HTML to Markdown MCP Server

编码与调试

by sunshad0w

将 HTML 转换为干净规范的 Markdown,并支持 Playwright 处理 JavaScript 较重的动态页面抓取与转换。

什么是 HTML to Markdown MCP Server

将 HTML 转换为干净规范的 Markdown,并支持 Playwright 处理 JavaScript 较重的动态页面抓取与转换。

README

HTML to Markdown MCP Server

MCP (Model Context Protocol) server for converting HTML webpages to clean Markdown format. Reduces HTML size by ~90-95% while preserving tables, images, and important content - perfect for AI context.

Features

  • Converts HTML from URLs to clean Markdown
  • Preserves tables, images, and links
  • Removes unnecessary elements (scripts, styles, navigation, footers, headers)
  • Significant size reduction (typically 90-95% compression)
  • Configurable options for images, tables, and links
  • Built with trafilatura and BeautifulSoup4 for robust extraction
  • Stream processing for efficient handling of large pages
  • Size limits to prevent downloading excessively large content (1MB-50MB)
  • Optional caching to speed up repeated conversions of the same URLs
  • 🌐 Browser mode with Playwright - Handles JavaScript-heavy sites and authenticated pages
    • Execute JavaScript (perfect for SPAs: React, Vue, Angular)
    • Use your browser profile with cookies (access authenticated pages!)
    • Support for Chrome, Firefox, WebKit
    • Configurable wait strategies for dynamic content

Installation

Prerequisites

  • Python 3.10 or higher
  • uv package manager (recommended) or pip

Install with uv (recommended)

bash
# Clone the repository
git clone <your-repo-url>
cd html2md

# Install dependencies
uv pip install -e .

# Install Playwright browsers (required for browser mode)
playwright install chromium

Install with pip

bash
# Clone the repository
git clone <your-repo-url>
cd html2md

# Create virtual environment
python -m venv venv
source venv/bin/activate  # On Windows: venv\Scripts\activate

# Install dependencies
pip install -e .

# Install Playwright browsers (required for browser mode)
playwright install chromium

Docker Installation (Recommended for Production)

The easiest way to use html2md is with Docker:

bash
# Build the image
docker build -t html2md .

# Or use pre-built image (when published)
docker pull your-registry/html2md:latest

For Claude Desktop, configure with Docker:

json
{
  "mcpServers": {
    "html2md": {
      "command": "docker",
      "args": [
        "run",
        "-i",
        "--rm",
        "html2md"
      ]
    }
  }
}

Docker Image Features:

  • Pre-installed Playwright with Chromium
  • Optimized for minimal size (~1GB)
  • Non-root user for security
  • Ready to use - no additional setup required

Configuration

Add the server to your Claude Desktop configuration file:

macOS

Edit ~/Library/Application Support/Claude/claude_desktop_config.json:

json
{
  "mcpServers": {
    "html2md": {
      "command": "uv",
      "args": [
        "--directory",
        "/absolute/path/to/html2md",
        "run",
        "html2md"
      ]
    }
  }
}

Windows

Edit %APPDATA%/Claude/claude_desktop_config.json:

json
{
  "mcpServers": {
    "html2md": {
      "command": "uv",
      "args": [
        "--directory",
        "C:\\absolute\\path\\to\\html2md",
        "run",
        "html2md"
      ]
    }
  }
}

Linux

Edit ~/.config/Claude/claude_desktop_config.json:

json
{
  "mcpServers": {
    "html2md": {
      "command": "uv",
      "args": [
        "--directory",
        "/absolute/path/to/html2md",
        "run",
        "html2md"
      ]
    }
  }
}

Usage

Once configured, the MCP server will be available in Claude Desktop. You can use the html_to_markdown tool:

Example 1: Basic conversion

code
Convert this webpage to markdown: https://example.com/article

Example 2: With options

code
Use the html_to_markdown tool with:
- url: https://example.com/docs
- include_images: false
- include_tables: true

Example 3: Browser mode for JavaScript-heavy sites

code
Use the html_to_markdown tool with:
- url: https://spa-application.com
- fetch_method: playwright
- wait_for: networkidle

Example 4: Access authenticated pages

code
Use the html_to_markdown tool with:
- url: https://private-site.com/dashboard
- fetch_method: playwright
- use_user_profile: true
- browser_type: chromium

Note: For use_user_profile=true, make sure Chrome is closed before running.

Tool Parameters

Basic Parameters:

  • url (required): URL of the webpage to convert
  • include_images (optional, default: true): Include images in Markdown
  • include_tables (optional, default: true): Include tables in Markdown
  • include_links (optional, default: true): Include links in Markdown
  • timeout (optional, default: 30): Request timeout in seconds (5-120)

Performance Parameters:

  • max_size (optional, default: 10MB): Maximum size of content to download in bytes (1MB-50MB)
  • use_cache (optional, default: false): Enable caching for faster repeated conversions
  • cache_ttl (optional, default: 3600): Cache time-to-live in seconds (60-86400)

Browser Mode Parameters:

  • fetch_method (optional, default: "fetch"): Fetch method - "fetch" (fast) or "playwright" (handles JS, auth)
  • browser_type (optional, default: "chromium"): Browser to use - "chromium", "firefox", or "webkit"
  • headless (optional, default: true): Run browser in headless mode
  • wait_for (optional, default: "networkidle"): Wait strategy - "load", "domcontentloaded", or "networkidle"
  • use_user_profile (optional, default: false): Use your browser profile with cookies (requires Chrome closed)

Development

Install development dependencies

bash
uv pip install -e ".[dev]"

Run tests

bash
pytest

Code formatting

bash
# Format with black
black src/ tests/

# Lint with ruff
ruff check src/ tests/

Type checking

bash
mypy src/

Architecture

The project consists of three main modules:

converter.py

Core HTML to Markdown conversion functionality:

  • fetch_html(): Downloads HTML from URL
  • clean_html(): Removes unnecessary elements with BeautifulSoup
  • convert_to_markdown(): Converts cleaned HTML to Markdown with trafilatura
  • html_to_markdown(): Main workflow combining all steps

server.py

MCP server implementation:

  • Registers the html_to_markdown tool
  • Handles tool calls and error responses
  • Runs async MCP server with stdio transport

utils.py

Utility functions:

  • Hash calculation for caching
  • Text formatting and truncation
  • Domain extraction
  • Filename sanitization

cache.py

In-memory caching system:

  • SimpleCache class with TTL support
  • Global cache instance management
  • Automatic expiration of old entries
  • Hash-based cache keys for URL + parameters

browser.py

Playwright browser automation:

  • fetch_html_playwright() - Async browser-based HTML fetching
  • Support for Chromium, Firefox, WebKit
  • User profile integration for authenticated access
  • Configurable wait strategies for dynamic content

Troubleshooting

Server not appearing in Claude Desktop

  1. Check that the path in claude_desktop_config.json is absolute and correct
  2. Restart Claude Desktop completely
  3. Check Claude Desktop logs for errors

Installation issues

bash
# Verify Python version
python --version  # Should be 3.10+

# Try reinstalling dependencies
uv pip install --force-reinstall -e .

Conversion errors

  • Timeout errors: Increase the timeout parameter
  • Empty content: Some websites may block automated requests or use JavaScript rendering
    • Solution: Use fetch_method: playwright to execute JavaScript
  • Parse errors: The webpage structure may be unusual or malformed
  • Content too large: Increase the max_size parameter (up to 50MB) or the page exceeds limits
  • Cache issues: Disable caching with use_cache: false if you need fresh content

Browser mode issues

  • Playwright not installed: Run playwright install chromium
  • Browser launch fails: Check that you have sufficient permissions and disk space
  • User profile error: Make sure Chrome is completely closed before using use_user_profile: true
  • Page doesn't load fully: Try different wait_for strategies:
    • "load" - fastest, waits for page load event
    • "domcontentloaded" - waits for DOM to be ready
    • "networkidle" - slowest but most reliable, waits for network to be idle
  • Authentication not working: Ensure you're using browser_type: chromium and use_user_profile: true

Performance

Typical conversion results:

  • Original HTML: ~500KB - 2MB
  • Markdown output: ~25KB - 100KB
  • Compression: 90-95%
  • Processing time: 2-10 seconds (depending on page size and network)

License

MIT

Contributing

Contributions are welcome! Please feel free to submit issues or pull requests.

Credits

Built with:

常见问题

HTML to Markdown MCP Server 是什么?

将 HTML 转换为干净规范的 Markdown,并支持 Playwright 处理 JavaScript 较重的动态页面抓取与转换。

相关 Skills

网页构建器

by anthropics

Universal
热门

面向复杂 claude.ai HTML artifact 开发,快速初始化 React + Tailwind CSS + shadcn/ui 项目并打包为单文件 HTML,适合需要状态管理、路由或多组件交互的页面。

在 claude.ai 里做复杂网页 Artifact 很省心,多组件、状态和路由都能顺手搭起来,React、Tailwind 与 shadcn/ui 组合效率高、成品也更精致。

编码与调试
未扫描114.1k

前端设计

by anthropics

Universal
热门

面向组件、页面、海报和 Web 应用开发,按鲜明视觉方向生成可直接落地的前端代码与高质感 UI,适合做 landing page、Dashboard 或美化现有界面,避开千篇一律的 AI 审美。

想把页面做得既能上线又有设计感,就用前端设计:组件到整站都能产出,难得的是能避开千篇一律的 AI 味。

编码与调试
未扫描114.1k

网页应用测试

by anthropics

Universal
热门

用 Playwright 为本地 Web 应用编写自动化测试,支持启动开发服务器、校验前端交互、排查 UI 异常、抓取截图与浏览器日志,适合调试动态页面和回归验证。

借助 Playwright 一站式验证本地 Web 应用前端功能,调 UI 时还能同步查看日志和截图,定位问题更快。

编码与调试
未扫描114.1k

相关 MCP Server

GitHub

编辑精选

by GitHub

热门

GitHub 是 MCP 官方参考服务器,让 Claude 直接读写你的代码仓库和 Issues。

这个参考服务器解决了开发者想让 AI 安全访问 GitHub 数据的问题,适合需要自动化代码审查或 Issue 管理的团队。但注意它只是参考实现,生产环境得自己加固安全。

编码与调试
83.4k

by Context7

热门

Context7 是实时拉取最新文档和代码示例的智能助手,让你告别过时资料。

它能解决开发者查找文档时信息滞后的问题,特别适合快速上手新库或跟进更新。不过,依赖外部源可能导致偶尔的数据延迟,建议结合官方文档使用。

编码与调试
52.2k

by tldraw

热门

tldraw 是让 AI 助手直接在无限画布上绘图和协作的 MCP 服务器。

这解决了 AI 只能输出文本、无法视觉化协作的痛点——想象让 Claude 帮你画流程图或白板讨论。最适合需要快速原型设计或头脑风暴的开发者。不过,目前它只是个基础连接器,你得自己搭建画布应用才能发挥全部潜力。

编码与调试
46.3k

评论