HTML to Markdown MCP Server
编码与调试by sunshad0w
将 HTML 转换为干净规范的 Markdown,并支持 Playwright 处理 JavaScript 较重的动态页面抓取与转换。
什么是 HTML to Markdown MCP Server?
将 HTML 转换为干净规范的 Markdown,并支持 Playwright 处理 JavaScript 较重的动态页面抓取与转换。
README
HTML to Markdown MCP Server
MCP (Model Context Protocol) server for converting HTML webpages to clean Markdown format. Reduces HTML size by ~90-95% while preserving tables, images, and important content - perfect for AI context.
Features
- Converts HTML from URLs to clean Markdown
- Preserves tables, images, and links
- Removes unnecessary elements (scripts, styles, navigation, footers, headers)
- Significant size reduction (typically 90-95% compression)
- Configurable options for images, tables, and links
- Built with
trafilaturaandBeautifulSoup4for robust extraction - Stream processing for efficient handling of large pages
- Size limits to prevent downloading excessively large content (1MB-50MB)
- Optional caching to speed up repeated conversions of the same URLs
- 🌐 Browser mode with Playwright - Handles JavaScript-heavy sites and authenticated pages
- Execute JavaScript (perfect for SPAs: React, Vue, Angular)
- Use your browser profile with cookies (access authenticated pages!)
- Support for Chrome, Firefox, WebKit
- Configurable wait strategies for dynamic content
Installation
Prerequisites
- Python 3.10 or higher
uvpackage manager (recommended) orpip
Install with uv (recommended)
# Clone the repository
git clone <your-repo-url>
cd html2md
# Install dependencies
uv pip install -e .
# Install Playwright browsers (required for browser mode)
playwright install chromium
Install with pip
# Clone the repository
git clone <your-repo-url>
cd html2md
# Create virtual environment
python -m venv venv
source venv/bin/activate # On Windows: venv\Scripts\activate
# Install dependencies
pip install -e .
# Install Playwright browsers (required for browser mode)
playwright install chromium
Docker Installation (Recommended for Production)
The easiest way to use html2md is with Docker:
# Build the image
docker build -t html2md .
# Or use pre-built image (when published)
docker pull your-registry/html2md:latest
For Claude Desktop, configure with Docker:
{
"mcpServers": {
"html2md": {
"command": "docker",
"args": [
"run",
"-i",
"--rm",
"html2md"
]
}
}
}
Docker Image Features:
- Pre-installed Playwright with Chromium
- Optimized for minimal size (~1GB)
- Non-root user for security
- Ready to use - no additional setup required
Configuration
Add the server to your Claude Desktop configuration file:
macOS
Edit ~/Library/Application Support/Claude/claude_desktop_config.json:
{
"mcpServers": {
"html2md": {
"command": "uv",
"args": [
"--directory",
"/absolute/path/to/html2md",
"run",
"html2md"
]
}
}
}
Windows
Edit %APPDATA%/Claude/claude_desktop_config.json:
{
"mcpServers": {
"html2md": {
"command": "uv",
"args": [
"--directory",
"C:\\absolute\\path\\to\\html2md",
"run",
"html2md"
]
}
}
}
Linux
Edit ~/.config/Claude/claude_desktop_config.json:
{
"mcpServers": {
"html2md": {
"command": "uv",
"args": [
"--directory",
"/absolute/path/to/html2md",
"run",
"html2md"
]
}
}
}
Usage
Once configured, the MCP server will be available in Claude Desktop. You can use the html_to_markdown tool:
Example 1: Basic conversion
Convert this webpage to markdown: https://example.com/article
Example 2: With options
Use the html_to_markdown tool with:
- url: https://example.com/docs
- include_images: false
- include_tables: true
Example 3: Browser mode for JavaScript-heavy sites
Use the html_to_markdown tool with:
- url: https://spa-application.com
- fetch_method: playwright
- wait_for: networkidle
Example 4: Access authenticated pages
Use the html_to_markdown tool with:
- url: https://private-site.com/dashboard
- fetch_method: playwright
- use_user_profile: true
- browser_type: chromium
Note: For use_user_profile=true, make sure Chrome is closed before running.
Tool Parameters
Basic Parameters:
url(required): URL of the webpage to convertinclude_images(optional, default: true): Include images in Markdowninclude_tables(optional, default: true): Include tables in Markdowninclude_links(optional, default: true): Include links in Markdowntimeout(optional, default: 30): Request timeout in seconds (5-120)
Performance Parameters:
max_size(optional, default: 10MB): Maximum size of content to download in bytes (1MB-50MB)use_cache(optional, default: false): Enable caching for faster repeated conversionscache_ttl(optional, default: 3600): Cache time-to-live in seconds (60-86400)
Browser Mode Parameters:
fetch_method(optional, default: "fetch"): Fetch method - "fetch" (fast) or "playwright" (handles JS, auth)browser_type(optional, default: "chromium"): Browser to use - "chromium", "firefox", or "webkit"headless(optional, default: true): Run browser in headless modewait_for(optional, default: "networkidle"): Wait strategy - "load", "domcontentloaded", or "networkidle"use_user_profile(optional, default: false): Use your browser profile with cookies (requires Chrome closed)
Development
Install development dependencies
uv pip install -e ".[dev]"
Run tests
pytest
Code formatting
# Format with black
black src/ tests/
# Lint with ruff
ruff check src/ tests/
Type checking
mypy src/
Architecture
The project consists of three main modules:
converter.py
Core HTML to Markdown conversion functionality:
fetch_html(): Downloads HTML from URLclean_html(): Removes unnecessary elements with BeautifulSoupconvert_to_markdown(): Converts cleaned HTML to Markdown with trafilaturahtml_to_markdown(): Main workflow combining all steps
server.py
MCP server implementation:
- Registers the
html_to_markdowntool - Handles tool calls and error responses
- Runs async MCP server with stdio transport
utils.py
Utility functions:
- Hash calculation for caching
- Text formatting and truncation
- Domain extraction
- Filename sanitization
cache.py
In-memory caching system:
SimpleCacheclass with TTL support- Global cache instance management
- Automatic expiration of old entries
- Hash-based cache keys for URL + parameters
browser.py
Playwright browser automation:
fetch_html_playwright()- Async browser-based HTML fetching- Support for Chromium, Firefox, WebKit
- User profile integration for authenticated access
- Configurable wait strategies for dynamic content
Troubleshooting
Server not appearing in Claude Desktop
- Check that the path in
claude_desktop_config.jsonis absolute and correct - Restart Claude Desktop completely
- Check Claude Desktop logs for errors
Installation issues
# Verify Python version
python --version # Should be 3.10+
# Try reinstalling dependencies
uv pip install --force-reinstall -e .
Conversion errors
- Timeout errors: Increase the
timeoutparameter - Empty content: Some websites may block automated requests or use JavaScript rendering
- Solution: Use
fetch_method: playwrightto execute JavaScript
- Solution: Use
- Parse errors: The webpage structure may be unusual or malformed
- Content too large: Increase the
max_sizeparameter (up to 50MB) or the page exceeds limits - Cache issues: Disable caching with
use_cache: falseif you need fresh content
Browser mode issues
- Playwright not installed: Run
playwright install chromium - Browser launch fails: Check that you have sufficient permissions and disk space
- User profile error: Make sure Chrome is completely closed before using
use_user_profile: true - Page doesn't load fully: Try different
wait_forstrategies:"load"- fastest, waits for page load event"domcontentloaded"- waits for DOM to be ready"networkidle"- slowest but most reliable, waits for network to be idle
- Authentication not working: Ensure you're using
browser_type: chromiumanduse_user_profile: true
Performance
Typical conversion results:
- Original HTML: ~500KB - 2MB
- Markdown output: ~25KB - 100KB
- Compression: 90-95%
- Processing time: 2-10 seconds (depending on page size and network)
License
MIT
Contributing
Contributions are welcome! Please feel free to submit issues or pull requests.
Credits
Built with:
- MCP SDK - Model Context Protocol
- trafilatura - Web content extraction
- BeautifulSoup4 - HTML parsing
常见问题
HTML to Markdown MCP Server 是什么?
将 HTML 转换为干净规范的 Markdown,并支持 Playwright 处理 JavaScript 较重的动态页面抓取与转换。
相关 Skills
网页构建器
by anthropics
面向复杂 claude.ai HTML artifact 开发,快速初始化 React + Tailwind CSS + shadcn/ui 项目并打包为单文件 HTML,适合需要状态管理、路由或多组件交互的页面。
✎ 在 claude.ai 里做复杂网页 Artifact 很省心,多组件、状态和路由都能顺手搭起来,React、Tailwind 与 shadcn/ui 组合效率高、成品也更精致。
前端设计
by anthropics
面向组件、页面、海报和 Web 应用开发,按鲜明视觉方向生成可直接落地的前端代码与高质感 UI,适合做 landing page、Dashboard 或美化现有界面,避开千篇一律的 AI 审美。
✎ 想把页面做得既能上线又有设计感,就用前端设计:组件到整站都能产出,难得的是能避开千篇一律的 AI 味。
网页应用测试
by anthropics
用 Playwright 为本地 Web 应用编写自动化测试,支持启动开发服务器、校验前端交互、排查 UI 异常、抓取截图与浏览器日志,适合调试动态页面和回归验证。
✎ 借助 Playwright 一站式验证本地 Web 应用前端功能,调 UI 时还能同步查看日志和截图,定位问题更快。
相关 MCP Server
GitHub
编辑精选by GitHub
GitHub 是 MCP 官方参考服务器,让 Claude 直接读写你的代码仓库和 Issues。
✎ 这个参考服务器解决了开发者想让 AI 安全访问 GitHub 数据的问题,适合需要自动化代码审查或 Issue 管理的团队。但注意它只是参考实现,生产环境得自己加固安全。
Context7 文档查询
编辑精选by Context7
Context7 是实时拉取最新文档和代码示例的智能助手,让你告别过时资料。
✎ 它能解决开发者查找文档时信息滞后的问题,特别适合快速上手新库或跟进更新。不过,依赖外部源可能导致偶尔的数据延迟,建议结合官方文档使用。
by tldraw
tldraw 是让 AI 助手直接在无限画布上绘图和协作的 MCP 服务器。
✎ 这解决了 AI 只能输出文本、无法视觉化协作的痛点——想象让 Claude 帮你画流程图或白板讨论。最适合需要快速原型设计或头脑风暴的开发者。不过,目前它只是个基础连接器,你得自己搭建画布应用才能发挥全部潜力。