文档转MD

Name: 文档转MD
Rating: 5 (1284 reviews)
Author: daymade

Universal

markdown-tools

by daymade

把 PDF、DOCX、PPTX 智能转成高质量 markdown：Quick 模式适合快速出稿，Heavy 模式并行多工具合并最佳结果，还能抽取图片、校验质量，适合整理 LLM 友好的文档输出。

1.3k效率与工作流未扫描2026年3月5日

安装

claude skill add --url github.com/daymade/claude-code-skills/tree/main/markdown-tools

文档

Markdown Tools

Convert documents to high-quality markdown with intelligent multi-tool orchestration.

Dual Mode Architecture

Mode	Speed	Quality	Use Case
Quick (default)	Fast	Good	Drafts, simple documents
Heavy	Slower	Best	Final documents, complex layouts

Quick Start

Installation

bash

# Required: PDF/DOCX/PPTX support
uv tool install "markitdown[pdf]"
pip install pymupdf4llm
brew install pandoc

Basic Conversion

bash

# Quick Mode (default) - fast, single best tool
uv run --with pymupdf4llm --with markitdown scripts/convert.py document.pdf -o output.md

# Heavy Mode - multi-tool parallel execution with merge
uv run --with pymupdf4llm --with markitdown scripts/convert.py document.pdf -o output.md --heavy

# Check available tools
uv run scripts/convert.py --list-tools

Tool Selection Matrix

Format	Quick Mode Tool	Heavy Mode Tools
PDF	pymupdf4llm	pymupdf4llm + markitdown
DOCX	pandoc	pandoc + markitdown
PPTX	markitdown	markitdown + pandoc
XLSX	markitdown	markitdown

Tool Characteristics

pymupdf4llm: LLM-optimized PDF conversion with native table detection and image extraction
markitdown: Microsoft's universal converter, good for Office formats
pandoc: Excellent structure preservation for DOCX/PPTX

Heavy Mode Workflow

Heavy Mode runs multiple tools in parallel and selects the best segments:

Parallel Execution: Run all applicable tools simultaneously
Segment Analysis: Parse each output into segments (tables, headings, images, paragraphs)
Quality Scoring: Score each segment based on completeness and structure
Intelligent Merge: Select best version of each segment across tools

Merge Criteria

Segment Type	Selection Criteria
Tables	More rows/columns, proper header separator
Images	Alt text present, local paths preferred
Headings	Proper hierarchy, appropriate length
Lists	More items, nested structure preserved
Paragraphs	Content completeness

Image Extraction

bash

# Extract images with metadata
uv run --with pymupdf scripts/extract_pdf_images.py document.pdf -o ./assets

# Generate markdown references file
uv run --with pymupdf scripts/extract_pdf_images.py document.pdf --markdown refs.md

Output:

Images: assets/img_page1_1.png, assets/img_page2_1.jpg
Metadata: assets/images_metadata.json (page, position, dimensions)

Quality Validation

bash

# Validate conversion quality
uv run --with pymupdf scripts/validate_output.py document.pdf output.md

# Generate HTML report
uv run --with pymupdf scripts/validate_output.py document.pdf output.md --report report.html

Quality Metrics

Metric	Pass	Warn	Fail
Text Retention	>95%	85-95%	<85%
Table Retention	100%	90-99%	<90%
Image Retention	100%	80-99%	<80%

Merge Outputs Manually

bash

# Merge multiple markdown files
python scripts/merge_outputs.py output1.md output2.md -o merged.md

# Show segment attribution
python scripts/merge_outputs.py output1.md output2.md -o merged.md --verbose

Path Conversion (Windows/WSL)

bash

# Windows → WSL conversion
python scripts/convert_path.py "C:\Users\name\Documents\file.pdf"
# Output: /mnt/c/Users/name/Documents/file.pdf

Common Issues

"No conversion tools available"

bash

# Install all tools
pip install pymupdf4llm
uv tool install "markitdown[pdf]"
brew install pandoc

FontBBox warnings during PDF conversion

Harmless font parsing warnings, output is still correct

Images missing from output

Use Heavy Mode for better image preservation
Or extract separately with scripts/extract_pdf_images.py

Tables broken in output

Use Heavy Mode - it selects the most complete table version
Or validate with scripts/validate_output.py

Bundled Scripts

Script	Purpose
`convert.py`	Main orchestrator with Quick/Heavy mode
`merge_outputs.py`	Merge multiple markdown outputs
`validate_output.py`	Quality validation with HTML report
`extract_pdf_images.py`	PDF image extraction with metadata
`convert_path.py`	Windows to WSL path converter

References

references/heavy-mode-guide.md - Detailed Heavy Mode documentation
references/tool-comparison.md - Tool capabilities comparison
references/conversion-examples.md - Batch operation examples

文档转MD

安装

文档

Markdown Tools

Dual Mode Architecture

Quick Start

Installation

Basic Conversion

Tool Selection Matrix

Tool Characteristics

Heavy Mode Workflow

Merge Criteria

Image Extraction

Quality Validation

Quality Metrics

Merge Outputs Manually

Path Conversion (Windows/WSL)

Common Issues

Bundled Scripts

References

相关 Skills

技能工坊

PPT处理

PDF处理

相关 MCP 服务

文件系统

io.github.wonderwhy-er/desktop-commander

LinkedIn Profile and Job Scraper

评论