io.github.jztan/pdf-mcp
平台与服务by jztan
面向 PDF processing 的生产级 MCP server,内置 intelligent caching,提升处理效率与稳定性。
什么是 io.github.jztan/pdf-mcp?
面向 PDF processing 的生产级 MCP server,内置 intelligent caching,提升处理效率与稳定性。
README
pdf-mcp
A Model Context Protocol (MCP) server that enables AI agents to read, search, and extract content from PDF files. Built with Python and PyMuPDF, with SQLite-based caching for persistence across server restarts.
mcp-name: io.github.jztan/pdf-mcp
Features
- 8 specialized tools for different PDF operations
- SQLite caching — persistent cache survives server restarts (essential for STDIO transport)
- Paginated reading — read large PDFs in manageable chunks
- Full-text search — FTS5 index with BM25 ranking and Porter stemming
- Semantic search — find pages by meaning using local embeddings (no external API)
- Image extraction — per-page images returned as PNG file paths alongside text
- Table extraction — per-page tables with header and row data, detected via visible borders
- URL support — read PDFs from HTTP/HTTPS URLs
Installation
pip install pdf-mcp
For semantic search (adds fastembed and numpy, ~67 MB model download on first use):
pip install 'pdf-mcp[semantic]'
Quick Start
<details open> <summary><strong>Claude Code</strong></summary>claude mcp add pdf-mcp -- pdf-mcp
Or add to ~/.claude.json:
{
"mcpServers": {
"pdf-mcp": {
"command": "pdf-mcp"
}
}
}
Add to your claude_desktop_config.json:
{
"mcpServers": {
"pdf-mcp": {
"command": "pdf-mcp"
}
}
}
Config file location:
- macOS:
~/Library/Application Support/Claude/claude_desktop_config.json - Windows:
%APPDATA%\Claude\claude_desktop_config.json
Restart Claude Desktop after updating the config.
</details> <details> <summary><strong>Visual Studio Code</strong></summary>Requires VS Code 1.102+ with GitHub Copilot.
CLI:
code --add-mcp '{"name":"pdf-mcp","command":"pdf-mcp"}'
Command Palette:
- Open Command Palette (
Cmd/Ctrl+Shift+P) - Run
MCP: Open User Configuration(global) orMCP: Open Workspace Folder Configuration(project-specific) - Add the configuration:
json
{ "servers": { "pdf-mcp": { "command": "pdf-mcp" } } } - Save. VS Code will automatically load the server.
Manual: Create .vscode/mcp.json in your workspace:
{
"servers": {
"pdf-mcp": {
"command": "pdf-mcp"
}
}
}
codex mcp add pdf-mcp -- pdf-mcp
Or configure manually in ~/.codex/config.toml:
[mcp_servers.pdf-mcp]
command = "pdf-mcp"
Create or edit .kiro/settings/mcp.json in your workspace:
{
"mcpServers": {
"pdf-mcp": {
"command": "pdf-mcp",
"args": [],
"disabled": false
}
}
}
Save and restart Kiro.
</details> <details> <summary><strong>Other MCP Clients</strong></summary>Most MCP clients use a standard configuration format:
{
"mcpServers": {
"pdf-mcp": {
"command": "pdf-mcp"
}
}
}
With uvx (for isolated environments):
{
"mcpServers": {
"pdf-mcp": {
"command": "uvx",
"args": ["pdf-mcp"]
}
}
}
Verify Installation
pdf-mcp --help
Tools
pdf_info — Get Document Information
Returns page count, metadata, file size, and estimated token count. Call this first to understand a document before reading it. Includes toc_entry_count and inline TOC entries when the document has ≤50 bookmarks; larger TOCs (e.g. slide decks) return toc_truncated: true — use pdf_get_toc to retrieve the full outline.
"Read the PDF at /path/to/document.pdf"
pdf_read_pages — Read Specific Pages
Read selected pages to manage context size. Each page dict includes text, images/image_count, and tables/table_count. Tables are extracted as structured data (header + rows) and inlined directly in the page response — no separate tool call needed. Table detection requires visible borders in the PDF.
"Read pages 1-10 of the PDF"
"Read pages 15, 20, and 25-30"
pdf_read_all — Read Entire Document
Read a complete document in one call. Subject to a safety limit on page count.
"Read the entire PDF (it's only 10 pages)"
pdf_search — Search Within PDF
Find relevant pages before loading content. Uses a SQLite FTS5 index with Porter stemming and BM25 relevance ranking — results are ordered by relevance, not page number. Morphological variants are matched automatically (e.g. searching "managing" finds pages containing "management"). Falls back to a linear scan on SQLite builds without FTS5 support.
"Search for 'quarterly revenue' in the PDF"
pdf_semantic_search — Search by Meaning
Find pages by conceptual similarity rather than exact keywords. Searching "revenue growth" matches pages about "sales increase" or "financial performance" even without literal keyword overlap. Uses BAAI/bge-small-en-v1.5 embeddings (384-dim, local ONNX Runtime — no external API, no GPU required).
Embeddings are generated once per page and cached in SQLite. The first call for a document takes ~8–15 s on CPU; subsequent queries are instant.
Requires: pip install 'pdf-mcp[semantic]'
"Find pages about revenue growth in the PDF"
"Which pages discuss supply chain risks?"
pdf_get_toc — Get Table of Contents
"Show me the table of contents"
pdf_cache_stats — View Cache Statistics
"Show PDF cache statistics"
pdf_cache_clear — Clear Cache
"Clear expired PDF cache entries"
Example Workflow
For a large document (e.g., a 200-page annual report):
User: "Summarize the risk factors in this annual report"
Agent workflow:
1. pdf_info("report.pdf")
→ 200 pages, TOC shows "Risk Factors" on page 89
2. pdf_search("report.pdf", "risk factors")
→ Relevant pages: 89-110
3. pdf_read_pages("report.pdf", "89-100")
→ First batch
4. pdf_read_pages("report.pdf", "101-110")
→ Second batch
5. Synthesize answer from chunks
Caching
The server uses SQLite for persistent caching. This is necessary because MCP servers using STDIO transport are spawned as a new process for each conversation.
Cache location: ~/.cache/pdf-mcp/cache.db
What's cached:
| Data | Benefit |
|---|---|
| Metadata | Avoid re-parsing document info |
| Page text | Skip re-extraction |
| Images | Skip re-encoding |
| Tables | Skip re-detection |
| TOC | Skip re-parsing |
| FTS5 index | O(log N) search with BM25 ranking after first query |
| Embeddings | Instant semantic search after first indexing run |
Cache invalidation:
- Automatic when file modification time changes
- Manual via the
pdf_cache_cleartool - TTL: 24 hours (configurable)
Configuration
Environment variables:
# Cache directory (default: ~/.cache/pdf-mcp)
PDF_MCP_CACHE_DIR=/path/to/cache
# Cache TTL in hours (default: 24)
PDF_MCP_CACHE_TTL=48
Development
git clone https://github.com/jztan/pdf-mcp.git
cd pdf-mcp
# Install with dev dependencies
pip install -e ".[dev]"
# Run tests
pytest tests/ -v
# Type checking
mypy src/
# Linting
flake8 src/ tests/
# Formatting
black src/ tests/
Why pdf-mcp?
| Without pdf-mcp | With pdf-mcp | |
|---|---|---|
| Large PDFs | Context overflow | Chunked reading |
| Token budgeting | Guess and overflow | Estimated tokens before reading |
| Finding content | Load everything | Keyword search (FTS5 + BM25) or semantic search (local embeddings) |
| Tables | Lost in raw text | Extracted and inlined per page |
| Images | Ignored | Extracted as PNG files |
| Repeated access | Re-parse every time | SQLite cache |
| Tool design | Single monolithic tool | 7 specialized tools |
Roadmap
See ROADMAP.md for planned features and release history.
Contributing
Contributions are welcome. Please submit a pull request.
License
MIT — see LICENSE.
Links
- pdf-mcp on PyPI
- pdf-mcp on GitHub
- How I Built pdf-mcp — The problem with large PDFs in AI agents and a working solution
- MCP Server Security: 8 Vulnerabilities — What we found when we audited an MCP server for security holes
- How Claude Code Actually Reads PDFs — How chunked reading, FTS5, and SQLite caching work together
- Semantic vs Keyword Search for AI Agents — Benchmarks and a dual-search routing pattern: FTS5 for exact identifiers, embeddings for natural language
常见问题
io.github.jztan/pdf-mcp 是什么?
面向 PDF processing 的生产级 MCP server,内置 intelligent caching,提升处理效率与稳定性。
相关 Skills
MCP构建
by anthropics
聚焦高质量 MCP Server 开发,覆盖协议研究、工具设计、错误处理与传输选型,适合用 FastMCP 或 MCP SDK 对接外部 API、封装服务能力。
✎ 想让 LLM 稳定调用外部 API,就用 MCP构建:从 Python 到 Node 都有成熟指引,帮你更快做出高质量 MCP 服务器。
Slack动图
by anthropics
面向Slack的动图制作Skill,内置emoji/消息GIF的尺寸、帧率和色彩约束、校验与优化流程,适合把创意或上传图片快速做成可直接发送的Slack动画。
✎ 帮你快速做出适配 Slack 的动图,内置约束规则和校验工具,少踩上传与播放坑,做表情包和演示都更省心。
MCP服务构建器
by alirezarezvani
从 OpenAPI 一键生成 Python/TypeScript MCP server 脚手架,并校验 tool schema、命名规范与版本兼容性,适合把现有 REST API 快速发布成可生产演进的 MCP 服务。
✎ 帮你快速搭建 MCP 服务与后端 API,脚手架完善、扩展顺手,尤其适合想高效验证服务能力的开发者。
相关 MCP Server
Slack 消息
编辑精选by Anthropic
Slack 是让 AI 助手直接读写你的 Slack 频道和消息的 MCP 服务器。
✎ 这个服务器解决了团队协作中需要 AI 实时获取 Slack 信息的痛点,特别适合开发团队让 Claude 帮忙汇总频道讨论或发送通知。不过,它目前只是参考实现,文档有限,不建议在生产环境直接使用——更适合开发者学习 MCP 如何集成第三方服务。
by netdata
io.github.netdata/mcp-server 是让 AI 助手实时监控服务器指标和日志的 MCP 服务器。
✎ 这个工具解决了运维人员需要手动检查系统状态的痛点,最适合 DevOps 团队让 Claude 自动分析性能数据。不过,它依赖 NetData 的现有部署,如果你没用过这个监控平台,得先花时间配置。
by d4vinci
Scrapling MCP Server 是专为现代网页设计的智能爬虫工具,支持绕过 Cloudflare 等反爬机制。
✎ 这个工具解决了爬取动态网页和反爬网站时的头疼问题,特别适合需要批量采集电商价格或新闻数据的开发者。不过,它依赖外部浏览器引擎,资源消耗较大,不适合轻量级任务。