io.github.jztan/pdf-mcp

平台与服务

by jztan

面向 PDF processing 的生产级 MCP server,内置 intelligent caching,提升处理效率与稳定性。

什么是 io.github.jztan/pdf-mcp

面向 PDF processing 的生产级 MCP server,内置 intelligent caching,提升处理效率与稳定性。

README

pdf-mcp

PyPI version Python 3.10+ License: MIT GitHub Issues CI codecov Downloads

A Model Context Protocol (MCP) server that enables AI agents to read, search, and extract content from PDF files. Built with Python and PyMuPDF, with SQLite-based caching for persistence across server restarts.

mcp-name: io.github.jztan/pdf-mcp

Features

  • 8 specialized tools for different PDF operations
  • SQLite caching — persistent cache survives server restarts (essential for STDIO transport)
  • Paginated reading — read large PDFs in manageable chunks
  • Full-text search — FTS5 index with BM25 ranking and Porter stemming
  • Semantic search — find pages by meaning using local embeddings (no external API)
  • Image extraction — per-page images returned as PNG file paths alongside text
  • Table extraction — per-page tables with header and row data, detected via visible borders
  • URL support — read PDFs from HTTP/HTTPS URLs

Installation

bash
pip install pdf-mcp

For semantic search (adds fastembed and numpy, ~67 MB model download on first use):

bash
pip install 'pdf-mcp[semantic]'

Quick Start

<details open> <summary><strong>Claude Code</strong></summary>
bash
claude mcp add pdf-mcp -- pdf-mcp

Or add to ~/.claude.json:

json
{
  "mcpServers": {
    "pdf-mcp": {
      "command": "pdf-mcp"
    }
  }
}
</details> <details> <summary><strong>Claude Desktop</strong></summary>

Add to your claude_desktop_config.json:

json
{
  "mcpServers": {
    "pdf-mcp": {
      "command": "pdf-mcp"
    }
  }
}

Config file location:

  • macOS: ~/Library/Application Support/Claude/claude_desktop_config.json
  • Windows: %APPDATA%\Claude\claude_desktop_config.json

Restart Claude Desktop after updating the config.

</details> <details> <summary><strong>Visual Studio Code</strong></summary>

Requires VS Code 1.102+ with GitHub Copilot.

CLI:

bash
code --add-mcp '{"name":"pdf-mcp","command":"pdf-mcp"}'

Command Palette:

  1. Open Command Palette (Cmd/Ctrl+Shift+P)
  2. Run MCP: Open User Configuration (global) or MCP: Open Workspace Folder Configuration (project-specific)
  3. Add the configuration:
    json
    {
      "servers": {
        "pdf-mcp": {
          "command": "pdf-mcp"
        }
      }
    }
    
  4. Save. VS Code will automatically load the server.

Manual: Create .vscode/mcp.json in your workspace:

json
{
  "servers": {
    "pdf-mcp": {
      "command": "pdf-mcp"
    }
  }
}
</details> <details> <summary><strong>Codex CLI</strong></summary>
bash
codex mcp add pdf-mcp -- pdf-mcp

Or configure manually in ~/.codex/config.toml:

toml
[mcp_servers.pdf-mcp]
command = "pdf-mcp"
</details> <details> <summary><strong>Kiro</strong></summary>

Create or edit .kiro/settings/mcp.json in your workspace:

json
{
  "mcpServers": {
    "pdf-mcp": {
      "command": "pdf-mcp",
      "args": [],
      "disabled": false
    }
  }
}

Save and restart Kiro.

</details> <details> <summary><strong>Other MCP Clients</strong></summary>

Most MCP clients use a standard configuration format:

json
{
  "mcpServers": {
    "pdf-mcp": {
      "command": "pdf-mcp"
    }
  }
}

With uvx (for isolated environments):

json
{
  "mcpServers": {
    "pdf-mcp": {
      "command": "uvx",
      "args": ["pdf-mcp"]
    }
  }
}
</details>

Verify Installation

bash
pdf-mcp --help

Tools

pdf_info — Get Document Information

Returns page count, metadata, file size, and estimated token count. Call this first to understand a document before reading it. Includes toc_entry_count and inline TOC entries when the document has ≤50 bookmarks; larger TOCs (e.g. slide decks) return toc_truncated: true — use pdf_get_toc to retrieve the full outline.

code
"Read the PDF at /path/to/document.pdf"

pdf_read_pages — Read Specific Pages

Read selected pages to manage context size. Each page dict includes text, images/image_count, and tables/table_count. Tables are extracted as structured data (header + rows) and inlined directly in the page response — no separate tool call needed. Table detection requires visible borders in the PDF.

code
"Read pages 1-10 of the PDF"
"Read pages 15, 20, and 25-30"

pdf_read_all — Read Entire Document

Read a complete document in one call. Subject to a safety limit on page count.

code
"Read the entire PDF (it's only 10 pages)"

pdf_search — Search Within PDF

Find relevant pages before loading content. Uses a SQLite FTS5 index with Porter stemming and BM25 relevance ranking — results are ordered by relevance, not page number. Morphological variants are matched automatically (e.g. searching "managing" finds pages containing "management"). Falls back to a linear scan on SQLite builds without FTS5 support.

code
"Search for 'quarterly revenue' in the PDF"

pdf_semantic_search — Search by Meaning

Find pages by conceptual similarity rather than exact keywords. Searching "revenue growth" matches pages about "sales increase" or "financial performance" even without literal keyword overlap. Uses BAAI/bge-small-en-v1.5 embeddings (384-dim, local ONNX Runtime — no external API, no GPU required).

Embeddings are generated once per page and cached in SQLite. The first call for a document takes ~8–15 s on CPU; subsequent queries are instant.

Requires: pip install 'pdf-mcp[semantic]'

code
"Find pages about revenue growth in the PDF"
"Which pages discuss supply chain risks?"

pdf_get_toc — Get Table of Contents

code
"Show me the table of contents"

pdf_cache_stats — View Cache Statistics

code
"Show PDF cache statistics"

pdf_cache_clear — Clear Cache

code
"Clear expired PDF cache entries"

Example Workflow

For a large document (e.g., a 200-page annual report):

code
User: "Summarize the risk factors in this annual report"

Agent workflow:
1. pdf_info("report.pdf")
   → 200 pages, TOC shows "Risk Factors" on page 89

2. pdf_search("report.pdf", "risk factors")
   → Relevant pages: 89-110

3. pdf_read_pages("report.pdf", "89-100")
   → First batch

4. pdf_read_pages("report.pdf", "101-110")
   → Second batch

5. Synthesize answer from chunks

Caching

The server uses SQLite for persistent caching. This is necessary because MCP servers using STDIO transport are spawned as a new process for each conversation.

Cache location: ~/.cache/pdf-mcp/cache.db

What's cached:

DataBenefit
MetadataAvoid re-parsing document info
Page textSkip re-extraction
ImagesSkip re-encoding
TablesSkip re-detection
TOCSkip re-parsing
FTS5 indexO(log N) search with BM25 ranking after first query
EmbeddingsInstant semantic search after first indexing run

Cache invalidation:

  • Automatic when file modification time changes
  • Manual via the pdf_cache_clear tool
  • TTL: 24 hours (configurable)

Configuration

Environment variables:

bash
# Cache directory (default: ~/.cache/pdf-mcp)
PDF_MCP_CACHE_DIR=/path/to/cache

# Cache TTL in hours (default: 24)
PDF_MCP_CACHE_TTL=48

Development

bash
git clone https://github.com/jztan/pdf-mcp.git
cd pdf-mcp

# Install with dev dependencies
pip install -e ".[dev]"

# Run tests
pytest tests/ -v

# Type checking
mypy src/

# Linting
flake8 src/ tests/

# Formatting
black src/ tests/

Why pdf-mcp?

Without pdf-mcpWith pdf-mcp
Large PDFsContext overflowChunked reading
Token budgetingGuess and overflowEstimated tokens before reading
Finding contentLoad everythingKeyword search (FTS5 + BM25) or semantic search (local embeddings)
TablesLost in raw textExtracted and inlined per page
ImagesIgnoredExtracted as PNG files
Repeated accessRe-parse every timeSQLite cache
Tool designSingle monolithic tool7 specialized tools

Roadmap

See ROADMAP.md for planned features and release history.

Contributing

Contributions are welcome. Please submit a pull request.

License

MIT — see LICENSE.

Links

常见问题

io.github.jztan/pdf-mcp 是什么?

面向 PDF processing 的生产级 MCP server,内置 intelligent caching,提升处理效率与稳定性。

相关 Skills

MCP构建

by anthropics

Universal
热门

聚焦高质量 MCP Server 开发,覆盖协议研究、工具设计、错误处理与传输选型,适合用 FastMCP 或 MCP SDK 对接外部 API、封装服务能力。

想让 LLM 稳定调用外部 API,就用 MCP构建:从 Python 到 Node 都有成熟指引,帮你更快做出高质量 MCP 服务器。

平台与服务
未扫描111.8k

Slack动图

by anthropics

Universal
热门

面向Slack的动图制作Skill,内置emoji/消息GIF的尺寸、帧率和色彩约束、校验与优化流程,适合把创意或上传图片快速做成可直接发送的Slack动画。

帮你快速做出适配 Slack 的动图,内置约束规则和校验工具,少踩上传与播放坑,做表情包和演示都更省心。

平台与服务
未扫描111.8k

MCP服务构建器

by alirezarezvani

Universal
热门

从 OpenAPI 一键生成 Python/TypeScript MCP server 脚手架,并校验 tool schema、命名规范与版本兼容性,适合把现有 REST API 快速发布成可生产演进的 MCP 服务。

帮你快速搭建 MCP 服务与后端 API,脚手架完善、扩展顺手,尤其适合想高效验证服务能力的开发者。

平台与服务
未扫描9.8k

相关 MCP Server

Slack 消息

编辑精选

by Anthropic

热门

Slack 是让 AI 助手直接读写你的 Slack 频道和消息的 MCP 服务器。

这个服务器解决了团队协作中需要 AI 实时获取 Slack 信息的痛点,特别适合开发团队让 Claude 帮忙汇总频道讨论或发送通知。不过,它目前只是参考实现,文档有限,不建议在生产环境直接使用——更适合开发者学习 MCP 如何集成第三方服务。

平台与服务
83.1k

by netdata

热门

io.github.netdata/mcp-server 是让 AI 助手实时监控服务器指标和日志的 MCP 服务器。

这个工具解决了运维人员需要手动检查系统状态的痛点,最适合 DevOps 团队让 Claude 自动分析性能数据。不过,它依赖 NetData 的现有部署,如果你没用过这个监控平台,得先花时间配置。

平台与服务
78.3k

by d4vinci

热门

Scrapling MCP Server 是专为现代网页设计的智能爬虫工具,支持绕过 Cloudflare 等反爬机制。

这个工具解决了爬取动态网页和反爬网站时的头疼问题,特别适合需要批量采集电商价格或新闻数据的开发者。不过,它依赖外部浏览器引擎,资源消耗较大,不适合轻量级任务。

平台与服务
34.9k

评论