io.github.MikeRecognex/mcp-codebase-index

编码与调试

by mikerecognex

结构化 codebase indexer,内置 17 个 query tools,可减少 87% token 消耗,且 zero dependencies。

什么是 io.github.MikeRecognex/mcp-codebase-index

结构化 codebase indexer,内置 17 个 query tools,可减少 87% token 消耗,且 zero dependencies。

README

<!-- mcp-name: io.github.MikeRecognex/mcp-codebase-index -->

mcp-codebase-index

PyPI version CI Python 3.11+ License: AGPL-3.0 MCP Zero Dependencies

A structural codebase indexer with an MCP server for AI-assisted development. Zero runtime dependencies — uses Python's ast module for Python analysis and regex-based parsing for TypeScript/JS, Go, Rust, and C#. Requires Python 3.11+.

What It Does

Indexes codebases by parsing source files into structural metadata -- functions, classes, imports, dependency graphs, and cross-file call chains -- then exposes 18 query tools via the Model Context Protocol, enabling Claude Code and other MCP clients to navigate codebases efficiently without reading entire files.

Automatic incremental re-indexing: In git repositories, the index stays up to date automatically. Before every query, the server checks git diff and git status (~1-2ms). If files changed, only those files are re-parsed and the dependency graph is rebuilt. No need to manually call reindex after edits, branch switches, or pulls.

Persistent disk cache: The index is saved to a pickle cache file (.codebase-index-cache.pkl) after every build. On subsequent server starts, the cache is loaded and validated against the current git HEAD — if the ref matches, startup is instant. If a small number of files changed (≤20), the cached index is loaded and incrementally updated instead of rebuilt from scratch. This eliminates the cold-start penalty when restarting Claude Code sessions, restarting the MCP server, or resuming work after context compaction.

Language Support

LanguageMethodExtracts
Python (.py)AST parsingFunctions, classes, methods, imports, dependency graph
TypeScript/JS (.ts, .tsx, .js, .jsx)Regex-basedFunctions, arrow functions, classes, interfaces, type aliases, imports
Go (.go)Regex-basedFunctions, methods (receiver-based), structs, interfaces, type aliases, imports, doc comments
Rust (.rs)Regex-basedFunctions (pub/async/const/unsafe), structs, enums, traits, impl blocks, use statements, attributes, doc comments, macro_rules
C# (.cs)Regex-basedClasses, interfaces, structs, enums, records, methods, constructors, using directives, [Attributes], /// XML doc comments
Markdown/Text (.md, .txt, .rst)Heading detectionSections (# headings, underlines, numbered, ALL-CAPS)
OtherGenericLine counts only

Installation

bash
pip install "mcp-codebase-index[mcp]"

The [mcp] extra includes the MCP server dependency. Omit it if you only need the programmatic API.

For development (from a local clone):

bash
pip install -e ".[dev,mcp]"

MCP Server

Running

bash
# As a console script
PROJECT_ROOT=/path/to/project mcp-codebase-index

# As a Python module
PROJECT_ROOT=/path/to/project python -m mcp_codebase_index.server

PROJECT_ROOT specifies which directory to index. Defaults to the current working directory.

Persistent Cache

In git repositories, the server automatically caches the index to .codebase-index-cache.pkl in the project root. On startup:

  1. Cache hit (exact match): If the cached git ref matches the current HEAD, the index loads instantly from disk — no parsing, no file walking.
  2. Cache hit (small changeset): If ≤20 files changed since the cached ref, the cached index is loaded and incrementally updated on the first query.
  3. Cache miss: If the changeset is large or no cache exists, a full rebuild runs and saves a new cache.

Add .codebase-index-cache.pkl to your .gitignore — it's a local-only build artifact.

Configuring with OpenClaw

Install the package on the machine where OpenClaw is running:

bash
# Local install
pip install "mcp-codebase-index[mcp]"

# Or inside a Docker container / remote VPS
docker exec -it openclaw bash
pip install "mcp-codebase-index[mcp]"

Add the MCP server to your OpenClaw agent config (openclaw.json):

json
{
  "agents": {
    "list": [{
      "id": "main",
      "mcp": {
        "servers": [
          {
            "name": "codebase-index",
            "command": "mcp-codebase-index",
            "env": {
              "PROJECT_ROOT": "/path/to/project"
            }
          }
        ]
      }
    }]
  }
}

Restart OpenClaw and verify the connection:

bash
openclaw mcp list

All 18 tools will be available to your agent.

Performance note: The server automatically detects file changes via git diff before every query (~1-2ms) and incrementally re-indexes only what changed. However, OpenClaw's default MCP integration via mcporter spawns a fresh server process per tool call, which discards the in-memory index and forces a full rebuild each time (~1-2s for small projects, longer for large ones). With persistent caching, these cold starts are now significantly faster — the server loads from the disk cache instead of re-parsing the entire codebase. For persistent connections (avoiding even the cache load overhead), use the openclaw-mcp-adapter plugin, which connects once at startup and keeps the server running:

bash
pip install openclaw-mcp-adapter

Configuring with Claude Code

Add to your project's .mcp.json:

json
{
  "mcpServers": {
    "codebase-index": {
      "command": "mcp-codebase-index",
      "env": {
        "PROJECT_ROOT": "/path/to/project"
      }
    }
  }
}

Or using the Python module directly (useful if installed in a virtualenv):

json
{
  "mcpServers": {
    "codebase-index": {
      "command": "/path/to/.venv/bin/python3",
      "args": ["-m", "mcp_codebase_index.server"],
      "env": {
        "PROJECT_ROOT": "/path/to/project"
      }
    }
  }
}

Reinforcing Tool Usage with Hooks

Claude Code tends to default to built-in Glob/Grep/Read tools even when codebase-index is available. In addition to CLAUDE.md instructions (see below), you can add hooks that fire on every prompt to reinforce the behavior. Add this to .claude/settings.local.json:

json
{
  "hooks": {
    "SessionStart": [
      {
        "hooks": [
          {
            "type": "command",
            "command": "echo 'CRITICAL REMINDER: Use codebase-index MCP tools FIRST for ALL code navigation (find_symbol, get_function_source, search_codebase, get_dependencies, etc). Only fall back to Glob/Grep/Read for non-code files.'"
          }
        ]
      }
    ],
    "UserPromptSubmit": [
      {
        "hooks": [
          {
            "type": "command",
            "command": "echo 'Use codebase-index MCP tools first for code navigation.'"
          }
        ]
      }
    ]
  }
}

Hook stdout is injected as context Claude sees before responding. SessionStart fires on startup, resume, and context compaction. UserPromptSubmit fires on every turn.

Important: Make the AI Actually Use Indexed Tools

By default, AI assistants will ignore the indexed tools and fall back to reading entire files with Glob/Grep/Read. Soft language like "prefer" gets rationalized away. Add this to your project's CLAUDE.md (or equivalent instructions file) with mandatory language:

code
## Codebase Navigation — MANDATORY

You MUST use codebase-index MCP tools FIRST when exploring or navigating the codebase. This is not optional.

- ALWAYS start with: get_project_summary, find_symbol, get_function_source, get_class_source,
  get_structure_summary, get_dependencies, get_dependents, get_change_impact, get_call_chain, search_codebase
- Only fall back to Read/Glob/Grep when codebase-index tools genuinely don't have what you need
  (e.g. reading non-code files, config, frontmatter)
- If you catch yourself reaching for Glob/Grep/Read to find or understand code, STOP and use
  codebase-index instead

The word "prefer" is too weak — models treat it as a suggestion and default to familiar tools. Mandatory language with explicit fallback criteria is what actually changes behavior.

Available Tools (18)

ToolDescription
get_project_summaryFile count, packages, top classes/functions
list_filesList indexed files with optional glob filter
get_structure_summaryStructure of a file or the whole project
get_functionsList functions with name, lines, params
get_classesList classes with name, lines, methods, bases
get_importsList imports with module, names, line
get_function_sourceFull source of a function/method
get_class_sourceFull source of a class
find_symbolFind where a symbol is defined (file, line, type)
get_dependenciesWhat a symbol calls/uses
get_dependentsWhat calls/uses a symbol
get_change_impactDirect + transitive dependents
get_call_chainShortest dependency path (BFS)
get_file_dependenciesFiles imported by a given file
get_file_dependentsFiles that import from a given file
search_codebaseRegex search across all files (max 100 results)
reindexForce full re-index (rarely needed — incremental updates happen automatically in git repos)
get_usage_statsSession efficiency stats: tool calls, characters returned vs total source, estimated token savings

Benchmarks

Tested across four real-world projects on an M-series MacBook Pro, from a small project to CPython itself (1.1 million lines):

Index Build Performance

ProjectFilesLinesFunctionsClassesIndex TimePeak Memory
RMLPlus367,762237550.9s2.4 MB
FastAPI2,556332,1604,1396175.7s55 MB
Django3,714707,49329,9957,37136.2s126 MB
CPython2,4641,115,33459,6209,03755.9s197 MB

With persistent caching, subsequent startups bypass the full build entirely. Cache load time is negligible compared to parsing — a cache hit on CPython restores the full index in under a second instead of 56s.

Query Response Size vs Total Source

Querying CPython — 41 million characters of source code:

QueryResponseTotal SourceReduction
find_symbol("TestCase")67 chars41,077,561 chars99.9998%
get_dependencies("compile")115 chars41,077,561 chars99.9997%
get_change_impact("TestCase")16,812 chars41,077,561 chars99.96%
get_function_source("compile")4,531 chars41,077,561 chars99.99%
get_function_source("run_unittest")439 chars41,077,561 chars99.999%

find_symbol returns 54-67 characters regardless of whether the project is 7K lines or 1.1M lines. Response size scales with the answer, not the codebase.

get_change_impact("TestCase") on CPython found 154 direct dependents and 492 transitive dependents in 0.45ms — the kind of query that's impossible without a dependency graph. Use max_direct and max_transitive to cap output to your token budget.

Query Response Time

All targeted queries return in sub-millisecond time, even on CPython's 1.1M lines:

QueryRMLPlusFastAPIDjangoCPython
find_symbol0.01ms0.01ms0.03ms0.08ms
get_dependencies0.00ms0.00ms0.00ms0.01ms
get_change_impact0.02ms0.00ms2.81ms0.45ms
get_function_source0.01ms0.02ms0.03ms0.10ms

Run the benchmarks yourself: python benchmarks/benchmark.py

How Is This Different from LSP?

LSP answers "where is this function?" — mcp-codebase-index answers "what happens if I change it?" LSP is point queries: one symbol, one file, one position. It can tell you where LLMClient is defined and who references it. But ask "what breaks transitively if I refactor LLMClient?" and LSP has nothing. This tool returns 11 direct dependents and 31 transitive impacts in a single call — 204 characters. To get the same answer from LSP, the AI would need to chain dozens of find-reference calls recursively, reading files at every step, burning thousands of tokens to reconstruct what the dependency graph already knows.

LSP also requires you to install a separate language server for every language in your project — pyright for Python, vtsls for TypeScript, gopls for Go. Each one is a heavyweight binary with its own dependencies and configuration. mcp-codebase-index is zero dependencies, handles Python + TypeScript/JS + Go + Rust + C# + Markdown out of the box, and every response has built-in token budget controls (max_results, max_lines). LSP was built for IDEs. This was built for AI.

Programmatic Usage

python
from mcp_codebase_index.project_indexer import ProjectIndexer
from mcp_codebase_index.query_api import create_project_query_functions

indexer = ProjectIndexer("/path/to/project", include_patterns=["**/*.py"])
index = indexer.index()
query_funcs = create_project_query_functions(index)

# Use query functions
print(query_funcs["get_project_summary"]())
print(query_funcs["find_symbol"]("MyClass"))
print(query_funcs["get_change_impact"]("some_function"))

Development

bash
pip install -e ".[dev,mcp]"
pytest tests/ -v
ruff check src/ tests/

References

The structural indexer was originally developed as part of the RMLPlus project, an implementation of the Recursive Language Models framework.

License

This project is dual-licensed:

If you're using mcp-codebase-index as a standalone MCP server for development, the AGPL-3.0 license applies at no cost. If you're embedding it in a proprietary product or offering it as part of a hosted service, you'll need a commercial license. See COMMERCIAL-LICENSE.md for details.

常见问题

io.github.MikeRecognex/mcp-codebase-index 是什么?

结构化 codebase indexer,内置 17 个 query tools,可减少 87% token 消耗,且 zero dependencies。

相关 Skills

前端设计

by anthropics

Universal
热门

面向组件、页面、海报和 Web 应用开发,按鲜明视觉方向生成可直接落地的前端代码与高质感 UI,适合做 landing page、Dashboard 或美化现有界面,避开千篇一律的 AI 审美。

想把页面做得既能上线又有设计感,就用前端设计:组件到整站都能产出,难得的是能避开千篇一律的 AI 味。

编码与调试
未扫描152.6k

网页应用测试

by anthropics

Universal
热门

用 Playwright 为本地 Web 应用编写自动化测试,支持启动开发服务器、校验前端交互、排查 UI 异常、抓取截图与浏览器日志,适合调试动态页面和回归验证。

借助 Playwright 一站式验证本地 Web 应用前端功能,调 UI 时还能同步查看日志和截图,定位问题更快。

编码与调试
未扫描152.6k

网页构建器

by anthropics

Universal
热门

面向复杂 claude.ai HTML artifact 开发,快速初始化 React + Tailwind CSS + shadcn/ui 项目并打包为单文件 HTML,适合需要状态管理、路由或多组件交互的页面。

在 claude.ai 里做复杂网页 Artifact 很省心,多组件、状态和路由都能顺手搭起来,React、Tailwind 与 shadcn/ui 组合效率高、成品也更精致。

编码与调试
未扫描152.6k

相关 MCP Server

GitHub

编辑精选

by GitHub

热门

GitHub 是 MCP 官方参考服务器,让 Claude 直接读写你的代码仓库和 Issues。

这个参考服务器解决了开发者想让 AI 安全访问 GitHub 数据的问题,适合需要自动化代码审查或 Issue 管理的团队。但注意它只是参考实现,生产环境得自己加固安全。

编码与调试
87.4k

by Context7

热门

Context7 是实时拉取最新文档和代码示例的智能助手,让你告别过时资料。

它能解决开发者查找文档时信息滞后的问题,特别适合快速上手新库或跟进更新。不过,依赖外部源可能导致偶尔的数据延迟,建议结合官方文档使用。

编码与调试
57.7k

by tldraw

热门

tldraw 是让 AI 助手直接在无限画布上绘图和协作的 MCP 服务器。

这解决了 AI 只能输出文本、无法视觉化协作的痛点——想象让 Claude 帮你画流程图或白板讨论。最适合需要快速原型设计或头脑风暴的开发者。不过,目前它只是个基础连接器,你得自己搭建画布应用才能发挥全部潜力。

编码与调试
48.0k

评论