io.github.jghiringhelli/codeseeker
编码与调试by jghiringhelli
Graph-powered code intelligence with semantic search and knowledge graph for AI assistants
什么是 io.github.jghiringhelli/codeseeker?
Graph-powered code intelligence with semantic search and knowledge graph for AI assistants
README
CodeSeeker
Four-layer hybrid search and knowledge graph for AI coding assistants.
BM25 + vector embeddings + RAPTOR directory summaries + graph expansion — fused into a single MCP tool that gives Claude, Copilot, and Cursor a real understanding of your codebase.
Works with Claude Code, GitHub Copilot (VS Code 1.99+), Cursor, Windsurf, and Claude Desktop.
Zero configuration — indexes on first use, stays in sync automatically.
The Problem
AI assistants are powerful editors, but they navigate code like a tourist:
- Grep finds text — not meaning.
"find authentication logic"returns every file containing the word "auth" - File reads are isolated — Claude sees a file but not its dependencies, callers, or the patterns your team established
- No memory of your project — every session starts from scratch
CodeSeeker fixes this. It indexes your codebase once and gives AI assistants a queryable knowledge graph they can use on every turn.
How It Works
A 4-stage pipeline runs on every query:
Query: "find JWT refresh token logic"
│
▼ Stage 1 — Hybrid retrieval
┌─────────────────────────────────────────────────────┐
│ BM25 (exact symbols, camelCase tokenized) │
│ + │
│ Vector search (384-dim Xenova embeddings) │
│ ↓ │
│ Reciprocal Rank Fusion: score = Σ 1/(60 + rank_i) │
│ Top-30 results, including RAPTOR directory nodes │
└─────────────────────────────────────────────────────┘
│
▼ Stage 2 — RAPTOR cascade (conditional)
┌─────────────────────────────────────────────────────┐
│ IF best directory-summary score ≥ 0.5: │
│ → narrow results to that directory automatically │
│ ELSE: all 30 results pass through unchanged │
│ Effect: "what does auth/ do?" scopes to auth/ │
│ "jwt.ts decode function" bypasses this │
└─────────────────────────────────────────────────────┘
│
▼ Stage 3 — Scoring and deduplication
┌─────────────────────────────────────────────────────┐
│ Dedup: keep highest-score chunk per file │
│ Source files: +0.10 (definition sites matter) │
│ Test files: −0.15 (prevent test dominance) │
│ Symbol boost: +0.20 (query token in filename) │
│ Multi-chunk: up to +0.30 (file has many hits) │
└─────────────────────────────────────────────────────┘
│
▼ Stage 4 — Graph expansion
┌─────────────────────────────────────────────────────┐
│ Top-10 results → follow IMPORTS/CALLS/EXTENDS edges │
│ Structural neighbors scored at source × 0.7 │
│ Avg graph connectivity: 20.8 edges/node │
└─────────────────────────────────────────────────────┘
│
▼
auth/jwt.ts (0.94), auth/refresh.ts (0.89), ...
The knowledge graph is built from AST-parsed imports at index time. It's what powers analyze dependencies, dead-code detection, and graph expansion in every search.
What Makes It Different
| Approach | Strengths | Limitations |
|---|---|---|
| Grep / ripgrep | Fast, universal | No semantic understanding |
| Vector search only | Finds similar code | Misses structural relationships |
| Serena | Precise LSP symbol navigation, 30+ languages | No semantic search, no cross-file reasoning |
| Codanna | Fast symbol lookup, good call graphs | Semantic search needs JSDoc — undocumented code gets no embeddings; no BM25, no RAPTOR, Windows experimental |
| CodeSeeker | BM25 + embedding fusion + RAPTOR + graph + coding standards + multi-language AST | Requires initial indexing (30s–5min) |
What LSP tools can't do:
- "Find code that handles errors like this" → semantic pattern search
- "What validation approach does this project use?" → auto-detected coding standards
- "Show me everything related to authentication" → graph traversal across indirect dependencies
What vector-only search misses:
- Direct import/export chains
- Class inheritance hierarchies
- Which files actually depend on which
Installation
Recommended: npx (no install needed)
The standard way to configure any MCP server — no global install required:
{
"mcpServers": {
"codeseeker": {
"command": "npx",
"args": ["-y", "codeseeker", "serve", "--mcp"]
}
}
}
Add this to your MCP config file (see below for per-client locations) and restart your editor.
npm global install
npm install -g codeseeker
codeseeker install --vscode # or --cursor, --windsurf
🔌 Claude Code Plugin
For Claude Code CLI users — adds auto-sync hooks and slash commands:
/plugin install codeseeker@github:jghiringhelli/codeseeker#plugin
Slash commands: /codeseeker:init, /codeseeker:reindex
☁️ Devcontainers / GitHub Codespaces
{
"name": "My Project",
"image": "mcr.microsoft.com/devcontainers/javascript-node:18",
"postCreateCommand": "npm install -g codeseeker && codeseeker install --vscode"
}
✅ Verify
Ask your AI assistant: "What CodeSeeker tools do you have?"
You should see: search, analyze, index — CodeSeeker's three tools.
Advanced Installation Options
<details> <summary><b>📋 MCP Configuration by client</b></summary>The MCP config JSON is the same for all clients — only the file location differs:
| Client | Config file |
|---|---|
| VS Code (Claude Code / Copilot) | .vscode/mcp.json in your project, or ~/.vscode/mcp.json globally |
| Cursor | .cursor/mcp.json in your project |
| Claude Desktop | ~/Library/Application Support/Claude/claude_desktop_config.json (macOS) or %APPDATA%\Claude\claude_desktop_config.json (Windows) |
| Windsurf | .windsurf/mcp.json in your project |
{
"mcpServers": {
"codeseeker": {
"command": "npx",
"args": ["-y", "codeseeker", "serve", "--mcp"]
}
}
}
npm install -g codeseeker
cd your-project
codeseeker init
codeseeker -c "how does authentication work in this project?"
What You Get
Once configured, Claude has access to these MCP tools (used automatically):
| Tool | Actions / Usage | What It Does |
|---|---|---|
search | {query} | Hybrid search: vector + BM25 text + path-match, fused with RRF; RAPTOR directory summaries surface for abstract queries |
search | {query, search_type: "graph"} | Hybrid search + Graph RAG — follows import/call/extends edges to surface structurally connected files |
search | {query, search_type: "vector"} | Pure embedding cosine-similarity search (no BM25 or path scoring) |
search | {query, search_type: "fts"} | Pure BM25 text search with CamelCase tokenisation and synonym expansion |
search | {query, read: true} | Search + read file contents in one step |
search | {filepath} | Read a file with its related code automatically included |
analyze | {action: "dependencies", filepath} | Traverse the knowledge graph (imports, calls, extends) |
analyze | {action: "standards"} | Your project's detected patterns (validation, error handling) |
analyze | {action: "duplicates"} | Find duplicate/similar code blocks across your codebase |
analyze | {action: "dead_code"} | Detect unused exports, functions, and classes |
index | {action: "init", path} | Manually trigger indexing (rarely needed) |
index | {action: "sync", changes} | Update index for specific files |
index | {action: "exclude", paths} | Dynamically exclude/include files from the index |
index | {action: "status"} | List indexed projects with file/chunk counts |
You don't invoke these manually—Claude uses them automatically when searching code or analyzing relationships.
How Indexing Works
You don't need to manually index. When Claude uses any CodeSeeker tool, the tool automatically checks if the project is indexed. If not, it indexes on first use.
User: "Find the authentication logic"
│
▼
┌─────────────────────────────────────┐
│ Claude calls search({query: ...}) │
│ │ │
│ ▼ │
│ Project indexed? ──No──► Index now │
│ │ (auto) │
│ Yes │ │
│ │◀───────────────────┘ │
│ ▼ │
│ Return search results │
└─────────────────────────────────────┘
First search on a new project takes 30 seconds to several minutes (depending on size). Subsequent searches are instant.
Search Quality Research
<details> <summary><b>📊 Component ablation study (v2.0.0)</b> — measured impact of each retrieval layer</summary>Setup
18 hand-labelled queries across two real-world codebases:
| Corpus | Language | Files | Queries | Query types |
|---|---|---|---|---|
| Conclave | TypeScript (pnpm monorepo) | 201 | 10 | Symbol lookup, cross-file chains, out-of-scope |
| ImperialCommander2 | C# / Unity | 199 | 8 | Class lookup, controller wiring, file I/O |
Each query has one or more mustFind targets (exact file basenames) and optional mustNotFind targets (scope leak check). Queries were run on a real index built from source — real Xenova embeddings, real graph, real RAPTOR L2 nodes — to reflect production conditions.
Metrics: MRR (Mean Reciprocal Rank), P@1 (Precision at 1), R@5 (Recall at 5), F1@3.
Ablation results
| Configuration | MRR | P@1 | P@3 | R@5 | F1@3 | Notes |
|---|---|---|---|---|---|---|
| Hybrid baseline (BM25 + embed + RAPTOR, no graph) | 75.2% | 61.1% | 29.6% | 91.7% | 44.4% | Production default |
| + graph 1-hop | 74.9% | 61.1% | 29.6% | 91.7% | 44.4% | ±0% ranking, adds structural neighbors |
| + graph 2-hop | 74.9% | 61.1% | 29.6% | 91.7% | 44.4% | Scope leaks on unrelated queries |
| No RAPTOR (graph 1-hop) | 74.9% | 61.1% | 29.6% | 91.7% | 44.4% | RAPTOR contributes +0.3% |
What each layer actually does
BM25 + embedding fusion (RRF)
The workhorse. Handles ~94% of ranking quality on its own. BM25 catches exact symbol names and camelCase tokens; vector embeddings catch semantic similarity when names differ. Fused with Reciprocal Rank Fusion to combine both signals without manual weight tuning.
RAPTOR (hierarchical directory summaries)
Generates per-directory embedding nodes by mean-pooling all file embeddings in a folder. Acts as a post-filter: when a directory summary scores ≥ 0.5 against the query, results are narrowed to that directory's files. Measured contribution: +0.3% MRR on symbol queries. Fires conservatively — only when the directory is an obvious match. Its real value is on abstract queries ("what does the payments module do?") which don't appear in this benchmark; for those queries it prevents broad scattering across the entire codebase.
Knowledge graph (import/dependency edges)
Average connectivity: 20.8 file→file edges per node across both TS and C# codebases. Measured ranking impact: ±0% MRR for 1-hop expansion. The graph doesn't move MRR because the semantic layer already finds the right files — the graph's neighbors are usually already in the top-15. Its value is structural: the analyze dependencies action and explicit graph search type give Claude traversable import chains, inheritance hierarchies, and dependency paths that embeddings alone cannot provide.
Type boost / penalty scoring
Source files get +0.10 score boost; test files get −0.15 penalty; lock files and docs get −0.05 penalty. Without this, integration.test.ts would rank above dag-engine.ts for exact symbol queries because test files import and exercise every symbol in the source. The penalty corrects this without eliminating test files from results.
Monorepo directory exclusion fix
The single highest-impact change in v1.12.0: removing packages/ from the default exclusion list. For pnpm/yarn/lerna monorepos where all source lives under packages/, this exclusion was silently dropping all source files. Effect: 10% → 72% MRR on the Conclave monorepo benchmark.
Known limitations
| Query | Target | Issue | Root cause |
|---|---|---|---|
cv-prompts | orchestrator.ts | rank 97+ even with 2-hop graph | prompt-builder.test.ts outscores prompt-builder.ts semantically; source file never enters top-10, so we can't graph-walk from it to orchestrator.ts. Test-file dominance on cross-file queries. |
cv-exec-mode | types.ts | rank 11–12 | types.ts is a pure type-export file; low keyword density. Found within R@5 (rank ≤ 15). |
Benchmark script
Reproduce with:
npm run build
node scripts/real-bench.js
Requires C:\workspace\claude\conclave and C:\workspace\ImperialCommander2 to be present locally (or update paths in scripts/real-bench.js).
Auto-Detected Coding Standards
CodeSeeker analyzes your codebase and extracts patterns:
{
"validation": {
"email": {
"preferred": "z.string().email()",
"usage_count": 12,
"files": ["src/auth.ts", "src/user.ts"]
}
},
"react-patterns": {
"state": {
"preferred": "useState<T>()",
"usage_count": 45
}
}
}
Detected pattern categories:
- validation: Zod, Yup, Joi, validator.js, custom regex
- error-handling: API error responses, try-catch patterns, custom Error classes
- logging: Console, Winston, Bunyan, structured logging
- testing: Jest/Vitest setup, assertion patterns
- react-patterns: Hooks (useState, useEffect, useMemo, useCallback, useRef)
- state-management: Redux Toolkit, Zustand, React Context, TanStack Query
- api-patterns: Fetch, Axios, Express routes, Next.js API routes
When Claude writes new code, it follows your existing conventions instead of inventing new ones.
Managing Index Exclusions
If Claude notices files that shouldn't be indexed (like Unity's Library folder, build outputs, or generated files), it can dynamically exclude them:
// Exclude Unity Library folder and generated files
index({
action: "exclude",
project: "my-unity-game",
paths: ["Library/**", "Temp/**", "*.generated.cs"],
reason: "Unity build artifacts"
})
Exclusions are persisted in .codeseeker/exclusions.json and automatically respected during reindexing.
Code Cleanup Tools
CodeSeeker helps you maintain a clean codebase by finding duplicate code and detecting dead code.
Finding Duplicate Code
Ask Claude to find similar code blocks that could be consolidated:
"Find duplicate code in my project"
"Are there any similar functions that could be merged?"
"Show me copy-pasted code that should be refactored"
CodeSeeker uses vector similarity to find semantically similar code—not just exact matches. It detects:
- Copy-pasted functions with minor variations
- Similar validation logic across files
- Repeated patterns that could be extracted into utilities
Finding Dead Code
Ask Claude to identify unused code that can be safely removed:
"Find dead code in this project"
"What functions are never called?"
"Show me unused exports"
CodeSeeker analyzes the knowledge graph to find:
- Exported functions/classes that are never imported
- Internal functions with no callers
- Orphaned files with no incoming dependencies
Example workflow:
User: "Use CodeSeeker to clean up this project"
Claude: I'll analyze your codebase for cleanup opportunities.
Found 3 duplicate code blocks:
- validateEmail() in auth.ts and user.ts (92% similar)
- formatDate() appears in 4 files with minor variations
- Error handling pattern repeated in api/*.ts
Found 2 dead code files:
- src/utils/legacy-helper.ts (0 imports)
- src/services/unused-service.ts (exported but never imported)
Would you like me to:
1. Consolidate the duplicate validators into a shared utility?
2. Remove the dead code files?
Language Support
| Language | Parser | Relationship Extraction |
|---|---|---|
| TypeScript/JavaScript | Babel AST | Excellent |
| Python | Tree-sitter | Excellent |
| Java | Tree-sitter | Excellent |
| C# | Regex | Good |
| Go | Regex | Good |
| Rust, C/C++, Ruby, PHP | Regex | Basic |
Tree-sitter parsers install automatically when needed.
Keeping the Index in Sync
With Claude Code Plugin
The plugin installs hooks that automatically update the index:
| Event | What Happens |
|---|---|
| Claude edits a file | Index updated automatically |
Claude runs git pull/checkout/merge | Full reindex triggered |
You run /codeseeker:reindex | Manual full reindex |
You don't need to do anything—the plugin handles sync automatically.
With MCP Server Only (Cursor, Claude Desktop)
- Claude-initiated changes: Claude can call
index({action: "sync"})tool - Manual changes: Not automatically detected—ask Claude to reindex periodically
Sync Summary
| Setup | Claude Edits | Git Operations | Manual Edits |
|---|---|---|---|
| Plugin (Claude Code) | Auto | Auto | Manual |
| MCP (Cursor, Desktop) | Ask Claude | Ask Claude | Ask Claude |
| CLI | Auto | Auto | Manual |
When CodeSeeker Helps Most
Good fit:
- Large codebases (10K+ files) where Claude struggles to find relevant code
- Projects with established patterns you want Claude to follow
- Complex dependency chains across multiple files
- Teams wanting consistent AI-generated code
Less useful:
- Greenfield projects with little existing code
- Single-file scripts
- Projects where you're actively changing architecture
Architecture
┌──────────────────────────────────────────────────────────┐
│ Claude Code │
│ │ │
│ MCP Protocol │
│ │ │
│ ┌──────────────────────▼──────────────────────────┐ │
│ │ CodeSeeker MCP Server │ │
│ │ ┌─────────────┬─────────────┬────────────────┐ │ │
│ │ │ Vector │ Knowledge │ Coding │ │ │
│ │ │ Search │ Graph │ Standards │ │ │
│ │ │ (SQLite) │ (SQLite) │ (JSON) │ │ │
│ │ └─────────────┴─────────────┴────────────────┘ │ │
│ └─────────────────────────────────────────────────┘ │
└──────────────────────────────────────────────────────────┘
All data stored locally in .codeseeker/. No external services required.
For large teams (100K+ files, shared indexes), server mode supports PostgreSQL + Neo4j. See Storage Documentation.
For the complete technical internals — exact scoring formulas, MCP tool schema, graph edge types, RAPTOR threshold logic, pipeline stages, analysis confidence tiers — see the Technical Architecture Manual.
Troubleshooting
MCP server not connecting
- Verify npm and npx work:
npx -y codeseeker --version - Check MCP config file syntax (valid JSON, no trailing commas)
- Restart your editor/Claude application completely
- Check that Node.js is installed:
node --version(need v18+)
Indexing seems slow
First-time indexing of large projects (50K+ files) can take 5+ minutes. Subsequent uses are instant.
Tools not appearing in Claude
- Ask Claude: "What CodeSeeker tools do you have?"
- If no tools appear, check MCP config file exists and has correct syntax
- Restart your IDE completely (not just reload window)
- Check Claude/Copilot MCP connection status in IDE
Still stuck?
Open an issue: GitHub Issues
Documentation
- Integration Guide - How all components connect
- Architecture - Technical deep dive
- CLI Commands - Full command reference
Supported Platforms
| Client | MCP Support | Config |
|---|---|---|
| Claude Code (VS Code) | ✅ | .vscode/mcp.json or plugin |
| GitHub Copilot (VS Code 1.99+) | ✅ | .vscode/mcp.json |
| Cursor | ✅ | .cursor/mcp.json |
| Windsurf | ✅ | .windsurf/mcp.json |
| Claude Desktop | ✅ | claude_desktop_config.json |
| Visual Studio | ✅ | codeseeker install --vs |
Claude Code and GitHub Copilot share the same
.vscode/mcp.json— configure once, works for both.
Support
If CodeSeeker is useful to you, consider sponsoring the project.
License
MIT License. See LICENSE.
CodeSeeker gives Claude the code understanding that grep and embeddings alone can't provide.
常见问题
io.github.jghiringhelli/codeseeker 是什么?
Graph-powered code intelligence with semantic search and knowledge graph for AI assistants
相关 Skills
前端设计
by anthropics
面向组件、页面、海报和 Web 应用开发,按鲜明视觉方向生成可直接落地的前端代码与高质感 UI,适合做 landing page、Dashboard 或美化现有界面,避开千篇一律的 AI 审美。
✎ 想把页面做得既能上线又有设计感,就用前端设计:组件到整站都能产出,难得的是能避开千篇一律的 AI 味。
网页构建器
by anthropics
面向复杂 claude.ai HTML artifact 开发,快速初始化 React + Tailwind CSS + shadcn/ui 项目并打包为单文件 HTML,适合需要状态管理、路由或多组件交互的页面。
✎ 在 claude.ai 里做复杂网页 Artifact 很省心,多组件、状态和路由都能顺手搭起来,React、Tailwind 与 shadcn/ui 组合效率高、成品也更精致。
网页应用测试
by anthropics
用 Playwright 为本地 Web 应用编写自动化测试,支持启动开发服务器、校验前端交互、排查 UI 异常、抓取截图与浏览器日志,适合调试动态页面和回归验证。
✎ 借助 Playwright 一站式验证本地 Web 应用前端功能,调 UI 时还能同步查看日志和截图,定位问题更快。
相关 MCP Server
GitHub
编辑精选by GitHub
GitHub 是 MCP 官方参考服务器,让 Claude 直接读写你的代码仓库和 Issues。
✎ 这个参考服务器解决了开发者想让 AI 安全访问 GitHub 数据的问题,适合需要自动化代码审查或 Issue 管理的团队。但注意它只是参考实现,生产环境得自己加固安全。
Context7 文档查询
编辑精选by Context7
Context7 是实时拉取最新文档和代码示例的智能助手,让你告别过时资料。
✎ 它能解决开发者查找文档时信息滞后的问题,特别适合快速上手新库或跟进更新。不过,依赖外部源可能导致偶尔的数据延迟,建议结合官方文档使用。
by tldraw
tldraw 是让 AI 助手直接在无限画布上绘图和协作的 MCP 服务器。
✎ 这解决了 AI 只能输出文本、无法视觉化协作的痛点——想象让 Claude 帮你画流程图或白板讨论。最适合需要快速原型设计或头脑风暴的开发者。不过,目前它只是个基础连接器,你得自己搭建画布应用才能发挥全部潜力。