io.github.jghiringhelli/codeseeker

编码与调试

by jghiringhelli

Graph-powered code intelligence with semantic search and knowledge graph for AI assistants

什么是 io.github.jghiringhelli/codeseeker

Graph-powered code intelligence with semantic search and knowledge graph for AI assistants

README

CodeSeeker

Four-layer hybrid search and knowledge graph for AI coding assistants.
BM25 + vector embeddings + RAPTOR directory summaries + graph expansion — fused into a single MCP tool that gives Claude, Copilot, and Cursor a real understanding of your codebase.

npm version License: MIT TypeScript

Works with Claude Code, GitHub Copilot (VS Code 1.99+), Cursor, Windsurf, and Claude Desktop.
Zero configuration — indexes on first use, stays in sync automatically.

The Problem

AI assistants are powerful editors, but they navigate code like a tourist:

  • Grep finds text — not meaning. "find authentication logic" returns every file containing the word "auth"
  • File reads are isolated — Claude sees a file but not its dependencies, callers, or the patterns your team established
  • No memory of your project — every session starts from scratch

CodeSeeker fixes this. It indexes your codebase once and gives AI assistants a queryable knowledge graph they can use on every turn.

How It Works

A 4-stage pipeline runs on every query:

code
Query: "find JWT refresh token logic"
        │
        ▼  Stage 1 — Hybrid retrieval
   ┌─────────────────────────────────────────────────────┐
   │ BM25 (exact symbols, camelCase tokenized)           │
   │   +                                                 │
   │ Vector search (384-dim Xenova embeddings)           │
   │   ↓                                                 │
   │ Reciprocal Rank Fusion: score = Σ 1/(60 + rank_i)  │
   │ Top-30 results, including RAPTOR directory nodes    │
   └─────────────────────────────────────────────────────┘
        │
        ▼  Stage 2 — RAPTOR cascade (conditional)
   ┌─────────────────────────────────────────────────────┐
   │ IF best directory-summary score ≥ 0.5:              │
   │   → narrow results to that directory automatically  │
   │ ELSE: all 30 results pass through unchanged         │
   │ Effect: "what does auth/ do?" scopes to auth/       │
   │         "jwt.ts decode function" bypasses this      │
   └─────────────────────────────────────────────────────┘
        │
        ▼  Stage 3 — Scoring and deduplication
   ┌─────────────────────────────────────────────────────┐
   │ Dedup: keep highest-score chunk per file            │
   │ Source files:  +0.10  (definition sites matter)     │
   │ Test files:    −0.15  (prevent test dominance)      │
   │ Symbol boost:  +0.20  (query token in filename)     │
   │ Multi-chunk:   up to +0.30  (file has many hits)    │
   └─────────────────────────────────────────────────────┘
        │
        ▼  Stage 4 — Graph expansion
   ┌─────────────────────────────────────────────────────┐
   │ Top-10 results → follow IMPORTS/CALLS/EXTENDS edges │
   │ Structural neighbors scored at source × 0.7        │
   │ Avg graph connectivity: 20.8 edges/node             │
   └─────────────────────────────────────────────────────┘
        │
        ▼
   auth/jwt.ts (0.94), auth/refresh.ts (0.89), ...

The knowledge graph is built from AST-parsed imports at index time. It's what powers analyze dependencies, dead-code detection, and graph expansion in every search.

What Makes It Different

ApproachStrengthsLimitations
Grep / ripgrepFast, universalNo semantic understanding
Vector search onlyFinds similar codeMisses structural relationships
SerenaPrecise LSP symbol navigation, 30+ languagesNo semantic search, no cross-file reasoning
CodannaFast symbol lookup, good call graphsSemantic search needs JSDoc — undocumented code gets no embeddings; no BM25, no RAPTOR, Windows experimental
CodeSeekerBM25 + embedding fusion + RAPTOR + graph + coding standards + multi-language ASTRequires initial indexing (30s–5min)

What LSP tools can't do:

  • "Find code that handles errors like this" → semantic pattern search
  • "What validation approach does this project use?" → auto-detected coding standards
  • "Show me everything related to authentication" → graph traversal across indirect dependencies

What vector-only search misses:

  • Direct import/export chains
  • Class inheritance hierarchies
  • Which files actually depend on which

Installation

Recommended: npx (no install needed)

The standard way to configure any MCP server — no global install required:

json
{
  "mcpServers": {
    "codeseeker": {
      "command": "npx",
      "args": ["-y", "codeseeker", "serve", "--mcp"]
    }
  }
}

Add this to your MCP config file (see below for per-client locations) and restart your editor.

npm global install

bash
npm install -g codeseeker
codeseeker install --vscode      # or --cursor, --windsurf

🔌 Claude Code Plugin

For Claude Code CLI users — adds auto-sync hooks and slash commands:

bash
/plugin install codeseeker@github:jghiringhelli/codeseeker#plugin

Slash commands: /codeseeker:init, /codeseeker:reindex

☁️ Devcontainers / GitHub Codespaces

json
{
  "name": "My Project",
  "image": "mcr.microsoft.com/devcontainers/javascript-node:18",
  "postCreateCommand": "npm install -g codeseeker && codeseeker install --vscode"
}

✅ Verify

Ask your AI assistant: "What CodeSeeker tools do you have?"

You should see: search, analyze, index — CodeSeeker's three tools.

Advanced Installation Options

<details> <summary><b>📋 MCP Configuration by client</b></summary>

The MCP config JSON is the same for all clients — only the file location differs:

ClientConfig file
VS Code (Claude Code / Copilot).vscode/mcp.json in your project, or ~/.vscode/mcp.json globally
Cursor.cursor/mcp.json in your project
Claude Desktop~/Library/Application Support/Claude/claude_desktop_config.json (macOS) or %APPDATA%\Claude\claude_desktop_config.json (Windows)
Windsurf.windsurf/mcp.json in your project
json
{
  "mcpServers": {
    "codeseeker": {
      "command": "npx",
      "args": ["-y", "codeseeker", "serve", "--mcp"]
    }
  }
}
</details> <details> <summary><b>🖥️ CLI Standalone Usage</b> (without AI assistant)</summary>
bash
npm install -g codeseeker
cd your-project
codeseeker init
codeseeker -c "how does authentication work in this project?"
</details>

What You Get

Once configured, Claude has access to these MCP tools (used automatically):

ToolActions / UsageWhat It Does
search{query}Hybrid search: vector + BM25 text + path-match, fused with RRF; RAPTOR directory summaries surface for abstract queries
search{query, search_type: "graph"}Hybrid search + Graph RAG — follows import/call/extends edges to surface structurally connected files
search{query, search_type: "vector"}Pure embedding cosine-similarity search (no BM25 or path scoring)
search{query, search_type: "fts"}Pure BM25 text search with CamelCase tokenisation and synonym expansion
search{query, read: true}Search + read file contents in one step
search{filepath}Read a file with its related code automatically included
analyze{action: "dependencies", filepath}Traverse the knowledge graph (imports, calls, extends)
analyze{action: "standards"}Your project's detected patterns (validation, error handling)
analyze{action: "duplicates"}Find duplicate/similar code blocks across your codebase
analyze{action: "dead_code"}Detect unused exports, functions, and classes
index{action: "init", path}Manually trigger indexing (rarely needed)
index{action: "sync", changes}Update index for specific files
index{action: "exclude", paths}Dynamically exclude/include files from the index
index{action: "status"}List indexed projects with file/chunk counts

You don't invoke these manually—Claude uses them automatically when searching code or analyzing relationships.

How Indexing Works

You don't need to manually index. When Claude uses any CodeSeeker tool, the tool automatically checks if the project is indexed. If not, it indexes on first use.

code
User: "Find the authentication logic"
        │
        ▼
┌─────────────────────────────────────┐
│ Claude calls search({query: ...})  │
│         │                           │
│         ▼                           │
│ Project indexed? ──No──► Index now  │
│         │                  (auto)   │
│        Yes                   │      │
│         │◀───────────────────┘      │
│         ▼                           │
│ Return search results               │
└─────────────────────────────────────┘

First search on a new project takes 30 seconds to several minutes (depending on size). Subsequent searches are instant.


Search Quality Research

<details> <summary><b>📊 Component ablation study (v2.0.0)</b> — measured impact of each retrieval layer</summary>

Setup

18 hand-labelled queries across two real-world codebases:

CorpusLanguageFilesQueriesQuery types
ConclaveTypeScript (pnpm monorepo)20110Symbol lookup, cross-file chains, out-of-scope
ImperialCommander2C# / Unity1998Class lookup, controller wiring, file I/O

Each query has one or more mustFind targets (exact file basenames) and optional mustNotFind targets (scope leak check). Queries were run on a real index built from source — real Xenova embeddings, real graph, real RAPTOR L2 nodes — to reflect production conditions.

Metrics: MRR (Mean Reciprocal Rank), P@1 (Precision at 1), R@5 (Recall at 5), F1@3.

Ablation results

ConfigurationMRRP@1P@3R@5F1@3Notes
Hybrid baseline (BM25 + embed + RAPTOR, no graph)75.2%61.1%29.6%91.7%44.4%Production default
+ graph 1-hop74.9%61.1%29.6%91.7%44.4%±0% ranking, adds structural neighbors
+ graph 2-hop74.9%61.1%29.6%91.7%44.4%Scope leaks on unrelated queries
No RAPTOR (graph 1-hop)74.9%61.1%29.6%91.7%44.4%RAPTOR contributes +0.3%

What each layer actually does

BM25 + embedding fusion (RRF)
The workhorse. Handles ~94% of ranking quality on its own. BM25 catches exact symbol names and camelCase tokens; vector embeddings catch semantic similarity when names differ. Fused with Reciprocal Rank Fusion to combine both signals without manual weight tuning.

RAPTOR (hierarchical directory summaries)
Generates per-directory embedding nodes by mean-pooling all file embeddings in a folder. Acts as a post-filter: when a directory summary scores ≥ 0.5 against the query, results are narrowed to that directory's files. Measured contribution: +0.3% MRR on symbol queries. Fires conservatively — only when the directory is an obvious match. Its real value is on abstract queries ("what does the payments module do?") which don't appear in this benchmark; for those queries it prevents broad scattering across the entire codebase.

Knowledge graph (import/dependency edges)
Average connectivity: 20.8 file→file edges per node across both TS and C# codebases. Measured ranking impact: ±0% MRR for 1-hop expansion. The graph doesn't move MRR because the semantic layer already finds the right files — the graph's neighbors are usually already in the top-15. Its value is structural: the analyze dependencies action and explicit graph search type give Claude traversable import chains, inheritance hierarchies, and dependency paths that embeddings alone cannot provide.

Type boost / penalty scoring
Source files get +0.10 score boost; test files get −0.15 penalty; lock files and docs get −0.05 penalty. Without this, integration.test.ts would rank above dag-engine.ts for exact symbol queries because test files import and exercise every symbol in the source. The penalty corrects this without eliminating test files from results.

Monorepo directory exclusion fix
The single highest-impact change in v1.12.0: removing packages/ from the default exclusion list. For pnpm/yarn/lerna monorepos where all source lives under packages/, this exclusion was silently dropping all source files. Effect: 10% → 72% MRR on the Conclave monorepo benchmark.

Known limitations

QueryTargetIssueRoot cause
cv-promptsorchestrator.tsrank 97+ even with 2-hop graphprompt-builder.test.ts outscores prompt-builder.ts semantically; source file never enters top-10, so we can't graph-walk from it to orchestrator.ts. Test-file dominance on cross-file queries.
cv-exec-modetypes.tsrank 11–12types.ts is a pure type-export file; low keyword density. Found within R@5 (rank ≤ 15).

Benchmark script

Reproduce with:

bash
npm run build
node scripts/real-bench.js

Requires C:\workspace\claude\conclave and C:\workspace\ImperialCommander2 to be present locally (or update paths in scripts/real-bench.js).

</details>

Auto-Detected Coding Standards

CodeSeeker analyzes your codebase and extracts patterns:

json
{
  "validation": {
    "email": {
      "preferred": "z.string().email()",
      "usage_count": 12,
      "files": ["src/auth.ts", "src/user.ts"]
    }
  },
  "react-patterns": {
    "state": {
      "preferred": "useState<T>()",
      "usage_count": 45
    }
  }
}

Detected pattern categories:

  • validation: Zod, Yup, Joi, validator.js, custom regex
  • error-handling: API error responses, try-catch patterns, custom Error classes
  • logging: Console, Winston, Bunyan, structured logging
  • testing: Jest/Vitest setup, assertion patterns
  • react-patterns: Hooks (useState, useEffect, useMemo, useCallback, useRef)
  • state-management: Redux Toolkit, Zustand, React Context, TanStack Query
  • api-patterns: Fetch, Axios, Express routes, Next.js API routes

When Claude writes new code, it follows your existing conventions instead of inventing new ones.

Managing Index Exclusions

If Claude notices files that shouldn't be indexed (like Unity's Library folder, build outputs, or generated files), it can dynamically exclude them:

code
// Exclude Unity Library folder and generated files
index({
  action: "exclude",
  project: "my-unity-game",
  paths: ["Library/**", "Temp/**", "*.generated.cs"],
  reason: "Unity build artifacts"
})

Exclusions are persisted in .codeseeker/exclusions.json and automatically respected during reindexing.

Code Cleanup Tools

CodeSeeker helps you maintain a clean codebase by finding duplicate code and detecting dead code.

Finding Duplicate Code

Ask Claude to find similar code blocks that could be consolidated:

code
"Find duplicate code in my project"
"Are there any similar functions that could be merged?"
"Show me copy-pasted code that should be refactored"

CodeSeeker uses vector similarity to find semantically similar code—not just exact matches. It detects:

  • Copy-pasted functions with minor variations
  • Similar validation logic across files
  • Repeated patterns that could be extracted into utilities

Finding Dead Code

Ask Claude to identify unused code that can be safely removed:

code
"Find dead code in this project"
"What functions are never called?"
"Show me unused exports"

CodeSeeker analyzes the knowledge graph to find:

  • Exported functions/classes that are never imported
  • Internal functions with no callers
  • Orphaned files with no incoming dependencies

Example workflow:

code
User: "Use CodeSeeker to clean up this project"

Claude: I'll analyze your codebase for cleanup opportunities.

Found 3 duplicate code blocks:
- validateEmail() in auth.ts and user.ts (92% similar)
- formatDate() appears in 4 files with minor variations
- Error handling pattern repeated in api/*.ts

Found 2 dead code files:
- src/utils/legacy-helper.ts (0 imports)
- src/services/unused-service.ts (exported but never imported)

Would you like me to:
1. Consolidate the duplicate validators into a shared utility?
2. Remove the dead code files?

Language Support

LanguageParserRelationship Extraction
TypeScript/JavaScriptBabel ASTExcellent
PythonTree-sitterExcellent
JavaTree-sitterExcellent
C#RegexGood
GoRegexGood
Rust, C/C++, Ruby, PHPRegexBasic

Tree-sitter parsers install automatically when needed.

Keeping the Index in Sync

With Claude Code Plugin

The plugin installs hooks that automatically update the index:

EventWhat Happens
Claude edits a fileIndex updated automatically
Claude runs git pull/checkout/mergeFull reindex triggered
You run /codeseeker:reindexManual full reindex

You don't need to do anything—the plugin handles sync automatically.

With MCP Server Only (Cursor, Claude Desktop)

  • Claude-initiated changes: Claude can call index({action: "sync"}) tool
  • Manual changes: Not automatically detected—ask Claude to reindex periodically

Sync Summary

SetupClaude EditsGit OperationsManual Edits
Plugin (Claude Code)AutoAutoManual
MCP (Cursor, Desktop)Ask ClaudeAsk ClaudeAsk Claude
CLIAutoAutoManual

When CodeSeeker Helps Most

Good fit:

  • Large codebases (10K+ files) where Claude struggles to find relevant code
  • Projects with established patterns you want Claude to follow
  • Complex dependency chains across multiple files
  • Teams wanting consistent AI-generated code

Less useful:

  • Greenfield projects with little existing code
  • Single-file scripts
  • Projects where you're actively changing architecture

Architecture

code
┌──────────────────────────────────────────────────────────┐
│                     Claude Code                          │
│                         │                                │
│                    MCP Protocol                          │
│                         │                                │
│  ┌──────────────────────▼──────────────────────────┐    │
│  │              CodeSeeker MCP Server               │    │
│  │  ┌─────────────┬─────────────┬────────────────┐ │    │
│  │  │   Vector    │  Knowledge  │    Coding      │ │    │
│  │  │   Search    │    Graph    │   Standards    │ │    │
│  │  │  (SQLite)   │  (SQLite)   │   (JSON)       │ │    │
│  │  └─────────────┴─────────────┴────────────────┘ │    │
│  └─────────────────────────────────────────────────┘    │
└──────────────────────────────────────────────────────────┘

All data stored locally in .codeseeker/. No external services required.

For large teams (100K+ files, shared indexes), server mode supports PostgreSQL + Neo4j. See Storage Documentation.

For the complete technical internals — exact scoring formulas, MCP tool schema, graph edge types, RAPTOR threshold logic, pipeline stages, analysis confidence tiers — see the Technical Architecture Manual.

Troubleshooting

MCP server not connecting

  1. Verify npm and npx work: npx -y codeseeker --version
  2. Check MCP config file syntax (valid JSON, no trailing commas)
  3. Restart your editor/Claude application completely
  4. Check that Node.js is installed: node --version (need v18+)

Indexing seems slow

First-time indexing of large projects (50K+ files) can take 5+ minutes. Subsequent uses are instant.

Tools not appearing in Claude

  1. Ask Claude: "What CodeSeeker tools do you have?"
  2. If no tools appear, check MCP config file exists and has correct syntax
  3. Restart your IDE completely (not just reload window)
  4. Check Claude/Copilot MCP connection status in IDE

Still stuck?

Open an issue: GitHub Issues

Documentation

Supported Platforms

ClientMCP SupportConfig
Claude Code (VS Code).vscode/mcp.json or plugin
GitHub Copilot (VS Code 1.99+).vscode/mcp.json
Cursor.cursor/mcp.json
Windsurf.windsurf/mcp.json
Claude Desktopclaude_desktop_config.json
Visual Studiocodeseeker install --vs

Claude Code and GitHub Copilot share the same .vscode/mcp.json — configure once, works for both.

Support

If CodeSeeker is useful to you, consider sponsoring the project.

License

MIT License. See LICENSE.


CodeSeeker gives Claude the code understanding that grep and embeddings alone can't provide.

常见问题

io.github.jghiringhelli/codeseeker 是什么?

Graph-powered code intelligence with semantic search and knowledge graph for AI assistants

相关 Skills

前端设计

by anthropics

Universal
热门

面向组件、页面、海报和 Web 应用开发,按鲜明视觉方向生成可直接落地的前端代码与高质感 UI,适合做 landing page、Dashboard 或美化现有界面,避开千篇一律的 AI 审美。

想把页面做得既能上线又有设计感,就用前端设计:组件到整站都能产出,难得的是能避开千篇一律的 AI 味。

编码与调试
未扫描111.8k

网页构建器

by anthropics

Universal
热门

面向复杂 claude.ai HTML artifact 开发,快速初始化 React + Tailwind CSS + shadcn/ui 项目并打包为单文件 HTML,适合需要状态管理、路由或多组件交互的页面。

在 claude.ai 里做复杂网页 Artifact 很省心,多组件、状态和路由都能顺手搭起来,React、Tailwind 与 shadcn/ui 组合效率高、成品也更精致。

编码与调试
未扫描111.8k

网页应用测试

by anthropics

Universal
热门

用 Playwright 为本地 Web 应用编写自动化测试,支持启动开发服务器、校验前端交互、排查 UI 异常、抓取截图与浏览器日志,适合调试动态页面和回归验证。

借助 Playwright 一站式验证本地 Web 应用前端功能,调 UI 时还能同步查看日志和截图,定位问题更快。

编码与调试
未扫描111.8k

相关 MCP Server

GitHub

编辑精选

by GitHub

热门

GitHub 是 MCP 官方参考服务器,让 Claude 直接读写你的代码仓库和 Issues。

这个参考服务器解决了开发者想让 AI 安全访问 GitHub 数据的问题,适合需要自动化代码审查或 Issue 管理的团队。但注意它只是参考实现,生产环境得自己加固安全。

编码与调试
83.1k

by Context7

热门

Context7 是实时拉取最新文档和代码示例的智能助手,让你告别过时资料。

它能解决开发者查找文档时信息滞后的问题,特别适合快速上手新库或跟进更新。不过,依赖外部源可能导致偶尔的数据延迟,建议结合官方文档使用。

编码与调试
51.8k

by tldraw

热门

tldraw 是让 AI 助手直接在无限画布上绘图和协作的 MCP 服务器。

这解决了 AI 只能输出文本、无法视觉化协作的痛点——想象让 Claude 帮你画流程图或白板讨论。最适合需要快速原型设计或头脑风暴的开发者。不过,目前它只是个基础连接器,你得自己搭建画布应用才能发挥全部潜力。

编码与调试
46.2k

评论