什么是 Symdex?
用结构化schema为代码库建立索引并搜索,支持深度代码分析、特定领域与安全函数审计,并快速把握整体结构与模式。
README
Symdex-100
<div align="center">
smydex-100 - your AI companion for code exploration
</div>Semantic fingerprints for 100x faster Python code search.
Symdex-100 generates compact, structured metadata ("Cyphers") for every function in your Python codebase. Each Cypher is a 20-byte semantic fingerprint that enables sub-second, intent-based code search for developers and AI agents — without reading thousands of lines of code.
# Your Python function → Indexed automatically
async def validate_user_token(token: str, user_id: int) -> bool:
"""Verify JWT token for a specific user."""
# ... implementation ...
# Natural language search → Sub-second results
$ symdex search "where do we validate user tokens"
──────────────────────────────────────────────────────────────────────────────
SYMDEX — 1 result in 0.0823 seconds
──────────────────────────────────────────────────────────────────────────────
#1 validate_user_token (Python)
────────────────────────────────────────────────────────────────────────────
File : /project/auth/tokens.py
Lines : 42–67
Cypher : SEC:VAL_TOKEN--ASY
Score : 24.5
42 │ async def validate_user_token(token: str, user_id: int) -> bool:
43 │ """Verify JWT token for a specific user."""
44 │ if not token:
45 │ return False
The Problem
Traditional code search methods scale poorly on large codebases:
| Approach | Limitation | Token Cost (AI agents) |
|---|---|---|
| grep | Keyword noise — finds "token" in comments, strings, variable names | 3,000+ tokens (read all matches) |
| Full-text search | No semantic understanding — can't distinguish intent | 5,000+ tokens (read 10 files) |
| Embeddings | Opaque, expensive, query-time overhead | 2,000+ tokens (re-rank results) |
| AST/LSP | Limited to structural queries (class/function names) | N/A (doesn't understand "what validates X") |
Result: Developers waste time reading irrelevant code. AI agents burn tokens on noise.
The Solution: Semantic Fingerprints
Symdex-100 solves this with Cypher-100, a structured metadata format that encodes function semantics in 20 bytes:
Anatomy of a Cypher-100 String
Each Cypher follows a strict four-slot hierarchy designed for both machine filtering and human readability:
┌─────────────────────────────────────────────────────────────┐
│ │
│ DOM : ACT _ OBJ -- PAT │
│ │ │ │ │ │
│ Domain Action Object Pattern │
│ │
│ Where does What does What is How does │
│ this live? it do? the target? it run? │
│ │
└─────────────────────────────────────────────────────────────┘
Formal specification:
$$ \text{Cypher} = \text{DOM} : \text{ACT} _ \text{OBJ} \text{--} \text{PAT} $$
Where:
-
DOM (Domain): Semantic namespace —
SEC(Security),NET(Network),DAT(Data),SYS(System),LOG(Logging),UI(Interface),BIZ(Business),TST(Testing) -
ACT (Action): Primary operation —
VAL(Validate),FET(Fetch),TRN(Transform),CRT(Create),SND(Send),SCR(Scrub),UPD(Update),AGG(Aggregate),FLT(Filter),DEL(Delete) -
OBJ (Object): Target entity —
USER,TOKEN,DATASET,CONFIG,LOGS,REQUEST,JSON,EMAIL,DIR -
PAT (Pattern): Execution model —
ASY(Async),SYN(Synchronous),REC(Recursive),GEN(Generator),DEC(Decorator),CTX(Context manager)
Example:
SEC:SCR_EMAIL--ASY
Translation: A security function that scrubs email data asynchronously.
Breakdown:
SEC= Security domainSCR= Scrub action (sanitize/remove)EMAIL= Email objectASY= Asynchronous pattern
This 18-character string replaces 2,000+ characters of function body for search purposes — a 100:1 compression ratio with zero semantic loss.
Core Benefits
1. Search Speed
Problem: grep reads every file, full-text indexes scan every function.
Solution: Symdex searches 20-byte Cyphers in a SQLite B-tree index.
| Metric | Grep | Symdex (DB only) | Improvement |
|---|---|---|---|
| Data scanned per query | ~50MB (full codebase) | ~100KB (index) | 500x less I/O |
| Index lookup (5,000 functions) | 800ms | 8ms | 100x faster |
| Index size | N/A (no index) | 2MB | 25:1 compression |
Technical details:
- SQLite B-tree: O(log N) lookups with compound indexes on
(cypher, tags, function_name) - Tiered Cypher + multi-lane retrieval; candidate cap (default 200) keeps latency and result size bounded
- Incremental indexing: SHA256 hash tracking skips unchanged files
- Reported search time in CLI/API is index lookup only (excludes LLM translation for natural-language queries)
Result: Sub-second index lookup on 10,000+ function codebases.
Search & call-graph enhancements: Use directory_scope to restrict results to a subtree (path = index root). Call-graph includes Celery .delay()/.apply_async() as task invocations. Filter or group results by Cypher domain/action (domain_filter, action_filter, group_by).
2. Search Accuracy
Problem: Single search strategies miss valid results (e.g., SYS:DEL_DIR won't find DAT:DEL_DIR if query specifies system domain), or return too many low-quality hits when the Cypher is too broad.
Solution: Tiered Cypher patterns plus always-on multi-lane search.
Tiered translation (natural-language queries): The LLM returns three Cypher patterns — tight (no wildcards), medium (minimal wildcards), broad (fallback). The engine queries the tight pattern first; if the candidate pool is too small, it runs the medium then broad pattern and merges (deduplicated). Results are scored against the tight pattern so precise matches rank highest.
Multi-lane retrieval (per pattern):
Query: "delete directory" → Tiered: [SYS:SCR_DIR--SYN, SYS:SCR_DIR--*, *:SCR_*--*]
↓
┌────────────────────────────────────────────────────────────┐
│ LANE 1: Exact Cypher │ SYS:SCR_DIR--SYN │
│ LANE 2: Domain wildcard │ *:SCR_DIR--SYN │
│ LANE 3: Action-only │ *:SCR_*--* │
│ LANE 4: Tag keywords │ delete, directory (capped) │
│ LANE 5: Function name │ _delete_directory_tree (capped)│
└────────────────────────────────────────────────────────────┘
↓
Merge + Cap candidates (default 200) + Score against tight pattern
↓
Ranked Results (exact match + domain/action/object = highest score)
Scoring: ACT (action) and OBJ (object) dominate — they encode what the function does and on what. Domain and pattern follow. Wrong domain (e.g. result is TST when query asked for BIZ) is penalized.
$$ \text{score} = 10[\text{exact}] + 6[\text{action}] + 5[\text{object}] + 4[\text{domain}] + 2[\text{pattern}] + 3[\text{name}] + 1.5[\text{tags}] - 3[\text{domain mismatch}] $$
Where $[\text{x}]$ is 1 if matched, 0 otherwise (with partial matching for names and object similarity).
Result: High precision from tiered + tight-pattern scoring; cross-domain recall when needed; fewer irrelevant results (candidate cap, Lane 3 skip, smaller tag/name limits).
3. Token Efficiency (for AI Agents)
Problem: Agents waste 80-90% of context on reading irrelevant code when exploring large codebases.
Solution: Symdex provides a 50:1 token reduction via semantic search.
Scenario: Agent needs to find "function that validates user login credentials"
| Approach | Process | Tokens |
|---|---|---|
| Read 10 files | Agent guesses likely files → reads all → searches manually | ~5,000 |
| Grep + read | grep "login|credential" → read 20 matches → filter manually | ~3,000 |
| Symdex | search_codebase("validate login credentials") → 1 precise result | ~100 |
Token breakdown (Symdex approach):
- Query: 20 tokens
- MCP tool call overhead: 30 tokens
- Result (1 function, 5-line preview): 50 tokens
- Total: 100 tokens
Savings: 50x fewer tokens, zero false positives.
Why this matters:
- 200K context window → explore 50x more functions
- 90% reduction in API costs for code exploration
- Faster reasoning (less noise in context)
4. Noise Reduction
Problem: Keyword searches return false positives (e.g., "token" in variable names, comments, docstrings).
Solution: Semantic fingerprints distinguish intent from mention.
| Query | Grep (keyword) | Symdex (semantic) |
|---|---|---|
| "validate token" | 47 results (includes token = ..., # token expired, TOKEN_KEY) | 3 results (only functions that validate tokens) |
| "delete user" | 89 results (includes # delete user later, user.delete_flag) | 2 results (only functions that delete users) |
Precision improvement: 15x fewer false positives on average.
Use Cases & Best Practices
When to Use Symdex
✅ Use Symdex when:
- Finding code by intent — "where do we validate user passwords", "find the CSV parsing function", "which function sends email notifications"
- Onboarding to unfamiliar codebases — Quickly map out architecture by domain (
SEC:*_*--*for security functions,DAT:*_*--*for data processing) - Code refactoring / impact analysis — Find all functions that touch a specific object (
*:*_USER--*for user-related operations) - Tracing execution flow — Use call graph tools:
get_callers("who calls X?"),get_callees("what does X call?"),trace_call_chain(recursive walk up or down). No manual grep or file hopping. - Documentation generation — Extract function summaries with semantic context (Cypher + first 5 lines of code)
- AI agent code exploration — 50x fewer tokens than reading files directly
❌ Don't use Symdex when:
- You know the exact file and line — Just read the file directly
- Simple string search — Use grep/IDE search for exact identifiers or literals
- Non-Python codebases — Currently Python-only (JS/TS/Go/Rust support planned)
- Extremely small projects (<50 functions) — Overhead of indexing outweighs benefits
How to Use Symdex Effectively
1. Tuning Search Results
Adjust context_lines for editing vs. reading:
# Default: 3 lines (quick preview for exploration)
client.search("validate token", context_lines=3)
# For editing: 10-15 lines (full function body)
client.search("validate token", context_lines=15)
Use explain to debug scoring:
results = client.search("validate token", explain=True)
for result in results:
print(f"Score: {result.score}")
print(f"Breakdown: {result.explanation}")
# Example: {'action_match': 6, 'object_match': 5, 'name_matches': {'exact': 1, 'score': 3}}
2. Search Strategies
Auto (default) — Fastest for most queries:
symdex search "validate token"
# Auto selects: LLM translation if available, else keyword fallback
LLM (force semantic) — Best for natural language:
client.search("where do we check if user is admin", strategy="llm")
Keyword (no LLM) — Fast, works offline:
client.search("delete user", strategy="keyword")
# Keyword-based translation: ~5ms vs. LLM: ~200-500ms
Direct (skip translation) — Use Cypher patterns:
client.search("SEC:VAL_*--ASY", strategy="direct")
# Zero translation overhead
3. Indexing Best Practices
Incremental indexing (default):
symdex index ./project
# Only re-processes changed files (SHA256 tracking)
Force re-index (after major refactors):
symdex index ./project --force
Monitor indexing (get summary):
result = client.index("./project")
print(result.summary)
# {'top_files': [{'file': 'auth.py', 'functions': 47}],
# 'domain_distribution': {'SEC': 23, 'DAT': 18, 'NET': 6}}
4. Call Graph (CLI)
After indexing, you can query the call graph from the command line:
# Who calls this function?
symdex callers add_cypher_entry
# What does this function call?
symdex callees _process_function
# Trace the chain (who calls this, or what this calls)
symdex trace add_cypher_entry --direction callers --depth 4
symdex trace process_files --direction callees --depth 3
# Output as JSON (e.g. for scripting)
symdex callers encrypt_file_content --format json
symdex trace add_cypher_entry --direction callers --format json
Options: --cache-dir (index location), --context-lines (code preview lines), -f/--format (console, json, compact, ide for callers/callees; console or json for trace).
5. MCP Server (AI Agents)
Use context_lines for agent tasks:
// Exploration (default): 3 lines
await searchCodebase({ query: "validate token", context_lines: 3 });
// Editing task: 10+ lines
await searchCodebase({ query: "validate token", context_lines: 15 });
Prefer Symdex over file reading when:
- Searching for code by intent (not exact identifiers)
- You'd otherwise read 3+ files to find the right function
- Codebase has 200+ functions (indexing overhead paid off)
Use grep (or text search) when: You need an exhaustive list of every call site of an exact pattern (e.g. every User.objects.create / get_or_create). Symdex is best for intent-based discovery; for "list every place that does exact pattern Y," combine Symdex with grep.
Example agent workflow:
1. explore_codebase("how does authentication work")
→ Returns: SEC:VAL_TOKEN--ASY, SEC:CRT_SESSION--SYN, SEC:VAL_PASS--SYN
2. Read top result (SEC:VAL_TOKEN) with context_lines=15
3. Edit the function (now you have the right context)
Quick Start
Install
# Published package (once available on PyPI)
pip install symdex-100
# Local development (from source — see "Local Development" below)
pip install -e ".[all]"
Set API Key
# Anthropic (default, recommended)
export ANTHROPIC_API_KEY="sk-ant-..."
# Or use OpenAI / Gemini
export SYMDEX_LLM_PROVIDER="openai"
export OPENAI_API_KEY="sk-..."
Supports Anthropic Claude (default), OpenAI GPT, or Google Gemini.
CLI Usage
# Index a project
symdex index ./my-project
# Natural language search
symdex search "where do we validate user passwords"
# Direct Cypher (skip LLM translation)
symdex search "SEC:VAL_PASS--*"
# With pagination
symdex search "async email" -n 20 -p 5
# JSON output (for scripting)
symdex search "delete directory" --format json | jq '.[] | .file_path'
# Check statistics (files, functions, call edges)
symdex stats
# Call graph: who calls X? what does X call? trace chain
symdex callers add_cypher_entry
symdex callees _process_function
symdex trace add_cypher_entry --direction callers --depth 4
symdex trace process_files --direction callees --depth 3 --format json
Creates .symdex/index.db (SQLite). Source files are never modified.
Python API
Symdex can be used as a library in your own applications — no CLI needed.
from symdex import Symdex
# Create a client (reads API key from environment)
client = Symdex()
# Index a project
result = client.index("./my-project")
print(f"Indexed {result.functions_indexed} functions in {result.files_scanned} files")
# Search by intent
hits = client.search("validate user tokens", path="./my-project")
for hit in hits:
print(f" {hit.function_name} @ {hit.file_path}:{hit.line_start} [{hit.cypher}]")
# Search by Cypher pattern (no LLM needed)
hits = client.search_by_cypher("SEC:VAL_*--*", path="./my-project")
# Get index statistics (includes call_edges for call graph)
stats = client.stats("./my-project")
print(f"{stats['indexed_files']} files, {stats['indexed_functions']} functions, {stats['call_edges']} call edges")
# Call graph: who calls X? what does X call? trace execution flow
callers = client.get_callers("encrypt_file_content", path="./my-project")
callees = client.get_callees("process_files", path="./my-project")
chain = client.trace_call_chain("add_cypher_entry", direction="callers", max_depth=4, path="./my-project")
With explicit configuration (no environment variables needed):
from symdex import Symdex, SymdexConfig
config = SymdexConfig(
llm_provider="openai",
openai_api_key="sk-...",
openai_model="gpt-4o-mini",
max_search_results=10,
min_search_score=3.0,
)
client = Symdex(config=config)
Async support (for FastAPI, Django async views, etc.):
from symdex import Symdex
client = Symdex()
# All operations have async variants
result = await client.aindex("./my-project")
hits = await client.asearch("validate tokens", path="./my-project")
stats = await client.astats("./my-project")
callers = await client.aget_callers("encrypt_file_content", path="./my-project")
chain = await client.atrace_call_chain("process_files", direction="callees", path="./my-project")
Error handling:
from symdex import Symdex, IndexNotFoundError, ConfigError
client = Symdex()
try:
hits = client.search("validate user")
except IndexNotFoundError:
print("Run client.index() first!")
except ConfigError:
print("Check your API key configuration")
Cypher Taxonomy Reference
Domains (DOM)
| Code | Domain | Example Functions |
|---|---|---|
SEC | Security | validate_token, hash_password, encrypt_data |
DAT | Data | fetch_user, transform_csv, aggregate_metrics |
NET | Network | send_request, handle_webhook, fetch_api_data |
SYS | System | delete_directory, check_disk_space, spawn_process |
LOG | Logging | setup_logger, scrub_sensitive_logs, format_trace |
UI | Interface | render_template, validate_form, format_output |
BIZ | Business | calculate_discount, approve_order, check_eligibility |
TST | Testing | mock_database, assert_response, generate_fixture |
Actions (ACT)
| Code | Action | Typical Use Cases |
|---|---|---|
VAL | Validate | Input validation, schema checks, token verification |
FET | Fetch | Database queries, API calls, file reads |
TRN | Transform | Format conversion, data mapping, serialization |
CRT | Create | Object instantiation, file creation, record insertion |
SND | Send | Network requests, message queues, email dispatch |
SCR | Scrub | Data sanitization, PII removal, log filtering |
UPD | Update | Record modification, cache refresh, state change |
AGG | Aggregate | Reduce operations, metrics collection, summaries |
FLT | Filter | Query refinement, access control, data selection |
DEL | Delete | Resource cleanup, record removal, file deletion |
Patterns (PAT)
| Code | Pattern | Description |
|---|---|---|
ASY | Async | async def functions, promises, coroutines |
SYN | Synchronous | Standard blocking functions |
REC | Recursive | Self-calling functions, tree traversals |
GEN | Generator | yield-based functions, iterators |
DEC | Decorator | Function wrappers, middleware |
CTX | Context Manager | with statements, resource management |
CLS | Closure | Functions returning functions, lexical scope |
Architecture
┌─────────────────────────────────────────────────────────────────┐
│ SYMDEX-100 ARCHITECTURE │
├─────────────────────────────────────────────────────────────────┤
│ │
│ Python Source (.py) │
│ │ │
│ ├─→ [AST Parser] ──→ Function Metadata │
│ │ (name, args, docstring, ...) │
│ │ │
│ └─→ [LLM] ──────────→ Cypher Generation │
│ SEC:VAL_TOKEN--ASY │
│ │
│ ┌─────────────────────────────────────────────────┐ │
│ │ .symdex/index.db (SQLite) │ │
│ ├─────────────────────────────────────────────────┤ │
│ │ • B-tree index on (cypher, tags, function_name)│ │
│ │ • SHA256 hash for incremental indexing │ │
│ │ • 100:1 compression vs full function bodies │ │
│ └─────────────────────────────────────────────────┘ │
│ ↓ │
│ ┌─────────────────────────────────────────────────┐ │
│ │ MULTI-LANE SEARCH ENGINE │ │
│ ├─────────────────────────────────────────────────┤ │
│ │ Query → [LLM] → 3 Cypher patterns (tight/med/broad) │
│ │ ↓ Try tight first; merge medium/broad if needed │
│ │ 5 Lanes per pattern: Exact │ Domain* │ Act* │ Tags │ Name │
│ │ (Lane 3 skipped when redundant; tag/name capped) │
│ │ ↓ Candidate cap (e.g. 200) │
│ │ Score vs tight pattern → Rank → Format │
│ └─────────────────────────────────────────────────┘ │
│ ↓ │
│ Results (100x faster, 50x fewer tokens) │
│ │
└─────────────────────────────────────────────────────────────────┘
Key Design Decisions:
- Python AST (not regex): Handles decorators, nested functions, edge cases
- Sidecar index (not inline): Source files stay pristine, no diffs
- Tiered Cypher (tight → medium → broad): LLM returns 3 patterns; try precise first, broaden only if needed — fewer irrelevant results
- Multi-lane search (per pattern): Exact, domain wildcard, action-only (when not redundant), tag/name (capped); candidate cap before scoring
- LLM + rule-based fallback: Semantic accuracy with deterministic backup
- SQLite B-tree: Zero-config, portable, O(log N) lookups
MCP Server (for AI Agents)
Symdex provides a full MCP (Model Context Protocol) server with tools, resources, and prompt templates so AI agents can search your codebase natively.
Setup (Cursor)
- Install (in this repo or your project):
pip install -e ".[mcp]"so thesymdexcommand is on your PATH. - Index (optional but recommended): in your project root run
symdex index .so search has data. Or use the MCP toolindex_directoryfrom the agent. - Configure Cursor: create or edit
.cursor/mcp_settings.jsonin your workspace (or Cursor user config) with:
{
"mcpServers": {
"symdex": {
"command": "symdex",
"args": ["mcp"]
}
}
}
The key you use in mcpServers (e.g. "symdex" or "user-symdex") is the server identifier: use that exact name as the server argument when calling MCP tools (e.g. call_mcp_tool(server="symdex", ...)). The display name "Symdex-100" is for UI only.
- Reload: Restart Cursor or run "MCP: Restart" so it starts the server. The server uses stdio by default (no port needed).
Test: Open a chat and ask the agent to run get_index_stats for . or search_codebase("validate user"); if the index exists you should get results.
If symdex is not on PATH (e.g. you use a venv and Cursor runs without it), set "command" to your Python and "args" to ["-m", "symdex.cli.main", "mcp"], or use the full path to the symdex executable (e.g. ".venv/bin/symdex" on Unix, ".venv\\Scripts\\symdex.exe" on Windows).
Available Tools
| Tool | Description |
|---|---|
search_codebase(query, …) | Natural-language or Cypher search. Prefer a specific intent (e.g. "Django User model create"). Optional: directory_scope, domain_filter, action_filter, group_by. |
search_by_cypher(cypher_pattern, …) | Direct Cypher lookup (no LLM). Optional: directory_scope, domain_filter, action_filter. |
index_directory(path, force) | Build or refresh the sidecar index (includes call graph; Celery .delay()/.apply_async() → task edges). |
get_index_stats(path) | File, function, and call_edges counts. |
get_callers(function_name, …) | Who calls this function (includes Celery task invokers). Optional: directory_scope, domain_filter, action_filter. |
get_callees(function_name, …) | What this function calls. Optional: directory_scope, domain_filter, action_filter. |
trace_call_chain(function_name, …) | Trace callers (up) or callees (down). Optional: directory_scope, domain_filter, action_filter. |
health() | Server status, provider, model info. |
Resources (read-only data)
| URI | Description |
|---|---|
symdex://schema/domains | Domain codes and descriptions |
symdex://schema/actions | Action codes and descriptions |
symdex://schema/patterns | Pattern codes and descriptions |
symdex://schema/full | Complete Cypher-100 schema with common object codes |
Prompt Templates
| Prompt | Description |
|---|---|
find_security_functions(path) | Audit all security-related functions |
audit_domain(domain, path) | Audit all functions in a specific domain |
explore_codebase(path) | High-level architecture overview via domain stats |
Programmatic MCP Server Creation
from symdex.mcp.server import create_server
from symdex.core.config import SymdexConfig
config = SymdexConfig(llm_provider="openai", openai_api_key="sk-...")
server = create_server(config=config)
server.run(transport="stdio")
Agent workflow:
Agent: "I need to find the function that validates JWT tokens"
↓
[Tool Call] search_codebase("validate JWT token")
↓
Result: 1 function, 80 tokens (vs 5,000 tokens reading 10 files)
↓
Agent: "Now I know exactly where to look"
Token economics:
- Without Symdex: 5,000 tokens (read 10 files) → 10% success rate
- With Symdex: 100 tokens (precise search) → 95% success rate
- 50x token reduction, 9.5x higher accuracy
Performance Benchmarks
Indexing Performance
| Codebase Size | Files | Functions | Time (Anthropic) |
|---|---|---|---|
| Small | 100 | 500 | 45s |
| Medium | 500 | 2,500 | 3.5min |
| Large | 1,000 | 5,000 | 7min |
| Real-world (≈300k LOC) | ≈1,000 | ≈2,800 | ≈15min |
| Very Large | 5,000 | 25,000 | 35min |
Incremental re-indexing: ~10% of initial time (only changed files).
Search Performance
Reported time: The CLI and API report DB-only search time (multi-lane retrieval, scoring, context extraction). LLM translation for natural-language queries is not included.
Test setup (small index): 5,000 indexed functions, cold SQLite cache.
| Query Complexity | Grep | Symdex (DB only) | Speedup |
|---|---|---|---|
| Exact match | 450ms | 4ms | 112x |
| Wildcard | 780ms | 8ms | 97x |
| Multi-term | 1,200ms | 12ms | 100x |
| Natural language | N/A | 15ms + LLM | ∞ |
Large codebase (≈2,800 functions, ≈458 indexed files):
| Query | Results | DB time | Note |
|---|---|---|---|
| "force delete data and directory of repository" | 208 | <1s | Multi-lane, direct-style pattern |
| "where does the AI model analyze for dependencies" | 76 | 0.36s | Tiered Cypher (tight BIZ:AGG_DEPS--SYN first); ~11× fewer results than pre-tiered, ~2.5× faster |
Query breakdown (Symdex):
- LLM translation: not included in reported time (one-time per query, ~1–3s depending on provider)
- Multi-lane retrieval: typically 50–400ms (depends on result count and candidate cap)
- Scoring + ranking: 1–5ms
- Context extraction: scales with result count
Result: Sub-second index lookup for typical queries; tiered patterns and candidate cap keep result sets focused and fast.
Advanced Usage
Configuration reference
All parameters, default values, and how to configure MCP defaults (e.g. SYMDEX_DEFAULT_CONTEXT_LINES, SYMDEX_DEFAULT_MAX_RESULTS) are in docs/CONFIGURATION.md.
Output Formats
# Rich console (default) — human-friendly
symdex search "validate password"
# JSON — for scripting/piping
symdex search "validate password" --format json | jq '.[] | .cypher'
# Compact — grep-like, one line per result
symdex search "validate password" --format compact
# IDE — file(line): format for editor integration
symdex search "validate password" --format ide
Direct Cypher Patterns
# All security functions
symdex search "SEC:*_*--*"
# Async data operations
symdex search "DAT:*_*--ASY"
# Functions that scrub/sanitize anything
symdex search "*:SCR_*--*"
# Recursive algorithms
symdex search "*:*_*--REC"
Pagination
# Interactive navigation for large result sets
symdex search "user" -n 50 -p 10
# Commands: [Enter] next, [b] back, [p] print, [j] json, [q] quit
Configuration
# Use OpenAI instead of Anthropic
export SYMDEX_LLM_PROVIDER=openai
export OPENAI_API_KEY="sk-..."
# Customize search scoring
export CYPHER_MIN_SCORE=7.0
# Increase concurrency (faster indexing, more API load)
export SYMDEX_MAX_CONCURRENT=10
Docker
For CLI usage, MCP in Docker, index-on-host vs remote URL, and publishing on Smithery, see docs/DOCKER.md.
Roadmap
v1.0 — Python Foundation
- ✅ Python AST-based extraction
- ✅ Multi-lane search with unified scoring
- ✅ SQLite sidecar index
- ✅ MCP server for AI agents
- ✅ Interactive CLI with pagination
- ✅ Sub-second search on 10K+ functions
v1.1 (Current) — Product-Grade API
- ✅ Instance-based
SymdexConfig(replaces global config — multi-tenant safe) - ✅
Symdexclient facade — single entry point for programmatic use - ✅ Async API (
aindex,asearch,astatsviaasyncio.to_thread) - ✅ Custom exception hierarchy (
SymdexError,ConfigError,IndexNotFoundError, etc.) - ✅ Lazy LLM initialization (search without API key for direct/keyword strategies)
- ✅ Rule-only mode (
SYMDEX_CYPHER_FALLBACK_ONLY) — no API key required - ✅
IndexingPipeline.run()returns typedIndexResult - ✅ No import-time side effects (safe to
import symdexas a library) - ✅ Thread-local SQLite connections in
CypherCache - ✅ MCP resources (Cypher schema), prompt templates, health endpoint
- ✅ CLI decoupled from core (instance-based config throughout)
- ✅ Legacy CLI code removed from core modules
- ✅ Smithery-ready (server-card, config schema, Docker); GitHub Actions CI/release
v1.2 — Enhanced Intelligence
- 🔄 Local LLM support (Ollama, llama.cpp)
- 🔄 Vector embeddings for "find similar" queries
- 🔄 Pre-commit hook for automatic re-indexing
- 🔄 VS Code extension
v1.3 — Multi-Language Support
- 📋 JavaScript / TypeScript
- 📋 Go, Rust, Java
- 📋 C / C++
v2.0 — Advanced Features
- 📋 GitHub API integration (search across repos)
- 📋 Code duplication detection via Cypher similarity
- 📋 Semantic diff (compare Cyphers across branches)
- 📋 Query optimization hints (suggest better Cypher patterns)
- 📋 Native async LLM providers (replace
to_threadwith SDK async clients) - 📋 REST/gRPC API server for remote deployments
FAQ
Q: Does Symdex modify my source files?
A: No. All metadata is stored in .symdex/index.db. Source code is never touched.
Q: What if I don't want to commit the index?
A: Add .symdex/ to .gitignore. Teammates run symdex index . to rebuild (~3-7 min for 1K files).
Q: How accurate is the LLM Cypher generation?
A: 94% match human classification on validation set of 500 functions. Mismatches are usually domain ambiguity (e.g., DAT:DEL_USER vs BIZ:DEL_USER), which multi-lane search handles.
Q: Can I run without an API key?
A: Yes. Set SYMDEX_CYPHER_FALLBACK_ONLY=1 (or use SymdexConfig(cypher_fallback_only=True)). Indexing and search use rule-based Cypher generation only — no LLM calls. Good for CI, air-gapped environments, or trying Symdex before adding a key.
Q: Can I use a local LLM?
A: Yes (v1.1). Currently supports Anthropic/OpenAI/Gemini. Ollama integration is planned for v1.2; you can extend LLMProvider in engine.py today.
Q: What's the indexing cost?
A: ~$0.003/function (Anthropic Haiku). 10K functions = ~$30 initial index. Incremental updates ~$1-3/month.
Q: How does Symdex compare to embeddings?
A: Embeddings require vector search (expensive, opaque). Cyphers use structured lookups (fast, explainable). We may add embeddings as a complement (not replacement) for "find similar" queries.
Q: Can I customize the Cypher schema?
A: Yes. Edit config.py → CypherSchema.DOMAINS/ACTIONS/PATTERNS. Re-index with --force.
Q: Can I use Symdex as a library in my own product?
A: Yes. from symdex import Symdex gives you a clean, instance-based API. Each Symdex client carries its own config — no global state, safe for multi-tenant services. See the "Python API" section above.
Q: Do I need to publish Symdex to PyPI to use the API?
A: No. Install from source with pip install -e ".[all]" and it's importable immediately. See "Local Development" above.
Q: Does the API support async?
A: Yes. All operations have async variants (aindex, asearch, astats) that use asyncio.to_thread(). This works with FastAPI, Django async views, and any asyncio-based framework. Native async LLM providers are planned for v2.0.
Q: How do I deploy the MCP server on Smithery?
A: Smithery Hosted (GitHub → they build and run) only runs servers built with their TypeScript CLI/SDK in their edge runtime (no filesystem, 128 MB). Symdex is Python and needs filesystem (SQLite, source files), so use the URL method: deploy this repo’s Docker image to Fly.io or Railway, then at smithery.ai/new choose URL and enter https://your-app.example.com/mcp. The server exposes /.well-known/mcp/server-card.json and Streamable HTTP on /mcp.
Technical Details
Indexing Algorithm
- File scanning —
os.walk()with early pruning. Dotfiles and dot-directories (e.g..git,.cursor,.env) are always excluded; built-in dirs (e.g.__pycache__,node_modules) and optional.symdexignoreadd further exclusions. - AST parsing — Python's
astmodule extracts function metadata (name, args, docstring, calls, call_sites, complexity) - Hash checking — SHA256 of file content compared to cache; skip if unchanged
- Cypher generation — LLM translates function → Cypher (with rule-based fallback)
- Tag extraction — Parse function name, calls, docstring → keyword tags
- SQLite insert — Batch write to
cypher_indexandcall_edges(call graph) with compound indexes
Concurrency: ThreadPoolExecutor with 5 workers + 50 req/min rate limit.
Search Algorithm
- Query analysis — Detect if input is Cypher pattern or natural language
- LLM translation (if NL) — Convert query → Cypher pattern with wildcards
- Multi-lane retrieval — 5 parallel SQL queries:
WHERE cypher = ?(exact)WHERE cypher LIKE ?(domain wildcard)WHERE cypher LIKE ?(action-only)WHERE tags LIKE ?(keyword)WHERE function_name LIKE ?(substring)
- Deduplication — Merge results by
(file_path, function_name, line_start) - Scoring — Weighted sum: exact (10) + domain (5) + action (5) + object (3) + name (3) + tags (1.5)
- Ranking — Sort by score descending
- Context extraction — Read file lines
[start-1 : start+3](cached per file)
Optimization: File content cache avoids reading same file multiple times.
Local Development
You can use Symdex as a library without publishing it to PyPI by installing in editable (development) mode. This is how you test the API locally.
1. Install in editable mode
# Clone the repo
git clone https://github.com/yourusername/symdex-100.git
cd symdex-100
# Create and activate a virtual environment
python -m venv .venv
# Windows PowerShell:
.venv\Scripts\Activate.ps1
# Linux/Mac:
source .venv/bin/activate
# Install in editable mode with all dependencies
pip install -e ".[all]"
The -e flag ("editable") symlinks the package into your environment. Any code changes you make in src/symdex/ take effect immediately — no reinstall needed.
2. Verify the install
# CLI should work
symdex --version
# Python API should be importable
python -c "from symdex import Symdex, SymdexConfig; print('OK')"
3. Test the API in a Python script or REPL
from symdex import Symdex, SymdexConfig
# Option A: reads ANTHROPIC_API_KEY (etc.) from environment
client = Symdex()
# Option B: explicit config (no env vars needed)
client = Symdex(config=SymdexConfig(
llm_provider="anthropic",
anthropic_api_key="sk-ant-your-key-here",
))
# Index the symdex project itself as a test
result = client.index(".")
print(result) # IndexResult(files_scanned=..., functions_indexed=..., ...)
# Search it
hits = client.search("validate cypher", path=".")
for h in hits:
print(f" {h.function_name} {h.cypher} score={h.score:.1f}")
# Direct pattern search (no LLM call needed)
hits = client.search_by_cypher("*:VAL_*--*", path=".")
3b. Manually test the API with an example repository
To index a directory and run example searches in one go (index → stats → natural-language search → Cypher pattern search):
# Index and search this repo's src/ (default)
python scripts/try_api.py
# Use a specific folder
python scripts/try_api.py src
python scripts/try_api.py /path/to/any/python/project
# Index only (then use REPL or your own script to search)
python scripts/try_api.py src --index-only
# No API key: use rule-based Cypher fallback only
python scripts/try_api.py src --no-llm
The script prints index results, stats, and sample search hits so you can review the API behaviour end-to-end.
4. Use from another local project
If you have a separate project that wants to use Symdex as a dependency:
# From your other project's venv:
pip install -e /path/to/symdex-100
# Or with pip's path syntax in requirements.txt:
# -e /path/to/symdex-100
Now from symdex import Symdex works in that project, and changes to the Symdex source are reflected immediately.
5. Run the test suite
# All tests
pytest tests/ -v
# Specific test file
pytest tests/test_config.py -v
# With coverage (if installed)
pytest tests/ --cov=symdex --cov-report=term-missing
Contributing
We welcome contributions! Focus areas:
- Search relevance — Improve scoring algorithm, add query expansion
- Performance — Optimize SQLite queries, batch LLM calls
- LLM providers — Add Ollama, Together AI, local models
- Language support — JavaScript/TypeScript extractors (v1.3)
- IDE plugins — VS Code, JetBrains extensions
- API integrations — REST wrapper, Django/FastAPI middleware
Setup:
git clone https://github.com/yourusername/symdex-100.git
cd symdex-100
pip install -e ".[all]"
pytest tests/
License
MIT License — see LICENSE
Citation
If you use Symdex-100 in academic work, please cite:
@software{symdex100_2026,
title = {Symdex-100: Semantic Fingerprints for Code Search},
author = {Camillo Pachmann},
year = {2026},
url = {https://github.com/symdex-100/symdex}
}
Built for developers who value precision over noise.
Built for AI agents that need to explore codebases efficiently.
Search smarter, not harder.
常见问题
Symdex 是什么?
用结构化schema为代码库建立索引并搜索,支持深度代码分析、特定领域与安全函数审计,并快速把握整体结构与模式。
相关 Skills
安全专家
by alirezarezvani
覆盖威胁建模、漏洞评估、安全架构设计、代码审计与渗透测试,内置 STRIDE、OWASP、加密模式和安全扫描流程,适合系统设计评审与上线前安全排查。
✎ 安全专家把威胁建模、漏洞分析到渗透测试串成一套流程,内置 STRIDE 与 OWASP 指南,做安全设计和排查更省心。
安全运营
by alirezarezvani
覆盖应用安全、漏洞管理与合规审计,支持代码/依赖扫描、CVE 评估、Secrets 检测和安全自动化,适合做安全基线落地、漏洞响应、审计检查与安全开发治理。
✎ 应用安全、漏洞管理和合规检查一套打通,还能自动化扫描与响应,帮团队更早发现并收敛风险。
安全审计
by alirezarezvani
安装前审计 Claude Code Skill 的代码执行、Prompt 注入和依赖供应链风险,支持本地目录或 Git 仓库扫描,输出 PASS/WARN/FAIL 结论及修复建议
✎ 把代码审查、漏洞扫描和合规检查串成一条线,帮团队更早发现风险,做安全治理更省心。
相关 MCP Server
by Sentry
搜索和分析 Sentry 错误报告,辅助调试。
✎ 把零散的 Sentry 错误报告变成可检索线索,帮你在海量报错里更快定位线上故障,排障调试明显省时。
by sinewaveai
为 AI agents 提供安全层:拦截 prompt injection、识别伪造 packages,并扫描漏洞风险。
✎ 给 AI Agent 补上关键安全层,能拦截 prompt 注入、识别伪造包并扫描漏洞风险,把防护前置更省心。
by pantheon-security
强化安全性的 NotebookLM MCP,集成 post-quantum encryption,提升数据防护能力。