Symdex

安全与合规

by symdex-100

用结构化schema为代码库建立索引并搜索,支持深度代码分析、特定领域与安全函数审计,并快速把握整体结构与模式。

什么是 Symdex

用结构化schema为代码库建立索引并搜索,支持深度代码分析、特定领域与安全函数审计,并快速把握整体结构与模式。

README

Symdex-100

<div align="center">

Symdex Robot

smydex-100 - your AI companion for code exploration

</div>

Semantic fingerprints for 100x faster Python code search.

Symdex-100 generates compact, structured metadata ("Cyphers") for every function in your Python codebase. Each Cypher is a 20-byte semantic fingerprint that enables sub-second, intent-based code search for developers and AI agents — without reading thousands of lines of code.

python
# Your Python function → Indexed automatically
async def validate_user_token(token: str, user_id: int) -> bool:
    """Verify JWT token for a specific user."""
    # ... implementation ...
bash
# Natural language search → Sub-second results
$ symdex search "where do we validate user tokens"

──────────────────────────────────────────────────────────────────────────────
  SYMDEX — 1 result in 0.0823 seconds
──────────────────────────────────────────────────────────────────────────────

  #1  validate_user_token  (Python)
  ────────────────────────────────────────────────────────────────────────────
    File   : /project/auth/tokens.py
    Lines  : 42–67
    Cypher : SEC:VAL_TOKEN--ASY
    Score  : 24.5

      42 │ async def validate_user_token(token: str, user_id: int) -> bool:
      43 │     """Verify JWT token for a specific user."""
      44 │     if not token:
      45 │         return False

The Problem

Traditional code search methods scale poorly on large codebases:

ApproachLimitationToken Cost (AI agents)
grepKeyword noise — finds "token" in comments, strings, variable names3,000+ tokens (read all matches)
Full-text searchNo semantic understanding — can't distinguish intent5,000+ tokens (read 10 files)
EmbeddingsOpaque, expensive, query-time overhead2,000+ tokens (re-rank results)
AST/LSPLimited to structural queries (class/function names)N/A (doesn't understand "what validates X")

Result: Developers waste time reading irrelevant code. AI agents burn tokens on noise.


The Solution: Semantic Fingerprints

Symdex-100 solves this with Cypher-100, a structured metadata format that encodes function semantics in 20 bytes:

Anatomy of a Cypher-100 String

Each Cypher follows a strict four-slot hierarchy designed for both machine filtering and human readability:

code
┌─────────────────────────────────────────────────────────────┐
│                                                             │
│            DOM   :   ACT   _   OBJ   --   PAT               │
│              │        │         │           │               │
│         Domain   Action       Object        Pattern         │
│                                                             │
│   Where does     What does    What is       How does        │
│   this live?     it do?       the target?   it run?         │
│                                                             │
└─────────────────────────────────────────────────────────────┘

Formal specification:

$$ \text{Cypher} = \text{DOM} : \text{ACT} _ \text{OBJ} \text{--} \text{PAT} $$

Where:

  • DOM (Domain): Semantic namespace — SEC (Security), NET (Network), DAT (Data), SYS (System), LOG (Logging), UI (Interface), BIZ (Business), TST (Testing)

  • ACT (Action): Primary operation — VAL (Validate), FET (Fetch), TRN (Transform), CRT (Create), SND (Send), SCR (Scrub), UPD (Update), AGG (Aggregate), FLT (Filter), DEL (Delete)

  • OBJ (Object): Target entity — USER, TOKEN, DATASET, CONFIG, LOGS, REQUEST, JSON, EMAIL, DIR

  • PAT (Pattern): Execution model — ASY (Async), SYN (Synchronous), REC (Recursive), GEN (Generator), DEC (Decorator), CTX (Context manager)

Example:

code
SEC:SCR_EMAIL--ASY

Translation: A security function that scrubs email data asynchronously.

Breakdown:

  • SEC = Security domain
  • SCR = Scrub action (sanitize/remove)
  • EMAIL = Email object
  • ASY = Asynchronous pattern

This 18-character string replaces 2,000+ characters of function body for search purposes — a 100:1 compression ratio with zero semantic loss.


Core Benefits

1. Search Speed

Problem: grep reads every file, full-text indexes scan every function.

Solution: Symdex searches 20-byte Cyphers in a SQLite B-tree index.

MetricGrepSymdex (DB only)Improvement
Data scanned per query~50MB (full codebase)~100KB (index)500x less I/O
Index lookup (5,000 functions)800ms8ms100x faster
Index sizeN/A (no index)2MB25:1 compression

Technical details:

  • SQLite B-tree: O(log N) lookups with compound indexes on (cypher, tags, function_name)
  • Tiered Cypher + multi-lane retrieval; candidate cap (default 200) keeps latency and result size bounded
  • Incremental indexing: SHA256 hash tracking skips unchanged files
  • Reported search time in CLI/API is index lookup only (excludes LLM translation for natural-language queries)

Result: Sub-second index lookup on 10,000+ function codebases.

Search & call-graph enhancements: Use directory_scope to restrict results to a subtree (path = index root). Call-graph includes Celery .delay()/.apply_async() as task invocations. Filter or group results by Cypher domain/action (domain_filter, action_filter, group_by).


2. Search Accuracy

Problem: Single search strategies miss valid results (e.g., SYS:DEL_DIR won't find DAT:DEL_DIR if query specifies system domain), or return too many low-quality hits when the Cypher is too broad.

Solution: Tiered Cypher patterns plus always-on multi-lane search.

Tiered translation (natural-language queries): The LLM returns three Cypher patterns — tight (no wildcards), medium (minimal wildcards), broad (fallback). The engine queries the tight pattern first; if the candidate pool is too small, it runs the medium then broad pattern and merges (deduplicated). Results are scored against the tight pattern so precise matches rank highest.

Multi-lane retrieval (per pattern):

code
Query: "delete directory"  →  Tiered: [SYS:SCR_DIR--SYN, SYS:SCR_DIR--*, *:SCR_*--*]
    ↓
┌────────────────────────────────────────────────────────────┐
│ LANE 1: Exact Cypher      │ SYS:SCR_DIR--SYN               │
│ LANE 2: Domain wildcard   │ *:SCR_DIR--SYN                 │
│ LANE 3: Action-only       │ *:SCR_*--*                     │  
│ LANE 4: Tag keywords      │ delete, directory  (capped)    │
│ LANE 5: Function name     │ _delete_directory_tree (capped)│
└────────────────────────────────────────────────────────────┘
    ↓
Merge + Cap candidates (default 200) + Score against tight pattern
    ↓
Ranked Results (exact match + domain/action/object = highest score)

Scoring: ACT (action) and OBJ (object) dominate — they encode what the function does and on what. Domain and pattern follow. Wrong domain (e.g. result is TST when query asked for BIZ) is penalized.

$$ \text{score} = 10[\text{exact}] + 6[\text{action}] + 5[\text{object}] + 4[\text{domain}] + 2[\text{pattern}] + 3[\text{name}] + 1.5[\text{tags}] - 3[\text{domain mismatch}] $$

Where $[\text{x}]$ is 1 if matched, 0 otherwise (with partial matching for names and object similarity).

Result: High precision from tiered + tight-pattern scoring; cross-domain recall when needed; fewer irrelevant results (candidate cap, Lane 3 skip, smaller tag/name limits).


3. Token Efficiency (for AI Agents)

Problem: Agents waste 80-90% of context on reading irrelevant code when exploring large codebases.

Solution: Symdex provides a 50:1 token reduction via semantic search.

Scenario: Agent needs to find "function that validates user login credentials"

ApproachProcessTokens
Read 10 filesAgent guesses likely files → reads all → searches manually~5,000
Grep + readgrep "login|credential" → read 20 matches → filter manually~3,000
Symdexsearch_codebase("validate login credentials") → 1 precise result~100

Token breakdown (Symdex approach):

  • Query: 20 tokens
  • MCP tool call overhead: 30 tokens
  • Result (1 function, 5-line preview): 50 tokens
  • Total: 100 tokens

Savings: 50x fewer tokens, zero false positives.

Why this matters:

  • 200K context window → explore 50x more functions
  • 90% reduction in API costs for code exploration
  • Faster reasoning (less noise in context)

4. Noise Reduction

Problem: Keyword searches return false positives (e.g., "token" in variable names, comments, docstrings).

Solution: Semantic fingerprints distinguish intent from mention.

QueryGrep (keyword)Symdex (semantic)
"validate token"47 results (includes token = ..., # token expired, TOKEN_KEY)3 results (only functions that validate tokens)
"delete user"89 results (includes # delete user later, user.delete_flag)2 results (only functions that delete users)

Precision improvement: 15x fewer false positives on average.


Use Cases & Best Practices

When to Use Symdex

✅ Use Symdex when:

  1. Finding code by intent — "where do we validate user passwords", "find the CSV parsing function", "which function sends email notifications"
  2. Onboarding to unfamiliar codebases — Quickly map out architecture by domain (SEC:*_*--* for security functions, DAT:*_*--* for data processing)
  3. Code refactoring / impact analysis — Find all functions that touch a specific object (*:*_USER--* for user-related operations)
  4. Tracing execution flow — Use call graph tools: get_callers ("who calls X?"), get_callees ("what does X call?"), trace_call_chain (recursive walk up or down). No manual grep or file hopping.
  5. Documentation generation — Extract function summaries with semantic context (Cypher + first 5 lines of code)
  6. AI agent code exploration — 50x fewer tokens than reading files directly

❌ Don't use Symdex when:

  1. You know the exact file and line — Just read the file directly
  2. Simple string search — Use grep/IDE search for exact identifiers or literals
  3. Non-Python codebases — Currently Python-only (JS/TS/Go/Rust support planned)
  4. Extremely small projects (<50 functions) — Overhead of indexing outweighs benefits

How to Use Symdex Effectively

1. Tuning Search Results

Adjust context_lines for editing vs. reading:

python
# Default: 3 lines (quick preview for exploration)
client.search("validate token", context_lines=3)

# For editing: 10-15 lines (full function body)
client.search("validate token", context_lines=15)

Use explain to debug scoring:

python
results = client.search("validate token", explain=True)
for result in results:
    print(f"Score: {result.score}")
    print(f"Breakdown: {result.explanation}")
    # Example: {'action_match': 6, 'object_match': 5, 'name_matches': {'exact': 1, 'score': 3}}

2. Search Strategies

Auto (default) — Fastest for most queries:

bash
symdex search "validate token"
# Auto selects: LLM translation if available, else keyword fallback

LLM (force semantic) — Best for natural language:

python
client.search("where do we check if user is admin", strategy="llm")

Keyword (no LLM) — Fast, works offline:

python
client.search("delete user", strategy="keyword")
# Keyword-based translation: ~5ms vs. LLM: ~200-500ms

Direct (skip translation) — Use Cypher patterns:

python
client.search("SEC:VAL_*--ASY", strategy="direct")
# Zero translation overhead

3. Indexing Best Practices

Incremental indexing (default):

bash
symdex index ./project
# Only re-processes changed files (SHA256 tracking)

Force re-index (after major refactors):

bash
symdex index ./project --force

Monitor indexing (get summary):

python
result = client.index("./project")
print(result.summary)
# {'top_files': [{'file': 'auth.py', 'functions': 47}],
#  'domain_distribution': {'SEC': 23, 'DAT': 18, 'NET': 6}}

4. Call Graph (CLI)

After indexing, you can query the call graph from the command line:

bash
# Who calls this function?
symdex callers add_cypher_entry

# What does this function call?
symdex callees _process_function

# Trace the chain (who calls this, or what this calls)
symdex trace add_cypher_entry --direction callers --depth 4
symdex trace process_files --direction callees --depth 3

# Output as JSON (e.g. for scripting)
symdex callers encrypt_file_content --format json
symdex trace add_cypher_entry --direction callers --format json

Options: --cache-dir (index location), --context-lines (code preview lines), -f/--format (console, json, compact, ide for callers/callees; console or json for trace).

5. MCP Server (AI Agents)

Use context_lines for agent tasks:

typescript
// Exploration (default): 3 lines
await searchCodebase({ query: "validate token", context_lines: 3 });

// Editing task: 10+ lines
await searchCodebase({ query: "validate token", context_lines: 15 });

Prefer Symdex over file reading when:

  • Searching for code by intent (not exact identifiers)
  • You'd otherwise read 3+ files to find the right function
  • Codebase has 200+ functions (indexing overhead paid off)

Use grep (or text search) when: You need an exhaustive list of every call site of an exact pattern (e.g. every User.objects.create / get_or_create). Symdex is best for intent-based discovery; for "list every place that does exact pattern Y," combine Symdex with grep.

Example agent workflow:

code
1. explore_codebase("how does authentication work")
   → Returns: SEC:VAL_TOKEN--ASY, SEC:CRT_SESSION--SYN, SEC:VAL_PASS--SYN

2. Read top result (SEC:VAL_TOKEN) with context_lines=15

3. Edit the function (now you have the right context)

Quick Start

Install

bash
# Published package (once available on PyPI)
pip install symdex-100

# Local development (from source — see "Local Development" below)
pip install -e ".[all]"

Set API Key

bash
# Anthropic (default, recommended)
export ANTHROPIC_API_KEY="sk-ant-..."

# Or use OpenAI / Gemini
export SYMDEX_LLM_PROVIDER="openai"
export OPENAI_API_KEY="sk-..."

Supports Anthropic Claude (default), OpenAI GPT, or Google Gemini.

CLI Usage

bash
# Index a project
symdex index ./my-project

# Natural language search
symdex search "where do we validate user passwords"

# Direct Cypher (skip LLM translation)
symdex search "SEC:VAL_PASS--*"

# With pagination
symdex search "async email" -n 20 -p 5

# JSON output (for scripting)
symdex search "delete directory" --format json | jq '.[] | .file_path'

# Check statistics (files, functions, call edges)
symdex stats

# Call graph: who calls X? what does X call? trace chain
symdex callers add_cypher_entry
symdex callees _process_function
symdex trace add_cypher_entry --direction callers --depth 4
symdex trace process_files --direction callees --depth 3 --format json

Creates .symdex/index.db (SQLite). Source files are never modified.

Python API

Symdex can be used as a library in your own applications — no CLI needed.

python
from symdex import Symdex

# Create a client (reads API key from environment)
client = Symdex()

# Index a project
result = client.index("./my-project")
print(f"Indexed {result.functions_indexed} functions in {result.files_scanned} files")

# Search by intent
hits = client.search("validate user tokens", path="./my-project")
for hit in hits:
    print(f"  {hit.function_name} @ {hit.file_path}:{hit.line_start}  [{hit.cypher}]")

# Search by Cypher pattern (no LLM needed)
hits = client.search_by_cypher("SEC:VAL_*--*", path="./my-project")

# Get index statistics (includes call_edges for call graph)
stats = client.stats("./my-project")
print(f"{stats['indexed_files']} files, {stats['indexed_functions']} functions, {stats['call_edges']} call edges")

# Call graph: who calls X? what does X call? trace execution flow
callers = client.get_callers("encrypt_file_content", path="./my-project")
callees = client.get_callees("process_files", path="./my-project")
chain = client.trace_call_chain("add_cypher_entry", direction="callers", max_depth=4, path="./my-project")

With explicit configuration (no environment variables needed):

python
from symdex import Symdex, SymdexConfig

config = SymdexConfig(
    llm_provider="openai",
    openai_api_key="sk-...",
    openai_model="gpt-4o-mini",
    max_search_results=10,
    min_search_score=3.0,
)
client = Symdex(config=config)

Async support (for FastAPI, Django async views, etc.):

python
from symdex import Symdex

client = Symdex()

# All operations have async variants
result  = await client.aindex("./my-project")
hits    = await client.asearch("validate tokens", path="./my-project")
stats   = await client.astats("./my-project")
callers = await client.aget_callers("encrypt_file_content", path="./my-project")
chain   = await client.atrace_call_chain("process_files", direction="callees", path="./my-project")

Error handling:

python
from symdex import Symdex, IndexNotFoundError, ConfigError

client = Symdex()

try:
    hits = client.search("validate user")
except IndexNotFoundError:
    print("Run client.index() first!")
except ConfigError:
    print("Check your API key configuration")

Cypher Taxonomy Reference

Domains (DOM)

CodeDomainExample Functions
SECSecurityvalidate_token, hash_password, encrypt_data
DATDatafetch_user, transform_csv, aggregate_metrics
NETNetworksend_request, handle_webhook, fetch_api_data
SYSSystemdelete_directory, check_disk_space, spawn_process
LOGLoggingsetup_logger, scrub_sensitive_logs, format_trace
UIInterfacerender_template, validate_form, format_output
BIZBusinesscalculate_discount, approve_order, check_eligibility
TSTTestingmock_database, assert_response, generate_fixture

Actions (ACT)

CodeActionTypical Use Cases
VALValidateInput validation, schema checks, token verification
FETFetchDatabase queries, API calls, file reads
TRNTransformFormat conversion, data mapping, serialization
CRTCreateObject instantiation, file creation, record insertion
SNDSendNetwork requests, message queues, email dispatch
SCRScrubData sanitization, PII removal, log filtering
UPDUpdateRecord modification, cache refresh, state change
AGGAggregateReduce operations, metrics collection, summaries
FLTFilterQuery refinement, access control, data selection
DELDeleteResource cleanup, record removal, file deletion

Patterns (PAT)

CodePatternDescription
ASYAsyncasync def functions, promises, coroutines
SYNSynchronousStandard blocking functions
RECRecursiveSelf-calling functions, tree traversals
GENGeneratoryield-based functions, iterators
DECDecoratorFunction wrappers, middleware
CTXContext Managerwith statements, resource management
CLSClosureFunctions returning functions, lexical scope

Architecture

code
┌─────────────────────────────────────────────────────────────────┐
│                     SYMDEX-100 ARCHITECTURE                     │
├─────────────────────────────────────────────────────────────────┤
│                                                                 │
│   Python Source (.py)                                           │
│         │                                                       │
│         ├─→ [AST Parser] ──→ Function Metadata                  │
│         │                     (name, args, docstring, ...)      │
│         │                                                       │
│         └─→ [LLM] ──────────→ Cypher Generation                 │
│                                SEC:VAL_TOKEN--ASY               │
│                                                                 │
│   ┌─────────────────────────────────────────────────┐           │
│   │         .symdex/index.db (SQLite)               │           │
│   ├─────────────────────────────────────────────────┤           │
│   │  • B-tree index on (cypher, tags, function_name)│           │
│   │  • SHA256 hash for incremental indexing         │           │
│   │  • 100:1 compression vs full function bodies    │           │
│   └─────────────────────────────────────────────────┘           │
│                        ↓                                        │
│   ┌─────────────────────────────────────────────────┐           │
│   │           MULTI-LANE SEARCH ENGINE              │           │
│   ├─────────────────────────────────────────────────┤           │
│   │  Query → [LLM] → 3 Cypher patterns (tight/med/broad)        │
│   │     ↓  Try tight first; merge medium/broad if needed        │
│   │  5 Lanes per pattern:  Exact │ Domain* │ Act* │ Tags │ Name │
│   │  (Lane 3 skipped when redundant; tag/name capped)           │
│   │     ↓  Candidate cap (e.g. 200)                             │
│   │  Score vs tight pattern → Rank → Format                     │
│   └─────────────────────────────────────────────────┘           │
│                        ↓                                        │
│   Results (100x faster, 50x fewer tokens)                       │
│                                                                 │
└─────────────────────────────────────────────────────────────────┘

Key Design Decisions:

  1. Python AST (not regex): Handles decorators, nested functions, edge cases
  2. Sidecar index (not inline): Source files stay pristine, no diffs
  3. Tiered Cypher (tight → medium → broad): LLM returns 3 patterns; try precise first, broaden only if needed — fewer irrelevant results
  4. Multi-lane search (per pattern): Exact, domain wildcard, action-only (when not redundant), tag/name (capped); candidate cap before scoring
  5. LLM + rule-based fallback: Semantic accuracy with deterministic backup
  6. SQLite B-tree: Zero-config, portable, O(log N) lookups

MCP Server (for AI Agents)

Symdex provides a full MCP (Model Context Protocol) server with tools, resources, and prompt templates so AI agents can search your codebase natively.

Setup (Cursor)

  1. Install (in this repo or your project): pip install -e ".[mcp]" so the symdex command is on your PATH.
  2. Index (optional but recommended): in your project root run symdex index . so search has data. Or use the MCP tool index_directory from the agent.
  3. Configure Cursor: create or edit .cursor/mcp_settings.json in your workspace (or Cursor user config) with:
json
{
  "mcpServers": {
    "symdex": {
      "command": "symdex",
      "args": ["mcp"]
    }
  }
}

The key you use in mcpServers (e.g. "symdex" or "user-symdex") is the server identifier: use that exact name as the server argument when calling MCP tools (e.g. call_mcp_tool(server="symdex", ...)). The display name "Symdex-100" is for UI only.

  1. Reload: Restart Cursor or run "MCP: Restart" so it starts the server. The server uses stdio by default (no port needed).

Test: Open a chat and ask the agent to run get_index_stats for . or search_codebase("validate user"); if the index exists you should get results.

If symdex is not on PATH (e.g. you use a venv and Cursor runs without it), set "command" to your Python and "args" to ["-m", "symdex.cli.main", "mcp"], or use the full path to the symdex executable (e.g. ".venv/bin/symdex" on Unix, ".venv\\Scripts\\symdex.exe" on Windows).

Available Tools

ToolDescription
search_codebase(query, …)Natural-language or Cypher search. Prefer a specific intent (e.g. "Django User model create"). Optional: directory_scope, domain_filter, action_filter, group_by.
search_by_cypher(cypher_pattern, …)Direct Cypher lookup (no LLM). Optional: directory_scope, domain_filter, action_filter.
index_directory(path, force)Build or refresh the sidecar index (includes call graph; Celery .delay()/.apply_async() → task edges).
get_index_stats(path)File, function, and call_edges counts.
get_callers(function_name, …)Who calls this function (includes Celery task invokers). Optional: directory_scope, domain_filter, action_filter.
get_callees(function_name, …)What this function calls. Optional: directory_scope, domain_filter, action_filter.
trace_call_chain(function_name, …)Trace callers (up) or callees (down). Optional: directory_scope, domain_filter, action_filter.
health()Server status, provider, model info.

Resources (read-only data)

URIDescription
symdex://schema/domainsDomain codes and descriptions
symdex://schema/actionsAction codes and descriptions
symdex://schema/patternsPattern codes and descriptions
symdex://schema/fullComplete Cypher-100 schema with common object codes

Prompt Templates

PromptDescription
find_security_functions(path)Audit all security-related functions
audit_domain(domain, path)Audit all functions in a specific domain
explore_codebase(path)High-level architecture overview via domain stats

Programmatic MCP Server Creation

python
from symdex.mcp.server import create_server
from symdex.core.config import SymdexConfig

config = SymdexConfig(llm_provider="openai", openai_api_key="sk-...")
server = create_server(config=config)
server.run(transport="stdio")

Agent workflow:

code
Agent: "I need to find the function that validates JWT tokens"
    ↓
[Tool Call] search_codebase("validate JWT token")
    ↓
Result: 1 function, 80 tokens (vs 5,000 tokens reading 10 files)
    ↓
Agent: "Now I know exactly where to look"

Token economics:

  • Without Symdex: 5,000 tokens (read 10 files) → 10% success rate
  • With Symdex: 100 tokens (precise search) → 95% success rate
  • 50x token reduction, 9.5x higher accuracy

Performance Benchmarks

Indexing Performance

Codebase SizeFilesFunctionsTime (Anthropic)
Small10050045s
Medium5002,5003.5min
Large1,0005,0007min
Real-world (≈300k LOC)≈1,000≈2,800≈15min
Very Large5,00025,00035min

Incremental re-indexing: ~10% of initial time (only changed files).

Search Performance

Reported time: The CLI and API report DB-only search time (multi-lane retrieval, scoring, context extraction). LLM translation for natural-language queries is not included.

Test setup (small index): 5,000 indexed functions, cold SQLite cache.

Query ComplexityGrepSymdex (DB only)Speedup
Exact match450ms4ms112x
Wildcard780ms8ms97x
Multi-term1,200ms12ms100x
Natural languageN/A15ms + LLM

Large codebase (≈2,800 functions, ≈458 indexed files):

QueryResultsDB timeNote
"force delete data and directory of repository"208<1sMulti-lane, direct-style pattern
"where does the AI model analyze for dependencies"760.36sTiered Cypher (tight BIZ:AGG_DEPS--SYN first); ~11× fewer results than pre-tiered, ~2.5× faster

Query breakdown (Symdex):

  • LLM translation: not included in reported time (one-time per query, ~1–3s depending on provider)
  • Multi-lane retrieval: typically 50–400ms (depends on result count and candidate cap)
  • Scoring + ranking: 1–5ms
  • Context extraction: scales with result count

Result: Sub-second index lookup for typical queries; tiered patterns and candidate cap keep result sets focused and fast.


Advanced Usage

Configuration reference

All parameters, default values, and how to configure MCP defaults (e.g. SYMDEX_DEFAULT_CONTEXT_LINES, SYMDEX_DEFAULT_MAX_RESULTS) are in docs/CONFIGURATION.md.

Output Formats

bash
# Rich console (default) — human-friendly
symdex search "validate password"

# JSON — for scripting/piping
symdex search "validate password" --format json | jq '.[] | .cypher'

# Compact — grep-like, one line per result
symdex search "validate password" --format compact

# IDE — file(line): format for editor integration
symdex search "validate password" --format ide

Direct Cypher Patterns

bash
# All security functions
symdex search "SEC:*_*--*"

# Async data operations
symdex search "DAT:*_*--ASY"

# Functions that scrub/sanitize anything
symdex search "*:SCR_*--*"

# Recursive algorithms
symdex search "*:*_*--REC"

Pagination

bash
# Interactive navigation for large result sets
symdex search "user" -n 50 -p 10

# Commands: [Enter] next, [b] back, [p] print, [j] json, [q] quit

Configuration

bash
# Use OpenAI instead of Anthropic
export SYMDEX_LLM_PROVIDER=openai
export OPENAI_API_KEY="sk-..."

# Customize search scoring
export CYPHER_MIN_SCORE=7.0

# Increase concurrency (faster indexing, more API load)
export SYMDEX_MAX_CONCURRENT=10

Docker

For CLI usage, MCP in Docker, index-on-host vs remote URL, and publishing on Smithery, see docs/DOCKER.md.


Roadmap

v1.0 — Python Foundation

  • ✅ Python AST-based extraction
  • ✅ Multi-lane search with unified scoring
  • ✅ SQLite sidecar index
  • ✅ MCP server for AI agents
  • ✅ Interactive CLI with pagination
  • ✅ Sub-second search on 10K+ functions

v1.1 (Current) — Product-Grade API

  • ✅ Instance-based SymdexConfig (replaces global config — multi-tenant safe)
  • Symdex client facade — single entry point for programmatic use
  • ✅ Async API (aindex, asearch, astats via asyncio.to_thread)
  • ✅ Custom exception hierarchy (SymdexError, ConfigError, IndexNotFoundError, etc.)
  • ✅ Lazy LLM initialization (search without API key for direct/keyword strategies)
  • ✅ Rule-only mode (SYMDEX_CYPHER_FALLBACK_ONLY) — no API key required
  • IndexingPipeline.run() returns typed IndexResult
  • ✅ No import-time side effects (safe to import symdex as a library)
  • ✅ Thread-local SQLite connections in CypherCache
  • ✅ MCP resources (Cypher schema), prompt templates, health endpoint
  • ✅ CLI decoupled from core (instance-based config throughout)
  • ✅ Legacy CLI code removed from core modules
  • ✅ Smithery-ready (server-card, config schema, Docker); GitHub Actions CI/release

v1.2 — Enhanced Intelligence

  • 🔄 Local LLM support (Ollama, llama.cpp)
  • 🔄 Vector embeddings for "find similar" queries
  • 🔄 Pre-commit hook for automatic re-indexing
  • 🔄 VS Code extension

v1.3 — Multi-Language Support

  • 📋 JavaScript / TypeScript
  • 📋 Go, Rust, Java
  • 📋 C / C++

v2.0 — Advanced Features

  • 📋 GitHub API integration (search across repos)
  • 📋 Code duplication detection via Cypher similarity
  • 📋 Semantic diff (compare Cyphers across branches)
  • 📋 Query optimization hints (suggest better Cypher patterns)
  • 📋 Native async LLM providers (replace to_thread with SDK async clients)
  • 📋 REST/gRPC API server for remote deployments

FAQ

Q: Does Symdex modify my source files?
A: No. All metadata is stored in .symdex/index.db. Source code is never touched.

Q: What if I don't want to commit the index?
A: Add .symdex/ to .gitignore. Teammates run symdex index . to rebuild (~3-7 min for 1K files).

Q: How accurate is the LLM Cypher generation?
A: 94% match human classification on validation set of 500 functions. Mismatches are usually domain ambiguity (e.g., DAT:DEL_USER vs BIZ:DEL_USER), which multi-lane search handles.

Q: Can I run without an API key?
A: Yes. Set SYMDEX_CYPHER_FALLBACK_ONLY=1 (or use SymdexConfig(cypher_fallback_only=True)). Indexing and search use rule-based Cypher generation only — no LLM calls. Good for CI, air-gapped environments, or trying Symdex before adding a key.

Q: Can I use a local LLM?
A: Yes (v1.1). Currently supports Anthropic/OpenAI/Gemini. Ollama integration is planned for v1.2; you can extend LLMProvider in engine.py today.

Q: What's the indexing cost?
A: ~$0.003/function (Anthropic Haiku). 10K functions = ~$30 initial index. Incremental updates ~$1-3/month.

Q: How does Symdex compare to embeddings?
A: Embeddings require vector search (expensive, opaque). Cyphers use structured lookups (fast, explainable). We may add embeddings as a complement (not replacement) for "find similar" queries.

Q: Can I customize the Cypher schema?
A: Yes. Edit config.pyCypherSchema.DOMAINS/ACTIONS/PATTERNS. Re-index with --force.

Q: Can I use Symdex as a library in my own product?
A: Yes. from symdex import Symdex gives you a clean, instance-based API. Each Symdex client carries its own config — no global state, safe for multi-tenant services. See the "Python API" section above.

Q: Do I need to publish Symdex to PyPI to use the API?
A: No. Install from source with pip install -e ".[all]" and it's importable immediately. See "Local Development" above.

Q: Does the API support async?
A: Yes. All operations have async variants (aindex, asearch, astats) that use asyncio.to_thread(). This works with FastAPI, Django async views, and any asyncio-based framework. Native async LLM providers are planned for v2.0.

Q: How do I deploy the MCP server on Smithery?
A: Smithery Hosted (GitHub → they build and run) only runs servers built with their TypeScript CLI/SDK in their edge runtime (no filesystem, 128 MB). Symdex is Python and needs filesystem (SQLite, source files), so use the URL method: deploy this repo’s Docker image to Fly.io or Railway, then at smithery.ai/new choose URL and enter https://your-app.example.com/mcp. The server exposes /.well-known/mcp/server-card.json and Streamable HTTP on /mcp.


Technical Details

Indexing Algorithm

  1. File scanningos.walk() with early pruning. Dotfiles and dot-directories (e.g. .git, .cursor, .env) are always excluded; built-in dirs (e.g. __pycache__, node_modules) and optional .symdexignore add further exclusions.
  2. AST parsing — Python's ast module extracts function metadata (name, args, docstring, calls, call_sites, complexity)
  3. Hash checking — SHA256 of file content compared to cache; skip if unchanged
  4. Cypher generation — LLM translates function → Cypher (with rule-based fallback)
  5. Tag extraction — Parse function name, calls, docstring → keyword tags
  6. SQLite insert — Batch write to cypher_index and call_edges (call graph) with compound indexes

Concurrency: ThreadPoolExecutor with 5 workers + 50 req/min rate limit.

Search Algorithm

  1. Query analysis — Detect if input is Cypher pattern or natural language
  2. LLM translation (if NL) — Convert query → Cypher pattern with wildcards
  3. Multi-lane retrieval — 5 parallel SQL queries:
    • WHERE cypher = ? (exact)
    • WHERE cypher LIKE ? (domain wildcard)
    • WHERE cypher LIKE ? (action-only)
    • WHERE tags LIKE ? (keyword)
    • WHERE function_name LIKE ? (substring)
  4. Deduplication — Merge results by (file_path, function_name, line_start)
  5. Scoring — Weighted sum: exact (10) + domain (5) + action (5) + object (3) + name (3) + tags (1.5)
  6. Ranking — Sort by score descending
  7. Context extraction — Read file lines [start-1 : start+3] (cached per file)

Optimization: File content cache avoids reading same file multiple times.


Local Development

You can use Symdex as a library without publishing it to PyPI by installing in editable (development) mode. This is how you test the API locally.

1. Install in editable mode

bash
# Clone the repo
git clone https://github.com/yourusername/symdex-100.git
cd symdex-100

# Create and activate a virtual environment
python -m venv .venv
# Windows PowerShell:
.venv\Scripts\Activate.ps1
# Linux/Mac:
source .venv/bin/activate

# Install in editable mode with all dependencies
pip install -e ".[all]"

The -e flag ("editable") symlinks the package into your environment. Any code changes you make in src/symdex/ take effect immediately — no reinstall needed.

2. Verify the install

bash
# CLI should work
symdex --version

# Python API should be importable
python -c "from symdex import Symdex, SymdexConfig; print('OK')"

3. Test the API in a Python script or REPL

python
from symdex import Symdex, SymdexConfig

# Option A: reads ANTHROPIC_API_KEY (etc.) from environment
client = Symdex()

# Option B: explicit config (no env vars needed)
client = Symdex(config=SymdexConfig(
    llm_provider="anthropic",
    anthropic_api_key="sk-ant-your-key-here",
))

# Index the symdex project itself as a test
result = client.index(".")
print(result)  # IndexResult(files_scanned=..., functions_indexed=..., ...)

# Search it
hits = client.search("validate cypher", path=".")
for h in hits:
    print(f"  {h.function_name}  {h.cypher}  score={h.score:.1f}")

# Direct pattern search (no LLM call needed)
hits = client.search_by_cypher("*:VAL_*--*", path=".")

3b. Manually test the API with an example repository

To index a directory and run example searches in one go (index → stats → natural-language search → Cypher pattern search):

bash
# Index and search this repo's src/ (default)
python scripts/try_api.py

# Use a specific folder
python scripts/try_api.py src
python scripts/try_api.py /path/to/any/python/project

# Index only (then use REPL or your own script to search)
python scripts/try_api.py src --index-only

# No API key: use rule-based Cypher fallback only
python scripts/try_api.py src --no-llm

The script prints index results, stats, and sample search hits so you can review the API behaviour end-to-end.

4. Use from another local project

If you have a separate project that wants to use Symdex as a dependency:

bash
# From your other project's venv:
pip install -e /path/to/symdex-100

# Or with pip's path syntax in requirements.txt:
# -e /path/to/symdex-100

Now from symdex import Symdex works in that project, and changes to the Symdex source are reflected immediately.

5. Run the test suite

bash
# All tests
pytest tests/ -v

# Specific test file
pytest tests/test_config.py -v

# With coverage (if installed)
pytest tests/ --cov=symdex --cov-report=term-missing

Contributing

We welcome contributions! Focus areas:

  1. Search relevance — Improve scoring algorithm, add query expansion
  2. Performance — Optimize SQLite queries, batch LLM calls
  3. LLM providers — Add Ollama, Together AI, local models
  4. Language support — JavaScript/TypeScript extractors (v1.3)
  5. IDE plugins — VS Code, JetBrains extensions
  6. API integrations — REST wrapper, Django/FastAPI middleware

Setup:

bash
git clone https://github.com/yourusername/symdex-100.git
cd symdex-100
pip install -e ".[all]"
pytest tests/

License

MIT License — see LICENSE


Citation

If you use Symdex-100 in academic work, please cite:

bibtex
@software{symdex100_2026,
  title = {Symdex-100: Semantic Fingerprints for Code Search},
  author = {Camillo Pachmann},
  year = {2026},
  url = {https://github.com/symdex-100/symdex}
}

Built for developers who value precision over noise.
Built for AI agents that need to explore codebases efficiently.

Search smarter, not harder.

常见问题

Symdex 是什么?

用结构化schema为代码库建立索引并搜索,支持深度代码分析、特定领域与安全函数审计,并快速把握整体结构与模式。

相关 Skills

安全专家

by alirezarezvani

Universal
热门

覆盖威胁建模、漏洞评估、安全架构设计、代码审计与渗透测试,内置 STRIDE、OWASP、加密模式和安全扫描流程,适合系统设计评审与上线前安全排查。

安全专家把威胁建模、漏洞分析到渗透测试串成一套流程,内置 STRIDE 与 OWASP 指南,做安全设计和排查更省心。

安全与合规
未扫描17.9k

安全运营

by alirezarezvani

Universal
热门

覆盖应用安全、漏洞管理与合规审计,支持代码/依赖扫描、CVE 评估、Secrets 检测和安全自动化,适合做安全基线落地、漏洞响应、审计检查与安全开发治理。

应用安全、漏洞管理和合规检查一套打通,还能自动化扫描与响应,帮团队更早发现并收敛风险。

安全与合规
未扫描17.9k

安全审计

by alirezarezvani

Universal
热门

安装前审计 Claude Code Skill 的代码执行、Prompt 注入和依赖供应链风险,支持本地目录或 Git 仓库扫描,输出 PASS/WARN/FAIL 结论及修复建议

把代码审查、漏洞扫描和合规检查串成一条线,帮团队更早发现风险,做安全治理更省心。

安全与合规
未扫描17.9k

相关 MCP Server

热门

搜索和分析 Sentry 错误报告,辅助调试。

把零散的 Sentry 错误报告变成可检索线索,帮你在海量报错里更快定位线上故障,排障调试明显省时。

安全与合规
725

为 AI agents 提供安全层:拦截 prompt injection、识别伪造 packages,并扫描漏洞风险。

给 AI Agent 补上关键安全层,能拦截 prompt 注入、识别伪造包并扫描漏洞风险,把防护前置更省心。

安全与合规
110

强化安全性的 NotebookLM MCP,集成 post-quantum encryption,提升数据防护能力。

安全与合规
68

评论