Judges Panel

编码与调试

by kevinrabun

由 18 位专业评审组成,专门评估 AI 生成代码在安全性、成本与质量方面的表现。

什么是 Judges Panel

由 18 位专业评审组成,专门评估 AI 生成代码在安全性、成本与质量方面的表现。

README

Judges Panel

An MCP (Model Context Protocol) server that provides a panel of 45 specialized judges to evaluate AI-generated code — acting as an independent quality gate regardless of which project is being reviewed. Combines deterministic pattern matching & AST analysis (instant, offline, zero LLM calls) with LLM-powered deep-review prompts that let your AI assistant perform expert-persona analysis across all 45 domains.

Highlights:

  • Includes an App Builder Workflow (3-step) demo for release decisions, plain-language risk summaries, and prioritized fixes — see Try the Demo.
  • Includes V2 context-aware evaluation with policy profiles, evidence calibration, specialty feedback, confidence scoring, and uncertainty reporting.
  • Includes public repository URL reporting to clone a repo, run the full tribunal, and output a consolidated markdown report.
  • 200+ deterministic auto-fix patches (see src/patches/index.ts) plus LLM-powered deep review.

🧪 Many commands in printHelp are experimental/roadmap. By default, we show GA commands only. Set JUDGES_SHOW_EXPERIMENTAL=1 to reveal stubs; these may not be wired yet.

CI npm npm downloads License: MIT Tests

🔰 Packages

  • CLI: @kevinrabun/judges-cli → binary judges (use npx @kevinrabun/judges-cli eval --file app.ts).
  • MCP/API: @kevinrabun/judges → programmatic API + MCP server (npm install @kevinrabun/judges).
  • VS Code extension: see vscode-extension/.
  • GitHub Action: uses: KevinRabun/judges@main (see CI quickstart).

Quickstart

CLI (one-off)

bash
# Using the CLI package (recommended)
npx @kevinrabun/judges-cli eval --file src/app.ts

# Show GA commands only (default)
npx @kevinrabun/judges-cli --help

# Show experimental/roadmap commands
echo "JUDGES_SHOW_EXPERIMENTAL=1" >> $GITHUB_ENV
npx @kevinrabun/judges-cli --help

# License scan (supply-chain & license compliance)
npx @kevinrabun/judges-cli license-scan --dir .

CLI vs API: If you want to embed Judges in your app (MCP/API), install @kevinrabun/judges. For the command-line, use @kevinrabun/judges-cli (binary judges).

GitHub Action

yaml
name: Judges
on: [pull_request, push]
jobs:
  judges:
    runs-on: ubuntu-latest
    steps:
      - uses: actions/checkout@v4
      - uses: KevinRabun/judges@main
        with:
          path: .
          diff-only: true           # evaluate only changed lines in PRs (default true)
          fail-on-findings: true    # fail on critical/high findings
          upload-sarif: true        # upload SARIF to GitHub Code Scanning

Programmatic API (MCP server included)

bash
npm install @kevinrabun/judges
ts
import { evaluateCode } from "@kevinrabun/judges/api";
const verdict = evaluateCode("const password = 'ProdSecret';", "typescript");
console.log(verdict.overallVerdict, verdict.overallScore);

MCP server

The MCP server runs on stdio and is started by your MCP client (VS Code, Claude Desktop, etc.). Configure it in your MCP settings (e.g. mcp.json):

json
{
  "servers": {
    "judges": {
      "type": "stdio",
      "command": "npx",
      "args": ["-y", "@kevinrabun/judges"]
    }
  }
}

Or run the server directly:

bash
npx @kevinrabun/judges
# Starts the MCP server on stdio

Config file: .judgesrc.json (supports ${ENV_VAR} substitution via expandEnvPlaceholders). See Configuration.


Why Judges?

AI code generators (Copilot, Cursor, Claude, ChatGPT, etc.) write code fast — but they routinely produce insecure defaults, missing auth, hardcoded secrets, and poor error handling. Human reviewers catch some of this, but nobody reviews 45 dimensions consistently.

ESLint / BiomeSonarQubeSemgrep / CodeQLJudges
ScopeStyle + some bugsBugs + code smellsSecurity patterns45 domains: security, cost, compliance, a11y, API design, cloud, UX, …
AI-generated code focusNoNoPartialPurpose-built for AI output failure modes
SetupConfig per projectServer + scannerCloud or localOne command: npx @kevinrabun/judges-cli eval file.ts
Auto-fix patchesSomeNoNo200+ deterministic patches — instant, offline
Non-technical outputNoDashboardNoPlain-language findings with What/Why/Next
MCP nativeNoNoNoYes — works inside Copilot, Claude, Cursor
SARIF outputNoYesYesYes — upload to GitHub Code Scanning
CostFree$$$$Free/paidFree / MIT

Judges doesn't replace linters — it covers the dimensions linters don't: authentication strategy, data sovereignty, cost patterns, accessibility, framework-specific anti-patterns, and architectural issues across multiple files.

<p align="center"> <img src="docs/terminal-output.svg" alt="Judges — Terminal Output" width="680" /> </p>

Quick Start

Prereqs: Node.js >=18 (>=20 recommended), npx available. The judges CLI binary ships with @kevinrabun/judges-cli (preferred) and also works via npx @kevinrabun/judges.

Packages:

  • CLI: npm install -g @kevinrabun/judges-cli (or npx @kevinrabun/judges-cli ...)
  • MCP/API: npm install @kevinrabun/judges

Use @kevinrabun/judges for the MCP server and programmatic API. Use @kevinrabun/judges-cli when you want the judges terminal command.

Try it now (no clone needed)

bash
# Install the CLI globally
npm install -g @kevinrabun/judges-cli

# Evaluate any file
judges eval src/app.ts

# Pipe from stdin
cat api.py | judges eval --language python

# Single judge
judges eval --judge cybersecurity server.ts

# SARIF output for CI
judges eval --file app.ts --format sarif > results.sarif

# HTML report with severity filters and dark/light theme
judges eval --file app.ts --format html > report.html

# Fail CI on findings (exit code 1)
judges eval --fail-on-findings src/api.ts

# Suppress known findings via baseline
judges eval --baseline baseline.json src/api.ts

# Use a named preset
judges eval --preset security-only src/api.ts

# Use a config file
judges eval --config .judgesrc.json src/api.ts

# Set a minimum score threshold (exit 1 if below)
judges eval --min-score 80 src/api.ts

# One-line summary for scripts
judges eval --summary src/api.ts

# Agentic skills (orchestrated judge sets)
judges skill ai-code-review --file src/app.ts
judges skill security-review --file src/api.ts --format json
judges skill release-gate --file src/app.ts
judges skills   # list available skills

> Full catalog: [`docs/skills.md`](docs/skills.md)


# List all 45 judges
judges list

Additional CLI Commands

bash
# Interactive project setup wizard
judges init

# Preview auto-fix patches (dry run)
judges fix src/app.ts

# Apply patches directly
judges fix src/app.ts --apply

# License compliance scan (copyleft/unknown detection)
judges license-scan --format json --risk high

# Watch mode — re-evaluate on file save
judges watch src/

# Project-level report (local directory)
judges report . --format html --output report.html

# Evaluate a unified diff (pipe from git diff)
git diff HEAD~1 | judges diff

# Analyze dependencies for supply-chain risks
judges deps --path . --format json

# Run GitHub App server (zero-config PR reviews)
judges app serve --port 4567

# Run GitHub PR review (gh CLI required)
judges review --pr 123 --repo owner/name --diff-only

# Auto-tune presets and configs
judges tune --dir . --apply

# Create a baseline file to suppress known findings
judges baseline create --file src/api.ts -o baseline.json

# Generate CI template files
judges ci-templates --provider github
judges ci-templates --provider gitlab
judges ci-templates --provider azure
judges ci-templates --provider bitbucket

# Generate per-judge rule documentation
judges docs
judges docs --judge cybersecurity
judges docs --output docs/

# Install shell completions
judges completions bash   # eval "$(judges completions bash)"
judges completions zsh
judges completions fish
judges completions powershell

# Install pre-commit hook
judges hook install

# Uninstall pre-commit hook
judges hook uninstall

🔎 Tip: The CLI help now defaults to GA commands only. To see experimental/roadmap commands, run:

bash
JUDGES_SHOW_EXPERIMENTAL=1 judges --help

GitHub App (self-hosted webhook)

Run a zero-config PR reviewer as a GitHub App:

bash
# Run the webhook server locally
judges app serve --port 4567

Required env vars:

  • JUDGES_APP_ID – GitHub App ID
  • JUDGES_PRIVATE_KEY or JUDGES_PRIVATE_KEY_PATH – PEM private key
  • JUDGES_WEBHOOK_SECRET – signature verification secret

Optional:

  • JUDGES_MIN_SEVERITY (default: medium)
  • JUDGES_MAX_COMMENTS (default: 25)
  • JUDGES_TEST_DRY_RUN=1 to avoid live network calls during tests

For local testing, you can expose <code>http://localhost:4567/webhook</code> via <code>ngrok http 4567</code> and configure the GitHub App webhook URL accordingly.

Use in GitHub Actions

Add Judges to your CI pipeline with zero configuration:

yaml
# .github/workflows/judges.yml
name: Judges Code Review
on: [pull_request]

jobs:
  judges:
    runs-on: ubuntu-latest
    permissions:
      contents: read
      security-events: write  # only if using upload-sarif
    steps:
      - uses: actions/checkout@v4
      - uses: KevinRabun/judges@main
        with:
          path: src/api.ts        # file or directory
          format: text             # text | json | sarif | markdown
          upload-sarif: true       # upload to GitHub Code Scanning
          fail-on-findings: true   # fail CI on critical/high findings

Outputs available for downstream steps: verdict, score, findings, critical, high, sarif-file.

Use with Docker (no Node.js required)

bash
# Build the image
docker build -t judges .

# Evaluate a local file
docker run --rm -v $(pwd):/code judges eval --file /code/app.ts

# Pipe from stdin
cat api.py | docker run --rm -i judges eval --language python

# List judges
docker run --rm judges list

Or use as an MCP server

1. Install and Build

bash
git clone https://github.com/KevinRabun/judges.git
cd judges
npm install
npm run build

2. Try the Demo

Run the included demo to see all 45 judges evaluate a purposely flawed API server:

bash
npm run demo

This evaluates examples/sample-vulnerable-api.ts — a file intentionally packed with security holes, performance anti-patterns, and code quality issues — and prints a full verdict with per-judge scores and findings.

The demo now also includes an App Builder Workflow (3-step) section. In a single run, you get both tribunal output and workflow output:

  • Release decision (Ship now / Ship with caution / Do not ship)
  • Plain-language summaries of top risks
  • Prioritized remediation tasks and AI-fixable P0/P1 items

Sample workflow output (truncated):

text
╔══════════════════════════════════════════════════════════════╗
║             App Builder Workflow Demo (3-Step)             ║
╚══════════════════════════════════════════════════════════════╝

  Decision       : Do not ship
  Verdict        : FAIL (47/100)
  Risk Counts    : Critical 24 | High 27 | Medium 55

  Step 2 — Plain-Language Findings:
  - [CRITICAL] DATA-001: Hardcoded password detected
      What: ...
      Why : ...
      Next: ...

  Step 3 — Prioritized Tasks:
  - P0 | DEVELOPER | Effort L | DATA-001
      Task: ...
      Done: ...

  AI-Fixable Now (P0/P1):
  - P0 DATA-001: ...

Sample tribunal output (truncated):

code
╔══════════════════════════════════════════════════════════════╗
║           Judges Panel — Full Tribunal Demo                 ║
╚══════════════════════════════════════════════════════════════╝

  Overall Verdict : FAIL
  Overall Score   : 43/100
  Critical Issues : 15
  High Issues     : 17
  Total Findings  : 83
  Judges Run      : 33

  Per-Judge Breakdown:
  ────────────────────────────────────────────────────────────────
  ❌ Judge Data Security              0/100    7 finding(s)
  ❌ Judge Cybersecurity              0/100    7 finding(s)
  ❌ Judge Cost Effectiveness        52/100    5 finding(s)
  ⚠️  Judge Scalability              65/100    4 finding(s)
  ❌ Judge Cloud Readiness           61/100    4 finding(s)
  ❌ Judge Software Practices        45/100    6 finding(s)
  ❌ Judge Accessibility              0/100    8 finding(s)
  ❌ Judge API Design                 0/100    9 finding(s)
  ❌ Judge Reliability               54/100    3 finding(s)
  ❌ Judge Observability             45/100    5 finding(s)
  ❌ Judge Performance               27/100    5 finding(s)
  ❌ Judge Compliance                 0/100    4 finding(s)
  ⚠️  Judge Testing                  90/100    1 finding(s)
  ⚠️  Judge Documentation            70/100    4 finding(s)
  ⚠️  Judge Internationalization     65/100    4 finding(s)
  ⚠️  Judge Dependency Health        90/100    1 finding(s)
  ❌ Judge Concurrency               44/100    4 finding(s)
  ❌ Judge Ethics & Bias             65/100    2 finding(s)
  ❌ Judge Maintainability           52/100    4 finding(s)
  ❌ Judge Error Handling            27/100    3 finding(s)
  ❌ Judge Authentication             0/100    4 finding(s)
  ❌ Judge Database                   0/100    5 finding(s)
  ❌ Judge Caching                   62/100    3 finding(s)
  ❌ Judge Configuration Mgmt         0/100    3 finding(s)
  ⚠️  Judge Backwards Compat         80/100    2 finding(s)
  ⚠️  Judge Portability              72/100    2 finding(s)
  ❌ Judge UX                        52/100    4 finding(s)
  ❌ Judge Logging Privacy            0/100    4 finding(s)
  ❌ Judge Rate Limiting             27/100    4 finding(s)
  ⚠️  Judge CI/CD                    80/100    2 finding(s)

3. Run the Tests

bash
npm test

Runs automated tests covering all judges, AST parsers, markdown formatters, and edge cases.

4. Connect to Your Editor

VS Code (recommended — zero config)

Install the Judges Panel extension from the Marketplace. It provides:

  • Inline diagnostics & quick-fixes on every file save
  • @judges chat participant — type @judges in Copilot Chat, or just ask for a "judges panel review" and Copilot routes automatically
  • Auto-configured MCP server — all 45 expert-persona prompts available to Copilot with zero setup
bash
code --install-extension kevinrabun.judges-panel

VS Code — manual MCP config

If you prefer explicit workspace config (or want teammates without the extension to benefit), create .vscode/mcp.json:

json
{
  "servers": {
    "judges": {
      "command": "npx",
      "args": ["-y", "@kevinrabun/judges"]
    }
  }
}

Claude Desktop

Add to claude_desktop_config.json:

json
{
  "mcpServers": {
    "judges": {
      "command": "npx",
      "args": ["-y", "@kevinrabun/judges"]
    }
  }
}

Cursor / other MCP clients

Use the same npx command for any MCP-compatible client:

json
{
  "command": "npx",
  "args": ["-y", "@kevinrabun/judges"]
}

5. Use Judges in GitHub Copilot PR Reviews

Yes — users can include Judges as part of GitHub-based review workflows, with one important caveat:

  • The hosted copilot-pull-request-reviewer on GitHub does not currently let you directly attach arbitrary local MCP servers the same way VS Code does.
  • The practical pattern is to run Judges in CI on each PR, publish a report/check, and have Copilot + human reviewers use that output during review.

Option A (recommended): PR workflow check + report artifact

Create .github/workflows/judges-pr-review.yml:

yaml
name: Judges PR Review

on:
  pull_request:
    types: [opened, synchronize, reopened]

jobs:
  judges:
    runs-on: ubuntu-latest
    permissions:
      contents: read
      pull-requests: write

    steps:
      - name: Checkout
        uses: actions/checkout@v4

      - name: Setup Node
        uses: actions/setup-node@v4
        with:
          node-version: 20
          cache: npm

      - name: Install
        run: npm ci

      - name: Generate Judges report
        run: |
          npx tsx -e "import { generateRepoReportFromLocalPath } from './src/reports/public-repo-report.ts';
          const result = generateRepoReportFromLocalPath({
            repoPath: process.cwd(),
            outputPath: 'judges-pr-report.md',
            maxFiles: 600,
            maxFindingsInReport: 150,
          });
          console.log('Overall:', result.overallVerdict, result.averageScore);"

      - name: Upload report artifact
        uses: actions/upload-artifact@v4
        with:
          name: judges-pr-report
          path: judges-pr-report.md

This gives every PR a reproducible Judges output your team (and Copilot) can reference.

Option B: Add Copilot custom instructions in-repo

Add .github/instructions/judges.instructions.md with guidance such as:

markdown
When reviewing pull requests:
1. Read the latest Judges report artifact/check output first.
2. Prioritize CRITICAL and HIGH findings in remediation guidance.
3. If findings conflict, defer to security/compliance-related Judges.
4. Include rule IDs (e.g., DATA-001, CYBER-004) in suggested fixes.

This helps keep Copilot feedback aligned with Judges findings.


CLI Reference

All commands support --help for usage details.

judges eval

Evaluate a file with all 45 judges or a single judge.

FlagDescription
--file <path> / positionalFile to evaluate
--judge <id> / -j <id>Single judge mode
--language <lang> / -l <lang>Language hint (auto-detected from extension)
--format <fmt> / -f <fmt>Output format: text, json, sarif, markdown, html, pdf, junit, codeclimate, github-actions
--output <path> / -o <path>Write output to file
--fail-on-findingsExit with code 1 if verdict is FAIL
--baseline <path> / -b <path>JSON baseline file — suppress known findings
--summaryPrint a single summary line (ideal for scripts)
--config <path>Load a .judgesrc / .judgesrc.json config file
--preset <name>Use a named preset (see Named Presets for all 22 options)
--min-score <n>Exit with code 1 if overall score is below this threshold
--verbosePrint timing and debug information
--quietSuppress non-essential output
--no-colorDisable ANSI colors

judges init

Interactive wizard that generates project configuration:

  • .judgesrc.json — rule customization, disabled judges, severity thresholds
  • .github/workflows/judges.yml — GitHub Actions CI workflow
  • .gitlab-ci.judges.yml — GitLab CI pipeline (optional)
  • azure-pipelines.judges.yml — Azure Pipelines (optional)

judges fix

Preview or apply auto-fix patches from deterministic findings.

FlagDescription
positionalFile to fix
--applyWrite patches to disk (default: dry run)
--judge <id>Limit to a single judge's findings

judges watch

Continuously re-evaluate files on save.

FlagDescription
positionalFile or directory to watch (default: .)
--judge <id>Single judge mode
--fail-on-findingsExit non-zero if any evaluation fails

judges report

Run a full project-level tribunal on a local directory.

FlagDescription
positionalDirectory path (default: .)
--format <fmt>Output format: text, json, html, markdown
--output <path>Write report to file
--max-files <n>Maximum files to analyze (default: 600)
--max-file-bytes <n>Skip files larger than this (default: 300000)

judges hook

Manage a Git pre-commit hook that runs Judges on staged files.

bash
judges hook install    # add pre-commit hook
judges hook uninstall  # remove pre-commit hook

Detects Husky (.husky/pre-commit) and falls back to .git/hooks/pre-commit. Uses marker-based injection so it won't clobber existing hooks.

judges diff

Evaluate only the changed lines from a unified diff (e.g., git diff output).

FlagDescription
--file <path>Read diff from file instead of stdin
--format <fmt>Output format: text, json, sarif, junit, codeclimate
--output <path>Write output to file
bash
git diff HEAD~1 | judges diff
judges diff --file changes.patch --format sarif

judges deps

Analyze project dependencies for supply-chain risks.

FlagDescription
--path <dir>Project root to scan (default: .)
--format <fmt>Output format: text, json
bash
judges deps --path .
judges deps --path ./backend --format json

judges baseline

Create a baseline file to suppress known findings in future evaluations.

bash
judges baseline create --file src/api.ts
judges baseline create --file src/api.ts -o .judges-baseline.json

judges ci-templates

Generate CI/CD configuration templates for popular providers.

bash
judges ci-templates --provider github   # .github/workflows/judges.yml
judges ci-templates --provider gitlab   # .gitlab-ci.judges.yml
judges ci-templates --provider azure    # azure-pipelines.judges.yml
judges ci-templates --provider bitbucket # bitbucket-pipelines.yml (snippet)

judges docs

Generate per-judge rule documentation in Markdown.

FlagDescription
--judge <id>Generate docs for a single judge
--output <dir>Write individual .md files per judge
bash
judges docs                          # all judges to stdout
judges docs --judge cybersecurity    # single judge
judges docs --output docs/judges/    # write files to directory

judges completions

Generate shell completion scripts.

bash
eval "$(judges completions bash)"        # Bash
eval "$(judges completions zsh)"         # Zsh
judges completions fish | source         # Fish
judges completions powershell            # PowerShell (Register-ArgumentCompleter)

Named Presets

Use --preset to apply pre-configured evaluation settings:

PresetDescription
strictAll severities, all judges — maximum thoroughness
lenientOnly high and critical findings — fast and focused
security-onlySecurity-focused — disables non-security judges (cost, scalability, docs, a11y, i18n, UX, etc.)
startupSkip compliance, sovereignty, i18n judges — move fast
complianceOnly compliance, data-sovereignty, authentication — regulatory focus
performanceOnly performance, scalability, caching, cost-effectiveness
reactTuned for React/Next.js apps — enables accessibility, XSS protection
expressTuned for Express.js APIs — middleware security, auth, CORS, rate limiting
fastapiTuned for Python FastAPI — input validation, async patterns, API security
djangoTuned for Django apps — template security, ORM misuse, CSRF
spring-bootTuned for Java Spring Boot — injection, configuration, actuator security
railsTuned for Ruby on Rails — mass assignment, CSRF, SQL injection
nextjsTuned for Next.js — server/client security, API routes, SSR/ISR
terraformTuned for Terraform/OpenTofu IaC — infrastructure security, compliance
kubernetesTuned for K8s manifests — security contexts, RBAC, resource limits
onboardingSmart defaults for first-time adoption — suppresses noisy rules
fintechFinancial services — PCI DSS, cryptography, authentication, audit
healthtechHealthcare — HIPAA compliance, data sovereignty, encryption, audit trails
saasMulti-tenant SaaS — tenant isolation, rate limiting, scalability
governmentGovernment/public sector — compliance, sovereignty, authentication
open-sourceOpen-source projects — documentation, backwards compatibility, security, dependency health
ai-reviewAI-generated code review — hallucination detection, security, authentication, correctness
bash
judges eval --preset security-only src/api.ts
judges eval --preset strict --format sarif src/app.ts > results.sarif

CI Output Formats

JUnit XML

Generate JUnit XML for Jenkins, Azure DevOps, GitHub Actions, or GitLab test result viewers:

bash
judges eval --format junit src/api.ts > results.xml

Each judge maps to a <testsuite>, each finding becomes a <testcase> with <failure> for critical/high severity.

CodeClimate / GitLab Code Quality

Generate CodeClimate JSON for GitLab Code Quality or similar tools:

bash
judges eval --format codeclimate src/api.ts > codequality.json

Score Badges

Generate SVG or text badges for your README:

typescript
import { generateBadgeSvg, generateBadgeText } from "@kevinrabun/judges/badge";

const svg = generateBadgeSvg(85);          // shields.io-style SVG
const text = generateBadgeText(85);        // "✓ judges 85/100"
const svg2 = generateBadgeSvg(75, "quality"); // custom label

The Judge Panel

<!-- JUDGES_TABLE_START -->
JudgeDomainRule PrefixWhat It Evaluates
Data SecurityData Security & PrivacyDATA-Encryption, PII handling, secrets management, access controls
CybersecurityCybersecurity & Threat DefenseCYBER-Injection attacks, XSS, CSRF, auth flaws, OWASP Top 10
Cost EffectivenessCost Optimization & Resource EfficiencyCOST-Algorithm efficiency, N+1 queries, memory waste, caching strategy
ScalabilityScalability & PerformanceSCALE-Statelessness, horizontal scaling, concurrency, bottlenecks
Cloud ReadinessCloud-Native Architecture & DevOpsCLOUD-12-Factor compliance, containerization, graceful shutdown, IaC
Software PracticesSoftware Engineering Best Practices & Secure SDLCSWDEV-SOLID principles, type safety, error handling, input validation
AccessibilityAccessibility (a11y)A11Y-WCAG compliance, screen reader support, keyboard navigation, ARIA
API DesignAPI Design & ContractsAPI-REST conventions, versioning, pagination, error responses
ReliabilityReliability & ResilienceREL-Error handling, timeouts, retries, circuit breakers
ObservabilityMonitoring & DiagnosticsOBS-Structured logging, health checks, metrics, tracing
PerformanceRuntime PerformancePERF-N+1 queries, sync I/O, caching, memory leaks
ComplianceRegulatory & License ComplianceCOMP-GDPR/CCPA, PII protection, consent, data retention, audit trails
Data SovereigntyData, Technological & Operational SovereigntySOV-Data residency, cross-border transfers, vendor key management, AI model portability, identity federation, circuit breakers, audit trails, data export
TestingTest Quality & CoverageTEST-Test coverage, assertions, test isolation, naming
DocumentationDocumentation & Developer ExperienceDOC-JSDoc/docstrings, magic numbers, TODOs, code comments
Internationalizationi18n & LocalizationI18N-Hardcoded strings, locale handling, currency formatting
Dependency HealthSupply Chain & DependenciesDEPS-Version pinning, deprecated packages, supply chain
ConcurrencyConcurrency & Thread SafetyCONC-Race conditions, unbounded parallelism, missing await
Ethics & BiasAI/ML Fairness & EthicsETHICS-Demographic logic, dark patterns, inclusive language
MaintainabilityCode Maintainability & Technical DebtMAINT-Any types, magic numbers, deep nesting, dead code, file length
Error HandlingError Handling & Fault ToleranceERR-Empty catch blocks, missing error handlers, swallowed errors
AuthenticationAuthentication & AuthorizationAUTH-Hardcoded creds, missing auth middleware, token in query params
DatabaseDatabase Design & Query EfficiencyDB-SQL injection, N+1 queries, connection pooling, transactions
CachingCaching Strategy & Data FreshnessCACHE-Unbounded caches, missing TTL, no HTTP cache headers
Configuration ManagementConfiguration & Secrets ManagementCFG-Hardcoded secrets, missing env vars, config validation
Backwards CompatibilityBackwards Compatibility & VersioningCOMPAT-API versioning, breaking changes, response consistency
PortabilityPlatform Portability & Vendor IndependencePORTA-OS-specific paths, vendor lock-in, hardcoded hosts
UXUser Experience & Interface QualityUX-Loading states, error messages, pagination, destructive actions
Logging PrivacyLogging Privacy & Data RedactionLOGPRIV-PII in logs, token logging, structured logging, redaction
Rate LimitingRate Limiting & ThrottlingRATE-Missing rate limits, unbounded queries, backoff strategy
CI/CDCI/CD Pipeline & Deployment SafetyCICD-Test infrastructure, lint config, Docker tags, build scripts
Code StructureStructural AnalysisSTRUCT-Cyclomatic complexity, nesting depth, function length, dead code, type safety
Agent InstructionsAgent Instruction Markdown Quality & SafetyAGENT-Instruction hierarchy, conflict detection, unsafe overrides, scope, validation, policy guidance
AI Code SafetyAI-Generated Code Quality & SecurityAICS-Prompt injection, insecure LLM output handling, debug defaults, missing validation, unsafe deserialization of AI responses
Framework SafetyFramework-Specific Security & Best PracticesFW-React hooks ordering, Express middleware chains, Next.js SSR/SSG pitfalls, Angular/Vue lifecycle patterns, Django/Flask/FastAPI safety, Spring Boot security, ASP.NET Core auth & CORS, Go Gin/Echo/Fiber patterns
IaC SecurityInfrastructure as CodeIAC-Terraform, Bicep, ARM template misconfigurations, hardcoded secrets, missing encryption, overly permissive network/IAM rules
SecurityGeneral Security PostureSEC-Holistic security assessment — insecure data flows, weak cryptography, unsafe deserialization
Hallucination DetectionAI-Hallucinated API & Import ValidationHALLU-Detects hallucinated APIs, fabricated imports, and non-existent modules from AI code generators
Intent AlignmentCode–Comment Alignment & Stub DetectionINTENT-Detects mismatches between stated intent and implementation, placeholder stubs, TODO-only functions
API Contract ConformanceAPI Design & REST Best PracticesAPI-API endpoint input validation, REST conformance, request/response contract consistency
Multi-Turn CoherenceCode Coherence & ConsistencyCOH-Self-contradicting patterns, duplicate definitions, dead code, inconsistent naming
Model Fingerprint DetectionAI Code Provenance & Model AttributionMFPR-Detects stylistic fingerprints characteristic of specific AI code generators
Over-EngineeringSimplicity & PragmatismOVER-Unnecessary abstractions, wrapper-mania, premature generalization, over-complex patterns
Logic ReviewSemantic Correctness & Logic IntegrityLOGIC-Inverted conditions, dead code, name-body mismatch, off-by-one, incomplete control flow
False-Positive ReviewFalse Positive Detection & Finding AccuracyFPR-Meta-judge reviewing pattern-based findings for false positives: string literal context, comment/docstring matches, test scaffolding, IaC template gating
<!-- JUDGES_TABLE_END -->

How It Works

The tribunal operates in three layers:

  1. Pattern-Based Analysis — All tools (evaluate_code, evaluate_code_single_judge, evaluate_project, evaluate_diff) perform heuristic analysis using regex pattern matching to catch common anti-patterns. This layer is instant, deterministic, and runs entirely offline with zero external API calls.

  2. AST-Based Structural Analysis — The Code Structure judge (STRUCT-* rules) uses real Abstract Syntax Tree parsing to measure cyclomatic complexity, nesting depth, function length, parameter count, dead code, and type safety with precision that regex cannot achieve. All supported languages — TypeScript, JavaScript, Python, Rust, Go, Java, C#, and C++ — are parsed via tree-sitter WASM grammars (real syntax trees compiled to WebAssembly, in-process, zero native dependencies). A scope-tracking structural parser is kept as a fallback when WASM grammars are unavailable. No external AST server required.

  3. LLM-Powered Deep Analysis (Prompts) — The server exposes MCP prompts (e.g., judge-data-security, judge-cybersecurity) that provide each judge's expert persona as a system prompt. When used by an LLM-based client (Copilot, Claude, Cursor, etc.), the host LLM performs deeper, context-aware probabilistic analysis beyond what static patterns can detect. This is where the systemPrompt on each judge comes alive — Judges itself makes no LLM calls, but it provides the expert criteria so your AI assistant can act as 45 specialized reviewers.


Composable by Design

Judges Panel is a dual-layer review system: instant deterministic tools (offline, no API keys) for pattern and AST analysis, plus 45 expert-persona MCP prompts that unlock LLM-powered deep analysis when connected to an AI client. It does not try to be a CVE scanner or a linter. Those capabilities belong in dedicated MCP servers that an AI agent can orchestrate alongside Judges.

Built-in AST Analysis

Unlike earlier versions that recommended a separate AST MCP server, Judges Panel now includes real AST-based structural analysis out of the box:

  • TypeScript, JavaScript, Python, Rust, Go, Java, C#, C++ — All parsed with a unified tree-sitter WASM engine for full syntax-tree analysis (functions, complexity, nesting, dead code, type safety). Falls back to a scope-tracking structural parser when WASM grammars are unavailable

The Code Structure judge (STRUCT-*) uses these parsers to accurately measure:

RuleMetricThreshold
STRUCT-001Cyclomatic complexity> 10 per function (high)
STRUCT-002Nesting depth> 4 levels (medium)
STRUCT-003Function length> 50 lines (medium)
STRUCT-004Parameter count> 5 parameters (medium)
STRUCT-005Dead codeUnreachable statements (low)
STRUCT-006Weak typesany, dynamic, Object, interface{}, unsafe (medium)
STRUCT-007File complexity> 40 total cyclomatic complexity (high)
STRUCT-008Extreme complexity> 20 per function (critical)
STRUCT-009Extreme parameters> 8 parameters (high)
STRUCT-010Extreme function length> 150 lines (high)

Recommended MCP Stack

When your AI coding assistant connects to multiple MCP servers, each one contributes its specialty:

code
┌─────────────────────────────────────────────────────────┐
│                   AI Coding Assistant                   │
│              (Claude, Copilot, Cursor, etc.)            │
└──────┬──────────────────┬──────────┬───────────────────┘
       │                  │          │
       ▼                  ▼          ▼
  ┌──────────────┐  ┌────────┐  ┌────────┐
  │   Judges     │  │  CVE / │  │ Linter │
  │   Panel      │  │  SBOM  │  │ Server │
  │ ─────────────│  └────────┘  └────────┘
  │ 44 Heuristic │   Vuln DB     Style &
  │   judges     │   scanning    correctness
  │ + AST judge  │
  └──────────────┘
   Patterns +
   structural
   analysis
LayerWhat It DoesExample Servers
Judges Panel45-judge quality gate — security patterns, AST analysis, cost, scalability, a11y, compliance, sovereignty, ethics, dependency health, agent instruction governance, AI code safety, framework safetyThis server
CVE / SBOMVulnerability scanning against live databases — known CVEs, license risks, supply chainOSV, Snyk, Trivy, Grype MCP servers
LintingLanguage-specific style and correctness rulesESLint, Ruff, Clippy MCP servers
Runtime ProfilingMemory, CPU, latency measurement on running codeCustom profiling MCP servers

What This Means in Practice

When you ask your AI assistant "Is this code production-ready?", the agent can:

  1. Judges Panel → Scan for hardcoded secrets, missing error handling, N+1 queries, accessibility gaps, compliance issues, plus analyze cyclomatic complexity, detect dead code, and flag deeply nested functions via AST
  2. CVE Server → Check every dependency in package.json against known vulnerabilities
  3. Linter Server → Enforce team style rules, catch language-specific gotchas

Each server returns structured findings. The AI synthesizes everything into a single, actionable review — no single server needs to do it all.


MCP Tools

evaluate_v2

Run a V2 context-aware tribunal evaluation designed to raise feedback quality toward lead engineer/architect-level review:

  • Policy profile calibration (default, startup, regulated, healthcare, fintech, public-sector)
  • Context ingestion (architecture notes, constraints, standards, known risks, data-boundary model)
  • Runtime evidence hooks (tests, coverage, latency, error rate, vulnerability counts)
  • Specialty feedback aggregation by judge/domain
  • Confidence scoring and explicit uncertainty reporting

Supports:

  • Code mode: code + language
  • Project mode: files[]
ParameterTypeRequiredDescription
codestringconditionalSource code for single-file mode
languagestringconditionalProgramming language for single-file mode
filesarrayconditional{ path, content, language }[] for project mode
contextstringnoHigh-level review context
includeAstFindingsbooleannoInclude AST/code-structure findings (default: true)
minConfidencenumbernoMinimum finding confidence to include (0-1, default: 0)
policyProfileenumnodefault, startup, regulated, healthcare, fintech, public-sector
evaluationContextobjectnoStructured architecture/constraint context
evidenceobjectnoRuntime/operational evidence for confidence calibration

evaluate_app_builder_flow

Run a 3-step app-builder workflow for technical and non-technical stakeholders:

  1. Tribunal review (code/project/diff)
  2. Plain-language translation of top risks
  3. Prioritized remediation tasks with AI-fixable P0/P1 extraction

Supports:

  • Code mode: code + language
  • Project mode: files[]
  • Diff mode: code + language + changedLines[]
ParameterTypeRequiredDescription
codestringconditionalFull source content (code/diff mode)
languagestringconditionalProgramming language (code/diff mode)
filesarrayconditional{ path, content, language }[] for project mode
changedLinesnumber[]no1-based changed lines for diff mode
contextstringnoOptional business/technical context
maxFindingsnumbernoMax translated top findings (default: 10)
maxTasksnumbernoMax generated tasks (default: 20)
includeAstFindingsbooleannoInclude AST/code-structure findings (default: true)
minConfidencenumbernoMinimum finding confidence to include (0-1, default: 0)

evaluate_public_repo_report

Clone a public repository URL, run the full judges panel across eligible source files, and generate a consolidated markdown report.

ParameterTypeRequiredDescription
repoUrlstringyesPublic repository URL (https://...)
branchstringnoOptional branch name
outputPathstringnoOptional path to write report markdown
maxFilesnumbernoMax files analyzed (default: 600)
maxFileBytesnumbernoMax file size in bytes (default: 300000)
maxFindingsInReportnumbernoMax detailed findings in output (default: 150)
credentialModestringnoCredential detection mode: standard (default) or strict
includeAstFindingsbooleannoInclude AST/code-structure findings (default: true)
minConfidencenumbernoMinimum finding confidence to include (0-1, default: 0)
enableMustFixGatebooleannoEnable must-fix gate summary for high-confidence dangerous findings (default: false)
mustFixMinConfidencenumbernoConfidence threshold for must-fix gate triggers (0-1, default: 0.85)
mustFixDangerousRulePrefixesstring[]noOptional dangerous rule prefixes for gate matching (e.g., AUTH, CYBER, DATA)
keepClonebooleannoKeep cloned repo on disk for inspection

Quick examples

Generate a report from CLI:

bash
npm run report:public-repo -- --repoUrl https://github.com/microsoft/vscode --output reports/vscode-judges-report.md

# stricter credential-signal mode (optional)
npm run report:public-repo -- --repoUrl https://github.com/openclaw/openclaw --credentialMode strict --output reports/openclaw-judges-report-strict.md

# judge findings only (exclude AST/code-structure findings)
npm run report:public-repo -- --repoUrl https://github.com/openclaw/openclaw --includeAstFindings false --output reports/openclaw-judges-report-no-ast.md

# show only findings at 80%+ confidence
npm run report:public-repo -- --repoUrl https://github.com/openclaw/openclaw --minConfidence 0.8 --output reports/openclaw-judges-report-high-confidence.md

# include must-fix gate summary in the generated report
npm run report:public-repo -- --repoUrl https://github.com/openclaw/openclaw --enableMustFixGate true --mustFixMinConfidence 0.9 --mustFixDangerousPrefix AUTH --mustFixDangerousPrefix CYBER --output reports/openclaw-judges-report-mustfix.md

# opinionated quick-start mode (recommended first run)
npm run report:quickstart -- --repoUrl https://github.com/openclaw/openclaw --output reports/openclaw-quickstart.md

Call from MCP client:

json
{
  "tool": "evaluate_public_repo_report",
  "arguments": {
    "repoUrl": "https://github.com/microsoft/vscode",
    "branch": "main",
    "maxFiles": 400,
    "maxFindingsInReport": 120,
    "credentialMode": "strict",
    "includeAstFindings": false,
    "minConfidence": 0.8,
    "enableMustFixGate": true,
    "mustFixMinConfidence": 0.9,
    "mustFixDangerousRulePrefixes": ["AUTH", "CYBER", "DATA"],
    "outputPath": "reports/vscode-judges-report.md"
  }
}

Typical response summary includes:

  • overall verdict and average score
  • analyzed file count and total findings
  • per-judge score table
  • highest-risk findings and lowest-scoring files

Sample report snippet:

text
# Public Repository Full Judges Report

Generated from https://github.com/microsoft/vscode on 2026-02-21T12:00:00.000Z.

## Executive Summary
- Overall verdict: WARNING
- Average file score: 78/100
- Total findings: 412 (critical 3, high 29, medium 114, low 185, info 81)

get_judges

List all available judges with their domains and descriptions.

evaluate_code

Submit code to the full judges panel. all 45 judges evaluate independently and return a combined verdict.

ParameterTypeRequiredDescription
codestringyesThe source code to evaluate
languagestringyesProgramming language (e.g., typescript, python)
contextstringnoAdditional context about the code
includeAstFindingsbooleannoInclude AST/code-structure findings (default: true)
minConfidencenumbernoMinimum finding confidence to include (0-1, default: 0)
configobjectnoInline configuration (see Configuration)

evaluate_code_single_judge

Submit code to a specific judge for targeted review.

ParameterTypeRequiredDescription
codestringyesThe source code to evaluate
languagestringyesProgramming language
judgeIdstringyesSee judge IDs below
contextstringnoAdditional context
minConfidencenumbernoMinimum finding confidence to include (0-1, default: 0)
configobjectnoInline configuration (see Configuration)

evaluate_project

Submit multiple files for project-level analysis. all 45 judges evaluate each file, plus cross-file architectural analysis detects code duplication, inconsistent error handling, and dependency cycles.

ParameterTypeRequiredDescription
filesarrayyesArray of { path, content, language } objects
contextstringnoOptional project context
includeAstFindingsbooleannoInclude AST/code-structure findings (default: true)
minConfidencenumbernoMinimum finding confidence to include (0-1, default: 0)
configobjectnoInline configuration (see Configuration)

evaluate_diff

Evaluate only the changed lines in a code diff. Runs all 45 judges on the full file but filters findings to lines you specify. Ideal for PR reviews and incremental analysis.

ParameterTypeRequiredDescription
codestringyesThe full file content (post-change)
languagestringyesProgramming language
changedLinesnumber[]yes1-based line numbers that were changed
contextstringnoOptional context about the change
includeAstFindingsbooleannoInclude AST/code-structure findings (default: true)
minConfidencenumbernoMinimum finding confidence to include (0-1, default: 0)
configobjectnoInline configuration (see Configuration)

analyze_dependencies

Analyze a dependency manifest file for supply-chain risks, version pinning issues, typosquatting indicators, and dependency hygiene. Supports package.json, requirements.txt, Cargo.toml, go.mod, pom.xml, and .csproj files.

ParameterTypeRequiredDescription
manifeststringyesContents of the dependency manifest file
manifestTypestringyesFile type: package.json, requirements.txt, etc.
contextstringnoOptional context

evaluate_git_diff

Evaluate only changed lines from a git diff. Provide either repoPath for a live git diff or diffText for a pre-computed unified diff.

ParameterTypeRequiredDescription
repoPathstringconditionalAbsolute path to the git repository
basestringnoGit ref to diff against (default: HEAD~1)
diffTextstringconditionalPre-computed unified diff text
confidenceFilternumbernoMinimum confidence threshold for findings (0–1)
autoTunebooleannoApply feedback-driven auto-tuning (default: false)
maxPromptCharsnumbernoMax character budget for LLM prompts (default: 100000, 0 = unlimited)
configobjectnoInline configuration

re_evaluate_with_context

Re-run the tribunal with prior findings as context for iterative refinement. Supports dispute resolution, developer context injection, and focus-area filtering.

ParameterTypeRequiredDescription
codestringyesSource code to re-evaluate
languagestringyesProgramming language
disputedRuleIdsstring[]noRule IDs the developer disputes as false positives
acceptedRuleIdsstring[]noRule IDs the developer accepts
developerContextstringnoFree-form explanation of developer intent
focusAreasstring[]noSpecific areas to focus on (e.g., ["security"])
confidenceFilternumbernoMinimum confidence threshold (default: 0.5)
filePathstringnoFile path for context-aware evaluation
deepReviewbooleannoInclude LLM deep-review prompt section
relatedFilesarraynoCross-file context { path, snippet, relationship? }[]
maxPromptCharsnumbernoMax character budget for LLM prompts (default: 100000, 0 = unlimited)

Additional MCP Tools

ToolDescription
evaluate_fileRead a file from disk and submit it to the full panel. Auto-detects language from extension.
evaluate_code_streamingStreaming evaluation — returns per-judge results as each judge completes with running aggregates.
evaluate_focusedRun only specified judges. Use after an initial full evaluation to re-check specific areas.
evaluate_batchEvaluate multiple code files in a single call. Returns per-file verdicts plus aggregate statistics.
evaluate_then_fixEvaluate code and automatically generate fix patches for all findings with auto-fix support.
evaluate_with_progressEvaluate with progress callbacks for long-running evaluations.
evaluate_policy_awarePolicy-aware evaluation with named profiles (startup, regulated, healthcare, fintech, public-sector).
fix_codeEvaluate code and apply all available auto-fix patches. Returns fixed code with applied/remaining summary.
explain_findingExplain a finding in plain language with OWASP/CWE references, risk context, and remediation guidance.
triage_findingSet triage status of a finding (accepted-risk, deferred, wont-fix, false-positive) with attribution.
record_feedbackRecord user feedback (true-positive, false-positive, wont-fix) to calibrate confidence scores.
get_finding_statsFinding lifecycle statistics: open, fixed, recurring, and triaged counts plus trends.
get_suppression_analyticsAnalyze suppression patterns: FP rates by rule, suppression rates, auto-suppress candidates.
list_triaged_findingsList triaged findings, optionally filtered by triage status.
benchmark_gateRun benchmarks against quality thresholds. Returns pass/fail with F1, precision, recall metrics.
run_benchmarkRun the full benchmark suite with per-judge, per-category, per-difficulty breakdowns.
scaffold_judgeGenerate boilerplate files to add a new judge: definition, evaluator skeleton, and registration.
scaffold_pluginGenerate a starter plugin template with custom rules, judges, and lifecycle hooks.
session_statusCurrent evaluation session state: evaluation count, frameworks, verdict history, stability.
list_filesList files and directories in the workspace for project exploration.
read_fileRead file contents from the workspace.

Judge IDs

data-security · cybersecurity · security · cost-effectiveness · scalability · cloud-readiness · software-practices · accessibility · api-design · api-contract · reliability · observability · performance · compliance · data-sovereignty · testing · documentation · internationalization · dependency-health · concurrency · ethics-bias · maintainability · error-handling · authentication · database · caching · configuration-management · backwards-compatibility · portability · ux · logging-privacy · rate-limiting · ci-cd · code-structure · agent-instructions · ai-code-safety · framework-safety · iac-security · hallucination-detection · intent-alignment · multi-turn-coherence · model-fingerprint · over-engineering · logic-review · false-positive-review


MCP Prompts

Each judge has a corresponding prompt for LLM-powered deep analysis:

<!-- PROMPTS_TABLE_START -->
PromptDescription
judge-data-securityDeep data security review
judge-cybersecurityDeep cybersecurity review
judge-cost-effectivenessDeep cost optimization review
judge-scalabilityDeep scalability review
judge-cloud-readinessDeep cloud readiness review
judge-software-practicesDeep software practices review
judge-accessibilityDeep accessibility/WCAG review
judge-api-designDeep API design review
judge-reliabilityDeep reliability & resilience review
judge-observabilityDeep observability & monitoring review
judge-performanceDeep performance optimization review
judge-complianceDeep regulatory compliance review
judge-data-sovereigntyDeep data, technological & operational sovereignty review
judge-testingDeep testing quality review
judge-documentationDeep documentation quality review
judge-internationalizationDeep i18n review
judge-dependency-healthDeep dependency health review
judge-concurrencyDeep concurrency & async safety review
judge-ethics-biasDeep ethics & bias review
judge-maintainabilityDeep maintainability & tech debt review
judge-error-handlingDeep error handling review
judge-authenticationDeep authentication & authorization review
judge-databaseDeep database design & query review
judge-cachingDeep caching strategy review
judge-configuration-managementDeep configuration & secrets review
judge-backwards-compatibilityDeep backwards compatibility review
judge-portabilityDeep platform portability review
judge-uxDeep user experience review
judge-logging-privacyDeep logging privacy review
judge-rate-limitingDeep rate limiting review
judge-ci-cdDeep CI/CD pipeline review
judge-code-structureDeep AST-based structural analysis review
judge-agent-instructionsDeep review of agent instruction markdown quality and safety
judge-ai-code-safetyDeep review of AI-generated code risks: prompt injection, insecure LLM output handling, debug defaults, missing validation
judge-framework-safetyDeep review of framework-specific safety: React hooks, Express middleware, Next.js SSR/SSG, Angular/Vue, Django, Spring Boot, ASP.NET Core, Flask, FastAPI, Go frameworks
judge-iac-securityDeep review of infrastructure-as-code security: Terraform, Bicep, ARM template misconfigurations
judge-securityDeep holistic security posture review: insecure data flows, weak cryptography, unsafe deserialization
judge-hallucination-detectionDeep review of AI-hallucinated APIs, fabricated imports, non-existent modules
judge-intent-alignmentDeep review of code–comment alignment, stub detection, placeholder functions
judge-api-contractDeep review of API contract conformance, input validation, REST best practices
judge-multi-turn-coherenceDeep review of code coherence: self-contradictions, duplicate definitions, dead code
judge-model-fingerprintDeep review of AI code provenance and model attribution fingerprints
judge-over-engineeringDeep review of unnecessary abstractions, wrapper-mania, premature generalization
judge-logic-reviewDeep review of logic correctness, semantic mismatches, and dead code in AI-generated code
judge-false-positive-reviewMeta-judge review of pattern-based findings for false positive detection and accuracy
<!-- PROMPTS_TABLE_END -->

Configuration

Create a .judgesrc.json (or .judgesrc) file in your project root to customize evaluation behavior. See .judgesrc.example.json for a copy-paste-ready template, or reference the JSON Schema for full IDE autocompletion.

json
{
  "$schema": "https://github.com/KevinRabun/judges/blob/main/judgesrc.schema.json",
  "preset": "strict",
  "minSeverity": "medium",
  "disabledRules": ["COST-*", "I18N-001"],
  "disabledJudges": ["accessibility", "ethics-bias"],
  "ruleOverrides": {
    "SEC-003": { "severity": "critical" },
    "DOC-*": { "disabled": true }
  },
  "languages": ["typescript", "python"],
  "format": "text",
  "failOnFindings": false,
  "baseline": "",
  "regulatoryScope": ["GDPR", "PCI-DSS", "SOC2"],
  "consensusThreshold": 0.7
}
FieldTypeDefaultDescription
$schemastringJSON Schema URL for IDE validation
presetstringNamed preset (see Named Presets for all 22 options)
minSeveritystring"info"Minimum severity to report: critical · high · medium · low · info
disabledRulesstring[][]Rule IDs or prefix wildcards to suppress (e.g. "COST-*", "SEC-003")
disabledJudgesstring[][]Judge IDs to skip entirely (e.g. "cost-effectiveness")
ruleOverridesobject{}Per-rule overrides keyed by rule ID or wildcard — { disabled?: boolean, severity?: string }
languagesstring[][]Restrict analysis to specific languages (empty = all)
formatstring"text"Default output format: text · json · sarif · markdown · html · pdf · junit · codeclimate · github-actions
failOnFindingsbooleanfalseExit code 1 when verdict is fail — useful for CI gates
baselinestring""Path to a baseline JSON file — matching findings are suppressed
pluginsstring[][]Plugin module specifiers (npm packages or relative paths) that export custom judges
judgeWeightsobject{}Weighted importance per judge for aggregated scoring (e.g. { "cybersecurity": 2.0 })
failOnScoreBelownumberMinimum score (0–100) for the run to pass; complements failOnFindings
regulatoryScopestring[]Regulatory frameworks in scope (e.g. ["GDPR", "PCI-DSS"]). Findings citing ONLY out-of-scope frameworks are suppressed. Run judges list --frameworks for supported values.
consensusThresholdnumberConsensus suppression (0–1). If this fraction of judges report zero findings, minority findings are suppressed. Recommended: 0.7 for CI.
escalationThresholdnumberConfidence threshold (0–1) below which findings are flagged for human review
overridesarray[]Path-scoped config overrides (e.g. [{ "files": "**/*.test.ts", "disabledJudges": ["documentation"] }])
customRulesarray[]User-defined regex-based rules for business logic validation

All evaluation tools (CLI and MCP) accept the same configuration fields via --config <path> or inline config parameter.


Advanced Features

Inline Suppressions

Suppress specific findings directly in source code using comment directives:

typescript
const x = eval(input); // judges-ignore SEC-001
// judges-ignore-next-line CYBER-002
const y = dangerousOperation();
// judges-file-ignore DOC-*    ← suppress globally for this file

Supported comment styles: //, #, /* */. Supports comma-separated rule IDs and wildcards (*, SEC-*).

Auto-Fix Patches

Certain findings include machine-applicable patches in the patch field:

PatternAuto-Fix
new Buffer(x)Buffer.from(x)
http:// URLs (non-localhost)https://
Math.random()crypto.randomUUID()

Patches include oldText, newText, startLine, and endLine for automated application.

Cross-Evaluator Deduplication

When multiple judges flag the same issue (e.g., both Data Security and Cybersecurity detect SQL injection on line 15), findings are automatically deduplicated. The highest-severity finding wins, and the description is annotated with cross-references (e.g., "Also identified by: CYBER-003").

Human Focus Guide

Every tribunal evaluation includes a humanFocusGuide that categorizes findings into three buckets for human reviewers:

BucketDescriptionWhen to use
✅ TrustHigh-confidence (≥80%), evidence-backed findings with AST/taint confirmationAct directly — these have strong automated evidence
🔍 VerifyLower-confidence or absence-based findingsUse your judgment — the issue may exist elsewhere in the project
🔦 Blind SpotsAreas automated analysis cannot evaluateFocus your manual review time here

Blind spots are detected from code characteristics: complex branching logic, external service calls, financial calculations, PII handling, state machines, and complex regex. The guide appears in CLI text/markdown output, JSON/SARIF output, and GitHub Action step summaries.

Regulatory Scope

Configure which regulatory frameworks apply to your project in .judgesrc:

json
{ "regulatoryScope": ["GDPR", "PCI-DSS", "SOC2"] }

Findings that cite ONLY out-of-scope frameworks are suppressed. Findings with no regulatory reference (general code quality) are always kept. Run judges list --frameworks to see all 17 supported frameworks (GDPR, CCPA, HIPAA, PCI-DSS, SOC2, SOX, COPPA, FedRAMP, NIST, ISO27001, ePrivacy, DORA, NIS2, EU-AI-Act, and more).

Self-Teaching Amendments

The LLM benchmark system auto-generates precision amendments for judges with high false-positive rates. Amendments are data-driven corrections injected into prompts that improve accuracy over successive benchmark runs.

The self-teaching loop:

  1. Run benchmark → analyzer identifies judges below 70% precision
  2. Generates targeted amendments (e.g., "Judge ERR: do not flag clean Express code with framework error middleware")
  3. Next benchmark run loads amendments → precision improves
  4. Run judges codify-amendments to bake amendments permanently into the distributed package

Taint Flow Analysis

The engine performs inter-procedural taint tracking to trace data from user-controlled sources (e.g., req.body, process.env) through transformations to security-sensitive sinks (e.g., eval(), exec(), SQL queries). Taint flows are used to boost confidence on true-positive findings and suppress false positives where sanitization is detected.

Positive Signal Detection

Code that demonstrates good practices receives score bonuses (capped at +15):

SignalBonus
Parameterized queries+3
Security headers (helmet)+3
Auth middleware (passport, etc.)+3
Proper error handling+2
Input validation libs (zod, joi, etc.)+2
Rate limiting+2
Structured logging (pino, winston)+2
CORS configuration+1
Strict mode / strictNullChecks+1
Test patterns (describe/it/expect)+1

Framework-Aware Rules

Judges include framework-specific detection for Express, Django, Flask, FastAPI, Spring, ASP.NET, Rails, and more. Framework middleware (e.g., helmet(), express-rate-limit, passport.authenticate()) is recognized as mitigation, reducing false positives.

Cross-File Import Resolution

In project-level analysis, imports are resolved across files. If one file imports a security middleware module from another file in the project, findings about missing security controls are automatically adjusted with reduced confidence.


Scoring

Each judge scores the code from 0 to 100:

SeverityScore Deduction
Critical−30 points
High−18 points
Medium−10 points
Low−5 points
Info−2 points

Verdict logic:

  • FAIL — Any critical finding, or score < 60
  • WARNING — Any high finding, any medium finding, or score < 80
  • PASS — Score ≥ 80 with no critical, high, or medium findings

The overall tribunal score is the average of all 45 judges. The overall verdict fails if any judge fails.


Project Structure

code
judges/
├── src/
│   ├── index.ts              # MCP server entry point — tools, prompts, transport
│   ├── api.ts                # Programmatic API entry point
│   ├── cli.ts                # CLI argument parser and command router
│   ├── types.ts              # TypeScript interfaces (Finding, JudgeEvaluation, etc.)
│   ├── config.ts             # .judgesrc configuration parser and validation
│   ├── errors.ts             # Custom error types (ConfigError, EvaluationError, ParseError)
│   ├── language-patterns.ts  # Multi-language regex pattern constants and helpers
│   ├── judge-registry.ts     # Unified JudgeRegistry — single source of truth for all judges
│   ├── plugins.ts            # Plugin API façade (delegates to JudgeRegistry)
│   ├── scoring.ts            # Confidence scoring and calibration
│   ├── dedup.ts              # Finding deduplication engine
│   ├── fingerprint.ts        # Finding fingerprint generation
│   ├── comparison.ts         # Tool comparison benchmark data
│   ├── cache.ts              # Evaluation result caching
│   ├── calibration.ts        # Confidence calibration from feedback data
│   ├── fix-history.ts        # Auto-fix application history tracking
│   ├── ast/                  # AST analysis engine (built-in, no external deps)
│   │   ├── index.ts          # analyzeStructure() — routes to correct parser
│   │   ├── types.ts          # FunctionInfo, CodeStructure interfaces
│   │   ├── tree-sitter-ast.ts    # Tree-sitter WASM parser (all 8 languages)
│   │   ├── structural-parser.ts  # Fallback scope-tracking parser
│   │   ├── cross-file-taint.ts   # Cross-file taint propagation analysis
│   │   └── taint-tracker.ts      # Single-file taint flow tracking
│   ├── evaluators/           # Analysis engine for each judge
│   │   ├── index.ts          # evaluateWithJudge(), evaluateWithTribunal(), evaluateProject(), etc.
│   │   ├── shared.ts         # Scoring, verdict logic, markdown formatters
│   │   └── *.ts              # One analyzer per judge (45 files)
│   ├── formatters/           # Output formatters
│   │   ├── sarif.ts              # SARIF 2.1.0 output
│   │   ├── html.ts               # Self-contained HTML report (dark/light theme, filters)
│   │   ├── junit.ts              # JUnit XML output (Jenkins, Azure DevOps, GitHub Actions)
│   │   ├── codeclimate.ts        # CodeClimate/GitLab Code Quality JSON
│   │   ├── diagnostics.ts        # Diagnostics formatter
│   │   └── badge.ts              # SVG and text badge generator
│   ├── commands/             # CLI subcommands
│   │   ├── init.ts               # Interactive project setup wizard
│   │   ├── fix.ts                # Auto-fix patch preview and application
│   │   ├── watch.ts              # Watch mode — re-evaluate on save
│   │   ├── report.ts             # Project-level local report
│   │   ├── hook.ts               # Pre-commit hook install/uninstall
│   │   ├── ci-templates.ts       # GitLab, Azure, Bitbucket CI templates
│   │   ├── diff.ts               # Evaluate unified diff (git diff)
│   │   ├── deps.ts               # Dependency supply-chain analysis
│   │   ├── baseline.ts           # Create baseline for finding suppression
│   │   ├── completions.ts        # Shell completions (bash/zsh/fish/PowerShell)
│   │   ├── docs.ts               # Per-judge rule documentation generator
│   │   ├── feedback.ts           # False-positive tracking & finding feedback
│   │   ├── benchmark.ts          # Detection accuracy benchmark suite
│   │   ├── rule.ts               # Custom rule authoring wizard
│   │   ├── language-packs.ts     # Language-specific rule pack presets
│   │   └── config-share.ts       # Shareable team/org configuration
│   ├── presets.ts            # Named evaluation presets (strict, lenient, security-only, …)
│   ├── patches/
│   │   └── index.ts              # 201 deterministic auto-fix patch rules
│   ├── tools/                # MCP tool registrations
│   │   ├── register.ts           # Tool registration orchestrator
│   │   ├── register-evaluation.ts    # Evaluation tools (evaluate_code, etc.)
│   │   ├── register-workflow.ts      # Workflow tools (app builder, reports, etc.)
│   │   ├── prompts.ts            # MCP prompt registrations (per-judge prompts)
│   │   └── schemas.ts            # Zod schemas for tool parameters
│   ├── reports/
│   │   └── public-repo-report.ts   # Public repo clone + full tribunal report generation
│   └── judges/               # Judge definitions (id, name, domain, system prompt)
│       ├── index.ts          # Side-effect imports + re-exports (JUDGES, getJudge, getJudgeSummaries)
│       └── *.ts              # One self-registering definition per judge (45 files)
├── scripts/
│   ├── generate-public-repo-report.ts  # Run: npm run report:public-repo -- --repoUrl <url>
│   ├── daily-popular-repo-autofix.ts   # Run: npm run automation:daily-popular
│   └── debug-fp.ts                     # Debug false-positive findings
├── examples/
│   ├── sample-vulnerable-api.ts  # Intentionally flawed code (triggers all judges)
│   ├── demo.ts                   # Run: npm run demo
│   └── quickstart.ts             # Quick-start evaluation example
├── tests/
│   ├── judges.test.ts            # Core judge evaluation tests
│   ├── negative.test.ts          # Negative / FP-avoidance tests
│   ├── subsystems.test.ts        # Subsystem integration tests
│   ├── extension-logic.test.ts   # VS Code extension logic tests
│   └── tool-routing.test.ts      # MCP tool routing tests
├── grammars/                 # Tree-sitter WASM grammar files
│   ├── tree-sitter-typescript.wasm
│   ├── tree-sitter-cpp.wasm
│   ├── tree-sitter-python.wasm
│   ├── tree-sitter-go.wasm
│   ├── tree-sitter-rust.wasm
│   ├── tree-sitter-java.wasm
│   └── tree-sitter-c_sharp.wasm
├── judgesrc.schema.json      # JSON Schema for .judgesrc config files
├── server.json               # MCP Registry manifest
├── package.json
├── tsconfig.json
└── README.md

Scripts

CommandDescription
npm run buildCompile TypeScript to dist/
npm run devWatch mode — recompile on save
npm testRun the full test suite
npm run demoRun the sample tribunal demo
npm run report:public-repo -- --repoUrl <url>Generate a full tribunal report for a public repository URL
npm run report:quickstart -- --repoUrl <url>Run opinionated high-signal report defaults for fast adoption
npm run automation:daily-popularAnalyze up to 10 rotating popular repos/day and open up to 5 remediation PRs per repo
npm startStart the MCP server
npm run cleanRemove dist/
judges initInteractive project setup wizard
judges fix <file>Preview auto-fix patches (add --apply to write)
judges watch <dir>Watch mode — re-evaluate on file save
judges report <dir>Full tribunal report on a local directory
judges hook installInstall a Git pre-commit hook
judges diffEvaluate changed lines from unified diff
judges depsAnalyze dependencies for supply-chain risks
judges baseline createCreate baseline for finding suppression
judges ci-templatesGenerate CI pipeline templates
judges docsGenerate per-judge rule documentation
judges completions <shell>Shell completion scripts
judges feedback submitMark findings as true positive, false positive, or won't fix
judges feedback statsShow false-positive rate statistics
judges benchmark runRun detection accuracy benchmark suite
judges rule createInteractive custom rule creation wizard
judges rule listList custom evaluation rules
judges pack listList available language packs
judges config exportExport config as shareable package
judges config import <src>Import a shared configuration
judges compareCompare judges against other code review tools
judges listList all 45 judges with domains and descriptions
judges list --frameworksList supported regulatory frameworks and .judgesrc usage
judges codify-amendmentsBake self-teaching amendments into judge source files

Daily Popular Repo Automation

This repo includes a scheduled workflow at .github/workflows/daily-popular-repo-autofix.yml that:

  • selects up to 10 repositories per day from a default pool of 100+ popular repos (or a manually supplied target),
  • runs the full Judges evaluation across supported source languages,
  • applies only conservative, single-line remediations that reduce matching finding counts,
  • opens up to 5 PRs per repository with attribution to both Judges and the target repository,
  • skips repositories unless they are public and PR creation is possible with existing GitHub auth (no additional auth flow).
  • enforces hard runtime caps of 10 repositories/day and 5 PRs/repository.

Each run writes daily-autofix-summary.json (or SUMMARY_PATH) with per-repository telemetry, including:

  • runAggregate — compact run-level totals and cross-repo top prioritized rules,
  • runAggregate.totalCandidatesDiscovered and runAggregate.totalCandidatesAfterLocationDedupe — signal how much overlap was removed before attempting fixes,
  • runAggregate.totalCandidatesAfterPriorityThreshold — candidates that remain after applying minimum priority score,
  • runAggregate.dedupeReductionPercent — percent reduction from location dedupe for quick runtime-efficiency tracking,
  • runAggregate.priorityThresholdReductionPercent — percent reduction from minimum-priority filtering after dedupe,
  • priorityRulePrefixesUsed — dangerous rule prefixes used during prioritization,
  • minPriorityScoreUsed — minimum candidatePriorityScore applied for candidate inclusion,
  • candidatesDiscovered, candidatesAfterLocationDedupe, and candidatesAfterPriorityThreshold — per-repo candidate counts after each filter stage,
  • topPrioritizedRuleCounts — most common rule IDs among ranked candidates,
  • topPrioritizedCandidates — top ranked candidate samples (rule, severity, confidence, file, line, priority score).

Optional runtime control:

  • AUTOFIX_MIN_PRIORITY_SCORE — minimum candidate priority score required after dedupe (default: 0, disabled).

Required secret:

  • JUDGES_AUTOFIX_GH_TOKEN — GitHub token with permission to fork/push/create PRs for target repositories.

Manual run:

bash
gh workflow run "Judges Daily Full-Run Autofix PRs" -f targetRepoUrl=https://github.com/owner/repo

Programmatic API

Judges can be consumed as a library (not just via MCP). Import from @kevinrabun/judges/api:

typescript
import {
  evaluateCode,
  evaluateProject,
  evaluateCodeSingleJudge,
  getJudge,
  JUDGES,
  findingsToSarif,
} from "@kevinrabun/judges/api";

// Full tribunal evaluation
const verdict = evaluateCode("const x = eval(input);", "typescript");
console.log(verdict.overallScore, verdict.overallVerdict);

// Single judge
const result = evaluateCodeSingleJudge("cybersecurity", code, "typescript");

// SARIF output for CI integration
const sarif = findingsToSarif(verdict.evaluations.flatMap(e => e.findings));

Package Exports

Entry PointDescription
@kevinrabun/judges/apiProgrammatic API (default)
@kevinrabun/judges/serverMCP server entry point
@kevinrabun/judges/sarifSARIF 2.1.0 formatter
@kevinrabun/judges/junitJUnit XML formatter
@kevinrabun/judges/codeclimateCodeClimate/GitLab Code Quality JSON
@kevinrabun/judges/badgeSVG and text badge generator
@kevinrabun/judges/diagnosticsDiagnostics formatter
@kevinrabun/judges/pluginsPlugin system API (see Plugin Guide)
@kevinrabun/judges/fingerprintFinding fingerprint utilities
@kevinrabun/judges/comparisonTool comparison benchmarks

SARIF Output

Convert findings to SARIF 2.1.0 for GitHub Code Scanning, Azure DevOps, and other CI/CD tools:

typescript
import { findingsToSarif, evaluationToSarif, verdictToSarif } from "@kevinrabun/judges/sarif";

const sarif = verdictToSarif(verdict, "src/app.ts");
fs.writeFileSync("results.sarif", JSON.stringify(sarif, null, 2));

Custom Error Types

All thrown errors extend JudgesError with a machine-readable code property:

Error ClassCodeWhen
ConfigErrorJUDGES_CONFIG_INVALIDMalformed .judgesrc or invalid inline config
EvaluationErrorJUDGES_EVALUATION_FAILEDUnknown judge, analyzer crash
ParseErrorJUDGES_PARSE_FAILEDUnparseable source code or input data
typescript
import { ConfigError, EvaluationError } from "@kevinrabun/judges/api";
try {
  evaluateCode(code, "typescript");
} catch (e) {
  if (e instanceof ConfigError) console.error("Config issue:", e.code);
}

License

MIT

常见问题

Judges Panel 是什么?

由 18 位专业评审组成,专门评估 AI 生成代码在安全性、成本与质量方面的表现。

相关 Skills

网页构建器

by anthropics

Universal
热门

面向复杂 claude.ai HTML artifact 开发,快速初始化 React + Tailwind CSS + shadcn/ui 项目并打包为单文件 HTML,适合需要状态管理、路由或多组件交互的页面。

在 claude.ai 里做复杂网页 Artifact 很省心,多组件、状态和路由都能顺手搭起来,React、Tailwind 与 shadcn/ui 组合效率高、成品也更精致。

编码与调试
未扫描114.1k

前端设计

by anthropics

Universal
热门

面向组件、页面、海报和 Web 应用开发,按鲜明视觉方向生成可直接落地的前端代码与高质感 UI,适合做 landing page、Dashboard 或美化现有界面,避开千篇一律的 AI 审美。

想把页面做得既能上线又有设计感,就用前端设计:组件到整站都能产出,难得的是能避开千篇一律的 AI 味。

编码与调试
未扫描114.1k

网页应用测试

by anthropics

Universal
热门

用 Playwright 为本地 Web 应用编写自动化测试,支持启动开发服务器、校验前端交互、排查 UI 异常、抓取截图与浏览器日志,适合调试动态页面和回归验证。

借助 Playwright 一站式验证本地 Web 应用前端功能,调 UI 时还能同步查看日志和截图,定位问题更快。

编码与调试
未扫描114.1k

相关 MCP Server

GitHub

编辑精选

by GitHub

热门

GitHub 是 MCP 官方参考服务器,让 Claude 直接读写你的代码仓库和 Issues。

这个参考服务器解决了开发者想让 AI 安全访问 GitHub 数据的问题,适合需要自动化代码审查或 Issue 管理的团队。但注意它只是参考实现,生产环境得自己加固安全。

编码与调试
83.4k

by Context7

热门

Context7 是实时拉取最新文档和代码示例的智能助手,让你告别过时资料。

它能解决开发者查找文档时信息滞后的问题,特别适合快速上手新库或跟进更新。不过,依赖外部源可能导致偶尔的数据延迟,建议结合官方文档使用。

编码与调试
52.2k

by tldraw

热门

tldraw 是让 AI 助手直接在无限画布上绘图和协作的 MCP 服务器。

这解决了 AI 只能输出文本、无法视觉化协作的痛点——想象让 Claude 帮你画流程图或白板讨论。最适合需要快速原型设计或头脑风暴的开发者。不过,目前它只是个基础连接器,你得自己搭建画布应用才能发挥全部潜力。

编码与调试
46.3k

评论