llm-judge
by anderskev
LLM-as-judge methodology for comparing code implementations across repositories. Scores implementations on functionality, security, test quality, overengineering, and dead code using weighted rubrics. Used by /beagle:llm-judge command.
安装
claude skill add --url github.com/openclaw/skills/tree/main/skills/anderskev/llm-judge文档
LLM Judge Skill
Compare code implementations across 2+ repositories using structured evaluation.
Overview
This skill implements a two-phase LLM-as-judge evaluation:
- Phase 1: Fact Gathering - Parallel agents explore each repo and extract structured facts
- Phase 2: Judging - Parallel judges score each dimension using consistent rubrics
Reference Files
| File | Purpose |
|---|---|
| references/fact-schema.md | JSON schema for Phase 1 facts |
| references/scoring-rubrics.md | Detailed rubrics for each dimension |
| references/repo-agent.md | Instructions for Phase 1 agents |
| references/judge-agents.md | Instructions for Phase 2 judges |
Scoring Dimensions
| Dimension | Default Weight | Evaluates |
|---|---|---|
| Functionality | 30% | Spec compliance, test pass rate |
| Security | 25% | Vulnerabilities, security patterns |
| Test Quality | 20% | Coverage, DRY, mock boundaries |
| Overengineering | 15% | Unnecessary complexity |
| Dead Code | 10% | Unused code, TODOs |
Scoring Scale
| Score | Meaning |
|---|---|
| 5 | Excellent - Exceeds expectations |
| 4 | Good - Meets requirements, minor issues |
| 3 | Average - Functional but notable gaps |
| 2 | Below Average - Significant issues |
| 1 | Poor - Fails basic requirements |
Phase 1: Spawning Repo Agents
For each repository, spawn a Task agent with:
You are a Phase 1 Repo Agent for the LLM Judge evaluation.
**Your Repo:** $REPO_LABEL at $REPO_PATH
**Spec Document:**
$SPEC_CONTENT
**Instructions:** Read @beagle:llm-judge references/repo-agent.md
Gather facts and return a JSON object following the schema in references/fact-schema.md.
Load @beagle:llm-artifacts-detection for dead code and overengineering analysis.
Return ONLY valid JSON, no markdown or explanations.
Phase 2: Spawning Judge Agents
After all Phase 1 agents complete, spawn 5 judge agents (one per dimension):
You are the $DIMENSION Judge for the LLM Judge evaluation.
**Spec Document:**
$SPEC_CONTENT
**Facts from all repos:**
$ALL_FACTS_JSON
**Instructions:** Read @beagle:llm-judge references/judge-agents.md
Score each repo on $DIMENSION using the rubric in references/scoring-rubrics.md.
Return ONLY valid JSON following the judge output schema.
Aggregation
After Phase 2 completes:
- Collect scores from all 5 judges
- For each repo, compute weighted total:
code
weighted_total = sum(score[dim] * weight[dim]) / 100 - Rank repos by weighted total (descending)
- Generate verdict explaining the ranking
Output
Write results to .beagle/llm-judge-report.json and display markdown summary.
Dependencies
@beagle:llm-artifacts-detection- Reused by repo agents for dead code/overengineering
相关 Skills
前端设计
by anthropics
面向组件、页面、海报和 Web 应用开发,按鲜明视觉方向生成可直接落地的前端代码与高质感 UI,适合做 landing page、Dashboard 或美化现有界面,避开千篇一律的 AI 审美。
✎ 想把页面做得既能上线又有设计感,就用前端设计:组件到整站都能产出,难得的是能避开千篇一律的 AI 味。
网页构建器
by anthropics
面向复杂 claude.ai HTML artifact 开发,快速初始化 React + Tailwind CSS + shadcn/ui 项目并打包为单文件 HTML,适合需要状态管理、路由或多组件交互的页面。
✎ 在 claude.ai 里做复杂网页 Artifact 很省心,多组件、状态和路由都能顺手搭起来,React、Tailwind 与 shadcn/ui 组合效率高、成品也更精致。
网页应用测试
by anthropics
用 Playwright 为本地 Web 应用编写自动化测试,支持启动开发服务器、校验前端交互、排查 UI 异常、抓取截图与浏览器日志,适合调试动态页面和回归验证。
✎ 借助 Playwright 一站式验证本地 Web 应用前端功能,调 UI 时还能同步查看日志和截图,定位问题更快。
相关 MCP 服务
GitHub
编辑精选by GitHub
GitHub 是 MCP 官方参考服务器,让 Claude 直接读写你的代码仓库和 Issues。
✎ 这个参考服务器解决了开发者想让 AI 安全访问 GitHub 数据的问题,适合需要自动化代码审查或 Issue 管理的团队。但注意它只是参考实现,生产环境得自己加固安全。
Context7 文档查询
编辑精选by Context7
Context7 是实时拉取最新文档和代码示例的智能助手,让你告别过时资料。
✎ 它能解决开发者查找文档时信息滞后的问题,特别适合快速上手新库或跟进更新。不过,依赖外部源可能导致偶尔的数据延迟,建议结合官方文档使用。
by tldraw
tldraw 是让 AI 助手直接在无限画布上绘图和协作的 MCP 服务器。
✎ 这解决了 AI 只能输出文本、无法视觉化协作的痛点——想象让 Claude 帮你画流程图或白板讨论。最适合需要快速原型设计或头脑风暴的开发者。不过,目前它只是个基础连接器,你得自己搭建画布应用才能发挥全部潜力。