io.github.MetriLLM/metrillm

编码与调试

by metrillm

可从任意 MCP 客户端对本地 LLM 模型进行基准测试,评估速度、质量及硬件适配度。

什么是 io.github.MetriLLM/metrillm

可从任意 MCP 客户端对本地 LLM 模型进行基准测试,评估速度、质量及硬件适配度。

README

MetriLLM

CI Node.js License

npm version npm downloads GitHub stars

Benchmark your local LLM models in one command. Speed, quality, hardware fitness — with a shareable score and public leaderboard.

Think Geekbench, but for local LLMs on your actual hardware.

bash
npm install -g metrillm@latest
metrillm bench
<p align="center"> <img src="docs/images/cli1.png" width="48%" alt="MetriLLM CLI — interactive menu" /> <img src="docs/images/cli2.png" width="48%" alt="MetriLLM CLI — hardware detection" /> </p>

MetriLLM Leaderboard

What You Get

  • Performance metrics: tokens/sec, time to first token, memory usage, load time
  • Quality evaluation: reasoning, coding, math, instruction following, structured output, multilingual (14 prompts, 6 categories)
  • Global score (0-100): 30% hardware fit + 70% quality
  • Verdict: EXCELLENT / GOOD / MARGINAL / NOT RECOMMENDED
  • One-click share: --share uploads your result and gives you a public URL + leaderboard rank

Real Benchmark Results

From the public leaderboard — all results below were submitted with metrillm bench --share.

ModelMachineCPURAMtok/sTTFTGlobalVerdict
llama3.2:latestMac MiniApple M4 Pro64 GB98.9125 ms77GOOD
mistral:latestMac MiniApple M4 Pro64 GB54.3124 ms76GOOD
gemma3:4bMacBook AirApple M432 GB35.9303 ms72GOOD
gemma3:1bMacBook AirApple M432 GB39.4362 ms72GOOD
qwen3:1.7bMacBook AirApple M432 GB37.93.1 s70GOOD
llama3.2:3bMacBook AirApple M432 GB27.8285 ms69GOOD
gemma3:12bMacBook AirApple M432 GB12.3656 ms67GOOD
phi4:14bMacBook AirApple M432 GB11.1515 ms65GOOD
mistral:7bMacBook AirApple M432 GB13.6517 ms61GOOD
deepseek-r1:14bMacBook AirApple M432 GB10.830.0 s25NOT RECOMMENDED

Key takeaway: Small models (1-4B) fly on Apple Silicon. Larger models (14B+) with thinking chains can choke even on capable hardware. See full leaderboard →

Install

Requires Node 20+ and a local runtime: Ollama or LM Studio.

bash
# Install globally
npm install -g metrillm@latest
metrillm bench

# Alternative package managers
pnpm add -g metrillm@latest
bun add -g metrillm@latest

# Homebrew
brew install MetriLLM/metrillm/metrillm

# Or run without installing
npx metrillm@latest bench

Usage

bash
# Interactive mode — pick models from a menu
metrillm bench

# Benchmark a specific model
metrillm bench --model gemma3:4b

# Benchmark with LM Studio backend
metrillm bench --backend lm-studio --model qwen3-8b

# Benchmark all installed models
metrillm bench --all

# Share your result (upload + public URL + leaderboard rank)
metrillm bench --share

# CI/non-interactive mode
metrillm bench --ci-no-menu --share

# Force unload after each model (useful for memory isolation)
metrillm bench --all --unload-after-bench

# Export results locally
metrillm bench --export json
metrillm bench --export csv

Upload Configuration (CLI + MCP)

By default, production builds upload shared results to the official MetriLLM leaderboard (https://metrillm.dev).

  • No CI secret injection is required for standard releases.
  • Local/dev runs use the same default behavior.
  • Self-hosted or staging deployments can override endpoints with:
    • METRILLM_SUPABASE_URL
    • METRILLM_SUPABASE_ANON_KEY
    • METRILLM_PUBLIC_RESULT_BASE_URL

If these variables are set to placeholder values (from templates), MetriLLM falls back to official defaults.

Windows Users

PowerShell's default execution policy blocks npm global scripts. If you see PSSecurityException or UnauthorizedAccess when running metrillm, run this once:

powershell
Set-ExecutionPolicy -Scope CurrentUser -ExecutionPolicy RemoteSigned

Alternatively, use npx metrillm@latest which bypasses the issue entirely.

Runtime Backends

BackendFlagDefault URLRequired env
Ollama--backend ollamahttp://127.0.0.1:11434OLLAMA_HOST (optional)
LM Studio--backend lm-studiohttp://127.0.0.1:1234LM_STUDIO_BASE_URL (optional), LM_STUDIO_API_KEY (optional)

Shared runtime env:

  • METRILLM_STREAM_STALL_TIMEOUT_MS (optional): stream watchdog for all backends, default 30000, 0 disables it

LM Studio benchmark runs now use the native REST inference endpoint (/api/v1/chat) for both streaming and non-streaming generation. The previous OpenAI-compatible inference path (/v1/chat/completions) has been retired from MetriLLM so tok/s and TTFT can rely on native LM Studio stats when available. If a LM Studio response omits native token stats, MetriLLM still computes a score and shows the throughput as estimated.

For very large models, tune timeout flags:

  • --perf-warmup-timeout-ms (default 300000)
  • --perf-prompt-timeout-ms (default 120000)
  • --quality-timeout-ms (default 120000)
  • --coding-timeout-ms (default 240000)
  • --stream-stall-timeout-ms (default 30000, 0 disables stall timeout for any backend)

Benchmark Profile v1 (applied to all benchmark prompts):

  • temperature=0
  • top_p=1
  • seed=42
  • thinking follows your benchmark mode (--thinking / --no-thinking)
  • Context window stays runtime default (context=runtime-default) and is recorded as such in metadata.

LM Studio non-thinking guard:

  • When benchmark mode requests non-thinking (--no-thinking or default), MetriLLM now aborts if the model still emits reasoning traces (for result comparability).
  • To disable it in LM Studio for affected models, put this at the top of the model chat template: {%- set enable_thinking = false %} then eject/reload the model.

How Scoring Works

Hardware Fit Score (0-100) — how well the model runs on your machine:

  • Speed: 50% (tokens/sec relative to your hardware tier)
  • TTFT: 20% (time to first token)
  • Memory: 30% (RAM efficiency)

Quality Score (0-100) — how well the model answers:

  • Reasoning: 20pts | Coding: 20pts | Instruction Following: 20pts
  • Structured Output: 15pts | Math: 15pts | Multilingual: 10pts

Global Score = 30% Hardware Fit + 70% Quality

Hardware is auto-detected and scoring adapts to your tier (Entry/Balanced/High-End). A model hitting 10 tok/s on a 8GB machine scores differently than on a 64GB rig.

Full methodology →

Share Your Results

Every benchmark you share enriches the public leaderboard. No account needed — pick the method that fits your workflow:

MethodCommand / ActionBest for
CLImetrillm bench --shareTerminal users
MCPCall share_result toolAI coding assistants
Plugin/benchmark skill with share optionClaude Code / Cursor

All methods produce the same result:

  • A public URL for your benchmark
  • Your rank: "Top X% globally, Top Y% on [your CPU]"
  • A share card for social media
  • A challenge link to send to friends

Compare your results on the leaderboard →

MCP Server

Use MetriLLM from Claude Code, Cursor, Windsurf, or any MCP client — no CLI needed.

bash
# Claude Code
claude mcp add metrillm -- npx metrillm-mcp@latest

# Claude Desktop / Cursor / Windsurf — add to MCP config:
# { "command": "npx", "args": ["metrillm-mcp@latest"] }
ToolDescription
list_modelsList locally available LLM models
run_benchmarkRun full benchmark (performance + quality) on a model
get_resultsRetrieve previous benchmark results
share_resultUpload a result to the public leaderboard

Full MCP documentation →

Skills

Slash commands that work inside AI coding assistants — no server needed, just a Markdown file.

SkillTriggerDescription
/benchmarkUser-invokedRun a full benchmark interactively
metrillm-guideAuto-invokedContextual guidance on model selection and results

Skills are included in the plugins below, or can be installed standalone:

bash
# Claude Code
cp -r plugins/claude-code/skills/* ~/.claude/skills/

# Cursor
cp -r plugins/cursor/skills/* ~/.cursor/skills/

Plugins

Pre-built bundles (MCP + skills + agents) for deeper IDE integration.

ComponentDescription
MCP configAuto-connects to metrillm-mcp server
Skills/benchmark + metrillm-guide
Agentbenchmark-advisor — analyzes your hardware and recommends models

Install:

bash
# Claude Code
cp -r plugins/claude-code/.claude/* ~/.claude/

# Cursor
cp -r plugins/cursor/.cursor/* ~/.cursor/

See Claude Code plugin and Cursor plugin for details.

Integrations

IntegrationPackageStatusDocs
CLImetrillmStableUsage
MCP Servermetrillm-mcpStableMCP docs
SkillsStableSkills
Claude Code pluginStablePlugin docs
Cursor pluginStablePlugin docs

Development

bash
npm ci
npm run ci:verify     # typecheck + tests + build
npm run dev           # run from source
npm run test:watch    # vitest watch mode

Homebrew Formula Maintenance

The tap formula lives in Formula/metrillm.rb.

bash
# Refresh Formula/metrillm.rb with latest npm tarball + sha256
./scripts/update-homebrew-formula.sh

# Or pin a specific version
./scripts/update-homebrew-formula.sh 0.2.1

After updating the formula, commit and push so users can install/update with:

bash
brew tap MetriLLM/metrillm
brew install metrillm
brew upgrade metrillm

Contributing

Contributions are welcome! Please read the Contributing Guide before submitting a pull request. All commits must include a DCO sign-off.

License

Apache License 2.0 — see NOTICE for trademark information.

常见问题

io.github.MetriLLM/metrillm 是什么?

可从任意 MCP 客户端对本地 LLM 模型进行基准测试,评估速度、质量及硬件适配度。

相关 Skills

网页构建器

by anthropics

Universal
热门

面向复杂 claude.ai HTML artifact 开发,快速初始化 React + Tailwind CSS + shadcn/ui 项目并打包为单文件 HTML,适合需要状态管理、路由或多组件交互的页面。

在 claude.ai 里做复杂网页 Artifact 很省心,多组件、状态和路由都能顺手搭起来,React、Tailwind 与 shadcn/ui 组合效率高、成品也更精致。

编码与调试
未扫描114.1k

前端设计

by anthropics

Universal
热门

面向组件、页面、海报和 Web 应用开发,按鲜明视觉方向生成可直接落地的前端代码与高质感 UI,适合做 landing page、Dashboard 或美化现有界面,避开千篇一律的 AI 审美。

想把页面做得既能上线又有设计感,就用前端设计:组件到整站都能产出,难得的是能避开千篇一律的 AI 味。

编码与调试
未扫描114.1k

网页应用测试

by anthropics

Universal
热门

用 Playwright 为本地 Web 应用编写自动化测试,支持启动开发服务器、校验前端交互、排查 UI 异常、抓取截图与浏览器日志,适合调试动态页面和回归验证。

借助 Playwright 一站式验证本地 Web 应用前端功能,调 UI 时还能同步查看日志和截图,定位问题更快。

编码与调试
未扫描114.1k

相关 MCP Server

GitHub

编辑精选

by GitHub

热门

GitHub 是 MCP 官方参考服务器,让 Claude 直接读写你的代码仓库和 Issues。

这个参考服务器解决了开发者想让 AI 安全访问 GitHub 数据的问题,适合需要自动化代码审查或 Issue 管理的团队。但注意它只是参考实现,生产环境得自己加固安全。

编码与调试
83.4k

by Context7

热门

Context7 是实时拉取最新文档和代码示例的智能助手,让你告别过时资料。

它能解决开发者查找文档时信息滞后的问题,特别适合快速上手新库或跟进更新。不过,依赖外部源可能导致偶尔的数据延迟,建议结合官方文档使用。

编码与调试
52.2k

by tldraw

热门

tldraw 是让 AI 助手直接在无限画布上绘图和协作的 MCP 服务器。

这解决了 AI 只能输出文本、无法视觉化协作的痛点——想象让 Claude 帮你画流程图或白板讨论。最适合需要快速原型设计或头脑风暴的开发者。不过,目前它只是个基础连接器,你得自己搭建画布应用才能发挥全部潜力。

编码与调试
46.3k

评论