什么是 RAGScore?
用于生成 QA 数据集并评估 RAG 系统,隐私优先,兼容任意 LLM,可本地或云端运行。
README
Generate QA datasets & evaluate RAG systems in 2 commands
🔒 Privacy-First • ⚡ Lightning Fast • 🤖 Any LLM • 🏠 Local or Cloud • 🌍 Multilingual
</div>⚡ 2-Line RAG Evaluation
# Step 1: Generate QA pairs from your docs
ragscore generate docs/
# Step 2: Evaluate your RAG system
ragscore evaluate http://localhost:8000/query
That's it. Get accuracy scores and incorrect QA pairs instantly.
============================================================
✅ EXCELLENT: 85/100 correct (85.0%)
Average Score: 4.20/5.0
============================================================
❌ 15 Incorrect Pairs:
1. Q: "What is RAG?"
Score: 2/5 - Factually incorrect
2. Q: "How does retrieval work?"
Score: 3/5 - Incomplete answer
🚀 Quick Start
Install
pip install ragscore # Core (works with Ollama)
pip install "ragscore[openai]" # + OpenAI support
pip install "ragscore[notebook]" # + Jupyter/Colab support
pip install "ragscore[all]" # + All providers
Already installed? Keep up to date — new versions add features like failure diagnosis and retrieved context capture:
bashpip install --upgrade ragscore
Option 1: Python API (Notebook-Friendly)
Perfect for Jupyter, Colab, and rapid iteration. Get instant visualizations.
from ragscore import quick_test
# 1. Audit your RAG in one line
result = quick_test(
endpoint="http://localhost:8000/query", # Your RAG API
docs="docs/", # Your documents
n=10, # Number of test questions
)
# 1b. Tailored QA — target specific audiences
result = quick_test(
endpoint="http://localhost:8000/query",
docs="docs/",
audience="developers", # Who asks the questions?
purpose="api-integration", # What's the document for?
)
# 2. See the report
result.plot()
# 3. Inspect failures
bad_rows = result.df[result.df['score'] < 3]
display(bad_rows[['question', 'rag_answer', 'reason']])
Rich Object API:
result.accuracy- Accuracy scoreresult.df- Pandas DataFrame of all resultsresult.plot()- 3-panel visualization (4-panel withdetailed=True)result.corrections- List of items to fix
Option 2: CLI (Production)
Generate QA Pairs
# Set API key (or use local Ollama - no key needed!)
export OPENAI_API_KEY="sk-..."
# Generate from any document
ragscore generate paper.pdf
ragscore generate docs/*.pdf --concurrency 10
# Tailored QA generation — target specific audiences
ragscore generate docs/ --audience developers --purpose faq
ragscore generate docs/ --audience customers --purpose "pre-sales"
ragscore generate docs/ --audience "compliance auditors" --purpose "security audit"
Evaluate Your RAG
# Point to your RAG endpoint
ragscore evaluate http://localhost:8000/query
# Custom options
ragscore evaluate http://api/ask --model gpt-4o --output results.json
🔬 Detailed Multi-Metric Evaluation
Go beyond a single score. Add detailed=True to get 5 diagnostic dimensions per answer — in the same single LLM call.
result = quick_test(
endpoint=my_rag,
docs="docs/",
n=10,
detailed=True, # ⭐ Enable multi-metric evaluation
)
# Inspect per-question metrics
display(result.df[[
"question", "score", "correctness", "completeness",
"relevance", "conciseness", "faithfulness"
]])
# Radar chart + 4-panel visualization
result.plot()
==================================================
✅ PASSED: 9/10 correct (90%)
Average Score: 4.3/5.0
Threshold: 70%
──────────────────────────────────────────────────
Correctness: 4.5/5.0
Completeness: 4.2/5.0
Relevance: 4.8/5.0
Conciseness: 4.1/5.0
Faithfulness: 4.6/5.0
==================================================
| Metric | What it measures | Scale |
|---|---|---|
| Correctness | Semantic match to golden answer | 5 = fully correct |
| Completeness | Covers all key points | 5 = fully covered |
| Relevance | Addresses the question asked | 5 = perfectly on-topic |
| Conciseness | Focused, no filler | 5 = concise and precise |
| Faithfulness | No fabricated claims | 5 = fully faithful |
CLI:
ragscore evaluate http://localhost:8000/query --detailed
🔍 Failure Diagnosis (--diagnose)
When answers fail, --diagnose tells you why — retriever miss, generator hallucination, incomplete answer, or wrong interpretation:
ragscore evaluate http://localhost:8000/query --diagnose
🔍 Failure Diagnosis:
Retriever Miss: 3 (42.9%)
Generator Hallucination: 2 (28.6%)
Incomplete Answer: 1 (14.3%)
Wrong Interpretation: 1 (14.3%)
Uses the support_span already generated with each QA pair to give the judge grounding context. Combine with --detailed for full diagnostics:
ragscore evaluate http://localhost:8000/query --diagnose --detailed -o results.json
| Category | Meaning |
|---|---|
| Retriever Miss | RAG didn't retrieve the chunk containing the evidence |
| Generator Hallucination | Retrieved correctly but fabricated information |
| Incomplete Answer | Retrieved correctly but answer is partial |
| Wrong Interpretation | Retrieved correctly but misunderstood the content |
📓 Full demo notebook — build a mini RAG and test it with detailed metrics.
🎯 Audience & Purpose demo — generate tailored QA for developers, customers, auditors, and more.
🏠 Ollama local demo — 100% private RAG evaluation with no API keys.
🏠 100% Private with Local LLMs
# Use Ollama - no API keys, no cloud, 100% private
ollama pull llama3.1
ragscore generate confidential_docs/*.pdf
ragscore evaluate http://localhost:8000/query
Perfect for: Healthcare 🏥 • Legal ⚖️ • Finance 🏦 • Research 🔬
Ollama Model Recommendations
RAGScore generates complex structured QA pairs (question + answer + rationale + support span) in JSON format. This requires models with strong instruction-following and JSON output capabilities.
| Model | Size | Min RAM | QA Quality | Recommended |
|---|---|---|---|---|
llama3.1:70b | 40GB | 48GB VRAM | Excellent | GPU server (A100, L40) |
qwen2.5:32b | 18GB | 24GB VRAM | Excellent | GPU server (A10, L20) |
llama3.1:8b | 4.7GB | 8GB VRAM | Good | Best local choice |
qwen2.5:7b | 4.4GB | 8GB VRAM | Good | Good local alternative |
mistral:7b | 4.1GB | 8GB VRAM | Good | Good local alternative |
llama3.2:3b | 2.0GB | 4GB RAM | Fair | CPU-only / testing |
qwen2.5:1.5b | 1.0GB | 2GB RAM | Poor | Not recommended |
Minimum recommended: 8B+ models. Smaller models (1.5B–3B) produce lower quality support spans and may timeout on longer chunks.
Ollama Performance Guide
# Recommended: 8B model with concurrency 2 for local machines
ollama pull llama3.1:8b
ragscore generate docs/ --provider ollama --model llama3.1:8b
# GPU server (A10/L20): larger model with higher concurrency
ollama pull qwen2.5:32b
ragscore generate docs/ --provider ollama --model qwen2.5:32b --concurrency 5
Expected performance (28 chunks, 5 QA pairs per chunk):
| Hardware | Model | Time | Concurrency |
|---|---|---|---|
| MacBook (CPU) | llama3.2:3b | ~45 min | 2 |
| MacBook (CPU) | llama3.1:8b | ~25 min | 2 |
| A10 (24GB) | llama3.1:8b | ~3–5 min | 5 |
| L20/L40 (48GB) | qwen2.5:32b | ~3–5 min | 5 |
| OpenAI API | gpt-4o-mini | ~2 min | 10 |
RAGScore auto-reduces concurrency to 2 for local Ollama to avoid GPU/CPU contention.
🔌 Supported LLMs
| Provider | Setup | Notes |
|---|---|---|
| Ollama | ollama serve | Local, free, private |
| OpenAI | export OPENAI_API_KEY="sk-..." | Best quality |
| Anthropic | export ANTHROPIC_API_KEY="..." | Long context |
| DashScope | export DASHSCOPE_API_KEY="..." | Qwen models |
| vLLM | export LLM_BASE_URL="..." | Production-grade |
| Any OpenAI-compatible | export LLM_BASE_URL="..." | Groq, Together, etc. |
📊 Output Formats
Generated QA Pairs (output/generated_qas.jsonl)
{
"id": "abc123",
"question": "What is RAG?",
"answer": "RAG (Retrieval-Augmented Generation) combines...",
"rationale": "This is explicitly stated in the introduction...",
"support_span": "RAG systems retrieve relevant documents...",
"difficulty": "medium",
"source_path": "docs/rag_intro.pdf"
}
Evaluation Results (--output results.json)
{
"summary": {
"total": 100,
"correct": 85,
"incorrect": 15,
"accuracy": 0.85,
"avg_score": 4.2
},
"incorrect_pairs": [
{
"question": "What is RAG?",
"golden_answer": "RAG combines retrieval with generation...",
"rag_answer": "RAG is a database system.",
"score": 2,
"reason": "Factually incorrect - RAG is not a database"
}
]
}
🧪 Python API
from ragscore import run_pipeline, run_evaluation
# Generate QA pairs
run_pipeline(paths=["docs/"], concurrency=10)
# Generate tailored QA pairs for specific audiences
run_pipeline(
paths=["docs/"],
audience="support engineers",
purpose="fine-tuning a support chatbot",
)
# Evaluate RAG
results = run_evaluation(
endpoint="http://localhost:8000/query",
model="gpt-4o", # LLM for judging
)
print(f"Accuracy: {results.accuracy:.1%}")
🤖 AI Agent Integration
RAGScore is designed for AI agents and automation:
# Structured CLI with predictable output
ragscore generate docs/ --concurrency 5
ragscore evaluate http://api/query --output results.json
# Exit codes: 0 = success, 1 = error
# JSON output for programmatic parsing
CLI Reference:
| Command | Description |
|---|---|
ragscore generate <paths> | Generate QA pairs from documents |
ragscore generate <paths> --audience <who> | Tailored QA for specific audience |
ragscore generate <paths> --purpose <why> | Focus QA on document purpose |
ragscore evaluate <endpoint> | Evaluate RAG against golden QAs |
ragscore evaluate <endpoint> --detailed | Multi-metric evaluation |
ragscore evaluate <endpoint> --diagnose | Failure root-cause classification |
ragscore --help | Show all commands and options |
ragscore generate --help | Show generate options |
ragscore evaluate --help | Show evaluate options |
⚙️ Configuration
Zero config required. Optional environment variables:
export RAGSCORE_CHUNK_SIZE=512 # Chunk size for documents
export RAGSCORE_QUESTIONS_PER_CHUNK=5 # QAs per chunk
export RAGSCORE_WORK_DIR=/path/to/dir # Working directory
🔐 Privacy & Security
| Data | Cloud LLM | Local LLM |
|---|---|---|
| Documents | ✅ Local | ✅ Local |
| Text chunks | ⚠️ Sent to LLM | ✅ Local |
| Generated QAs | ✅ Local | ✅ Local |
| Evaluation results | ✅ Local | ✅ Local |
Compliance: GDPR ✅ • HIPAA ✅ (with local LLMs) • SOC 2 ✅
🧪 Development
git clone https://github.com/HZYAI/RagScore.git
cd RagScore
pip install -e ".[dev,all]"
pytest
📡 Telemetry
RAGScore collects telemetry only in MCP server mode (ragscore serve). Standard CLI and Python API usage do not send telemetry.
We collect limited anonymous operational metrics to understand feature usage and improve reliability. No document content, prompts, QA text, model outputs, API keys, endpoint URLs, or file paths are collected.
Collected in MCP mode:
- MCP tool invoked
- LLM provider and model name
ragscoreversion, Python version, OS type- Success/failure status
- Random anonymous installation ID
Opt out:
export RAGSCORE_NO_TELEMETRY=1
�� Links
- GitHub • PyPI • Issues • Discussions
<p align="center"> <b>⭐ Star us on GitHub if RAGScore helps you!</b><br> Made with ❤️ for the RAG community </p>
常见问题
RAGScore 是什么?
用于生成 QA 数据集并评估 RAG 系统,隐私优先,兼容任意 LLM,可本地或云端运行。
相关 Skills
Claude接口
by anthropics
面向接入 Claude API、Anthropic SDK 或 Agent SDK 的开发场景,自动识别项目语言并给出对应示例与默认配置,快速搭建 LLM 应用。
✎ 想把Claude能力接进应用或智能体,用claude-api上手快、兼容Anthropic与Agent SDK,集成路径清晰又省心
RAG架构师
by alirezarezvani
聚焦生产级RAG系统设计与优化,覆盖文档切块、检索链路、索引构建、召回评估等关键环节,适合搭建可扩展、高准确率的知识库问答与检索增强应用。
✎ 面向RAG落地,把知识库、向量检索和生成链路系统串联起来,做架构设计时更清晰,也更少踩坑。
多智能体架构
by alirezarezvani
聚焦多智能体系统架构设计,梳理 Supervisor、Swarm、分层和 Pipeline 等模式,覆盖角色定义、通信协作与性能评估,适合规划稳健可扩展的 AI agent 编排方案。
✎ 帮你系统解决多智能体应用的架构设计与协同编排难题,适合构建复杂 AI 工作流,成熟度高、社区认可也很亮眼。
相关 MCP Server
知识图谱记忆
编辑精选by Anthropic
Memory 是一个基于本地知识图谱的持久化记忆系统,让 AI 记住长期上下文。
✎ 帮 AI 和智能体补上“记不住”的短板,用本地知识图谱沉淀长期上下文,连续对话更聪明,数据也更可控。
顺序思维
编辑精选by Anthropic
Sequential Thinking 是让 AI 通过动态思维链解决复杂问题的参考服务器。
✎ 这个服务器展示了如何让 Claude 像人类一样逐步推理,适合开发者学习 MCP 的思维链实现。但注意它只是个参考示例,别指望直接用在生产环境里。
PraisonAI
编辑精选by mervinpraison
PraisonAI 是一个支持自反思和多 LLM 的低代码 AI 智能体框架。
✎ 如果你需要快速搭建一个能 24/7 运行的 AI 智能体团队来处理复杂任务(比如自动研究或代码生成),PraisonAI 的低代码设计和多平台集成(如 Telegram)让它上手极快。但作为非官方项目,它的生态成熟度可能不如 LangChain 等主流框架,适合愿意尝鲜的开发者。