什么是 RAGScore?
用于生成 QA 数据集并评估 RAG 系统,隐私优先,兼容任意 LLM,可本地或云端运行。
README
Generate QA datasets & evaluate RAG systems in 2 commands
🔒 Privacy-First • ⚡ Lightning Fast • 🤖 Any LLM • 🏠 Local or Cloud • 🌍 Multilingual
</div>⚡ 2-Line RAG Evaluation
# Step 1: Generate QA pairs from your docs
ragscore generate docs/
# Step 2: Evaluate your RAG system
ragscore evaluate http://localhost:8000/query
That's it. Get accuracy scores and incorrect QA pairs instantly.
============================================================
✅ EXCELLENT: 85/100 correct (85.0%)
Average Score: 4.20/5.0
============================================================
❌ 15 Incorrect Pairs:
1. Q: "What is RAG?"
Score: 2/5 - Factually incorrect
2. Q: "How does retrieval work?"
Score: 3/5 - Incomplete answer
🚀 Quick Start
Install
pip install ragscore # Core (works with Ollama)
pip install "ragscore[openai]" # + OpenAI support
pip install "ragscore[notebook]" # + Jupyter/Colab support
pip install "ragscore[all]" # + All providers
Option 1: Python API (Notebook-Friendly)
Perfect for Jupyter, Colab, and rapid iteration. Get instant visualizations.
from ragscore import quick_test
# 1. Audit your RAG in one line
result = quick_test(
endpoint="http://localhost:8000/query", # Your RAG API
docs="docs/", # Your documents
n=10, # Number of test questions
)
# 1b. Tailored QA — target specific audiences
result = quick_test(
endpoint="http://localhost:8000/query",
docs="docs/",
audience="developers", # Who asks the questions?
purpose="api-integration", # What's the document for?
)
# 2. See the report
result.plot()
# 3. Inspect failures
bad_rows = result.df[result.df['score'] < 3]
display(bad_rows[['question', 'rag_answer', 'reason']])
Rich Object API:
result.accuracy- Accuracy scoreresult.df- Pandas DataFrame of all resultsresult.plot()- 3-panel visualization (4-panel withdetailed=True)result.corrections- List of items to fix
Option 2: CLI (Production)
Generate QA Pairs
# Set API key (or use local Ollama - no key needed!)
export OPENAI_API_KEY="sk-..."
# Generate from any document
ragscore generate paper.pdf
ragscore generate docs/*.pdf --concurrency 10
# Tailored QA generation — target specific audiences
ragscore generate docs/ --audience developers --purpose faq
ragscore generate docs/ --audience customers --purpose "pre-sales"
ragscore generate docs/ --audience "compliance auditors" --purpose "security audit"
Evaluate Your RAG
# Point to your RAG endpoint
ragscore evaluate http://localhost:8000/query
# Custom options
ragscore evaluate http://api/ask --model gpt-4o --output results.json
🔬 Detailed Multi-Metric Evaluation
Go beyond a single score. Add detailed=True to get 5 diagnostic dimensions per answer — in the same single LLM call.
result = quick_test(
endpoint=my_rag,
docs="docs/",
n=10,
detailed=True, # ⭐ Enable multi-metric evaluation
)
# Inspect per-question metrics
display(result.df[[
"question", "score", "correctness", "completeness",
"relevance", "conciseness", "faithfulness"
]])
# Radar chart + 4-panel visualization
result.plot()
==================================================
✅ PASSED: 9/10 correct (90%)
Average Score: 4.3/5.0
Threshold: 70%
──────────────────────────────────────────────────
Correctness: 4.5/5.0
Completeness: 4.2/5.0
Relevance: 4.8/5.0
Conciseness: 4.1/5.0
Faithfulness: 4.6/5.0
==================================================
| Metric | What it measures | Scale |
|---|---|---|
| Correctness | Semantic match to golden answer | 5 = fully correct |
| Completeness | Covers all key points | 5 = fully covered |
| Relevance | Addresses the question asked | 5 = perfectly on-topic |
| Conciseness | Focused, no filler | 5 = concise and precise |
| Faithfulness | No fabricated claims | 5 = fully faithful |
CLI:
ragscore evaluate http://localhost:8000/query --detailed
📓 Full demo notebook — build a mini RAG and test it with detailed metrics.
🎯 Audience & Purpose demo — generate tailored QA for developers, customers, auditors, and more.
🏠 Ollama local demo — 100% private RAG evaluation with no API keys.
🏠 100% Private with Local LLMs
# Use Ollama - no API keys, no cloud, 100% private
ollama pull llama3.1
ragscore generate confidential_docs/*.pdf
ragscore evaluate http://localhost:8000/query
Perfect for: Healthcare 🏥 • Legal ⚖️ • Finance 🏦 • Research 🔬
Ollama Model Recommendations
RAGScore generates complex structured QA pairs (question + answer + rationale + support span) in JSON format. This requires models with strong instruction-following and JSON output capabilities.
| Model | Size | Min RAM | QA Quality | Recommended |
|---|---|---|---|---|
llama3.1:70b | 40GB | 48GB VRAM | Excellent | GPU server (A100, L40) |
qwen2.5:32b | 18GB | 24GB VRAM | Excellent | GPU server (A10, L20) |
llama3.1:8b | 4.7GB | 8GB VRAM | Good | Best local choice |
qwen2.5:7b | 4.4GB | 8GB VRAM | Good | Good local alternative |
mistral:7b | 4.1GB | 8GB VRAM | Good | Good local alternative |
llama3.2:3b | 2.0GB | 4GB RAM | Fair | CPU-only / testing |
qwen2.5:1.5b | 1.0GB | 2GB RAM | Poor | Not recommended |
Minimum recommended: 8B+ models. Smaller models (1.5B–3B) produce lower quality support spans and may timeout on longer chunks.
Ollama Performance Guide
# Recommended: 8B model with concurrency 2 for local machines
ollama pull llama3.1:8b
ragscore generate docs/ --provider ollama --model llama3.1:8b
# GPU server (A10/L20): larger model with higher concurrency
ollama pull qwen2.5:32b
ragscore generate docs/ --provider ollama --model qwen2.5:32b --concurrency 5
Expected performance (28 chunks, 5 QA pairs per chunk):
| Hardware | Model | Time | Concurrency |
|---|---|---|---|
| MacBook (CPU) | llama3.2:3b | ~45 min | 2 |
| MacBook (CPU) | llama3.1:8b | ~25 min | 2 |
| A10 (24GB) | llama3.1:8b | ~3–5 min | 5 |
| L20/L40 (48GB) | qwen2.5:32b | ~3–5 min | 5 |
| OpenAI API | gpt-4o-mini | ~2 min | 10 |
RAGScore auto-reduces concurrency to 2 for local Ollama to avoid GPU/CPU contention.
🔌 Supported LLMs
| Provider | Setup | Notes |
|---|---|---|
| Ollama | ollama serve | Local, free, private |
| OpenAI | export OPENAI_API_KEY="sk-..." | Best quality |
| Anthropic | export ANTHROPIC_API_KEY="..." | Long context |
| DashScope | export DASHSCOPE_API_KEY="..." | Qwen models |
| vLLM | export LLM_BASE_URL="..." | Production-grade |
| Any OpenAI-compatible | export LLM_BASE_URL="..." | Groq, Together, etc. |
📊 Output Formats
Generated QA Pairs (output/generated_qas.jsonl)
{
"id": "abc123",
"question": "What is RAG?",
"answer": "RAG (Retrieval-Augmented Generation) combines...",
"rationale": "This is explicitly stated in the introduction...",
"support_span": "RAG systems retrieve relevant documents...",
"difficulty": "medium",
"source_path": "docs/rag_intro.pdf"
}
Evaluation Results (--output results.json)
{
"summary": {
"total": 100,
"correct": 85,
"incorrect": 15,
"accuracy": 0.85,
"avg_score": 4.2
},
"incorrect_pairs": [
{
"question": "What is RAG?",
"golden_answer": "RAG combines retrieval with generation...",
"rag_answer": "RAG is a database system.",
"score": 2,
"reason": "Factually incorrect - RAG is not a database"
}
]
}
🧪 Python API
from ragscore import run_pipeline, run_evaluation
# Generate QA pairs
run_pipeline(paths=["docs/"], concurrency=10)
# Generate tailored QA pairs for specific audiences
run_pipeline(
paths=["docs/"],
audience="support engineers",
purpose="fine-tuning a support chatbot",
)
# Evaluate RAG
results = run_evaluation(
endpoint="http://localhost:8000/query",
model="gpt-4o", # LLM for judging
)
print(f"Accuracy: {results.accuracy:.1%}")
🤖 AI Agent Integration
RAGScore is designed for AI agents and automation:
# Structured CLI with predictable output
ragscore generate docs/ --concurrency 5
ragscore evaluate http://api/query --output results.json
# Exit codes: 0 = success, 1 = error
# JSON output for programmatic parsing
CLI Reference:
| Command | Description |
|---|---|
ragscore generate <paths> | Generate QA pairs from documents |
ragscore generate <paths> --audience <who> | Tailored QA for specific audience |
ragscore generate <paths> --purpose <why> | Focus QA on document purpose |
ragscore evaluate <endpoint> | Evaluate RAG against golden QAs |
ragscore evaluate <endpoint> --detailed | Multi-metric evaluation |
ragscore --help | Show all commands and options |
ragscore generate --help | Show generate options |
ragscore evaluate --help | Show evaluate options |
⚙️ Configuration
Zero config required. Optional environment variables:
export RAGSCORE_CHUNK_SIZE=512 # Chunk size for documents
export RAGSCORE_QUESTIONS_PER_CHUNK=5 # QAs per chunk
export RAGSCORE_WORK_DIR=/path/to/dir # Working directory
🔐 Privacy & Security
| Data | Cloud LLM | Local LLM |
|---|---|---|
| Documents | ✅ Local | ✅ Local |
| Text chunks | ⚠️ Sent to LLM | ✅ Local |
| Generated QAs | ✅ Local | ✅ Local |
| Evaluation results | ✅ Local | ✅ Local |
Compliance: GDPR ✅ • HIPAA ✅ (with local LLMs) • SOC 2 ✅
🧪 Development
git clone https://github.com/HZYAI/RagScore.git
cd RagScore
pip install -e ".[dev,all]"
pytest
🔗 Links
- GitHub • PyPI • Issues • Discussions
<p align="center"> <b>⭐ Star us on GitHub if RAGScore helps you!</b><br> Made with ❤️ for the RAG community </p>
常见问题
RAGScore 是什么?
用于生成 QA 数据集并评估 RAG 系统,隐私优先,兼容任意 LLM,可本地或云端运行。
相关 Skills
Claude接口
by anthropics
面向接入 Claude API、Anthropic SDK 或 Agent SDK 的开发场景,自动识别项目语言并给出对应示例与默认配置,快速搭建 LLM 应用。
✎ 想把Claude能力接进应用或智能体,用claude-api上手快、兼容Anthropic与Agent SDK,集成路径清晰又省心
提示工程专家
by alirezarezvani
覆盖Prompt优化、Few-shot设计、结构化输出、RAG评测与Agent工作流编排,适合分析token成本、评估LLM输出质量,并搭建可落地的AI智能体系统。
✎ 把提示优化、LLM评测到RAG与智能体设计串成一套方法,适合想系统提升AI开发效率的人。
智能体流程设计
by alirezarezvani
面向生产级多 Agent 编排,梳理顺序、并行、分层、事件驱动、共识五种工作流设计,覆盖 handoff、状态管理、容错重试、上下文预算与成本优化,适合搭建复杂 AI 协作系统。
✎ 帮你把多智能体流程设计、编排和自动化统一起来,复杂工作流也能更稳地落地,适合追求强控制力的团队。
相关 MCP Server
顺序思维
编辑精选by Anthropic
Sequential Thinking 是让 AI 通过动态思维链解决复杂问题的参考服务器。
✎ 这个服务器展示了如何让 Claude 像人类一样逐步推理,适合开发者学习 MCP 的思维链实现。但注意它只是个参考示例,别指望直接用在生产环境里。
知识图谱记忆
编辑精选by Anthropic
Memory 是一个基于本地知识图谱的持久化记忆系统,让 AI 记住长期上下文。
✎ 帮 AI 和智能体补上“记不住”的短板,用本地知识图谱沉淀长期上下文,连续对话更聪明,数据也更可控。
PraisonAI
编辑精选by mervinpraison
PraisonAI 是一个支持自反思和多 LLM 的低代码 AI 智能体框架。
✎ 如果你需要快速搭建一个能 24/7 运行的 AI 智能体团队来处理复杂任务(比如自动研究或代码生成),PraisonAI 的低代码设计和多平台集成(如 Telegram)让它上手极快。但作为非官方项目,它的生态成熟度可能不如 LangChain 等主流框架,适合愿意尝鲜的开发者。