RAGScore

AI 与智能体

by hzyai

用于生成 QA 数据集并评估 RAG 系统,隐私优先,兼容任意 LLM,可本地或云端运行。

什么是 RAGScore

用于生成 QA 数据集并评估 RAG 系统,隐私优先,兼容任意 LLM,可本地或云端运行。

README

<div align="center"> <img src="RAGScore.png" alt="RAGScore Logo" width="400"/>

PyPI version PyPI Downloads Python 3.9+ License Ollama Open In Colab MCP

<!-- mcp-name: io.github.HZYAI/ragscore -->

Generate QA datasets & evaluate RAG systems in 2 commands

🔒 Privacy-First • ⚡ Lightning Fast • 🤖 Any LLM • 🏠 Local or Cloud • 🌍 Multilingual

English | 中文 | 日本語 | Deutsch

</div>

⚡ 2-Line RAG Evaluation

bash
# Step 1: Generate QA pairs from your docs
ragscore generate docs/

# Step 2: Evaluate your RAG system
ragscore evaluate http://localhost:8000/query

That's it. Get accuracy scores and incorrect QA pairs instantly.

code
============================================================
✅ EXCELLENT: 85/100 correct (85.0%)
Average Score: 4.20/5.0
============================================================

❌ 15 Incorrect Pairs:

  1. Q: "What is RAG?"
     Score: 2/5 - Factually incorrect

  2. Q: "How does retrieval work?"
     Score: 3/5 - Incomplete answer

🚀 Quick Start

Install

bash
pip install ragscore              # Core (works with Ollama)
pip install "ragscore[openai]"    # + OpenAI support
pip install "ragscore[notebook]"  # + Jupyter/Colab support
pip install "ragscore[all]"       # + All providers

Option 1: Python API (Notebook-Friendly)

Perfect for Jupyter, Colab, and rapid iteration. Get instant visualizations.

python
from ragscore import quick_test

# 1. Audit your RAG in one line
result = quick_test(
    endpoint="http://localhost:8000/query",  # Your RAG API
    docs="docs/",                            # Your documents
    n=10,                                    # Number of test questions
)

# 1b. Tailored QA — target specific audiences
result = quick_test(
    endpoint="http://localhost:8000/query",
    docs="docs/",
    audience="developers",                   # Who asks the questions?
    purpose="api-integration",               # What's the document for?
)

# 2. See the report
result.plot()

# 3. Inspect failures
bad_rows = result.df[result.df['score'] < 3]
display(bad_rows[['question', 'rag_answer', 'reason']])

Rich Object API:

  • result.accuracy - Accuracy score
  • result.df - Pandas DataFrame of all results
  • result.plot() - 3-panel visualization (4-panel with detailed=True)
  • result.corrections - List of items to fix

Option 2: CLI (Production)

Generate QA Pairs

bash
# Set API key (or use local Ollama - no key needed!)
export OPENAI_API_KEY="sk-..."

# Generate from any document
ragscore generate paper.pdf
ragscore generate docs/*.pdf --concurrency 10

# Tailored QA generation — target specific audiences
ragscore generate docs/ --audience developers --purpose faq
ragscore generate docs/ --audience customers --purpose "pre-sales"
ragscore generate docs/ --audience "compliance auditors" --purpose "security audit"

Evaluate Your RAG

bash
# Point to your RAG endpoint
ragscore evaluate http://localhost:8000/query

# Custom options
ragscore evaluate http://api/ask --model gpt-4o --output results.json

🔬 Detailed Multi-Metric Evaluation

Go beyond a single score. Add detailed=True to get 5 diagnostic dimensions per answer — in the same single LLM call.

python
result = quick_test(
    endpoint=my_rag,
    docs="docs/",
    n=10,
    detailed=True,  # ⭐ Enable multi-metric evaluation
)

# Inspect per-question metrics
display(result.df[[
    "question", "score", "correctness", "completeness",
    "relevance", "conciseness", "faithfulness"
]])

# Radar chart + 4-panel visualization
result.plot()
code
==================================================
✅ PASSED: 9/10 correct (90%)
Average Score: 4.3/5.0
Threshold: 70%
──────────────────────────────────────────────────
  Correctness: 4.5/5.0
  Completeness: 4.2/5.0
  Relevance: 4.8/5.0
  Conciseness: 4.1/5.0
  Faithfulness: 4.6/5.0
==================================================
MetricWhat it measuresScale
CorrectnessSemantic match to golden answer5 = fully correct
CompletenessCovers all key points5 = fully covered
RelevanceAddresses the question asked5 = perfectly on-topic
ConcisenessFocused, no filler5 = concise and precise
FaithfulnessNo fabricated claims5 = fully faithful

CLI:

bash
ragscore evaluate http://localhost:8000/query --detailed

📓 Full demo notebook — build a mini RAG and test it with detailed metrics.

🎯 Audience & Purpose demo — generate tailored QA for developers, customers, auditors, and more.

🏠 Ollama local demo — 100% private RAG evaluation with no API keys.


🏠 100% Private with Local LLMs

bash
# Use Ollama - no API keys, no cloud, 100% private
ollama pull llama3.1
ragscore generate confidential_docs/*.pdf
ragscore evaluate http://localhost:8000/query

Perfect for: Healthcare 🏥 • Legal ⚖️ • Finance 🏦 • Research 🔬

Ollama Model Recommendations

RAGScore generates complex structured QA pairs (question + answer + rationale + support span) in JSON format. This requires models with strong instruction-following and JSON output capabilities.

ModelSizeMin RAMQA QualityRecommended
llama3.1:70b40GB48GB VRAMExcellentGPU server (A100, L40)
qwen2.5:32b18GB24GB VRAMExcellentGPU server (A10, L20)
llama3.1:8b4.7GB8GB VRAMGoodBest local choice
qwen2.5:7b4.4GB8GB VRAMGoodGood local alternative
mistral:7b4.1GB8GB VRAMGoodGood local alternative
llama3.2:3b2.0GB4GB RAMFairCPU-only / testing
qwen2.5:1.5b1.0GB2GB RAMPoorNot recommended

Minimum recommended: 8B+ models. Smaller models (1.5B–3B) produce lower quality support spans and may timeout on longer chunks.

Ollama Performance Guide

bash
# Recommended: 8B model with concurrency 2 for local machines
ollama pull llama3.1:8b
ragscore generate docs/ --provider ollama --model llama3.1:8b

# GPU server (A10/L20): larger model with higher concurrency
ollama pull qwen2.5:32b
ragscore generate docs/ --provider ollama --model qwen2.5:32b --concurrency 5

Expected performance (28 chunks, 5 QA pairs per chunk):

HardwareModelTimeConcurrency
MacBook (CPU)llama3.2:3b~45 min2
MacBook (CPU)llama3.1:8b~25 min2
A10 (24GB)llama3.1:8b~3–5 min5
L20/L40 (48GB)qwen2.5:32b~3–5 min5
OpenAI APIgpt-4o-mini~2 min10

RAGScore auto-reduces concurrency to 2 for local Ollama to avoid GPU/CPU contention.


🔌 Supported LLMs

ProviderSetupNotes
Ollamaollama serveLocal, free, private
OpenAIexport OPENAI_API_KEY="sk-..."Best quality
Anthropicexport ANTHROPIC_API_KEY="..."Long context
DashScopeexport DASHSCOPE_API_KEY="..."Qwen models
vLLMexport LLM_BASE_URL="..."Production-grade
Any OpenAI-compatibleexport LLM_BASE_URL="..."Groq, Together, etc.

📊 Output Formats

Generated QA Pairs (output/generated_qas.jsonl)

json
{
  "id": "abc123",
  "question": "What is RAG?",
  "answer": "RAG (Retrieval-Augmented Generation) combines...",
  "rationale": "This is explicitly stated in the introduction...",
  "support_span": "RAG systems retrieve relevant documents...",
  "difficulty": "medium",
  "source_path": "docs/rag_intro.pdf"
}

Evaluation Results (--output results.json)

json
{
  "summary": {
    "total": 100,
    "correct": 85,
    "incorrect": 15,
    "accuracy": 0.85,
    "avg_score": 4.2
  },
  "incorrect_pairs": [
    {
      "question": "What is RAG?",
      "golden_answer": "RAG combines retrieval with generation...",
      "rag_answer": "RAG is a database system.",
      "score": 2,
      "reason": "Factually incorrect - RAG is not a database"
    }
  ]
}

🧪 Python API

python
from ragscore import run_pipeline, run_evaluation

# Generate QA pairs
run_pipeline(paths=["docs/"], concurrency=10)

# Generate tailored QA pairs for specific audiences
run_pipeline(
    paths=["docs/"],
    audience="support engineers",
    purpose="fine-tuning a support chatbot",
)

# Evaluate RAG
results = run_evaluation(
    endpoint="http://localhost:8000/query",
    model="gpt-4o",  # LLM for judging
)
print(f"Accuracy: {results.accuracy:.1%}")

🤖 AI Agent Integration

RAGScore is designed for AI agents and automation:

bash
# Structured CLI with predictable output
ragscore generate docs/ --concurrency 5
ragscore evaluate http://api/query --output results.json

# Exit codes: 0 = success, 1 = error
# JSON output for programmatic parsing

CLI Reference:

CommandDescription
ragscore generate <paths>Generate QA pairs from documents
ragscore generate <paths> --audience <who>Tailored QA for specific audience
ragscore generate <paths> --purpose <why>Focus QA on document purpose
ragscore evaluate <endpoint>Evaluate RAG against golden QAs
ragscore evaluate <endpoint> --detailedMulti-metric evaluation
ragscore --helpShow all commands and options
ragscore generate --helpShow generate options
ragscore evaluate --helpShow evaluate options

⚙️ Configuration

Zero config required. Optional environment variables:

bash
export RAGSCORE_CHUNK_SIZE=512          # Chunk size for documents
export RAGSCORE_QUESTIONS_PER_CHUNK=5   # QAs per chunk
export RAGSCORE_WORK_DIR=/path/to/dir   # Working directory

🔐 Privacy & Security

DataCloud LLMLocal LLM
Documents✅ Local✅ Local
Text chunks⚠️ Sent to LLM✅ Local
Generated QAs✅ Local✅ Local
Evaluation results✅ Local✅ Local

Compliance: GDPR ✅ • HIPAA ✅ (with local LLMs) • SOC 2 ✅


🧪 Development

bash
git clone https://github.com/HZYAI/RagScore.git
cd RagScore
pip install -e ".[dev,all]"
pytest

🔗 Links


<p align="center"> <b>⭐ Star us on GitHub if RAGScore helps you!</b><br> Made with ❤️ for the RAG community </p>

常见问题

RAGScore 是什么?

用于生成 QA 数据集并评估 RAG 系统,隐私优先,兼容任意 LLM,可本地或云端运行。

相关 Skills

Claude接口

by anthropics

Universal
热门

面向接入 Claude API、Anthropic SDK 或 Agent SDK 的开发场景,自动识别项目语言并给出对应示例与默认配置,快速搭建 LLM 应用。

想把Claude能力接进应用或智能体,用claude-api上手快、兼容Anthropic与Agent SDK,集成路径清晰又省心

AI 与智能体
未扫描109.6k

提示工程专家

by alirezarezvani

Universal
热门

覆盖Prompt优化、Few-shot设计、结构化输出、RAG评测与Agent工作流编排,适合分析token成本、评估LLM输出质量,并搭建可落地的AI智能体系统。

把提示优化、LLM评测到RAG与智能体设计串成一套方法,适合想系统提升AI开发效率的人。

AI 与智能体
未扫描9.0k

智能体流程设计

by alirezarezvani

Universal
热门

面向生产级多 Agent 编排,梳理顺序、并行、分层、事件驱动、共识五种工作流设计,覆盖 handoff、状态管理、容错重试、上下文预算与成本优化,适合搭建复杂 AI 协作系统。

帮你把多智能体流程设计、编排和自动化统一起来,复杂工作流也能更稳地落地,适合追求强控制力的团队。

AI 与智能体
未扫描9.0k

相关 MCP Server

顺序思维

编辑精选

by Anthropic

热门

Sequential Thinking 是让 AI 通过动态思维链解决复杂问题的参考服务器。

这个服务器展示了如何让 Claude 像人类一样逐步推理,适合开发者学习 MCP 的思维链实现。但注意它只是个参考示例,别指望直接用在生产环境里。

AI 与智能体
82.9k

知识图谱记忆

编辑精选

by Anthropic

热门

Memory 是一个基于本地知识图谱的持久化记忆系统,让 AI 记住长期上下文。

帮 AI 和智能体补上“记不住”的短板,用本地知识图谱沉淀长期上下文,连续对话更聪明,数据也更可控。

AI 与智能体
82.9k

PraisonAI

编辑精选

by mervinpraison

热门

PraisonAI 是一个支持自反思和多 LLM 的低代码 AI 智能体框架。

如果你需要快速搭建一个能 24/7 运行的 AI 智能体团队来处理复杂任务(比如自动研究或代码生成),PraisonAI 的低代码设计和多平台集成(如 Telegram)让它上手极快。但作为非官方项目,它的生态成熟度可能不如 LangChain 等主流框架,适合愿意尝鲜的开发者。

AI 与智能体
6.4k

评论