RAGScore

Name: RAGScore
Rating: 1.5 (30 reviews)
Author: hzyai

AI 与智能体

by hzyai

用于生成 QA 数据集并评估 RAG 系统，隐私优先，兼容任意 LLM，可本地或云端运行。

30GitHub

什么是 RAGScore？

用于生成 QA 数据集并评估 RAG 系统，隐私优先，兼容任意 LLM，可本地或云端运行。

README

Generate QA datasets & evaluate RAG systems in 2 commands

🔒 Privacy-First • ⚡ Lightning Fast • 🤖 Any LLM • 🏠 Local or Cloud • 🌍 Multilingual

English | 中文 | 日本語 | Deutsch

</div>

⚡ 2-Line RAG Evaluation

bash

# Step 1: Generate QA pairs from your docs
ragscore generate docs/

# Step 2: Evaluate your RAG system
ragscore evaluate http://localhost:8000/query

That's it. Get accuracy scores and incorrect QA pairs instantly.

code

============================================================
✅ EXCELLENT: 85/100 correct (85.0%)
Average Score: 4.20/5.0
============================================================

❌ 15 Incorrect Pairs:

  1. Q: "What is RAG?"
     Score: 2/5 - Factually incorrect

  2. Q: "How does retrieval work?"
     Score: 3/5 - Incomplete answer

🚀 Quick Start

Install

bash

pip install ragscore              # Core (works with Ollama)
pip install "ragscore[openai]"    # + OpenAI support
pip install "ragscore[notebook]"  # + Jupyter/Colab support
pip install "ragscore[all]"       # + All providers

Option 1: Python API (Notebook-Friendly)

Perfect for Jupyter, Colab, and rapid iteration. Get instant visualizations.

python

from ragscore import quick_test

# 1. Audit your RAG in one line
result = quick_test(
    endpoint="http://localhost:8000/query",  # Your RAG API
    docs="docs/",                            # Your documents
    n=10,                                    # Number of test questions
)

# 1b. Tailored QA — target specific audiences
result = quick_test(
    endpoint="http://localhost:8000/query",
    docs="docs/",
    audience="developers",                   # Who asks the questions?
    purpose="api-integration",               # What's the document for?
)

# 2. See the report
result.plot()

# 3. Inspect failures
bad_rows = result.df[result.df['score'] < 3]
display(bad_rows[['question', 'rag_answer', 'reason']])

Rich Object API:

result.accuracy - Accuracy score
result.df - Pandas DataFrame of all results
result.plot() - 3-panel visualization (4-panel with detailed=True)
result.corrections - List of items to fix

Option 2: CLI (Production)

Generate QA Pairs

bash

# Set API key (or use local Ollama - no key needed!)
export OPENAI_API_KEY="sk-..."

# Generate from any document
ragscore generate paper.pdf
ragscore generate docs/*.pdf --concurrency 10

# Tailored QA generation — target specific audiences
ragscore generate docs/ --audience developers --purpose faq
ragscore generate docs/ --audience customers --purpose "pre-sales"
ragscore generate docs/ --audience "compliance auditors" --purpose "security audit"

Evaluate Your RAG

bash

# Point to your RAG endpoint
ragscore evaluate http://localhost:8000/query

# Custom options
ragscore evaluate http://api/ask --model gpt-4o --output results.json

🔬 Detailed Multi-Metric Evaluation

Go beyond a single score. Add detailed=True to get 5 diagnostic dimensions per answer — in the same single LLM call.

python

result = quick_test(
    endpoint=my_rag,
    docs="docs/",
    n=10,
    detailed=True,  # ⭐ Enable multi-metric evaluation
)

# Inspect per-question metrics
display(result.df[[
    "question", "score", "correctness", "completeness",
    "relevance", "conciseness", "faithfulness"
]])

# Radar chart + 4-panel visualization
result.plot()

code

==================================================
✅ PASSED: 9/10 correct (90%)
Average Score: 4.3/5.0
Threshold: 70%
──────────────────────────────────────────────────
  Correctness: 4.5/5.0
  Completeness: 4.2/5.0
  Relevance: 4.8/5.0
  Conciseness: 4.1/5.0
  Faithfulness: 4.6/5.0
==================================================

Metric	What it measures	Scale
Correctness	Semantic match to golden answer	5 = fully correct
Completeness	Covers all key points	5 = fully covered
Relevance	Addresses the question asked	5 = perfectly on-topic
Conciseness	Focused, no filler	5 = concise and precise
Faithfulness	No fabricated claims	5 = fully faithful

CLI:

bash

ragscore evaluate http://localhost:8000/query --detailed

📓 Full demo notebook — build a mini RAG and test it with detailed metrics.

🎯 Audience & Purpose demo — generate tailored QA for developers, customers, auditors, and more.

🏠 Ollama local demo — 100% private RAG evaluation with no API keys.

🏠 100% Private with Local LLMs

bash

# Use Ollama - no API keys, no cloud, 100% private
ollama pull llama3.1
ragscore generate confidential_docs/*.pdf
ragscore evaluate http://localhost:8000/query

Perfect for: Healthcare 🏥 • Legal ⚖️ • Finance 🏦 • Research 🔬

Ollama Model Recommendations

RAGScore generates complex structured QA pairs (question + answer + rationale + support span) in JSON format. This requires models with strong instruction-following and JSON output capabilities.

Model	Size	Min RAM	QA Quality	Recommended
`llama3.1:70b`	40GB	48GB VRAM	Excellent	GPU server (A100, L40)
`qwen2.5:32b`	18GB	24GB VRAM	Excellent	GPU server (A10, L20)
`llama3.1:8b`	4.7GB	8GB VRAM	Good	Best local choice
`qwen2.5:7b`	4.4GB	8GB VRAM	Good	Good local alternative
`mistral:7b`	4.1GB	8GB VRAM	Good	Good local alternative
`llama3.2:3b`	2.0GB	4GB RAM	Fair	CPU-only / testing
`qwen2.5:1.5b`	1.0GB	2GB RAM	Poor	Not recommended

Minimum recommended: 8B+ models. Smaller models (1.5B–3B) produce lower quality support spans and may timeout on longer chunks.

Ollama Performance Guide

bash

# Recommended: 8B model with concurrency 2 for local machines
ollama pull llama3.1:8b
ragscore generate docs/ --provider ollama --model llama3.1:8b

# GPU server (A10/L20): larger model with higher concurrency
ollama pull qwen2.5:32b
ragscore generate docs/ --provider ollama --model qwen2.5:32b --concurrency 5

Expected performance (28 chunks, 5 QA pairs per chunk):

Hardware	Model	Time	Concurrency
MacBook (CPU)	llama3.2:3b	~45 min	2
MacBook (CPU)	llama3.1:8b	~25 min	2
A10 (24GB)	llama3.1:8b	~3–5 min	5
L20/L40 (48GB)	qwen2.5:32b	~3–5 min	5
OpenAI API	gpt-4o-mini	~2 min	10

RAGScore auto-reduces concurrency to 2 for local Ollama to avoid GPU/CPU contention.

🔌 Supported LLMs

Provider	Setup	Notes
Ollama	`ollama serve`	Local, free, private
OpenAI	`export OPENAI_API_KEY="sk-..."`	Best quality
Anthropic	`export ANTHROPIC_API_KEY="..."`	Long context
DashScope	`export DASHSCOPE_API_KEY="..."`	Qwen models
vLLM	`export LLM_BASE_URL="..."`	Production-grade
Any OpenAI-compatible	`export LLM_BASE_URL="..."`	Groq, Together, etc.

📊 Output Formats

Generated QA Pairs (`output/generated_qas.jsonl`)

json

{
  "id": "abc123",
  "question": "What is RAG?",
  "answer": "RAG (Retrieval-Augmented Generation) combines...",
  "rationale": "This is explicitly stated in the introduction...",
  "support_span": "RAG systems retrieve relevant documents...",
  "difficulty": "medium",
  "source_path": "docs/rag_intro.pdf"
}

Evaluation Results (`--output results.json`)

json

{
  "summary": {
    "total": 100,
    "correct": 85,
    "incorrect": 15,
    "accuracy": 0.85,
    "avg_score": 4.2
  },
  "incorrect_pairs": [
    {
      "question": "What is RAG?",
      "golden_answer": "RAG combines retrieval with generation...",
      "rag_answer": "RAG is a database system.",
      "score": 2,
      "reason": "Factually incorrect - RAG is not a database"
    }
  ]
}

🧪 Python API

python

from ragscore import run_pipeline, run_evaluation

# Generate QA pairs
run_pipeline(paths=["docs/"], concurrency=10)

# Generate tailored QA pairs for specific audiences
run_pipeline(
    paths=["docs/"],
    audience="support engineers",
    purpose="fine-tuning a support chatbot",
)

# Evaluate RAG
results = run_evaluation(
    endpoint="http://localhost:8000/query",
    model="gpt-4o",  # LLM for judging
)
print(f"Accuracy: {results.accuracy:.1%}")

🤖 AI Agent Integration

RAGScore is designed for AI agents and automation:

bash

# Structured CLI with predictable output
ragscore generate docs/ --concurrency 5
ragscore evaluate http://api/query --output results.json

# Exit codes: 0 = success, 1 = error
# JSON output for programmatic parsing

CLI Reference:

Command	Description
`ragscore generate <paths>`	Generate QA pairs from documents
`ragscore generate <paths> --audience <who>`	Tailored QA for specific audience
`ragscore generate <paths> --purpose <why>`	Focus QA on document purpose
`ragscore evaluate <endpoint>`	Evaluate RAG against golden QAs
`ragscore evaluate <endpoint> --detailed`	Multi-metric evaluation
`ragscore --help`	Show all commands and options
`ragscore generate --help`	Show generate options
`ragscore evaluate --help`	Show evaluate options

⚙️ Configuration

Zero config required. Optional environment variables:

bash

export RAGSCORE_CHUNK_SIZE=512          # Chunk size for documents
export RAGSCORE_QUESTIONS_PER_CHUNK=5   # QAs per chunk
export RAGSCORE_WORK_DIR=/path/to/dir   # Working directory

🔐 Privacy & Security

Data	Cloud LLM	Local LLM
Documents	✅ Local	✅ Local
Text chunks	⚠️ Sent to LLM	✅ Local
Generated QAs	✅ Local	✅ Local
Evaluation results	✅ Local	✅ Local

Compliance: GDPR ✅ • HIPAA ✅ (with local LLMs) • SOC 2 ✅

🧪 Development

bash

git clone https://github.com/HZYAI/RagScore.git
cd RagScore
pip install -e ".[dev,all]"
pytest

🔗 Links

GitHub • PyPI • Issues • Discussions

<p align="center"> <b>⭐ Star us on GitHub if RAGScore helps you!</b><br> Made with ❤️ for the RAG community </p>

常见问题

RAGScore 是什么？

用于生成 QA 数据集并评估 RAG 系统，隐私优先，兼容任意 LLM，可本地或云端运行。

RAGScore

什么是 RAGScore？

README

⚡ 2-Line RAG Evaluation

🚀 Quick Start

Install

Option 1: Python API (Notebook-Friendly)

Option 2: CLI (Production)

Generate QA Pairs

Evaluate Your RAG

🔬 Detailed Multi-Metric Evaluation

🏠 100% Private with Local LLMs

Ollama Model Recommendations

Ollama Performance Guide

🔌 Supported LLMs

📊 Output Formats

Generated QA Pairs (`output/generated_qas.jsonl`)

Evaluation Results (`--output results.json`)

🧪 Python API

🤖 AI Agent Integration

⚙️ Configuration

🔐 Privacy & Security

🧪 Development

🔗 Links

常见问题

RAGScore 是什么？

相关 Skills

Claude接口

提示工程专家

智能体流程设计

相关 MCP Server

顺序思维

知识图谱记忆

PraisonAI

评论

RAGScore

什么是 RAGScore？

README

⚡ 2-Line RAG Evaluation

🚀 Quick Start

Install

Option 1: Python API (Notebook-Friendly)

Option 2: CLI (Production)

Generate QA Pairs

Evaluate Your RAG

🔬 Detailed Multi-Metric Evaluation

🏠 100% Private with Local LLMs

Ollama Model Recommendations

Ollama Performance Guide

🔌 Supported LLMs

📊 Output Formats

Generated QA Pairs (output/generated_qas.jsonl)

Evaluation Results (--output results.json)

🧪 Python API

🤖 AI Agent Integration

⚙️ Configuration

🔐 Privacy & Security

🧪 Development

🔗 Links

常见问题

RAGScore 是什么？

相关 Skills

Claude接口

提示工程专家

智能体流程设计

相关 MCP Server

顺序思维

知识图谱记忆

PraisonAI

评论

Generated QA Pairs (`output/generated_qas.jsonl`)

Evaluation Results (`--output results.json`)