Ragaai Catalyst

by bytesagain1

Python SDK for Agent AI Observability, Monitoring and Evaluation Framework. Includes features like a ragaai catalyst, python, agentic-ai.

3.7kAI 与智能体未扫描2026年3月23日

安装

claude skill add --url github.com/openclaw/skills/tree/main/skills/bytesagain1/rag-evaluator

文档

Rag Evaluator

AI-powered RAG (Retrieval-Augmented Generation) evaluation toolkit. Configure, benchmark, compare, and optimize your RAG pipelines from the command line. Track prompts, evaluations, fine-tuning experiments, costs, and usage — all with persistent local logging and full export capabilities.

Commands

Run rag-evaluator <command> [args] to use.

CommandDescription
configureConfigure RAG evaluation settings and parameters
benchmarkRun benchmarks against your RAG pipeline
compareCompare results across different RAG configurations
promptLog and manage prompt templates and variations
evaluateEvaluate RAG output quality and relevance
fine-tuneTrack fine-tuning experiments and parameters
analyzeAnalyze evaluation results and identify patterns
costTrack and log API/inference costs
usageMonitor token usage and API call volumes
optimizeLog optimization strategies and results
testRun test cases against RAG configurations
reportGenerate evaluation reports
statsShow summary statistics across all categories
export <fmt>Export data in json, csv, or txt format
search <term>Search across all logged entries
recentShow recent activity from history log
statusHealth check — version, data dir, disk usage
helpShow help and available commands
versionShow version (v2.0.0)

Each domain command (configure, benchmark, compare, etc.) works in two modes:

  • Without arguments: displays the most recent 20 entries from that category
  • With arguments: logs the input with a timestamp and saves to the category log file

Data Storage

All data is stored locally in ~/.local/share/rag-evaluator/:

  • Each command creates its own log file (e.g., configure.log, benchmark.log)
  • A unified history.log tracks all activity across commands
  • Entries are stored in timestamp|value pipe-delimited format
  • Export supports JSON, CSV, and plain text formats

Requirements

  • Bash 4+ with set -euo pipefail strict mode
  • Standard Unix utilities: date, wc, du, tail, grep, sed, cat
  • No external dependencies or API keys required

When to Use

  1. Evaluating RAG pipeline quality — log evaluation scores, compare retrieval strategies, and track improvements over time
  2. Benchmarking different configurations — run benchmarks across embedding models, chunk sizes, or retrieval methods and compare results side by side
  3. Tracking costs and usage — monitor API costs and token usage across experiments to stay within budget
  4. Managing prompt engineering — log prompt variations, test them against your pipeline, and analyze which templates perform best
  5. Generating reports for stakeholders — export evaluation data as JSON/CSV for dashboards, or generate text reports summarizing RAG performance

Examples

bash
# Configure a new evaluation run
rag-evaluator configure "model=gpt-4 chunks=512 overlap=50 top_k=5"

# Run a benchmark and log results
rag-evaluator benchmark "latency=230ms recall@5=0.82 precision@5=0.71"

# Compare two retrieval strategies
rag-evaluator compare "bm25 vs dense: bm25 recall=0.78, dense recall=0.85"

# Track evaluation scores
rag-evaluator evaluate "faithfulness=0.91 relevance=0.87 coherence=0.93"

# Log API cost for a run
rag-evaluator cost "run-042: $0.23 (1.2k tokens input, 800 tokens output)"

# View summary statistics
rag-evaluator stats

# Export all data as CSV
rag-evaluator export csv

# Search for specific entries
rag-evaluator search "gpt-4"

# Check recent activity
rag-evaluator recent

# Health check
rag-evaluator status

Output

All commands output to stdout. Redirect to a file if needed:

bash
rag-evaluator report "weekly summary" > report.txt
rag-evaluator export json  # saves to ~/.local/share/rag-evaluator/export.json

Configuration

Set DATA_DIR by modifying the script, or use the default: ~/.local/share/rag-evaluator/


Powered by BytesAgain | bytesagain.com | hello@bytesagain.com

相关 Skills

Claude接口

by anthropics

Universal
热门

面向接入 Claude API、Anthropic SDK 或 Agent SDK 的开发场景,自动识别项目语言并给出对应示例与默认配置,快速搭建 LLM 应用。

想把Claude能力接进应用或智能体,用claude-api上手快、兼容Anthropic与Agent SDK,集成路径清晰又省心

AI 与智能体
未扫描109.6k

提示工程专家

by alirezarezvani

Universal
热门

覆盖Prompt优化、Few-shot设计、结构化输出、RAG评测与Agent工作流编排,适合分析token成本、评估LLM输出质量,并搭建可落地的AI智能体系统。

把提示优化、LLM评测到RAG与智能体设计串成一套方法,适合想系统提升AI开发效率的人。

AI 与智能体
未扫描9.0k

智能体流程设计

by alirezarezvani

Universal
热门

面向生产级多 Agent 编排,梳理顺序、并行、分层、事件驱动、共识五种工作流设计,覆盖 handoff、状态管理、容错重试、上下文预算与成本优化,适合搭建复杂 AI 协作系统。

帮你把多智能体流程设计、编排和自动化统一起来,复杂工作流也能更稳地落地,适合追求强控制力的团队。

AI 与智能体
未扫描9.0k

相关 MCP 服务

顺序思维

编辑精选

by Anthropic

热门

Sequential Thinking 是让 AI 通过动态思维链解决复杂问题的参考服务器。

这个服务器展示了如何让 Claude 像人类一样逐步推理,适合开发者学习 MCP 的思维链实现。但注意它只是个参考示例,别指望直接用在生产环境里。

AI 与智能体
82.9k

知识图谱记忆

编辑精选

by Anthropic

热门

Memory 是一个基于本地知识图谱的持久化记忆系统,让 AI 记住长期上下文。

帮 AI 和智能体补上“记不住”的短板,用本地知识图谱沉淀长期上下文,连续对话更聪明,数据也更可控。

AI 与智能体
82.9k

PraisonAI

编辑精选

by mervinpraison

热门

PraisonAI 是一个支持自反思和多 LLM 的低代码 AI 智能体框架。

如果你需要快速搭建一个能 24/7 运行的 AI 智能体团队来处理复杂任务(比如自动研究或代码生成),PraisonAI 的低代码设计和多平台集成(如 Telegram)让它上手极快。但作为非官方项目,它的生态成熟度可能不如 LangChain 等主流框架,适合愿意尝鲜的开发者。

AI 与智能体
6.4k

评论