Ragaai Catalyst
by bytesagain1
Python SDK for Agent AI Observability, Monitoring and Evaluation Framework. Includes features like a ragaai catalyst, python, agentic-ai.
安装
claude skill add --url github.com/openclaw/skills/tree/main/skills/bytesagain1/rag-evaluator文档
Rag Evaluator
AI-powered RAG (Retrieval-Augmented Generation) evaluation toolkit. Configure, benchmark, compare, and optimize your RAG pipelines from the command line. Track prompts, evaluations, fine-tuning experiments, costs, and usage — all with persistent local logging and full export capabilities.
Commands
Run rag-evaluator <command> [args] to use.
| Command | Description |
|---|---|
configure | Configure RAG evaluation settings and parameters |
benchmark | Run benchmarks against your RAG pipeline |
compare | Compare results across different RAG configurations |
prompt | Log and manage prompt templates and variations |
evaluate | Evaluate RAG output quality and relevance |
fine-tune | Track fine-tuning experiments and parameters |
analyze | Analyze evaluation results and identify patterns |
cost | Track and log API/inference costs |
usage | Monitor token usage and API call volumes |
optimize | Log optimization strategies and results |
test | Run test cases against RAG configurations |
report | Generate evaluation reports |
stats | Show summary statistics across all categories |
export <fmt> | Export data in json, csv, or txt format |
search <term> | Search across all logged entries |
recent | Show recent activity from history log |
status | Health check — version, data dir, disk usage |
help | Show help and available commands |
version | Show version (v2.0.0) |
Each domain command (configure, benchmark, compare, etc.) works in two modes:
- Without arguments: displays the most recent 20 entries from that category
- With arguments: logs the input with a timestamp and saves to the category log file
Data Storage
All data is stored locally in ~/.local/share/rag-evaluator/:
- Each command creates its own log file (e.g.,
configure.log,benchmark.log) - A unified
history.logtracks all activity across commands - Entries are stored in
timestamp|valuepipe-delimited format - Export supports JSON, CSV, and plain text formats
Requirements
- Bash 4+ with
set -euo pipefailstrict mode - Standard Unix utilities:
date,wc,du,tail,grep,sed,cat - No external dependencies or API keys required
When to Use
- Evaluating RAG pipeline quality — log evaluation scores, compare retrieval strategies, and track improvements over time
- Benchmarking different configurations — run benchmarks across embedding models, chunk sizes, or retrieval methods and compare results side by side
- Tracking costs and usage — monitor API costs and token usage across experiments to stay within budget
- Managing prompt engineering — log prompt variations, test them against your pipeline, and analyze which templates perform best
- Generating reports for stakeholders — export evaluation data as JSON/CSV for dashboards, or generate text reports summarizing RAG performance
Examples
# Configure a new evaluation run
rag-evaluator configure "model=gpt-4 chunks=512 overlap=50 top_k=5"
# Run a benchmark and log results
rag-evaluator benchmark "latency=230ms recall@5=0.82 precision@5=0.71"
# Compare two retrieval strategies
rag-evaluator compare "bm25 vs dense: bm25 recall=0.78, dense recall=0.85"
# Track evaluation scores
rag-evaluator evaluate "faithfulness=0.91 relevance=0.87 coherence=0.93"
# Log API cost for a run
rag-evaluator cost "run-042: $0.23 (1.2k tokens input, 800 tokens output)"
# View summary statistics
rag-evaluator stats
# Export all data as CSV
rag-evaluator export csv
# Search for specific entries
rag-evaluator search "gpt-4"
# Check recent activity
rag-evaluator recent
# Health check
rag-evaluator status
Output
All commands output to stdout. Redirect to a file if needed:
rag-evaluator report "weekly summary" > report.txt
rag-evaluator export json # saves to ~/.local/share/rag-evaluator/export.json
Configuration
Set DATA_DIR by modifying the script, or use the default: ~/.local/share/rag-evaluator/
Powered by BytesAgain | bytesagain.com | hello@bytesagain.com
相关 Skills
Claude接口
by anthropics
面向接入 Claude API、Anthropic SDK 或 Agent SDK 的开发场景,自动识别项目语言并给出对应示例与默认配置,快速搭建 LLM 应用。
✎ 想把Claude能力接进应用或智能体,用claude-api上手快、兼容Anthropic与Agent SDK,集成路径清晰又省心
提示工程专家
by alirezarezvani
覆盖Prompt优化、Few-shot设计、结构化输出、RAG评测与Agent工作流编排,适合分析token成本、评估LLM输出质量,并搭建可落地的AI智能体系统。
✎ 把提示优化、LLM评测到RAG与智能体设计串成一套方法,适合想系统提升AI开发效率的人。
智能体流程设计
by alirezarezvani
面向生产级多 Agent 编排,梳理顺序、并行、分层、事件驱动、共识五种工作流设计,覆盖 handoff、状态管理、容错重试、上下文预算与成本优化,适合搭建复杂 AI 协作系统。
✎ 帮你把多智能体流程设计、编排和自动化统一起来,复杂工作流也能更稳地落地,适合追求强控制力的团队。
相关 MCP 服务
顺序思维
编辑精选by Anthropic
Sequential Thinking 是让 AI 通过动态思维链解决复杂问题的参考服务器。
✎ 这个服务器展示了如何让 Claude 像人类一样逐步推理,适合开发者学习 MCP 的思维链实现。但注意它只是个参考示例,别指望直接用在生产环境里。
知识图谱记忆
编辑精选by Anthropic
Memory 是一个基于本地知识图谱的持久化记忆系统,让 AI 记住长期上下文。
✎ 帮 AI 和智能体补上“记不住”的短板,用本地知识图谱沉淀长期上下文,连续对话更聪明,数据也更可控。
PraisonAI
编辑精选by mervinpraison
PraisonAI 是一个支持自反思和多 LLM 的低代码 AI 智能体框架。
✎ 如果你需要快速搭建一个能 24/7 运行的 AI 智能体团队来处理复杂任务(比如自动研究或代码生成),PraisonAI 的低代码设计和多平台集成(如 Telegram)让它上手极快。但作为非官方项目,它的生态成熟度可能不如 LangChain 等主流框架,适合愿意尝鲜的开发者。