bigdata
by BytesAgain
Split large files, run parallel processing, and stream batch analysis. Use when sampling datasets, aggregating logs, or transforming bulk data.
安装
claude skill add --url github.com/openclaw/skills/tree/main/skills/bytesagain3/bigdata文档
BigData
A comprehensive data processing toolkit for ingesting, transforming, querying, filtering, aggregating, and managing data workflows — all from the command line with local timestamped log storage.
Commands
| Command | Description |
|---|---|
bigdata ingest <input> | Ingest raw data into the system. Without args, shows recent ingest entries |
bigdata transform <input> | Record a data transformation step. Without args, shows recent transforms |
bigdata query <input> | Log and track data queries. Without args, shows recent queries |
bigdata filter <input> | Apply and record data filters. Without args, shows recent filters |
bigdata aggregate <input> | Record aggregation operations. Without args, shows recent aggregations |
bigdata visualize <input> | Log visualization tasks. Without args, shows recent visualizations |
bigdata export <input> | Log export operations. Without args, shows recent exports |
bigdata sample <input> | Record data sampling operations. Without args, shows recent samples |
bigdata schema <input> | Track schema definitions and changes. Without args, shows recent schemas |
bigdata validate <input> | Log data validation checks. Without args, shows recent validations |
bigdata pipeline <input> | Record pipeline configurations. Without args, shows recent pipelines |
bigdata profile <input> | Log data profiling operations. Without args, shows recent profiles |
bigdata stats | Show summary statistics across all entry types |
bigdata search <term> | Search across all log entries for a keyword |
bigdata recent | Show the 20 most recent activity entries from the history log |
bigdata status | Health check — version, data dir, total entries, disk usage, last activity |
bigdata help | Show all available commands |
bigdata version | Print version (v2.0.0) |
Each data command (ingest, transform, query, etc.) works the same way:
- With arguments: saves the entry with a timestamp to its dedicated
.logfile and records it in the activity history - Without arguments: displays the 20 most recent entries from that command's log
Data Storage
All data is stored locally in plain-text log files:
~/.local/share/bigdata/
├── ingest.log # Ingested data entries
├── transform.log # Transformation records
├── query.log # Query log
├── filter.log # Filter operations
├── aggregate.log # Aggregation records
├── visualize.log # Visualization tasks
├── export.log # Export operations
├── sample.log # Sampling records
├── schema.log # Schema definitions
├── validate.log # Validation checks
├── pipeline.log # Pipeline configurations
├── profile.log # Profiling results
└── history.log # Unified activity log with timestamps
Each entry is stored as YYYY-MM-DD HH:MM|<value> for easy parsing and export.
Requirements
- Bash 4.0+ (uses
set -euo pipefail) - Standard UNIX utilities:
date,wc,du,grep,head,tail,cat - No external dependencies or API keys required
- Works offline — all data stays on your machine
When to Use
- Data pipeline tracking — Record each step of a multi-stage data workflow (ingest → transform → validate → export) with full timestamps for audit trails
- Quick data logging — Capture observations, measurements, or notes about datasets directly from the terminal without opening a separate app
- Schema management — Keep track of schema definitions, changes, and validation rules as your data evolves over time
- Data quality monitoring — Log validation checks and profiling results to build a history of data quality metrics
- Workflow documentation — Use search and recent commands to review what data operations were performed, when, and in what order
Examples
Log a complete data workflow
# Ingest raw data
bigdata ingest "customer_orders_2024.csv — 1.2M rows loaded"
# Transform it
bigdata transform "normalize dates to ISO-8601, trim whitespace, deduplicate"
# Validate the output
bigdata validate "all required fields present, no nulls in customer_id"
# Record the schema
bigdata schema "orders: id(int), customer_id(int), amount(decimal), date(date)"
# Export when ready
bigdata export "final dataset pushed to analytics warehouse"
Search and review activity
# Search across all logs for a keyword
bigdata search "customer"
# Check overall statistics
bigdata stats
# View recent activity across all commands
bigdata recent
# Health check
bigdata status
Pipeline and profiling
# Define a pipeline
bigdata pipeline "daily-etl: ingest → clean → validate → load — runs at 02:00 UTC"
# Profile a dataset
bigdata profile "users table: 500K rows, 12 columns, 0.3% nulls in email field"
# Sample data for testing
bigdata sample "random 10% sample from transactions for QA testing"
# Record an aggregation
bigdata aggregate "monthly revenue by region — Q1 totals computed"
Filter and query tracking
# Log a filter operation
bigdata filter "removed records older than 2020-01-01, kept 850K of 1.2M rows"
# Track a query
bigdata query "SELECT region, SUM(revenue) FROM orders GROUP BY region"
# Log a visualization
bigdata visualize "bar chart: monthly revenue trend, exported as PNG"
Output
All commands print confirmation to stdout. Data is persisted in ~/.local/share/bigdata/. Use bigdata stats for a summary or bigdata search <term> to find specific entries across all logs.
Powered by BytesAgain | bytesagain.com | hello@bytesagain.com
相关 Skills
表格处理
by anthropics
围绕 .xlsx、.xlsm、.csv、.tsv 做读写、修复、清洗、格式整理、公式计算与格式转换,适合修改现有表格、生成新报表或把杂乱数据整理成交付级电子表格。
✎ 做 Excel/CSV 相关任务很省心,能直接读写、修复、清洗和格式转换,尤其擅长把乱七八糟的表格整理成交付级文件。
PDF处理
by anthropics
遇到 PDF 读写、文本表格提取、合并拆分、旋转加水印、表单填写或加解密时直接用它,也能提取图片、生成新 PDF,并把扫描件通过 OCR 变成可搜索文档。
✎ PDF杂活别再来回切工具了,文本表格提取、合并拆分到OCR识别一次搞定,连扫描件也能变可搜索。
Word文档
by anthropics
覆盖Word/.docx文档的创建、读取、编辑与重排,适合生成报告、备忘录、信函和模板,也能处理目录、页眉页脚、页码、图片替换、查找替换、修订批注及内容提取整理。
✎ 搞定 .docx 的创建、改写与精排版,目录、批量替换、批注修订和图片更新都能自动化,做正式文档尤其省心。
相关 MCP 服务
文件系统
编辑精选by Anthropic
Filesystem 是 MCP 官方参考服务器,让 LLM 安全读写本地文件系统。
✎ 这个服务器解决了让 Claude 直接操作本地文件的痛点,比如自动整理文档或生成代码文件。适合需要自动化文件处理的开发者,但注意它只是参考实现,生产环境需自行加固安全。
by wonderwhy-er
Desktop Commander 是让 AI 直接执行终端命令、管理文件和进程的 MCP 服务器。
✎ 这工具解决了 AI 无法直接操作本地环境的痛点,适合需要自动化脚本调试或文件批量处理的开发者。它能让你用自然语言指挥终端,但权限控制需谨慎,毕竟让 AI 执行 rm -rf 可不是闹着玩的。
EdgarTools
编辑精选by dgunning
EdgarTools 是无需 API 密钥即可解析 SEC EDGAR 财报的开源 Python 库。
✎ 这个工具解决了金融数据获取的痛点——直接让 AI 读取结构化财报,比如让 Claude 分析苹果的 10-K 文件。适合量化分析师或金融开发者快速构建数据管道。但注意,它依赖 SEC 网站稳定性,高峰期可能延迟。