ETL管道

etl

by BytesAgain

Build ETL pipelines with data ingestion, cleaning, and validation steps. Use when ingesting sources, transforming formats, validating data, or scheduling loads.

4.2k数据与存储未扫描2026年3月23日

安装

claude skill add --url github.com/openclaw/skills/tree/main/skills/bytesagain1/etl

文档

ETL

Extract-Transform-Load data toolkit (v2.0.0). Record and manage data pipeline activities across the full ETL lifecycle — ingest, transform, query, filter, aggregate, visualize, export, sample, schema definition, validation, pipeline orchestration, and data profiling. Each command logs timestamped entries to its own log file, giving you a structured record of all data operations.

Commands

CommandDescription
etl ingest <input>Record a data ingestion event (source, format, row count, etc.). Without args, shows recent ingest entries.
etl transform <input>Log a transformation step (column rename, type cast, normalization, etc.). Without args, shows recent transforms.
etl query <input>Record a query operation or SQL statement. Without args, shows recent queries.
etl filter <input>Log a filtering rule or condition applied to data. Without args, shows recent filters.
etl aggregate <input>Record an aggregation step (GROUP BY, SUM, AVG, etc.). Without args, shows recent aggregations.
etl visualize <input>Log a visualization request or chart configuration. Without args, shows recent visualizations.
etl export <input>Record an export operation (destination, format, row count). Without args, shows recent exports.
etl sample <input>Log a data sampling step (sample size, method, seed). Without args, shows recent samples.
etl schema <input>Record a schema definition or schema change. Without args, shows recent schema entries.
etl validate <input>Log a data validation rule or result. Without args, shows recent validations.
etl pipeline <input>Record a pipeline configuration or execution step. Without args, shows recent pipeline entries.
etl profile <input>Log a data profiling result (null counts, distributions, anomalies). Without args, shows recent profiles.
etl statsShow summary statistics: entry counts per category, total entries, data size, and earliest record date.
etl export <fmt>Export all logged data to a file. Supported formats: json, csv, txt. (Note: this is a different code path from the export log command — it exports the tool's own data.)
etl search <term>Search across all log files for a keyword (case-insensitive).
etl recentShow the 20 most recent entries from the activity history log.
etl statusHealth check: version, data directory, total entries, disk usage, last activity.
etl helpShow the built-in help with all available commands.
etl versionPrint the current version (v2.0.0).

Data Storage

All data is stored as plain-text log files in ~/.local/share/etl/:

  • Per-command logs — Each command (ingest, transform, query, etc.) writes to its own .log file (e.g., ingest.log, transform.log).
  • History log — Every operation is also appended to history.log with a timestamp and command name.
  • Export files — Generated in the same directory as export.json, export.csv, or export.txt.

Entries are stored in timestamp|value format, making them easy to grep, parse, or pipe into downstream tools.

Requirements

  • Bash 4.0+ (uses set -euo pipefail)
  • coreutilsdate, wc, du, head, tail, grep, basename, cut
  • No external dependencies, API keys, or network access required
  • Works fully offline on any POSIX-compatible system

When to Use

  1. Logging data pipeline steps — Record each stage of your ETL process (ingest → transform → validate → export) with timestamps, creating a complete audit trail of data movements.
  2. Schema management and validation — Use schema to document table structures and validate to log data quality rules and their pass/fail results.
  3. Data profiling and exploration — Use profile to record column statistics, null rates, and distribution anomalies; use sample to log sampling parameters for reproducibility.
  4. Pipeline orchestration tracking — Use pipeline to record multi-step workflow configurations, execution order, and dependencies between ETL stages.
  5. Cross-team data operations review — Run stats for aggregate counts, search to find specific operations by keyword, and export json to share pipeline logs with team members or load into dashboards.

Examples

bash
# Log a data ingestion from S3
etl ingest "s3://data-lake/raw/users_2024.csv — 1.2M rows, CSV format"

# Record a transformation step
etl transform "Normalize email to lowercase, cast created_at to UTC timestamp"

# Log a validation rule
etl validate "NOT NULL check on user_id: 0 violations out of 1,200,000 rows"

# Record schema for a new table
etl schema "users_dim: id INT PK, email VARCHAR(255), created_at TIMESTAMP, country CHAR(2)"

# Define a pipeline
etl pipeline "daily_user_load: ingest(s3) -> dedupe -> validate -> load(postgres)"

# Search for anything related to 'users'
etl search users

# Export all ETL logs to CSV for analysis
etl export csv

# View summary statistics
etl stats

# Check system health
etl status

Tips

  • Run any data command without arguments to see recent entries (e.g., etl ingest shows the last 20 ingest entries).
  • Use etl recent for a quick overview of all activity across all categories.
  • Combine with cron to auto-log pipeline runs: 0 2 * * * etl pipeline "nightly_load completed at $(date)"
  • Back up your data by copying ~/.local/share/etl/ to your preferred backup location.

Powered by BytesAgain | bytesagain.com | hello@bytesagain.com

相关 Skills

资深架构师

by alirezarezvani

Universal
热门

适合系统设计评审、ADR记录和扩展性规划,分析依赖与耦合,权衡单体或微服务、数据库与技术栈选型,并输出Mermaid、PlantUML、ASCII架构图。

搞系统设计、技术选型和扩展规划时,用它能更快理清架构决策与依赖关系,还能直接产出 Mermaid/PlantUML 图,方案讨论效率很高。

数据与存储
未扫描11.5k

迁移架构师

by alirezarezvani

Universal
热门

为数据库、API 与基础设施迁移制定分阶段零停机方案,提前校验兼容性与风险,生成回滚策略、验证关卡和时间线,适合复杂系统平滑切换。

做数据库与存储迁移时,用它统一梳理表结构和数据搬迁流程,架构视角更完整,复杂迁移也更稳。

数据与存储
未扫描11.5k

资深数据工程师

by alirezarezvani

Universal
热门

聚焦生产级数据工程,覆盖 ETL/ELT、批处理与流式管道、数据建模、Airflow/dbt/Spark 优化和数据质量治理,适合设计数据架构、搭建现代数据栈与排查性能问题。

复杂数据管道、ETL/ELT 和治理难题交给它,凭 Spark、Airflow、dbt 等现代数据栈经验,能更稳地搭起可扩展的数据基础设施。

数据与存储
未扫描11.5k

相关 MCP 服务

by Anthropic

热门

PostgreSQL 是让 Claude 直接查询和管理你的数据库的 MCP 服务器。

这个服务器解决了开发者需要手动编写 SQL 查询的痛点,特别适合数据分析师或后端开发者快速探索数据库结构。不过,由于是参考实现,生产环境使用前务必评估安全风险,别指望它能处理复杂事务。

数据与存储
83.9k

SQLite 数据库

编辑精选

by Anthropic

热门

SQLite 是让 AI 直接查询本地数据库进行数据分析的 MCP 服务器。

这个服务器解决了 AI 无法直接访问 SQLite 数据库的问题,适合需要快速分析本地数据集的开发者。不过,作为参考实现,它可能缺乏生产级的安全特性,建议在受控环境中使用。

数据与存储
83.9k

by Firecrawl

热门

Firecrawl 是让 AI 直接抓取网页并提取结构化数据的 MCP 服务器。

它解决了手动写爬虫的麻烦,让 Claude 能直接访问动态网页内容。最适合需要实时数据的研究者或开发者,比如监控竞品价格或抓取新闻。但要注意,它依赖第三方 API,可能涉及隐私和成本问题。

数据与存储
6.1k

评论