自动调研

autoresearch

by baiyunrei2025

|

3.7k搜索与获取未扫描2026年3月23日

安装

claude skill add --url github.com/openclaw/skills/tree/main/skills/baiyunrei2025/autoresearch-karpathy

文档

Autoresearch Skill

This skill enables autonomous AI research experiments based on Andrej Karpathy's autoresearch project. It allows AI agents to autonomously modify neural network training code, run experiments, evaluate results, and iteratively improve models.

Core Concept

The idea: give an AI agent a small but real LLM training setup and let it experiment autonomously. The agent modifies the code, trains for 5 minutes, checks if the result improved, keeps or discards, and repeats. You can leave it running overnight and wake up to a log of experiments and (hopefully) a better model.

Key Files

The project has three core files:

  1. prepare.py — Fixed constants, one-time data prep (downloads training data, trains a BPE tokenizer), and runtime utilities (dataloader, evaluation). Not modified.
  2. train.py — The single file the agent edits. Contains the full GPT model, optimizer (Muon + AdamW), and training loop. Everything is fair game: architecture, hyperparameters, optimizer, batch size, etc. This file is edited and iterated on by the agent.
  3. program.md — Baseline instructions for the agent. This file is edited and iterated on by the human.

Requirements

  • Single NVIDIA GPU (tested on H100)
  • Python 3.10+
  • uv package manager

Quick Start Workflow

Phase 1: Initial Setup

  1. Clone the repository (if not already done):

    bash
    git clone https://github.com/karpathy/autoresearch.git
    cd autoresearch
    
  2. Install dependencies:

    bash
    uv sync
    
  3. Prepare data (one-time setup):

    bash
    uv run prepare.py
    

Phase 2: Experiment Setup

  1. Agree on a run tag (e.g., based on date like mar20)
  2. Create a new branch:
    bash
    git checkout -b autoresearch/<tag>
    
  3. Initialize results file:
    bash
    echo -e "commit\tval_bpb\tmemory_gb\tstatus\tdescription" > results.tsv
    

Phase 3: Autonomous Experimentation Loop

The agent follows this loop indefinitely:

code
LOOP FOREVER:
  1. Look at current git state
  2. Modify train.py with experimental idea
  3. git commit
  4. Run experiment: uv run train.py > run.log 2>&1
  5. Extract results: grep "^val_bpb:\|^peak_vram_mb:" run.log
  6. If crash → analyze logs and fix or mark as crash
  7. Record results in results.tsv
  8. If improved → keep commit
  9. If not improved → git reset

Key Metrics

  • val_bpb (validation bits per byte) — Lower is better, vocab-size-independent
  • Training time — Fixed 5-minute budget per experiment
  • Peak VRAM — Memory usage in GB
  • Statuskeep, discard, or crash

Constraints

What the agent CAN do:

  • Modify train.py (architecture, optimizer, hyperparameters, training loop, etc.)
  • Experiment with different model configurations
  • Run training experiments autonomously

What the agent CANNOT do:

  • Modify prepare.py (read-only)
  • Install new packages or add dependencies
  • Modify the evaluation harness

Quality Criteria

  1. Simplicity: Simpler solutions are preferred over complex ones
  2. Performance: Lower val_bpb is better
  3. Memory: VRAM usage should be reasonable
  4. Stability: Code must run without crashing

Output Format

Each experiment produces a summary:

code
---
val_bpb:          0.997900
training_seconds: 300.1
total_seconds:    325.9
peak_vram_mb:     45060.2
mfu_percent:      39.80
total_tokens_M:   499.6
num_steps:        953
num_params_M:     50.3
depth:            8

Results Logging

Results are logged to results.tsv (tab-separated):

code
commit	val_bpb	memory_gb	status	description
a1b2c3d	0.997900	44.0	keep	baseline
b2c3d4e	0.993200	44.2	keep	increase LR to 0.04
c3d4e5f	1.005000	44.0	discard	switch to GeLU activation
d4e5f6g	0.000000	0.0	crash	double model width (OOM)

Autonomous Operation

CRITICAL: Once the experiment loop begins, the agent operates autonomously:

  • Do NOT pause to ask the human if you should continue
  • Do NOT ask "should I keep going?" or "is this a good stopping point?"
  • Continue working indefinitely until manually stopped
  • If out of ideas, think harder: read papers, re-analyze code, try radical changes

Use Cases

  1. Overnight experiments: Leave running while sleeping, wake up to results
  2. Architecture search: Automatically explore model architectures
  3. Hyperparameter optimization: Find optimal training parameters
  4. Research automation: Reduce manual experimentation effort

Troubleshooting

Common Issues:

  1. GPU not available: Check CUDA installation and GPU drivers
  2. uv not installed: Install uv package manager
  3. Data not prepared: Run uv run prepare.py
  4. Out of memory: Reduce model size or batch size

Error Handling:

  • Crashes are logged as crash status
  • Analyze logs with tail -n 50 run.log
  • Fix simple issues and retry, skip fundamentally broken ideas

Best Practices

  1. Start with baseline: Always run unmodified code first
  2. Incremental changes: Make small, focused modifications
  3. Document experiments: Clear descriptions in results.tsv
  4. Monitor progress: Regularly check results and trends
  5. Balance exploration/exploitation: Mix radical ideas with incremental improvements

Integration with Agent Teams

This skill can be combined with the agent-teams-playbook skill for:

  • Multi-agent research coordination
  • Parallel experimentation
  • Specialized roles (architect, optimizer, evaluator)
  • Distributed research workflows

References

相关 Skills

agent-browser

by chulla-ceja

热门

Browser automation CLI for AI agents. Use when the user needs to interact with websites, including navigating pages, filling forms, clicking buttons, taking screenshots, extracting data, testing web apps, or automating any browser task. Triggers include requests to "open a website", "fill out a form", "click a button", "take a screenshot", "scrape data from a page", "test this web app", "login to a site", "automate browser actions", or any task requiring programmatic web interaction.

搜索与获取
未扫描3.7k

接口规范

by alexxxiong

热门

API 规范管理工具 - 跨项目 API 文档的初始化、更新、查询与搜索。Triggers: 'API文档', 'API规范', '接口文档', '路由解析', 'apispec', 'API lookup', 'API search'.

搜索与获取
未扫描3.7k

investment-research

by caijichang212

热门

Perform structured investment research (投研分析) for a company/stock/ETF/sector using a repeatable framework: fundamentals (basic/财务报表与商业模式), technical analysis (技术指标与关键价位), industry research (行业景气与竞争格局), valuation (估值对比/情景), catalysts and risks, and produce a professional research report + actionable plan. Use when the user asks for: equity/ETF analysis, earnings/financial statement breakdown, peer/industry comparison, valuation ranges, bull/base/bear scenarios, technical trend/support-resistance, or a full research memo.

搜索与获取
未扫描3.7k

相关 MCP 服务

by Anthropic

热门

Puppeteer 是让 Claude 自动操作浏览器进行网页抓取和测试的 MCP 服务器。

这个服务器解决了手动编写 Puppeteer 脚本的繁琐问题,适合需要自动化网页交互的开发者,比如抓取动态内容或做端到端测试。不过,作为参考实现,它可能缺少生产级的安全防护,建议在可控环境中使用。

搜索与获取
82.9k

网页抓取

编辑精选

by Anthropic

热门

Fetch 是 MCP 官方参考服务器,让 AI 能抓取网页并转为 Markdown 格式。

这个服务器解决了 AI 直接处理网页内容时格式混乱的问题,适合需要让 Claude 分析在线文档或新闻的开发者。不过作为参考实现,它缺乏生产级的安全配置,你得自己处理反爬虫和隐私风险。

搜索与获取
82.9k

Brave 搜索

编辑精选

by Anthropic

热门

Brave Search 是让 Claude 直接调用 Brave 搜索 API 获取实时网络信息的 MCP 服务器。

如果你想让 AI 助手帮你搜索最新资讯或技术文档,这个工具能绕过传统搜索的限制,直接返回结构化数据。特别适合需要实时信息的开发者,比如查 API 更新或竞品动态。不过它依赖 Brave 的 API 配额,高频使用可能受限。

搜索与获取
82.9k

相关资讯

Dan Woods 结合 Apple 的「LLM in a Flash」论文和 Andrej Karpathy 的自动研究模式,使用 Claude Code 生成代码,成功在 MacBook Pro M3 Max 上高效运行 Qwen3.5-397B-A17B 模型。模型专家权重被量化至 2-bit,非专家部分保持原精度,最终代码和论文已开源。

深度Simon Willison·3月18日·2 分钟

autoresearch 是一个由 AI 智能体驱动的自动化研究项目,基于单 GPU 的 nanochat 简化实现。开发者通过编写 Markdown 指令文件(program.md)设定研究目标,智能体则自主修改训练代码(train.py),每轮实验固定 5 分钟,通过比较验证损失(val_bpb)来迭代优化模型。项目设计简洁,包含数据准备、训练和指令三个核心文件,支持在 Claude/Codex 等智能体上运行,并提供了针对小规模计算平台的调优建议和社区分支。

指南·3月8日·5 分钟

评论