运行手册生成

Universal

Runbook Generator

by alirezarezvani

扫描仓库自动识别 CI/CD、数据库、容器与托管环境,生成含执行命令、校验项、回滚、升级路径和时长预估的生产级运维 Runbook,并根据配置变更提示文档过期。

把繁琐的运维流程和故障处置经验自动整理成可执行运行手册,帮 DevOps 团队沉淀知识、加快应急响应。

12.1kDevOps未扫描2026年3月5日

安装

claude skill add --url github.com/alirezarezvani/claude-skills/tree/main/engineering/runbook-generator

文档

Tier: POWERFUL
Category: Engineering
Domain: DevOps / Site Reliability Engineering


Overview

Analyze a codebase and generate production-grade operational runbooks. Detects your stack (CI/CD, database, hosting, containers), then produces step-by-step runbooks with copy-paste commands, verification checks, rollback procedures, escalation paths, and time estimates. Keeps runbooks fresh with staleness detection linked to config file modification dates.


Core Capabilities

  • Stack detection — auto-identify CI/CD, database, hosting, orchestration from repo files
  • Runbook types — deployment, incident response, database maintenance, scaling, monitoring setup
  • Format discipline — numbered steps, copy-paste commands, ✅ verification checks, time estimates
  • Escalation paths — L1 → L2 → L3 with contact info and decision criteria
  • Rollback procedures — every deployment step has a corresponding undo
  • Staleness detection — runbook sections reference config files; flag when source changes
  • Testing methodology — dry-run framework for staging validation, quarterly review cadence

When to Use

Use when:

  • A codebase has no runbooks and you need to bootstrap them fast
  • Existing runbooks are outdated or incomplete (point at the repo, regenerate)
  • Onboarding a new engineer who needs clear operational procedures
  • Preparing for an incident response drill or audit
  • Setting up monitoring and on-call rotation from scratch

Skip when:

  • The system is too early-stage to have stable operational patterns
  • Runbooks already exist and only need minor updates (edit directly)

Stack Detection

When given a repo, scan for these signals before writing a single runbook line:

bash
# CI/CD
ls .github/workflows/     → GitHub Actions
ls .gitlab-ci.yml         → GitLab CI
ls Jenkinsfile            → Jenkins
ls .circleci/             → CircleCI
ls bitbucket-pipelines.yml → Bitbucket Pipelines

# Database
grep -r "postgresql\|postgres\|pg" package.json pyproject.toml → PostgreSQL
grep -r "mysql\|mariadb"           package.json               → MySQL
grep -r "mongodb\|mongoose"        package.json               → MongoDB
grep -r "redis"                    package.json               → Redis
ls prisma/schema.prisma            → Prisma ORM (check provider field)
ls drizzle.config.*                → Drizzle ORM

# Hosting
ls vercel.json                     → Vercel
ls railway.toml                    → Railway
ls fly.toml                        → Fly.io
ls .ebextensions/                  → AWS Elastic Beanstalk
ls terraform/  ls *.tf             → Custom AWS/GCP/Azure (check provider)
ls kubernetes/ ls k8s/             → Kubernetes
ls docker-compose.yml              → Docker Compose

# Framework
ls next.config.*                   → Next.js
ls nuxt.config.*                   → Nuxt
ls svelte.config.*                 → SvelteKit
cat package.json | jq '.scripts'   → Check build/start commands

Map detected stack → runbook templates. A Next.js + PostgreSQL + Vercel + GitHub Actions repo needs:

  • Deployment runbook (Vercel + GitHub Actions)
  • Database runbook (PostgreSQL backup, migration, vacuum)
  • Incident response (with Vercel logs + pg query debugging)
  • Monitoring setup (Vercel Analytics, pg_stat, alerting)

Runbook Types

1. Deployment Runbook

markdown
# Deployment Runbook — [App Name]
**Stack:** Next.js 14 + PostgreSQL 15 + Vercel  
**Last verified:** 2025-03-01  
**Source configs:** vercel.json (modified: git log -1 --format=%ci -- vercel.json)  
**Owner:** Platform Team  
**Est. total time:** 15–25 min  

---

## Pre-deployment Checklist
- [ ] All PRs merged to main
- [ ] CI passing on main (GitHub Actions green)
- [ ] Database migrations tested in staging
- [ ] Rollback plan confirmed

## Steps

### Step 1 — Run CI checks locally (3 min)
```bash
pnpm test
pnpm lint
pnpm build

✅ Expected: All pass with 0 errors. Build output in .next/

Step 2 — Apply database migrations (5 min)

bash
# Staging first
DATABASE_URL=$STAGING_DATABASE_URL npx prisma migrate deploy

✅ Expected: All migrations have been successfully applied.

bash
# Verify migration applied
psql $STAGING_DATABASE_URL -c "\d" | grep -i migration

✅ Expected: Migration table shows new entry with today's date

Step 3 — Deploy to production (5 min)

bash
git push origin main
# OR trigger manually:
vercel --prod

✅ Expected: Vercel dashboard shows deployment in progress. URL format: https://app-name-<hash>-team.vercel.app

Step 4 — Smoke test production (5 min)

bash
# Health check
curl -sf https://your-app.vercel.app/api/health | jq .

# Critical path
curl -sf https://your-app.vercel.app/api/users/me \
  -H "Authorization: Bearer $TEST_TOKEN" | jq '.id'

✅ Expected: health returns {"status":"ok","db":"connected"}. Users API returns valid ID.

Step 5 — Monitor for 10 min

  • Check Vercel Functions log for errors: vercel logs --since=10m
  • Check error rate in Vercel Analytics: < 1% 5xx
  • Check DB connection pool: SELECT count(*) FROM pg_stat_activity; (< 80% of max_connections)

Rollback

If smoke tests fail or error rate spikes:

bash
# Instant rollback via Vercel (preferred — < 30 sec)
vercel rollback [previous-deployment-url]

# Database rollback (only if migration was applied)
DATABASE_URL=$PROD_DATABASE_URL npx prisma migrate reset --skip-seed
# WARNING: This resets to previous migration. Confirm data impact first.

✅ Expected after rollback: Previous deployment URL becomes active. Verify with smoke test.


Escalation

  • L1 (on-call engineer): Check Vercel logs, run smoke tests, attempt rollback
  • L2 (platform lead): DB issues, data loss risk, rollback failed — Slack: @platform-lead
  • L3 (CTO): Production down > 30 min, data breach — PagerDuty: #critical-incidents
code

---

### 2. Incident Response Runbook

```markdown
# Incident Response Runbook
**Severity levels:** P1 (down), P2 (degraded), P3 (minor)  
**Est. total time:** P1: 30–60 min, P2: 1–4 hours  

## Phase 1 — Triage (5 min)

### Confirm the incident
```bash
# Is the app responding?
curl -sw "%{http_code}" https://your-app.vercel.app/api/health -o /dev/null

# Check Vercel function errors (last 15 min)
vercel logs --since=15m | grep -i "error\|exception\|5[0-9][0-9]"

✅ 200 = app up. 5xx or timeout = incident confirmed.

Declare severity:

  • Site completely down → P1 — page L2/L3 immediately
  • Partial degradation / slow responses → P2 — notify team channel
  • Single feature broken → P3 — create ticket, fix in business hours

Phase 2 — Diagnose (10–15 min)

bash
# Recent deployments — did something just ship?
vercel ls --limit=5

# Database health
psql $DATABASE_URL -c "SELECT pid, state, wait_event, query FROM pg_stat_activity WHERE state != 'idle' LIMIT 20;"

# Long-running queries (> 30 sec)
psql $DATABASE_URL -c "SELECT pid, now() - pg_stat_activity.query_start AS duration, query FROM pg_stat_activity WHERE state = 'active' AND now() - pg_stat_activity.query_start > interval '30 seconds';"

# Connection pool saturation
psql $DATABASE_URL -c "SELECT count(*), max_conn FROM pg_stat_activity, (SELECT setting::int AS max_conn FROM pg_settings WHERE name='max_connections') t GROUP BY max_conn;"

Diagnostic decision tree:

  • Recent deploy + new errors → rollback (see Deployment Runbook)
  • DB query timeout / pool saturation → kill long queries, scale connections
  • External dependency failing → check status pages, add circuit breaker
  • Memory/CPU spike → check Vercel function logs for infinite loops

Phase 3 — Mitigate (variable)

bash
# Kill a runaway DB query
psql $DATABASE_URL -c "SELECT pg_terminate_backend(<pid>);"

# Scale DB connections (Supabase/Neon — adjust pool size)
# Vercel → Settings → Environment Variables → update DATABASE_POOL_MAX

# Enable maintenance mode (if you have a feature flag)
vercel env add MAINTENANCE_MODE true production
vercel --prod  # redeploy with flag

Phase 4 — Resolve & Postmortem

After incident is resolved, within 24 hours:

  1. Write incident timeline (what happened, when, who noticed, what fixed it)
  2. Identify root cause (5-Whys)
  3. Define action items with owners and due dates
  4. Update this runbook if a step was missing or wrong
  5. Add monitoring/alert that would have caught this earlier

Postmortem template: docs/postmortems/YYYY-MM-DD-incident-title.md


Escalation Path

LevelWhoWhenContact
L1On-call engineerAlways firstPagerDuty rotation
L2Platform leadDB issues, rollback neededSlack @platform-lead
L3CTO/VP EngP1 > 30 min, data lossPhone + PagerDuty
code

---

### 3. Database Maintenance Runbook

```markdown
# Database Maintenance Runbook — PostgreSQL
**Schedule:** Weekly vacuum (automated), monthly manual review  

## Backup

```bash
# Full backup
pg_dump $DATABASE_URL \
  --format=custom \
  --compress=9 \
  --file="backup-$(date +%Y%m%d-%H%M%S).dump"

✅ Expected: File created, size > 0. pg_restore --list backup.dump | head -20 shows tables.

Verify backup is restorable (test monthly):

bash
pg_restore --dbname=$STAGING_DATABASE_URL backup.dump
psql $STAGING_DATABASE_URL -c "SELECT count(*) FROM users;"

✅ Expected: Row count matches production.

Migration

bash
# Always test in staging first
DATABASE_URL=$STAGING_DATABASE_URL npx prisma migrate deploy
# Verify, then:
DATABASE_URL=$PROD_DATABASE_URL npx prisma migrate deploy

✅ Expected: All migrations have been successfully applied.

⚠️ For large table migrations (> 1M rows), use pg_repack or add column with DEFAULT separately to avoid table locks.

Vacuum & Reindex

bash
# Check bloat before deciding
psql $DATABASE_URL -c "
SELECT schemaname, tablename, 
       pg_size_pretty(pg_total_relation_size(schemaname||'.'||tablename)) AS total_size,
       n_dead_tup, n_live_tup,
       ROUND(n_dead_tup::numeric / NULLIF(n_live_tup + n_dead_tup, 0) * 100, 1) AS dead_ratio
FROM pg_stat_user_tables
ORDER BY n_dead_tup DESC LIMIT 10;"

# Vacuum high-bloat tables (non-blocking)
psql $DATABASE_URL -c "VACUUM ANALYZE users;"
psql $DATABASE_URL -c "VACUUM ANALYZE events;"

# Reindex (use CONCURRENTLY to avoid locks)
psql $DATABASE_URL -c "REINDEX INDEX CONCURRENTLY users_email_idx;"

✅ Expected: dead_ratio drops below 5% after vacuum.

code

---

## Staleness Detection

Add a staleness header to every runbook:

```markdown
## Staleness Check
This runbook references the following config files. If they've changed since the
"Last verified" date, review the affected steps.

| Config File | Last Modified | Affects Steps |
|-------------|--------------|---------------|
| vercel.json | `git log -1 --format=%ci -- vercel.json` | Step 3, Rollback |
| prisma/schema.prisma | `git log -1 --format=%ci -- prisma/schema.prisma` | Step 2, DB Maintenance |
| .github/workflows/deploy.yml | `git log -1 --format=%ci -- .github/workflows/deploy.yml` | Step 1, Step 3 |
| docker-compose.yml | `git log -1 --format=%ci -- docker-compose.yml` | All scaling steps |

Automation: Add a CI job that runs weekly and comments on the runbook doc if any referenced file was modified more recently than the runbook's "Last verified" date.


Runbook Testing Methodology

Dry-Run in Staging

Before trusting a runbook in production, validate every step in staging:

bash
# 1. Create a staging environment mirror
vercel env pull .env.staging
source .env.staging

# 2. Run each step with staging credentials
# Replace all $DATABASE_URL with $STAGING_DATABASE_URL
# Replace all production URLs with staging URLs

# 3. Verify expected outputs match
# Document any discrepancies and update the runbook

# 4. Time each step — update estimates in the runbook
time npx prisma migrate deploy

Quarterly Review Cadence

Schedule a 1-hour review every quarter:

  1. Run each command in staging — does it still work?
  2. Check config drift — compare "Last Modified" dates vs "Last verified"
  3. Test rollback procedures — actually roll back in staging
  4. Update contact info — L1/L2/L3 may have changed
  5. Add new failure modes discovered in the past quarter
  6. Update "Last verified" date at top of runbook

Common Pitfalls

PitfallFix
Commands that require manual copy of dynamic valuesUse env vars — $DATABASE_URL not postgres://user:pass@host/db
No expected output specifiedAdd ✅ with exact expected string after every verification step
Rollback steps missingEvery destructive step needs a corresponding undo
Runbooks that never get testedSchedule quarterly staging dry-runs in team calendar
L3 escalation contact is the former CTOReview contact info every quarter
Migration runbook doesn't mention table locksCall out lock risk for large table operations explicitly

Best Practices

  1. Every command must be copy-pasteable — no placeholder text, use env vars
  2. ✅ after every step — explicit expected output, not "it should work"
  3. Time estimates are mandatory — engineers need to know if they have time to fix before SLA breach
  4. Rollback before you deploy — plan the undo before executing
  5. Runbooks live in the repodocs/runbooks/, versioned with the code they describe
  6. Postmortem → runbook update — every incident should improve a runbook
  7. Link, don't duplicate — reference the canonical config file, don't copy its contents into the runbook
  8. Test runbooks like you test code — untested runbooks are worse than no runbooks (false confidence)

相关 Skills

环境密钥管理

by alirezarezvani

Universal
热门

统一梳理dev/staging/prod的.env和密钥流程,自动生成.env.example、校验必填变量、扫描Git历史泄漏,并联动Vault、AWS SSM、1Password、Doppler完成轮换。

统一管理环境变量、密钥与配置,减少泄露和部署混乱,安全治理与团队协作一起做好,DevOps 场景很省心。

DevOps
未扫描12.1k

可观测性设计

by alirezarezvani

Universal
热门

面向生产系统规划可落地的可观测性体系,串起指标、日志、链路追踪与 SLI/SLO、错误预算、告警和仪表盘设计,适合搭建监控平台与优化故障响应。

把监控、日志、链路追踪串起来,帮助团队从设计阶段构建可观测性,排障更快、系统演进更稳。

DevOps
未扫描12.1k

单仓导航

by alirezarezvani

Universal
热门

聚焦monorepo架构治理与迁移,覆盖Turborepo、Nx、pnpm workspaces,支持跨包影响分析、按变更范围构建测试、依赖图可视化和发布流程优化。

单仓导航专治 monorepo 里找代码、理依赖和切工作区费时的问题,对多项目共仓场景尤其友好,让大型仓库也能像小项目一样好逛。

DevOps
未扫描12.1k

相关 MCP 服务

kubefwd

编辑精选

by txn2

热门

kubefwd 是让 AI 帮你批量转发 Kubernetes 服务到本地的开发神器。

微服务开发者最头疼的本地调试问题,它一键搞定——自动分配 IP 避免端口冲突,还能用自然语言查询状态。但依赖 AI 工作流,纯命令行爱好者可能觉得不够直接。

DevOps
4.1k

Cloudflare

编辑精选

by Cloudflare

热门

Cloudflare MCP Server 是让你用自然语言管理 Workers、KV 和 R2 等云资源的工具。

这个工具解决了开发者频繁切换控制台和文档的痛点,特别适合那些在 Cloudflare 上部署无服务器应用、需要快速调试或管理配置的团队。不过,由于它依赖多个子服务器,初次设置可能有点繁琐,建议先从 Workers Bindings 这类核心功能入手。

DevOps
3.6k

Terraform

编辑精选

by hashicorp

Terraform MCP Server 是让 AI 助手直接操作 Terraform Registry 和 HCP Terraform 的桥梁。

如果你经常在 Terraform 里翻文档找模块配置,这个服务器能省不少时间——直接问 Claude 就能生成准确的代码片段。最适合管理多云基础设施的团队,但注意它目前只适合本地使用,别在生产环境里暴露 HTTP 端点。

DevOps
1.3k

评论