itil-ops
by chefboyrdave21
>
安装
claude skill add --url github.com/openclaw/skills/tree/main/skills/chefboyrdave21/itil-ops文档
ITIL Ops — IT Service Management for AI Agents
Structured incident, problem, and change management adapted from ITIL 4 for autonomous agent operations.
Core Concepts
Severity Levels
| Level | Meaning | Response | Example |
|---|---|---|---|
| P1 | Critical — service down, data at risk | Immediate alert + auto-remediate | Crash loop, disk full, OOM |
| P2 | High — degraded service | Alert within 1h | Service restarts, auth failures |
| P3 | Medium — non-critical issue | Next review cycle | Cron timeouts, broken files |
| P4 | Low — cosmetic/minor | Track, fix when convenient | Log warnings, config drift |
Incident vs Problem vs Change
- Incident: Something broke. Restore service ASAP. (reactive)
- Problem: Pattern of incidents. Find and fix root cause. (proactive)
- Change: Planned modification. Assess risk before executing. (controlled)
Incident Management
Detection Sources
Scan these in order of criticality:
- Service crashes —
journalctl --user -u SERVICE --since "12 hours ago"for watchdog timeouts, SIGABRT, SIGSEGV, core dumps - Cron failures — consecutive error count > 2 in job state files
- Health endpoints — HTTP health checks returning non-200
- Resource pressure — disk > 80%, RAM > 80%, swap active
- Data integrity — schema validation failures, broken files, load errors
Detection Script
Run scripts/itil-review.sh to scan all sources. It outputs:
ITIL_CLEARif nothing found (reply HEARTBEAT_OK)- Formatted report with incidents and problems if issues detected
Incident Lifecycle
DETECTED → CLASSIFIED (P1-P4) → DIAGNOSED → RESOLVED → CLOSED
↓
(3+ occurrences)
↓
ESCALATE TO PROBLEM
Auto-Classification Rules
# P1 — Critical
- Service crash count >= 3 in 12h (crash loop)
- Disk usage >= 90%
- RAM usage >= 90%
- Data loss detected
# P2 — High
- Service crashed 1-2 times
- 3+ services down simultaneously
- Auth/token failures affecting operations
- Cron job with 5+ consecutive failures
# P3 — Medium
- Broken data files (schema violations)
- Memory load errors > 10 in 12h
- Cron job with 3-4 consecutive failures
- Disk usage 80-89%
# P4 — Low
- 1 service down (non-critical)
- Config warnings
- Log noise
Creating Incident Tickets
When incidents are found, create coordination tasks:
Title: [ITIL-INC] <brief description>
Body:
- Severity: P1/P2/P3/P4
- Category: service|cron|memory|disk|security
- Detected: <timestamp>
- Detail: <what happened>
- Impact: <what's affected>
- Action: <what to do>
Problem Management
Pattern Detection
An incident becomes a problem when:
- Same error occurs 3+ times in 24h
- Same incident type recurs across 2+ review cycles
- Multiple related incidents share a common root cause
Root Cause Analysis (RCA)
When a problem is identified:
- Gather evidence — journal logs, error messages, state files, recent changes
- Timeline — reconstruct the sequence of events
- 5 Whys — ask why iteratively until you reach the actual root cause
- Fix classification:
- Quick fix — config change, file repair, timeout bump
- Code fix — bug in script or daemon, needs PR
- Architecture fix — design flaw, needs redesign
Problem Ticket Format
Title: [ITIL-PRB] <root cause description>
Body:
- Related incidents: <list>
- Root cause: <what's actually broken>
- Evidence: <logs, patterns, data>
- Fix applied: <immediate remediation>
- Fix needed: <permanent solution>
- Prevention: <how to prevent recurrence>
Known Error Database
Track resolved problems in state file (itil-state.json):
{
"last_review": "2026-03-22T04:19:50Z",
"last_incident_count": 2,
"last_problem_count": 1,
"known_errors": {
"memory-content-dict": {
"description": "Scripts writing content as dict instead of string",
"root_cause": "Missing json.dumps() in memory file writers",
"fix": "Wrap content in json.dumps() before saving",
"fixed_date": "2026-03-22"
}
}
}
Change Management
Pre-Change Checklist
Before modifying services, configs, or infrastructure:
- What's changing? — specific files, services, configs
- Why? — linked incident/problem ticket
- Risk? — what could go wrong
- Rollback plan? — how to undo if it breaks
- Test? — how to verify it worked
- Notify? — does the human need to know
Change Categories
| Type | Approval | Example |
|---|---|---|
| Standard | Pre-approved, just do it | Restart service, bump timeout |
| Normal | Inform human, wait for OK | New cron job, config change |
| Emergency | Fix now, inform after | Service down, data at risk |
Post-Change Verification
After any change:
- Check service status —
systemctl --user status SERVICE - Watch logs for 60s —
journalctl --user -u SERVICE -f --since "now" - Run health check —
scripts/itil-review.sh - Verify no new errors in first 5 minutes
Event Management
Log Monitoring Patterns
# Service crashes
journalctl --user -u SERVICE --since "12h ago" | grep -ciE "watchdog timeout|killed|SIGABRT|SIGSEGV|failed with"
# Memory/resource issues
journalctl --user -u SERVICE --since "12h ago" | grep -c "Failed to load"
# Auth failures
journalctl --user -u SERVICE --since "12h ago" | grep -ciE "unauthorized|403|token expired|auth fail"
Health Check Endpoints
Check services with curl:
curl -sf --max-time 5 "$URL" >/dev/null 2>&1 || echo "DOWN"
Configure endpoints in the review script for your environment.
Continual Improvement
Review Cadence
| Review | Frequency | Purpose |
|---|---|---|
| Incident review | Every 12h | Detect and classify new issues |
| Problem review | Weekly | Identify patterns, track RCA progress |
| Capacity review | Weekly | Disk, RAM, memory count trends |
| Process review | Monthly | Are our detection rules catching real issues? |
KPIs to Track
- MTTR (Mean Time to Resolve) — how fast do we fix incidents?
- Incident recurrence rate — are the same things breaking?
- False positive rate — are we alerting on non-issues?
- Known error resolution — are problems getting permanent fixes?
State Tracking
The review script maintains itil-state.json with:
- Last review timestamp and results
- Incident/problem counts per review
- System metrics (disk, RAM, restart count)
- Cross-review pattern detection data
Cron Setup
Recommended Schedule
# Incident review — every 12 hours
openclaw cron add --name "itil-review" --every "12h" \
--model "anthropic/claude-sonnet-4-6" --timeout-seconds 180 \
--session isolated \
--message "Run ITIL review: bash ~/.skcapstone/agents/lumina/scripts/itil-review.sh"
# Weekly problem review (Sunday 9 AM)
# Analyze the week's incidents, identify patterns, suggest improvements
File Structure
itil-ops/
├── SKILL.md # This file
├── scripts/
│ └── itil-review.sh # Main review script (scan + classify + report)
└── references/
└── itil4-agent-mapping.md # ITIL 4 → Agent operations reference
Integration Points
- Coordination tasks —
skcapstone coord createfor incident/problem tickets - Memory snapshots —
skmemory_snapshotto record resolutions for future reference - Heartbeat — integrate with existing heartbeat to run lightweight checks
- Cron — scheduled reviews via OpenClaw cron system
- Alerting — Telegram/Discord delivery for P1/P2 issues
相关 Skills
可观测性设计
by alirezarezvani
面向生产系统规划可落地的可观测性体系,串起指标、日志、链路追踪与 SLI/SLO、错误预算、告警和仪表盘设计,适合搭建监控平台与优化故障响应。
✎ 把监控、日志、链路追踪串起来,帮助团队从设计阶段构建可观测性,排障更快、系统演进更稳。
资深开发运维
by alirezarezvani
覆盖 CI/CD 流水线生成、Terraform 基建脚手架和自动化部署,适合在 AWS、GCP、Azure 上搭建云原生发布流程,管理 Docker/Kubernetes 基础设施并持续优化交付。
✎ 把CI/CD、基础设施即代码、容器与监控串成一条交付链,尤其适合AWS/GCP/Azure多云团队高效落地。
环境密钥管理
by alirezarezvani
统一梳理dev/staging/prod的.env和密钥流程,自动生成.env.example、校验必填变量、扫描Git历史泄漏,并联动Vault、AWS SSM、1Password、Doppler完成轮换。
✎ 统一管理环境变量、密钥与配置,减少泄露和部署混乱,安全治理与团队协作一起做好,DevOps 场景很省心。
相关 MCP 服务
kubefwd
编辑精选by txn2
kubefwd 是让 AI 帮你批量转发 Kubernetes 服务到本地的开发神器。
✎ 微服务开发者最头疼的本地调试问题,它一键搞定——自动分配 IP 避免端口冲突,还能用自然语言查询状态。但依赖 AI 工作流,纯命令行爱好者可能觉得不够直接。
Cloudflare
编辑精选by Cloudflare
Cloudflare MCP Server 是让你用自然语言管理 Workers、KV 和 R2 等云资源的工具。
✎ 这个工具解决了开发者频繁切换控制台和文档的痛点,特别适合那些在 Cloudflare 上部署无服务器应用、需要快速调试或管理配置的团队。不过,由于它依赖多个子服务器,初次设置可能有点繁琐,建议先从 Workers Bindings 这类核心功能入手。
Terraform
编辑精选by hashicorp
Terraform MCP Server 是让 AI 助手直接操作 Terraform Registry 和 HCP Terraform 的桥梁。
✎ 如果你经常在 Terraform 里翻文档找模块配置,这个服务器能省不少时间——直接问 Claude 就能生成准确的代码片段。最适合管理多云基础设施的团队,但注意它目前只适合本地使用,别在生产环境里暴露 HTTP 端点。