itil-ops

by chefboyrdave21

>

3.7kDevOps未扫描2026年3月23日

安装

claude skill add --url github.com/openclaw/skills/tree/main/skills/chefboyrdave21/itil-ops

文档

ITIL Ops — IT Service Management for AI Agents

Structured incident, problem, and change management adapted from ITIL 4 for autonomous agent operations.

Core Concepts

Severity Levels

LevelMeaningResponseExample
P1Critical — service down, data at riskImmediate alert + auto-remediateCrash loop, disk full, OOM
P2High — degraded serviceAlert within 1hService restarts, auth failures
P3Medium — non-critical issueNext review cycleCron timeouts, broken files
P4Low — cosmetic/minorTrack, fix when convenientLog warnings, config drift

Incident vs Problem vs Change

  • Incident: Something broke. Restore service ASAP. (reactive)
  • Problem: Pattern of incidents. Find and fix root cause. (proactive)
  • Change: Planned modification. Assess risk before executing. (controlled)

Incident Management

Detection Sources

Scan these in order of criticality:

  1. Service crashesjournalctl --user -u SERVICE --since "12 hours ago" for watchdog timeouts, SIGABRT, SIGSEGV, core dumps
  2. Cron failures — consecutive error count > 2 in job state files
  3. Health endpoints — HTTP health checks returning non-200
  4. Resource pressure — disk > 80%, RAM > 80%, swap active
  5. Data integrity — schema validation failures, broken files, load errors

Detection Script

Run scripts/itil-review.sh to scan all sources. It outputs:

  • ITIL_CLEAR if nothing found (reply HEARTBEAT_OK)
  • Formatted report with incidents and problems if issues detected

Incident Lifecycle

code
DETECTED → CLASSIFIED (P1-P4) → DIAGNOSED → RESOLVED → CLOSED
                                      ↓
                              (3+ occurrences)
                                      ↓
                              ESCALATE TO PROBLEM

Auto-Classification Rules

bash
# P1 — Critical
- Service crash count >= 3 in 12h (crash loop)
- Disk usage >= 90%
- RAM usage >= 90%
- Data loss detected

# P2 — High
- Service crashed 1-2 times
- 3+ services down simultaneously
- Auth/token failures affecting operations
- Cron job with 5+ consecutive failures

# P3 — Medium
- Broken data files (schema violations)
- Memory load errors > 10 in 12h
- Cron job with 3-4 consecutive failures
- Disk usage 80-89%

# P4 — Low
- 1 service down (non-critical)
- Config warnings
- Log noise

Creating Incident Tickets

When incidents are found, create coordination tasks:

code
Title: [ITIL-INC] <brief description>
Body:
- Severity: P1/P2/P3/P4
- Category: service|cron|memory|disk|security
- Detected: <timestamp>
- Detail: <what happened>
- Impact: <what's affected>
- Action: <what to do>

Problem Management

Pattern Detection

An incident becomes a problem when:

  • Same error occurs 3+ times in 24h
  • Same incident type recurs across 2+ review cycles
  • Multiple related incidents share a common root cause

Root Cause Analysis (RCA)

When a problem is identified:

  1. Gather evidence — journal logs, error messages, state files, recent changes
  2. Timeline — reconstruct the sequence of events
  3. 5 Whys — ask why iteratively until you reach the actual root cause
  4. Fix classification:
    • Quick fix — config change, file repair, timeout bump
    • Code fix — bug in script or daemon, needs PR
    • Architecture fix — design flaw, needs redesign

Problem Ticket Format

code
Title: [ITIL-PRB] <root cause description>
Body:
- Related incidents: <list>
- Root cause: <what's actually broken>
- Evidence: <logs, patterns, data>
- Fix applied: <immediate remediation>
- Fix needed: <permanent solution>
- Prevention: <how to prevent recurrence>

Known Error Database

Track resolved problems in state file (itil-state.json):

json
{
  "last_review": "2026-03-22T04:19:50Z",
  "last_incident_count": 2,
  "last_problem_count": 1,
  "known_errors": {
    "memory-content-dict": {
      "description": "Scripts writing content as dict instead of string",
      "root_cause": "Missing json.dumps() in memory file writers",
      "fix": "Wrap content in json.dumps() before saving",
      "fixed_date": "2026-03-22"
    }
  }
}

Change Management

Pre-Change Checklist

Before modifying services, configs, or infrastructure:

  1. What's changing? — specific files, services, configs
  2. Why? — linked incident/problem ticket
  3. Risk? — what could go wrong
  4. Rollback plan? — how to undo if it breaks
  5. Test? — how to verify it worked
  6. Notify? — does the human need to know

Change Categories

TypeApprovalExample
StandardPre-approved, just do itRestart service, bump timeout
NormalInform human, wait for OKNew cron job, config change
EmergencyFix now, inform afterService down, data at risk

Post-Change Verification

After any change:

  1. Check service status — systemctl --user status SERVICE
  2. Watch logs for 60s — journalctl --user -u SERVICE -f --since "now"
  3. Run health check — scripts/itil-review.sh
  4. Verify no new errors in first 5 minutes

Event Management

Log Monitoring Patterns

bash
# Service crashes
journalctl --user -u SERVICE --since "12h ago" | grep -ciE "watchdog timeout|killed|SIGABRT|SIGSEGV|failed with"

# Memory/resource issues
journalctl --user -u SERVICE --since "12h ago" | grep -c "Failed to load"

# Auth failures
journalctl --user -u SERVICE --since "12h ago" | grep -ciE "unauthorized|403|token expired|auth fail"

Health Check Endpoints

Check services with curl:

bash
curl -sf --max-time 5 "$URL" >/dev/null 2>&1 || echo "DOWN"

Configure endpoints in the review script for your environment.

Continual Improvement

Review Cadence

ReviewFrequencyPurpose
Incident reviewEvery 12hDetect and classify new issues
Problem reviewWeeklyIdentify patterns, track RCA progress
Capacity reviewWeeklyDisk, RAM, memory count trends
Process reviewMonthlyAre our detection rules catching real issues?

KPIs to Track

  • MTTR (Mean Time to Resolve) — how fast do we fix incidents?
  • Incident recurrence rate — are the same things breaking?
  • False positive rate — are we alerting on non-issues?
  • Known error resolution — are problems getting permanent fixes?

State Tracking

The review script maintains itil-state.json with:

  • Last review timestamp and results
  • Incident/problem counts per review
  • System metrics (disk, RAM, restart count)
  • Cross-review pattern detection data

Cron Setup

Recommended Schedule

bash
# Incident review — every 12 hours
openclaw cron add --name "itil-review" --every "12h" \
  --model "anthropic/claude-sonnet-4-6" --timeout-seconds 180 \
  --session isolated \
  --message "Run ITIL review: bash ~/.skcapstone/agents/lumina/scripts/itil-review.sh"

# Weekly problem review (Sunday 9 AM)
# Analyze the week's incidents, identify patterns, suggest improvements

File Structure

code
itil-ops/
├── SKILL.md              # This file
├── scripts/
│   └── itil-review.sh    # Main review script (scan + classify + report)
└── references/
    └── itil4-agent-mapping.md  # ITIL 4 → Agent operations reference

Integration Points

  • Coordination tasksskcapstone coord create for incident/problem tickets
  • Memory snapshotsskmemory_snapshot to record resolutions for future reference
  • Heartbeat — integrate with existing heartbeat to run lightweight checks
  • Cron — scheduled reviews via OpenClaw cron system
  • Alerting — Telegram/Discord delivery for P1/P2 issues

相关 Skills

可观测性设计

by alirezarezvani

Universal
热门

面向生产系统规划可落地的可观测性体系,串起指标、日志、链路追踪与 SLI/SLO、错误预算、告警和仪表盘设计,适合搭建监控平台与优化故障响应。

把监控、日志、链路追踪串起来,帮助团队从设计阶段构建可观测性,排障更快、系统演进更稳。

DevOps
未扫描9.0k

资深开发运维

by alirezarezvani

Universal
热门

覆盖 CI/CD 流水线生成、Terraform 基建脚手架和自动化部署,适合在 AWS、GCP、Azure 上搭建云原生发布流程,管理 Docker/Kubernetes 基础设施并持续优化交付。

把CI/CD、基础设施即代码、容器与监控串成一条交付链,尤其适合AWS/GCP/Azure多云团队高效落地。

DevOps
未扫描9.0k

环境密钥管理

by alirezarezvani

Universal
热门

统一梳理dev/staging/prod的.env和密钥流程,自动生成.env.example、校验必填变量、扫描Git历史泄漏,并联动Vault、AWS SSM、1Password、Doppler完成轮换。

统一管理环境变量、密钥与配置,减少泄露和部署混乱,安全治理与团队协作一起做好,DevOps 场景很省心。

DevOps
未扫描9.0k

相关 MCP 服务

kubefwd

编辑精选

by txn2

热门

kubefwd 是让 AI 帮你批量转发 Kubernetes 服务到本地的开发神器。

微服务开发者最头疼的本地调试问题,它一键搞定——自动分配 IP 避免端口冲突,还能用自然语言查询状态。但依赖 AI 工作流,纯命令行爱好者可能觉得不够直接。

DevOps
4.1k

Cloudflare

编辑精选

by Cloudflare

热门

Cloudflare MCP Server 是让你用自然语言管理 Workers、KV 和 R2 等云资源的工具。

这个工具解决了开发者频繁切换控制台和文档的痛点,特别适合那些在 Cloudflare 上部署无服务器应用、需要快速调试或管理配置的团队。不过,由于它依赖多个子服务器,初次设置可能有点繁琐,建议先从 Workers Bindings 这类核心功能入手。

DevOps
3.6k

Terraform

编辑精选

by hashicorp

Terraform MCP Server 是让 AI 助手直接操作 Terraform Registry 和 HCP Terraform 的桥梁。

如果你经常在 Terraform 里翻文档找模块配置,这个服务器能省不少时间——直接问 Claude 就能生成准确的代码片段。最适合管理多云基础设施的团队,但注意它目前只适合本地使用,别在生产环境里暴露 HTTP 端点。

DevOps
1.3k

评论