Incident Response

Severity Levels

Level	Description	Response Time
P1	Service down	15 min
P2	Major degradation	30 min
P3	Minor impact	4 hours
P4	No impact	Next business day

Incident Flow

Alert → Acknowledge → Assess → Mitigate → Resolve → Postmortem
          │             │         │
          └── Page ─────┴── Communicate

On-Call Checklist

Acknowledge alert within SLA
Assess impact and severity
Communicate status to stakeholders
Mitigate - Stop the bleeding
Investigate root cause
Resolve underlying issue
Document in postmortem

Communication Template

🔴 INCIDENT: [Brief description]
Impact: [Who/what is affected]
Status: [Investigating/Mitigating/Resolved]
ETA: [Expected resolution time]
Updates: [Channel/page]

Common Runbooks

High CPU

Identify process: top -c
Check for runaway processes
Scale horizontally if needed
Investigate root cause

Out of Disk

Check usage: df -h
Find large files: du -sh /* | sort -h
Clear logs/temp files
Add storage or archive

Database Slow

Check connections: SHOW PROCESSLIST
Identify slow queries
Kill blocking queries if needed
Scale or optimize

Escalation Path

On-Call Engineer (15 min)
    ↓
Team Lead (30 min)
    ↓
Engineering Manager (1 hour)
    ↓
VP Engineering (2 hours)

incident-responseSafety 90Repository

Package Files

Incident Response

Severity Levels

Incident Flow

On-Call Checklist

Communication Template

Common Runbooks

High CPU

Out of Disk

Database Slow

Escalation Path

Install

AI Quality Score

Metadata

Tags

incident-responseSafety 90Repository ShareFavorite skill

Package Files

Incident Response

Severity Levels

Incident Flow

On-Call Checklist

Communication Template

Common Runbooks

High CPU

Out of Disk

Database Slow

Escalation Path

Install

AI Quality Score

Metadata

Tags

incident-responseSafety 90Repository