Incident Response
Severity Levels
| Level | Description | Response Time |
|---|
| P1 | Service down | 15 min |
| P2 | Major degradation | 30 min |
| P3 | Minor impact | 4 hours |
| P4 | No impact | Next business day |
Incident Flow
Alert → Acknowledge → Assess → Mitigate → Resolve → Postmortem
│ │ │
└── Page ─────┴── Communicate
On-Call Checklist
- Acknowledge alert within SLA
- Assess impact and severity
- Communicate status to stakeholders
- Mitigate - Stop the bleeding
- Investigate root cause
- Resolve underlying issue
- Document in postmortem
Communication Template
🔴 INCIDENT: [Brief description]
Impact: [Who/what is affected]
Status: [Investigating/Mitigating/Resolved]
ETA: [Expected resolution time]
Updates: [Channel/page]
Common Runbooks
High CPU
- Identify process:
top -c
- Check for runaway processes
- Scale horizontally if needed
- Investigate root cause
Out of Disk
- Check usage:
df -h
- Find large files:
du -sh /* | sort -h
- Clear logs/temp files
- Add storage or archive
Database Slow
- Check connections:
SHOW PROCESSLIST
- Identify slow queries
- Kill blocking queries if needed
- Scale or optimize
Escalation Path
On-Call Engineer (15 min)
↓
Team Lead (30 min)
↓
Engineering Manager (1 hour)
↓
VP Engineering (2 hours)