Incident Response Guide
Incident phases, severity levels, best practices, and tools.
Response Phases
Detection
Identify incident, assess severity
Action: Monitoring, alerts, user reports
Triage
Classify severity, assign team
Action: Severity levels, role assignment
Containment
Stop spread, limit damage
Action: Isolate, rollback, disable
Resolution
Fix root cause
Action: Debug, patch, deploy fix
Recovery
Restore normal operations
Action: Verify fix, restore services
Post-mortem
Learn from incident
Action: Document, improve, share
Severity Levels
SEV-1 (Critical)
Complete service outage
Response: All hands, immediate
SEV-2 (High)
Major functionality broken
Response: Core team, ASAP
SEV-3 (Medium)
Partial impact, workaround exists
Response: Assigned team, 24h
SEV-4 (Low)
Minor issue, limited impact
Response: Normal queue
Best Practices
Clear incident commander
Communication channel dedicated
Regular status updates
Document timeline
Don't panic, stay calm
Customer communication plan
Root cause focus
Blameless post-mortem
Incident Tools
PagerDuty
Alerting, on-call management
Slack
Incident channel, communication
Jira
Issue tracking, post-mortem
Datadog
Monitoring, dashboards
Runbooks
Playbooks, procedures
Incident Response Checklist
1. Acknowledge alert. 2. Assess severity. 3. Assign incident commander. 4. Open communication channel. 5. Start incident log. 6. Communicate to stakeholders. 7. Contain issue. 8. Investigate root cause. 9. Implement fix. 10. Verify resolution. 11. Recovery actions. 12. Close incident. 13. Post-mortem. 14. Action items. Speed matters. Communicate often. Learn always.