1. Detection & reporting
- Monitor alerts from observability tools
- Accept reports from customer support, users, or team members
- Document the initial report with timestamp and symptoms
2. Initial assessment
- Verify the incident is real and not a false alarm
- Determine severity level
- Identify which systems and users are affected
3. Mobilize the team
Incident Commander: Coordinates response, makes decisions, communicates status
Technical Lead: Directs investigation and remediation efforts
Communications Lead: Handles internal and external communications
Support Lead: Interfaces with affected users and support team
4. Investigation & mitigation
- Gather logs, metrics, and other diagnostic information
- Identify root cause or implement temporary workaround
- Test and deploy fix
- Monitor for stability
5. Resolution & verification
- Confirm all systems are operating normally