Incident Response: Detection, Response, and Post-Mortems
Build an effective incident response process: from detection and escalation to resolution and blameless post-mortems that prevent recurrence.
When something breaks in production, the minutes matter. Not just for user impact, but because stressed engineers under time pressure make bad decisions. An incident response process gives you a playbook so your team can focus on fixing the problem, not figuring out who does what.
When to Use
SEV1 vs. SEV2: How to Decide
Declare SEV1 when you have complete service unavailability, data loss or corruption, or a security breach. If customers cannot complete their primary task at all, that is SEV1. The bar for declaring SEV1 should be low — it is better to over-communicate severity and scale back than to under-respond.
Declare SEV2 when a major feature is degraded but customers can work around it. Search returning errors for 20% of users is SEV2. Payment processing slow but completing is SEV2. SEV2 means the response is urgent but the business is not on fire.
The hard call is partial availability. A service that is technically up but serving errors for a subset of users could be SEV2. When in doubt, declare the higher severity and demote it later if it turns out to be minor.
War Room vs. Async Slack
Open a war room (video call, dedicated channel) when you have a SEV1, when multiple teams need to coordinate simultaneously, or when the incident is actively worsening and needs rapid decision-making. War rooms add coordination overhead, so use them only when the speed of parallel investigation outweighs the overhead.
Use async Slack updates when you have a SEV2, when the incident is stable and slowly improving, or when only one team is working the problem. Async keeps people informed without tying up multiple engineers in a call.
Rollback vs. Fix-Forward
Rollback when the deployment caused the incident, when a known-good state exists and can be restored quickly, and when the fix requires more time than rollback. Rollback is the right call when you can be confident in 10 minutes that reverting solves the problem.
Fix-forward when the deployment is not the cause, when rollback would cause more disruption (for example, in-flight transactions), and when the fix is simpler than the rollback procedure. Fix-forward requires confidence that the fix will not make things worse.
Incident Lifecycle
flowchart TD
A[Alert Fires<br/>or User Reports] --> B{Classify Severity}
B -->|SEV1| C[Open War Room<br/>Page Secondary On-Call]
B -->|SEV2| D[Open Incident Channel<br/>Assign IC]
C --> E[Investigate<br/>Identify Root Cause]
D --> E
E --> F{Rollback<br/>or Fix-Forward?}
F -->|Rollback| G[Execute Rollback<br/>Monitor Recovery]
F -->|Fix-Forward| H[Implement Fix<br/>Test in Staging]
G --> I{Service<br/>Recovered?}
H --> I
I -->|Yes| J[Declare Resolved<br/>Update Status Page]
I -->|No| E
J --> K[Schedule Post-Mortem<br/>Track Action Items]
Incident Classification and Severity
Not all incidents are created equal. Define severity levels so everyone knows the stakes.
| Severity | Definition | Example | Response |
|---|---|---|---|
| SEV1 | Complete outage or data loss | Checkout service down, database corrupted | Immediate all-hands |
| SEV2 | Major feature degraded | Search returning errors for 20% of queries | Response in 15 minutes |
| SEV3 | Minor feature degraded | PDF exports slow but working | Response in 2 hours |
| SEV4 | Cosmetic or minor issue | Wrong logo color on one page | Next sprint |
Classify based on user impact, not technical cause. A single-user issue is SEV4 even if the root cause is interesting. A 1% error rate on a critical path is SEV2.
Detection Sources and Alerting
The Alerting in Production post covers how to build alerts that page the right people. This is where that investment pays off.
Detection can come from automated monitoring, user reports, or employee reports. Automated detection is faster and more reliable. When PagerDuty fires at 3am, you have context: error rates, latency, which service is affected.
User-reported incidents are harder. Create a clear channel ( Slack channel, dedicated email, emergency hotline) that routes to the on-call. Train customer support to recognize severity.
Escalation Paths and Communication
Escalation is about getting the right people involved quickly.
# Example escalation policy (PagerDuty format)
escalation_policy:
name: Platform Team Escalation
description: Primary on-call, then secondary, then manager
levels:
- recipients:
- user: primary-oncall
delay_minutes: 0
- recipients:
- user: secondary-oncall
delay_minutes: 15
- recipients:
- user: platform-manager
delay_minutes: 30
During an incident, communicate early and often. Users would rather hear “we are aware and investigating” than watch silence.
Status page updates:
**2026-03-25 14:32 UTC - Investigating**
We are seeing elevated error rates on our checkout service. Our team is investigating and will update in 30 minutes.
**2026-03-25 14:45 UTC - Identified**
We have identified the issue as a database connection pool exhaustion caused by a slow-running query. We are working on a fix.
**2026-03-25 15:02 UTC - Resolved**
The issue has been resolved. Checkout service is operating normally. We will publish a post-mortem within 5 business days.
Internal communication should be in a dedicated incident Slack channel. Do not let incident discussion pollute regular team channels.
Active Incident Management
One person leads the incident. This is the Incident Commander (IC). The IC coordinates, communicates, and makes decisions. They are not necessarily the most technical person, they are the person who can keep the incident moving.
The IC role:
- Declares incident (sets severity, creates Slack channel, pages on-call)
- Coordinates responders (who is investigating, who is fixing, who is communicating)
- Tracks progress (main incident timeline)
- Makes calls (rollback vs. fix-forward, customer communication)
- Closes incident (declares resolved, schedules post-mortem)
Use a war room for SEV1s. Video conference, screen share, one channel for coordination. For SEV2s, a Slack channel with async updates often suffices.
The Chaos Engineering post has more on building systems that fail gracefully, which reduces incident frequency and severity.
Blameless Post-Mortem Process
After a SEV1 or SEV2, conduct a post-mortem. The purpose is learning, not blame. Blameless means: focus on systems and processes, not individuals.
Timeline first. Reconstruct what happened and when. Include when the incident was detected, when it was escalated, when mitigation started.
## Post-Mortem: Checkout Service Outage
**Date:** 2026-03-25
**Duration:** 47 minutes
**Impact:** 1,247 failed checkout attempts
**Severity:** SEV1
### Timeline (UTC)
- 13:15 - Last successful deployment to checkout service
- 13:47 - Alerting fires: error rate > 5%
- 13:48 - Primary on-call acknowledges
- 13:52 - Incident channel created, IC assigned
- 14:01 - Database connection issue identified
- 14:08 - Query timeout identified as cause
- 14:15 - Rollback initiated
- 14:34 - Rollback complete, service recovering
- 14:47 - Error rates back to normal
### Root Cause
A deployment introduced a query that did not use an index, causing full table scans on the orders table. Under load, this exhausted the connection pool.
### Contributing Factors
- No query plan review in deployment process
- Load testing did not include the new query pattern
- Connection pool size was not monitored
### Action Items
| Item | Owner | Due |
| -------------------------------------------- | ------ | ---------- |
| Add query plan review to CI | @sarah | 2026-04-01 |
| Add connection pool monitoring | @james | 2026-04-05 |
| Update load testing to include checkout flow | @ops | 2026-04-10 |
### What Went Well
- Detection was fast (3 minutes from failure to alert)
- Rollback procedure worked as documented
- Communication was clear and timely
Share post-mortems widely. Blameless only works if people believe it. Seeing the same process applied to senior engineers as junior engineers builds trust.
Improving Detection and Response
Post-mortems are useless if action items are not tracked. Put them in your project management tool. Review action items in weekly ops meetings.
Track your incident metrics over time:
- Mean Time to Detection (MTTD): How long between failure and alert?
- Mean Time to Acknowledge (MTTA): How long between alert and someone looking at it?
- Mean Time to Resolution (MTTR): How long from alert to fix?
Set targets for each. If your MTTD is 15 minutes and your target is 5, you need better alerting. If your MTTR is 2 hours and your target is 30 minutes, you need better runbooks or faster rollback procedures.
Production Failure Scenarios
| Failure | Impact | Mitigation |
|---|---|---|
| Severity misclassification (under-response) | SEV2 treated as SEV3, wrong people paged, slow resolution | Default to higher severity, demote later if warranted; track over- and under-escalations in post-mortems |
| Escalation policy not triggering due to on-call rotation gap | Alert fires but nobody sees it for 30 minutes | Test escalation policies quarterly, simulate on-call transitions, have a fallback communication channel |
| Status page update causing customer panic due to inaccurate information | Customers react to incomplete data, support tickets spike | Have a reviewer check status page updates before posting, verify facts before publishing |
| Post-mortem action items never implemented | Same incident recurs six months later | Assign action item owners with due dates, review in weekly ops meeting, track in project management tool |
| War room too large, too many people talking | Noise drowns signal, IC cannot coordinate | Strict attendee list, use a separate investigation channel for parallel work, IC manages participation |
Incident Response Trade-offs
| Scenario | War Room | Async Slack |
|---|---|---|
| Coordination speed | Fast (everyone on call) | Slow (check messages periodically) |
| Noise level | High | Low |
| Multi-team alignment | Good | Requires explicit updates |
| Best for | SEV1, rapidly evolving | SEV2, stable investigation |
| Scenario | Rollback | Fix-Forward |
|---|---|---|
| Time to recovery | Usually faster | Variable |
| Risk | Lower (known-good state) | Depends on fix quality |
| When to use | Deployment-caused incidents | Non-deployment causes |
| Trade-off | May not fix root cause | Fixes the actual problem |
Incident Response Observability
Track these metrics to understand your incident response health:
| Metric | What It Tells You | Target |
|---|---|---|
| MTTD (Mean Time to Detection) | Speed of automated detection | < 5 minutes |
| MTTA (Mean Time to Acknowledge) | On-call responsiveness | < 15 minutes |
| MTTR (Mean Time to Resolution) | Overall incident resolution speed | < 30 minutes for SEV1 |
| False positive alert rate | Alert quality | < 10% |
| Alert to war room open time | How fast coordination starts | < 5 minutes |
Alert fatigue is a real problem. If engineers ignore alerts because 80% are false positives, a real incident gets missed. Review alert quality monthly. If an alert fires and nobody acts on it within 15 minutes, it was either not important enough to page or it was a false positive.
Key commands:
# Check PagerDuty escalation policy status
pd-cli escalation-policy list --team "Platform Team"
# Review alert volume by service in the last 24 hours
curl -s "http://prometheus:9090/api/v1/query?query=ALERTS{alertstate='firing'}" | jq '.data.result | group_by(.labels.service) | map({service: .[0].labels.service, count: length})'
# Count incidents by severity in the last quarter
kubectl get events --all-namespaces --field-selector type=Warning --since="12h" | wc -l
Common Anti-Patterns
Skipping post-mortems for minor incidents. Every SEV1 and SEV2 deserves a post-mortem. SEV3s and SEV4s can be documented in a sentence, but you should still review patterns. If your SEV4 count is growing, something is wrong.
Not acting on post-mortem action items. A post-mortem without tracked action items is just a document. If the same class of incident happens twice, the first post-mortem failed.
Severity inflation. Declaring everything SEV1 because you want attention makes real SEV1s harder to spot. Save SEV1 for actual outages. Your team will learn to ignore the noise.
Treating all incidents as fire drills. Not every incident requires everyone to drop everything. A SEV3 can wait until business hours. Waking people up unnecessarily builds resentment that surfaces when a real SEV1 happens.
No clear IC. When everyone is investigating, nobody is coordinating. The IC makes calls, tracks the timeline, and manages communication. Without one, you get parallel investigations, duplicated effort, and conflicting status updates.
Quick Recap
Key Takeaways
- Define severity levels and use them consistently — classify by user impact, not technical interest
- Declare SEV1 early and demote if needed — under-responding is worse than over-responding
- The Incident Commander coordinates, does not investigate — separate the coordination role from the fix role
- Blameless post-mortems focus on systems and processes, not individuals
- Track MTTD, MTTA, MTTR over time — you cannot improve what you do not measure
- Action items from post-mortems must be tracked and reviewed
Incident Response Checklist
# 1. Define severity levels (SEV1/SEV2/SEV3/SEV4) in your runbook
# 2. Set up PagerDuty escalation with 0 -> 15 -> 30 minute delays
# 3. Create status page templates for Investigating / Identified / Resolved
# 4. Assign an Incident Commander for every SEV1 and SEV2
# 5. Open separate Slack channel per incident: #inc-YYYY-MM-DD-description
# 6. Document timeline as the incident progresses, not after
# 7. Conduct blameless post-mortem within 5 business days of SEV1/SEV2
# 8. Create action items in project management tool with owners and due dates
# 9. Review MTTD, MTTA, MTTR monthly in ops review meeting
For more on building resilient systems, see Chaos Engineering. For alerting best practices, see Alerting in Production. For monitoring best practices and SLOs, see Observability Engineering.
Trade-off Summary
| Runbook Approach | Speed | Consistency | Flexibility | Best For |
|---|---|---|---|---|
| Rigid script | Fastest | Highest | Lowest | Mature, stable systems |
| Flexible template | Moderate | High | Moderate | Evolving systems |
| Open-ended checklist | Slowest | Lowest | Highest | Novel incidents |
| ChatOps + bot execution | Fast | High | Medium | Large on-call teams |
| On-Call Schedule | Burnout Risk | Coverage Quality | Cost |
|---|---|---|---|
| Fixed rotation | Low | Varies | Lower |
| Follow-the-sun | Medium | High | Higher |
| Dev-first rotation | Medium | High | Lowest |
| Managed service | Lowest | Highest | Highest |
Conclusion
Effective incident response is not about being perfect. It is about being consistent. When every incident follows the same playbook, same severity definitions, same escalation paths, same communication templates, your team can focus on the problem instead of the process.
Run the process, learn from it, and improve it. Incidents are inevitable. Outages are optional.
Category
Related Posts
Alerting in Production: Paging, Runbooks, and On-Call
Build effective alerting systems that wake people up for real emergencies: alert fatigue prevention, runbook automation, and healthy on-call practices.
Kubernetes High Availability: HPA, Pod Disruption Budgets, Multi-AZ
Build resilient Kubernetes applications with Horizontal Pod Autoscaler, Pod Disruption Budgets, and multi-availability zone deployments for production workloads.
The Observability Engineering Mindset: Beyond Monitoring
Transition from traditional monitoring to full observability: structured logs, metrics, traces, and the cultural practices that make observability teams successful.