Incident Response: Detection, Response, and Post-Mortems

Build an effective incident response process: from detection and escalation to resolution and blameless post-mortems that prevent recurrence.

published: reading time: 13 min read

When something breaks in production, the minutes matter. Not just for user impact, but because stressed engineers under time pressure make bad decisions. An incident response process gives you a playbook so your team can focus on fixing the problem, not figuring out who does what.

When to Use

SEV1 vs. SEV2: How to Decide

Declare SEV1 when you have complete service unavailability, data loss or corruption, or a security breach. If customers cannot complete their primary task at all, that is SEV1. The bar for declaring SEV1 should be low — it is better to over-communicate severity and scale back than to under-respond.

Declare SEV2 when a major feature is degraded but customers can work around it. Search returning errors for 20% of users is SEV2. Payment processing slow but completing is SEV2. SEV2 means the response is urgent but the business is not on fire.

The hard call is partial availability. A service that is technically up but serving errors for a subset of users could be SEV2. When in doubt, declare the higher severity and demote it later if it turns out to be minor.

War Room vs. Async Slack

Open a war room (video call, dedicated channel) when you have a SEV1, when multiple teams need to coordinate simultaneously, or when the incident is actively worsening and needs rapid decision-making. War rooms add coordination overhead, so use them only when the speed of parallel investigation outweighs the overhead.

Use async Slack updates when you have a SEV2, when the incident is stable and slowly improving, or when only one team is working the problem. Async keeps people informed without tying up multiple engineers in a call.

Rollback vs. Fix-Forward

Rollback when the deployment caused the incident, when a known-good state exists and can be restored quickly, and when the fix requires more time than rollback. Rollback is the right call when you can be confident in 10 minutes that reverting solves the problem.

Fix-forward when the deployment is not the cause, when rollback would cause more disruption (for example, in-flight transactions), and when the fix is simpler than the rollback procedure. Fix-forward requires confidence that the fix will not make things worse.

Incident Lifecycle

flowchart TD
    A[Alert Fires<br/>or User Reports] --> B{Classify Severity}
    B -->|SEV1| C[Open War Room<br/>Page Secondary On-Call]
    B -->|SEV2| D[Open Incident Channel<br/>Assign IC]
    C --> E[Investigate<br/>Identify Root Cause]
    D --> E
    E --> F{Rollback<br/>or Fix-Forward?}
    F -->|Rollback| G[Execute Rollback<br/>Monitor Recovery]
    F -->|Fix-Forward| H[Implement Fix<br/>Test in Staging]
    G --> I{Service<br/>Recovered?}
    H --> I
    I -->|Yes| J[Declare Resolved<br/>Update Status Page]
    I -->|No| E
    J --> K[Schedule Post-Mortem<br/>Track Action Items]

Incident Classification and Severity

Not all incidents are created equal. Define severity levels so everyone knows the stakes.

SeverityDefinitionExampleResponse
SEV1Complete outage or data lossCheckout service down, database corruptedImmediate all-hands
SEV2Major feature degradedSearch returning errors for 20% of queriesResponse in 15 minutes
SEV3Minor feature degradedPDF exports slow but workingResponse in 2 hours
SEV4Cosmetic or minor issueWrong logo color on one pageNext sprint

Classify based on user impact, not technical cause. A single-user issue is SEV4 even if the root cause is interesting. A 1% error rate on a critical path is SEV2.

Detection Sources and Alerting

The Alerting in Production post covers how to build alerts that page the right people. This is where that investment pays off.

Detection can come from automated monitoring, user reports, or employee reports. Automated detection is faster and more reliable. When PagerDuty fires at 3am, you have context: error rates, latency, which service is affected.

User-reported incidents are harder. Create a clear channel ( Slack channel, dedicated email, emergency hotline) that routes to the on-call. Train customer support to recognize severity.

Escalation Paths and Communication

Escalation is about getting the right people involved quickly.

# Example escalation policy (PagerDuty format)
escalation_policy:
  name: Platform Team Escalation
  description: Primary on-call, then secondary, then manager
  levels:
    - recipients:
        - user: primary-oncall
      delay_minutes: 0
    - recipients:
        - user: secondary-oncall
      delay_minutes: 15
    - recipients:
        - user: platform-manager
      delay_minutes: 30

During an incident, communicate early and often. Users would rather hear “we are aware and investigating” than watch silence.

Status page updates:

**2026-03-25 14:32 UTC - Investigating**
We are seeing elevated error rates on our checkout service. Our team is investigating and will update in 30 minutes.

**2026-03-25 14:45 UTC - Identified**
We have identified the issue as a database connection pool exhaustion caused by a slow-running query. We are working on a fix.

**2026-03-25 15:02 UTC - Resolved**
The issue has been resolved. Checkout service is operating normally. We will publish a post-mortem within 5 business days.

Internal communication should be in a dedicated incident Slack channel. Do not let incident discussion pollute regular team channels.

Active Incident Management

One person leads the incident. This is the Incident Commander (IC). The IC coordinates, communicates, and makes decisions. They are not necessarily the most technical person, they are the person who can keep the incident moving.

The IC role:

  1. Declares incident (sets severity, creates Slack channel, pages on-call)
  2. Coordinates responders (who is investigating, who is fixing, who is communicating)
  3. Tracks progress (main incident timeline)
  4. Makes calls (rollback vs. fix-forward, customer communication)
  5. Closes incident (declares resolved, schedules post-mortem)

Use a war room for SEV1s. Video conference, screen share, one channel for coordination. For SEV2s, a Slack channel with async updates often suffices.

The Chaos Engineering post has more on building systems that fail gracefully, which reduces incident frequency and severity.

Blameless Post-Mortem Process

After a SEV1 or SEV2, conduct a post-mortem. The purpose is learning, not blame. Blameless means: focus on systems and processes, not individuals.

Timeline first. Reconstruct what happened and when. Include when the incident was detected, when it was escalated, when mitigation started.

## Post-Mortem: Checkout Service Outage

**Date:** 2026-03-25
**Duration:** 47 minutes
**Impact:** 1,247 failed checkout attempts
**Severity:** SEV1

### Timeline (UTC)

- 13:15 - Last successful deployment to checkout service
- 13:47 - Alerting fires: error rate > 5%
- 13:48 - Primary on-call acknowledges
- 13:52 - Incident channel created, IC assigned
- 14:01 - Database connection issue identified
- 14:08 - Query timeout identified as cause
- 14:15 - Rollback initiated
- 14:34 - Rollback complete, service recovering
- 14:47 - Error rates back to normal

### Root Cause

A deployment introduced a query that did not use an index, causing full table scans on the orders table. Under load, this exhausted the connection pool.

### Contributing Factors

- No query plan review in deployment process
- Load testing did not include the new query pattern
- Connection pool size was not monitored

### Action Items

| Item                                         | Owner  | Due        |
| -------------------------------------------- | ------ | ---------- |
| Add query plan review to CI                  | @sarah | 2026-04-01 |
| Add connection pool monitoring               | @james | 2026-04-05 |
| Update load testing to include checkout flow | @ops   | 2026-04-10 |

### What Went Well

- Detection was fast (3 minutes from failure to alert)
- Rollback procedure worked as documented
- Communication was clear and timely

Share post-mortems widely. Blameless only works if people believe it. Seeing the same process applied to senior engineers as junior engineers builds trust.

Improving Detection and Response

Post-mortems are useless if action items are not tracked. Put them in your project management tool. Review action items in weekly ops meetings.

Track your incident metrics over time:

  • Mean Time to Detection (MTTD): How long between failure and alert?
  • Mean Time to Acknowledge (MTTA): How long between alert and someone looking at it?
  • Mean Time to Resolution (MTTR): How long from alert to fix?

Set targets for each. If your MTTD is 15 minutes and your target is 5, you need better alerting. If your MTTR is 2 hours and your target is 30 minutes, you need better runbooks or faster rollback procedures.

Production Failure Scenarios

FailureImpactMitigation
Severity misclassification (under-response)SEV2 treated as SEV3, wrong people paged, slow resolutionDefault to higher severity, demote later if warranted; track over- and under-escalations in post-mortems
Escalation policy not triggering due to on-call rotation gapAlert fires but nobody sees it for 30 minutesTest escalation policies quarterly, simulate on-call transitions, have a fallback communication channel
Status page update causing customer panic due to inaccurate informationCustomers react to incomplete data, support tickets spikeHave a reviewer check status page updates before posting, verify facts before publishing
Post-mortem action items never implementedSame incident recurs six months laterAssign action item owners with due dates, review in weekly ops meeting, track in project management tool
War room too large, too many people talkingNoise drowns signal, IC cannot coordinateStrict attendee list, use a separate investigation channel for parallel work, IC manages participation

Incident Response Trade-offs

ScenarioWar RoomAsync Slack
Coordination speedFast (everyone on call)Slow (check messages periodically)
Noise levelHighLow
Multi-team alignmentGoodRequires explicit updates
Best forSEV1, rapidly evolvingSEV2, stable investigation
ScenarioRollbackFix-Forward
Time to recoveryUsually fasterVariable
RiskLower (known-good state)Depends on fix quality
When to useDeployment-caused incidentsNon-deployment causes
Trade-offMay not fix root causeFixes the actual problem

Incident Response Observability

Track these metrics to understand your incident response health:

MetricWhat It Tells YouTarget
MTTD (Mean Time to Detection)Speed of automated detection< 5 minutes
MTTA (Mean Time to Acknowledge)On-call responsiveness< 15 minutes
MTTR (Mean Time to Resolution)Overall incident resolution speed< 30 minutes for SEV1
False positive alert rateAlert quality< 10%
Alert to war room open timeHow fast coordination starts< 5 minutes

Alert fatigue is a real problem. If engineers ignore alerts because 80% are false positives, a real incident gets missed. Review alert quality monthly. If an alert fires and nobody acts on it within 15 minutes, it was either not important enough to page or it was a false positive.

Key commands:

# Check PagerDuty escalation policy status
pd-cli escalation-policy list --team "Platform Team"

# Review alert volume by service in the last 24 hours
curl -s "http://prometheus:9090/api/v1/query?query=ALERTS{alertstate='firing'}" | jq '.data.result | group_by(.labels.service) | map({service: .[0].labels.service, count: length})'

# Count incidents by severity in the last quarter
kubectl get events --all-namespaces --field-selector type=Warning --since="12h" | wc -l

Common Anti-Patterns

Skipping post-mortems for minor incidents. Every SEV1 and SEV2 deserves a post-mortem. SEV3s and SEV4s can be documented in a sentence, but you should still review patterns. If your SEV4 count is growing, something is wrong.

Not acting on post-mortem action items. A post-mortem without tracked action items is just a document. If the same class of incident happens twice, the first post-mortem failed.

Severity inflation. Declaring everything SEV1 because you want attention makes real SEV1s harder to spot. Save SEV1 for actual outages. Your team will learn to ignore the noise.

Treating all incidents as fire drills. Not every incident requires everyone to drop everything. A SEV3 can wait until business hours. Waking people up unnecessarily builds resentment that surfaces when a real SEV1 happens.

No clear IC. When everyone is investigating, nobody is coordinating. The IC makes calls, tracks the timeline, and manages communication. Without one, you get parallel investigations, duplicated effort, and conflicting status updates.

Quick Recap

Key Takeaways

  • Define severity levels and use them consistently — classify by user impact, not technical interest
  • Declare SEV1 early and demote if needed — under-responding is worse than over-responding
  • The Incident Commander coordinates, does not investigate — separate the coordination role from the fix role
  • Blameless post-mortems focus on systems and processes, not individuals
  • Track MTTD, MTTA, MTTR over time — you cannot improve what you do not measure
  • Action items from post-mortems must be tracked and reviewed

Incident Response Checklist

# 1. Define severity levels (SEV1/SEV2/SEV3/SEV4) in your runbook
# 2. Set up PagerDuty escalation with 0 -> 15 -> 30 minute delays
# 3. Create status page templates for Investigating / Identified / Resolved
# 4. Assign an Incident Commander for every SEV1 and SEV2
# 5. Open separate Slack channel per incident: #inc-YYYY-MM-DD-description
# 6. Document timeline as the incident progresses, not after
# 7. Conduct blameless post-mortem within 5 business days of SEV1/SEV2
# 8. Create action items in project management tool with owners and due dates
# 9. Review MTTD, MTTA, MTTR monthly in ops review meeting

For more on building resilient systems, see Chaos Engineering. For alerting best practices, see Alerting in Production. For monitoring best practices and SLOs, see Observability Engineering.

Trade-off Summary

Runbook ApproachSpeedConsistencyFlexibilityBest For
Rigid scriptFastestHighestLowestMature, stable systems
Flexible templateModerateHighModerateEvolving systems
Open-ended checklistSlowestLowestHighestNovel incidents
ChatOps + bot executionFastHighMediumLarge on-call teams
On-Call ScheduleBurnout RiskCoverage QualityCost
Fixed rotationLowVariesLower
Follow-the-sunMediumHighHigher
Dev-first rotationMediumHighLowest
Managed serviceLowestHighestHighest

Conclusion

Effective incident response is not about being perfect. It is about being consistent. When every incident follows the same playbook, same severity definitions, same escalation paths, same communication templates, your team can focus on the problem instead of the process.

Run the process, learn from it, and improve it. Incidents are inevitable. Outages are optional.

Category

Related Posts

Alerting in Production: Paging, Runbooks, and On-Call

Build effective alerting systems that wake people up for real emergencies: alert fatigue prevention, runbook automation, and healthy on-call practices.

#alerting #monitoring #on-call

Kubernetes High Availability: HPA, Pod Disruption Budgets, Multi-AZ

Build resilient Kubernetes applications with Horizontal Pod Autoscaler, Pod Disruption Budgets, and multi-availability zone deployments for production workloads.

#kubernetes #high-availability #hpa

The Observability Engineering Mindset: Beyond Monitoring

Transition from traditional monitoring to full observability: structured logs, metrics, traces, and the cultural practices that make observability teams successful.

#observability #engineering #sre