Incident Response: Detection, Response, and Post-Mortems

Build an effective incident response process: from detection and escalation to resolution and blameless post-mortems that prevent recurrence.

published: reading time: 24 min read author: GeekWorkBench updated: May 14, 2026

Introduction

When something breaks in production, the minutes matter. Not just for user impact, but because stressed engineers under time pressure make bad decisions. An incident response process gives you a playbook so your team can focus on fixing the problem, not figuring out who does what. Without one, organizations end up running fire drills, missing escalations, and sending confused updates that prolong outages and baffle stakeholders.

Good incident response covers three phases that feed into each other. Detection gets the right people alerted quickly through monitoring and alerting systems. Response gets the problem solved through coordination, investigation, and whatever mitigation works fastest. Learning is where you figure out why it happened and how to stop it happening again. Weakness in any one phase drags down the whole system.

This guide walks through building an incident response process from scratch: severity levels and when to use each, escalation paths and who to page, communication templates that keep stakeholders informed, the Incident Commander role and why it matters, and the blameless post-mortem that turns failures into improvements. By the end, you will have the structure to handle incidents consistently, communicate clearly under pressure, and extract learning that compounds over time.

When to Use

SEV1 vs. SEV2: How to Decide

Declare SEV1 when you have complete service unavailability, data loss or corruption, or a security breach. If customers cannot complete their primary task at all, that is SEV1. The bar for declaring SEV1 should be low — it is better to over-communicate severity and scale back than to under-respond.

Declare SEV2 when a major feature is degraded but customers can work around it. Search returning errors for 20% of users is SEV2. Payment processing slow but completing is SEV2. SEV2 means the response is urgent but the business is not on fire.

The hard call is partial availability. A service that is technically up but serving errors for a subset of users could be SEV2. When in doubt, declare the higher severity and demote it later if it turns out to be minor.

War Room vs. Async Slack

Open a war room (video call, dedicated channel) when you have a SEV1, when multiple teams need to coordinate simultaneously, or when the incident is actively worsening and needs rapid decision-making. War rooms add coordination overhead, so use them only when the speed of parallel investigation outweighs the overhead.

Use async Slack updates when you have a SEV2, when the incident is stable and slowly improving, or when only one team is working the problem. Async keeps people informed without tying up multiple engineers in a call.

Rollback vs. Fix-Forward

Rollback when the deployment caused the incident, when a known-good state exists and can be restored quickly, and when the fix requires more time than rollback. Rollback is the right call when you can be confident in 10 minutes that reverting solves the problem.

Fix-forward when the deployment is not the cause, when rollback would cause more disruption (for example, in-flight transactions), and when the fix is simpler than the rollback procedure. Fix-forward requires confidence that the fix will not make things worse.

Incident Lifecycle

flowchart TD
    A[Alert Fires<br/>or User Reports] --> B{Classify Severity}
    B -->|SEV1| C[Open War Room<br/>Page Secondary On-Call]
    B -->|SEV2| D[Open Incident Channel<br/>Assign IC]
    C --> E[Investigate<br/>Identify Root Cause]
    D --> E
    E --> F{Rollback<br/>or Fix-Forward?}
    F -->|Rollback| G[Execute Rollback<br/>Monitor Recovery]
    F -->|Fix-Forward| H[Implement Fix<br/>Test in Staging]
    G --> I{Service<br/>Recovered?}
    H --> I
    I -->|Yes| J[Declare Resolved<br/>Update Status Page]
    I -->|No| E
    J --> K[Schedule Post-Mortem<br/>Track Action Items]

Incident Classification and Severity

Not all incidents are created equal. Define severity levels so everyone knows the stakes.

SeverityDefinitionExampleResponse
SEV1Complete outage or data lossCheckout service down, database corruptedImmediate all-hands
SEV2Major feature degradedSearch returning errors for 20% of queriesResponse in 15 minutes
SEV3Minor feature degradedPDF exports slow but workingResponse in 2 hours
SEV4Cosmetic or minor issueWrong logo color on one pageNext sprint

Classify based on user impact, not technical cause. A single-user issue is SEV4 even if the root cause is interesting. A 1% error rate on a critical path is SEV2.

Detection Sources and Alerting

The Alerting in Production post covers how to build alerts that page the right people. This is where that investment pays off.

Detection can come from automated monitoring, user reports, or employee reports. Automated detection is faster and more reliable. When PagerDuty fires at 3am, you have context: error rates, latency, which service is affected.

User-reported incidents are harder. Create a clear channel ( Slack channel, dedicated email, emergency hotline) that routes to the on-call. Train customer support to recognize severity.

Escalation Paths and Communication

Escalation is about getting the right people involved quickly.

# Example escalation policy (PagerDuty format)
escalation_policy:
  name: Platform Team Escalation
  description: Primary on-call, then secondary, then manager
  levels:
    - recipients:
        - user: primary-oncall
      delay_minutes: 0
    - recipients:
        - user: secondary-oncall
      delay_minutes: 15
    - recipients:
        - user: platform-manager
      delay_minutes: 30

During an incident, communicate early and often. Users would rather hear “we are aware and investigating” than watch silence.

Status page updates:

**2026-03-25 14:32 UTC - Investigating**
We are seeing elevated error rates on our checkout service. Our team is investigating and will update in 30 minutes.

**2026-03-25 14:45 UTC - Identified**
We have identified the issue as a database connection pool exhaustion caused by a slow-running query. We are working on a fix.

**2026-03-25 15:02 UTC - Resolved**
The issue has been resolved. Checkout service is operating normally. We will publish a post-mortem within 5 business days.

Internal communication should be in a dedicated incident Slack channel. Do not let incident discussion pollute regular team channels.

Active Incident Management

One person leads the incident. This is the Incident Commander (IC). The IC coordinates, communicates, and makes decisions. They are not necessarily the most technical person, they are the person who can keep the incident moving.

The IC role:

  1. Declares incident (sets severity, creates Slack channel, pages on-call)
  2. Coordinates responders (who is investigating, who is fixing, who is communicating)
  3. Tracks progress (main incident timeline)
  4. Makes calls (rollback vs. fix-forward, customer communication)
  5. Closes incident (declares resolved, schedules post-mortem)

Use a war room for SEV1s. Video conference, screen share, one channel for coordination. For SEV2s, a Slack channel with async updates often suffices.

The Chaos Engineering post has more on building systems that fail gracefully, which reduces incident frequency and severity.

Blameless Post-Mortem Process

After a SEV1 or SEV2, conduct a post-mortem. The purpose is learning, not blame. Blameless means: focus on systems and processes, not individuals.

Timeline first. Reconstruct what happened and when. Include when the incident was detected, when it was escalated, when mitigation started.

## Post-Mortem: Checkout Service Outage

**Date:** 2026-03-25
**Duration:** 47 minutes
**Impact:** 1,247 failed checkout attempts
**Severity:** SEV1

### Timeline (UTC)

- 13:15 - Last successful deployment to checkout service
- 13:47 - Alerting fires: error rate > 5%
- 13:48 - Primary on-call acknowledges
- 13:52 - Incident channel created, IC assigned
- 14:01 - Database connection issue identified
- 14:08 - Query timeout identified as cause
- 14:15 - Rollback initiated
- 14:34 - Rollback complete, service recovering
- 14:47 - Error rates back to normal

### Root Cause

A deployment introduced a query that did not use an index, causing full table scans on the orders table. Under load, this exhausted the connection pool.

### Contributing Factors

- No query plan review in deployment process
- Load testing did not include the new query pattern
- Connection pool size was not monitored

### Action Items

| Item                                         | Owner  | Due        |
| -------------------------------------------- | ------ | ---------- |
| Add query plan review to CI                  | @sarah | 2026-04-01 |
| Add connection pool monitoring               | @james | 2026-04-05 |
| Update load testing to include checkout flow | @ops   | 2026-04-10 |

### What Went Well

- Detection was fast (3 minutes from failure to alert)
- Rollback procedure worked as documented
- Communication was clear and timely

Share post-mortems widely. Blameless only works if people believe it. Seeing the same process applied to senior engineers as junior engineers builds trust.

Improving Detection and Response

Post-mortems are useless if action items are not tracked. Put them in your project management tool. Review action items in weekly ops meetings.

Track your incident metrics over time:

  • Mean Time to Detection (MTTD): How long between failure and alert?
  • Mean Time to Acknowledge (MTTA): How long between alert and someone looking at it?
  • Mean Time to Resolution (MTTR): How long from alert to fix?

Set targets for each. If your MTTD is 15 minutes and your target is 5, you need better alerting. If your MTTR is 2 hours and your target is 30 minutes, you need better runbooks or faster rollback procedures.

Production Failure Scenarios

FailureImpactMitigation
Severity misclassification (under-response)SEV2 treated as SEV3, wrong people paged, slow resolutionDefault to higher severity, demote later if warranted; track over- and under-escalations in post-mortems
Escalation policy not triggering due to on-call rotation gapAlert fires but nobody sees it for 30 minutesTest escalation policies quarterly, simulate on-call transitions, have a fallback communication channel
Status page update causing customer panic due to inaccurate informationCustomers react to incomplete data, support tickets spikeHave a reviewer check status page updates before posting, verify facts before publishing
Post-mortem action items never implementedSame incident recurs six months laterAssign action item owners with due dates, review in weekly ops meeting, track in project management tool
War room too large, too many people talkingNoise drowns signal, IC cannot coordinateStrict attendee list, use a separate investigation channel for parallel work, IC manages participation

Incident Response Trade-off Analysis

ScenarioWar RoomAsync Slack
Coordination speedFast (everyone on call)Slow (check messages periodically)
Noise levelHighLow
Multi-team alignmentGoodRequires explicit updates
Best forSEV1, rapidly evolvingSEV2, stable investigation
ScenarioRollbackFix-Forward
Time to recoveryUsually fasterVariable
RiskLower (known-good state)Depends on fix quality
When to useDeployment-caused incidentsNon-deployment causes
Trade-offMay not fix root causeFixes the actual problem

Incident Response Observability

Track these metrics to understand your incident response health:

MetricWhat It Tells YouTarget
MTTD (Mean Time to Detection)Speed of automated detection< 5 minutes
MTTA (Mean Time to Acknowledge)On-call responsiveness< 15 minutes
MTTR (Mean Time to Resolution)Overall incident resolution speed< 30 minutes for SEV1
False positive alert rateAlert quality< 10%
Alert to war room open timeHow fast coordination starts< 5 minutes

Alert fatigue is a real problem. If engineers ignore alerts because 80% are false positives, a real incident gets missed. Review alert quality monthly. If an alert fires and nobody acts on it within 15 minutes, it was either not important enough to page or it was a false positive.

Key commands:

# Check PagerDuty escalation policy status
pd-cli escalation-policy list --team "Platform Team"

# Review alert volume by service in the last 24 hours
curl -s "http://prometheus:9090/api/v1/query?query=ALERTS{alertstate='firing'}" | jq '.data.result | group_by(.labels.service) | map({service: .[0].labels.service, count: length})'

# Count incidents by severity in the last quarter
kubectl get events --all-namespaces --field-selector type=Warning --since="12h" | wc -l

Common Pitfalls / Anti-Patterns

Skipping post-mortems for minor incidents. Every SEV1 and SEV2 deserves a post-mortem. SEV3s and SEV4s can be documented in a sentence, but you should still review patterns. If your SEV4 count is growing, something is wrong.

Not acting on post-mortem action items. A post-mortem without tracked action items is just a document. If the same class of incident happens twice, the first post-mortem failed.

Severity inflation. Declaring everything SEV1 because you want attention makes real SEV1s harder to spot. Save SEV1 for actual outages. Your team will learn to ignore the noise.

Treating all incidents as fire drills. Not every incident requires everyone to drop everything. A SEV3 can wait until business hours. Waking people up unnecessarily builds resentment that surfaces when a real SEV1 happens.

No clear IC. When everyone is investigating, nobody is coordinating. The IC makes calls, tracks the timeline, and manages communication. Without one, you get parallel investigations, duplicated effort, and conflicting status updates.

Closing Thoughts

Effective incident response is not about being perfect. It is about being consistent. When every incident follows the same playbook, same severity definitions, same escalation paths, same communication templates, your team can focus on the problem instead of the process.

Run the process, learn from it, and improve it. Incidents are inevitable. Outages are optional.

Trade-off Analysis

No single incident response approach works for every situation. These are the key trade-offs you will face.

Severity Declaration: SEV1 vs SEV2

The choice between declaring a SEV1 or SEV2 shapes how much coordination overhead you get versus how much attention the incident receives.

FactorSEV1SEV2
War roomAll hands, full focusLead + 2-3 responders
CommunicationExecutive updatesTeam-level updates
Resolution pressureMaximumHigh
Overhead costHighModerate

Default to higher severity and demote later. It is easier to explain why a SEV1 was over-egineered than why a SEV2 became a SEV1 mid-incident. Partial availability is the hardest case — when in doubt, declare higher.

Incident Commander: Single IC vs Distributed Investigation

Having one person own coordination versus letting everyone investigate in parallel is a fundamental trade-off.

ApproachProsCons
Single ICClear decisions, no duplicated effort, one timelineIC must resist investigating — coordination is a full-time job
No IC (everyone investigates)More parallel coverageConflicting updates, duplicated work, no clear go/no-go

The IC role is non-negotiable for SEV1 and SEV2. For SEV3 and SEV4, a brief Slack thread with a designated lead is usually sufficient.

Rollback vs Fix-Forward

Whether to revert a deployment or push a fix depends on the situation and what you know.

FactorRollbackFix-Forward
When deployment caused incidentYesOnly if fix is faster
Known-good state availableYesN/A
Fix is trivial and fastNoYes
In-flight transactions/due to loseNoYes
Root cause unknownRiskyRisky either way

The 10-minute rule: If you cannot be confident within 10 minutes that rollback solves the problem, evaluate fix-forward. Both are valid — the goal is fastest time-to-resolution.

Post-Mortem Timing: Same Day vs Within 5 Days

TimingWhen to useTrade-off
Same dayMinor incidents, quick winsMay miss contributing factors when memory is freshest
Within 5 business daysSEV1/SEV2Time to process, but action items can lose urgency

For SEV1 and SEV2, 5 business days gives enough distance to see patterns rather than just symptoms, while staying close enough to the incident for accurate recall.

Communication Templates vs Ad-Hoc Updates

ApproachProsCons
TemplatesFaster, consistent, harder to forget key infoFeels robotic if over-used
Ad-hocFlexible, context-specificInconsistency, missing stakeholders

Use templates for severity, status page, and stakeholder updates. Adapt the tone but keep the structure. For SEV1, pre-built templates in a shared doc save minutes that matter.

Interview Questions

1. Define the key phases of an incident response lifecycle and explain why each matters.

Expected answer points:

  • Detection: monitoring, alerting, and user reports — the faster you know, the faster you respond
  • Containment: isolate the blast radius to prevent further damage while you investigate
  • Investigation: root cause analysis under pressure — distinguish symptoms from causes
  • Resolution: fix the problem and verify the fix works in production
  • Post-mortem: blameless review to identify systemic improvements, not assign blame
2. What is the difference between SEV1, SEV2, and SEV3? How do you decide which severity to assign?

Expected answer points:

  • SEV1: complete service unavailability, data loss, security breach — the business is on fire
  • SEV2: major feature degraded but customers can work around it — urgent but not catastrophic
  • SEV3: minor feature impairment, low user impact — addressed in normal workflow
  • When in doubt, declare a higher severity and demote later — under-responding is more damaging
  • Partial availability is the hardest call — treat it as SEV2 unless proven otherwise
3. When should you open a war room versus handling an incident asynchronously in Slack?

Expected answer points:

  • Open a war room for SEV1, multiple-team coordination, or actively worsening incidents
  • Use async Slack for SEV2, stable incidents, or single-team investigations
  • War rooms add coordination overhead — justify it with the speed of parallel investigation
  • Keep war room scope tight: focused on stopping bleeding, not complete diagnosis
4. What makes an effective on-call rotation? How do you balance coverage quality against engineer burnout?

Expected answer points:

  • Follow-the-sun works for global teams — high coverage at the cost of time zone disruption
  • Fixed rotations minimize context-switching but can lead to stale knowledge
  • Dev-first rotation (same team builds and runs) increases ownership but raises burnout risk
  • Managed services (PagerDuty, Opsgenie) offload scheduling complexity but add cost
  • Track MTTA and MTTR by rotation — if responders are burned out, numbers show it
5. What is a runbook and what separates a good one from a bad one?

Expected answer points:

  • A runbook is a step-by-step playbook for a specific incident type
  • Good runbooks are precise, scoped to one problem, and updated after every incident
  • Bad runbooks are vague, try to cover too many scenarios, or become outdated
  • Template-driven runbooks are more maintainable than rigid scripts
  • Every runbook should have a clear exit criterion — when is this incident resolved?
6. How do you prevent cascading failures during incident response?

Expected answer points:

  • Identify dependent services early — what breaks when this component fails?
  • Use circuit breakers to stop failures from propagating upstream
  • Throttle traffic intentionally to keep the system alive at reduced capacity
  • Capacity buffers and graceful degradation prevent total collapse under load
  • Chaos engineering tests these failure modes before they happen in production
7. What is a blast radius analysis and when should you perform one during an incident?

Expected answer points:

  • Blast radius analysis estimates how far the impact spreads from the failure point
  • Do it immediately after containment — understand scope before adding more responders
  • Categorize impact: user-facing vs internal, revenue vs reputation, current vs potential
  • The goal is right-sizing the response — enough to contain, not so much it creates noise
8. How should you communicate incident status to stakeholders during an ongoing incident?

Expected answer points:

  • Public status page updates within minutes of declaring an incident
  • Use a single source of truth — one channel, one incident room, one status page
  • Initial communication: what happened, current impact, what you are doing
  • Follow-up communication: every 15-30 min for SEV1, less for SEV2
  • Never speculate about root cause publicly until it is confirmed
9. What is a blameless post-mortem and why is it important for organizational learning?

Expected answer points:

  • A blameless post-mortem focuses on systemic causes, not individual fault
  • Engineers who fear blame will hide mistakes, making systemic problems invisible
  • Effective post-mortems identify contributing factors across people, process, and tooling
  • Action items must be specific, assigned, and tracked to completion
  • Review post-mortems regularly — patterns across incidents reveal structural issues
10. How do you distinguish root cause from symptom during incident investigation?

Expected answer points:

  • Symptoms are observable effects — error rates, latency spikes, failed requests
  • Root cause is the underlying mechanism that produces the symptom
  • Ask "why" five times: why did users see errors → why did the service fail → why did it crash
  • Correlation is not causation — a spike in metric A followed by failure B does not mean A caused B
  • Instrumentation and tracing help separate coincidence from causality
11. What are the most common monitoring and alerting failures that prolong incidents?

Expected answer points:

  • Missing metrics: services without RED (Rate, Errors, Duration) coverage hide failure signals
  • Silent failures: catch-all error handlers that swallow exceptions without alerting
  • Alert fatigue: too many alerts cause responders to ignore or delay on real incidents
  • No causation signal: alerts show symptoms but give no path to diagnosis
  • Stale dashboards: dashboards that are not queried during incidents show outdated state
12. How do you design on-call schedules that balance coverage quality with engineer wellbeing?

Expected answer points:

  • Rotation frequency matters more than rotation length — weekly is better than monthly
  • Compensate on-call time regardless of incident frequency
  • Allow engineers to opt out of on-call during high-stress personal periods
  • Track alert volume per service — unhealthy services burn out on-call engineers
  • Regular on-call training and game days reduce stress during actual incidents
13. What is MTTR and how do you optimize it without sacrificing resolution quality?

Expected answer points:

  • MTTR (Mean Time To Resolution) measures average time from alert to restored service
  • High-fidelity alerting reduces MTTR by cutting time spent in detection
  • Runbook maturity reduces MTTR by removing the "what do we do now" phase
  • Parallel investigation (war room) reduces MTTR for complex incidents
  • Never trade off resolution quality for speed — incomplete fixes cause repeat incidents
14. What is the role of a dedicated incident commander during a major outage?

Expected answer points:

  • The incident commander owns coordination, not investigation — they facilitate, not fix
  • They manage the war room, delegate tasks, and track action items
  • They own external communication and stakeholder updates
  • Separating command from investigation prevents tunnel vision and missed angles
  • Rotation during long incidents prevents fatigue-driven decision errors
15. How do you handle an SLA breach situation when customers are actively being impacted?

Expected answer points:

  • Declare the breach immediately and update the status page — transparency > avoidance
  • Focus engineering effort on the fastest viable fix, not the perfect fix
  • Communication cadence doubles — every 10-15 minutes for active SLA breach
  • Account management should be pre-warned before customers call in
  • Post-incident: review the SLA itself — was it achievable given the system's true capacity?
16. What steps would you take in the first 5 minutes of a suspected cascading failure?

Expected answer points:

  • Declare incident severity — do not wait for full understanding to start responding
  • Open the war room and page the incident commander
  • Check dashboards for upstream dependency failures — look for causal signals, not correlations
  • Throttle or shed traffic to preserve partial service — degraded is better than dead
  • Enable circuit breakers on dependent services to stop further propagation
17. How do you facilitate an effective post-mortem meeting without it turning into a blame session?

Expected answer points:

  • Start with a clear framing: this is about systems and processes, not individuals
  • Use the 5 whys technique to trace backward from incident to root cause
  • Every action item needs an owner and a deadline — vague follow-ups are useless
  • Share post-mortems broadly — organizational learning compounds over time
  • Track action item completion rate — if it is low, the process has no teeth
18. How would you design an incident response process for a team that has never done formal on-call?

Expected answer points:

  • Start with alerting: instrument services so failures produce signals, not silence
  • Define severity levels and response SLAs — even preliminary ones create structure
  • Create runbooks for the top 3 most common incidents — build habit before building completeness
  • Run game days: simulate failures in staging to stress-test the process
  • Retrospect after every incident, no matter how small — build the learning muscle
19. What are the signs that an incident is becoming chronic and how do you manage it differently?

Expected answer points:

  • Chronic incidents recur multiple times in a short window — same root cause, different symptoms
  • Warning signs: repeat pages, a service that never leaves the incident state, growing defect backlog
  • Management approach: stop treating symptoms, allocate dedicated time to root-cause elimination
  • Elevate to project mode — incidents become tracked work items, not firefighting
  • Communicate chronic status to stakeholders — ongoing instability requires expectation management
20. How do you balance the need for speed in incident response with documentation and process compliance?

Expected answer points:

  • During active incident: document actions taken, not analysis — real-time notes are enough
  • Formal documentation and process updates happen after resolution, not during
  • Build templates and runbooks in peacetime so incident responders do not create process during crisis
  • Compliance should be baked into tooling (automated rollback, mandatory SLO gates) not human memory
  • Post-incident: the review process is where rigor and documentation matter — do not rush it

Further Reading

Conclusion

Key Takeaways

  • Define severity levels and use them consistently — classify by user impact, not technical interest
  • Declare SEV1 early and demote if needed — under-responding is worse than over-responding
  • The Incident Commander coordinates, does not investigate — separate the coordination role from the fix role
  • Blameless post-mortems focus on systems and processes, not individuals
  • Track MTTD, MTTA, MTTR over time — you cannot improve what you do not measure
  • Action items from post-mortems must be tracked and reviewed

Incident Response Checklist

# 1. Define severity levels (SEV1/SEV2/SEV3/SEV4) in your runbook
# 2. Set up PagerDuty escalation with 0 -> 15 -> 30 minute delays
# 3. Create status page templates for Investigating / Identified / Resolved
# 4. Assign an Incident Commander for every SEV1 and SEV2
# 5. Open separate Slack channel per incident: #inc-YYYY-MM-DD-description
# 6. Document timeline as the incident progresses, not after
# 7. Conduct blameless post-mortem within 5 business days of SEV1/SEV2
# 8. Create action items in project management tool with owners and due dates
# 9. Review MTTD, MTTA, MTTR monthly in ops review meeting

For more on building resilient systems, see Chaos Engineering. For alerting best practices, see Alerting in Production. For monitoring best practices and SLOs, see Observability Engineering.

Category

Related Posts

Alerting in Production: Paging, Runbooks, and On-Call

Build effective alerting systems that wake people up for real emergencies: alert fatigue prevention, runbook automation, and healthy on-call practices.

#alerting #monitoring #on-call

Kubernetes High Availability: HPA, Pod Disruption Budgets, Multi-AZ

Build resilient Kubernetes applications with Horizontal Pod Autoscaler, Pod Disruption Budgets, and multi-availability zone deployments for production workloads.

#kubernetes #high-availability #hpa

The Observability Engineering Mindset: Beyond Monitoring

Transition from traditional monitoring to full observability: structured logs, metrics, traces, and the cultural practices that make observability teams successful.

#observability #engineering #sre