Alerting in Production: Paging, Runbooks, and On-Call
Build effective alerting systems that wake people up for real emergencies: alert fatigue prevention, runbook automation, and healthy on-call practices.
Alert fatigue is the silent killer of operational excellence. You have probably seen it: pages firing constantly, engineers starting to ignore them, and then when something genuinely breaks at 3am, nobody responds fast enough because they assume it is another false alarm. This is not a tooling problem. It is a design philosophy problem.
When to Use
PagerDuty vs Static Thresholds vs SLO-Based Alerting
Use PagerDuty-style urgent alerting when you have customer-facing outages that need immediate human response: complete service unavailability, data integrity risks, or security breaches. These warrant 3am pages and should be rare.
Use static threshold alerting when you have clear, known failure modes with fixed boundaries: CPU > 90%, disk > 80%, error rate > 1%. Static thresholds work for infrastructure-level signals that have predictable normal ranges.
Use SLO-based alerting when you want to alert on user impact rather than internal metrics. SLO alerts fire when your error budget is burning faster than sustainable, catching both sudden spikes and slow leaks. This is the most user-centric approach.
Use anomaly-based alerting when normal behavior varies too much for static thresholds — for example, traffic patterns that change by time of day or day of week. Anomaly detection adapts to patterns but requires historical data and can produce false positives.
The practical stack: SLO alerts for customer-facing services, static thresholds for infrastructure health, anomaly detection for high-variance business metrics.
When to Page vs When to Slack
Page (PagerDuty, SMS, call) when the issue requires human action within 15 minutes: service is down, data is at risk, security incident in progress. If it cannot be fixed by an engineer clicking something in the next 15 minutes, it probably does not need a page.
Slack (or email) when the issue is important but can wait: a disk at 75% has days of runway, a non-critical service is degraded but users can work around it, capacity planning needs attention.
The litmus test: if you wake someone up, can they do something actionable in 5 minutes? If not, it should not page.
flowchart TD
A[Metric Fires] --> B{User impact?}
B -->|Yes| C{SLO budget burning?}
B -->|No| D{Remediable in 15min?}
C -->|Yes| E[SLO Burn Rate Alert<br/>Slack + Page if fast burn]
C -->|No| F[Static Threshold Alert<br/>Slack only]
D -->|Yes| G[Static Threshold Alert<br/>Slack only]
D -->|No| H[Log/Capacity Alert<br/>Ticketing system]
Alerting Philosophy: Symptoms vs Causes
The first question to ask before creating any alert: does this represent a symptom a user is experiencing, or a cause that needs investigation? Many teams alert on causes, high CPU, memory pressure, disk I/O, when they should be alerting on symptoms: error rates, latency spikes, request failures.
A user does not care that your CPU hit 90%. They care that their checkout page is loading slowly or returning errors. Alert on the symptom, investigate the cause.
The Metrics & Monitoring guide covers this distinction in more detail, but the key principle is simple: your alerting should answer the question “is the user okay?” without requiring deep system knowledge.
Defining SLOs and Error Budgets
Service Level Objectives give your alerting meaning. Without SLOs, you have no rational basis for deciding what deserves a page and what can wait.
Define SLOs based on user experience:
# Example SLO definitions
checkout_service:
availability: 99.9% # 43 minutes downtime per month
latency_p99: 2000ms # 99% of requests under 2 seconds
error_rate: 0.1% # 0.1% of requests return 5xx
Error budgets are the flip side of SLOs. If your availability SLO is 99.9%, you have a 43-minute error budget per month. When you burn through that budget, alerts should fire, even if the system has not completely failed.
This approach shifts alerting from “something is wrong” to “we are at risk of missing our commitments to users.”
Alert Severity Levels and Routing
Not everything deserves the same response. Use a clear severity hierarchy:
| Severity | Definition | Response Time | Channel |
|---|---|---|---|
| SEV1 | User-facing outage, data loss risk | 5 minutes | PagerDuty + SMS + Call |
| SEV2 | Degraded performance, partial outage | 15 minutes | PagerDuty + Slack |
| SEV3 | Non-critical issue, capacity risk | 2 hours | Slack |
| SEV4 | Informational, maintenance soon | Next business day |
Route alerts based on severity and on-call schedules. A SEV1 fires regardless of time; a SEV3 on a Saturday afternoon might wait until Monday if the on-call engineer is on a hike.
Runbook Writing and Automation
A runbook is not documentation. It is a decision tree for stressful situations. When an alert fires at 2am, you should not be reading documentation. You should be following steps.
Good runbooks have three properties: they are skimmable, they have clear commands to run, and they include escalation points.
# Runbook: High Error Rate on Checkout Service
## Symptoms
- Error rate > 1% for 5 minutes
- Checkout failures appearing in logs
## Investigation Steps
1. Check payment processor status → [link to status page]
2. Check database connections: `SELECT count(*) FROM pg_stat_activity WHERE state = 'active';`
3. Check recent deployments: `git log --oneline -10`
4. Check feature flags: [Plaid link]
## Mitigation
- If deployment-related: rollback with `./scripts/rollback.sh checkout-service`
- If database: scale up connection pool
- If payment processor: enable circuit breaker
## Escalation
- SEV1: Call Platform Lead immediately
- Still stuck after 20 minutes: Page Engineering Manager
Automate what you can. If a runbook step can be scripted, script it. Runbook automation reduces mean time to resolution because you are not copy-pasting commands under pressure.
The Logging Best Practices post has examples of log queries you can embed directly into runbooks for faster diagnosis.
On-Call Rotation Best Practices
Healthy on-call rotations share a few characteristics: fairness, manageable load, and psychological safety.
Rotate frequently enough that no single engineer bears the burden. A two-week rotation is common. One week is better for reducing context-switching overhead.
Compensate people for being on-call. This is table stakes. Being on-call interrupts sleep, social life, and personal time. Pay them for that disruption, either in extra pay or time off.
Separate primary and secondary on-call. Primary gets paged first. If they do not acknowledge within 5 minutes, secondary gets paged. This prevents single points of failure.
Post-Incident Review and Alert Tuning
After every SEV1 and SEV2, conduct a post-incident review. The goal is to understand what happened, why the alert did or did not help, and what to fix.
Use the “5 Whys” technique: start with the incident, then ask why five times to get to root cause.
## Post-Incident Review: Checkout Outage 2026-03-15
**Duration:** 23 minutes
**Impact:** ~340 failed checkouts
**Root Cause:** Database connection pool exhausted after a slow query
**5 Whys:**
1. Why did checkouts fail? → Database connections were exhausted
2. Why were connections exhausted? → A query was holding connections for >30 seconds
3. Why was the query slow? → Missing index on orders.user_id
4. Why was the index missing? → Added in staging but not in production migration
5. Why did the migration not run? → Production migration was blocked by a lint check
**Alert Feedback:** The "high error rate" alert fired, but engineers had to investigate for 8 minutes before finding the database connection issue. We should add a separate alert for connection pool utilization >80%.
Tune alerts based on post-incident findings. If an alert fired but was not actionable, adjust the threshold or add context. If something should have alerted but did not, add a new alert.
Production Failure Scenarios
| Failure | Impact | Mitigation |
|---|---|---|
| Alert storm from one root cause | Hundreds of pages; on-call ignores all | Use alert grouping and deduplication; route related alerts to single incident |
| SLO alert fires but system already recovering | Wasted wake-up; engineer annoyed | Add for: 5m to catch transient issues; use burn-rate alerts instead of threshold |
| Alert routing to wrong team | Wrong people paged; issue unaddressed | Validate routing rules quarterly; test escalation paths quarterly |
| Runbook steps no longer accurate | MTTR increases; engineers confused | Review and update runbooks after every SEV1; version control runbooks |
| On-call rotation gap | Alert fires but nobody acknowledges; extended outage | Overlap on-call schedules; test handoff process; have fallback escalation |
| Static threshold too sensitive | Pages fire for normal traffic spikes | Add baseline deviation; use relative thresholds (CPU > 2x normal) not absolute |
Observability Hooks for Alerting
Alerting systems need their own monitoring. If your alerting system fails, you lose visibility at the worst possible moment.
Alert on Alerting Itself
| What to Monitor | Metric | Alert Threshold |
|---|---|---|
| Alertmanager down | alertmanager_up == 0 | Page immediately |
| Alert delivery latency | alertmanager_notification_latency_seconds > 30s | Page if > 1 minute |
| Alert storm detected | alerts_firing{severity="critical"} > 20 | Page if sustained > 5 min |
| Notification queue backing up | alertmanager_alerts_pending > 100 | Warning if > 50 |
| On-call acknowledgment rate | MTTA by engineer | Track but do not alert on |
Alert Quality Metrics to Track
# False positive rate: alerts that fire but require no action
sum(rate(alerts_firing{action="none"}[1h])) / sum(rate(alerts_firing[1h]))
# Mean time to acknowledge (MTTA) by severity
histogram_quantile(0.95,
sum(rate(alert_ack_time_seconds_bucket[1h])) by (le, severity)
)
# Alert volume by service and severity
sum by (service, severity) (rate(alerts_firing[1h]))
# Burn rate for alert fatigue: are we paging more this week vs last?
sum(rate(alerts_firing{severity="warning"}[1h])) /
sum(rate(alerts_firing{severity="warning"}[1h] offset 7d))
Track these metrics in a dashboard. If false positive rate exceeds 30%, start pruning alerts. If MTTA is trending up, on-call load may be too high.
Alert Routing Observability
# Alert to verify alert routing is healthy
groups:
- name: alerting-health
rules:
- alert: AlertManagerDown
expr: up{job="alertmanager"} == 0
for: 1m
labels:
severity: critical
annotations:
summary: "AlertManager instance {{ $labels.instance }} is down"
description: "Alert routing is unavailable. All alerts will queue or fail."
- alert: AlertDeliveryLatency
expr: histogram_quantile(0.95, sum(rate(alertmanager_notification_latency_seconds_bucket[5m])) by (le)) > 30
for: 5m
labels:
severity: warning
annotations:
summary: "Alert delivery latency above 30 seconds"
description: "P95 alert delivery latency is {{ $value }}s. Alerts may arrive late during incidents."
- alert: AlertQueueBackingUp
expr: alertmanager_alerts_pending / alertmanager_alerts_maximum > 0.8
for: 10m
labels:
severity: warning
annotations:
summary: "Alert notification queue above 80%"
description: "Alert queue is backing up. Check AlertManager connectivity."
Common Anti-Patterns
Alerting on causes instead of symptoms. CPU at 90% does not tell you if users are affected. Error rate > 1% does. Alert on what users experience, then investigate internal signals to find the cause.
Static thresholds without context. A CPU at 90% on a Monday morning after a deployment is suspicious. The same CPU at 90% during peak traffic on Black Friday might be expected. Pair static thresholds with service-level context.
No alert deduplication. When a database fails, you do not want 50 alerts from 50 affected services. Group alerts by root cause so engineers investigate one incident, not 50 pages.
Alerting without runbooks. A page without a runbook means the on-call engineer has to start from scratch. Every page-worthy alert needs a linked runbook.
Failing to tune alerts. If an alert fires every week and nobody fixes it, the alert is useless. Either fix the underlying issue or remove the alert. Constant alerts train engineers to ignore pages.
Compensating for bad architecture with alerts. If your database fills up every month, fix the cleanup job, do not just alert on it. Alerts mask problems; they do not solve them.
Quick Recap
Key Takeaways
- Alert on symptoms users experience, not internal root causes
- SLO-based alerting catches both fast burns and slow leaks
- Every page-worthy alert needs a runbook with actionable steps
- Tune alerts after every SEV1 and SEV2; dead alerts train engineers to ignore pages
- Track alert quality metrics: false positive rate, MTTA, alert volume over time
Alerting Checklist
# 1. Define SLOs for customer-facing services
# availability: 99.9% (43 min/month budget)
# latency_p99: 2000ms
# error_rate: 0.1%
# 2. Configure multi-window burn-rate alerts
# 1h window: page if burning 14.4x sustainable rate
# 6h window: warning if burning 6x sustainable rate
# 3d window: investigate if burning 3x sustainable rate
# 3. Set up alert routing
# PagerDuty: SEV1 critical (page + SMS + call)
# Slack: SEV2/SEV3 warnings
# Ticket: SEV4 informational
# 4. Write runbooks for every page-worthy alert
# Investigation steps with exact commands
# Mitigation steps with rollback procedures
# Escalation path with contact info
# 5. Tune quarterly
# Review false positive rate — target < 20%
# Review MTTA — target < 5 minutes for critical
# Remove or fix alerts that fire but require no action
# 6. Monitor the monitoring
# alertmanager_up == 1
# MTTA by engineer tracked weekly
# Alert volume by severity trended monthly
Trade-off Summary
| Alerting Strategy | Precision | Recall | Alert Volume | Best For |
|---|---|---|---|---|
| SLO-based alerts | High | High | Low | User-facing services |
| Resource-based alerts | Medium | Low | Medium | Infrastructure |
| Error-rate alerts | High | Medium | Medium | API services |
| Anomaly-based alerts | Low | High | High | Novel failure modes |
| Blackbox monitoring | High | Low | Low | External dependencies |
| PagerDuty vs alternatives | Cost | Integrations | On-call Features |
|---|---|---|---|
| PagerDuty | High | Largest | Most mature |
| OpsGenie | Medium | Large | Good |
| Squadcast | Low | Growing | Good |
| Slack + bots | Lowest | Varies | Limited |
| Custom (webhooks) | Infrastructure | Varies | DIY |
Conclusion
Good alerting is not about having more alerts. It is about having the right alerts that tell you when users are impacted, route to the right person, and give enough context to act. Invest in SLOs, write actionable runbooks, rotate on-call fairly, and tune relentlessly.
Alert fatigue is a cultural problem as much as a technical one. Create an environment where it is safe to say “this alert is not actionable” and prioritize fixing it over tolerating it.
Category
Related Posts
Alerting in Production: Building Alerts That Matter
Build alerting systems that catch real problems without fatigue. Learn alert design principles, severity levels, runbooks, and on-call best practices.
The Observability Engineering Mindset: Beyond Monitoring
Transition from traditional monitoring to full observability: structured logs, metrics, traces, and the cultural practices that make observability teams successful.
Metrics, Monitoring, and Alerting: From SLIs to Alerts
Learn the RED and USE methods, SLIs/SLOs/SLAs, and how to build alerting systems that catch real problems. Includes examples for web services and databases.