Alerting in Production: Paging, Runbooks, and On-Call

Build effective alerting systems that wake people up for real emergencies: alert fatigue prevention, runbook automation, and healthy on-call practices.

published: reading time: 13 min read

Alert fatigue is the silent killer of operational excellence. You have probably seen it: pages firing constantly, engineers starting to ignore them, and then when something genuinely breaks at 3am, nobody responds fast enough because they assume it is another false alarm. This is not a tooling problem. It is a design philosophy problem.

When to Use

PagerDuty vs Static Thresholds vs SLO-Based Alerting

Use PagerDuty-style urgent alerting when you have customer-facing outages that need immediate human response: complete service unavailability, data integrity risks, or security breaches. These warrant 3am pages and should be rare.

Use static threshold alerting when you have clear, known failure modes with fixed boundaries: CPU > 90%, disk > 80%, error rate > 1%. Static thresholds work for infrastructure-level signals that have predictable normal ranges.

Use SLO-based alerting when you want to alert on user impact rather than internal metrics. SLO alerts fire when your error budget is burning faster than sustainable, catching both sudden spikes and slow leaks. This is the most user-centric approach.

Use anomaly-based alerting when normal behavior varies too much for static thresholds — for example, traffic patterns that change by time of day or day of week. Anomaly detection adapts to patterns but requires historical data and can produce false positives.

The practical stack: SLO alerts for customer-facing services, static thresholds for infrastructure health, anomaly detection for high-variance business metrics.

When to Page vs When to Slack

Page (PagerDuty, SMS, call) when the issue requires human action within 15 minutes: service is down, data is at risk, security incident in progress. If it cannot be fixed by an engineer clicking something in the next 15 minutes, it probably does not need a page.

Slack (or email) when the issue is important but can wait: a disk at 75% has days of runway, a non-critical service is degraded but users can work around it, capacity planning needs attention.

The litmus test: if you wake someone up, can they do something actionable in 5 minutes? If not, it should not page.

flowchart TD
    A[Metric Fires] --> B{User impact?}
    B -->|Yes| C{SLO budget burning?}
    B -->|No| D{Remediable in 15min?}
    C -->|Yes| E[SLO Burn Rate Alert<br/>Slack + Page if fast burn]
    C -->|No| F[Static Threshold Alert<br/>Slack only]
    D -->|Yes| G[Static Threshold Alert<br/>Slack only]
    D -->|No| H[Log/Capacity Alert<br/>Ticketing system]

Alerting Philosophy: Symptoms vs Causes

The first question to ask before creating any alert: does this represent a symptom a user is experiencing, or a cause that needs investigation? Many teams alert on causes, high CPU, memory pressure, disk I/O, when they should be alerting on symptoms: error rates, latency spikes, request failures.

A user does not care that your CPU hit 90%. They care that their checkout page is loading slowly or returning errors. Alert on the symptom, investigate the cause.

The Metrics & Monitoring guide covers this distinction in more detail, but the key principle is simple: your alerting should answer the question “is the user okay?” without requiring deep system knowledge.

Defining SLOs and Error Budgets

Service Level Objectives give your alerting meaning. Without SLOs, you have no rational basis for deciding what deserves a page and what can wait.

Define SLOs based on user experience:

# Example SLO definitions
checkout_service:
  availability: 99.9% # 43 minutes downtime per month
  latency_p99: 2000ms # 99% of requests under 2 seconds
  error_rate: 0.1% # 0.1% of requests return 5xx

Error budgets are the flip side of SLOs. If your availability SLO is 99.9%, you have a 43-minute error budget per month. When you burn through that budget, alerts should fire, even if the system has not completely failed.

This approach shifts alerting from “something is wrong” to “we are at risk of missing our commitments to users.”

Alert Severity Levels and Routing

Not everything deserves the same response. Use a clear severity hierarchy:

SeverityDefinitionResponse TimeChannel
SEV1User-facing outage, data loss risk5 minutesPagerDuty + SMS + Call
SEV2Degraded performance, partial outage15 minutesPagerDuty + Slack
SEV3Non-critical issue, capacity risk2 hoursSlack
SEV4Informational, maintenance soonNext business dayEmail

Route alerts based on severity and on-call schedules. A SEV1 fires regardless of time; a SEV3 on a Saturday afternoon might wait until Monday if the on-call engineer is on a hike.

Runbook Writing and Automation

A runbook is not documentation. It is a decision tree for stressful situations. When an alert fires at 2am, you should not be reading documentation. You should be following steps.

Good runbooks have three properties: they are skimmable, they have clear commands to run, and they include escalation points.

# Runbook: High Error Rate on Checkout Service

## Symptoms

- Error rate > 1% for 5 minutes
- Checkout failures appearing in logs

## Investigation Steps

1. Check payment processor status → [link to status page]
2. Check database connections: `SELECT count(*) FROM pg_stat_activity WHERE state = 'active';`
3. Check recent deployments: `git log --oneline -10`
4. Check feature flags: [Plaid link]

## Mitigation

- If deployment-related: rollback with `./scripts/rollback.sh checkout-service`
- If database: scale up connection pool
- If payment processor: enable circuit breaker

## Escalation

- SEV1: Call Platform Lead immediately
- Still stuck after 20 minutes: Page Engineering Manager

Automate what you can. If a runbook step can be scripted, script it. Runbook automation reduces mean time to resolution because you are not copy-pasting commands under pressure.

The Logging Best Practices post has examples of log queries you can embed directly into runbooks for faster diagnosis.

On-Call Rotation Best Practices

Healthy on-call rotations share a few characteristics: fairness, manageable load, and psychological safety.

Rotate frequently enough that no single engineer bears the burden. A two-week rotation is common. One week is better for reducing context-switching overhead.

Compensate people for being on-call. This is table stakes. Being on-call interrupts sleep, social life, and personal time. Pay them for that disruption, either in extra pay or time off.

Separate primary and secondary on-call. Primary gets paged first. If they do not acknowledge within 5 minutes, secondary gets paged. This prevents single points of failure.

Post-Incident Review and Alert Tuning

After every SEV1 and SEV2, conduct a post-incident review. The goal is to understand what happened, why the alert did or did not help, and what to fix.

Use the “5 Whys” technique: start with the incident, then ask why five times to get to root cause.

## Post-Incident Review: Checkout Outage 2026-03-15

**Duration:** 23 minutes
**Impact:** ~340 failed checkouts
**Root Cause:** Database connection pool exhausted after a slow query

**5 Whys:**

1. Why did checkouts fail? → Database connections were exhausted
2. Why were connections exhausted? → A query was holding connections for >30 seconds
3. Why was the query slow? → Missing index on orders.user_id
4. Why was the index missing? → Added in staging but not in production migration
5. Why did the migration not run? → Production migration was blocked by a lint check

**Alert Feedback:** The "high error rate" alert fired, but engineers had to investigate for 8 minutes before finding the database connection issue. We should add a separate alert for connection pool utilization >80%.

Tune alerts based on post-incident findings. If an alert fired but was not actionable, adjust the threshold or add context. If something should have alerted but did not, add a new alert.

Production Failure Scenarios

FailureImpactMitigation
Alert storm from one root causeHundreds of pages; on-call ignores allUse alert grouping and deduplication; route related alerts to single incident
SLO alert fires but system already recoveringWasted wake-up; engineer annoyedAdd for: 5m to catch transient issues; use burn-rate alerts instead of threshold
Alert routing to wrong teamWrong people paged; issue unaddressedValidate routing rules quarterly; test escalation paths quarterly
Runbook steps no longer accurateMTTR increases; engineers confusedReview and update runbooks after every SEV1; version control runbooks
On-call rotation gapAlert fires but nobody acknowledges; extended outageOverlap on-call schedules; test handoff process; have fallback escalation
Static threshold too sensitivePages fire for normal traffic spikesAdd baseline deviation; use relative thresholds (CPU > 2x normal) not absolute

Observability Hooks for Alerting

Alerting systems need their own monitoring. If your alerting system fails, you lose visibility at the worst possible moment.

Alert on Alerting Itself

What to MonitorMetricAlert Threshold
Alertmanager downalertmanager_up == 0Page immediately
Alert delivery latencyalertmanager_notification_latency_seconds > 30sPage if > 1 minute
Alert storm detectedalerts_firing{severity="critical"} > 20Page if sustained > 5 min
Notification queue backing upalertmanager_alerts_pending > 100Warning if > 50
On-call acknowledgment rateMTTA by engineerTrack but do not alert on

Alert Quality Metrics to Track

# False positive rate: alerts that fire but require no action
sum(rate(alerts_firing{action="none"}[1h])) / sum(rate(alerts_firing[1h]))

# Mean time to acknowledge (MTTA) by severity
histogram_quantile(0.95,
  sum(rate(alert_ack_time_seconds_bucket[1h])) by (le, severity)
)

# Alert volume by service and severity
sum by (service, severity) (rate(alerts_firing[1h]))

# Burn rate for alert fatigue: are we paging more this week vs last?
sum(rate(alerts_firing{severity="warning"}[1h])) /
  sum(rate(alerts_firing{severity="warning"}[1h] offset 7d))

Track these metrics in a dashboard. If false positive rate exceeds 30%, start pruning alerts. If MTTA is trending up, on-call load may be too high.

Alert Routing Observability

# Alert to verify alert routing is healthy
groups:
  - name: alerting-health
    rules:
      - alert: AlertManagerDown
        expr: up{job="alertmanager"} == 0
        for: 1m
        labels:
          severity: critical
        annotations:
          summary: "AlertManager instance {{ $labels.instance }} is down"
          description: "Alert routing is unavailable. All alerts will queue or fail."

      - alert: AlertDeliveryLatency
        expr: histogram_quantile(0.95, sum(rate(alertmanager_notification_latency_seconds_bucket[5m])) by (le)) > 30
        for: 5m
        labels:
          severity: warning
        annotations:
          summary: "Alert delivery latency above 30 seconds"
          description: "P95 alert delivery latency is {{ $value }}s. Alerts may arrive late during incidents."

      - alert: AlertQueueBackingUp
        expr: alertmanager_alerts_pending / alertmanager_alerts_maximum > 0.8
        for: 10m
        labels:
          severity: warning
        annotations:
          summary: "Alert notification queue above 80%"
          description: "Alert queue is backing up. Check AlertManager connectivity."

Common Anti-Patterns

Alerting on causes instead of symptoms. CPU at 90% does not tell you if users are affected. Error rate > 1% does. Alert on what users experience, then investigate internal signals to find the cause.

Static thresholds without context. A CPU at 90% on a Monday morning after a deployment is suspicious. The same CPU at 90% during peak traffic on Black Friday might be expected. Pair static thresholds with service-level context.

No alert deduplication. When a database fails, you do not want 50 alerts from 50 affected services. Group alerts by root cause so engineers investigate one incident, not 50 pages.

Alerting without runbooks. A page without a runbook means the on-call engineer has to start from scratch. Every page-worthy alert needs a linked runbook.

Failing to tune alerts. If an alert fires every week and nobody fixes it, the alert is useless. Either fix the underlying issue or remove the alert. Constant alerts train engineers to ignore pages.

Compensating for bad architecture with alerts. If your database fills up every month, fix the cleanup job, do not just alert on it. Alerts mask problems; they do not solve them.

Quick Recap

Key Takeaways

  • Alert on symptoms users experience, not internal root causes
  • SLO-based alerting catches both fast burns and slow leaks
  • Every page-worthy alert needs a runbook with actionable steps
  • Tune alerts after every SEV1 and SEV2; dead alerts train engineers to ignore pages
  • Track alert quality metrics: false positive rate, MTTA, alert volume over time

Alerting Checklist

# 1. Define SLOs for customer-facing services
# availability: 99.9% (43 min/month budget)
# latency_p99: 2000ms
# error_rate: 0.1%

# 2. Configure multi-window burn-rate alerts
# 1h window: page if burning 14.4x sustainable rate
# 6h window: warning if burning 6x sustainable rate
# 3d window: investigate if burning 3x sustainable rate

# 3. Set up alert routing
# PagerDuty: SEV1 critical (page + SMS + call)
# Slack: SEV2/SEV3 warnings
# Ticket: SEV4 informational

# 4. Write runbooks for every page-worthy alert
# Investigation steps with exact commands
# Mitigation steps with rollback procedures
# Escalation path with contact info

# 5. Tune quarterly
# Review false positive rate — target < 20%
# Review MTTA — target < 5 minutes for critical
# Remove or fix alerts that fire but require no action

# 6. Monitor the monitoring
# alertmanager_up == 1
# MTTA by engineer tracked weekly
# Alert volume by severity trended monthly

Trade-off Summary

Alerting StrategyPrecisionRecallAlert VolumeBest For
SLO-based alertsHighHighLowUser-facing services
Resource-based alertsMediumLowMediumInfrastructure
Error-rate alertsHighMediumMediumAPI services
Anomaly-based alertsLowHighHighNovel failure modes
Blackbox monitoringHighLowLowExternal dependencies
PagerDuty vs alternativesCostIntegrationsOn-call Features
PagerDutyHighLargestMost mature
OpsGenieMediumLargeGood
SquadcastLowGrowingGood
Slack + botsLowestVariesLimited
Custom (webhooks)InfrastructureVariesDIY

Conclusion

Good alerting is not about having more alerts. It is about having the right alerts that tell you when users are impacted, route to the right person, and give enough context to act. Invest in SLOs, write actionable runbooks, rotate on-call fairly, and tune relentlessly.

Alert fatigue is a cultural problem as much as a technical one. Create an environment where it is safe to say “this alert is not actionable” and prioritize fixing it over tolerating it.

Category

Related Posts

Alerting in Production: Building Alerts That Matter

Build alerting systems that catch real problems without fatigue. Learn alert design principles, severity levels, runbooks, and on-call best practices.

#data-engineering #alerting #monitoring

The Observability Engineering Mindset: Beyond Monitoring

Transition from traditional monitoring to full observability: structured logs, metrics, traces, and the cultural practices that make observability teams successful.

#observability #engineering #sre

Metrics, Monitoring, and Alerting: From SLIs to Alerts

Learn the RED and USE methods, SLIs/SLOs/SLAs, and how to build alerting systems that catch real problems. Includes examples for web services and databases.

#observability #monitoring #metrics