Alerting in Production: Building Alerts That Matter

Build alerting systems that catch real problems without fatigue. Learn alert design principles, severity levels, runbooks, and on-call best practices.

published: reading time: 12 min read

Alerting in Production: Designing Alerts That Actually Matter

You know the drill. 3am. Your phone screams. You stumble for it, heart pounding. The alert says something about CPU. You log in, check, find nothing wrong. Back to sleep, assuming you can still sleep. That is alert fatigue, and it is worse than silence.

The flip side is just as bad: systems quietly falling apart while dashboards stay green and nobody notices until the tickets start pouring in.

This guide covers how to design alerting systems that catch real problems, minimize noise, and actually help during incidents.

When to Use Alerting

Alerting is appropriate when:

  • You have production services where downtime or degradation matters
  • You have on-call engineers who can respond to issues
  • Your service has known failure modes that can be detected automatically
  • You have established SLOs with defined error budgets

When to skip or simplify alerting:

  • Development or test environments with no user impact
  • Services you are actively migrating or planning to deprecate within 30 days
  • Systems where you have no on-call coverage and cannot respond anyway
  • Research prototypes with no uptime requirements

The Alerting Problem

flowchart TB
    subgraph "Metric Collection"
        APP["Application\n(Metrics, Traces)"]
        HOST["Host Exporter\n(Node Exporter)"]
        DB["Database\nExporter"]
    end
    subgraph "Alerting Engine"
        PROM["Prometheus\n/ Alertmanager"]
        RULES["Alert Rules\n(SLO, Burn Rate)"]
        EVAL["Rule\nEvaluator"]
    end
    subgraph "Alert Routing"
        ROUTE["Alertmanager\nRouting Tree"]
        INHIBIT["Inhibit\nRules"]
        SILENCE["Silence\nPeriods"]
    end
    subgraph "Notification"
        PAGER["PagerDuty\n(P1, P2)"]
        SLACK["Slack\n(P3, Team)"]
        EMAIL["Email\n(P4, Backlog)"]
    end
    APP --> PROM
    HOST --> PROM
    DB --> PROM
    PROM --> EVAL
    EVAL --> RULES
    RULES --> ROUTE
    ROUTE --> INHIBIT
    ROUTE --> SILENCE
    INHIBIT --> PAGER
    ROUTE --> SLACK
    ROUTE --> EMAIL

Alerting is a signal-to-noise problem. Too many alerts and real issues get lost in the noise. Too few and you miss real failures.

Most bad alerting stems from a few common mistakes. Teams alert on causes instead of symptoms. Thresholds get set once at launch and never touched again, even as traffic patterns change. There is no clear line between “something needs attention” and “wake somebody up right now.” And sometimes alerts fire for issues that clear up on their own before anyone could act anyway.

Our Metrics, Monitoring, and Alerting guide covers the foundational concepts like SLIs, SLOs, and golden signals that underpin effective alerting.

Alert Severity Levels

Not all alerts are equal. Severity levels determine response time and escalation path.

Standard Severity Levels

SeverityResponse TimeExamplesWho Gets Paged
P1 CriticalMinutes, 24/7Complete outage, data loss, security breachOn-call immediately
P2 High30 minutesDegraded performance affecting usersOn-call
P3 MediumBusiness hoursNon-critical failures, capacity warningsTeam Slack
P4 LowNext sprintPredictable issues, capacity planningBacklog

Pagerduty gets real expensive real fast if everything is P1. Reserve that for actual outages.

When to Page at Each Level

Page at P1 when:

  • Users cannot complete core workflows
  • Data integrity is at risk
  • Security breach is in progress
  • Revenue is directly affected

Page at P2 when:

  • A significant percentage of users are affected
  • Error rates exceed acceptable thresholds
  • Latency makes the service unusable
  • A dependency failure is cascading

Do not page at P3 or P4. Add them to dashboards and team channels instead.

Alert Design Principles

Principle 1: Alert on Symptoms, Not Causes

Users do not care that your CPU is at 95%. They care that pages are loading slowly. Alert on the symptom:

# Bad: Alert on cause
- alert: HighCPU
  expr: cpu_usage > 95
  # Fires constantly, not actionable

# Good: Alert on symptom
- alert: HighLatency
  expr: histogram_quantile(0.95, rate(http_request_duration_seconds_bucket[5m])) > 2
  for: 5m
  # Only fires when users are actually affected

Principle 2: Provide Context in the Alert

Every alert should tell the on-call engineer what they need to start debugging:

- alert: DatabaseConnectionsExhausted
  expr: db_connections_active / db_connections_max > 0.95
  for: 5m
  labels:
    severity: critical
  annotations:
    summary: "Database connections at 95% capacity"
    description: |
      Service: {{ $labels.service }}
      Current connections: {{ $value | humanize }}
      Max connections: {{ $labels.max_connections }}
      Instance: {{ $labels.instance }}
      Runbook: https://runbooks.example.com/db-connections

Every P1 and P2 alert should link to a runbook with:

  • What this alert means
  • Common causes
  • How to investigate
  • How to resolve
  • When to escalate
annotations:
  runbook_url: "https://runbooks.example.com/high-error-rate"

Runbook Structure

A good runbook is a checklist, not an essay. On-call engineers are stressed at 2am and cannot parse paragraphs. Give them steps they can follow without thinking too hard.

Runbook Template

# High Error Rate Runbook

## Symptoms

- Alert: {{ $alert.name }}
- Current error rate: {{ $value }}

## Quick Checks

1. [ ] Check database health: `kubectl exec -it db-pod -- psql`
2. [ ] Check external dependencies status page
3. [ ] Check recent deployments: `kubectl get pods --sort-by=.metadata.creationTimestamp`
4. [ ] Check for increased traffic patterns

## Common Causes

- Database connection pool exhausted (check db_metrics)
- Upstream API outage (check dependencies)
- Bad deployment (check for recent changes)
- Traffic spike (check request rate)

## Mitigation Steps

1. Scale up if capacity issue: `kubectl scale deployment api-gateway --replicas=10`
2. Rollback if deployment issue: `./scripts/rollback.sh`
3. Enable circuit breaker if dependency issue

## Escalation

If not resolved in 30 minutes, page @platform-lead

Alert Suppression and Routing

Suppression During Maintenance

Prevent alerts from firing during planned maintenance:

# Alertmanager inhibit rules
inhibit_rules:
  - source_match:
      severity: maintenance
    target_match:
      severity: critical
    equal: ["alertname", "service"]

Routing by Service

Route alerts to the right team:

# Alertmanager routing
route:
  receiver: default
  routes:
    - match:
        service: database
      receiver: database-oncall
      continue: true
    - match:
        service: api
      receiver: api-oncall
    - match:
        severity: critical
      receiver: pagerduty

Blackbox Monitoring

Blackbox monitoring tests your service from the outside, independent of application metrics.

# Prometheus blackbox configuration
modules:
  http_2xx:
    prober: http
    http:
      method: GET
      fail_if_ssl: false

scrape_configs:
  - job_name: "blackbox-health"
    metrics_path: /health
    static_configs:
      - targets:
          - https://api.example.com/health

See our Prometheus & Grafana guide for hands-on examples of implementing these patterns.

Alert Fatigue Prevention

Alert Reviews

Go through your alerts every few months. If an alert has not fired in six months, ask why you still have it. If something fires multiple times per week and nobody fixes it, it is noise. If the same alert always needs the same manual fix, automate that fix. And if an alert wakes people up but nobody can actually do anything about it at 3am, get rid of it or make it something that can be handled automatically.

Signal vs Noise Metrics

Track your alert signal-to-noise ratio:

# Alert quality metric
sum(rate(alerts_resolved_total[1h])) / sum(rate(alerts_fired_total[1h]))

If this ratio is below 0.5, you have too much noise. Some teams aim for 0.8 or higher.

SLO-Based Alerting

Instead of alerting on arbitrary thresholds, alert based on error budgets:

# Burn-rate alerting (from our metrics guide)
- alert: ErrorBudgetBurningFast
  expr: |
    (
      sum(rate(http_requests_total{status=~"5.."}[1h]))
      /
      sum(rate(http_requests_total[1h]))
    )
    > (1 - 0.999) * 14.4
  for: 5m
  labels:
    severity: page

Common Alert Patterns

Resource Exhaustion

AlertThresholdAction
Disk space> 85%Clean old logs, rotate data
Memory> 90%OOM kill imminent, scale up
CPU> 95% for 10mScale up, check for runaway processes
Connections> 80% maxConnection pool exhausted, investigate leaks

Error Rate Spikes

AlertThresholdInvestigation
5xx rate> 1% for 5mCheck recent deployments, dependencies
4xx rate> 10% for 5mClient issue, potential attack
Error log rate> 100/minCheck logs for patterns

Latency Degradation

AlertThresholdInvestigation
P95 latency> 2sCheck database queries, external APIs
P99 latency> 5sFind slow queries, connection pool issues
Latency spike> 3x baselineTraffic anomaly or resource exhaustion

On-Call Practices

Alert Acknowledgment

When an alert fires:

  1. Acknowledge immediately to stop duplicate notifications
  2. Join the incident channel
  3. Start documenting your investigation in real-time
  4. Escalate early if you are stuck

Escalation Paths

Define escalation clearly:

escalation:
  - name: on-call
    timeout: 15m
    if_no_respond:
      - name: team-lead
      - name: engineering-manager

Post-Incident Review

After every P1:

  1. What triggered the alert?
  2. Why did it become a P1?
  3. What was the time to detect vs time to resolve?
  4. What can we automate to prevent this?
  5. Do we need better alerting?

Observability Hooks

Your alerting system needs its own monitoring.

Alert on Alert System Health

AlertConditionSeverity
Alertmanager downCannot receive alertsCritical
Alert channel failingNotifications not deliveringCritical
Alert evaluation errorRules returning errorsHigh

Security Considerations

Alerting systems handle sensitive information. Protect them:

  • Alert content should not include passwords, tokens, or secrets
  • Runbook URLs require authentication
  • Alert notification channels should be encrypted
  • On-call schedules are sensitive — protect access to rotation data

Alerting Trade-Offs

Alerting ApproachWhen to UseKey Risk
Threshold-based (CPU > 95%)Simple resource monitoring, known baselinesConstantly needs retuning as traffic changes
SLO-based burn-rateServices with defined SLOs, error budget trackingRequires SLO definition first; complex to set up
canary analysisRolling deployments, infrastructure changesFails to catch novel failure modes
Blackbox/external probingEnd-to-end availability, synthetic transactionsCannot see internal state changes
Anomaly detection (ML)Highly variable metrics, seasonal patternsHigh false positive rate without tuning

Alerting Production Failure Scenarios

Alert storm from a single downstream dependency failure

A critical database starts returning errors. Every service that calls it fires a “database error rate” alert simultaneously. Your on-call engineer receives 47 pages in 2 minutes. The real issue is one database — the 46 other alerts are redundant. Nobody can think clearly.

Mitigation: Use Alertmanager inhibition rules so that when a database alert fires, downstream service alerts are suppressed. Group alerts by root cause, not by symptom. Use “alert routing only” for cascading failures, not individual service pages.

Alert routing failure silently drops pages

Alertmanager loses connectivity to PagerDuty for 20 minutes during a region failover. During that window, a critical service fails. No page goes out. By the time the routing issue is discovered, the outage has been running for 25 minutes with no on-call response.

Mitigation: Monitor your alert delivery system itself. Alert when Alertmanager cannot reach its notification endpoints. Have a backup notification channel (SMS, Slack) that operates independently. Test failover scenarios quarterly.

SLO burn-rate alert fires but no error budget remains

Your SLO is 99.9% monthly availability (8.7 hours of downtime). A bug causes 4 hours of downtime in a single day. The burn-rate alert fires correctly, but the error budget for the rest of the month is now gone. Any further downtime will breach the SLO regardless of how quickly you respond.

Mitigation: Set up multi-window burn-rate alerts. A 1-hour window catching fast burns (14.4x burn rate = 99.9% consumed in 1 hour) catches acute issues. A 6-hour window at 6x rate catches slower burns. And always have a “budget remaining” dashboard alongside burn-rate alerting.

Stale threshold creates a hidden failure mode

A service’s latency P99 threshold was set at 500ms during launch in 2023. Traffic has grown 10x since then. The baseline P99 is now 800ms, but the threshold never changed. A new code change causes P99 to spike to 2500ms. The alert fires, but the on-call engineer marks it “known issue” because it looks like the normal elevated baseline.

Mitigation: Automatically recalibrate thresholds based on current baselines. Alert on deviation from baseline, not absolute thresholds. Review thresholds quarterly and after any significant traffic shift.

Alerting Anti-Patterns

Alerting on infrastructure instead of user outcomes. CPU at 95% means nothing to users if the service is handling the load fine. Alert on what users experience: latency, errors, throughput.

No alert review cadence. Alerts created at launch stay forever. Traffic patterns change, services get refactored, old alerts become irrelevant or actively misleading. Schedule quarterly alert reviews.

Every alert pages immediately. If your P3 informational alerts also wake someone up at 3am, you have not actually set severity levels — you have just labeled things incorrectly.

Runbooks that nobody can read at 3am. Walls of text, no checkboxes, no clear escalation path. A runbook should be a step-by-step checklist.

Alert on symptoms without context. “High error rate” tells you nothing. “Error rate in checkout service > 5% for 5 minutes, affecting payment processing” gives you a starting point.

Quick Recap

Key Takeaways:

  • Alert on symptoms that affect users, not internal causes
  • Every alert needs a runbook with investigation steps
  • Calibrate thresholds to your actual traffic patterns
  • Track alert quality and remove noisy alerts
  • SLO-based burn-rate alerting reduces false positives
  • Review and tune alerts quarterly

Checklist for new alerts:

  • Does this indicate a real user-facing problem?
  • Is there a clear action the on-call can take?
  • Is there a runbook linked?
  • Is the severity level appropriate?
  • Have we tested that it actually fires when expected?

Category

Related Posts

Metrics, Monitoring, and Alerting: From SLIs to Alerts

Learn the RED and USE methods, SLIs/SLOs/SLAs, and how to build alerting systems that catch real problems. Includes examples for web services and databases.

#observability #monitoring #metrics

Alerting in Production: Paging, Runbooks, and On-Call

Build effective alerting systems that wake people up for real emergencies: alert fatigue prevention, runbook automation, and healthy on-call practices.

#alerting #monitoring #on-call

The Observability Engineering Mindset: Beyond Monitoring

Transition from traditional monitoring to full observability: structured logs, metrics, traces, and the cultural practices that make observability teams successful.

#observability #engineering #sre