Alerting in Production: Building Alerts That Matter

Build alerting systems that catch real problems without fatigue. Learn alert design principles, severity levels, runbooks, and on-call best practices.

published: March 27, 2026 reading time: 17 min read author: GeekWorkBench updated: June 17, 2026

Quick Summary

Good alerting catches real production problems without drowning your on-call team in noise. This covers severity levels, runbooks, and the alert design principles that actually work—based on what goes wrong in practice.

Alerting in Production: Designing Alerts That Actually Matter

You know the drill. 3am. Your phone screams. You stumble for it, heart pounding. The alert says something about CPU. You log in, check, find nothing wrong. Back to sleep, assuming you can still sleep. That is alert fatigue, and it is worse than silence.

The flip side is just as bad: systems quietly falling apart while dashboards stay green and nobody notices until the tickets start pouring in.

This guide covers how to design alerting systems that catch real problems, minimize noise, and actually help during incidents.

When to Use Alerting

Alerting is appropriate when:

You have production services where downtime or degradation matters
You have on-call engineers who can respond to issues
Your service has known failure modes that can be detected automatically
You have established SLOs with defined error budgets

When to skip or simplify alerting:

Development or test environments with no user impact
Services you are actively migrating or planning to deprecate within 30 days
Systems where you have no on-call coverage and cannot respond anyway
Research prototypes with no uptime requirements

The Alerting Problem

flowchart TB
    subgraph "Metric Collection"
        APP["Application\n(Metrics, Traces)"]
        HOST["Host Exporter\n(Node Exporter)"]
        DB["Database\nExporter"]
    end
    subgraph "Alerting Engine"
        PROM["Prometheus\n/ Alertmanager"]
        RULES["Alert Rules\n(SLO, Burn Rate)"]
        EVAL["Rule\nEvaluator"]
    end
    subgraph "Alert Routing"
        ROUTE["Alertmanager\nRouting Tree"]
        INHIBIT["Inhibit\nRules"]
        SILENCE["Silence\nPeriods"]
    end
    subgraph "Notification"
        PAGER["PagerDuty\n(P1, P2)"]
        SLACK["Slack\n(P3, Team)"]
        EMAIL["Email\n(P4, Backlog)"]
    end
    APP --> PROM
    HOST --> PROM
    DB --> PROM
    PROM --> EVAL
    EVAL --> RULES
    RULES --> ROUTE
    ROUTE --> INHIBIT
    ROUTE --> SILENCE
    INHIBIT --> PAGER
    ROUTE --> SLACK
    ROUTE --> EMAIL

Alerting is a signal-to-noise problem. Too many alerts and real issues get lost in the noise. Too few and you miss real failures.

Most bad alerting stems from a few common mistakes. Teams alert on causes instead of symptoms. Thresholds get set once at launch and never touched again, even as traffic patterns change. There is no clear line between “something needs attention” and “wake somebody up right now.” And sometimes alerts fire for issues that clear up on their own before anyone could act anyway.

Our Metrics, Monitoring, and Alerting guide covers the foundational concepts like SLIs, SLOs, and golden signals that underpin effective alerting.

Alert Severity Levels

Not all alerts are equal. Severity levels determine response time and escalation path.

Standard Severity Levels

Severity	Response Time	Examples	Who Gets Paged
P1 Critical	Minutes, 24/7	Complete outage, data loss, security breach	On-call immediately
P2 High	30 minutes	Degraded performance affecting users	On-call
P3 Medium	Business hours	Non-critical failures, capacity warnings	Team Slack
P4 Low	Next sprint	Predictable issues, capacity planning	Backlog

Pagerduty gets real expensive real fast if everything is P1. Reserve that for actual outages.

When to Page at Each Level

Page at P1 when:

Users cannot complete core workflows
Data integrity is at risk
Security breach is in progress
Revenue is directly affected

Page at P2 when:

A significant percentage of users are affected
Error rates exceed acceptable thresholds
Latency makes the service unusable
A dependency failure is cascading

Do not page at P3 or P4. Add them to dashboards and team channels instead.

Alert Design Principles

Principle 1: Alert on Symptoms, Not Causes

Users do not care that your CPU is at 95%. They care that pages are loading slowly. Alert on the symptom:

# Bad: Alert on cause
- alert: HighCPU
  expr: cpu_usage > 95
  # Fires constantly, not actionable

# Good: Alert on symptom
- alert: HighLatency
  expr: histogram_quantile(0.95, rate(http_request_duration_seconds_bucket[5m])) > 2
  for: 5m
  # Only fires when users are actually affected

Principle 2: Provide Context in the Alert

Every alert should tell the on-call engineer what they need to start debugging:

- alert: DatabaseConnectionsExhausted
  expr: db_connections_active / db_connections_max > 0.95
  for: 5m
  labels:
    severity: critical
  annotations:
    summary: "Database connections at 95% capacity"
    description: |
      Service: {{ $labels.service }}
      Current connections: {{ $value | humanize }}
      Max connections: {{ $labels.max_connections }}
      Instance: {{ $labels.instance }}
      Runbook: https://runbooks.example.com/db-connections

Principle 3: Include Runbook Links

Every P1 and P2 alert should link to a runbook with:

What this alert means
Common causes
How to investigate
How to resolve
When to escalate

annotations:
  runbook_url: "https://runbooks.example.com/high-error-rate"

Runbook Structure

A good runbook is a checklist, not an essay. On-call engineers are stressed at 2am and cannot parse paragraphs. Give them steps they can follow without thinking too hard.

Runbook Template

# High Error Rate Runbook

## Symptoms

- Alert: {{ $alert.name }}
- Current error rate: {{ $value }}

## Quick Checks

1. [ ] Check database health: `kubectl exec -it db-pod -- psql`
2. [ ] Check external dependencies status page
3. [ ] Check recent deployments: `kubectl get pods --sort-by=.metadata.creationTimestamp`
4. [ ] Check for increased traffic patterns

## Common Causes

- Database connection pool exhausted (check db_metrics)
- Upstream API outage (check dependencies)
- Bad deployment (check for recent changes)
- Traffic spike (check request rate)

## Mitigation Steps

1. Scale up if capacity issue: `kubectl scale deployment api-gateway --replicas=10`
2. Rollback if deployment issue: `./scripts/rollback.sh`
3. Enable circuit breaker if dependency issue

## Escalation

If not resolved in 30 minutes, page @platform-lead

Alert Suppression and Routing

Suppression During Maintenance

Prevent alerts from firing during planned maintenance:

# Alertmanager inhibit rules
inhibit_rules:
  - source_match:
      severity: maintenance
    target_match:
      severity: critical
    equal: ["alertname", "service"]

Routing by Service

Route alerts to the right team:

# Alertmanager routing
route:
  receiver: default
  routes:
    - match:
        service: database
      receiver: database-oncall
      continue: true
    - match:
        service: api
      receiver: api-oncall
    - match:
        severity: critical
      receiver: pagerduty

Blackbox Monitoring

Blackbox monitoring tests your service from the outside, independent of application metrics.

# Prometheus blackbox configuration
modules:
  http_2xx:
    prober: http
    http:
      method: GET
      fail_if_ssl: false

scrape_configs:
  - job_name: "blackbox-health"
    metrics_path: /health
    static_configs:
      - targets:
          - https://api.example.com/health

See our Prometheus & Grafana guide for hands-on examples of implementing these patterns.

Alert Fatigue Prevention

Alert Reviews

Do them every quarter. Treat it like a postmortem — you are trying to figure out what your alerting system got wrong and why, before it causes the next outage.

Week 1: Pull six months of Alertmanager data. Group by alert name and sort by fire count. The usual pattern: your top 10 alerts by volume account for most of the noise. Calculate the signal-to-noise ratio for each one using sum(rate(alerts_resolved_total[1h])) by (alertname) / sum(rate(alerts_fired_total[1h])) by (alertname). Anything under 0.5 is worth questioning.

Week 2: Sort every alert into four buckets. Actionable and necessary — keep it. Actionable but fires constantly — tune the threshold or add a for duration or condition to filter out known false positives (batch jobs are a common culprit). Actionable but nobody can do anything about it at 3am — demote it to P3/P4 and route it to Slack instead of paging. And anything that fires for problems nobody can respond to or that always self-resolve — just archive it.

Week 3: Triage the backlog. Make tickets for everything flagged for tuning and tag them so you can trace what changed later. If an alert has not fired in the past year, archive it. I mean it. If it mattered, it would have gone off by now.

Week 4: Deploy the changes, then watch the signal-to-noise ratio for 30 days. If it went up, document what worked. If it went down, revert and try a different approach. Also maintain a living alert catalog — a shared doc that lists every active alert, what it is for, who owns it, and when it was last reviewed.

The teams that skip this end up with alert sets designed for the system they had in 2021, not the one they have now. Traffic patterns change. Services get deprecated. Thresholds stop making sense. The review is what keeps your alerting system from lying to you.

Signal vs Noise Metrics

Track your alert signal-to-noise ratio:

# Alert quality metric
sum(rate(alerts_resolved_total[1h])) / sum(rate(alerts_fired_total[1h]))

If this ratio is below 0.5, you have too much noise. Some teams aim for 0.8 or higher.

SLO-Based Alerting

Instead of alerting on arbitrary thresholds, alert based on error budgets:

# Burn-rate alerting (from our metrics guide)
- alert: ErrorBudgetBurningFast
  expr: |
    (
      sum(rate(http_requests_total{status=~"5.."}[1h]))
      /
      sum(rate(http_requests_total[1h]))
    )
    > (1 - 0.999) * 14.4
  for: 5m
  labels:
    severity: page

Common Alert Patterns

Resource Exhaustion

Alert	Threshold	Action
Disk space	> 85%	Clean old logs, rotate data
Memory	> 90%	OOM kill imminent, scale up
CPU	> 95% for 10m	Scale up, check for runaway processes
Connections	> 80% max	Connection pool exhausted, investigate leaks

Error Rate Spikes

Alert	Threshold	Investigation
5xx rate	> 1% for 5m	Check recent deployments, dependencies
4xx rate	> 10% for 5m	Client issue, potential attack
Error log rate	> 100/min	Check logs for patterns

Latency Degradation

Alert	Threshold	Investigation
P95 latency	> 2s	Check database queries, external APIs
P99 latency	> 5s	Find slow queries, connection pool issues
Latency spike	> 3x baseline	Traffic anomaly or resource exhaustion

On-Call Practices

Alert Acknowledgment

When an alert fires:

Acknowledge immediately to stop duplicate notifications
Join the incident channel
Start documenting your investigation in real-time
Escalate early if you are stuck

Escalation Paths

Define escalation clearly:

escalation:
  - name: on-call
    timeout: 15m
    if_no_respond:
      - name: team-lead
      - name: engineering-manager

Post-Incident Review

After every P1:

What triggered the alert?
Why did it become a P1?
What was the time to detect vs time to resolve?
What can we automate to prevent this?
Do we need better alerting?

Observability Hooks

Your alerting system needs its own monitoring.

Alert on Alert System Health

Alert	Condition	Severity
Alertmanager down	Cannot receive alerts	Critical
Alert channel failing	Notifications not delivering	Critical
Alert evaluation error	Rules returning errors	High

Security Considerations

Alerting systems handle sensitive information. Protect them:

Alert content should not include passwords, tokens, or secrets
Runbook URLs require authentication
Alert notification channels should be encrypted
On-call schedules are sensitive — protect access to rotation data

Alerting Trade-Offs

Alerting Approach	When to Use	Key Risk
Threshold-based (CPU > 95%)	Simple resource monitoring, known baselines	Constantly needs retuning as traffic changes
SLO-based burn-rate	Services with defined SLOs, error budget tracking	Requires SLO definition first; complex to set up
canary analysis	Rolling deployments, infrastructure changes	Fails to catch novel failure modes
Blackbox/external probing	End-to-end availability, synthetic transactions	Cannot see internal state changes
Anomaly detection (ML)	Highly variable metrics, seasonal patterns	High false positive rate without tuning

Alerting Production Failure Scenarios

Alert storm from a single downstream dependency failure

A critical database starts returning errors. Every service that calls it fires a “database error rate” alert simultaneously. Your on-call engineer receives 47 pages in 2 minutes. The real issue is one database — the 46 other alerts are redundant. Nobody can think clearly.

The blast radius of a single dependency failure depends entirely on how many services sit below it in the call chain. A monolith with a shared database gives you one alert. A microservices architecture with a database per service, but shared authentication and a shared event bus, can give you 40 alerts from one bad host. In the 47-page scenario, the on-call engineer spends the first 15 minutes triaging which alert is the root cause rather than actually mitigating. Meanwhile, real users are hitting errors and the clock is running.

What makes this worse is that each alert follows a slightly different pattern: the auth service fires on connection pool exhaustion, the event processor fires on query timeouts, the user-facing API fires on failed health checks. Same root cause, different symptoms, different alert names, different runbooks. The team designed each alert to be independent — which it is — but failed to account for the correlation that always shows up in production.

Mitigation: Use Alertmanager inhibition rules so that when a database alert fires, downstream service alerts are suppressed. Group alerts by root cause, not by symptom. Configure Alertmanager to route only the infrastructure-level alert to PagerDuty, and route service-level alerts to a separate channel for later review. This requires maintaining an inhibition mapping, but it is the standard approach for keeping page volume manageable when things go wrong at scale.

Alert routing failure silently drops pages

Alertmanager loses connectivity to PagerDuty for 20 minutes during a region failover. During that window, a critical service fails. No page goes out. By the time the routing issue is discovered, the outage has been running for 25 minutes with no on-call response.

The routing failure is insidious because it looks like silence — and silence usually means things are fine. The alert system shows no errors. Alertmanager is running. Prometheus is evaluating rules. The only thing that is not happening is the page reaching a human. During those 20 minutes, user-facing error rates climbed to 100% on the affected service. A database failover completed successfully but the application did not reconnect properly, so the service stayed down until someone manually restarted it. That manual intervention did not happen until a customer called support.

This failure mode exploits a trust assumption: if the alerting system is quiet, nothing is wrong. That assumption is wrong in exactly the way that causes the most damage — it reverses during the worst possible moment. When your alerting infrastructure is the thing failing, you lose the ability to know anything is wrong at all.

Mitigation: Monitor your alert delivery system itself. Alert when Alertmanager cannot reach its notification endpoints. Set up a heartbeat check that fires a test alert every 5 minutes — if it does not arrive, something is wrong with the routing path. Have a backup notification channel (SMS, Slack webhook with a different vendor) that operates independently of your primary pipeline. Test failover scenarios quarterly, including intentionally severing the PagerDuty connection to verify the backup actually fires.

SLO burn-rate alert fires but no error budget remains

Your SLO is 99.9% monthly availability (8.7 hours of downtime). A bug causes 4 hours of downtime in a single day. The burn-rate alert fires correctly, but the error budget for the rest of the month is now gone. Any further downtime will breach the SLO regardless of how quickly you respond.

The burn-rate alert did its job. It detected the accelerated consumption of error budget and fired at the right threshold. The problem is not the alert — it is what the alert is designed to do versus what the team expected it to do. A burn-rate alert tells you that error budget is being consumed fast. It does not tell you that the budget is already gone. Those are two different situations requiring two different responses, and most teams only have alerts for the first one.

In this scenario, after the 4-hour outage, the remaining error budget is roughly 4.7 hours for the rest of the month. Any additional incident — even a 10-minute one — will push the monthly SLO into breach territory. The on-call engineer who handled the 4-hour outage correctly was not aware that they were now operating in a zero-budget state. They remediated the incident, closed the ticket, and went back to normal operations. Two days later, a minor deployment caused 30 minutes of elevated latency. That 30 minutes, combined with the previous 4 hours, breached the SLO. The postmortem was brutal.

Mitigation: Set up multi-window burn-rate alerts. A 1-hour window catching fast burns (14.4x burn rate = 99.9% consumed in 1 hour) catches acute issues. A 6-hour window at 6x rate catches slower burns that still matter. But also add an explicit “error budget exhausted” alert that fires when remaining budget drops below a configurable threshold — 20% is a common choice. Always maintain a dashboard showing budget consumed, budget remaining, and projected breach date. The burn-rate alert tells you something is burning. The budget remaining alert tells you the fire already consumed everything.

Stale threshold creates a hidden failure mode

A service’s latency P99 threshold was set at 500ms during launch in 2023. Traffic has grown 10x since then. The baseline P99 is now 800ms, but the threshold never changed. A new code change causes P99 to spike to 2500ms. The alert fires, but the on-call engineer marks it “known issue” because it looks like the normal elevated baseline.

Mitigation: Automatically recalibrate thresholds based on current baselines. Alert on deviation from baseline, not absolute thresholds. Review thresholds quarterly and after any significant traffic shift.

Alerting Anti-Patterns

Alerting on infrastructure instead of user outcomes. CPU at 95% means nothing to users if the service is handling the load fine. Alert on what users experience: latency, errors, throughput.

No alert review cadence. Alerts created at launch stay forever. Traffic patterns change, services get refactored, old alerts become irrelevant or actively misleading. Schedule quarterly alert reviews.

Every alert pages immediately. If your P3 informational alerts also wake someone up at 3am, you have not actually set severity levels — you have just labeled things incorrectly.

Runbooks that nobody can read at 3am. Walls of text, no checkboxes, no clear escalation path. A runbook should be a step-by-step checklist.

Alert on symptoms without context. “High error rate” tells you nothing. “Error rate in checkout service > 5% for 5 minutes, affecting payment processing” gives you a starting point.

Quick Recap

Key Takeaways:

Alert on symptoms that affect users, not internal causes
Every alert needs a runbook with investigation steps
Calibrate thresholds to your actual traffic patterns
Track alert quality and remove noisy alerts
SLO-based burn-rate alerting reduces false positives
Review and tune alerts quarterly

Checklist for new alerts:

Does this indicate a real user-facing problem?
Is there a clear action the on-call can take?
Is there a runbook linked?
Is the severity level appropriate?
Have we tested that it actually fires when expected?