Alerting in Production: Building Alerts That Matter
Build alerting systems that catch real problems without fatigue. Learn alert design principles, severity levels, runbooks, and on-call best practices.
Alerting in Production: Designing Alerts That Actually Matter
You know the drill. 3am. Your phone screams. You stumble for it, heart pounding. The alert says something about CPU. You log in, check, find nothing wrong. Back to sleep, assuming you can still sleep. That is alert fatigue, and it is worse than silence.
The flip side is just as bad: systems quietly falling apart while dashboards stay green and nobody notices until the tickets start pouring in.
This guide covers how to design alerting systems that catch real problems, minimize noise, and actually help during incidents.
When to Use Alerting
Alerting is appropriate when:
- You have production services where downtime or degradation matters
- You have on-call engineers who can respond to issues
- Your service has known failure modes that can be detected automatically
- You have established SLOs with defined error budgets
When to skip or simplify alerting:
- Development or test environments with no user impact
- Services you are actively migrating or planning to deprecate within 30 days
- Systems where you have no on-call coverage and cannot respond anyway
- Research prototypes with no uptime requirements
The Alerting Problem
flowchart TB
subgraph "Metric Collection"
APP["Application\n(Metrics, Traces)"]
HOST["Host Exporter\n(Node Exporter)"]
DB["Database\nExporter"]
end
subgraph "Alerting Engine"
PROM["Prometheus\n/ Alertmanager"]
RULES["Alert Rules\n(SLO, Burn Rate)"]
EVAL["Rule\nEvaluator"]
end
subgraph "Alert Routing"
ROUTE["Alertmanager\nRouting Tree"]
INHIBIT["Inhibit\nRules"]
SILENCE["Silence\nPeriods"]
end
subgraph "Notification"
PAGER["PagerDuty\n(P1, P2)"]
SLACK["Slack\n(P3, Team)"]
EMAIL["Email\n(P4, Backlog)"]
end
APP --> PROM
HOST --> PROM
DB --> PROM
PROM --> EVAL
EVAL --> RULES
RULES --> ROUTE
ROUTE --> INHIBIT
ROUTE --> SILENCE
INHIBIT --> PAGER
ROUTE --> SLACK
ROUTE --> EMAIL
Alerting is a signal-to-noise problem. Too many alerts and real issues get lost in the noise. Too few and you miss real failures.
Most bad alerting stems from a few common mistakes. Teams alert on causes instead of symptoms. Thresholds get set once at launch and never touched again, even as traffic patterns change. There is no clear line between “something needs attention” and “wake somebody up right now.” And sometimes alerts fire for issues that clear up on their own before anyone could act anyway.
Our Metrics, Monitoring, and Alerting guide covers the foundational concepts like SLIs, SLOs, and golden signals that underpin effective alerting.
Alert Severity Levels
Not all alerts are equal. Severity levels determine response time and escalation path.
Standard Severity Levels
| Severity | Response Time | Examples | Who Gets Paged |
|---|---|---|---|
| P1 Critical | Minutes, 24/7 | Complete outage, data loss, security breach | On-call immediately |
| P2 High | 30 minutes | Degraded performance affecting users | On-call |
| P3 Medium | Business hours | Non-critical failures, capacity warnings | Team Slack |
| P4 Low | Next sprint | Predictable issues, capacity planning | Backlog |
Pagerduty gets real expensive real fast if everything is P1. Reserve that for actual outages.
When to Page at Each Level
Page at P1 when:
- Users cannot complete core workflows
- Data integrity is at risk
- Security breach is in progress
- Revenue is directly affected
Page at P2 when:
- A significant percentage of users are affected
- Error rates exceed acceptable thresholds
- Latency makes the service unusable
- A dependency failure is cascading
Do not page at P3 or P4. Add them to dashboards and team channels instead.
Alert Design Principles
Principle 1: Alert on Symptoms, Not Causes
Users do not care that your CPU is at 95%. They care that pages are loading slowly. Alert on the symptom:
# Bad: Alert on cause
- alert: HighCPU
expr: cpu_usage > 95
# Fires constantly, not actionable
# Good: Alert on symptom
- alert: HighLatency
expr: histogram_quantile(0.95, rate(http_request_duration_seconds_bucket[5m])) > 2
for: 5m
# Only fires when users are actually affected
Principle 2: Provide Context in the Alert
Every alert should tell the on-call engineer what they need to start debugging:
- alert: DatabaseConnectionsExhausted
expr: db_connections_active / db_connections_max > 0.95
for: 5m
labels:
severity: critical
annotations:
summary: "Database connections at 95% capacity"
description: |
Service: {{ $labels.service }}
Current connections: {{ $value | humanize }}
Max connections: {{ $labels.max_connections }}
Instance: {{ $labels.instance }}
Runbook: https://runbooks.example.com/db-connections
Principle 3: Include Runbook Links
Every P1 and P2 alert should link to a runbook with:
- What this alert means
- Common causes
- How to investigate
- How to resolve
- When to escalate
annotations:
runbook_url: "https://runbooks.example.com/high-error-rate"
Runbook Structure
A good runbook is a checklist, not an essay. On-call engineers are stressed at 2am and cannot parse paragraphs. Give them steps they can follow without thinking too hard.
Runbook Template
# High Error Rate Runbook
## Symptoms
- Alert: {{ $alert.name }}
- Current error rate: {{ $value }}
## Quick Checks
1. [ ] Check database health: `kubectl exec -it db-pod -- psql`
2. [ ] Check external dependencies status page
3. [ ] Check recent deployments: `kubectl get pods --sort-by=.metadata.creationTimestamp`
4. [ ] Check for increased traffic patterns
## Common Causes
- Database connection pool exhausted (check db_metrics)
- Upstream API outage (check dependencies)
- Bad deployment (check for recent changes)
- Traffic spike (check request rate)
## Mitigation Steps
1. Scale up if capacity issue: `kubectl scale deployment api-gateway --replicas=10`
2. Rollback if deployment issue: `./scripts/rollback.sh`
3. Enable circuit breaker if dependency issue
## Escalation
If not resolved in 30 minutes, page @platform-lead
Alert Suppression and Routing
Suppression During Maintenance
Prevent alerts from firing during planned maintenance:
# Alertmanager inhibit rules
inhibit_rules:
- source_match:
severity: maintenance
target_match:
severity: critical
equal: ["alertname", "service"]
Routing by Service
Route alerts to the right team:
# Alertmanager routing
route:
receiver: default
routes:
- match:
service: database
receiver: database-oncall
continue: true
- match:
service: api
receiver: api-oncall
- match:
severity: critical
receiver: pagerduty
Blackbox Monitoring
Blackbox monitoring tests your service from the outside, independent of application metrics.
# Prometheus blackbox configuration
modules:
http_2xx:
prober: http
http:
method: GET
fail_if_ssl: false
scrape_configs:
- job_name: "blackbox-health"
metrics_path: /health
static_configs:
- targets:
- https://api.example.com/health
See our Prometheus & Grafana guide for hands-on examples of implementing these patterns.
Alert Fatigue Prevention
Alert Reviews
Go through your alerts every few months. If an alert has not fired in six months, ask why you still have it. If something fires multiple times per week and nobody fixes it, it is noise. If the same alert always needs the same manual fix, automate that fix. And if an alert wakes people up but nobody can actually do anything about it at 3am, get rid of it or make it something that can be handled automatically.
Signal vs Noise Metrics
Track your alert signal-to-noise ratio:
# Alert quality metric
sum(rate(alerts_resolved_total[1h])) / sum(rate(alerts_fired_total[1h]))
If this ratio is below 0.5, you have too much noise. Some teams aim for 0.8 or higher.
SLO-Based Alerting
Instead of alerting on arbitrary thresholds, alert based on error budgets:
# Burn-rate alerting (from our metrics guide)
- alert: ErrorBudgetBurningFast
expr: |
(
sum(rate(http_requests_total{status=~"5.."}[1h]))
/
sum(rate(http_requests_total[1h]))
)
> (1 - 0.999) * 14.4
for: 5m
labels:
severity: page
Common Alert Patterns
Resource Exhaustion
| Alert | Threshold | Action |
|---|---|---|
| Disk space | > 85% | Clean old logs, rotate data |
| Memory | > 90% | OOM kill imminent, scale up |
| CPU | > 95% for 10m | Scale up, check for runaway processes |
| Connections | > 80% max | Connection pool exhausted, investigate leaks |
Error Rate Spikes
| Alert | Threshold | Investigation |
|---|---|---|
| 5xx rate | > 1% for 5m | Check recent deployments, dependencies |
| 4xx rate | > 10% for 5m | Client issue, potential attack |
| Error log rate | > 100/min | Check logs for patterns |
Latency Degradation
| Alert | Threshold | Investigation |
|---|---|---|
| P95 latency | > 2s | Check database queries, external APIs |
| P99 latency | > 5s | Find slow queries, connection pool issues |
| Latency spike | > 3x baseline | Traffic anomaly or resource exhaustion |
On-Call Practices
Alert Acknowledgment
When an alert fires:
- Acknowledge immediately to stop duplicate notifications
- Join the incident channel
- Start documenting your investigation in real-time
- Escalate early if you are stuck
Escalation Paths
Define escalation clearly:
escalation:
- name: on-call
timeout: 15m
if_no_respond:
- name: team-lead
- name: engineering-manager
Post-Incident Review
After every P1:
- What triggered the alert?
- Why did it become a P1?
- What was the time to detect vs time to resolve?
- What can we automate to prevent this?
- Do we need better alerting?
Observability Hooks
Your alerting system needs its own monitoring.
Alert on Alert System Health
| Alert | Condition | Severity |
|---|---|---|
| Alertmanager down | Cannot receive alerts | Critical |
| Alert channel failing | Notifications not delivering | Critical |
| Alert evaluation error | Rules returning errors | High |
Security Considerations
Alerting systems handle sensitive information. Protect them:
- Alert content should not include passwords, tokens, or secrets
- Runbook URLs require authentication
- Alert notification channels should be encrypted
- On-call schedules are sensitive — protect access to rotation data
Alerting Trade-Offs
| Alerting Approach | When to Use | Key Risk |
|---|---|---|
| Threshold-based (CPU > 95%) | Simple resource monitoring, known baselines | Constantly needs retuning as traffic changes |
| SLO-based burn-rate | Services with defined SLOs, error budget tracking | Requires SLO definition first; complex to set up |
| canary analysis | Rolling deployments, infrastructure changes | Fails to catch novel failure modes |
| Blackbox/external probing | End-to-end availability, synthetic transactions | Cannot see internal state changes |
| Anomaly detection (ML) | Highly variable metrics, seasonal patterns | High false positive rate without tuning |
Alerting Production Failure Scenarios
Alert storm from a single downstream dependency failure
A critical database starts returning errors. Every service that calls it fires a “database error rate” alert simultaneously. Your on-call engineer receives 47 pages in 2 minutes. The real issue is one database — the 46 other alerts are redundant. Nobody can think clearly.
Mitigation: Use Alertmanager inhibition rules so that when a database alert fires, downstream service alerts are suppressed. Group alerts by root cause, not by symptom. Use “alert routing only” for cascading failures, not individual service pages.
Alert routing failure silently drops pages
Alertmanager loses connectivity to PagerDuty for 20 minutes during a region failover. During that window, a critical service fails. No page goes out. By the time the routing issue is discovered, the outage has been running for 25 minutes with no on-call response.
Mitigation: Monitor your alert delivery system itself. Alert when Alertmanager cannot reach its notification endpoints. Have a backup notification channel (SMS, Slack) that operates independently. Test failover scenarios quarterly.
SLO burn-rate alert fires but no error budget remains
Your SLO is 99.9% monthly availability (8.7 hours of downtime). A bug causes 4 hours of downtime in a single day. The burn-rate alert fires correctly, but the error budget for the rest of the month is now gone. Any further downtime will breach the SLO regardless of how quickly you respond.
Mitigation: Set up multi-window burn-rate alerts. A 1-hour window catching fast burns (14.4x burn rate = 99.9% consumed in 1 hour) catches acute issues. A 6-hour window at 6x rate catches slower burns. And always have a “budget remaining” dashboard alongside burn-rate alerting.
Stale threshold creates a hidden failure mode
A service’s latency P99 threshold was set at 500ms during launch in 2023. Traffic has grown 10x since then. The baseline P99 is now 800ms, but the threshold never changed. A new code change causes P99 to spike to 2500ms. The alert fires, but the on-call engineer marks it “known issue” because it looks like the normal elevated baseline.
Mitigation: Automatically recalibrate thresholds based on current baselines. Alert on deviation from baseline, not absolute thresholds. Review thresholds quarterly and after any significant traffic shift.
Alerting Anti-Patterns
Alerting on infrastructure instead of user outcomes. CPU at 95% means nothing to users if the service is handling the load fine. Alert on what users experience: latency, errors, throughput.
No alert review cadence. Alerts created at launch stay forever. Traffic patterns change, services get refactored, old alerts become irrelevant or actively misleading. Schedule quarterly alert reviews.
Every alert pages immediately. If your P3 informational alerts also wake someone up at 3am, you have not actually set severity levels — you have just labeled things incorrectly.
Runbooks that nobody can read at 3am. Walls of text, no checkboxes, no clear escalation path. A runbook should be a step-by-step checklist.
Alert on symptoms without context. “High error rate” tells you nothing. “Error rate in checkout service > 5% for 5 minutes, affecting payment processing” gives you a starting point.
Quick Recap
Key Takeaways:
- Alert on symptoms that affect users, not internal causes
- Every alert needs a runbook with investigation steps
- Calibrate thresholds to your actual traffic patterns
- Track alert quality and remove noisy alerts
- SLO-based burn-rate alerting reduces false positives
- Review and tune alerts quarterly
Checklist for new alerts:
- Does this indicate a real user-facing problem?
- Is there a clear action the on-call can take?
- Is there a runbook linked?
- Is the severity level appropriate?
- Have we tested that it actually fires when expected?
Category
Related Posts
Metrics, Monitoring, and Alerting: From SLIs to Alerts
Learn the RED and USE methods, SLIs/SLOs/SLAs, and how to build alerting systems that catch real problems. Includes examples for web services and databases.
Alerting in Production: Paging, Runbooks, and On-Call
Build effective alerting systems that wake people up for real emergencies: alert fatigue prevention, runbook automation, and healthy on-call practices.
The Observability Engineering Mindset: Beyond Monitoring
Transition from traditional monitoring to full observability: structured logs, metrics, traces, and the cultural practices that make observability teams successful.