Metrics, Monitoring, and Alerting: From SLIs to Alerts
Learn the RED and USE methods, SLIs/SLOs/SLAs, and how to build alerting systems that catch real problems. Includes examples for web services and databases.
Metrics, Monitoring, and Alerting: From SLIs to Production Alerts
Monitoring is how you know your system is healthy. Without it, you are blind to failures, degradation, and capacity issues until users report them. This guide covers the theory and practice of building monitoring systems that actually help.
We assume you have basic familiarity with logs and metrics collection. If you need a logging refresher, our Logging Best Practices guide covers structured logging first.
The Three Methods
There are three established methodologies for defining what to measure: RED, USE, and Google’s Four Golden Signals. Each serves different purposes.
RED Method
RED focuses on request-driven services, particularly APIs:
- Rate: Requests per second
- Errors: Error rate (usually percentage of requests resulting in errors)
- Duration: Response time distribution (p50, p95, p99)
# Request rate
sum(rate(http_requests_total[5m]))
# Error rate
sum(rate(http_requests_total{status=~"5.."}[5m])) / sum(rate(http_requests_total[5m]))
# Duration percentiles
histogram_quantile(0.95, sum(rate(http_request_duration_seconds_bucket[5m])) by (le))
USE Method
USE focuses on resource utilization:
- Utilization: How busy is the resource
- Saturation: How much work is queued beyond what the resource can handle
- Errors: Internal errors preventing correct operation
# CPU utilization
100 - (avg by(instance) (rate(node_cpu_seconds_total{mode="idle"}[5m])) * 100)
# Memory saturation
node_memory_Active_bytes / node_memory_MemTotal_bytes * 100
# Disk I/O errors
rate(node_disk_io_time_seconds_total{ mode != "idle" }[5m])
Four Golden Signals
Google’s SRE book defines four signals every service should monitor:
- Latency: How long operations take
- Traffic: How much demand exists
- Errors: How often requests fail
- Saturation: How full the system is
For most web services, these four signals cover what matters most.
Service Level Indicators
SLIs are the actual metrics you measure. They define what “good” looks like for your service.
Common SLIs for Web Services
| SLI | Definition | Good | Acceptable |
|---|---|---|---|
| Availability | Percentage of requests that get a successful response | 99.9% | 99.5% |
| Latency | p95 response time under normal conditions | < 200ms | < 500ms |
| Throughput | Requests handled per second | Varies | Above baseline |
| Error Rate | Percentage of errors (5xx, timeouts) | < 0.1% | < 1% |
Measuring SLIs
Define SLIs precisely so they can be measured consistently:
# sli-config.yaml
service: api-gateway
environment: production
slis:
- name: request_success_rate
description: "Percentage of requests returning non-5xx responses"
query: |
sum(rate(http_requests_total{service="api-gateway",status!~"5.."}[5m]))
/
sum(rate(http_requests_total{service="api-gateway"}[5m]))
- name: p95_latency
description: "95th percentile request duration"
query: |
histogram_quantile(0.95,
sum(rate(http_request_duration_seconds_bucket{service="api-gateway"}[5m])) by (le)
)
- name: error_rate
description: "Percentage of requests returning 5xx errors"
query: |
sum(rate(http_requests_total{service="api-gateway",status=~"5.."}[5m]))
/
sum(rate(http_requests_total{service="api-gateway"}[5m]))
Service Level Objectives
SLOs are the targets you want to achieve. They transform SLIs into goals your team commits to.
Defining SLOs
SLOs combine an SLI, a threshold, and a time window:
# SLO configuration
objectives:
- display_name: "API Availability"
sli: request_success_rate
target: 99.9
window: 30d
description: "API should be available 99.9% of the time"
- display_name: "API Latency"
sli: p95_latency
target: 99.0
threshold_ms: 200
window: 30d
description: "95% of requests should complete within 200ms"
- display_name: "API Error Rate"
sli: error_rate
target: 99.5
threshold_percent: 0.1
window: 30d
description: "Error rate should stay below 0.1%"
Error Budgets
Error budgets convert SLO compliance into actionable information. If your availability SLO is 99.9% over 30 days, you have 43 minutes of allowed downtime in that period:
# Calculate error budget
def error_budget(slo_target, window_days=30):
window_seconds = window_days * 24 * 60 * 60
allowed_downtime = window_seconds * (1 - slo_target)
return allowed_downtime
# 99.9% over 30 days = 43.2 minutes
budget = error_budget(0.999)
print(f"Monthly error budget: {budget / 60:.1f} minutes")
# Monthly error budget: 43.2 minutes
Error budgets tell you when you can be aggressive with releases (large budget remaining) and when you need to be careful (budget nearly exhausted).
flowchart TB
subgraph "Error Budget Lifecycle"
A["SLO Target\n99.9% = 43.2 min/month"]
B["Error Budget\n= Allowed Downtime"]
C["Budget Remaining\n= Good (deploy freely)"]
D["Budget Burning\n= Slow Burn (> 6x)"]
E["Budget Critical\n= Fast Burn (14.4x)"]
F["Budget Exhausted\n= SLA Breach"]
end
A --> B
B --> C
B --> D
D --> E
E --> F
SLO Alerting
Set up alerting before you burn through the budget:
# Alert when 10% of budget is burned in 1 hour
- alert: SLOErrorBudgetBurning
expr: |
(
sum(rate(http_requests_total{service="api-gateway",status=~"5.."}[1h]))
/
sum(rate(http_requests_total{service="api-gateway"}[1h]))
)
> (1 - 0.999) * 0.1
labels:
severity: warning
annotations:
summary: "Error budget burning fast"
description: "More than 10% of the 30-day error budget has been burned in the last hour"
Multi-Window Burn-Rate Alerting
Standard threshold alerts catch sudden spikes but miss slow leaks. Multi-window burn-rate alerting detects both:
| Window | Burn Rate Multiplier | Budget Burned | Use Case |
|---|---|---|---|
| 1 hour | 14.4x | 1% per hour | Fast burn (page immediately) |
| 6 hours | 6x | 10% per 6 hours | Medium burn (warning) |
| 3 days | 3x | 10% per 3 days | Slow leak (investigate) |
| 30 days | 1x | 100% per 30 days | Budget exhausted (review) |
Fast Burn Alert (1-Hour Window)
# Multi-window burn-rate alerting rules
groups:
- name: slo-burn-rate
interval: 30s
rules:
# 1-hour window: Page immediately if burning 14.4x sustainable rate
# At 99.9% SLO, sustainable error rate = 0.001
# 14.4 * 0.001 = 0.0144 = 1.44% error rate
- alert: SLOErrorBudgetFastBurn
expr: |
(
sum(rate(http_requests_total{service="api-gateway",status=~"5.."}[1h]))
/
sum(rate(http_requests_total{service="api-gateway"}[1h]))
)
> (1 - 0.999) * 14.4
for: 5m
labels:
severity: page
category: slo
window: 1h
annotations:
summary: "Error budget burning fast - FAST PAGE"
description: |
Error budget is being consumed {{ $value | humanize }}x faster than sustainable.
At this rate, your entire 30-day budget will be exhausted in ~7 hours.
SLO: 99.9% | Current error rate: {{ $value | humanizePercentage }}
Action: Page on-call immediately and investigate error spike.
Medium Burn Alert (6-Hour Window)
# 6-hour window: Warning if burning 6x sustainable rate
# 6 * 0.001 = 0.006 = 0.6% error rate
- alert: SLOErrorBudgetMediumBurn
expr: |
(
sum(rate(http_requests_total{service="api-gateway",status=~"5.."}[6h]))
/
sum(rate(http_requests_total{service="api-gateway"}[6h]))
)
> (1 - 0.999) * 6
for: 30m
labels:
severity: warning
category: slo
window: 6h
annotations:
summary: "Error budget burning - INVESTIGATE"
description: |
Error budget is being consumed {{ $value | humanize }}x faster than sustainable.
At this rate, 10% of your 30-day budget will be burned in ~6 hours.
SLO: 99.9% | Current error rate: {{ $value | humanizePercentage }}
Action: Investigate elevated error patterns during business hours.
Slow Burn Alert (3-Day Window)
# 3-day window: Long-term trend detection
# 3 * 0.001 = 0.003 = 0.3% error rate
- alert: SLOErrorBudgetSlowBurn
expr: |
(
sum(rate(http_requests_total{service="api-gateway",status=~"5.."}[3d]))
/
sum(rate(http_requests_total{service="api-gateway"}[3d]))
)
> (1 - 0.999) * 3
for: 3h
labels:
severity: warning
category: slo
window: 3d
annotations:
summary: "Error budget slow leak - REVIEW"
description: |
Error budget is being consumed {{ $value | humanize }}x faster than sustainable.
At this rate, 10% of your 30-day budget will be burned in ~3 days.
SLO: 99.9% | Current error rate: {{ $value | humanizePercentage }}
Action: Schedule reliability review; may indicate systemic issues.
Combined Multi-Window Alert
# Combined: Fire if ANY window exceeds threshold
# This catches both fast spikes and slow leaks
- alert: SLOErrorBudgetMultiWindowBurn
expr: |
(
sum(rate(http_requests_total{service="api-gateway",status=~"5.."}[1h]))
/
sum(rate(http_requests_total{service="api-gateway"}[1h]))
)
> (1 - 0.999) * 14.4
or
(
sum(rate(http_requests_total{service="api-gateway",status=~"5.."}[6h]))
/
sum(rate(http_requests_total{service="api-gateway"}[6h]))
)
> (1 - 0.999) * 6
or
(
sum(rate(http_requests_total{service="api-gateway",status=~"5.."}[3d]))
/
sum(rate(http_requests_total{service="api-gateway"}[3d]))
)
> (1 - 0.999) * 3
for: 5m
labels:
severity: page
category: slo
annotations:
summary: "Error budget burning across multiple time windows"
description: |
Multi-window burn-rate alert triggered.
One or more time windows show unsustainable error rates.
1h burn rate: {{ printf "%.1f" (neilyz (index $alerts "0" | value)) }}x threshold
6h burn rate: {{ printf "%.1f" (neilyz (index $alerts "1" | value)) }}x threshold
3d burn rate: {{ printf "%.1f" (neilyz (index $alerts "2" | value)) }}x threshold
Action: On-call should investigate and consider declaring incident.
Burn-Rate Alerting Template (Parameterizable)
# Template for burn-rate alerting with configurable SLO
# Use $slo_threshold as a template variable or replace manually
- alert: SLOBudgetBurnGeneric
expr: |
(
sum(rate(http_requests_total{service="$service",status=~"5.."}[1h]))
/
sum(rate(http_requests_total{service="$service"}[1h]))
)
> (1 - $slo_threshold) * 14.4
or
(
sum(rate(http_requests_total{service="$service",status=~"5.."}[6h]))
/
sum(rate(http_requests_total{service="$service"}[6h]))
)
> (1 - $slo_threshold) * 6
or
(
sum(rate(http_requests_total{service="$service",status=~"5.."}[3d]))
/
sum(rate(http_requests_total{service="$service"}[3d]))
)
> (1 - $slo_threshold) * 3
for: 5m
labels:
severity: page
category: slo
annotations:
summary: "Error budget burning for {{ $labels.service }} (SLO: {{ $labels.slo_target }})"
description: |
Multi-window burn-rate alert for {{ $labels.service }}.
SLO: {{ $labels.slo_target }}% | Windows: 1h / 6h / 3d
Refer to incident runbook: {{ $labels.runbook_url }}
Error Budget Dashboard
{
"dashboard": {
"title": "SLO Error Budget Dashboard",
"panels": [
{
"title": "Error Budget Remaining",
"type": "gauge",
"targets": [
{
"expr": "(1 - (sum(rate(http_requests_total{service=\"api-gateway\",status=~\"5..\"}[30d])) / sum(rate(http_requests_total{service=\"api-gateway\"}[30d])))) * 100",
"legendFormat": "Budget Used %"
}
],
"fieldConfig": {
"defaults": {
"min": 0,
"max": 100,
"unit": "percent",
"thresholds": {
"mode": "absolute",
"steps": [
{ "value": 0, "color": "red" },
{ "value": 25, "color": "orange" },
{ "value": 50, "color": "yellow" },
{ "value": 75, "color": "green" }
]
}
}
}
},
{
"title": "Burn Rate by Window",
"type": "graph",
"targets": [
{
"expr": "(sum(rate(http_requests_total{service=\"api-gateway\",status=~\"5..\"}[1h])) / sum(rate(http_requests_total{service=\"api-gateway\"}[1h]))) / (1 - 0.999)",
"legendFormat": "1h Burn Rate"
},
{
"expr": "(sum(rate(http_requests_total{service=\"api-gateway\",status=~\"5..\"}[6h])) / sum(rate(http_requests_total{service=\"api-gateway\"}[6h]))) / (1 - 0.999)",
"legendFormat": "6h Burn Rate"
},
{
"expr": "(sum(rate(http_requests_total{service=\"api-gateway\",status=~\"5..\"}[3d])) / sum(rate(http_requests_total{service=\"api-gateway\"}[3d]))) / (1 - 0.999)",
"legendFormat": "3d Burn Rate"
}
],
"gridPos": { "x": 0, "y": 8, "w": 12, "h": 8 }
},
{
"title": "Projected Budget Exhaustion (Hours)",
"type": "stat",
"targets": [
{
"expr": "((1 - (sum(rate(http_requests_total{service=\"api-gateway\",status=~\"5..\"}[30d])) / sum(rate(http_requests_total{service=\"api-gateway\"}[30d])))) * 30 * 24) / ((sum(rate(http_requests_total{service=\"api-gateway\",status=~\"5..\"}[1h])) / sum(rate(http_requests_total{service=\"api-gateway\"}[1h]))) / (1 - 0.999))",
"legendFormat": "Hours remaining"
}
],
"gridPos": { "x": 12, "y": 8, "w": 6, "h": 4 }
},
{
"title": "Error Rate vs SLO Target",
"type": "timeseries",
"targets": [
{
"expr": "sum(rate(http_requests_total{service=\"api-gateway\",status=~\"5..\"}[5m])) / sum(rate(http_requests_total{service=\"api-gateway\"}[5m])) * 100",
"legendFormat": "Current Error Rate %"
},
{
"expr": "(1 - 0.999) * 100",
"legendFormat": "SLO Target Error Rate %"
}
],
"gridPos": { "x": 18, "y": 8, "w": 6, "h": 8 }
}
]
}
}
Observability Hooks for Metrics Monitoring
This section defines what to log, measure, trace, and alert for metrics monitoring systems themselves.
Log (What to Emit)
| Event | Fields | Level |
|---|---|---|
| Scrape target added | target, job, endpoint | INFO |
| Scrape target removed | target, job, reason | INFO |
| Alert state change | alert_name, old_state, new_state, duration | INFO |
| Recording rule evaluation error | rule_name, error | ERROR |
| Alert evaluation error | alert_name, error | ERROR |
| Remote write failure | remote_url, error, retry_count | WARN |
| TSDB checkpoint created | checkpoint_size, duration | DEBUG |
Measure (Metrics to Collect)
| Metric | Type | Description |
|---|---|---|
prometheus_tsdb_head_samples | Gauge | Samples in TSDB head |
prometheus_tsdb_head_chunks | Gauge | Chunks in TSDB head |
prometheus_tsdb_head_duration_seconds | Gauge | Time head has existed |
prometheus_target_scrapes_total | Counter | Total scrape attempts |
prometheus_target_scrapes_failed_total | Counter | Failed scrape attempts |
prometheus_target_scrapes_exceeded_target_limit_total | Counter | Targets exceeding limit |
prometheus_remote_write_requests_total | Counter | Remote write requests |
prometheus_remote_write_requests_failed_total | Counter | Failed remote write requests |
prometheus_alertmanager_alerts_total | Counter | Alerts sent to Alertmanager |
prometheus_alerting_rules_evaluated_total | Counter | Rule evaluations |
prometheus_notifications_queue_length | Gauge | Pending notifications |
prometheus_http_request_duration_seconds | Histogram | HTTP request latency |
Trace (Correlation Points)
| Operation | Trace Attribute | Purpose |
|---|---|---|
| Scrape cycle | scrape.job, scrape.target | Track scrape performance |
| Remote write | remote_write.endpoint, remote_write.status | Monitor write health |
| Alert evaluation | alert.name, alert.severity | Correlate alerts |
Alert (When to Page for Monitoring System Itself)
| Alert | Condition | Severity | Purpose |
|---|---|---|---|
| Prometheus Down | Prometheus instance unreachable | P1 Critical | Monitoring unavailable |
| TSDB Head Growing | Head chunks > 2 weeks of data | P2 High | Storage issue |
| Scrape Failure Rate | Scrape failures > 10% for 10 min | P2 High | Missing metrics |
| Remote Write Failing | Remote write failures > 5% | P1 Critical | Backup missing |
| Alert Queue Full | Pending alerts > 100 | P3 Medium | Notification delay |
| High Cardinality | Cardinality > configured limit | P3 Medium | Memory pressure |
Metrics Monitoring System Observability Template
# monitoring-system-observability.yaml
groups:
- name: prometheus-self-monitoring
rules:
# Prometheus instance down
- alert: PrometheusDown
expr: up{job="prometheus"} == 0
for: 2m
labels:
severity: critical
annotations:
summary: "Prometheus instance {{ $labels.instance }} is down"
description: "Prometheus monitoring is unavailable. Investigate immediately."
# TSDB head growing unbounded
- alert: PrometheusTSDBHeadOld
expr: prometheus_tsdb_head_min_time{job="prometheus"} < (time() - 3600 * 24 * 14)
for: 1h
labels:
severity: warning
annotations:
summary: "Prometheus TSDB head is more than 2 weeks old"
description: "TSDB head has not compacted in 2 weeks. Check storage and compaction settings."
# High scrape failure rate
- alert: PrometheusScrapeFailureRate
expr: |
sum(rate(prometheus_target_scrapes_failed_total{job="prometheus"}[10m]))
/
sum(rate(prometheus_target_scrapes_total{job="prometheus"}[10m])) > 0.1
for: 10m
labels:
severity: high
annotations:
summary: "Prometheus scrape failure rate above 10%"
description: "{{ $value | humanizePercentage }} of scrapes are failing."
# Remote write failures
- alert: PrometheusRemoteWriteFailing
expr: |
sum(rate(prometheus_remote_write_requests_failed_total[5m]))
/
sum(rate(prometheus_remote_write_requests_total[5m])) > 0.05
for: 5m
labels:
severity: critical
annotations:
summary: "Prometheus remote write failure rate above 5%"
description: "Remote write to long-term storage is failing. Historical metrics at risk."
# Notification queue backing up
- alert: PrometheusNotificationQueueFull
expr: prometheus_notifications_queue_length{job="prometheus"} > 100
for: 5m
labels:
severity: warning
annotations:
summary: "Prometheus alert notification queue is backing up"
description: "{{ $value }} alerts pending delivery. Alertmanager may be unreachable."
# High HTTP request latency
- alert: PrometheusHighQueryLatency
expr: |
histogram_quantile(0.95,
sum(rate(prometheus_http_request_duration_seconds_bucket{job="prometheus"}[5m])) by (le)
) > 2
for: 10m
labels:
severity: warning
annotations:
summary: "Prometheus query latency above 2 seconds (p95)"
description: "Query performance is degraded. Check TSDB load and query complexity."
# Burn-rate alerting for SLO itself (meta-monitoring)
- alert: SLOBurnRateAlertMetaMonitoring
expr: |
(
sum(rate(prometheus_target_scrapes_failed_total{job="prometheus"}[1h]))
/
sum(rate(prometheus_target_scrapes_total{job="prometheus"}[1h]))
) > 0.01
for: 5m
labels:
severity: warning
annotations:
summary: "Prometheus monitoring quality degraded"
description: "Scrape failure rate is {{ $value | humanizePercentage }}. Monitoring accuracy reduced."
Service Level Agreements
SLAs are contractual commitments to customers, often backed by financial penalties. They are usually less aggressive than SLOs.
SLA vs SLO
| Aspect | SLA | SLO |
|---|---|---|
| Audience | Customers | Internal |
| Enforced by | Contracts, penalties | Team discipline |
| Target | Usually less strict | Usually more strict |
| Consequences | Financial | Operational |
Don’t set SLAs until you have SLOs you are confident you can meet. Adding contractual SLAs before you understand your system’s behavior is asking for trouble.
Building Dashboards
Dashboards translate metrics into actionable information. Good dashboards answer specific questions. Bad dashboards show everything and answer nothing.
Dashboard Design Principles
Start with the user and their questions. A dashboard for on-call engineers answering “is my service healthy?” looks different from an executive dashboard showing business metrics.
Group related metrics. Use rows and panels to organize information logically.
Include context. Raw numbers without comparison are hard to interpret. Show current value versus target, versus last week, or versus baseline.
Minimize chart junk. Every element should convey information. Remove gridlines, legends, and labels that don’t add value.
Example Dashboard Panels
# Grafana dashboard JSON (abbreviated)
{
"dashboard":
{
"title": "API Gateway Overview",
"panels":
[
{
"title": "Request Rate",
"type": "graph",
"targets":
[
{
"expr": "sum(rate(http_requests_total{service='api-gateway'}[5m])) by (service)",
"legendFormat": "{{service}}",
},
],
"gridPos": { "x": 0, "y": 0, "w": 12, "h": 8 },
},
{
"title": "Error Rate",
"type": "stat",
"targets":
[
{
"expr": "sum(rate(http_requests_total{service='api-gateway',status=~'5..'}[5m])) / sum(rate(http_requests_total{service='api-gateway'}[5m])) * 100",
"legendFormat": "Error %",
},
],
"fieldConfig":
{
"defaults":
{
"thresholds":
{
"mode": "absolute",
"steps":
[
{ "value": 0, "color": "green" },
{ "value": 0.1, "color": "yellow" },
{ "value": 1, "color": "red" },
],
},
},
},
"gridPos": { "x": 12, "y": 0, "w": 6, "h": 4 },
},
{
"title": "P95 Latency",
"type": "gauge",
"targets":
[
{
"expr": "histogram_quantile(0.95, sum(rate(http_request_duration_seconds_bucket{service='api-gateway'}[5m])) by (le)) * 1000",
"legendFormat": "p95 ms",
},
],
"gridPos": { "x": 18, "y": 0, "w": 6, "h": 4 },
},
],
},
}
Essential Dashboard Sections
At a Glance: Key health indicators in a single row with green/yellow/red status for critical metrics.
Request Metrics: Rate, error rate, and latency percentiles for your main endpoints.
Infrastructure Metrics: CPU, memory, disk, and network for your hosts.
Application Metrics: Business-specific metrics like queue depths, cache hit rates, or background job counts.
Dependency Health: Metrics for databases, caches, and external services your application depends on.
Alert Design
Alerts should wake someone up only when they need to act. Too many alerts cause fatigue; too few cause outages.
Alert Severity Levels
| Severity | Response Time | Examples |
|---|---|---|
| P1 Critical | Minutes, 24/7 | Complete outage, data loss, security breach |
| P2 High | 30 minutes | Degraded performance affecting many users |
| P3 Medium | Business hours | Minor degradation, non-critical failures |
| P4 Low | Next sprint | Predictable issues, capacity planning |
Alert Quality Checklist
Before creating an alert, ask:
- Does this indicate a real problem affecting users?
- Is the root cause something we can fix?
- Is the alert actionable? Can the recipient do something about it?
- Is the alert specific? Does it point toward the likely cause?
- Is the threshold calibrated? Are we alerting on symptoms or causes?
Alerting on symptoms leads to noise. Alerting on causes requires understanding your system well enough to know what indicates true problems.
Example Alert Rules
# Prometheus alerting rules
groups:
- name: api-gateway
rules:
# High error rate
- alert: APIGatewayHighErrorRate
expr: |
sum(rate(http_requests_total{service="api-gateway",status=~"5.."}[5m]))
/
sum(rate(http_requests_total{service="api-gateway"}[5m])) > 0.01
for: 5m
labels:
severity: critical
annotations:
summary: "API Gateway error rate above 1%"
description: "Error rate is {{ $value | humanizePercentage }} over the last 5 minutes"
# Latency degradation
- alert: APIGatewayHighLatency
expr: |
histogram_quantile(0.95,
sum(rate(http_request_duration_seconds_bucket{service="api-gateway"}[5m])) by (le)
) > 1
for: 10m
labels:
severity: warning
annotations:
summary: "API Gateway P95 latency above 1 second"
description: "P95 latency is {{ $value | humanizeDuration }}"
# Slow queries
- alert: DatabaseSlowQueries
expr: |
rate(django_dbqueries_total{type="slow"}[5m]) > 10
for: 5m
labels:
severity: warning
annotations:
summary: "More than 10 slow queries per second"
description: "Database is experiencing query performance issues"
Alert Routing
Route alerts to the right people based on severity and service:
# Alertmanager configuration
route:
receiver: default
routes:
- match:
severity: critical
receiver: pagerduty
continue: true
- match:
service: database
receiver: database-oncall
- match:
severity: warning
receiver: slack-warnings
- match:
severity: info
receiver: none # Don't alert, just log
receivers:
- name: pagerduty
pagerduty_configs:
- service_key: xxx
severity: critical
- name: slack-warnings
slack_configs:
- channel: "#alerts-warning"
- name: database-oncall
pagerduty_configs:
- service_key: yyy
routing_key: database-team
Blackbox Monitoring
Blackbox monitoring tests your service from the outside, independent of application metrics. It catches failures that internal metrics miss.
Prometheus Blackbox Exporter
# blackbox.yml
modules:
http_2xx:
prober: http
http:
valid_http_versions: ["HTTP/1.1", "HTTP/2"]
method: GET
fail_if_ssl: false
# Prometheus scrape config
scrape_configs:
- job_name: "blackbox-http"
metrics_path: /probe
params:
module: [http_2xx]
static_configs:
- targets:
- https://api.example.com/health
relabel_configs:
- source_labels: [__address__]
target_label: __param_target
- source_labels: [__param_target]
target_label: instance
- target_label: __address__
replacement: blackbox-exporter:9115
Synthetic Transactions
Run synthetic transactions regularly to catch degradation before users notice:
import requests
import time
def synthetic_checkout():
start = time.time()
# 1. Create cart
cart_resp = requests.post('https://api.example.com/carts', timeout=5)
cart_id = cart_resp.json()['id']
# 2. Add item
requests.post(f'https://api.example.com/carts/{cart_id}/items',
json={'product_id': 'PROD123', 'quantity': 1}, timeout=5)
# 3. Checkout
checkout_resp = requests.post(f'https://api.example.com/carts/{cart_id}/checkout',
timeout=10)
duration = time.time() - start
# Report metrics
metrics.histogram('synthetic_checkout_duration', duration)
metrics.gauge('synthetic_checkout_success', 1 if checkout_resp.status == 200 else 0)
Capacity Planning
Monitoring helps predict when you need more capacity.
Trends and Projections
Monitor usage trends over weeks and months:
# Weekly growth rate
sum(rate(http_requests_total[7d])) / sum(rate(http_requests_total[7d] offset 7d)) - 1
# Days until CPU saturation at current growth rate
predict_linear(node_cpu_seconds_total{mode="idle"}[7d], 30 * 24 * 3600)
Scaling Thresholds
# KEDA scaled object for event-driven scaling
apiVersion: keda.sh/v1alpha1
kind: ScaledObject
metadata:
name: api-gateway-scaler
namespace: production
spec:
scaleTargetRef:
name: api-gateway
minReplicaCount: 3
maxReplicaCount: 100
triggers:
- type: prometheus
metadata:
serverAddress: http://prometheus:9090
metricName: http_requests_per_second
threshold: "1000"
query: sum(rate(http_requests_total{service="api-gateway"}[2m]))
When to Use Metrics-Based Monitoring
When to Use Metrics-Based Monitoring:
- Real-time system health monitoring and alerting
- Capacity planning and trend analysis
- SLO/SLA tracking and error budget management
- Correlation with business metrics (revenue, conversions)
- Long-term historical analysis and reporting
- Infrastructure and application layer monitoring
When Not to Use Metrics-Based Monitoring:
- Debugging specific request failures (use logs)
- Understanding request flow across services (use tracing)
- One-off troubleshooting of transient issues
- Monitoring low-volume events that do not aggregate well
- Situations where you need the full context of a single operation
Trade-off Analysis
| Aspect | Metrics Monitoring | Log Analysis | Distributed Tracing |
|---|---|---|---|
| Latency Detection | Aggregate patterns | Per-request detail | Request-level timing |
| Root Cause | Correlation difficult | Full context available | 因果链路清晰 |
| Storage Cost | Low (aggregated) | High (raw events) | Medium (spans) |
| Query Flexibility | High (PromQL/SQL) | Medium (text search) | Low (structured) |
| Historical Analysis | Excellent | Limited by cost | Poor |
| Real-time Alerting | Excellent | Poor | Good |
| Debug Single Request | Poor | Excellent | Excellent |
| Capacity Planning | Excellent | Poor | Poor |
Production Failure Scenarios
| Failure | Impact | Mitigation |
|---|---|---|
| Prometheus scrape failure | Gaps in metrics; missing visibility | Configure remote_write to backup; alert on target down; use Prometheus federation |
| Alert fatigue | Teams ignore alerts; real issues missed | Regularly review alert quality; remove stale alerts; tune thresholds |
| Metric cardinality explosion | Prometheus OOM; query performance degraded | Limit label cardinality; use recording rules; segment by service |
| Alert routing failure | Critical alerts not delivered; extended outages | Test alert routing; use multiple notification channels; on-call rotation |
| Dashboard data source outage | No visibility for teams; blind decision making | Configure backup data sources; cache dashboard definitions |
| SLO target chronically missed | Customer trust erosion; potential SLA penalties | Investigate root causes; adjust SLO targets if unrealistic; build error budget alerts |
Observability Checklist
Core Metrics (Golden Signals)
- Latency: p50, p95, p99 response times
- Traffic: requests per second, throughput
- Errors: error rate (4xx, 5xx), success rate
- Saturation: CPU, memory, disk, queue depths
Infrastructure Metrics
- Host-level: CPU utilization, memory usage, disk I/O, network throughput
- Container-level: resource limits, restart counts, OOM kills
- Kubernetes: pod status, node conditions, namespace quotas
Application Metrics (RED/USE)
- Rate: requests per second by endpoint, method, status
- Errors: error counts and rates by type, service, endpoint
- Duration: histogram of request latencies with percentiles
- Utilization: resource usage, queue lengths, connection pool stats
- Saturation: backpressure indicators, throttling events
SLO/SLA Metrics
- Availability SLI with 30-day rolling window
- Latency SLI (p95 under 200ms target)
- Error rate SLI (below 0.1% target)
- Error budget tracking and burn rate alerts
Alerting Rules
- P1: Service down or availability SLA breach
- P2: High error rate (>1% for 5 minutes)
- P2: Latency degradation (p95 >500ms)
- P3: Resource utilization >80%
- P4: Error budget 50% consumed in 1 hour
Security Checklist
- Metrics endpoint (/metrics) not exposed publicly
- Prometheus access authenticated (if external)
- Alert manager notifications do not contain sensitive data
- Dashboard links to internal systems use authenticated proxies
- No API keys or secrets visible in dashboard URLs
- Alert routing logs audited
- Metrics cardinality limits enforced to prevent DoS
- Remote write connections use TLS
- Dashboard snapshots sanitized before sharing
Common Pitfalls / Anti-Patterns
1. Alerting on Symptoms Without Context
# Bad: Alerts on symptoms
- alert: HighCPU
expr: cpu_usage > 80
# Fires constantly without context
# Good: Alert with actionable context
- alert: HighCPUOnCriticalService
expr: cpu_usage{service="api-gateway"} > 80 and rate(http_requests_total{service="api-gateway"}[5m]) > 1000
# Only fires when it matters
2. Missing Alert Aggregation
Firing 1000 alerts for one failure causes chaos. Use grouping, inhibition, and routing to consolidate alerts:
route:
group_by: ["alertname", "cluster", "service"]
group_wait: 30s
group_interval: 5m
repeat_interval: 4h
3. SLOs That Do Not Reflect User Experience
Setting SLOs on backend metrics instead of user-visible metrics creates false confidence. SLOs should measure what users experience, not internal implementation details.
4. Dashboard Overload
More panels do not mean better monitoring. Each dashboard should answer specific questions. If you need to scroll, you have too many panels.
5. No Alert Testing
Alerts that never fire in production may have broken queries. Test alerts periodically by simulating conditions that should trigger them.
6. Ignoring Alert Fatigue
If engineers start ignoring alerts, real problems get missed. Review and tune alerts regularly. Remove alerts that no longer serve a purpose.
Quick Recap
Key Takeaways:
- Start with the four golden signals: latency, traffic, errors, saturation
- Define SLOs that reflect user experience, not internal metrics
- Use error budgets to prioritize reliability work
- Alert on symptoms with context, not raw numbers
- Every alert should be actionable: someone should know what to do
- Review and tune alerts regularly to avoid fatigue
Copy/Paste Checklist:
# Availability SLO query
1 - (
sum(rate(http_requests_total{status=~"5.."}[30d]))
/
sum(rate(http_requests_total)[30d])
)
# Error budget burning alert
(
sum(rate(http_requests_total{service="api-gateway",status=~"5.."}[1h]))
/
sum(rate(http_requests_total{service="api-gateway"}[1h]))
) > (1 - 0.999) * 0.1
# P95 latency with context
histogram_quantile(0.95,
sum(rate(http_request_duration_seconds_bucket{service="api-gateway"}[5m])) by (le)
) > 1
# Dashboard variable template
sum(rate(http_requests_total{service=~"$service"}[$interval])) by (service)
# Alert routing with grouping
route:
receiver: default
group_by: ['alertname', 'cluster', 'service']
group_wait: 30s
continue: true
routes:
- match:
severity: critical
receiver: pagerduty
Conclusion
Effective monitoring combines the right metrics, clear SLOs, thoughtful alerts, and actionable dashboards. Start with the four golden signals: latency, traffic, errors, and saturation. Define SLOs that reflect what users actually care about. Build alerts that fire only when someone needs to act.
For deeper observability into distributed systems, our Distributed Tracing guide covers request flow across services. The Prometheus & Grafana guide provides hands-on examples for implementing these patterns.
Category
Related Posts
Alerting in Production: Building Alerts That Matter
Build alerting systems that catch real problems without fatigue. Learn alert design principles, severity levels, runbooks, and on-call best practices.
Database Monitoring: Metrics, Tools, and Alerting
Keep your PostgreSQL database healthy with comprehensive monitoring. This guide covers query latency, connection usage, disk I/O, cache hit ratios, and alerting with pg_stat_statements and Prometheus.
Alerting in Production: Paging, Runbooks, and On-Call
Build effective alerting systems that wake people up for real emergencies: alert fatigue prevention, runbook automation, and healthy on-call practices.