Metrics, Monitoring, and Alerting: From SLIs to Alerts

Learn the RED and USE methods, SLIs/SLOs/SLAs, and how to build alerting systems that catch real problems. Includes examples for web services and databases.

published: March 22, 2026 reading time: 23 min read

Metrics, Monitoring, and Alerting: From SLIs to Production Alerts

Monitoring is how you know your system is healthy. Without it, you are blind to failures, degradation, and capacity issues until users report them. This guide covers the theory and practice of building monitoring systems that actually help.

We assume you have basic familiarity with logs and metrics collection. If you need a logging refresher, our Logging Best Practices guide covers structured logging first.

The Three Methods

There are three established methodologies for defining what to measure: RED, USE, and Google’s Four Golden Signals. Each serves different purposes.

RED Method

RED focuses on request-driven services, particularly APIs:

Rate: Requests per second
Errors: Error rate (usually percentage of requests resulting in errors)
Duration: Response time distribution (p50, p95, p99)

# Request rate
sum(rate(http_requests_total[5m]))

# Error rate
sum(rate(http_requests_total{status=~"5.."}[5m])) / sum(rate(http_requests_total[5m]))

# Duration percentiles
histogram_quantile(0.95, sum(rate(http_request_duration_seconds_bucket[5m])) by (le))

USE Method

USE focuses on resource utilization:

Utilization: How busy is the resource
Saturation: How much work is queued beyond what the resource can handle
Errors: Internal errors preventing correct operation

# CPU utilization
100 - (avg by(instance) (rate(node_cpu_seconds_total{mode="idle"}[5m])) * 100)

# Memory saturation
node_memory_Active_bytes / node_memory_MemTotal_bytes * 100

# Disk I/O errors
rate(node_disk_io_time_seconds_total{ mode != "idle" }[5m])

Four Golden Signals

Google’s SRE book defines four signals every service should monitor:

Latency: How long operations take
Traffic: How much demand exists
Errors: How often requests fail
Saturation: How full the system is

For most web services, these four signals cover what matters most.

Service Level Indicators

SLIs are the actual metrics you measure. They define what “good” looks like for your service.

Common SLIs for Web Services

SLI	Definition	Good	Acceptable
Availability	Percentage of requests that get a successful response	99.9%	99.5%
Latency	p95 response time under normal conditions	< 200ms	< 500ms
Throughput	Requests handled per second	Varies	Above baseline
Error Rate	Percentage of errors (5xx, timeouts)	< 0.1%	< 1%

Measuring SLIs

Define SLIs precisely so they can be measured consistently:

# sli-config.yaml
service: api-gateway
environment: production

slis:
  - name: request_success_rate
    description: "Percentage of requests returning non-5xx responses"
    query: |
      sum(rate(http_requests_total{service="api-gateway",status!~"5.."}[5m]))
      /
      sum(rate(http_requests_total{service="api-gateway"}[5m]))

  - name: p95_latency
    description: "95th percentile request duration"
    query: |
      histogram_quantile(0.95,
        sum(rate(http_request_duration_seconds_bucket{service="api-gateway"}[5m])) by (le)
      )

  - name: error_rate
    description: "Percentage of requests returning 5xx errors"
    query: |
      sum(rate(http_requests_total{service="api-gateway",status=~"5.."}[5m]))
      /
      sum(rate(http_requests_total{service="api-gateway"}[5m]))

Service Level Objectives

SLOs are the targets you want to achieve. They transform SLIs into goals your team commits to.

Defining SLOs

SLOs combine an SLI, a threshold, and a time window:

# SLO configuration
objectives:
  - display_name: "API Availability"
    sli: request_success_rate
    target: 99.9
    window: 30d
    description: "API should be available 99.9% of the time"

  - display_name: "API Latency"
    sli: p95_latency
    target: 99.0
    threshold_ms: 200
    window: 30d
    description: "95% of requests should complete within 200ms"

  - display_name: "API Error Rate"
    sli: error_rate
    target: 99.5
    threshold_percent: 0.1
    window: 30d
    description: "Error rate should stay below 0.1%"

Error Budgets

Error budgets convert SLO compliance into actionable information. If your availability SLO is 99.9% over 30 days, you have 43 minutes of allowed downtime in that period:

# Calculate error budget
def error_budget(slo_target, window_days=30):
    window_seconds = window_days * 24 * 60 * 60
    allowed_downtime = window_seconds * (1 - slo_target)
    return allowed_downtime

# 99.9% over 30 days = 43.2 minutes
budget = error_budget(0.999)
print(f"Monthly error budget: {budget / 60:.1f} minutes")
# Monthly error budget: 43.2 minutes

Error budgets tell you when you can be aggressive with releases (large budget remaining) and when you need to be careful (budget nearly exhausted).

flowchart TB
    subgraph "Error Budget Lifecycle"
        A["SLO Target\n99.9% = 43.2 min/month"]
        B["Error Budget\n= Allowed Downtime"]
        C["Budget Remaining\n= Good (deploy freely)"]
        D["Budget Burning\n= Slow Burn (> 6x)"]
        E["Budget Critical\n= Fast Burn (14.4x)"]
        F["Budget Exhausted\n= SLA Breach"]
    end
    A --> B
    B --> C
    B --> D
    D --> E
    E --> F

SLO Alerting

Set up alerting before you burn through the budget:

# Alert when 10% of budget is burned in 1 hour
- alert: SLOErrorBudgetBurning
  expr: |
    (
      sum(rate(http_requests_total{service="api-gateway",status=~"5.."}[1h]))
      /
      sum(rate(http_requests_total{service="api-gateway"}[1h]))
    )
    > (1 - 0.999) * 0.1
  labels:
    severity: warning
  annotations:
    summary: "Error budget burning fast"
    description: "More than 10% of the 30-day error budget has been burned in the last hour"

Multi-Window Burn-Rate Alerting

Standard threshold alerts catch sudden spikes but miss slow leaks. Multi-window burn-rate alerting detects both:

Window	Burn Rate Multiplier	Budget Burned	Use Case
1 hour	14.4x	1% per hour	Fast burn (page immediately)
6 hours	6x	10% per 6 hours	Medium burn (warning)
3 days	3x	10% per 3 days	Slow leak (investigate)
30 days	1x	100% per 30 days	Budget exhausted (review)

Fast Burn Alert (1-Hour Window)

# Multi-window burn-rate alerting rules
groups:
  - name: slo-burn-rate
    interval: 30s
    rules:
      # 1-hour window: Page immediately if burning 14.4x sustainable rate
      # At 99.9% SLO, sustainable error rate = 0.001
      # 14.4 * 0.001 = 0.0144 = 1.44% error rate
      - alert: SLOErrorBudgetFastBurn
        expr: |
          (
            sum(rate(http_requests_total{service="api-gateway",status=~"5.."}[1h]))
            /
            sum(rate(http_requests_total{service="api-gateway"}[1h]))
          )
          > (1 - 0.999) * 14.4
        for: 5m
        labels:
          severity: page
          category: slo
          window: 1h
        annotations:
          summary: "Error budget burning fast - FAST PAGE"
          description: |
            Error budget is being consumed {{ $value | humanize }}x faster than sustainable.
            At this rate, your entire 30-day budget will be exhausted in ~7 hours.
            SLO: 99.9% | Current error rate: {{ $value | humanizePercentage }}
            Action: Page on-call immediately and investigate error spike.

Medium Burn Alert (6-Hour Window)

# 6-hour window: Warning if burning 6x sustainable rate
# 6 * 0.001 = 0.006 = 0.6% error rate
- alert: SLOErrorBudgetMediumBurn
  expr: |
    (
      sum(rate(http_requests_total{service="api-gateway",status=~"5.."}[6h]))
      /
      sum(rate(http_requests_total{service="api-gateway"}[6h]))
    )
    > (1 - 0.999) * 6
  for: 30m
  labels:
    severity: warning
    category: slo
    window: 6h
  annotations:
    summary: "Error budget burning - INVESTIGATE"
    description: |
      Error budget is being consumed {{ $value | humanize }}x faster than sustainable.
      At this rate, 10% of your 30-day budget will be burned in ~6 hours.
      SLO: 99.9% | Current error rate: {{ $value | humanizePercentage }}
      Action: Investigate elevated error patterns during business hours.

Slow Burn Alert (3-Day Window)

# 3-day window: Long-term trend detection
# 3 * 0.001 = 0.003 = 0.3% error rate
- alert: SLOErrorBudgetSlowBurn
  expr: |
    (
      sum(rate(http_requests_total{service="api-gateway",status=~"5.."}[3d]))
      /
      sum(rate(http_requests_total{service="api-gateway"}[3d]))
    )
    > (1 - 0.999) * 3
  for: 3h
  labels:
    severity: warning
    category: slo
    window: 3d
  annotations:
    summary: "Error budget slow leak - REVIEW"
    description: |
      Error budget is being consumed {{ $value | humanize }}x faster than sustainable.
      At this rate, 10% of your 30-day budget will be burned in ~3 days.
      SLO: 99.9% | Current error rate: {{ $value | humanizePercentage }}
      Action: Schedule reliability review; may indicate systemic issues.

Combined Multi-Window Alert

# Combined: Fire if ANY window exceeds threshold
# This catches both fast spikes and slow leaks
- alert: SLOErrorBudgetMultiWindowBurn
  expr: |
    (
      sum(rate(http_requests_total{service="api-gateway",status=~"5.."}[1h]))
      /
      sum(rate(http_requests_total{service="api-gateway"}[1h]))
    )
    > (1 - 0.999) * 14.4
    or
    (
      sum(rate(http_requests_total{service="api-gateway",status=~"5.."}[6h]))
      /
      sum(rate(http_requests_total{service="api-gateway"}[6h]))
    )
    > (1 - 0.999) * 6
    or
    (
      sum(rate(http_requests_total{service="api-gateway",status=~"5.."}[3d]))
      /
      sum(rate(http_requests_total{service="api-gateway"}[3d]))
    )
    > (1 - 0.999) * 3
  for: 5m
  labels:
    severity: page
    category: slo
  annotations:
    summary: "Error budget burning across multiple time windows"
    description: |
      Multi-window burn-rate alert triggered.
      One or more time windows show unsustainable error rates.
      1h burn rate: {{ printf "%.1f" (neilyz (index $alerts "0" | value)) }}x threshold
      6h burn rate: {{ printf "%.1f" (neilyz (index $alerts "1" | value)) }}x threshold
      3d burn rate: {{ printf "%.1f" (neilyz (index $alerts "2" | value)) }}x threshold
      Action: On-call should investigate and consider declaring incident.

Burn-Rate Alerting Template (Parameterizable)

# Template for burn-rate alerting with configurable SLO
# Use $slo_threshold as a template variable or replace manually
- alert: SLOBudgetBurnGeneric
  expr: |
    (
      sum(rate(http_requests_total{service="$service",status=~"5.."}[1h]))
      /
      sum(rate(http_requests_total{service="$service"}[1h]))
    )
    > (1 - $slo_threshold) * 14.4
    or
    (
      sum(rate(http_requests_total{service="$service",status=~"5.."}[6h]))
      /
      sum(rate(http_requests_total{service="$service"}[6h]))
    )
    > (1 - $slo_threshold) * 6
    or
    (
      sum(rate(http_requests_total{service="$service",status=~"5.."}[3d]))
      /
      sum(rate(http_requests_total{service="$service"}[3d]))
    )
    > (1 - $slo_threshold) * 3
  for: 5m
  labels:
    severity: page
    category: slo
  annotations:
    summary: "Error budget burning for {{ $labels.service }} (SLO: {{ $labels.slo_target }})"
    description: |
      Multi-window burn-rate alert for {{ $labels.service }}.
      SLO: {{ $labels.slo_target }}% | Windows: 1h / 6h / 3d
      Refer to incident runbook: {{ $labels.runbook_url }}

Error Budget Dashboard

{
  "dashboard": {
    "title": "SLO Error Budget Dashboard",
    "panels": [
      {
        "title": "Error Budget Remaining",
        "type": "gauge",
        "targets": [
          {
            "expr": "(1 - (sum(rate(http_requests_total{service=\"api-gateway\",status=~\"5..\"}[30d])) / sum(rate(http_requests_total{service=\"api-gateway\"}[30d])))) * 100",
            "legendFormat": "Budget Used %"
          }
        ],
        "fieldConfig": {
          "defaults": {
            "min": 0,
            "max": 100,
            "unit": "percent",
            "thresholds": {
              "mode": "absolute",
              "steps": [
                { "value": 0, "color": "red" },
                { "value": 25, "color": "orange" },
                { "value": 50, "color": "yellow" },
                { "value": 75, "color": "green" }
              ]
            }
          }
        }
      },
      {
        "title": "Burn Rate by Window",
        "type": "graph",
        "targets": [
          {
            "expr": "(sum(rate(http_requests_total{service=\"api-gateway\",status=~\"5..\"}[1h])) / sum(rate(http_requests_total{service=\"api-gateway\"}[1h]))) / (1 - 0.999)",
            "legendFormat": "1h Burn Rate"
          },
          {
            "expr": "(sum(rate(http_requests_total{service=\"api-gateway\",status=~\"5..\"}[6h])) / sum(rate(http_requests_total{service=\"api-gateway\"}[6h]))) / (1 - 0.999)",
            "legendFormat": "6h Burn Rate"
          },
          {
            "expr": "(sum(rate(http_requests_total{service=\"api-gateway\",status=~\"5..\"}[3d])) / sum(rate(http_requests_total{service=\"api-gateway\"}[3d]))) / (1 - 0.999)",
            "legendFormat": "3d Burn Rate"
          }
        ],
        "gridPos": { "x": 0, "y": 8, "w": 12, "h": 8 }
      },
      {
        "title": "Projected Budget Exhaustion (Hours)",
        "type": "stat",
        "targets": [
          {
            "expr": "((1 - (sum(rate(http_requests_total{service=\"api-gateway\",status=~\"5..\"}[30d])) / sum(rate(http_requests_total{service=\"api-gateway\"}[30d])))) * 30 * 24) / ((sum(rate(http_requests_total{service=\"api-gateway\",status=~\"5..\"}[1h])) / sum(rate(http_requests_total{service=\"api-gateway\"}[1h]))) / (1 - 0.999))",
            "legendFormat": "Hours remaining"
          }
        ],
        "gridPos": { "x": 12, "y": 8, "w": 6, "h": 4 }
      },
      {
        "title": "Error Rate vs SLO Target",
        "type": "timeseries",
        "targets": [
          {
            "expr": "sum(rate(http_requests_total{service=\"api-gateway\",status=~\"5..\"}[5m])) / sum(rate(http_requests_total{service=\"api-gateway\"}[5m])) * 100",
            "legendFormat": "Current Error Rate %"
          },
          {
            "expr": "(1 - 0.999) * 100",
            "legendFormat": "SLO Target Error Rate %"
          }
        ],
        "gridPos": { "x": 18, "y": 8, "w": 6, "h": 8 }
      }
    ]
  }
}

Observability Hooks for Metrics Monitoring

This section defines what to log, measure, trace, and alert for metrics monitoring systems themselves.

Log (What to Emit)

Event	Fields	Level
Scrape target added	target, job, endpoint	INFO
Scrape target removed	target, job, reason	INFO
Alert state change	alert_name, old_state, new_state, duration	INFO
Recording rule evaluation error	rule_name, error	ERROR
Alert evaluation error	alert_name, error	ERROR
Remote write failure	remote_url, error, retry_count	WARN
TSDB checkpoint created	checkpoint_size, duration	DEBUG

Measure (Metrics to Collect)

Metric	Type	Description
`prometheus_tsdb_head_samples`	Gauge	Samples in TSDB head
`prometheus_tsdb_head_chunks`	Gauge	Chunks in TSDB head
`prometheus_tsdb_head_duration_seconds`	Gauge	Time head has existed
`prometheus_target_scrapes_total`	Counter	Total scrape attempts
`prometheus_target_scrapes_failed_total`	Counter	Failed scrape attempts
`prometheus_target_scrapes_exceeded_target_limit_total`	Counter	Targets exceeding limit
`prometheus_remote_write_requests_total`	Counter	Remote write requests
`prometheus_remote_write_requests_failed_total`	Counter	Failed remote write requests
`prometheus_alertmanager_alerts_total`	Counter	Alerts sent to Alertmanager
`prometheus_alerting_rules_evaluated_total`	Counter	Rule evaluations
`prometheus_notifications_queue_length`	Gauge	Pending notifications
`prometheus_http_request_duration_seconds`	Histogram	HTTP request latency

Trace (Correlation Points)

Operation	Trace Attribute	Purpose
Scrape cycle	`scrape.job`, `scrape.target`	Track scrape performance
Remote write	`remote_write.endpoint`, `remote_write.status`	Monitor write health
Alert evaluation	`alert.name`, `alert.severity`	Correlate alerts

Alert (When to Page for Monitoring System Itself)

Alert	Condition	Severity	Purpose
Prometheus Down	Prometheus instance unreachable	P1 Critical	Monitoring unavailable
TSDB Head Growing	Head chunks > 2 weeks of data	P2 High	Storage issue
Scrape Failure Rate	Scrape failures > 10% for 10 min	P2 High	Missing metrics
Remote Write Failing	Remote write failures > 5%	P1 Critical	Backup missing
Alert Queue Full	Pending alerts > 100	P3 Medium	Notification delay
High Cardinality	Cardinality > configured limit	P3 Medium	Memory pressure

Metrics Monitoring System Observability Template

# monitoring-system-observability.yaml
groups:
  - name: prometheus-self-monitoring
    rules:
      # Prometheus instance down
      - alert: PrometheusDown
        expr: up{job="prometheus"} == 0
        for: 2m
        labels:
          severity: critical
        annotations:
          summary: "Prometheus instance {{ $labels.instance }} is down"
          description: "Prometheus monitoring is unavailable. Investigate immediately."

      # TSDB head growing unbounded
      - alert: PrometheusTSDBHeadOld
        expr: prometheus_tsdb_head_min_time{job="prometheus"} < (time() - 3600 * 24 * 14)
        for: 1h
        labels:
          severity: warning
        annotations:
          summary: "Prometheus TSDB head is more than 2 weeks old"
          description: "TSDB head has not compacted in 2 weeks. Check storage and compaction settings."

      # High scrape failure rate
      - alert: PrometheusScrapeFailureRate
        expr: |
          sum(rate(prometheus_target_scrapes_failed_total{job="prometheus"}[10m]))
          /
          sum(rate(prometheus_target_scrapes_total{job="prometheus"}[10m])) > 0.1
        for: 10m
        labels:
          severity: high
        annotations:
          summary: "Prometheus scrape failure rate above 10%"
          description: "{{ $value | humanizePercentage }} of scrapes are failing."

      # Remote write failures
      - alert: PrometheusRemoteWriteFailing
        expr: |
          sum(rate(prometheus_remote_write_requests_failed_total[5m]))
          /
          sum(rate(prometheus_remote_write_requests_total[5m])) > 0.05
        for: 5m
        labels:
          severity: critical
        annotations:
          summary: "Prometheus remote write failure rate above 5%"
          description: "Remote write to long-term storage is failing. Historical metrics at risk."

      # Notification queue backing up
      - alert: PrometheusNotificationQueueFull
        expr: prometheus_notifications_queue_length{job="prometheus"} > 100
        for: 5m
        labels:
          severity: warning
        annotations:
          summary: "Prometheus alert notification queue is backing up"
          description: "{{ $value }} alerts pending delivery. Alertmanager may be unreachable."

      # High HTTP request latency
      - alert: PrometheusHighQueryLatency
        expr: |
          histogram_quantile(0.95,
            sum(rate(prometheus_http_request_duration_seconds_bucket{job="prometheus"}[5m])) by (le)
          ) > 2
        for: 10m
        labels:
          severity: warning
        annotations:
          summary: "Prometheus query latency above 2 seconds (p95)"
          description: "Query performance is degraded. Check TSDB load and query complexity."

      # Burn-rate alerting for SLO itself (meta-monitoring)
      - alert: SLOBurnRateAlertMetaMonitoring
        expr: |
          (
            sum(rate(prometheus_target_scrapes_failed_total{job="prometheus"}[1h]))
            /
            sum(rate(prometheus_target_scrapes_total{job="prometheus"}[1h]))
          ) > 0.01
        for: 5m
        labels:
          severity: warning
        annotations:
          summary: "Prometheus monitoring quality degraded"
          description: "Scrape failure rate is {{ $value | humanizePercentage }}. Monitoring accuracy reduced."

Service Level Agreements

SLAs are contractual commitments to customers, often backed by financial penalties. They are usually less aggressive than SLOs.

SLA vs SLO

Aspect	SLA	SLO
Audience	Customers	Internal
Enforced by	Contracts, penalties	Team discipline
Target	Usually less strict	Usually more strict
Consequences	Financial	Operational

Don’t set SLAs until you have SLOs you are confident you can meet. Adding contractual SLAs before you understand your system’s behavior is asking for trouble.

Building Dashboards

Dashboards translate metrics into actionable information. Good dashboards answer specific questions. Bad dashboards show everything and answer nothing.

Dashboard Design Principles

Start with the user and their questions. A dashboard for on-call engineers answering “is my service healthy?” looks different from an executive dashboard showing business metrics.

Group related metrics. Use rows and panels to organize information logically.

Include context. Raw numbers without comparison are hard to interpret. Show current value versus target, versus last week, or versus baseline.

Minimize chart junk. Every element should convey information. Remove gridlines, legends, and labels that don’t add value.

Example Dashboard Panels

# Grafana dashboard JSON (abbreviated)
{
  "dashboard":
    {
      "title": "API Gateway Overview",
      "panels":
        [
          {
            "title": "Request Rate",
            "type": "graph",
            "targets":
              [
                {
                  "expr": "sum(rate(http_requests_total{service='api-gateway'}[5m])) by (service)",
                  "legendFormat": "{{service}}",
                },
              ],
            "gridPos": { "x": 0, "y": 0, "w": 12, "h": 8 },
          },
          {
            "title": "Error Rate",
            "type": "stat",
            "targets":
              [
                {
                  "expr": "sum(rate(http_requests_total{service='api-gateway',status=~'5..'}[5m])) / sum(rate(http_requests_total{service='api-gateway'}[5m])) * 100",
                  "legendFormat": "Error %",
                },
              ],
            "fieldConfig":
              {
                "defaults":
                  {
                    "thresholds":
                      {
                        "mode": "absolute",
                        "steps":
                          [
                            { "value": 0, "color": "green" },
                            { "value": 0.1, "color": "yellow" },
                            { "value": 1, "color": "red" },
                          ],
                      },
                  },
              },
            "gridPos": { "x": 12, "y": 0, "w": 6, "h": 4 },
          },
          {
            "title": "P95 Latency",
            "type": "gauge",
            "targets":
              [
                {
                  "expr": "histogram_quantile(0.95, sum(rate(http_request_duration_seconds_bucket{service='api-gateway'}[5m])) by (le)) * 1000",
                  "legendFormat": "p95 ms",
                },
              ],
            "gridPos": { "x": 18, "y": 0, "w": 6, "h": 4 },
          },
        ],
    },
}

Essential Dashboard Sections

At a Glance: Key health indicators in a single row with green/yellow/red status for critical metrics.

Request Metrics: Rate, error rate, and latency percentiles for your main endpoints.

Infrastructure Metrics: CPU, memory, disk, and network for your hosts.

Application Metrics: Business-specific metrics like queue depths, cache hit rates, or background job counts.

Dependency Health: Metrics for databases, caches, and external services your application depends on.

Alert Design

Alerts should wake someone up only when they need to act. Too many alerts cause fatigue; too few cause outages.

Alert Severity Levels

Severity	Response Time	Examples
P1 Critical	Minutes, 24/7	Complete outage, data loss, security breach
P2 High	30 minutes	Degraded performance affecting many users
P3 Medium	Business hours	Minor degradation, non-critical failures
P4 Low	Next sprint	Predictable issues, capacity planning

Alert Quality Checklist

Before creating an alert, ask:

Does this indicate a real problem affecting users?
Is the root cause something we can fix?
Is the alert actionable? Can the recipient do something about it?
Is the alert specific? Does it point toward the likely cause?
Is the threshold calibrated? Are we alerting on symptoms or causes?

Alerting on symptoms leads to noise. Alerting on causes requires understanding your system well enough to know what indicates true problems.

Example Alert Rules

# Prometheus alerting rules
groups:
  - name: api-gateway
    rules:
      # High error rate
      - alert: APIGatewayHighErrorRate
        expr: |
          sum(rate(http_requests_total{service="api-gateway",status=~"5.."}[5m]))
          /
          sum(rate(http_requests_total{service="api-gateway"}[5m])) > 0.01
        for: 5m
        labels:
          severity: critical
        annotations:
          summary: "API Gateway error rate above 1%"
          description: "Error rate is {{ $value | humanizePercentage }} over the last 5 minutes"

      # Latency degradation
      - alert: APIGatewayHighLatency
        expr: |
          histogram_quantile(0.95,
            sum(rate(http_request_duration_seconds_bucket{service="api-gateway"}[5m])) by (le)
          ) > 1
        for: 10m
        labels:
          severity: warning
        annotations:
          summary: "API Gateway P95 latency above 1 second"
          description: "P95 latency is {{ $value | humanizeDuration }}"

      # Slow queries
      - alert: DatabaseSlowQueries
        expr: |
          rate(django_dbqueries_total{type="slow"}[5m]) > 10
        for: 5m
        labels:
          severity: warning
        annotations:
          summary: "More than 10 slow queries per second"
          description: "Database is experiencing query performance issues"

Alert Routing

Route alerts to the right people based on severity and service:

# Alertmanager configuration
route:
  receiver: default
  routes:
    - match:
        severity: critical
      receiver: pagerduty
      continue: true
    - match:
        service: database
      receiver: database-oncall
    - match:
        severity: warning
      receiver: slack-warnings
    - match:
        severity: info
      receiver: none # Don't alert, just log

receivers:
  - name: pagerduty
    pagerduty_configs:
      - service_key: xxx
        severity: critical
  - name: slack-warnings
    slack_configs:
      - channel: "#alerts-warning"
  - name: database-oncall
    pagerduty_configs:
      - service_key: yyy
        routing_key: database-team

Blackbox Monitoring

Blackbox monitoring tests your service from the outside, independent of application metrics. It catches failures that internal metrics miss.

Prometheus Blackbox Exporter

# blackbox.yml
modules:
  http_2xx:
    prober: http
    http:
      valid_http_versions: ["HTTP/1.1", "HTTP/2"]
      method: GET
      fail_if_ssl: false

# Prometheus scrape config
scrape_configs:
  - job_name: "blackbox-http"
    metrics_path: /probe
    params:
      module: [http_2xx]
    static_configs:
      - targets:
          - https://api.example.com/health
    relabel_configs:
      - source_labels: [__address__]
        target_label: __param_target
      - source_labels: [__param_target]
        target_label: instance
      - target_label: __address__
        replacement: blackbox-exporter:9115

Synthetic Transactions

Run synthetic transactions regularly to catch degradation before users notice:

import requests
import time

def synthetic_checkout():
    start = time.time()

    # 1. Create cart
    cart_resp = requests.post('https://api.example.com/carts', timeout=5)
    cart_id = cart_resp.json()['id']

    # 2. Add item
    requests.post(f'https://api.example.com/carts/{cart_id}/items',
                  json={'product_id': 'PROD123', 'quantity': 1}, timeout=5)

    # 3. Checkout
    checkout_resp = requests.post(f'https://api.example.com/carts/{cart_id}/checkout',
                                   timeout=10)

    duration = time.time() - start

    # Report metrics
    metrics.histogram('synthetic_checkout_duration', duration)
    metrics.gauge('synthetic_checkout_success', 1 if checkout_resp.status == 200 else 0)

Capacity Planning

Monitoring helps predict when you need more capacity.

Trends and Projections

Monitor usage trends over weeks and months:

# Weekly growth rate
sum(rate(http_requests_total[7d])) / sum(rate(http_requests_total[7d] offset 7d)) - 1

# Days until CPU saturation at current growth rate
predict_linear(node_cpu_seconds_total{mode="idle"}[7d], 30 * 24 * 3600)

Scaling Thresholds

# KEDA scaled object for event-driven scaling
apiVersion: keda.sh/v1alpha1
kind: ScaledObject
metadata:
  name: api-gateway-scaler
  namespace: production
spec:
  scaleTargetRef:
    name: api-gateway
  minReplicaCount: 3
  maxReplicaCount: 100
  triggers:
    - type: prometheus
      metadata:
        serverAddress: http://prometheus:9090
        metricName: http_requests_per_second
        threshold: "1000"
        query: sum(rate(http_requests_total{service="api-gateway"}[2m]))

When to Use Metrics-Based Monitoring

When to Use Metrics-Based Monitoring:

Real-time system health monitoring and alerting
Capacity planning and trend analysis
SLO/SLA tracking and error budget management
Correlation with business metrics (revenue, conversions)
Long-term historical analysis and reporting
Infrastructure and application layer monitoring

When Not to Use Metrics-Based Monitoring:

Debugging specific request failures (use logs)
Understanding request flow across services (use tracing)
One-off troubleshooting of transient issues
Monitoring low-volume events that do not aggregate well
Situations where you need the full context of a single operation

Trade-off Analysis

Aspect	Metrics Monitoring	Log Analysis	Distributed Tracing
Latency Detection	Aggregate patterns	Per-request detail	Request-level timing
Root Cause	Correlation difficult	Full context available	因果链路清晰
Storage Cost	Low (aggregated)	High (raw events)	Medium (spans)
Query Flexibility	High (PromQL/SQL)	Medium (text search)	Low (structured)
Historical Analysis	Excellent	Limited by cost	Poor
Real-time Alerting	Excellent	Poor	Good
Debug Single Request	Poor	Excellent	Excellent
Capacity Planning	Excellent	Poor	Poor

Production Failure Scenarios

Failure	Impact	Mitigation
Prometheus scrape failure	Gaps in metrics; missing visibility	Configure remote_write to backup; alert on target down; use Prometheus federation
Alert fatigue	Teams ignore alerts; real issues missed	Regularly review alert quality; remove stale alerts; tune thresholds
Metric cardinality explosion	Prometheus OOM; query performance degraded	Limit label cardinality; use recording rules; segment by service
Alert routing failure	Critical alerts not delivered; extended outages	Test alert routing; use multiple notification channels; on-call rotation
Dashboard data source outage	No visibility for teams; blind decision making	Configure backup data sources; cache dashboard definitions
SLO target chronically missed	Customer trust erosion; potential SLA penalties	Investigate root causes; adjust SLO targets if unrealistic; build error budget alerts

Observability Checklist

Core Metrics (Golden Signals)

Latency: p50, p95, p99 response times
Traffic: requests per second, throughput
Errors: error rate (4xx, 5xx), success rate
Saturation: CPU, memory, disk, queue depths

Infrastructure Metrics

Host-level: CPU utilization, memory usage, disk I/O, network throughput
Container-level: resource limits, restart counts, OOM kills
Kubernetes: pod status, node conditions, namespace quotas

Application Metrics (RED/USE)

Rate: requests per second by endpoint, method, status
Errors: error counts and rates by type, service, endpoint
Duration: histogram of request latencies with percentiles
Utilization: resource usage, queue lengths, connection pool stats
Saturation: backpressure indicators, throttling events

SLO/SLA Metrics

Availability SLI with 30-day rolling window
Latency SLI (p95 under 200ms target)
Error rate SLI (below 0.1% target)
Error budget tracking and burn rate alerts

Alerting Rules

P1: Service down or availability SLA breach
P2: High error rate (>1% for 5 minutes)
P2: Latency degradation (p95 >500ms)
P3: Resource utilization >80%
P4: Error budget 50% consumed in 1 hour

Security Checklist

Metrics endpoint (/metrics) not exposed publicly
Prometheus access authenticated (if external)
Alert manager notifications do not contain sensitive data
Dashboard links to internal systems use authenticated proxies
No API keys or secrets visible in dashboard URLs
Alert routing logs audited
Metrics cardinality limits enforced to prevent DoS
Remote write connections use TLS
Dashboard snapshots sanitized before sharing

Common Pitfalls / Anti-Patterns

1. Alerting on Symptoms Without Context

# Bad: Alerts on symptoms
- alert: HighCPU
  expr: cpu_usage > 80
  # Fires constantly without context

# Good: Alert with actionable context
- alert: HighCPUOnCriticalService
  expr: cpu_usage{service="api-gateway"} > 80 and rate(http_requests_total{service="api-gateway"}[5m]) > 1000
  # Only fires when it matters

2. Missing Alert Aggregation

Firing 1000 alerts for one failure causes chaos. Use grouping, inhibition, and routing to consolidate alerts:

route:
  group_by: ["alertname", "cluster", "service"]
  group_wait: 30s
  group_interval: 5m
  repeat_interval: 4h

3. SLOs That Do Not Reflect User Experience

Setting SLOs on backend metrics instead of user-visible metrics creates false confidence. SLOs should measure what users experience, not internal implementation details.

4. Dashboard Overload

More panels do not mean better monitoring. Each dashboard should answer specific questions. If you need to scroll, you have too many panels.

5. No Alert Testing

Alerts that never fire in production may have broken queries. Test alerts periodically by simulating conditions that should trigger them.

6. Ignoring Alert Fatigue

If engineers start ignoring alerts, real problems get missed. Review and tune alerts regularly. Remove alerts that no longer serve a purpose.

Quick Recap

Key Takeaways:

Start with the four golden signals: latency, traffic, errors, saturation
Define SLOs that reflect user experience, not internal metrics
Use error budgets to prioritize reliability work
Alert on symptoms with context, not raw numbers
Every alert should be actionable: someone should know what to do
Review and tune alerts regularly to avoid fatigue

Copy/Paste Checklist:

# Availability SLO query
1 - (
  sum(rate(http_requests_total{status=~"5.."}[30d]))
  /
  sum(rate(http_requests_total)[30d])
)

# Error budget burning alert
(
  sum(rate(http_requests_total{service="api-gateway",status=~"5.."}[1h]))
  /
  sum(rate(http_requests_total{service="api-gateway"}[1h]))
) > (1 - 0.999) * 0.1

# P95 latency with context
histogram_quantile(0.95,
  sum(rate(http_request_duration_seconds_bucket{service="api-gateway"}[5m])) by (le)
) > 1

# Dashboard variable template
sum(rate(http_requests_total{service=~"$service"}[$interval])) by (service)

# Alert routing with grouping
route:
  receiver: default
  group_by: ['alertname', 'cluster', 'service']
  group_wait: 30s
  continue: true
  routes:
    - match:
        severity: critical
      receiver: pagerduty

Conclusion

Effective monitoring combines the right metrics, clear SLOs, thoughtful alerts, and actionable dashboards. Start with the four golden signals: latency, traffic, errors, and saturation. Define SLOs that reflect what users actually care about. Build alerts that fire only when someone needs to act.

For deeper observability into distributed systems, our Distributed Tracing guide covers request flow across services. The Prometheus & Grafana guide provides hands-on examples for implementing these patterns.