Metrics, Monitoring, and Alerting: From SLIs to Alerts

Learn the RED and USE methods, SLIs/SLOs/SLAs, and how to build alerting systems that catch real problems. Includes examples for web services and databases.

published: reading time: 37 min read author: GeekWorkBench

Metrics, Monitoring, and Alerting: From SLIs to Production Alerts

Monitoring is how you know your system is healthy. Without it, you are blind to failures, degradation, and capacity issues until users report them. This guide covers the theory and practice of building monitoring systems that actually help.

We assume you have basic familiarity with logs and metrics collection. If you need a logging refresher, our Logging Best Practices guide covers structured logging first.

Introduction

There are three established methodologies for defining what to measure: RED, USE, and Google’s Four Golden Signals. Each serves different purposes.

RED Method

RED focuses on request-driven services, particularly APIs:

  • Rate: Requests per second
  • Errors: Error rate (usually percentage of requests resulting in errors)
  • Duration: Response time distribution (p50, p95, p99)
# Request rate
sum(rate(http_requests_total[5m]))

# Error rate
sum(rate(http_requests_total{status=~"5.."}[5m])) / sum(rate(http_requests_total[5m]))

# Duration percentiles
histogram_quantile(0.95, sum(rate(http_request_duration_seconds_bucket[5m])) by (le))

USE Method

USE focuses on resource utilization:

  • Utilization: How busy is the resource
  • Saturation: How much work is queued beyond what the resource can handle
  • Errors: Internal errors preventing correct operation
# CPU utilization
100 - (avg by(instance) (rate(node_cpu_seconds_total{mode="idle"}[5m])) * 100)

# Memory saturation
node_memory_Active_bytes / node_memory_MemTotal_bytes * 100

# Disk I/O errors
rate(node_disk_io_time_seconds_total{ mode != "idle" }[5m])

Four Golden Signals

Google’s SRE book defines four signals every service should monitor:

  • Latency: How long operations take
  • Traffic: How much demand exists
  • Errors: How often requests fail
  • Saturation: How full the system is

For most web services, these four signals cover what matters most.

Core Concepts

SLIs are the actual metrics you measure. They define what “good” looks like for your service.

Common SLIs for Web Services

SLIDefinitionGoodAcceptable
AvailabilityPercentage of requests that get a successful response99.9%99.5%
Latencyp95 response time under normal conditions< 200ms< 500ms
ThroughputRequests handled per secondVariesAbove baseline
Error RatePercentage of errors (5xx, timeouts)< 0.1%< 1%

Measuring SLIs

Define SLIs precisely so they can be measured consistently:

# sli-config.yaml
service: api-gateway
environment: production

slis:
  - name: request_success_rate
    description: "Percentage of requests returning non-5xx responses"
    query: |
      sum(rate(http_requests_total{service="api-gateway",status!~"5.."}[5m]))
      /
      sum(rate(http_requests_total{service="api-gateway"}[5m]))

  - name: p95_latency
    description: "95th percentile request duration"
    query: |
      histogram_quantile(0.95,
        sum(rate(http_request_duration_seconds_bucket{service="api-gateway"}[5m])) by (le)
      )

  - name: error_rate
    description: "Percentage of requests returning 5xx errors"
    query: |
      sum(rate(http_requests_total{service="api-gateway",status=~"5.."}[5m]))
      /
      sum(rate(http_requests_total{service="api-gateway"}[5m]))

Service Level Objectives

SLOs are the targets you want to achieve. They transform SLIs into goals your team commits to.

Defining SLOs

SLOs combine an SLI, a threshold, and a time window:

# SLO configuration
objectives:
  - display_name: "API Availability"
    sli: request_success_rate
    target: 99.9
    window: 30d
    description: "API should be available 99.9% of the time"

  - display_name: "API Latency"
    sli: p95_latency
    target: 99.0
    threshold_ms: 200
    window: 30d
    description: "95% of requests should complete within 200ms"

  - display_name: "API Error Rate"
    sli: error_rate
    target: 99.5
    threshold_percent: 0.1
    window: 30d
    description: "Error rate should stay below 0.1%"

Error Budgets

Error budgets convert SLO compliance into actionable information. If your availability SLO is 99.9% over 30 days, you have 43 minutes of allowed downtime in that period:

# Calculate error budget
def error_budget(slo_target, window_days=30):
    window_seconds = window_days * 24 * 60 * 60
    allowed_downtime = window_seconds * (1 - slo_target)
    return allowed_downtime

# 99.9% over 30 days = 43.2 minutes
budget = error_budget(0.999)
print(f"Monthly error budget: {budget / 60:.1f} minutes")
# Monthly error budget: 43.2 minutes

Error budgets tell you when you can be aggressive with releases (large budget remaining) and when you need to be careful (budget nearly exhausted).

flowchart TB
    subgraph "Error Budget Lifecycle"
        A["SLO Target\n99.9% = 43.2 min/month"]
        B["Error Budget\n= Allowed Downtime"]
        C["Budget Remaining\n= Good (deploy freely)"]
        D["Budget Burning\n= Slow Burn (> 6x)"]
        E["Budget Critical\n= Fast Burn (14.4x)"]
        F["Budget Exhausted\n= SLA Breach"]
    end
    A --> B
    B --> C
    B --> D
    D --> E
    E --> F

SLO Alerting

Set up alerting before you burn through the budget:

# Alert when 10% of budget is burned in 1 hour
- alert: SLOErrorBudgetBurning
  expr: |
    (
      sum(rate(http_requests_total{service="api-gateway",status=~"5.."}[1h]))
      /
      sum(rate(http_requests_total{service="api-gateway"}[1h]))
    )
    > (1 - 0.999) * 0.1
  labels:
    severity: warning
  annotations:
    summary: "Error budget burning fast"
    description: "More than 10% of the 30-day error budget has been burned in the last hour"

Multi-Window Burn-Rate Alerting

Standard threshold alerts catch sudden spikes but miss slow leaks. Multi-window burn-rate alerting detects both:

WindowBurn Rate MultiplierBudget BurnedUse Case
1 hour14.4x1% per hourFast burn (page immediately)
6 hours6x10% per 6 hoursMedium burn (warning)
3 days3x10% per 3 daysSlow leak (investigate)
30 days1x100% per 30 daysBudget exhausted (review)

Fast Burn Alert (1-Hour Window)

# Multi-window burn-rate alerting rules
groups:
  - name: slo-burn-rate
    interval: 30s
    rules:
      # 1-hour window: Page immediately if burning 14.4x sustainable rate
      # At 99.9% SLO, sustainable error rate = 0.001
      # 14.4 * 0.001 = 0.0144 = 1.44% error rate
      - alert: SLOErrorBudgetFastBurn
        expr: |
          (
            sum(rate(http_requests_total{service="api-gateway",status=~"5.."}[1h]))
            /
            sum(rate(http_requests_total{service="api-gateway"}[1h]))
          )
          > (1 - 0.999) * 14.4
        for: 5m
        labels:
          severity: page
          category: slo
          window: 1h
        annotations:
          summary: "Error budget burning fast - FAST PAGE"
          description: |
            Error budget is being consumed {{ $value | humanize }}x faster than sustainable.
            At this rate, your entire 30-day budget will be exhausted in ~7 hours.
            SLO: 99.9% | Current error rate: {{ $value | humanizePercentage }}
            Action: Page on-call immediately and investigate error spike.

Medium Burn Alert (6-Hour Window)

# 6-hour window: Warning if burning 6x sustainable rate
# 6 * 0.001 = 0.006 = 0.6% error rate
- alert: SLOErrorBudgetMediumBurn
  expr: |
    (
      sum(rate(http_requests_total{service="api-gateway",status=~"5.."}[6h]))
      /
      sum(rate(http_requests_total{service="api-gateway"}[6h]))
    )
    > (1 - 0.999) * 6
  for: 30m
  labels:
    severity: warning
    category: slo
    window: 6h
  annotations:
    summary: "Error budget burning - INVESTIGATE"
    description: |
      Error budget is being consumed {{ $value | humanize }}x faster than sustainable.
      At this rate, 10% of your 30-day budget will be burned in ~6 hours.
      SLO: 99.9% | Current error rate: {{ $value | humanizePercentage }}
      Action: Investigate elevated error patterns during business hours.

Slow Burn Alert (3-Day Window)

# 3-day window: Long-term trend detection
# 3 * 0.001 = 0.003 = 0.3% error rate
- alert: SLOErrorBudgetSlowBurn
  expr: |
    (
      sum(rate(http_requests_total{service="api-gateway",status=~"5.."}[3d]))
      /
      sum(rate(http_requests_total{service="api-gateway"}[3d]))
    )
    > (1 - 0.999) * 3
  for: 3h
  labels:
    severity: warning
    category: slo
    window: 3d
  annotations:
    summary: "Error budget slow leak - REVIEW"
    description: |
      Error budget is being consumed {{ $value | humanize }}x faster than sustainable.
      At this rate, 10% of your 30-day budget will be burned in ~3 days.
      SLO: 99.9% | Current error rate: {{ $value | humanizePercentage }}
      Action: Schedule reliability review; may indicate systemic issues.

Combined Multi-Window Alert

# Combined: Fire if ANY window exceeds threshold
# This catches both fast spikes and slow leaks
- alert: SLOErrorBudgetMultiWindowBurn
  expr: |
    (
      sum(rate(http_requests_total{service="api-gateway",status=~"5.."}[1h]))
      /
      sum(rate(http_requests_total{service="api-gateway"}[1h]))
    )
    > (1 - 0.999) * 14.4
    or
    (
      sum(rate(http_requests_total{service="api-gateway",status=~"5.."}[6h]))
      /
      sum(rate(http_requests_total{service="api-gateway"}[6h]))
    )
    > (1 - 0.999) * 6
    or
    (
      sum(rate(http_requests_total{service="api-gateway",status=~"5.."}[3d]))
      /
      sum(rate(http_requests_total{service="api-gateway"}[3d]))
    )
    > (1 - 0.999) * 3
  for: 5m
  labels:
    severity: page
    category: slo
  annotations:
    summary: "Error budget burning across multiple time windows"
    description: |
      Multi-window burn-rate alert triggered.
      One or more time windows show unsustainable error rates.
      1h burn rate: {{ printf "%.1f" (neilyz (index $alerts "0" | value)) }}x threshold
      6h burn rate: {{ printf "%.1f" (neilyz (index $alerts "1" | value)) }}x threshold
      3d burn rate: {{ printf "%.1f" (neilyz (index $alerts "2" | value)) }}x threshold
      Action: On-call should investigate and consider declaring incident.

Burn-Rate Alerting Template (Parameterizable)

# Template for burn-rate alerting with configurable SLO
# Use $slo_threshold as a template variable or replace manually
- alert: SLOBudgetBurnGeneric
  expr: |
    (
      sum(rate(http_requests_total{service="$service",status=~"5.."}[1h]))
      /
      sum(rate(http_requests_total{service="$service"}[1h]))
    )
    > (1 - $slo_threshold) * 14.4
    or
    (
      sum(rate(http_requests_total{service="$service",status=~"5.."}[6h]))
      /
      sum(rate(http_requests_total{service="$service"}[6h]))
    )
    > (1 - $slo_threshold) * 6
    or
    (
      sum(rate(http_requests_total{service="$service",status=~"5.."}[3d]))
      /
      sum(rate(http_requests_total{service="$service"}[3d]))
    )
    > (1 - $slo_threshold) * 3
  for: 5m
  labels:
    severity: page
    category: slo
  annotations:
    summary: "Error budget burning for {{ $labels.service }} (SLO: {{ $labels.slo_target }})"
    description: |
      Multi-window burn-rate alert for {{ $labels.service }}.
      SLO: {{ $labels.slo_target }}% | Windows: 1h / 6h / 3d
      Refer to incident runbook: {{ $labels.runbook_url }}

Error Budget Dashboard

{
  "dashboard": {
    "title": "SLO Error Budget Dashboard",
    "panels": [
      {
        "title": "Error Budget Remaining",
        "type": "gauge",
        "targets": [
          {
            "expr": "(1 - (sum(rate(http_requests_total{service=\"api-gateway\",status=~\"5..\"}[30d])) / sum(rate(http_requests_total{service=\"api-gateway\"}[30d])))) * 100",
            "legendFormat": "Budget Used %"
          }
        ],
        "fieldConfig": {
          "defaults": {
            "min": 0,
            "max": 100,
            "unit": "percent",
            "thresholds": {
              "mode": "absolute",
              "steps": [
                { "value": 0, "color": "red" },
                { "value": 25, "color": "orange" },
                { "value": 50, "color": "yellow" },
                { "value": 75, "color": "green" }
              ]
            }
          }
        }
      },
      {
        "title": "Burn Rate by Window",
        "type": "graph",
        "targets": [
          {
            "expr": "(sum(rate(http_requests_total{service=\"api-gateway\",status=~\"5..\"}[1h])) / sum(rate(http_requests_total{service=\"api-gateway\"}[1h]))) / (1 - 0.999)",
            "legendFormat": "1h Burn Rate"
          },
          {
            "expr": "(sum(rate(http_requests_total{service=\"api-gateway\",status=~\"5..\"}[6h])) / sum(rate(http_requests_total{service=\"api-gateway\"}[6h]))) / (1 - 0.999)",
            "legendFormat": "6h Burn Rate"
          },
          {
            "expr": "(sum(rate(http_requests_total{service=\"api-gateway\",status=~\"5..\"}[3d])) / sum(rate(http_requests_total{service=\"api-gateway\"}[3d]))) / (1 - 0.999)",
            "legendFormat": "3d Burn Rate"
          }
        ],
        "gridPos": { "x": 0, "y": 8, "w": 12, "h": 8 }
      },
      {
        "title": "Projected Budget Exhaustion (Hours)",
        "type": "stat",
        "targets": [
          {
            "expr": "((1 - (sum(rate(http_requests_total{service=\"api-gateway\",status=~\"5..\"}[30d])) / sum(rate(http_requests_total{service=\"api-gateway\"}[30d])))) * 30 * 24) / ((sum(rate(http_requests_total{service=\"api-gateway\",status=~\"5..\"}[1h])) / sum(rate(http_requests_total{service=\"api-gateway\"}[1h]))) / (1 - 0.999))",
            "legendFormat": "Hours remaining"
          }
        ],
        "gridPos": { "x": 12, "y": 8, "w": 6, "h": 4 }
      },
      {
        "title": "Error Rate vs SLO Target",
        "type": "timeseries",
        "targets": [
          {
            "expr": "sum(rate(http_requests_total{service=\"api-gateway\",status=~\"5..\"}[5m])) / sum(rate(http_requests_total{service=\"api-gateway\"}[5m])) * 100",
            "legendFormat": "Current Error Rate %"
          },
          {
            "expr": "(1 - 0.999) * 100",
            "legendFormat": "SLO Target Error Rate %"
          }
        ],
        "gridPos": { "x": 18, "y": 8, "w": 6, "h": 8 }
      }
    ]
  }
}

Observability Hooks for Metrics Monitoring

This section defines what to log, measure, trace, and alert for metrics monitoring systems themselves.

Log (What to Emit)

EventFieldsLevel
Scrape target addedtarget, job, endpointINFO
Scrape target removedtarget, job, reasonINFO
Alert state changealert_name, old_state, new_state, durationINFO
Recording rule evaluation errorrule_name, errorERROR
Alert evaluation erroralert_name, errorERROR
Remote write failureremote_url, error, retry_countWARN
TSDB checkpoint createdcheckpoint_size, durationDEBUG

Measure (Metrics to Collect)

MetricTypeDescription
prometheus_tsdb_head_samplesGaugeSamples in TSDB head
prometheus_tsdb_head_chunksGaugeChunks in TSDB head
prometheus_tsdb_head_duration_secondsGaugeTime head has existed
prometheus_target_scrapes_totalCounterTotal scrape attempts
prometheus_target_scrapes_failed_totalCounterFailed scrape attempts
prometheus_target_scrapes_exceeded_target_limit_totalCounterTargets exceeding limit
prometheus_remote_write_requests_totalCounterRemote write requests
prometheus_remote_write_requests_failed_totalCounterFailed remote write requests
prometheus_alertmanager_alerts_totalCounterAlerts sent to Alertmanager
prometheus_alerting_rules_evaluated_totalCounterRule evaluations
prometheus_notifications_queue_lengthGaugePending notifications
prometheus_http_request_duration_secondsHistogramHTTP request latency

Trace (Correlation Points)

OperationTrace AttributePurpose
Scrape cyclescrape.job, scrape.targetTrack scrape performance
Remote writeremote_write.endpoint, remote_write.statusMonitor write health
Alert evaluationalert.name, alert.severityCorrelate alerts

Alert (When to Page for Monitoring System Itself)

AlertConditionSeverityPurpose
Prometheus DownPrometheus instance unreachableP1 CriticalMonitoring unavailable
TSDB Head GrowingHead chunks > 2 weeks of dataP2 HighStorage issue
Scrape Failure RateScrape failures > 10% for 10 minP2 HighMissing metrics
Remote Write FailingRemote write failures > 5%P1 CriticalBackup missing
Alert Queue FullPending alerts > 100P3 MediumNotification delay
High CardinalityCardinality > configured limitP3 MediumMemory pressure

Metrics Monitoring System Observability Template

# monitoring-system-observability.yaml
groups:
  - name: prometheus-self-monitoring
    rules:
      # Prometheus instance down
      - alert: PrometheusDown
        expr: up{job="prometheus"} == 0
        for: 2m
        labels:
          severity: critical
        annotations:
          summary: "Prometheus instance {{ $labels.instance }} is down"
          description: "Prometheus monitoring is unavailable. Investigate immediately."

      # TSDB head growing unbounded
      - alert: PrometheusTSDBHeadOld
        expr: prometheus_tsdb_head_min_time{job="prometheus"} < (time() - 3600 * 24 * 14)
        for: 1h
        labels:
          severity: warning
        annotations:
          summary: "Prometheus TSDB head is more than 2 weeks old"
          description: "TSDB head has not compacted in 2 weeks. Check storage and compaction settings."

      # High scrape failure rate
      - alert: PrometheusScrapeFailureRate
        expr: |
          sum(rate(prometheus_target_scrapes_failed_total{job="prometheus"}[10m]))
          /
          sum(rate(prometheus_target_scrapes_total{job="prometheus"}[10m])) > 0.1
        for: 10m
        labels:
          severity: high
        annotations:
          summary: "Prometheus scrape failure rate above 10%"
          description: "{{ $value | humanizePercentage }} of scrapes are failing."

      # Remote write failures
      - alert: PrometheusRemoteWriteFailing
        expr: |
          sum(rate(prometheus_remote_write_requests_failed_total[5m]))
          /
          sum(rate(prometheus_remote_write_requests_total[5m])) > 0.05
        for: 5m
        labels:
          severity: critical
        annotations:
          summary: "Prometheus remote write failure rate above 5%"
          description: "Remote write to long-term storage is failing. Historical metrics at risk."

      # Notification queue backing up
      - alert: PrometheusNotificationQueueFull
        expr: prometheus_notifications_queue_length{job="prometheus"} > 100
        for: 5m
        labels:
          severity: warning
        annotations:
          summary: "Prometheus alert notification queue is backing up"
          description: "{{ $value }} alerts pending delivery. Alertmanager may be unreachable."

      # High HTTP request latency
      - alert: PrometheusHighQueryLatency
        expr: |
          histogram_quantile(0.95,
            sum(rate(prometheus_http_request_duration_seconds_bucket{job="prometheus"}[5m])) by (le)
          ) > 2
        for: 10m
        labels:
          severity: warning
        annotations:
          summary: "Prometheus query latency above 2 seconds (p95)"
          description: "Query performance is degraded. Check TSDB load and query complexity."

      # Burn-rate alerting for SLO itself (meta-monitoring)
      - alert: SLOBurnRateAlertMetaMonitoring
        expr: |
          (
            sum(rate(prometheus_target_scrapes_failed_total{job="prometheus"}[1h]))
            /
            sum(rate(prometheus_target_scrapes_total{job="prometheus"}[1h]))
          ) > 0.01
        for: 5m
        labels:
          severity: warning
        annotations:
          summary: "Prometheus monitoring quality degraded"
          description: "Scrape failure rate is {{ $value | humanizePercentage }}. Monitoring accuracy reduced."

DORA Metrics for Delivery Performance

DORA (DevOps Research and Assessment) research identified four metrics that predict software delivery performance. You can use your existing monitoring infrastructure to track all of them.

The Four DORA Metrics

MetricDefinitionTarget
Deployment FrequencyHow often code deploys to productionMultiple per day
Lead Time for ChangesTime from commit to productionUnder 1 week
Change Failure RatePercentage of deployments causing failuresUnder 15%
Mean Time to Recovery (MTTR)Time to restore service after failureUnder 1 hour

Measuring DORA Metrics

# Deployment frequency (deployments per day)
sum(increase(deployments_total[1d]))

# Lead time for changes (p95 time from commit to deploy)
histogram_quantile(0.95,
  sum(rate(deploy_build_to_deploy_seconds_bucket[1w])) by (le)
)

# Change failure rate (failed deployments / total deployments)
sum(increase(deployments_total{status="failed"}[1w]))
/
sum(increase(deployments_total[1w]))

# MTTR from incident start to resolution
avg(incident_resolution_seconds) by (service)

Connecting DORA to SLOs

Deploy frequently without causing failures requires both mature processes and good monitoring. SLO burn-rate alerts give you early warning before reliability problems show up in DORA metrics.

The elite performers identified by DORA research deploy multiple times daily with change failure rates below 5%. That kind of reliability does not happen by accident.

Monitoring Cost Optimization

Monitoring costs have a way of creeping up on you. Cardinality explosions, excessive retention, and over-scraping are the usual suspects.

Controlling Cardinality

Every unique combination of label values creates a new time series. High-cardinality labels like user IDs or request IDs quickly exhaust Prometheus memory.

# Check cardinality per metric
topk(10,
  count by (__name__) (
    {__name__=~".+"}
  )
)

# Identify high-cardinality label combinations
count by (job, __name__) (
  {__name__="http_requests_total"}
)

Practical limits per metric:

  • Keep unique label combinations under 10,000 per metric
  • Avoid labels with unbounded values (user_id, session_id, request_id)
  • Use recording rules to pre-aggregate high-cardinality queries

Retention and Downsampling

Long retention periods multiply storage costs. Use downsampling to preserve historical patterns at lower resolution.

# prometheus.yml
rule_files:
  - /etc/prometheus/rules/*.yml

scrape_configs:
  - job_name: "federate"
    metrics_path: /federate
    params:
      match[]:
        - '{__name__=~"job:.+"}'
    rule_files:
      - /etc/prometheus/rollups.yml
# rollups.yml - pre-computed aggregations
groups:
  - name: hourly_rollups
    interval: 1h
    rules:
      - record: job:http_requests_total:hourly_rate
        expr: |
          sum by (job, status) (
            rate(http_requests_total[1h])
          )

Retention guidelines:

  • Raw data: 15-30 days (sufficient for most alerting)
  • 1-hour resolution: 90 days (for capacity planning)
  • 1-day resolution: 1-2 years (for business reporting)

Scrape Interval Tuning

Frequent scrapes improve resolution but increase resource usage. Match scrape intervals to alert requirements.

Use CaseScrape IntervalNotes
Real-time alerting15sUse for critical P1 metrics only
Standard metrics60sSufficient for most use cases
Slow-changing metrics5mDisk usage, job queue depths
Historical analysis15m+Roll up to reduce storage

Multi-Region Monitoring Strategies

Running monitoring across regions means dealing with network latency, aggregation headaches, and the question of whether to set SLOs per region or globally.

Federation Architecture

Federation allows hierarchical collection where regional Prometheus servers scrape local targets, then a global Prometheus federation scrapes the regional instances.

# Regional Prometheus (scapes local targets)
scrape_configs:
  - job_name: "local-services"
    static_configs:
      - targets: ["app-us-east-1:9090", "db-us-east-1:9090"]

# Global Prometheus (federates regional)
scrape_configs:
  - job_name: "federate-us-east-1"
    metrics_path: /federate
    params:
      match[]:
        - '{__name__=~"job:.+"}'
    static_configs:
      - targets: ["prometheus-us-east-1:9090"]

Considerations:

  • Network latency affects metric freshness; use longer evaluation intervals for federated data
  • Not all metrics need global aggregation; filter with match[] rules
  • Cross-region bandwidth costs money; only federate what you need

Regional vs Global SLOs

A user in Frankfurt cares about Frankfurt availability, not your global average. Global SLOs can hide regional outages entirely.

ApproachProsCons
Global SLO onlySimpler reportingMasks regional issues
Regional SLOsReflects user experienceMore dashboards to manage
BothAccurate + simple reportingRequires good tooling

For global services, set regional SLOs per region plus a composite global SLO. Alert on regional SLOs to catch degradation early.

Synthetic Monitoring from Multiple Locations

Real-user monitoring and synthetic checks from multiple geographic points catch regional issues that internal metrics miss.

import requests
import time

# Run synthetic checks from multiple regions
REGIONS = ["us-east-1", "eu-west-1", "ap-southeast-1"]

def synthetic_check_region(region):
    start = time.time()
    try:
        resp = requests.get(
            f"https://api.example.com/health",
            timeout=5,
            headers={"X-Region": region}
        )
        duration = time.time() - start
        # Report per-region metrics
        metrics.gauge(f"health_check_{region}_duration", duration)
        metrics.gauge(f"health_check_{region}_success", 1 if resp.status == 200 else 0)
    except Exception as e:
        metrics.gauge(f"health_check_{region}_success", 0)

Route alerts based on both regional synthetic checks and internal metrics. A regional synthetic failure plus elevated internal latency in the same region strongly indicates a regional outage.

Service Level Agreements

SLAs are contractual commitments to customers, often backed by financial penalties. They are usually less aggressive than SLOs.

SLA vs SLO

AspectSLASLO
AudienceCustomersInternal
Enforced byContracts, penaltiesTeam discipline
TargetUsually less strictUsually more strict
ConsequencesFinancialOperational

Don’t set SLAs until you have SLOs you are confident you can meet. Adding contractual SLAs before you understand your system’s behavior is asking for trouble.

Building Dashboards

Dashboards translate metrics into actionable information. Good dashboards answer specific questions. Bad dashboards show everything and answer nothing.

Dashboard Design Principles

Start with the user and their questions. A dashboard for on-call engineers answering “is my service healthy?” looks different from an executive dashboard showing business metrics.

Group related metrics. Use rows and panels to organize information logically.

Include context. Raw numbers without comparison are hard to interpret. Show current value versus target, versus last week, or versus baseline.

Minimize chart junk. Every element should convey information. Remove gridlines, legends, and labels that don’t add value.

Example Dashboard Panels

# Grafana dashboard JSON (abbreviated)
{
  "dashboard":
    {
      "title": "API Gateway Overview",
      "panels":
        [
          {
            "title": "Request Rate",
            "type": "graph",
            "targets":
              [
                {
                  "expr": "sum(rate(http_requests_total{service='api-gateway'}[5m])) by (service)",
                  "legendFormat": "{{service}}",
                },
              ],
            "gridPos": { "x": 0, "y": 0, "w": 12, "h": 8 },
          },
          {
            "title": "Error Rate",
            "type": "stat",
            "targets":
              [
                {
                  "expr": "sum(rate(http_requests_total{service='api-gateway',status=~'5..'}[5m])) / sum(rate(http_requests_total{service='api-gateway'}[5m])) * 100",
                  "legendFormat": "Error %",
                },
              ],
            "fieldConfig":
              {
                "defaults":
                  {
                    "thresholds":
                      {
                        "mode": "absolute",
                        "steps":
                          [
                            { "value": 0, "color": "green" },
                            { "value": 0.1, "color": "yellow" },
                            { "value": 1, "color": "red" },
                          ],
                      },
                  },
              },
            "gridPos": { "x": 12, "y": 0, "w": 6, "h": 4 },
          },
          {
            "title": "P95 Latency",
            "type": "gauge",
            "targets":
              [
                {
                  "expr": "histogram_quantile(0.95, sum(rate(http_request_duration_seconds_bucket{service='api-gateway'}[5m])) by (le)) * 1000",
                  "legendFormat": "p95 ms",
                },
              ],
            "gridPos": { "x": 18, "y": 0, "w": 6, "h": 4 },
          },
        ],
    },
}

Essential Dashboard Sections

At a Glance: Key health indicators in a single row with green/yellow/red status for critical metrics.

Request Metrics: Rate, error rate, and latency percentiles for your main endpoints.

Infrastructure Metrics: CPU, memory, disk, and network for your hosts.

Application Metrics: Business-specific metrics like queue depths, cache hit rates, or background job counts.

Dependency Health: Metrics for databases, caches, and external services your application depends on.

Alert Design

Alerts should wake someone up only when they need to act. Too many alerts cause fatigue; too few cause outages.

Alert Severity Levels

SeverityResponse TimeExamples
P1 CriticalMinutes, 24/7Complete outage, data loss, security breach
P2 High30 minutesDegraded performance affecting many users
P3 MediumBusiness hoursMinor degradation, non-critical failures
P4 LowNext sprintPredictable issues, capacity planning

Alert Quality Checklist

Before creating an alert, ask:

  • Does this indicate a real problem affecting users?
  • Is the root cause something we can fix?
  • Is the alert actionable? Can the recipient do something about it?
  • Is the alert specific? Does it point toward the likely cause?
  • Is the threshold calibrated? Are we alerting on symptoms or causes?

Alerting on symptoms leads to noise. Alerting on causes requires understanding your system well enough to know what indicates true problems.

Example Alert Rules

# Prometheus alerting rules
groups:
  - name: api-gateway
    rules:
      # High error rate
      - alert: APIGatewayHighErrorRate
        expr: |
          sum(rate(http_requests_total{service="api-gateway",status=~"5.."}[5m]))
          /
          sum(rate(http_requests_total{service="api-gateway"}[5m])) > 0.01
        for: 5m
        labels:
          severity: critical
        annotations:
          summary: "API Gateway error rate above 1%"
          description: "Error rate is {{ $value | humanizePercentage }} over the last 5 minutes"

      # Latency degradation
      - alert: APIGatewayHighLatency
        expr: |
          histogram_quantile(0.95,
            sum(rate(http_request_duration_seconds_bucket{service="api-gateway"}[5m])) by (le)
          ) > 1
        for: 10m
        labels:
          severity: warning
        annotations:
          summary: "API Gateway P95 latency above 1 second"
          description: "P95 latency is {{ $value | humanizeDuration }}"

      # Slow queries
      - alert: DatabaseSlowQueries
        expr: |
          rate(django_dbqueries_total{type="slow"}[5m]) > 10
        for: 5m
        labels:
          severity: warning
        annotations:
          summary: "More than 10 slow queries per second"
          description: "Database is experiencing query performance issues"

Alert Routing

Route alerts to the right people based on severity and service:

# Alertmanager configuration
route:
  receiver: default
  routes:
    - match:
        severity: critical
      receiver: pagerduty
      continue: true
    - match:
        service: database
      receiver: database-oncall
    - match:
        severity: warning
      receiver: slack-warnings
    - match:
        severity: info
      receiver: none # Don't alert, just log

receivers:
  - name: pagerduty
    pagerduty_configs:
      - service_key: xxx
        severity: critical
  - name: slack-warnings
    slack_configs:
      - channel: "#alerts-warning"
  - name: database-oncall
    pagerduty_configs:
      - service_key: yyy
        routing_key: database-team

Blackbox Monitoring

Blackbox monitoring tests your service from the outside, independent of application metrics. It catches failures that internal metrics miss.

Prometheus Blackbox Exporter

# blackbox.yml
modules:
  http_2xx:
    prober: http
    http:
      valid_http_versions: ["HTTP/1.1", "HTTP/2"]
      method: GET
      fail_if_ssl: false
# Prometheus scrape config
scrape_configs:
  - job_name: "blackbox-http"
    metrics_path: /probe
    params:
      module: [http_2xx]
    static_configs:
      - targets:
          - https://api.example.com/health
    relabel_configs:
      - source_labels: [__address__]
        target_label: __param_target
      - source_labels: [__param_target]
        target_label: instance
      - target_label: __address__
        replacement: blackbox-exporter:9115

Synthetic Transactions

Run synthetic transactions regularly to catch degradation before users notice:

import requests
import time

def synthetic_checkout():
    start = time.time()

    # 1. Create cart
    cart_resp = requests.post('https://api.example.com/carts', timeout=5)
    cart_id = cart_resp.json()['id']

    # 2. Add item
    requests.post(f'https://api.example.com/carts/{cart_id}/items',
                  json={'product_id': 'PROD123', 'quantity': 1}, timeout=5)

    # 3. Checkout
    checkout_resp = requests.post(f'https://api.example.com/carts/{cart_id}/checkout',
                                   timeout=10)

    duration = time.time() - start

    # Report metrics
    metrics.histogram('synthetic_checkout_duration', duration)
    metrics.gauge('synthetic_checkout_success', 1 if checkout_resp.status == 200 else 0)

Capacity Planning

Monitoring helps predict when you need more capacity.

Monitor usage trends over weeks and months:

# Weekly growth rate
sum(rate(http_requests_total[7d])) / sum(rate(http_requests_total[7d] offset 7d)) - 1

# Days until CPU saturation at current growth rate
predict_linear(node_cpu_seconds_total{mode="idle"}[7d], 30 * 24 * 3600)

Scaling Thresholds

# KEDA scaled object for event-driven scaling
apiVersion: keda.sh/v1alpha1
kind: ScaledObject
metadata:
  name: api-gateway-scaler
  namespace: production
spec:
  scaleTargetRef:
    name: api-gateway
  minReplicaCount: 3
  maxReplicaCount: 100
  triggers:
    - type: prometheus
      metadata:
        serverAddress: http://prometheus:9090
        metricName: http_requests_per_second
        threshold: "1000"
        query: sum(rate(http_requests_total{service="api-gateway"}[2m]))

When to Use Metrics-Based Monitoring

When to Use Metrics-Based Monitoring:

  • Real-time system health monitoring and alerting
  • Capacity planning and trend analysis
  • SLO/SLA tracking and error budget management
  • Correlation with business metrics (revenue, conversions)
  • Long-term historical analysis and reporting
  • Infrastructure and application layer monitoring

When Not to Use Metrics-Based Monitoring:

  • Debugging specific request failures (use logs)
  • Understanding request flow across services (use tracing)
  • One-off troubleshooting of transient issues
  • Monitoring low-volume events that do not aggregate well
  • Situations where you need the full context of a single operation

Trade-off Analysis

AspectMetrics MonitoringLog AnalysisDistributed Tracing
Latency DetectionAggregate patternsPer-request detailRequest-level timing
Root CauseCorrelation difficultFull context availableCausal chain is clear
Storage CostLow (aggregated)High (raw events)Medium (spans)
Query FlexibilityHigh (PromQL/SQL)Medium (text search)Low (structured)
Historical AnalysisExcellentLimited by costPoor
Real-time AlertingExcellentPoorGood
Debug Single RequestPoorExcellentExcellent
Capacity PlanningExcellentPoorPoor

Production Failure Scenarios

FailureImpactMitigation
Prometheus scrape failureGaps in metrics; missing visibilityConfigure remote_write to backup; alert on target down; use Prometheus federation
Alert fatigueTeams ignore alerts; real issues missedRegularly review alert quality; remove stale alerts; tune thresholds
Metric cardinality explosionPrometheus OOM; query performance degradedLimit label cardinality; use recording rules; segment by service
Alert routing failureCritical alerts not delivered; extended outagesTest alert routing; use multiple notification channels; on-call rotation
Dashboard data source outageNo visibility for teams; blind decision makingConfigure backup data sources; cache dashboard definitions
SLO target chronically missedCustomer trust erosion; potential SLA penaltiesInvestigate root causes; adjust SLO targets if unrealistic; build error budget alerts

Common Pitfalls / Anti-Patterns

Common Pitfalls

1. Alerting on Symptoms Without Context

# Bad: Alerts on symptoms
- alert: HighCPU
  expr: cpu_usage > 80
  # Fires constantly without context

# Good: Alert with actionable context
- alert: HighCPUOnCriticalService
  expr: cpu_usage{service="api-gateway"} > 80 and rate(http_requests_total{service="api-gateway"}[5m]) > 1000
  # Only fires when it matters

2. Missing Alert Aggregation

Firing 1000 alerts for one failure causes chaos. Use grouping, inhibition, and routing to consolidate alerts:

route:
  group_by: ["alertname", "cluster", "service"]
  group_wait: 30s
  group_interval: 5m
  repeat_interval: 4h

3. SLOs That Do Not Reflect User Experience

Setting SLOs on backend metrics instead of user-visible metrics creates false confidence. SLOs should measure what users experience, not internal implementation details.

4. Dashboard Overload

More panels do not mean better monitoring. Each dashboard should answer specific questions. If you need to scroll, you have too many panels.

5. No Alert Testing

Alerts that never fire in production may have broken queries. Test alerts periodically by simulating conditions that should trigger them.

6. Ignoring Alert Fatigue

If engineers start ignoring alerts, real problems get missed. Review and tune alerts regularly. Remove alerts that no longer serve a purpose.

Observability Checklist

Core Metrics (Golden Signals)

  • Latency: p50, p95, p99 response times
  • Traffic: requests per second, throughput
  • Errors: error rate (4xx, 5xx), success rate
  • Saturation: CPU, memory, disk, queue depths

Infrastructure Metrics

  • Host-level: CPU utilization, memory usage, disk I/O, network throughput
  • Container-level: resource limits, restart counts, OOM kills
  • Kubernetes: pod status, node conditions, namespace quotas

Application Metrics (RED/USE)

  • Rate: requests per second by endpoint, method, status
  • Errors: error counts and rates by type, service, endpoint
  • Duration: histogram of request latencies with percentiles
  • Utilization: resource usage, queue lengths, connection pool stats
  • Saturation: backpressure indicators, throttling events

SLO/SLA Metrics

  • Availability SLI with 30-day rolling window
  • Latency SLI (p95 under 200ms target)
  • Error rate SLI (below 0.1% target)
  • Error budget tracking and burn rate alerts

Alerting Rules

  • P1: Service down or availability SLA breach
  • P2: High error rate (>1% for 5 minutes)
  • P2: Latency degradation (p95 >500ms)
  • P3: Resource utilization >80%
  • P4: Error budget 50% consumed in 1 hour

Security Checklist

  • Metrics endpoint (/metrics) not exposed publicly
  • Prometheus access authenticated (if external)
  • Alert manager notifications do not contain sensitive data
  • Dashboard links to internal systems use authenticated proxies
  • No API keys or secrets visible in dashboard URLs
  • Alert routing logs audited
  • Metrics cardinality limits enforced to prevent DoS
  • Remote write connections use TLS
  • Dashboard snapshots sanitized before sharing

Real-world Failure Scenarios

Scenario 1: Alert Storm from Misconfigured Threshold

What happened: A dashboard team deployed a new dashboard with an incorrectly configured alert rule. The alert threshold was set to trigger on any request above 0ms latency instead of the intended threshold of above 1000ms. Within minutes, the alerting system was flooded with thousands of alerts.

Root cause: A copy-paste error introduced an extra zero when configuring the threshold. No unit tests or dry-run validation existed for alert expressions.

Impact: The on-call engineer received over 3,000 notifications in 5 minutes. The alerting system’s queue became saturated, delaying genuine alerts by over 10 minutes.

Lesson learned: Always dry-run alert queries against production data before activating. Set a maximum alert rate limit per team. Implement alert expression validation in CI/CD.

Scenario 2: Metric Gaps from Prometheus Federation Outage

What happened: A network partition between two data centers caused Prometheus federation to fail. Global dashboards showed no data for 45 minutes while regional dashboards continued to function normally.

Root cause: The federation setup had no redundancy. A single network path connected the global Prometheus to regional instances.

Impact: Executive dashboards were blank during a live incident review. The operations team had to manually aggregate regional dashboard data to report status.

Lesson learned: Implement dual redundant federation paths. Alert on federation scrape health. Use remote_write to a central storage backend (Thanos or Cortex) for global query reliability.

Interview Questions

These questions assess practical knowledge of metrics, monitoring, and alerting systems.

1. Explain the difference between RED and USE methods. When would you use each?

Expected answer points:

  • RED (Rate, Errors, Duration) focuses on request-driven services like APIs; USE (Utilization, Saturation, Errors) focuses on resource utilization
  • Use RED for services where you measure request/response patterns
  • Use USE for infrastructure resources like CPU, memory, disk
  • Most systems need both: USE for underlying resources, RED for user-facing services
2. What is an error budget, and how does it inform alerting strategy?

Expected answer points:

  • Error budget is the allowed amount of downtime/errors within an SLO window (e.g., 99.9% over 30 days = 43.2 minutes)
  • Burn rate alerts trigger when errors consume budget faster than sustainable
  • Multi-window alerting catches both fast burns (1h window, 14.4x rate) and slow leaks (3-day window, 3x rate)
  • Large remaining budget allows aggressive deployments; exhausted budget requires stability focus
3. What are the Four Golden Signals, and why are they important?

Expected answer points:

  • Latency: how long operations take (distinguish slow from failed)
  • Traffic: demand on the system (requests per second)
  • Errors: rate of failures (distinguish 4xx from 5xx)
  • Saturation: how full the system is (CPU, memory, queue depth)
  • Google SRE book defines these as minimum signals every service should monitor
4. How would you design a multi-window burn-rate alerting system?

Expected answer points:

  • Configure multiple time windows: 1h (fast burn, page immediately), 6h (medium burn, warning), 3d (slow leak, investigate)
  • Use burn rate multipliers: 14.4x for 1h, 6x for 6h, 3x for 3d
  • Combine with OR logic to fire if any window exceeds threshold
  • Set severity based on burn rate: page for fast burn, warning for medium/slow
5. What is the difference between SLIs, SLOs, and SLAs?

Expected answer points:

  • SLI (Service Level Indicator): the actual metric you measure (e.g., error rate percentage)
  • SLO (Service Level Objective): the target you commit to internally (e.g., 99.9% availability)
  • SLA (Service Level Agreement): contractual commitment to customers, often with penalties
  • SLOs are usually tighter than SLAs since internal targets should exceed customer commitments
6. What metrics would you collect to monitor a PostgreSQL database?

Expected answer points:

  • Connection pool usage (active connections / max connections)
  • Query performance (slow query rate, p95/p99 latency per query type)
  • Replication lag for read replicas
  • Buffer cache hit ratio and memory pressure
  • Disk I/O ( WAL writes, table scans, index usage)
  • Locking and contention metrics
  • Database size and table bloat
7. How do you avoid alert fatigue while ensuring critical issues are paged?

Expected answer points:

  • Every alert must be actionable: someone must know what to do when paged
  • Alert on causes, not symptoms (e.g., database connection pool exhausted, not just "high latency")
  • Use severity levels appropriately: P1 for page-worthy, P2/P3 for warning channels
  • Regularly review alert quality: remove stale alerts, tune thresholds
  • Use alert grouping and inhibition to consolidate related alerts
8. What is blackbox monitoring, and when would you use it over whitebox?

Expected answer points:

  • Blackbox monitoring tests services from outside (synthetic transactions, endpoint health checks)
  • Whitebox monitoring collects internal metrics (application logs, infrastructure metrics)
  • Blackbox catches failures that internal metrics miss (network partitions, DNS issues)
  • Use both: blackbox for user-experience validation, whitebox for root cause analysis
  • Blackbox exporter with Prometheus probes endpoints for availability and response correctness
9. How would you implement capacity planning using metrics?

Expected answer points:

  • Track usage trends over weeks and months using rate comparisons (7d vs 7d offset)
  • Use predict_linear() in PromQL to forecast when resources will saturate
  • Monitor scaling thresholds with tools like KEDA that scale based on Prometheus metrics
  • Set up alerts when utilization crosses 70-80% to allow lead time for scaling
  • Consider both horizontal (add replicas) and vertical (bigger instances) scaling paths
10. What considerations are unique to multi-region monitoring deployments?

Expected answer points:

  • Data aggregation strategy: federate regional Prometheus instances to central global view
  • Network latency between regions affects metric freshness; account for scrape intervals
  • Separate SLOs per region vs global SLOs (users care about their region)
  • Alert routing must account for on-call schedules across timezones
  • Remote write costs increase with cross-region bandwidth; compress and batch appropriately
  • Synthetic monitoring from multiple geographic locations catches regional degradation
11. How would you decide on appropriate SLO targets for a new service?

Expected answer points:

  • Start with what users actually experience - not internal implementation details
  • Analyze historical data: availability, latency distributions from existing monitoring
  • Consider business context: revenue impact of downtime, user expectations
  • Set targets tighter than SLAs (if SLAs exist) to give buffer for error budget
  • Iterate and adjust: SLOs should be achievable - unrealistic SLOs create false confidence
  • Align with similar services in your stack for consistency
12. Explain how you would monitor a microservices architecture. What challenges arise that do not exist in monolithic systems?

Expected answer points:

  • Need distributed tracing to understand request flow across services
  • Service mesh metrics (Envoy, Istio) provide sidecar proxy telemetry
  • Challenge: determining which service is the root cause when latency propagates
  • Challenge: network latency between services adds to end-to-end latency
  • Use trace correlation IDs to link requests across service boundaries
  • RED/USE methods apply per service, but need aggregation across service dependencies
  • Container orchestration metrics (Kubernetes) add another monitoring layer
13. What is the difference between Push and Pull metrics collection? When would you choose each?

Expected answer points:

  • Pull model (Prometheus): scraping occurs from a central collector hitting target endpoints
  • Push model (StatsD, Wavefront): agents send metrics to a central collector
  • Pull advantages: easier to target validation, no need to manage agent deployment, easier to detect unmonitored targets
  • Push advantages: works better for short-lived jobs, NAT traversal easier, clients control emission rate
  • Use pull for long-lived services that expose /metrics endpoints
  • Use push for batch jobs, serverless functions, or fire-and-forget workloads
14. How would you design a monitoring system to detect anomalies without relying on static thresholds?

Expected answer points:

  • Use adaptive thresholds based on historical patterns (seasonal variations, growth trends)
  • Implement anomaly detection algorithms: rolling standard deviation, exponential smoothing
  • Leverage machine learning models trained on normal behavior patterns
  • Use multivariate analysis to correlate multiple signals rather than single metrics
  • Start with simpler approaches: deviation from rolling average, rate of change
  • Combine with SLO burn-rate alerts as a complementary approach
15. How do you handle monitoring for services that experience bursty traffic patterns?

Expected answer points:

  • Use longer aggregation windows to avoid false positives during idle periods
  • Set baseline detection that learns traffic patterns over time
  • Separate alert rules for burst windows vs normal operation windows
  • Use percentiles (p95, p99) instead of averages to handle outliers
  • Consider autoscaling correlation: if traffic spikes and latency degrades, alert on the combination
  • Alert on error rates and saturation more than absolute request rates
16. What are recording rules in Prometheus, and when would you use them?

Expected answer points:

  • Recording rules pre-compute frequently needed expressions and save them as new time series
  • Use to reduce query load: expensive aggregations run once, results queried repeatedly
  • Use to reduce cardinality: pre-aggregate high-cardinality data into lower cardinality
  • Essential for dashboard queries that run frequently (every 10-30s)
  • Use for metrics derived from multiple sources that are expensive to join at query time
  • Example: record `job:http_requests_total:rate5m` instead of computing `sum(rate(http_requests_total[5m])) by (job)` on every dashboard refresh
17. How would you structure alerting for a service that has both latency-sensitive and throughput-sensitive workloads?

Expected answer points:

  • Separate alert rules for latency SLIs vs throughput SLIs
  • Use different time windows: latency might need faster alerting (1-5 min), throughput might tolerate longer
  • Latency-sensitive path: alert on p95/p99 latency exceeding thresholds, not just averages
  • Throughput-sensitive path: alert when RPS drops below minimum threshold
  • Consider resource saturation (CPU, connections) as leading indicators for both
  • Alert severity might differ: latency breach page immediately, throughput drop warn first
18. Describe how you would implement meta-monitoring (monitoring the monitoring system itself).

Expected answer points:

  • Track Prometheus itself: up/down, TSDB head age, scrape success rates
  • Monitor alertmanager: are alerts being received, are notifications being sent
  • Track metric gaps: use `prometheus_tsdb_head_samples` to detect compaction issues
  • Build an error budget for monitoring quality: if scrape failures exceed threshold, alert
  • Use federation cross-checks: compare data between federated Prometheus instances
  • External probing: use blackbox exporter to verify monitoring system endpoints
19. What is the relationship between DORA metrics and SLO-based alerting? How would you use them together?

Expected answer points:

  • DORA metrics track delivery performance: deployment frequency, lead time, change failure rate, MTTR
  • SLO burn-rate alerting provides early warning before reliability problems affect DORA metrics
  • Use SLO alerts to prevent change failure rate increases - catch degradation before it compounds
  • Error budget exhaustion correlates with change failure rate spike
  • Link deployment events to SLO impact: did this deploy cause an alert?
  • Use DORA metrics to measure if SLO improvements actually translate to better delivery outcomes
20. How would you approach setting up alerting for a third-party API dependency?

Expected answer points:

  • Blackbox monitoring: synthetic transactions hitting the third-party endpoint
  • Track latency percentiles from your service's perspective (not just availability)
  • Monitor error rates when calling third-party API (5xx, timeouts, connection failures)
  • Set circuit breaker thresholds: if third-party failure rate exceeds X%, route around it
  • Alert on dependency health rather than just availability - degraded performance matters
  • Use multi-location monitoring to distinguish between your issue and provider issue

Further Reading

Conclusion

Key Takeaways:

  • Start with the four golden signals: latency, traffic, errors, saturation
  • Define SLOs that reflect user experience, not internal metrics
  • Use error budgets to prioritize reliability work
  • Alert on symptoms with context, not raw numbers
  • Every alert should be actionable: someone should know what to do
  • Review and tune alerts regularly to avoid fatigue

Copy/Paste Checklist:

# Availability SLO query
1 - (
  sum(rate(http_requests_total{status=~"5.."}[30d]))
  /
  sum(rate(http_requests_total)[30d])
)

# Error budget burning alert
(
  sum(rate(http_requests_total{service="api-gateway",status=~"5.."}[1h]))
  /
  sum(rate(http_requests_total{service="api-gateway"}[1h]))
) > (1 - 0.999) * 0.1

# P95 latency with context
histogram_quantile(0.95,
  sum(rate(http_request_duration_seconds_bucket{service="api-gateway"}[5m])) by (le)
) > 1

# Dashboard variable template
sum(rate(http_requests_total{service=~"$service"}[$interval])) by (service)

# Alert routing with grouping
route:
  receiver: default
  group_by: ['alertname', 'cluster', 'service']
  group_wait: 30s
  continue: true
  routes:
    - match:
        severity: critical
      receiver: pagerduty

Effective monitoring combines the right metrics, clear SLOs, thoughtful alerts, and actionable dashboards. Start with the four golden signals: latency, traffic, errors, and saturation. Define SLOs that reflect what users actually care about. Build alerts that fire only when someone needs to act.

For deeper observability into distributed systems, our Distributed Tracing guide covers request flow across services. The Prometheus & Grafana guide provides hands-on examples for implementing these patterns.

Category

Related Posts

Alerting in Production: Building Alerts That Matter

Build alerting systems that catch real problems without fatigue. Learn alert design principles, severity levels, runbooks, and on-call best practices.

#data-engineering #alerting #monitoring

Database Monitoring: Metrics, Tools, and Alerting

Keep your PostgreSQL database healthy with comprehensive monitoring. This guide covers query latency, connection usage, disk I/O, cache hit ratios, and alerting with pg_stat_statements and Prometheus.

#database #monitoring #observability

Alerting in Production: Paging, Runbooks, and On-Call

Build effective alerting systems that wake people up for real emergencies: alert fatigue prevention, runbook automation, and healthy on-call practices.

#alerting #monitoring #on-call