Prometheus and Grafana: Metrics Collection and Visualization

Learn Prometheus metrics collection, PromQL querying, and Grafana dashboard creation. Complete guide to building observable systems with metrics.

published: March 22, 2026 reading time: 26 min read

Prometheus & Grafana: Metrics Collection and Visualization

Prometheus and Grafana form the backbone of modern monitoring stacks. Prometheus pulls metrics from services and stores them in time-series format, while Grafana visualizes the data and lets you build dashboards for real-time analysis.

This guide covers Prometheus architecture, metric types, PromQL, and Grafana dashboard construction. If you need background on monitoring philosophy, see our Metrics, Monitoring & Alerting guide first.

Prometheus Architecture

graph LR
    A[Services] -->|Pull Metrics| B[Prometheus Server]
    B --> C[Time Series DB]
    A -->|Push Metrics| D[Push Gateway]
    D --> B
    B --> E[Grafana]
    B --> F[Alertmanager]
    F --> G[Email/PagerDuty/Slack]

Prometheus uses a pull model by default: it scrapes targets at configured intervals. For short-lived jobs that cannot be scraped, the Push Gateway accepts pushed metrics.

Key Components

Prometheus Server: Pulls, stores, and queries metrics
Push Gateway: Receives metrics from short-lived batch jobs
Alertmanager: Handles alerting and notification routing
Exporters: Agents that expose metrics from third-party systems

Metric Types

Prometheus supports four fundamental metric types.

Counter

A cumulative metric that only increases. Use for request counts, error counts, or anything that resets at restart.

# Python client example
from prometheus_client import Counter

requests_total = Counter(
    'http_requests_total',
    'Total HTTP requests',
    ['method', 'endpoint', 'status']
)

requests_total.labels(method='GET', endpoint='/api/users', status='200').inc()

Gauge

A metric that can go up or down. Use for current values like memory usage, in-flight requests, or temperature.

from prometheus_client import Gauge

current_temperature = Gauge(
    'room_temperature_celsius',
    'Current temperature in Celsius'
)

current_temperature.set(22.5)
current_temperature.dec(0.5)

Histogram

Samples observations and counts them in configurable buckets. Use for request durations or response sizes.

from prometheus_client import Histogram

request_duration = Histogram(
    'http_request_duration_seconds',
    'HTTP request duration in seconds',
    ['method', 'endpoint'],
    buckets=(0.005, 0.01, 0.025, 0.05, 0.1, 0.25, 0.5, 1.0, 2.5, 5.0, 10.0)
)

with request_duration.labels(method='GET', endpoint='/api/users').time():
    # Handle request
    pass

The histogram automatically calculates quantiles and provides _sum and _count suffixes for total and count.

Summary

Similar to histogram but calculates quantiles on the client side. Use when you need exact percentiles and can accept increased cardinality.

from prometheus_client import Summary

request_latency = Summary(
    'http_request_latency_seconds',
    'HTTP request latency in seconds',
    ['method', 'endpoint']
)

with request_latency.labels(method='GET', endpoint='/api/users').time():
    # Handle request
    pass

Instrumenting Applications

Express.js with prom-client

const client = require("prom-client");
const express = require("express");

const register = new client.Registry();

register.setDefaultLabels({
  app: "api-gateway",
});

client.collectDefaultMetrics({ register });

const httpRequestsTotal = new client.Counter({
  name: "http_requests_total",
  help: "Total HTTP requests",
  labelNames: ["method", "path", "status"],
  registers: [register],
});

const httpRequestDuration = new client.Histogram({
  name: "http_request_duration_seconds",
  help: "HTTP request duration in seconds",
  labelNames: ["method", "path"],
  buckets: [0.01, 0.05, 0.1, 0.5, 1, 2, 5],
  registers: [register],
});

const app = express();

app.use((req, res, next) => {
  const start = Date.now();

  res.on("finish", () => {
    const duration = (Date.now() - start) / 1000;
    const path = req.route ? req.route.path : req.path;

    httpRequestsTotal.inc({
      method: req.method,
      path: path,
      status: res.statusCode,
    });

    httpRequestDuration.observe(
      {
        method: req.method,
        path: path,
      },
      duration,
    );
  });

  next();
});

app.get("/metrics", async (req, res) => {
  res.set("Content-Type", register.contentType);
  res.end(await register.metrics());
});

app.get("/api/users", (req, res) => {
  res.json([{ id: 1, name: "Alice" }]);
});

app.listen(3000);

Python with FastAPI

from prometheus_client import Counter, Histogram, generate_latest, CONTENT_TYPE_LATEST
from fastapi import FastAPI, Request
from fastapi.responses import Response
import time

app = FastAPI()

http_requests_total = Counter(
    'http_requests_total',
    'Total HTTP requests',
    ['method', 'endpoint', 'status']
)

http_request_duration = Histogram(
    'http_request_duration_seconds',
    'HTTP request duration',
    ['method', 'endpoint'],
    buckets=(0.01, 0.05, 0.1, 0.5, 1.0, 2.0, 5.0)
)

@app.middleware("http")
async def add_metrics(request: Request, call_next):
    start = time.time()

    response = await call_next(request)

    duration = time.time() - start
    path = request.url.path

    http_requests_total.labels(
        method=request.method,
        endpoint=path,
        status=response.status_code
    ).inc()

    http_request_duration.labels(
        method=request.method,
        endpoint=path
    ).observe(duration)

    return response

@app.get("/metrics")
def metrics():
    return Response(
        content=generate_latest(),
        media_type=CONTENT_TYPE_LATEST
    )

Prometheus Configuration

Scrape Configuration

# prometheus.yml
global:
  scrape_interval: 15s
  evaluation_interval: 15s
  external_labels:
    environment: production
    cluster: us-east-1

alerting:
  alertmanagers:
    - static_configs:
        - targets:
            - alertmanager:9093

rule_files:
  - "/etc/prometheus/rules/*.yml"

scrape_configs:
  - job_name: "prometheus"
    static_configs:
      - targets: ["localhost:9090"]

  - job_name: "api-gateway"
    scrape_interval: 10s
    static_configs:
      - targets: ["api-gateway:3000"]
    metrics_path: /metrics
    relabel_configs:
      - source_labels: [__address__]
        target_label: instance
        replacement: "api-gateway-1"

  - job_name: "kubernetes-apiservers"
    kubernetes_sd_configs:
      - role: endpoints
    scheme: https
    tls_config:
      ca_file: /var/run/secrets/kubernetes.io/serviceaccount/ca.crt
    bearer_token_file: /var/run/secrets/kubernetes.io/serviceaccount/token
    relabel_configs:
      - source_labels:
          [__meta_kubernetes_namespace, __meta_kubernetes_service_name]
        action: keep
        regex: default;kubernetes

  - job_name: "kubernetes-nodes"
    kubernetes_sd_configs:
      - role: node
    relabel_configs:
      - action: labelmap
        regex: __meta_kubernetes_node_label_(.+)

  - job_name: "kubernetes-pods"
    kubernetes_sd_configs:
      - role: pod
    relabel_configs:
      - source_labels: [__meta_kubernetes_pod_annotation_prometheus_io_scrape]
        action: keep
        regex: true
      - source_labels: [__meta_kubernetes_pod_annotation_prometheus_io_path]
        action: replace
        target_label: __metrics_path__
        regex: (.+)
      - source_labels:
          [__address__, __meta_kubernetes_pod_annotation_prometheus_io_port]
        action: replace
        regex: ([^:]+)(?::\d+)?;(\d+)
        replacement: $1:$2
        target_label: __address__

Recording Rules

Recording rules pre-compute frequently needed queries:

# /etc/prometheus/rules/recording.yml
groups:
  - name: api-gateway
    interval: 30s
    rules:
      - record: apigw:http_requests:rate5m
        expr: |
          sum(rate(http_requests_total{service="api-gateway"}[5m])) by (service)

      - record: apigw:http_errors:rate5m
        expr: |
          sum(rate(http_requests_total{service="api-gateway",status=~"5.."}[5m])) by (service)

      - record: apigw:http_latency:p95
        expr: |
          histogram_quantile(0.95,
            sum(rate(http_request_duration_seconds_bucket{service="api-gateway"}[5m])) by (le)
          )

  - name: slo-performance
    interval: 60s
    rules:
      - record: job:slo:availability:30d
        expr: |
          1 -
          sum(rate(http_requests_total{status=~"5.."}[30d]))
          /
          sum(rate(http_requests_total)[30d])

Alerting Rules

# /etc/prometheus/rules/alerts.yml
groups:
  - name: api-alerts
    rules:
      - alert: HighErrorRate
        expr: |
          sum(rate(http_requests_total{service="api-gateway",status=~"5.."}[5m]))
          /
          sum(rate(http_requests_total{service="api-gateway"}[5m])) > 0.01
        for: 5m
        labels:
          severity: critical
          team: api
        annotations:
          summary: "High error rate on {{ $labels.service }}"
          description: "Error rate is {{ $value | humanizePercentage }}"

      - alert: HighLatency
        expr: |
          histogram_quantile(0.95,
            sum(rate(http_request_duration_seconds_bucket{service="api-gateway"}[5m])) by (le)
          ) > 1
        for: 10m
        labels:
          severity: warning
        annotations:
          summary: "High latency on {{ $labels.service }}"
          description: "P95 latency is {{ $value | humanizeDuration }}"

      - alert: InstanceDown
        expr: up{job="api-gateway"} == 0
        for: 2m
        labels:
          severity: critical
        annotations:
          summary: "Instance {{ $labels.instance }} is down"

PromQL

Prometheus Query Language lets you analyze time-series data.

Basic Queries

# All metrics starting with http
{__name__=~"http_.*"}

# Request rate per endpoint
sum(rate(http_requests_total[5m])) by (endpoint)

# Error rate
sum(rate(http_requests_total{status=~"5.."}[5m])) / sum(rate(http_requests_total[5m]))

Aggregations

# Sum by label
sum(http_requests_total) by (service)

# Average
avg(http_request_duration_seconds) by (service)

# Percentile
histogram_quantile(0.99, http_request_duration_seconds_bucket)

# Top 10 by value
topk(10, http_requests_total)

Functions

# Rate of change (per second)
rate(http_requests_total[5m])

# Increase over time range
increase(http_requests_total[1h])

# Predict linear trend
predict_linear(node_memory_MemFree_bytes[10m], 3600)

# Timestamp of last sample
timestamp(http_requests_total)

Subqueries

# Rate with nested aggregation
max_over_time(
  rate(http_requests_total[5m])[15m:1m]
)

# Combined functions
min_over_time(
  histogram_quantile(0.95, rate(http_request_duration_seconds_bucket[5m]))[30m:]
)

Grafana Dashboard Construction

Grafana connects to Prometheus and provides visualization.

Panel Types

Type	Use Case
Graph	Time series visualization
Stat	Single big number
Gauge	Numeric with thresholds
Table	Multiple metrics and dimensions
Pie chart	Proportional distribution
Heatmap	Density visualization

Graph Panel Configuration

{
  "panel": {
    "title": "Request Rate",
    "type": "graph",
    "gridPos": { "x": 0, "y": 0, "w": 12, "h": 8 },
    "targets": [
      {
        "expr": "sum(rate(http_requests_total[5m])) by (service)",
        "legendFormat": "{{service}}",
        "refId": "A"
      }
    ],
    "yaxes": [
      {
        "format": "reqps",
        "label": "Requests/s"
      }
    ],
    "xaxis": {
      "mode": "time"
    },
    "seriesOverrides": [],
    "fieldConfig": {
      "defaults": {
        "custom": {
          "drawStyle": "line",
          "lineWidth": 2,
          "fillOpacity": 10,
          "gradientMode": "none"
        },
        "mappings": [],
        "thresholds": {
          "mode": "absolute",
          "steps": [{ "value": 0, "color": "green" }]
        }
      }
    }
  }
}

Variables and Templating

Dashboard variables make dashboards reusable:

{
  "templating": {
    "list": [
      {
        "name": "service",
        "type": "query",
        "query": "label_values(http_requests_total, service)",
        "multi": true
      },
      {
        "name": "interval",
        "type": "interval",
        "query": "1m,5m,15m,30m,1h",
        "multi": true
      }
    ]
  }
}

Use variables in queries:

sum(rate(http_requests_total{service=~"$service"}[$interval])) by (service)

Annotations

Annotations mark events on dashboards:

{
  "annotations": {
    "list": [
      {
        "name": "Deployments",
        "datasource": "Prometheus",
        "query": "{job="deployments"}",
        "iconColor": "rgba(255, 96, 96, 1)"
      }
    ]
  }
}

Alerting Rules in Grafana

{
  "alert": {
    "name": "High Error Rate",
    "conditions": [
      {
        "evaluator": {
          "params": [0.01],
          "type": "gt"
        },
        "operator": {
          "type": "and"
        },
        "query": {
          "params": ["A", "5m", "now"]
        },
        "reducer": {
          "type": "avg"
        }
      }
    ],
    "frequency": "1m",
    "noDataState": "no_data",
    "exec_err_state": "alerting",
    "message": "Error rate is above 1% for 5 minutes"
  }
}

Alertmanager Configuration

Alertmanager handles routing alerts to notification channels.

# alertmanager.yml
global:
  resolve_timeout: 5m
  smtp_smarthost: "smtp.example.com:587"
  smtp_from: "alerts@example.com"
  smtp_auth_username: "alerts"
  smtp_auth_password: "${SMTP_PASSWORD}"

route:
  receiver: "default"
  group_by: ["alertname", "cluster", "service"]
  group_wait: 30s
  group_interval: 5m
  repeat_interval: 4h
  routes:
    - match:
        severity: critical
      receiver: "pagerduty"
      continue: true
    - match:
        severity: warning
      receiver: "slack"
      continue: true
    - match:
        team: "database"
      receiver: "database-oncall"

receivers:
  - name: "default"
    slack_configs:
      - channel: "#alerts-general"
        send_resolved: true
        title: '{{ range .Alerts }}{{ .Annotations.summary }}\n{{ end }}'
        text: |
          {{ range .Alerts }}
          **{{ .Labels.alertname }}**
          {{ .Annotations.description }}
          {{ end }}

  - name: "pagerduty"
    pagerduty_configs:
      - service_key: "${PAGERDUTY_KEY}"
        severity: "{{ .Labels.severity }}"
        component: "{{ .Labels.service }}"

  - name: "slack"
    slack_configs:
      - channel: "#alerts-critical"
        send_resolved: true
        api_url: "${SLACK_WEBHOOK_URL}"

  - name: "database-oncall"
    email_configs:
      - to: "database-oncall@example.com"
        headers:
          subject: "Database Alert: {{ .GroupLabels.alertname }}"

Exporters

Exporters expose metrics from third-party systems.

Node Exporter

System-level metrics:

# Run node exporter
docker run -d \
  --name node-exporter \
  --network host \
  prom/node-exporter:latest \
  --collector.filesystem.mount-points-exclude=^/(sys|proc|dev|host|etc)($$|/)

Blackbox Exporter

Probing endpoints:

# blackbox.yml
modules:
  http_2xx:
    prober: http
    http:
      valid_http_versions: ["HTTP/1.1", "HTTP/2"]
  tcp_connect:
    prober: tcp
  dns:
    prober: dns
    dns:
      query_name: example.com
      query_type: A

# Prometheus scrape config for blackbox
scrape_configs:
  - job_name: "blackbox"
    metrics_path: /probe
    params:
      module: [http_2xx]
    static_configs:
      - targets:
          - https://api.example.com/health
          - https://api2.example.com/health
    relabel_configs:
      - source_labels: [__address__]
        target_label: __param_target
      - source_labels: [__param_target]
        target_label: instance
      - target_label: __address__
        replacement: blackbox-exporter:9115

When to Use Prometheus and Grafana

When to Use Prometheus:

Pull-based metrics collection from dynamic services
Time-series data requiring flexible querying
SLO/SLA tracking with PromQL
Alerting on metric thresholds
Service discovery integration (Kubernetes, Consul, etc.)
High-dimensional metrics with many labels

When Not to Use Prometheus:

Pure log aggregation (use ELK Stack)
Distributed request tracing (use Jaeger)
Event streaming or real-time processing (use Kafka)
Long-term data warehousing (Prometheus is not designed for decades of retention)
Ultra-high cardinality use cases (pre-aggregate or use other storage)

When to Use Grafana:

Time-series visualization and dashboards
Multi-data source dashboards (Prometheus + Elasticsearch + Jaeger)
Alert rule management and notification channels
Exploring metrics interactively
Building SLO error budget dashboards

Trade-off Analysis

Aspect	Prometheus	Datadog	InfluxDB	CloudWatch
Query Model	Pull-based	Agent/push	Agent/push	Pull/push hybrid
Metric Cardinality	High	High	Medium	Low
Query Language	PromQL	MQL/SQL	InfluxQL/Flux	CloudWatch SQL
Alerting	Native	Native	Kapacitor	CloudWatch Alerts
Storage Cost	Self-managed	SaaS (expensive)	Self-managed	Pay-per-use
Retention	Configurable	SaaS tiers	Configurable	15mo default
Learning Curve	Moderate	Low	Medium	Steep
Kubernetes Support	Excellent	Good	Good	Limited
Long-term Storage	Thanos/Cortex	Built-in	Built-in	Auto-archival

SLI/SLO/Error Budget Templates for Prometheus & Grafana

SLI Definition Template

# sli-definitions.yaml
# Service Level Indicator definitions for Prometheus/Grafana
service: example-service
environment: production

slis:
  # Availability SLI
  - name: availability
    description: "Successful requests as percentage of total"
    sli_type: ratio
    target: 99.9
    query: |
      sum(rate(http_requests_total{service="example-service",status!~"5.."}[{{ window }}]))
      /
      sum(rate(http_requests_total{service="example-service"}[{{ window }}]))

  # Latency SLI (good requests)
  - name: latency_success
    description: "P95 latency for successful requests"
    sli_type: latency
    target: 200 # milliseconds
    query: |
      histogram_quantile(0.95,
        sum(rate(http_request_duration_seconds_bucket{
          service="example-service",
          status!~"5.."
        }[{{ window }}])) by (le)
      )

  # Latency SLI (all requests)
  - name: latency_overall
    description: "P95 latency for all requests"
    sli_type: latency
    target: 500 # milliseconds
    query: |
      histogram_quantile(0.95,
        sum(rate(http_request_duration_seconds_bucket{
          service="example-service"
        }[{{ window }}])) by (le)
      )

  # Error rate SLI
  - name: error_rate
    description: "5xx error rate as percentage of total"
    sli_type: ratio
    target: 99.5
    query: |
      100 - (
        sum(rate(http_requests_total{service="example-service",status!~"5.."}[{{ window }}]))
        /
        sum(rate(http_requests_total{service="example-service"}[{{ window }}]))
      ) * 100

  # Throughput SLI
  - name: throughput
    description: "Requests per second"
    sli_type: gauge
    target: 1000 # minimum rps
    query: |
      sum(rate(http_requests_total{service="example-service"}[{{ window }}]))

SLO Configuration Template

# slo-configuration.yaml
# Service Level Objectives for Grafana Enterprise / Prometheus
objectives:
  # High-priority availability SLO
  - display_name: "API Availability"
    sli: availability
    target: 99.9
    window: 30d
    description: "API should successfully handle 99.9% of requests"
    alert_at_budget_remaining: 50% # Alert when 50% budget remains
    alert_severity: critical

  # Latency SLO for user-facing requests
  - display_name: "API Latency (p95)"
    sli: latency_success
    target: 99.0 # 99% of requests under 200ms
    window: 30d
    description: "99% of successful requests should complete within 200ms"
    alert_at_budget_remaining: 25%
    alert_severity: warning

  # Error budget SLO
  - display_name: "Error Rate"
    sli: error_rate
    target: 99.5
    window: 30d
    description: "Error rate should stay below 0.5%"
    alert_at_budget_remaining: 25%
    alert_severity: warning

Error Budget Calculator Template

# error-budget-calculator.py
"""
Error Budget Calculator for SLOs
Run: python error-budget-calculator.py
"""

def calculate_error_budget(slo_target, window_days=30):
    """
    Calculate error budget in minutes for a given SLO target.

    Args:
        slo_target: Target as decimal (e.g., 0.999 for 99.9%)
        window_days: Measurement window in days

    Returns:
        tuple: (total_budget_minutes, budget_per_hour, budget_per_day)
    """
    window_seconds = window_days * 24 * 60 * 60
    allowed_errors_seconds = window_seconds * (1 - slo_target)
    total_budget_minutes = allowed_errors_seconds / 60

    # Budget burning rates
    budget_per_hour = total_budget_minutes / (window_days * 24)
    budget_per_day = total_budget_minutes / window_days

    return total_budget_minutes, budget_per_hour, budget_per_day

# Standard SLO targets
slo_targets = {
    "99%": 0.99,
    "99.5%": 0.995,
    "99.9%": 0.999,
    "99.95%": 0.9995,
    "99.99%": 0.9999,
}

print("=" * 70)
print("Error Budget Calculator (30-day window)")
print("=" * 70)

for name, target in slo_targets.items():
    total, per_hour, per_day = calculate_error_budget(target)
    sustainable_rate = (1 - target) * 100

    print(f"\nSLO Target: {name}")
    print(f"  Sustainable error rate: {sustainable_rate:.4f}%")
    print(f"  Total error budget: {total:.2f} minutes ({total/60:.2f} hours)")
    print(f"  Budget burn rate: {per_hour:.4f} min/hour, {per_day:.2f} min/day")
    print(f"  Time to exhaust budget at 1% overhead: {total / (per_hour * 0.01):.1f} hours")
    print(f"  Time to exhaust budget at 10% overhead: {total / (per_hour * 0.1):.1f} hours")

# Burn-rate multipliers
print("\n" + "=" * 70)
print("Burn Rate Thresholds for 99.9% SLO")
print("=" * 70)
slo = 0.999
burn_rates = {
    "1 hour (fast burn)": 14.4,
    "6 hours (medium)": 6.0,
    "3 days (slow leak)": 3.0,
    "30 days (sustained)": 1.0,
}

for window, multiplier in burn_rates.items():
    threshold = (1 - slo) * multiplier * 100
    budget_burned_per_window = (multiplier * (1 - slo) * 100)
    print(f"\n{window}:")
    print(f"  Burn rate multiplier: {multiplier}x")
    print(f"  Error rate threshold: {threshold:.4f}%")
    print(f"  Budget consumed per window: {budget_burned_per_window:.4f}%")

Grafana SLO Dashboard Template

{
  "dashboard": {
    "title": "SLO Error Budget Dashboard - Example Service",
    "uid": "slo-example-service",
    "panels": [
      {
        "title": "Error Budget Remaining",
        "type": "gauge",
        "gridPos": { "x": 0, "y": 0, "w": 8, "h": 8 },
        "targets": [
          {
            "expr": "(1 - (sum(rate(http_requests_total{service='example-service',status=~'5..'}[30d])) / sum(rate(http_requests_total{service='example-service'}[30d])))) * 100",
            "refId": "A"
          }
        ],
        "fieldConfig": {
          "defaults": {
            "min": 0,
            "max": 100,
            "unit": "percent",
            "thresholds": {
              "mode": "absolute",
              "steps": [
                { "value": 0, "color": "red" },
                { "value": 25, "color": "orange" },
                { "value": 50, "color": "yellow" },
                { "value": 75, "color": "green" }
              ]
            },
            "custom": {
              "displayMode": "lcd"
            }
          }
        },
        "options": {
          "orientation": "auto",
          "showThresholdLabels": false,
          "showThresholdMarkers": true
        }
      },
      {
        "title": "Burn Rate by Time Window",
        "type": "timeseries",
        "gridPos": { "x": 8, "y": 0, "w": 16, "h": 8 },
        "targets": [
          {
            "expr": "(sum(rate(http_requests_total{service='example-service',status=~'5..'}[1h])) / sum(rate(http_requests_total{service='example-service'}[1h]))) / (1 - 0.999)",
            "legendFormat": "1h Burn Rate",
            "refId": "A"
          },
          {
            "expr": "(sum(rate(http_requests_total{service='example-service',status=~'5..'}[6h])) / sum(rate(http_requests_total{service='example-service'}[6h]))) / (1 - 0.999)",
            "legendFormat": "6h Burn Rate",
            "refId": "B"
          },
          {
            "expr": "(sum(rate(http_requests_total{service='example-service',status=~'5..'}[3d])) / sum(rate(http_requests_total{service='example-service'}[3d]))) / (1 - 0.999)",
            "legendFormat": "3d Burn Rate",
            "refId": "C"
          },
          {
            "expr": "1",
            "legendFormat": "Sustainable Rate",
            "refId": "D"
          }
        ],
        "fieldConfig": {
          "defaults": {
            "custom": {
              "drawStyle": "line",
              "lineWidth": 2,
              "fillOpacity": 10
            }
          }
        }
      },
      {
        "title": "Error Rate vs SLO Target",
        "type": "stat",
        "gridPos": { "x": 0, "y": 8, "w": 6, "h": 4 },
        "targets": [
          {
            "expr": "(sum(rate(http_requests_total{service='example-service',status=~'5..'}[5m])) / sum(rate(http_requests_total{service='example-service'}[5m]))) * 100",
            "refId": "A"
          }
        ],
        "options": {
          "colorMode": "value",
          "graphMode": "area"
        },
        "fieldConfig": {
          "defaults": {
            "unit": "percent",
            "thresholds": {
              "mode": "absolute",
              "steps": [
                { "value": 0, "color": "green" },
                { "value": 0.1, "color": "yellow" },
                { "value": 0.5, "color": "red" }
              ]
            }
          }
        }
      },
      {
        "title": "SLO Target (0.1%)",
        "type": "stat",
        "gridPos": { "x": 6, "y": 8, "w": 6, "h": 4 },
        "targets": [
          {
            "expr": "(1 - 0.999) * 100",
            "refId": "A"
          }
        ],
        "fieldConfig": {
          "defaults": {
            "unit": "percent"
          }
        }
      },
      {
        "title": "Projected Budget Exhaustion",
        "type": "stat",
        "gridPos": { "x": 12, "y": 8, "w": 6, "h": 4 },
        "targets": [
          {
            "expr": "((1 - (sum(rate(http_requests_total{service='example-service',status=~'5..'}[30d])) / sum(rate(http_requests_total{service='example-service'}[30d])))) * 30 * 24) / max((sum(rate(http_requests_total{service='example-service',status=~'5..'}[1h])) / sum(rate(http_requests_total{service='example-service'}[1h]))) / (1 - 0.999), 0.1)",
            "refId": "A"
          }
        ],
        "fieldConfig": {
          "defaults": {
            "unit": "h",
            "thresholds": {
              "mode": "absolute",
              "steps": [
                { "value": 0, "color": "red" },
                { "value": 168, "color": "yellow" },
                { "value": 720, "color": "green" }
              ]
            }
          }
        }
      },
      {
        "title": "Recent Error Budget Burn Events",
        "type": "table",
        "gridPos": { "x": 18, "y": 8, "w": 6, "h": 8 },
        "targets": [
          {
            "expr": "ALERTS{alertname=~'SLO.*Burn.*', service='example-service'}",
            "refId": "A"
          }
        ]
      }
    ],
    "templating": {
      "list": [
        {
          "name": "service",
          "type": "query",
          "query": "label_values(http_requests_total, service)",
          "multi": false
        },
        {
          "name": "slo_target",
          "type": "custom",
          "query": "0.999,0.9995,0.9999",
          "multi": false
        }
      ]
    }
  }
}

Multi-Window Burn-Rate Alerting for Prometheus & Grafana

Burn-Rate Alerting Rules Template

# prometheus-burn-rate-alerts.yaml
# Multi-window burn-rate alerting for SLO-based alerting
groups:
  - name: slo-burn-rate-alerts
    interval: 30s # Evaluate every 30 seconds
    rules:
      # FAST BURN: 1-hour window, 14.4x burn rate
      # Budget exhaustion in ~7 hours
      - alert: SLOBurnRateFastBurn1h
        expr: |
          (
            sum(rate(http_requests_total{service="example-service",status=~"5.."}[1h]))
            /
            sum(rate(http_requests_total{service="example-service"}[1h]))
          )
          > (1 - 0.999) * 14.4
        for: 5m
        labels:
          severity: critical
          category: slo
          window: 1h
          slo: availability
        annotations:
          summary: "SLO FAST BURN [1h window] - Page Immediately"
          description: |
            Error budget is burning at {{ $value | humanize }}x sustainable rate.
            SLO: 99.9% | Error Rate: {{ $value | humanizePercentage }}
            At this rate, budget will be exhausted in ~7 hours.
            Runbook: https://runbooks.example.com/slo-fast-burn

      # MEDIUM BURN: 6-hour window, 6x burn rate
      # Budget exhaustion in ~5 days
      - alert: SLOBurnRateMediumBurn6h
        expr: |
          (
            sum(rate(http_requests_total{service="example-service",status=~"5.."}[6h]))
            /
            sum(rate(http_requests_total{service="example-service"}[6h]))
          )
          > (1 - 0.999) * 6
        for: 30m
        labels:
          severity: warning
          category: slo
          window: 6h
          slo: availability
        annotations:
          summary: "SLO MEDIUM BURN [6h window] - Investigate"
          description: |
            Error budget is burning at {{ $value | humanize }}x sustainable rate.
            SLO: 99.9% | Error Rate: {{ $value | humanizePercentage }}
            At this rate, 10% of budget will be burned in ~6 hours.
            Runbook: https://runbooks.example.com/slo-medium-burn

      # SLOW BURN: 3-day window, 3x burn rate
      # Budget exhaustion in ~10 days
      - alert: SLOBurnRateSlowBurn3d
        expr: |
          (
            sum(rate(http_requests_total{service="example-service",status=~"5.."}[3d]))
            /
            sum(rate(http_requests_total{service="example-service"}[3d]))
          )
          > (1 - 0.999) * 3
        for: 3h
        labels:
          severity: warning
          category: slo
          window: 3d
          slo: availability
        annotations:
          summary: "SLO SLOW BURN [3d window] - Review"
          description: |
            Error budget is burning at {{ $value | humanize }}x sustainable rate.
            SLO: 99.9% | Error Rate: {{ $value | humanizePercentage }}
            At this rate, 10% of budget will be burned in ~3 days.
            Runbook: https://runbooks.example.com/slo-slow-burn

      # COMBINED MULTI-WINDOW: Fires if ANY window exceeds threshold
      - alert: SLOBurnRateMultiWindow
        expr: |
          (
            sum(rate(http_requests_total{service="example-service",status=~"5.."}[1h]))
            /
            sum(rate(http_requests_total{service="example-service"}[1h]))
          )
          > (1 - 0.999) * 14.4
          or
          (
            sum(rate(http_requests_total{service="example-service",status=~"5.."}[6h]))
            /
            sum(rate(http_requests_total{service="example-service"}[6h]))
          )
          > (1 - 0.999) * 6
          or
          (
            sum(rate(http_requests_total{service="example-service",status=~"5.."}[3d]))
            /
            sum(rate(http_requests_total{service="example-service"}[3d]))
          )
          > (1 - 0.999) * 3
        for: 5m
        labels:
          severity: critical
          category: slo
        annotations:
          summary: "SLO BURNING across multiple windows"
          description: |
            Multi-window burn-rate alert triggered for example-service.
            Windows: 1h (14.4x), 6h (6x), 3d (3x)
            Check burn rates at: https://grafana.example.com/d/slo-dashboard
            Runbook: https://runbooks.example.com/slo-multi-window

Latency SLO Burn-Rate Alerts

# Latency burn-rate (using histogram)
- alert: SLOLatencyBurnFast1h
  expr: |
    (
      sum(rate(http_request_duration_seconds_bucket{
        service="example-service",
        le="0.2"  # Under 200ms
      }[1h]))
      /
      sum(rate(http_request_duration_seconds_count{
        service="example-service"
      }[1h]))
    )
    < 0.99
  for: 5m
  labels:
    severity: critical
    category: slo
    window: 1h
    slo: latency
  annotations:
    summary: "Latency SLO burning fast [1h window]"
    description: |
      Latency SLO (99% under 200ms) is burning at unsustainable rate.
      Current good rate: {{ $value | humanizePercentage }}
      Target: 99% | At this rate, budget will exhaust in ~7 hours.

Grafana Alerting with Burn-Rate

{
  "grafanaAlert": {
    "name": "SLO Error Budget Multi-Window",
    "condition": "C",
    "data": [
      {
        "refId": "A",
        "query": {
          "expr": "sum(rate(http_requests_total{service=\"example-service\",status=~\"5..\"}[1h])) / sum(rate(http_requests_total{service=\"example-service\"}[1h]))",
          "reducer": "last"
        }
      },
      {
        "refId": "B",
        "query": {
          "expr": "(1 - 0.999) * 14.4",
          "reducer": "last"
        }
      },
      {
        "refId": "C",
        "type": "threshold",
        "evaluator": {
          "type": "gt",
          "params": ["B"]
        }
      }
    ],
    "execErrState": "alerting",
    "noDataState": "no_data",
    "for": "5m",
    "annotations": {
      "summary": "SLO Error Budget Fast Burn",
      "description": "Error budget is burning at {{ $values.A.Value }} per second. At this rate, budget will be exhausted in approximately 7 hours. Immediate investigation required."
    },
    "labels": {
      "severity": "critical",
      "team": "platform"
    }
  }
}

Observability Hooks for Prometheus & Grafana

This section defines what to log, measure, trace, and alert for Prometheus and Grafana themselves.

Log (What to Emit)

Event	Fields	Level
Prometheus started	version, instance, listen_address	INFO
TSDB compaction	duration_seconds, blocks_removed	DEBUG
Scrape failure	target, error, duration	WARN
Remote write failure	endpoint, error, retries	WARN
Alert triggered	alert_name, labels, eval_duration	INFO
Grafana dashboard saved	dashboard_id, folder, user	DEBUG
Alert notification sent	alert_name, receiver, status	INFO

Measure (Metrics to Collect)

Metric	Type	Description
`prometheus_target_scrapes_total`	Counter	Total scrape operations
`prometheus_target_scrapes_failed_total`	Counter	Failed scrape operations
`prometheus_tsdb_head_samples`	Gauge	Samples in memory
`prometheus_tsdb_compactions_total`	Counter	Compaction operations
`prometheus_remote_write_requests_total`	Counter	Remote write requests
`prometheus_remote_write_requests_failed_total`	Counter	Failed remote write requests
`prometheus_notifications_total`	Counter	Alert notifications sent
`grafana_api_response_status_total`	Counter	API response by status
`grafana_dashboard_save_duration_seconds`	Histogram	Dashboard save latency
`grafana_alerting_active_alerts`	Gauge	Currently active alerts

Trace (Correlation Points)

Operation	Trace Attribute	Purpose
Scrape cycle	`prometheus.scrape.job`, `prometheus.scrape.target`	Monitor scrape health
TSDB write	`prometheus.tsdb.write.samples`	Track write performance
Query execution	`prometheus.query.duration_seconds`	Monitor query performance
Alert evaluation	`prometheus.alert.eval_duration_seconds`	Track alert latency

Alert (When to Page for Prometheus/Grafana)

Alert	Condition	Severity	Purpose
Prometheus Down	up{job=“prometheus”} == 0	P1 Critical	Monitoring offline
Prometheus OOM	process_resident_memory_bytes > 10GB	P1 Critical	Memory exhaustion
TSDB Head Growing	head_min_time < now - 7d	P2 High	Compaction lagging
Remote Write Failing	remote_write_failures > 5%	P1 Critical	Long-term data at risk
Scrape Target Down	up{job=“node”} == 0	P2 High	Infrastructure issue
Grafana Down	grafana_http_request_total{status=“500”} > 100	P1 Critical	Dashboards unavailable
Alert Storm	alertEvaluationDuration > 30s	P2 High	Alert logic problem
Query Latency	query_duration_seconds > 10s	P3 Medium	Performance degradation

Prometheus & Grafana Observability Template

# prometheus-grafana-observability.yaml
groups:
  - name: prometheus-self-monitoring
    rules:
      # Prometheus instance down
      - alert: PrometheusDown
        expr: up{job="prometheus"} == 0
        for: 2m
        labels:
          severity: critical
          component: prometheus
        annotations:
          summary: "Prometheus instance {{ $labels.instance }} is down"
          description: "Prometheus monitoring is unavailable. All SLO dashboards are affected."

      # Prometheus memory pressure
      - alert: PrometheusHighMemory
        expr: process_resident_memory_bytes{job="prometheus"} / 1024 / 1024 / 1024 > 10
        for: 5m
        labels:
          severity: warning
          component: prometheus
        annotations:
          summary: "Prometheus memory usage above 10GB"
          description: "Prometheus is using {{ $value | humanize }}GB of memory. Risk of OOM."

      # TSDB head not compacting
      - alert: PrometheusTSDBHeadNotCompacting
        expr: (time() - prometheus_tsdb_head_min_time{job="prometheus"}) > 7 * 24 * 3600
        for: 1h
        labels:
          severity: warning
          component: prometheus
        annotations:
          summary: "Prometheus TSDB head has not compacted in 7 days"
          description: "TSDB head is growing unbounded. Check disk I/O and compaction settings."

      # High scrape failure rate
      - alert: PrometheusScrapeFailureRate
        expr: |
          sum(rate(prometheus_target_scrapes_failed_total{job="prometheus"}[10m]))
          /
          sum(rate(prometheus_target_scrapes_total{job="prometheus"}[10m])) > 0.05
        for: 10m
        labels:
          severity: high
          component: prometheus
        annotations:
          summary: "Prometheus scrape failure rate above 5%"
          description: "{{ $value | humanizePercentage }} of scrape operations are failing."

      # Remote write failures
      - alert: PrometheusRemoteWriteFailing
        expr: |
          sum(rate(prometheus_remote_write_requests_failed_total{job="prometheus"}[5m]))
          /
          sum(rate(prometheus_remote_write_requests_total{job="prometheus"}[5m])) > 0.05
        for: 5m
        labels:
          severity: critical
          component: prometheus
        annotations:
          summary: "Prometheus remote write failure rate above 5%"
          description: "Metrics are not being backed up to long-term storage. Historical data at risk."

  - name: grafana-monitoring
    rules:
      # Grafana API errors
      - alert: GrafanaHighAPIErrorRate
        expr: |
          sum(rate(grafana_http_request_status_total{status=~"5..",handler="/api/*"}[5m]))
          /
          sum(rate(grafana_http_request_status_total{handler="/api/*"}[5m])) > 0.05
        for: 5m
        labels:
          severity: high
          component: grafana
        annotations:
          summary: "Grafana API error rate above 5%"
          description: "Grafana API is returning errors. Dashboards may be unavailable."

      # Grafana alert evaluation slow
      - alert: GrafanaAlertEvaluationSlow
        expr: grafana_alerting_rule_evaluation_duration_seconds{quantile="0.95"} > 30
        for: 5m
        labels:
          severity: warning
          component: grafana
        annotations:
          summary: "Grafana alert evaluation P95 above 30 seconds"
          description: "Alert evaluation is taking {{ $value }}s at P95. Risk of alert delays."

      # Active alerts increasing rapidly
      - alert: GrafanaAlertStorm
        expr: increase(grafana_alerting_active_alerts[5m]) > 50
        for: 5m
        labels:
          severity: warning
          component: grafana
        annotations:
          summary: "Rapid increase in Grafana active alerts"
          description: "{{ $value }} new alerts activated in 5 minutes. Possible alert storm."

## Production Failure Scenarios

| Failure | Impact | Mitigation |
|---------|--------|------------|
| Prometheus OOM from high cardinality | Metrics dropped; monitoring gaps | Limit label values; use recording rules; segment by service |
| Target scrape timeout | Missing metrics; gaps in data | Optimize query performance; adjust scrape timeout; increase resources |
| Alertmanager down | No alerts delivered; extended outages | Configure redundant Alertmanager instances; test alert delivery |
| Prometheus TSDB corruption | Historical data loss | Regular snapshots; replicated storage (Thanos, Cortex); backup recovery |
| Grafana dashboard database corruption | Lost dashboards | Use dashboard provisioning; store in Git; regular backups |
| Remote write failures | Metrics not backed up to long-term storage | Implement local buffering; retry with backoff; monitor remote write queue |

## Observability Checklist

### Prometheus Metrics Coverage

- [ ] Request rate (http_requests_total by method, endpoint, status)
- [ ] Request latency (http_request_duration_seconds histogram)
- [ ] Error rate (5xx responses as ratio of total)
- [ ] Active connections or in-flight requests
- [ ] Saturation metrics (queue depth if applicable)

### Infrastructure Metrics

- [ ] CPU usage per service
- [ ] Memory usage and fragmentation
- [ ] Disk I/O and storage usage
- [ ] Network throughput
- [ ] Container restart counts
- [ ] OOM kill events

### Alerting Rules Checklist

- [ ] Service-level SLO alerts (availability, latency)
- [ ] Resource exhaustion warnings (CPU, memory, disk >80%)
- [ ] Dependency health (database, cache, external APIs)
- [ ] Pipeline health (scrape success, remote write queue)
- [ ] Error budget burn rate alerts

### Recording Rules Checklist

- [ ] Pre-aggregate frequently queried metrics
- [ ] Calculate SLO ratios (error rate, availability)
- [ ] Create service-level metrics from raw instrumentation
- [ ] Define SLI metrics for dashboards

## Security Checklist

- [ ] Prometheus /metrics endpoint not publicly accessible
- [ ] Alertmanager notifications sanitized (no secrets)
- [ ] Grafana authentication enabled (OAuth/LDAP preferred)
- [ ] Grafana dashboard permissions scoped by team
- [ ] API keys for Alertmanager stored securely
- [ ] TLS configured for all endpoints
- [ ] Scraping credentials stored in Kubernetes secrets
- [ ] Remote write uses TLS and authentication
- [ ] No sensitive data in metric labels or annotations

## Common Pitfalls / Anti-Patterns

### 1. Label Cardinality Explosion

Unbounded label values cause memory exhaustion:

```yaml
# Bad: User ID as label (millions of values)
http_requests_total{user_id="usr_123456"}

# Good: Aggregate, or use low-cardinality labels
http_requests_total{user_type="premium"}  # Count by type instead
# Or track user metrics separately

2. Querying Raw Metrics in Dashboards

Raw high-cardinality metrics slow down dashboards:

# Bad: Query raw metrics at visualization time
sum(rate(http_requests_total{service="api"}[5m])) by (user_id)

# Good: Use recording rules to pre-aggregate
sum(rate(http_requests_aggregated{service="api"}[5m])) by (service)

3. Missing Metric Labels for Debugging

Labels should enable useful filtering:

# Bad: Too few labels
http_requests_total: 1000

# Good: Labels that enable useful debugging
http_requests_total{service="api-gateway", method="GET", endpoint="/api/users", status="200"}

4. Alerting Without Severity Classification

All alerts at same severity causes alert fatigue:

# Good: Severity classification
- alert: ServiceDown
  labels:
    severity: critical
- alert: HighLatency
  labels:
    severity: warning
- alert: MetricScrapeLag
  labels:
    severity: info

5. No Alert Routing Testing

Alerts that never fire in production may have broken routing:

# Test alert routing
curl -X POST http://alertmanager:9093/api/v1/alerts \
  -d '[{"labels":{"alertname":"TestAlert","severity":"critical"}}]'

6. Ignoring Recording Rules

Querying raw high-resolution data at visualization time is slow:

# prometheus.yml - recording rules
groups:
  - name: api_service_rules
    interval: 30s
    rules:
      - record: apigw:request_rate:5m
        expr: |
          sum(rate(http_requests_total{service="api-gateway"}[5m])) by (service)

Quick Recap

Key Takeaways:

Prometheus uses pull-based scraping; configure service discovery
Four metric types: Counter (increases), Gauge (varies), Histogram (buckets), Summary (quantiles)
PromQL enables powerful aggregation, but pre-aggregate with recording rules
Grafana visualizes Prometheus data and manages alerting
Alertmanager routes alerts with grouping, inhibition, and silencing
Monitor Prometheus itself: scrape health, TSDB size, query performance

Copy/Paste Checklist:

# Counter metric
http_requests_total{method="GET", endpoint="/api", status="200"}

# Gauge metric
room_temperature_celsius{room="office"} 23.5

# Histogram metric
http_request_duration_seconds_bucket{le="0.1"}

# Recording rule
- record: service:error_rate:5m
  expr: |
    sum(rate(http_requests_total{status=~"5.."}[5m]))
    /
    sum(rate(http_requests_total[5m]))

# Alert rule
- alert: HighErrorRate
  expr: |
    sum(rate(http_requests_total{status=~"5.."}[5m]))
    /
    sum(rate(http_requests_total[5m])) > 0.01
  for: 5m
  labels:
    severity: critical

# Alert routing (Alertmanager)
route:
  receiver: default
  routes:
    - match:
        severity: critical
      receiver: pagerduty
    - match:
        severity: warning
      receiver: slack

Conclusion

Prometheus and Grafana together provide complete metrics observability. Instrument your applications with counters, gauges, and histograms. Configure Prometheus to scrape and store these metrics. Use PromQL to analyze trends and calculate SLOs. Build Grafana dashboards for real-time monitoring and alerting.

For logs and tracing correlation, see our ELK Stack and Distributed Tracing guides. For building complete monitoring pipelines with alerting, see the Metrics, Monitoring & Alerting guide.