Prometheus and Grafana: Metrics Collection and Visualization

Learn Prometheus metrics collection, PromQL querying, and Grafana dashboard creation. Complete guide to building observable systems with metrics.

published: reading time: 40 min read author: GeekWorkBench

Prometheus & Grafana: Metrics Collection and Visualization

Prometheus and Grafana form the backbone of modern monitoring stacks. Prometheus pulls metrics from services and stores them in time-series format, while Grafana visualizes the data and lets you build dashboards for real-time analysis.

This guide covers Prometheus architecture, metric types, PromQL, and Grafana dashboard construction. If you need background on monitoring philosophy, see our Metrics, Monitoring & Alerting guide first.

Introduction

Prometheus, originally developed at SoundCloud, has become the de facto standard for open-source monitoring in cloud-native environments. Its pull-based model and powerful query language make it particularly well-suited for dynamic Kubernetes clusters where services spin up and down frequently. Grafana complements Prometheus by providing rich visualization capabilities, allowing teams to build dashboards that surface critical operational data in real time.

Together, these tools form a complete metrics pipeline: instrumentation emits metric data, Prometheus scrapes and stores it, PromQL enables flexible analysis, and Grafana transforms raw numbers into actionable dashboards. This combination supports the four golden signals of observability—latency, traffic, errors, and saturation—while enabling sophisticated SLO tracking and alerting workflows.

This guide walks through everything you need to build a production-ready monitoring stack. You will learn how Prometheus schedules scrapes across dynamic targets, distinguishes between the four metric types, and queries time-series data with PromQL. You will also see how to construct Grafana dashboards, configure Alertmanager routing, and instrument applications in Python and JavaScript. By the end, you will be able to design a monitoring system that surfaces failures before they become outages.

Prometheus Architecture

graph LR
    A[Services] -->|Pull Metrics| B[Prometheus Server]
    B --> C[Time Series DB]
    A -->|Push Metrics| D[Push Gateway]
    D --> B
    B --> E[Grafana]
    B --> F[Alertmanager]
    F --> G[Email/PagerDuty/Slack]

Prometheus uses a pull model by default: it scrapes targets at configured intervals. For short-lived jobs that cannot be scraped, the Push Gateway accepts pushed metrics.

When to use Pushgateway: Pushgateway is appropriate for batch jobs and short-lived processes that cannot expose a scrape endpoint — such as CI/CD pipeline jobs, scheduled cron tasks, or one-off scripts that run and complete before the next scrape interval. Since these jobs exit before Prometheus can pull from them, they push their metrics to Pushgateway which then serves as a scrape target.

Key Components

  • Prometheus Server: Pulls, stores, and queries metrics
  • Push Gateway: Receives metrics from short-lived batch jobs
  • Alertmanager: Handles alerting and notification routing
  • Exporters: Agents that expose metrics from third-party systems

Metric Types

Prometheus supports four fundamental metric types.

Counter

A cumulative metric that only increases. Use for request counts, error counts, or anything that resets at restart.

# Python client example
from prometheus_client import Counter

requests_total = Counter(
    'http_requests_total',
    'Total HTTP requests',
    ['method', 'endpoint', 'status']
)

requests_total.labels(method='GET', endpoint='/api/users', status='200').inc()

Gauge

A metric that can go up or down. Use for current values like memory usage, in-flight requests, or temperature.

from prometheus_client import Gauge

current_temperature = Gauge(
    'room_temperature_celsius',
    'Current temperature in Celsius'
)

current_temperature.set(22.5)
current_temperature.dec(0.5)

Histogram

Samples observations and counts them in configurable buckets. Use for request durations or response sizes.

from prometheus_client import Histogram

request_duration = Histogram(
    'http_request_duration_seconds',
    'HTTP request duration in seconds',
    ['method', 'endpoint'],
    buckets=(0.005, 0.01, 0.025, 0.05, 0.1, 0.25, 0.5, 1.0, 2.5, 5.0, 10.0)
)

with request_duration.labels(method='GET', endpoint='/api/users').time():
    # Handle request
    pass

The histogram automatically calculates quantiles and provides _sum and _count suffixes for total and count.

Summary

Similar to histogram but calculates quantiles on the client side. Use when you need exact percentiles and can accept increased cardinality.

from prometheus_client import Summary

request_latency = Summary(
    'http_request_latency_seconds',
    'HTTP request latency in seconds',
    ['method', 'endpoint']
)

with request_latency.labels(method='GET', endpoint='/api/users').time():
    # Handle request
    pass

Instrumenting Applications

Express.js with prom-client

const client = require("prom-client");
const express = require("express");

const register = new client.Registry();

register.setDefaultLabels({
  app: "api-gateway",
});

client.collectDefaultMetrics({ register });

const httpRequestsTotal = new client.Counter({
  name: "http_requests_total",
  help: "Total HTTP requests",
  labelNames: ["method", "path", "status"],
  registers: [register],
});

const httpRequestDuration = new client.Histogram({
  name: "http_request_duration_seconds",
  help: "HTTP request duration in seconds",
  labelNames: ["method", "path"],
  buckets: [0.01, 0.05, 0.1, 0.5, 1, 2, 5],
  registers: [register],
});

const app = express();

app.use((req, res, next) => {
  const start = Date.now();

  res.on("finish", () => {
    const duration = (Date.now() - start) / 1000;
    const path = req.route ? req.route.path : req.path;

    httpRequestsTotal.inc({
      method: req.method,
      path: path,
      status: res.statusCode,
    });

    httpRequestDuration.observe(
      {
        method: req.method,
        path: path,
      },
      duration,
    );
  });

  next();
});

app.get("/metrics", async (req, res) => {
  res.set("Content-Type", register.contentType);
  res.end(await register.metrics());
});

app.get("/api/users", (req, res) => {
  res.json([{ id: 1, name: "Alice" }]);
});

app.listen(3000);

Python with FastAPI

from prometheus_client import Counter, Histogram, generate_latest, CONTENT_TYPE_LATEST
from fastapi import FastAPI, Request
from fastapi.responses import Response
import time

app = FastAPI()

http_requests_total = Counter(
    'http_requests_total',
    'Total HTTP requests',
    ['method', 'endpoint', 'status']
)

http_request_duration = Histogram(
    'http_request_duration_seconds',
    'HTTP request duration',
    ['method', 'endpoint'],
    buckets=(0.01, 0.05, 0.1, 0.5, 1.0, 2.0, 5.0)
)

@app.middleware("http")
async def add_metrics(request: Request, call_next):
    start = time.time()

    response = await call_next(request)

    duration = time.time() - start
    path = request.url.path

    http_requests_total.labels(
        method=request.method,
        endpoint=path,
        status=response.status_code
    ).inc()

    http_request_duration.labels(
        method=request.method,
        endpoint=path
    ).observe(duration)

    return response

@app.get("/metrics")
def metrics():
    return Response(
        content=generate_latest(),
        media_type=CONTENT_TYPE_LATEST
    )

Prometheus Configuration

Scrape Configuration

# prometheus.yml
global:
  scrape_interval: 15s
  evaluation_interval: 15s
  external_labels:
    environment: production
    cluster: us-east-1

alerting:
  alertmanagers:
    - static_configs:
        - targets:
            - alertmanager:9093

rule_files:
  - "/etc/prometheus/rules/*.yml"

scrape_configs:
  - job_name: "prometheus"
    static_configs:
      - targets: ["localhost:9090"]

  - job_name: "api-gateway"
    scrape_interval: 10s
    static_configs:
      - targets: ["api-gateway:3000"]
    metrics_path: /metrics
    relabel_configs:
      - source_labels: [__address__]
        target_label: instance
        replacement: "api-gateway-1"

  - job_name: "kubernetes-apiservers"
    kubernetes_sd_configs:
      - role: endpoints
    scheme: https
    tls_config:
      ca_file: /var/run/secrets/kubernetes.io/serviceaccount/ca.crt
    bearer_token_file: /var/run/secrets/kubernetes.io/serviceaccount/token
    relabel_configs:
      - source_labels:
          [__meta_kubernetes_namespace, __meta_kubernetes_service_name]
        action: keep
        regex: default;kubernetes

  - job_name: "kubernetes-nodes"
    kubernetes_sd_configs:
      - role: node
    relabel_configs:
      - action: labelmap
        regex: __meta_kubernetes_node_label_(.+)

  - job_name: "kubernetes-pods"
    kubernetes_sd_configs:
      - role: pod
    relabel_configs:
      - source_labels: [__meta_kubernetes_pod_annotation_prometheus_io_scrape]
        action: keep
        regex: true
      - source_labels: [__meta_kubernetes_pod_annotation_prometheus_io_path]
        action: replace
        target_label: __metrics_path__
        regex: (.+)
      - source_labels:
          [__address__, __meta_kubernetes_pod_annotation_prometheus_io_port]
        action: replace
        regex: ([^:]+)(?::\d+)?;(\d+)
        replacement: $1:$2
        target_label: __address__

Recording Rules

Recording rules pre-compute frequently needed queries:

# /etc/prometheus/rules/recording.yml
groups:
  - name: api-gateway
    interval: 30s
    rules:
      - record: apigw:http_requests:rate5m
        expr: |
          sum(rate(http_requests_total{service="api-gateway"}[5m])) by (service)

      - record: apigw:http_errors:rate5m
        expr: |
          sum(rate(http_requests_total{service="api-gateway",status=~"5.."}[5m])) by (service)

      - record: apigw:http_latency:p95
        expr: |
          histogram_quantile(0.95,
            sum(rate(http_request_duration_seconds_bucket{service="api-gateway"}[5m])) by (le)
          )

  - name: slo-performance
    interval: 60s
    rules:
      - record: job:slo:availability:30d
        expr: |
          1 -
          sum(rate(http_requests_total{status=~"5.."}[30d]))
          /
          sum(rate(http_requests_total)[30d])

Alerting Rules

# /etc/prometheus/rules/alerts.yml
groups:
  - name: api-alerts
    rules:
      - alert: HighErrorRate
        expr: |
          sum(rate(http_requests_total{service="api-gateway",status=~"5.."}[5m]))
          /
          sum(rate(http_requests_total{service="api-gateway"}[5m])) > 0.01
        for: 5m
        labels:
          severity: critical
          team: api
        annotations:
          summary: "High error rate on {{ $labels.service }}"
          description: "Error rate is {{ $value | humanizePercentage }}"

      - alert: HighLatency
        expr: |
          histogram_quantile(0.95,
            sum(rate(http_request_duration_seconds_bucket{service="api-gateway"}[5m])) by (le)
          ) > 1
        for: 10m
        labels:
          severity: warning
        annotations:
          summary: "High latency on {{ $labels.service }}"
          description: "P95 latency is {{ $value | humanizeDuration }}"

      - alert: InstanceDown
        expr: up{job="api-gateway"} == 0
        for: 2m
        labels:
          severity: critical
        annotations:
          summary: "Instance {{ $labels.instance }} is down"

PromQL

Prometheus Query Language lets you analyze time-series data.

Basic Queries

# All metrics starting with http
{__name__=~"http_.*"}

# Request rate per endpoint
sum(rate(http_requests_total[5m])) by (endpoint)

# Error rate
sum(rate(http_requests_total{status=~"5.."}[5m])) / sum(rate(http_requests_total[5m]))

Aggregations

# Sum by label
sum(http_requests_total) by (service)

# Average
avg(http_request_duration_seconds) by (service)

# Percentile
histogram_quantile(0.99, http_request_duration_seconds_bucket)

# Top 10 by value
topk(10, http_requests_total)

Functions

# Rate of change (per second)
rate(http_requests_total[5m])

# Increase over time range
increase(http_requests_total[1h])

# Predict linear trend
predict_linear(node_memory_MemFree_bytes[10m], 3600)

# Timestamp of last sample
timestamp(http_requests_total)

Subqueries

# Rate with nested aggregation
max_over_time(
  rate(http_requests_total[5m])[15m:1m]
)

# Combined functions
min_over_time(
  histogram_quantile(0.95, rate(http_request_duration_seconds_bucket[5m]))[30m:]
)

Grafana Dashboard Construction

Grafana connects to Prometheus and provides visualization.

Panel Types

TypeUse Case
GraphTime series visualization
StatSingle big number
GaugeNumeric with thresholds
TableMultiple metrics and dimensions
Pie chartProportional distribution
HeatmapDensity visualization

Graph Panel Configuration

{
  "panel": {
    "title": "Request Rate",
    "type": "graph",
    "gridPos": { "x": 0, "y": 0, "w": 12, "h": 8 },
    "targets": [
      {
        "expr": "sum(rate(http_requests_total[5m])) by (service)",
        "legendFormat": "{{service}}",
        "refId": "A"
      }
    ],
    "yaxes": [
      {
        "format": "reqps",
        "label": "Requests/s"
      }
    ],
    "xaxis": {
      "mode": "time"
    },
    "seriesOverrides": [],
    "fieldConfig": {
      "defaults": {
        "custom": {
          "drawStyle": "line",
          "lineWidth": 2,
          "fillOpacity": 10,
          "gradientMode": "none"
        },
        "mappings": [],
        "thresholds": {
          "mode": "absolute",
          "steps": [{ "value": 0, "color": "green" }]
        }
      }
    }
  }
}

Variables and Templating

Dashboard variables make dashboards reusable:

{
  "templating": {
    "list": [
      {
        "name": "service",
        "type": "query",
        "query": "label_values(http_requests_total, service)",
        "multi": true
      },
      {
        "name": "interval",
        "type": "interval",
        "query": "1m,5m,15m,30m,1h",
        "multi": true
      }
    ]
  }
}

Use variables in queries:

sum(rate(http_requests_total{service=~"$service"}[$interval])) by (service)

Annotations

Annotations mark events on dashboards:

{
  "annotations": {
    "list": [
      {
        "name": "Deployments",
        "datasource": "Prometheus",
        "query": "{job="deployments"}",
        "iconColor": "rgba(255, 96, 96, 1)"
      }
    ]
  }
}

Alerting Rules in Grafana

{
  "alert": {
    "name": "High Error Rate",
    "conditions": [
      {
        "evaluator": {
          "params": [0.01],
          "type": "gt"
        },
        "operator": {
          "type": "and"
        },
        "query": {
          "params": ["A", "5m", "now"]
        },
        "reducer": {
          "type": "avg"
        }
      }
    ],
    "frequency": "1m",
    "noDataState": "no_data",
    "exec_err_state": "alerting",
    "message": "Error rate is above 1% for 5 minutes"
  }
}

Alertmanager Configuration

Alertmanager handles routing alerts to notification channels.

# alertmanager.yml
global:
  resolve_timeout: 5m
  smtp_smarthost: "smtp.example.com:587"
  smtp_from: "alerts@example.com"
  smtp_auth_username: "alerts"
  smtp_auth_password: "${SMTP_PASSWORD}"

route:
  receiver: "default"
  group_by: ["alertname", "cluster", "service"]
  group_wait: 30s
  group_interval: 5m
  repeat_interval: 4h
  routes:
    - match:
        severity: critical
      receiver: "pagerduty"
      continue: true
    - match:
        severity: warning
      receiver: "slack"
      continue: true
    - match:
        team: "database"
      receiver: "database-oncall"

receivers:
  - name: "default"
    slack_configs:
      - channel: "#alerts-general"
        send_resolved: true
        title: '{{ range .Alerts }}{{ .Annotations.summary }}\n{{ end }}'
        text: |
          {{ range .Alerts }}
          **{{ .Labels.alertname }}**
          {{ .Annotations.description }}
          {{ end }}

  - name: "pagerduty"
    pagerduty_configs:
      - service_key: "${PAGERDUTY_KEY}"
        severity: "{{ .Labels.severity }}"
        component: "{{ .Labels.service }}"

  - name: "slack"
    slack_configs:
      - channel: "#alerts-critical"
        send_resolved: true
        api_url: "${SLACK_WEBHOOK_URL}"

  - name: "database-oncall"
    email_configs:
      - to: "database-oncall@example.com"
        headers:
          subject: "Database Alert: {{ .GroupLabels.alertname }}"

Exporters

Exporters expose metrics from third-party systems.

Node Exporter

System-level metrics:

# Run node exporter
docker run -d \
  --name node-exporter \
  --network host \
  prom/node-exporter:latest \
  --collector.filesystem.mount-points-exclude=^/(sys|proc|dev|host|etc)($$|/)

Blackbox Exporter

Probing endpoints:

# blackbox.yml
modules:
  http_2xx:
    prober: http
    http:
      valid_http_versions: ["HTTP/1.1", "HTTP/2"]
  tcp_connect:
    prober: tcp
  dns:
    prober: dns
    dns:
      query_name: example.com
      query_type: A
# Prometheus scrape config for blackbox
scrape_configs:
  - job_name: "blackbox"
    metrics_path: /probe
    params:
      module: [http_2xx]
    static_configs:
      - targets:
          - https://api.example.com/health
          - https://api2.example.com/health
    relabel_configs:
      - source_labels: [__address__]
        target_label: __param_target
      - source_labels: [__param_target]
        target_label: instance
      - target_label: __address__
        replacement: blackbox-exporter:9115

When to Use Prometheus and Grafana

When to Use Prometheus:

  • Pull-based metrics collection from dynamic services
  • Time-series data requiring flexible querying
  • SLO/SLA tracking with PromQL
  • Alerting on metric thresholds
  • Service discovery integration (Kubernetes, Consul, etc.)
  • High-dimensional metrics with many labels

When Not to Use Prometheus:

  • Pure log aggregation (use ELK Stack)
  • Distributed request tracing (use Jaeger)
  • Event streaming or real-time processing (use Kafka)
  • Long-term data warehousing (Prometheus is not designed for decades of retention)
  • Ultra-high cardinality use cases (pre-aggregate or use other storage)

When to Use Grafana:

  • Time-series visualization and dashboards
  • Multi-data source dashboards (Prometheus + Elasticsearch + Jaeger)
  • Alert rule management and notification channels
  • Exploring metrics interactively
  • Building SLO error budget dashboards

Trade-off Analysis

AspectPrometheusDatadogInfluxDBCloudWatch
Query ModelPull-basedAgent/pushAgent/pushPull/push hybrid
Metric CardinalityHighHighMediumLow
Query LanguagePromQLMQL/SQLInfluxQL/FluxCloudWatch SQL
AlertingNativeNativeKapacitorCloudWatch Alerts
Storage CostSelf-managedSaaS (expensive)Self-managedPay-per-use
RetentionConfigurableSaaS tiersConfigurable15mo default
Learning CurveModerateLowMediumSteep
Kubernetes SupportExcellentGoodGoodLimited
Long-term StorageThanos/CortexBuilt-inBuilt-inAuto-archival

SLI/SLO/Error Budget Templates for Prometheus & Grafana

SLI Definition Template

# sli-definitions.yaml
# Service Level Indicator definitions for Prometheus/Grafana
service: example-service
environment: production

slis:
  # Availability SLI
  - name: availability
    description: "Successful requests as percentage of total"
    sli_type: ratio
    target: 99.9
    query: |
      sum(rate(http_requests_total{service="example-service",status!~"5.."}[{{ window }}]))
      /
      sum(rate(http_requests_total{service="example-service"}[{{ window }}]))

  # Latency SLI (good requests)
  - name: latency_success
    description: "P95 latency for successful requests"
    sli_type: latency
    target: 200 # milliseconds
    query: |
      histogram_quantile(0.95,
        sum(rate(http_request_duration_seconds_bucket{
          service="example-service",
          status!~"5.."
        }[{{ window }}])) by (le)
      )

  # Latency SLI (all requests)
  - name: latency_overall
    description: "P95 latency for all requests"
    sli_type: latency
    target: 500 # milliseconds
    query: |
      histogram_quantile(0.95,
        sum(rate(http_request_duration_seconds_bucket{
          service="example-service"
        }[{{ window }}])) by (le)
      )

  # Error rate SLI
  - name: error_rate
    description: "5xx error rate as percentage of total"
    sli_type: ratio
    target: 99.5
    query: |
      100 - (
        sum(rate(http_requests_total{service="example-service",status!~"5.."}[{{ window }}]))
        /
        sum(rate(http_requests_total{service="example-service"}[{{ window }}]))
      ) * 100

  # Throughput SLI
  - name: throughput
    description: "Requests per second"
    sli_type: gauge
    target: 1000 # minimum rps
    query: |
      sum(rate(http_requests_total{service="example-service"}[{{ window }}]))

SLO Configuration Template

# slo-configuration.yaml
# Service Level Objectives for Grafana Enterprise / Prometheus
objectives:
  # High-priority availability SLO
  - display_name: "API Availability"
    sli: availability
    target: 99.9
    window: 30d
    description: "API should successfully handle 99.9% of requests"
    alert_at_budget_remaining: 50% # Alert when 50% budget remains
    alert_severity: critical

  # Latency SLO for user-facing requests
  - display_name: "API Latency (p95)"
    sli: latency_success
    target: 99.0 # 99% of requests under 200ms
    window: 30d
    description: "99% of successful requests should complete within 200ms"
    alert_at_budget_remaining: 25%
    alert_severity: warning

  # Error budget SLO
  - display_name: "Error Rate"
    sli: error_rate
    target: 99.5
    window: 30d
    description: "Error rate should stay below 0.5%"
    alert_at_budget_remaining: 25%
    alert_severity: warning

Error Budget Calculator Template

# error-budget-calculator.py
"""
Error Budget Calculator for SLOs
Run: python error-budget-calculator.py
"""

def calculate_error_budget(slo_target, window_days=30):
    """
    Calculate error budget in minutes for a given SLO target.

    Args:
        slo_target: Target as decimal (e.g., 0.999 for 99.9%)
        window_days: Measurement window in days

    Returns:
        tuple: (total_budget_minutes, budget_per_hour, budget_per_day)
    """
    window_seconds = window_days * 24 * 60 * 60
    allowed_errors_seconds = window_seconds * (1 - slo_target)
    total_budget_minutes = allowed_errors_seconds / 60

    # Budget burning rates
    budget_per_hour = total_budget_minutes / (window_days * 24)
    budget_per_day = total_budget_minutes / window_days

    return total_budget_minutes, budget_per_hour, budget_per_day

# Standard SLO targets
slo_targets = {
    "99%": 0.99,
    "99.5%": 0.995,
    "99.9%": 0.999,
    "99.95%": 0.9995,
    "99.99%": 0.9999,
}

print("=" * 70)
print("Error Budget Calculator (30-day window)")
print("=" * 70)

for name, target in slo_targets.items():
    total, per_hour, per_day = calculate_error_budget(target)
    sustainable_rate = (1 - target) * 100

    print(f"\nSLO Target: {name}")
    print(f"  Sustainable error rate: {sustainable_rate:.4f}%")
    print(f"  Total error budget: {total:.2f} minutes ({total/60:.2f} hours)")
    print(f"  Budget burn rate: {per_hour:.4f} min/hour, {per_day:.2f} min/day")
    print(f"  Time to exhaust budget at 1% overhead: {total / (per_hour * 0.01):.1f} hours")
    print(f"  Time to exhaust budget at 10% overhead: {total / (per_hour * 0.1):.1f} hours")

# Burn-rate multipliers
print("\n" + "=" * 70)
print("Burn Rate Thresholds for 99.9% SLO")
print("=" * 70)
slo = 0.999
burn_rates = {
    "1 hour (fast burn)": 14.4,
    "6 hours (medium)": 6.0,
    "3 days (slow leak)": 3.0,
    "30 days (sustained)": 1.0,
}

for window, multiplier in burn_rates.items():
    threshold = (1 - slo) * multiplier * 100
    budget_burned_per_window = (multiplier * (1 - slo) * 100)
    print(f"\n{window}:")
    print(f"  Burn rate multiplier: {multiplier}x")
    print(f"  Error rate threshold: {threshold:.4f}%")
    print(f"  Budget consumed per window: {budget_burned_per_window:.4f}%")

Grafana SLO Dashboard Template

{
  "dashboard": {
    "title": "SLO Error Budget Dashboard - Example Service",
    "uid": "slo-example-service",
    "panels": [
      {
        "title": "Error Budget Remaining",
        "type": "gauge",
        "gridPos": { "x": 0, "y": 0, "w": 8, "h": 8 },
        "targets": [
          {
            "expr": "(1 - (sum(rate(http_requests_total{service='example-service',status=~'5..'}[30d])) / sum(rate(http_requests_total{service='example-service'}[30d])))) * 100",
            "refId": "A"
          }
        ],
        "fieldConfig": {
          "defaults": {
            "min": 0,
            "max": 100,
            "unit": "percent",
            "thresholds": {
              "mode": "absolute",
              "steps": [
                { "value": 0, "color": "red" },
                { "value": 25, "color": "orange" },
                { "value": 50, "color": "yellow" },
                { "value": 75, "color": "green" }
              ]
            },
            "custom": {
              "displayMode": "lcd"
            }
          }
        },
        "options": {
          "orientation": "auto",
          "showThresholdLabels": false,
          "showThresholdMarkers": true
        }
      },
      {
        "title": "Burn Rate by Time Window",
        "type": "timeseries",
        "gridPos": { "x": 8, "y": 0, "w": 16, "h": 8 },
        "targets": [
          {
            "expr": "(sum(rate(http_requests_total{service='example-service',status=~'5..'}[1h])) / sum(rate(http_requests_total{service='example-service'}[1h]))) / (1 - 0.999)",
            "legendFormat": "1h Burn Rate",
            "refId": "A"
          },
          {
            "expr": "(sum(rate(http_requests_total{service='example-service',status=~'5..'}[6h])) / sum(rate(http_requests_total{service='example-service'}[6h]))) / (1 - 0.999)",
            "legendFormat": "6h Burn Rate",
            "refId": "B"
          },
          {
            "expr": "(sum(rate(http_requests_total{service='example-service',status=~'5..'}[3d])) / sum(rate(http_requests_total{service='example-service'}[3d]))) / (1 - 0.999)",
            "legendFormat": "3d Burn Rate",
            "refId": "C"
          },
          {
            "expr": "1",
            "legendFormat": "Sustainable Rate",
            "refId": "D"
          }
        ],
        "fieldConfig": {
          "defaults": {
            "custom": {
              "drawStyle": "line",
              "lineWidth": 2,
              "fillOpacity": 10
            }
          }
        }
      },
      {
        "title": "Error Rate vs SLO Target",
        "type": "stat",
        "gridPos": { "x": 0, "y": 8, "w": 6, "h": 4 },
        "targets": [
          {
            "expr": "(sum(rate(http_requests_total{service='example-service',status=~'5..'}[5m])) / sum(rate(http_requests_total{service='example-service'}[5m]))) * 100",
            "refId": "A"
          }
        ],
        "options": {
          "colorMode": "value",
          "graphMode": "area"
        },
        "fieldConfig": {
          "defaults": {
            "unit": "percent",
            "thresholds": {
              "mode": "absolute",
              "steps": [
                { "value": 0, "color": "green" },
                { "value": 0.1, "color": "yellow" },
                { "value": 0.5, "color": "red" }
              ]
            }
          }
        }
      },
      {
        "title": "SLO Target (0.1%)",
        "type": "stat",
        "gridPos": { "x": 6, "y": 8, "w": 6, "h": 4 },
        "targets": [
          {
            "expr": "(1 - 0.999) * 100",
            "refId": "A"
          }
        ],
        "fieldConfig": {
          "defaults": {
            "unit": "percent"
          }
        }
      },
      {
        "title": "Projected Budget Exhaustion",
        "type": "stat",
        "gridPos": { "x": 12, "y": 8, "w": 6, "h": 4 },
        "targets": [
          {
            "expr": "((1 - (sum(rate(http_requests_total{service='example-service',status=~'5..'}[30d])) / sum(rate(http_requests_total{service='example-service'}[30d])))) * 30 * 24) / max((sum(rate(http_requests_total{service='example-service',status=~'5..'}[1h])) / sum(rate(http_requests_total{service='example-service'}[1h]))) / (1 - 0.999), 0.1)",
            "refId": "A"
          }
        ],
        "fieldConfig": {
          "defaults": {
            "unit": "h",
            "thresholds": {
              "mode": "absolute",
              "steps": [
                { "value": 0, "color": "red" },
                { "value": 168, "color": "yellow" },
                { "value": 720, "color": "green" }
              ]
            }
          }
        }
      },
      {
        "title": "Recent Error Budget Burn Events",
        "type": "table",
        "gridPos": { "x": 18, "y": 8, "w": 6, "h": 8 },
        "targets": [
          {
            "expr": "ALERTS{alertname=~'SLO.*Burn.*', service='example-service'}",
            "refId": "A"
          }
        ]
      }
    ],
    "templating": {
      "list": [
        {
          "name": "service",
          "type": "query",
          "query": "label_values(http_requests_total, service)",
          "multi": false
        },
        {
          "name": "slo_target",
          "type": "custom",
          "query": "0.999,0.9995,0.9999",
          "multi": false
        }
      ]
    }
  }
}

Multi-Window Burn-Rate Alerting for Prometheus & Grafana

Burn-Rate Alerting Rules Template

# prometheus-burn-rate-alerts.yaml
# Multi-window burn-rate alerting for SLO-based alerting
groups:
  - name: slo-burn-rate-alerts
    interval: 30s # Evaluate every 30 seconds
    rules:
      # FAST BURN: 1-hour window, 14.4x burn rate
      # Budget exhaustion in ~7 hours
      - alert: SLOBurnRateFastBurn1h
        expr: |
          (
            sum(rate(http_requests_total{service="example-service",status=~"5.."}[1h]))
            /
            sum(rate(http_requests_total{service="example-service"}[1h]))
          )
          > (1 - 0.999) * 14.4
        for: 5m
        labels:
          severity: critical
          category: slo
          window: 1h
          slo: availability
        annotations:
          summary: "SLO FAST BURN [1h window] - Page Immediately"
          description: |
            Error budget is burning at {{ $value | humanize }}x sustainable rate.
            SLO: 99.9% | Error Rate: {{ $value | humanizePercentage }}
            At this rate, budget will be exhausted in ~7 hours.
            Runbook: https://runbooks.example.com/slo-fast-burn

      # MEDIUM BURN: 6-hour window, 6x burn rate
      # Budget exhaustion in ~5 days
      - alert: SLOBurnRateMediumBurn6h
        expr: |
          (
            sum(rate(http_requests_total{service="example-service",status=~"5.."}[6h]))
            /
            sum(rate(http_requests_total{service="example-service"}[6h]))
          )
          > (1 - 0.999) * 6
        for: 30m
        labels:
          severity: warning
          category: slo
          window: 6h
          slo: availability
        annotations:
          summary: "SLO MEDIUM BURN [6h window] - Investigate"
          description: |
            Error budget is burning at {{ $value | humanize }}x sustainable rate.
            SLO: 99.9% | Error Rate: {{ $value | humanizePercentage }}
            At this rate, 10% of budget will be burned in ~6 hours.
            Runbook: https://runbooks.example.com/slo-medium-burn

      # SLOW BURN: 3-day window, 3x burn rate
      # Budget exhaustion in ~10 days
      - alert: SLOBurnRateSlowBurn3d
        expr: |
          (
            sum(rate(http_requests_total{service="example-service",status=~"5.."}[3d]))
            /
            sum(rate(http_requests_total{service="example-service"}[3d]))
          )
          > (1 - 0.999) * 3
        for: 3h
        labels:
          severity: warning
          category: slo
          window: 3d
          slo: availability
        annotations:
          summary: "SLO SLOW BURN [3d window] - Review"
          description: |
            Error budget is burning at {{ $value | humanize }}x sustainable rate.
            SLO: 99.9% | Error Rate: {{ $value | humanizePercentage }}
            At this rate, 10% of budget will be burned in ~3 days.
            Runbook: https://runbooks.example.com/slo-slow-burn

      # COMBINED MULTI-WINDOW: Fires if ANY window exceeds threshold
      - alert: SLOBurnRateMultiWindow
        expr: |
          (
            sum(rate(http_requests_total{service="example-service",status=~"5.."}[1h]))
            /
            sum(rate(http_requests_total{service="example-service"}[1h]))
          )
          > (1 - 0.999) * 14.4
          or
          (
            sum(rate(http_requests_total{service="example-service",status=~"5.."}[6h]))
            /
            sum(rate(http_requests_total{service="example-service"}[6h]))
          )
          > (1 - 0.999) * 6
          or
          (
            sum(rate(http_requests_total{service="example-service",status=~"5.."}[3d]))
            /
            sum(rate(http_requests_total{service="example-service"}[3d]))
          )
          > (1 - 0.999) * 3
        for: 5m
        labels:
          severity: critical
          category: slo
        annotations:
          summary: "SLO BURNING across multiple windows"
          description: |
            Multi-window burn-rate alert triggered for example-service.
            Windows: 1h (14.4x), 6h (6x), 3d (3x)
            Check burn rates at: https://grafana.example.com/d/slo-dashboard
            Runbook: https://runbooks.example.com/slo-multi-window

Latency SLO Burn-Rate Alerts

# Latency burn-rate (using histogram)
- alert: SLOLatencyBurnFast1h
  expr: |
    (
      sum(rate(http_request_duration_seconds_bucket{
        service="example-service",
        le="0.2"  # Under 200ms
      }[1h]))
      /
      sum(rate(http_request_duration_seconds_count{
        service="example-service"
      }[1h]))
    )
    < 0.99
  for: 5m
  labels:
    severity: critical
    category: slo
    window: 1h
    slo: latency
  annotations:
    summary: "Latency SLO burning fast [1h window]"
    description: |
      Latency SLO (99% under 200ms) is burning at unsustainable rate.
      Current good rate: {{ $value | humanizePercentage }}
      Target: 99% | At this rate, budget will exhaust in ~7 hours.

Grafana Alerting with Burn-Rate

{
  "grafanaAlert": {
    "name": "SLO Error Budget Multi-Window",
    "condition": "C",
    "data": [
      {
        "refId": "A",
        "query": {
          "expr": "sum(rate(http_requests_total{service=\"example-service\",status=~\"5..\"}[1h])) / sum(rate(http_requests_total{service=\"example-service\"}[1h]))",
          "reducer": "last"
        }
      },
      {
        "refId": "B",
        "query": {
          "expr": "(1 - 0.999) * 14.4",
          "reducer": "last"
        }
      },
      {
        "refId": "C",
        "type": "threshold",
        "evaluator": {
          "type": "gt",
          "params": ["B"]
        }
      }
    ],
    "execErrState": "alerting",
    "noDataState": "no_data",
    "for": "5m",
    "annotations": {
      "summary": "SLO Error Budget Fast Burn",
      "description": "Error budget is burning at {{ $values.A.Value }} per second. At this rate, budget will be exhausted in approximately 7 hours. Immediate investigation required."
    },
    "labels": {
      "severity": "critical",
      "team": "platform"
    }
  }
}

Observability Hooks for Prometheus & Grafana

This section defines what to log, measure, trace, and alert for Prometheus and Grafana themselves.

Log (What to Emit)

EventFieldsLevel
Prometheus startedversion, instance, listen_addressINFO
TSDB compactionduration_seconds, blocks_removedDEBUG
Scrape failuretarget, error, durationWARN
Remote write failureendpoint, error, retriesWARN
Alert triggeredalert_name, labels, eval_durationINFO
Grafana dashboard saveddashboard_id, folder, userDEBUG
Alert notification sentalert_name, receiver, statusINFO

Measure (Metrics to Collect)

MetricTypeDescription
prometheus_target_scrapes_totalCounterTotal scrape operations
prometheus_target_scrapes_failed_totalCounterFailed scrape operations
prometheus_tsdb_head_samplesGaugeSamples in memory
prometheus_tsdb_compactions_totalCounterCompaction operations
prometheus_remote_write_requests_totalCounterRemote write requests
prometheus_remote_write_requests_failed_totalCounterFailed remote write requests
prometheus_notifications_totalCounterAlert notifications sent
grafana_api_response_status_totalCounterAPI response by status
grafana_dashboard_save_duration_secondsHistogramDashboard save latency
grafana_alerting_active_alertsGaugeCurrently active alerts

Trace (Correlation Points)

OperationTrace AttributePurpose
Scrape cycleprometheus.scrape.job, prometheus.scrape.targetMonitor scrape health
TSDB writeprometheus.tsdb.write.samplesTrack write performance
Query executionprometheus.query.duration_secondsMonitor query performance
Alert evaluationprometheus.alert.eval_duration_secondsTrack alert latency

Alert (When to Page for Prometheus/Grafana)

AlertConditionSeverityPurpose
Prometheus Downup{job=“prometheus”} == 0P1 CriticalMonitoring offline
Prometheus OOMprocess_resident_memory_bytes > 10GBP1 CriticalMemory exhaustion
TSDB Head Growinghead_min_time < now - 7dP2 HighCompaction lagging
Remote Write Failingremote_write_failures > 5%P1 CriticalLong-term data at risk
Scrape Target Downup{job=“node”} == 0P2 HighInfrastructure issue
Grafana Downgrafana_http_request_total{status=“500”} > 100P1 CriticalDashboards unavailable
Alert StormalertEvaluationDuration > 30sP2 HighAlert logic problem
Query Latencyquery_duration_seconds > 10sP3 MediumPerformance degradation

Prometheus & Grafana Observability Template

# prometheus-grafana-observability.yaml
groups:
  - name: prometheus-self-monitoring
    rules:
      # Prometheus instance down
      - alert: PrometheusDown
        expr: up{job="prometheus"} == 0
        for: 2m
        labels:
          severity: critical
          component: prometheus
        annotations:
          summary: "Prometheus instance {{ $labels.instance }} is down"
          description: "Prometheus monitoring is unavailable. All SLO dashboards are affected."

      # Prometheus memory pressure
      - alert: PrometheusHighMemory
        expr: process_resident_memory_bytes{job="prometheus"} / 1024 / 1024 / 1024 > 10
        for: 5m
        labels:
          severity: warning
          component: prometheus
        annotations:
          summary: "Prometheus memory usage above 10GB"
          description: "Prometheus is using {{ $value | humanize }}GB of memory. Risk of OOM."

      # TSDB head not compacting
      - alert: PrometheusTSDBHeadNotCompacting
        expr: (time() - prometheus_tsdb_head_min_time{job="prometheus"}) > 7 * 24 * 3600
        for: 1h
        labels:
          severity: warning
          component: prometheus
        annotations:
          summary: "Prometheus TSDB head has not compacted in 7 days"
          description: "TSDB head is growing unbounded. Check disk I/O and compaction settings."

      # High scrape failure rate
      - alert: PrometheusScrapeFailureRate
        expr: |
          sum(rate(prometheus_target_scrapes_failed_total{job="prometheus"}[10m]))
          /
          sum(rate(prometheus_target_scrapes_total{job="prometheus"}[10m])) > 0.05
        for: 10m
        labels:
          severity: high
          component: prometheus
        annotations:
          summary: "Prometheus scrape failure rate above 5%"
          description: "{{ $value | humanizePercentage }} of scrape operations are failing."

      # Remote write failures
      - alert: PrometheusRemoteWriteFailing
        expr: |
          sum(rate(prometheus_remote_write_requests_failed_total{job="prometheus"}[5m]))
          /
          sum(rate(prometheus_remote_write_requests_total{job="prometheus"}[5m])) > 0.05
        for: 5m
        labels:
          severity: critical
          component: prometheus
        annotations:
          summary: "Prometheus remote write failure rate above 5%"
          description: "Metrics are not being backed up to long-term storage. Historical data at risk."

  - name: grafana-monitoring
    rules:
      # Grafana API errors
      - alert: GrafanaHighAPIErrorRate
        expr: |
          sum(rate(grafana_http_request_status_total{status=~"5..",handler="/api/*"}[5m]))
          /
          sum(rate(grafana_http_request_status_total{handler="/api/*"}[5m])) > 0.05
        for: 5m
        labels:
          severity: high
          component: grafana
        annotations:
          summary: "Grafana API error rate above 5%"
          description: "Grafana API is returning errors. Dashboards may be unavailable."

      # Grafana alert evaluation slow
      - alert: GrafanaAlertEvaluationSlow
        expr: grafana_alerting_rule_evaluation_duration_seconds{quantile="0.95"} > 30
        for: 5m
        labels:
          severity: warning
          component: grafana
        annotations:
          summary: "Grafana alert evaluation P95 above 30 seconds"
          description: "Alert evaluation is taking {{ $value }}s at P95. Risk of alert delays."

      # Active alerts increasing rapidly
      - alert: GrafanaAlertStorm
        expr: increase(grafana_alerting_active_alerts[5m]) > 50
        for: 5m
        labels:
          severity: warning
          component: grafana
        annotations:
          summary: "Rapid increase in Grafana active alerts"
          description: "{{ $value }} new alerts activated in 5 minutes. Possible alert storm."

Prometheus Federation and Scalability

When your metric volume outgrows a single Prometheus server, federation gives you a hierarchical way to scale.

graph TB
    A[Global Prometheus] -->|Federate Aggregated| B[Regional Prometheus 1]
    A -->|Federate Aggregated| C[Regional Prometheus 2]
    A -->|Federate Aggregated| D[Regional Prometheus N]
    B --> E[Prometheus TSDB 1]
    C --> F[Prometheus TSDB 2]
    D --> G[Prometheus TSDB N]
    B --> H[Object Storage<br/>Thanos Sidecar]
    C --> I[Object Storage<br/>Thanos Sidecar]
    D --> J[Object Storage<br/>Thanos Sidecar]
    H --> K[Thanos Query]
    I --> K
    J --> K
    K --> L[Grafana]

Federation Configuration

# Global Prometheus - federate from regional
scrape_configs:
  - job_name: "federate-regional"
    honor_labels: true
    metrics_path: "/federate"
    params:
      "match[]":
        - '{job="high-level"}'
    static_configs:
      - targets:
          - regional-1:9090
          - regional-2:9090
          - regional-n:9090

Scalability Patterns

PatternUse CaseComplexityTrade-off
Larger Prometheus<100K seriesLowVertical limit
FederationHierarchical aggregationMediumQuery fan-out
ThanosGlobal view + long-termHighOperational complexity
CortexMulti-tenant SaaSVery HighFull managed service

Long-Term Storage with Thanos

Thanos adds unlimited metric retention and global aggregation to Prometheus.

Thanos Components

graph LR
    A[Prometheus] -->|StoreAPI| B[Thanos Sidecar]
    B --> C[Object Storage]
    A[Prometheus] -->|Remote Write| D[Thanos Receive]
    D --> C
    C --> E[Thanos Query]
    E --> F[Grafana]
    E --> G[Thanos Rule]
    G --> C

Thanos Sidecar Configuration

# prometheus.yml with Thanos sidecar
global:
  external_labels:
    cluster: prod-us-east
    region: us-east-1
# thanos-sidecar args
args:
  - --prometheus.url=http://localhost:9090
  - --tsdb.path=/prometheus
  - --objstore.config-file=s3-config.yaml
  - --shipper.upload-compacted

S3 Configuration for Object Storage

# s3-config.yaml
type: S3
config:
  bucket: thanos-metrics
  endpoint: s3.amazonaws.com
  region: us-east-1
  aws:
    s3ForcePathStyle: true
  trace:
    enable: true

Thanos Query Federation

# Thanos Query configuration
apiVersion: thanos.io/v1alpha1
kind: ThanosQuery
metadata:
  name: thanos-query
spec:
  replicas: 2
  stores:
    - thanos-sidecar:10901
    - thanos-receive:10901
    - prometheus-operated:9090

Retention and Compaction

# Thanos Store configuration
spec:
  retentionResolution: 90d
  blockDuration: 2h
  resolution: 15s
  maxSize: 10GB

Production Failure Scenarios

FailureImpactMitigation
Prometheus OOM from high cardinalityMetrics dropped; monitoring gapsLimit label values; use recording rules; segment by service
Target scrape timeoutMissing metrics; gaps in dataOptimize query performance; adjust scrape timeout; increase resources
Alertmanager downNo alerts delivered; extended outagesConfigure redundant Alertmanager instances; test alert delivery
Prometheus TSDB corruptionHistorical data lossRegular snapshots; replicated storage (Thanos, Cortex); backup recovery
Grafana dashboard database corruptionLost dashboardsUse dashboard provisioning; store in Git; regular backups
Remote write failuresMetrics not backed up to long-term storageImplement local buffering; retry with backoff; monitor remote write queue

Real-World Failure Scenario: Prometheus OOM from Cardinality Explosion

A team at a mid-size SaaS company discovered their Prometheus instance consuming 40GB of memory and crashing twice daily. The culprit: a developer had added user_id as a label to an HTTP request counter.

On a platform with 500,000 active users, this created 500,000 unique time series for a single metric. Combined with multiple endpoints and methods, the metric cardinality exploded to over 2 million series.

The memory exhaustion caused Prometheus to crash, which meant no monitoring data was collected during the outage periods. The team had no visibility into what caused the outage itself—a classic monitoring blind spot.

The fix: remove high-cardinality labels, pre-aggregate user-level metrics into lower-cardinality dimensions (user_type, plan_tier), and add cardinality monitoring to Prometheus itself.

Lesson: Test metric label cardinality under realistic load before deploying. Monitor prometheus_tsdb_symbol_table_size and set cardinality alerts before you run out of memory.

Real-World Failure Scenario: Grafana Dashboard SQL Timeout Taking Down Shared API

A popular Grafana dashboard was querying a PostgreSQL datasource with a poorly optimized query that performed a full table scan on 50 million rows. The query took 45 seconds to execute and locked rows that other queries needed.

Because multiple teams shared the same datasource API, the slow dashboard query created a backlog that caused 200ms latency on all other queries using the same connection pool. Eventually, the connection pool was exhausted and the datasource API started returning 503 errors to all dashboards—not just the one with the bad query.

The on-call engineer spent 3 hours identifying which dashboard was causing the issue because there was no per-dashboard query performance monitoring.

Lesson: Implement per-dashboard query timeout limits in Grafana. Add datasource query performance monitoring. Set up alerting on datasource response latency. Configure connection pool limits per dashboard or team to prevent noisy neighbor problems.

Common Pitfalls / Anti-Patterns

Common Pitfalls

1. Label Cardinality Explosion

Unbounded label values cause memory exhaustion:

# Bad: User ID as label (millions of values)
http_requests_total{user_id="usr_123456"}

# Good: Aggregate, or use low-cardinality labels
http_requests_total{user_type="premium"}  # Count by type instead
# Or track user metrics separately

2. Querying Raw Metrics in Dashboards

Raw high-cardinality metrics slow down dashboards:

# Bad: Query raw metrics at visualization time
sum(rate(http_requests_total{service="api"}[5m])) by (user_id)

# Good: Use recording rules to pre-aggregate
sum(rate(http_requests_aggregated{service="api"}[5m])) by (service)

3. Missing Metric Labels for Debugging

Labels should enable useful filtering:

# Bad: Too few labels
http_requests_total: 1000

# Good: Labels that enable useful debugging
http_requests_total{service="api-gateway", method="GET", endpoint="/api/users", status="200"}

4. Alerting Without Severity Classification

All alerts at same severity causes alert fatigue:

# Good: Severity classification
- alert: ServiceDown
  labels:
    severity: critical
- alert: HighLatency
  labels:
    severity: warning
- alert: MetricScrapeLag
  labels:
    severity: info

5. No Alert Routing Testing

Alerts that never fire in production may have broken routing:

# Test alert routing
curl -X POST http://alertmanager:9093/api/v1/alerts \
  -d '[{"labels":{"alertname":"TestAlert","severity":"critical"}}]'

6. Ignoring Recording Rules

Querying raw high-resolution data at visualization time is slow:

# prometheus.yml - recording rules
groups:
  - name: api_service_rules
    interval: 30s
    rules:
      - record: apigw:request_rate:5m
        expr: |
          sum(rate(http_requests_total{service="api-gateway"}[5m])) by (service)

Interview Questions

1. What is the difference between Prometheus pull-based model and traditional push-based monitoring?
  • Pull model: Prometheus scrapes targets at configured intervals. Simpler to manage with dynamic infrastructure since you only need to configure which endpoints to hit.
  • Push model: Agents send metrics to a central server. Better for short-lived jobs or fire-and-forget workloads that cannot wait to be scraped.
  • Pull advantages: Easier target discovery via service discovery integrations, no need to install or configure agents on every service, and you can verify exactly which targets are being monitored by hitting the scrape endpoint directly.
  • Push advantages: Works well behind firewalls or NAT, handles ephemeral services better, and reduces the attack surface of target services since nothing needs to listen for incoming scrape requests.
2. Explain the four Prometheus metric types and when you would use each.
  • Counter: Cumulative, only increases (or resets to 0 on restart). Use for request counts, error counts, task completions
  • Gauge: Can go up or down. Use for current values like memory usage, queue depth, temperature, active connections
  • Histogram: Buckets observations for calculating quantiles. Use for request durations, response sizes. Provides _sum, _count, and _bucket suffixes
  • Summary: Calculates quantiles client-side. Use when exact percentiles are needed and you can accept increased cardinality. Higher memory footprint than histogram
3. How do you prevent label cardinality explosion in Prometheus?
  • Never use high-cardinality values as labels: user IDs, session IDs, request IDs. These multiply the number of time series stored.
  • Aggregate high-cardinality data before labeling (count by user_type instead of user_id)
  • Use recording rules to pre-aggregate frequently queried metrics
  • Implement federation to segment metrics by service or cluster
  • Use metrics relabeling to drop unwanted labels early in the pipeline
  • Monitor prometheus_tsdb_symbol_table_size and memory usage to detect issues early
4. What is PromQL and explain the difference between `rate()` and `increase()` functions.
  • PromQL is Prometheus Query Language for selecting and aggregating time-series data
  • rate(): Calculates per-second rate of change. Best for dashboards and alerting because it normalizes to a consistent time unit regardless of query window
  • increase(): Calculates total increase over the time range. Better for reporting total counts over a period
  • Note: rate() should be used for alerting and visualization; increase() for batch reporting
  • Both handle counter resets automatically
5. How does histogram_quantile() work in PromQL?
  • Calculates approximate quantile from histogram buckets using linear interpolation
  • Requires histogram metric with _bucket suffixes and a le (less than or equal) label
  • Takes a value between 0 and 1 (e.g., 0.95 for 95th percentile)
  • Results are approximate because of bucket granularity and interpolation
  • Must use rate() on histogram_bucket first: histogram_quantile(0.95, rate(http_request_duration_seconds_bucket[5m]))
  • For more accurate quantiles, consider summary metric type instead (at the cost of higher cardinality)
6. What are recording rules and why are they important?
  • Pre-compute frequently needed queries and store results as new time series
  • Reduce query load at visualization time, especially for complex aggregations
  • Essential for dashboard performance when querying high-resolution raw data
  • Example: Instead of calculating error rate at visualization time, store it as a recording rule that runs every 30s
  • Named with clear hierarchy: group:metric:aggregation (e.g., apigw:http_requests:rate5m)
  • Critical for SLO calculations that need to run over long time windows
7. Explain the relationship between SLI, SLO, and SLA in monitoring.
  • SLI (Service Level Indicator): The quantitative measure of service behavior (latency, availability, error rate)
  • SLO (Service Level Objective): The target value or range for the SLI (e.g., 99.9% availability, p99 latency under 500ms)
  • SLA (Service Level Agreement): The contractual commitment to customers, often stricter than internal SLOs
  • Error budget = 1 - SLO, represents how much unreliability is acceptable
  • SLIs should be measured continuously; SLOs drive alerting; SLAs have business consequences
8. How does multi-window burn-rate alerting work and why is it useful?
  • Burn-rate alerting detects SLO violations faster than traditional threshold alerts
  • Uses multiple time windows (1h, 6h, 3d) with different burn-rate multipliers (14.4x, 6x, 3x)
  • Fast burn (1h window): Catches severe outages quickly, budget exhaustion in ~7 hours
  • Medium burn (6h window): Catches sustained degradation, budget exhaustion in ~5 days
  • Slow burn (3d window): Catches gradual issues, budget exhaustion in ~30 days
  • Multi-window alert fires if ANY window exceeds its threshold, catching both fast and slow burns
9. What is the purpose of the Push Gateway in Prometheus and when should you avoid using it?
  • Push Gateway receives metrics from short-lived batch jobs that cannot be scraped (jobs that complete before the next scrape interval)
  • Use when: batch jobs are too short-lived to be scraped, scheduled jobs that run infrequently, CI/CD pipeline jobs
  • Avoid using for long-running services (use direct scraping instead)
  • Push Gateway can become a single point of failure and a cardinality bottleneck if not managed properly
  • Never expose Push Gateway publicly; it has no authentication by default
  • Prefer the pull model for most services; Push Gateway is a last resort for ephemeral workloads
10. How do you monitor Prometheus itself and what critical self-monitoring metrics should you track?
  • Monitor Prometheus uptime with up{job="prometheus"} metric
  • Track TSDB health: prometheus_tsdb_head_samples (memory), prometheus_tsdb_compactions_total (disk I/O)
  • Scrape health: prometheus_target_scrapes_total vs prometheus_target_scrapes_failed_total
  • Query performance: prometheus_query_duration_seconds for slow queries
  • Alert evaluation: prometheus_rule_evaluation_duration_seconds for alert lag
  • Remote write queue if using Thanos/Cortex: prometheus_remote_write_queue_size
  • Self-monitoring alerts: PrometheusDown, PrometheusHighMemory, PrometheusScrapeFailureRate
11. What is Prometheus federation and when would you use it?
  • Federation lets Prometheus servers scrape from other Prometheus servers, creating a hierarchy
  • Segment metrics by region, team, or service to stay within single-server scalability limits
  • Global federation scrapes aggregated metrics from regional servers; service federation pulls specific metrics from a global server
  • Set `honor_labels: true` to keep metric labels from remote servers intact
  • Typical setup: global Prometheus aggregates high-level metrics; regional servers keep full-resolution data
  • Federation is not a replacement for long-term storage like Thanos or Cortex
12. How does Prometheus handle high availability and what are the limitations?
  • Prometheus has no native HA; you run multiple replicas with identical config
  • Each replica scrapes independently, so you get duplicate data and duplicated alerts unless you handle it
  • Alertmanager clustering with redundancy groups deduplicates alerts from replicas
  • Thanos and Cortex add HA by replicating data across stores with eventual consistency
  • Grafana can deduplicate by picking from multiple Prometheus datasources
  • For real HA, look at Thanos sidecar mode or Cortex with consistent hashing
13. What is the difference between recording rules and alerting rules?
  • Recording rules pre-compute queries and store results as new time series, making dashboards faster
  • Alerting rules check conditions and fire alerts via Alertmanager when something is wrong
  • Recording rules are about query performance; alerting rules are about operational response
  • Recording rules run at `evaluation_interval` and persist results; alerting rules evaluate constantly
  • Use recording rules for SLO calculations, error rates, and dashboard latency percentiles
  • Use alerting rules for threshold violations, service downtime, and error budget burns
14. Explain the Prometheus TSDB (Time Series Database) structure and how it stores data.
  • TSDB is Prometheus's custom time-series database, built on levelDB
  • Data goes into 2-hour blocks on disk, each with compressed sample chunks
  • Head block holds recent uncompacted data for real-time writes
  • Each sample is a (timestamp, value) pair; series are identified by metric name and labels
  • WAL (Write-Ahead Log) handles crash recovery; checkpointing cuts replay time
  • Compaction merges older blocks; retention deletes blocks past the retention period
15. What are the trade-offs between Prometheus pull model and push-based alternatives?
  • Pull model: service discovery makes management simpler, targets are easy to verify, no agents to deploy
  • Pull model downside: cannot reach devices behind firewalls, poor fit for short-lived jobs
  • Push model: works behind NAT and firewalls, good for ephemeral services, delivery is guaranteed
  • Push model downside: agents on every service, harder inventory of what is monitored, resource overhead
  • Hybrid: Push Gateway for short-lived batch jobs, direct scraping for long-running services
  • OpenTelemetry uses push but can feed a collector that Prometheus scrapes (prometheus receiver)
16. How do you implement long-term metric storage with Thanos or Cortex?
  • Thanos runs a sidecar next to Prometheus and uploads TSDB blocks to object storage (GCS, S3)
  • Thanos Query federates metrics across Prometheus instances for a global view
  • Cortex is a horizontally scalable multi-tenant Prometheus-as-a-Service using consistent hashing
  • Both give you unlimited retention, global aggregation, and HA
  • Trade-offs: more complexity, eventual consistency delays, storage costs, higher query latency
  • Thanos is simpler to adopt; Cortex works better for multi-tenant SaaS products
17. What are the key Grafana dashboard best practices for maintainability?
  • Templating with variables makes panels reusable across services
  • Folder structure should mirror your service topology
  • Annotations mark deployments and events on time-series graphs
  • Stat and gauge panels beat graphs for single-KPI dashboards
  • Row collapsing cuts initial load time for complex dashboards
  • Provision dashboards via Git for change tracking and easy rollback
  • Links between dashboards enable drill-down navigation
18. How do you handle metric naming conventions and labeling best practices?
  • Name format: `___` like `http_request_duration_seconds`
  • Labels enable filtering; avoid high-cardinality values like user_id or session_id
  • Only add version or environment labels when actually needed for debugging
  • Exporter metric names should be prefixed by the exporter (e.g., `node_memory_MemFree_bytes`)
  • Dynamic label values that grow unbounded are a Cardinality problem waiting to happen
  • Counters get `_total` suffix; latency histograms get `_seconds`
19. What is the difference between histogram and summary metric types in Prometheus?
  • Histogram buckets observations and calculates quantiles server-side using `histogram_quantile()`. Results are approximate because of bucket granularity
  • Summary calculates quantiles on the client side before export. Exact results but higher memory footprint
  • Histogram works with `rate()` over any quantile; summary quantiles are baked in at instrumentation time
  • Histogram cardinality grows with buckets; summary cardinality grows with quantiles
  • Use histogram when you need flexible quantile calculation and plan to use `histogram_quantile()`
  • Use summary when you need exact quantiles and can absorb the cardinality cost
20. How do you set up and configure alerting with Alertmanager inhibition and silencing?
  • Inhibition suppresses alerts when other alerts are already firing (e.g., suppress all alerts when an instance is down)
  • Set up inhibition with `inhibit_rules` that match source and target alert labels
  • Silencing stops alerts from firing during a time window, useful for maintenance
  • Create silences with `amtool` CLI or the Alertmanager API using matcher rules
  • Route grouping sends alerts to receivers based on labels; set `continue: true` for multiple matches
  • Test routing by sending test alerts through `amtool alert query` or the API

Further Reading

Conclusion

Key Takeaways:

  • Prometheus uses pull-based scraping; configure service discovery
  • Four metric types: Counter (increases), Gauge (varies), Histogram (buckets), Summary (quantiles)
  • PromQL enables powerful aggregation, but pre-aggregate with recording rules
  • Grafana visualizes Prometheus data and manages alerting
  • Alertmanager routes alerts with grouping, inhibition, and silencing
  • Monitor Prometheus itself: scrape health, TSDB size, query performance

Copy/Paste Checklist:

# Counter metric
http_requests_total{method="GET", endpoint="/api", status="200"}

# Gauge metric
room_temperature_celsius{room="office"} 23.5

# Histogram metric
http_request_duration_seconds_bucket{le="0.1"}

# Recording rule
- record: service:error_rate:5m
  expr: |
    sum(rate(http_requests_total{status=~"5.."}[5m]))
    /
    sum(rate(http_requests_total[5m]))

# Alert rule
- alert: HighErrorRate
  expr: |
    sum(rate(http_requests_total{status=~"5.."}[5m]))
    /
    sum(rate(http_requests_total[5m])) > 0.01
  for: 5m
  labels:
    severity: critical

# Alert routing (Alertmanager)
route:
  receiver: default
  routes:
    - match:
        severity: critical
      receiver: pagerduty
    - match:
        severity: warning
      receiver: slack

Observability Checklist

Prometheus Metrics Coverage

  • Request rate (http_requests_total by method, endpoint, status)
  • Request latency (http_request_duration_seconds histogram)
  • Error rate (5xx responses as ratio of total)
  • Active connections or in-flight requests
  • Saturation metrics (queue depth if applicable)

Infrastructure Metrics

  • CPU usage per service
  • Memory usage and fragmentation
  • Disk I/O and storage usage
  • Network throughput
  • Container restart counts
  • OOM kill events

Alerting Rules Checklist

  • Service-level SLO alerts (availability, latency)
  • Resource exhaustion warnings (CPU, memory, disk >80%)
  • Dependency health (database, cache, external APIs)
  • Pipeline health (scrape success, remote write queue)
  • Error budget burn rate alerts

Recording Rules Checklist

  • Pre-aggregate frequently queried metrics
  • Calculate SLO ratios (error rate, availability)
  • Create service-level metrics from raw instrumentation
  • Define SLI metrics for dashboards

Security Checklist

  • Prometheus /metrics endpoint not publicly accessible
  • Alertmanager notifications sanitized (no secrets)
  • Grafana authentication enabled (OAuth/LDAP preferred)
  • Grafana dashboard permissions scoped by team
  • API keys for Alertmanager stored securely
  • TLS configured for all endpoints
  • Scraping credentials stored in Kubernetes secrets
  • Remote write uses TLS and authentication
  • No sensitive data in metric labels or annotations

Prometheus and Grafana together provide complete metrics observability. Instrument your applications with counters, gauges, and histograms. Configure Prometheus to scrape and store these metrics. Use PromQL to analyze trends and calculate SLOs. Build Grafana dashboards for real-time monitoring and alerting.

For logs and tracing correlation, see our ELK Stack and Distributed Tracing guides. For building complete monitoring pipelines with alerting, see the Metrics, Monitoring & Alerting guide.

Category

Related Posts

Database Monitoring: Metrics, Tools, and Alerting

Keep your PostgreSQL database healthy with comprehensive monitoring. This guide covers query latency, connection usage, disk I/O, cache hit ratios, and alerting with pg_stat_statements and Prometheus.

#database #monitoring #observability

Metrics, Monitoring, and Alerting: From SLIs to Alerts

Learn the RED and USE methods, SLIs/SLOs/SLAs, and how to build alerting systems that catch real problems. Includes examples for web services and databases.

#observability #monitoring #metrics

Alerting in Production: Building Alerts That Matter

Build alerting systems that catch real problems without fatigue. Learn alert design principles, severity levels, runbooks, and on-call best practices.

#data-engineering #alerting #monitoring