Database Monitoring: Metrics, Tools, and Alerting

Keep your PostgreSQL database healthy with comprehensive monitoring. This guide covers query latency, connection usage, disk I/O, cache hit ratios, and alerting with pg_stat_statements and Prometheus.

published: reading time: 16 min read

Database Monitoring: Metrics, Tools, and Alerting

A database without monitoring is flying blind. You will not know you are out of disk space until writes start failing. You will not know queries are slowing down until users complain. You will not know replication lag is growing until you have a replica that is hours behind.

This guide covers the essential metrics to track, tools for collecting them, and how to set up alerting that actually helps rather than just creating noise.

Key Metrics to Track

Not all metrics are equally important. Focus on the ones that directly indicate user-facing problems or impending failures.

Query Latency

Query latency is the most user-visible metric. Slow queries mean slow page loads, frustrated users, and potential timeouts.

Track these latency percentiles:

  • p50 (median): The typical response time
  • p95: What 19 out of 20 requests experience
  • p99: The worst experiences (excluding outliers)
-- Check query latency from pg_stat_statements
SELECT
  query,
  calls,
  total_exec_time / calls AS avg_ms,
  stddev_exec_time AS stddev_ms,
  min_exec_time,
  max_exec_time,
  rows / calls AS avg_rows
FROM pg_stat_statements
ORDER BY total_exec_time DESC
LIMIT 20;

The pg_stat_statements extension must be enabled:

CREATE EXTENSION IF NOT EXISTS pg_stat_statements;

PostgreSQL’s pg_stat_statements reports total execution time, not per-call latency percentiles. For true percentiles, consider pg_stat_kcache or external tools.

Connection Usage

PostgreSQL has a finite connection limit. When connections are exhausted, new connections fail. This is a hard failure that affects users directly.

-- Current connection status
SELECT
  state,
  COUNT(*) AS count
FROM pg_stat_activity
GROUP BY state;
-- Connection limits and usage
SELECT
  setting::int AS max_connections,
  (SELECT COUNT(*) FROM pg_stat_activity) AS current_connections,
  (SELECT COUNT(*) FROM pg_stat_activity WHERE state = 'active') AS active_connections;

Track connection usage as a percentage of max. If you are regularly above 70-80%, investigate why—either scale max_connections or reduce connection usage through pooling.

Disk I/O

Disk I/O bottlenecks are common database problems. Slow I/O means queries wait for data to come from disk rather than being served from memory.

-- I/O statistics
SELECT
  schedinfo.io_read_count,
  schedinfo.io_write_count,
  schedinfo.io_read_ms,
  schedinfo.io_write_ms
FROM pg_stat_activity act
JOIN pg_proc AS proc ON act.procpid = proc.oid
JOIN pg_catalog.pg_statio_user_tables AS statio ON statio.relid = act.query
LIMIT 1;

For aggregate I/O statistics:

SELECT
  pg_stat_database.datname,
  blks_read,
  blks_hit,
  ROUND(100.0 * blks_hit / NULLIF(blks_hit + blks_read, 0), 2) AS cache_hit_ratio
FROM pg_stat_database
WHERE datname = current_database();

A cache hit ratio below 95-99% indicates your working set does not fit in memory, or there are problematic queries scanning large amounts of data.

Buffer Cache Hit Ratio

The buffer cache hit ratio measures how often PostgreSQL finds data in memory versus on disk. A low ratio means lots of disk I/O.

-- System-wide buffer cache hit ratio
SELECT
  pg_stat_database.datname,
  pg_stat_database.blks_hit,
  pg_stat_database.blks_read,
  ROUND(100.0 * pg_stat_database.blks_hit /
    NULLIF(pg_stat_database.blks_hit + pg_stat_database.blks_read, 0), 2) AS hit_ratio
FROM pg_stat_database
WHERE pg_stat_database.datname = current_database();

A hit ratio above 99% is excellent. Below 95% warrants investigation—either you need more memory, or queries are scanning too much data.

Replication Lag

If you are using streaming replication (read replicas), monitor lag carefully. A replica that falls behind provides stale data and takes load off the primary incorrectly.

-- Check replication lag on replica
SELECT
  now() - pg_last_xact_replay_timestamp() AS replication_lag;
-- On primary, check replica status
SELECT
  client_addr,
  state,
  sent_lsn - write_lsn AS sent_lag,
  write_lsn - flush_lsn AS write_lag,
  flush_lsn - replay_lsn AS replay_lag
FROM pg_stat_replication;

Lag is measured in bytes (LSN difference) and time. The time-based lag is more useful for user-facing impact. Replication lag above a few seconds can cause problems for applications that expect read-after-write consistency.

Slow Query Log Analysis

Identifying slow queries is the first step to fixing them. PostgreSQL can log slow queries automatically.

Configuring Slow Query Logging

-- Log queries taking longer than 100ms
ALTER SYSTEM SET log_min_duration_statement = 100;
SELECT pg_reload_conf();

This logs the query text, duration, and binding parameters. Review logs regularly to find candidates for optimization.

Analyzing Slow Queries

Once you have slow query data, analyze patterns:

-- Top slowest queries by average time
SELECT
  LEFT(query, 100) AS query_preview,
  calls,
  total_exec_time / calls AS avg_ms,
  max_exec_time,
  rows / calls AS avg_rows
FROM pg_stat_statements
WHERE calls > 10
ORDER BY avg_ms DESC
LIMIT 20;
-- Most frequently called slow queries
SELECT
  LEFT(query, 100) AS query_preview,
  calls,
  total_exec_time / calls AS avg_ms,
  total_exec_time AS total_ms
FROM pg_stat_statements
WHERE calls > 100
ORDER BY total_exec_time DESC
LIMIT 20;

The first query finds queries that are slow even when called infrequently. The second finds queries called so often that even small slowdowns add up.

Query Execution Plans

For slow queries, examine the execution plan:

EXPLAIN (ANALYZE, BUFFERS, FORMAT TEXT)
SELECT *
FROM orders
WHERE user_id = 123
ORDER BY created_at DESC
LIMIT 10;

Key things to look for:

  • Seq Scan on large tables: Often indicates missing indexes
  • High actual rows vs estimated rows: Statistics are stale, run ANALYZE
  • Nested Loop on large datasets: May benefit from hash join
  • Sort nodes with high memory: May need work_mem increase

Monitoring Tools

PostgreSQL includes built-in stats. For production monitoring, you will want something more comprehensive.

pg_stat_statements

This is the essential tool for query-level visibility.

Enable it in postgresql.conf:

shared_preload_libraries = 'pg_stat_statements'
pg_stat_statements.track = top

DataDog

DataDog provides PostgreSQL monitoring with minimal setup. The agent auto-discovers PostgreSQL metrics and provides dashboards out of the box.

Key DataDog metrics for PostgreSQL:

  • postgresql.connections: Current connections
  • postgresql.rows_fetched: Rows returned by queries
  • postgresql.rows_inserted/updated/deleted: Write rates
  • postgresql.buffer_cache_hit_ratio: Cache effectiveness
  • postgresql.active_temporary_files: Temporary file creation (indicates memory pressure)

DataDog integrates well with other monitoring data—correlating database metrics with application traces, logs, and infrastructure metrics in one place.

Prometheus and Grafana

For a more DIY approach, Prometheus + Grafana provides powerful monitoring. The postgres_exporter scrapes PostgreSQL metrics for Prometheus.

# prometheus.yml
scrape_configs:
  - job_name: "postgres"
    static_configs:
      - targets: ["localhost:9187"]

Common Prometheus queries for PostgreSQL:

# Connection utilization
postgres_connections{state="active"} / postgres_max_connections

# Cache hit ratio
rate(postgres_blks_hit_total[5m]) / (rate(postgres_blks_hit_total[5m]) + rate(postgres_blks_read_total[5m]))

# Slow query rate
rate(postgres_slow_query_seconds_total[5m])

Grafana dashboards visualize these metrics. The official PostgreSQL dashboard template provides a reasonable starting point.

pg_stat_monitor

pg_stat_monitor is an improved version of pg_stat_statements that provides histogram data, allowing true percentile calculations.

CREATE EXTENSION pg_stat_monitor;
SELECT pg_stat_monitor_settings();
-- Query with timing histogram
SELECT
  query,
  calls,
  total_exec_time,
  min_exec_time,
  avg_exec_time,
  max_exec_time,
  p90_exec_time,
  p95_exec_time,
  p99_exec_time
FROM pg_stat_monitor()
ORDER BY total_exec_time DESC
LIMIT 20;

This gives you actual p90/p95/p99 latencies directly from the database, which is more useful than averages alone.

flowchart TD
    subgraph DB["PostgreSQL Internals"]
        SS[pg_stat_statements<br/>Query statistics]
        SA[pg_stat_activity<br/>Connection states]
        SB[pg_stat_bgwriter<br/>Write buffers]
        SR[pg_stat_replication<br/>Replica lag]
    end

    subgraph Collection["Collection Layer"]
        EX[postgres_exporter<br/>Prometheus metrics]
        DD[DataDog Agent<br/>Custom metrics]
        BC[pg_stat_monitor<br/>Percentile histograms]
    end

    subgraph Store["Storage & Visualization"]
        PROM[Prometheus<br/>Time-series DB]
        GRAF[Grafana<br/>Dashboards]
        DD2[DataDog<br/>Dashboards + APM]
    end

    subgraph Alert["Alerting"]
        AL[AlertManager<br/>Paging]
        PN[PagerDuty<br/>On-call routing]
    end

    DB --> EX
    DB --> DD
    DB --> BC
    SS --> EX
    SA --> EX
    SR --> EX
    EX --> PROM
    DD --> DD2
    BC --> PROM
    PROM --> GRAF
    DD2 --> AL
    PROM --> AL
    AL --> PN

When to Use / When Not to Use

Use built-in PostgreSQL stats (pg_stat_statements, pg_stat_activity) when you need quick visibility without extra setup, during initial investigation of an issue, or when evaluating whether to adopt a heavier monitoring stack.

Use pg_stat_monitor over pg_stat_statements when you need true latency percentiles (p90/p95/p99) directly from the database, since pg_stat_statements only gives averages and totals.

Use DataDog or Prometheus+Grafana when you need to correlate database metrics with application traces, infrastructure metrics, and logs in a single view, or when managing multiple database instances across environments.

Use custom metrics exported to Prometheus when you have specific business-level SLIs (like order placement latency) that standard database exports do not capture.

Do not rely solely on averages — a database with a p99 of 5 seconds looks fine if you only track the mean. Always use percentile-based alerting.

Do not set up alert rules for every metric you track. Alert fatigue is real — if everything triggers an alert, nothing gets attention.

Setting Up Alerts for Thresholds

Alerts should notify you of problems, not create noise that trains you to ignore them. Good alerts are actionable and specific.

Essential Alerts

Disk space critical

SELECT
  df.mountpoint,
  df.device,
  df.blocks - df.free AS used_blocks,
  df.blocks AS total_blocks,
  ROUND(100.0 * (df.blocks - df.free) / df.blocks) AS used_pct
FROM pg_catalog.pg_stat_device_stats df
WHERE df.mountpoint = '/data';

Alert threshold: 85% disk usage. At 90%, you might not have room for VACUUM or emergency operations.

Connection utilization high

Alert when connections exceed 80% of max. At this threshold, some capacity remains for burst traffic, but you should investigate.

Replication lag

Alert when replica lag exceeds 30 seconds. Beyond this, read-after-write consistency guarantees break down.

Long-running queries

Alert when queries exceed 60 seconds. These often indicate problems and can hold locks that block other queries.

Alert Anti-Patterns

Avoid alerting on:

  • CPU spikes that resolve quickly: Brief CPU spikes are normal during batch operations
  • Single slow queries: Track these via slow query log, alert only if they are increasing
  • Cache hit ratio drops: Brief fluctuations are normal; alert only on sustained drops

Alerting Best Practices

  1. Set reasonable time windows: Alert only when a condition persists for 5+ minutes, not on momentary spikes.

  2. Use multiple severity levels: Warning (investigate) vs Critical (respond now).

  3. Include context in alerts: The alert should say what is wrong, which system, and when it started.

  4. Alert on symptoms, not causes: “Database latency is high” is better than “Cache hit ratio is low”—the symptom is what affects users.

  5. Review and tune regularly: If you are ignoring an alert, either fix the underlying issue or adjust the threshold.

Building a Monitoring Dashboard

A good database dashboard shows the key metrics at a glance:

  1. Connection count (current and trend)
  2. Query latency (p50, p95, p99)
  3. Transactions per second
  4. Cache hit ratio
  5. Replication lag (if applicable)
  6. Disk usage (current and trend)
  7. Active queries (top offenders)
  8. Locks (blocked queries)

Most monitoring tools provide templates. Build from there based on your specific alerts.

Production Failure Scenarios

FailureCauseMitigation
Monitoring blind spot during incidentAlerts set only on CPU/disk but not on query latency spikeAlert on p95/p99 latency and query error rate, not just infrastructure metrics
Alert fatigue desensitizes teamHundreds of warnings fire daily for minor fluctuationsRaise thresholds, use composite alerts (multiple conditions must fire together)
pg_stat_statements query plan cachedSlow query captured with placeholder values prevents plan optimizationUse pg_stat_statements_reset() after fixing a query to clear stale entries
Monitoring dashboard lies about freshnessMetric export interval is 60s but dashboard refreshes every 5sVerify metric scrape intervals match dashboard expectations
Lock monitoring missed cascading failurespg_locks view not monitored, deadlocks cascade across tablesMonitor pg_stat_activity.state = 'active' combined with wait_event_type, set deadlock detection alerts

Trade-Off Table: Monitoring Tools

Dimensionpg_stat_statementsauto_explainDataDogPrometheus + Grafana
Setup effortNone (built-in)LowMediumMedium-High
Latency percentilesAverages onlyN/AYes (histograms)Yes (histograms)
Query plan analysisNoYes (actual plans)Via integrationVia exporter
Cross-service correlationNoNoYes (APM + infra)Yes (if metrics aligned)
AlertingNo (external tool needed)NoYes (built-in)AlertManager required
CostFreeFreePer-host pricingFree (self-hosted)
Best forInitial investigation, tuningPlan-level debuggingProduction observabilityLarge-scale infrastructure

Real-World Case Study: PagerDuty’s PostgreSQL Monitoring Setup

PagerDuty, a company that builds incident management software, runs PostgreSQL as a core operational database. Their engineering team has written extensively about their PostgreSQL monitoring architecture.

What they built: PagerDuty’s monitoring stack uses Prometheus exporters for infrastructure metrics (CPU, disk, network), combined with a PostgreSQL-specific exporter that captures pg_stat_statements, pg_stat_activity, pg_stat_user_tables, replication lag, and vacuum activity. They store all metrics in Prometheus and visualize in Grafana with per-team dashboards.

Key monitoring patterns they emphasize:

  • Cardinality management: PagerDuty runs pg_stat_statements tracking top 10,000 queries by total time, but they avoid high-cardinality labels (like individual user IDs) on PostgreSQL metrics to prevent Prometheus cardinality explosions. They aggregate user-level data into buckets (e.g., “heavy user” vs “light user”) rather than tracking per-user.
  • Alert fatigue prevention: They use a three-tier alert model. Tier 1: page immediately for customer-facing impact (error rate > 1%, p99 latency > 5s). Tier 2: notify Slack channel for degraded performance (cache hit ratio < 90%, replication lag > 10s). Tier 3: ticket created for trend analysis (disk growth rate > 10%/week, slow query count increasing). This prevents page fatigue while ensuring real incidents get immediate attention.
  • Query-level observability: They use pg_stat_statements with pg_stat_statements.track = all and aggregate by query normalized text (stripping literal values). They tag each query with a role (api, worker, background) to isolate which component is causing issues.
  • Capacity-triggered scaling: When pg_stat_database.tup_returned drops below baseline by 20% for more than 5 minutes (suggesting table bloat or index degradation), they automatically trigger a vacuum or reindex job.

The lesson: The value of monitoring is not in having dashboards—it’s in having a clear escalation path from metric to action. PagerDuty’s three-tier alert model is a pattern any team can adopt: separate alerts by impact severity, assign different response protocols for each tier, and tune thresholds based on observed baseline rather than arbitrary values.

Interview Questions

Q1: Your PostgreSQL database shows a cache hit ratio of 99% but query latency has doubled over the past week. What do you investigate?

A: A high cache hit ratio with degraded latency suggests the bottleneck is not memory-related. Investigate in this order: First, check pg_stat_statements for new slow queries—maybe a new query pattern emerged that wasn’t present in the previous baseline. Second, check pg_stat_activity for lock contention: run SELECT * FROM pg_stat_activity WHERE wait_event_type IS NOT NULL to see if queries are waiting on locks. Third, check replication lag if using replicas: a lagging replica could cause read latency to appear as write latency if your application uses synchronous replication. Fourth, check for index bloat using pgstatindex() on your most-used indexes—a bloated index has higher tree depth and requires more I/O per lookup even if the buffer cache is warm. Fifth, check disk I/O latency (iostat on the host): a saturated disk queue causes latency even with high cache hit ratios. Finally, check for increased connection counts—if max_connections is being approached, queries may be waiting in the connection queue.

Q2: How do you set meaningful alert thresholds for a PostgreSQL database?

A: Threshold setting is an iterative process. Start with theoretical limits: max_connections (e.g., 100), disk capacity (e.g., 500GB), shared_buffers (e.g., 8GB). Set initial thresholds at 70% (warning) and 85% (critical) of theoretical limits. Then refine based on observed baseline: capture metrics for 2-4 weeks to establish baseline behavior. Set warning thresholds at baseline + 3 standard deviations for highly variable metrics (query latency), and at fixed percentages for monotonic resources (disk usage, connection count). Use composite alerts to reduce noise: fire only when both cache hit ratio drops below 90% AND transaction latency exceeds 500ms simultaneously. Review and adjust thresholds quarterly or after any significant load change.

Q3: What’s the difference between pg_stat_statements, pg_stat_activity, and pg_stat_user_tables?

A: pg_stat_statements tracks query performance across all queries globally: total calls, total time, total I/O time, and cache hit ratio per normalized query. It requires the extension to be installed (CREATE EXTENSION pg_stat_statements) and does not track which session or user ran a query. pg_stat_activity is session-level: it shows every currently active connection, the query each session is running, wait events, transaction state, and query start time. Use it to diagnose currently running queries and lock contention in real time. pg_stat_user_tables (and pg_stat_sys_tables) tracks per-table statistics: sequential scans vs index scans, individual table hit/read/insert/update/delete rates, and vacuum/autovacuum activity. Use it to identify which tables are hot spots and whether indexes are being used effectively.

Q4: How would you monitor for a slow query that only occurs during peak traffic?

A: Configure pg_stat_statements with track_utility = on to capture all query types including COPY and VACUUM. Set a low log_min_duration_statement threshold (e.g., 100ms) during peak hours only, using a scheduled configuration change or pg_scheduler. Capture pg_stat_statements snapshots hourly and diff them to identify queries whose total time increases disproportionately during peak periods. Look specifically at total_exec_time / calls (average per execution) rather than total time alone—a query that runs fast but 100,000 times per hour will show high total time. Use pg_stat_activity with state = 'active' and query_start > now() - interval '5 minutes' filtered to active queries during peak to catch currently-running slow queries. For recurring issues, add auto_explain with auto_explain.log_min_duration = '500ms' to capture query plans for slow queries automatically.

Quick Reference: Alert Thresholds

Disk usage: Critical at 90%, Warning at 80%. At 90% there may be no room for VACUUM or emergency operations.

Connection utilization: Warning at 70% of max_connections, Critical at 85%. Investigate pool sizing or query efficiency before hitting limits.

Cache hit ratio: Warning below 95%, Critical below 90%. Sustained drops indicate queries scanning too much data.

Replication lag: Warning at 10 seconds, Critical at 30 seconds. Beyond 30 seconds read-after-write consistency breaks down.

Query latency: Set p95 alerts at 2-3x your baseline. Set p99 alerts at 5x baseline or at your SLA threshold.

Conclusion

Database monitoring is not optional. The metrics exist, the tools exist, and ignoring them leads to incidents that could have been prevented.

Start with the basics: track query latency, connections, disk usage, and cache hit ratios. Add slow query logging. Set up alerts for critical thresholds.

As you mature, add more sophisticated monitoring: query-level percentiles, detailed replication stats, correlation with application traces.

The goal is visibility. You cannot fix what you cannot see. Monitoring turns guesswork into data-driven decisions about where to optimize, when to scale, and how to prevent problems before they become incidents.


Category

Related Posts

Connection Pooling: HikariCP, pgBouncer, and ProxySQL

Learn connection pool sizing, HikariCP, pgBouncer, and ProxySQL, timeout settings, idle management, and when pooling helps or hurts performance.

#database #connection-pooling #performance

Vacuuming and Reindexing in PostgreSQL

PostgreSQL's MVCC requires regular maintenance. This guide explains dead tuples, VACUUM vs VACUUM FULL, autovacuum tuning, REINDEX strategies, and how to monitor bloat with pg_stat_user_tables.

#database #postgresql #vacuum

Metrics, Monitoring, and Alerting: From SLIs to Alerts

Learn the RED and USE methods, SLIs/SLOs/SLAs, and how to build alerting systems that catch real problems. Includes examples for web services and databases.

#observability #monitoring #metrics