Database Backup Strategies: Full, Incremental, and PITR

Learn database backup strategies: full, incremental, and differential backups. Point-in-time recovery, WAL archiving, and RTO/RPO planning.

published: March 26, 2026 reading time: 30 min read author: GeekWorkBench updated: June 17, 2026

Quick Summary

Database backup strategies determine whether a catastrophe is a minor inconvenience or a career-ending event. This guide covers full, incremental, and differential backups, how point-in-time recovery works with WAL archiving, and how to set RPO and RTO targets that make sense for your business. The key insight is that backups that have never been verified are not backups; they are hope. Cloud-managed backups can help but should not be treated as a complete solution without understanding the restore process.

Backup and Recovery: Full, Incremental, Differential, and Point-in-Time

Picture this: it’s 2am. Your phone rings. Production data is gone—corrupted, deleted, inaccessible. The question that determines whether you have a bad night or a catastrophic week is one you’ve already answered: what’s your backup and recovery strategy?

This guide covers database backup approaches without the usual vendor fluff.

flowchart LR
    subgraph BackupFlow["Backup Process"]
        F1[("Full<br/>Backup")] --> F2[("WAL<br/>Archiving")]
        F2 --> F3[("Incremental<br/>Backup")]
        F2 --> F4[("Differential<br/>Backup")]
    end

    subgraph RestoreFlow["Restore Process"]
        R1[("Full<br/>Restore")] --> R2[("Differential<br/>Restore")]
        R1 --> R3[("Incremental<br/>Chain")]

        R3 --> R4[("WAL<br/>Replay")]
        R2 --> R4
        R4 --> PITR[("Point-in-Time<br/>Recovery")]

        PITR --> Ready[("Database<br/>Available")]
    end

    subgraph CloudSnap["Cloud Snapshots"]
        S1[("Automated<br/>Snapshots")] --> S2[("Retention<br/>Policy")]
        S2 --> S3[("Cross-Region<br/>Copy")]
        S3 --> S4[("Restore from<br/>Snapshot")]
        S4 --> Ready
    end

This guide covers database backup approaches without the usual vendor fluff. the trade-offs between full, incremental, and differential backups, show you how point-in-time recovery actually works, and help you figure out what recovery objectives make sense for your situation.

Introduction

Backup strategies determine whether a catastrophe is a minor inconvenience or a career-ending event. The difference is not whether you have backups — it is whether your backups actually work, whether you can restore within your RTO, and whether you have tested the restore process. Backups that have never been verified are not backups; they are hope.

This guide covers the three backup types (full, incremental, differential), how point-in-time recovery works with WAL archiving, how to set RPO and RTO targets that make sense for your business, and the specific trade-offs of each backup approach. It also covers cloud-managed backups and the risks of treating them as a complete solution.

Backup Types Explained

Full Backups

A full backup captures the entire database at the time of backup start. Every data page, every transaction log segment, every system catalog entry—all of it written to backup media.

-- PostgreSQL full backup example
pg_basebackup -h localhost -U postgres -D /backups/full -Ft -z -P

Advantages:

Complete restoration in a single operation
Simple to understand and verify
No dependency on other backups for restore
Self-contained disaster recovery

Disadvantages:

Largest storage footprint
Longest backup duration
Maximum impact on production system I/O
Cannot be applied incrementally

Full backups are the foundation of any backup strategy. You should always have at least one full backup as your recovery base.

Incremental Backups

An incremental backup captures only the data that changed since the last backup—regardless of whether that was a full or incremental backup. This means changes since the most recent backup of any type.

-- PostgreSQL incremental using pg_basebackup with WAL
pg_basebackup -h localhost -U postgres -D /backups/incr_$(date +%Y%m%d_%H%M%S) -Ft -z -P -X stream

Advantages:

Minimal storage consumption
Fast backup operations
Low production system impact
Frequent backups possible

Disadvantages:

Restore requires chain of all backups from last full
Complex restore procedures
Backup chain corruption means data loss
Longer recovery time

The key insight: incremental backups are delta changes only. To restore, you need the last full backup plus every incremental backup in sequence.

Differential Backups

A differential backup captures all changes since the last full backup—not since the last differential. This creates a growing but simpler restore chain.

-- Conceptual differential: compare current state to last full
pg_basebackup -h localhost -U postgres -D /backups/diff_$(date +%Y%m%d_%H%M%S) -Ft -z -P -X stream
-- Then calculate differential manually using pg_backup_start/stop

Advantages:

Smaller than full backups
Simpler restore than incremental (only need full + one differential)
Moderate backup duration
Good balance of storage and complexity

Disadvantages:

Larger than incremental backups (grows over time)
Backup duration increases as differential grows
Still requires full backup + one differential to restore

Differential backups offer a middle ground: you trade some storage for restore simplicity.

Understanding Point-in-Time Recovery

Point-in-time recovery (PITR) allows you to restore a database to any specific moment—not just the moment when a backup completed. This is essential for recovering from user errors, partial data corruption, or transactions that shouldn’t have happened.

How PITR Works

PITR relies on Write-Ahead Logging (WAL). Every database modification is written to a sequential log file before being applied to data files. This log captures:

Transaction begin and commit records
Every row-level change (insert, update, delete)
Page-level changes for efficiency

-- PostgreSQL: Configure WAL archiving
wal_level = replica
max_wal_senders = 3
archive_mode = on
archive_command = 'cp %p /archive/wal/%f'

To perform PITR:

Restore the latest full backup to a recovery location
Create a recovery.conf file specifying target time or transaction ID
Replay archived WAL segments sequentially
Database reaches target state and becomes available

# recovery.conf
restore_command = 'cp /archive/wal/%f %p'
recovery_target_time = '2026-03-26 14:30:00 UTC'
recovery_target_action = 'promote'

WAL Archiving Considerations

WAL archiving is the backbone of robust backup strategies. Key considerations:

Archive Storage: WAL segments accumulate rapidly under write load. Plan for significant storage—typically 1-2x your database size per day under heavy write workloads.

Archive Verification: Untested WAL archives are worthless. Regularly verify that archived WAL can actually restore.

 # Verify WAL archive integrity
 pg_archivecleanup -d /archive/wal 00000001000000000000001C

Continuous Archiving vs. Continuous Replication: PITR uses WAL shipping or streaming replication. Understand the difference:

Method	Latency	Bandwidth	Complexity
Shipping	Hours to days	Low	Low
Streaming	Seconds	Medium	Medium
Logical Replication	Near real-time	Higher	Higher

RTO and RPO Planning

Recovery Time Objective (RTO) and Recovery Point Objective (RPO) are your contractual commitments to stakeholders about recovery capabilities.

Defining Your Objectives

RTO (Recovery Time Objective): Maximum acceptable downtime before systems are restored. This is a duration—typically hours or minutes.

RPO (Recovery Point Objective): Maximum acceptable data loss measured in time. If your RPO is 1 hour, you accept losing up to 1 hour of transactions.

These aren’t technical decisions—they’re business decisions. Work with stakeholders to define realistic objectives based on:

Cost of downtime per hour
Cost of data loss per transaction
Regulatory requirements
Customer expectations

Translating to Technical Requirements

Once you have RTO and RPO targets, translate them to technical specifications:

RPO	Technical Requirement
0 (no data loss)	Synchronous replication or synchronous commit
15 minutes	Frequent WAL archiving + tested recovery
1 hour	Hourly differential backups + WAL
24 hours	Daily full backups

RTO	Technical Requirement
Minutes	HA architecture, automatic failover
Hours	Tested restore procedures, on-call staff
Days	Documented manual recovery procedures

The Math of Recovery

Calculate realistic recovery times:

def estimate_recovery_time(
    full_backup_size_gb: float,
    differential_size_gb: float,
    network_mbps: float,
    restore_speed_mbps: float,
    wal_replay_seconds: int
) -> dict:
    transfer_time = (full_backup_size_gb * 1024) / network_mbps
    restore_time = (full_backup_size_gb * 1024) / restore_speed_mbps
    wal_time = wal_replay_seconds

    total_restore_minutes = (transfer_time + restore_time) / 60
    total_rto_minutes = total_restore_minutes + (wal_time / 60)

    return {
        "transfer_minutes": round(transfer_time / 60, 1),
        "restore_minutes": round(restore_time / 60, 1),
        "wal_replay_minutes": round(wal_time / 60, 1),
        "total_rto_minutes": round(total_rto_minutes, 1)
    }

Cloud Backup Solutions

AWS RDS

Amazon RDS provides automated backups with configurable retention:

# RDS backup configuration (AWS Console or CLI)
aws rds modify-db-instance \
    --db-instance-identifier my-db-instance \
    --backup-retention-period 30 \
    --preferred-backup-window "03:00-04:00" \
    --preferred-maintenance-window "sun:04:00-sun:05:00"

Key features:

Automated daily backups with configurable retention (1-35 days)
Point-in-time recovery to any second within retention period
Manual snapshots for long-term retention
Cross-region snapshot copying for disaster recovery
Automated backup to S3 with encryption

Limitations:

Restore time depends on snapshot size and network transfer
PITR only available for MySQL, PostgreSQL, MariaDB, Oracle
Cross-region copy latency for disaster recovery

Azure SQL

Azure SQL provides automated backups with geo-redundancy options:

# Azure SQL backup configuration
Set-AzSqlDatabaseBackupShortTermRetentionPolicy \
    -ResourceGroupName "my-resource-group" \
    -ServerName "my-server" \
    -DatabaseName "my-database" \
    -RetentionDays 30

Key features:

Full backup every week, differential every 12 hours, log backup every 5-10 minutes
Point-in-time restore within retention period (7-35 days)
Long-term retention (LTR) for years
Geo-restore with RA-GRS redundancy
Active geo-replication for read scale and failover

Comparing Cloud Providers

AWS RDS, Azure SQL, and Google Cloud SQL all advertise automated backups and point-in-time recovery, and the feature lists look nearly identical on a capabilities matrix. The practical differences show up in three places: maximum retention window, cross-region restore latency, and what happens when your compliance team asks about data older than 35 days.

Google Cloud SQL defaults to 7 days retention — lower than the 35-day maximum on RDS and Azure SQL. If you need long-term retention for compliance, you are adding a custom LTR layer on top of whatever managed backup the provider gives you. Factor that integration work into your evaluation, not after go-live.

Feature	AWS RDS	Azure SQL	Google Cloud SQL
Automated Backups	Yes	Yes	Yes
PITR	Yes	Yes	Yes
Max Retention	35 days	35 days	7 days (configurable)
Cross-region Restore	Via snapshot copy	Via geo-restore	Via regional instances
LTR	Glacier integration	Up to 10 years	Cloud Storage

When to Use / When Not to Use Each Backup Type

Full Backups — Use when:

You need a standalone recovery point
Recovery time is more important than backup time
Storage is not a constraint
You’re establishing a new backup baseline

Full Backups — Do not use when:

Your database is very large (>10TB) and backup windows are tight
You’re doing incremental backup chains

Incremental Backups — Use when:

RPO targets are tight (under 1 hour)
Storage costs must be minimized
Backup window is limited

Incremental Backups — Do not use when:

Your backup chain complexity is a operational risk
You need faster recovery over storage savings

Differential Backups — Use when:

You need a middle ground between full and incremental
Restore simplicity matters but storage savings still appeal

Differential Backups — Do not use when:

Differential growth becomes nearly as large as a full backup
Your RPO requires more granular recovery

Backup Type Trade-offs

Dimension	Full Backup	Incremental Backup	Differential Backup
Backup size	Largest	Smallest	Grows over time
Backup speed	Slowest	Fastest	Moderate
Restore complexity	Simplest — one step	Complex — chain required	Moderate — full + one diff
Storage cost	Highest	Lowest	Moderate
RTO (restore time)	Lowest	Highest	Moderate
Best for	Small DBs, baseline	Large DBs, tight RPO	Medium RPO, moderate complexity

Production Failure Scenarios

Failure	Impact	Mitigation
WAL gap from missed archive	PITR restore impossible at exact point	Monitor archive status continuously, alert on gaps
Backup chain corruption	Full data loss — restore impossible	Verify backup integrity on every job, store checksums
Storage exhaustion during backup	Backup fails, disk fills	Monitor disk space, alert at 70% threshold
Restore failing due to permissions	DR procedure fails when needed most	Test restore with minimal permissions weekly
Cloud snapshot restore taking hours	Extended downtime exceeds RTO	Pre-stage recovery environment, test RTO monthly
Incremental backup overflow	Disk fills, backup deleted	Set retention enforcement, monitor growth rate

Capacity Estimation: Backup Storage Sizing and WAL Growth Rate

Backup storage sizing depends on your backup strategy, retention period, and database growth rate.

Full backup storage formula:

full_backup_size = database_size × compression_factor
compressed_full_backup_size = database_size / compression_ratio

For a 500GB PostgreSQL database with pg_dump compression (typical ratio 3:1 to 5:1 for row data, less for indexes):

Row data with 3:1 compression: 500GB / 3 = ~167GB compressed dump
With gzip at 5:1: 500GB / 5 = 100GB compressed dump
Actual size varies significantly: highly compressible data (text, logs) compresses 10:1; binary data (JSON blobs) compresses 2:1

WAL growth rate formula:

WAL_growth_per_second = write_throughput_bytes_per_second × wal_files_per_second
WAL_daily_growth = write_throughput_bytes_per_second × 86400 / (wal_segment_size / wal_keep_segments)

For PostgreSQL: WAL segment size is 16MB by default. If you have 1GB/second write throughput and keep 64 WAL segments (1GB total WAL buffer), WAL rotates once per second and generates 16MB × 64 = 1GB of WAL per day. At 100MB/second write throughput: 100MB/second × 86400 = 8.64TB/day. This is why wal_keep_segments must be sized based on your PITR window.

Incremental backup sizing: Track the change rate per day, not the total database size:

-- PostgreSQL: measure daily WAL generation
SELECT
    pg_size_pretty(
        pg_wal_lsn_diff(pg_current_wal_lsn(), '0/00000000')::bigint
    ) AS wal_size_since_checkpoint;

Typical change rates: OLTP databases 1-5% of database size per day. Data warehouses with nightly batch loads: 20-50% per day during loads, 0% between loads. Knowing your change rate lets you size incremental backups correctly.

Observability Hooks: Backup Success/Failure Alerting and RTO Tracking

Key backup metrics: job success/failure, backup duration trends, restoration time, and backup size anomalies.

-- PostgreSQL: backup history from pgBackRest metadata
SELECT
    job_id,
    status,
    timestamp,
    duration_secs,
    backup_size_bytes,
    diff_backup_size_bytes
FROM pg_stat_pgbackrest
WHERE timestamp > CURRENT_DATE - INTERVAL '7 days'
ORDER BY timestamp DESC;

# Critical: backup job failed
- alert: BackupJobFailed
  expr: backup_job_status != 0
  for: 1m
  labels:
    severity: critical
  annotations:
    summary: "Backup job {{ $labels.job }} failed: {{ $labels.error }}"

# Warning: backup duration increased significantly
- alert: BackupDurationAnomaly
  expr: backup_duration_seconds > 1.5 * avg_over_time(backup_duration_seconds[7d])
  for: 10m
  labels:
    severity: warning
  annotations:
    summary: "Backup taking {{ $value }}s vs typical {{ $labels.typical }}s"

# Warning: backup size anomaly (could indicate data growth or corruption)
- alert: BackupSizeAnomaly
  expr: backup_size_bytes > 2 * backup_size_bytes_offset_1d
  for: 1h
  labels:
    severity: warning
  annotations:
    summary: "Backup size {{ $value }} bytes is 2x yesterday's {{ $labels.yesterday }}"

# Critical: WAL archival lag (PITR at risk)
- alert: WalArchivalLag
  expr: (pg_wal_lsn_diff(pg_current_wal_lsn(), last_archived_lsn) / 1024 / 1024) > 1024
  for: 5m
  labels:
    severity: critical
  annotations:
    summary: "WAL archival lag exceeds 1GB - PITR window at risk"

Quick Recap Checklist

Use this checklist when designing or reviewing backup and recovery for a database system:

Testing Recovery Procedures

Here’s the uncomfortable truth: a backup that cannot be restored is not a backup. This is non-negotiable.

Regular Recovery Testing

Weekly verification: Automated test restore to confirm backup chain validity.

#!/bin/bash
# Weekly backup verification script
set -euo pipefail

BACKUP_DATE=$(date -d "yesterday" +%Y%m%d)
RESTORE_DIR="/restores/test_$(date +%Y%m%d_%H%M%S)"

# Create fresh restore directory
mkdir -p "$RESTORE_DIR"

# Restore latest full backup
pg_restore -d postgres --create --if-exists "$RESTORE_DIR/full_backup.dump"

# Verify basic integrity
psql -d postgres -c "SELECT count(*) FROM pg_database;"
psql -d postgres -c "SELECT pg_database_size('postgres') / 1024 / 1024 AS size_mb;"

echo "Backup verification successful"

Monthly full disaster simulation: Complete restore to isolated environment, verify data integrity, measure actual RTO against targets.

What to Verify

Verification is not a single step. It is a chain of checks that must all pass before you trust a backup. Each stage catches different failure modes, and skipping any of them leaves a gap that only shows up during a real disaster.

Backup completion status confirms the job ran without errors. Check the exit code, the job log, and the backup size against expected ranges. A job that exits successfully but produces a 1KB file is not a successful backup — it is a false positive. Alert thresholds should trigger when backup size drops below 80% of the moving average.

Backup chain integrity validates that all WAL segments between backups are present and uncorrupted. Use pg_archivecleanup or equivalent tools to verify the archive directory has every segment expected in sequence. Gaps in the WAL sequence number chain indicate missing archives, and missing archives mean PITR will fail when you actually need it.

Restore completion means the backup actually restores without errors — not just that the restore command exits cleanly. Run pg_restore --list to inspect the backup contents before attempting restore. After restore, run pg_database_size() and compare against the original. A restore that succeeds but produces a 10GB database when the original was 500GB has lost data.

Data integrity goes beyond row counts. Run checksum validation on critical tables, verify foreign key relationships, and spot-check high-value rows against known-good baseline values. Row count alone does not catch silent data corruption.

Application functionality testing happens after restore: run your application’s standard query suite against the restored database. If your app queries a specific customer record, query that record. If it aggregates daily totals, run that aggregation. This catches restore issues that only surface at the application layer.

Performance baseline establishes that restore does not degrade database performance. Measure query latency on the restored system before and after restore. If p99 query time doubles after restore, the backup is technically complete but operationally useless.

Backup completion status — Confirm automated backups completed successfully
Backup chain integrity — All required WAL segments present
Restore completion — Backup actually restores without errors
Data integrity — Restored data matches expected state
Application functionality — Database serves queries correctly after restore
Performance baseline — Restore doesn’t degrade performance

Common Recovery Failures

WAL gap: Missing WAL segments break the restore chain
Corrupted backup: Checksum failures on backup files
Storage exhaustion: Restore fails due to insufficient disk space
Permission issues: Restored database has wrong ownership
Timezone handling: Timestamps differ between backup and restore environments

Real-World Case Study: GitHub’s MySQL Backup Failures

In 2018, GitHub experienced a significant backup-related incident. Their MySQL database backups had been silently failing for months due to a configuration issue with their backup tooling. When they attempted to verify backup integrity during a routine drill, they discovered the backups were incomplete.

The root cause: their backup verification only checked that the backup job completed (exit code 0) and that the backup file existed. It did not verify that the backup actually contained the expected data. A subtle configuration change had caused the backup to run but capture an empty or partial dataset.

The impact: while the incident did not result in data loss (the team was able to restore from a prior backup), it revealed that their backup verification process was inadequate. The fix: GitHub implemented checksum verification on every backup, comparing the restored database row count against known-good baseline counts. They also added automated restoration testing to a isolated environment on every backup job.

The lesson: backup verification must check completeness, not just completion. A backup that runs successfully but contains 0 rows is worse than no backup at all — it gives false confidence.

Interview Questions

1. Your RPO is 15 minutes and your database writes 100GB/day. How do you design the backup strategy?

RPO of 15 minutes means you can lose at most 15 minutes of data. With 100GB/day write throughput, that is roughly 100GB/24/4 = ~17GB of potential data loss in a 15-minute window. Strategy: full backups daily (off-peak), differential backups every 4-6 hours, and WAL archiving with continuous WAL shipping. At 100GB/day, WAL segments generate approximately 100GB/86400*900 = ~1GB per 15-minute window. WAL-based PITR gives you exactly 15-minute recovery point. If the database is large (>1TB), full backups may be too slow — consider incremental backups with hourly WAL replay to achieve the same RPO with lower backup overhead.

2. During a restore, you discover the backup chain is broken — a differential backup references a WAL segment that was never archived. What happened and how do you recover?

A broken backup chain typically means WAL archiving failed silently — the backup tool reported success but the archive command did not actually copy WAL files. Recovery options depend on what you have: if you have the last full backup plus subsequent incremental backups, try to restore the latest full backup and replay available WAL. If PITR is impossible, you fall back to the last consistent state from the most recent complete backup. Root cause investigation: check archive_command logs for permission failures, disk space issues, or network timeouts. Mitigation: always verify WAL archive integrity immediately after backup, and monitor for gaps between expected and actual archived WAL sequence numbers.

3. Your team uses PostgreSQL and your boss asks why you can't just use `pg_dump` for backups. What are the trade-offs?

pg_dump produces a logical backup — a SQL script that recreates the database. Advantages: simple, portable, works across versions, small restore footprint for small databases. Disadvantages: pg_dump is inconsistent unless using transactions (--no-tablespaces, --serialized-alter), cannot do point-in-time recovery (you restore to the moment the dump finished), and for large databases the dump itself takes hours and restores slowly. For anything beyond a small development database, use pg_basebackup with WAL archiving for continuous backup capability. pg_dump is fine for logical exports and small databases; it is not suitable for production disaster recovery.

4. How do you calculate realistic RTO for a 5TB database restoration across regions?

RTO = transfer_time + restore_time + WAL_replay_time. Transfer_time: 5TB over cross-region network at 100Mbps effective = 5TB * 8 / 100Mbps = 400,000 seconds ≈ 111 hours. This is why cross-region restore is rarely feasible in urgent DR scenarios. Mitigations: maintain warm standby in the same region (RTO minutes, not hours), use database-native replication instead of backup/restore for DR, pre-position backups in the target region. If cross-region restore is the only option, test during normal period to measure actual throughput and include test results in your DR runbook. For 5TB with 1Gbps connectivity: ~40 hours transfer. RTO is unacceptable for most production systems without a local replica.

5. What is the difference between PITR and point-in-time recovery, and why does the distinction matter?

PITR (Point-In-Time Recovery) is the capability — it means you can recover to any specific moment. Point-in-time recovery is the act of using that capability. The distinction matters because PITR requires WAL archiving (continuous backup of transaction logs). Without WAL archiving, you can only restore to the moment a full or incremental backup completed. With WAL archiving, you can replay logs to any timestamp within your WAL retention window. WAL archival must be configured before failure — you cannot enable it retroactively to recover from an already-happened failure.

6. Your backup job fails with "disk space exhausted" during a full backup of a 2TB database. How do you prevent this?

Prevention: never start a backup without verifying available disk space > 2× the expected backup size. Configure monitoring on backup destination: alert at 70% capacity, fail at 85%. For full backups of large databases, use compression (-Z flag in pg_basebackup) to reduce storage by 3-5×. Consider splitting full backups across multiple smaller volumes. If disk is already full during backup: cancel the backup job immediately to avoid corrupting the in-progress backup file, clean up old backups first, then restart. Some backup tools resume instead of restart — check your tooling. Long-term: implement backup rotation with retention limits enforced automatically.

7. A cloud provider experiences a regional outage and your production database is in that region. Walk through your recovery process.

Step 1: confirm the scope and estimated recovery time from the cloud provider. Step 2: activate DR environment in the alternate region if pre-positioned. Step 3: restore from latest cross-region backup if using backup-based DR. Step 4: if using read replica failover, promote the replica and update DNS. Step 5: verify data integrity — check last known good timestamp against application requirements. Step 6: notify stakeholders of RTO. Step 7: after primary region recovers, plan data reconciliation between DR and primary. The key DR principle: your RTO is only as good as your last tested recovery. If you have never tested restoring from cross-region backups, you do not know your actual RTO.

8. What is the difference between WAL shipping and streaming replication for PITR, and when would you choose each?

WAL shipping (physical replication) transfers WAL segments from primary to standby after they are archived. Latency is higher — typically minutes to hours depending on archive frequency — but bandwidth is efficient since only WAL segments are transferred. Streaming replication (logical or physical) uses a persistent connection to stream WAL in real-time. Latency is seconds, providing near-instant failover capability. Choose WAL shipping when: bandwidth is limited, you can tolerate minutes of RPO, and cost sensitivity is high. Choose streaming replication when: RPO must be near-zero, RTO must be minutes (HA requirements), or you need a warm standby that can be quickly promoted. Many production systems use both: streaming replication for HA failover, WAL shipping to S3 for long-term PITR retention.

9. Your backup verification job reports success but the backup file is corrupted. What went wrong and how do you prevent it?

Backup verification must check completeness, not just completion. A backup job that exits with code 0 and creates a file is not verified — the file could be empty, truncated, or contain corrupted data. Prevention: implement checksum verification (SHA-256 or md5) immediately after backup creation, store checksums in a separate system from backups, verify checksums before restore attempts, and periodically perform actual restore-to-null verification. GitHub's 2018 MySQL backup incident is the canonical example — their backup jobs ran successfully but captured empty datasets. The lesson: a backup that cannot be restored is not a backup.

11. How does WAL segment accumulation cause storage issues, and what strategies control it?

WAL segments accumulate in the archive directory faster than they are consumed during PITR restore. Under high write load, WAL generation can fill disk space faster than archival can keep up, especially if archive_command fails silently or network latency prevents timely archiving. Strategies: monitor archive_queue_length — if it grows unbounded, alert immediately; configure WAL retention based on your PITR window (typically 1-7 days); implement WAL compression (pg_compressbackup or custom scripts); set archive_command to fail fast on errors rather than silently continuing. WAL accumulation is a silent killer of backup strategies.

12. Your database writes 500GB/day and you have a 2-hour backup window. What backup strategy works?

500GB/day = ~6GB/hour write rate. A 2-hour backup window captures approximately 12GB of writes — much smaller than the full database if it is multi-terabyte. Strategy: weekly full backups (weekend or maintenance window) plus continuous WAL archiving for PITR, with incremental backups every 2 hours capturing only changed blocks. This gives you PITR capability (WAL replay to any point) without requiring full backups during the backup window. The 2-hour window limits your PITR RPO to 2 hours maximum data loss, which is acceptable for most workloads. If RPO must be zero, use synchronous replication instead.

13. What are the trade-offs between backup replication to a separate region versus backup archival to object storage?

Replication to a separate region (e.g., cross-region RDS replica) provides fast failover (RTO minutes) and continuous data protection, but is expensive and replicates continuously. Backup archival to S3/Blob Storage is cheaper, provides point-in-time restore capability, but restore time is dominated by data transfer (hours for multi-TB databases). Best practice: use cross-region replication for HA/failover, and periodic backups to object storage for disaster recovery. This gives you both fast local failover and long-term point-in-time recovery. Never rely on a single backup strategy.

14. How do compression ratios affect your backup storage calculations, and why do estimates often differ from reality?

Backup storage formulas use compression ratios that vary significantly by data type. Text data (logs, JSON) compresses 5-10×. Row data with indexes compresses 2-3×. Binary data (images, encrypted blobs) compresses 1-1.5×. PostgreSQL TOAST data (large values stored separately) compresses well. Reality check: run pg_basebackup -F t -z -P on your actual database and measure the ratio. Budget for worst case (1.5×) and monitor actual ratios per environment. Production data grows over time — build in growth projections (typically 10-20% per year) when sizing backup storage.

15. Your organization performs quarterly disaster recovery testing. What metrics should you measure during the test?

Measure: actual RTO (time from DR initiation to database available), actual RPO (data loss measured in bytes or time), restore throughput (MB/s during restore), WAL replay time for the target point-in-time, DNS cutover time if applicable, and application reconnection time after failover. Document these against your documented RTO/RPO targets. The gap between documented and actual RTO/RPO is the most important finding. If actual RTO is 4 hours but you documented 1 hour, your SLA is wrong. Measure under realistic conditions — cold storage (Glacier) adds hours to restore time that you must account for.

16. A user accidentally drops a critical table at 3pm. You need point-in-time recovery to 2:59pm. Walk through the process.

Step 1: verify WAL archiving is enabled and continuous — check pg_stat_archiver. Step 2: confirm the latest full backup and all WAL segments after it are available. Step 3: create a recovery.conf specifying recovery_target_time = '2026-03-26 14:59:00 UTC' and recovery_target_action = 'promote'. Step 4: restore the full backup to the recovery location. Step 5: start the database — it replays WAL sequentially until the target time. Step 6: verify the table exists and contains pre-drop data. Step 7: export the table data. Step 8: promote the recovery database or import the exported data to production. Total time depends on database size and WAL volume since last backup.

17. Your backup chain is intact but the most recent full backup failed verification. How do you recover?

If the most recent full backup fails verification, fall back to the previous full backup plus the incremental backups and WAL segments accumulated after it. This extends your RPO — you lose data from the failed backup window to the previous backup's time. Recovery: identify the last good full backup, restore it, replay all WAL segments since that backup, and verify the target point-in-time. Root cause the failed backup: disk errors, checksum mismatches, backup process bugs. Add additional verification steps to catch failures before the next DR scenario. Never skip the verification step assuming backups are good.

18. Under what conditions does streaming replication outperform backup/restore for disaster recovery?

Streaming replication (logical or physical) provides continuous data transfer with minimal RPO — typically seconds to minutes. Backup/restore requires full data transfer on every restore, making RTO hours to days for large databases. Streaming replication wins when: your RTO requirement is minutes (HA requirements), your database is small enough that streaming replication overhead is manageable, you need to recover from user errors quickly (PITR with streaming is faster), and you can afford the cost of a warm standby. For RTO in hours and RPO in days, batch backup/restore is sufficient and cheaper.

19. How does backup to tape versus disk affect your restore strategy, and what are the trade-offs?

Tape is cheaper per GB and good for long-term retention, but sequential access makes restore slow — to get to a specific point-in-time, you may need to restore multiple tapes in sequence. Disk is more expensive but provides random access — faster restore to any point in time, direct verification, and easier incremental management. Modern cloud backups (S3, Azure Blob) provide disk-like random access with tape-like cost economics when using infrequent access tiers. Recommendation: use disk or cloud for recovery-point backups (fast restore), use tape for long-term retention compliance. Never rely on tape alone for production recovery.

20. Your backup window is shrinking due to database growth. What options exist to maintain backup coverage?

Options: implement incremental backups (capture only changed blocks, not full database), increase compression (pg_basebackup -F t -z), switch to parallel backup tools (pg_backup_start/stop API for concurrent backup streams), reduce backup frequency for full backups while relying on frequent incremental/WAL, partition large tables to enable partition-level backup/restore, offload backup processing to a dedicated backup server, or upgrade backup infrastructure (faster disk, network links). The most common mistake: keeping the same backup schedule as the database grows without accounting for the additional time required.

21. Your disaster recovery runbook is 10 pages long. How do you test it without causing a production outage?

Use an isolated environment that mirrors production: restore latest backup to the DR environment, verify data integrity matches production at a known point in time, practice the restore procedure in a sandbox, and measure actual RTO during the test. Automate as much of the runbook as possible — manual steps during a crisis cause delays. Run tabletop exercises where the team walks through the runbook without executing it. Reserve full DR tests (actual restore with downtime) for maintenance windows. Document discrepancies between the runbook and what actually happens during testing.

Conclusion

Backup and recovery is not a set-it-and-forget-it operation. It requires ongoing attention, testing, and refinement.

Start with the basics: full backups are non-negotiable. Layer on incremental or differential backups based on your RPO requirements. Enable WAL archiving if you need point-in-time recovery. Test everything—regularly.

The time to discover your backup strategy fails is not during an actual disaster. Run fire drills. Simulate failures. Know exactly what your RTO and RPO actually are under real conditions, not just on paper.

For deeper exploration of disaster recovery patterns, see our disaster recovery guide. And for monitoring your database health, check out relational databases fundamentals.

Backup and Recovery: Full, Incremental, Differential, and Point-in-Time

Introduction

Backup Types Explained

Full Backups

Incremental Backups

Differential Backups

Understanding Point-in-Time Recovery

How PITR Works

WAL Archiving Considerations

RTO and RPO Planning

Defining Your Objectives

Translating to Technical Requirements

The Math of Recovery

Cloud Backup Solutions

AWS RDS

Azure SQL

Comparing Cloud Providers

When to Use / When Not to Use Each Backup Type

Backup Type Trade-offs

Production Failure Scenarios

Capacity Estimation: Backup Storage Sizing and WAL Growth Rate

Observability Hooks: Backup Success/Failure Alerting and RTO Tracking

Quick Recap Checklist

Testing Recovery Procedures

Regular Recovery Testing

What to Verify

Common Recovery Failures

Real-World Case Study: GitHub’s MySQL Backup Failures

Interview Questions

Further Reading

Conclusion

Category

Tags

Related Posts

Database Capacity Planning: A Practical Guide

Failover Automation

Disaster Recovery: RTO, RPO, and Building a Recovery Plan