Database Backup Strategies: Full, Incremental, and PITR
Learn database backup strategies: full, incremental, and differential backups. Point-in-time recovery, WAL archiving, and RTO/RPO planning.
Backup and Recovery: Full, Incremental, Differential, and Point-in-Time
Picture this: it’s 2am. Your phone rings. Production data is gone—corrupted, deleted, inaccessible. The question that determines whether you have a bad night or a catastrophic week is one you’ve already answered: what’s your backup and recovery strategy?
This guide covers database backup approaches without the usual vendor fluff.
flowchart LR
subgraph BackupFlow["Backup Process"]
F1[("Full<br/>Backup")] --> F2[("WAL<br/>Archiving")]
F2 --> F3[("Incremental<br/>Backup")]
F2 --> F4[("Differential<br/>Backup")]
end
subgraph RestoreFlow["Restore Process"]
R1[("Full<br/>Restore")] --> R2[("Differential<br/>Restore")]
R1 --> R3[("Incremental<br/>Chain")]
R3 --> R4[("WAL<br/>Replay")]
R2 --> R4
R4 --> PITR[("Point-in-Time<br/>Recovery")]
PITR --> Ready[("Database<br/>Available")]
end
subgraph CloudSnap["Cloud Snapshots"]
S1[("Automated<br/>Snapshots")] --> S2[("Retention<br/>Policy")]
S2 --> S3[("Cross-Region<br/>Copy")]
S3 --> S4[("Restore from<br/>Snapshot")]
S4 --> Ready
end
This guide covers database backup approaches without the usual vendor fluff. the trade-offs between full, incremental, and differential backups, show you how point-in-time recovery actually works, and help you figure out what recovery objectives make sense for your situation.
Introduction
Backup strategies determine whether a catastrophe is a minor inconvenience or a career-ending event. The difference is not whether you have backups — it is whether your backups actually work, whether you can restore within your RTO, and whether you have tested the restore process. Backups that have never been verified are not backups; they are hope.
This guide covers the three backup types (full, incremental, differential), how point-in-time recovery works with WAL archiving, how to set RPO and RTO targets that make sense for your business, and the specific trade-offs of each backup approach. It also covers cloud-managed backups and the risks of treating them as a complete solution.
Backup Types Explained
Full Backups
A full backup captures the entire database at the time of backup start. Every data page, every transaction log segment, every system catalog entry—all of it written to backup media.
-- PostgreSQL full backup example
pg_basebackup -h localhost -U postgres -D /backups/full -Ft -z -P
Advantages:
- Complete restoration in a single operation
- Simple to understand and verify
- No dependency on other backups for restore
- Self-contained disaster recovery
Disadvantages:
- Largest storage footprint
- Longest backup duration
- Maximum impact on production system I/O
- Cannot be applied incrementally
Full backups are the foundation of any backup strategy. You should always have at least one full backup as your recovery base.
Incremental Backups
An incremental backup captures only the data that changed since the last backup—regardless of whether that was a full or incremental backup. This means changes since the most recent backup of any type.
-- PostgreSQL incremental using pg_basebackup with WAL
pg_basebackup -h localhost -U postgres -D /backups/incr_$(date +%Y%m%d_%H%M%S) -Ft -z -P -X stream
Advantages:
- Minimal storage consumption
- Fast backup operations
- Low production system impact
- Frequent backups possible
Disadvantages:
- Restore requires chain of all backups from last full
- Complex restore procedures
- Backup chain corruption means data loss
- Longer recovery time
The key insight: incremental backups are delta changes only. To restore, you need the last full backup plus every incremental backup in sequence.
Differential Backups
A differential backup captures all changes since the last full backup—not since the last differential. This creates a growing but simpler restore chain.
-- Conceptual differential: compare current state to last full
pg_basebackup -h localhost -U postgres -D /backups/diff_$(date +%Y%m%d_%H%M%S) -Ft -z -P -X stream
-- Then calculate differential manually using pg_backup_start/stop
Advantages:
- Smaller than full backups
- Simpler restore than incremental (only need full + one differential)
- Moderate backup duration
- Good balance of storage and complexity
Disadvantages:
- Larger than incremental backups (grows over time)
- Backup duration increases as differential grows
- Still requires full backup + one differential to restore
Differential backups offer a middle ground: you trade some storage for restore simplicity.
Understanding Point-in-Time Recovery
Point-in-time recovery (PITR) allows you to restore a database to any specific moment—not just the moment when a backup completed. This is essential for recovering from user errors, partial data corruption, or transactions that shouldn’t have happened.
How PITR Works
PITR relies on Write-Ahead Logging (WAL). Every database modification is written to a sequential log file before being applied to data files. This log captures:
- Transaction begin and commit records
- Every row-level change (insert, update, delete)
- Page-level changes for efficiency
-- PostgreSQL: Configure WAL archiving
wal_level = replica
max_wal_senders = 3
archive_mode = on
archive_command = 'cp %p /archive/wal/%f'
To perform PITR:
- Restore the latest full backup to a recovery location
- Create a recovery.conf file specifying target time or transaction ID
- Replay archived WAL segments sequentially
- Database reaches target state and becomes available
# recovery.conf
restore_command = 'cp /archive/wal/%f %p'
recovery_target_time = '2026-03-26 14:30:00 UTC'
recovery_target_action = 'promote'
WAL Archiving Considerations
WAL archiving is the backbone of robust backup strategies. Key considerations:
Archive Storage: WAL segments accumulate rapidly under write load. Plan for significant storage—typically 1-2x your database size per day under heavy write workloads.
Archive Verification: Untested WAL archives are worthless. Regularly verify that archived WAL can actually restore.
# Verify WAL archive integrity
pg_archivecleanup -d /archive/wal 00000001000000000000001C
Continuous Archiving vs. Continuous Replication: PITR uses WAL shipping or streaming replication. Understand the difference:
| Method | Latency | Bandwidth | Complexity |
|---|---|---|---|
| Shipping | Hours to days | Low | Low |
| Streaming | Seconds | Medium | Medium |
| Logical Replication | Near real-time | Higher | Higher |
RTO and RPO Planning
Recovery Time Objective (RTO) and Recovery Point Objective (RPO) are your contractual commitments to stakeholders about recovery capabilities.
Defining Your Objectives
RTO (Recovery Time Objective): Maximum acceptable downtime before systems are restored. This is a duration—typically hours or minutes.
RPO (Recovery Point Objective): Maximum acceptable data loss measured in time. If your RPO is 1 hour, you accept losing up to 1 hour of transactions.
These aren’t technical decisions—they’re business decisions. Work with stakeholders to define realistic objectives based on:
- Cost of downtime per hour
- Cost of data loss per transaction
- Regulatory requirements
- Customer expectations
Translating to Technical Requirements
Once you have RTO and RPO targets, translate them to technical specifications:
| RPO | Technical Requirement |
|---|---|
| 0 (no data loss) | Synchronous replication or synchronous commit |
| 15 minutes | Frequent WAL archiving + tested recovery |
| 1 hour | Hourly differential backups + WAL |
| 24 hours | Daily full backups |
| RTO | Technical Requirement |
|---|---|
| Minutes | HA architecture, automatic failover |
| Hours | Tested restore procedures, on-call staff |
| Days | Documented manual recovery procedures |
The Math of Recovery
Calculate realistic recovery times:
def estimate_recovery_time(
full_backup_size_gb: float,
differential_size_gb: float,
network_mbps: float,
restore_speed_mbps: float,
wal_replay_seconds: int
) -> dict:
transfer_time = (full_backup_size_gb * 1024) / network_mbps
restore_time = (full_backup_size_gb * 1024) / restore_speed_mbps
wal_time = wal_replay_seconds
total_restore_minutes = (transfer_time + restore_time) / 60
total_rto_minutes = total_restore_minutes + (wal_time / 60)
return {
"transfer_minutes": round(transfer_time / 60, 1),
"restore_minutes": round(restore_time / 60, 1),
"wal_replay_minutes": round(wal_time / 60, 1),
"total_rto_minutes": round(total_rto_minutes, 1)
}
Cloud Backup Solutions
AWS RDS
Amazon RDS provides automated backups with configurable retention:
# RDS backup configuration (AWS Console or CLI)
aws rds modify-db-instance \
--db-instance-identifier my-db-instance \
--backup-retention-period 30 \
--preferred-backup-window "03:00-04:00" \
--preferred-maintenance-window "sun:04:00-sun:05:00"
Key features:
- Automated daily backups with configurable retention (1-35 days)
- Point-in-time recovery to any second within retention period
- Manual snapshots for long-term retention
- Cross-region snapshot copying for disaster recovery
- Automated backup to S3 with encryption
Limitations:
- Restore time depends on snapshot size and network transfer
- PITR only available for MySQL, PostgreSQL, MariaDB, Oracle
- Cross-region copy latency for disaster recovery
Azure SQL
Azure SQL provides automated backups with geo-redundancy options:
# Azure SQL backup configuration
Set-AzSqlDatabaseBackupShortTermRetentionPolicy \
-ResourceGroupName "my-resource-group" \
-ServerName "my-server" \
-DatabaseName "my-database" \
-RetentionDays 30
Key features:
- Full backup every week, differential every 12 hours, log backup every 5-10 minutes
- Point-in-time restore within retention period (7-35 days)
- Long-term retention (LTR) for years
- Geo-restore with RA-GRS redundancy
- Active geo-replication for read scale and failover
Comparing Cloud Providers
| Feature | AWS RDS | Azure SQL | Google Cloud SQL |
|---|---|---|---|
| Automated Backups | Yes | Yes | Yes |
| PITR | Yes | Yes | Yes |
| Max Retention | 35 days | 35 days | 7 days (configurable) |
| Cross-region Restore | Via snapshot copy | Via geo-restore | Via regional instances |
| LTR | Glacier integration | Up to 10 years | Cloud Storage |
When to Use / When Not to Use Each Backup Type
Full Backups — Use when:
- You need a standalone recovery point
- Recovery time is more important than backup time
- Storage is not a constraint
- You’re establishing a new backup baseline
Full Backups — Do not use when:
- Your database is very large (>10TB) and backup windows are tight
- You’re doing incremental backup chains
Incremental Backups — Use when:
- RPO targets are tight (under 1 hour)
- Storage costs must be minimized
- Backup window is limited
Incremental Backups — Do not use when:
- Your backup chain complexity is a operational risk
- You need faster recovery over storage savings
Differential Backups — Use when:
- You need a middle ground between full and incremental
- Restore simplicity matters but storage savings still appeal
Differential Backups — Do not use when:
- Differential growth becomes nearly as large as a full backup
- Your RPO requires more granular recovery
Backup Type Trade-offs
| Dimension | Full Backup | Incremental Backup | Differential Backup |
|---|---|---|---|
| Backup size | Largest | Smallest | Grows over time |
| Backup speed | Slowest | Fastest | Moderate |
| Restore complexity | Simplest — one step | Complex — chain required | Moderate — full + one diff |
| Storage cost | Highest | Lowest | Moderate |
| RTO (restore time) | Lowest | Highest | Moderate |
| Best for | Small DBs, baseline | Large DBs, tight RPO | Medium RPO, moderate complexity |
Production Failure Scenarios
| Failure | Impact | Mitigation |
|---|---|---|
| WAL gap from missed archive | PITR restore impossible at exact point | Monitor archive status continuously, alert on gaps |
| Backup chain corruption | Full data loss — restore impossible | Verify backup integrity on every job, store checksums |
| Storage exhaustion during backup | Backup fails, disk fills | Monitor disk space, alert at 70% threshold |
| Restore failing due to permissions | DR procedure fails when needed most | Test restore with minimal permissions weekly |
| Cloud snapshot restore taking hours | Extended downtime exceeds RTO | Pre-stage recovery environment, test RTO monthly |
| Incremental backup overflow | Disk fills, backup deleted | Set retention enforcement, monitor growth rate |
Capacity Estimation: Backup Storage Sizing and WAL Growth Rate
Backup storage sizing depends on your backup strategy, retention period, and database growth rate.
Full backup storage formula:
full_backup_size = database_size × compression_factor
compressed_full_backup_size = database_size / compression_ratio
For a 500GB PostgreSQL database with pg_dump compression (typical ratio 3:1 to 5:1 for row data, less for indexes):
- Row data with 3:1 compression: 500GB / 3 = ~167GB compressed dump
- With gzip at 5:1: 500GB / 5 = 100GB compressed dump
- Actual size varies significantly: highly compressible data (text, logs) compresses 10:1; binary data (JSON blobs) compresses 2:1
WAL growth rate formula:
WAL_growth_per_second = write_throughput_bytes_per_second × wal_files_per_second
WAL_daily_growth = write_throughput_bytes_per_second × 86400 / (wal_segment_size / wal_keep_segments)
For PostgreSQL: WAL segment size is 16MB by default. If you have 1GB/second write throughput and keep 64 WAL segments (1GB total WAL buffer), WAL rotates once per second and generates 16MB × 64 = 1GB of WAL per day. At 100MB/second write throughput: 100MB/second × 86400 = 8.64TB/day. This is why wal_keep_segments must be sized based on your PITR window.
Incremental backup sizing: Track the change rate per day, not the total database size:
-- PostgreSQL: measure daily WAL generation
SELECT
pg_size_pretty(
pg_wal_lsn_diff(pg_current_wal_lsn(), '0/00000000')::bigint
) AS wal_size_since_checkpoint;
Typical change rates: OLTP databases 1-5% of database size per day. Data warehouses with nightly batch loads: 20-50% per day during loads, 0% between loads. Knowing your change rate lets you size incremental backups correctly.
Observability Hooks: Backup Success/Failure Alerting and RTO Tracking
Key backup metrics: job success/failure, backup duration trends, restoration time, and backup size anomalies.
-- PostgreSQL: backup history from pgBackRest metadata
SELECT
job_id,
status,
timestamp,
duration_secs,
backup_size_bytes,
diff_backup_size_bytes
FROM pg_stat_pgbackrest
WHERE timestamp > CURRENT_DATE - INTERVAL '7 days'
ORDER BY timestamp DESC;
# Critical: backup job failed
- alert: BackupJobFailed
expr: backup_job_status != 0
for: 1m
labels:
severity: critical
annotations:
summary: "Backup job {{ $labels.job }} failed: {{ $labels.error }}"
# Warning: backup duration increased significantly
- alert: BackupDurationAnomaly
expr: backup_duration_seconds > 1.5 * avg_over_time(backup_duration_seconds[7d])
for: 10m
labels:
severity: warning
annotations:
summary: "Backup taking {{ $value }}s vs typical {{ $labels.typical }}s"
# Warning: backup size anomaly (could indicate data growth or corruption)
- alert: BackupSizeAnomaly
expr: backup_size_bytes > 2 * backup_size_bytes_offset_1d
for: 1h
labels:
severity: warning
annotations:
summary: "Backup size {{ $value }} bytes is 2x yesterday's {{ $labels.yesterday }}"
# Critical: WAL archival lag (PITR at risk)
- alert: WalArchivalLag
expr: (pg_wal_lsn_diff(pg_current_wal_lsn(), last_archived_lsn) / 1024 / 1024) > 1024
for: 5m
labels:
severity: critical
annotations:
summary: "WAL archival lag exceeds 1GB - PITR window at risk"
Quick Recap Checklist
Use this checklist when designing or reviewing backup and recovery for a database system:
- Full backup completed and verified within last 24 hours
- Incremental or differential backups scheduled based on RPO requirements
- WAL archiving enabled and verified for PITR capability
- Backup integrity verified via checksum validation after every job
- Restore procedure tested in isolated environment within last 30 days
- RTO and RPO documented and communicated to stakeholders
- Cross-region backup copies exist for disaster recovery
- Backup retention policy aligned with compliance requirements
- Backup monitoring alerts configured for job failures and capacity
- Recovery runbook documented with step-by-step restore procedures
Testing Recovery Procedures
Here’s the uncomfortable truth: a backup that cannot be restored is not a backup. This is non-negotiable.
Regular Recovery Testing
Weekly verification: Automated test restore to confirm backup chain validity.
#!/bin/bash
# Weekly backup verification script
set -euo pipefail
BACKUP_DATE=$(date -d "yesterday" +%Y%m%d)
RESTORE_DIR="/restores/test_$(date +%Y%m%d_%H%M%S)"
# Create fresh restore directory
mkdir -p "$RESTORE_DIR"
# Restore latest full backup
pg_restore -d postgres --create --if-exists "$RESTORE_DIR/full_backup.dump"
# Verify basic integrity
psql -d postgres -c "SELECT count(*) FROM pg_database;"
psql -d postgres -c "SELECT pg_database_size('postgres') / 1024 / 1024 AS size_mb;"
echo "Backup verification successful"
Monthly full disaster simulation: Complete restore to isolated environment, verify data integrity, measure actual RTO against targets.
What to Verify
- Backup completion status — Confirm automated backups completed successfully
- Backup chain integrity — All required WAL segments present
- Restore completion — Backup actually restores without errors
- Data integrity — Restored data matches expected state
- Application functionality — Database serves queries correctly after restore
- Performance baseline — Restore doesn’t degrade performance
Common Recovery Failures
- WAL gap: Missing WAL segments break the restore chain
- Corrupted backup: Checksum failures on backup files
- Storage exhaustion: Restore fails due to insufficient disk space
- Permission issues: Restored database has wrong ownership
- Timezone handling: Timestamps differ between backup and restore environments
Real-World Case Study: GitHub’s MySQL Backup Failures
In 2018, GitHub experienced a significant backup-related incident. Their MySQL database backups had been silently failing for months due to a configuration issue with their backup tooling. When they attempted to verify backup integrity during a routine drill, they discovered the backups were incomplete.
The root cause: their backup verification only checked that the backup job completed (exit code 0) and that the backup file existed. It did not verify that the backup actually contained the expected data. A subtle configuration change had caused the backup to run but capture an empty or partial dataset.
The impact: while the incident did not result in data loss (the team was able to restore from a prior backup), it revealed that their backup verification process was inadequate. The fix: GitHub implemented checksum verification on every backup, comparing the restored database row count against known-good baseline counts. They also added automated restoration testing to a isolated environment on every backup job.
The lesson: backup verification must check completeness, not just completion. A backup that runs successfully but contains 0 rows is worse than no backup at all — it gives false confidence.
Interview Questions
RPO of 15 minutes means you can lose at most 15 minutes of data. With 100GB/day write throughput, that is roughly 100GB/24/4 = ~17GB of potential data loss in a 15-minute window. Strategy: full backups daily (off-peak), differential backups every 4-6 hours, and WAL archiving with continuous WAL shipping. At 100GB/day, WAL segments generate approximately 100GB/86400*900 = ~1GB per 15-minute window. WAL-based PITR gives you exactly 15-minute recovery point. If the database is large (>1TB), full backups may be too slow — consider incremental backups with hourly WAL replay to achieve the same RPO with lower backup overhead.
A broken backup chain typically means WAL archiving failed silently — the backup tool reported success but the archive command did not actually copy WAL files. Recovery options depend on what you have: if you have the last full backup plus subsequent incremental backups, try to restore the latest full backup and replay available WAL. If PITR is impossible, you fall back to the last consistent state from the most recent complete backup. Root cause investigation: check archive_command logs for permission failures, disk space issues, or network timeouts. Mitigation: always verify WAL archive integrity immediately after backup, and monitor for gaps between expected and actual archived WAL sequence numbers.
pg_dump produces a logical backup — a SQL script that recreates the database. Advantages: simple, portable, works across versions, small restore footprint for small databases. Disadvantages: pg_dump is inconsistent unless using transactions (--no-tablespaces, --serialized-alter), cannot do point-in-time recovery (you restore to the moment the dump finished), and for large databases the dump itself takes hours and restores slowly. For anything beyond a small development database, use pg_basebackup with WAL archiving for continuous backup capability. pg_dump is fine for logical exports and small databases; it is not suitable for production disaster recovery.
RTO = transfer_time + restore_time + WAL_replay_time. Transfer_time: 5TB over cross-region network at 100Mbps effective = 5TB * 8 / 100Mbps = 400,000 seconds ≈ 111 hours. This is why cross-region restore is rarely feasible in urgent DR scenarios. Mitigations: maintain warm standby in the same region (RTO minutes, not hours), use database-native replication instead of backup/restore for DR, pre-position backups in the target region. If cross-region restore is the only option, test during normal period to measure actual throughput and include test results in your DR runbook. For 5TB with 1Gbps connectivity: ~40 hours transfer. RTO is unacceptable for most production systems without a local replica.
PITR (Point-In-Time Recovery) is the capability — it means you can recover to any specific moment. Point-in-time recovery is the act of using that capability. The distinction matters because PITR requires WAL archiving (continuous backup of transaction logs). Without WAL archiving, you can only restore to the moment a full or incremental backup completed. With WAL archiving, you can replay logs to any timestamp within your WAL retention window. WAL archival must be configured before failure — you cannot enable it retroactively to recover from an already-happened failure.
Prevention: never start a backup without verifying available disk space > 2× the expected backup size. Configure monitoring on backup destination: alert at 70% capacity, fail at 85%. For full backups of large databases, use compression (-Z flag in pg_basebackup) to reduce storage by 3-5×. Consider splitting full backups across multiple smaller volumes. If disk is already full during backup: cancel the backup job immediately to avoid corrupting the in-progress backup file, clean up old backups first, then restart. Some backup tools resume instead of restart — check your tooling. Long-term: implement backup rotation with retention limits enforced automatically.
Step 1: confirm the scope and estimated recovery time from the cloud provider. Step 2: activate DR environment in the alternate region if pre-positioned. Step 3: restore from latest cross-region backup if using backup-based DR. Step 4: if using read replica failover, promote the replica and update DNS. Step 5: verify data integrity — check last known good timestamp against application requirements. Step 6: notify stakeholders of RTO. Step 7: after primary region recovers, plan data reconciliation between DR and primary. The key DR principle: your RTO is only as good as your last tested recovery. If you have never tested restoring from cross-region backups, you do not know your actual RTO.
WAL shipping (physical replication) transfers WAL segments from primary to standby after they are archived. Latency is higher — typically minutes to hours depending on archive frequency — but bandwidth is efficient since only WAL segments are transferred. Streaming replication (logical or physical) uses a persistent connection to stream WAL in real-time. Latency is seconds, providing near-instant failover capability. Choose WAL shipping when: bandwidth is limited, you can tolerate minutes of RPO, and cost sensitivity is high. Choose streaming replication when: RPO must be near-zero, RTO must be minutes (HA requirements), or you need a warm standby that can be quickly promoted. Many production systems use both: streaming replication for HA failover, WAL shipping to S3 for long-term PITR retention.
Backup verification must check completeness, not just completion. A backup job that exits with code 0 and creates a file is not verified — the file could be empty, truncated, or contain corrupted data. Prevention: implement checksum verification (SHA-256 or md5) immediately after backup creation, store checksums in a separate system from backups, verify checksums before restore attempts, and periodically perform actual restore-to-null verification. GitHub's 2018 MySQL backup incident is the canonical example — their backup jobs ran successfully but captured empty datasets. The lesson: a backup that cannot be restored is not a backup.
WAL segments accumulate in the archive directory faster than they are consumed during PITR restore. Under high write load, WAL generation can fill disk space faster than archival can keep up, especially if archive_command fails silently or network latency prevents timely archiving. Strategies: monitor archive_queue_length — if it grows unbounded, alert immediately; configure WAL retention based on your PITR window (typically 1-7 days); implement WAL compression (pg_compressbackup or custom scripts); set archive_command to fail fast on errors rather than silently continuing. WAL accumulation is a silent killer of backup strategies.
500GB/day = ~6GB/hour write rate. A 2-hour backup window captures approximately 12GB of writes — much smaller than the full database if it is multi-terabyte. Strategy: weekly full backups (weekend or maintenance window) plus continuous WAL archiving for PITR, with incremental backups every 2 hours capturing only changed blocks. This gives you PITR capability (WAL replay to any point) without requiring full backups during the backup window. The 2-hour window limits your PITR RPO to 2 hours maximum data loss, which is acceptable for most workloads. If RPO must be zero, use synchronous replication instead.
Replication to a separate region (e.g., cross-region RDS replica) provides fast failover (RTO minutes) and continuous data protection, but is expensive and replicates continuously. Backup archival to S3/Blob Storage is cheaper, provides point-in-time restore capability, but restore time is dominated by data transfer (hours for multi-TB databases). Best practice: use cross-region replication for HA/failover, and periodic backups to object storage for disaster recovery. This gives you both fast local failover and long-term point-in-time recovery. Never rely on a single backup strategy.
Backup storage formulas use compression ratios that vary significantly by data type. Text data (logs, JSON) compresses 5-10×. Row data with indexes compresses 2-3×. Binary data (images, encrypted blobs) compresses 1-1.5×. PostgreSQL TOAST data (large values stored separately) compresses well. Reality check: run pg_basebackup -F t -z -P on your actual database and measure the ratio. Budget for worst case (1.5×) and monitor actual ratios per environment. Production data grows over time — build in growth projections (typically 10-20% per year) when sizing backup storage.
Measure: actual RTO (time from DR initiation to database available), actual RPO (data loss measured in bytes or time), restore throughput (MB/s during restore), WAL replay time for the target point-in-time, DNS cutover time if applicable, and application reconnection time after failover. Document these against your documented RTO/RPO targets. The gap between documented and actual RTO/RPO is the most important finding. If actual RTO is 4 hours but you documented 1 hour, your SLA is wrong. Measure under realistic conditions — cold storage (Glacier) adds hours to restore time that you must account for.
Step 1: verify WAL archiving is enabled and continuous — check pg_stat_archiver. Step 2: confirm the latest full backup and all WAL segments after it are available. Step 3: create a recovery.conf specifying recovery_target_time = '2026-03-26 14:59:00 UTC' and recovery_target_action = 'promote'. Step 4: restore the full backup to the recovery location. Step 5: start the database — it replays WAL sequentially until the target time. Step 6: verify the table exists and contains pre-drop data. Step 7: export the table data. Step 8: promote the recovery database or import the exported data to production. Total time depends on database size and WAL volume since last backup.
If the most recent full backup fails verification, fall back to the previous full backup plus the incremental backups and WAL segments accumulated after it. This extends your RPO — you lose data from the failed backup window to the previous backup's time. Recovery: identify the last good full backup, restore it, replay all WAL segments since that backup, and verify the target point-in-time. Root cause the failed backup: disk errors, checksum mismatches, backup process bugs. Add additional verification steps to catch failures before the next DR scenario. Never skip the verification step assuming backups are good.
Streaming replication (logical or physical) provides continuous data transfer with minimal RPO — typically seconds to minutes. Backup/restore requires full data transfer on every restore, making RTO hours to days for large databases. Streaming replication wins when: your RTO requirement is minutes (HA requirements), your database is small enough that streaming replication overhead is manageable, you need to recover from user errors quickly (PITR with streaming is faster), and you can afford the cost of a warm standby. For RTO in hours and RPO in days, batch backup/restore is sufficient and cheaper.
Tape is cheaper per GB and good for long-term retention, but sequential access makes restore slow — to get to a specific point-in-time, you may need to restore multiple tapes in sequence. Disk is more expensive but provides random access — faster restore to any point in time, direct verification, and easier incremental management. Modern cloud backups (S3, Azure Blob) provide disk-like random access with tape-like cost economics when using infrequent access tiers. Recommendation: use disk or cloud for recovery-point backups (fast restore), use tape for long-term retention compliance. Never rely on tape alone for production recovery.
Options: implement incremental backups (capture only changed blocks, not full database), increase compression (pg_basebackup -F t -z), switch to parallel backup tools (pg_backup_start/stop API for concurrent backup streams), reduce backup frequency for full backups while relying on frequent incremental/WAL, partition large tables to enable partition-level backup/restore, offload backup processing to a dedicated backup server, or upgrade backup infrastructure (faster disk, network links). The most common mistake: keeping the same backup schedule as the database grows without accounting for the additional time required.
Use an isolated environment that mirrors production: restore latest backup to the DR environment, verify data integrity matches production at a known point in time, practice the restore procedure in a sandbox, and measure actual RTO during the test. Automate as much of the runbook as possible — manual steps during a crisis cause delays. Run tabletop exercises where the team walks through the runbook without executing it. Reserve full DR tests (actual restore with downtime) for maintenance windows. Document discrepancies between the runbook and what actually happens during testing.
Further Reading
- Disaster Recovery — DR patterns and planning
- Relational Databases — Foundational concepts
Conclusion
Backup and recovery is not a set-it-and-forget-it operation. It requires ongoing attention, testing, and refinement.
Start with the basics: full backups are non-negotiable. Layer on incremental or differential backups based on your RPO requirements. Enable WAL archiving if you need point-in-time recovery. Test everything—regularly.
The time to discover your backup strategy fails is not during an actual disaster. Run fire drills. Simulate failures. Know exactly what your RTO and RPO actually are under real conditions, not just on paper.
For deeper exploration of disaster recovery patterns, see our disaster recovery guide. And for monitoring your database health, check out relational databases fundamentals.
Category
Related Posts
Database Capacity Planning: A Practical Guide
Plan for growth before you hit walls. This guide covers growth forecasting, compute and storage sizing, IOPS requirements, and cloud vs on-prem decisions.
Failover Automation
Automatic failover patterns. Health checks, failure detection, split-brain prevention, and DNS TTL management during database failover.
Disaster Recovery: RTO, RPO, and Building a Recovery Plan
Disaster recovery planning protects against catastrophic failures. Learn RTO/RPO metrics, backup strategies, and multi-region failover patterns.