Database Backup Strategies: Full, Incremental, and PITR
Learn database backup strategies: full, incremental, and differential backups. Point-in-time recovery, WAL archiving, and RTO/RPO planning.
Backup and Recovery: Full, Incremental, Differential, and Point-in-Time
Picture this: it’s 2am. Your phone rings. Production data is gone—corrupted, deleted, inaccessible. The question that determines whether you have a bad night or a catastrophic week is one you’ve already answered: what’s your backup and recovery strategy?
This guide covers database backup approaches without the usual vendor fluff.
flowchart LR
subgraph BackupFlow["Backup Process"]
F1[("Full<br/>Backup")] --> F2[("WAL<br/>Archiving")]
F2 --> F3[("Incremental<br/>Backup")]
F2 --> F4[("Differential<br/>Backup")]
end
subgraph RestoreFlow["Restore Process"]
R1[("Full<br/>Restore")] --> R2[("Differential<br/>Restore")]
R1 --> R3[("Incremental<br/>Chain")]
R3 --> R4[("WAL<br/>Replay")]
R2 --> R4
R4 --> PITR[("Point-in-Time<br/>Recovery")]
PITR --> Ready[("Database<br/>Available")]
end
subgraph CloudSnap["Cloud Snapshots"]
S1[("Automated<br/>Snapshots")] --> S2[("Retention<br/>Policy")]
S2 --> S3[("Cross-Region<br/>Copy")]
S3 --> S4[("Restore from<br/>Snapshot")]
S4 --> Ready
end
This guide covers database backup approaches without the usual vendor fluff. the trade-offs between full, incremental, and differential backups, show you how point-in-time recovery actually works, and help you figure out what recovery objectives make sense for your situation.
Backup Types Explained
Full Backups
A full backup captures the entire database at the time of backup start. Every data page, every transaction log segment, every system catalog entry—all of it written to backup media.
-- PostgreSQL full backup example
pg_basebackup -h localhost -U postgres -D /backups/full -Ft -z -P
Advantages:
- Complete restoration in a single operation
- Simple to understand and verify
- No dependency on other backups for restore
- Self-contained disaster recovery
Disadvantages:
- Largest storage footprint
- Longest backup duration
- Maximum impact on production system I/O
- Cannot be applied incrementally
Full backups are the foundation of any backup strategy. You should always have at least one full backup as your recovery base.
Incremental Backups
An incremental backup captures only the data that changed since the last backup—regardless of whether that was a full or incremental backup. This means changes since the most recent backup of any type.
-- PostgreSQL incremental using pg_basebackup with WAL
pg_basebackup -h localhost -U postgres -D /backups/incr_$(date +%Y%m%d_%H%M%S) -Ft -z -P -X stream
Advantages:
- Minimal storage consumption
- Fast backup operations
- Low production system impact
- Frequent backups possible
Disadvantages:
- Restore requires chain of all backups from last full
- Complex restore procedures
- Backup chain corruption means data loss
- Longer recovery time
The key insight: incremental backups are delta changes only. To restore, you need the last full backup plus every incremental backup in sequence.
Differential Backups
A differential backup captures all changes since the last full backup—not since the last differential. This creates a growing but simpler restore chain.
-- Conceptual differential: compare current state to last full
pg_basebackup -h localhost -U postgres -D /backups/diff_$(date +%Y%m%d_%H%M%S) -Ft -z -P -X stream
-- Then calculate differential manually using pg_backup_start/stop
Advantages:
- Smaller than full backups
- Simpler restore than incremental (only need full + one differential)
- Moderate backup duration
- Good balance of storage and complexity
Disadvantages:
- Larger than incremental backups (grows over time)
- Backup duration increases as differential grows
- Still requires full backup + one differential to restore
Differential backups offer a middle ground: you trade some storage for restore simplicity.
Understanding Point-in-Time Recovery
Point-in-time recovery (PITR) allows you to restore a database to any specific moment—not just the moment when a backup completed. This is essential for recovering from user errors, partial data corruption, or transactions that shouldn’t have happened.
How PITR Works
PITR relies on Write-Ahead Logging (WAL). Every database modification is written to a sequential log file before being applied to data files. This log captures:
- Transaction begin and commit records
- Every row-level change (insert, update, delete)
- Page-level changes for efficiency
-- PostgreSQL: Configure WAL archiving
wal_level = replica
max_wal_senders = 3
archive_mode = on
archive_command = 'cp %p /archive/wal/%f'
To perform PITR:
- Restore the latest full backup to a recovery location
- Create a recovery.conf file specifying target time or transaction ID
- Replay archived WAL segments sequentially
- Database reaches target state and becomes available
# recovery.conf
restore_command = 'cp /archive/wal/%f %p'
recovery_target_time = '2026-03-26 14:30:00 UTC'
recovery_target_action = 'promote'
WAL Archiving Considerations
WAL archiving is the backbone of robust backup strategies. Key considerations:
Archive Storage: WAL segments accumulate rapidly under write load. Plan for significant storage—typically 1-2x your database size per day under heavy write workloads.
Archive Verification: Untested WAL archives are worthless. Regularly verify that archived WAL can actually restore.
# Verify WAL archive integrity
pg_archivecleanup -d /archive/wal 00000001000000000000001C
Continuous Archiving vs. Continuous Replication: PITR uses WAL shipping or streaming replication. Understand the difference:
| Method | Latency | Bandwidth | Complexity |
|---|---|---|---|
| Shipping | Hours to days | Low | Low |
| Streaming | Seconds | Medium | Medium |
| Logical Replication | Near real-time | Higher | Higher |
RTO and RPO Planning
Recovery Time Objective (RTO) and Recovery Point Objective (RPO) are your contractual commitments to stakeholders about recovery capabilities.
Defining Your Objectives
RTO (Recovery Time Objective): Maximum acceptable downtime before systems are restored. This is a duration—typically hours or minutes.
RPO (Recovery Point Objective): Maximum acceptable data loss measured in time. If your RPO is 1 hour, you accept losing up to 1 hour of transactions.
These aren’t technical decisions—they’re business decisions. Work with stakeholders to define realistic objectives based on:
- Cost of downtime per hour
- Cost of data loss per transaction
- Regulatory requirements
- Customer expectations
Translating to Technical Requirements
Once you have RTO and RPO targets, translate them to technical specifications:
| RPO | Technical Requirement |
|---|---|
| 0 (no data loss) | Synchronous replication or synchronous commit |
| 15 minutes | Frequent WAL archiving + tested recovery |
| 1 hour | Hourly differential backups + WAL |
| 24 hours | Daily full backups |
| RTO | Technical Requirement |
|---|---|
| Minutes | HA architecture, automatic failover |
| Hours | Tested restore procedures, on-call staff |
| Days | Documented manual recovery procedures |
The Math of Recovery
Calculate realistic recovery times:
def estimate_recovery_time(
full_backup_size_gb: float,
differential_size_gb: float,
network_mbps: float,
restore_speed_mbps: float,
wal_replay_seconds: int
) -> dict:
transfer_time = (full_backup_size_gb * 1024) / network_mbps
restore_time = (full_backup_size_gb * 1024) / restore_speed_mbps
wal_time = wal_replay_seconds
total_restore_minutes = (transfer_time + restore_time) / 60
total_rto_minutes = total_restore_minutes + (wal_time / 60)
return {
"transfer_minutes": round(transfer_time / 60, 1),
"restore_minutes": round(restore_time / 60, 1),
"wal_replay_minutes": round(wal_time / 60, 1),
"total_rto_minutes": round(total_rto_minutes, 1)
}
Cloud Backup Solutions
AWS RDS
Amazon RDS provides automated backups with configurable retention:
# RDS backup configuration (AWS Console or CLI)
aws rds modify-db-instance \
--db-instance-identifier my-db-instance \
--backup-retention-period 30 \
--preferred-backup-window "03:00-04:00" \
--preferred-maintenance-window "sun:04:00-sun:05:00"
Key features:
- Automated daily backups with configurable retention (1-35 days)
- Point-in-time recovery to any second within retention period
- Manual snapshots for long-term retention
- Cross-region snapshot copying for disaster recovery
- Automated backup to S3 with encryption
Limitations:
- Restore time depends on snapshot size and network transfer
- PITR only available for MySQL, PostgreSQL, MariaDB, Oracle
- Cross-region copy latency for disaster recovery
Azure SQL
Azure SQL provides automated backups with geo-redundancy options:
# Azure SQL backup configuration
Set-AzSqlDatabaseBackupShortTermRetentionPolicy \
-ResourceGroupName "my-resource-group" \
-ServerName "my-server" \
-DatabaseName "my-database" \
-RetentionDays 30
Key features:
- Full backup every week, differential every 12 hours, log backup every 5-10 minutes
- Point-in-time restore within retention period (7-35 days)
- Long-term retention (LTR) for years
- Geo-restore with RA-GRS redundancy
- Active geo-replication for read scale and failover
Comparing Cloud Providers
| Feature | AWS RDS | Azure SQL | Google Cloud SQL |
|---|---|---|---|
| Automated Backups | Yes | Yes | Yes |
| PITR | Yes | Yes | Yes |
| Max Retention | 35 days | 35 days | 7 days (configurable) |
| Cross-region Restore | Via snapshot copy | Via geo-restore | Via regional instances |
| LTR | Glacier integration | Up to 10 years | Cloud Storage |
When to Use / When Not to Use Each Backup Type
Full Backups — Use when:
- You need a standalone recovery point
- Recovery time is more important than backup time
- Storage is not a constraint
- You’re establishing a new backup baseline
Full Backups — Do not use when:
- Your database is very large (>10TB) and backup windows are tight
- You’re doing incremental backup chains
Incremental Backups — Use when:
- RPO targets are tight (under 1 hour)
- Storage costs must be minimized
- Backup window is limited
Incremental Backups — Do not use when:
- Your backup chain complexity is a operational risk
- You need faster recovery over storage savings
Differential Backups — Use when:
- You need a middle ground between full and incremental
- Restore simplicity matters but storage savings still appeal
Differential Backups — Do not use when:
- Differential growth becomes nearly as large as a full backup
- Your RPO requires more granular recovery
Backup Type Trade-offs
| Dimension | Full Backup | Incremental Backup | Differential Backup |
|---|---|---|---|
| Backup size | Largest | Smallest | Grows over time |
| Backup speed | Slowest | Fastest | Moderate |
| Restore complexity | Simplest — one step | Complex — chain required | Moderate — full + one diff |
| Storage cost | Highest | Lowest | Moderate |
| RTO (restore time) | Lowest | Highest | Moderate |
| Best for | Small DBs, baseline | Large DBs, tight RPO | Medium RPO, moderate complexity |
Production Failure Scenarios
| Failure | Impact | Mitigation |
|---|---|---|
| WAL gap from missed archive | PITR restore impossible at exact point | Monitor archive status continuously, alert on gaps |
| Backup chain corruption | Full data loss — restore impossible | Verify backup integrity on every job, store checksums |
| Storage exhaustion during backup | Backup fails, disk fills | Monitor disk space, alert at 70% threshold |
| Restore failing due to permissions | DR procedure fails when needed most | Test restore with minimal permissions weekly |
| Cloud snapshot restore taking hours | Extended downtime exceeds RTO | Pre-stage recovery environment, test RTO monthly |
| Incremental backup overflow | Disk fills, backup deleted | Set retention enforcement, monitor growth rate |
Capacity Estimation: Backup Storage Sizing and WAL Growth Rate
Backup storage sizing depends on your backup strategy, retention period, and database growth rate.
Full backup storage formula:
full_backup_size = database_size × compression_factor
compressed_full_backup_size = database_size / compression_ratio
For a 500GB PostgreSQL database with pg_dump compression (typical ratio 3:1 to 5:1 for row data, less for indexes):
- Row data with 3:1 compression: 500GB / 3 = ~167GB compressed dump
- With gzip at 5:1: 500GB / 5 = 100GB compressed dump
- Actual size varies significantly: highly compressible data (text, logs) compresses 10:1; binary data (JSON blobs) compresses 2:1
WAL growth rate formula:
WAL_growth_per_second = write_throughput_bytes_per_second × wal_files_per_second
WAL_daily_growth = write_throughput_bytes_per_second × 86400 / (wal_segment_size / wal_keep_segments)
For PostgreSQL: WAL segment size is 16MB by default. If you have 1GB/second write throughput and keep 64 WAL segments (1GB total WAL buffer), WAL rotates once per second and generates 16MB × 64 = 1GB of WAL per day. At 100MB/second write throughput: 100MB/second × 86400 = 8.64TB/day. This is why wal_keep_segments must be sized based on your PITR window.
Incremental backup sizing: Track the change rate per day, not the total database size:
-- PostgreSQL: measure daily WAL generation
SELECT
pg_size_pretty(
pg_wal_lsn_diff(pg_current_wal_lsn(), '0/00000000')::bigint
) AS wal_size_since_checkpoint;
Typical change rates: OLTP databases 1-5% of database size per day. Data warehouses with nightly batch loads: 20-50% per day during loads, 0% between loads. Knowing your change rate lets you size incremental backups correctly.
Observability Hooks: Backup Success/Failure Alerting and RTO Tracking
Key backup metrics: job success/failure, backup duration trends, restoration time, and backup size anomalies.
-- PostgreSQL: backup history from pgBackRest metadata
SELECT
job_id,
status,
timestamp,
duration_secs,
backup_size_bytes,
diff_backup_size_bytes
FROM pg_stat_pgbackrest
WHERE timestamp > CURRENT_DATE - INTERVAL '7 days'
ORDER BY timestamp DESC;
# Critical: backup job failed
- alert: BackupJobFailed
expr: backup_job_status != 0
for: 1m
labels:
severity: critical
annotations:
summary: "Backup job {{ $labels.job }} failed: {{ $labels.error }}"
# Warning: backup duration increased significantly
- alert: BackupDurationAnomaly
expr: backup_duration_seconds > 1.5 * avg_over_time(backup_duration_seconds[7d])
for: 10m
labels:
severity: warning
annotations:
summary: "Backup taking {{ $value }}s vs typical {{ $labels.typical }}s"
# Warning: backup size anomaly (could indicate data growth or corruption)
- alert: BackupSizeAnomaly
expr: backup_size_bytes > 2 * backup_size_bytes_offset_1d
for: 1h
labels:
severity: warning
annotations:
summary: "Backup size {{ $value }} bytes is 2x yesterday's {{ $labels.yesterday }}"
# Critical: WAL archival lag (PITR at risk)
- alert: WalArchivalLag
expr: (pg_wal_lsn_diff(pg_current_wal_lsn(), last_archived_lsn) / 1024 / 1024) > 1024
for: 5m
labels:
severity: critical
annotations:
summary: "WAL archival lag exceeds 1GB - PITR window at risk"
Real-World Case Study: GitHub’s MySQL Backup Failures
In 2018, GitHub experienced a significant backup-related incident. Their MySQL database backups had been silently failing for months due to a configuration issue with their backup tooling. When they attempted to verify backup integrity during a routine drill, they discovered the backups were incomplete.
The root cause: their backup verification only checked that the backup job completed (exit code 0) and that the backup file existed. It did not verify that the backup actually contained the expected data. A subtle configuration change had caused the backup to run but capture an empty or partial dataset.
The impact: while the incident did not result in data loss (the team was able to restore from a prior backup), it revealed that their backup verification process was inadequate. The fix: GitHub implemented checksum verification on every backup, comparing the restored database row count against known-good baseline counts. They also added automated restoration testing to a isolated environment on every backup job.
The lesson: backup verification must check completeness, not just completion. A backup that runs successfully but contains 0 rows is worse than no backup at all — it gives false confidence.
Testing Recovery Procedures
Here’s the uncomfortable truth: a backup that cannot be restored is not a backup. This is non-negotiable.
Regular Recovery Testing
Weekly verification: Automated test restore to confirm backup chain validity.
#!/bin/bash
# Weekly backup verification script
set -euo pipefail
BACKUP_DATE=$(date -d "yesterday" +%Y%m%d)
RESTORE_DIR="/restores/test_$(date +%Y%m%d_%H%M%S)"
# Create fresh restore directory
mkdir -p "$RESTORE_DIR"
# Restore latest full backup
pg_restore -d postgres --create --if-exists "$RESTORE_DIR/full_backup.dump"
# Verify basic integrity
psql -d postgres -c "SELECT count(*) FROM pg_database;"
psql -d postgres -c "SELECT pg_database_size('postgres') / 1024 / 1024 AS size_mb;"
echo "Backup verification successful"
Monthly full disaster simulation: Complete restore to isolated environment, verify data integrity, measure actual RTO against targets.
What to Verify
- Backup completion status — Confirm automated backups completed successfully
- Backup chain integrity — All required WAL segments present
- Restore completion — Backup actually restores without errors
- Data integrity — Restored data matches expected state
- Application functionality — Database serves queries correctly after restore
- Performance baseline — Restore doesn’t degrade performance
Common Recovery Failures
- WAL gap: Missing WAL segments break the restore chain
- Corrupted backup: Checksum failures on backup files
- Storage exhaustion: Restore fails due to insufficient disk space
- Permission issues: Restored database has wrong ownership
- Timezone handling: Timestamps differ between backup and restore environments
Conclusion
Backup and recovery is not a set-it-and-forget-it operation. It requires ongoing attention, testing, and refinement.
Start with the basics: full backups are non-negotiable. Layer on incremental or differential backups based on your RPO requirements. Enable WAL archiving if you need point-in-time recovery. Test everything—regularly.
The time to discover your backup strategy fails is not during an actual disaster. Run fire drills. Simulate failures. Know exactly what your RTO and RPO actually are under real conditions, not just on paper.
For deeper exploration of disaster recovery patterns, see our disaster recovery guide. And for monitoring your database health, check out relational databases fundamentals.
Category
Related Posts
Database Capacity Planning: A Practical Guide
Plan for growth before you hit walls. This guide covers growth forecasting, compute and storage sizing, IOPS requirements, and cloud vs on-prem decisions.
Failover Automation
Automatic failover patterns. Health checks, failure detection, split-brain prevention, and DNS TTL management during database failover.
Disaster Recovery: RTO, RPO, and Building a Recovery Plan
Disaster recovery planning protects against catastrophic failures. Learn RTO/RPO metrics, backup strategies, failover automation, and multi-region recovery patterns.