Disaster Recovery: RTO, RPO, and Building a Recovery Plan
Disaster recovery planning protects against catastrophic failures. Learn RTO/RPO metrics, backup strategies, failover automation, and multi-region recovery patterns.
Disaster Recovery: RTO, RPO, and Building a Recovery Plan
Every system can fail catastrophically. A fire destroys your data center. A corrupted deployment breaks production. A malicious actor deletes critical data. Disasters happen. The question is whether you survive them.
Disaster recovery (DR) is preparing for catastrophic failures. It involves planning, architecture, and processes to restore service when normal recovery mechanisms fail.
This article covers DR metrics, backup strategies, failover approaches, and building a recovery plan.
Understanding RTO and RPO
Two metrics define your disaster recovery requirements.
Recovery Time Objective (RTO)
RTO is the maximum acceptable downtime after a disaster. If your RTO is 4 hours, you must restore service within 4 hours of failure. Everything else is unacceptable.
RTO drives your architecture decisions. An RTO of 4 hours might allow manual restore procedures. An RTO of 5 minutes requires automatic failover.
Recovery Point Objective (RPO)
RPO is the maximum acceptable data loss. If your RPO is 1 hour, you accept losing up to 1 hour of data. Your backups must capture state from no more than 1 hour ago.
RPO drives your backup frequency. An RPO of 1 hour requires backups at least every hour. An RPO of 1 minute requires continuous replication.
graph LR
A[Last Backup] -->|1 hour| B[Disaster]
C[Last Backup] -->|15 minutes| D[Disaster]
A -.-> E[1 hour data loss - RPO met]
D -.-> F[15 minutes data loss - tighter RPO]
Setting Targets
Different services need different targets. A payment system might need RTO of minutes and RPO of seconds. A logging service might tolerate hours of downtime and days of data loss.
Define targets based on business impact. Work backwards from what the business can survive.
Backup Strategies
Full Backups
Copy everything. Restore is simple: replace current data with backup. The cost is storage and time. Full backups of large databases take hours and need significant storage space.
# Full database backup
pg_dump -Fc mydb > backup_$(date +%Y%m%d).dump
Incremental Backups
Only back up changes since the last backup. Storage efficient. Restore requires last full backup plus all incrementals. Slower restore process.
# Incremental backup using WAL
pg_basebackup -Xt -z -D /backup/incremental_$(date +%Y%m%d)
Continuous Replication
Stream changes to a replica in real time. Minimal RPO. Complex setup. Requires handling replication conflicts.
# Redis replica configuration
replicaof master.example.com 6379
Point-in-Time Recovery
Combine full backups with transaction logs. Restore to any point in time. Useful for recovering from data corruption or bad deployments.
-- PostgreSQL point-in-time recovery
-- Restore to just before the bad deployment
pg_restore --point-in-time='2026-03-22 14:30:00' backup.dump
Failover Architectures
Active-Passive Failover
Primary site handles traffic. Secondary site stays ready but idle. When primary fails, traffic moves to secondary.
Primary: [Active Service] --> [Primary DB]
|
[Replication]
|
Secondary: [Standby Service] --> [Replica DB]
The upside is simplicity. The downside is cost—you pay for idle capacity.
Active-Active Failover
Multiple sites handle traffic simultaneously. When one site fails, others continue. No single point of failure.
Site A: [Service] <--> [DB A]
Site B: [Service] <--> [DB B]
\ /
[Replication]
The upside is maximum availability. The downside is complexity: conflict resolution gets messy, and latency between sites adds friction.
Pilot Light
Minimal infrastructure runs in backup region. When failover triggers, you scale up to full capacity. Cost effective but slower recovery.
# Pilot light: minimal always-on, scale on failover
# Always on:
- 1 web server
- 1 small database replica
# On failover:
- Scale web to 10 servers
- Promote database replica
Database Recovery
SQL Database Recovery
def restore_database(backup_path: str, point_in_time: datetime):
# Stop the database
db.stop()
# Restore from backup
db.restore(backup_path)
# Apply WAL to point in time
if point_in_time:
db.recover_until(point_in_time)
# Start database
db.start()
# Verify
db.verify()
NoSQL Database Recovery
def restore_cassandra(backup_location: str):
# Clear existing data
truncate_all_keyspaces()
# Restore snapshots
for node in cluster.nodes:
restore_snapshots(node, backup_location)
# Repair to ensure consistency
cluster.repair()
Automation
Manual recovery fails under pressure. When a disaster happens, people panic. Procedures get skipped. Mistakes get made. Automate recovery to remove human error.
Infrastructure as Code
Define infrastructure in code. When you need to rebuild, you deploy from code. No manual configuration.
# Terraform for DR infrastructure
resource "aws_instance" "dr_server" {
ami = var.ami_id
instance_type = "m5.large"
subnet_id = aws_subnet.dr_subnet.id
}
Automated Failover
class FailoverController:
def check_health(self):
primary = self.get_primary()
if not primary.is_healthy():
self.initiate_failover()
def initiate_failover(self):
# 1. Promote replica
replica.promote()
# 2. Update DNS
dns.update(primary=replica.address)
# 3. Notify
notify_team("Failover complete")
# 4. Log
log.failover_event()
Runbooks
Document recovery procedures step by step. When stressed, people cannot improvise. They need checklists.
# Database Failover Runbook
1. Verify primary is unreachable
- Ping primary_ip
- Try connecting to primary:5432
2. Check replica health
- Check replica replication lag: under 10 seconds
- Verify replica is accepting writes
3. Promote replica
- pg_ctl promote /var/lib/postgresql/data
4. Update DNS
- Update RDS endpoint in Route53
5. Verify application
- Check application logs for errors
- Verify read/write operations work
6. Notify
- Post in #incidents
- Update status page
Testing Your Recovery
A recovery plan you have never tested is not a recovery plan. Test it regularly.
Backup Restore Tests
Restore from your backups to a test environment. Verify data integrity. Measure restore time. Document any issues.
# Monthly restore test
1. Launch test instance
2. Restore latest backup
3. Verify application connects
4. Spot check data integrity
5. Document restore time
Failover Drills
Practice failover in staging. When comfortable, practice in production during low-traffic windows. Measure time to fail over.
# Quarterly failover test
def test_failover():
start = time.time()
# Simulate primary failure
primary.stop()
# Trigger failover
controller.initiate_failover()
# Verify
assert is_healthy(secondary)
assert application.is_working()
duration = time.time() - start
assert duration < RTO
Tabletop Exercises
Walk through disaster scenarios with the team. No actual failures, just discussion. “If the data center caught fire, what would we do?” This surfaces gaps in the plan.
Common Mistakes
No Documented RTO/RPO
Without defined targets, you have no way to know if your architecture actually meets your needs. Write them down. Revisit them quarterly.
Single Point of Failure in Backup
Your backup is in the same rack as your primary. Fire destroys both. Keep backups geographically separate.
Not Testing Backups
Backups you have never restored from are not backups. They are hope. Test restore monthly.
Forgetting About Data at Rest
Backups protect data at rest. What about data in transit? If the database corrupts and you restore, the corruption might have already replicated. Know your data boundaries.
Recovery Plan Without Contact Info
Your plan says “call the DBA.” The DBA is on vacation. Put contact info in the plan. Include multiple contacts.
Building the Plan
A disaster recovery plan documents:
- Scope: What systems are covered
- RTO/RPO: Maximum acceptable downtime and data loss
- Architecture: How the system is deployed and replicated
- Backup Schedule: When backups run, where they are stored
- Failover Procedure: Step-by-step instructions
- Contacts: Who to call during a disaster
- Testing Schedule: When you test recovery
Review and update the plan quarterly. Systems evolve. What worked last year may not work this year.
When to Use / When Not to Use DR Planning
Use comprehensive DR planning when:
- Your service has availability requirements exceeding 99%
- Data loss would cause significant business harm
- Regulatory requirements mandate disaster recovery capabilities
- Your team has the operational maturity to execute and test plans
Simplified DR is acceptable when:
- Service can tolerate hours of downtime
- Data is non-critical or easily reproducible
- Cost of DR infrastructure exceeds business value
- System is in early stage with no revenue impact
Production Failure Scenarios
| Failure | Impact | Mitigation |
|---|---|---|
| Primary data center destroyed | Complete service outage; data loss | Activate DR site; promote replica; update DNS |
| Database corruption | Application serves bad data; potential data loss | Point-in-time recovery; verify backups; discard corrupted replica |
| Failed deployment to production | Service degradation or outage | Rollback deployment; use blue-green deploys; have canary analysis |
| Backup system fails silently | No recovery possible when needed | Verify backup integrity automatically; alert on backup failures |
| Network partition | Split-brain risk; inconsistent data | Use quorum-based decisions; prefer consistency over availability |
| Ransomware attack | Data encrypted; service unavailable | Isolate affected systems; restore from clean backups; do not pay |
Observability Checklist
-
Metrics:
- Backup success/failure rate (target: 100% success)
- Backup age (oldest successful backup)
- Replication lag for DR replica
- Time to complete restore in test environment
- DR site health check status
-
Logs:
- All backup job completions with file counts and sizes
- Restore test results with any discrepancies found
- Failover drill outcomes
- Anomalies in backup or replication streams
-
Alerts:
- Backup job fails
- Backup age exceeds RPO threshold
- Replication lag exceeds 5 minutes (warning) / RPO (critical)
- DR site unreachable
- Restore test fails
Security Checklist
- Backups encrypted at rest with separate key management
- Backup access restricted to authorized personnel with MFA
- Offsite backups protected against unauthorized access
- Backup retention complies with data retention policies
- DR site access authenticated and authorized separately from primary
- Restore procedures require dual authorization
- Audit log all backup and restore operations
- Test disaster recovery without exposing production data
Common Pitfalls / Anti-Patterns
Backups Without Testing
Backups that have never been restored are not backups. They are hope. Test restore monthly. Measure actual restore time. Verify data integrity.
Backup Geographic Risk
Backups stored in the same data center as primary are vulnerable to the same disasters. Keep backups geographically separate.
DR Plan Without Contact Info
A plan that says “call the DBA” when the DBA is on vacation is not a plan. Include multiple contacts with phone numbers.
Forgetting About Data in Transit
Point-in-time recovery to a timestamp might not capture corruption that already replicated. Understand your replication scope.
Planning Without RTO/RPO
Without defined targets, you cannot evaluate whether your architecture meets needs. Set targets based on business impact.
Quick Recap
Key Bullets:
- RTO is maximum acceptable downtime; RPO is maximum acceptable data loss
- RTO drives architecture (manual vs automatic failover); RPO drives backup frequency
- Active-passive is simpler but wasteful; active-active is complex but resilient
- Automate recovery to remove human error during crisis
- Test recovery quarterly; tabletop exercises catch planning gaps
Copy/Paste Checklist:
DR Plan Document:
[ ] Scope: which systems and data
[ ] RTO and RPO per system
[ ] Backup strategy (full/incremental/continuous)
[ ] Backup schedule and retention
[ ] Backup storage locations (geographically separate)
[ ] Failover procedure (step by step)
[ ] Recovery procedure (step by step)
[ ] DNS and routing updates
[ ] Contact list with multiple options
[ ] Notification procedures (who, when, how)
[ ] Testing schedule
[ ] Last tested date and results
Conclusion
For more on related topics, see Chaos Engineering and Geo-Distribution.
Category
Related Posts
Database Backup Strategies: Full, Incremental, and PITR
Learn database backup strategies: full, incremental, and differential backups. Point-in-time recovery, WAL archiving, and RTO/RPO planning.
Backpressure Handling: Protecting Pipelines from Overload
Learn how to implement backpressure in data pipelines to prevent cascading failures, handle overload gracefully, and maintain system stability.
Data Validation: Ensuring Reliability in Data Pipelines
Learn data validation techniques for catching errors early, defining constraints, and building reliable production data pipelines.