Disaster Recovery: RTO, RPO, and Building a Recovery Plan

Disaster recovery planning protects against catastrophic failures. Learn RTO/RPO metrics, backup strategies, failover automation, and multi-region recovery patterns.

published: reading time: 10 min read

Disaster Recovery: RTO, RPO, and Building a Recovery Plan

Every system can fail catastrophically. A fire destroys your data center. A corrupted deployment breaks production. A malicious actor deletes critical data. Disasters happen. The question is whether you survive them.

Disaster recovery (DR) is preparing for catastrophic failures. It involves planning, architecture, and processes to restore service when normal recovery mechanisms fail.

This article covers DR metrics, backup strategies, failover approaches, and building a recovery plan.

Understanding RTO and RPO

Two metrics define your disaster recovery requirements.

Recovery Time Objective (RTO)

RTO is the maximum acceptable downtime after a disaster. If your RTO is 4 hours, you must restore service within 4 hours of failure. Everything else is unacceptable.

RTO drives your architecture decisions. An RTO of 4 hours might allow manual restore procedures. An RTO of 5 minutes requires automatic failover.

Recovery Point Objective (RPO)

RPO is the maximum acceptable data loss. If your RPO is 1 hour, you accept losing up to 1 hour of data. Your backups must capture state from no more than 1 hour ago.

RPO drives your backup frequency. An RPO of 1 hour requires backups at least every hour. An RPO of 1 minute requires continuous replication.

graph LR
    A[Last Backup] -->|1 hour| B[Disaster]
    C[Last Backup] -->|15 minutes| D[Disaster]
    A -.-> E[1 hour data loss - RPO met]
    D -.-> F[15 minutes data loss - tighter RPO]

Setting Targets

Different services need different targets. A payment system might need RTO of minutes and RPO of seconds. A logging service might tolerate hours of downtime and days of data loss.

Define targets based on business impact. Work backwards from what the business can survive.

Backup Strategies

Full Backups

Copy everything. Restore is simple: replace current data with backup. The cost is storage and time. Full backups of large databases take hours and need significant storage space.

# Full database backup
pg_dump -Fc mydb > backup_$(date +%Y%m%d).dump

Incremental Backups

Only back up changes since the last backup. Storage efficient. Restore requires last full backup plus all incrementals. Slower restore process.

# Incremental backup using WAL
pg_basebackup -Xt -z -D /backup/incremental_$(date +%Y%m%d)

Continuous Replication

Stream changes to a replica in real time. Minimal RPO. Complex setup. Requires handling replication conflicts.

# Redis replica configuration
replicaof master.example.com 6379

Point-in-Time Recovery

Combine full backups with transaction logs. Restore to any point in time. Useful for recovering from data corruption or bad deployments.

-- PostgreSQL point-in-time recovery
-- Restore to just before the bad deployment
pg_restore --point-in-time='2026-03-22 14:30:00' backup.dump

Failover Architectures

Active-Passive Failover

Primary site handles traffic. Secondary site stays ready but idle. When primary fails, traffic moves to secondary.

Primary:  [Active Service] --> [Primary DB]
                                |
                          [Replication]
                                |
Secondary: [Standby Service] --> [Replica DB]

The upside is simplicity. The downside is cost—you pay for idle capacity.

Active-Active Failover

Multiple sites handle traffic simultaneously. When one site fails, others continue. No single point of failure.

Site A: [Service] <--> [DB A]
Site B: [Service] <--> [DB B]
         \             /
        [Replication]

The upside is maximum availability. The downside is complexity: conflict resolution gets messy, and latency between sites adds friction.

Pilot Light

Minimal infrastructure runs in backup region. When failover triggers, you scale up to full capacity. Cost effective but slower recovery.

# Pilot light: minimal always-on, scale on failover
# Always on:
- 1 web server
- 1 small database replica

# On failover:
- Scale web to 10 servers
- Promote database replica

Database Recovery

SQL Database Recovery

def restore_database(backup_path: str, point_in_time: datetime):
    # Stop the database
    db.stop()

    # Restore from backup
    db.restore(backup_path)

    # Apply WAL to point in time
    if point_in_time:
        db.recover_until(point_in_time)

    # Start database
    db.start()

    # Verify
    db.verify()

NoSQL Database Recovery

def restore_cassandra(backup_location: str):
    # Clear existing data
    truncate_all_keyspaces()

    # Restore snapshots
    for node in cluster.nodes:
        restore_snapshots(node, backup_location)

    # Repair to ensure consistency
    cluster.repair()

Automation

Manual recovery fails under pressure. When a disaster happens, people panic. Procedures get skipped. Mistakes get made. Automate recovery to remove human error.

Infrastructure as Code

Define infrastructure in code. When you need to rebuild, you deploy from code. No manual configuration.

# Terraform for DR infrastructure
resource "aws_instance" "dr_server" {
ami           = var.ami_id
instance_type = "m5.large"
subnet_id     = aws_subnet.dr_subnet.id
}

Automated Failover

class FailoverController:
    def check_health(self):
        primary = self.get_primary()
        if not primary.is_healthy():
            self.initiate_failover()

    def initiate_failover(self):
        # 1. Promote replica
        replica.promote()

        # 2. Update DNS
        dns.update(primary=replica.address)

        # 3. Notify
        notify_team("Failover complete")

        # 4. Log
        log.failover_event()

Runbooks

Document recovery procedures step by step. When stressed, people cannot improvise. They need checklists.

# Database Failover Runbook

1. Verify primary is unreachable
   - Ping primary_ip
   - Try connecting to primary:5432

2. Check replica health
   - Check replica replication lag: under 10 seconds
   - Verify replica is accepting writes

3. Promote replica
   - pg_ctl promote /var/lib/postgresql/data

4. Update DNS
   - Update RDS endpoint in Route53

5. Verify application
   - Check application logs for errors
   - Verify read/write operations work

6. Notify
   - Post in #incidents
   - Update status page

Testing Your Recovery

A recovery plan you have never tested is not a recovery plan. Test it regularly.

Backup Restore Tests

Restore from your backups to a test environment. Verify data integrity. Measure restore time. Document any issues.

# Monthly restore test
1. Launch test instance
2. Restore latest backup
3. Verify application connects
4. Spot check data integrity
5. Document restore time

Failover Drills

Practice failover in staging. When comfortable, practice in production during low-traffic windows. Measure time to fail over.

# Quarterly failover test
def test_failover():
    start = time.time()

    # Simulate primary failure
    primary.stop()

    # Trigger failover
    controller.initiate_failover()

    # Verify
    assert is_healthy(secondary)
    assert application.is_working()

    duration = time.time() - start
    assert duration < RTO

Tabletop Exercises

Walk through disaster scenarios with the team. No actual failures, just discussion. “If the data center caught fire, what would we do?” This surfaces gaps in the plan.

Common Mistakes

No Documented RTO/RPO

Without defined targets, you have no way to know if your architecture actually meets your needs. Write them down. Revisit them quarterly.

Single Point of Failure in Backup

Your backup is in the same rack as your primary. Fire destroys both. Keep backups geographically separate.

Not Testing Backups

Backups you have never restored from are not backups. They are hope. Test restore monthly.

Forgetting About Data at Rest

Backups protect data at rest. What about data in transit? If the database corrupts and you restore, the corruption might have already replicated. Know your data boundaries.

Recovery Plan Without Contact Info

Your plan says “call the DBA.” The DBA is on vacation. Put contact info in the plan. Include multiple contacts.

Building the Plan

A disaster recovery plan documents:

  1. Scope: What systems are covered
  2. RTO/RPO: Maximum acceptable downtime and data loss
  3. Architecture: How the system is deployed and replicated
  4. Backup Schedule: When backups run, where they are stored
  5. Failover Procedure: Step-by-step instructions
  6. Contacts: Who to call during a disaster
  7. Testing Schedule: When you test recovery

Review and update the plan quarterly. Systems evolve. What worked last year may not work this year.

When to Use / When Not to Use DR Planning

Use comprehensive DR planning when:

  • Your service has availability requirements exceeding 99%
  • Data loss would cause significant business harm
  • Regulatory requirements mandate disaster recovery capabilities
  • Your team has the operational maturity to execute and test plans

Simplified DR is acceptable when:

  • Service can tolerate hours of downtime
  • Data is non-critical or easily reproducible
  • Cost of DR infrastructure exceeds business value
  • System is in early stage with no revenue impact

Production Failure Scenarios

FailureImpactMitigation
Primary data center destroyedComplete service outage; data lossActivate DR site; promote replica; update DNS
Database corruptionApplication serves bad data; potential data lossPoint-in-time recovery; verify backups; discard corrupted replica
Failed deployment to productionService degradation or outageRollback deployment; use blue-green deploys; have canary analysis
Backup system fails silentlyNo recovery possible when neededVerify backup integrity automatically; alert on backup failures
Network partitionSplit-brain risk; inconsistent dataUse quorum-based decisions; prefer consistency over availability
Ransomware attackData encrypted; service unavailableIsolate affected systems; restore from clean backups; do not pay

Observability Checklist

  • Metrics:

    • Backup success/failure rate (target: 100% success)
    • Backup age (oldest successful backup)
    • Replication lag for DR replica
    • Time to complete restore in test environment
    • DR site health check status
  • Logs:

    • All backup job completions with file counts and sizes
    • Restore test results with any discrepancies found
    • Failover drill outcomes
    • Anomalies in backup or replication streams
  • Alerts:

    • Backup job fails
    • Backup age exceeds RPO threshold
    • Replication lag exceeds 5 minutes (warning) / RPO (critical)
    • DR site unreachable
    • Restore test fails

Security Checklist

  • Backups encrypted at rest with separate key management
  • Backup access restricted to authorized personnel with MFA
  • Offsite backups protected against unauthorized access
  • Backup retention complies with data retention policies
  • DR site access authenticated and authorized separately from primary
  • Restore procedures require dual authorization
  • Audit log all backup and restore operations
  • Test disaster recovery without exposing production data

Common Pitfalls / Anti-Patterns

Backups Without Testing

Backups that have never been restored are not backups. They are hope. Test restore monthly. Measure actual restore time. Verify data integrity.

Backup Geographic Risk

Backups stored in the same data center as primary are vulnerable to the same disasters. Keep backups geographically separate.

DR Plan Without Contact Info

A plan that says “call the DBA” when the DBA is on vacation is not a plan. Include multiple contacts with phone numbers.

Forgetting About Data in Transit

Point-in-time recovery to a timestamp might not capture corruption that already replicated. Understand your replication scope.

Planning Without RTO/RPO

Without defined targets, you cannot evaluate whether your architecture meets needs. Set targets based on business impact.

Quick Recap

Key Bullets:

  • RTO is maximum acceptable downtime; RPO is maximum acceptable data loss
  • RTO drives architecture (manual vs automatic failover); RPO drives backup frequency
  • Active-passive is simpler but wasteful; active-active is complex but resilient
  • Automate recovery to remove human error during crisis
  • Test recovery quarterly; tabletop exercises catch planning gaps

Copy/Paste Checklist:

DR Plan Document:
[ ] Scope: which systems and data
[ ] RTO and RPO per system
[ ] Backup strategy (full/incremental/continuous)
[ ] Backup schedule and retention
[ ] Backup storage locations (geographically separate)
[ ] Failover procedure (step by step)
[ ] Recovery procedure (step by step)
[ ] DNS and routing updates
[ ] Contact list with multiple options
[ ] Notification procedures (who, when, how)
[ ] Testing schedule
[ ] Last tested date and results

Conclusion

For more on related topics, see Chaos Engineering and Geo-Distribution.

Category

Related Posts

Database Backup Strategies: Full, Incremental, and PITR

Learn database backup strategies: full, incremental, and differential backups. Point-in-time recovery, WAL archiving, and RTO/RPO planning.

#database #backup #recovery

Backpressure Handling: Protecting Pipelines from Overload

Learn how to implement backpressure in data pipelines to prevent cascading failures, handle overload gracefully, and maintain system stability.

#data-engineering #backpressure #data-pipelines

Data Validation: Ensuring Reliability in Data Pipelines

Learn data validation techniques for catching errors early, defining constraints, and building reliable production data pipelines.

#data-engineering #data-quality #data-validation