Disaster Recovery: RTO, RPO, and Building a Recovery Plan

Disaster recovery planning protects against catastrophic failures. Learn RTO/RPO metrics, backup strategies, and multi-region failover patterns.

published: reading time: 19 min read author: GeekWorkBench

Disaster Recovery: RTO, RPO, and Building a Recovery Plan

Every system can fail catastrophically. A fire destroys your data center. A corrupted deployment breaks production. A malicious actor deletes critical data. Disasters happen. The question is whether you survive them.

Disaster recovery (DR) is preparing for catastrophic failures. It involves planning, architecture, and processes to restore service when normal recovery mechanisms fail.

This article covers DR metrics, backup strategies, failover approaches, and building a recovery plan.

Understanding RTO and RPO

Two metrics define your disaster recovery requirements.

Recovery Time Objective (RTO)

RTO is the maximum acceptable downtime after a disaster. If your RTO is 4 hours, you must restore service within 4 hours of failure. Everything else is unacceptable.

RTO drives your architecture decisions. An RTO of 4 hours might allow manual restore procedures. An RTO of 5 minutes requires automatic failover.

Recovery Point Objective (RPO)

RPO is the maximum acceptable data loss. If your RPO is 1 hour, you accept losing up to 1 hour of data. Your backups must capture state from no more than 1 hour ago.

RPO drives your backup frequency. An RPO of 1 hour requires backups at least every hour. An RPO of 1 minute requires continuous replication.

graph LR
    A[Last Backup] -->|1 hour| B[Disaster]
    C[Last Backup] -->|15 minutes| D[Disaster]
    A -.-> E[1 hour data loss - RPO met]
    D -.-> F[15 minutes data loss - tighter RPO]

Setting Targets

Different services need different targets. A payment system might need RTO of minutes and RPO of seconds. A logging service might tolerate hours of downtime and days of data loss.

Define targets based on business impact. Work backwards from what the business can survive.

Backup Strategies

Full Backups

Copy everything. Restore is simple: replace current data with backup. The cost is storage and time. Full backups of large databases take hours and need significant storage space.

# Full database backup
pg_dump -Fc mydb > backup_$(date +%Y%m%d).dump

Incremental Backups

Only back up changes since the last backup. Storage efficient. Restore requires last full backup plus all incrementals. Slower restore process.

# Incremental backup using WAL
pg_basebackup -Xt -z -D /backup/incremental_$(date +%Y%m%d)

Continuous Replication

Stream changes to a replica in real time. Minimal RPO. Complex setup. Requires handling replication conflicts.

# Redis replica configuration
replicaof master.example.com 6379

Point-in-Time Recovery

Combine full backups with transaction logs. Restore to any point in time. Useful for recovering from data corruption or bad deployments.

-- PostgreSQL point-in-time recovery
-- Restore to just before the bad deployment
pg_restore --point-in-time='2026-03-22 14:30:00' backup.dump

Failover Architectures

Active-Passive Failover

Primary site handles traffic. Secondary site stays ready but idle. When primary fails, traffic moves to secondary.

Primary: [Active Service] --> [Primary DB]
|
[Replication]
|
Secondary: [Standby Service] --> [Replica DB]

The upside is simplicity. The downside is cost—you pay for idle capacity.

Active-Active Failover

Multiple sites handle traffic simultaneously. When one site fails, others continue. No single point of failure.

Site A: [Service] <--> [DB A]
Site B: [Service] <--> [DB B]
         \             /
        [Replication]

The upside is maximum availability. The downside is complexity: conflict resolution gets messy, and latency between sites adds friction.

Pilot Light

Minimal infrastructure runs in backup region. When failover triggers, you scale up to full capacity. Cost effective but slower recovery.

# Pilot light: minimal always-on, scale on failover
# Always on:
- 1 web server
- 1 small database replica

# On failover:
- Scale web to 10 servers
- Promote database replica

Database Recovery

SQL Database Recovery

def restore_database(backup_path: str, point_in_time: datetime):
    # Stop the database
    db.stop()

    # Restore from backup
    db.restore(backup_path)

    # Apply WAL to point in time
    if point_in_time:
        db.recover_until(point_in_time)

    # Start database
    db.start()

    # Verify
    db.verify()

NoSQL Database Recovery

def restore_cassandra(backup_location: str):
    # Clear existing data
    truncate_all_keyspaces()

    # Restore snapshots
    for node in cluster.nodes:
        restore_snapshots(node, backup_location)

    # Repair to ensure consistency
    cluster.repair()

Automation

Manual recovery fails under pressure. When a disaster happens, people panic. Procedures get skipped. Mistakes get made. Automate recovery to remove human error.

Infrastructure as Code

Define infrastructure in code. When you need to rebuild, you deploy from code. No manual configuration.

# Terraform for DR infrastructure
resource "aws_instance" "dr_server" {
ami           = var.ami_id
instance_type = "m5.large"
subnet_id     = aws_subnet.dr_subnet.id
}

Automated Failover

class FailoverController:
    def check_health(self):
        primary = self.get_primary()
        if not primary.is_healthy():
            self.initiate_failover()

    def initiate_failover(self):
        # 1. Promote replica
        replica.promote()

        # 2. Update DNS
        dns.update(primary=replica.address)

        # 3. Notify
        notify_team("Failover complete")

        # 4. Log
        log.failover_event()

Runbooks

Document recovery procedures step by step. When stressed, people cannot improvise. They need checklists.

# Database Failover Runbook

1. Verify primary is unreachable
   - Ping primary_ip
   - Try connecting to primary:5432

2. Check replica health
   - Check replica replication lag: under 10 seconds
   - Verify replica is accepting writes

3. Promote replica
   - pg_ctl promote /var/lib/postgresql/data

4. Update DNS
   - Update RDS endpoint in Route53

5. Verify application
   - Check application logs for errors
   - Verify read/write operations work

6. Notify
   - Post in #incidents
   - Update status page

Testing Your Recovery

A recovery plan you have never tested is not a recovery plan. Test it regularly.

Backup Restore Tests

Restore from your backups to a test environment. Verify data integrity. Measure restore time. Document any issues.

# Monthly restore test
1. Launch test instance
2. Restore latest backup
3. Verify application connects
4. Spot check data integrity
5. Document restore time

Failover Drills

Practice failover in staging. When comfortable, practice in production during low-traffic windows. Measure time to fail over.

# Quarterly failover test
def test_failover():
    start = time.time()

    # Simulate primary failure
    primary.stop()

    # Trigger failover
    controller.initiate_failover()

    # Verify
    assert is_healthy(secondary)
    assert application.is_working()

    duration = time.time() - start
    assert duration < RTO

Tabletop Exercises

Walk through disaster scenarios with the team. No actual failures, just discussion. “If the data center caught fire, what would we do?” This surfaces gaps in the plan.

Building the Plan

A disaster recovery plan documents:

  1. Scope: What systems are covered
  2. RTO/RPO: Maximum acceptable downtime and data loss
  3. Architecture: How the system is deployed and replicated
  4. Backup Schedule: When backups run, where they are stored
  5. Failover Procedure: Step-by-step instructions
  6. Contacts: Who to call during a disaster
  7. Testing Schedule: When you test recovery

Review and update the plan quarterly. Systems evolve. What worked last year may not work this year.

When to Use / When Not to Use DR Planning

Use comprehensive DR planning when:

  • Your service has availability requirements exceeding 99%
  • Data loss would cause significant business harm
  • Regulatory requirements mandate disaster recovery capabilities
  • Your team has the operational maturity to execute and test plans

Simplified DR is acceptable when:

  • Service can tolerate hours of downtime
  • Data is non-critical or easily reproducible
  • Cost of DR infrastructure exceeds business value
  • System is in early stage with no revenue impact

DR Strategy Comparison

Not every system needs the same approach. Choosing a DR strategy is a cost-versus-recovery-speed trade-off. Overspend on DR for a dev environment and you waste budget. Underspend on DR for a payment processor and you face catastrophic losses.

StrategyRTORPOCostComplexityBest For
Backup and Restore4–24 hours1–24 hoursLowLowDev/test, non-critical services
Pilot Light10–30 minMinutesMedium-lowMediumInternal apps, moderate availability
Warm Standby2–10 minSeconds–minutesMedium-highMediumSaaS apps, 99.9% SLA
Active-Passive (Hot Standby)Seconds–minutesSecondsHighMediumCustomer-facing services, 99.95%
Active-ActiveNear-zeroNear-zeroVery highVery highGlobal services, payments, 99.99%+

Moving from backup-restore to active-active reduces RTO by roughly 1000x—at roughly 10x the cost. Pilot light sits in between: you keep a skeleton of infrastructure running and scale it up on failover. Active-active is the expensive end of the table. Favor it only when zero downtime is genuinely non-negotiable and your team is ready to manage distributed writes and conflict resolution.

Cloud Provider DR Capabilities

Managed cloud services absorb a lot of the DR operational work. Each major cloud has offerings that handle replication, failover orchestration, and backup retention without you writing the plumbing.

AWS

  • RDS Multi-AZ: Synchronous standby replica in a different Availability Zone. Automatic failover in under two minutes. RPO is near-zero for AZ failures.
  • S3 Cross-Region Replication: Asynchronously replicates objects to a bucket in another region. Useful for offsite backup copies and read latency optimization.
  • Route 53 Health Checks and Failover Routing: DNS failover with active health checks. Minimum TTL can be as low as 60 seconds, but propagation across resolvers still takes minutes.
  • AWS Backup: Centralized backup management across RDS, EBS, DynamoDB, and EFS. Supports cross-region vault copies and policy-based retention.

Google Cloud

  • Cloud SQL High Availability: Synchronous standby in the same region with automatic failover. Cross-region read replicas can be promoted for regional DR.
  • Cloud Spanner: Multi-region writes with synchronous replication. Near-zero RPO globally—no manual failover required.
  • Cloud Storage Multi-Region: Objects are stored redundantly across multiple geographic regions automatically.
  • Persistent Disk Snapshots: Scheduled snapshots for Compute Engine disks. Cross-region snapshot copies enable recovery in another region.

Azure

  • Azure Site Recovery: Replicates VMs, on-premises workloads, and containerized applications to a secondary region. Supports automated failover orchestration with recovery plans.
  • Azure SQL Geo-Replication and Failover Groups: Asynchronous readable replica in any region. Failover groups handle DNS updates and connection string repointing automatically.
  • Geo-Redundant Storage (GRS): Replicates block and file storage to a paired region. Data is available during a regional outage after Microsoft declares failover.

Important caveat: Managed failover services handle infrastructure failover, not application state. Session stores, in-progress transactions, and message queues still require application-level DR logic.

Real-world Failure Scenarios

FailureImpactMitigation
Primary data center destroyedComplete service outage; data lossActivate DR site; promote replica; update DNS
Database corruptionApplication serves bad data; potential data lossPoint-in-time recovery; verify backups; discard corrupted replica
Failed deployment to productionService degradation or outageRollback deployment; use blue-green deploys; have canary analysis
Backup system fails silentlyNo recovery possible when neededVerify backup integrity automatically; alert on backup failures
Network partitionSplit-brain risk; inconsistent dataUse quorum-based decisions; prefer consistency over availability
Ransomware attackData encrypted; service unavailableIsolate affected systems; restore from clean backups; do not pay

Common Pitfalls / Anti-Patterns

Backups Without Testing

Backups that have never been restored are not backups. They are hope. Test restore monthly. Measure actual restore time. Verify data integrity.

Backup Geographic Risk

Backups stored in the same data center as primary are vulnerable to the same disasters. Keep backups geographically separate.

DR Plan Without Contact Info

A plan that says “call the DBA” when the DBA is on vacation is not a plan. Include multiple contacts with phone numbers.

Forgetting About Data in Transit

Point-in-time recovery to a timestamp might not capture corruption that already replicated. Understand your replication scope.

Planning Without RTO/RPO

Without defined targets, you cannot evaluate whether your architecture meets needs. Set targets based on business impact.

Trade-off Analysis

FactorActive-PassiveActive-ActiveTape Backup
RTOMedium (minutes to hours)Low (seconds to minutes)High (hours to days)
RPOLow (minutes of data loss)Very low (near-zero)High (last backup only)
CostLow (1 standby site)High (2+ full sites)Low (offline media)
ComplexityLowHighLow
Data Loss RiskLowVery lowHigh
Geographic RedundancyYesYesNo
Automation LevelMediumHighLow
Best ForTier-2 workloadsMission-critical appsLong-term archival

Decision Framework:

  • RTO < 1 hour, RPO < 5 minutes → Active-Active with synchronous replication
  • RTO < 4 hours, RPO < 1 hour → Active-Passive with async replication + frequent backups
  • RTO > 4 hours acceptable → Tape/disk backup with documented restore procedure
  • Compliance/archival → Tape or cold storage with periodic restore tests

Quick Recap Checklist

Key Bullets:

  • RTO is maximum acceptable downtime; RPO is maximum acceptable data loss
  • RTO drives architecture (manual vs automatic failover); RPO drives backup frequency
  • Active-passive is simpler but wasteful; active-active is complex but resilient
  • Automate recovery to remove human error during crisis
  • Test recovery quarterly; tabletop exercises catch planning gaps

Copy/Paste Checklist:

DR Plan Document:
[ ] Scope: which systems and data
[ ] RTO and RPO per system
[ ] Backup strategy (full/incremental/continuous)
[ ] Backup schedule and retention
[ ] Backup storage locations (geographically separate)
[ ] Failover procedure (step by step)
[ ] Recovery procedure (step by step)
[ ] DNS and routing updates
[ ] Contact list with multiple options
[ ] Notification procedures (who, when, how)
[ ] Testing schedule
[ ] Last tested date and results

Observability Checklist

  • Metrics:

    • Backup success/failure rate (target: 100% success)
    • Backup age (oldest successful backup)
    • Replication lag for DR replica
    • Time to complete restore in test environment
    • DR site health check status
  • Logs:

    • All backup job completions with file counts and sizes
    • Restore test results with any discrepancies found
    • Failover drill outcomes
    • Anomalies in backup or replication streams
  • Alerts:

    • Backup job fails
    • Backup age exceeds RPO threshold
    • Replication lag exceeds 5 minutes (warning) / RPO (critical)
    • DR site unreachable
    • Restore test fails

Security Checklist

  • Backups encrypted at rest with separate key management
  • Backup access restricted to authorized personnel with MFA
  • Offsite backups protected against unauthorized access
  • Backup retention complies with data retention policies
  • DR site access authenticated and authorized separately from primary
  • Restore procedures require dual authorization
  • Audit log all backup and restore operations
  • Test disaster recovery without exposing production data

Interview Questions

1. What is the difference between RTO and RPO?

RTO (Recovery Time Objective) is the maximum acceptable downtime—how long can the system be down before the business impact is unacceptable? RPO (Recovery Point Objective) is the maximum acceptable data loss—how much data can be lost without causing unacceptable business impact?

RTO drives architecture decisions: a 1-hour RTO requires automated failover. A 24-hour RTO might allow manual recovery procedures. RPO drives backup decisions: a 1-hour RPO requires replication at least every hour. A 24-hour RPO allows daily backups.

2. Compare active-passive and active-active disaster recovery architectures.

Active-passive keeps a warm standby in another region. The primary handles all traffic. When the primary fails, DNS failover routes traffic to the standby. Simpler to operate, but wastefully idle during normal operations. Failover takes time—minutes to hours depending on DNS propagation and application startup.

Active-active runs in multiple regions simultaneously, all serving traffic. When one region fails, the others continue handling its share. No failover delay. More complex—requires data replication between regions and handling of concurrent writes. Higher operational cost since all regions run hot.

Most systems benefit from active-passive with automated failover. True active-active is reserved for systems requiring zero downtime and willing to pay the complexity cost.

3. How do you determine appropriate RTO and RPO values for a system?

Start from the business: what is the cost of downtime per hour? What is the cost of data loss per hour? These costs help justify the investment in DR infrastructure.

Different services have different requirements. A payments service might need minutes RTO. A reporting dashboard might tolerate hours. Set targets per service based on business criticality.

Industry regulatory requirements sometimes mandate specific targets. PCI-DSS for payment data, HIPAA for healthcare, SOC 2 for trust services—all have specific availability expectations.

4. What are the main disaster recovery strategies for databases?

Backup and restore: periodic backups stored offsite. Cheapest, slowest recovery. RTO depends on backup frequency and restore time.

Point-in-time recovery: full backup plus transaction log replay. Enables recovery to any point in time. Useful for recovering from data corruption or bad deployments.

Read replicas: asynchronous copies updated from primary. Can be promoted for DR but may lose some data. RPO depends on replication lag.

Synchronous replication: writes not confirmed until replica acknowledges. Zero RPO but adds latency to every write. Expensive to maintain.

Multi-region writes: like Spanner or CockroachDB. True zero RPO with no single point of failure. Highest complexity and cost.

5. How often should you test disaster recovery?

Backup restore tests: monthly at minimum. Verify you can actually restore from backups and measure how long it takes.

Failover drills: quarterly for critical systems. Practice the actual failover procedure, not just the theory. Document any gaps.

Tabletop exercises: twice yearly with the full team. Walk through disaster scenarios verbally. Surface gaps in the plan before real disasters expose them.

The most common failure is never testing until you actually need to recover. By then it's too late to fix what's broken.

6. What is the difference between hot, warm, and cold standby?

Cold standby: no infrastructure running. Backup site is just hardware or instances that get spun up when needed. Cheapest but slowest to recover—RTO of hours.

Warm standby: minimal infrastructure always running, enough to handle reduced traffic. Can scale up to full capacity on failover. RTO of 10-30 minutes typical.

Hot standby: full production infrastructure running in backup region. Serving reduced traffic or idle. Ready to take over immediately. RTO of seconds to minutes.

The naming reflects how "hot" the standby is—how ready it is to take over. Hotter standby costs more but recovers faster.

7. How do you handle stateful workloads in a DR scenario?

Stateful workloads are the hardest part of DR. The key is understanding what state needs to be preserved and at what cost.

Session state: store in distributed cache (Redis) instead of local memory. Sessions survive failover automatically.

In-progress transactions: use idempotency keys and design for exactly-once semantics. Some data loss is unavoidable for truly in-flight work.

Message queues: acknowledge messages only after processing completes. Use dead letter queues for messages that can't be processed.

Database state: rely on replication for most recent state, point-in-time recovery for historical. Accept small RPO.

The fundamental challenge: you can't replicate state faster than the network allows. At some point you accept data loss to achieve acceptable RTO.

8. What is a DR runbook and why is it important?

A DR runbook is a step-by-step procedure for recovering from a specific disaster scenario. It lists exact commands to run, in order, with expected outcomes at each step.

During a disaster, people make mistakes under pressure. They skip steps. They panic. A runbook removes decision-making from the crisis. Anyone qualified can execute it, not just the person who built the system.

Runbooks should cover: detecting failure, confirming it's real, initiating failover, verifying recovery, communicating status. Each step should be copy-pasteable commands, not descriptions of what to do.

Test your runbook in a failover drill. If you discover missing steps or ambiguous instructions during a real disaster, you've learned nothing.

9. How do you manage DNS failover without causing downtime?

DNS failover sounds simple: change the DNS record and traffic goes to the new site. In practice, DNS TTLs mean changes take minutes to hours to propagate globally.

Low TTLs: set DNS TTL to 60 seconds or less before failover. But many resolvers ignore TTLs and cache longer anyway.

Anycast: use CDNs or global load balancers that route by IP prefix rather than DNS. More complex but eliminates DNS propagation delay.

Health checks: combine DNS failover with active health checks on the primary. Detect failure faster than DNS TTL would suggest.

The real solution: don't rely on DNS alone for fast failover. Use anycast or active health checks at the load balancer level. DNS is the backup, not the primary mechanism.

10. How do you validate that your DR plan actually works?

The only way to validate a DR plan is to actually execute it under simulated disaster conditions. Documentation is not validation.

Backup restore tests: monthly. Launch a clean environment, restore from backup, verify application works. Measure actual restore time.

Chaos engineering: for advanced teams. Intentionally trigger failure of primary site in staging and verify DR site takes over correctly.

Game days: full team exercises where you simulate a disaster and walk through the runbook. No actual failures, but you measure how long it takes and where people get stuck.

The most important metric: can you actually meet your RTO and RPO targets under realistic conditions? If you haven't tested, you don't know.

Category

Related Posts

Database Backup Strategies: Full, Incremental, and PITR

Learn database backup strategies: full, incremental, and differential backups. Point-in-time recovery, WAL archiving, and RTO/RPO planning.

#database #backup #recovery

Backpressure Handling: Protecting Pipelines from Overload

Learn how to implement backpressure in data pipelines to prevent cascading failures, handle overload gracefully, and maintain system stability.

#data-engineering #backpressure #data-pipelines

Data Validation: Ensuring Reliability in Data Pipelines

Learn data validation techniques for catching errors early, defining constraints, and building reliable production data pipelines.

#data-engineering #data-quality #data-validation