Disaster Recovery: RTO, RPO, and Building a Recovery Plan

Disaster recovery planning protects against catastrophic failures. Learn RTO/RPO metrics, backup strategies, and multi-region failover patterns.

published: March 22, 2026 reading time: 29 min read author: GeekWorkBench updated: June 17, 2026

Quick Summary

Disaster recovery planning protects against catastrophic failures through two metrics: RTO (how long you can be down) and RPO (how much data you can lose). RTO drives architecture decisions, RPO drives backup frequency. Backups range from full copies to continuous replication, with trade-offs between restore speed and storage cost. Failover architectures include active-passive (simple but pays for idle capacity), active-active (no single point of failure but conflict resolution gets messy), and pilot light (cost-effective but slower recovery). After reading, you will understand how to set RTO/RPO targets based on business impact and choose a DR architecture that fits your risk tolerance.

Disaster Recovery: RTO, RPO, and Building a Recovery Plan

Every system can fail catastrophically. A fire destroys your data center. A corrupted deployment breaks production. A malicious actor deletes critical data. Disasters happen. The question is whether you survive them.

Disaster recovery (DR) is preparing for catastrophic failures. It involves planning, architecture, and processes to restore service when normal recovery mechanisms fail.

This article covers DR metrics, backup strategies, failover approaches, and building a recovery plan.

Understanding RTO and RPO

Two metrics define your disaster recovery requirements.

Recovery Time Objective (RTO)

RTO is the maximum acceptable downtime after a disaster. If your RTO is 4 hours, you must restore service within 4 hours of failure. Everything else is unacceptable.

RTO drives your architecture decisions. An RTO of 4 hours might allow manual restore procedures. An RTO of 5 minutes requires automatic failover.

The tricky part is that RTO is not just about technology—it is a business decision dressed in technical clothing. A four-hour RTO for a payments system is rarely acceptable if regulatory SLAs specify faster recovery. A four-hour RTO for an internal analytics dashboard might be fine. When you define RTO without business input, you end up engineering to requirements nobody actually has or underbuilding for requirements that are non-negotiable. Interview the business side, not just the engineering side.

Recovery Point Objective (RPO)

RPO is the maximum acceptable data loss. If your RPO is 1 hour, you accept losing up to 1 hour of data. Your backups must capture state from no more than 1 hour ago.

RPO drives your backup frequency. An RPO of 1 hour requires backups at least every hour. An RPO of 1 minute requires continuous replication.

RPO sits at the intersection of business tolerance and technical cost. Financial transactions, medical records, and auth tokens have near-zero RPO requirements because the data is irreplaceable. Log aggregation and analytics pipelines can tolerate hours of data loss because rebuilding is cheap. When RPO is not explicitly set, it defaults to whatever your current backup schedule delivers—and that default is almost always wrong for some critical subset of your data.

Treat RPO as a first-class requirement, not an infrastructure output. A single RPO for an entire system is almost always incorrect: the checkout database and the analytics pipeline have different loss tolerances. Write RPO per data class, update it when the product changes, and revisit it quarterly. The billing ledger and the session store require different conversations with the business about what “acceptable loss” means.

Setting Targets

Different services need different targets. A payment system might need RTO of minutes and RPO of seconds. A logging service might tolerate hours of downtime and days of data loss. Most systems fall somewhere between these extremes.

Define targets based on business impact. Work backwards from what the business can survive. Ask yourself: if this service goes down for 4 hours, how much revenue do we lose? If we lose an hour of data, what is the customer impact? The answers anchor your targets in reality, not guesswork.

A practical starting point: list every system in scope, interview the team that owns it, and ask two questions. First, what is the longest you can afford to be down? Second, how much data can you afford to lose? Write those numbers down. Then cross-reference with what your current architecture can actually deliver. The gap between “what we need” and “what we have” tells you where to invest. If you need 5-minute RTO but your current restore procedure takes 4 hours, that gap is your DR project.

Revisit these targets at least twice a year. Systems change. A staging environment that was acceptable with 24-hour RTO might become production-adjacent after a redesign. A service that tolerated data loss might start handling regulated data. Targets that lived in a document nobody looked at become dangerous fiction.

Backup Strategies

Full Backups

Copy everything. Restore is simple: replace current data with backup. The cost is storage and time. Full backups of large databases take hours and need significant storage space.

# Full database backup
pg_dump -Fc mydb > backup_$(date +%Y%m%d).dump

Incremental Backups

Only back up changes since the last backup. Storage efficient. Restore requires last full backup plus all incrementals. Slower restore process.

# Incremental backup using WAL
pg_basebackup -Xt -z -D /backup/incremental_$(date +%Y%m%d)

Continuous Replication

Stream changes to a replica in real time. Minimal RPO. Complex setup. Requires handling replication conflicts.

# Redis replica configuration
replicaof master.example.com 6379

Point-in-Time Recovery

Combine full backups with transaction logs. Restore to any point in time. Useful for recovering from data corruption or bad deployments.

-- PostgreSQL point-in-time recovery
-- Restore to just before the bad deployment
pg_restore --point-in-time='2026-03-22 14:30:00' backup.dump

Failover Architectures

Active-Passive Failover

Primary site handles traffic. Secondary site stays ready but idle. When primary fails, traffic moves to secondary.

Primary: [Active Service] --> [Primary DB]
|
[Replication]
|
Secondary: [Standby Service] --> [Replica DB]

The upside is simplicity. The downside is cost—you pay for idle capacity.

Active-Active Failover

Multiple sites handle traffic simultaneously. When one site fails, others continue. No single point of failure.

Site A: [Service] <--> [DB A]
Site B: [Service] <--> [DB B]
         \             /
        [Replication]

The upside is maximum availability. The downside is complexity: conflict resolution gets messy, and latency between sites adds friction.

Pilot Light

Minimal infrastructure runs in backup region. When failover triggers, you scale up to full capacity. Cost effective but slower recovery.

# Pilot light: minimal always-on, scale on failover
# Always on:
- 1 web server
- 1 small database replica

# On failover:
- Scale web to 10 servers
- Promote database replica

Database Recovery

SQL Database Recovery

Relational databases are where most recovery complexity lives. Unlike flat files, SQL databases have internal consistency constraints. You cannot just copy files and expect the database to start. The restore process must respect write-ahead log sequence numbers, replication slots, and schema integrity.

The basic flow is straightforward: stop the database, restore from backup, apply any transaction logs to reach the desired recovery point, then start and verify. The verification step is often the one people skip, but it matters most. A restored database that fails foreign key checks is not recovered. Run schema-level validation, checksum comparisons, and spot-check application queries on the restored data before routing traffic to it.

For PostgreSQL specifically, pg_restore combined with WAL archive replay gives you point-in-time recovery. You define a target timestamp or transaction ID, and PostgreSQL replays just enough WAL to reach that point. This is the go-to recovery method when a bad deployment or data corruption is the disaster: restore to just before the incident, losing only the transactions that occurred between the corruption and the restore.

def restore_database(backup_path: str, point_in_time: datetime):
    # Stop the database
    db.stop()

    # Restore from backup
    db.restore(backup_path)

    # Apply WAL to point in time
    if point_in_time:
        db.recover_until(point_in_time)

    # Start database
    db.start()

    # Verify
    db.verify()

NoSQL Database Recovery

NoSQL databases handle recovery differently than relational systems because they trade consistency guarantees for horizontal scale and availability. The recovery approach depends heavily on the database engine. Cassandra uses a peer-to-peer model where each node is independent. MongoDB replica sets maintain an oplog. DynamoDB relies on its internal replication across three availability zones.

The code below shows a Cassandra recovery pattern. Because Cassandra spreads data across multiple nodes using consistent hashing, you must restore snapshots on each node individually, then run a repair to reconcile any differences. The repair step is where most mistakes happen — skip it and some reads return stale or missing data. Running it correctly detects inconsistencies between replicas that might have accumulated during the recovery window.

One big difference from SQL recovery: NoSQL databases often let you stay online during restore. You can restore a Cassandra node to the cluster, let it stream data from peers, and serve reads while it catches up. That online recovery is a real operational win, but it also means you need to watch repair progress. A repair that never finishes because of network or disk bottlenecks leaves your cluster in a degraded state.

def restore_cassandra(backup_location: str):
    # Clear existing data
    truncate_all_keyspaces()

    # Restore snapshots
    for node in cluster.nodes:
        restore_snapshots(node, backup_location)

    # Repair to ensure consistency
    cluster.repair()

Automation

Manual recovery fails under pressure. When a disaster happens, people panic. Procedures get skipped. Mistakes get made. Automate recovery to remove human error.

Infrastructure as Code

Define infrastructure in code. When you need to rebuild, you deploy from code. No manual configuration.

# Terraform for DR infrastructure
resource "aws_instance" "dr_server" {
ami           = var.ami_id
instance_type = "m5.large"
subnet_id     = aws_subnet.dr_subnet.id
}

Automated Failover

Manual failover wastes precious minutes during a disaster. Someone has to notice the alert, SSH into a box, run commands from a runbook, and pray they remember the right order. In reality, people freeze, skip steps, or fat-finger commands. Automated failover removes the human from the decision chain.

A failover controller monitors primary health continuously. When the primary becomes unreachable, the controller promotes a replica to primary, updates DNS or load balancer records to point traffic at the new primary, and notifies the team. The window between detection and recovery should be measured in seconds, not minutes.

The hardest design problem is avoiding false positives. A brief network blip should not trigger a full failover — that causes more damage than it prevents. Implement hysteresis: require multiple failed health checks over a sustained period before declaring the primary dead. The code below uses a simple health check loop, but a production system should also check from multiple monitoring locations to distinguish between a primary failure and a network partition.

class FailoverController:
    def check_health(self):
        primary = self.get_primary()
        if not primary.is_healthy():
            self.initiate_failover()

    def initiate_failover(self):
        # 1. Promote replica
        replica.promote()

        # 2. Update DNS
        dns.update(primary=replica.address)

        # 3. Notify
        notify_team("Failover complete")

        # 4. Log
        log.failover_event()

Runbooks

Document recovery procedures step by step. When stressed, people cannot improvise. They need checklists.

# Database Failover Runbook

1. Verify primary is unreachable
   - Ping primary_ip
   - Try connecting to primary:5432

2. Check replica health
   - Check replica replication lag: under 10 seconds
   - Verify replica is accepting writes

3. Promote replica
   - pg_ctl promote /var/lib/postgresql/data

4. Update DNS
   - Update RDS endpoint in Route53

5. Verify application
   - Check application logs for errors
   - Verify read/write operations work

6. Notify
   - Post in #incidents
   - Update status page

Testing Your Recovery

A recovery plan you have never tested is not a recovery plan. Test it regularly.

Backup Restore Tests

Restore from your backups to a test environment. Verify data integrity. Measure restore time. Document any issues.

# Monthly restore test
1. Launch test instance
2. Restore latest backup
3. Verify application connects
4. Spot check data integrity
5. Document restore time

Failover Drills

Practice failover in staging. When comfortable, practice in production during low-traffic windows. Measure time to fail over.

# Quarterly failover test
def test_failover():
    start = time.time()

    # Simulate primary failure
    primary.stop()

    # Trigger failover
    controller.initiate_failover()

    # Verify
    assert is_healthy(secondary)
    assert application.is_working()

    duration = time.time() - start
    assert duration < RTO

Tabletop Exercises

Tabletop exercises are cheap disaster preparedness that pays off fast. Instead of failing over real systems, you gather the team in a room (or video call), present a scenario, and talk through the response. “A report just came in: the primary database is corrupt and the last clean backup is 12 hours old. What do you do?” The goal: surface gaps in the plan before they bite you in production.

Run these at least twice a year. Pick scenarios your team has not practiced recently. A data center fire. A ransomware attack. A regional cloud outage. An insider threat. Each scenario stresses different parts of the plan. If you only rehearse database corruption, you miss the gaps in your network recovery procedures. The format is simple: the facilitator reads the scenario, the team walks through the runbook verbally, someone takes notes on every question that cannot be answered immediately.

The output of a tabletop exercise is an action item list. “We don’t know who has permissions to the DR account.” “The DNS failover step references a runbook that is out of date.” “Nobody on call tonight has access to the backup vault.” These are the gaps you want to find. Every gap discovered in a tabletop exercise is a failure mode you can fix before it matters. After the session, assign owners and deadlines for each item, and check them off before the next exercise.

Building the Plan

A disaster recovery plan documents:

Scope: What systems are covered
RTO/RPO: Maximum acceptable downtime and data loss
Architecture: How the system is deployed and replicated
Backup Schedule: When backups run, where they are stored
Failover Procedure: Step-by-step instructions
Contacts: Who to call during a disaster
Testing Schedule: When you test recovery

Review and update the plan quarterly. Systems evolve. What worked last year may not work this year.

When to Use / When Not to Use DR Planning

Use comprehensive DR planning when:

Your service has availability requirements exceeding 99%
Data loss would cause significant business harm
Regulatory requirements mandate disaster recovery capabilities
Your team has the operational maturity to execute and test plans

Simplified DR is acceptable when:

Service can tolerate hours of downtime
Data is non-critical or easily reproducible
Cost of DR infrastructure exceeds business value
System is in early stage with no revenue impact

DR Strategy Comparison

Not every system needs the same approach. Choosing a DR strategy is a cost-versus-recovery-speed trade-off. Overspend on DR for a dev environment and you waste budget. Underspend on DR for a payment processor and you face catastrophic losses.

Strategy	RTO	RPO	Cost	Complexity	Best For
Backup and Restore	4–24 hours	1–24 hours	Low	Low	Dev/test, non-critical services
Pilot Light	10–30 min	Minutes	Medium-low	Medium	Internal apps, moderate availability
Warm Standby	2–10 min	Seconds–minutes	Medium-high	Medium	SaaS apps, 99.9% SLA
Active-Passive (Hot Standby)	Seconds–minutes	Seconds	High	Medium	Customer-facing services, 99.95%
Active-Active	Near-zero	Near-zero	Very high	Very high	Global services, payments, 99.99%+

Moving from backup-restore to active-active reduces RTO by roughly 1000x—at roughly 10x the cost. Pilot light sits in between: you keep a skeleton of infrastructure running and scale it up on failover. Active-active is the expensive end of the table. Favor it only when zero downtime is genuinely non-negotiable and your team is ready to manage distributed writes and conflict resolution.

Cloud Provider DR Capabilities

Managed cloud services absorb a lot of the DR operational work. Each major cloud has offerings that handle replication, failover orchestration, and backup retention without you writing the plumbing.

AWS

RDS Multi-AZ: Synchronous standby replica in a different Availability Zone. Automatic failover in under two minutes. RPO is near-zero for AZ failures.
S3 Cross-Region Replication: Asynchronously replicates objects to a bucket in another region. Useful for offsite backup copies and read latency optimization.
Route 53 Health Checks and Failover Routing: DNS failover with active health checks. Minimum TTL can be as low as 60 seconds, but propagation across resolvers still takes minutes.
AWS Backup: Centralized backup management across RDS, EBS, DynamoDB, and EFS. Supports cross-region vault copies and policy-based retention.

Google Cloud

Cloud SQL High Availability: Synchronous standby in the same region with automatic failover. Cross-region read replicas can be promoted for regional DR.
Cloud Spanner: Multi-region writes with synchronous replication. Near-zero RPO globally—no manual failover required.
Cloud Storage Multi-Region: Objects are stored redundantly across multiple geographic regions automatically.
Persistent Disk Snapshots: Scheduled snapshots for Compute Engine disks. Cross-region snapshot copies enable recovery in another region.

Azure

Azure Site Recovery: Replicates VMs, on-premises workloads, and containerized applications to a secondary region. Supports automated failover orchestration with recovery plans.
Azure SQL Geo-Replication and Failover Groups: Asynchronous readable replica in any region. Failover groups handle DNS updates and connection string repointing automatically.
Geo-Redundant Storage (GRS): Replicates block and file storage to a paired region. Data is available during a regional outage after Microsoft declares failover.

Important caveat: Managed failover services handle infrastructure failover, not application state. Session stores, in-progress transactions, and message queues still require application-level DR logic.

Real-world Failure Scenarios

Failure	Impact	Mitigation
Primary data center destroyed	Complete service outage; data loss	Activate DR site; promote replica; update DNS
Database corruption	Application serves bad data; potential data loss	Point-in-time recovery; verify backups; discard corrupted replica
Failed deployment to production	Service degradation or outage	Rollback deployment; use blue-green deploys; have canary analysis
Backup system fails silently	No recovery possible when needed	Verify backup integrity automatically; alert on backup failures
Network partition	Split-brain risk; inconsistent data	Use quorum-based decisions; prefer consistency over availability
Ransomware attack	Data encrypted; service unavailable	Isolate affected systems; restore from clean backups; do not pay

Common Pitfalls / Anti-Patterns

Backups Without Testing

The backup job finished. No errors reported. Tape rotation schedule is up to date. This is where most teams stop, and it is not enough. A backup that has never been restored tells you nothing about whether you can actually recover. The restore might fail because of a schema mismatch, a corrupted dump file, or a missing dependency. It might take twelve hours when your RTO is four. You will not know until you need to know, and by then it is too late.

Run restore tests monthly at minimum. Spin up a clean test environment, restore the latest backup, then run the full application stack against it. Check that the database schema is intact, application queries return expected data, and the restore finished within your RTO window. Keep a log of restore times so you can spot degradation early. If restore time crept from forty minutes to two hours over six months, something is degrading. Fix it before the next disaster.

Write down the results. Note the backup timestamp, the environment used, restore duration, any errors encountered, and who ran the test. After three months of successful restores, you have proof that your backup strategy works. Until then, you have expensive insurance with unknown terms.

Backup Geographic Risk

Geographic proximity of backups to primary systems is a failure mode that teams miss until it hurts them. When the primary data center goes dark, the backups go dark with it. Fire suppression systems, flooding, regional power failures, and intentional sabotage all have one thing in common: they affect everything in the affected area at the same time. Your backup copies need to exist outside that blast radius.

For cloud deployments, configure cross-region replication. S3 cross-region replication, Azure Geo-Redundant Storage, and Google Cloud Storage Multi-Region all handle this automatically. For on-premise backups, establish an offsite rotation: ship backup media to a colocation facility, a secondary office, or a cloud storage bucket in a different region. The offsite copy should lag the primary by no more than your RPO window. If your RPO is one hour, the offsite copy must reflect data from no more than one hour ago.

Test the cross-region restore path specifically. Verify that you can actually restore from the offsite copy within your RTO. This means network transfer time, decryption overhead, restore duration, and application startup all fit in the window. A backup that takes twenty hours to replicate when your RTO is four hours is not a backup. It is a liability dressed up as a safety net.

DR Plan Without Contact Info

A disaster at 3 AM on a holiday weekend tests the contact chain, not the technology. The failover procedure might be perfect, but if the first person on the call list is unreachable, every minute you spend tracking down a replacement is a minute of extended outage. The contact list is as important as the technical runbook.

Build the contact list with redundancy. For each role required during a disaster, list at least two people with direct phone numbers. Include time zone and preferred contact method. Designate an escalation order: if the primary DBA does not answer within five minutes, call the secondary. If neither answers, escalate to the engineering manager. The list should live somewhere that does not depend on the infrastructure you are trying to recover. When your SSO provider is down, you cannot access the password manager to find the DBA phone number. Keep a printed copy in the server room, the executive office, and with the CTO on call.

Test the contact chain quarterly, separate from failover testing. Call the first person on the list at 3 AM on a Saturday. If they do not answer, call the second. If the second does not answer within ten minutes, escalate. Measure how long it takes to reach a live human. If it takes an hour to reach someone who can authorize a failover, your effective RTO is at least one hour longer than your documented RTO.

Forgetting About Data in Transit

Point-in-time recovery targets the primary database timeline. When you restore to a timestamp, you replay the write-ahead log to that point. The assumption is that the restored state reflects the truth. This assumption falls apart when replication is involved. If the corruption started on the primary, it propagates to standbys before you catch the problem. A standby that was five minutes behind when corruption occurred still has five minutes of clean data. A standby that was caught up is already contaminated.

Synchronous replication removes this gap, but it adds write latency. Every write must confirm on the standby before it commits on the primary. If the primary and standby are in different availability zones, you add cross-zone network latency to every transaction. For many applications this is fine. For high-throughput write workloads it is not. Asynchronous replication is faster but leaves a window where the standby lags the primary. That window is your exposure.

When recovering from async replication, target the replica with the most lag at the time of corruption. That replica is most likely to have clean data. Promote it to primary, then rebuild the other replicas from the new primary. This differs from the standard “restore the most recent backup” approach. It requires you to monitor replication lag continuously and know which replica was furthest behind when the disaster struck.

Planning Without RTO/RPO

Without defined RTO and RPO targets, your DR architecture has no requirements. You cannot decide whether you need active-passive or backup-restore because you do not know how fast you need to recover or how much data loss is acceptable. Every discussion becomes subjective. The infrastructure lead argues for hot standby because they saw a data center burn down once. The finance lead argues for tape backups because they are cheap. Neither argument is grounded in a measurable requirement.

Define RTO and RPO per service based on business impact. How much revenue does one hour of downtime cost? What is the reputational damage of losing an hour of customer data? The answers give you budget for DR infrastructure. A payment system where an hour of downtime costs six figures justifies six-figure DR spending. A staging environment where nobody notices if it is down for a day justifies a cheap backup-to-S3 approach.

Write the targets down and review them quarterly. As the business grows, acceptable downtime shrinks. A startup can tolerate four hours of downtime. That same company two years later with a live customer base might tolerate four minutes. If the targets are not written down and reviewed, the architecture drifts, and you discover the mismatch during a real disaster. That is the worst possible time to find out your RTO is not achievable.

Trade-off Analysis

Factor	Active-Passive	Active-Active	Tape Backup
RTO	Medium (minutes to hours)	Low (seconds to minutes)	High (hours to days)
RPO	Low (minutes of data loss)	Very low (near-zero)	High (last backup only)
Cost	Low (1 standby site)	High (2+ full sites)	Low (offline media)
Complexity	Low	High	Low
Data Loss Risk	Low	Very low	High
Geographic Redundancy	Yes	Yes	No
Automation Level	Medium	High	Low
Best For	Tier-2 workloads	Mission-critical apps	Long-term archival

Decision Framework:

RTO < 1 hour, RPO < 5 minutes → Active-Active with synchronous replication
RTO < 4 hours, RPO < 1 hour → Active-Passive with async replication + frequent backups
RTO > 4 hours acceptable → Tape/disk backup with documented restore procedure
Compliance/archival → Tape or cold storage with periodic restore tests

Quick Recap Checklist

Key Bullets:

RTO is maximum acceptable downtime; RPO is maximum acceptable data loss
RTO drives architecture (manual vs automatic failover); RPO drives backup frequency
Active-passive is simpler but wasteful; active-active is complex but resilient
Automate recovery to remove human error during crisis
Test recovery quarterly; tabletop exercises catch planning gaps

Copy/Paste Checklist:

DR Plan Document:
[ ] Scope: which systems and data
[ ] RTO and RPO per system
[ ] Backup strategy (full/incremental/continuous)
[ ] Backup schedule and retention
[ ] Backup storage locations (geographically separate)
[ ] Failover procedure (step by step)
[ ] Recovery procedure (step by step)
[ ] DNS and routing updates
[ ] Contact list with multiple options
[ ] Notification procedures (who, when, how)
[ ] Testing schedule
[ ] Last tested date and results

Observability Checklist

Metrics:
- Backup success/failure rate (target: 100% success)
- Backup age (oldest successful backup)
- Replication lag for DR replica
- Time to complete restore in test environment
- DR site health check status
Logs:
- All backup job completions with file counts and sizes
- Restore test results with any discrepancies found
- Failover drill outcomes
- Anomalies in backup or replication streams
Alerts:
- Backup job fails
- Backup age exceeds RPO threshold
- Replication lag exceeds 5 minutes (warning) / RPO (critical)
- DR site unreachable
- Restore test fails

Security Checklist

Backups encrypted at rest with separate key management
Backup access restricted to authorized personnel with MFA
Offsite backups protected against unauthorized access
Backup retention complies with data retention policies
DR site access authenticated and authorized separately from primary
Restore procedures require dual authorization
Audit log all backup and restore operations
Test disaster recovery without exposing production data

Interview Questions

1. What is the difference between RTO and RPO?

RTO (Recovery Time Objective) is the maximum acceptable downtime—how long can the system be down before the business impact is unacceptable? RPO (Recovery Point Objective) is the maximum acceptable data loss—how much data can be lost without causing unacceptable business impact?

RTO drives architecture decisions: a 1-hour RTO requires automated failover. A 24-hour RTO might allow manual recovery procedures. RPO drives backup decisions: a 1-hour RPO requires replication at least every hour. A 24-hour RPO allows daily backups.

2. Compare active-passive and active-active disaster recovery architectures.

Active-passive keeps a warm standby in another region. The primary handles all traffic. When the primary fails, DNS failover routes traffic to the standby. Simpler to operate, but wastefully idle during normal operations. Failover takes time—minutes to hours depending on DNS propagation and application startup.

Active-active runs in multiple regions simultaneously, all serving traffic. When one region fails, the others continue handling its share. No failover delay. More complex—requires data replication between regions and handling of concurrent writes. Higher operational cost since all regions run hot.

Most systems benefit from active-passive with automated failover. True active-active is reserved for systems requiring zero downtime and willing to pay the complexity cost.

3. How do you determine appropriate RTO and RPO values for a system?

Start from the business: what is the cost of downtime per hour? What is the cost of data loss per hour? These costs help justify the investment in DR infrastructure.

Different services have different requirements. A payments service might need minutes RTO. A reporting dashboard might tolerate hours. Set targets per service based on business criticality.

Industry regulatory requirements sometimes mandate specific targets. PCI-DSS for payment data, HIPAA for healthcare, SOC 2 for trust services—all have specific availability expectations.

4. What are the main disaster recovery strategies for databases?

Backup and restore: periodic backups stored offsite. Cheapest, slowest recovery. RTO depends on backup frequency and restore time.

Point-in-time recovery: full backup plus transaction log replay. Enables recovery to any point in time. Useful for recovering from data corruption or bad deployments.

Read replicas: asynchronous copies updated from primary. Can be promoted for DR but may lose some data. RPO depends on replication lag.

Synchronous replication: writes not confirmed until replica acknowledges. Zero RPO but adds latency to every write. Expensive to maintain.

Multi-region writes: like Spanner or CockroachDB. True zero RPO with no single point of failure. Highest complexity and cost.

5. How often should you test disaster recovery?

Backup restore tests: monthly at minimum. Verify you can actually restore from backups and measure how long it takes.

Failover drills: quarterly for critical systems. Practice the actual failover procedure, not just the theory. Document any gaps.

Tabletop exercises: twice yearly with the full team. Walk through disaster scenarios verbally. Surface gaps in the plan before real disasters expose them.

The most common failure is never testing until you actually need to recover. By then it's too late to fix what's broken.

6. What is the difference between hot, warm, and cold standby?

Cold standby: no infrastructure running. Backup site is just hardware or instances that get spun up when needed. Cheapest but slowest to recover—RTO of hours.

Warm standby: minimal infrastructure always running, enough to handle reduced traffic. Can scale up to full capacity on failover. RTO of 10-30 minutes typical.

Hot standby: full production infrastructure running in backup region. Serving reduced traffic or idle. Ready to take over immediately. RTO of seconds to minutes.

The naming reflects how "hot" the standby is—how ready it is to take over. Hotter standby costs more but recovers faster.

7. How do you handle stateful workloads in a DR scenario?

Stateful workloads are the hardest part of DR. The key is understanding what state needs to be preserved and at what cost.

Session state: store in distributed cache (Redis) instead of local memory. Sessions survive failover automatically.

In-progress transactions: use idempotency keys and design for exactly-once semantics. Some data loss is unavoidable for truly in-flight work.

Message queues: acknowledge messages only after processing completes. Use dead letter queues for messages that can't be processed.

Database state: rely on replication for most recent state, point-in-time recovery for historical. Accept small RPO.

The fundamental challenge: you can't replicate state faster than the network allows. At some point you accept data loss to achieve acceptable RTO.

8. What is a DR runbook and why is it important?

A DR runbook is a step-by-step procedure for recovering from a specific disaster scenario. It lists exact commands to run, in order, with expected outcomes at each step.

During a disaster, people make mistakes under pressure. They skip steps. They panic. A runbook removes decision-making from the crisis. Anyone qualified can execute it, not just the person who built the system.

Runbooks should cover: detecting failure, confirming it's real, initiating failover, verifying recovery, communicating status. Each step should be copy-pasteable commands, not descriptions of what to do.

Test your runbook in a failover drill. If you discover missing steps or ambiguous instructions during a real disaster, you've learned nothing.

9. How do you manage DNS failover without causing downtime?

DNS failover sounds simple: change the DNS record and traffic goes to the new site. In practice, DNS TTLs mean changes take minutes to hours to propagate globally.

Low TTLs: set DNS TTL to 60 seconds or less before failover. But many resolvers ignore TTLs and cache longer anyway.

Anycast: use CDNs or global load balancers that route by IP prefix rather than DNS. More complex but eliminates DNS propagation delay.

Health checks: combine DNS failover with active health checks on the primary. Detect failure faster than DNS TTL would suggest.

The real solution: don't rely on DNS alone for fast failover. Use anycast or active health checks at the load balancer level. DNS is the backup, not the primary mechanism.

10. How do you validate that your DR plan actually works?

The only way to validate a DR plan is to actually execute it under simulated disaster conditions. Documentation is not validation.

Backup restore tests: monthly. Launch a clean environment, restore from backup, verify application works. Measure actual restore time.

Chaos engineering: for advanced teams. Intentionally trigger failure of primary site in staging and verify DR site takes over correctly.

Game days: full team exercises where you simulate a disaster and walk through the runbook. No actual failures, but you measure how long it takes and where people get stuck.

The most important metric: can you actually meet your RTO and RPO targets under realistic conditions? If you haven't tested, you don't know.

Disaster Recovery: RTO, RPO, and Building a Recovery Plan

Understanding RTO and RPO

Recovery Time Objective (RTO)

Recovery Point Objective (RPO)

Setting Targets

Backup Strategies

Full Backups

Incremental Backups

Continuous Replication

Point-in-Time Recovery

Failover Architectures

Active-Passive Failover

Active-Active Failover

Pilot Light

Database Recovery

SQL Database Recovery

NoSQL Database Recovery

Automation

Infrastructure as Code

Automated Failover

Runbooks

Testing Your Recovery

Backup Restore Tests

Failover Drills

Tabletop Exercises

Building the Plan

When to Use / When Not to Use DR Planning

DR Strategy Comparison

Cloud Provider DR Capabilities

AWS

Google Cloud

Azure

Real-world Failure Scenarios

Common Pitfalls / Anti-Patterns

Backups Without Testing

Backup Geographic Risk

DR Plan Without Contact Info

Forgetting About Data in Transit

Planning Without RTO/RPO

Trade-off Analysis

Quick Recap Checklist

Observability Checklist

Security Checklist

Interview Questions

Category

Tags

Related Posts

Database Backup Strategies: Full, Incremental, and PITR

Backpressure Handling: Protecting Pipelines from Overload

Data Validation: Ensuring Reliability in Data Pipelines