Failover Automation

Automatic failover patterns. Health checks, failure detection, split-brain prevention, and DNS TTL management during database failover.

published: reading time: 28 min read author: GeekWorkBench updated: January 1, 1970

Failover Automation

A primary node fails at 3am. Do you get paged? Does the system recover automatically? Or does someone have to wake up, diagnose, and manually trigger a failover?

Failover automation determines whether your high availability setup is a real safety net or just a checklist item that gives false confidence.

Introduction

Failover automation turns a database outage from a 3am emergency into an automatic recovery. Without it, a primary failure requires someone to diagnose the problem, confirm the replica is ready, promote the replica, and update DNS or connection strings — all while the system is down. With it, the system detects failure, promotes a replica, and reconnects traffic, typically in under a minute.

The catch is that automated failover introduces its own failure modes: split-brain where both primary and replica think they are primary, promoting a replica that is too far behind, or failover that happens unnecessarily during temporary network hiccups. This guide covers health check design, failure detection strategies, split-brain prevention, and DNS TTL management during failover transitions.

What Automatic Failover Means

The system detects a primary failure and promotes a replica to primary without human intervention. The goal: minimize downtime and data loss. The catch: it involves tradeoffs between speed, safety, and infrastructure complexity.

Failure detection and promotion sound straightforward. They rarely are.

Health Checks and Failure Detection

Before failover can happen, the system must confirm the primary is actually down. Too sensitive and you get spurious failovers that cause instability. Too lenient and you accept extended downtime.

Types of Health Checks

Database-level checks verify the process is accepting connections: pg_isready for PostgreSQL, TCP connection to the database port, SHOW GLOBAL STATUS for MySQL.

Query-level checks run an actual query to confirm the engine can process work. SELECT 1 proves the query engine is functional, not just the network layer.

Replication checks verify the replica is receiving changes from the primary. A replica that cannot keep up is a poor failover candidate—it would become primary with missing data.

# PostgreSQL health check
pg_isready -h $PRIMARY_HOST -p $PRIMARY_PORT -U $CHECK_USER
if [ $? -ne 0 ]; then
    trigger_failover
fi

Failure Detection Timeout Tradeoffs

The timeout for declaring a primary dead requires balancing:

  • Short timeout: fast recovery, but risk of false positives from temporary network blips
  • Long timeout: safer from spurious failovers, but longer downtime

Most production systems use 10-30 seconds with multiple consecutive checks. Some implementations require failures to persist over a period. Others use quorum—multiple independent checks must agree before triggering failover.

Split-Brain Prevention

Split-brain occurs when two nodes both believe they are the primary. Both accept writes, data diverges, and reconciliation becomes painful or impossible.

Network partitions cause most split-brain scenarios. If the primary loses network connectivity but keeps running, it continues accepting writes locally. A replica promotes itself believing the primary is dead. When the network recovers, you have two databases with conflicting data.

Quorum-Based Prevention

The safest approach uses quorum: a node can only become primary if a majority agrees. A three-node cluster needs 2 votes. If one node loses connection to the others, it cannot promote itself—only 1 vote. The remaining two nodes can elect a new primary.

Raft consensus protocols work this way. etcd, Consul, and some database clustering solutions implement similar logic.

Fencing

Fencing ensures the old primary cannot continue writing when a new primary takes over. Two common approaches:

  • STONITH (Shoot The Other Node In The Head): power off or isolate the failed node before promotion
  • Resource fencing: revoke the old primary’s access to shared storage or network

Fencing adds complexity but provides stronger guarantees.

Prometheus, HAProxy, and Patroni Patterns

Common tooling patterns for managing failover:

Prometheus Alertmanager + Blackbox Exporter

Monitor the primary with Prometheus. When checks fail, Alertmanager fires an alert that triggers automated failover via webhook.

groups:
  - name: database_failover
    rules:
      - alert: DatabasePrimaryDown
        expr: probe_success{job="db-primary-check"} == 0
        for: 30s
        labels:
          severity: critical
        annotations:
          summary: "Database primary is down"

HAProxy for Connection Routing

HAProxy detects primary failure and routes writes to the new primary automatically. Your application keeps connecting to the same HAProxy endpoint; HAProxy handles the backend switch.

backend db_write
    option tcp-check
    tcp-check connect
    tcp-check send PING\r\n
    tcp-check expect string PONG
    server primary db-primary:5432 check inter 5s fall 3 rise 2
    server replica1 db-replica1:5432 check inter 5s fall 3 rise 2 backup

Patroni for PostgreSQL

Patroni automates PostgreSQL failover using distributed consensus (etcd, ZooKeeper, or Consul) for leader election. It manages leader election, replication, automated failover with safety checks, and post-failover recovery.

Vitess and CockroachDB

Vitess and CockroachDB have automatic failover built in as core features. If you want failover that works without custom configuration, managed distributed databases handle this internally.

Mean Time to Recovery (MTTR)

MTTR measures average recovery time. For database failover, it includes:

  1. Time to detect failure
  2. Time to decide on failover
  3. Time to execute promotion
  4. Time for replicas to catch up to new primary

Target MTTR varies by industry, but 30-60 seconds is achievable with proper automation. Manual failover typically takes 5-15 minutes when accounting for human response time, diagnosis, and execution.

The MTTR formula breaks down component by component. Detection time is your health check interval times the number of consecutive failures required to trigger failover. If you check every 5 seconds and require 3 consecutive failures, minimum detection time is 15 seconds. Add network jitter and you might see 20-30 seconds before detection fires.

Decision time is the time between detection and promotion trigger. For quorum-based systems (Patroni with etcd, for example), this includes leader election timeout — typically 10-15 seconds by default. You can tune this down to 3-5 seconds at the cost of higher false-positive risk.

Promotion time depends on your database. PostgreSQL promotion typically takes 2-5 seconds (rewriting the control file, promoting the replica). MySQL is similar. CockroachDB and other distributed databases promote automatically as part of their Raft consensus — usually under 1 second.

Replication catchup time: after promotion, the new primary must have all committed data. With synchronous replication, catchup time is zero — the replica was already current. With asynchronous replication, the new primary may be missing the last few seconds of writes. The practical catchup time for async replication is roughly equal to your replication lag at the time of failover. If lag was 2 seconds, you may have lost 2 seconds of data and catchup (for the new replica to catch up to the new primary) takes another few seconds under load.

Real-World Case Study: Patroni at Spotify

Spotify runs one of the largest PostgreSQL clusters managed by Patroni in production. Patroni handles automatic failover for their hundreds of PostgreSQL clusters using etcd as the consensus backend.

The challenge at Spotify’s scale was not the failover itself — Patroni’s failover is reliable — but managing the post-failover state across their service mesh. When a PostgreSQL primary fails and a replica promotes, hundreds of application servers that had connections to the old primary need to reconnect to the new primary. Without connection pooling (Spotify uses PgBouncer in front of PostgreSQL), this would cause a connection storm as all applications simultaneously try to reconnect.

Spotify’s solution was aggressive connection pooling at every tier. PgBouncer sits in front of every PostgreSQL cluster. When failover happens, PgBouncer detects connection failures and reconnects to the new primary with exponential backoff. The connection pooling layer absorbs the reconnection burst rather than letting it hit the database directly.

The operational lesson: failover automation handles the database layer, but your connection pooling and application reconnection logic must be equally resilient. Patroni’s failover works reliably, but if your application crashes on connection errors instead of retrying gracefully, you still get an outage.

Real-World Case Study: AWS RDS Failover

AWS RDS with Multi-AZ failover is automatic but not instantaneous. When an RDS primary fails, the failover process typically takes 60-120 seconds. This includes detection (Amazon’s internal monitoring), promotion of the standby, and DNS update propagation.

The catch: during those 60-120 seconds, your application receives connection errors. If your connection pooling layer cannot handle connection failures gracefully (by retrying with backoff and eventually connecting to the new endpoint), your application starts failing requests even though RDS has already recovered.

The lesson: managed failover does not eliminate application-level connection resilience. RDS handles the database failover; your application must handle the connection interruption. Use the RDS endpoint (which automatically resolves to the new primary after failover) rather than hardcoding the primary’s IP. Configure your database client to retry on connection failure with exponential backoff. Test failover explicitly — RDS allows you to trigger a manual failover to verify your application handles it correctly.

Quick Recap Checklist

  • Health checks use multiple consecutive failures to avoid false positives
  • Quorum-based promotion prevents split-brain (minimum 3 nodes)
  • Fencing (STONITH) isolates old primary before new primary promotes
  • DNS TTL set to 60 seconds or lower before failover
  • Connection pooling absorbs reconnection bursts during failover
  • MTTR target: 30-60 seconds with proper automation
  • Test failover explicitly in staging before production
  • 2-node clusters cannot do safe automatic failover — use manual or add witness
  • Synchronous replication eliminates data loss gap on failover
  • Alert at lag exceeding your RPO threshold

DNS TTL Management During Failover

When the primary changes, applications must connect to the new primary. DNS-based discovery handles this: point your database hostname to the current primary’s IP, update DNS on failover.

The problem: DNS caching. If your TTL is 1 hour, clients might keep connecting to the failed IP for an hour after failover.

Low TTL Before Failover

Set a low TTL (60 seconds or less) on your database DNS records before you need failover. This lets you update quickly when the time comes.

gcloud dns record-sets transaction start --zone=db-zone
gcloud dns record-sets transaction remove db-primary.example.com. A 300 \
    --zone=db-zone --ttl=60
gcloud dns record-sets transaction add db-replica1.example.com. A 300 \
    --zone=db-zone --ttl=60
gcloud dns record-sets transaction execute --zone=db-zone

Application-Level DNS Caching

Applications cache DNS lookups at the OS or library level regardless of your TTL. Ensure your database client re-resolves hostnames on connection failure or periodically.

Some clients have dns_refresh_rate settings. Others need connection pooling libraries like PgBouncer or ProxySQL to handle reconnection gracefully.

When to Use / When Not to Use Automatic Failover

Use automatic failover when:

  • Your SLA requires fast recovery (under 60 seconds)
  • You run multi-AZ deployments where node failure is a realistic scenario
  • You have 3+ database nodes so quorum-based promotion is safe
  • Your application handles brief write unavailability gracefully

Do not use automatic failover when:

  • You have only 2 nodes — quorum-based failover cannot work safely with 2
  • Your database runs on shared storage (SAN) where primary and replica share failure domain
  • Your team can respond faster manually than the automation can detect and act
  • You lack monitoring that distinguishes node failure from slow node

Automatic vs Semi-Automatic vs Manual Failover Trade-offs

DimensionAutomatic FailoverSemi-Automatic (Operator-Triggered)Manual Failover
MTTR30-60 seconds2-5 minutes5-15 minutes
False positive riskHigher — can trigger on transient issuesLower — human verifies firstNone
Split-brain riskModerate without quorum, low with quorumVery lowNone
Operational complexityHighMediumLow
Best suited for24/7 services with SLABusiness hours ops, critical dataNon-critical dev/staging
RequiresQuorum, fencing, watchdogAlerting + runbookOn-call engineer
Data loss riskLow (with sync rep)Very lowDepends on async lag

What Automation Cannot Solve

Automation handles node failure, not root cause failure. If a bad deployment corrupts data on the primary, promoting a replica spreads the corruption. You still need backups, point-in-time recovery, and tests that verify your recovery procedures actually work.

Automation also does not eliminate runbooks. When something unexpected happens, operators need documented procedures, not just alerts.

Production Failure Scenarios

FailureImpactMitigation
Spurious failover from network blipPrimary demoted unnecessarily, brief write outageRequire multiple consecutive failures, use longer detection windows
Split-brain from network partitionTwo primaries accept writes, data divergesUse quorum-based promotion, implement fencing (STONITH)
Fencing failure leaves old primary writingData corruption, split-brain continuesTest fencing regularly, monitor fencing state
Promotion of lagging replicaData loss — missing writes not yet replicatedCheck replication position before promotion, use semi-sync replication
DNS TTL too highClients keep connecting to failed primary for minutesSet TTL to 60s or lower before failover, use connection pooling
Cascade failure during failoverMultiple nodes fail during transitionKeep cluster stable during failover, avoid concurrent config changes
Automation runs during planned maintenancePrimary demoted while operator workingPause automation during maintenance windows, use manual mode

Interview Questions

1. You have a 2-node PostgreSQL cluster with async replication. Someone suggests adding automatic failover. What is the fundamental problem?

With 2 nodes, quorum-based failover cannot work safely. If the primary fails and the replica promotes, there is no third vote to break ties — any network partition that splits the nodes at the same time as the primary failure creates a split-brain scenario where both nodes believe they are primary. The only safe configuration with 2 nodes is manual failover or shared-nothing failover where one node can definitively prove it is the only one that can serve writes (via shared storage or fencing). The recommendation is to add a third node or use a witness/arbitrator node to enable quorum-based failover.

2. Your automatic failover triggers during a brief network partition. The old primary is still running but cannot communicate with the cluster. What happens?

The old primary may still have active connections and continue accepting writes locally. When the network recovers and the old primary rejoins the cluster, it has writes that were never replicated to the new primary. You now have divergent data. The cluster must reconcile — either by reverting the old primary to a replica and having it replay the missing writes, or by accepting data loss. This is why fencing (STONITH) is critical: before the new primary promotes, the old primary must be isolated from accepting writes. Without fencing, the old primary can continue writing during a partition and cause split-brain.

3. How do you tune a Patroni cluster for sub-30-second MTTR?

Three tuning points: reduce the heartbeat interval (Patroni sends heartbeats between nodes), reduce the loop_wait parameter which controls the Patroni main loop, and reduce the retry_timeout for etcd operations. Standard Patroni settings have a loop_wait of 10 seconds and a heartbeat interval of 10 seconds, giving a minimum detection time around 30 seconds. Reducing loop_wait to 1-2 seconds gets detection time under 10 seconds, but at the cost of increased etcd write load. For truly sub-30-second failover, synchronous replication to at least one replica is the most reliable approach since the promoted replica is guaranteed to be current.

4. How does quorum-based failover prevent split-brain, and what is the minimum number of nodes required for safe automatic failover?

Quorum requires that a majority of nodes must agree before promoting a new primary. With 3 nodes, you need 2 votes to promote. If the primary fails and one replica also loses connectivity, the isolated replica has only 1 vote — it cannot promote itself. The remaining 2-node group has 2 votes and can elect a new primary. This prevents split-brain because a single node cannot form a majority on its own. Minimum safe configuration: 3 nodes for odd-numbered quorum (3, 5, 7). With 2 nodes, there is no majority possible without the other node — any network partition creates two nodes each believing they are primary. Some setups use a witness/arbitrator node (like etcd's learner) to break ties for 2-node clusters, but a 3-node cluster is the recommended minimum for production automatic failover.

5. What is STONITH fencing and why is it critical for automatic failover safety?

STONITH (Shoot The Other Node In The Head) is a fencing technique that physically isolates a failed node before promoting a new primary. When a replica decides to promote, STONITH first power-cycles or network-isolates the old primary to ensure it cannot continue writing. Without STONITH, the old primary may still be running and accepting writes during a network partition — when the partition heals, you have two primaries with divergent data (split-brain). STONITH ensures the old primary is definitively dead before the new primary accepts writes. Implementation options: IPMI power management, hardware out-of-band interfaces, or cloud provider APIs (AWS instance stop). STONITH adds operational complexity but is required for strong failover safety guarantees.

6. During a failover, DNS TTL misconfiguration causes clients to keep connecting to the failed primary. What are the consequences and how do you prevent it?

Clients cached the old IP with a 1-hour TTL. After failover, the DNS record points to the new primary's IP, but clients ignore the update for up to an hour. During this window, writes to the failed primary's IP fail with connection errors. If the old primary is still running (not properly shut down), it may accept writes that are never replicated to the new primary — data divergence. Prevention: set database DNS TTL to 60 seconds or less before deploying failover automation. Use a database-specific DNS endpoint (like RDS endpoint or a dedicated CNAME) that resolves to the current primary rather than hardcoding IPs. Some setups use a virtual IP (VIP) that floats to the new primary — the application connects to the VIP and the network handles the switch without DNS changes.

7. Explain the tuning parameters in Patroni that affect failover detection time: loop_wait, heartbeat, and retry_timeout.

loop_wait is the main Patroni loop interval — default 10 seconds. Each loop iteration checks cluster health and sends heartbeats. heartbeat (Patroni 2.0+) is the interval between heartbeat signals, typically 2 seconds. The effective detection time is roughly (loop_wait * number_of_failed_heartbeats) + network_jitter. Standard settings give ~30 second detection. retry_timeout is the timeout for etcd/consul operations — if the distributed store does not respond within this time, Patroni treats it as a failure. To achieve sub-30-second MTTR: reduce loop_wait to 1-2 seconds (increases etcd load significantly), reduce heartbeat interval, ensure etcd/consul is fast and co-located. For truly fast failover, synchronous replication to at least one replica is more reliable than aggressive timeouts.

8. Your automatic failover triggers spuriously during a brief network partition. The old primary never actually failed. What happened and how do you prevent it?

The monitoring system triggered failover based on a single failed health check or too-short timeout. A brief network blip caused health checks to fail for a few seconds, the failover automation interpreted this as primary death, and promoted a replica. Meanwhile the old primary continued running normally and was still accepting writes locally. When the network recovered and both primaries existed, you had a split-brain scenario. Prevention: require multiple consecutive failures before triggering failover (e.g., 3 consecutive failures over 30 seconds). Use health check intervals of 5-10 seconds with enough consecutive failures to survive brief blips. Some systems use quorum (multiple independent monitors must agree) rather than single-monitor decision. Also implement a pre-promotion verification step that checks the candidate replica's data freshness before allowing promotion.

9. How does AWS RDS Multi-AZ failover work internally, and what are the application-level implications during the 60-120 second failover window?

RDS Multi-AZ uses synchronous replication to a standby in a different AZ. When the primary fails, Amazon's internal monitoring detects it, promotes the standby, and updates the DNS endpoint to point to the new primary. The application connects to the same endpoint (e.g., mydb.abc123.us-east-1.rds.amazonaws.com) — DNS propagates the new IP. During the 60-120 second failover window: connection attempts fail with connection errors; in-flight transactions fail; your application must handle reconnection with exponential backoff. The RDS endpoint masks the IP change but does not eliminate application disruption. Key application requirements: retry logic with backoff on connection failure, connection pool that can recover from failed connections, and no hardcoded IP addresses. Test failover explicitly — RDS allows you to trigger a test failover to verify your application's handling.

10. What is the difference between pgpool-II's raw mode and replication mode in the context of automatic failover?

In raw mode (native PostgreSQL), pgpool-II does not handle replication — it just routes queries to a single PostgreSQL backend. Failover in raw mode means detecting backend failure and switching to a standby — pgpool can do this but you must configure watchdog for HA. In replication mode, pgpool maintains a real replication connection to multiple PostgreSQL servers and replicates writes across them. Failover in replication mode is more complex — pgpool can detect failed nodes but must also manage the replication state. For most automatic failover setups with PostgreSQL, Patroni + PgBouncer is preferred over pgpool-II because Patroni handles failover coordination with etcd/consul while pgpool focuses on connection pooling. pgpool-II works well for read load balancing but Patroni is the standard for HA.

11. You have a 2-node PostgreSQL cluster. Your team is considering enabling automatic failover using a witness node. How does a witness node enable safe failover with only 2 data nodes?

A witness node (or arbitrator) participates in quorum but does not store data. With 2 data nodes + 1 witness: if the primary fails, the replica + witness have 2 votes (majority of 3), allowing promotion. If the witness fails, the 2 data nodes can still communicate and continue operating — the primary does not promote spuriously. This gives you safe automatic failover with 2 data nodes without requiring a 3rd full data node. The tradeoff: the witness is a single point of failure for the election itself (if both the witness and one data node fail simultaneously, you lose quorum). Also verify that your failover tool (Patroni, etc.) supports witness/arbitrator nodes and that the witness's network is reliable and in the same failure domain as the data nodes.

12. Describe the tradeoffs between automatic failover, semi-automatic (operator-triggered), and manual failover for a business-critical database.

Automatic failover: MTTR 30-60 seconds, highest availability, but risk of false positives (spurious failover) and split-brain without proper quorum/fencing. Semi-automatic (operator-triggered): monitoring alerts an on-call engineer who manually approves and triggers failover — MTTR 2-5 minutes, lower false-positive risk because a human verifies before acting. Manual failover: engineer diagnoses, then runs procedures manually — MTTR 5-15 minutes, no spurious failover risk, but human response time adds delay. For business-critical databases with 24/7 availability requirements, automatic failover is appropriate if quorum and fencing are properly implemented. For less critical systems or those with skilled ops teams that can respond in under 5 minutes, semi-automatic may provide better safety-to-speed balance. Manual is appropriate for development and staging.

13. During a failover, you observe a connection storm as hundreds of application instances simultaneously try to reconnect. What causes this and how do you prevent it?

When the primary fails, all application instances that had connections to it receive connection errors nearly simultaneously. Each instance's connection pool or ORM sees the failure and independently starts retrying with exponential backoff — but without coordination, all instances retry at the same times (they failed at the same time). This creates a connection storm that can overwhelm the new primary or the connection pooler. Prevention: use PgBouncer or ProxySQL in front of the database — the pooler maintains its own connection pool to the database and handles reconnection internally, absorbing the burst rather than letting it hit the database directly. Configure pooler reconnection with jitter and exponential backoff. At the application level, implement circuit breakers that temporarily stop sending requests to the database during failover, allowing the pooler time to stabilize.

14. What are the minimum etcd/Consul requirements for a production Patroni cluster, and how do you avoid the consensus backend itself becoming a bottleneck?

Patroni requires a distributed consensus store (etcd, Consul, ZooKeeper). For production: use 3 or 5 etcd nodes for HA — a 3-node etcd cluster tolerates 1 failure, 5-node tolerates 2. Place etcd nodes on separate hardware and network segments from the PostgreSQL primaries to avoid correlated failures. Etcd performance matters: use SSDs for etcd data directory, network locality between etcd nodes and Patroni nodes, and monitor etcd write latency (should be < 10ms). If etcd becomes slow, Patroni loops hang and failover detection slows down. For read-heavy workloads, etcd can handle thousands of reads per second, but ensure your etcd cluster is not overutilized — dedicated etcd clusters for critical Patroni deployments are recommended over sharing with other workloads.

15. How do you calculate MTTR for an automatic failover system, and what are the main time components?

MTTR = detection_time + decision_time + promotion_time + catchup_time. Detection_time = health_check_interval * consecutive_failures_required. With 5s checks and 3 failures required, minimum detection is 15s plus network jitter. Decision_time = leader_election_timeout (for quorum-based systems like Patroni with etcd). Default etcd election timeout is 10s, so decision_time is at least 10s. Promotion_time = time to rewrite PostgreSQL control file and update replication position. PostgreSQL promotion typically takes 2-5s. Catchup_time = if using synchronous replication, catchup_time is 0. If async, the new primary may need to replay remaining WAL, which depends on lag at failover time. Target MTTR with proper automation: 30-60 seconds. Manual failover typically takes 5-15 minutes. Formula-based: MTTR = (health_check_interval * failures) + election_timeout + promotion_latency + async_lag_at_failover.

16. A failover has just completed and the new primary is promoted. What post-failover steps do you take to restore full redundancy?

Immediately after failover: verify the new primary is accepting writes and the old primary is isolated (offline or demoted). Then: (1) Configure the old primary to replicate from the new primary as a replica — this is usually automatic in Patroni. (2) Verify replication is catching up — check pg_stat_replication for the new replica. (3) If using a load balancer or connection pooler, verify traffic is routing to the new primary. (4) Monitor the new replica's lag — it may need to replay a backlog of WAL. (5) Once replication is current and stable, you are back to full redundancy (1 primary + N replicas). If the old primary had hardware failure, repair or replace it and add it back to the cluster as a new replica. Document the failover event and review whether the automation worked correctly or needs tuning.

17. You suspect your automatic failover triggered incorrectly (false positive). How do you investigate and confirm, and what metrics would have warned you?

Investigate: check the failover automation logs (Patroni, Consul, etc.) for the decision rationale. Look at the health check results around the failover time — were there multiple consecutive failures or just one? Check system metrics on the old primary: was it actually healthy or was it experiencing load/CPU/network issues that made it appear down? If the old primary was healthy and reachable, the health check was too sensitive. Metrics that warn of false positives: health check success rate over time (if success rate drops below 100%, you are near the threshold), replication lag (a lagging replica is a poor failover candidate), and CPU/memory/disk on the candidate replica (a resource-exhausted replica should not be promoted). Prevention: require more consecutive failures, use longer health check intervals, and implement pre-promotion checks that verify the candidate's resource health and replication position.

18. How does HAProxy detect primary failure and route traffic to the new primary during a failover?

HAProxy uses health checks (TCP checks or custom scripts) to monitor each backend. If the primary fails checks, HAProxy marks it dead and stops routing traffic to it. If you configure a backup server (the replica becomes the backup), HAProxy promotes the backup to active once the primary is marked dead. For PostgreSQL, the typical health check is pg_isready via TCP. The failover is not automatic on the database side — HAProxy only detects and routes away from failed backends. You still need Patroni or similar to promote the replica at the database level. HAProxy then detects the new primary's IP (which Patroni or your failover script updates) and resumes routing. For smoother transitions, use a virtual IP that floats to the new primary — HAProxy just checks the VIP and the network handles the IP movement.

19. How do you design a failover strategy that accounts for both hardware failures and software bugs that cause crashes but leave the database in an inconsistent state?

Hardware failures and software bugs require different recovery approaches. Hardware failure (power supply, disk, memory) typically leaves the database in a clean state — the last committed transaction is well-defined. Software bugs (corrupt indexes, half-applied transactions, crashed writer process) can leave the database in an inconsistent state where promotion to primary is dangerous because the replica may carry forward corrupted data. Design for both: implement checksum verification on replication data to detect corruption before promotion. Use post-promotion verification that runs consistency checks (PostgreSQL's pg_checksums, MySQL's CHECK TABLE) before accepting writes on the new primary. For software bugs causing inconsistency, prefer a full data comparison between candidate replica and primary before promotion rather than blindly promoting the first replica that responds. Consider separating failure classes — hardware failure triggers immediate promotion, software inconsistency triggers alert-and-review before automatic promotion. Maintain backups that are independent of the replication chain so you can restore to a known-good state regardless of which replica you promote.

20. How does read replica promotion affect long-running transactions, and what strategies minimize transaction failures during automatic failover?

Long-running transactions that started on the primary before failover may fail when the replica promotes — the transaction's session is interrupted, and if it attempts to commit, the new primary may not have the transaction in its state. The new primary also has a different timeline: transactions that were in-flight on the old primary but not yet committed may be lost. Strategies: keep transaction durations short — long-running transactions are more likely to span a failover event. Use application-level transaction retry logic that catches connection errors and re-executes failed transactions (idempotent operations). Avoid multi-statement transactions that hold locks across the failover window. For critical transactions, consider using synchronous replication so the replica is current at promotion time. Monitor in-flight transaction count as part of your failover metrics — a sudden spike in long-running transactions before failover increases the chance of transaction failures during promotion.

Further Reading

Official Documentation:

Tools and Patterns:

  • Patroni: Automates PostgreSQL failover using etcd/ZooKeeper/Consul for leader election
  • HAProxy: Connection routing and health checking for database clusters
  • Prometheus + Alertmanager: Health check monitoring and automated alert-based failover triggering
  • Vitess: MySQL sharding middleware with built-in failover
  • CockroachDB: Distributed SQL with automatic failover as a core feature

Case Studies:

Monitoring Queries:

-- PostgreSQL: Check Patroni cluster status
SELECT member_name, role, state, sync_state
FROM pg_stat_get_wal_senders();

-- Check for fencing state
SELECT pg_is_in_recovery();

Conclusion

Automatic failover separates resilient systems from expensive pagerduty schedules. The core components: reliable failure detection, quorum or fencing to prevent split-brain, and DNS or connection routing that updates when the primary changes.

Test your failover process before you need it. A failover that works in theory but fails under pressure is a liability, not a safety net.


Category

Related Posts

Database Backup Strategies: Full, Incremental, and PITR

Learn database backup strategies: full, incremental, and differential backups. Point-in-time recovery, WAL archiving, and RTO/RPO planning.

#database #backup #recovery

Database Capacity Planning: A Practical Guide

Plan for growth before you hit walls. This guide covers growth forecasting, compute and storage sizing, IOPS requirements, and cloud vs on-prem decisions.

#database #capacity-planning #infrastructure

Connection Pooling: HikariCP, pgBouncer, and ProxySQL

Learn connection pool sizing, HikariCP, pgBouncer, and ProxySQL, timeout settings, idle management, and when pooling helps or hurts performance.

#database #connection-pooling #performance