Failover Automation
Automatic failover patterns. Health checks, failure detection, split-brain prevention, and DNS TTL management during database failover.
Failover Automation
A primary node fails at 3am. Do you get paged? Does the system recover automatically? Or does someone have to wake up, diagnose, and manually trigger a failover?
Failover automation determines whether your high availability setup is a real safety net or just a checklist item that gives false confidence.
Introduction
Failover automation turns a database outage from a 3am emergency into an automatic recovery. Without it, a primary failure requires someone to diagnose the problem, confirm the replica is ready, promote the replica, and update DNS or connection strings — all while the system is down. With it, the system detects failure, promotes a replica, and reconnects traffic, typically in under a minute.
The catch is that automated failover introduces its own failure modes: split-brain where both primary and replica think they are primary, promoting a replica that is too far behind, or failover that happens unnecessarily during temporary network hiccups. This guide covers health check design, failure detection strategies, split-brain prevention, and DNS TTL management during failover transitions.
What Automatic Failover Means
The system detects a primary failure and promotes a replica to primary without human intervention. The goal: minimize downtime and data loss. The catch: it involves tradeoffs between speed, safety, and infrastructure complexity.
Failure detection and promotion sound straightforward. They rarely are.
Health Checks and Failure Detection
Before failover can happen, the system must confirm the primary is actually down. Too sensitive and you get spurious failovers that cause instability. Too lenient and you accept extended downtime.
Types of Health Checks
Database-level checks verify the process is accepting connections: pg_isready for PostgreSQL, TCP connection to the database port, SHOW GLOBAL STATUS for MySQL.
Query-level checks run an actual query to confirm the engine can process work. SELECT 1 proves the query engine is functional, not just the network layer.
Replication checks verify the replica is receiving changes from the primary. A replica that cannot keep up is a poor failover candidate—it would become primary with missing data.
# PostgreSQL health check
pg_isready -h $PRIMARY_HOST -p $PRIMARY_PORT -U $CHECK_USER
if [ $? -ne 0 ]; then
trigger_failover
fi
Failure Detection Timeout Tradeoffs
The timeout for declaring a primary dead requires balancing:
- Short timeout: fast recovery, but risk of false positives from temporary network blips
- Long timeout: safer from spurious failovers, but longer downtime
Most production systems use 10-30 seconds with multiple consecutive checks. Some implementations require failures to persist over a period. Others use quorum—multiple independent checks must agree before triggering failover.
Split-Brain Prevention
Split-brain occurs when two nodes both believe they are the primary. Both accept writes, data diverges, and reconciliation becomes painful or impossible.
Network partitions cause most split-brain scenarios. If the primary loses network connectivity but keeps running, it continues accepting writes locally. A replica promotes itself believing the primary is dead. When the network recovers, you have two databases with conflicting data.
Quorum-Based Prevention
The safest approach uses quorum: a node can only become primary if a majority agrees. A three-node cluster needs 2 votes. If one node loses connection to the others, it cannot promote itself—only 1 vote. The remaining two nodes can elect a new primary.
Raft consensus protocols work this way. etcd, Consul, and some database clustering solutions implement similar logic.
Fencing
Fencing ensures the old primary cannot continue writing when a new primary takes over. Two common approaches:
- STONITH (Shoot The Other Node In The Head): power off or isolate the failed node before promotion
- Resource fencing: revoke the old primary’s access to shared storage or network
Fencing adds complexity but provides stronger guarantees.
Prometheus, HAProxy, and Patroni Patterns
Common tooling patterns for managing failover:
Prometheus Alertmanager + Blackbox Exporter
Monitor the primary with Prometheus. When checks fail, Alertmanager fires an alert that triggers automated failover via webhook.
groups:
- name: database_failover
rules:
- alert: DatabasePrimaryDown
expr: probe_success{job="db-primary-check"} == 0
for: 30s
labels:
severity: critical
annotations:
summary: "Database primary is down"
HAProxy for Connection Routing
HAProxy detects primary failure and routes writes to the new primary automatically. Your application keeps connecting to the same HAProxy endpoint; HAProxy handles the backend switch.
backend db_write
option tcp-check
tcp-check connect
tcp-check send PING\r\n
tcp-check expect string PONG
server primary db-primary:5432 check inter 5s fall 3 rise 2
server replica1 db-replica1:5432 check inter 5s fall 3 rise 2 backup
Patroni for PostgreSQL
Patroni automates PostgreSQL failover using distributed consensus (etcd, ZooKeeper, or Consul) for leader election. It manages leader election, replication, automated failover with safety checks, and post-failover recovery.
Vitess and CockroachDB
Vitess and CockroachDB have automatic failover built in as core features. If you want failover that works without custom configuration, managed distributed databases handle this internally.
Mean Time to Recovery (MTTR)
MTTR measures average recovery time. For database failover, it includes:
- Time to detect failure
- Time to decide on failover
- Time to execute promotion
- Time for replicas to catch up to new primary
Target MTTR varies by industry, but 30-60 seconds is achievable with proper automation. Manual failover typically takes 5-15 minutes when accounting for human response time, diagnosis, and execution.
The MTTR formula breaks down component by component. Detection time is your health check interval times the number of consecutive failures required to trigger failover. If you check every 5 seconds and require 3 consecutive failures, minimum detection time is 15 seconds. Add network jitter and you might see 20-30 seconds before detection fires.
Decision time is the time between detection and promotion trigger. For quorum-based systems (Patroni with etcd, for example), this includes leader election timeout — typically 10-15 seconds by default. You can tune this down to 3-5 seconds at the cost of higher false-positive risk.
Promotion time depends on your database. PostgreSQL promotion typically takes 2-5 seconds (rewriting the control file, promoting the replica). MySQL is similar. CockroachDB and other distributed databases promote automatically as part of their Raft consensus — usually under 1 second.
Replication catchup time: after promotion, the new primary must have all committed data. With synchronous replication, catchup time is zero — the replica was already current. With asynchronous replication, the new primary may be missing the last few seconds of writes. The practical catchup time for async replication is roughly equal to your replication lag at the time of failover. If lag was 2 seconds, you may have lost 2 seconds of data and catchup (for the new replica to catch up to the new primary) takes another few seconds under load.
Real-World Case Study: Patroni at Spotify
Spotify runs one of the largest PostgreSQL clusters managed by Patroni in production. Patroni handles automatic failover for their hundreds of PostgreSQL clusters using etcd as the consensus backend.
The challenge at Spotify’s scale was not the failover itself — Patroni’s failover is reliable — but managing the post-failover state across their service mesh. When a PostgreSQL primary fails and a replica promotes, hundreds of application servers that had connections to the old primary need to reconnect to the new primary. Without connection pooling (Spotify uses PgBouncer in front of PostgreSQL), this would cause a connection storm as all applications simultaneously try to reconnect.
Spotify’s solution was aggressive connection pooling at every tier. PgBouncer sits in front of every PostgreSQL cluster. When failover happens, PgBouncer detects connection failures and reconnects to the new primary with exponential backoff. The connection pooling layer absorbs the reconnection burst rather than letting it hit the database directly.
The operational lesson: failover automation handles the database layer, but your connection pooling and application reconnection logic must be equally resilient. Patroni’s failover works reliably, but if your application crashes on connection errors instead of retrying gracefully, you still get an outage.
Real-World Case Study: AWS RDS Failover
AWS RDS with Multi-AZ failover is automatic but not instantaneous. When an RDS primary fails, the failover process typically takes 60-120 seconds. This includes detection (Amazon’s internal monitoring), promotion of the standby, and DNS update propagation.
The catch: during those 60-120 seconds, your application receives connection errors. If your connection pooling layer cannot handle connection failures gracefully (by retrying with backoff and eventually connecting to the new endpoint), your application starts failing requests even though RDS has already recovered.
The lesson: managed failover does not eliminate application-level connection resilience. RDS handles the database failover; your application must handle the connection interruption. Use the RDS endpoint (which automatically resolves to the new primary after failover) rather than hardcoding the primary’s IP. Configure your database client to retry on connection failure with exponential backoff. Test failover explicitly — RDS allows you to trigger a manual failover to verify your application handles it correctly.
Quick Recap Checklist
- Health checks use multiple consecutive failures to avoid false positives
- Quorum-based promotion prevents split-brain (minimum 3 nodes)
- Fencing (STONITH) isolates old primary before new primary promotes
- DNS TTL set to 60 seconds or lower before failover
- Connection pooling absorbs reconnection bursts during failover
- MTTR target: 30-60 seconds with proper automation
- Test failover explicitly in staging before production
- 2-node clusters cannot do safe automatic failover — use manual or add witness
- Synchronous replication eliminates data loss gap on failover
- Alert at lag exceeding your RPO threshold
Related Posts
- Database Replication — Replication fundamentals these systems build on
- Database Scaling — Scaling strategies including replication
- Consistent Hashing — Related distribution mechanisms
DNS TTL Management During Failover
When the primary changes, applications must connect to the new primary. DNS-based discovery handles this: point your database hostname to the current primary’s IP, update DNS on failover.
The problem: DNS caching. If your TTL is 1 hour, clients might keep connecting to the failed IP for an hour after failover.
Low TTL Before Failover
Set a low TTL (60 seconds or less) on your database DNS records before you need failover. This lets you update quickly when the time comes.
gcloud dns record-sets transaction start --zone=db-zone
gcloud dns record-sets transaction remove db-primary.example.com. A 300 \
--zone=db-zone --ttl=60
gcloud dns record-sets transaction add db-replica1.example.com. A 300 \
--zone=db-zone --ttl=60
gcloud dns record-sets transaction execute --zone=db-zone
Application-Level DNS Caching
Applications cache DNS lookups at the OS or library level regardless of your TTL. Ensure your database client re-resolves hostnames on connection failure or periodically.
Some clients have dns_refresh_rate settings. Others need connection pooling libraries like PgBouncer or ProxySQL to handle reconnection gracefully.
When to Use / When Not to Use Automatic Failover
Use automatic failover when:
- Your SLA requires fast recovery (under 60 seconds)
- You run multi-AZ deployments where node failure is a realistic scenario
- You have 3+ database nodes so quorum-based promotion is safe
- Your application handles brief write unavailability gracefully
Do not use automatic failover when:
- You have only 2 nodes — quorum-based failover cannot work safely with 2
- Your database runs on shared storage (SAN) where primary and replica share failure domain
- Your team can respond faster manually than the automation can detect and act
- You lack monitoring that distinguishes node failure from slow node
Automatic vs Semi-Automatic vs Manual Failover Trade-offs
| Dimension | Automatic Failover | Semi-Automatic (Operator-Triggered) | Manual Failover |
|---|---|---|---|
| MTTR | 30-60 seconds | 2-5 minutes | 5-15 minutes |
| False positive risk | Higher — can trigger on transient issues | Lower — human verifies first | None |
| Split-brain risk | Moderate without quorum, low with quorum | Very low | None |
| Operational complexity | High | Medium | Low |
| Best suited for | 24/7 services with SLA | Business hours ops, critical data | Non-critical dev/staging |
| Requires | Quorum, fencing, watchdog | Alerting + runbook | On-call engineer |
| Data loss risk | Low (with sync rep) | Very low | Depends on async lag |
What Automation Cannot Solve
Automation handles node failure, not root cause failure. If a bad deployment corrupts data on the primary, promoting a replica spreads the corruption. You still need backups, point-in-time recovery, and tests that verify your recovery procedures actually work.
Automation also does not eliminate runbooks. When something unexpected happens, operators need documented procedures, not just alerts.
Production Failure Scenarios
| Failure | Impact | Mitigation |
|---|---|---|
| Spurious failover from network blip | Primary demoted unnecessarily, brief write outage | Require multiple consecutive failures, use longer detection windows |
| Split-brain from network partition | Two primaries accept writes, data diverges | Use quorum-based promotion, implement fencing (STONITH) |
| Fencing failure leaves old primary writing | Data corruption, split-brain continues | Test fencing regularly, monitor fencing state |
| Promotion of lagging replica | Data loss — missing writes not yet replicated | Check replication position before promotion, use semi-sync replication |
| DNS TTL too high | Clients keep connecting to failed primary for minutes | Set TTL to 60s or lower before failover, use connection pooling |
| Cascade failure during failover | Multiple nodes fail during transition | Keep cluster stable during failover, avoid concurrent config changes |
| Automation runs during planned maintenance | Primary demoted while operator working | Pause automation during maintenance windows, use manual mode |
Interview Questions
With 2 nodes, quorum-based failover cannot work safely. If the primary fails and the replica promotes, there is no third vote to break ties — any network partition that splits the nodes at the same time as the primary failure creates a split-brain scenario where both nodes believe they are primary. The only safe configuration with 2 nodes is manual failover or shared-nothing failover where one node can definitively prove it is the only one that can serve writes (via shared storage or fencing). The recommendation is to add a third node or use a witness/arbitrator node to enable quorum-based failover.
The old primary may still have active connections and continue accepting writes locally. When the network recovers and the old primary rejoins the cluster, it has writes that were never replicated to the new primary. You now have divergent data. The cluster must reconcile — either by reverting the old primary to a replica and having it replay the missing writes, or by accepting data loss. This is why fencing (STONITH) is critical: before the new primary promotes, the old primary must be isolated from accepting writes. Without fencing, the old primary can continue writing during a partition and cause split-brain.
Three tuning points: reduce the heartbeat interval (Patroni sends heartbeats between nodes), reduce the loop_wait parameter which controls the Patroni main loop, and reduce the retry_timeout for etcd operations. Standard Patroni settings have a loop_wait of 10 seconds and a heartbeat interval of 10 seconds, giving a minimum detection time around 30 seconds. Reducing loop_wait to 1-2 seconds gets detection time under 10 seconds, but at the cost of increased etcd write load. For truly sub-30-second failover, synchronous replication to at least one replica is the most reliable approach since the promoted replica is guaranteed to be current.
Quorum requires that a majority of nodes must agree before promoting a new primary. With 3 nodes, you need 2 votes to promote. If the primary fails and one replica also loses connectivity, the isolated replica has only 1 vote — it cannot promote itself. The remaining 2-node group has 2 votes and can elect a new primary. This prevents split-brain because a single node cannot form a majority on its own. Minimum safe configuration: 3 nodes for odd-numbered quorum (3, 5, 7). With 2 nodes, there is no majority possible without the other node — any network partition creates two nodes each believing they are primary. Some setups use a witness/arbitrator node (like etcd's learner) to break ties for 2-node clusters, but a 3-node cluster is the recommended minimum for production automatic failover.
STONITH (Shoot The Other Node In The Head) is a fencing technique that physically isolates a failed node before promoting a new primary. When a replica decides to promote, STONITH first power-cycles or network-isolates the old primary to ensure it cannot continue writing. Without STONITH, the old primary may still be running and accepting writes during a network partition — when the partition heals, you have two primaries with divergent data (split-brain). STONITH ensures the old primary is definitively dead before the new primary accepts writes. Implementation options: IPMI power management, hardware out-of-band interfaces, or cloud provider APIs (AWS instance stop). STONITH adds operational complexity but is required for strong failover safety guarantees.
Clients cached the old IP with a 1-hour TTL. After failover, the DNS record points to the new primary's IP, but clients ignore the update for up to an hour. During this window, writes to the failed primary's IP fail with connection errors. If the old primary is still running (not properly shut down), it may accept writes that are never replicated to the new primary — data divergence. Prevention: set database DNS TTL to 60 seconds or less before deploying failover automation. Use a database-specific DNS endpoint (like RDS endpoint or a dedicated CNAME) that resolves to the current primary rather than hardcoding IPs. Some setups use a virtual IP (VIP) that floats to the new primary — the application connects to the VIP and the network handles the switch without DNS changes.
loop_wait is the main Patroni loop interval — default 10 seconds. Each loop iteration checks cluster health and sends heartbeats. heartbeat (Patroni 2.0+) is the interval between heartbeat signals, typically 2 seconds. The effective detection time is roughly (loop_wait * number_of_failed_heartbeats) + network_jitter. Standard settings give ~30 second detection. retry_timeout is the timeout for etcd/consul operations — if the distributed store does not respond within this time, Patroni treats it as a failure. To achieve sub-30-second MTTR: reduce loop_wait to 1-2 seconds (increases etcd load significantly), reduce heartbeat interval, ensure etcd/consul is fast and co-located. For truly fast failover, synchronous replication to at least one replica is more reliable than aggressive timeouts.
The monitoring system triggered failover based on a single failed health check or too-short timeout. A brief network blip caused health checks to fail for a few seconds, the failover automation interpreted this as primary death, and promoted a replica. Meanwhile the old primary continued running normally and was still accepting writes locally. When the network recovered and both primaries existed, you had a split-brain scenario. Prevention: require multiple consecutive failures before triggering failover (e.g., 3 consecutive failures over 30 seconds). Use health check intervals of 5-10 seconds with enough consecutive failures to survive brief blips. Some systems use quorum (multiple independent monitors must agree) rather than single-monitor decision. Also implement a pre-promotion verification step that checks the candidate replica's data freshness before allowing promotion.
RDS Multi-AZ uses synchronous replication to a standby in a different AZ. When the primary fails, Amazon's internal monitoring detects it, promotes the standby, and updates the DNS endpoint to point to the new primary. The application connects to the same endpoint (e.g., mydb.abc123.us-east-1.rds.amazonaws.com) — DNS propagates the new IP. During the 60-120 second failover window: connection attempts fail with connection errors; in-flight transactions fail; your application must handle reconnection with exponential backoff. The RDS endpoint masks the IP change but does not eliminate application disruption. Key application requirements: retry logic with backoff on connection failure, connection pool that can recover from failed connections, and no hardcoded IP addresses. Test failover explicitly — RDS allows you to trigger a test failover to verify your application's handling.
In raw mode (native PostgreSQL), pgpool-II does not handle replication — it just routes queries to a single PostgreSQL backend. Failover in raw mode means detecting backend failure and switching to a standby — pgpool can do this but you must configure watchdog for HA. In replication mode, pgpool maintains a real replication connection to multiple PostgreSQL servers and replicates writes across them. Failover in replication mode is more complex — pgpool can detect failed nodes but must also manage the replication state. For most automatic failover setups with PostgreSQL, Patroni + PgBouncer is preferred over pgpool-II because Patroni handles failover coordination with etcd/consul while pgpool focuses on connection pooling. pgpool-II works well for read load balancing but Patroni is the standard for HA.
A witness node (or arbitrator) participates in quorum but does not store data. With 2 data nodes + 1 witness: if the primary fails, the replica + witness have 2 votes (majority of 3), allowing promotion. If the witness fails, the 2 data nodes can still communicate and continue operating — the primary does not promote spuriously. This gives you safe automatic failover with 2 data nodes without requiring a 3rd full data node. The tradeoff: the witness is a single point of failure for the election itself (if both the witness and one data node fail simultaneously, you lose quorum). Also verify that your failover tool (Patroni, etc.) supports witness/arbitrator nodes and that the witness's network is reliable and in the same failure domain as the data nodes.
Automatic failover: MTTR 30-60 seconds, highest availability, but risk of false positives (spurious failover) and split-brain without proper quorum/fencing. Semi-automatic (operator-triggered): monitoring alerts an on-call engineer who manually approves and triggers failover — MTTR 2-5 minutes, lower false-positive risk because a human verifies before acting. Manual failover: engineer diagnoses, then runs procedures manually — MTTR 5-15 minutes, no spurious failover risk, but human response time adds delay. For business-critical databases with 24/7 availability requirements, automatic failover is appropriate if quorum and fencing are properly implemented. For less critical systems or those with skilled ops teams that can respond in under 5 minutes, semi-automatic may provide better safety-to-speed balance. Manual is appropriate for development and staging.
When the primary fails, all application instances that had connections to it receive connection errors nearly simultaneously. Each instance's connection pool or ORM sees the failure and independently starts retrying with exponential backoff — but without coordination, all instances retry at the same times (they failed at the same time). This creates a connection storm that can overwhelm the new primary or the connection pooler. Prevention: use PgBouncer or ProxySQL in front of the database — the pooler maintains its own connection pool to the database and handles reconnection internally, absorbing the burst rather than letting it hit the database directly. Configure pooler reconnection with jitter and exponential backoff. At the application level, implement circuit breakers that temporarily stop sending requests to the database during failover, allowing the pooler time to stabilize.
Patroni requires a distributed consensus store (etcd, Consul, ZooKeeper). For production: use 3 or 5 etcd nodes for HA — a 3-node etcd cluster tolerates 1 failure, 5-node tolerates 2. Place etcd nodes on separate hardware and network segments from the PostgreSQL primaries to avoid correlated failures. Etcd performance matters: use SSDs for etcd data directory, network locality between etcd nodes and Patroni nodes, and monitor etcd write latency (should be < 10ms). If etcd becomes slow, Patroni loops hang and failover detection slows down. For read-heavy workloads, etcd can handle thousands of reads per second, but ensure your etcd cluster is not overutilized — dedicated etcd clusters for critical Patroni deployments are recommended over sharing with other workloads.
MTTR = detection_time + decision_time + promotion_time + catchup_time. Detection_time = health_check_interval * consecutive_failures_required. With 5s checks and 3 failures required, minimum detection is 15s plus network jitter. Decision_time = leader_election_timeout (for quorum-based systems like Patroni with etcd). Default etcd election timeout is 10s, so decision_time is at least 10s. Promotion_time = time to rewrite PostgreSQL control file and update replication position. PostgreSQL promotion typically takes 2-5s. Catchup_time = if using synchronous replication, catchup_time is 0. If async, the new primary may need to replay remaining WAL, which depends on lag at failover time. Target MTTR with proper automation: 30-60 seconds. Manual failover typically takes 5-15 minutes. Formula-based: MTTR = (health_check_interval * failures) + election_timeout + promotion_latency + async_lag_at_failover.
Immediately after failover: verify the new primary is accepting writes and the old primary is isolated (offline or demoted). Then: (1) Configure the old primary to replicate from the new primary as a replica — this is usually automatic in Patroni. (2) Verify replication is catching up — check pg_stat_replication for the new replica. (3) If using a load balancer or connection pooler, verify traffic is routing to the new primary. (4) Monitor the new replica's lag — it may need to replay a backlog of WAL. (5) Once replication is current and stable, you are back to full redundancy (1 primary + N replicas). If the old primary had hardware failure, repair or replace it and add it back to the cluster as a new replica. Document the failover event and review whether the automation worked correctly or needs tuning.
Investigate: check the failover automation logs (Patroni, Consul, etc.) for the decision rationale. Look at the health check results around the failover time — were there multiple consecutive failures or just one? Check system metrics on the old primary: was it actually healthy or was it experiencing load/CPU/network issues that made it appear down? If the old primary was healthy and reachable, the health check was too sensitive. Metrics that warn of false positives: health check success rate over time (if success rate drops below 100%, you are near the threshold), replication lag (a lagging replica is a poor failover candidate), and CPU/memory/disk on the candidate replica (a resource-exhausted replica should not be promoted). Prevention: require more consecutive failures, use longer health check intervals, and implement pre-promotion checks that verify the candidate's resource health and replication position.
HAProxy uses health checks (TCP checks or custom scripts) to monitor each backend. If the primary fails checks, HAProxy marks it dead and stops routing traffic to it. If you configure a backup server (the replica becomes the backup), HAProxy promotes the backup to active once the primary is marked dead. For PostgreSQL, the typical health check is pg_isready via TCP. The failover is not automatic on the database side — HAProxy only detects and routes away from failed backends. You still need Patroni or similar to promote the replica at the database level. HAProxy then detects the new primary's IP (which Patroni or your failover script updates) and resumes routing. For smoother transitions, use a virtual IP that floats to the new primary — HAProxy just checks the VIP and the network handles the IP movement.
Hardware failures and software bugs require different recovery approaches. Hardware failure (power supply, disk, memory) typically leaves the database in a clean state — the last committed transaction is well-defined. Software bugs (corrupt indexes, half-applied transactions, crashed writer process) can leave the database in an inconsistent state where promotion to primary is dangerous because the replica may carry forward corrupted data. Design for both: implement checksum verification on replication data to detect corruption before promotion. Use post-promotion verification that runs consistency checks (PostgreSQL's pg_checksums, MySQL's CHECK TABLE) before accepting writes on the new primary. For software bugs causing inconsistency, prefer a full data comparison between candidate replica and primary before promotion rather than blindly promoting the first replica that responds. Consider separating failure classes — hardware failure triggers immediate promotion, software inconsistency triggers alert-and-review before automatic promotion. Maintain backups that are independent of the replication chain so you can restore to a known-good state regardless of which replica you promote.
Long-running transactions that started on the primary before failover may fail when the replica promotes — the transaction's session is interrupted, and if it attempts to commit, the new primary may not have the transaction in its state. The new primary also has a different timeline: transactions that were in-flight on the old primary but not yet committed may be lost. Strategies: keep transaction durations short — long-running transactions are more likely to span a failover event. Use application-level transaction retry logic that catches connection errors and re-executes failed transactions (idempotent operations). Avoid multi-statement transactions that hold locks across the failover window. For critical transactions, consider using synchronous replication so the replica is current at promotion time. Monitor in-flight transaction count as part of your failover metrics — a sudden spike in long-running transactions before failover increases the chance of transaction failures during promotion.
Further Reading
Official Documentation:
- Patroni Documentation — PostgreSQL HA template using distributed consensus
- Consul Documentation — Service mesh and distributed coordination
- etcd Documentation — Distributed key-value store for shared configuration and service discovery
- pg_bouncer Documentation — Connection pooler for PostgreSQL
Tools and Patterns:
- Patroni: Automates PostgreSQL failover using etcd/ZooKeeper/Consul for leader election
- HAProxy: Connection routing and health checking for database clusters
- Prometheus + Alertmanager: Health check monitoring and automated alert-based failover triggering
- Vitess: MySQL sharding middleware with built-in failover
- CockroachDB: Distributed SQL with automatic failover as a core feature
Case Studies:
- Spotify Engineering Blog — PostgreSQL at scale with Patroni (see archive for failover patterns)
- AWS RDS Patroni Automation — Reference architecture for AWS-managed PostgreSQL failover
Monitoring Queries:
-- PostgreSQL: Check Patroni cluster status
SELECT member_name, role, state, sync_state
FROM pg_stat_get_wal_senders();
-- Check for fencing state
SELECT pg_is_in_recovery();
Conclusion
Automatic failover separates resilient systems from expensive pagerduty schedules. The core components: reliable failure detection, quorum or fencing to prevent split-brain, and DNS or connection routing that updates when the primary changes.
Test your failover process before you need it. A failover that works in theory but fails under pressure is a liability, not a safety net.
Category
Related Posts
Database Backup Strategies: Full, Incremental, and PITR
Learn database backup strategies: full, incremental, and differential backups. Point-in-time recovery, WAL archiving, and RTO/RPO planning.
Database Capacity Planning: A Practical Guide
Plan for growth before you hit walls. This guide covers growth forecasting, compute and storage sizing, IOPS requirements, and cloud vs on-prem decisions.
Connection Pooling: HikariCP, pgBouncer, and ProxySQL
Learn connection pool sizing, HikariCP, pgBouncer, and ProxySQL, timeout settings, idle management, and when pooling helps or hurts performance.