Failover Automation
Automatic failover patterns. Health checks, failure detection, split-brain prevention, and DNS TTL management during database failover.
Failover Automation
A primary node fails at 3am. Do you get paged? Does the system recover automatically? Or does someone have to wake up, diagnose, and manually trigger a failover?
Failover automation determines whether your high availability setup is a real safety net or just a checklist item that gives false confidence.
What Automatic Failover Means
The system detects a primary failure and promotes a replica to primary without human intervention. The goal: minimize downtime and data loss. The catch: it involves tradeoffs between speed, safety, and infrastructure complexity.
Failure detection and promotion sound straightforward. They rarely are.
Health Checks and Failure Detection
Before failover can happen, the system must confirm the primary is actually down. Too sensitive and you get spurious failovers that cause instability. Too lenient and you accept extended downtime.
Types of Health Checks
Database-level checks verify the process is accepting connections: pg_isready for PostgreSQL, TCP connection to the database port, SHOW GLOBAL STATUS for MySQL.
Query-level checks run an actual query to confirm the engine can process work. SELECT 1 proves the query engine is functional, not just the network layer.
Replication checks verify the replica is receiving changes from the primary. A replica that cannot keep up is a poor failover candidate—it would become primary with missing data.
# PostgreSQL health check
pg_isready -h $PRIMARY_HOST -p $PRIMARY_PORT -U $CHECK_USER
if [ $? -ne 0 ]; then
trigger_failover
fi
Failure Detection Timeout Tradeoffs
The timeout for declaring a primary dead requires balancing:
- Short timeout: fast recovery, but risk of false positives from temporary network blips
- Long timeout: safer from spurious failovers, but longer downtime
Most production systems use 10-30 seconds with multiple consecutive checks. Some implementations require failures to persist over a period. Others use quorum—multiple independent checks must agree before triggering failover.
Split-Brain Prevention
Split-brain occurs when two nodes both believe they are the primary. Both accept writes, data diverges, and reconciliation becomes painful or impossible.
Network partitions cause most split-brain scenarios. If the primary loses network connectivity but keeps running, it continues accepting writes locally. A replica promotes itself believing the primary is dead. When the network recovers, you have two databases with conflicting data.
Quorum-Based Prevention
The safest approach uses quorum: a node can only become primary if a majority agrees. A three-node cluster needs 2 votes. If one node loses connection to the others, it cannot promote itself—only 1 vote. The remaining two nodes can elect a new primary.
Raft consensus protocols work this way. etcd, Consul, and some database clustering solutions implement similar logic.
Fencing
Fencing ensures the old primary cannot continue writing when a new primary takes over. Two common approaches:
- STONITH (Shoot The Other Node In The Head): power off or isolate the failed node before promotion
- Resource fencing: revoke the old primary’s access to shared storage or network
Fencing adds complexity but provides stronger guarantees.
Prometheus, HAProxy, and Patroni Patterns
Common tooling patterns for managing failover:
Prometheus Alertmanager + Blackbox Exporter
Monitor the primary with Prometheus. When checks fail, Alertmanager fires an alert that triggers automated failover via webhook.
groups:
- name: database_failover
rules:
- alert: DatabasePrimaryDown
expr: probe_success{job="db-primary-check"} == 0
for: 30s
labels:
severity: critical
annotations:
summary: "Database primary is down"
HAProxy for Connection Routing
HAProxy detects primary failure and routes writes to the new primary automatically. Your application keeps connecting to the same HAProxy endpoint; HAProxy handles the backend switch.
backend db_write
option tcp-check
tcp-check connect
tcp-check send PING\r\n
tcp-check expect string PONG
server primary db-primary:5432 check inter 5s fall 3 rise 2
server replica1 db-replica1:5432 check inter 5s fall 3 rise 2 backup
Patroni for PostgreSQL
Patroni automates PostgreSQL failover using distributed consensus (etcd, ZooKeeper, or Consul) for leader election. It manages leader election, replication, automated failover with safety checks, and post-failover recovery.
Vitess and CockroachDB
Vitess and CockroachDB have automatic failover built in as core features. If you want failover that works without custom configuration, managed distributed databases handle this internally.
Mean Time to Recovery (MTTR)
MTTR measures average recovery time. For database failover, it includes:
- Time to detect failure
- Time to decide on failover
- Time to execute promotion
- Time for replicas to catch up to new primary
Target MTTR varies by industry, but 30-60 seconds is achievable with proper automation. Manual failover typically takes 5-15 minutes when accounting for human response time, diagnosis, and execution.
The MTTR formula breaks down component by component. Detection time is your health check interval times the number of consecutive failures required to trigger failover. If you check every 5 seconds and require 3 consecutive failures, minimum detection time is 15 seconds. Add network jitter and you might see 20-30 seconds before detection fires.
Decision time is the time between detection and promotion trigger. For quorum-based systems (Patroni with etcd, for example), this includes leader election timeout — typically 10-15 seconds by default. You can tune this down to 3-5 seconds at the cost of higher false-positive risk.
Promotion time depends on your database. PostgreSQL promotion typically takes 2-5 seconds (rewriting the control file, promoting the replica). MySQL is similar. CockroachDB and other distributed databases promote automatically as part of their Raft consensus — usually under 1 second.
Replication catchup time: after promotion, the new primary must have all committed data. With synchronous replication, catchup time is zero — the replica was already current. With asynchronous replication, the new primary may be missing the last few seconds of writes. The practical catchup time for async replication is roughly equal to your replication lag at the time of failover. If lag was 2 seconds, you may have lost 2 seconds of data and catchup (for the new replica to catch up to the new primary) takes another few seconds under load.
Real-World Case Study: Patroni at Spotify
Spotify runs one of the largest PostgreSQL clusters managed by Patroni in production. Patroni handles automatic failover for their hundreds of PostgreSQL clusters using etcd as the consensus backend.
The challenge at Spotify’s scale was not the failover itself — Patroni’s failover is reliable — but managing the post-failover state across their service mesh. When a PostgreSQL primary fails and a replica promotes, hundreds of application servers that had connections to the old primary need to reconnect to the new primary. Without connection pooling (Spotify uses PgBouncer in front of PostgreSQL), this would cause a connection storm as all applications simultaneously try to reconnect.
Spotify’s solution was aggressive connection pooling at every tier. PgBouncer sits in front of every PostgreSQL cluster. When failover happens, PgBouncer detects connection failures and reconnects to the new primary with exponential backoff. The connection pooling layer absorbs the reconnection burst rather than letting it hit the database directly.
The operational lesson: failover automation handles the database layer, but your connection pooling and application reconnection logic must be equally resilient. Patroni’s failover works reliably, but if your application crashes on connection errors instead of retrying gracefully, you still get an outage.
Real-World Case Study: AWS RDS Failover
AWS RDS with Multi-AZ failover is automatic but not instantaneous. When an RDS primary fails, the failover process typically takes 60-120 seconds. This includes detection (Amazon’s internal monitoring), promotion of the standby, and DNS update propagation.
The catch: during those 60-120 seconds, your application receives connection errors. If your connection pooling layer cannot handle connection failures gracefully (by retrying with backoff and eventually connecting to the new endpoint), your application starts failing requests even though RDS has already recovered.
The lesson: managed failover does not eliminate application-level connection resilience. RDS handles the database failover; your application must handle the connection interruption. Use the RDS endpoint (which automatically resolves to the new primary after failover) rather than hardcoding the primary’s IP. Configure your database client to retry on connection failure with exponential backoff. Test failover explicitly — RDS allows you to trigger a manual failover to verify your application handles it correctly.
Interview Questions
Q: You have a 2-node PostgreSQL cluster with async replication. Someone suggests adding automatic failover. What is the fundamental problem?
With 2 nodes, quorum-based failover cannot work safely. If the primary fails and the replica promotes, there is no third vote to break ties — any network partition that splits the nodes at the same time as the primary failure creates a split-brain scenario where both nodes believe they are primary. The only safe configuration with 2 nodes is manual failover or shared-nothing failover where one node can definitively prove it is the only one that can serve writes (via shared storage or fencing). The recommendation is to add a third node or use a witness/arbitrator node to enable quorum-based failover.
Q: Your automatic failover triggers during a brief network partition. The old primary is still running but cannot communicate with the cluster. What happens?
The old primary may still have active connections and continue accepting writes locally. When the network recovers and the old primary rejoins the cluster, it has writes that were never replicated to the new primary. You now have divergent data. The cluster must reconcile — either by reverting the old primary to a replica and having it replay the missing writes, or by accepting data loss. This is why fencing (STONITH) is critical: before the new primary promotes, the old primary must be isolated from accepting writes. Without fencing, the old primary can continue writing during a partition and cause split-brain.
Q: How do you tune a Patroni cluster for sub-30-second MTTR?
Three tuning points: reduce the heartbeat interval (Patroni sends heartbeats between nodes), reduce the loop_wait parameter which controls the Patroni main loop, and reduce the retry_timeout for etcd operations. Standard Patroni settings have a loop_wait of 10 seconds and a heartbeat interval of 10 seconds, giving a minimum detection time around 30 seconds. Reducing loop_wait to 1-2 seconds gets detection time under 10 seconds, but at the cost of increased etcd write load. For truly sub-30-second failover, synchronous replication to at least one replica is the most reliable approach since the promoted replica is guaranteed to be current.
Q: What is the difference between automatic failover and continuous availability solutions like CockroachDB?
Automatic failover recovers from node failure by promoting a replica. There is always a gap — even if brief — between the primary failing and the replica becoming primary. During this gap, the database is unavailable for writes. Continuous availability solutions like CockroachDB use Raft consensus to replicate writes to a quorum of nodes before acknowledging the client. If a node fails, the Raft group continues accepting writes with no gap — the failure is invisible to the application because a new leader is elected within milliseconds. The tradeoff is latency: Raft consensus requires round-trips to a quorum of nodes, so write latency is higher than a single-node PostgreSQL. CockroachDB’s model works best for globally distributed workloads where you need zero-downtime recovery and can tolerate higher write latency.
DNS TTL Management During Failover
When the primary changes, applications must connect to the new primary. DNS-based discovery handles this: point your database hostname to the current primary’s IP, update DNS on failover.
The problem: DNS caching. If your TTL is 1 hour, clients might keep connecting to the failed IP for an hour after failover.
Low TTL Before Failover
Set a low TTL (60 seconds or less) on your database DNS records before you need failover. This lets you update quickly when the time comes.
gcloud dns record-sets transaction start --zone=db-zone
gcloud dns record-sets transaction remove db-primary.example.com. A 300 \
--zone=db-zone --ttl=60
gcloud dns record-sets transaction add db-replica1.example.com. A 300 \
--zone=db-zone --ttl=60
gcloud dns record-sets transaction execute --zone=db-zone
Application-Level DNS Caching
Applications cache DNS lookups at the OS or library level regardless of your TTL. Ensure your database client re-resolves hostnames on connection failure or periodically.
Some clients have dns_refresh_rate settings. Others need connection pooling libraries like PgBouncer or ProxySQL to handle reconnection gracefully.
When to Use / When Not to Use Automatic Failover
Use automatic failover when:
- Your SLA requires fast recovery (under 60 seconds)
- You run multi-AZ deployments where node failure is a realistic scenario
- You have 3+ database nodes so quorum-based promotion is safe
- Your application handles brief write unavailability gracefully
Do not use automatic failover when:
- You have only 2 nodes — quorum-based failover cannot work safely with 2
- Your database runs on shared storage (SAN) where primary and replica share failure domain
- Your team can respond faster manually than the automation can detect and act
- You lack monitoring that distinguishes node failure from slow node
Automatic vs Semi-Automatic vs Manual Failover Trade-offs
| Dimension | Automatic Failover | Semi-Automatic (Operator-Triggered) | Manual Failover |
|---|---|---|---|
| MTTR | 30-60 seconds | 2-5 minutes | 5-15 minutes |
| False positive risk | Higher — can trigger on transient issues | Lower — human verifies first | None |
| Split-brain risk | Moderate without quorum, low with quorum | Very low | None |
| Operational complexity | High | Medium | Low |
| Best suited for | 24/7 services with SLA | Business hours ops, critical data | Non-critical dev/staging |
| Requires | Quorum, fencing, watchdog | Alerting + runbook | On-call engineer |
| Data loss risk | Low (with sync rep) | Very low | Depends on async lag |
What Automation Cannot Solve
Automation handles node failure, not root cause failure. If a bad deployment corrupts data on the primary, promoting a replica spreads the corruption. You still need backups, point-in-time recovery, and tests that verify your recovery procedures actually work.
Automation also does not eliminate runbooks. When something unexpected happens, operators need documented procedures, not just alerts.
Production Failure Scenarios
| Failure | Impact | Mitigation |
|---|---|---|
| Spurious failover from network blip | Primary demoted unnecessarily, brief write outage | Require multiple consecutive failures, use longer detection windows |
| Split-brain from network partition | Two primaries accept writes, data diverges | Use quorum-based promotion, implement fencing (STONITH) |
| Fencing failure leaves old primary writing | Data corruption, split-brain continues | Test fencing regularly, monitor fencing state |
| Promotion of lagging replica | Data loss — missing writes not yet replicated | Check replication position before promotion, use semi-sync replication |
| DNS TTL too high | Clients keep connecting to failed primary for minutes | Set TTL to 60s or lower before failover, use connection pooling |
| Cascade failure during failover | Multiple nodes fail during transition | Keep cluster stable during failover, avoid concurrent config changes |
| Automation runs during planned maintenance | Primary demoted while operator working | Pause automation during maintenance windows, use manual mode |
Conclusion
Automatic failover separates resilient systems from expensive pagerduty schedules. The core components: reliable failure detection, quorum or fencing to prevent split-brain, and DNS or connection routing that updates when the primary changes.
Test your failover process before you need it. A failover that works in theory but fails under pressure is a liability, not a safety net.
Related Posts
- Database Replication — Replication fundamentals these systems build on
- Database Scaling — Scaling strategies including replication
- Consistent Hashing — Related distribution mechanisms
Category
Related Posts
Database Backup Strategies: Full, Incremental, and PITR
Learn database backup strategies: full, incremental, and differential backups. Point-in-time recovery, WAL archiving, and RTO/RPO planning.
Database Capacity Planning: A Practical Guide
Plan for growth before you hit walls. This guide covers growth forecasting, compute and storage sizing, IOPS requirements, and cloud vs on-prem decisions.
Connection Pooling: HikariCP, pgBouncer, and ProxySQL
Learn connection pool sizing, HikariCP, pgBouncer, and ProxySQL, timeout settings, idle management, and when pooling helps or hurts performance.