Failover Automation

Automatic failover patterns. Health checks, failure detection, split-brain prevention, and DNS TTL management during database failover.

published: March 26, 2026 reading time: 35 min read author: GeekWorkBench updated: January 1, 1970

Quick Summary

Failover automation turns a database outage from a 3am emergency into an automatic recovery, detecting primary failure and promoting a replica without human intervention. Health checks must balance sensitivity: too sensitive creates spurious failovers, too lenient extends downtime. Most production systems use 10-30 second timeouts with multiple consecutive checks. Split-brain prevention uses quorum voting (a node needs majority to become primary) or fencing (STONITH or resource fencing to stop the old primary from writing). DNS TTL management matters during failover since cached records can point clients to the dead primary. After reading, you will understand health check design, failure detection strategies, split-brain prevention, and DNS TTL management during database failover.

Failover Automation

A primary node fails at 3am. Do you get paged? Does the system recover automatically? Or does someone have to wake up, diagnose, and manually trigger a failover?

Failover automation determines whether your high availability setup is a real safety net or just a checklist item that gives false confidence.

Introduction

Failover automation turns a database outage from a 3am emergency into an automatic recovery. Without it, a primary failure requires someone to diagnose the problem, confirm the replica is ready, promote the replica, and update DNS or connection strings — all while the system is down. With it, the system detects failure, promotes a replica, and reconnects traffic, typically in under a minute.

The catch is that automated failover introduces its own failure modes: split-brain where both primary and replica think they are primary, promoting a replica that is too far behind, or failover that happens unnecessarily during temporary network hiccups. This guide covers health check design, failure detection strategies, split-brain prevention, and DNS TTL management during failover transitions.

What Automatic Failover Means

The system detects a primary failure and promotes a replica to primary without human intervention. The goal: minimize downtime and data loss. The catch: it involves tradeoffs between speed, safety, and infrastructure complexity.

Failure detection and promotion sound straightforward. They rarely are.

Health Checks and Failure Detection

Before failover can happen, the system must confirm the primary is actually down. Too sensitive and you get spurious failovers that cause instability. Too lenient and you accept extended downtime.

Types of Health Checks

Database-level checks verify the process is accepting connections: pg_isready for PostgreSQL, TCP connection to the database port, SHOW GLOBAL STATUS for MySQL.

Query-level checks run an actual query to confirm the engine can process work. SELECT 1 proves the query engine is functional, not just the network layer.

Replication checks verify the replica is receiving changes from the primary. A replica that cannot keep up is a poor failover candidate—it would become primary with missing data.

# PostgreSQL health check
pg_isready -h $PRIMARY_HOST -p $PRIMARY_PORT -U $CHECK_USER
if [ $? -ne 0 ]; then
    trigger_failover
fi

Failure Detection Timeout Tradeoffs

The timeout for declaring a primary dead requires balancing:

Short timeout: fast recovery, but risk of false positives from temporary network blips
Long timeout: safer from spurious failovers, but longer downtime

Most production systems use 10-30 seconds with multiple consecutive checks. Some implementations require failures to persist over a period. Others use quorum—multiple independent checks must agree before triggering failover.

Split-Brain Prevention

Split-brain occurs when two nodes both believe they are the primary. Both accept writes, data diverges, and reconciliation becomes painful or impossible.

Network partitions cause most split-brain scenarios. If the primary loses network connectivity but keeps running, it continues accepting writes locally. A replica promotes itself believing the primary is dead. When the network recovers, you have two databases with conflicting data.

Quorum-Based Prevention

The safest approach uses quorum: a node can only become primary if a majority agrees. A three-node cluster needs 2 votes. If one node loses connection to the others, it cannot promote itself—only 1 vote. The remaining two nodes can elect a new primary.

Raft consensus protocols work this way. etcd, Consul, and some database clustering solutions implement similar logic.

Here’s the failure scenario that quorum is designed to prevent: the primary loses network connectivity to Node C but keeps running. Node C cannot reach the primary, so it tries to promote itself. Without quorum, Node C has 1 vote and should not promote. The two remaining nodes have 2 votes and can elect a new primary between themselves. The old primary, still believing it is primary, continues running — but it cannot accept writes because it cannot form a quorum. Fencing then isolates the old primary cleanly.

The 2-node case is where it gets messy. If one primary and one replica both lose sight of each other, each sees itself as having the majority — 1 out of 2 is not a majority. Neither should promote. That means a 2-node cluster cannot do safe automatic failover; you need a third node, a witness, or a tiebreaker. Patroni handles this by requiring the consensus store (etcd) to agree before promoting. If etcd itself is partitioned, Patroni sits tight rather than risking split-brain.

Configuration matters as much as concept. In etcd, election-timeout controls how long a node waits before considering the leader gone — the default is 1000ms. For faster failover you might push this to 200-300ms, but that raises your false positive rate. Consul’s autopilot settings work similarly. The tradeoff does not change: aggressive timeouts catch real failures faster but can trigger on blips, and conservative timeouts avoid spurious failovers but add seconds or minutes to your MTTR.

Fencing

Quorum prevents a replica from promoting without majority agreement, but it does not stop the old primary from continuing to accept writes. If the primary lost network connectivity rather than crashing, it keeps running and may have a local write queue that never reached the replica. When the network recovers, you have two nodes with divergent data. Fencing closes this gap by physically isolating the old primary before the new primary starts accepting writes.

STONITH (Shoot The Other Node In The Head) achieves this by power-cycling or isolating the failed node through an out-of-band channel. The method depends on your environment:

Method	How it works	Best for
IPMI / iLO / iDRAC	Fencing agent sends a power reset command over the BMC network	Bare metal, on-prem
Cloud provider API	AWS `StopInstances`, GCP `resetInstance` via API call	Cloud VMs
Switch port disable	Network switch disables the port connected to the old primary	Environments with managed switches
Upsi or PDU control	Power distribution unit cuts power to the specific node	Data centers with intelligent PDUs

Patroni integrates STONITH through its watchdog mechanism. If a replica fails to promote within the configured timeout, Patroni’s watchdog triggers a system reset to halt the old primary rather than leaving the cluster in a degraded state. The mechanism uses the Linux softdog or watchdog kernel module and needs no special hardware — if Patroni stops sending heartbeat signals, the kernel resets the node.

Resource fencing revokes access to shared resources without power-cycling the node. Storage fencing is the most common approach: revoke the old primary’s write access to shared storage (SAN, NFS, or distributed storage). PostgreSQL uses pg_postexecute or sfex to prevent the old primary from opening its data directory after promotion. Network fencing uses ACLs or firewall rules to block the old primary from reaching other cluster nodes or the shared storage network.

STONITH is more reliable since the node is definitively off, but your out-of-band channel must actually work. If your IPMI shares the same network as the database and the network partitions, STONITH fails at the worst possible moment. Resource fencing works faster over the regular network, but a misconfigured ACL or a process that ignores the fence can still allow writes. Most production setups layer both: resource fencing first, STONITH as the fallback.

Fencing adds real operational overhead. You need to configure and test the mechanism, monitor its health, and make sure your operators understand what happens during a failover. If you are running a smaller team or a development environment where 24/7 availability is not a hard requirement, semi-automatic failover (an operator confirms the old primary is isolated before triggering promotion) is often more practical than building full fencing support.

Prometheus, HAProxy, and Patroni Patterns

Common tooling patterns for managing failover:

Prometheus Alertmanager + Blackbox Exporter

Monitor the primary with Prometheus. When checks fail, Alertmanager fires an alert that triggers automated failover via webhook.

groups:
  - name: database_failover
    rules:
      - alert: DatabasePrimaryDown
        expr: probe_success{job="db-primary-check"} == 0
        for: 30s
        labels:
          severity: critical
        annotations:
          summary: "Database primary is down"

HAProxy for Connection Routing

HAProxy detects primary failure and routes writes to the new primary automatically. Your application keeps connecting to the same HAProxy endpoint; HAProxy handles the backend switch.

backend db_write
    option tcp-check
    tcp-check connect
    tcp-check send PING\r\n
    tcp-check expect string PONG
    server primary db-primary:5432 check inter 5s fall 3 rise 2
    server replica1 db-replica1:5432 check inter 5s fall 3 rise 2 backup

Patroni for PostgreSQL

Patroni automates PostgreSQL failover using distributed consensus (etcd, ZooKeeper, or Consul) for leader election. It manages leader election, replication, automated failover with safety checks, and post-failover recovery.

Patroni runs as a background process on every PostgreSQL node. Each instance continuously writes its cluster state to the distributed consensus store — primary or replica, last heartbeat, replication position. If the primary stops writing heartbeats, the consensus store notices and promotes the healthiest replica. The decision comes from an external quorum system, not from the failing primary itself — that’s the core of why Patroni works.

Operators interact with Patroni through patronictl. Common commands:

# Check cluster health
patronictl list

# Trigger manual failover (for maintenance)
patronictl switchover

# View failover history
patronictl history

During a failover, Patroni’s watchdog ensures that if a replica fails to promote within the configured timeout, the cluster pauses rather than continuing in a degraded state. This prevents split-brain where two nodes both try to become primary. Patroni also uses sfex or pg_postexecute to stop the old primary from accepting connections before the new primary starts accepting writes.

The catch: you now maintain a three-node etcd cluster alongside your PostgreSQL nodes. At scale — like Spotify’s hundreds of clusters — that becomes a platform concern, not a per-cluster one. But the benefit is real: failover works reliably no matter which PostgreSQL node fails, because the decision logic lives in the consensus layer, not in the database process.

Vitess and CockroachDB

Vitess and CockroachDB have automatic failover built in as core features. If you want failover that works without custom configuration, managed distributed databases handle this internally.

Vitess started as YouTube’s MySQL sharding middleware, and it handles failover at the vtgate layer. When a MySQL tablet fails, Vitess’s topology service detects it and promotes a replica. The vtgate layer starts routing queries to the new primary automatically — your application just sees a connection reset and retries. One thing worth knowing: Vitess also reshards transparently during failover, so a node failure during a migration does not block the resharding process. The downside is that Vitess is a full platform. You need a ZooKeeper or etcd cluster for topology state, plus vtgate, vtconsul, and vttablet running on every node. It is not a drop-in database — it is a platform you commit to running.

CockroachDB works differently. It is distributed at the storage layer using Raft consensus, so failover is not a separate process — it is how replication works by default. When a node fails, Raft rebalances replicas to the remaining nodes without any external tooling. A 3-node cluster can lose 1 node with no downtime and only minor latency impact. There is no promotion step because CockroachDB treats all replicas as equally valid — reads come from any replica, writes need a majority. The tradeoff is cost: CockroachDB’s resource requirements are higher than vanilla PostgreSQL, and you pay for that operational simplicity.

Mean Time to Recovery (MTTR)

MTTR measures average recovery time. For database failover, it includes:

Time to detect failure
Time to decide on failover
Time to execute promotion
Time for replicas to catch up to new primary

Target MTTR varies by industry, but 30-60 seconds is achievable with proper automation. Manual failover typically takes 5-15 minutes when accounting for human response time, diagnosis, and execution.

The MTTR formula breaks down component by component. Detection time is your health check interval times the number of consecutive failures required to trigger failover. If you check every 5 seconds and require 3 consecutive failures, minimum detection time is 15 seconds. Add network jitter and you might see 20-30 seconds before detection fires.

Decision time is the time between detection and promotion trigger. For quorum-based systems (Patroni with etcd, for example), this includes leader election timeout — typically 10-15 seconds by default. You can tune this down to 3-5 seconds at the cost of higher false-positive risk.

Promotion time depends on your database. PostgreSQL promotion typically takes 2-5 seconds (rewriting the control file, promoting the replica). MySQL is similar. CockroachDB and other distributed databases promote automatically as part of their Raft consensus — usually under 1 second.

Replication catchup time: after promotion, the new primary must have all committed data. With synchronous replication, catchup time is zero — the replica was already current. With asynchronous replication, the new primary may be missing the last few seconds of writes. The practical catchup time for async replication is roughly equal to your replication lag at the time of failover. If lag was 2 seconds, you may have lost 2 seconds of data and catchup (for the new replica to catch up to the new primary) takes another few seconds under load.

Real-World Case Study: Patroni at Spotify

Spotify runs one of the largest PostgreSQL clusters managed by Patroni in production. Patroni handles automatic failover for their hundreds of PostgreSQL clusters using etcd as the consensus backend.

The challenge at Spotify’s scale was not the failover itself — Patroni’s failover is reliable — but managing the post-failover state across their service mesh. When a PostgreSQL primary fails and a replica promotes, hundreds of application servers that had connections to the old primary need to reconnect to the new primary. Without connection pooling (Spotify uses PgBouncer in front of PostgreSQL), this would cause a connection storm as all applications simultaneously try to reconnect.

Spotify’s solution was aggressive connection pooling at every tier. PgBouncer sits in front of every PostgreSQL cluster. When failover happens, PgBouncer detects connection failures and reconnects to the new primary with exponential backoff. The connection pooling layer absorbs the reconnection burst rather than letting it hit the database directly.

The operational lesson: failover automation handles the database layer, but your connection pooling and application reconnection logic must be equally resilient. Patroni’s failover works reliably, but if your application crashes on connection errors instead of retrying gracefully, you still get an outage.

Real-World Case Study: AWS RDS Failover

AWS RDS with Multi-AZ failover is automatic but not instantaneous. When an RDS primary fails, the failover process typically takes 60-120 seconds. This includes detection (Amazon’s internal monitoring), promotion of the standby, and DNS update propagation.

The catch: during those 60-120 seconds, your application receives connection errors. If your connection pooling layer cannot handle connection failures gracefully (by retrying with backoff and eventually connecting to the new endpoint), your application starts failing requests even though RDS has already recovered.

The lesson: managed failover does not eliminate application-level connection resilience. RDS handles the database failover; your application must handle the connection interruption. Use the RDS endpoint (which automatically resolves to the new primary after failover) rather than hardcoding the primary’s IP. Configure your database client to retry on connection failure with exponential backoff. Test failover explicitly — RDS allows you to trigger a manual failover to verify your application handles it correctly.

Quick Recap Checklist

Database Replication — Replication fundamentals these systems build on
Database Scaling — Scaling strategies including replication
Consistent Hashing — Related distribution mechanisms

DNS TTL Management During Failover

When the primary changes, applications must connect to the new primary. DNS-based discovery handles this: point your database hostname to the current primary’s IP, update DNS on failover.

The problem: DNS caching. If your TTL is 1 hour, clients might keep connecting to the failed IP for an hour after failover.

Low TTL Before Failover

Set a low TTL (60 seconds or less) on your database DNS records before you need failover. This lets you update quickly when the time comes.

gcloud dns record-sets transaction start --zone=db-zone
gcloud dns record-sets transaction remove db-primary.example.com. A 300 \
    --zone=db-zone --ttl=60
gcloud dns record-sets transaction add db-replica1.example.com. A 300 \
    --zone=db-zone --ttl=60
gcloud dns record-sets transaction execute --zone=db-zone

Application-Level DNS Caching

Applications cache DNS lookups at the OS or library level regardless of your TTL. Ensure your database client re-resolves hostnames on connection failure or periodically.

The issue goes deeper than your TTL settings. Most operating systems cache DNS as a performance feature — Linux with systemd-resolved holds onto results for the full TTL, but also keeps a negative cache for failed lookups that can persist for minutes. Language runtimes layer their own caching on top of this. The JVM is the worst offender: it caches DNS lookups indefinitely by default, until the process restarts. Go’s standard library also caches forever unless you use a dialer with a deadline set. So even with a 60-second TTL, a JVM application that resolved db-primary.example.com at startup will keep hammering the old IP until someone restarts the process.

PgBouncer and ProxySQL solve this by sitting between your application and the database. Your application connects to the pooler endpoint, the pooler tracks which backend is primary, and when failover happens, the pooler reconnects to the new primary. Your application just sees a connection reset and retries — no DNS involved. This is the cleanest solution if you can add a pooling layer.

If a pooler is not an option, most database clients re-resolve the hostname when retrying a failed connection. PostgreSQL’s Go driver, the Java driver, and Node.js pg all do this. For JVM apps, set networkaddress.cache.ttl=0 in your security policy. In Kubernetes, use a headless service or an ingress controller that tracks the primary via labels — Kubernetes updates its endpoints automatically when a pod fails, which sidesteps DNS caching problems entirely.

When to Use / When Not to Use Automatic Failover

Use automatic failover when:

Your SLA requires fast recovery (under 60 seconds)
You run multi-AZ deployments where node failure is a realistic scenario
You have 3+ database nodes so quorum-based promotion is safe
Your application handles brief write unavailability gracefully

Do not use automatic failover when:

You have only 2 nodes — quorum-based failover cannot work safely with 2
Your database runs on shared storage (SAN) where primary and replica share failure domain
Your team can respond faster manually than the automation can detect and act
You lack monitoring that distinguishes node failure from slow node

Automatic vs Semi-Automatic vs Manual Failover Trade-offs

Dimension	Automatic Failover	Semi-Automatic (Operator-Triggered)	Manual Failover
MTTR	30-60 seconds	2-5 minutes	5-15 minutes
False positive risk	Higher — can trigger on transient issues	Lower — human verifies first	None
Split-brain risk	Moderate without quorum, low with quorum	Very low	None
Operational complexity	High	Medium	Low
Best suited for	24/7 services with SLA	Business hours ops, critical data	Non-critical dev/staging
Requires	Quorum, fencing, watchdog	Alerting + runbook	On-call engineer
Data loss risk	Low (with sync rep)	Very low	Depends on async lag

What Automation Cannot Solve

Automation handles node failure, not root cause failure. If a bad deployment corrupts data on the primary, promoting a replica spreads the corruption. You still need backups, point-in-time recovery, and tests that verify your recovery procedures actually work.

Automation also does not eliminate runbooks. When something unexpected happens, operators need documented procedures, not just alerts.

Production Failure Scenarios

Failure	Impact	Mitigation
Spurious failover from network blip	Primary demoted unnecessarily, brief write outage	Require multiple consecutive failures, use longer detection windows
Split-brain from network partition	Two primaries accept writes, data diverges	Use quorum-based promotion, implement fencing (STONITH)
Fencing failure leaves old primary writing	Data corruption, split-brain continues	Test fencing regularly, monitor fencing state
Promotion of lagging replica	Data loss — missing writes not yet replicated	Check replication position before promotion, use semi-sync replication
DNS TTL too high	Clients keep connecting to failed primary for minutes	Set TTL to 60s or lower before failover, use connection pooling
Cascade failure during failover	Multiple nodes fail during transition	Keep cluster stable during failover, avoid concurrent config changes
Automation runs during planned maintenance	Primary demoted while operator working	Pause automation during maintenance windows, use manual mode

Interview Questions

1. You have a 2-node PostgreSQL cluster with async replication. Someone suggests adding automatic failover. What is the fundamental problem?

With 2 nodes, quorum-based failover cannot work safely. If the primary fails and the replica promotes, there is no third vote to break ties — any network partition that splits the nodes at the same time as the primary failure creates a split-brain scenario where both nodes believe they are primary. The only safe configuration with 2 nodes is manual failover or shared-nothing failover where one node can definitively prove it is the only one that can serve writes (via shared storage or fencing). The recommendation is to add a third node or use a witness/arbitrator node to enable quorum-based failover.

2. Your automatic failover triggers during a brief network partition. The old primary is still running but cannot communicate with the cluster. What happens?

The old primary may still have active connections and continue accepting writes locally. When the network recovers and the old primary rejoins the cluster, it has writes that were never replicated to the new primary. You now have divergent data. The cluster must reconcile — either by reverting the old primary to a replica and having it replay the missing writes, or by accepting data loss. This is why fencing (STONITH) is critical: before the new primary promotes, the old primary must be isolated from accepting writes. Without fencing, the old primary can continue writing during a partition and cause split-brain.

3. How do you tune a Patroni cluster for sub-30-second MTTR?

Three tuning points: reduce the heartbeat interval (Patroni sends heartbeats between nodes), reduce the loop_wait parameter which controls the Patroni main loop, and reduce the retry_timeout for etcd operations. Standard Patroni settings have a loop_wait of 10 seconds and a heartbeat interval of 10 seconds, giving a minimum detection time around 30 seconds. Reducing loop_wait to 1-2 seconds gets detection time under 10 seconds, but at the cost of increased etcd write load. For truly sub-30-second failover, synchronous replication to at least one replica is the most reliable approach since the promoted replica is guaranteed to be current.

4. How does quorum-based failover prevent split-brain, and what is the minimum number of nodes required for safe automatic failover?

Quorum requires that a majority of nodes must agree before promoting a new primary. With 3 nodes, you need 2 votes to promote. If the primary fails and one replica also loses connectivity, the isolated replica has only 1 vote — it cannot promote itself. The remaining 2-node group has 2 votes and can elect a new primary. This prevents split-brain because a single node cannot form a majority on its own. Minimum safe configuration: 3 nodes for odd-numbered quorum (3, 5, 7). With 2 nodes, there is no majority possible without the other node — any network partition creates two nodes each believing they are primary. Some setups use a witness/arbitrator node (like etcd's learner) to break ties for 2-node clusters, but a 3-node cluster is the recommended minimum for production automatic failover.

5. What is STONITH fencing and why is it critical for automatic failover safety?

STONITH (Shoot The Other Node In The Head) is a fencing technique that physically isolates a failed node before promoting a new primary. When a replica decides to promote, STONITH first power-cycles or network-isolates the old primary to ensure it cannot continue writing. Without STONITH, the old primary may still be running and accepting writes during a network partition — when the partition heals, you have two primaries with divergent data (split-brain). STONITH ensures the old primary is definitively dead before the new primary accepts writes. Implementation options: IPMI power management, hardware out-of-band interfaces, or cloud provider APIs (AWS instance stop). STONITH adds operational complexity but is required for strong failover safety guarantees.

6. During a failover, DNS TTL misconfiguration causes clients to keep connecting to the failed primary. What are the consequences and how do you prevent it?

Clients cached the old IP with a 1-hour TTL. After failover, the DNS record points to the new primary's IP, but clients ignore the update for up to an hour. During this window, writes to the failed primary's IP fail with connection errors. If the old primary is still running (not properly shut down), it may accept writes that are never replicated to the new primary — data divergence. Prevention: set database DNS TTL to 60 seconds or less before deploying failover automation. Use a database-specific DNS endpoint (like RDS endpoint or a dedicated CNAME) that resolves to the current primary rather than hardcoding IPs. Some setups use a virtual IP (VIP) that floats to the new primary — the application connects to the VIP and the network handles the switch without DNS changes.

7. Explain the tuning parameters in Patroni that affect failover detection time: loop_wait, heartbeat, and retry_timeout.

loop_wait is the main Patroni loop interval — default 10 seconds. Each loop iteration checks cluster health and sends heartbeats. heartbeat (Patroni 2.0+) is the interval between heartbeat signals, typically 2 seconds. The effective detection time is roughly (loop_wait * number_of_failed_heartbeats) + network_jitter. Standard settings give ~30 second detection. retry_timeout is the timeout for etcd/consul operations — if the distributed store does not respond within this time, Patroni treats it as a failure. To achieve sub-30-second MTTR: reduce loop_wait to 1-2 seconds (increases etcd load significantly), reduce heartbeat interval, ensure etcd/consul is fast and co-located. For truly fast failover, synchronous replication to at least one replica is more reliable than aggressive timeouts.

8. Your automatic failover triggers spuriously during a brief network partition. The old primary never actually failed. What happened and how do you prevent it?

The monitoring system triggered failover based on a single failed health check or too-short timeout. A brief network blip caused health checks to fail for a few seconds, the failover automation interpreted this as primary death, and promoted a replica. Meanwhile the old primary continued running normally and was still accepting writes locally. When the network recovered and both primaries existed, you had a split-brain scenario. Prevention: require multiple consecutive failures before triggering failover (e.g., 3 consecutive failures over 30 seconds). Use health check intervals of 5-10 seconds with enough consecutive failures to survive brief blips. Some systems use quorum (multiple independent monitors must agree) rather than single-monitor decision. Also implement a pre-promotion verification step that checks the candidate replica's data freshness before allowing promotion.

9. How does AWS RDS Multi-AZ failover work internally, and what are the application-level implications during the 60-120 second failover window?

RDS Multi-AZ uses synchronous replication to a standby in a different AZ. When the primary fails, Amazon's internal monitoring detects it, promotes the standby, and updates the DNS endpoint to point to the new primary. The application connects to the same endpoint (e.g., mydb.abc123.us-east-1.rds.amazonaws.com) — DNS propagates the new IP. During the 60-120 second failover window: connection attempts fail with connection errors; in-flight transactions fail; your application must handle reconnection with exponential backoff. The RDS endpoint masks the IP change but does not eliminate application disruption. Key application requirements: retry logic with backoff on connection failure, connection pool that can recover from failed connections, and no hardcoded IP addresses. Test failover explicitly — RDS allows you to trigger a test failover to verify your application's handling.

10. What is the difference between pgpool-II's raw mode and replication mode in the context of automatic failover?

In raw mode (native PostgreSQL), pgpool-II does not handle replication — it just routes queries to a single PostgreSQL backend. Failover in raw mode means detecting backend failure and switching to a standby — pgpool can do this but you must configure watchdog for HA. In replication mode, pgpool maintains a real replication connection to multiple PostgreSQL servers and replicates writes across them. Failover in replication mode is more complex — pgpool can detect failed nodes but must also manage the replication state. For most automatic failover setups with PostgreSQL, Patroni + PgBouncer is preferred over pgpool-II because Patroni handles failover coordination with etcd/consul while pgpool focuses on connection pooling. pgpool-II works well for read load balancing but Patroni is the standard for HA.

11. You have a 2-node PostgreSQL cluster. Your team is considering enabling automatic failover using a witness node. How does a witness node enable safe failover with only 2 data nodes?

A witness node (or arbitrator) participates in quorum but does not store data. With 2 data nodes + 1 witness: if the primary fails, the replica + witness have 2 votes (majority of 3), allowing promotion. If the witness fails, the 2 data nodes can still communicate and continue operating — the primary does not promote spuriously. This gives you safe automatic failover with 2 data nodes without requiring a 3rd full data node. The tradeoff: the witness is a single point of failure for the election itself (if both the witness and one data node fail simultaneously, you lose quorum). Also verify that your failover tool (Patroni, etc.) supports witness/arbitrator nodes and that the witness's network is reliable and in the same failure domain as the data nodes.

12. Describe the tradeoffs between automatic failover, semi-automatic (operator-triggered), and manual failover for a business-critical database.

Automatic failover: MTTR 30-60 seconds, highest availability, but risk of false positives (spurious failover) and split-brain without proper quorum/fencing. Semi-automatic (operator-triggered): monitoring alerts an on-call engineer who manually approves and triggers failover — MTTR 2-5 minutes, lower false-positive risk because a human verifies before acting. Manual failover: engineer diagnoses, then runs procedures manually — MTTR 5-15 minutes, no spurious failover risk, but human response time adds delay. For business-critical databases with 24/7 availability requirements, automatic failover is appropriate if quorum and fencing are properly implemented. For less critical systems or those with skilled ops teams that can respond in under 5 minutes, semi-automatic may provide better safety-to-speed balance. Manual is appropriate for development and staging.

13. During a failover, you observe a connection storm as hundreds of application instances simultaneously try to reconnect. What causes this and how do you prevent it?

When the primary fails, all application instances that had connections to it receive connection errors nearly simultaneously. Each instance's connection pool or ORM sees the failure and independently starts retrying with exponential backoff — but without coordination, all instances retry at the same times (they failed at the same time). This creates a connection storm that can overwhelm the new primary or the connection pooler. Prevention: use PgBouncer or ProxySQL in front of the database — the pooler maintains its own connection pool to the database and handles reconnection internally, absorbing the burst rather than letting it hit the database directly. Configure pooler reconnection with jitter and exponential backoff. At the application level, implement circuit breakers that temporarily stop sending requests to the database during failover, allowing the pooler time to stabilize.

14. What are the minimum etcd/Consul requirements for a production Patroni cluster, and how do you avoid the consensus backend itself becoming a bottleneck?

Patroni requires a distributed consensus store (etcd, Consul, ZooKeeper). For production: use 3 or 5 etcd nodes for HA — a 3-node etcd cluster tolerates 1 failure, 5-node tolerates 2. Place etcd nodes on separate hardware and network segments from the PostgreSQL primaries to avoid correlated failures. Etcd performance matters: use SSDs for etcd data directory, network locality between etcd nodes and Patroni nodes, and monitor etcd write latency (should be < 10ms). If etcd becomes slow, Patroni loops hang and failover detection slows down. For read-heavy workloads, etcd can handle thousands of reads per second, but ensure your etcd cluster is not overutilized — dedicated etcd clusters for critical Patroni deployments are recommended over sharing with other workloads.

15. How do you calculate MTTR for an automatic failover system, and what are the main time components?

MTTR = detection_time + decision_time + promotion_time + catchup_time. Detection_time = health_check_interval * consecutive_failures_required. With 5s checks and 3 failures required, minimum detection is 15s plus network jitter. Decision_time = leader_election_timeout (for quorum-based systems like Patroni with etcd). Default etcd election timeout is 10s, so decision_time is at least 10s. Promotion_time = time to rewrite PostgreSQL control file and update replication position. PostgreSQL promotion typically takes 2-5s. Catchup_time = if using synchronous replication, catchup_time is 0. If async, the new primary may need to replay remaining WAL, which depends on lag at failover time. Target MTTR with proper automation: 30-60 seconds. Manual failover typically takes 5-15 minutes. Formula-based: MTTR = (health_check_interval * failures) + election_timeout + promotion_latency + async_lag_at_failover.

16. A failover has just completed and the new primary is promoted. What post-failover steps do you take to restore full redundancy?

Immediately after failover: verify the new primary is accepting writes and the old primary is isolated (offline or demoted). Then: (1) Configure the old primary to replicate from the new primary as a replica — this is usually automatic in Patroni. (2) Verify replication is catching up — check pg_stat_replication for the new replica. (3) If using a load balancer or connection pooler, verify traffic is routing to the new primary. (4) Monitor the new replica's lag — it may need to replay a backlog of WAL. (5) Once replication is current and stable, you are back to full redundancy (1 primary + N replicas). If the old primary had hardware failure, repair or replace it and add it back to the cluster as a new replica. Document the failover event and review whether the automation worked correctly or needs tuning.

17. You suspect your automatic failover triggered incorrectly (false positive). How do you investigate and confirm, and what metrics would have warned you?

Investigate: check the failover automation logs (Patroni, Consul, etc.) for the decision rationale. Look at the health check results around the failover time — were there multiple consecutive failures or just one? Check system metrics on the old primary: was it actually healthy or was it experiencing load/CPU/network issues that made it appear down? If the old primary was healthy and reachable, the health check was too sensitive. Metrics that warn of false positives: health check success rate over time (if success rate drops below 100%, you are near the threshold), replication lag (a lagging replica is a poor failover candidate), and CPU/memory/disk on the candidate replica (a resource-exhausted replica should not be promoted). Prevention: require more consecutive failures, use longer health check intervals, and implement pre-promotion checks that verify the candidate's resource health and replication position.

18. How does HAProxy detect primary failure and route traffic to the new primary during a failover?

HAProxy uses health checks (TCP checks or custom scripts) to monitor each backend. If the primary fails checks, HAProxy marks it dead and stops routing traffic to it. If you configure a backup server (the replica becomes the backup), HAProxy promotes the backup to active once the primary is marked dead. For PostgreSQL, the typical health check is pg_isready via TCP. The failover is not automatic on the database side — HAProxy only detects and routes away from failed backends. You still need Patroni or similar to promote the replica at the database level. HAProxy then detects the new primary's IP (which Patroni or your failover script updates) and resumes routing. For smoother transitions, use a virtual IP that floats to the new primary — HAProxy just checks the VIP and the network handles the IP movement.

19. How do you design a failover strategy that accounts for both hardware failures and software bugs that cause crashes but leave the database in an inconsistent state?

Hardware failures and software bugs require different recovery approaches. Hardware failure (power supply, disk, memory) typically leaves the database in a clean state — the last committed transaction is well-defined. Software bugs (corrupt indexes, half-applied transactions, crashed writer process) can leave the database in an inconsistent state where promotion to primary is dangerous because the replica may carry forward corrupted data. Design for both: implement checksum verification on replication data to detect corruption before promotion. Use post-promotion verification that runs consistency checks (PostgreSQL's pg_checksums, MySQL's CHECK TABLE) before accepting writes on the new primary. For software bugs causing inconsistency, prefer a full data comparison between candidate replica and primary before promotion rather than blindly promoting the first replica that responds. Consider separating failure classes — hardware failure triggers immediate promotion, software inconsistency triggers alert-and-review before automatic promotion. Maintain backups that are independent of the replication chain so you can restore to a known-good state regardless of which replica you promote.

20. How does read replica promotion affect long-running transactions, and what strategies minimize transaction failures during automatic failover?

Long-running transactions that started on the primary before failover may fail when the replica promotes — the transaction's session is interrupted, and if it attempts to commit, the new primary may not have the transaction in its state. The new primary also has a different timeline: transactions that were in-flight on the old primary but not yet committed may be lost. Strategies: keep transaction durations short — long-running transactions are more likely to span a failover event. Use application-level transaction retry logic that catches connection errors and re-executes failed transactions (idempotent operations). Avoid multi-statement transactions that hold locks across the failover window. For critical transactions, consider using synchronous replication so the replica is current at promotion time. Monitor in-flight transaction count as part of your failover metrics — a sudden spike in long-running transactions before failover increases the chance of transaction failures during promotion.

Conclusion

Automatic failover separates resilient systems from expensive pagerduty schedules. The core components: reliable failure detection, quorum or fencing to prevent split-brain, and DNS or connection routing that updates when the primary changes.

Test your failover process before you need it. A failover that works in theory but fails under pressure is a liability, not a safety net.

Failover Automation

Introduction

What Automatic Failover Means

Health Checks and Failure Detection

Types of Health Checks

Failure Detection Timeout Tradeoffs

Split-Brain Prevention

Quorum-Based Prevention

Fencing

Prometheus, HAProxy, and Patroni Patterns

Prometheus Alertmanager + Blackbox Exporter

HAProxy for Connection Routing

Patroni for PostgreSQL

Vitess and CockroachDB

Mean Time to Recovery (MTTR)

Real-World Case Study: Patroni at Spotify

Real-World Case Study: AWS RDS Failover

Quick Recap Checklist

Related Posts

DNS TTL Management During Failover

Low TTL Before Failover

Application-Level DNS Caching

When to Use / When Not to Use Automatic Failover

Automatic vs Semi-Automatic vs Manual Failover Trade-offs

What Automation Cannot Solve

Production Failure Scenarios

Interview Questions

Further Reading

Conclusion

Category

Tags

Related Posts

Database Backup Strategies: Full, Incremental, and PITR

Database Capacity Planning: A Practical Guide

Connection Pooling: HikariCP, pgBouncer, and ProxySQL