Database Replication: Master-Slave and Failover Patterns
Database replication explained: master-slave, multi-master, synchronous vs asynchronous strategies, failover patterns, and consistency.
Database Replication: Master-Slave, Multi-Master, and Failover Patterns
Replication keeps copies of your data on multiple servers. When one server fails, others take over. Replication also distributes read load. The topology you choose affects consistency, availability, and operational complexity.
I have seen replication fail in subtle ways. A replica falls behind, queries return stale data, and nobody notices until a customer complains. Understanding replication mechanics helps you build robust systems.
Replication Fundamentals
Replication copies data from one database to another. The source is the primary. The copies are replicas. Changes propagate from primary to replicas.
graph TD
App[Application] --> Primary[(Primary DB)]
Primary -->|replicate| Replica1[(Replica 1)]
Primary -->|replicate| Replica2[(Replica 2)]
Primary -->|replicate| Replica3[(Replica 3)]
Replication can be synchronous or asynchronous. Synchronous replication waits for replicas to confirm writes before reporting success. Asynchronous replication confirms writes immediately and propagates in the background.
Synchronous replication guarantees that replicas have committed data. If the primary fails, no data is lost. The cost is latency. Every write waits for replica confirmation.
Asynchronous replication writes immediately on primary. Replicas receive changes eventually. This is faster but data might be lost if the primary fails before replicas receive the update.
Master-Slave Replication
Master-slave is the most common topology. The master handles writes. Slaves handle reads. All writes go to master. Reads distribute across slaves.
graph LR
AppW[Write] --> Master[(Master)]
AppR1[Read] --> Slave1[(Slave 1)]
AppR2[Read] --> Slave2[(Slave 2)]
Master -->|SQL replication| Slave1
Master -->|SQL replication| Slave2
PostgreSQL calls this streaming replication. MySQL calls it replica set. The mechanics differ but the topology is the same.
Slave lag is a common problem. Asynchronous replication means slaves might lag behind the master. If a slave falls behind, reads return stale data.
-- PostgreSQL: check replication lag
SELECT client_addr, state, sent_lsn, write_lsn,
flush_lsn, replay_lsn
FROM pg_stat_replication;
Monitoring lag helps. Alert when lag exceeds thresholds. Route reads away from lagging replicas.
Multi-Master Replication
Multi-master allows writes to multiple primaries. Any master can handle reads and writes. This improves write availability. If one master fails, others continue accepting writes.
graph TD
App1[App 1] --> Master1[(Master 1)]
App2[App 2] --> Master2[(Master 2)]
Master1 -->|bidirectional| Master2
Master2 -->|bidirectional| Master1
The complexity is conflict resolution. Two masters might modify the same row simultaneously. Who wins?
-- Conflict: both masters update user email
-- Master 1: UPDATE users SET email='a@test.com' WHERE id=1
-- Master 2: UPDATE users SET email='b@test.com' WHERE id=1
-- Result depends on conflict resolution strategy
Different systems handle conflicts differently. Last-write-wins uses timestamps, which can be unreliable across servers. Some systems track conflicts and allow applications to resolve them.
Multi-master is rarely worth the complexity. Write conflicts are hard to reason about. Most applications do not need multi-master. Master-slave with failover handles most availability requirements.
Synchronous vs Asynchronous
Synchronous replication provides stronger guarantees. A transaction commits only when replicas confirm the write. If a replica fails, the write fails. This guarantees zero data loss on primary failure.
# Synchronous replication behavior
BEGIN;
UPDATE accounts SET balance = balance - 100 WHERE id = 1;
-- Waits for replica confirmation before proceeding
COMMIT;
The cost is latency. Writes take longer because they wait for network round trips to replicas. In geo-distributed setups, this latency is significant.
Asynchronous replication writes immediately. The primary confirms writes without waiting for replicas. This is faster but some data loss is possible if the primary fails.
# Asynchronous replication behavior
BEGIN;
UPDATE accounts SET balance = balance - 100 WHERE id = 1;
-- Returns immediately, replicates in background
COMMIT;
Most deployments use asynchronous replication. The performance benefit outweighs the small window of potential data loss. If durability matters, synchronous replication to at least one nearby replica provides a good balance.
Handling Replica Acknowledgment Timeouts
Synchronous replication requires waiting for replica confirmations. When a replica fails to acknowledge in time, you need proper handling to avoid blocking writes indefinitely.
import time
from enum import Enum
from dataclasses import dataclass
from typing import Optional
class ReplicationMode(Enum):
SYNCHRONOUS = "synchronous"
SEMI_SYNCHRONOUS = "semi_synchronous"
ASYNCHRONOUS = "asynchronous"
@dataclass
class WriteResult:
success: bool
acknowledged_by: list[str]
fallback_triggered: bool = False
error: Optional[str] = None
class ReplicaTimeoutHandler:
"""
Handles replica acknowledgment timeouts for synchronous replication.
Supports configurable timeout and fallback strategies.
"""
def __init__(
self,
timeout_seconds: float = 5.0,
mode: ReplicationMode = ReplicationMode.SYNCHRONOUS,
min_acks_required: int = 1
):
self.timeout_seconds = timeout_seconds
self.mode = mode
self.min_acks_required = min_acks_required
def write_with_ack(
self,
data: str,
replicas: list[str]
) -> WriteResult:
"""
Write to replicas with acknowledgment handling.
Returns result with acknowledgment status.
"""
acks_received = []
start_time = time.time()
for replica in replicas:
try:
ack = self._wait_for_ack(replica, start_time)
if ack:
acks_received.append(replica)
except TimeoutError:
if self.mode == ReplicationMode.SYNCHRONOUS:
return WriteResult(
success=False,
acknowledged_by=acks_received,
error=f"Timeout waiting for {replica}"
)
# Check if we have enough acks
if len(acks_received) >= self.min_acks_required:
return WriteResult(
success=True,
acknowledged_by=acks_received,
fallback_triggered=len(acks_received) < len(replicas)
)
# Semi-synchronous fallback
if self.mode == ReplicationMode.SEMI_SYNCHRONOUS:
return WriteResult(
success=True,
acknowledged_by=acks_received,
fallback_triggered=True
)
return WriteResult(
success=False,
acknowledged_by=acks_received,
error="Insufficient acknowledgments"
)
def _wait_for_ack(self, replica: str, start_time: float) -> bool:
"""Simulate waiting for replica acknowledgment."""
elapsed = time.time() - start_time
remaining = self.timeout_seconds - elapsed
if remaining <= 0:
raise TimeoutError(f"Replica {replica} acknowledgment timed out")
# Simulate network call with timeout
# In practice: connection.execute("WAIT FOR REPLICA ACK", timeout=remaining)
return True # Simulated success
Key timeout handling strategies:
| Strategy | Behavior | Use When |
|---|---|---|
| Fail on timeout | Write fails if replica doesn’t ack in time | Data durability is critical |
| Fallback to async | Switch to async if timeout occurs | Availability matters more than durability |
| Queue and retry | Queue writes for later replica ack | Network hiccups are temporary |
# Example: Semi-synchronous with fallback
handler = ReplicaTimeoutHandler(
timeout_seconds=2.0,
mode=ReplicationMode.SEMI_SYNCHRONOUS,
min_acks_required=1
)
result = handler.write_with_ack("transaction_data", ["replica1", "replica2"])
if result.success:
if result.fallback_triggered:
print(f"Write succeeded with {len(result.acknowledged_by)} acks (fallback mode)")
else:
print(f"Full synchronous write confirmed by {len(result.acknowledged_by)} replicas")
else:
print(f"Write failed: {result.error}")
Trade-Off Comparison: Synchronous vs Asynchronous Replication
| Dimension | Synchronous Replication | Asynchronous Replication |
|---|---|---|
| Write Latency | High (waits for replica confirmation) | Low (returns immediately) |
| Data Durability | Zero data loss on primary failure | Small window of potential data loss |
| Availability | Lower (can block if replica fails) | Higher (continues even if replicas lag) |
| Network Dependency | Requires low-latency network to replicas | Tolerates high-latency links |
| Cross-DC Support | Impractical across continents | Works well geo-distributed |
| RPO (Recovery Point Objective) | Zero | Minimal (lag-dependent) |
| RTO (Recovery Time Objective) | Faster (replica is current) | May be slower (replica must catch up) |
| Replica Load | Higher (must keep up with primary) | Lower (can lag) |
| Typical Use Cases | Financial transactions, inventory, critical data | Web sessions, analytics, social media feeds |
Decision Flowchart: When to Choose Sync vs Async
Use this flowchart to determine which replication strategy fits your use case:
flowchart TD
Start[Start: Choose Replication Strategy] --> Q1{Can you tolerate data loss?}
Q1 -->|No, zero tolerance| Sync[Synchronous Replication]
Q1 -->|Yes, small gap OK| Q2{Can you accept higher write latency?}
Q2 -->|Yes, latency acceptable| Sync
Q2 -->|No, need low latency| Q3{Are nodes geo-distributed?}
Q3 -->|Yes, cross-continent| Async[Asynchronous Replication]
Q3 -->|No, same datacenter| Q4{What's your RPO requirement?}
Q4 -->|RPO = 0| Sync
Q4 -->|RPO > 0| Q5{Can replicas occasionally block writes?}
Q5 -->|No, availability critical| Async
Q5 -->|Yes, can tolerate blocking| Sync
Sync --> SyncUse[Use synchronous replication to at least one nearby replica]
Async --> AsyncUse[Use asynchronous replication with lag monitoring]
Quick Decision Guide:
| Question | If Yes | If No |
|---|---|---|
| Financial transaction? | Sync | Continue |
| Cross-datacenter (>100ms RTT)? | Async | Continue |
| Can tolerate 1-5ms extra latency? | Sync | Async |
| Write-heavy workload? | Async | Continue |
| RPO must be zero? | Sync | Async |
Quorum Formula Mathematical Derivation
The quorum condition R + W > N is central to understanding strong consistency in replicated databases. Let us derive it formally and prove why it gives strong consistency guarantees.
The Problem Setup
Consider a replicated database with N replicas. To read a value, you must contact R replicas and return the most recent write among them. To write a value, you must contact W replicas and wait for all of them to acknowledge the write.
The fundamental question: How do we guarantee that every read sees the most recent write?
Formal Proof
Let us prove that R + W > N guarantees strong consistency.
Theorem: In a replicated system with N replicas, if you require W acknowledgments for writes and query R replicas for reads, then every read is guaranteed to see the most recent write if and only if R + W > N.
Proof Sketch:
-
Notation:
- Let
W_setbe the set of replicas that acknowledged the latest write, with|W_set| = W - Let
R_setbe the set of replicas contacted for the current read, with|R_set| = R - Let
X = W_set ∩ R_setbe the intersection (replicas that have the write AND are contacted for the read) - Let
U = W_set ∪ R_setbe the union
- Let
-
Cardinality Bound: By the principle of inclusion-exclusion:
|X| = |W_set| + |R_set| - |U| = W + R - |U| -
Since all replicas are in the union (there are only N replicas total):
|U| ≤ N Therefore: |X| = W + R - |U| ≥ W + R - N -
The Intersection Must Be Non-Empty: If
R + W > N, then:|X| ≥ W + R - N > 0Therefore,
|X| ≥ 1, meaning at least one replica in the read set also has the latest write. -
The Read Always Sees the Latest Write: Because the read quorum and write quorum overlap, the read coordinator can compare timestamps or version numbers across the overlapping replicas. The overlapping replica(s) definitely have the latest write, so the read returns the correct value.
Conversely, if R + W ≤ N, the sets may be disjoint:
|X| ≥ W + R - N ≤ 0
There may be no overlap, meaning the read set might miss the latest write entirely.
Why This Matters for Consistency
The quorum overlap guarantee means you cannot have a situation where:
- A write is committed (W replicas acknowledged it)
- A subsequent read contacts only replicas that did NOT receive the write
- The read returns stale data
With R + W > N, this scenario is impossible by construction.
Examples
Example 1: N=3, R=2, W=2
R + W = 2 + 2 = 4 > 3
Write quorum: {A, B}. Read quorum: {B, C}. Intersection: {B} - non-empty, read sees latest write.
Example 2: N=5, R=2, W=3
R + W = 2 + 3 = 5, NOT > 5
Write quorum: {A, B, C}. Read quorum: {D, E}. Intersection: {} - empty. Read may return stale data.
Example 3: N=5, R=3, W=3
R + W = 3 + 3 = 6 > 5
Write quorum: {A, B, C}. Read quorum: {C, D, E}. Intersection: {C} - non-empty.
Cassandra-Style Quorum
Cassandra uses a variant where consistency level is configurable:
def cassandra_consistency_level(n: int, r: int, w: int) -> str:
"""
Determine consistency guarantee for Cassandra-style settings.
"""
if r + w > n:
return "Strong consistency (linearizable)"
elif r + w == n:
return "Serial consistency (read-your-writes not guaranteed)"
else:
return "Eventual consistency (may read stale data)"
# Cassandra consistency levels
# LOCAL_ONE: R=1, no guarantee
# QUORUM: R=W=ceil(N/2) -> guarantees R+W > N
# ALL: R=W=N -> maximum consistency, minimum availability
# EACH_QUORUM: R=W=quorum in EACH datacenter
Strong Consistency vs Linearizability
The R + W > N condition gives you strong consistency for single-key operations. However, for linearizability (which requires that operations appear atomic across multiple keys or time), you need additional mechanisms:
- Paxos or Raft: Consensus protocols that order operations globally
- Spanner’s TrueTime: Physical time with bounded uncertainty
- Hybrid Logical Clocks: Physical time + logical counters
# R+W>N gives you strong consistency for single keys
# For multi-key transactions, you need something stronger
def check_linearizability_requirements(system_type: str) -> str:
"""
What additional guarantees does each system provide?
"""
if system_type == "dynamo_style":
return "Eventual consistency per key; R+W>N for single-key strong consistency"
elif system_type == "cassandra":
return "Tunable per-query; QUORUM gives R+W>N per key"
elif system_type == "spanner":
return "Linearizable via TrueTime (hardware clocks + GPS)"
elif system_type == "cockroachdb":
return "Linearizable via HLC and distributed transactions"
else:
return "Unknown consistency model"
Summary
The quorum formula R + W > N is derived from basic set theory:
- The intersection of read and write quorums must be non-empty
- This requires
W + R - N > 0, which isR + W > N - This guarantees every read sees at least one replica with the latest write
- Without this condition, read and write sets may be disjoint, allowing stale reads
This is why quorum-based replication is the foundation of consistent distributed systems like DynamoDB, Cassandra, and etcd.
Failover Patterns
When the primary fails, a replica must take over. This is failover. Automating failover reduces downtime but introduces complexity.
Manual Failover
Manual failover is simple. When the primary fails, an operator promotes a replica to primary. Traffic redirects. Replicas reconfigure to follow the new primary.
The downside is time. Minutes of downtime while failover happens. For some applications, this is acceptable. For others, it is not.
Automatic Failover
Automatic failover detects primary failure and promotes a replica automatically. This reduces downtime but requires careful implementation.
graph TD
Monitor[Monitor] -->|primary unreachable| Election[Election]
Election --> Replica1{Which replica?}
Replica1 -->|most advanced| Promote[(Promote to Primary)]
Promote --> NewMaster[(New Primary)]
NewMaster --> Replica2[(Old Replica 2)]
Failure detection is tricky. How do you distinguish between primary being slow and primary being down? False positives cause unnecessary failovers. False negatives leave you without a writable primary.
Most systems use a consensus approach. Multiple monitors must agree the primary is down before failover proceeds. This prevents split-brain scenarios where two servers think they are primary.
Failover Timeline Sequence
Here is what actually happens during an automated failover, step by step. The short version: writes go down for a bit, then come back up.
sequenceDiagram
participant App as Application
participant Primary as Primary DB
participant Replica1 as Replica 1
participant Replica2 as Replica 2
participant Monitor as Failover Monitor
participant Proxy as DB Proxy
Note over Primary,Replica1: Normal operation — async replication
Primary->>Replica1: WAL entries (async)
Primary->>Replica2: WAL entries (async)
App->>Primary: WRITE (committed)
Primary->>App: WRITE OK
Note over Primary: T1: Primary becomes unreachable
Monitor->>Primary: Health check (timeout)
Monitor->>Primary: Health check (timeout)
Monitor->>Primary: Health check (timeout)
Monitor->>Replica1: Is primary reachable? (no response)
Monitor->>Replica2: Is primary reachable? (no response)
Note over Monitor: T2: Consensus — primary is down
Monitor->>Replica1: Get LSN position
Replica1->>Monitor: LSN: 88400
Monitor->>Replica2: Get LSN position
Replica2->>Monitor: LSN: 88350
Note over Monitor: Replica1 is most advanced
Monitor->>Replica1: PROMOTE to primary
Monitor->>Proxy: New primary = Replica1
Note over Replica1: T3: Replica1 promoted — brief window of unavailability
Replica1->>Replica2: Reconfigure to follow new primary
Replica2->>Replica1: Acknowledged
Note over App,Primary: T4: Application reconnects
App->>Proxy: WRITE (retry)
Proxy->>Replica1: WRITE
Replica1->>App: WRITE OK
Note over App,Replica1: Write availability restored
Timeline breakdown:
| Phase | What Happens | Duration |
|---|---|---|
| T1 | Primary stops responding to health checks | Set by your timeout configuration |
| T2 | Monitor agrees the primary is dead, picks the most current replica | Usually 3-10 seconds |
| T3 | That replica gets promoted, other replicas switch to follow it | 5-30 seconds |
| T4 | Applications reconnect through the proxy | 1-5 seconds |
Unplanned failovers usually land somewhere between 10 and 60 seconds end to end. Planned ones are faster because the old primary is still up and coordination is simpler.
Replication Lag and Its Effects
Replication lag causes subtle problems. Applications might write to primary and immediately read from a replica. The read might hit a replica that has not yet received the write.
# This can fail with replication lag
update_user(user_id, {"email": "new@test.com"})
user = read_from_replica(user_id) # Might still have old email
Read-after-write consistency requires either reading from primary after writes or waiting for replication to catch up before reading from replicas.
# Solution 1: Read from primary after writes
update_user(user_id, {"email": "new@test.com"})
user = read_from_primary(user_id) # Always current
# Solution 2: Wait for replication
update_user(user_id, {"email": "new@test.com"})
wait_for_replication_lag(user_id, timeout=5)
user = read_from_replica(user_id) # Now includes the write
Some systems offer synchronous replication to eliminate lag. PostgreSQL synchronous replication guarantees the standby has received the write before committing.
Cascading Replication
Cascading replication chains replicas. A replica follows another replica rather than directly following the primary. This reduces load on the primary when you have many replicas.
graph TD
Primary[(Primary)] --> Replica1[(Replica 1)]
Replica1 --> Replica2[(Replica 2)]
Replica1 --> Replica3[(Replica 3)]
Replica2 --> Replica4[(Replica 4)]
Cascading increases replication lag. Data flows through intermediate replicas before reaching leaf replicas. For read-heavy workloads where staleness is acceptable, this is fine.
When to Use and When Not to Use Each Replication Topology
Master-Slave Replication:
- Use when: You need read scale and read replicas; you need high availability for reads
- Do not use when: You need write scaling (all writes still go to master); you need strong consistency across replicas
Multi-Master Replication:
- Use when: You need write availability across geographic regions; no single master can handle all writes
- Do not use when: You have high write conflict potential; your team lacks experience with conflict resolution
Synchronous Replication:
- Use when: Data durability is critical; you cannot afford any data loss on primary failure
- Do not use when: Network latency between nodes is high; write throughput is more important than immediate durability
Asynchronous Replication:
- Use when: Write latency is critical; some data loss is acceptable; nodes are geographically distributed
- Do not use when: You need zero RPO; strict durability guarantees are required
Production Failure Scenarios
| Failure | Impact | Mitigation |
|---|---|---|
| Primary failure without automatic failover | Writes fail, extended downtime | Implement automated failover, test failover regularly, have runbooks |
| Replication break due to divergent data | Replica stops replicating, falls behind | Use pt-table-checksum to detect drift, implement resync procedure |
| Network partition causing split-brain | Two primaries accept writes independently | Use quorum-based failover, implement fencing, test network partition scenarios |
| Replication lag buildup | Read replicas return stale data | Monitor lag, provision stronger replicas, switch to synchronous replication |
| Long-running query holding replication slots | Replication cannot advance, WAL accumulates | Kill long-running queries, tune max_slot_wal_keep_size |
| Cascading replication chain failure | Leaf replicas fall far behind | Keep replication chain short, monitor all replicas |
| Accidental replica promotion | Application writes go to wrong server | Protect against manual promotion, require multiple approvals |
| WAL corruption on primary | Replicas replicate corrupted data | Enable checksums, verify replica integrity regularly |
Replica Promotion Failure Mode: Write Burst Risk
When a primary fails during high write throughput, automatic failover promotes a replica. Here is what can go wrong.
If async replication was in use, the replica may not have received the latest writes. During peak load, replication lag easily reaches seconds or more. Once that replica becomes the new primary, it has stale data. Any writes accepted after promotion that depend on the missing writes create inconsistency.
Picture a $100 transfer from account A to account B. The primary fails after writing the debit but before replicating. The replica is promoted. The transfer appears to succeed, but the debit never reached the new primary. When the client checks the balance, it looks correct. The data is internally wrong.
Once promoted, the new primary gets a flood of writes. Buffered requests retry, new requests arrive, and the burst can overwhelm the new primary before it stabilizes. This makes it harder to catch up on replication from other replicas, even with semi-synchronous replication.
# Timeline of promotion failure
# T1: Primary writes "debit account A: $100" (async replication)
# T2: Primary fails before replicating to Replica1
# T3: Replica1 promoted to new primary
# T4: Client reads balance of account A from new primary (returns $900, missing debit)
# T5: Client transfers $100 from account A to account B (new primary accepts)
# T6: New primary now has inconsistent state ($100 debit from T1 is lost)
Mitigation strategies:
-
Use semi-synchronous replication: Require at least one replica to acknowledge writes before committing. The promoted replica then has at least one durable copy of recent writes.
-
Verify replica lag before promotion: Most failover tools check lag before promoting. If lag exceeds a threshold, wait or alert an operator instead of auto-promoting.
-
Implement guardrails on the application side: Use idempotent write patterns so duplicate writes at the new primary do not cause inconsistency. Design the application to treat promotion as a potential data integrity event requiring reconciliation.
-
Use quorum-based replication: With replication factor 3 and W=2 acknowledgments, at least one of the two nodes that acknowledged must have the latest write. Quorum systems like DynamoDB or Cassandra handle this more gracefully than simple primary-replica setups.
-
Plan for post-promotion reconciliation: Accept that some data may be inconsistent after a promotion under load. Run a reconciliation process after failover to detect and correct drift.
Automated failover reduces downtime but introduces this risk. Manual failover with operator verification is slower but reduces the risk. For financial or other high-integrity systems, synchronous replication or quorum-based approaches are worth the latency cost.
Observability Checklist
Metrics to Monitor:
- Replication lag in seconds (primary vs each replica)
- Replica IO and SQL thread status
- Replica heartbeat/lag metrics
- WAL generation rate vs replay rate
- Replication slot age (if using slots)
- Disk usage on replicas vs primary
- Connection count per replica
- Error count on replication threads
Logs to Capture:
- Replication state changes (connect, disconnect, error)
- Replication conflict events (duplicate key, deadlocks)
- Long-running queries on replicas blocking replication
- Checkpoint and WAL activity
- Failover events and decisions
Alerts to Set:
- Replication lag > 30 seconds (adjust to your SLA)
- Replica IO/SQL thread stopped
- Replication slot age > threshold
- Disk usage on replica > 80%
- High number of replication conflicts
- Replica connection failures
-- PostgreSQL: Monitor replication status
SELECT client_addr, state, sent_lsn, write_lsn,
flush_lsn, replay_lsn,
sent_lsn - replay_lsn AS replication_lag
FROM pg_stat_replication;
-- MySQL: Check replica status
SHOW REPLICA STATUS\G
Security Checklist
- Use strong authentication for replication connections (scram-sha-256, certificates)
- Encrypt replication traffic with TLS
- Restrict replication user to replica connections only
- Implement network segmentation for replication traffic
- Use separate credentials for replication (not application credentials)
- Audit replication configuration changes
- Rotate replication credentials regularly
- Monitor authentication failures on replication ports
- Verify replica SSL certificates
Common Pitfalls and Anti-Patterns
-
Assuming replicas are consistent immediately after writes: With async replication, replicas lag. Do not read from replicas for critical data immediately after writes.
-
Not monitoring replication lag: Lag accumulates silently until users complain. Monitor proactively and alert on threshold breaches.
-
Promoting replica without verifying data: An incompletely replicated replica promoted to primary causes data loss. Verify replication is current before promotion.
-
Using replicas for writes without understanding conflicts: Writes to replicas are typically replicated back. This can cause conflicts if not carefully designed.
-
Ignoring replication slot retention: Slots prevent WAL removal. If replicas fall behind, WAL accumulates and can fill disk.
-
Configuring weak authentication for replicas: Replicas with weak auth become attack vectors. Treat replica authentication as seriously as primary.
-
Not testing failover procedures: If you have never tested failover, you will discover problems during actual outages. Test regularly.
-
Mixing synchronous and asynchronous replication: This causes confusing behavior. Choose one replication mode per replica.
Quick Recap
Key Bullets:
- Replication provides high availability and read scale; it does not scale writes
- Asynchronous replication is faster but allows stale reads and potential data loss
- Synchronous replication guarantees durability at the cost of latency
- Monitor replication lag on every replica and alert on anomalies
- Test failover procedures before you need them in production
- Multi-master creates conflict resolution complexity; use only when write availability demands it
Copy/Paste Checklist:
-- PostgreSQL: Set up synchronous replication
ALTER SYSTEM SET synchronous_commit = on;
ALTER SYSTEM SET synchronous_standby_names = 'replica1,replica2';
-- MySQL: Configure replica with GTID
CHANGE REPLICATION SOURCE TO
SOURCE_HOST = 'primary-host',
SOURCE_USER = 'repl_user',
SOURCE_PASSWORD = 'secure_password',
SOURCE_AUTO_POSITION = 1;
START REPLICA;
-- Monitor replication health
SELECT * FROM pg_stat_replication;
SHOW REPLICA STATUS;
Real-World Case Study: GitHub’s MySQL Replication Outage
In 2022, GitHub experienced a major outage caused by a cascading replication failure. The problem started with a planned maintenance window where a primary database was taken offline. During the failover to a replica, the replica that was promoted had accumulated significant replication lag due to a long-running analytical query that had been holding the replication slot open. When the failover occurred, the promoted replica was missing several hours of writes.
GitHub’s architecture used MySQL with asynchronous replication. The primary was replicating to a relay master in each datacenter, which then replicated to read replicas. When the relay master failed, the read replicas attached to it fell behind. The failover process did not adequately account for the lag differential between replicas attached to different relay masters.
The lesson from GitHub’s post-incident analysis was around replication topology complexity. With multiple tiers of replication, lag compounds at each level. A replica attached to a lagging relay master can be far behind the primary even if its own replication thread is healthy. Their fix involved reducing the replication tier depth (no more cascading relay masters), adding automated lag monitoring with hard failovers blocked when lag exceeded thresholds, and improving the failover runbook to require operators to verify replica lag before promoting.
Real-World Case Study: Amazon Aurora’s Replica Propagation
Amazon Aurora separates the storage layer from the compute layer. Rather than replicating SQL statements or WAL entries, Aurora replicates the storage volume itself across 6 storage nodes in 3 Availability Zones. When your Aurora primary writes a page, the write is acknowledged after 4 of 6 storage nodes have committed the page. This is a form of synchronous replication at the storage layer.
The benefit: replica lag in Aurora is measured in single-digit milliseconds under normal operation, because replicas read from the same shared storage volume as the primary. There is no replication lag in the traditional sense. A replica that is promoted does not need to catch up because it has been reading from the same storage from the moment it started.
The complexity: Aurora’s shared storage design means write throughput is limited by the storage layer’s capacity, not the compute layer. This trade-off works well for most applications but becomes a bottleneck for write-heavy workloads. Aurora also requires that replicas be in the same region as the primary — cross-region replication is asynchronous and has higher lag.
Interview Questions
Q: What happens to replication lag when you add a read replica that is significantly weaker than other replicas?
The weaker replica cannot replay WAL entries as fast as the stronger replicas. Replication uses a single SQL thread for applying changes — if that thread cannot keep up with the primary’s write rate, the weak replica falls behind. Meanwhile, stronger replicas stay current. The practical impact: if your application routes reads round-robin across replicas without checking lag, some users get stale data while others see current data. The fix is to either upgrade the weak replica’s resources or exclude it from read distribution and use it only for specific low-throughput tasks.
Q: Your application writes to a primary and immediately reads from a read replica. Under asynchronous replication, what consistency guarantee does the read see?
The read sees eventual consistency at best. There is no guarantee that the replica has received the write. If replication lag is zero, the read sees the write. If lag is 5 seconds, the read sees data that is 5 seconds old. For read-after-write consistency, you must either read from the primary after writes or use semi-synchronous replication to guarantee at least one replica has the write before the primary acknowledges the transaction.
Q: What is the difference between synchronous and semi-synchronous replication?
Synchronous replication waits for all configured replicas to acknowledge writes before committing the transaction on the primary. If any replica does not acknowledge within a timeout, the write fails. Semi-synchronous replication waits for at least one replica to acknowledge, then commits on the primary. If no replica acknowledges within the timeout, it falls back to asynchronous replication and commits anyway. PostgreSQL calls this synchronous_commit = on for synchronous and synchronous_commit = remote_apply or remote_write for the various semi-synchronous modes.
Q: You are designing a replication strategy for a financial application where zero data loss on failover is required. What do you choose?
Use synchronous replication to at least one replica in the same datacenter. This guarantees that when the primary commits a transaction, at least one replica has already written it. On failover, the promoted replica is current with no data loss. The tradeoff is that every write now waits for a network round-trip to the replica. For financial applications where correctness matters more than write latency, this is the right trade. You also want to use semi-synchronous replication as a fallback if the synchronous replica becomes unavailable, and you want automated failover that promotes only if the replica is confirmed current.
Conclusion
Replication provides availability and read scale. Master-slave is the simplest topology. Writes go to master, reads distribute across replicas.
Synchronous replication guarantees durability at the cost of latency. Asynchronous replication is faster but leaves a small window for data loss.
Failover automation reduces downtime but adds complexity. Monitor replication lag and alert on anomalies. Unexpected staleness causes subtle bugs.
Multi-master sounds appealing for write availability but creates conflict resolution challenges. Master-slave with failover handles most needs.
For related reading, see Database Scaling for broader scaling strategies, and Horizontal Sharding for distributing writes across servers.
Category
Related Posts
Asynchronous Replication: Speed and Availability at Scale
Learn how asynchronous replication works in distributed databases, including eventual consistency implications, lag monitoring, and practical use cases where speed outweighs strict consistency.
Synchronous Replication: Guaranteeing Data Consistency Across Nodes
Explore synchronous replication patterns in distributed databases. Learn about the write-ahead log shipping, Quorum-based replication, and how synchronous replication ensures zero RPO in production systems.
Amazon DynamoDB: Scalable NoSQL with Predictable Performance
Deep dive into Amazon DynamoDB architecture, partitioned tables, eventual consistency, on-demand capacity, and the single-digit millisecond SLA.