Database Replication: Master-Slave and Failover Patterns

Database replication explained: master-slave, multi-master, synchronous vs asynchronous strategies, failover patterns, and consistency.

published: reading time: 26 min read

Database Replication: Master-Slave, Multi-Master, and Failover Patterns

Replication keeps copies of your data on multiple servers. When one server fails, others take over. Replication also distributes read load. The topology you choose affects consistency, availability, and operational complexity.

I have seen replication fail in subtle ways. A replica falls behind, queries return stale data, and nobody notices until a customer complains. Understanding replication mechanics helps you build robust systems.

Replication Fundamentals

Replication copies data from one database to another. The source is the primary. The copies are replicas. Changes propagate from primary to replicas.

graph TD
    App[Application] --> Primary[(Primary DB)]
    Primary -->|replicate| Replica1[(Replica 1)]
    Primary -->|replicate| Replica2[(Replica 2)]
    Primary -->|replicate| Replica3[(Replica 3)]

Replication can be synchronous or asynchronous. Synchronous replication waits for replicas to confirm writes before reporting success. Asynchronous replication confirms writes immediately and propagates in the background.

Synchronous replication guarantees that replicas have committed data. If the primary fails, no data is lost. The cost is latency. Every write waits for replica confirmation.

Asynchronous replication writes immediately on primary. Replicas receive changes eventually. This is faster but data might be lost if the primary fails before replicas receive the update.

Master-Slave Replication

Master-slave is the most common topology. The master handles writes. Slaves handle reads. All writes go to master. Reads distribute across slaves.

graph LR
    AppW[Write] --> Master[(Master)]
    AppR1[Read] --> Slave1[(Slave 1)]
    AppR2[Read] --> Slave2[(Slave 2)]
    Master -->|SQL replication| Slave1
    Master -->|SQL replication| Slave2

PostgreSQL calls this streaming replication. MySQL calls it replica set. The mechanics differ but the topology is the same.

Slave lag is a common problem. Asynchronous replication means slaves might lag behind the master. If a slave falls behind, reads return stale data.

-- PostgreSQL: check replication lag
SELECT client_addr, state, sent_lsn, write_lsn,
       flush_lsn, replay_lsn
FROM pg_stat_replication;

Monitoring lag helps. Alert when lag exceeds thresholds. Route reads away from lagging replicas.

Multi-Master Replication

Multi-master allows writes to multiple primaries. Any master can handle reads and writes. This improves write availability. If one master fails, others continue accepting writes.

graph TD
    App1[App 1] --> Master1[(Master 1)]
    App2[App 2] --> Master2[(Master 2)]
    Master1 -->|bidirectional| Master2
    Master2 -->|bidirectional| Master1

The complexity is conflict resolution. Two masters might modify the same row simultaneously. Who wins?

-- Conflict: both masters update user email
-- Master 1: UPDATE users SET email='a@test.com' WHERE id=1
-- Master 2: UPDATE users SET email='b@test.com' WHERE id=1
-- Result depends on conflict resolution strategy

Different systems handle conflicts differently. Last-write-wins uses timestamps, which can be unreliable across servers. Some systems track conflicts and allow applications to resolve them.

Multi-master is rarely worth the complexity. Write conflicts are hard to reason about. Most applications do not need multi-master. Master-slave with failover handles most availability requirements.

Synchronous vs Asynchronous

Synchronous replication provides stronger guarantees. A transaction commits only when replicas confirm the write. If a replica fails, the write fails. This guarantees zero data loss on primary failure.

# Synchronous replication behavior
BEGIN;
UPDATE accounts SET balance = balance - 100 WHERE id = 1;
-- Waits for replica confirmation before proceeding
COMMIT;

The cost is latency. Writes take longer because they wait for network round trips to replicas. In geo-distributed setups, this latency is significant.

Asynchronous replication writes immediately. The primary confirms writes without waiting for replicas. This is faster but some data loss is possible if the primary fails.

# Asynchronous replication behavior
BEGIN;
UPDATE accounts SET balance = balance - 100 WHERE id = 1;
-- Returns immediately, replicates in background
COMMIT;

Most deployments use asynchronous replication. The performance benefit outweighs the small window of potential data loss. If durability matters, synchronous replication to at least one nearby replica provides a good balance.

Handling Replica Acknowledgment Timeouts

Synchronous replication requires waiting for replica confirmations. When a replica fails to acknowledge in time, you need proper handling to avoid blocking writes indefinitely.

import time
from enum import Enum
from dataclasses import dataclass
from typing import Optional

class ReplicationMode(Enum):
    SYNCHRONOUS = "synchronous"
    SEMI_SYNCHRONOUS = "semi_synchronous"
    ASYNCHRONOUS = "asynchronous"

@dataclass
class WriteResult:
    success: bool
    acknowledged_by: list[str]
    fallback_triggered: bool = False
    error: Optional[str] = None

class ReplicaTimeoutHandler:
    """
    Handles replica acknowledgment timeouts for synchronous replication.
    Supports configurable timeout and fallback strategies.
    """
    def __init__(
        self,
        timeout_seconds: float = 5.0,
        mode: ReplicationMode = ReplicationMode.SYNCHRONOUS,
        min_acks_required: int = 1
    ):
        self.timeout_seconds = timeout_seconds
        self.mode = mode
        self.min_acks_required = min_acks_required

    def write_with_ack(
        self,
        data: str,
        replicas: list[str]
    ) -> WriteResult:
        """
        Write to replicas with acknowledgment handling.
        Returns result with acknowledgment status.
        """
        acks_received = []
        start_time = time.time()

        for replica in replicas:
            try:
                ack = self._wait_for_ack(replica, start_time)
                if ack:
                    acks_received.append(replica)
            except TimeoutError:
                if self.mode == ReplicationMode.SYNCHRONOUS:
                    return WriteResult(
                        success=False,
                        acknowledged_by=acks_received,
                        error=f"Timeout waiting for {replica}"
                    )

        # Check if we have enough acks
        if len(acks_received) >= self.min_acks_required:
            return WriteResult(
                success=True,
                acknowledged_by=acks_received,
                fallback_triggered=len(acks_received) < len(replicas)
            )

        # Semi-synchronous fallback
        if self.mode == ReplicationMode.SEMI_SYNCHRONOUS:
            return WriteResult(
                success=True,
                acknowledged_by=acks_received,
                fallback_triggered=True
            )

        return WriteResult(
            success=False,
            acknowledged_by=acks_received,
            error="Insufficient acknowledgments"
        )

    def _wait_for_ack(self, replica: str, start_time: float) -> bool:
        """Simulate waiting for replica acknowledgment."""
        elapsed = time.time() - start_time
        remaining = self.timeout_seconds - elapsed

        if remaining <= 0:
            raise TimeoutError(f"Replica {replica} acknowledgment timed out")

        # Simulate network call with timeout
        # In practice: connection.execute("WAIT FOR REPLICA ACK", timeout=remaining)
        return True  # Simulated success

Key timeout handling strategies:

StrategyBehaviorUse When
Fail on timeoutWrite fails if replica doesn’t ack in timeData durability is critical
Fallback to asyncSwitch to async if timeout occursAvailability matters more than durability
Queue and retryQueue writes for later replica ackNetwork hiccups are temporary
# Example: Semi-synchronous with fallback
handler = ReplicaTimeoutHandler(
    timeout_seconds=2.0,
    mode=ReplicationMode.SEMI_SYNCHRONOUS,
    min_acks_required=1
)

result = handler.write_with_ack("transaction_data", ["replica1", "replica2"])

if result.success:
    if result.fallback_triggered:
        print(f"Write succeeded with {len(result.acknowledged_by)} acks (fallback mode)")
    else:
        print(f"Full synchronous write confirmed by {len(result.acknowledged_by)} replicas")
else:
    print(f"Write failed: {result.error}")

Trade-Off Comparison: Synchronous vs Asynchronous Replication

DimensionSynchronous ReplicationAsynchronous Replication
Write LatencyHigh (waits for replica confirmation)Low (returns immediately)
Data DurabilityZero data loss on primary failureSmall window of potential data loss
AvailabilityLower (can block if replica fails)Higher (continues even if replicas lag)
Network DependencyRequires low-latency network to replicasTolerates high-latency links
Cross-DC SupportImpractical across continentsWorks well geo-distributed
RPO (Recovery Point Objective)ZeroMinimal (lag-dependent)
RTO (Recovery Time Objective)Faster (replica is current)May be slower (replica must catch up)
Replica LoadHigher (must keep up with primary)Lower (can lag)
Typical Use CasesFinancial transactions, inventory, critical dataWeb sessions, analytics, social media feeds

Decision Flowchart: When to Choose Sync vs Async

Use this flowchart to determine which replication strategy fits your use case:

flowchart TD
    Start[Start: Choose Replication Strategy] --> Q1{Can you tolerate data loss?}

    Q1 -->|No, zero tolerance| Sync[Synchronous Replication]
    Q1 -->|Yes, small gap OK| Q2{Can you accept higher write latency?}

    Q2 -->|Yes, latency acceptable| Sync
    Q2 -->|No, need low latency| Q3{Are nodes geo-distributed?}

    Q3 -->|Yes, cross-continent| Async[Asynchronous Replication]
    Q3 -->|No, same datacenter| Q4{What's your RPO requirement?}

    Q4 -->|RPO = 0| Sync
    Q4 -->|RPO > 0| Q5{Can replicas occasionally block writes?}

    Q5 -->|No, availability critical| Async
    Q5 -->|Yes, can tolerate blocking| Sync

    Sync --> SyncUse[Use synchronous replication to at least one nearby replica]
    Async --> AsyncUse[Use asynchronous replication with lag monitoring]

Quick Decision Guide:

QuestionIf YesIf No
Financial transaction?SyncContinue
Cross-datacenter (>100ms RTT)?AsyncContinue
Can tolerate 1-5ms extra latency?SyncAsync
Write-heavy workload?AsyncContinue
RPO must be zero?SyncAsync

Quorum Formula Mathematical Derivation

The quorum condition R + W > N is central to understanding strong consistency in replicated databases. Let us derive it formally and prove why it gives strong consistency guarantees.

The Problem Setup

Consider a replicated database with N replicas. To read a value, you must contact R replicas and return the most recent write among them. To write a value, you must contact W replicas and wait for all of them to acknowledge the write.

The fundamental question: How do we guarantee that every read sees the most recent write?

Formal Proof

Let us prove that R + W > N guarantees strong consistency.

Theorem: In a replicated system with N replicas, if you require W acknowledgments for writes and query R replicas for reads, then every read is guaranteed to see the most recent write if and only if R + W > N.

Proof Sketch:

  1. Notation:

    • Let W_set be the set of replicas that acknowledged the latest write, with |W_set| = W
    • Let R_set be the set of replicas contacted for the current read, with |R_set| = R
    • Let X = W_set ∩ R_set be the intersection (replicas that have the write AND are contacted for the read)
    • Let U = W_set ∪ R_set be the union
  2. Cardinality Bound: By the principle of inclusion-exclusion:

    |X| = |W_set| + |R_set| - |U|
         = W + R - |U|
  3. Since all replicas are in the union (there are only N replicas total):

    |U| ≤ N
    Therefore:
    |X| = W + R - |U|
         ≥ W + R - N
  4. The Intersection Must Be Non-Empty: If R + W > N, then:

    |X| ≥ W + R - N > 0

    Therefore, |X| ≥ 1, meaning at least one replica in the read set also has the latest write.

  5. The Read Always Sees the Latest Write: Because the read quorum and write quorum overlap, the read coordinator can compare timestamps or version numbers across the overlapping replicas. The overlapping replica(s) definitely have the latest write, so the read returns the correct value.

Conversely, if R + W ≤ N, the sets may be disjoint:

|X| ≥ W + R - N ≤ 0

There may be no overlap, meaning the read set might miss the latest write entirely.

Why This Matters for Consistency

The quorum overlap guarantee means you cannot have a situation where:

  • A write is committed (W replicas acknowledged it)
  • A subsequent read contacts only replicas that did NOT receive the write
  • The read returns stale data

With R + W > N, this scenario is impossible by construction.

Examples

Example 1: N=3, R=2, W=2

R + W = 2 + 2 = 4 > 3

Write quorum: {A, B}. Read quorum: {B, C}. Intersection: {B} - non-empty, read sees latest write.

Example 2: N=5, R=2, W=3

R + W = 2 + 3 = 5, NOT > 5

Write quorum: {A, B, C}. Read quorum: {D, E}. Intersection: {} - empty. Read may return stale data.

Example 3: N=5, R=3, W=3

R + W = 3 + 3 = 6 > 5

Write quorum: {A, B, C}. Read quorum: {C, D, E}. Intersection: {C} - non-empty.

Cassandra-Style Quorum

Cassandra uses a variant where consistency level is configurable:

def cassandra_consistency_level(n: int, r: int, w: int) -> str:
    """
    Determine consistency guarantee for Cassandra-style settings.
    """
    if r + w > n:
        return "Strong consistency (linearizable)"
    elif r + w == n:
        return "Serial consistency (read-your-writes not guaranteed)"
    else:
        return "Eventual consistency (may read stale data)"

# Cassandra consistency levels
# LOCAL_ONE: R=1, no guarantee
# QUORUM: R=W=ceil(N/2) -> guarantees R+W > N
# ALL: R=W=N -> maximum consistency, minimum availability
# EACH_QUORUM: R=W=quorum in EACH datacenter

Strong Consistency vs Linearizability

The R + W > N condition gives you strong consistency for single-key operations. However, for linearizability (which requires that operations appear atomic across multiple keys or time), you need additional mechanisms:

  • Paxos or Raft: Consensus protocols that order operations globally
  • Spanner’s TrueTime: Physical time with bounded uncertainty
  • Hybrid Logical Clocks: Physical time + logical counters
# R+W>N gives you strong consistency for single keys
# For multi-key transactions, you need something stronger

def check_linearizability_requirements(system_type: str) -> str:
    """
    What additional guarantees does each system provide?
    """
    if system_type == "dynamo_style":
        return "Eventual consistency per key; R+W>N for single-key strong consistency"
    elif system_type == "cassandra":
        return "Tunable per-query; QUORUM gives R+W>N per key"
    elif system_type == "spanner":
        return "Linearizable via TrueTime (hardware clocks + GPS)"
    elif system_type == "cockroachdb":
        return "Linearizable via HLC and distributed transactions"
    else:
        return "Unknown consistency model"

Summary

The quorum formula R + W > N is derived from basic set theory:

  1. The intersection of read and write quorums must be non-empty
  2. This requires W + R - N > 0, which is R + W > N
  3. This guarantees every read sees at least one replica with the latest write
  4. Without this condition, read and write sets may be disjoint, allowing stale reads

This is why quorum-based replication is the foundation of consistent distributed systems like DynamoDB, Cassandra, and etcd.


Failover Patterns

When the primary fails, a replica must take over. This is failover. Automating failover reduces downtime but introduces complexity.

Manual Failover

Manual failover is simple. When the primary fails, an operator promotes a replica to primary. Traffic redirects. Replicas reconfigure to follow the new primary.

The downside is time. Minutes of downtime while failover happens. For some applications, this is acceptable. For others, it is not.

Automatic Failover

Automatic failover detects primary failure and promotes a replica automatically. This reduces downtime but requires careful implementation.

graph TD
    Monitor[Monitor] -->|primary unreachable| Election[Election]
    Election --> Replica1{Which replica?}
    Replica1 -->|most advanced| Promote[(Promote to Primary)]
    Promote --> NewMaster[(New Primary)]
    NewMaster --> Replica2[(Old Replica 2)]

Failure detection is tricky. How do you distinguish between primary being slow and primary being down? False positives cause unnecessary failovers. False negatives leave you without a writable primary.

Most systems use a consensus approach. Multiple monitors must agree the primary is down before failover proceeds. This prevents split-brain scenarios where two servers think they are primary.

Failover Timeline Sequence

Here is what actually happens during an automated failover, step by step. The short version: writes go down for a bit, then come back up.

sequenceDiagram
    participant App as Application
    participant Primary as Primary DB
    participant Replica1 as Replica 1
    participant Replica2 as Replica 2
    participant Monitor as Failover Monitor
    participant Proxy as DB Proxy

    Note over Primary,Replica1: Normal operation — async replication
    Primary->>Replica1: WAL entries (async)
    Primary->>Replica2: WAL entries (async)
    App->>Primary: WRITE (committed)
    Primary->>App: WRITE OK

    Note over Primary: T1: Primary becomes unreachable
    Monitor->>Primary: Health check (timeout)
    Monitor->>Primary: Health check (timeout)
    Monitor->>Primary: Health check (timeout)
    Monitor->>Replica1: Is primary reachable? (no response)
    Monitor->>Replica2: Is primary reachable? (no response)

    Note over Monitor: T2: Consensus — primary is down
    Monitor->>Replica1: Get LSN position
    Replica1->>Monitor: LSN: 88400
    Monitor->>Replica2: Get LSN position
    Replica2->>Monitor: LSN: 88350
    Note over Monitor: Replica1 is most advanced

    Monitor->>Replica1: PROMOTE to primary
    Monitor->>Proxy: New primary = Replica1

    Note over Replica1: T3: Replica1 promoted — brief window of unavailability
    Replica1->>Replica2: Reconfigure to follow new primary
    Replica2->>Replica1: Acknowledged

    Note over App,Primary: T4: Application reconnects
    App->>Proxy: WRITE (retry)
    Proxy->>Replica1: WRITE
    Replica1->>App: WRITE OK

    Note over App,Replica1: Write availability restored

Timeline breakdown:

PhaseWhat HappensDuration
T1Primary stops responding to health checksSet by your timeout configuration
T2Monitor agrees the primary is dead, picks the most current replicaUsually 3-10 seconds
T3That replica gets promoted, other replicas switch to follow it5-30 seconds
T4Applications reconnect through the proxy1-5 seconds

Unplanned failovers usually land somewhere between 10 and 60 seconds end to end. Planned ones are faster because the old primary is still up and coordination is simpler.

Replication Lag and Its Effects

Replication lag causes subtle problems. Applications might write to primary and immediately read from a replica. The read might hit a replica that has not yet received the write.

# This can fail with replication lag
update_user(user_id, {"email": "new@test.com"})
user = read_from_replica(user_id)  # Might still have old email

Read-after-write consistency requires either reading from primary after writes or waiting for replication to catch up before reading from replicas.

# Solution 1: Read from primary after writes
update_user(user_id, {"email": "new@test.com"})
user = read_from_primary(user_id)  # Always current

# Solution 2: Wait for replication
update_user(user_id, {"email": "new@test.com"})
wait_for_replication_lag(user_id, timeout=5)
user = read_from_replica(user_id)  # Now includes the write

Some systems offer synchronous replication to eliminate lag. PostgreSQL synchronous replication guarantees the standby has received the write before committing.

Cascading Replication

Cascading replication chains replicas. A replica follows another replica rather than directly following the primary. This reduces load on the primary when you have many replicas.

graph TD
    Primary[(Primary)] --> Replica1[(Replica 1)]
    Replica1 --> Replica2[(Replica 2)]
    Replica1 --> Replica3[(Replica 3)]
    Replica2 --> Replica4[(Replica 4)]

Cascading increases replication lag. Data flows through intermediate replicas before reaching leaf replicas. For read-heavy workloads where staleness is acceptable, this is fine.

When to Use and When Not to Use Each Replication Topology

Master-Slave Replication:

  • Use when: You need read scale and read replicas; you need high availability for reads
  • Do not use when: You need write scaling (all writes still go to master); you need strong consistency across replicas

Multi-Master Replication:

  • Use when: You need write availability across geographic regions; no single master can handle all writes
  • Do not use when: You have high write conflict potential; your team lacks experience with conflict resolution

Synchronous Replication:

  • Use when: Data durability is critical; you cannot afford any data loss on primary failure
  • Do not use when: Network latency between nodes is high; write throughput is more important than immediate durability

Asynchronous Replication:

  • Use when: Write latency is critical; some data loss is acceptable; nodes are geographically distributed
  • Do not use when: You need zero RPO; strict durability guarantees are required

Production Failure Scenarios

FailureImpactMitigation
Primary failure without automatic failoverWrites fail, extended downtimeImplement automated failover, test failover regularly, have runbooks
Replication break due to divergent dataReplica stops replicating, falls behindUse pt-table-checksum to detect drift, implement resync procedure
Network partition causing split-brainTwo primaries accept writes independentlyUse quorum-based failover, implement fencing, test network partition scenarios
Replication lag buildupRead replicas return stale dataMonitor lag, provision stronger replicas, switch to synchronous replication
Long-running query holding replication slotsReplication cannot advance, WAL accumulatesKill long-running queries, tune max_slot_wal_keep_size
Cascading replication chain failureLeaf replicas fall far behindKeep replication chain short, monitor all replicas
Accidental replica promotionApplication writes go to wrong serverProtect against manual promotion, require multiple approvals
WAL corruption on primaryReplicas replicate corrupted dataEnable checksums, verify replica integrity regularly

Replica Promotion Failure Mode: Write Burst Risk

When a primary fails during high write throughput, automatic failover promotes a replica. Here is what can go wrong.

If async replication was in use, the replica may not have received the latest writes. During peak load, replication lag easily reaches seconds or more. Once that replica becomes the new primary, it has stale data. Any writes accepted after promotion that depend on the missing writes create inconsistency.

Picture a $100 transfer from account A to account B. The primary fails after writing the debit but before replicating. The replica is promoted. The transfer appears to succeed, but the debit never reached the new primary. When the client checks the balance, it looks correct. The data is internally wrong.

Once promoted, the new primary gets a flood of writes. Buffered requests retry, new requests arrive, and the burst can overwhelm the new primary before it stabilizes. This makes it harder to catch up on replication from other replicas, even with semi-synchronous replication.

# Timeline of promotion failure
# T1: Primary writes "debit account A: $100" (async replication)
# T2: Primary fails before replicating to Replica1
# T3: Replica1 promoted to new primary
# T4: Client reads balance of account A from new primary (returns $900, missing debit)
# T5: Client transfers $100 from account A to account B (new primary accepts)
# T6: New primary now has inconsistent state ($100 debit from T1 is lost)

Mitigation strategies:

  1. Use semi-synchronous replication: Require at least one replica to acknowledge writes before committing. The promoted replica then has at least one durable copy of recent writes.

  2. Verify replica lag before promotion: Most failover tools check lag before promoting. If lag exceeds a threshold, wait or alert an operator instead of auto-promoting.

  3. Implement guardrails on the application side: Use idempotent write patterns so duplicate writes at the new primary do not cause inconsistency. Design the application to treat promotion as a potential data integrity event requiring reconciliation.

  4. Use quorum-based replication: With replication factor 3 and W=2 acknowledgments, at least one of the two nodes that acknowledged must have the latest write. Quorum systems like DynamoDB or Cassandra handle this more gracefully than simple primary-replica setups.

  5. Plan for post-promotion reconciliation: Accept that some data may be inconsistent after a promotion under load. Run a reconciliation process after failover to detect and correct drift.

Automated failover reduces downtime but introduces this risk. Manual failover with operator verification is slower but reduces the risk. For financial or other high-integrity systems, synchronous replication or quorum-based approaches are worth the latency cost.

Observability Checklist

Metrics to Monitor:

  • Replication lag in seconds (primary vs each replica)
  • Replica IO and SQL thread status
  • Replica heartbeat/lag metrics
  • WAL generation rate vs replay rate
  • Replication slot age (if using slots)
  • Disk usage on replicas vs primary
  • Connection count per replica
  • Error count on replication threads

Logs to Capture:

  • Replication state changes (connect, disconnect, error)
  • Replication conflict events (duplicate key, deadlocks)
  • Long-running queries on replicas blocking replication
  • Checkpoint and WAL activity
  • Failover events and decisions

Alerts to Set:

  • Replication lag > 30 seconds (adjust to your SLA)
  • Replica IO/SQL thread stopped
  • Replication slot age > threshold
  • Disk usage on replica > 80%
  • High number of replication conflicts
  • Replica connection failures
-- PostgreSQL: Monitor replication status
SELECT client_addr, state, sent_lsn, write_lsn,
       flush_lsn, replay_lsn,
       sent_lsn - replay_lsn AS replication_lag
FROM pg_stat_replication;

-- MySQL: Check replica status
SHOW REPLICA STATUS\G

Security Checklist

  • Use strong authentication for replication connections (scram-sha-256, certificates)
  • Encrypt replication traffic with TLS
  • Restrict replication user to replica connections only
  • Implement network segmentation for replication traffic
  • Use separate credentials for replication (not application credentials)
  • Audit replication configuration changes
  • Rotate replication credentials regularly
  • Monitor authentication failures on replication ports
  • Verify replica SSL certificates

Common Pitfalls and Anti-Patterns

  1. Assuming replicas are consistent immediately after writes: With async replication, replicas lag. Do not read from replicas for critical data immediately after writes.

  2. Not monitoring replication lag: Lag accumulates silently until users complain. Monitor proactively and alert on threshold breaches.

  3. Promoting replica without verifying data: An incompletely replicated replica promoted to primary causes data loss. Verify replication is current before promotion.

  4. Using replicas for writes without understanding conflicts: Writes to replicas are typically replicated back. This can cause conflicts if not carefully designed.

  5. Ignoring replication slot retention: Slots prevent WAL removal. If replicas fall behind, WAL accumulates and can fill disk.

  6. Configuring weak authentication for replicas: Replicas with weak auth become attack vectors. Treat replica authentication as seriously as primary.

  7. Not testing failover procedures: If you have never tested failover, you will discover problems during actual outages. Test regularly.

  8. Mixing synchronous and asynchronous replication: This causes confusing behavior. Choose one replication mode per replica.

Quick Recap

Key Bullets:

  • Replication provides high availability and read scale; it does not scale writes
  • Asynchronous replication is faster but allows stale reads and potential data loss
  • Synchronous replication guarantees durability at the cost of latency
  • Monitor replication lag on every replica and alert on anomalies
  • Test failover procedures before you need them in production
  • Multi-master creates conflict resolution complexity; use only when write availability demands it

Copy/Paste Checklist:

-- PostgreSQL: Set up synchronous replication
ALTER SYSTEM SET synchronous_commit = on;
ALTER SYSTEM SET synchronous_standby_names = 'replica1,replica2';

-- MySQL: Configure replica with GTID
CHANGE REPLICATION SOURCE TO
    SOURCE_HOST = 'primary-host',
    SOURCE_USER = 'repl_user',
    SOURCE_PASSWORD = 'secure_password',
    SOURCE_AUTO_POSITION = 1;

START REPLICA;

-- Monitor replication health
SELECT * FROM pg_stat_replication;
SHOW REPLICA STATUS;

Real-World Case Study: GitHub’s MySQL Replication Outage

In 2022, GitHub experienced a major outage caused by a cascading replication failure. The problem started with a planned maintenance window where a primary database was taken offline. During the failover to a replica, the replica that was promoted had accumulated significant replication lag due to a long-running analytical query that had been holding the replication slot open. When the failover occurred, the promoted replica was missing several hours of writes.

GitHub’s architecture used MySQL with asynchronous replication. The primary was replicating to a relay master in each datacenter, which then replicated to read replicas. When the relay master failed, the read replicas attached to it fell behind. The failover process did not adequately account for the lag differential between replicas attached to different relay masters.

The lesson from GitHub’s post-incident analysis was around replication topology complexity. With multiple tiers of replication, lag compounds at each level. A replica attached to a lagging relay master can be far behind the primary even if its own replication thread is healthy. Their fix involved reducing the replication tier depth (no more cascading relay masters), adding automated lag monitoring with hard failovers blocked when lag exceeded thresholds, and improving the failover runbook to require operators to verify replica lag before promoting.

Real-World Case Study: Amazon Aurora’s Replica Propagation

Amazon Aurora separates the storage layer from the compute layer. Rather than replicating SQL statements or WAL entries, Aurora replicates the storage volume itself across 6 storage nodes in 3 Availability Zones. When your Aurora primary writes a page, the write is acknowledged after 4 of 6 storage nodes have committed the page. This is a form of synchronous replication at the storage layer.

The benefit: replica lag in Aurora is measured in single-digit milliseconds under normal operation, because replicas read from the same shared storage volume as the primary. There is no replication lag in the traditional sense. A replica that is promoted does not need to catch up because it has been reading from the same storage from the moment it started.

The complexity: Aurora’s shared storage design means write throughput is limited by the storage layer’s capacity, not the compute layer. This trade-off works well for most applications but becomes a bottleneck for write-heavy workloads. Aurora also requires that replicas be in the same region as the primary — cross-region replication is asynchronous and has higher lag.

Interview Questions

Q: What happens to replication lag when you add a read replica that is significantly weaker than other replicas?

The weaker replica cannot replay WAL entries as fast as the stronger replicas. Replication uses a single SQL thread for applying changes — if that thread cannot keep up with the primary’s write rate, the weak replica falls behind. Meanwhile, stronger replicas stay current. The practical impact: if your application routes reads round-robin across replicas without checking lag, some users get stale data while others see current data. The fix is to either upgrade the weak replica’s resources or exclude it from read distribution and use it only for specific low-throughput tasks.

Q: Your application writes to a primary and immediately reads from a read replica. Under asynchronous replication, what consistency guarantee does the read see?

The read sees eventual consistency at best. There is no guarantee that the replica has received the write. If replication lag is zero, the read sees the write. If lag is 5 seconds, the read sees data that is 5 seconds old. For read-after-write consistency, you must either read from the primary after writes or use semi-synchronous replication to guarantee at least one replica has the write before the primary acknowledges the transaction.

Q: What is the difference between synchronous and semi-synchronous replication?

Synchronous replication waits for all configured replicas to acknowledge writes before committing the transaction on the primary. If any replica does not acknowledge within a timeout, the write fails. Semi-synchronous replication waits for at least one replica to acknowledge, then commits on the primary. If no replica acknowledges within the timeout, it falls back to asynchronous replication and commits anyway. PostgreSQL calls this synchronous_commit = on for synchronous and synchronous_commit = remote_apply or remote_write for the various semi-synchronous modes.

Q: You are designing a replication strategy for a financial application where zero data loss on failover is required. What do you choose?

Use synchronous replication to at least one replica in the same datacenter. This guarantees that when the primary commits a transaction, at least one replica has already written it. On failover, the promoted replica is current with no data loss. The tradeoff is that every write now waits for a network round-trip to the replica. For financial applications where correctness matters more than write latency, this is the right trade. You also want to use semi-synchronous replication as a fallback if the synchronous replica becomes unavailable, and you want automated failover that promotes only if the replica is confirmed current.

Conclusion

Replication provides availability and read scale. Master-slave is the simplest topology. Writes go to master, reads distribute across replicas.

Synchronous replication guarantees durability at the cost of latency. Asynchronous replication is faster but leaves a small window for data loss.

Failover automation reduces downtime but adds complexity. Monitor replication lag and alert on anomalies. Unexpected staleness causes subtle bugs.

Multi-master sounds appealing for write availability but creates conflict resolution challenges. Master-slave with failover handles most needs.

For related reading, see Database Scaling for broader scaling strategies, and Horizontal Sharding for distributing writes across servers.

Category

Related Posts

Asynchronous Replication: Speed and Availability at Scale

Learn how asynchronous replication works in distributed databases, including eventual consistency implications, lag monitoring, and practical use cases where speed outweighs strict consistency.

#distributed-systems #replication #eventual-consistency

Synchronous Replication: Guaranteeing Data Consistency Across Nodes

Explore synchronous replication patterns in distributed databases. Learn about the write-ahead log shipping, Quorum-based replication, and how synchronous replication ensures zero RPO in production systems.

#distributed-systems #replication #consistency

Amazon DynamoDB: Scalable NoSQL with Predictable Performance

Deep dive into Amazon DynamoDB architecture, partitioned tables, eventual consistency, on-demand capacity, and the single-digit millisecond SLA.

#distributed-systems #databases #amazon