Asynchronous Replication: Speed and Availability at Scale

Learn how asynchronous replication works in distributed databases, including eventual consistency implications, lag monitoring, and practical use cases where speed outweighs strict consistency.

published: reading time: 9 min read

Asynchronous Replication: Speed and Availability at Scale

Asynchronous replication is the workhorse of distributed databases. Writes confirm immediately on the primary. Replicas receive the changes eventually, usually milliseconds later but sometimes much longer. This gives you speed and availability, but it introduces a window where data can be lost if the primary fails.

Most production databases run async by default. The performance benefit is real. The trade-off is accepting a small probability of losing recent writes during failover. For many applications, this is acceptable. For others, it is not.

How Asynchronous Replication Works

The primary processes a write, confirms it to the client, and sends the change to replicas in the background. The client does not wait for replica confirmation.

sequenceDiagram
    participant Client
    participant Primary
    participant Replica

    Client->>Primary: BEGIN; UPDATE...; COMMIT
    Primary->>Primary: Write to local WAL
    Primary-->>Client: COMMIT confirmed
    Note over Primary,Replica: (Background replication)

    rect rgb(200, 200, 200)
        Primary->>Replica: Send WAL entry
        Replica-->>Primary: ACK
    end

This non-blocking behavior is why async replication is fast. Write latency is essentially local disk latency plus a small amount of processing. There is no waiting for remote replica confirmation.

The Replication Lag Problem

With async replication, replicas fall behind the primary. This lag can be seconds or minutes depending on load, network conditions, and replica performance.

# This sequence can fail with async replication
def update_and_read(user_id, new_email):
    db.execute("UPDATE users SET email = ? WHERE id = ?", new_email, user_id)
    # Write confirmed immediately
    # But replica might not have the update yet

    user = db_replica.execute("SELECT email FROM users WHERE id = ?", user_id)
    # Might return old email if replica is lagging
    return user['email']

This is the fundamental tension. Async replication is fast, but reads from replicas can return stale data.

Statement-Based vs WAL-Based Replication

Different databases use different mechanisms.

Statement-based replication sends the SQL statements to replicas. MySQL’s original replication used this approach.

-- Primary executes:
UPDATE users SET email = 'new@example.com' WHERE id = 1;

-- Replica executes the same statement:
UPDATE users SET email = 'new@example.com' WHERE id = 1;

The problem: nondeterministic functions like NOW() or RAND() produce different results on different nodes.

WAL-based replication sends the binary WAL entries. PostgreSQL uses this. WAL entries are deterministic because they contain the actual bytes that changed.

# WAL entry contains the actual bytes
wal_entry = {
    'type': 'UPDATE',
    'table': 'users',
    'key': 'id=1',
    'old_data': {'email': 'old@example.com'},
    'new_data': {'email': 'new@example.com'},
    'timestamp': 1711234567
}

Row-based replication sends the actual row changes. MySQL’s row-based replication combines the benefits of both.

Eventual Consistency

Async replication provides eventual consistency. Given enough time without new writes, all replicas will converge to the same state.

graph LR
    P[Primary] -->|immediate| R1[Replica 1]
    P -->|immediate| R2[Replica 2]

    R1 -.->|eventually consistent| Same[Same State]
    R2 -.->|eventually consistent| Same

The “eventually” part is important. During the lag window, reads from different replicas can return different results. This is observable inconsistency, not just a theoretical concern.

# User experience with eventual consistency
def get_user_email(user_id):
    # Route to random replica
    replica = random.choice([replica1, replica2, replica3])
    return replica.execute("SELECT email FROM users WHERE id = ?", user_id)

# Two consecutive calls might return different results
email1 = get_user_email(123)  # 'old@example.com'
email2 = get_user_email(123)  # 'new@example.com' if replica2 has newer data

Use Cases Where Async Makes Sense

Async replication is the right choice when:

  • Web sessions and preferences: User profile updates can lag a few seconds without causing real problems
  • Analytics and reporting: Reads for dashboards do not need millisecond freshness
  • Social media feeds: Timeline updates do not need to be immediately consistent
  • Logging and metrics: Time-series data tolerates small gaps
  • Geographic distribution: Replicating across continents with sync would be prohibitively slow
# Good async use case: user preferences
def update_user_preference(user_id, key, value):
    # Write immediately, replicate async
    db.execute("UPDATE preferences SET value = ? WHERE user_id = ? AND key = ?",
               value, user_id, key)
    # User sees update instantly on primary
    # Other replicas catch up shortly

# Bad async use case: account balance
def transfer_funds(from_id, to_id, amount):
    # This needs synchronous replication for correctness
    db.execute("UPDATE accounts SET balance = balance - ? WHERE id = ?",
               amount, from_id)
    db.execute("UPDATE accounts SET balance = balance + ? WHERE id = ?",
               amount, to_id)

Monitoring Replication Lag

Lag monitoring is critical with async replication. You need to know how far behind your replicas are.

-- PostgreSQL: Check replication lag
SELECT client_addr, state,
       sent_lsn, replay_lsn,
       sent_lsn - replay_lsn AS lag_bytes
FROM pg_stat_replication;

-- MySQL: Check replica lag
SHOW REPLICA STATUS\G
# Look at: Seconds_Behind_Master, Relay_Log_Pos
# Application-level lag monitoring
def check_replica_health(replica):
    last_replicated_ts = replica.get_last_replicated_timestamp()
    current_ts = time.time()
    lag_seconds = current_ts - last_replicated_ts

    if lag_seconds > 30:
        alert(f"Replica {replica.name} lag: {lag_seconds}s")
        # Route reads away from this replica
        remove_from_read_pool(replica)

Key metrics to track:

MetricNormalWarningCritical
Replication lag< 1s1-10s> 10s
Replica IO threadRunningStopped
Replica SQL threadRunningStopped
WAL backlog< 100MB100MB-1GB> 1GB

The CAP Theorem and Async Replication

Async replication is the availability (A) choice in the CAP theorem trade-off. During a network partition, an async primary can continue accepting writes. Those writes might not survive if the primary fails before replicas receive them.

graph TD
    subgraph Partition
        P[Primary] --> R1[(Replica 1)]
        P -.- X[(Partition)]
        R2[(Replica 2)] -.- X
    end

    P -->|writes continue| Client1[Client]
    R2 -.->|disconnected| P

If you need consistency during partitions, you need synchronous replication. If you need availability during partitions, async is your choice.

For more on this trade-off, see CAP Theorem and Consistency Models.

Reducing Lag in Async Setups

If lag is a problem, there are techniques to reduce it.

Upgrade replica hardware: Replicas that cannot keep up with the primary’s write rate will always lag. Give them faster disks and more CPU.

Reduce write workload: If the primary writes 10,000 transactions per second and replicas can only handle 8,000, lag will grow indefinitely.

Use parallel replication: Modern MySQL and PostgreSQL support parallel replica apply.

-- MySQL: Enable parallel replication
STOP REPLICA;
SET GLOBAL slave_parallel_workers = 8;
START REPLICA;

-- PostgreSQL: Increase wal_sender_timeout for long-running queries
ALTER SYSTEM SET wal_sender_timeout = '60s';

Cascade replication: Chain replicas so the primary does not feed every replica directly. This reduces primary load.

graph TD
    P[Primary] --> R1[Replica 1]
    R1 --> R2[Replica 2]
    R1 --> R3[Replica 3]
    R2 --> R4[Replica 4]

Read Routing with Async Replicas

Application code should handle stale reads gracefully when using async replicas.

# Read from replica with fallback to primary
def read_user(user_id):
    try:
        # Try replica first
        replica = get_read_replica()
        user = replica.execute("SELECT * FROM users WHERE id = ?", user_id)
        return user
    except ReplicaTooLagError:
        # Fall back to primary if replica is too far behind
        return primary.execute("SELECT * FROM users WHERE id = ?", user_id)

# Read-after-write with async replication
def update_and_read(user_id, new_email):
    # Always read from primary after writes
    db.execute("UPDATE users SET email = ? WHERE id = ?", new_email, user_id)
    return primary.execute("SELECT email FROM users WHERE id = ?", user_id)

When Async Is Not Enough

Some data genuinely cannot tolerate async replication:

  • Financial transactions: If you write a payment and the primary fails before replicas receive it, money disappears
  • Inventory management: Overselling happens when replicas do not have the latest stock counts
  • Locking operations: If you acquire a lock and the primary fails before it replicates, you have a zombie lock

For these cases, see Synchronous Replication for strong consistency options.

Read-Your-Writes Consistency Checklist

When using async replicas, ensuring read-after-write consistency requires explicit handling:

# Strategy 1: Read from primary after writes
def read_after_write(session, user_id, new_email):
    session.execute(
        "UPDATE users SET email = ? WHERE id = ?",
        new_email, user_id
    )
    # Must read from primary to see our own write
    return primary.execute(
        "SELECT email FROM users WHERE id = ?",
        user_id
    )

# Strategy 2: Track LSN and wait for replication
def read_after_wait(session, user_id, new_email):
    lsn = session.execute(
        "UPDATE users SET email = ? WHERE id = ?",
        new_email, user_id
    )
    # Wait for replica to catch up to our LSN
    replica.wait_for_lsn(lsn, timeout=30)
    return replica.execute(
        "SELECT email FROM users WHERE id = ?",
        user_id
    )

# Strategy 3: Session pinning (sticky to primary)
def read_with_session_pin(session, user_id):
    # Session always routes to primary for this user
    session.set_pin(user_id, to_node='primary')
    return session.execute(
        "SELECT email FROM users WHERE id = ?",
        user_id
    )
StrategyProsConsBest For
Read from primarySimple, always correctAdds primary loadWrite-heavy workloads
Wait for LSNWorks with any replicaAdds latencyOccasional consistency needs
Session pinningConsistent experienceReduces flexibilityUser-specific data

Common Pitfalls

  1. Assuming replica data is current: After a write, do not assume replicas have the update immediately. Always read from primary for critical data after writes.

  2. Not monitoring lag: Lag accumulates silently. By the time users complain, you might be hours behind. Monitor proactively.

  3. Promoting a lagging replica: If a replica is 2 hours behind and you promote it, you just lost 2 hours of data. Always check lag before promoting.

  4. Ignoring replication slot retention: If a replica falls behind and the replication slot is lost, you cannot recover without a full resync.

  5. Testing in ideal conditions: Replication lag grows under load. Test with production-level write rates.

Quick Recap

  • Async replication confirms writes immediately without waiting for replicas
  • Replication lag means replicas can return stale data
  • Best for: non-critical reads, analytics, geographically distributed systems
  • Monitor lag actively and alert on threshold breaches
  • Use read-after-write consistency when needed by reading from primary after writes
  • For critical data requiring zero RPO, use synchronous replication

For related reading, see Database Replication for the broader replication landscape, Distributed Caching for read-scaling patterns, and Event-Driven Architecture for alternative consistency models.

Async replication is the right choice for most read-heavy workloads. The key is understanding which data needs strong consistency and which can tolerate eventual consistency, then routing reads accordingly.

Category

Related Posts

CRDTs: Conflict-Free Replicated Data Types

Learn how CRDTs enable strongly consistent eventual consistency in distributed databases. Explore G-Counters, PN-Counters, LWW-Registers, OR-Sets, and practical applications.

#distributed-systems #crdt #eventual-consistency

Synchronous Replication: Guaranteeing Data Consistency Across Nodes

Explore synchronous replication patterns in distributed databases. Learn about the write-ahead log shipping, Quorum-based replication, and how synchronous replication ensures zero RPO in production systems.

#distributed-systems #replication #consistency

Database Replication: Master-Slave and Failover Patterns

Database replication explained: master-slave, multi-master, synchronous vs asynchronous strategies, failover patterns, and consistency.

#databases #replication #availability