Asynchronous Replication: Speed and Availability at Scale
Learn how asynchronous replication works in distributed databases, including eventual consistency implications, lag monitoring, and practical use cases where speed outweighs strict consistency.
Asynchronous Replication: Speed and Availability at Scale
Asynchronous replication is the workhorse of distributed databases. Writes confirm immediately on the primary. Replicas receive the changes eventually, usually milliseconds later but sometimes much longer. This gives you speed and availability, but it introduces a window where data can be lost if the primary fails.
Most production databases run async by default. The performance benefit is real. The trade-off is accepting a small probability of losing recent writes during failover. For many applications, this is acceptable. For others, it is not.
How Asynchronous Replication Works
The primary processes a write, confirms it to the client, and sends the change to replicas in the background. The client does not wait for replica confirmation.
sequenceDiagram
participant Client
participant Primary
participant Replica
Client->>Primary: BEGIN; UPDATE...; COMMIT
Primary->>Primary: Write to local WAL
Primary-->>Client: COMMIT confirmed
Note over Primary,Replica: (Background replication)
rect rgb(200, 200, 200)
Primary->>Replica: Send WAL entry
Replica-->>Primary: ACK
end
This non-blocking behavior is why async replication is fast. Write latency is essentially local disk latency plus a small amount of processing. There is no waiting for remote replica confirmation.
The Replication Lag Problem
With async replication, replicas fall behind the primary. This lag can be seconds or minutes depending on load, network conditions, and replica performance.
# This sequence can fail with async replication
def update_and_read(user_id, new_email):
db.execute("UPDATE users SET email = ? WHERE id = ?", new_email, user_id)
# Write confirmed immediately
# But replica might not have the update yet
user = db_replica.execute("SELECT email FROM users WHERE id = ?", user_id)
# Might return old email if replica is lagging
return user['email']
This is the fundamental tension. Async replication is fast, but reads from replicas can return stale data.
Statement-Based vs WAL-Based Replication
Different databases use different mechanisms.
Statement-based replication sends the SQL statements to replicas. MySQL’s original replication used this approach.
-- Primary executes:
UPDATE users SET email = 'new@example.com' WHERE id = 1;
-- Replica executes the same statement:
UPDATE users SET email = 'new@example.com' WHERE id = 1;
The problem: nondeterministic functions like NOW() or RAND() produce different results on different nodes.
WAL-based replication sends the binary WAL entries. PostgreSQL uses this. WAL entries are deterministic because they contain the actual bytes that changed.
# WAL entry contains the actual bytes
wal_entry = {
'type': 'UPDATE',
'table': 'users',
'key': 'id=1',
'old_data': {'email': 'old@example.com'},
'new_data': {'email': 'new@example.com'},
'timestamp': 1711234567
}
Row-based replication sends the actual row changes. MySQL’s row-based replication combines the benefits of both.
Eventual Consistency
Async replication provides eventual consistency. Given enough time without new writes, all replicas will converge to the same state.
graph LR
P[Primary] -->|immediate| R1[Replica 1]
P -->|immediate| R2[Replica 2]
R1 -.->|eventually consistent| Same[Same State]
R2 -.->|eventually consistent| Same
The “eventually” part is important. During the lag window, reads from different replicas can return different results. This is observable inconsistency, not just a theoretical concern.
# User experience with eventual consistency
def get_user_email(user_id):
# Route to random replica
replica = random.choice([replica1, replica2, replica3])
return replica.execute("SELECT email FROM users WHERE id = ?", user_id)
# Two consecutive calls might return different results
email1 = get_user_email(123) # 'old@example.com'
email2 = get_user_email(123) # 'new@example.com' if replica2 has newer data
Use Cases Where Async Makes Sense
Async replication is the right choice when:
- Web sessions and preferences: User profile updates can lag a few seconds without causing real problems
- Analytics and reporting: Reads for dashboards do not need millisecond freshness
- Social media feeds: Timeline updates do not need to be immediately consistent
- Logging and metrics: Time-series data tolerates small gaps
- Geographic distribution: Replicating across continents with sync would be prohibitively slow
# Good async use case: user preferences
def update_user_preference(user_id, key, value):
# Write immediately, replicate async
db.execute("UPDATE preferences SET value = ? WHERE user_id = ? AND key = ?",
value, user_id, key)
# User sees update instantly on primary
# Other replicas catch up shortly
# Bad async use case: account balance
def transfer_funds(from_id, to_id, amount):
# This needs synchronous replication for correctness
db.execute("UPDATE accounts SET balance = balance - ? WHERE id = ?",
amount, from_id)
db.execute("UPDATE accounts SET balance = balance + ? WHERE id = ?",
amount, to_id)
Monitoring Replication Lag
Lag monitoring is critical with async replication. You need to know how far behind your replicas are.
-- PostgreSQL: Check replication lag
SELECT client_addr, state,
sent_lsn, replay_lsn,
sent_lsn - replay_lsn AS lag_bytes
FROM pg_stat_replication;
-- MySQL: Check replica lag
SHOW REPLICA STATUS\G
# Look at: Seconds_Behind_Master, Relay_Log_Pos
# Application-level lag monitoring
def check_replica_health(replica):
last_replicated_ts = replica.get_last_replicated_timestamp()
current_ts = time.time()
lag_seconds = current_ts - last_replicated_ts
if lag_seconds > 30:
alert(f"Replica {replica.name} lag: {lag_seconds}s")
# Route reads away from this replica
remove_from_read_pool(replica)
Key metrics to track:
| Metric | Normal | Warning | Critical |
|---|---|---|---|
| Replication lag | < 1s | 1-10s | > 10s |
| Replica IO thread | Running | — | Stopped |
| Replica SQL thread | Running | — | Stopped |
| WAL backlog | < 100MB | 100MB-1GB | > 1GB |
The CAP Theorem and Async Replication
Async replication is the availability (A) choice in the CAP theorem trade-off. During a network partition, an async primary can continue accepting writes. Those writes might not survive if the primary fails before replicas receive them.
graph TD
subgraph Partition
P[Primary] --> R1[(Replica 1)]
P -.- X[(Partition)]
R2[(Replica 2)] -.- X
end
P -->|writes continue| Client1[Client]
R2 -.->|disconnected| P
If you need consistency during partitions, you need synchronous replication. If you need availability during partitions, async is your choice.
For more on this trade-off, see CAP Theorem and Consistency Models.
Reducing Lag in Async Setups
If lag is a problem, there are techniques to reduce it.
Upgrade replica hardware: Replicas that cannot keep up with the primary’s write rate will always lag. Give them faster disks and more CPU.
Reduce write workload: If the primary writes 10,000 transactions per second and replicas can only handle 8,000, lag will grow indefinitely.
Use parallel replication: Modern MySQL and PostgreSQL support parallel replica apply.
-- MySQL: Enable parallel replication
STOP REPLICA;
SET GLOBAL slave_parallel_workers = 8;
START REPLICA;
-- PostgreSQL: Increase wal_sender_timeout for long-running queries
ALTER SYSTEM SET wal_sender_timeout = '60s';
Cascade replication: Chain replicas so the primary does not feed every replica directly. This reduces primary load.
graph TD
P[Primary] --> R1[Replica 1]
R1 --> R2[Replica 2]
R1 --> R3[Replica 3]
R2 --> R4[Replica 4]
Read Routing with Async Replicas
Application code should handle stale reads gracefully when using async replicas.
# Read from replica with fallback to primary
def read_user(user_id):
try:
# Try replica first
replica = get_read_replica()
user = replica.execute("SELECT * FROM users WHERE id = ?", user_id)
return user
except ReplicaTooLagError:
# Fall back to primary if replica is too far behind
return primary.execute("SELECT * FROM users WHERE id = ?", user_id)
# Read-after-write with async replication
def update_and_read(user_id, new_email):
# Always read from primary after writes
db.execute("UPDATE users SET email = ? WHERE id = ?", new_email, user_id)
return primary.execute("SELECT email FROM users WHERE id = ?", user_id)
When Async Is Not Enough
Some data genuinely cannot tolerate async replication:
- Financial transactions: If you write a payment and the primary fails before replicas receive it, money disappears
- Inventory management: Overselling happens when replicas do not have the latest stock counts
- Locking operations: If you acquire a lock and the primary fails before it replicates, you have a zombie lock
For these cases, see Synchronous Replication for strong consistency options.
Read-Your-Writes Consistency Checklist
When using async replicas, ensuring read-after-write consistency requires explicit handling:
# Strategy 1: Read from primary after writes
def read_after_write(session, user_id, new_email):
session.execute(
"UPDATE users SET email = ? WHERE id = ?",
new_email, user_id
)
# Must read from primary to see our own write
return primary.execute(
"SELECT email FROM users WHERE id = ?",
user_id
)
# Strategy 2: Track LSN and wait for replication
def read_after_wait(session, user_id, new_email):
lsn = session.execute(
"UPDATE users SET email = ? WHERE id = ?",
new_email, user_id
)
# Wait for replica to catch up to our LSN
replica.wait_for_lsn(lsn, timeout=30)
return replica.execute(
"SELECT email FROM users WHERE id = ?",
user_id
)
# Strategy 3: Session pinning (sticky to primary)
def read_with_session_pin(session, user_id):
# Session always routes to primary for this user
session.set_pin(user_id, to_node='primary')
return session.execute(
"SELECT email FROM users WHERE id = ?",
user_id
)
| Strategy | Pros | Cons | Best For |
|---|---|---|---|
| Read from primary | Simple, always correct | Adds primary load | Write-heavy workloads |
| Wait for LSN | Works with any replica | Adds latency | Occasional consistency needs |
| Session pinning | Consistent experience | Reduces flexibility | User-specific data |
Common Pitfalls
-
Assuming replica data is current: After a write, do not assume replicas have the update immediately. Always read from primary for critical data after writes.
-
Not monitoring lag: Lag accumulates silently. By the time users complain, you might be hours behind. Monitor proactively.
-
Promoting a lagging replica: If a replica is 2 hours behind and you promote it, you just lost 2 hours of data. Always check lag before promoting.
-
Ignoring replication slot retention: If a replica falls behind and the replication slot is lost, you cannot recover without a full resync.
-
Testing in ideal conditions: Replication lag grows under load. Test with production-level write rates.
Quick Recap
- Async replication confirms writes immediately without waiting for replicas
- Replication lag means replicas can return stale data
- Best for: non-critical reads, analytics, geographically distributed systems
- Monitor lag actively and alert on threshold breaches
- Use read-after-write consistency when needed by reading from primary after writes
- For critical data requiring zero RPO, use synchronous replication
For related reading, see Database Replication for the broader replication landscape, Distributed Caching for read-scaling patterns, and Event-Driven Architecture for alternative consistency models.
Async replication is the right choice for most read-heavy workloads. The key is understanding which data needs strong consistency and which can tolerate eventual consistency, then routing reads accordingly.
Category
Related Posts
CRDTs: Conflict-Free Replicated Data Types
Learn how CRDTs enable strongly consistent eventual consistency in distributed databases. Explore G-Counters, PN-Counters, LWW-Registers, OR-Sets, and practical applications.
Synchronous Replication: Guaranteeing Data Consistency Across Nodes
Explore synchronous replication patterns in distributed databases. Learn about the write-ahead log shipping, Quorum-based replication, and how synchronous replication ensures zero RPO in production systems.
Database Replication: Master-Slave and Failover Patterns
Database replication explained: master-slave, multi-master, synchronous vs asynchronous strategies, failover patterns, and consistency.