Multi-Paxos
Multi-Paxos extends basic Paxos to achieve consensus on sequences of values, enabling practical replicated state machines for distributed systems and databases.
Introduction
Basic Paxos agrees on a single value. Run it once, and all non-faulty nodes commit to the same value. Run it again with a different value, and they commit to that. But what if you need to agree on a sequence of values, like a replicated log of commands? That’s what Multi-Paxos solves.
The extension is conceptually simple: treat each slot in the log as a separate instance of basic Paxos. The complexity comes from optimization—running Paxos from scratch for every log entry would be prohibitively expensive.
Core Concepts
Multi-Paxos optimizes away the prepare phase for consecutive entries once a leader is established. The insight is that if a proposer has recently won a majority vote, it can reasonably assume it remains the leader for a period. During this stable period, it can skip Phase 1 and go directly to Phase 2 for subsequent entries.
This optimization transforms Multi-Paxos from a two-phase protocol per entry to essentially a one-phase protocol after the initial leader election.
sequenceDiagram
participant L as Leader
participant F1 as Follower 1
participant F2 as Follower 2
rect rgb(40, 40, 60)
Note over L,F2: First entry: Full Paxos (Phase 1 + Phase 2)
L->>F1: Prepare(n=10)
L->>F2: Prepare(n=10)
F1-->>L: Promise(v=nil)
F2-->>L: Promise(v=nil)
L->>F1: Accept(n=10, v="cmd1")
L->>F2: Accept(n=10, v="cmd1")
F1-->>L: Accepted
F2-->>L: Accepted
end
rect rgb(40, 60, 40)
Note over L,F2: Subsequent entries: Optimized (Phase 2 only)
L->>F1: Accept(n=10, v="cmd2")
L->>F2: Accept(n=10, v="cmd2")
F1-->>L: Accepted
F2-->>L: Accepted
L->>F1: Accept(n=10, v="cmd3")
L->>F2: Accept(n=10, v="cmd3")
F1-->>L: Accepted
F2-->>L: Accepted
end
Role of the Leader
Multi-Paxos typically designates a stable leader to avoid contention. Without leadership, multiple nodes might propose conflicting values for the same log position, causing retries and latency spikes.
The leader election can use basic Paxos itself (running a round of prepare messages to see if any node can get a majority). Some implementations piggyback on the first client’s request to establish leadership lazily.
Log Gaps and Truncation
Multi-Paxos truncates log entries at the last known stable index, skipping over gaps where commands were already decided. The gaps happen naturally when leaders change: the new leader may not have entries that were committed by the previous leader if they had not been fully replicated.
graph LR
L1["Node 1<br/>Log: [A] [B] [C] [D] [E]"]
L2["Node 2<br/>Log: [A] [B] [C]"]
L3["Node 3<br/>Log: [A] [B] [C] [D]"]
L1 -->|Committed| C1["Idx 1-5 committed"]
L2 -->|Truncated| C2["Missing idx 4,5"]
L3 -->|Truncated| C3["Missing idx 5"]
When a new leader takes over, it may have gaps in its log where other nodes already committed entries. Multi-Paxos handles this by having the leader re-propose missing entries rather than copying them — ensuring consistency without a full synchronization.
Skipped Log Indices
What happens if a client request needs to go into log position 100, but the log only has 50 entries? In practice, the leader detects gaps and either waits for the gap to be filled or replicates a no-op command to fill the gap. Some implementations use a snapshot-based approach where the state machine state is checkpointed and old log entries are discarded.
Multi-Paxos vs Raft
Raft was designed to be understandable by explicitly separating concerns. Multi-Paxos is more abstract and leaves many details unspecified. This distinction has practical implications.
graph LR
subgraph "Design Philosophy"
M[Multi-Paxos<br/>Abstract model<br/>Minimal assumptions] --> R[Raft<br/>Concrete specification<br/>Every case defined]
end
subgraph "Practical Trade-offs"
M2[Multi-Paxos<br/>Fewer network rounds<br/>for stable workloads<br/>Harder to implement] --> R2[Raft<br/>Easier to implement<br/>Correctly<br/>More rounds]
end
| Aspect | Multi-Paxos | Raft |
|---|---|---|
| Network Rounds (stable leader) | 1 RTT per entry | 1 RTT per entry |
| Leader Election | Via Paxos (2+ RTTs) | Heartbeat-based |
| Log Compaction | unspecified (varies) | Defined clearly (snapshot + log truncation) |
| Membership Changes | Complex, varies by implementation | Joint consensus or single-server changes |
| Implementation Difficulty | High (many unspecified details) | Moderate (clear specification) |
| Leader Stability | Critical for performance | Important but more resilient |
| Disk I/O | Implementation-dependent | Typically batched snapshots |
Raft separates leader election from log replication explicitly. Multi-Paxos treats these as the same mechanism. The Raft paper gives you reference pseudo-code for all components; Multi-Paxos papers focus on correctness proofs rather than implementation guidance.
For new projects, Raft is usually the better choice. The clarity of specification reduces implementation bugs. For existing systems (Chubby, Zookeeper’s Zab), understanding Multi-Paxos helps you reason about their behavior.
Checkpointing and State Machine Recovery
Multi-Paxos logs grow indefinitely. Without checkpointing, recovery time grows proportionally to log length. Checkpointing (or snapshotting) solves this by periodically saving the state machine state and discarding old log entries.
What a Checkpoint Contains
A checkpoint is not just a memory dump. It must capture everything needed to reconstruct the state machine at a specific log index:
class Checkpoint:
def __init__(self, last_applied_index, state, log_length):
self.last_applied_index = last_applied_index
# The highest log index whose command has been applied
# All earlier entries are now reflected in state
self.state = state
# The actual state machine state (key-value pairs,
# document contents, etc.)
self.log_length = log_length
# How many log entries existed when checkpoint was taken
# Entries before this index can be discarded
The Replay Process
When a node restarts, it must reconstruct its state by reading the checkpoint and replaying log entries after the checkpoint index.
graph TB
subgraph "State Machine Recovery"
S1[Read last checkpoint<br/>last_applied=500<br/>state=Snapshot_500] --> S2[Find log entries<br/>501, 502, 503...]
S2 --> S3[Replay entry 501<br/>Apply to state]
S3 --> S4[Replay entry 502<br/>Apply to state]
S4 --> S5[Replay entry 503<br/>Apply to state]
S5 --> S6[State machine matches<br/>other replicas]
end
Checkpoint Creation and Log Truncation
Checkpoints must be created consistently across replicas to avoid divergence.
sequenceDiagram
participant L as Leader
participant F1 as Follower 1
participant F2 as Follower 2
participant SM as State Machine
Note over L: Checkpoint interval reached
L->>SM: Create checkpoint<br/>state=apply(log 1-500)
L->>F1: Send checkpoint + index=500
L->>F2: Send checkpoint + index=500
Note over SM: Log entries 1-500 now reflected<br/>in state
F1->>F1: Apply checkpoint locally
F2->>F2: Apply checkpoint locally
L->>F1: Truncate log entries 1-500
L->>F2: Truncate log entries 1-500
Note over L,F2: All replicas now at<br/>checkpoint index 500
The leader coordinates checkpoint creation to ensure all replicas can truncate the same log entries. If a follower is slow, the leader waits before truncating.
Consistency Requirements
A checkpoint is only trustworthy if it reflects a consistent view of the state machine. In Multi-Paxos terms, this means the checkpoint index must be a committed log position.
If you checkpoint at index 400 but index 401 was not yet committed (no majority acknowledged), you might checkpoint state that other nodes never saw. After a failure, the replica might diverge.
Proper implementations only checkpoint at committed indices.
Scalability Considerations
Multi-Paxos scales well for write-heavy workloads under stable leadership, but certain limits emerge as cluster size increases.
Leader Bottleneck
The designated leader becomes a bottleneck at high throughput. Every write must go through the leader, which sends Accept messages to all followers and waits for majority acknowledgment. As more nodes join the cluster, the leader’s network and disk I/O burden grows proportionally.
For a 3-node cluster, the leader sends 2 messages per entry. For a 7-node cluster, it sends 6 messages per entry. The network overhead scales with cluster size, and disk I/O remains constant per entry regardless of cluster size.
# Throughput estimation under stable leadership
def estimate_throughput(node_count, fsync_latency_ms, network_latency_ms):
# Time per entry = max(fsync_latency, network_round_trip) + processing
round_trip = network_latency_ms * 2 # leader -> follower -> leader
entry_time_ms = max(fsync_latency_ms, round_trip)
max_throughput = 1000 / entry_time_ms
return max_throughput
# 3 nodes, 5ms fsync, 1ms network: ~200 entries/sec
# 7 nodes, 5ms fsync, 1ms network: ~200 entries/sec (fsync bound)
# Same throughput, more network overhead for larger clusters
Read Scalability Workarounds
Multi-Paxos clusters often implement follower reads for horizontal read scalability. Followers serve stale reads by applying committed log entries to their local state machine. This trades consistency for throughput.
Client --> Leader: write x=1
Leader --> Followers: Accept(n=10, v="write x=1")
Followers --> Leader: Accepted
Leader --> Client: committed
Client --> Follower: read x (returns local state, potentially stale)
For strong consistency reads, clients must route to the leader. Some systems use lease-based follower reads where the leader grants followers permission to serve reads for a time window, accepting staleness in exchange for distributed read throughput.
Membership Change Overhead
Adding nodes to a Multi-Paxos cluster requires careful coordination. The protocol does not specify membership changes, so implementations typically:
- Pause client traffic
- Propagate new configuration through the log
- Wait for the new configuration to commit
- Resume traffic
This pause duration grows with the number of nodes being added, making large-scale cluster expansion disruptive.
Durability and Performance Trade-offs
Multi-Paxos provides strong durability guarantees by default: a value is only committed when a majority acknowledges it. However, this creates performance tensions that production deployments must navigate.
The Durability Throughput Tension
Synchronous persistence before acknowledgment creates a hard throughput ceiling. Each entry must survive a crash to be considered committed, which means fsync completes before the leader responds to clients.
| Configuration | Durability | Throughput | Crash Loss |
|---|---|---|---|
| Sync per entry | Highest | Lowest (~100-1000/sec) | None |
| Batch fsync (8 entries) | High | Moderate (~5000/sec) | Last batch |
| Async append + background sync | Moderate | High (~20000/sec) | Several seconds of writes |
Battery-Backed Write Caches
Enterprise storage systems use battery-backed write caches (BBWC) to accelerate durability acknowledgments. The cache is battery-protected, so data in the cache survives power loss. The storage system acknowledges writes after they hit the BBWC, before data reaches disk.
Leader --> Storage: write entry 501
Storage --> BBWC: append entry
BBWC --> Storage: acknowledged (durable even if power lost)
Storage --> Leader: Accepted
Leader --> Followers: Accept(n=10, v="entry 501")
This enables high throughput while maintaining durability guarantees. If power fails, the BBWC flushes to disk on recovery.
epoch-based Leases
Some Multi-Paxos variants use epoch-based leases to batch acknowledgments across multiple entries. The leader proposes a range of entries and followers acknowledge the entire epoch with a single round trip.
# Epoch-based batching
def on_client_request(value):
epoch = current_epoch
entries.append(Entry(epoch, sequence++, value))
if should_flush_epoch():
# Send all entries in epoch together
send_accept_batch(entries)
wait_for_majority_ack(entries)
commit_epoch(epoch)
entries.clear()
This amortizes network round trips across many entries but introduces a vulnerability window. If the leader crashes before the epoch commits, all entries in the epoch are lost.
Relationship to Other Patterns
Multi-Paxos forms the backbone of many replicated log systems. My post on Two-Phase Commit discusses a different approach to coordinating distributed transactions, though it lacks the fault-tolerance guarantees that Multi-Paxos provides.
For building complete replicated services, you’ll need more than just Multi-Paxos. My posts on Database Replication and Distributed Transactions explore how consensus fits into broader system designs.
Practical Applications
Google’s Spanner uses Multi-Paxos variants for synchronizing replicas across data centers. Chubby uses it for lock service. Many distributed databases use the same underlying pattern: a replicated log of commands that is applied to each replica’s state machine.
The reason this works in practice: basic Paxos is expensive, but you only need it once to establish leadership. After that, the common case is fast.
Production Failure Scenarios
Scenario 1: The Phantom Write Problem
What happened: A financial trading system using Multi-Paxos lost orders during a leadership election. The old leader had received and logged an order but the new leader, having a shorter log, did not have the order committed.
Root cause: The client sent an order to the old leader, which wrote it to its local log and returned success to the client. However, the old leader crashed before replicating the entry to a majority of followers. The new leader, with a different log prefix, was elected and continued processing without the lost order.
Impact: Several customer orders were silently dropped. The trading system had to reconcile its actual state against what clients believed had succeeded.
Lesson learned: Clients must wait for commit acknowledgments from a majority before considering a request complete. Idempotency keys help with client-side retry safety.
Scenario 2: The Stale Checkpoint Catastrophe
What happened: A distributed database using Multi-Paxos experienced silent data corruption after a routine maintenance event when a checkpoint was restored from an uncommitted log index.
Root cause: The checkpoint was taken at log index 400, but index 401 had not yet been committed (no majority acknowledgment). When a follower took a local checkpoint including index 401’s partial state, and then crashed and restarted, it restored state that other replicas had never seen.
Impact: After recovery, that replica diverged from the cluster. Detecting the divergence required expensive checksum verification across all data.
Lesson learned: Checkpoints must only be taken at committed log indices. The committed index is the highest index that a majority has acknowledged—nothing less is safe.
Scenario 3: The Fsync Bottleneck Chain
What happened: A high-throughput key-value store using Multi-Paxos experienced severe write latency spikes during peak load, eventually timing out clients.
Root cause: The system ran on spinning disks where fsync latency dominated. At high throughput, the single leader became a bottleneck—the disk could not keep up with the rate of accepted proposals requiring synchronous persistence. Write queues grew, latency spiked, and clients timed out.
Impact: The service became effectively unavailable during peak load despite having adequate CPU and network capacity.
Lesson learned: Profile disk I/O patterns before deploying Multi-Paxos. Consider batching fsync calls, using battery-backed write caches, or switching to SSDs for write-heavy workloads.
Common Pitfalls / Anti-Patterns
Multi-Paxos sounds simple, but production implementations face real challenges. The gap between theory and practice is substantial.
Persistence: If the leader crashes before replicating to a majority, clients may think a command succeeded when it actually failed. Careful clients must track which entries are actually committed.
Checkpointing: The log grows indefinitely. Periodic checkpointing of the state machine lets you truncate old entries, but taking consistent checkpoints across a distributed system is its own problem.
Membership changes: Adding or removing nodes while the system is running requires careful coordination. Most implementations pause the protocol, change membership, and resume.
Leader Stability Challenges
Multi-Paxos needs a relatively stable leader to get its performance benefits. When leadership flips often, the protocol collapses back toward basic Paxos.
Consider a scenario with three nodes (N1, N2, N3) and a client issuing commands:
# Scenario: Leadership changes during command processing
# Time T1: N1 is leader
client_cmd_1 = {"seq": 1, "value": "write x=1"}
# N1 sends Accept to N2, N3
# Time T2: N1 fails, N2 becomes leader before cmd_1 commits
# N2 receives client_cmd_2 = {"seq": 1, "value": "write x=2"}
# N2 proposes seq=1 with value "write x=2"
# Time T3: N1 recovers, N3 thinks N1 is still leader
# N3 receives Accept from N1 for seq=1, N2 for seq=1
# Conflict: same sequence number, different values
If clients retry with the same sequence numbers after leader changes, conflicting proposals can emerge. The system must either assign sequence numbers centrally (defeating distributed benefits) or accept that some commands will be lost and require client-level retry logic.
Log Compaction Interaction
Checkpointing interacts with Multi-Paxos in subtle ways. A checkpoint must capture the complete state, but determining “complete” in a distributed setting is tricky.
graph TB
subgraph "Checkpoint Creation Process"
C1[Leader creates checkpoint<br/>of state machine state]
C2[Send checkpoint to all followers<br/>asynchronously]
C3[Followers apply checkpoint<br/>locally]
C4[Leader truncates log<br/>at checkpoint index]
C5[Clients can now read<br/>consistent snapshot]
C1 --> C2 --> C3 --> C4 --> C5
end
Here is the tricky part: a follower might be at log index 100 while the leader checkpointed at index 150. If the follower crashes and restarts, it must reconstruct state from the checkpoint plus log entries 101-150. If any of those entries were lost during the crash, the follower has an inconsistent view that no amount of Paxos rounds will fix.
Disk I/O Bottlenecks
Multi-Paxos requires durability before acknowledging accepted proposals. Each Accept message typically triggers a fsync to persistent storage.
# Typical accept handler with persistence
def on_accept(proposal_id, value):
# 1. Write to log (sequential I/O, relatively fast)
log.append(proposal_id, value)
# 2. Persist to disk (fsync - orders of magnitude slower)
log.flush_to_disk() # ~5-10ms on spinning disk
log.sync() # fsync call
# 3. Update in-memory state
state.update(value)
# 4. Send response
return Accepted(proposal_id)
On a system with spinning disks, fsync latency dominates. A single leader processing thousands of commands per second can become bottlenecked on disk I/O. SSDs help, but the fundamental issue stays: every accepted proposal requires synchronous persistence.
Some implementations batch multiple accepts before fsync, trading durability for throughput. This means a crash might lose several recently-accepted commands.
Concurrency Control Limitations
Multi-Paxos decides on a sequence of values but provides no native concurrency control for overlapping writes to the same keys.
# Multi-Paxos does not handle this:
# Client A: write x = 1, read x
# Client B: write x = 2, read x
# If both operations touch the same key "x",
# Multi-Paxos guarantees they are ordered,
# but provides no mechanism to detect
# or prevent conflicting access patterns
# You must layer your own concurrency control:
# - External locking (Zookeeper advisory locks)
# - Optimistic concurrency (version vectors)
# - Pessimistic locking (2PL, MVCC)
For simple key-value stores, this is manageable. For complex transactions touching multiple keys, you need additional coordination.
Persistent State Requirements
Before sending Accept messages, a proper Multi-Paxos implementation must persist certain state to ensure crash safety.
# State that MUST be persisted before sending Accept
class AcceptorState:
def on_prepare_response(self, proposal_id, accepted_id, accepted_value):
# Persist promise before sending ack
# If we crash after sending Promise but before persisting,
# we might accept a lower-numbered proposal after restart
self.promised_id = proposal_id
persist_to_disk({
'promised_id': self.promised_id,
'accepted_id': self.accepted_id, # Also persist accepted
'accepted_value': self.accepted_value
})
def on_accept(self, proposal_id, value):
# Must persist BEFORE sending Accepted response
# If we crash after sending response but before persisting,
# we might lose knowledge of what we accepted
if proposal_id >= self.promised_id:
self.accepted_id = proposal_id
self.accepted_value = value
persist_to_disk({
'accepted_id': self.accepted_id,
'accepted_value': self.accepted_value
})
return Accepted(proposal_id)
The invariant is simple: once you have promised not to accept lower proposals, or have accepted a value, that state must survive crashes. Violating this leads to split-brain scenarios where different nodes believe different values were chosen.
Quick Recap
Before diving into implementation, ensure you understand:
- Basic Paxos agrees on a single value; Multi-Paxos extends this to a sequence of values
- The optimization: skip Phase 1 (prepare) for consecutive entries once a leader is stable
- A stable leader enables 1-RTT per entry instead of 2-RTT
- Log gaps occur during leader changes and are handled by re-proposing missing entries
- Checkpointing must occur at committed log indices to maintain consistency
- Disk I/O (fsync) is the primary bottleneck for high-throughput Multi-Paxos implementations
- Multi-Paxos provides no native concurrency control—layer it on top
- Membership changes require pausing the protocol in most implementations
Interview Questions
Key points:
- Basic Paxos only agrees on a single value; Multi-Paxos agrees on a sequence of values
- This enables replicated state machines where a distributed log drives state changes
- Each log position becomes a separate Paxos instance, but optimized for the common case
Key points:
- After a leader wins a majority vote, it assumes leadership is stable for a period
- Subsequent entries skip Phase 1 (prepare phase) and go directly to Phase 2 (accept phase)
- This reduces 2-RTT per entry to 1-RTT per entry after leader election
Key points:
- A new leader must be elected, typically via Paxos itself (prepare messages)
- The new leader may have gaps in its log where other nodes committed entries
- Multi-Paxos handles this by re-proposing missing entries, not copying them
- This ensures consistency without requiring full synchronization
Key points:
- Frequent leadership changes collapse the protocol back toward basic Paxos
- Each leadership change requires new Phase 1 (prepare) rounds before Phase 2
- This eliminates the 1-RTT optimization benefit for subsequent entries
- Systems must balance leader leases, heartbeats, and failure detection carefully
Key points:
- Without checkpointing, recovery time grows proportionally to log length
- Checkpointing periodically saves the state machine state and discards old log entries
- A checkpoint must capture everything needed to reconstruct state at a specific log index
- Only checkpoint at committed indices—uncommitted state would cause divergence
Key points:
- Every accepted proposal requires fsync to persistent storage before acknowledgment
- fsync latency (~5-10ms on spinning disks) dominates throughput
- Solutions: batch multiple accepts before fsync (trades durability), use SSDs, pipeline writes
- Batching means a crash might lose several recently-accepted commands
Key points:
- Membership changes require careful coordination to maintain consistency
- Most implementations pause the protocol, change membership, then resume
- The Paxos mechanism itself does not specify how to handle membership changes
- Joint consensus or single-server changes are common approaches (as in Raft)
Key points:
- Multi-Paxos decides the order of values but provides no conflict detection
- Overlapping writes to the same keys are ordered but not prevented
- Additional layering required: advisory locks (Zookeeper), optimistic concurrency (version vectors), or pessimistic locking (2PL, MVCC)
- For multi-key transactions, additional coordination beyond Multi-Paxos is essential
Key points:
- Multi-Paxos is abstract with many unspecified details—high implementation difficulty
- Raft provides a concrete specification with pseudo-code for all components
- Raft separates leader election from log replication explicitly; Multi-Paxos treats them as the same
- For new projects, Raft is usually the better choice due to clarity of specification
Key points:
- Promise state: once promised not to accept lower-numbered proposals, that must survive crashes
- Accepted state: once accepted a value, that must persist before sending Accepted response
- The invariant: both promised_id and accepted_id/accepted_value must survive crashes
- Violating this leads to split-brain scenarios where nodes disagree on chosen values
Key points:
- Spanner uses Multi-Paxos variants for synchronizing replicas across data centers
- Chubby uses Multi-Paxos for its lock service coordination
- These systems demonstrate that basic Paxos, despite its theoretical nature, underlies production distributed systems
- The optimization (1-RTT after leader election) makes it fast enough for production use
Key points:
- Multi-Paxos is not a separate protocol—it extends Basic Paxos
- Each log slot is treated as an independent Basic Paxos instance
- Leader election in Multi-Paxos is itself implemented using Basic Paxos
- The "multi" part is an optimization layer on top of repeated Basic Paxos instances
Key points:
- Truncation removes entries at the last known stable index, skipping gaps
- Gaps naturally occur during leader changes
- The new leader may not have entries that were committed by the previous leader
- The leader re-proposes missing entries rather than copying them from other nodes
Key points:
- The follower must reconstruct state from its last checkpoint plus log entries after that index
- If those log entries were lost during the crash, the follower has an inconsistent view
- This inconsistency cannot be fixed by additional Paxos rounds
- Proper persistence of promised and accepted state before responding prevents this
Key points:
- If the leader crashes after receiving a request but before replicating to a majority, the command may be lost
- Careful clients must track which entries are actually committed before considering a request successful
- Client retries with the same sequence numbers after leader changes can cause conflicting proposals
- The system must either assign sequence numbers centrally or accept lost commands requiring client-level retry
Key points:
- Under stable leadership, Multi-Paxos achieves 1-RTT per entry vs 2-RTT for basic Paxos
- The optimization eliminates prepare phases for consecutive entries once leader is established
- With frequent leadership changes, Multi-Paxos collapses to basic Paxos performance
- Leader leases and failure detection tuning directly impact effective throughput
Key points:
- Synchronous checkpointing pauses the protocol during checkpoint creation, ensuring all replicas stay in sync
- Asynchronous checkpointing allows the leader to continue processing but risks divergent checkpoints if the leader crashes mid-creation
- Synchronous is safer but impacts latency; asynchronous is faster but requires careful crash recovery handling
- Hybrid approaches checkpoint when the system is idle or during low-traffic periods
Key points:
- The optimization skips Phase 1 (prepare phase) assuming leadership remains stable
- If leadership changes frequently, each new leader must run Phase 1 before Phase 2, losing the 1-RTT benefit
- Systems use leader leases (time-bound authority) to maintain stability while allowing failure detection
- The trade-off: aggressive leases improve performance but risk longer outages if the leader fails unexpectedly
Key points:
- Client requests that the old leader accepted but did not commit to a majority are lost
- The client must retry the request, but retrying with the same sequence number risks conflicting proposals if a new leader is already processing different values
- Idempotency keys help clients safely retry without causing duplicate commands
- The new leader re-proposes missing log entries, but requests that only reached the old leader's log are effectively dropped
Key points:
- Epoch-based batching collects multiple entries and sends them as a single accept round, amortizing network overhead
- Followers acknowledge an entire epoch with one round trip, reducing per-entry latency
- The risk: if the leader crashes before the epoch commits, all entries in that epoch are lost
- Systems must balance batch size against crash loss window—larger batches mean higher throughput but greater loss on failure
Further Reading
- Paxos Made Simple — Leslie Lamport’s original paper
- Paxos Made Live — Google’s engineering retrospective on implementing Chubby’s Multi-Paxos variant
- The Chubby Lock Service — Original paper describing how Paxos underlies Google’s coordination service
- Spanner: Google’s Globally Distributed Database — Shows how Multi-Paxos variants synchronize replicas across data centers
- Raft Refloated — Comparison of Raft and Multi-Paxos design philosophies
- On the Parallels between Paxos and Raft — Formal analysis connecting the two consensus approaches
Conclusion
Multi-Paxos extends the theoretical Paxos algorithm into a practical protocol for replicating logs and state machines. The optimization from two phases to one per entry makes it fast enough for production use.
Understanding Multi-Paxos helps when working with systems like Zookeeper, etcd, or distributed databases. But for new projects, Raft usually provides a cleaner foundation.
Category
Related Posts
Paxos Consensus Algorithm
Paxos is a family of consensus algorithms for achieving agreement in distributed systems despite failures, pioneered by Leslie Lamport.
Google Chubby: The Lock Service That Inspired ZooKeeper
Explore Google Chubby's architecture, lock-based coordination, Paxos integration, cell hierarchy, and its influence on distributed systems design.
Raft Consensus Algorithm
Raft is a consensus algorithm designed to be understandable, decomposing the problem into leader election, log replication, and safety.