Multi-Paxos
Multi-Paxos extends basic Paxos to achieve consensus on sequences of values, enabling practical replicated state machines for distributed systems and databases.
Introduction
Basic Paxos agrees on a single value. Run it once, and all non-faulty nodes commit to the same value. Run it again with a different value, and they commit to that. But what if you need to agree on a sequence of values, like a replicated log of commands? That’s what Multi-Paxos solves.
The extension is conceptually simple: treat each slot in the log as a separate instance of basic Paxos. The complexity comes from optimization—running Paxos from scratch for every log entry would be prohibitively expensive.
The Core Idea
Multi-Paxos optimizes away the prepare phase for consecutive entries once a leader is established. The insight is that if a proposer has recently won a majority vote, it can reasonably assume it remains the leader for a period. During this stable period, it can skip Phase 1 and go directly to Phase 2 for subsequent entries.
This optimization transforms Multi-Paxos from a two-phase protocol per entry to essentially a one-phase protocol after the initial leader election.
sequenceDiagram
participant L as Leader
participant F1 as Follower 1
participant F2 as Follower 2
rect rgb(40, 40, 60)
Note over L,F2: First entry: Full Paxos (Phase 1 + Phase 2)
L->>F1: Prepare(n=10)
L->>F2: Prepare(n=10)
F1-->>L: Promise(v=nil)
F2-->>L: Promise(v=nil)
L->>F1: Accept(n=10, v="cmd1")
L->>F2: Accept(n=10, v="cmd1")
F1-->>L: Accepted
F2-->>L: Accepted
end
rect rgb(40, 60, 40)
Note over L,F2: Subsequent entries: Optimized (Phase 2 only)
L->>F1: Accept(n=10, v="cmd2")
L->>F2: Accept(n=10, v="cmd2")
F1-->>L: Accepted
F2-->>L: Accepted
L->>F1: Accept(n=10, v="cmd3")
L->>F2: Accept(n=10, v="cmd3")
F1-->>L: Accepted
F2-->>L: Accepted
end
Role of the Leader
Multi-Paxos typically designates a stable leader to avoid contention. Without leadership, multiple nodes might propose conflicting values for the same log position, causing retries and latency spikes.
The leader election can use basic Paxos itself (running a round of prepare messages to see if any node can get a majority). Some implementations piggyback on the first client’s request to establish leadership lazily.
Log Gaps and Truncation
Multi-Paxos truncates log entries at the last known stable index, skipping over gaps where commands were already decided. The gaps happen naturally when leaders change: the new leader may not have entries that were committed by the previous leader if they had not been fully replicated.
graph LR
L1["Node 1<br/>Log: [A] [B] [C] [D] [E]"]
L2["Node 2<br/>Log: [A] [B] [C]"]
L3["Node 3<br/>Log: [A] [B] [C] [D]"]
L1 -->|Committed| C1["Idx 1-5 committed"]
L2 -->|Truncated| C2["Missing idx 4,5"]
L3 -->|Truncated| C3["Missing idx 5"]
When a new leader takes over, it may have gaps in its log where other nodes already committed entries. Multi-Paxos handles this by having the leader re-propose missing entries rather than copying them — ensuring consistency without a full synchronization.
Skipped Log Indices
What happens if a client request needs to go into log position 100, but the log only has 50 entries? In practice, the leader detects gaps and either waits for the gap to be filled or replicates a no-op command to fill the gap. Some implementations use a snapshot-based approach where the state machine state is checkpointed and old log entries are discarded.
Implementation Challenges
Multi-Paxos sounds simple, but production implementations face real challenges. The gap between theory and practice is substantial.
Persistence: If the leader crashes before replicating to a majority, clients may think a command succeeded when it actually failed. Careful clients must track which entries are actually committed.
Checkpointing: The log grows indefinitely. Periodic checkpointing of the state machine lets you truncate old entries, but taking consistent checkpoints across a distributed system is its own problem.
Membership changes: Adding or removing nodes while the system is running requires careful coordination. Most implementations pause the protocol, change membership, and resume.
Leader Stability Challenges
Multi-Paxos needs a relatively stable leader to get its performance benefits. When leadership flips often, the protocol collapses back toward basic Paxos.
Consider a scenario with three nodes (N1, N2, N3) and a client issuing commands:
# Scenario: Leadership changes during command processing
# Time T1: N1 is leader
client_cmd_1 = {"seq": 1, "value": "write x=1"}
# N1 sends Accept to N2, N3
# Time T2: N1 fails, N2 becomes leader before cmd_1 commits
# N2 receives client_cmd_2 = {"seq": 1, "value": "write x=2"}
# N2 proposes seq=1 with value "write x=2"
# Time T3: N1 recovers, N3 thinks N1 is still leader
# N3 receives Accept from N1 for seq=1, N2 for seq=1
# Conflict: same sequence number, different values
If clients retry with the same sequence numbers after leader changes, conflicting proposals can emerge. The system must either assign sequence numbers centrally (defeating distributed benefits) or accept that some commands will be lost and require client-level retry logic.
Log Compaction Interaction
Checkpointing interacts with Multi-Paxos in subtle ways. A checkpoint must capture the complete state, but determining “complete” in a distributed setting is tricky.
graph TB
subgraph "Checkpoint Creation Process"
C1[Leader creates checkpoint<br/>of state machine state]
C2[Send checkpoint to all followers<br/>asynchronously]
C3[Followers apply checkpoint<br/>locally]
C4[Leader truncates log<br/>at checkpoint index]
C5[Clients can now read<br/>consistent snapshot]
C1 --> C2 --> C3 --> C4 --> C5
end
Here is the tricky part: a follower might be at log index 100 while the leader checkpointed at index 150. If the follower crashes and restarts, it must reconstruct state from the checkpoint plus log entries 101-150. If any of those entries were lost during the crash, the follower has an inconsistent view that no amount of Paxos rounds will fix.
Disk I/O Bottlenecks
Multi-Paxos requires durability before acknowledging accepted proposals. Each Accept message typically triggers a fsync to persistent storage.
# Typical accept handler with persistence
def on_accept(proposal_id, value):
# 1. Write to log (sequential I/O, relatively fast)
log.append(proposal_id, value)
# 2. Persist to disk (fsync - orders of magnitude slower)
log.flush_to_disk() # ~5-10ms on spinning disk
log.sync() # fsync call
# 3. Update in-memory state
state.update(value)
# 4. Send response
return Accepted(proposal_id)
On a system with spinning disks, fsync latency dominates. A single leader processing thousands of commands per second can become bottlenecked on disk I/O. SSDs help, but the fundamental issue stays: every accepted proposal requires synchronous persistence.
Some implementations batch multiple accepts before fsync, trading durability for throughput. This means a crash might lose several recently-accepted commands.
Concurrency Control Limitations
Multi-Paxos decides on a sequence of values but provides no native concurrency control for overlapping writes to the same keys.
# Multi-Paxos does not handle this:
# Client A: write x = 1, read x
# Client B: write x = 2, read x
# If both operations touch the same key "x",
# Multi-Paxos guarantees they are ordered,
# but provides no mechanism to detect
# or prevent conflicting access patterns
# You must layer your own concurrency control:
# - External locking (Zookeeper advisory locks)
# - Optimistic concurrency (version vectors)
# - Pessimistic locking (2PL, MVCC)
For simple key-value stores, this is manageable. For complex transactions touching multiple keys, you need additional coordination.
Persistent State Requirements
Before sending Accept messages, a proper Multi-Paxos implementation must persist certain state to ensure crash safety.
# State that MUST be persisted before sending Accept
class AcceptorState:
def on_prepare_response(self, proposal_id, accepted_id, accepted_value):
# Persist promise before sending ack
# If we crash after sending Promise but before persisting,
# we might accept a lower-numbered proposal after restart
self.promised_id = proposal_id
persist_to_disk({
'promised_id': self.promised_id,
'accepted_id': self.accepted_id, # Also persist accepted
'accepted_value': self.accepted_value
})
def on_accept(self, proposal_id, value):
# Must persist BEFORE sending Accepted response
# If we crash after sending response but before persisting,
# we might lose knowledge of what we accepted
if proposal_id >= self.promised_id:
self.accepted_id = proposal_id
self.accepted_value = value
persist_to_disk({
'accepted_id': self.accepted_id,
'accepted_value': self.accepted_value
})
return Accepted(proposal_id)
The invariant is simple: once you have promised not to accept lower proposals, or have accepted a value, that state must survive crashes. Violating this leads to split-brain scenarios where different nodes believe different values were chosen.
Multi-Paxos vs Raft
Raft was designed to be understandable by explicitly separating concerns. Multi-Paxos is more abstract and leaves many details unspecified. This distinction has practical implications.
graph LR
subgraph "Design Philosophy"
M[Multi-Paxos<br/>Abstract model<br/>Minimal assumptions] --> R[Raft<br/>Concrete specification<br/>Every case defined]
end
subgraph "Practical Trade-offs"
M2[Multi-Paxos<br/>Fewer network rounds<br/>for stable workloads<br/>Harder to implement] --> R2[Raft<br/>Easier to implement<br/>Correctly<br/>More rounds]
end
| Aspect | Multi-Paxos | Raft |
|---|---|---|
| Network Rounds (stable leader) | 1 RTT per entry | 1 RTT per entry |
| Leader Election | Via Paxos (2+ RTTs) | Heartbeat-based |
| Log Compaction | unspecified (varies) | Defined clearly (snapshot + log truncation) |
| Membership Changes | Complex, varies by implementation | Joint consensus or single-server changes |
| Implementation Difficulty | High (many unspecified details) | Moderate (clear specification) |
| Leader Stability | Critical for performance | Important but more resilient |
| Disk I/O | Implementation-dependent | Typically batched snapshots |
Raft separates leader election from log replication explicitly. Multi-Paxos treats these as the same mechanism. The Raft paper gives you reference pseudo-code for all components; Multi-Paxos papers focus on correctness proofs rather than implementation guidance.
For new projects, Raft is usually the better choice. The clarity of specification reduces implementation bugs. For existing systems (Chubby, Zookeeper’s Zab), understanding Multi-Paxos helps you reason about their behavior.
Checkpointing and State Machine Recovery
Multi-Paxos logs grow indefinitely. Without checkpointing, recovery time grows proportionally to log length. Checkpointing (or snapshotting) solves this by periodically saving the state machine state and discarding old log entries.
What a Checkpoint Contains
A checkpoint is not just a memory dump. It must capture everything needed to reconstruct the state machine at a specific log index:
class Checkpoint:
def __init__(self, last_applied_index, state, log_length):
self.last_applied_index = last_applied_index
# The highest log index whose command has been applied
# All earlier entries are now reflected in state
self.state = state
# The actual state machine state (key-value pairs,
# document contents, etc.)
self.log_length = log_length
# How many log entries existed when checkpoint was taken
# Entries before this index can be discarded
The Replay Process
When a node restarts, it must reconstruct its state by reading the checkpoint and replaying log entries after the checkpoint index.
graph TB
subgraph "State Machine Recovery"
S1[Read last checkpoint<br/>last_applied=500<br/>state=Snapshot_500] --> S2[Find log entries<br/>501, 502, 503...]
S2 --> S3[Replay entry 501<br/>Apply to state]
S3 --> S4[Replay entry 502<br/>Apply to state]
S4 --> S5[Replay entry 503<br/>Apply to state]
S5 --> S6[State machine matches<br/>other replicas]
end
Checkpoint Creation and Log Truncation
Checkpoints must be created consistently across replicas to avoid divergence.
sequenceDiagram
participant L as Leader
participant F1 as Follower 1
participant F2 as Follower 2
participant SM as State Machine
Note over L: Checkpoint interval reached
L->>SM: Create checkpoint<br/>state=apply(log 1-500)
L->>F1: Send checkpoint + index=500
L->>F2: Send checkpoint + index=500
Note over SM: Log entries 1-500 now reflected<br/>in state
F1->>F1: Apply checkpoint locally
F2->>F2: Apply checkpoint locally
L->>F1: Truncate log entries 1-500
L->>F2: Truncate log entries 1-500
Note over L,F2: All replicas now at<br/>checkpoint index 500
The leader coordinates checkpoint creation to ensure all replicas can truncate the same log entries. If a follower is slow, the leader waits before truncating.
Consistency Requirements
A checkpoint is only trustworthy if it reflects a consistent view of the state machine. In Multi-Paxos terms, this means the checkpoint index must be a committed log position.
If you checkpoint at index 400 but index 401 was not yet committed (no majority acknowledged), you might checkpoint state that other nodes never saw. After a failure, the replica might diverge.
Proper implementations only checkpoint at committed indices.
Relationship to Other Patterns
Multi-Paxos forms the backbone of many replicated log systems. My post on Two-Phase Commit discusses a different approach to coordinating distributed transactions, though it lacks the fault-tolerance guarantees that Multi-Paxos provides.
For building complete replicated services, you’ll need more than just Multi-Paxos. My posts on Database Replication and Distributed Transactions explore how consensus fits into broader system designs.
Practical Applications
Google’s Spanner uses Multi-Paxos variants for synchronizing replicas across data centers. Chubby uses it for lock service. Many distributed databases use the same underlying pattern: a replicated log of commands that is applied to each replica’s state machine.
The reason this works in practice: basic Paxos is expensive, but you only need it once to establish leadership. After that, the common case is fast.
Conclusion
Multi-Paxos extends the theoretical Paxos algorithm into a practical protocol for replicating logs and state machines. The optimization from two phases to one per entry makes it fast enough for production use.
Understanding Multi-Paxos helps when working with systems like Zookeeper, etcd, or distributed databases. But for new projects, Raft usually provides a cleaner foundation.
Category
Related Posts
Paxos Consensus Algorithm
Paxos is a family of consensus algorithms for achieving agreement in distributed systems despite failures, pioneered by Leslie Lamport.
Google Chubby: The Lock Service That Inspired ZooKeeper
Explore Google Chubby's architecture, lock-based coordination, Paxos integration, cell hierarchy, and its influence on distributed systems design.
Raft Consensus Algorithm
Raft is a consensus algorithm designed to be understandable, decomposing the problem into leader election, log replication, and safety.