Paxos Consensus Algorithm
Paxos is a family of consensus algorithms for achieving agreement in distributed systems despite failures, pioneered by Leslie Lamport.
Introduction
Paxos is a family of consensus algorithms for achieving agreement in distributed systems despite failures. Leslie Lamport first published the algorithm in 1998, though he had developed it years earlier in the 1980s. The name comes from a fictional legislative consensus protocol on the island of Paxos, which Lamport used to illustrate how a group of nodes could agree on a value even when some nodes fail or messages get lost.
The problem Paxos solves: how do you get a collection of nodes to agree on a single value when the network is unreliable and nodes can crash? This comes up everywhere from choosing a leader to replicating state across a database cluster.
The Two-Phases
Paxos operates in two distinct phases, each serving a specific purpose.
Phase 1: Prepare
A node that wants to propose a value becomes the proposer and begins Phase 1. It generates a proposal number (higher than any proposal number it has used before) and sends prepare messages to a majority of acceptors. The acceptors respond by promising not to accept any proposal with a number lower than the received number, and they include the highest-numbered proposal they have already accepted if any exists.
This phase asks acceptors whether they will listen to my proposals and gathers information about what values have already been proposed by others.
Phase 2: Accept
If the proposer receives responses from a majority of acceptors, it moves to Phase 2. It sends accept messages containing the proposal number and either its original value or the value from the highest-numbered proposal among the responses (if any acceptors had already accepted a value). The acceptors accept this proposal if they have not already promised to reject it.
Once a majority of acceptors accept a proposal, the value is chosen.
sequenceDiagram
participant P as Proposer
participant A1 as Acceptor 1
participant A2 as Acceptor 2
participant A3 as Acceptor 3
rect rgb(40, 40, 60)
Note over P,A3: Phase 1: Prepare
P->>A1: Prepare(n=5)
P->>A2: Prepare(n=5)
P->>A3: Prepare(n=5)
A1-->>P: Promise(n=5, accepted=nil)
A2-->>P: Promise(n=5, accepted=nil)
A3-->>P: Promise(n=5, accepted=nil)
end
rect rgb(40, 60, 40)
Note over P,A3: Phase 2: Accept
P->>A1: Accept(n=5, v="value")
P->>A2: Accept(n=5, v="value")
P->>A3: Accept(n=5, v="value")
A1-->>P: Accepted(n=5, v="value")
A2-->>P: Accepted(n=5, v="value")
A3-->>P: Accepted(n=5, v="value")
end
Note over P,A3: Consensus achieved!
Why Majority Matters
Paxos requires only a majority of nodes to agree, not all nodes. This is crucial for handling failures gracefully. If you required all nodes, a single crash would block progress forever. By only needing a majority, the system can tolerate up to floor(n/2) failures and still make progress.
Any two majorities overlap, which means if a value has been chosen, subsequent proposals will learn about it. This guarantees safety—if one node thinks a value has been chosen, no other node can choose a different value.
The Role of Proposal Numbers
Proposal numbers act as logical timestamps that order proposals globally. Each proposer generates numbers higher than any it has used before, typically by combining a unique node ID with an incrementing counter.
This numbering prevents conflicting proposals from derailing the system. If two proposers try to propose different values simultaneously, the higher proposal number wins, and the lower one defers.
Practical Considerations
Implementing Paxos correctly is notoriously difficult. The original paper reads like a story rather than a specification, which has led to many misconceptions. Chubby (Google’s lock service) and many distributed databases use variants of Paxos, but most production implementations differ from the textbook algorithm in subtle ways.
Multi-Paxos extends the basic algorithm to agree on a sequence of values, which is more useful for building replicated state machines. You can read about this extension in my Multi-Paxos post.
Relationship to CAP Theorem
Paxos achieves consistency in the CAP sense—it will not return inconsistent data. However, like any consensus algorithm, network partitions can prevent progress. During a partition, the algorithm stalls until the partition heals. This is sometimes called “consistency over availability,” though I find that framing oversimplifies things. You can learn more about this trade-off in my post on the CAP Theorem.
Failure Scenarios
Proposer Crash During Phase 1
If a proposer crashes after sending prepare messages but before collecting a majority of responses, subsequent proposers must gather their own majority before proposing. The already-promised acceptors will reject the crashed proposer’s later accept messages.
sequenceDiagram
participant P1 as Proposer A
participant P2 as Proposer B
participant A1 as Acceptor 1
participant A2 as Acceptor 2
participant A3 as Acceptor 3
P1->>A1: Prepare(n=5)
P1->>A2: Prepare(n=5)
P1-xA3: Prepare(n=5) --timeout-->
Note over P1,A3: P1 crashes here
P2->>A1: Prepare(n=6)
P2->>A2: Prepare(n=6)
P2->>A3: Prepare(n=6)
A1-->>P2: Promise(n=6, none)
A2-->>P2: Promise(n=6, none)
A3-->>P2: Promise(n=6, none)
Note over P2,A3: P2 collects majority with n=6
P2->>A1: Accept(n=6, v="x")
P2->>A2: Accept(n=6, v="x")
P2->>A3: Accept(n=6, v="x")
Note over P2,A3: Value "x" chosen
Acceptor Crash
If an acceptor crashes after accepting a proposal but before responding to the proposer, the proposer simply continues once it receives majority acknowledgments. The crashed acceptor’s state is lost, but the value survives because a majority acknowledged it.
sequenceDiagram
participant P as Proposer
participant A1 as Acceptor 1
participant A2 as Acceptor 2
participant A3 as Acceptor 3
P->>A1: Prepare(n=5)
P->>A2: Prepare(n=5)
P->>A3: Prepare(n=5)
A1-->>P: Promise(n=5)
A2-->>P: Promise(n=5)
A3-->>P: Promise(n=5)
P->>A1: Accept(n=5, v="y")
P->>A2: Accept(n=5, v="y")
A1-->>P: Accepted
A2-->>P: Accepted
Note over P,A3: Majority (2 of 3) achieved — value "y" chosen
P->>A3: Accept(n=5, v="y") --never responds (crashed)-->
Note over A3: A3 comes back later with no memory of this proposal
Network Partition
During a partition, one side may elect a proposer that cannot reach a majority on the other side. When the partition heals, nodes exchange their highest-known proposals and continue correctly.
sequenceDiagram
participant Left as Left Partition
participant Right as Right Partition
Note over Left: Proposer L1 has majority (3 nodes)
Note over Right: Proposer R1 isolated, no majority
Left->>Left: Propose(n=7, v="left")
Note over Left: Value "left" chosen in left partition
Right->>Right: Propose(n=5, v="right")
Note over Right: No progress in right partition
Note over Left,Right: Partition heals
Left->>Right: Exchange highest proposal info
Note over Left,Right: n=7 wins (higher)
Note over Left,Right: Right partition adopts "left"
Acceptor and Proposer Roles — Pseudo-code
# Acceptor state
promised_id = -1
accepted_id = -1
accepted_value = None
def on_prepare(proposal_id):
if proposal_id > promised_id:
promised_id = proposal_id
return Promise(promised_id, accepted_id, accepted_value)
return Reject(promised_id)
def on_accept(proposal_id, value):
if proposal_id >= promised_id:
accepted_id = proposal_id
accepted_value = value
return Accepted(accepted_id)
return Reject(promised_id)
# Proposer state
proposal_id = None
value = None
def on_propose(original_value):
proposal_id = generate_unique_id()
responses = send_prepare_to_all(proposal_id)
if len(responses >= majority):
accepted = [r for r in responses if r.accepted_id != -1]
if accepted:
value = max(accepted, key=lambda r: r.accepted_id).accepted_value
else:
value = original_value
send_accept(proposal_id, value)
Livelock Scenario
Here is what goes wrong without a designated proposer. Two nodes keep outbidding each other. Each one proposes a higher number, collects promises from a majority, but then the other jumps in with an even higher number before the first can send its accept messages. No value ever gets chosen — just an infinite loop of higher and higher proposal numbers.
sequenceDiagram
participant P1 as Proposer A
participant P2 as Proposer B
participant A1 as Acceptor 1
participant A2 as Acceptor 2
participant A3 as Acceptor 3
P1->>A1: Prepare(n=1)
P1->>A2: Prepare(n=1)
P1->>A3: Prepare(n=1)
A1-->>P1: Promise(n=1)
A2-->>P1: Promise(n=1)
Note over P1,A3: P1 collecting majority...
P2->>A1: Prepare(n=2)
P2->>A2: Prepare(n=2)
P2->>A3: Prepare(n=2)
A1-->>P2: Promise(n=2, higher!)
A2-->>P2: Promise(n=2)
Note over P1,A3: P2 preempts P1's n=1
P1->>A1: Prepare(n=3)
P1->>A2: Prepare(n=3)
A1-->>P1: Promise(n=3, higher!)
Note over P1,A3: P1 preempts P2's n=2
P2->>A1: Prepare(n=4)
Note over P1,A3: ...continues indefinitely
This is why production systems pick a leader to be the only proposer — otherwise you get this exact livelock.
Limitations
Paxos is not the fastest consensus algorithm. The two phases of communication introduce latency, and the algorithm is not designed for high-throughput scenarios. It prioritizes safety over performance.
Paxos alone does not specify how to choose a leader. Without a distinguished proposer, multiple nodes may try to propose simultaneously, leading to livelock. In practice, leaders are often chosen separately using techniques like leader election.
When to Use Paxos
Paxos is the right choice when:
- Correctness is paramount: When your application cannot tolerate divergent state (e.g., financial systems, distributed databases)
- You need a proven foundation: Paxos has formal correctness proofs and decades of analysis behind it
- You’re building infrastructure: Paxos forms the basis for systems like Chubby, Zookeeper (via Zab), and Google’s Spanner
- You need flexibility: Multi-Paxos adapts to sequences of values, making it suitable for replicated state machines
Paxos is not ideal when:
- Throughput matters more than safety: Raft’s leader-centric design often provides better performance
- You need operational simplicity: Raft’s clearer separation of concerns makes it easier to implement and debug
- Your team needs to maintain the code: Raft’s understandability advantage matters for long-term maintenance
Consensus Algorithm Comparison
| Algorithm | Leader Assumption | Complexity | Reconfiguration | Correctness Proof | Learning Curve |
|---|---|---|---|---|---|
| Paxos | Optional (can be leaderless) | High | Complex | Classical | Steep |
| Raft | Strong (single leader) | Medium | Joint consensus | In paper | Moderate |
| VSR | Within view | Medium | View changes | Formal | Moderate |
| Zab | Primary-backup | Medium | Static | Formal | Moderate |
| Algorithm | Typical Latency | Throughput | Fault Tolerance | Production Use |
|---|---|---|---|---|
| Paxos | 2-3 RTTs per decision | Moderate | f+1 nodes for majority | Chubby, Spanner |
| Raft | 1-2 RTTs (with leader) | High | f+1 nodes for majority | etcd, CockroachDB, TiKV |
| VSR | 1-2 RTTs | Moderate | f+1 nodes for majority | CORFU, some Zab variants |
| Zab | 1-2 RTTs | High | f+1 nodes for majority | ZooKeeper |
Conclusion
Paxos is one of the most important algorithms in distributed systems. It formally proves that consensus is achievable in asynchronous systems with failures. Despite its complexity, understanding Paxos provides the foundation for reasoning about more practical consensus protocols like Raft.
If you want to go deeper, my post on Consistency Models explores how consensus fits into the broader landscape of consistency guarantees in distributed systems.
Category
Related Posts
Multi-Paxos
Multi-Paxos extends basic Paxos to achieve consensus on sequences of values, enabling practical replicated state machines for distributed systems and databases.
View-Stamped Replication
View-Stamped Replication (VSR) is a distributed consensus protocol that uses views and timestamps to achieve agreement in asynchronous systems.
The FLP Impossibility Result
The FLP impossibility theorem proves that no consensus algorithm can guarantee termination in an asynchronous system with even one faulty process. Understanding FLP is essential for distributed systems designers.