Paxos Consensus Algorithm

Paxos is a family of consensus algorithms for achieving agreement in distributed systems despite failures, pioneered by Leslie Lamport.

published: reading time: 9 min read updated: March 24, 2026

Introduction

Paxos is a family of consensus algorithms for achieving agreement in distributed systems despite failures. Leslie Lamport first published the algorithm in 1998, though he had developed it years earlier in the 1980s. The name comes from a fictional legislative consensus protocol on the island of Paxos, which Lamport used to illustrate how a group of nodes could agree on a value even when some nodes fail or messages get lost.

The problem Paxos solves: how do you get a collection of nodes to agree on a single value when the network is unreliable and nodes can crash? This comes up everywhere from choosing a leader to replicating state across a database cluster.

The Two-Phases

Paxos operates in two distinct phases, each serving a specific purpose.

Phase 1: Prepare

A node that wants to propose a value becomes the proposer and begins Phase 1. It generates a proposal number (higher than any proposal number it has used before) and sends prepare messages to a majority of acceptors. The acceptors respond by promising not to accept any proposal with a number lower than the received number, and they include the highest-numbered proposal they have already accepted if any exists.

This phase asks acceptors whether they will listen to my proposals and gathers information about what values have already been proposed by others.

Phase 2: Accept

If the proposer receives responses from a majority of acceptors, it moves to Phase 2. It sends accept messages containing the proposal number and either its original value or the value from the highest-numbered proposal among the responses (if any acceptors had already accepted a value). The acceptors accept this proposal if they have not already promised to reject it.

Once a majority of acceptors accept a proposal, the value is chosen.

sequenceDiagram
    participant P as Proposer
    participant A1 as Acceptor 1
    participant A2 as Acceptor 2
    participant A3 as Acceptor 3

    rect rgb(40, 40, 60)
        Note over P,A3: Phase 1: Prepare
        P->>A1: Prepare(n=5)
        P->>A2: Prepare(n=5)
        P->>A3: Prepare(n=5)
        A1-->>P: Promise(n=5, accepted=nil)
        A2-->>P: Promise(n=5, accepted=nil)
        A3-->>P: Promise(n=5, accepted=nil)
    end

    rect rgb(40, 60, 40)
        Note over P,A3: Phase 2: Accept
        P->>A1: Accept(n=5, v="value")
        P->>A2: Accept(n=5, v="value")
        P->>A3: Accept(n=5, v="value")
        A1-->>P: Accepted(n=5, v="value")
        A2-->>P: Accepted(n=5, v="value")
        A3-->>P: Accepted(n=5, v="value")
    end

    Note over P,A3: Consensus achieved!

Why Majority Matters

Paxos requires only a majority of nodes to agree, not all nodes. This is crucial for handling failures gracefully. If you required all nodes, a single crash would block progress forever. By only needing a majority, the system can tolerate up to floor(n/2) failures and still make progress.

Any two majorities overlap, which means if a value has been chosen, subsequent proposals will learn about it. This guarantees safety—if one node thinks a value has been chosen, no other node can choose a different value.

The Role of Proposal Numbers

Proposal numbers act as logical timestamps that order proposals globally. Each proposer generates numbers higher than any it has used before, typically by combining a unique node ID with an incrementing counter.

This numbering prevents conflicting proposals from derailing the system. If two proposers try to propose different values simultaneously, the higher proposal number wins, and the lower one defers.

Practical Considerations

Implementing Paxos correctly is notoriously difficult. The original paper reads like a story rather than a specification, which has led to many misconceptions. Chubby (Google’s lock service) and many distributed databases use variants of Paxos, but most production implementations differ from the textbook algorithm in subtle ways.

Multi-Paxos extends the basic algorithm to agree on a sequence of values, which is more useful for building replicated state machines. You can read about this extension in my Multi-Paxos post.

Relationship to CAP Theorem

Paxos achieves consistency in the CAP sense—it will not return inconsistent data. However, like any consensus algorithm, network partitions can prevent progress. During a partition, the algorithm stalls until the partition heals. This is sometimes called “consistency over availability,” though I find that framing oversimplifies things. You can learn more about this trade-off in my post on the CAP Theorem.

Failure Scenarios

Proposer Crash During Phase 1

If a proposer crashes after sending prepare messages but before collecting a majority of responses, subsequent proposers must gather their own majority before proposing. The already-promised acceptors will reject the crashed proposer’s later accept messages.

sequenceDiagram
    participant P1 as Proposer A
    participant P2 as Proposer B
    participant A1 as Acceptor 1
    participant A2 as Acceptor 2
    participant A3 as Acceptor 3

    P1->>A1: Prepare(n=5)
    P1->>A2: Prepare(n=5)
    P1-xA3: Prepare(n=5) --timeout-->
    Note over P1,A3: P1 crashes here
    P2->>A1: Prepare(n=6)
    P2->>A2: Prepare(n=6)
    P2->>A3: Prepare(n=6)
    A1-->>P2: Promise(n=6, none)
    A2-->>P2: Promise(n=6, none)
    A3-->>P2: Promise(n=6, none)
    Note over P2,A3: P2 collects majority with n=6
    P2->>A1: Accept(n=6, v="x")
    P2->>A2: Accept(n=6, v="x")
    P2->>A3: Accept(n=6, v="x")
    Note over P2,A3: Value "x" chosen

Acceptor Crash

If an acceptor crashes after accepting a proposal but before responding to the proposer, the proposer simply continues once it receives majority acknowledgments. The crashed acceptor’s state is lost, but the value survives because a majority acknowledged it.

sequenceDiagram
    participant P as Proposer
    participant A1 as Acceptor 1
    participant A2 as Acceptor 2
    participant A3 as Acceptor 3

    P->>A1: Prepare(n=5)
    P->>A2: Prepare(n=5)
    P->>A3: Prepare(n=5)
    A1-->>P: Promise(n=5)
    A2-->>P: Promise(n=5)
    A3-->>P: Promise(n=5)
    P->>A1: Accept(n=5, v="y")
    P->>A2: Accept(n=5, v="y")
    A1-->>P: Accepted
    A2-->>P: Accepted
    Note over P,A3: Majority (2 of 3) achieved — value "y" chosen
    P->>A3: Accept(n=5, v="y") --never responds (crashed)-->
    Note over A3: A3 comes back later with no memory of this proposal

Network Partition

During a partition, one side may elect a proposer that cannot reach a majority on the other side. When the partition heals, nodes exchange their highest-known proposals and continue correctly.

sequenceDiagram
    participant Left as Left Partition
    participant Right as Right Partition

    Note over Left: Proposer L1 has majority (3 nodes)
    Note over Right: Proposer R1 isolated, no majority

    Left->>Left: Propose(n=7, v="left")
    Note over Left: Value "left" chosen in left partition

    Right->>Right: Propose(n=5, v="right")
    Note over Right: No progress in right partition

    Note over Left,Right: Partition heals
    Left->>Right: Exchange highest proposal info
    Note over Left,Right: n=7 wins (higher)
    Note over Left,Right: Right partition adopts "left"

Acceptor and Proposer Roles — Pseudo-code

# Acceptor state
promised_id = -1
accepted_id = -1
accepted_value = None

def on_prepare(proposal_id):
    if proposal_id > promised_id:
        promised_id = proposal_id
        return Promise(promised_id, accepted_id, accepted_value)
    return Reject(promised_id)

def on_accept(proposal_id, value):
    if proposal_id >= promised_id:
        accepted_id = proposal_id
        accepted_value = value
        return Accepted(accepted_id)
    return Reject(promised_id)

# Proposer state
proposal_id = None
value = None

def on_propose(original_value):
    proposal_id = generate_unique_id()
    responses = send_prepare_to_all(proposal_id)
    if len(responses >= majority):
        accepted = [r for r in responses if r.accepted_id != -1]
        if accepted:
            value = max(accepted, key=lambda r: r.accepted_id).accepted_value
        else:
            value = original_value
        send_accept(proposal_id, value)

Livelock Scenario

Here is what goes wrong without a designated proposer. Two nodes keep outbidding each other. Each one proposes a higher number, collects promises from a majority, but then the other jumps in with an even higher number before the first can send its accept messages. No value ever gets chosen — just an infinite loop of higher and higher proposal numbers.

sequenceDiagram
    participant P1 as Proposer A
    participant P2 as Proposer B
    participant A1 as Acceptor 1
    participant A2 as Acceptor 2
    participant A3 as Acceptor 3

    P1->>A1: Prepare(n=1)
    P1->>A2: Prepare(n=1)
    P1->>A3: Prepare(n=1)
    A1-->>P1: Promise(n=1)
    A2-->>P1: Promise(n=1)
    Note over P1,A3: P1 collecting majority...

    P2->>A1: Prepare(n=2)
    P2->>A2: Prepare(n=2)
    P2->>A3: Prepare(n=2)
    A1-->>P2: Promise(n=2, higher!)
    A2-->>P2: Promise(n=2)
    Note over P1,A3: P2 preempts P1's n=1

    P1->>A1: Prepare(n=3)
    P1->>A2: Prepare(n=3)
    A1-->>P1: Promise(n=3, higher!)
    Note over P1,A3: P1 preempts P2's n=2
    P2->>A1: Prepare(n=4)
    Note over P1,A3: ...continues indefinitely

This is why production systems pick a leader to be the only proposer — otherwise you get this exact livelock.

Limitations

Paxos is not the fastest consensus algorithm. The two phases of communication introduce latency, and the algorithm is not designed for high-throughput scenarios. It prioritizes safety over performance.

Paxos alone does not specify how to choose a leader. Without a distinguished proposer, multiple nodes may try to propose simultaneously, leading to livelock. In practice, leaders are often chosen separately using techniques like leader election.

When to Use Paxos

Paxos is the right choice when:

  • Correctness is paramount: When your application cannot tolerate divergent state (e.g., financial systems, distributed databases)
  • You need a proven foundation: Paxos has formal correctness proofs and decades of analysis behind it
  • You’re building infrastructure: Paxos forms the basis for systems like Chubby, Zookeeper (via Zab), and Google’s Spanner
  • You need flexibility: Multi-Paxos adapts to sequences of values, making it suitable for replicated state machines

Paxos is not ideal when:

  • Throughput matters more than safety: Raft’s leader-centric design often provides better performance
  • You need operational simplicity: Raft’s clearer separation of concerns makes it easier to implement and debug
  • Your team needs to maintain the code: Raft’s understandability advantage matters for long-term maintenance

Consensus Algorithm Comparison

AlgorithmLeader AssumptionComplexityReconfigurationCorrectness ProofLearning Curve
PaxosOptional (can be leaderless)HighComplexClassicalSteep
RaftStrong (single leader)MediumJoint consensusIn paperModerate
VSRWithin viewMediumView changesFormalModerate
ZabPrimary-backupMediumStaticFormalModerate
AlgorithmTypical LatencyThroughputFault ToleranceProduction Use
Paxos2-3 RTTs per decisionModeratef+1 nodes for majorityChubby, Spanner
Raft1-2 RTTs (with leader)Highf+1 nodes for majorityetcd, CockroachDB, TiKV
VSR1-2 RTTsModeratef+1 nodes for majorityCORFU, some Zab variants
Zab1-2 RTTsHighf+1 nodes for majorityZooKeeper

Conclusion

Paxos is one of the most important algorithms in distributed systems. It formally proves that consensus is achievable in asynchronous systems with failures. Despite its complexity, understanding Paxos provides the foundation for reasoning about more practical consensus protocols like Raft.

If you want to go deeper, my post on Consistency Models explores how consensus fits into the broader landscape of consistency guarantees in distributed systems.

Category

Related Posts

Multi-Paxos

Multi-Paxos extends basic Paxos to achieve consensus on sequences of values, enabling practical replicated state machines for distributed systems and databases.

#distributed-systems #consensus #paxos

View-Stamped Replication

View-Stamped Replication (VSR) is a distributed consensus protocol that uses views and timestamps to achieve agreement in asynchronous systems.

#distributed-systems #consensus #view-stamped

The FLP Impossibility Result

The FLP impossibility theorem proves that no consensus algorithm can guarantee termination in an asynchronous system with even one faulty process. Understanding FLP is essential for distributed systems designers.

#distributed-systems #consensus #flp