Paxos Consensus Algorithm

Paxos is a family of consensus algorithms for achieving agreement in distributed systems despite failures, pioneered by Leslie Lamport.

published: reading time: 18 min read author: GeekWorkBench updated: March 24, 2026

Introduction

Paxos is a family of consensus algorithms for achieving agreement in distributed systems despite failures. Leslie Lamport first published the algorithm in 1998, though he had developed it years earlier in the 1980s. The name comes from a fictional legislative consensus protocol on the island of Paxos, which Lamport used to illustrate how a group of nodes could agree on a value even when some nodes fail or messages get lost.

The problem Paxos solves: how do you get a collection of nodes to agree on a single value when the network is unreliable and nodes can crash? This comes up everywhere from choosing a leader to replicating state across a database cluster.

Core Concepts

Paxos operates in two distinct phases, each serving a specific purpose.

Phase 1: Prepare

A node that wants to propose a value becomes the proposer and begins Phase 1. It generates a proposal number (higher than any proposal number it has used before) and sends prepare messages to a majority of acceptors. The acceptors respond by promising not to accept any proposal with a number lower than the received number, and they include the highest-numbered proposal they have already accepted if any exists.

This phase asks acceptors whether they will listen to my proposals and gathers information about what values have already been proposed by others.

Phase 2: Accept

If the proposer receives responses from a majority of acceptors, it moves to Phase 2. It sends accept messages containing the proposal number and either its original value or the value from the highest-numbered proposal among the responses (if any acceptors had already accepted a value). The acceptors accept this proposal if they have not already promised to reject it.

Once a majority of acceptors accept a proposal, the value is chosen.

sequenceDiagram
    participant P as Proposer
    participant A1 as Acceptor 1
    participant A2 as Acceptor 2
    participant A3 as Acceptor 3

    rect rgb(40, 40, 60)
        Note over P,A3: Phase 1: Prepare
        P->>A1: Prepare(n=5)
        P->>A2: Prepare(n=5)
        P->>A3: Prepare(n=5)
        A1-->>P: Promise(n=5, accepted=nil)
        A2-->>P: Promise(n=5, accepted=nil)
        A3-->>P: Promise(n=5, accepted=nil)
    end

    rect rgb(40, 60, 40)
        Note over P,A3: Phase 2: Accept
        P->>A1: Accept(n=5, v="value")
        P->>A2: Accept(n=5, v="value")
        P->>A3: Accept(n=5, v="value")
        A1-->>P: Accepted(n=5, v="value")
        A2-->>P: Accepted(n=5, v="value")
        A3-->>P: Accepted(n=5, v="value")
    end

    Note over P,A3: Consensus achieved!

Why Majority Matters

Paxos requires only a majority of nodes to agree, not all nodes. This is crucial for handling failures gracefully. If you required all nodes, a single crash would block progress forever. By only needing a majority, the system can tolerate up to floor(n/2) failures and still make progress.

Any two majorities overlap, which means if a value has been chosen, subsequent proposals will learn about it. This guarantees safety—if one node thinks a value has been chosen, no other node can choose a different value.

The Role of Proposal Numbers

Proposal numbers act as logical timestamps that order proposals globally. Each proposer generates numbers higher than any it has used before, typically by combining a unique node ID with an incrementing counter.

This numbering prevents conflicting proposals from derailing the system. If two proposers try to propose different values simultaneously, the higher proposal number wins, and the lower one defers.

Practical Considerations

Implementing Paxos correctly is notoriously difficult. The original paper reads like a story rather than a specification, which has led to many misconceptions. Chubby (Google’s lock service) and many distributed databases use variants of Paxos, but most production implementations differ from the textbook algorithm in subtle ways.

Multi-Paxos extends the basic algorithm to agree on a sequence of values, which is more useful for building replicated state machines. You can read about this extension in my Multi-Paxos post.

Relationship to CAP Theorem

Paxos achieves consistency in the CAP sense—it will not return inconsistent data. However, like any consensus algorithm, network partitions can prevent progress. During a partition, the algorithm stalls until the partition heals. This is sometimes called “consistency over availability,” though I find that framing oversimplifies things. You can learn more about this trade-off in my post on the CAP Theorem.

Failure Scenarios

Proposer Crash During Phase 1

If a proposer crashes after sending prepare messages but before collecting a majority of responses, subsequent proposers must gather their own majority before proposing. The already-promised acceptors will reject the crashed proposer’s later accept messages.

sequenceDiagram
    participant P1 as Proposer A
    participant P2 as Proposer B
    participant A1 as Acceptor 1
    participant A2 as Acceptor 2
    participant A3 as Acceptor 3

    P1->>A1: Prepare(n=5)
    P1->>A2: Prepare(n=5)
    P1-xA3: Prepare(n=5) --timeout-->
    Note over P1,A3: P1 crashes here
    P2->>A1: Prepare(n=6)
    P2->>A2: Prepare(n=6)
    P2->>A3: Prepare(n=6)
    A1-->>P2: Promise(n=6, none)
    A2-->>P2: Promise(n=6, none)
    A3-->>P2: Promise(n=6, none)
    Note over P2,A3: P2 collects majority with n=6
    P2->>A1: Accept(n=6, v="x")
    P2->>A2: Accept(n=6, v="x")
    P2->>A3: Accept(n=6, v="x")
    Note over P2,A3: Value "x" chosen

Acceptor Crash

If an acceptor crashes after accepting a proposal but before responding to the proposer, the proposer simply continues once it receives majority acknowledgments. The crashed acceptor’s state is lost, but the value survives because a majority acknowledged it.

sequenceDiagram
    participant P as Proposer
    participant A1 as Acceptor 1
    participant A2 as Acceptor 2
    participant A3 as Acceptor 3

    P->>A1: Prepare(n=5)
    P->>A2: Prepare(n=5)
    P->>A3: Prepare(n=5)
    A1-->>P: Promise(n=5)
    A2-->>P: Promise(n=5)
    A3-->>P: Promise(n=5)
    P->>A1: Accept(n=5, v="y")
    P->>A2: Accept(n=5, v="y")
    A1-->>P: Accepted
    A2-->>P: Accepted
    Note over P,A3: Majority (2 of 3) achieved — value "y" chosen
    P->>A3: Accept(n=5, v="y") --never responds (crashed)-->
    Note over A3: A3 comes back later with no memory of this proposal

Network Partition

During a partition, one side may elect a proposer that cannot reach a majority on the other side. When the partition heals, nodes exchange their highest-known proposals and continue correctly.

sequenceDiagram
    participant Left as Left Partition
    participant Right as Right Partition

    Note over Left: Proposer L1 has majority (3 nodes)
    Note over Right: Proposer R1 isolated, no majority

    Left->>Left: Propose(n=7, v="left")
    Note over Left: Value "left" chosen in left partition

    Right->>Right: Propose(n=5, v="right")
    Note over Right: No progress in right partition

    Note over Left,Right: Partition heals
    Left->>Right: Exchange highest proposal info
    Note over Left,Right: n=7 wins (higher)
    Note over Left,Right: Right partition adopts "left"

Acceptor and Proposer Roles — Pseudo-code

# Acceptor state
promised_id = -1
accepted_id = -1
accepted_value = None

def on_prepare(proposal_id):
    if proposal_id > promised_id:
        promised_id = proposal_id
        return Promise(promised_id, accepted_id, accepted_value)
    return Reject(promised_id)

def on_accept(proposal_id, value):
    if proposal_id >= promised_id:
        accepted_id = proposal_id
        accepted_value = value
        return Accepted(accepted_id)
    return Reject(promised_id)

# Proposer state
proposal_id = None
value = None

def on_propose(original_value):
    proposal_id = generate_unique_id()
    responses = send_prepare_to_all(proposal_id)
    if len(responses >= majority):
        accepted = [r for r in responses if r.accepted_id != -1]
        if accepted:
            value = max(accepted, key=lambda r: r.accepted_id).accepted_value
        else:
            value = original_value
        send_accept(proposal_id, value)

Common Pitfalls / Anti-Patterns

Here is what goes wrong without a designated proposer. Two nodes keep outbidding each other. Each one proposes a higher number, collects promises from a majority, but then the other jumps in with an even higher number before the first can send its accept messages. No value ever gets chosen — just an infinite loop of higher and higher proposal numbers.

sequenceDiagram
    participant P1 as Proposer A
    participant P2 as Proposer B
    participant A1 as Acceptor 1
    participant A2 as Acceptor 2
    participant A3 as Acceptor 3

    P1->>A1: Prepare(n=1)
    P1->>A2: Prepare(n=1)
    P1->>A3: Prepare(n=1)
    A1-->>P1: Promise(n=1)
    A2-->>P1: Promise(n=1)
    Note over P1,A3: P1 collecting majority...

    P2->>A1: Prepare(n=2)
    P2->>A2: Prepare(n=2)
    P2->>A3: Prepare(n=2)
    A1-->>P2: Promise(n=2, higher!)
    A2-->>P2: Promise(n=2)
    Note over P1,A3: P2 preempts P1's n=1

    P1->>A1: Prepare(n=3)
    P1->>A2: Prepare(n=3)
    A1-->>P1: Promise(n=3, higher!)
    Note over P1,A3: P1 preempts P2's n=2
    P2->>A1: Prepare(n=4)
    Note over P1,A3: ...continues indefinitely

This is why production systems pick a leader to be the only proposer — otherwise you get this exact livelock.

Limitations

Paxos is not the fastest consensus algorithm. The two phases of communication introduce latency, and the algorithm is not designed for high-throughput scenarios. It prioritizes safety over performance.

Paxos alone does not specify how to choose a leader. Without a distinguished proposer, multiple nodes may try to propose simultaneously, leading to livelock. In practice, leaders are often chosen separately using techniques like leader election.

When to Use Paxos

Paxos is the right choice when:

  • Correctness is paramount: When your application cannot tolerate divergent state (e.g., financial systems, distributed databases)
  • You need a proven foundation: Paxos has formal correctness proofs and decades of analysis behind it
  • You’re building infrastructure: Paxos forms the basis for systems like Chubby, Zookeeper (via Zab), and Google’s Spanner
  • You need flexibility: Multi-Paxos adapts to sequences of values, making it suitable for replicated state machines

Paxos is not ideal when:

  • Throughput matters more than safety: Raft’s leader-centric design often provides better performance
  • You need operational simplicity: Raft’s clearer separation of concerns makes it easier to implement and debug
  • Your team needs to maintain the code: Raft’s understandability advantage matters for long-term maintenance

Interview Questions

1. What problem does Paxos solve in distributed systems?

Paxos solves the consensus problem: achieving agreement among a distributed group of nodes on a single value, even when nodes can crash, messages can be lost, and the network is unreliable. It guarantees safety—if any node decides a value has been chosen, no other node can choose a different value for that same decision.

2. Explain the two phases of the Paxos algorithm.

Phase 1 (Prepare): A proposer generates a unique proposal number higher than any it has used before and sends prepare messages to a majority of acceptors. Acceptors promise not to accept proposals with lower numbers and report any previously accepted values.

Phase 2 (Accept): If the proposer receives promises from a majority, it sends accept messages with the proposal number and either its original value or the value from the highest-numbered proposal among responses. A majority of acceptors accepting a proposal means the value is chosen.

3. Why does Paxos require only a majority of nodes rather than all nodes?

Requiring only a majority provides fault tolerance: the system can tolerate up to floor(n/2) node failures without blocking progress. Additionally, any two majorities overlap, which guarantees that if a value is chosen, subsequent proposals will learn about it, ensuring safety.

4. What is the purpose of proposal numbers in Paxos?

Proposal numbers act as logical timestamps that order proposals globally. They prevent conflicting proposals from derailing the system—higher proposal numbers take precedence, and lower-numbered proposals must defer. Each proposer generates numbers higher than any it has previously used, typically combining a unique node ID with an incrementing counter.

5. What happens when a proposer crashes during Phase 1?

If a proposer crashes after sending prepare messages but before collecting a majority of responses, subsequent proposers must gather their own majority before proposing. The already-promised acceptors will reject the crashed proposer's later accept messages since they have already promised to a higher proposal number.

6. How does Paxos handle an acceptor crash during the algorithm?

If an acceptor crashes after accepting a proposal but before responding to the proposer, the proposer simply continues once it receives majority acknowledgments. The crashed acceptor's state is lost, but the value survives because a majority acknowledged it before the crash.

7. What is livelock in Paxos and how is it prevented?

Livelock occurs when two or more proposers continuously outbid each other with higher proposal numbers. Each one collects promises from a majority, but the other jumps in with a higher number before the first can send its accept messages. No value ever gets chosen. Production systems prevent this by designating a single leader as the only proposer.

8. How does Paxos relate to the CAP Theorem?

Paxos achieves consistency in the CAP sense—it will not return inconsistent data. However, like any consensus algorithm, network partitions can prevent progress. During a partition, the algorithm stalls until the partition heals, favoring consistency over availability.

9. What is Multi-Paxos and how does it extend basic Paxos?

Multi-Paxos extends the basic single-decree Paxos algorithm to agree on a sequence of values, making it suitable for building replicated state machines. It optimizes by skipping Phase 1 for consecutive proposals when a leader is stable, reducing the communication rounds from 2 RTTs to 1 RTT per decision.

10. Compare Paxos and Raft in terms of leader assumption and complexity.

Paxos: Leader is optional (can be leaderless), high complexity, steep learning curve, classical correctness proofs. Typical latency is 2-3 RTTs per decision.

Raft: Strong leader assumption (single leader), medium complexity, clearer separation of concerns making it easier to implement. Typical latency is 1-2 RTTs with a stable leader, providing better throughput.

11. What is the Byzantine Generals Problem and how does it relate to Paxos?

The Byzantine Generals Problem deals with achieving consensus in a system where nodes can send conflicting or malicious information (not just fail silently). Paxos assumes "fail-stop" failures where crashed nodes simply stop responding. Byzantine Fault Tolerance (BFT) algorithms like PBFT handle the stronger threat model. Paxos provides safety under fail-stop failures but not under Byzantine failures.

12. How does network partition affect Paxos?

During a partition, one side may elect a proposer that cannot reach a majority on the other side, preventing any new values from being chosen on either side. When the partition heals, nodes exchange their highest-known proposals and continue correctly. The higher proposal number wins, and the value from the partition that achieved majority is adopted.

13. What are the limitations of Paxos?

Paxos is not the fastest consensus algorithm due to its two-phase communication overhead. It prioritizes safety over performance. Paxos alone does not specify how to choose a leader—without a distinguished proposer, multiple nodes may try to propose simultaneously, leading to livelock. It is complex to implement correctly, and the original paper is notoriously difficult to read.

14. What production systems use Paxos or variants of it?

Several prominent systems use Paxos: Google's Chubby (lock service), Google Spanner (distributed database), CockroachDB, TiKV, and etcd all use consensus protocols derived from or inspired by Paxos. ZooKeeper uses Zab, a related protocol. Microsoft's Azure uses Paxos variants for state machine replication.

15. Can Paxos reconfigure its cluster membership?

Basic Paxos does not specify reconfiguration, making it complex to add or remove nodes. Joint consensus is one approach where a special joint quorum of old and new configuration must agree. Many production implementations develop custom reconfiguration schemes that extend the core Paxos algorithm.

16. What is the role of the learner in Paxos?

Learners are acceptors that have accepted a value but need to propagate this information to other nodes that did not participate directly. When a learner receives accepted messages from a majority, it can inform other interested nodes (replicas) that a value has been chosen. Learners are part of the classic Paxos formulation but are often simplified in implementations.

17. Why is Paxos considered harder to understand than Raft?

Paxos was published in a single paper that reads like a mathematical proof mixed with narrative, using the fictional Paxos legislative protocol as an analogy. The algorithm is not clearly separated into independent components. Raft was designed specifically to be more understandable, with clear separation of leader election, log replication, and safety, and uses more concrete descriptions.

18. What is the relationship between Paxos and Viewstamped Replication (VSR)?

Viewstamped Replication is a consensus protocol developed independently by Barbara Liskov and Brian Oki around the same time as Paxos. Like Paxos, it achieves consensus in asynchronous systems with failures. VSR uses a view-change mechanism similar to Paxos's Phase 1, and both protocols have similar fault tolerance characteristics (majority quorum). They are functionally equivalent in terms of safety guarantees.

19. How does Paxos ensure that if a value is chosen, no other value can be chosen?

This is guaranteed by the promise mechanism: when an acceptor promises to a proposal number N, it agrees not to accept any proposal with a number lower than N. Since any two majorities overlap, if a proposal P with value v is accepted by a majority, any future proposal that wants to be accepted must collect promises from at least one node in that majority. That node will reject any proposal with a lower number, ensuring the higher-numbered proposal (which carries v or a later value) wins.

20. What optimizations exist for Paxos in practice?

Several optimizations are used in practice: Multi-Paxos optimizes consecutive agreements by skipping Phase 1 when a leader is stable. Read-your-writes consistency can be achieved by tagging reads with the current proposal number. Leases reduce RTTs for reads. Some implementations use batching to group multiple proposals. Cheap Paxos uses fewer nodes for recovery. Fast Paxos attempts to reduce latency with larger quorums.

Further Reading

Quick Recap Checklist

Before deploying a system based on Paxos, verify:

  • You need consensus for a single or sequence of values in a distributed system
  • Fault tolerance is required (system must continue despite node failures)
  • Network partitions are possible and the system should handle them safely
  • A majority quorum strategy is acceptable for your failure threshold
  • You have a plan for leader election to prevent livelock
  • Your implementation handles Phase 1 failures and retries correctly
  • Acceptor state is persisted to survive crashes
  • You understand the difference between Paxos (fail-stop) and Byzantine fault tolerance
  • For multi-value decisions, you are using Multi-Paxos or an equivalent
  • Cluster reconfiguration is handled if nodes need to be added or removed

Consensus Algorithm Comparison

AlgorithmLeader AssumptionComplexityReconfigurationCorrectness ProofLearning Curve
PaxosOptional (can be leaderless)HighComplexClassicalSteep
RaftStrong (single leader)MediumJoint consensusIn paperModerate
VSRWithin viewMediumView changesFormalModerate
ZabPrimary-backupMediumStaticFormalModerate
AlgorithmTypical LatencyThroughputFault ToleranceProduction Use
Paxos2-3 RTTs per decisionModeratef+1 nodes for majorityChubby, Spanner
Raft1-2 RTTs (with leader)Highf+1 nodes for majorityetcd, CockroachDB, TiKV
VSR1-2 RTTsModeratef+1 nodes for majorityCORFU, some Zab variants
Zab1-2 RTTsHighf+1 nodes for majorityZooKeeper

Conclusion

Paxos is one of the most important algorithms in distributed systems. It formally proves that consensus is achievable in asynchronous systems with failures. Despite its complexity, understanding Paxos provides the foundation for reasoning about more practical consensus protocols like Raft.

If you want to go deeper, my post on Consistency Models explores how consensus fits into the broader landscape of consistency guarantees in distributed systems.

Category

Related Posts

Multi-Paxos

Multi-Paxos extends basic Paxos to achieve consensus on sequences of values, enabling practical replicated state machines for distributed systems and databases.

#distributed-systems #consensus #paxos

View-Stamped Replication

View-Stamped Replication (VSR) is a distributed consensus protocol that uses views and timestamps to achieve agreement in asynchronous systems.

#distributed-systems #consensus #view-stamped

The FLP Impossibility Result

The FLP impossibility theorem proves that no consensus algorithm can guarantee termination in an asynchronous system with even one faulty process.

#distributed-systems #consensus #flp