Paxos Consensus Algorithm

Paxos is a family of consensus algorithms for achieving agreement in distributed systems despite failures, pioneered by Leslie Lamport.

published: March 24, 2026 reading time: 22 min read author: GeekWorkBench updated: March 24, 2026

Quick Summary

Paxos gets a group of nodes to agree on a single value even when messages are lost and nodes fail. Phase 1 (Prepare): a proposer picks a proposal number higher than any it has used, sends prepare messages to a majority of acceptors, and they promise not to accept lower-numbered proposals. Phase 2 (Accept): if the proposer gets promises from a majority, it sends accept messages with its value (or the highest-value proposal already accepted). Once a majority accepts, the value is chosen. The majority requirement is key: any two majorities overlap, so if a value is chosen, later proposals learn about it and defer. Without a designated leader, multiple proposers can livelock by continuously outbidding each other—this is why production systems like etcd and CockroachDB elect a single leader. Paxos prioritizes safety over performance; Raft is usually easier to implement and performs better in practice due to its leader-centric design.

Introduction

Paxos is a family of consensus algorithms for achieving agreement in distributed systems despite failures. Leslie Lamport first published the algorithm in 1998, though he had developed it years earlier in the 1980s. The name comes from a fictional legislative consensus protocol on the island of Paxos, which Lamport used to illustrate how a group of nodes could agree on a value even when some nodes fail or messages get lost.

The problem Paxos solves: how do you get a collection of nodes to agree on a single value when the network is unreliable and nodes can crash? This comes up everywhere from choosing a leader to replicating state across a database cluster.

Core Concepts

Paxos operates in two distinct phases, each serving a specific purpose.

Phase 1: Prepare

Before a proposer can suggest a value, it needs to secure the right to be heard. Phase 1 handles exactly that: establishing priority and finding out what other proposers may have already slipped in.

The proposer picks a number higher than any it has used before, then sends prepare messages to a majority of acceptors. Each message says in effect: “I’m proposing something with this number. Don’t accept anything lower.”

When an acceptor gets a prepare request, it checks whether the proposal number beats everything it has already promised to. If yes, it locks in the promise and sends back the highest-numbered proposal it has accepted so far, if any. This matters because it tells the proposer whether a value has already landed with a majority, in which case the proposer must adopt that value rather than push its own.

If the proposal number does not beat the acceptor’s existing promise, the message gets ignored. The proposer times out eventually and comes back with a higher number.

sequenceDiagram
    participant P as Proposer
    participant A1 as Acceptor 1
    participant A2 as Acceptor 2
    participant A3 as Acceptor 3

    rect rgb(40, 40, 60)
        Note over P,A3: Phase 1: Prepare
        P->>A1: Prepare(n=5)
        P->>A2: Prepare(n=5)
        P->>A3: Prepare(n=5)
        A1-->>P: Promise(n=5, accepted=nil)
        A2-->>P: Promise(n=5, accepted=nil)
        A3-->>P: Promise(n=5, accepted=nil)
    end

All three acceptors promised to proposal 5 and reported no prior accepted proposals, so the proposer can move to Phase 2 with its own value. If even one acceptor had reported a previously accepted proposal with a higher number, the proposer would have had to adopt that value instead.

This phase asks acceptors whether they will listen to my proposals and gathers information about what values have already been proposed by others.

Phase 2: Accept

If the proposer receives responses from a majority of acceptors, it moves to Phase 2. It sends accept messages containing the proposal number and either its original value or the value from the highest-numbered proposal among the responses (if any acceptors had already accepted a value). The acceptors accept this proposal if they have not already promised to reject it.

Once a majority of acceptors accept a proposal, the value is chosen.

sequenceDiagram
    participant P as Proposer
    participant A1 as Acceptor 1
    participant A2 as Acceptor 2
    participant A3 as Acceptor 3

    rect rgb(40, 40, 60)
        Note over P,A3: Phase 1: Prepare
        P->>A1: Prepare(n=5)
        P->>A2: Prepare(n=5)
        P->>A3: Prepare(n=5)
        A1-->>P: Promise(n=5, accepted=nil)
        A2-->>P: Promise(n=5, accepted=nil)
        A3-->>P: Promise(n=5, accepted=nil)
    end

    rect rgb(40, 60, 40)
        Note over P,A3: Phase 2: Accept
        P->>A1: Accept(n=5, v="value")
        P->>A2: Accept(n=5, v="value")
        P->>A3: Accept(n=5, v="value")
        A1-->>P: Accepted(n=5, v="value")
        A2-->>P: Accepted(n=5, v="value")
        A3-->>P: Accepted(n=5, v="value")
    end

    Note over P,A3: Consensus achieved!

Why Majority Matters

Paxos requires only a majority of nodes to agree, not all nodes. This is crucial for handling failures gracefully. If you required all nodes, a single crash would block progress forever. By only needing a majority, the system can tolerate up to floor(n/2) failures and still make progress.

Any two majorities overlap, which means if a value has been chosen, subsequent proposals will learn about it. This guarantees safety—if one node thinks a value has been chosen, no other node can choose a different value.

The Role of Proposal Numbers

Proposal numbers act as logical timestamps that order proposals globally. Each proposer generates numbers higher than any it has used before, typically by combining a unique node ID with an incrementing counter.

This numbering prevents conflicting proposals from derailing the system. If two proposers try to propose different values simultaneously, the higher proposal number wins, and the lower one defers.

Practical Considerations

Implementing Paxos correctly is notoriously difficult. The original paper reads like a story rather than a specification, which has led to many misconceptions. Chubby (Google’s lock service) and many distributed databases use variants of Paxos, but most production implementations differ from the textbook algorithm in subtle ways.

Multi-Paxos extends the basic algorithm to agree on a sequence of values, which is more useful for building replicated state machines. You can read about this extension in my Multi-Paxos post.

Relationship to CAP Theorem

Paxos achieves consistency in the CAP sense—it will not return inconsistent data. However, like any consensus algorithm, network partitions can prevent progress. During a partition, the algorithm stalls until the partition heals. This is sometimes called “consistency over availability,” though I find that framing oversimplifies things. You can learn more about this trade-off in my post on the CAP Theorem.

Failure Scenarios

Proposer Crash During Phase 1

A proposer sends prepare messages to a majority of acceptors and receives promises back. At this point the proposer has the right to send accept messages, but it has not yet done so. If the proposer crashes before sending accepts, the promises remain active on the acceptor side.

Each acceptor that promised to proposal number N will reject any accept message with a number lower than N. When the crashed proposer recovers and tries to send its accept messages with its original proposal number, every acceptor that promised to a higher number in the meantime will refuse the old proposal. The crashed proposer must start over from Phase 1 with a fresh, higher proposal number.

When Proposer B sends prepare messages with a higher number, the acceptors that already promised to Proposer A will update their promised_id to B (since B’s number is higher). B collects its majority of promises, possibly learning of any value that was accepted in the interim. If no value was accepted, B proceeds with its own value. If a value was already accepted by a majority, B adopts that value and tries to commit it with its higher proposal number. No value committed by a majority can be lost in this process.

sequenceDiagram
    participant P1 as Proposer A
    participant P2 as Proposer B
    participant A1 as Acceptor 1
    participant A2 as Acceptor 2
    participant A3 as Acceptor 3

    P1->>A1: Prepare(n=5)
    P1->>A2: Prepare(n=5)
    P1-xA3: Prepare(n=5) --timeout-->
    Note over P1,A3: P1 crashes here
    P2->>A1: Prepare(n=6)
    P2->>A2: Prepare(n=6)
    P2->>A3: Prepare(n=6)
    A1-->>P2: Promise(n=6, none)
    A2-->>P2: Promise(n=6, none)
    A3-->>P2: Promise(n=6, none)
    Note over P2,A3: P2 collects majority with n=6
    P2->>A1: Accept(n=6, v="x")
    P2->>A2: Accept(n=6, v="x")
    P2->>A3: Accept(n=6, v="x")
    Note over P2,A3: Value "x" chosen

Acceptor Crash

Acceptors can crash at any point, and the algorithm tolerates this as long as a majority remains operational. The critical case is when an acceptor accepts a proposal, sends the Accepted response, and then crashes before the proposer receives it.

The proposer only needs majority acknowledgments to consider a value chosen. When Acceptor 3 crashes after accepting but before its Accepted message arrives, the proposer has already received Accepted messages from Acceptors 1 and 2, giving it a majority. The value is considered chosen even though one acceptor that accepted never confirmed receipt. The quorum property guarantees that any future majority will overlap with this one.

When the crashed acceptor recovers, it returns to its initial state with no memory of the proposal it had accepted. This creates an asymmetry: the system knows the value is chosen, but the recovering acceptor does not. Other nodes that received the Accepted messages propagate the information. The recovering acceptor may receive prepare messages for a later proposal and respond with whatever information it has (which will be stale), or it may be notified by a learner that the value was chosen.

This contrasts with a proposer crash. An acceptor crash does not require special recovery steps in the proposer, because the proposer was already done (majority accepted). A proposer crash is more disruptive: the proposer must generate a new higher proposal number and re-convince a majority before continuing.

sequenceDiagram
    participant P as Proposer
    participant A1 as Acceptor 1
    participant A2 as Acceptor 2
    participant A3 as Acceptor 3

    P->>A1: Prepare(n=5)
    P->>A2: Prepare(n=5)
    P->>A3: Prepare(n=5)
    A1-->>P: Promise(n=5)
    A2-->>P: Promise(n=5)
    A3-->>P: Promise(n=5)
    P->>A1: Accept(n=5, v="y")
    P->>A2: Accept(n=5, v="y")
    A1-->>P: Accepted
    A2-->>P: Accepted
    Note over P,A3: Majority (2 of 3) achieved — value "y" chosen
    P->>A3: Accept(n=5, v="y") --never responds (crashed)-->
    Note over A3: A3 comes back later with no memory of this proposal

Network Partition

Network partitions split the acceptor group into two or more subgroups with no communication between them. Each subgroup may have a proposer, but only one subgroup can have a majority. Paxos stalls here rather than making wrong decisions.

On the side with a majority, a proposer sends prepares, receives promises from the majority, and successfully commits a value. On the isolated side, a proposer may be active but cannot collect a majority of promises. It sends prepare messages that go unanswered, times out, and tries again with a higher number. The partition blocks the minority side completely.

When the partition heals, the two sides exchange information about the highest proposal numbers they have seen. The higher proposal number wins, and that number carries a value. If the majority side committed value V at proposal number N, and the minority side proposed a different value W at a lower proposal number M, then N > M and V wins. The minority side adopts V. If instead the minority side happened to have the higher proposal number, its value would win only if it had already collected a majority within its own partition, which is impossible since the partition split the majority.

No two majorities can choose different values, because any two majorities overlap and the higher proposal number mechanism ensures the overlapping node knows which value was chosen first. The partition does not violate safety; it only delays availability.

sequenceDiagram
    participant Left as Left Partition
    participant Right as Right Partition

    Note over Left: Proposer L1 has majority (3 nodes)
    Note over Right: Proposer R1 isolated, no majority

    Left->>Left: Propose(n=7, v="left")
    Note over Left: Value "left" chosen in left partition

    Right->>Right: Propose(n=5, v="right")
    Note over Right: No progress in right partition

    Note over Left,Right: Partition heals
    Left->>Right: Exchange highest proposal info
    Note over Left,Right: n=7 wins (higher)
    Note over Left,Right: Right partition adopts "left"

Acceptor and Proposer Roles — Pseudo-code

# Acceptor state
promised_id = -1
accepted_id = -1
accepted_value = None

def on_prepare(proposal_id):
    if proposal_id > promised_id:
        promised_id = proposal_id
        return Promise(promised_id, accepted_id, accepted_value)
    return Reject(promised_id)

def on_accept(proposal_id, value):
    if proposal_id >= promised_id:
        accepted_id = proposal_id
        accepted_value = value
        return Accepted(accepted_id)
    return Reject(promised_id)

# Proposer state
proposal_id = None
value = None

def on_propose(original_value):
    proposal_id = generate_unique_id()
    responses = send_prepare_to_all(proposal_id)
    if len(responses >= majority):
        accepted = [r for r in responses if r.accepted_id != -1]
        if accepted:
            value = max(accepted, key=lambda r: r.accepted_id).accepted_value
        else:
            value = original_value
        send_accept(proposal_id, value)

Common Pitfalls / Anti-Patterns

Here is what goes wrong without a designated proposer. Two nodes keep outbidding each other. Each one proposes a higher number, collects promises from a majority, but then the other jumps in with an even higher number before the first can send its accept messages. No value ever gets chosen — just an infinite loop of higher and higher proposal numbers.

sequenceDiagram
    participant P1 as Proposer A
    participant P2 as Proposer B
    participant A1 as Acceptor 1
    participant A2 as Acceptor 2
    participant A3 as Acceptor 3

    P1->>A1: Prepare(n=1)
    P1->>A2: Prepare(n=1)
    P1->>A3: Prepare(n=1)
    A1-->>P1: Promise(n=1)
    A2-->>P1: Promise(n=1)
    Note over P1,A3: P1 collecting majority...

    P2->>A1: Prepare(n=2)
    P2->>A2: Prepare(n=2)
    P2->>A3: Prepare(n=2)
    A1-->>P2: Promise(n=2, higher!)
    A2-->>P2: Promise(n=2)
    Note over P1,A3: P2 preempts P1's n=1

    P1->>A1: Prepare(n=3)
    P1->>A2: Prepare(n=3)
    A1-->>P1: Promise(n=3, higher!)
    Note over P1,A3: P1 preempts P2's n=2
    P2->>A1: Prepare(n=4)
    Note over P1,A3: ...continues indefinitely

This is why production systems pick a leader to be the only proposer — otherwise you get this exact livelock.

Limitations

Paxos is not the fastest consensus algorithm. The two phases of communication introduce latency, and the algorithm is not designed for high-throughput scenarios. It prioritizes safety over performance.

Paxos alone does not specify how to choose a leader. Without a distinguished proposer, multiple nodes may try to propose simultaneously, leading to livelock. In practice, leaders are often chosen separately using techniques like leader election.

When to Use Paxos

Paxos is the right choice when:

Correctness is paramount: When your application cannot tolerate divergent state (e.g., financial systems, distributed databases)
You need a proven foundation: Paxos has formal correctness proofs and decades of analysis behind it
You’re building infrastructure: Paxos forms the basis for systems like Chubby, Zookeeper (via Zab), and Google’s Spanner
You need flexibility: Multi-Paxos adapts to sequences of values, making it suitable for replicated state machines

Paxos is not ideal when:

Throughput matters more than safety: Raft’s leader-centric design often provides better performance
You need operational simplicity: Raft’s clearer separation of concerns makes it easier to implement and debug
Your team needs to maintain the code: Raft’s understandability advantage matters for long-term maintenance

Interview Questions

1. What problem does Paxos solve in distributed systems?

Paxos solves the consensus problem: achieving agreement among a distributed group of nodes on a single value, even when nodes can crash, messages can be lost, and the network is unreliable. It guarantees safety—if any node decides a value has been chosen, no other node can choose a different value for that same decision.

2. Explain the two phases of the Paxos algorithm.

Phase 1 (Prepare): A proposer generates a unique proposal number higher than any it has used before and sends prepare messages to a majority of acceptors. Acceptors promise not to accept proposals with lower numbers and report any previously accepted values.

Phase 2 (Accept): If the proposer receives promises from a majority, it sends accept messages with the proposal number and either its original value or the value from the highest-numbered proposal among responses. A majority of acceptors accepting a proposal means the value is chosen.

3. Why does Paxos require only a majority of nodes rather than all nodes?

Requiring only a majority provides fault tolerance: the system can tolerate up to floor(n/2) node failures without blocking progress. Additionally, any two majorities overlap, which guarantees that if a value is chosen, subsequent proposals will learn about it, ensuring safety.

4. What is the purpose of proposal numbers in Paxos?

Proposal numbers act as logical timestamps that order proposals globally. They prevent conflicting proposals from derailing the system—higher proposal numbers take precedence, and lower-numbered proposals must defer. Each proposer generates numbers higher than any it has previously used, typically combining a unique node ID with an incrementing counter.

5. What happens when a proposer crashes during Phase 1?

If a proposer crashes after sending prepare messages but before collecting a majority of responses, subsequent proposers must gather their own majority before proposing. The already-promised acceptors will reject the crashed proposer's later accept messages since they have already promised to a higher proposal number.

6. How does Paxos handle an acceptor crash during the algorithm?

If an acceptor crashes after accepting a proposal but before responding to the proposer, the proposer simply continues once it receives majority acknowledgments. The crashed acceptor's state is lost, but the value survives because a majority acknowledged it before the crash.

7. What is livelock in Paxos and how is it prevented?

Livelock occurs when two or more proposers continuously outbid each other with higher proposal numbers. Each one collects promises from a majority, but the other jumps in with a higher number before the first can send its accept messages. No value ever gets chosen. Production systems prevent this by designating a single leader as the only proposer.

8. How does Paxos relate to the CAP Theorem?

9. What is Multi-Paxos and how does it extend basic Paxos?

Multi-Paxos extends the basic single-decree Paxos algorithm to agree on a sequence of values, making it suitable for building replicated state machines. It optimizes by skipping Phase 1 for consecutive proposals when a leader is stable, reducing the communication rounds from 2 RTTs to 1 RTT per decision.

10. Compare Paxos and Raft in terms of leader assumption and complexity.

Paxos: Leader is optional (can be leaderless), high complexity, steep learning curve, classical correctness proofs. Typical latency is 2-3 RTTs per decision.

Raft: Strong leader assumption (single leader), medium complexity, clearer separation of concerns making it easier to implement. Typical latency is 1-2 RTTs with a stable leader, providing better throughput.

11. What is the Byzantine Generals Problem and how does it relate to Paxos?

The Byzantine Generals Problem deals with achieving consensus in a system where nodes can send conflicting or malicious information (not just fail silently). Paxos assumes "fail-stop" failures where crashed nodes simply stop responding. Byzantine Fault Tolerance (BFT) algorithms like PBFT handle the stronger threat model. Paxos provides safety under fail-stop failures but not under Byzantine failures.

12. How does network partition affect Paxos?

During a partition, one side may elect a proposer that cannot reach a majority on the other side, preventing any new values from being chosen on either side. When the partition heals, nodes exchange their highest-known proposals and continue correctly. The higher proposal number wins, and the value from the partition that achieved majority is adopted.

13. What are the limitations of Paxos?

Paxos is not the fastest consensus algorithm due to its two-phase communication overhead. It prioritizes safety over performance. Paxos alone does not specify how to choose a leader—without a distinguished proposer, multiple nodes may try to propose simultaneously, leading to livelock. It is complex to implement correctly, and the original paper is notoriously difficult to read.

14. What production systems use Paxos or variants of it?

Several prominent systems use Paxos: Google's Chubby (lock service), Google Spanner (distributed database), CockroachDB, TiKV, and etcd all use consensus protocols derived from or inspired by Paxos. ZooKeeper uses Zab, a related protocol. Microsoft's Azure uses Paxos variants for state machine replication.

15. Can Paxos reconfigure its cluster membership?

Basic Paxos does not specify reconfiguration, making it complex to add or remove nodes. Joint consensus is one approach where a special joint quorum of old and new configuration must agree. Many production implementations develop custom reconfiguration schemes that extend the core Paxos algorithm.

16. What is the role of the learner in Paxos?

Learners are acceptors that have accepted a value but need to propagate this information to other nodes that did not participate directly. When a learner receives accepted messages from a majority, it can inform other interested nodes (replicas) that a value has been chosen. Learners are part of the classic Paxos formulation but are often simplified in implementations.

17. Why is Paxos considered harder to understand than Raft?

Paxos was published in a single paper that reads like a mathematical proof mixed with narrative, using the fictional Paxos legislative protocol as an analogy. The algorithm is not clearly separated into independent components. Raft was designed specifically to be more understandable, with clear separation of leader election, log replication, and safety, and uses more concrete descriptions.

18. What is the relationship between Paxos and Viewstamped Replication (VSR)?

Viewstamped Replication is a consensus protocol developed independently by Barbara Liskov and Brian Oki around the same time as Paxos. Like Paxos, it achieves consensus in asynchronous systems with failures. VSR uses a view-change mechanism similar to Paxos's Phase 1, and both protocols have similar fault tolerance characteristics (majority quorum). They are functionally equivalent in terms of safety guarantees.

19. How does Paxos ensure that if a value is chosen, no other value can be chosen?

This is guaranteed by the promise mechanism: when an acceptor promises to a proposal number N, it agrees not to accept any proposal with a number lower than N. Since any two majorities overlap, if a proposal P with value v is accepted by a majority, any future proposal that wants to be accepted must collect promises from at least one node in that majority. That node will reject any proposal with a lower number, ensuring the higher-numbered proposal (which carries v or a later value) wins.

20. What optimizations exist for Paxos in practice?

Several optimizations are used in practice: Multi-Paxos optimizes consecutive agreements by skipping Phase 1 when a leader is stable. Read-your-writes consistency can be achieved by tagging reads with the current proposal number. Leases reduce RTTs for reads. Some implementations use batching to group multiple proposals. Cheap Paxos uses fewer nodes for recovery. Fast Paxos attempts to reduce latency with larger quorums.

Quick Recap Checklist

Before deploying a system based on Paxos, verify:

Consensus Algorithm Comparison

Algorithm	Leader Assumption	Complexity	Reconfiguration	Correctness Proof	Learning Curve
Paxos	Optional (can be leaderless)	High	Complex	Classical	Steep
Raft	Strong (single leader)	Medium	Joint consensus	In paper	Moderate
VSR	Within view	Medium	View changes	Formal	Moderate
Zab	Primary-backup	Medium	Static	Formal	Moderate

Algorithm	Typical Latency	Throughput	Fault Tolerance	Production Use
Paxos	2-3 RTTs per decision	Moderate	f+1 nodes for majority	Chubby, Spanner
Raft	1-2 RTTs (with leader)	High	f+1 nodes for majority	etcd, CockroachDB, TiKV
VSR	1-2 RTTs	Moderate	f+1 nodes for majority	CORFU, some Zab variants
Zab	1-2 RTTs	High	f+1 nodes for majority	ZooKeeper

Conclusion

Paxos is one of the most important algorithms in distributed systems. It formally proves that consensus is achievable in asynchronous systems with failures. Despite its complexity, understanding Paxos provides the foundation for reasoning about more practical consensus protocols like Raft.

If you want to go deeper, my post on Consistency Models explores how consensus fits into the broader landscape of consistency guarantees in distributed systems.