Multi-Paxos

Multi-Paxos extends basic Paxos to achieve consensus on sequences of values, enabling practical replicated state machines for distributed systems and databases.

published: March 24, 2026 reading time: 27 min read author: GeekWorkBench updated: June 17, 2026

Quick Summary

Multi-Paxos extends basic Paxos to agree on sequences of values, forming the backbone of replicated state machines in systems like Spanner and Chubby. The key optimization: once a leader is stable, skip Phase 1 (prepare) for consecutive entries and go straight to Phase 2, reducing 2 RTTs per entry to 1 RTT. Checkpointing must happen at committed indices — anything else causes silent divergence. Multi-Paxos leaves many details unspecified, which is why Raft is usually the better choice for new projects. After reading, you'll understand how leader stability affects throughput, why fsync is the real bottleneck, and how epoch-based batching trades durability for performance.

Introduction

Basic Paxos agrees on a single value. Run it once, and all non-faulty nodes commit to the same value. Run it again with a different value, and they commit to that. But what if you need to agree on a sequence of values, like a replicated log of commands? That’s what Multi-Paxos solves.

The extension is conceptually simple: treat each slot in the log as a separate instance of basic Paxos. The complexity comes from optimization—running Paxos from scratch for every log entry would be prohibitively expensive.

Core Concepts

Multi-Paxos optimizes away the prepare phase for consecutive entries once a leader is established. The insight is that if a proposer has recently won a majority vote, it can reasonably assume it remains the leader for a period. During this stable period, it can skip Phase 1 and go directly to Phase 2 for subsequent entries.

This optimization transforms Multi-Paxos from a two-phase protocol per entry to essentially a one-phase protocol after the initial leader election.

sequenceDiagram
    participant L as Leader
    participant F1 as Follower 1
    participant F2 as Follower 2

    rect rgb(40, 40, 60)
        Note over L,F2: First entry: Full Paxos (Phase 1 + Phase 2)
        L->>F1: Prepare(n=10)
        L->>F2: Prepare(n=10)
        F1-->>L: Promise(v=nil)
        F2-->>L: Promise(v=nil)
        L->>F1: Accept(n=10, v="cmd1")
        L->>F2: Accept(n=10, v="cmd1")
        F1-->>L: Accepted
        F2-->>L: Accepted
    end

    rect rgb(40, 60, 40)
        Note over L,F2: Subsequent entries: Optimized (Phase 2 only)
        L->>F1: Accept(n=10, v="cmd2")
        L->>F2: Accept(n=10, v="cmd2")
        F1-->>L: Accepted
        F2-->>L: Accepted

        L->>F1: Accept(n=10, v="cmd3")
        L->>F2: Accept(n=10, v="cmd3")
        F1-->>L: Accepted
        F2-->>L: Accepted
    end

Role of the Leader

Multi-Paxos typically designates a stable leader to avoid contention. Without leadership, multiple nodes might propose conflicting values for the same log position, causing retries and latency spikes.

The leader election can use basic Paxos itself (running a round of prepare messages to see if any node can get a majority). Some implementations piggyback on the first client’s request to establish leadership lazily.

Log Gaps and Truncation

Multi-Paxos truncates log entries at the last known stable index, skipping over gaps where commands were already decided. The gaps happen naturally when leaders change: the new leader may not have entries that were committed by the previous leader if they had not been fully replicated.

graph LR
    L1["Node 1<br/>Log: [A] [B] [C] [D] [E]"]
    L2["Node 2<br/>Log: [A] [B] [C]"]
    L3["Node 3<br/>Log: [A] [B] [C] [D]"]

    L1 -->|Committed| C1["Idx 1-5 committed"]
    L2 -->|Truncated| C2["Missing idx 4,5"]
    L3 -->|Truncated| C3["Missing idx 5"]

When a new leader takes over, it may have gaps in its log where other nodes already committed entries. Multi-Paxos handles this by having the leader re-propose missing entries rather than copying them — ensuring consistency without a full synchronization.

Skipped Log Indices

What happens if a client request needs to go into log position 100, but the log only has 50 entries? In practice, the leader detects gaps and either waits for the gap to be filled or replicates a no-op command to fill the gap. Some implementations use a snapshot-based approach where the state machine state is checkpointed and old log entries are discarded.

Multi-Paxos vs Raft

Raft was designed to be understandable by explicitly separating concerns. Multi-Paxos is more abstract and leaves many details unspecified. This distinction has practical implications.

graph LR
    subgraph "Design Philosophy"
        M[Multi-Paxos<br/>Abstract model<br/>Minimal assumptions] --> R[Raft<br/>Concrete specification<br/>Every case defined]
    end

    subgraph "Practical Trade-offs"
        M2[Multi-Paxos<br/>Fewer network rounds<br/>for stable workloads<br/>Harder to implement] --> R2[Raft<br/>Easier to implement<br/>Correctly<br/>More rounds]
    end

Aspect	Multi-Paxos	Raft
Network Rounds (stable leader)	1 RTT per entry	1 RTT per entry
Leader Election	Via Paxos (2+ RTTs)	Heartbeat-based
Log Compaction	unspecified (varies)	Defined clearly (snapshot + log truncation)
Membership Changes	Complex, varies by implementation	Joint consensus or single-server changes
Implementation Difficulty	High (many unspecified details)	Moderate (clear specification)
Leader Stability	Critical for performance	Important but more resilient
Disk I/O	Implementation-dependent	Typically batched snapshots

Raft separates leader election from log replication explicitly. Multi-Paxos treats these as the same mechanism. The Raft paper gives you reference pseudo-code for all components; Multi-Paxos papers focus on correctness proofs rather than implementation guidance.

For new projects, Raft is usually the better choice. The clarity of specification reduces implementation bugs. For existing systems (Chubby, Zookeeper’s Zab), understanding Multi-Paxos helps you reason about their behavior.

Checkpointing and State Machine Recovery

Multi-Paxos logs grow indefinitely. Without checkpointing, recovery time grows proportionally to log length. Checkpointing (or snapshotting) solves this by periodically saving the state machine state and discarding old log entries.

What a Checkpoint Contains

A checkpoint is not just a memory dump. It must capture everything needed to reconstruct the state machine at a specific log index:

class Checkpoint:
    def __init__(self, last_applied_index, state, log_length):
        self.last_applied_index = last_applied_index
        # The highest log index whose command has been applied
        # All earlier entries are now reflected in state

        self.state = state
        # The actual state machine state (key-value pairs,
        # document contents, etc.)

        self.log_length = log_length
        # How many log entries existed when checkpoint was taken
        # Entries before this index can be discarded

The Replay Process

When a node restarts, it must reconstruct its state by reading the checkpoint and replaying log entries after the checkpoint index.

graph TB
    subgraph "State Machine Recovery"
        S1[Read last checkpoint<br/>last_applied=500<br/>state=Snapshot_500] --> S2[Find log entries<br/>501, 502, 503...]
        S2 --> S3[Replay entry 501<br/>Apply to state]
        S3 --> S4[Replay entry 502<br/>Apply to state]
        S4 --> S5[Replay entry 503<br/>Apply to state]
        S5 --> S6[State machine matches<br/>other replicas]
    end

Checkpoint Creation and Log Truncation

Checkpoints must be created consistently across replicas to avoid divergence.

sequenceDiagram
    participant L as Leader
    participant F1 as Follower 1
    participant F2 as Follower 2
    participant SM as State Machine

    Note over L: Checkpoint interval reached

    L->>SM: Create checkpoint<br/>state=apply(log 1-500)
    L->>F1: Send checkpoint + index=500
    L->>F2: Send checkpoint + index=500
    Note over SM: Log entries 1-500 now reflected<br/>in state

    F1->>F1: Apply checkpoint locally
    F2->>F2: Apply checkpoint locally

    L->>F1: Truncate log entries 1-500
    L->>F2: Truncate log entries 1-500

    Note over L,F2: All replicas now at<br/>checkpoint index 500

The leader coordinates checkpoint creation to ensure all replicas can truncate the same log entries. If a follower is slow, the leader waits before truncating.

Consistency Requirements

A checkpoint is only trustworthy if it reflects a consistent view of the state machine. In Multi-Paxos terms, this means the checkpoint index must be a committed log position.

If you checkpoint at index 400 but index 401 was not yet committed (no majority acknowledged), you might checkpoint state that other nodes never saw. After a failure, the replica might diverge.

Proper implementations only checkpoint at committed indices.

The committed index invariant deserves closer examination because distributed crash recovery has several subtle failure modes. Consider what happens when a leader checkpoints at index 500 and a follower has only reached index 450.

The follower must replay log entries 451 through 500 to catch up. If any of those entries were lost during a crash, never persisted to a majority, the follower cannot reconstruct the same state as other replicas. This divergence is undetectable without comparing full state checksums across all nodes.

Most implementations address this by requiring the leader to prove that a majority has reached the checkpoint index before broadcasting truncation instructions. Some systems add an explicit checkpoint acknowledgment round where followers confirm they have applied the checkpoint before the leader truncates. Without this acknowledgment, a slow follower can cause the leader to truncate prematurely, leading to state divergence on the next recovery.

A second edge case involves overlapping checkpoints. If node A checkpoints at index 500 while node B checkpoints at index 490 (perhaps due to different disk speeds or acknowledgement timing), both nodes hold checkpoints at different indices. After recovery, node A replays entries 501 onward while node B replays 491 onward. If entries 491-500 were committed and stable, both nodes should converge, but if any entry in that range was lost before a majority acknowledged it, the nodes permanently diverge. Proper implementations either serialize checkpoint creation through the leader or use a checkpoint consensus protocol to ensure all replicas truncate at the same index.

Scalability Considerations

Multi-Paxos scales well for write-heavy workloads under stable leadership, but certain limits emerge as cluster size increases.

Leader Bottleneck

The designated leader becomes a bottleneck at high throughput. Every write must go through the leader, which sends Accept messages to all followers and waits for majority acknowledgment. As more nodes join the cluster, the leader’s network and disk I/O burden grows proportionally.

For a 3-node cluster, the leader sends 2 messages per entry. For a 7-node cluster, it sends 6 messages per entry. The network overhead scales with cluster size, and disk I/O remains constant per entry regardless of cluster size.

# Throughput estimation under stable leadership
def estimate_throughput(node_count, fsync_latency_ms, network_latency_ms):
    # Time per entry = max(fsync_latency, network_round_trip) + processing
    round_trip = network_latency_ms * 2  # leader -> follower -> leader
    entry_time_ms = max(fsync_latency_ms, round_trip)
    max_throughput = 1000 / entry_time_ms
    return max_throughput

# 3 nodes, 5ms fsync, 1ms network: ~200 entries/sec
# 7 nodes, 5ms fsync, 1ms network: ~200 entries/sec (fsync bound)
# Same throughput, more network overhead for larger clusters

Read Scalability Workarounds

Multi-Paxos clusters often implement follower reads for horizontal read scalability. Followers serve stale reads by applying committed log entries to their local state machine. This trades consistency for throughput.

Client --> Leader: write x=1
Leader --> Followers: Accept(n=10, v="write x=1")
Followers --> Leader: Accepted
Leader --> Client: committed

Client --> Follower: read x  (returns local state, potentially stale)

For strong consistency reads, clients must route to the leader. Some systems use lease-based follower reads where the leader grants followers permission to serve reads for a time window, accepting staleness in exchange for distributed read throughput.

Membership Change Overhead

Adding nodes to a Multi-Paxos cluster requires careful coordination. The protocol does not specify membership changes, so implementations typically:

Pause client traffic
Propagate new configuration through the log
Wait for the new configuration to commit
Resume traffic

This pause duration grows with the number of nodes being added, making large-scale cluster expansion disruptive.

The four-step sequence above deserves unpacking because membership changes are one of the most error-prone operations in production Multi-Paxos deployments. Pausing client traffic prevents the log from growing while the configuration change is in flight. A growing log during a membership change creates ambiguity about which configuration should apply to each entry. Propagating the new configuration through the log rather than announcing it out-of-band ensures that all nodes agree on the membership change in the same total order as regular commands. Waiting for the change to commit before resuming ensures that a majority of the new configuration has acknowledged the change and will survive the next leader election.

The tricky part is that Multi-Paxos needs a stable leader to maintain throughput, but membership changes temporarily weaken the cluster. During the pause, the leader cannot accept new entries, so clients experience elevated latency. For a 3-node cluster adding a 4th node, the pause is brief. For a 7-node cluster adding 3 nodes simultaneously, the pause can last several seconds, during which the cluster is less fault-tolerant. A single additional failure during this window could cause unavailability.

Several optimizations appear in production systems.

Single-server changes add or remove one node at a time rather than bulk changes, reducing the unavailability window per operation. Joint consensus uses a two-phase approach where the cluster first transitions to a joint configuration before committing the new configuration, allowing the cluster to remain operational during the change. Some implementations route read traffic to followers during membership changes while blocking writes, preserving read availability even when write availability is sacrificed.

Raft’s approach to membership changes represents a well-specified version of these patterns. Multi-Paxos implementations must invent these mechanisms from scratch, which is why membership change bugs appear frequently in systems that originated from Multi-Paxos designs.

Durability and Performance Trade-offs

Multi-Paxos provides strong durability guarantees by default: a value is only committed when a majority acknowledges it. However, this creates performance tensions that production deployments must navigate.

The Durability Throughput Tension

Synchronous persistence before acknowledgment creates a hard throughput ceiling. Each entry must survive a crash to be considered committed, which means fsync completes before the leader responds to clients.

Configuration	Durability	Throughput	Crash Loss
Sync per entry	Highest	Lowest (~100-1000/sec)	None
Batch fsync (8 entries)	High	Moderate (~5000/sec)	Last batch
Async append + background sync	Moderate	High (~20000/sec)	Several seconds of writes

Battery-Backed Write Caches

Enterprise storage systems use battery-backed write caches (BBWC) to accelerate durability acknowledgments. The cache is battery-protected, so data in the cache survives power loss. The storage system acknowledges writes after they hit the BBWC, before data reaches disk.

Leader --> Storage: write entry 501
Storage --> BBWC: append entry
BBWC --> Storage: acknowledged (durable even if power lost)
Storage --> Leader: Accepted
Leader --> Followers: Accept(n=10, v="entry 501")

This enables high throughput while maintaining durability guarantees. If power fails, the BBWC flushes to disk on recovery.

epoch-based Leases

Some Multi-Paxos variants use epoch-based leases to batch acknowledgments across multiple entries. The leader proposes a range of entries and followers acknowledge the entire epoch with a single round trip.

# Epoch-based batching
def on_client_request(value):
    epoch = current_epoch
    entries.append(Entry(epoch, sequence++, value))

    if should_flush_epoch():
        # Send all entries in epoch together
        send_accept_batch(entries)
        wait_for_majority_ack(entries)
        commit_epoch(epoch)
        entries.clear()

This amortizes network round trips across many entries but introduces a vulnerability window. If the leader crashes before the epoch commits, all entries in the epoch are lost.

Relationship to Other Patterns

Multi-Paxos forms the backbone of many replicated log systems. My post on Two-Phase Commit discusses a different approach to coordinating distributed transactions, though it lacks the fault-tolerance guarantees that Multi-Paxos provides.

For building complete replicated services, you’ll need more than just Multi-Paxos. My posts on Database Replication and Distributed Transactions explore how consensus fits into broader system designs.

Practical Applications

Google’s Spanner uses Multi-Paxos variants for synchronizing replicas across data centers. Chubby uses it for lock service. Many distributed databases use the same underlying pattern: a replicated log of commands that is applied to each replica’s state machine.

The reason this works in practice: basic Paxos is expensive, but you only need it once to establish leadership. After that, the common case is fast.

Production Failure Scenarios

Scenario 1: The Phantom Write Problem

What happened: A financial trading system using Multi-Paxos lost orders during a leadership election. The old leader had received and logged an order but the new leader, having a shorter log, did not have the order committed.

Root cause: The client sent an order to the old leader, which wrote it to its local log and returned success to the client. However, the old leader crashed before replicating the entry to a majority of followers. The new leader, with a different log prefix, was elected and continued processing without the lost order.

Impact: Several customer orders were silently dropped. The trading system had to reconcile its actual state against what clients believed had succeeded.

Lesson learned: Clients must wait for commit acknowledgments from a majority before considering a request complete. Idempotency keys help with client-side retry safety.

Scenario 2: The Stale Checkpoint Catastrophe

What happened: A distributed database using Multi-Paxos experienced silent data corruption after a routine maintenance event when a checkpoint was restored from an uncommitted log index.

Root cause: The checkpoint was taken at log index 400, but index 401 had not yet been committed (no majority acknowledgment). When a follower took a local checkpoint including index 401’s partial state, and then crashed and restarted, it restored state that other replicas had never seen.

Impact: After recovery, that replica diverged from the cluster. Detecting the divergence required expensive checksum verification across all data.

Lesson learned: Checkpoints must only be taken at committed log indices. The committed index is the highest index that a majority has acknowledged—nothing less is safe.

Scenario 3: The Fsync Bottleneck Chain

What happened: A high-throughput key-value store using Multi-Paxos experienced severe write latency spikes during peak load, eventually timing out clients.

Root cause: The system ran on spinning disks where fsync latency dominated. At high throughput, the single leader became a bottleneck—the disk could not keep up with the rate of accepted proposals requiring synchronous persistence. Write queues grew, latency spiked, and clients timed out.

Impact: The service became effectively unavailable during peak load despite having adequate CPU and network capacity.

Lesson learned: Profile disk I/O patterns before deploying Multi-Paxos. Consider batching fsync calls, using battery-backed write caches, or switching to SSDs for write-heavy workloads.

Common Pitfalls / Anti-Patterns

Multi-Paxos sounds simple, but production implementations face real challenges. The gap between theory and practice is substantial.

Persistence: If the leader crashes before replicating to a majority, clients may think a command succeeded when it actually failed. Careful clients must track which entries are actually committed.

Checkpointing: The log grows indefinitely. Periodic checkpointing of the state machine lets you truncate old entries, but taking consistent checkpoints across a distributed system is its own problem.

Membership changes: Adding or removing nodes while the system is running requires careful coordination. Most implementations pause the protocol, change membership, and resume.

Leader Stability Challenges

Multi-Paxos needs a relatively stable leader to get its performance benefits. When leadership flips often, the protocol collapses back toward basic Paxos.

Consider a scenario with three nodes (N1, N2, N3) and a client issuing commands:

# Scenario: Leadership changes during command processing
# Time T1: N1 is leader
client_cmd_1 = {"seq": 1, "value": "write x=1"}
# N1 sends Accept to N2, N3

# Time T2: N1 fails, N2 becomes leader before cmd_1 commits
# N2 receives client_cmd_2 = {"seq": 1, "value": "write x=2"}
# N2 proposes seq=1 with value "write x=2"

# Time T3: N1 recovers, N3 thinks N1 is still leader
# N3 receives Accept from N1 for seq=1, N2 for seq=1
# Conflict: same sequence number, different values

If clients retry with the same sequence numbers after leader changes, conflicting proposals can emerge. The system must either assign sequence numbers centrally (defeating distributed benefits) or accept that some commands will be lost and require client-level retry logic.

Log Compaction Interaction

Checkpointing interacts with Multi-Paxos in subtle ways. A checkpoint must capture the complete state, but determining “complete” in a distributed setting is tricky.

graph TB
    subgraph "Checkpoint Creation Process"
        C1[Leader creates checkpoint<br/>of state machine state]
        C2[Send checkpoint to all followers<br/>asynchronously]
        C3[Followers apply checkpoint<br/>locally]
        C4[Leader truncates log<br/>at checkpoint index]
        C5[Clients can now read<br/>consistent snapshot]

        C1 --> C2 --> C3 --> C4 --> C5
    end

Here is the tricky part: a follower might be at log index 100 while the leader checkpointed at index 150. If the follower crashes and restarts, it must reconstruct state from the checkpoint plus log entries 101-150. If any of those entries were lost during the crash, the follower has an inconsistent view that no amount of Paxos rounds will fix.

Disk I/O Bottlenecks

Multi-Paxos requires durability before acknowledging accepted proposals. Each Accept message typically triggers a fsync to persistent storage.

# Typical accept handler with persistence
def on_accept(proposal_id, value):
    # 1. Write to log (sequential I/O, relatively fast)
    log.append(proposal_id, value)

    # 2. Persist to disk (fsync - orders of magnitude slower)
    log.flush_to_disk()  # ~5-10ms on spinning disk
    log.sync()           # fsync call

    # 3. Update in-memory state
    state.update(value)

    # 4. Send response
    return Accepted(proposal_id)

On a system with spinning disks, fsync latency dominates. A single leader processing thousands of commands per second can become bottlenecked on disk I/O. SSDs help, but the fundamental issue stays: every accepted proposal requires synchronous persistence.

Some implementations batch multiple accepts before fsync, trading durability for throughput. This means a crash might lose several recently-accepted commands.

Concurrency Control Limitations

Multi-Paxos decides on a sequence of values but provides no native concurrency control for overlapping writes to the same keys.

# Multi-Paxos does not handle this:
# Client A: write x = 1, read x
# Client B: write x = 2, read x
# If both operations touch the same key "x",
# Multi-Paxos guarantees they are ordered,
# but provides no mechanism to detect
# or prevent conflicting access patterns

# You must layer your own concurrency control:
# - External locking (Zookeeper advisory locks)
# - Optimistic concurrency (version vectors)
# - Pessimistic locking (2PL, MVCC)

For simple key-value stores, this is manageable. For complex transactions touching multiple keys, you need additional coordination.

Persistent State Requirements

Before sending Accept messages, a proper Multi-Paxos implementation must persist certain state to ensure crash safety.

# State that MUST be persisted before sending Accept
class AcceptorState:
    def on_prepare_response(self, proposal_id, accepted_id, accepted_value):
        # Persist promise before sending ack
        # If we crash after sending Promise but before persisting,
        # we might accept a lower-numbered proposal after restart
        self.promised_id = proposal_id
        persist_to_disk({
            'promised_id': self.promised_id,
            'accepted_id': self.accepted_id,  # Also persist accepted
            'accepted_value': self.accepted_value
        })

    def on_accept(self, proposal_id, value):
        # Must persist BEFORE sending Accepted response
        # If we crash after sending response but before persisting,
        # we might lose knowledge of what we accepted
        if proposal_id >= self.promised_id:
            self.accepted_id = proposal_id
            self.accepted_value = value
            persist_to_disk({
                'accepted_id': self.accepted_id,
                'accepted_value': self.accepted_value
            })
            return Accepted(proposal_id)

The invariant is simple: once you have promised not to accept lower proposals, or have accepted a value, that state must survive crashes. Violating this leads to split-brain scenarios where different nodes believe different values were chosen.

Quick Recap

Before diving into implementation, ensure you understand:

Basic Paxos agrees on a single value; Multi-Paxos extends this to a sequence of values
The optimization: skip Phase 1 (prepare) for consecutive entries once a leader is stable
A stable leader enables 1-RTT per entry instead of 2-RTT
Log gaps occur during leader changes and are handled by re-proposing missing entries
Checkpointing must occur at committed log indices to maintain consistency
Disk I/O (fsync) is the primary bottleneck for high-throughput Multi-Paxos implementations
Multi-Paxos provides no native concurrency control—layer it on top
Membership changes require pausing the protocol in most implementations

Interview Questions

1. What problem does Multi-Paxos solve that basic Paxos cannot?

Key points:

Basic Paxos only agrees on a single value; Multi-Paxos agrees on a sequence of values
This enables replicated state machines where a distributed log drives state changes
Each log position becomes a separate Paxos instance, but optimized for the common case

2. Explain the key optimization Multi-Paxos uses to reduce latency.

Key points:

After a leader wins a majority vote, it assumes leadership is stable for a period
Subsequent entries skip Phase 1 (prepare phase) and go directly to Phase 2 (accept phase)
This reduces 2-RTT per entry to 1-RTT per entry after leader election

3. What happens when a leader crashes during Multi-Paxos execution?

Key points:

A new leader must be elected, typically via Paxos itself (prepare messages)
The new leader may have gaps in its log where other nodes committed entries
Multi-Paxos handles this by re-proposing missing entries, not copying them
This ensures consistency without requiring full synchronization

4. Why is leader stability critical for Multi-Paxos performance?

Key points:

Frequent leadership changes collapse the protocol back toward basic Paxos
Each leadership change requires new Phase 1 (prepare) rounds before Phase 2
This eliminates the 1-RTT optimization benefit for subsequent entries
Systems must balance leader leases, heartbeats, and failure detection carefully

5. What is the purpose of log checkpointing in Multi-Paxos?

Key points:

Without checkpointing, recovery time grows proportionally to log length
Checkpointing periodically saves the state machine state and discards old log entries
A checkpoint must capture everything needed to reconstruct state at a specific log index
Only checkpoint at committed indices—uncommitted state would cause divergence

6. Describe the disk I/O bottleneck in Multi-Paxos and potential solutions.

Key points:

Every accepted proposal requires fsync to persistent storage before acknowledgment
fsync latency (~5-10ms on spinning disks) dominates throughput
Solutions: batch multiple accepts before fsync (trades durability), use SSDs, pipeline writes
Batching means a crash might lose several recently-accepted commands

7. How does Multi-Paxos handle membership changes (adding/removing nodes)?

Key points:

Membership changes require careful coordination to maintain consistency
Most implementations pause the protocol, change membership, then resume
The Paxos mechanism itself does not specify how to handle membership changes
Joint consensus or single-server changes are common approaches (as in Raft)

8. What concurrency control limitations does Multi-Paxos have?

Key points:

Multi-Paxos decides the order of values but provides no conflict detection
Overlapping writes to the same keys are ordered but not prevented
Additional layering required: advisory locks (Zookeeper), optimistic concurrency (version vectors), or pessimistic locking (2PL, MVCC)
For multi-key transactions, additional coordination beyond Multi-Paxos is essential

9. Compare Multi-Paxos and Raft in terms of implementation complexity.

Key points:

Multi-Paxos is abstract with many unspecified details—high implementation difficulty
Raft provides a concrete specification with pseudo-code for all components
Raft separates leader election from log replication explicitly; Multi-Paxos treats them as the same
For new projects, Raft is usually the better choice due to clarity of specification

10. What state must be persisted before sending Accept messages in Multi-Paxos?

Key points:

Promise state: once promised not to accept lower-numbered proposals, that must survive crashes
Accepted state: once accepted a value, that must persist before sending Accepted response
The invariant: both promised_id and accepted_id/accepted_value must survive crashes
Violating this leads to split-brain scenarios where nodes disagree on chosen values

11. How does Multi-Paxos relate to Google's Spanner and Chubby?

Key points:

Spanner uses Multi-Paxos variants for synchronizing replicas across data centers
Chubby uses Multi-Paxos for its lock service coordination
These systems demonstrate that basic Paxos, despite its theoretical nature, underlies production distributed systems
The optimization (1-RTT after leader election) makes it fast enough for production use

12. What is the relationship between Basic Paxos and Multi-Paxos?

Key points:

Multi-Paxos is not a separate protocol—it extends Basic Paxos
Each log slot is treated as an independent Basic Paxos instance
Leader election in Multi-Paxos is itself implemented using Basic Paxos
The "multi" part is an optimization layer on top of repeated Basic Paxos instances

13. Explain log truncation and why gaps might appear in a Multi-Paxos log.

Key points:

Truncation removes entries at the last known stable index, skipping gaps
Gaps naturally occur during leader changes
The new leader may not have entries that were committed by the previous leader
The leader re-proposes missing entries rather than copying them from other nodes

14. What happens when a follower crashes and restarts in a Multi-Paxos cluster?

Key points:

The follower must reconstruct state from its last checkpoint plus log entries after that index
If those log entries were lost during the crash, the follower has an inconsistent view
This inconsistency cannot be fixed by additional Paxos rounds
Proper persistence of promised and accepted state before responding prevents this

15. Why might clients need retry logic in a Multi-Paxos-based system?

Key points:

If the leader crashes after receiving a request but before replicating to a majority, the command may be lost
Careful clients must track which entries are actually committed before considering a request successful
Client retries with the same sequence numbers after leader changes can cause conflicting proposals
The system must either assign sequence numbers centrally or accept lost commands requiring client-level retry

16. How does leader stability affect Multi-Paxos throughput compared to basic Paxos?

Key points:

Under stable leadership, Multi-Paxos achieves 1-RTT per entry vs 2-RTT for basic Paxos
The optimization eliminates prepare phases for consecutive entries once leader is established
With frequent leadership changes, Multi-Paxos collapses to basic Paxos performance
Leader leases and failure detection tuning directly impact effective throughput

17. What are the trade-offs between synchronous and asynchronous checkpointing?

Key points:

Synchronous checkpointing pauses the protocol during checkpoint creation, ensuring all replicas stay in sync
Asynchronous checkpointing allows the leader to continue processing but risks divergent checkpoints if the leader crashes mid-creation
Synchronous is safer but impacts latency; asynchronous is faster but requires careful crash recovery handling
Hybrid approaches checkpoint when the system is idle or during low-traffic periods

18. Why does Multi-Paxos require a stable leader for its performance optimization?

Key points:

The optimization skips Phase 1 (prepare phase) assuming leadership remains stable
If leadership changes frequently, each new leader must run Phase 1 before Phase 2, losing the 1-RTT benefit
Systems use leader leases (time-bound authority) to maintain stability while allowing failure detection
The trade-off: aggressive leases improve performance but risk longer outages if the leader fails unexpectedly

19. What happens to pending client requests when a Multi-Paxos leader fails?

Key points:

Client requests that the old leader accepted but did not commit to a majority are lost
The client must retry the request, but retrying with the same sequence number risks conflicting proposals if a new leader is already processing different values
Idempotency keys help clients safely retry without causing duplicate commands
The new leader re-proposes missing log entries, but requests that only reached the old leader's log are effectively dropped

20. How does epoch-based batching improve Multi-Paxos throughput and what are its risks?

Key points:

Epoch-based batching collects multiple entries and sends them as a single accept round, amortizing network overhead
Followers acknowledge an entire epoch with one round trip, reducing per-entry latency
The risk: if the leader crashes before the epoch commits, all entries in that epoch are lost
Systems must balance batch size against crash loss window—larger batches mean higher throughput but greater loss on failure

Conclusion

Multi-Paxos extends the theoretical Paxos algorithm into a practical protocol for replicating logs and state machines. The optimization from two phases to one per entry makes it fast enough for production use.

Understanding Multi-Paxos helps when working with systems like Zookeeper, etcd, or distributed databases. But for new projects, Raft usually provides a cleaner foundation.

Introduction

Core Concepts

Role of the Leader

Log Gaps and Truncation

Skipped Log Indices

Multi-Paxos vs Raft

Checkpointing and State Machine Recovery

What a Checkpoint Contains

The Replay Process

Checkpoint Creation and Log Truncation

Consistency Requirements

Scalability Considerations

Leader Bottleneck

Read Scalability Workarounds

Membership Change Overhead

Durability and Performance Trade-offs

The Durability Throughput Tension

Battery-Backed Write Caches

epoch-based Leases

Relationship to Other Patterns

Practical Applications

Production Failure Scenarios

Scenario 1: The Phantom Write Problem

Scenario 2: The Stale Checkpoint Catastrophe

Scenario 3: The Fsync Bottleneck Chain

Common Pitfalls / Anti-Patterns

Leader Stability Challenges

Log Compaction Interaction

Disk I/O Bottlenecks

Concurrency Control Limitations

Persistent State Requirements

Quick Recap

Interview Questions

Further Reading

Conclusion

Category

Tags

Related Posts

Paxos Consensus Algorithm

Google Chubby: The Lock Service That Inspired ZooKeeper

Raft Consensus Algorithm