View-Stamped Replication

View-Stamped Replication (VSR) is a distributed consensus protocol that uses views and timestamps to achieve agreement in asynchronous systems.

published: reading time: 18 min read author: GeekWorkBench updated: March 24, 2026

Introduction

View-Stamped Replication (VSR) was developed by Barbara Liskov and colleagues at MIT in the late 1980s, predating Raft by nearly two decades. The algorithm achieved consensus using a view-based approach where nodes progress through numbered views, with one node serving as the primary in each view.

The name comes from the two key pieces of state that nodes maintain: the view number (indicating their current understanding of who is primary) and the status (whether they are normal, recovering, or have been replaced).

VSR is less well-known than Paxos or Raft, but it influenced both. If you want another angle on consensus algorithms, VSR is worth your time.

Views and Roles

The core abstraction in VSR is the view. A view consists of a view number and an ordered list of replicas. The first replica in the list is the primary, the second is the backup, and so on. Nodes advance through views as failures happen.

stateDiagram-v2
    [*] --> Normal
    Normal --> Normal: Primary receives client requests
    Normal --> Normal: Backup processes prepare messages

    Normal --> ViewChange: Timeout without communication
    ViewChange --> Normal: New view established

    state ViewChange {
        [*] --> StartViewChange
        StartViewChange --> DoViewChange: Received enough VIEW-CHANGE messages
        DoViewChange --> [*]: Sent STARTVIEW to new primary
    }

When a backup detects that the primary is unresponsive (via timeout), it initiates a view change. It increments its local view number and sends VIEW-CHANGE messages to all replicas. Once a node receives VIEW-CHANGE messages from a majority of replicas, it sends a STARTVIEW message to the new primary.

The Normal Phase

In the normal (non-view-change) phase, the primary processes client requests sequentially. When the primary receives a request, it assigns it a monotonically increasing transaction number and forwards an PREPARE message to all backups.

Each backup receives the PREPARE message, persists the operation to its log, and sends an acknowledgment back to the primary. When the primary receives acknowledgments from a majority, it applies the operation to its state machine and responds to the client.

This is similar to how Raft handles log replication, but with a different mechanism for view changes and recovery.

sequenceDiagram
    participant C as Client
    participant P as Primary
    participant B1 as Backup 1
    participant B2 as Backup 2

    C->>P: Request: op(x)
    P->>P: Assign transaction number n=42
    P->>B1: PREPARE(view=5, n=42, op(x))
    P->>B2: PREPARE(view=5, n=42, op(x))
    B1->>P: ACK(view=5, n=42)
    B2->>P: ACK(view=5, n=42)
    P->>P: Apply op(x) to state machine
    P-->>C: Response: OK
    Note over P: Committed: transaction 42

Key points about the normal phase:

  1. The primary assigns a monotonically increasing transaction number (not log index like Raft)
  2. Operations are forwarded to all backups via PREPARE messages
  3. A majority of acknowledgments (including the primary) commits the transaction
  4. The primary applies the operation after quorum is reached, not before

Handling Failures and Recovery

VSR handles node failures carefully. When a primary fails, some backup becomes the new primary through the view change protocol. When a failed node recovers, it must rejoin the group and catch up on missed operations.

Recovery involves the recovering node sending a RECOVERY message to the current primary. The primary responds with a STATE message containing the current view number and the operations the recovering node missed. This allows the recovering node to replay those operations and rejoin as a full member.

Relationship to Paxos and Raft

VSR shares conceptual territory with both Paxos and Raft but takes a different path. Like Raft, it has a clear notion of leadership within a view. Like Paxos, it uses quorum-based replication and can tolerate asynchronous communication, within limits.

The view change mechanism is more elaborate than Raft’s leader election. VSR separates “who is primary” from “what is committed,” whereas Raft embeds these in its leader and log replication.

VSR is a solid choice for building replicated state machines, but it is less commonly implemented than Raft. If you need a well-tested consensus algorithm with a mature ecosystem, Raft-based implementations (like etcd) are more readily available.

However, if you’re building a system that requires the specific properties of view-stamped replication, or if you’re working with existing VSR-based code, understanding the algorithm helps.

VSR and Zab: The Direct Lineage

The Zab protocol (ZooKeeper Atomic Broadcast) is essentially VSR adapted for ZooKeeper’s specific needs. Understanding this lineage clarifies why ZooKeeper behaves the way it does.

The key similarities:

  • Both use view numbers to track leadership epochs
  • Both use a primary-backup model where only the primary can order operations
  • Both separate “who is primary” from “what is committed”
  • Both use versions of VIEW-CHANGE/STARTVIEW-equivalent messages to transition leadership

The key adaptations Zab made:

AspectVSRZab
RecoveryFull state replay via STATE messagesDiscovery phase + state sync
Epoch conceptView numberEpoch (proposal) number
CommitmentMajority ACK to primaryQuorum of acknowledgments to leader
Abort conditionIf a quorum cannot be reachedIf quorum is lost, suspend operations

When you debug ZooKeeper leadership issues, the view-change concepts from VSR map directly. ZooKeeper’s zxid (transaction ID) carries the epoch number and transaction counter, much like VSR’s view number plus transaction number.

Comparison to Other Approaches

For more context on consensus in distributed systems, see my posts on CAP Theorem and Consistency Models. These explore the broader landscape of trade-offs in distributed systems.

The Two-Phase Commit post discusses a different coordination approach that is simpler but less fault-tolerant.

VSR vs Raft vs Paxos

AspectVSRRaftPaxos
Leader ConceptPrimary within viewSingle strong leaderOptional distinguished proposer
ReconfigurationView changes with majorityJoint consensusComplex, requires extra care
Recovery ComplexityRECOVERY/STATE messagesLog truncation + snapshotDepends on implementation
CommitmentMajority of acknowledgmentsMajority of AppendEntriesMajority of Accept messages
Log StructureTransaction numbersLog indices with termsProposal numbers
Node StateView number + statusCurrent term + votedForPromised proposal numbers
AspectVSRRaftPaxos
UnderstandabilityModerateHigh (designed for clarity)Low (proof-heavy)
PerformanceComparable to RaftOptimized for throughputLower due to 2 phases
InfluenceCORFU, Zabetcd, CockroachDB, TiKVChubby, Spanner
Formal ProofYesYesYes (classical)
Year Published198820141998 (first paper)

Membership Changes and Reconfiguration

VSR handles cluster membership changes through the view change mechanism. When nodes need to be added or removed, the system progresses through a new view that includes the updated membership configuration.

The reconfiguration procedure follows these steps:

sequenceDiagram
    participant O as Old Configuration
    participant L as Leader
    participant N as New Node
    participant F as Followers

    Note over O,L: Client requests configuration change
    L->>L: Prepare reconfiguration view v+1
    L->>N: STARTVIEW(v+1, new_config)
    L->>F: STARTVIEW(v+1, new_config)
    Note over N: New node catches up via STATE messages
    N-->>L: ACK(v+1)
    F-->>L: ACK(v+1)
    Note over L: Quorum reached in view v+1
    Note over L,N,F: New configuration active

Key points about VSR reconfiguration:

  1. View-based approach — Membership changes are embedded in view changes, giving a natural mechanism for transitioning between configurations
  2. Catch-up via STATE messages — New nodes receive the current state from the primary before participating in consensus
  3. No joint consensus required — Unlike Raft’s two-phase joint consensus, VSR transitions in a single view change once the new node catches up
  4. Transition safety — Quorum requirements prevent two conflicting configurations from operating simultaneously

The primary sends STATE messages containing the view number, committed transaction number, and the recent log entries the new node needs to replay. Once the new node acknowledges receipt and catches up, it becomes a full voting member.

Recovery and Checkpointing Procedure

VSR nodes can fail and recover, requiring them to rejoin the cluster with the correct state. The recovery procedure uses the RECOVERY and STATE message exchange.

sequenceDiagram
    participant R as Recovering Node
    participant P as Primary
    participant B as Backup

    R->>P: RECOVERY(v=5, n_recovered)
    Note over P: Current view = 5, committed_n = 100
    P->>R: STATE(view=5, committed=100, operations=[...])
    Note over R: Replay operations 1-100
    R->>R: State machine up-to-date
    Note over R: Rejoin as full member

The recovery procedure:

  1. RECOVERY message — The recovering node sends its last known view number and last recovered transaction number to the current primary
  2. STATE response — The primary responds with the current view, the highest committed transaction number, and any uncommitted operations the recovering node missed
  3. State reconstruction — The recovering node replays the missing operations to its state machine, bringing it back up to date
  4. Return to service — Once caught up, the node resumes normal operation and can participate in quorum decisions

Checkpointing in VSR

Checkpointing (snapshotting) in VSR serves the same purpose as in Raft: bounding log growth for long-running clusters. The checkpoint contains:

class Checkpoint:
    def __init__(self, view_number, last_transaction, state):
        self.view_number = view_number
        self.last_transaction = last_transaction  # Last committed transaction
        self.state = state                        # State machine state snapshot
        self.checksum = compute_hash(state)       # Integrity verification

Checkpoint creation:

  • The primary periodically creates checkpoints of the application state machine
  • Checkpoints are triggered based on transaction count thresholds (e.g., every 10,000 transactions)
  • Log entries up to the checkpoint’s last_transaction can be discarded

Checkpoint recovery:

  • A recovering node receives the latest checkpoint via STATE message
  • The checkpoint includes enough information to replay from that point
  • Nodes verify checkpoint integrity via the embedded checksum

This approach means:

  • Memory usage stays bounded even with high transaction rates
  • Recovery time stays bounded by the checkpoint interval, not total transaction history
  • Network transfer during recovery stays proportional to checkpoint size, not full history

Production Failure Scenarios

Network Partition During View Change

During a network partition, two nodes may simultaneously believe they are the primary. VSR resolves this because only the node that receives VIEW-CHANGE messages from a majority can actually form a quorum and send STARTVIEW to establish leadership. The partitioned node that cannot reach a majority will stall—it cannot commit transactions, but it also cannot cause split-brain.

Cascading Node Failures

When multiple nodes fail in quick succession, VSR can enter a situation where no majority exists to form a new view. In this case, the cluster becomes unavailable for writes until at least one node recovers or network connectivity is restored. This is an intentional design trade-off—no false commits are possible, but availability is sacrificed.

Checkpoint Corruption

If a recovering node receives a corrupted checkpoint via the STATE message, the embedded checksum allows detection. The node must request the checkpoint again or sync from another replica. This prevents replaying corrupted operations to the state machine.

View Number Overflow

For very long-running clusters with frequent view changes, the view number could theoretically overflow. In practice, systems using VSR implement checks that alert administrators well before this becomes a concern, and reconfiguration can be used to reset view numbers if needed.

Quick Recap Checklist

Before diving deeper, here are the key points to remember about View-Stamped Replication:

  • VSR uses views—a view number plus an ordered replica list—to track leadership
  • The first replica in a view is the primary; remaining replicas are backups
  • Transaction numbers (not log indices) identify operations in the order they are committed
  • A majority of acknowledgments commits a transaction; the primary applies it after quorum
  • View changes happen when a backup detects an unresponsive primary via timeout
  • A quorum of VIEW-CHANGE messages promotes a new primary in a higher-numbered view
  • Recovery uses RECOVERY/STATE message exchange to replay missed operations
  • Checkpointing bounds log growth and speeds up recovery for long-running clusters
  • Reconfiguration embeds membership changes in view changes—no joint consensus needed
  • VSR influenced CORFU, ZooKeeper’s Zab, and shares concepts with Raft and Paxos

Interview Questions

1. What is a "view" in View-Stamped Replication and what two pieces of state does each node maintain?

A view in VSR consists of a view number and an ordered list of replicas. Each node maintains two key pieces of state: the view number (indicating their current understanding of who is primary) and their status (whether they are normal, recovering, or replaced).

2. How does VSR handle a primary failure and subsequent view change?

When a backup detects that the primary is unresponsive via timeout, it increments its local view number and sends VIEW-CHANGE messages to all replicas. Once a node receives VIEW-CHANGE messages from a majority of replicas, it sends a STARTVIEW message to the new primary, completing the transition.

3. How does the normal phase work in VSR?

In the normal phase, the primary receives client requests and assigns each a monotonically increasing transaction number. It forwards an PREPARE message to all backups. Each backup persists the operation to its log and sends an acknowledgment back. When the primary receives acknowledgments from a majority, it applies the operation to its state machine and responds to the client.

4. How does VSR's recovery procedure work when a node rejoins the cluster?

The recovering node sends a RECOVERY message to the current primary with its last known view number and last recovered transaction number. The primary responds with a STATE message containing the current view number, the highest committed transaction number, and any uncommitted operations the recovering node missed. The recovering node replays those operations and rejoins as a full member.

5. How does checkpointing work in VSR and why is it needed?

The primary periodically creates checkpoints of the application state machine based on transaction count thresholds (e.g., every 10,000 transactions). Log entries up to the checkpoint's last_transaction can then be discarded. This bounds memory usage, recovery time, and network transfer during recovery.

6. How does VSR handle membership changes and reconfiguration?

VSR embeds membership changes in view changes. When nodes need to be added or removed, the system progresses through a new view that includes the updated membership configuration. The primary sends STATE messages containing the view number, committed transaction number, and recent log entries. Once the new node catches up, it becomes a full voting member.

7. What is the key difference between VSR and Raft in terms of leader concept and reconfiguration?

VSR uses "primary within a view" as its leader concept, while Raft uses a single strong leader. For reconfiguration, VSR transitions via single view changes with majority quorum once the new node catches up. Raft requires a two-phase joint consensus approach.

8. What systems were influenced by VSR?

CORFU used VSR as a foundation for distributed shared log storage. Apache ZooKeeper's Zab protocol borrows heavily from VSR's view-change approach. VSR also influenced the thinking behind Raft, which was developed nearly two decades later.

9. How does VSR's transaction numbering differ from Raft's log indexing?

VSR assigns monotonically increasing transaction numbers to operations, which identify the order of operations being committed. Raft uses log indices with terms—each entry has a term number that helps with leader election and consistency checks.

10. What are the node state components in VSR compared to Raft?

In VSR, node state is the view number plus status (normal, recovering, replaced). In Raft, node state includes the current term number and votedFor field. Both approaches track leadership and consensus participation, but through different mechanisms.

11. How does VSR's commitment mechanism work?

VSR commits transactions when the primary receives acknowledgments from a majority of backups (including itself). The primary assigns a transaction number and forwards PREPARE messages. Once quorum is reached, the primary applies the operation to its state machine before responding to the client.

12. Why is VSR's separation of "who is primary" from "what is committed" significant?

This separation gives VSR flexibility in how leadership and consensus are decoupled. The view change mechanism handles who should be primary independently from the committed state, whereas Raft embeds these concepts together in its leader and log replication model.

13. How does VSR tolerate network asynchronous conditions compared to Raft?

VSR can tolerate asynchronous conditions during normal operation as long as messages eventually get delivered and a majority can be reached. However, view changes require synchrony—a node cannot establish a new view if it cannot collect VIEW-CHANGE messages from a majority. This is similar to Raft's requirement that leader election requires a majority of reachable nodes.

14. What happens if a primary crashes after sending PREPARE but before committing?

If the primary crashes after broadcasting PREPARE messages but before receiving a majority of acknowledgments, no commitment occurs. The new primary (elected in the subsequent view change) will have access to the prepared but uncommitted operation in its log. It can either re-prepare the operation (send PREPARE again with the same transaction number) or discard it depending on implementation choices.

15. Why does VSR use transaction numbers instead of log indices?

Transaction numbers provide a simpler way to track operation ordering across view changes. Since views have ordered replica lists, the transaction number alone can identify "who committed what when" without needing to track term numbers, log indices, and snapshot state like Raft does. This simplifies the recovery procedure.

16. How does VSR's view change compare to Raft's leader election in terms of complexity?

VSR's view change is more elaborate—it requires multiple rounds of messages (VIEW-CHANGE to all, then STARTVIEW from the new primary). Raft's leader election is simpler: nodes increment their term, vote for themselves, and the candidate with most votes wins. However, VSR separates the "who is primary" question from "what is committed," giving more flexibility.

17. What are the advantages of VSR's RECOVERY/STATE message exchange over Raft's snapshot transfer?

RECOVERY/STATE allows incremental catch-up based on the recovering node's last known state. The primary sends only the operations the node missed, reducing network transfer. Raft typically transfers full snapshots for nodes that are too far behind. VSR's approach is more efficient when the gap is small but requires the primary to track per-node recovery state.

18. How does Zab differ from VSR in handling epoch/leadership transitions?

Zab uses epochs (similar to VSR's view numbers) but with a discovery phase where the leader learns the current state from followers before proposing. VSR transitions more directly—once a quorum of VIEW-CHANGE messages is received, the new primary is established immediately. Zab's discovery phase adds safety but increases latency for leadership transitions.

19. What constraints does VSR place on quorum configuration?

VSR requires that any two successive views share a majority of nodes to ensure progress. This is necessary because view changes depend on receiving VIEW-CHANGE messages from a majority. If the membership configuration breaks this property (e.g., too few nodes relative to failure tolerance), the system may become unavailable for writes.

20. How would you implement a client retry mechanism in VSR?

When a client sends a request and doesn't receive a response within a timeout, it should resend the request to the primary with its original transaction number (if known) and client ID. The primary must track which requests have already been applied to prevent duplicate execution. Idempotent operations can be safely retried; non-idempotent operations require careful tracking of client sequence numbers.

Further Reading

Conclusion

View-Stamped Replication is a consensus algorithm worth knowing about. It predates Raft, influenced both Paxos and Raft, and provides a coherent approach to achieving consensus through views and quorum-based replication.

If you’re interested in the history and diversity of consensus algorithms, VSR is worth studying. It shows that there are many valid approaches to the same fundamental problem.

Category

Related Posts

Paxos Consensus Algorithm

Paxos is a family of consensus algorithms for achieving agreement in distributed systems despite failures, pioneered by Leslie Lamport.

#distributed-systems #consensus #paxos

The FLP Impossibility Result

The FLP impossibility theorem proves that no consensus algorithm can guarantee termination in an asynchronous system with even one faulty process.

#distributed-systems #consensus #flp

Leader Election in Distributed Systems

Leader election is the process of designating a single node as the coordinator among a set of distributed nodes, critical for consensus protocols.

#distributed-systems #leader-election #consensus