View-Stamped Replication

View-Stamped Replication (VSR) is a distributed consensus protocol that uses views and timestamps to achieve agreement in asynchronous systems.

published: March 24, 2026 reading time: 24 min read author: GeekWorkBench updated: March 24, 2026

Quick Summary

View-Stamped Replication is a consensus protocol from 1988 that flew under the radar while Paxos and Raft got all the attention, but it influenced both. VSR uses views - a view number plus an ordered replica list - to track leadership, with the first replica in each view serving as primary. Nodes progress through views as failures happen, and a quorum of VIEW-CHANGE messages promotes a new primary. What makes VSR interesting is how it separates 'who is primary' from 'what is committed', a design choice that ZooKeeper's Zab protocol borrowed directly. Recovery uses RECOVERY/STATE message exchange to replay missed operations rather than full snapshot transfer. If you're debugging ZooKeeper leadership issues or working with CORFU, understanding VSR's view-change concepts maps directly to what you're seeing.

Introduction

View-Stamped Replication (VSR) was developed by Barbara Liskov and colleagues at MIT in the late 1980s, predating Raft by nearly two decades. The algorithm achieved consensus using a view-based approach where nodes progress through numbered views, with one node serving as the primary in each view.

The name comes from the two key pieces of state that nodes maintain: the view number (indicating their current understanding of who is primary) and the status (whether they are normal, recovering, or have been replaced).

VSR is less well-known than Paxos or Raft, but it influenced both. If you want another angle on consensus algorithms, VSR is worth your time.

Views and Roles

The core abstraction in VSR is the view. A view consists of a view number and an ordered list of replicas. The first replica in the list is the primary, the second is the backup, and so on. Nodes advance through views as failures happen.

stateDiagram-v2
    [*] --> Normal
    Normal --> Normal: Primary receives client requests
    Normal --> Normal: Backup processes prepare messages

    Normal --> ViewChange: Timeout without communication
    ViewChange --> Normal: New view established

    state ViewChange {
        [*] --> StartViewChange
        StartViewChange --> DoViewChange: Received enough VIEW-CHANGE messages
        DoViewChange --> [*]: Sent STARTVIEW to new primary
    }

When a backup detects that the primary is unresponsive (via timeout), it initiates a view change. It increments its local view number and sends VIEW-CHANGE messages to all replicas. Once a node receives VIEW-CHANGE messages from a majority of replicas, it sends a STARTVIEW message to the new primary.

The Normal Phase

In the normal (non-view-change) phase, the primary processes client requests sequentially. When the primary receives a request, it assigns it a monotonically increasing transaction number and forwards an PREPARE message to all backups.

Each backup receives the PREPARE message, persists the operation to its log, and sends an acknowledgment back to the primary. When the primary receives acknowledgments from a majority, it applies the operation to its state machine and responds to the client.

This is similar to how Raft handles log replication, but with a different mechanism for view changes and recovery.

sequenceDiagram
    participant C as Client
    participant P as Primary
    participant B1 as Backup 1
    participant B2 as Backup 2

    C->>P: Request: op(x)
    P->>P: Assign transaction number n=42
    P->>B1: PREPARE(view=5, n=42, op(x))
    P->>B2: PREPARE(view=5, n=42, op(x))
    B1->>P: ACK(view=5, n=42)
    B2->>P: ACK(view=5, n=42)
    P->>P: Apply op(x) to state machine
    P-->>C: Response: OK
    Note over P: Committed: transaction 42

Key points about the normal phase:

The primary assigns a monotonically increasing transaction number (not log index like Raft)
Operations are forwarded to all backups via PREPARE messages
A majority of acknowledgments (including the primary) commits the transaction
The primary applies the operation after quorum is reached, not before

Handling Failures and Recovery

VSR handles node failures carefully. When a primary fails, some backup becomes the new primary through the view change protocol. When a failed node recovers, it must rejoin the group and catch up on missed operations.

Recovery involves the recovering node sending a RECOVERY message to the current primary. The primary responds with a STATE message containing the current view number and the operations the recovering node missed. This allows the recovering node to replay those operations and rejoin as a full member.

Relationship to Paxos and Raft

VSR shares conceptual territory with both Paxos and Raft but takes a different path. Like Raft, it has a clear notion of leadership within a view. Like Paxos, it uses quorum-based replication and can tolerate asynchronous communication, within limits.

The view change mechanism is more elaborate than Raft’s leader election. VSR separates “who is primary” from “what is committed,” whereas Raft embeds these in its leader and log replication.

VSR is a solid choice for building replicated state machines, but it is less commonly implemented than Raft. If you need a well-tested consensus algorithm with a mature ecosystem, Raft-based implementations (like etcd) are more readily available.

However, if you’re building a system that requires the specific properties of view-stamped replication, or if you’re working with existing VSR-based code, understanding the algorithm helps.

VSR and Zab: The Direct Lineage

The Zab protocol (ZooKeeper Atomic Broadcast) is essentially VSR adapted for ZooKeeper’s specific needs. Understanding this lineage clarifies why ZooKeeper behaves the way it does.

The key similarities:

Both use view numbers to track leadership epochs
Both use a primary-backup model where only the primary can order operations
Both separate “who is primary” from “what is committed”
Both use versions of VIEW-CHANGE/STARTVIEW-equivalent messages to transition leadership

The key adaptations Zab made:

Aspect	VSR	Zab
Recovery	Full state replay via STATE messages	Discovery phase + state sync
Epoch concept	View number	Epoch (proposal) number
Commitment	Majority ACK to primary	Quorum of acknowledgments to leader
Abort condition	If a quorum cannot be reached	If quorum is lost, suspend operations

When you debug ZooKeeper leadership issues, the view-change concepts from VSR map directly. ZooKeeper’s zxid (transaction ID) carries the epoch number and transaction counter, much like VSR’s view number plus transaction number.

Comparison to Other Approaches

For more context on consensus in distributed systems, see my posts on CAP Theorem and Consistency Models. These explore the broader landscape of trade-offs in distributed systems.

The Two-Phase Commit post discusses a different coordination approach that is simpler but less fault-tolerant.

VSR vs Raft vs Paxos

Aspect	VSR	Raft	Paxos
Leader Concept	Primary within view	Single strong leader	Optional distinguished proposer
Reconfiguration	View changes with majority	Joint consensus	Complex, requires extra care
Recovery Complexity	RECOVERY/STATE messages	Log truncation + snapshot	Depends on implementation
Commitment	Majority of acknowledgments	Majority of AppendEntries	Majority of Accept messages
Log Structure	Transaction numbers	Log indices with terms	Proposal numbers
Node State	View number + status	Current term + votedFor	Promised proposal numbers

Aspect	VSR	Raft	Paxos
Understandability	Moderate	High (designed for clarity)	Low (proof-heavy)
Performance	Comparable to Raft	Optimized for throughput	Lower due to 2 phases
Influence	CORFU, Zab	etcd, CockroachDB, TiKV	Chubby, Spanner
Formal Proof	Yes	Yes	Yes (classical)
Year Published	1988	2014	1998 (first paper)

Membership Changes and Reconfiguration

VSR handles cluster membership changes through the view change mechanism. When nodes need to be added or removed, the system progresses through a new view that includes the updated membership configuration.

The reconfiguration procedure follows these steps:

sequenceDiagram
    participant O as Old Configuration
    participant L as Leader
    participant N as New Node
    participant F as Followers

    Note over O,L: Client requests configuration change
    L->>L: Prepare reconfiguration view v+1
    L->>N: STARTVIEW(v+1, new_config)
    L->>F: STARTVIEW(v+1, new_config)
    Note over N: New node catches up via STATE messages
    N-->>L: ACK(v+1)
    F-->>L: ACK(v+1)
    Note over L: Quorum reached in view v+1
    Note over L,N,F: New configuration active

Key points about VSR reconfiguration:

View-based approach — Membership changes are embedded in view changes, giving a natural mechanism for transitioning between configurations
Catch-up via STATE messages — New nodes receive the current state from the primary before participating in consensus
No joint consensus required — Unlike Raft’s two-phase joint consensus, VSR transitions in a single view change once the new node catches up
Transition safety — Quorum requirements prevent two conflicting configurations from operating simultaneously

The primary sends STATE messages containing the view number, committed transaction number, and the recent log entries the new node needs to replay. Once the new node acknowledges receipt and catches up, it becomes a full voting member.

Recovery and Checkpointing Procedure

VSR nodes can fail and recover, requiring them to rejoin the cluster with the correct state. The recovery procedure uses the RECOVERY and STATE message exchange.

sequenceDiagram
    participant R as Recovering Node
    participant P as Primary
    participant B as Backup

    R->>P: RECOVERY(v=5, n_recovered)
    Note over P: Current view = 5, committed_n = 100
    P->>R: STATE(view=5, committed=100, operations=[...])
    Note over R: Replay operations 1-100
    R->>R: State machine up-to-date
    Note over R: Rejoin as full member

The recovery procedure:

RECOVERY message — The recovering node sends its last known view number and last recovered transaction number to the current primary
STATE response — The primary responds with the current view, the highest committed transaction number, and any uncommitted operations the recovering node missed
State reconstruction — The recovering node replays the missing operations to its state machine, bringing it back up to date
Return to service — Once caught up, the node resumes normal operation and can participate in quorum decisions

Checkpointing in VSR

Checkpointing (snapshotting) in VSR serves the same purpose as in Raft: bounding log growth for long-running clusters. The checkpoint contains:

class Checkpoint:
    def __init__(self, view_number, last_transaction, state):
        self.view_number = view_number
        self.last_transaction = last_transaction  # Last committed transaction
        self.state = state                        # State machine state snapshot
        self.checksum = compute_hash(state)       # Integrity verification

Checkpoint creation:

The primary periodically creates checkpoints of the application state machine
Checkpoints are triggered based on transaction count thresholds (e.g., every 10,000 transactions)
Log entries up to the checkpoint’s last_transaction can be discarded

Checkpoint recovery:

A recovering node receives the latest checkpoint via STATE message
The checkpoint includes enough information to replay from that point
Nodes verify checkpoint integrity via the embedded checksum

This approach means:

Memory usage stays bounded even with high transaction rates
Recovery time stays bounded by the checkpoint interval, not total transaction history
Network transfer during recovery stays proportional to checkpoint size, not full history

Production Failure Scenarios

Network Partition During View Change

A network partition during a view change can leave two nodes both thinking they are primary. VSR prevents split-brain through its quorum requirement for establishing any new view.

Here is how the race condition works. Suppose primary P is on side A of the partition, and backup B on side B times out on P and triggers a view change. B increments its local view number and broadcasts VIEW-CHANGE messages to all replicas. If B can reach a majority, it collects enough VIEW-CHANGE messages to form a quorum. It then sends a STARTVIEW to whichever node it believes should be the new primary, which could be itself or another backup that sent the highest view number.

The node that receives the STARTVIEW becomes the legitimate primary. It starts accepting and ordering client requests. P on side A of the partition cannot receive VIEW-CHANGE messages from any majority. It may keep receiving client requests, but it cannot commit them because it cannot gather acknowledgments from a quorum of backups. It stalls.

What stops false commits is the quorum overlap property. Any two successive views that both reach quorum must share at least one node that voted in both. This means at most one configuration of nodes can certify a new primary at any given view number. A partitioned node trying to act as primary will fail to gather quorum acknowledgments and cannot drive transactions to commit.

Client requests sent to the partitioned primary will eventually time out. Clients should retry against the new primary once they detect the timeout, using their original transaction number if known. The new primary must deduplicate based on client ID and transaction number to avoid double-applying operations.

The partition is detected through timeout on the backup side. The backup does not need to know whether the primary is truly dead or just unreachable—it treats silence as failure and initiates the view change protocol. This means VSR tolerates asynchronous networks as long as messages are eventually delivered and the timeout is set appropriately.

Cascading Node Failures

When multiple nodes fail in quick succession, VSR can reach a state where no majority of nodes remains reachable. The cluster becomes unavailable for writes—a deliberate trade-off that favors correctness over availability.

The failure threshold depends on cluster size. In a 5-node cluster, any 2 nodes can fail and the remaining 3 still form a majority. In a 3-node cluster, a single failure leaves only 2 nodes, which is still a majority. But if a second node fails before the first recovers, the cluster is left with 1 node—below majority—and cannot make progress.

VSR has no automatic recovery path when a majority cannot be formed. The algorithm waits for one of the failed nodes to recover and rejoin. Once a recovering node completes its STATE synchronization with the current primary and rejoins the cluster, a majority can be re-established and normal operation resumes.

During the unavailable period, client requests are rejected or time out. There is no speculative or degraded mode that allows partial writes. This is the same safety trade-off Raft and Paxos make: if you cannot confirm that a majority of nodes have acknowledged an operation, you cannot safely commit it.

The vulnerability window for cascading failures depends on detection time. Longer view-change timeouts keep a failed primary appearing alive to its backups. Setting timeouts too short causes spurious view changes during temporary load spikes. Setting them too long delays failure detection. Most production deployments use heartbeat intervals in the 100ms to 1s range with a few missed heartbeats before triggering a view change.

Administrators should track the number of nodes in each view and alert when cluster size drops below the majority threshold. Some implementations add a witness or observer node type that participates in view changes without storing data, which reduces the failure threshold without adding replication overhead.

Checkpoint Corruption

A recovering node that gets a corrupted checkpoint from the primary risks replaying incorrect state to its state machine. VSR handles this with a checksum embedded in each checkpoint. The receiving node verifies integrity before applying anything to its local state.

When a node initiates recovery, it asks the current primary for the latest checkpoint via the RECOVERY message exchange. The primary responds with a STATE message containing the checkpoint object, which includes the view number at the time of the checkpoint, the last committed transaction number, a serialized snapshot of the application state, and a checksum computed over the state data. The recovering node recomputes the checksum over the received state and compares it against the embedded value. A mismatch means something went wrong.

On detecting corruption, the recovering node has two options. It can ask for the checkpoint again from the primary, in case the error happened during network transmission. Or it can ask a different replica that was part of the majority in the view where the checkpoint was created. If all replicas return corrupted checkpoints, the node cannot recover through VSR alone and needs administrative intervention—potentially restoring from an external backup or rebuilding the state from the transaction log if enough replicas are available.

The checksum does not tell you whether the corruption is partial (some bytes flipped) or complete (checkpoint truncated). Both cases trigger the same recovery path. This is why VSR implementations tend to use cryptographic hashes like SHA-256 rather than simpler checksums that might miss certain error patterns.

Checkpoint corruption during normal operation is not a concern in VSR’s design. Checkpoints are created by the primary and only transferred to recovering nodes. If the primary’s checkpoint becomes corrupted, clients or other replicas might eventually detect the corrupted state in normal operation, but VSR does not give the primary a way to verify its own checkpoint. Production deployments usually pair checkpoint integrity checks with external monitoring and periodic health checks on the primary’s state.

View Number Overflow

View numbers in VSR are stored as integers, and theoretically they can overflow if a cluster runs long enough with frequent view changes. For most deployments this is not a realistic concern, but the mechanism is worth knowing.

VSR view numbers typically use 64-bit integers. Even with a view change every second, a 64-bit counter would not overflow for hundreds of millions of years. Most deployments use 32-bit or 64-bit integers without special overflow protection, relying on the practical infeasibility of overflow rather than explicit defenses.

The overflow concern ties into the reconfiguration mechanism. When an administrator wants to reset view numbers as a precaution or because they are approaching a practical limit, they perform a reconfiguration that effectively starts fresh. The reconfiguration creates a new view with the updated membership configuration, and the new view’s number can be set to any value the administrator chooses, as long as it is higher than any view number existing nodes have seen. This gives administrators a controlled way to reset the counter without disrupting the cluster.

For long-running deployments that expect extremely high view change rates, some VSR-inspired systems add explicit overflow guards. These usually take the form of alerts when the view number exceeds a configurable threshold, such as 90% of the maximum value for the integer type, giving administrators advance notice to plan a reconfiguration.

A more practical concern than overflow is view number synchronization after a reconfiguration. When a new node joins or an existing node recovers, it receives the current view number via the STATE message. If the recovering node has a stale view number that is too far behind, the primary may need to send a large number of missed operations. This is bounded by the checkpoint interval. If checkpoints are created every 10,000 transactions, the recovery gap is at most the checkpoint interval regardless of how long the node was down.

Quick Recap Checklist

Before diving deeper, here are the key points to remember about View-Stamped Replication:

Interview Questions

1. What is a "view" in View-Stamped Replication and what two pieces of state does each node maintain?

A view in VSR consists of a view number and an ordered list of replicas. Each node maintains two key pieces of state: the view number (indicating their current understanding of who is primary) and their status (whether they are normal, recovering, or replaced).

2. How does VSR handle a primary failure and subsequent view change?

When a backup detects that the primary is unresponsive via timeout, it increments its local view number and sends VIEW-CHANGE messages to all replicas. Once a node receives VIEW-CHANGE messages from a majority of replicas, it sends a STARTVIEW message to the new primary, completing the transition.

3. How does the normal phase work in VSR?

In the normal phase, the primary receives client requests and assigns each a monotonically increasing transaction number. It forwards an PREPARE message to all backups. Each backup persists the operation to its log and sends an acknowledgment back. When the primary receives acknowledgments from a majority, it applies the operation to its state machine and responds to the client.

4. How does VSR's recovery procedure work when a node rejoins the cluster?

The recovering node sends a RECOVERY message to the current primary with its last known view number and last recovered transaction number. The primary responds with a STATE message containing the current view number, the highest committed transaction number, and any uncommitted operations the recovering node missed. The recovering node replays those operations and rejoins as a full member.

5. How does checkpointing work in VSR and why is it needed?

The primary periodically creates checkpoints of the application state machine based on transaction count thresholds (e.g., every 10,000 transactions). Log entries up to the checkpoint's last_transaction can then be discarded. This bounds memory usage, recovery time, and network transfer during recovery.

6. How does VSR handle membership changes and reconfiguration?

VSR embeds membership changes in view changes. When nodes need to be added or removed, the system progresses through a new view that includes the updated membership configuration. The primary sends STATE messages containing the view number, committed transaction number, and recent log entries. Once the new node catches up, it becomes a full voting member.

7. What is the key difference between VSR and Raft in terms of leader concept and reconfiguration?

VSR uses "primary within a view" as its leader concept, while Raft uses a single strong leader. For reconfiguration, VSR transitions via single view changes with majority quorum once the new node catches up. Raft requires a two-phase joint consensus approach.

8. What systems were influenced by VSR?

CORFU used VSR as a foundation for distributed shared log storage. Apache ZooKeeper's Zab protocol borrows heavily from VSR's view-change approach. VSR also influenced the thinking behind Raft, which was developed nearly two decades later.

9. How does VSR's transaction numbering differ from Raft's log indexing?

VSR assigns monotonically increasing transaction numbers to operations, which identify the order of operations being committed. Raft uses log indices with terms—each entry has a term number that helps with leader election and consistency checks.

10. What are the node state components in VSR compared to Raft?

In VSR, node state is the view number plus status (normal, recovering, replaced). In Raft, node state includes the current term number and votedFor field. Both approaches track leadership and consensus participation, but through different mechanisms.

11. How does VSR's commitment mechanism work?

VSR commits transactions when the primary receives acknowledgments from a majority of backups (including itself). The primary assigns a transaction number and forwards PREPARE messages. Once quorum is reached, the primary applies the operation to its state machine before responding to the client.

12. Why is VSR's separation of "who is primary" from "what is committed" significant?

This separation gives VSR flexibility in how leadership and consensus are decoupled. The view change mechanism handles who should be primary independently from the committed state, whereas Raft embeds these concepts together in its leader and log replication model.

13. How does VSR tolerate network asynchronous conditions compared to Raft?

VSR can tolerate asynchronous conditions during normal operation as long as messages eventually get delivered and a majority can be reached. However, view changes require synchrony—a node cannot establish a new view if it cannot collect VIEW-CHANGE messages from a majority. This is similar to Raft's requirement that leader election requires a majority of reachable nodes.

14. What happens if a primary crashes after sending PREPARE but before committing?

If the primary crashes after broadcasting PREPARE messages but before receiving a majority of acknowledgments, no commitment occurs. The new primary (elected in the subsequent view change) will have access to the prepared but uncommitted operation in its log. It can either re-prepare the operation (send PREPARE again with the same transaction number) or discard it depending on implementation choices.

15. Why does VSR use transaction numbers instead of log indices?

Transaction numbers provide a simpler way to track operation ordering across view changes. Since views have ordered replica lists, the transaction number alone can identify "who committed what when" without needing to track term numbers, log indices, and snapshot state like Raft does. This simplifies the recovery procedure.

16. How does VSR's view change compare to Raft's leader election in terms of complexity?

VSR's view change is more elaborate—it requires multiple rounds of messages (VIEW-CHANGE to all, then STARTVIEW from the new primary). Raft's leader election is simpler: nodes increment their term, vote for themselves, and the candidate with most votes wins. However, VSR separates the "who is primary" question from "what is committed," giving more flexibility.

17. What are the advantages of VSR's RECOVERY/STATE message exchange over Raft's snapshot transfer?

RECOVERY/STATE allows incremental catch-up based on the recovering node's last known state. The primary sends only the operations the node missed, reducing network transfer. Raft typically transfers full snapshots for nodes that are too far behind. VSR's approach is more efficient when the gap is small but requires the primary to track per-node recovery state.

18. How does Zab differ from VSR in handling epoch/leadership transitions?

Zab uses epochs (similar to VSR's view numbers) but with a discovery phase where the leader learns the current state from followers before proposing. VSR transitions more directly—once a quorum of VIEW-CHANGE messages is received, the new primary is established immediately. Zab's discovery phase adds safety but increases latency for leadership transitions.

19. What constraints does VSR place on quorum configuration?

VSR requires that any two successive views share a majority of nodes to ensure progress. This is necessary because view changes depend on receiving VIEW-CHANGE messages from a majority. If the membership configuration breaks this property (e.g., too few nodes relative to failure tolerance), the system may become unavailable for writes.

20. How would you implement a client retry mechanism in VSR?

When a client sends a request and doesn't receive a response within a timeout, it should resend the request to the primary with its original transaction number (if known) and client ID. The primary must track which requests have already been applied to prevent duplicate execution. Idempotent operations can be safely retried; non-idempotent operations require careful tracking of client sequence numbers.

Conclusion

View-Stamped Replication is a consensus algorithm worth knowing about. It predates Raft, influenced both Paxos and Raft, and provides a coherent approach to achieving consensus through views and quorum-based replication.

If you’re interested in the history and diversity of consensus algorithms, VSR is worth studying. It shows that there are many valid approaches to the same fundamental problem.