Raft Consensus Algorithm

Raft is a consensus algorithm designed to be understandable, decomposing the problem into leader election, log replication, and safety.

published: reading time: 10 min read updated: March 24, 2026

Introduction

Raft was designed with a specific goal: be understandable. Diego Ongaro and John Ousterhout published the algorithm in 2014, explicitly stating that their aim was to create a consensus algorithm that practitioners could actually reason about. Their user study showed that people understood Raft better than Paxos, which matters because consensus algorithms are notoriously difficult to implement correctly.

The algorithm breaks the consensus problem into three independent subproblems: leader election, log replication, and safety. This decomposition makes it easier to reason about and implement than Paxos.

Leader Election

At any given time, each node in a Raft cluster is in one of three states: leader, follower, or candidate. All nodes start as followers. If a follower receives no communication from a leader within an election timeout, it becomes a candidate and initiates an election.

stateDiagram-v2
    [*] --> Follower
    Follower --> Candidate: Election timeout
    Candidate --> Follower: Another node wins
    Candidate --> Leader: Receives votes from majority
    Leader --> Follower: Higher term discovered
    Candidate --> Follower: Higher term discovered

    state Leader {
        [*] --> Active
        Active --> Active: Heartbeat received
    }

To win an election, a candidate must receive votes from a majority of nodes. Each node votes for at most one candidate per term, and nodes vote for the first candidate that requests their vote if that candidate’s log is at least as up-to-date as their own.

The election timeout is randomized within a range (typically 150-300ms). This randomness breaks ties and ensures that split votes are rare. When they do happen, nodes wait for a new timeout before trying again.

Log Replication

Once a leader is elected, it begins replicating log entries to followers. Clients send commands to the leader, which appends them to its log and then sends AppendEntries RPCs to all followers in parallel. Once a majority of nodes have written the entry, the leader applies it to the state machine and returns the result to the client.

The log is structured as a sequence of entries, each containing a term number and a command. The term numbers serve as logical clocks, helping nodes identify which entries are stale or conflicting.

sequenceDiagram
    participant C as Client
    participant L as Leader
    participant F1 as Follower 1
    participant F2 as Follower 2

    C->>L: Command: x=1
    L->>L: Append to local log
    L->>F1: AppendEntries(term=3, entry=x:1)
    L->>F2: AppendEntries(term=3, entry=x:1)
    F1-->>L: Success
    F2-->>L: Success
    L->>L: Apply to state machine
    L-->>C: OK
    Note over L: Committed: entry x=1

Safety

Raft’s safety guarantee is straightforward: if a command has been applied to the state machine on any node, no other node may apply a different command for the same log index. This is enforced through the voting mechanism—nodes will not vote for a candidate whose log is behind their own.

When a leader crashes, some of its log entries may not have been fully replicated to all followers. The new leader must bring all followers up to date. This process can involve sending conflicting entries or truncating follower’s logs.

Membership Changes

Raft handles cluster membership changes through joint consensus. When you need to add or remove nodes, the cluster transitions through a joint configuration where old and new nodes must agree before the transition completes. This prevents the scenario where two disjoint majorities could form and make conflicting decisions.

The joint consensus approach adds complexity but guarantees safety during transitions. You add nodes incrementally, giving the system time to replicate data and adjust quotas.

AppendEntries RPC

The AppendEntries RPC is how Raft’s leader replicates log entries to followers. Here’s the core structure:

type AppendEntriesArgs struct {
    Term         int    // Leader's current term
    LeaderID     int    // So followers can redirect clients
    PrevLogIndex int    // Index of log entry immediately before new ones
    PrevLogTerm  int    // Term of PrevLogIndex entry
    Entries      []byte // Log entries to store (empty for heartbeat)
    LeaderCommit int    // Leader's commitIndex
}

type AppendEntriesReply struct {
    Term    int  // Current term, for leader to update itself
    Success bool // True if follower contained entry matching PrevLogIndex/PrevLogTerm

    // Optimization: tell leader who to backtrack to
    ConflictIndex int // First index conflict starts at
    ConflictTerm  int // Term at ConflictIndex
}

A heartbeat is just an AppendEntries with an empty Entries slice. Followers expect heartbeats within the election timeout range (typically 150-300ms) to confirm the leader is still active.

Joint Consensus Membership Change

When adding or removing nodes, Raft uses a two-phase approach:

sequenceDiagram
    participant L as Leader
    participant O1 as Old Node 1
    participant O2 as Old Node 2
    participant N1 as New Node 3

    rect rgb(40, 40, 60)
        Note over L,N1: Phase 1: Joint Configuration
        L->>O1: C_old + C_new (joint config)
        L->>O2: C_old + C_new (joint config)
        L->>N1: C_old + C_new (joint config)
        O1-->>L: OK
        O2-->>L: OK
        N1-->>L: OK (catching up)
        Note over L: Committed joint config
    end

    rect rgb(40, 60, 40)
        Note over L,N1: Phase 2: New Configuration Only
        L->>O1: C_new only
        L->>O2: C_new only
        L->>N1: C_new only
        O1-->>L: OK
        O2-->>L: OK
        N1-->>L: OK
        Note over L: New config committed
    end

The key steps:

  1. Leader receives request to add/remove node(s)
  2. Leader creates joint configuration C_old + C_new
  3. Leader replicates joint config to all nodes
  4. Once joint config is committed, leader creates C_new config
  5. New config is replicated; old nodes outside new config can shut down

Why Raft Wins in Practice

Raft’s emphasis on understandability has paid off. The algorithm has been implemented in many projects, including etcd (which backs Kubernetes), CockroachDB, and TiKV. The clear separation of concerns makes it easier to debug and verify.

Compared to Paxos, Raft has a stronger notion of leadership. A single leader processes all requests and coordinates replication, which simplifies the protocol at the cost of concentrating load on one node. When the leader fails, a new election happens quickly.

Performance Characteristics

Raft’s performance is dominated by the leader. It can process hundreds to thousands of commands per second depending on network latency and the number of followers. Write throughput scales with the leader’s ability to send RPCs in parallel to followers.

Read throughput can be improved by scaling followers for read-heavy workloads, but only if you accept stale reads. Strictly linear reads require confirming with the leader that it is still the current leader (using lease-based approaches or heartbeat confirmation).

Performance Bottleneck Table

ComponentLatency ContributorBottleneck
Leader electionClock drift + election timeoutWorst-case ~election timeout
Log replicationDisk fsync on followersDisk I/O is the ceiling
SnapshottingState machine serializationCPU and memory
Commit latencyMajority ack + disk writefsync latency
RPC round-tripNetworkGeographic distance

Performance Reference Table

ComponentTypical ValueWhat Affects It
Election timeout150-300ms randomNetwork jitter, load
Heartbeat interval50msThroughput vs sensitivity tradeoff
Log replication latency1-2 RTTsNetwork RTT, follower catch-up speed
Commit latency10-50msReplication factor, network latency
Snapshot sizeVariesApplication state size, snapshot frequency
Recovery timeElection timeout + catch-upHow far behind the new leader is

The leader can become a bottleneck under heavy write load. etcd mitigates this with batching and pipelining — batching multiple log entries into a single AppendEntries RPC reduces per-entry overhead. CockroachDB uses lease-based follower reads to scale read throughput without staleness risk.

Relationship to Other Algorithms

Raft and Paxos achieve the same safety guarantees through different means. Paxos is more abstract and general; Raft is more prescriptive and practical. If you want to understand the theoretical foundations, I recommend reading about the CAP Theorem and how consensus algorithms navigate the consistency-availability trade-off.

Raft also relates closely to leader election concepts. My post on Leader Election explores different approaches to selecting leaders in distributed systems.

Pre-Vote Mechanism

A partitioned node that re-joins the cluster can disrupt elections if it doesn’t realize its term is stale. When it sends a RequestVote with an old (higher) term, it can knock out the current leader and trigger an unnecessary election. The Pre-Vote phase prevents this.

Before becoming a candidate, a node first asks all other nodes: “Would you vote for me if I started an election?” If a majority responds yes, the node increments its term and proceeds with the real election. If not enough nodes respond, the node knows it is partitioned and does not disrupt the cluster.

sequenceDiagram
    participant F as Follower (partitioned)
    participant N1 as Node 1
    participant N2 as Node 2
    participant L as Leader

    F->>N1: Pre-Vote(term=9)?
    F->>N2: Pre-Vote(term=9)?
    Note over N1,N2: Current term = 5, leader is active
    N1-->>F: No (term too high, I'm with leader)
    N2-->>F: No (same reason)
    Note over F: Two rejections received — knows it is behind
    Note over F: Does NOT increment term
    Note over F: Does NOT send RequestVote
    Note over F: Stays quiet until it learns the new term

The Pre-Vote check forces the partitioned node to accept reality before it can cause damage.

Log Compaction and Snapshotting

Raft logs grow indefinitely. In production, you must compact the log — snapshotting is the mechanism that makes this safe.

Instead of keeping every log entry forever, the leader periodically creates a snapshot of the application state machine and discards all log entries up to that point. Each snapshot covers a contiguous prefix of the log.

graph LR
    A["Log entries 1-1000"] -->|Snapshot created| B["Log entries 1001-onwards"]
    A -->|Kept| S["Snapshot: state at idx 1000"]

When a follower falls too far behind, the leader sends its snapshot instead of log entries. The follower then replays the snapshot to rebuild its state, then continues from there.

Snapshot contents typically include:

  • Last included index (the last log entry in the snapshot)
  • Last included term (term of that index)
  • Application state (the actual data)
  • Last cluster configuration (membership info)

The tradeoff: snapshotting requires the application to support state machine snapshots, and sending large snapshots across the network is expensive. etcd recommends snapshotting every 10,000-20,000 entries for typical workloads.

Limitations

Raft does not solve all distributed systems problems. It provides consensus for a single log, but building a full replicated state machine requires additional engineering. The algorithm also assumes bounded message delivery times, which can be violated in networks with long delays or partitions.

The leader can become a bottleneck. If the leader becomes overloaded or the network to the leader degrades, the entire cluster’s throughput suffers. Some systems route reads through followers with freshness checks, but this adds complexity.

Conclusion

Raft succeeded at its primary goal: being understandable. It provides a practical foundation for building replicated systems without the complexity of more general consensus protocols. If you’re building a distributed system that needs consensus, Raft is often the right starting point.

For more on how consensus fits into broader distributed systems patterns, see my post on Consistency Models.

Category

Related Posts

etcd: Distributed Key-Value Store for Configuration

Deep dive into etcd architecture using Raft consensus, watches for reactive configuration, leader election patterns, and Kubernetes integration.

#distributed-systems #databases #etcd

Leader Election in Distributed Systems

Leader election is the process of designating a single node as the coordinator among a set of distributed nodes, critical for consensus protocols.

#distributed-systems #leader-election #consensus

Multi-Paxos

Multi-Paxos extends basic Paxos to achieve consensus on sequences of values, enabling practical replicated state machines for distributed systems and databases.

#distributed-systems #consensus #paxos