Raft Consensus Algorithm
Raft is a consensus algorithm designed to be understandable, decomposing the problem into leader election, log replication, and safety.
Introduction
Raft was designed with a specific goal: be understandable. Diego Ongaro and John Ousterhout published the algorithm in 2014, explicitly stating that their aim was to create a consensus algorithm that practitioners could actually reason about. Their user study showed that people understood Raft better than Paxos, which matters because consensus algorithms are notoriously difficult to implement correctly.
The algorithm breaks the consensus problem into three independent subproblems: leader election, log replication, and safety. This decomposition makes it easier to reason about and implement than Paxos.
Leader Election
At any given time, each node in a Raft cluster is in one of three states: leader, follower, or candidate. All nodes start as followers. If a follower receives no communication from a leader within an election timeout, it becomes a candidate and initiates an election.
stateDiagram-v2
[*] --> Follower
Follower --> Candidate: Election timeout
Candidate --> Follower: Another node wins
Candidate --> Leader: Receives votes from majority
Leader --> Follower: Higher term discovered
Candidate --> Follower: Higher term discovered
state Leader {
[*] --> Active
Active --> Active: Heartbeat received
}
To win an election, a candidate must receive votes from a majority of nodes. Each node votes for at most one candidate per term, and nodes vote for the first candidate that requests their vote if that candidate’s log is at least as up-to-date as their own.
The election timeout is randomized within a range (typically 150-300ms). This randomness breaks ties and ensures that split votes are rare. When they do happen, nodes wait for a new timeout before trying again.
Log Replication
Once a leader is elected, it begins replicating log entries to followers. Clients send commands to the leader, which appends them to its log and then sends AppendEntries RPCs to all followers in parallel. Once a majority of nodes have written the entry, the leader applies it to the state machine and returns the result to the client.
The log is structured as a sequence of entries, each containing a term number and a command. The term numbers serve as logical clocks, helping nodes identify which entries are stale or conflicting.
sequenceDiagram
participant C as Client
participant L as Leader
participant F1 as Follower 1
participant F2 as Follower 2
C->>L: Command: x=1
L->>L: Append to local log
L->>F1: AppendEntries(term=3, entry=x:1)
L->>F2: AppendEntries(term=3, entry=x:1)
F1-->>L: Success
F2-->>L: Success
L->>L: Apply to state machine
L-->>C: OK
Note over L: Committed: entry x=1
Safety
Raft’s safety guarantee is straightforward: if a command has been applied to the state machine on any node, no other node may apply a different command for the same log index. This is enforced through the voting mechanism—nodes will not vote for a candidate whose log is behind their own.
When a leader crashes, some of its log entries may not have been fully replicated to all followers. The new leader must bring all followers up to date. This process can involve sending conflicting entries or truncating follower’s logs.
Membership Changes
Raft handles cluster membership changes through joint consensus. When you need to add or remove nodes, the cluster transitions through a joint configuration where old and new nodes must agree before the transition completes. This prevents the scenario where two disjoint majorities could form and make conflicting decisions.
The joint consensus approach adds complexity but guarantees safety during transitions. You add nodes incrementally, giving the system time to replicate data and adjust quotas.
AppendEntries RPC
The AppendEntries RPC is how Raft’s leader replicates log entries to followers. Here’s the core structure:
type AppendEntriesArgs struct {
Term int // Leader's current term
LeaderID int // So followers can redirect clients
PrevLogIndex int // Index of log entry immediately before new ones
PrevLogTerm int // Term of PrevLogIndex entry
Entries []byte // Log entries to store (empty for heartbeat)
LeaderCommit int // Leader's commitIndex
}
type AppendEntriesReply struct {
Term int // Current term, for leader to update itself
Success bool // True if follower contained entry matching PrevLogIndex/PrevLogTerm
// Optimization: tell leader who to backtrack to
ConflictIndex int // First index conflict starts at
ConflictTerm int // Term at ConflictIndex
}
A heartbeat is just an AppendEntries with an empty Entries slice. Followers expect heartbeats within the election timeout range (typically 150-300ms) to confirm the leader is still active.
Joint Consensus Membership Change
When adding or removing nodes, Raft uses a two-phase approach:
sequenceDiagram
participant L as Leader
participant O1 as Old Node 1
participant O2 as Old Node 2
participant N1 as New Node 3
rect rgb(40, 40, 60)
Note over L,N1: Phase 1: Joint Configuration
L->>O1: C_old + C_new (joint config)
L->>O2: C_old + C_new (joint config)
L->>N1: C_old + C_new (joint config)
O1-->>L: OK
O2-->>L: OK
N1-->>L: OK (catching up)
Note over L: Committed joint config
end
rect rgb(40, 60, 40)
Note over L,N1: Phase 2: New Configuration Only
L->>O1: C_new only
L->>O2: C_new only
L->>N1: C_new only
O1-->>L: OK
O2-->>L: OK
N1-->>L: OK
Note over L: New config committed
end
The key steps:
- Leader receives request to add/remove node(s)
- Leader creates joint configuration
C_old + C_new - Leader replicates joint config to all nodes
- Once joint config is committed, leader creates
C_newconfig - New config is replicated; old nodes outside new config can shut down
Why Raft Wins in Practice
Raft’s emphasis on understandability has paid off. The algorithm has been implemented in many projects, including etcd (which backs Kubernetes), CockroachDB, and TiKV. The clear separation of concerns makes it easier to debug and verify.
Compared to Paxos, Raft has a stronger notion of leadership. A single leader processes all requests and coordinates replication, which simplifies the protocol at the cost of concentrating load on one node. When the leader fails, a new election happens quickly.
Performance Characteristics
Raft’s performance is dominated by the leader. It can process hundreds to thousands of commands per second depending on network latency and the number of followers. Write throughput scales with the leader’s ability to send RPCs in parallel to followers.
Read throughput can be improved by scaling followers for read-heavy workloads, but only if you accept stale reads. Strictly linear reads require confirming with the leader that it is still the current leader (using lease-based approaches or heartbeat confirmation).
Performance Bottleneck Table
| Component | Latency Contributor | Bottleneck |
|---|---|---|
| Leader election | Clock drift + election timeout | Worst-case ~election timeout |
| Log replication | Disk fsync on followers | Disk I/O is the ceiling |
| Snapshotting | State machine serialization | CPU and memory |
| Commit latency | Majority ack + disk write | fsync latency |
| RPC round-trip | Network | Geographic distance |
Performance Reference Table
| Component | Typical Value | What Affects It |
|---|---|---|
| Election timeout | 150-300ms random | Network jitter, load |
| Heartbeat interval | 50ms | Throughput vs sensitivity tradeoff |
| Log replication latency | 1-2 RTTs | Network RTT, follower catch-up speed |
| Commit latency | 10-50ms | Replication factor, network latency |
| Snapshot size | Varies | Application state size, snapshot frequency |
| Recovery time | Election timeout + catch-up | How far behind the new leader is |
The leader can become a bottleneck under heavy write load. etcd mitigates this with batching and pipelining — batching multiple log entries into a single AppendEntries RPC reduces per-entry overhead. CockroachDB uses lease-based follower reads to scale read throughput without staleness risk.
Relationship to Other Algorithms
Raft and Paxos achieve the same safety guarantees through different means. Paxos is more abstract and general; Raft is more prescriptive and practical. If you want to understand the theoretical foundations, I recommend reading about the CAP Theorem and how consensus algorithms navigate the consistency-availability trade-off.
Raft also relates closely to leader election concepts. My post on Leader Election explores different approaches to selecting leaders in distributed systems.
Pre-Vote Mechanism
A partitioned node that re-joins the cluster can disrupt elections if it doesn’t realize its term is stale. When it sends a RequestVote with an old (higher) term, it can knock out the current leader and trigger an unnecessary election. The Pre-Vote phase prevents this.
Before becoming a candidate, a node first asks all other nodes: “Would you vote for me if I started an election?” If a majority responds yes, the node increments its term and proceeds with the real election. If not enough nodes respond, the node knows it is partitioned and does not disrupt the cluster.
sequenceDiagram
participant F as Follower (partitioned)
participant N1 as Node 1
participant N2 as Node 2
participant L as Leader
F->>N1: Pre-Vote(term=9)?
F->>N2: Pre-Vote(term=9)?
Note over N1,N2: Current term = 5, leader is active
N1-->>F: No (term too high, I'm with leader)
N2-->>F: No (same reason)
Note over F: Two rejections received — knows it is behind
Note over F: Does NOT increment term
Note over F: Does NOT send RequestVote
Note over F: Stays quiet until it learns the new term
The Pre-Vote check forces the partitioned node to accept reality before it can cause damage.
Log Compaction and Snapshotting
Raft logs grow indefinitely. In production, you must compact the log — snapshotting is the mechanism that makes this safe.
Instead of keeping every log entry forever, the leader periodically creates a snapshot of the application state machine and discards all log entries up to that point. Each snapshot covers a contiguous prefix of the log.
graph LR
A["Log entries 1-1000"] -->|Snapshot created| B["Log entries 1001-onwards"]
A -->|Kept| S["Snapshot: state at idx 1000"]
When a follower falls too far behind, the leader sends its snapshot instead of log entries. The follower then replays the snapshot to rebuild its state, then continues from there.
Snapshot contents typically include:
- Last included index (the last log entry in the snapshot)
- Last included term (term of that index)
- Application state (the actual data)
- Last cluster configuration (membership info)
The tradeoff: snapshotting requires the application to support state machine snapshots, and sending large snapshots across the network is expensive. etcd recommends snapshotting every 10,000-20,000 entries for typical workloads.
Limitations
Raft does not solve all distributed systems problems. It provides consensus for a single log, but building a full replicated state machine requires additional engineering. The algorithm also assumes bounded message delivery times, which can be violated in networks with long delays or partitions.
The leader can become a bottleneck. If the leader becomes overloaded or the network to the leader degrades, the entire cluster’s throughput suffers. Some systems route reads through followers with freshness checks, but this adds complexity.
Conclusion
Raft succeeded at its primary goal: being understandable. It provides a practical foundation for building replicated systems without the complexity of more general consensus protocols. If you’re building a distributed system that needs consensus, Raft is often the right starting point.
For more on how consensus fits into broader distributed systems patterns, see my post on Consistency Models.
Category
Related Posts
etcd: Distributed Key-Value Store for Configuration
Deep dive into etcd architecture using Raft consensus, watches for reactive configuration, leader election patterns, and Kubernetes integration.
Leader Election in Distributed Systems
Leader election is the process of designating a single node as the coordinator among a set of distributed nodes, critical for consensus protocols.
Multi-Paxos
Multi-Paxos extends basic Paxos to achieve consensus on sequences of values, enabling practical replicated state machines for distributed systems and databases.