The FLP Impossibility Result
The FLP impossibility theorem proves that no consensus algorithm can guarantee termination in an asynchronous system with even one faulty process. Understanding FLP is essential for distributed systems designers.
Introduction
In 1985, Fischer, Lynch, and Paterson published a short paper that proved a striking impossibility: no consensus algorithm can guarantee termination in a fully asynchronous system if even a single process can fail. The result is known as FLP after the authors’ initials.
This is not a practical limitation to be engineered around. It is a mathematical proof that consensus and guaranteed termination are fundamentally incompatible in asynchronous systems.
The Setup
The FLP result assumes an asynchronous system with the following properties:
- Processes communicate by sending messages
- Messages can be delayed arbitrarily but eventually delivered
- Processes can fail by stopping (crash-stop, not Byzantine)
- No clocks or timeouts are available to detect failures
This is a realistic model of many systems. Networks have variable latency; you cannot reliably distinguish a crashed process from a slow one without real-time clocks.
What Consensus Requires
Consensus algorithms typically require three properties:
- Agreement: All non-faulty processes decide on the same value
- Validity: The decided value must have been proposed by some process
- Termination: All non-faulty processes eventually decide
FLP proves that in an asynchronous system with at least one faulty process, no algorithm can guarantee all three simultaneously.
graph TB
subgraph "FLP Model"
P1[Process 1]
P2[Process 2]
P3[Process 3]
M1[Message Queue]
M2[Message Queue]
M3[Message Queue]
P1 --> M2
P2 --> M1
P2 --> M3
P3 --> M1
P3 --> M2
end
Note(("No timeouts<br/>Messages delayed<br/>but eventually<br/>delivered"))
The Core Insight
The proof constructs an adversarial message scheduler. Given any protocol that appears to be working, the scheduler can delay critical messages to keep the system in a state where processes disagree but cannot gather enough information to decide.
The key is that in an asynchronous system, you cannot know whether a process has crashed or is just slow. The scheduler can exploit this uncertainty indefinitely by ensuring that whichever choice a process makes, there is always a plausible scenario where the other choice was correct.
The Bivalent Undecidable State
The proof hinges on the concept of bivalent states. A system state is bivalent if the final decision depends on future events that have not yet occurred. The proof shows that from any initial bivalent state, the adversary can always keep the system bivalent by carefully ordering message deliveries.
This means the system can be driven to a point where processes have conflicting information, yet cannot make progress because any decision might be wrong.
Concrete 2-Process Example
Consider two processes, P1 and P2, trying to agree on a binary value (0 or 1):
graph LR
subgraph Initial
A1[Initial State<br/>P1: initial<br/>P2: initial]
end
subgraph Bivalent
B1[State A<br/>P1: decided 0<br/>P2: undecided]
B2[State B<br/>P1: undecided<br/>P2: decided 1]
B3[Truly Bivalent<br/>P1: if recv 0 -> 0<br/>P2: if recv 1 -> 1]
end
A1 -->|P1 receives 0| B1
A1 -->|P1 receives 1| B2
B1 -.->|scheduler delays<br/>critical msg| B3
B2 -.->|scheduler delays<br/>critical msg| B3
The scheduler keeps the system in state B3 (bivalent) by:
- Delaying the message from P1 to P2 that would confirm value 0
- Delaying the message from P2 to P1 that would confirm value 1
- Neither process has enough information to commit to a decision
This adversarial scheduling can continue indefinitely because from either “almost-decided” state, there exists a plausible scenario where the other value would have been correct.
Adversarial Scheduler in Action
The FLP proof constructs an explicit adversarial scheduler that keeps the system undecided. Here is the gist with two nodes:
sequenceDiagram
participant A as Node A
participant S as Adversary<br/>Scheduler
participant B as Node B
rect rgb(50, 50, 80)
Note over A,B: Round 1: A proposes
A->>S: Send "propose 0"
Note over S: Intercept message
S-xB: Delay "propose 0"
Note over A: A waits for ack<br/>from majority (only B)<br/>Incomplete info
end
rect rgb(80, 50, 50)
Note over A,B: Round 2: B proposes
B->>S: Send "propose 1"
Note over S: Intercept message
S-xA: Delay "propose 1"
Note over B: B waits for ack<br/>from majority (only A)<br/>Incomplete info
end
rect rgb(50, 80, 50)
Note over A,B: Round 3: Scheduler acts again
S-xA: Deliver B's "propose 1"<br/>to A only
S-xB: Deliver A's "propose 0"<br/>to B only
Note over A: A now sees<br/>conflicting info
Note over B: B now sees<br/>conflicting info
end
Note over A,B: Neither node has majority<br/>System stays bivalent<br/>Scheduler repeats indefinitely
The scheduler alternates which messages it delays. Each node sees partial information but never enough to commit. The system can remain in this state forever, proving that no algorithm can guarantee termination.
Formal Problem Reduction
FLP proves impossibility through reduction. If you could solve consensus in an asynchronous system with even one possible failure, you could solve the simpler problem of Byzantine agreement, which is provably impossible.
The reduction works like this. Given a system that solves consensus despite failures, you can construct a scheduler (an adversary) that forces the system into a bivalent state, where the final outcome depends on timing, by delaying just the right messages at just the right moments. Since no algorithm can avoid this, consensus is impossible in general asynchronous systems.
This is not a construction flaw. It is a fundamental result. Any algorithm that makes progress in all failure scenarios can be forced into indecision by an adversarial scheduler that carefully delays messages to keep the system in a bivalent state.
The proof’s central point: you cannot distinguish a slow node from a crashed one in an asynchronous network. This ambiguity is what makes consensus undecidable without additional assumptions, like timing bounds or synchrony.
What This Means Practically
FLP does not mean consensus is impossible. It means consensus algorithms must make a trade-off:
- They can guarantee safety (agreement and validity) but not liveness (termination), or
- They can guarantee liveness under certain conditions (like synchronous networks), or
- They can use randomness to guarantee termination with high probability
Coping Strategies
Real systems use various strategies to work around FLP:
Synchrony assumptions: If you assume bounds on message delivery, you can use timeouts to detect failures and guarantee termination. The CAP theorem captures this trade-off: during partitions, you must choose between consistency (giving up availability) or availability (giving up consistency during partition recovery).
Probabilistic termination: Some algorithms, like Ben-Or’s randomized consensus, guarantee termination with probability 1. They may run for an unbounded time in the worst case, but the probability of that happening is zero.
Lease-based approaches: As discussed in my Leader Election post, lease-based approaches assume bounded clock skew and network delays. They provide eventual detection of failures but cannot guarantee instant detection.
Partial Synchrony and the Dwork-Lynch Model
FLP assumes a fully asynchronous system with no timing assumptions. But what if we relax this slightly?
The Dwork-Lynch model (1988) introduced partial synchrony: the system is usually asynchronous, but eventually messages are delivered within some bounded time. This bounded time is not known a priori but exists.
graph LR
subgraph "System Models"
A[Asynchronous<br/>No timing<br/>assumptions] --> B[Partial Synchrony<br/>Eventually<br/>bounded delay]
B --> C[Synchronous<br/>Known bounds<br/>always]
end
In partial synchrony:
- Initially: System behaves asynchronously (FLP applies)
- Eventually: After unknown bound GST (Global Stabilization Time), timing guarantees hold
- Result: Algorithms can guarantee liveness after GST while maintaining safety always
This is how practical systems work around FLP. They assume “the network will eventually be well-behaved” rather than “the network is always well-behaved.”
FLP in Practice: How Spanner and Paxos Handle This
Real systems using Paxos or Raft don’t violate FLP mathematically, but they work in practice because the assumptions underlying FLP don’t perfectly match reality:
Google Spanner uses TrueTime (bounded clock uncertainty) to provide external consistency. Spanner’s TrueTime API guarantees that uncertainty is bounded to at most 7 seconds. This means:
- Spanner can use timeout-based leader election safely
- After a leader failure, Spanner waits at least the maximum clock uncertainty before promoting a new leader
- This effectively converts the system to partial synchrony during critical periods
Paxos-based systems (Chubby, Zookeeper) use leader leases:
- The leader acquires a lease before processing requests
- If the leader fails to renew, followers wait for the lease to expire before starting an election
- This bounds the “asynchronous” period where the adversary could schedule messages adversarially
Raft (used in etcd, CockroachDB) relies on:
- Heartbeat timeouts to detect leader failure
- Election timeout randomization to break ties
- Assumption that networks eventually deliver messages
These systems guarantee safety (no two nodes can be leaders simultaneously, no divergent state) but accept that during extended network partitions, liveness (ability to make progress) may be temporarily suspended.
In practice: “the network is usually reliable, and when it’s not, we sacrifice liveness for safety until it recovers.”
Relationship to CAP
FLP and CAP are related but distinct. CAP focuses on the trade-off between consistency and availability during network partitions. FLP focuses on the impossibility of guaranteed termination in asynchronous systems with failures.
Both results stem from the same underlying reality: in asynchronous systems, you cannot distinguish failures from delays. CAP accepts this and makes availability the default. FLP formalizes the impossibility and forces algorithm designers to be explicit about their assumptions.
My post on the CAP Theorem explores these trade-offs in more detail.
Why This Matters
FLP is an important result in distributed systems theory. It sets limits on what can be achieved and forces practitioners to be explicit about their assumptions.
Understanding FLP changes how you think about system design. Instead of trying to achieve impossible guarantees, you design systems that degrade gracefully under adversarial conditions.
The Broader Impact
Since 1985, researchers have built on FLP in various directions. The result has been extended to Byzantine failures (where nodes can behave arbitrarily maliciously), partial synchrony models, and different communication patterns.
The field of distributed consensus has grown substantially since FLP. Paxos, Raft, and many other algorithms have been developed with practical trade-offs in mind. FLP does not make these algorithms useless; it clarifies what they can and cannot guarantee.
Conclusion
The FLP impossibility result tells us something fundamental about the nature of distributed systems. We cannot have both safety and guaranteed liveness in asynchronous systems with failures. Every consensus algorithm makes explicit trade-offs based on this reality.
Understanding FLP does not make distributed systems programming easier, but it does make your reasoning about these systems more sound. When something seems too good to be true, FLP reminds us why.
For more on consistency trade-offs, see my post on Consistency Models.
Category
Related Posts
Leader Election in Distributed Systems
Leader election is the process of designating a single node as the coordinator among a set of distributed nodes, critical for consensus protocols.
Paxos Consensus Algorithm
Paxos is a family of consensus algorithms for achieving agreement in distributed systems despite failures, pioneered by Leslie Lamport.
View-Stamped Replication
View-Stamped Replication (VSR) is a distributed consensus protocol that uses views and timestamps to achieve agreement in asynchronous systems.