The FLP Impossibility Result

The FLP impossibility theorem proves that no consensus algorithm can guarantee termination in an asynchronous system with even one faulty process.

published: March 24, 2026 reading time: 19 min read author: GeekWorkBench updated: March 24, 2026

Introduction

In 1985, Fischer, Lynch, and Paterson published a short paper that proved a striking impossibility: no consensus algorithm can guarantee termination in a fully asynchronous system if even a single process can fail. The result is known as FLP after the authors’ initials.

This is not a practical limitation to be engineered around. It is a mathematical proof that consensus and guaranteed termination are fundamentally incompatible in asynchronous systems.

The Setup

The FLP result assumes an asynchronous system with the following properties:

Processes communicate by sending messages
Messages can be delayed arbitrarily but eventually delivered
Processes can fail by stopping (crash-stop, not Byzantine)
No clocks or timeouts are available to detect failures

This is a realistic model of many systems. Networks have variable latency; you cannot reliably distinguish a crashed process from a slow one without real-time clocks.

What Consensus Requires

Consensus algorithms typically require three properties:

Agreement: All non-faulty processes decide on the same value
Validity: The decided value must have been proposed by some process
Termination: All non-faulty processes eventually decide

FLP proves that in an asynchronous system with at least one faulty process, no algorithm can guarantee all three simultaneously.

graph TB
    subgraph "FLP Model"
        P1[Process 1]
        P2[Process 2]
        P3[Process 3]
        M1[Message Queue]
        M2[Message Queue]
        M3[Message Queue]

        P1 --> M2
        P2 --> M1
        P2 --> M3
        P3 --> M1
        P3 --> M2
    end

    Note(("No timeouts<br/>Messages delayed<br/>but eventually<br/>delivered"))

The Core Insight

The proof constructs an adversarial message scheduler. Given any protocol that appears to be working, the scheduler can delay critical messages to keep the system in a state where processes disagree but cannot gather enough information to decide.

The key is that in an asynchronous system, you cannot know whether a process has crashed or is just slow. The scheduler can exploit this uncertainty indefinitely by ensuring that whichever choice a process makes, there is always a plausible scenario where the other choice was correct.

The Bivalent Undecidable State

The proof hinges on the concept of bivalent states. A system state is bivalent if the final decision depends on future events that have not yet occurred. The proof shows that from any initial bivalent state, the adversary can always keep the system bivalent by carefully ordering message deliveries.

This means the system can be driven to a point where processes have conflicting information, yet cannot make progress because any decision might be wrong.

Concrete 2-Process Example

Consider two processes, P1 and P2, trying to agree on a binary value (0 or 1):

graph LR
    subgraph Initial
        A1[Initial State<br/>P1: initial<br/>P2: initial]
    end

    subgraph Bivalent
        B1[State A<br/>P1: decided 0<br/>P2: undecided]
        B2[State B<br/>P1: undecided<br/>P2: decided 1]
        B3[Truly Bivalent<br/>P1: if recv 0 -> 0<br/>P2: if recv 1 -> 1]
    end

    A1 -->|P1 receives 0| B1
    A1 -->|P1 receives 1| B2
    B1 -.->|scheduler delays<br/>critical msg| B3
    B2 -.->|scheduler delays<br/>critical msg| B3

The scheduler keeps the system in state B3 (bivalent) by:

Delaying the message from P1 to P2 that would confirm value 0
Delaying the message from P2 to P1 that would confirm value 1
Neither process has enough information to commit to a decision

This adversarial scheduling can continue indefinitely because from either “almost-decided” state, there exists a plausible scenario where the other value would have been correct.

Adversarial Scheduler in Action

The FLP proof constructs an explicit adversarial scheduler that keeps the system undecided. Here is the gist with two nodes:

sequenceDiagram
    participant A as Node A
    participant S as Adversary<br/>Scheduler
    participant B as Node B

    rect rgb(50, 50, 80)
        Note over A,B: Round 1: A proposes
        A->>S: Send "propose 0"
        Note over S: Intercept message
        S-xB: Delay "propose 0"
        Note over A: A waits for ack<br/>from majority (only B)<br/>Incomplete info
    end

    rect rgb(80, 50, 50)
        Note over A,B: Round 2: B proposes
        B->>S: Send "propose 1"
        Note over S: Intercept message
        S-xA: Delay "propose 1"
        Note over B: B waits for ack<br/>from majority (only A)<br/>Incomplete info
    end

    rect rgb(50, 80, 50)
        Note over A,B: Round 3: Scheduler acts again
        S-xA: Deliver B's "propose 1"<br/>to A only
        S-xB: Deliver A's "propose 0"<br/>to B only
        Note over A: A now sees<br/>conflicting info
        Note over B: B now sees<br/>conflicting info
    end

    Note over A,B: Neither node has majority<br/>System stays bivalent<br/>Scheduler repeats indefinitely

The scheduler alternates which messages it delays. Each node sees partial information but never enough to commit. The system can remain in this state forever, proving that no algorithm can guarantee termination.

Formal Problem Reduction

FLP proves impossibility through reduction. If you could solve consensus in an asynchronous system with even one possible failure, you could solve the simpler problem of Byzantine agreement, which is provably impossible.

The reduction works like this. Given a system that solves consensus despite failures, you can construct a scheduler (an adversary) that forces the system into a bivalent state, where the final outcome depends on timing, by delaying just the right messages at just the right moments. Since no algorithm can avoid this, consensus is impossible in general asynchronous systems.

This is not a construction flaw. It is a fundamental result. Any algorithm that makes progress in all failure scenarios can be forced into indecision by an adversarial scheduler that carefully delays messages to keep the system in a bivalent state.

The proof’s central point: you cannot distinguish a slow node from a crashed one in an asynchronous network. This ambiguity is what makes consensus undecidable without additional assumptions, like timing bounds or synchrony.

What This Means Practically

FLP does not mean consensus is impossible. It means consensus algorithms must make a trade-off:

They can guarantee safety (agreement and validity) but not liveness (termination), or
They can guarantee liveness under certain conditions (like synchronous networks), or
They can use randomness to guarantee termination with high probability

Common Pitfalls / Anti-Patterns

Real systems use various strategies to work around FLP:

Synchrony assumptions: If you assume bounds on message delivery, you can use timeouts to detect failures and guarantee termination. The CAP theorem captures this trade-off: during partitions, you must choose between consistency (giving up availability) or availability (giving up consistency during partition recovery).

Probabilistic termination: Some algorithms, like Ben-Or’s randomized consensus, guarantee termination with probability 1. They may run for an unbounded time in the worst case, but the probability of that happening is zero.

Lease-based approaches: As discussed in my Leader Election post, lease-based approaches assume bounded clock skew and network delays. They provide eventual detection of failures but cannot guarantee instant detection.

Real-world Failure Scenarios

Understanding FLP is one thing; seeing it manifest in production is another. These are real scenarios where FLP’s impossibility rears its head:

Scenario 1: Network Partition with Split-Brain

Two data centers lose connectivity for 30 seconds. Both sides believe the other has failed. Without timing bounds:

Both sides attempt to elect a leader
Both sides start accepting writes locally
When the partition heals, conflict resolution must reconcile divergent states
This is why Paxos and Raft guarantee safety (no divergent state committed) but may block writes during partition

Scenario 2: Leader Crashes During Message Delay

A Raft leader sends a RequestVote message that gets delayed by 40 seconds due to network jitter:

The delayed message arrives after the election timeout on followers
A new leader may have already been elected
The old leader may have already committed entries
Raft handles this via term checking: stale messages are rejected

Scenario 3: Clock Skew Exploiting Asynchrony

In a system without TrueTime-style bounded clocks, an adversarial network can cause:

A node to believe its lease is still valid when the new leader has already taken over
Two nodes serving reads with stale data while believing they are the sole leader
This is why bounded clock uncertainty (partial synchrony) is critical for strong consistency

Scenario 4: GC Pauses and Message Reordering

A JVM garbage collection pause of 10+ seconds can cause:

Heartbeat timeouts to trigger leader elections
Messages that were in-flight to appear as if never sent
The system must tolerate such pauses without violating safety

Scenario 5: Byzantine Failure in Practice

While FLP originally addresses crash-stop failures, real systems face Byzantine (arbitrary malicious) behavior:

A node may claim to have received messages it never received
A corrupted node may send contradictory messages to different peers
PBFT and similar protocols handle this but require more than n/2 honest nodes

Partial Synchrony and the Dwork-Lynch Model

FLP assumes a fully asynchronous system with no timing assumptions. But what if we relax this slightly?

The Dwork-Lynch model (1988) introduced partial synchrony: the system is usually asynchronous, but eventually messages are delivered within some bounded time. This bounded time is not known a priori but exists.

graph LR
    subgraph "System Models"
        A[Asynchronous<br/>No timing<br/>assumptions] --> B[Partial Synchrony<br/>Eventually<br/>bounded delay]
        B --> C[Synchronous<br/>Known bounds<br/>always]
    end

In partial synchrony:

Initially: System behaves asynchronously (FLP applies)
Eventually: After unknown bound GST (Global Stabilization Time), timing guarantees hold
Result: Algorithms can guarantee liveness after GST while maintaining safety always

This is how practical systems work around FLP. They assume “the network will eventually be well-behaved” rather than “the network is always well-behaved.”

FLP in Practice: How Spanner and Paxos Handle This

Real systems using Paxos or Raft don’t violate FLP mathematically, but they work in practice because the assumptions underlying FLP don’t perfectly match reality:

Google Spanner uses TrueTime (bounded clock uncertainty) to provide external consistency. Spanner’s TrueTime API guarantees that uncertainty is bounded to at most 7 seconds. This means:

Spanner can use timeout-based leader election safely
After a leader failure, Spanner waits at least the maximum clock uncertainty before promoting a new leader
This effectively converts the system to partial synchrony during critical periods

Paxos-based systems (Chubby, Zookeeper) use leader leases:

The leader acquires a lease before processing requests
If the leader fails to renew, followers wait for the lease to expire before starting an election
This bounds the “asynchronous” period where the adversary could schedule messages adversarially

Raft (used in etcd, CockroachDB) relies on:

Heartbeat timeouts to detect leader failure
Election timeout randomization to break ties
Assumption that networks eventually deliver messages

These systems guarantee safety (no two nodes can be leaders simultaneously, no divergent state) but accept that during extended network partitions, liveness (ability to make progress) may be temporarily suspended.

In practice: “the network is usually reliable, and when it’s not, we sacrifice liveness for safety until it recovers.”

Relationship to CAP

FLP and CAP are related but distinct. CAP focuses on the trade-off between consistency and availability during network partitions. FLP focuses on the impossibility of guaranteed termination in asynchronous systems with failures.

Both results stem from the same underlying reality: in asynchronous systems, you cannot distinguish failures from delays. CAP accepts this and makes availability the default. FLP formalizes the impossibility and forces algorithm designers to be explicit about their assumptions.

My post on the CAP Theorem explores these trade-offs in more detail.

Why This Matters

FLP is an important result in distributed systems theory. It sets limits on what can be achieved and forces practitioners to be explicit about their assumptions.

Understanding FLP changes how you think about system design. Instead of trying to achieve impossible guarantees, you design systems that degrade gracefully under adversarial conditions.

The Broader Impact

Since 1985, researchers have built on FLP in various directions. The result has been extended to Byzantine failures (where nodes can behave arbitrarily maliciously), partial synchrony models, and different communication patterns.

The field of distributed consensus has grown substantially since FLP. Paxos, Raft, and many other algorithms have been developed with practical trade-offs in mind. FLP does not make these algorithms useless; it clarifies what they can and cannot guarantee.

Trade-off Comparison Table

The FLP result forces every consensus algorithm to make explicit trade-offs along safety/liveness dimensions. Here is how common approaches compare:

Approach	System Model	Safety	Liveness	Failure Handling	Practical Example
FLP Original	Fully Async	Yes	No	Crash-stop	Theoretical only
Paxos	Partial Sync	Yes	Yes*	Crash-stop	Chubby, Zookeeper
Raft	Partial Sync	Yes	Yes*	Crash-stop	etcd, CockroachDB
Ben-Or (Randomized)	Fully Async	Yes	Probabilistic	Crash-stop	Theoretical interest
PBFT	Partial Sync	Yes	Yes*	Byzantine	Zyzzyva, BFT-SMaRt
Dwork-Lynch	Partial Sync	Yes	Yes*	Crash-stop	Most production systems

*Liveness guaranteed only after the system stabilizes (GST in partial synchrony models).

Quick Recap Checklist

Before diving into implementation or deeper theory, verify your understanding of these core FLP concepts:

The FLP result applies specifically to fully asynchronous systems with crash-stop failures
Bivalent states are states where the final decision depends on future message ordering
An adversarial scheduler can keep a bivalent system undecided indefinitely
FLP proves you cannot simultaneously guarantee safety, liveness, and fault tolerance in async systems
Partial synchrony (Dwork-Lynch model) relaxes timing assumptions to allow liveness guarantees
Real systems like Paxos and Raft guarantee safety but may sacrifice liveness during partitions
Failure detectors and randomized algorithms are practical workarounds for the FLP impossibility
The FLP result does not mean consensus is impossible, only that guaranteed termination is impossible without additional assumptions

Interview Questions

1. What does the FLP impossibility result actually prove?

Expected answer points:

No consensus algorithm can guarantee termination in a fully asynchronous system with even one crash-stop failure
The proof shows safety and guaranteed liveness are incompatible under async timing assumptions
It is a mathematical impossibility, not a practical engineering limitation

2. What is a bivalent state in the context of the FLP proof?

Expected answer points:

A system state where the final decision depends on future message ordering that has not yet occurred
Both possible final values (0 or 1) remain plausible given current information
The adversary can keep the system bivalent by carefully ordering message deliveries

3. Why can an adversarial scheduler force a consensus algorithm to never terminate?

Expected answer points:

In async systems, you cannot distinguish a crashed process from a slow one without real-time clocks
The scheduler delays critical messages strategically, keeping each process in an "almost decided" but not committed state
From any bivalent state, the scheduler can always find a plausible scenario where the other decision value would have been correct

4. What are the three properties that consensus algorithms typically require?

Expected answer points:

Agreement: All non-faulty processes decide on the same value
Validity: The decided value must have been proposed by some process
Termination: All non-faulty processes eventually decide

5. How does the Dwork-Lynch partial synchrony model work around FLP?

Expected answer points:

The system is initially asynchronous with no timing guarantees (FLP applies)
Eventually, after a global stabilization time (GST), message delivery is bounded
Algorithms guarantee safety always, but liveness only after GST when timing assumptions hold

6. How does Google Spanner use TrueTime to work around FLP?

Expected answer points:

TrueTime provides bounded clock uncertainty (at most 7 seconds for Spanner)
This effectively converts the system to partial synchrony during critical periods
Spanner waits at least the maximum clock uncertainty before promoting a new leader after failure

7. Why does FLP not make consensus algorithms like Paxos or Raft useless?

Expected answer points:

FLP proves termination cannot be guaranteed, not that consensus is impossible
Paxos and Raft guarantee safety (no two nodes decide different values) under all conditions
They guarantee liveness under practical assumptions (eventual message delivery, partial synchrony)

8. What is the relationship between FLP and the CAP theorem?

Expected answer points:

Both stem from the same underlying reality: you cannot distinguish failures from delays in async networks
CAP focuses on consistency vs availability trade-offs during partitions
FLP formalizes the impossibility of guaranteed termination, forcing algorithm designers to be explicit about assumptions

9. How does Ben-Or's randomized consensus algorithm address FLP?

Expected answer points:

Ben-Or uses coin-flipping to achieve consensus with probability 1 in fully async systems
Termination is probabilistic rather than deterministic
The probability of non-termination in the worst case is zero, bypassing FLP's deterministic impossibility

10. What is the role of failure detectors in working around FLP?

Expected answer points:

Failure detectors provide suspicion information about process failures (possibly unreliable)
They introduce timing assumptions indirectly, converting async to partial synchrony
Algorithms using failure detectors can guarantee liveness because they can eventually suspect crashed processes

11. Why is the asynchronous communication model the key to FLP's impossibility?

Expected answer points:

Asynchronous message passing means no bounds on message delivery time
Without timing bounds, no process can know whether another has crashed or is simply slow
This uncertainty is what the adversarial scheduler exploits to keep the system bivalent indefinitely

12. What is the formal problem reduction used in the FLP proof?

Expected answer points:

FLP reduces Byzantine agreement to consensus, showing if consensus is solvable, Byzantine agreement would also be solvable
Since Byzantine agreement is provably impossible in async systems, consensus must also be impossible
The reduction constructs an adversarial scheduler that forces the system into a bivalent state

13. How do lease-based approaches like in Paxos handle FLP?

Expected answer points:

The leader acquires a lease before processing requests
If the leader fails to renew, followers wait for the lease to expire before starting election
This bounds the asynchronous window where adversarial scheduling could occur

14. What is the Global Stabilization Time (GST) in partial synchrony models?

Expected answer points:

GST is the unknown but finite time after which message delivery is bounded
Before GST, the system behaves asynchronously and FLP applies
After GST, timing assumptions hold and algorithms can guarantee liveness

15. Does FLP apply to Byzantine failures as well as crash-stop failures?

Expected answer points:

The original FLP result applies to crash-stop (only stops responding) failures
The result has been extended to Byzantine failures where nodes behave arbitrarily maliciously
Byzantine FLP variants show impossibility results are even stronger with more severe failure models

16. What is the practical implication of FLP for system designers?

Expected answer points:

You must be explicit about which guarantees your system provides under which conditions
Design for graceful degradation under adversarial network conditions
Accept that during extended network partitions, some form of liveness sacrifice is unavoidable

17. Why does Raft's election timeout randomization help with liveness?

Expected answer points:

Randomized timeouts reduce the chance of split-brain elections where multiple nodes become candidates simultaneously
It increases the probability that one node wins decisively before adversarial scheduling can interfere
Combined with heartbeat assumptions, it provides probabilistic liveness guarantees

18. Can randomized algorithms completely bypass FLP's impossibility?

Expected answer points:

Randomized algorithms like Ben-Or achieve consensus with probability 1, not deterministic certainty
They circumvent FLP by not requiring guaranteed termination in the worst case
The probability of non-termination is zero, but the running time could be unbounded in adversarial scenarios

19. What happens to consensus algorithms during network partitions in practice?

Expected answer points:

Safety is always maintained: no two nodes decide different values
Liveness may be sacrificed: the system cannot make progress until the partition heals
This is the CAP theorem in action: choosing consistency over availability during partitions

20. In the 2-process bivalent example, how does the scheduler keep the system undecided?

Expected answer points:

The scheduler delays messages from P1 to P2 confirming value 0 and from P2 to P1 confirming value 1
Neither process has enough information to commit to a decision
The system reaches a truly bivalent state where whichever choice a process makes, the other value remains plausible

Introduction

The Setup

What Consensus Requires

The Core Insight

The Bivalent Undecidable State

Concrete 2-Process Example

Adversarial Scheduler in Action

Formal Problem Reduction

What This Means Practically

Common Pitfalls / Anti-Patterns

Real-world Failure Scenarios

Scenario 1: Network Partition with Split-Brain

Scenario 2: Leader Crashes During Message Delay

Scenario 3: Clock Skew Exploiting Asynchrony

Scenario 4: GC Pauses and Message Reordering

Scenario 5: Byzantine Failure in Practice

Partial Synchrony and the Dwork-Lynch Model

FLP in Practice: How Spanner and Paxos Handle This

Relationship to CAP

Why This Matters

The Broader Impact

Trade-off Comparison Table

Quick Recap Checklist

Interview Questions

Further Reading

Category

Tags

Related Posts

Leader Election in Distributed Systems

Paxos Consensus Algorithm

View-Stamped Replication