The FLP Impossibility Result

The FLP impossibility theorem proves that no consensus algorithm can guarantee termination in an asynchronous system with even one faulty process.

published: reading time: 19 min read author: GeekWorkBench updated: March 24, 2026

Introduction

In 1985, Fischer, Lynch, and Paterson published a short paper that proved a striking impossibility: no consensus algorithm can guarantee termination in a fully asynchronous system if even a single process can fail. The result is known as FLP after the authors’ initials.

This is not a practical limitation to be engineered around. It is a mathematical proof that consensus and guaranteed termination are fundamentally incompatible in asynchronous systems.

The Setup

The FLP result assumes an asynchronous system with the following properties:

  • Processes communicate by sending messages
  • Messages can be delayed arbitrarily but eventually delivered
  • Processes can fail by stopping (crash-stop, not Byzantine)
  • No clocks or timeouts are available to detect failures

This is a realistic model of many systems. Networks have variable latency; you cannot reliably distinguish a crashed process from a slow one without real-time clocks.

What Consensus Requires

Consensus algorithms typically require three properties:

  1. Agreement: All non-faulty processes decide on the same value
  2. Validity: The decided value must have been proposed by some process
  3. Termination: All non-faulty processes eventually decide

FLP proves that in an asynchronous system with at least one faulty process, no algorithm can guarantee all three simultaneously.

graph TB
    subgraph "FLP Model"
        P1[Process 1]
        P2[Process 2]
        P3[Process 3]
        M1[Message Queue]
        M2[Message Queue]
        M3[Message Queue]

        P1 --> M2
        P2 --> M1
        P2 --> M3
        P3 --> M1
        P3 --> M2
    end

    Note(("No timeouts<br/>Messages delayed<br/>but eventually<br/>delivered"))

The Core Insight

The proof constructs an adversarial message scheduler. Given any protocol that appears to be working, the scheduler can delay critical messages to keep the system in a state where processes disagree but cannot gather enough information to decide.

The key is that in an asynchronous system, you cannot know whether a process has crashed or is just slow. The scheduler can exploit this uncertainty indefinitely by ensuring that whichever choice a process makes, there is always a plausible scenario where the other choice was correct.

The Bivalent Undecidable State

The proof hinges on the concept of bivalent states. A system state is bivalent if the final decision depends on future events that have not yet occurred. The proof shows that from any initial bivalent state, the adversary can always keep the system bivalent by carefully ordering message deliveries.

This means the system can be driven to a point where processes have conflicting information, yet cannot make progress because any decision might be wrong.

Concrete 2-Process Example

Consider two processes, P1 and P2, trying to agree on a binary value (0 or 1):

graph LR
    subgraph Initial
        A1[Initial State<br/>P1: initial<br/>P2: initial]
    end

    subgraph Bivalent
        B1[State A<br/>P1: decided 0<br/>P2: undecided]
        B2[State B<br/>P1: undecided<br/>P2: decided 1]
        B3[Truly Bivalent<br/>P1: if recv 0 -> 0<br/>P2: if recv 1 -> 1]
    end

    A1 -->|P1 receives 0| B1
    A1 -->|P1 receives 1| B2
    B1 -.->|scheduler delays<br/>critical msg| B3
    B2 -.->|scheduler delays<br/>critical msg| B3

The scheduler keeps the system in state B3 (bivalent) by:

  1. Delaying the message from P1 to P2 that would confirm value 0
  2. Delaying the message from P2 to P1 that would confirm value 1
  3. Neither process has enough information to commit to a decision

This adversarial scheduling can continue indefinitely because from either “almost-decided” state, there exists a plausible scenario where the other value would have been correct.

Adversarial Scheduler in Action

The FLP proof constructs an explicit adversarial scheduler that keeps the system undecided. Here is the gist with two nodes:

sequenceDiagram
    participant A as Node A
    participant S as Adversary<br/>Scheduler
    participant B as Node B

    rect rgb(50, 50, 80)
        Note over A,B: Round 1: A proposes
        A->>S: Send "propose 0"
        Note over S: Intercept message
        S-xB: Delay "propose 0"
        Note over A: A waits for ack<br/>from majority (only B)<br/>Incomplete info
    end

    rect rgb(80, 50, 50)
        Note over A,B: Round 2: B proposes
        B->>S: Send "propose 1"
        Note over S: Intercept message
        S-xA: Delay "propose 1"
        Note over B: B waits for ack<br/>from majority (only A)<br/>Incomplete info
    end

    rect rgb(50, 80, 50)
        Note over A,B: Round 3: Scheduler acts again
        S-xA: Deliver B's "propose 1"<br/>to A only
        S-xB: Deliver A's "propose 0"<br/>to B only
        Note over A: A now sees<br/>conflicting info
        Note over B: B now sees<br/>conflicting info
    end

    Note over A,B: Neither node has majority<br/>System stays bivalent<br/>Scheduler repeats indefinitely

The scheduler alternates which messages it delays. Each node sees partial information but never enough to commit. The system can remain in this state forever, proving that no algorithm can guarantee termination.

Formal Problem Reduction

FLP proves impossibility through reduction. If you could solve consensus in an asynchronous system with even one possible failure, you could solve the simpler problem of Byzantine agreement, which is provably impossible.

The reduction works like this. Given a system that solves consensus despite failures, you can construct a scheduler (an adversary) that forces the system into a bivalent state, where the final outcome depends on timing, by delaying just the right messages at just the right moments. Since no algorithm can avoid this, consensus is impossible in general asynchronous systems.

This is not a construction flaw. It is a fundamental result. Any algorithm that makes progress in all failure scenarios can be forced into indecision by an adversarial scheduler that carefully delays messages to keep the system in a bivalent state.

The proof’s central point: you cannot distinguish a slow node from a crashed one in an asynchronous network. This ambiguity is what makes consensus undecidable without additional assumptions, like timing bounds or synchrony.

What This Means Practically

FLP does not mean consensus is impossible. It means consensus algorithms must make a trade-off:

  • They can guarantee safety (agreement and validity) but not liveness (termination), or
  • They can guarantee liveness under certain conditions (like synchronous networks), or
  • They can use randomness to guarantee termination with high probability

Common Pitfalls / Anti-Patterns

Real systems use various strategies to work around FLP:

Synchrony assumptions: If you assume bounds on message delivery, you can use timeouts to detect failures and guarantee termination. The CAP theorem captures this trade-off: during partitions, you must choose between consistency (giving up availability) or availability (giving up consistency during partition recovery).

Probabilistic termination: Some algorithms, like Ben-Or’s randomized consensus, guarantee termination with probability 1. They may run for an unbounded time in the worst case, but the probability of that happening is zero.

Lease-based approaches: As discussed in my Leader Election post, lease-based approaches assume bounded clock skew and network delays. They provide eventual detection of failures but cannot guarantee instant detection.

Real-world Failure Scenarios

Understanding FLP is one thing; seeing it manifest in production is another. These are real scenarios where FLP’s impossibility rears its head:

Scenario 1: Network Partition with Split-Brain

Two data centers lose connectivity for 30 seconds. Both sides believe the other has failed. Without timing bounds:

  • Both sides attempt to elect a leader
  • Both sides start accepting writes locally
  • When the partition heals, conflict resolution must reconcile divergent states
  • This is why Paxos and Raft guarantee safety (no divergent state committed) but may block writes during partition

Scenario 2: Leader Crashes During Message Delay

A Raft leader sends a RequestVote message that gets delayed by 40 seconds due to network jitter:

  • The delayed message arrives after the election timeout on followers
  • A new leader may have already been elected
  • The old leader may have already committed entries
  • Raft handles this via term checking: stale messages are rejected

Scenario 3: Clock Skew Exploiting Asynchrony

In a system without TrueTime-style bounded clocks, an adversarial network can cause:

  • A node to believe its lease is still valid when the new leader has already taken over
  • Two nodes serving reads with stale data while believing they are the sole leader
  • This is why bounded clock uncertainty (partial synchrony) is critical for strong consistency

Scenario 4: GC Pauses and Message Reordering

A JVM garbage collection pause of 10+ seconds can cause:

  • Heartbeat timeouts to trigger leader elections
  • Messages that were in-flight to appear as if never sent
  • The system must tolerate such pauses without violating safety

Scenario 5: Byzantine Failure in Practice

While FLP originally addresses crash-stop failures, real systems face Byzantine (arbitrary malicious) behavior:

  • A node may claim to have received messages it never received
  • A corrupted node may send contradictory messages to different peers
  • PBFT and similar protocols handle this but require more than n/2 honest nodes

Partial Synchrony and the Dwork-Lynch Model

FLP assumes a fully asynchronous system with no timing assumptions. But what if we relax this slightly?

The Dwork-Lynch model (1988) introduced partial synchrony: the system is usually asynchronous, but eventually messages are delivered within some bounded time. This bounded time is not known a priori but exists.

graph LR
    subgraph "System Models"
        A[Asynchronous<br/>No timing<br/>assumptions] --> B[Partial Synchrony<br/>Eventually<br/>bounded delay]
        B --> C[Synchronous<br/>Known bounds<br/>always]
    end

In partial synchrony:

  • Initially: System behaves asynchronously (FLP applies)
  • Eventually: After unknown bound GST (Global Stabilization Time), timing guarantees hold
  • Result: Algorithms can guarantee liveness after GST while maintaining safety always

This is how practical systems work around FLP. They assume “the network will eventually be well-behaved” rather than “the network is always well-behaved.”

FLP in Practice: How Spanner and Paxos Handle This

Real systems using Paxos or Raft don’t violate FLP mathematically, but they work in practice because the assumptions underlying FLP don’t perfectly match reality:

Google Spanner uses TrueTime (bounded clock uncertainty) to provide external consistency. Spanner’s TrueTime API guarantees that uncertainty is bounded to at most 7 seconds. This means:

  • Spanner can use timeout-based leader election safely
  • After a leader failure, Spanner waits at least the maximum clock uncertainty before promoting a new leader
  • This effectively converts the system to partial synchrony during critical periods

Paxos-based systems (Chubby, Zookeeper) use leader leases:

  • The leader acquires a lease before processing requests
  • If the leader fails to renew, followers wait for the lease to expire before starting an election
  • This bounds the “asynchronous” period where the adversary could schedule messages adversarially

Raft (used in etcd, CockroachDB) relies on:

  • Heartbeat timeouts to detect leader failure
  • Election timeout randomization to break ties
  • Assumption that networks eventually deliver messages

These systems guarantee safety (no two nodes can be leaders simultaneously, no divergent state) but accept that during extended network partitions, liveness (ability to make progress) may be temporarily suspended.

In practice: “the network is usually reliable, and when it’s not, we sacrifice liveness for safety until it recovers.”

Relationship to CAP

FLP and CAP are related but distinct. CAP focuses on the trade-off between consistency and availability during network partitions. FLP focuses on the impossibility of guaranteed termination in asynchronous systems with failures.

Both results stem from the same underlying reality: in asynchronous systems, you cannot distinguish failures from delays. CAP accepts this and makes availability the default. FLP formalizes the impossibility and forces algorithm designers to be explicit about their assumptions.

My post on the CAP Theorem explores these trade-offs in more detail.

Why This Matters

FLP is an important result in distributed systems theory. It sets limits on what can be achieved and forces practitioners to be explicit about their assumptions.

Understanding FLP changes how you think about system design. Instead of trying to achieve impossible guarantees, you design systems that degrade gracefully under adversarial conditions.

The Broader Impact

Since 1985, researchers have built on FLP in various directions. The result has been extended to Byzantine failures (where nodes can behave arbitrarily maliciously), partial synchrony models, and different communication patterns.

The field of distributed consensus has grown substantially since FLP. Paxos, Raft, and many other algorithms have been developed with practical trade-offs in mind. FLP does not make these algorithms useless; it clarifies what they can and cannot guarantee.

Trade-off Comparison Table

The FLP result forces every consensus algorithm to make explicit trade-offs along safety/liveness dimensions. Here is how common approaches compare:

ApproachSystem ModelSafetyLivenessFailure HandlingPractical Example
FLP OriginalFully AsyncYesNoCrash-stopTheoretical only
PaxosPartial SyncYesYes*Crash-stopChubby, Zookeeper
RaftPartial SyncYesYes*Crash-stopetcd, CockroachDB
Ben-Or (Randomized)Fully AsyncYesProbabilisticCrash-stopTheoretical interest
PBFTPartial SyncYesYes*ByzantineZyzzyva, BFT-SMaRt
Dwork-LynchPartial SyncYesYes*Crash-stopMost production systems

*Liveness guaranteed only after the system stabilizes (GST in partial synchrony models).

Quick Recap Checklist

Before diving into implementation or deeper theory, verify your understanding of these core FLP concepts:

  • The FLP result applies specifically to fully asynchronous systems with crash-stop failures
  • Bivalent states are states where the final decision depends on future message ordering
  • An adversarial scheduler can keep a bivalent system undecided indefinitely
  • FLP proves you cannot simultaneously guarantee safety, liveness, and fault tolerance in async systems
  • Partial synchrony (Dwork-Lynch model) relaxes timing assumptions to allow liveness guarantees
  • Real systems like Paxos and Raft guarantee safety but may sacrifice liveness during partitions
  • Failure detectors and randomized algorithms are practical workarounds for the FLP impossibility
  • The FLP result does not mean consensus is impossible, only that guaranteed termination is impossible without additional assumptions

Interview Questions

1. What does the FLP impossibility result actually prove?

Expected answer points:

  • No consensus algorithm can guarantee termination in a fully asynchronous system with even one crash-stop failure
  • The proof shows safety and guaranteed liveness are incompatible under async timing assumptions
  • It is a mathematical impossibility, not a practical engineering limitation
2. What is a bivalent state in the context of the FLP proof?

Expected answer points:

  • A system state where the final decision depends on future message ordering that has not yet occurred
  • Both possible final values (0 or 1) remain plausible given current information
  • The adversary can keep the system bivalent by carefully ordering message deliveries
3. Why can an adversarial scheduler force a consensus algorithm to never terminate?

Expected answer points:

  • In async systems, you cannot distinguish a crashed process from a slow one without real-time clocks
  • The scheduler delays critical messages strategically, keeping each process in an "almost decided" but not committed state
  • From any bivalent state, the scheduler can always find a plausible scenario where the other decision value would have been correct
4. What are the three properties that consensus algorithms typically require?

Expected answer points:

  • Agreement: All non-faulty processes decide on the same value
  • Validity: The decided value must have been proposed by some process
  • Termination: All non-faulty processes eventually decide
5. How does the Dwork-Lynch partial synchrony model work around FLP?

Expected answer points:

  • The system is initially asynchronous with no timing guarantees (FLP applies)
  • Eventually, after a global stabilization time (GST), message delivery is bounded
  • Algorithms guarantee safety always, but liveness only after GST when timing assumptions hold
6. How does Google Spanner use TrueTime to work around FLP?

Expected answer points:

  • TrueTime provides bounded clock uncertainty (at most 7 seconds for Spanner)
  • This effectively converts the system to partial synchrony during critical periods
  • Spanner waits at least the maximum clock uncertainty before promoting a new leader after failure
7. Why does FLP not make consensus algorithms like Paxos or Raft useless?

Expected answer points:

  • FLP proves termination cannot be guaranteed, not that consensus is impossible
  • Paxos and Raft guarantee safety (no two nodes decide different values) under all conditions
  • They guarantee liveness under practical assumptions (eventual message delivery, partial synchrony)
8. What is the relationship between FLP and the CAP theorem?

Expected answer points:

  • Both stem from the same underlying reality: you cannot distinguish failures from delays in async networks
  • CAP focuses on consistency vs availability trade-offs during partitions
  • FLP formalizes the impossibility of guaranteed termination, forcing algorithm designers to be explicit about assumptions
9. How does Ben-Or's randomized consensus algorithm address FLP?

Expected answer points:

  • Ben-Or uses coin-flipping to achieve consensus with probability 1 in fully async systems
  • Termination is probabilistic rather than deterministic
  • The probability of non-termination in the worst case is zero, bypassing FLP's deterministic impossibility
10. What is the role of failure detectors in working around FLP?

Expected answer points:

  • Failure detectors provide suspicion information about process failures (possibly unreliable)
  • They introduce timing assumptions indirectly, converting async to partial synchrony
  • Algorithms using failure detectors can guarantee liveness because they can eventually suspect crashed processes
11. Why is the asynchronous communication model the key to FLP's impossibility?

Expected answer points:

  • Asynchronous message passing means no bounds on message delivery time
  • Without timing bounds, no process can know whether another has crashed or is simply slow
  • This uncertainty is what the adversarial scheduler exploits to keep the system bivalent indefinitely
12. What is the formal problem reduction used in the FLP proof?

Expected answer points:

  • FLP reduces Byzantine agreement to consensus, showing if consensus is solvable, Byzantine agreement would also be solvable
  • Since Byzantine agreement is provably impossible in async systems, consensus must also be impossible
  • The reduction constructs an adversarial scheduler that forces the system into a bivalent state
13. How do lease-based approaches like in Paxos handle FLP?

Expected answer points:

  • The leader acquires a lease before processing requests
  • If the leader fails to renew, followers wait for the lease to expire before starting election
  • This bounds the asynchronous window where adversarial scheduling could occur
14. What is the Global Stabilization Time (GST) in partial synchrony models?

Expected answer points:

  • GST is the unknown but finite time after which message delivery is bounded
  • Before GST, the system behaves asynchronously and FLP applies
  • After GST, timing assumptions hold and algorithms can guarantee liveness
15. Does FLP apply to Byzantine failures as well as crash-stop failures?

Expected answer points:

  • The original FLP result applies to crash-stop (only stops responding) failures
  • The result has been extended to Byzantine failures where nodes behave arbitrarily maliciously
  • Byzantine FLP variants show impossibility results are even stronger with more severe failure models
16. What is the practical implication of FLP for system designers?

Expected answer points:

  • You must be explicit about which guarantees your system provides under which conditions
  • Design for graceful degradation under adversarial network conditions
  • Accept that during extended network partitions, some form of liveness sacrifice is unavoidable
17. Why does Raft's election timeout randomization help with liveness?

Expected answer points:

  • Randomized timeouts reduce the chance of split-brain elections where multiple nodes become candidates simultaneously
  • It increases the probability that one node wins decisively before adversarial scheduling can interfere
  • Combined with heartbeat assumptions, it provides probabilistic liveness guarantees
18. Can randomized algorithms completely bypass FLP's impossibility?

Expected answer points:

  • Randomized algorithms like Ben-Or achieve consensus with probability 1, not deterministic certainty
  • They circumvent FLP by not requiring guaranteed termination in the worst case
  • The probability of non-termination is zero, but the running time could be unbounded in adversarial scenarios
19. What happens to consensus algorithms during network partitions in practice?

Expected answer points:

  • Safety is always maintained: no two nodes decide different values
  • Liveness may be sacrificed: the system cannot make progress until the partition heals
  • This is the CAP theorem in action: choosing consistency over availability during partitions
20. In the 2-process bivalent example, how does the scheduler keep the system undecided?

Expected answer points:

  • The scheduler delays messages from P1 to P2 confirming value 0 and from P2 to P1 confirming value 1
  • Neither process has enough information to commit to a decision
  • The system reaches a truly bivalent state where whichever choice a process makes, the other value remains plausible

Further Reading

  • Original Paper: Fischer, Lynch, and Paterson - "Impossibility of Distributed Consensus with One Faulty Process" (1985)
  • Partial Synchrony: Dwork and Lynch - "Consensus in the Presence of Partial Synchrony" (1988)
  • Randomized Consensus: Ben-Or - "Another Advantage of Free Choice: Completely Asynchronous Agreement Protocols"
  • Paxos: Lamport - "The Part-Time Parliament" (1998)
  • Raft: Ongaro and Ousterhout - "In Search of an Understandable Consensus Algorithm" (2014)
  • TrueTime: Google Spanner - "Spanner: Google's Globally Distributed Database" (2012)
## Conclusion

The FLP impossibility result tells us something fundamental about the nature of distributed systems. We cannot have both safety and guaranteed liveness in asynchronous systems with failures. Every consensus algorithm makes explicit trade-offs based on this reality.

Understanding FLP does not make distributed systems programming easier, but it does make your reasoning about these systems more sound. When something seems too good to be true, FLP reminds us why.

For more on consistency trade-offs, see my post on Consistency Models.

Category

Related Posts

Leader Election in Distributed Systems

Leader election is the process of designating a single node as the coordinator among a set of distributed nodes, critical for consensus protocols.

#distributed-systems #leader-election #consensus

Paxos Consensus Algorithm

Paxos is a family of consensus algorithms for achieving agreement in distributed systems despite failures, pioneered by Leslie Lamport.

#distributed-systems #consensus #paxos

View-Stamped Replication

View-Stamped Replication (VSR) is a distributed consensus protocol that uses views and timestamps to achieve agreement in asynchronous systems.

#distributed-systems #consensus #view-stamped