Two-Phase Commit Protocol Explained

Learn the two-phase commit protocol for distributed transactions: prepare and commit phases, coordinator role, failure handling, and trade-offs.

published: March 22, 2026 reading time: 37 min read author: GeekWorkBench updated: June 17, 2026

Quick Summary

Two-phase commit (2PC) coordinates atomic commitment across distributed databases through prepare and commit phases—a coordinator collects votes and then tells everyone to commit or rollback. The catch: if the coordinator crashes after participants vote yes, they hold locks indefinitely, waiting. Modern databases like Spanner and CockroachDB avoid this by running the coordinator as a Paxos or Raft group, so coordinator state survives crashes. For microservices, saga pattern with compensating transactions trades atomicity for availability—and that's usually the right call.

Two-Phase Commit Protocol Explained

Two-phase commit (2PC) is a protocol for achieving atomic commitment across multiple database nodes. All nodes either commit or rollback together. No partial states.

The idea is appealing: distributed transactions that behave like single-node transactions. In practice, 2PC has failure modes that make it problematic, and most systems reach for saga instead.

This post explains how 2PC works, the failure scenarios that cause trouble, and why it fell out of favor.

Introduction

2PC works in two phases: prepare and commit.

Phase 1: Prepare

The coordinator sends a prepare message to all participants. Each participant votes:

Vote Yes: The participant has completed its work, its locks are held, and it is ready to commit.
Vote No: Something went wrong. The participant rolls back and releases locks.

%%{ wrappingType: "word"}%%
sequenceDiagram
    participant Coordinator
    participant P1 as Participant 1
    participant P2 as Participant 2
    participant P3 as Participant 3
    Coordinator->>+P1: Prepare
    Coordinator->>+P2: Prepare
    Coordinator->>+P3: Prepare
    P1->>-Coordinator: Vote Yes
    P2->>-Coordinator: Vote Yes
    P3->>-Coordinator: Vote No

If any participant votes no, the coordinator sends rollback to all. Done.

Phase 2: Commit

If all participants vote yes, the coordinator sends commit to all. Each participant commits its local transaction and releases locks.

%%{ wrappingType: "word"}%%
sequenceDiagram
    participant Coordinator
    participant P1 as Participant 1
    participant P2 as Participant 2
    participant P3 as Participant 3
    Coordinator->>+P1: Commit
    Coordinator->>+P2: Commit
    Coordinator->>+P3: Commit
    P1->>-Coordinator: Committed
    P2->>-Coordinator: Committed
    P3->>-Coordinator: Committed

All participants must commit for the transaction to succeed.

Core Concepts

Coordinator and Participants

The coordinator is the central authority in 2PC. It sends prepare messages to all participants, collects their votes, and then decides whether to commit or abort. Participants do the actual work: they execute their portion of the transaction, acquire locks, and wait for orders.

The problem is that once a participant votes yes, it is stuck. It holds locks and cannot decide anything on its own. It cannot commit because the coordinator might have told other participants to rollback. It cannot rollback because the coordinator might have told other participants to commit. If the coordinator crashes after collecting yes votes but before sending the decision, participants block indefinitely — this is the blocking problem that makes 2PC problematic in practice.

In practice, the coordinator is often a separate service or embedded in the application. Some databases (like Percona XtraDB) implement 2PC internally for distributed transactions using the XA standard. The coordinator must be reliable. If it crashes, participants may be left in limbo.

Embedded vs External Coordinator

Aspect	Embedded Coordinator	External Coordinator
Definition	Coordinator logic runs within the application process initiating the transaction	Coordinator is a separate service/process that manages 2PC across participants
Example	Application code directly coordinates MySQL/PostgreSQL XA transactions	Dedicated service like Narayana, WebSphere LTCO, or custom coordinator service
Advantages	Simpler deployment; no additional service to operate; lower latency for single-site transactions	Coordinator survives application crashes; easier to monitor and manage; better for distributed multi-service transactions
Disadvantages	Application crash kills the coordinator; harder to monitor; participants may block if application is overloaded	Additional dependency; network hop to coordinator; requires HA configuration for coordinator
Coordinator SPOF	Yes — if the application crashes, coordinator dies with it	Yes — unless configured in HA mode with consensus (e.g., Paxos-based)
When to Use	Single application coordinating local database partitions; simple XA transactions within one database cluster	Multi-service transactions; transactions spanning multiple applications; when coordinator must survive application restarts
Recovery	Application restart recovers coordinator state from persistent storage	Coordinator service restart with state persisted to disk or replicated via consensus

Failure Handling

Not all 2PC failures act the same. Crashes on the participant side hit differently than coordinator crashes, and network partitions introduce problems neither can solve alone. Each failure type demands its own recovery strategy, and knowing which one your system is exposed to determines whether 2PC is a reasonable fit.

The sections below break down each failure mode — what breaks, why it breaks, and how much damage it causes. Some are manageable (participant crashes before voting). Others are catastrophic (coordinator crashes after collecting yes votes). The common thread is that all of them trace back to 2PC’s fundamental reliance on a single coordinator node that every participant must trust.

Participant Crashes Before Voting

If a participant crashes before voting, the coordinator times out and sends rollback. Other participants roll back. The crashed participant, when it recovers, also rolls back (or has nothing to do if it had not started work).

This case is manageable.

Participant Crashes After Voting Yes but Before Commit

This is where things break down. The participant has voted yes — it has acquired locks and finished its work. Now it is waiting for the coordinator to say commit or rollback.

If the coordinator crashes before sending the decision, this participant is stuck. It holds its locks indefinitely. Other participants have committed (if they received the commit message before crashing) or rolled back (if they received rollback or voted no).

The participant cannot decide unilaterally. It cannot commit (because the coordinator might have sent rollback to others). It cannot rollback (because the coordinator might have sent commit to others).

This is the blocking problem. The participant must wait for the coordinator to recover.

Coordinator Crashes

If the coordinator crashes after collecting yes votes but before sending commit, participants are stuck. They must wait for the coordinator to recover.

If the coordinator never comes back, participants stay stuck. In practice, systems fall back on timeouts and manual intervention.

Network Partition

If a participant cannot reach the coordinator, it blocks. If the coordinator cannot reach a participant, it must treat it as a no vote and send rollback.

Network partitions are common in distributed systems. 2PC does not handle them gracefully.

Alternatives to 2PC

No protocol survives real-world distributed systems unscathed, but some handle failure better than others. The alternatives to 2PC fall into two camps: protocols that aim for the same strong consistency guarantees but with less blocking, and architectural patterns that sidestep distributed transactions entirely.

Three-phase commit tries to fix the blocking problem directly. Saga and event sourcing take a different angle — they accept that distributed atomicity is too expensive and design around eventual consistency instead. The right choice depends on whether your system can tolerate temporary inconsistency or whether every transaction must be all-or-nothing.

Why 2PC Is Rarely Used

The blocking problem is what kills 2PC in practice. In a system that needs to stay available — which is most systems — waiting indefinitely for coordinator recovery is not an option.

Other protocols address this. Three-phase commit (3PC) adds a pre-commit phase to eliminate blocking, but it still assumes a synchronous system and makes stronger network assumptions. It is rarely used in practice either.

The saga pattern avoids locking entirely. Instead of atomic commitment, saga uses compensating transactions. If step 3 fails, steps 1 and 2 are undone. The system remains available. The cost is temporary inconsistency. For saga pattern details, see Saga Pattern.

Event sourcing stores events, not state. The event log is the source of truth. Rebuilding state is a matter of replaying events. This avoids distributed transactions altogether.

Three-Phase Commit (3PC)

3PC attempts to solve the blocking problem by adding a third phase between prepare and commit. The key difference is that 3PC is non-blocking, assuming synchronous enough network conditions.

Phase 1: Prepare — Same as 2PC, coordinator asks participants to vote.

Phase 2: Pre-commit — If all vote yes, coordinator sends pre-commit. Participants acknowledge receipt and stay in pre-commit state.

Phase 3: Commit — Coordinator sends commit, participants commit.

The pre-commit phase acts as an insurance policy. If the coordinator crashes during pre-commit, participants know a commit decision was made. They can decide to commit rather than block indefinitely.

Aspect	2PC	3PC
Blocking	Yes — blocks on coordinator failure	Non-blocking (under synchronous assumptions)
Phases	Prepare → Commit	Prepare → Pre-commit → Commit
Coordinator crash	Participants block indefinitely	Participants can decide (commit) if pre-commit seen
Network partition	May block indefinitely	May still block if timeout assumptions fail
Message complexity	3 messages per participant	5 messages per participant
Failure handling	Blocking; manual intervention	Still assumes network synchrony; less practical
In-doubt state	Yes — participant stuck if crash	Pre-commit state gives participants decision capability
Real-world usage	Embedded in databases (XA)	Rarely used; considered impractical for most systems
Atomicity guarantee	All-or-nothing	All-or-nothing (same as 2PC)

3PC sounds better on paper. The problem is that it assumes the network behaves predictably — that messages arrive within a known timeout and that nodes do not fail simultaneously. In real distributed systems, these assumptions break. Networks partition. Nodes crash and come back. The protocol that should eliminate blocking can still block under realistic failure scenarios.

Most engineers who need distributed transaction guarantees end up with either a properly configured 2PC (with Paxos-based coordinator) or saga. 3PC occupies an awkward middle ground: it solves the blocking problem in theory but adds complexity without solving the underlying network assumptions that make it impractical.

Database Implementations and Practical Usage

The textbook 2PC protocol assumes a single coordinator that can fail at any moment and participants that block until the coordinator recovers. Real databases cannot ship that. Every major distributed database that uses 2PC has modified the protocol to work around the blocking problem — usually by making the coordinator state survive crashes through consensus replication.

Three systems show how this plays out in practice. Google Spanner uses Paxos groups to replicate coordinator decisions. CockroachDB goes a different route — the transaction record itself is a Raft entity, so the commit outcome lives in the replicated log directly. TiDB splits its architecture into separate coordination and storage layers, with the commit decision stored in a replicated key-value store. The common pattern: replicated coordinator state, not a single point of failure.

Database Implementations

Most real-world 2PC implementations don’t look like the textbook version. The blocking problem is well-known, so database teams built around it. Spanner, CockroachDB, and TiDB each took a different path — but they share one idea: make the coordinator state survive crashes.

Database-Specific Implementations

Google Spanner

Spanner uses Paxos groups as the coordinator. Instead of one node deciding and hoping it doesn’t crash, the group agrees on the decision. When a transaction touches multiple participant groups:

The coordinator leader proposes a prepare timestamp via Paxos
Participants acknowledge via Paxos consensus
The coordinator leader assigns a commit timestamp and broadcasts via Paxos

The coordinator is the Paxos leader — if it dies, another node picks up immediately. Spanner’s TrueTime adds another trick: even during uncertainty windows, participants know how long to wait before giving up.

Spanner’s Paxos integration solves the blocking problem by replicating coordinator state across all group members. Any surviving group member can drive recovery after a failure.

# Simplified Spanner-style Paxos-coordinated 2PC
class PaxosCoordinator:
    def __init__(self, participants, paxos_group):
        self.participants = participants
        self.paxos_group = paxos_group  # Replicated coordinator state

    def execute(self, transaction):
        # Propose commit decision to Paxos group (not single node)
        proposal = {'decision': 'commit', 'txn': transaction, 'timestamp': None}

        # Paxos consensus among group members
        decided_value = self.paxos_group.propose(proposal)

        if decided_value['decision'] == 'commit':
            # Broadcast commit to all participants
            for p in self.participants:
                p.commit(transaction)
        else:
            for p in self.participants:
                p.rollback(transaction)

CockroachDB

CockroachDB went with distributed commit, where the transaction record itself is a Raft entity. The leaseholder of the transaction record acts as the coordinator — not a separate process. When a transaction commits:

The transaction record gets updated with a commit timestamp
This update goes into the Raft log across all replica nodes
Participants read the committed transaction record to figure out what happened

The outcome lives in Raft, so it survives coordinator crashes. No single-node SPOF.

TiDB

TiDB splits things across three components: the TiDB Server handles coordination, PD hands out timestamps, and TiKV stores everything with Raft replication underneath. Fail the TiDB Server and the transaction record survives in TiKV.

Database	Coordinator Type	Consensus Layer	Blocking Problem Solved?
Spanner	Paxos group leader	Paxos	Yes — replicated coordinator state
CockroachDB	Transaction leaseholder	Raft	Yes — transaction record replicated
TiDB	TiDB Server + PD	Raft (via TiKV)	Yes — commit decision in replicated Raft log

When to Use / Recovery

2PC trades availability for consistency. The blocking problem means your system may become unresponsive if the coordinator fails or a network partition occurs. Before reaching for 2PC, ask whether you actually need atomic commitment across separate nodes — or whether eventual consistency through saga pattern would suffice. If your system must stay up during failures, saga is usually the better choice.

This section covers the concrete scenarios where 2PC makes sense and the ones where it does not. It also walks through the recovery mechanics that keep 2PC viable in production — write-ahead logging, coordinator recovery procedures, and the critical rules that prevent inconsistent state.

When to Use 2PC

Use 2PC when:

You need atomic commitment across multiple nodes with strong consistency
Your network is stable and transactions are short (minimizing blocking window)
You are using a distributed database with Paxos-based coordinator (like Spanner) that eliminates the coordinator single point of failure
Occasional blocking during coordinator recovery is acceptable
Regulatory requirements demand ACID semantics across distributed nodes

Avoid 2PC when:

Your system must stay available during network partitions
Transactions are long-running (blocking window becomes unacceptable)
You have multiple independent services with separate databases
You cannot guarantee coordinator reliability (no HA configuration)
The blocking problem is unacceptable for your availability requirements

Recovery Protocol

Crash recovery in 2PC is where things get uncomfortable. When a participant restarts with a transaction in flight, it has to figure out what happened while it was gone — without asking the coordinator (because the coordinator might be down too).

The trick is that participants log their state to WAL before doing anything else. If you voted Yes, that goes to WAL before the vote message leaves. This means a restarted participant can reconstruct its state by reading its own log.

Recovery Protocol Implementation

WAL Entry Types

- TXN_PREPARE: Voted Yes, locks acquired, waiting for coordinator decision
- TXN_COMMIT: Decision came back commit, locks released
- TXN_ROLLBACK: Decision came back rollback (or voted No), locks released

Participant Recovery Procedure

On restart, scan for transactions stuck in PREPARED state. For each one:

Check the coordinator’s decision — either from the coordinator’s recovery log (old style) or from the replicated Paxos/Raft log (modern systems)
If the decision was COMMIT, commit locally and release locks
If the decision was ROLLBACK, roll back and release locks
If the coordinator is still unreachable, stay in prepared state and wait

Coordinator Recovery Procedure

The coordinator must write its decision to WAL before telling participants. If you crash after collecting Yes votes but before persisting, you’ve got an inconsistent mess that needs manual intervention.

On restart, the coordinator reads its WAL for unresolved transactions and resends the decision to participants that haven’t acknowledged.

The Critical Rule

The coordinator MUST write its decision to persistent storage before sending it to participants. If the coordinator crashes after collecting yes votes but before persisting, and the participants have already committed, the system enters an inconsistent state that requires manual intervention.

Modern databases sidestep this entirely — the commit decision lives in a replicated log, so it survives coordinator crashes without anyone having to remember anything.

Implementation Sketch

A simplified 2PC coordinator looks like this:

class TwoPhaseCommitCoordinator:
    def __init__(self, participants):
        self.participants = participants
        self.state = 'init'

    def execute(self, transaction):
        # Phase 1: Prepare
        votes = []
        for participant in self.participants:
            vote = participant.prepare(transaction)
            votes.append(vote)

        if any(vote == 'no' for vote in votes):
            self.state = 'rollback'
            for participant in self.participants:
                participant.rollback(transaction)
            return 'rolled back'

        # Phase 2: Commit
        self.state = 'commit'
        for participant in self.participants:
            participant.commit(transaction)
        return 'committed'

Real implementations must handle timeouts, crashes, and retries.

Trade-off Analysis

2PC vs Alternatives

When evaluating distributed transaction approaches, the core question is whether you need strong atomic consistency or whether eventual consistency is acceptable. 2PC provides the former but can block indefinitely when the coordinator fails. Saga provides the latter with no blocking, but requires compensating transactions to undo steps when things go wrong. Event sourcing sidesteps distributed transactions entirely by storing state changes as an immutable log. The table below compares these approaches across the dimensions that matter most in production.

Aspect	2PC	Saga Pattern	Event Sourcing
Atomicity	Full atomic commitment	Eventual consistency via compensation	State rebuilt from event log
Blocking	Yes — coordinator failure blocks	Never blocks	No blocking
Locks	Holds locks until commit/rollback	No distributed locks	No locks
Complexity	Requires coordinator + participants	Application-level compensation logic	Event replay complexity
Availability	Compromised during partitions	High availability maintained	High availability
Consistency	Strong consistency	Eventual consistency	Strong consistency (event log)
Latency	2-3 round trips (prepare + commit)	No distributed coordination overhead	Write to local event log
Recovery	WAL-based; coordinator must survive	Compensating transactions replay	Replay events to rebuild state
Use Case	Tightly coupled distributed DBs	Microservices, long-running processes	Audit trails, CQRS, immutable logs

When to Choose Each Approach

Choose 2PC when:

All participants are tightly coupled databases under your control
Strong consistency is non-negotiable
Coordinator can be made highly available (Paxos-based)
Transaction duration is short and predictable

Choose Saga when:

Services have independent databases
Availability trumps immediate consistency
Business transactions span service boundaries
You can tolerate temporary inconsistency

Choose Event Sourcing when:

You need a complete audit trail
Rebuilding state from events is acceptable
You want to avoid distributed transactions entirely
Temporal queries and event replay are important

Production Failure Scenarios

Failure	Impact	Mitigation
Participant crashes after voting Yes but before commit	Participant blocks indefinitely waiting for decision	Use Paxos-based consensus for coordinator; implement timeout-based lock release
Coordinator crashes after collecting Yes votes	Participants block waiting for decision	Run coordinator in HA with consensus; use persistent state for recovery
Network partition during commit phase	Some participants commit, others rollback	Design for partition tolerance; use saga pattern for compensation
Participant recovery with uncertain state	Participant doesn’t know whether to commit or rollback	Use WAL to determine state; implement recovery protocol; log prepare state
Coordinator timeout misconfiguration	Premature rollback or indefinite waiting	Set appropriate timeout values based on network characteristics
All participants vote No	Coordinator sends rollback; this is normal	Ensure proper error handling; investigate root cause of votes

Common Pitfalls / Anti-Patterns

Ignoring the blocking problem: The coordinator crash scenario where participants block indefinitely is not theoretical. In production systems with network issues, it happens. Do not use 2PC without a plan for this.

Long-running transactions with 2PC: The longer the transaction, the longer participants hold locks. Long transactions increase contention and blocking window. Keep transactions short.

Using 2PC for microservice transactions: 2PC across independent services with separate databases is generally the wrong approach. Services should own their data and use saga or choreography for cross-service consistency.

Not planning for coordinator failure: If the coordinator is a single point of failure, you will eventually have a bad day. Use HA coordinator configuration or consensus-based coordination.

Underestimating timeout configuration: Timeouts that are too short cause unnecessary rollbacks. Timeouts that are too long cause long blocking. Tune based on actual network characteristics.

Assuming atomicity provides isolation: 2PC provides atomicity but not isolation. Concurrent transactions can still see each other’s uncommitted results. Use appropriate isolation levels.

Interview Questions

1. What problem does the two-phase commit protocol solve?

2PC ensures atomic commitment across multiple database nodes in a distributed system
All participants either commit or rollback together — no partial states
It provides the illusion of a single-node transaction across multiple nodes
Solves the coordination problem: how to get independent nodes to agree on a single outcome

2. Describe the two phases of 2PC and what happens in each.

Phase 1 (Prepare): Coordinator sends prepare to all participants; each participant votes Yes (ready to commit, locks held) or No (rollback, release locks)
Phase 2 (Commit): If all vote Yes, coordinator sends commit to all; participants commit locally and release locks; if any vote No, coordinator sends rollback to all
The coordinator manages the voting and decision distribution

3. What is the blocking problem in 2PC and why is it serious?

Blocking occurs when a participant votes Yes but the coordinator crashes before sending the decision
The participant holds locks and cannot decide unilaterally — it cannot commit (coordinator might have sent rollback) and cannot rollback (coordinator might have sent commit)
The participant must wait indefinitely for coordinator recovery
Other participants may have already committed or rolled back, creating inconsistent state
This is unacceptable in high-availability systems

4. How do modern distributed databases like Spanner solve the blocking problem?

Spanner uses Paxos groups as the coordinator — the coordinator state is replicated across multiple nodes
If the coordinator leader crashes, another node can immediately take over and drive recovery
CockroachDB uses distributed commit where the transaction record is a Raft entity
TiDB stores the commit decision in TiKV which is Raft-replicated
The key insight: coordinator state must survive crashes, not just the coordinator process

5. What is the difference between an embedded and external coordinator?

Embedded: coordinator runs within the application process (e.g., XA transactions in MySQL); simpler but app crash kills coordinator
External: coordinator is a separate service (e.g., Narayana, WebSphere LTCO); survives app crashes but adds operational complexity
Both are single points of failure unless the coordinator itself is HA/replicated
External coordinators are better for multi-service transactions spanning applications

6. How does WAL (Write-Ahead Logging) help with 2PC recovery?

Participants write their state to WAL before sending vote messages
TXN_PREPARE = voted Yes, locks held, waiting; TXN_COMMIT = decided commit; TXN_ROLLBACK = decided rollback
On restart, participants scan WAL for PREPARED transactions
They check the coordinator's decision from replicated log (or ask coordinator) and complete accordingly
Critical rule: coordinator must write decision to WAL before sending to participants

7. Why is 2PC rarely used in microservice architectures?

Microservices have independent databases — 2PC across separate databases is complex and fragile
The blocking problem is unacceptable for available systems
Long-running transactions with 2PC cause lock contention across service boundaries
Saga pattern is preferred: each service handles its own transaction, compensating transactions handle failures
Saga trades atomicity and isolation for availability and scalability

8. What are the key metrics to monitor in a 2PC production deployment?

Transaction commit rate vs rollback rate (baseline and anomalies)
Prepare and commit phase durations (latency spikes indicate problems)
Coordinator crash count and recovery time
Participant blocking time (time stuck in PREPARED state)
Transaction timeout rate (misconfigured timeouts cause unnecessary rollbacks)
Alert thresholds: participants in prepared state > N seconds, rollback rate exceeds baseline

9. How does 3PC attempt to solve the blocking problem and why does it still fail in practice?

3PC adds a pre-commit phase: Prepare → Pre-commit → Commit
If coordinator crashes after sending pre-commit, participants know a commit decision was made
Participants can decide to commit rather than block indefinitely
Problem: 3PC assumes synchronous network — if messages timeout unexpectedly, the protocol breaks down
Network partitions and simultaneous node failures still cause blocking in practice
Real-world systems find Paxos-based 2PC or saga more practical

10. What is the critical rule for coordinator crash recovery in 2PC?

The coordinator MUST write its decision to persistent storage (WAL or replicated log) BEFORE sending to participants
If coordinator crashes after collecting yes votes but before persisting, participants may have committed while coordinator lost the decision
This creates an inconsistent state requiring manual intervention
Modern databases avoid this by storing the commit decision in a Raft/Paxos replicated log
Recovery: coordinator reads WAL for unresolved transactions and resends decisions to unacknowledged participants

11. What is the role of the prepare phase in 2PC and why is it important?

The prepare phase is where the coordinator asks all participants to vote on whether they can commit
Each participant must write TXN_PREPARE to WAL before sending vote — this ensures the vote is durable
Participants acquire locks during prepare and hold them until the decision arrives
If any participant votes No, the transaction aborts immediately — no locks remain held
The prepare phase creates the "in-doubt" window: participants are committed to vote but awaiting decision

12. Explain the difference between 2PC and saga pattern for distributed transactions.

2PC provides atomic commitment — all participants commit or all rollback together; saga provides eventual consistency via compensating transactions
2PC blocks participants during coordinator failure; saga never blocks because it avoids distributed locks
2PC requires a coordinator and participants to support the protocol; saga can be implemented at the application layer without database support
Saga trades isolation and atomicity for availability — intermediate states are visible
2PC is suitable when strong consistency is required and blocking is acceptable; saga is better for microservices and long-running transactions

13. How does network partition affect 2PC and what are the outcomes?

If a participant cannot reach the coordinator during prepare, it blocks indefinitely waiting for the prepare message
If the coordinator cannot reach a participant, it must treat it as a No vote and send rollback to all
If partition occurs after participants vote Yes but before commit arrives, affected participants block indefinitely
Network partitions can cause split-brain scenarios where some participants commit and others rollback
CAP theorem implications: 2PC chooses consistency over availability during partitions — it will block rather than proceed inconsistently

14. What is an in-doubt transaction in 2PC and how should it be handled?

An in-doubt transaction occurs when a participant has voted Yes but not yet received the commit or rollback decision from the coordinator
The participant holds locks and cannot make a unilateral decision
Handling options: wait for coordinator recovery (may block indefinitely), use timeout-based lock release (risks inconsistency), consult coordinator's WAL on recovery
Modern databases with Paxos/Raft-based coordinators solve in-doubt states by replicating the decision
Monitoring in-doubt duration is critical — alerts should fire if participants exceed threshold time in prepared state

15. What are the isolation guarantees provided by 2PC?

2PC provides atomicity but NOT isolation — these are separate properties in ACID
Concurrent transactions can see each other's uncommitted results depending on isolation level
The locks held during prepare phase provide serializability at the cost of concurrency
2PC does not prevent dirty reads unless appropriate isolation levels are configured
Understanding isolation levels (Read Committed, Serializable, etc.) is essential for correct 2PC usage

16. Why does 2PC not work well with long-running transactions?

Long-running transactions hold locks for extended periods, blocking other transactions
The blocking window is proportional to transaction duration — longer transactions mean longer potential blocking
Network timeouts are harder to set correctly for long transactions — too short causes premature rollback, too long causes extended blocking
Deadlock probability increases with transaction duration and lock contention
Saga pattern is preferred for long-running business transactions to avoid lock contention

17. How does TiDB's architecture differ from Spanner and CockroachDB in solving the 2PC blocking problem?

TiDB uses a three-component architecture: TiDB Server (coordinator), PD (timestamp allocation), TiKV (storage with Raft replication)
The commit decision is stored in TiKV which is Raft-replicated, surviving TiDB Server failures
PD provides globally consistent timestamps, enabling distributed transaction ordering
Spanner uses Paxos groups with TrueTime for geo-replication; CockroachDB uses distributed Raft transaction records
TiDB's separation of compute (TiDB Server) and storage (TiKV) provides horizontal scalability

18. What is the message complexity of 2PC and how does it scale?

2PC requires 3 messages per participant: prepare, vote response, commit/rollback
Total messages = 3N where N is the number of participants — scales linearly with participants
All messages must arrive within timeout windows — network latency impacts coordinator timeout configuration
3PC requires 5 messages per participant (adds pre-commit and ack), increasing overhead by 67%
High participant counts make 2PC coordination expensive and block longer due to stragglers

19. What are the security considerations when implementing 2PC?

Coordinator-to-participant communication must be authenticated to prevent unauthorized transaction commands
Encryption (TLS) is needed to protect vote values and commit decisions from interception
Authorization: only authorized participants should join transactions — implement access control lists
Audit logging of all commit/rollback decisions for compliance and forensic analysis
Input validation at each participant prevents malicious transaction data from propagating
Coordinator management APIs must be protected to prevent coordinator compromise

20. How would you design a monitoring dashboard for a production 2PC deployment?

Key metrics: commit rate vs rollback rate, prepare/commit phase durations, in-doubt transaction count and duration
Coordinator health: crash count, recovery time, HA failover events
Participant health: blocking time in prepared state, vote distribution (yes/no ratio)
Network health: timeout rates, partition events, latency percentiles
Alerts: participants in prepared state > threshold, coordinator unavailable, rollback rate spike
Dashboards should show end-to-end transaction latency, not just 2PC protocol latency

Consensus Algorithms and 2PC

2PC is fundamentally a consensus problem — all participants must agree on a single decision (commit or abort). Understanding the relationship between consensus algorithms and 2PC illuminates why modern distributed databases moved toward consensus-based coordinators.

The Consensus Problem

In distributed systems, consensus means all processes must agree on a single value. The FLP impossibility result (Fischer, Lynch, and Paterson, 1985) showed that no deterministic consensus algorithm can guarantee termination in an asynchronous system if even one process might crash. Any practical algorithm has to trade safety (agreement) against liveness (every correct process eventually decides something).

The tension is fundamental: safety says nothing decides until agreement is reached, but liveness says every correct process must decide eventually. In an asynchronous network where messages can be delayed arbitrarily, you cannot have both. This is not a flaw in current algorithms. It is a property of distributed systems.

To understand what FLP actually says, consider a simple scenario. Three nodes are trying to agree on a value. The network delivers messages, but you cannot tell whether a node that is not responding has crashed or is just slow. A message that would convince node 3 to decide might still be in flight when node 3 appears to have crashed. Any deterministic algorithm that decides in this situation risks deciding incorrectly if the message arrives late. The result is that under purely asynchronous assumptions, you cannot build a consensus algorithm that is both safe and guaranteed to terminate.

This is a theoretical result, but it has practical consequences. It means every consensus algorithm has to make some assumption about the network — usually that messages are eventually delivered, or that timeouts are set correctly. If those assumptions hold, the algorithm terminates. If they do not, the algorithm may block. No algorithm escapes this tradeoff.

The safety-liveness distinction is worth understanding in concrete terms. Safety says: if a value is decided, every correct process sees the same value. It prevents split-brain — you never end up with node A believing the decision was X and node B believing it was Y. Liveness says: every correct process eventually decides a value. It prevents deadlock — you never wait forever without a decision.

Safety is the non-negotiable property. A system that decides inconsistently is broken in a way that is hard to detect and hard to recover from. Liveness is what you sacrifice when things go wrong. In 2PC, participants block indefinitely — that is a liveness failure. In exchange, they maintain safety: no participant commits unless the coordinator confirms that all participants voted yes.

2PC sidesteps this by assuming the coordinator is trustworthy and reliable—if it fails, it recovers quickly. Under this assumption, 2PC delivers safety at the cost of liveness: participants block if the coordinator goes down. The assumption holds when failures are rare and recovery is fast.

When the coordinator crashes and does not come back, 2PC hits the blocking problem. Participants must choose between blocking forever (preserving safety) or going it alone (risking inconsistency). FLP tells us why there is no algorithmic way around this: in an asynchronous system, no participant can tell whether the coordinator is truly dead or just slow.

Modern databases handle this by making the coordinator itself fault-tolerant through consensus replication. The coordinator state lives on multiple nodes using Paxos or Raft, so the system no longer assumes “the coordinator never fails.” Instead it assumes “the coordinator fails only if a majority of its replicas fail.” This removes the single point of failure while keeping the protocol simple.

Paxos and Raft as Coordinator Backends

Modern databases solve the coordinator reliability problem by implementing the coordinator as a replicated state machine using consensus algorithms:

Paxos (used by Google Spanner, etcd, CockroachDB’s HA coordinator):

Guarantees safety even during network partitions
Uses leader lease to ensure only one leader proposes values
Coordinator state is replicated across a quorum of nodes
If coordinator crashes, a new leader is elected from the surviving quorum

Raft (used by CockroachDB, TiDB):

Designed to be understandable compared to Paxos
Leader-based — all writes go through the leader
Coordinator state lives in the Raft log, replicated to followers
Leader election provides automatic failover

Both algorithms solve the coordinator reliability problem by replicating coordinator state across a quorum of nodes. If the leader crashes, a new one is elected from the surviving majority — no single point of failure. The difference is in how they are typically deployed and their operational characteristics.

Paxos has a longer track record in production systems at scale. Google Spanner uses Multi-Paxos with a stable leader that handles all writes during normal operation. The leader lease means a single node is authoritative for a period, which simplifies the protocol and reduces latency. When the leader fails, a new election happens and a follower with the most up-to-date log is elected. Spanner’s TrueTime allows it to use shorter leases safely.

Raft was designed to be more understandable, and that matters in practice. The leader election, log replication, and safety sub-problems are separated, which makes it easier to reason about implementation bugs. Diego Ongaro’s dissertation at Stanford documents that during Raft’s development, they user-tested both Paxos and Raft implementations with students — Raft implementations were debugged faster and more completely. That is why etcd and CockroachDB chose Raft as their consensus layer.

For coordinator backends specifically, the latency profiles are similar. Raft typically needs 2 round trips to commit a log entry (one to propose and replicate to followers, one for the commit acknowledgment). Paxos Multi-Paxos has a similar or slightly lower overhead when the leader is stable. The real difference is in recovery: Raft’s leader election is more predictable because election timeouts are randomized within a range, making it rare for two candidates to split votes. Paxos can have more complex leader election scenarios under certain network conditions.

Choosing between them is usually not a hard technical decision. If you are building on etcd, CockroachDB, or TiDB, you are using Raft. If you are building on Spanner or etcd-adjacent systems, you are using Paxos. If you are rolling your own coordinator, Raft’s literature is more accessible and the reference implementations are easier to follow.

Why Not Use Pure Consensus for Transactions?

Using a consensus algorithm as the transaction coordinator means every transaction decision goes through a consensus round. Raft typically needs 2 round trips to commit a log entry—one for the leader to receive the proposal and replicate to followers, one for the commit acknowledgment. Paxos has similar or higher overhead depending on the variant. A single-node coordinator can decide immediately, so consensus adds latency.

Here are the rough numbers. A single-node 2PC commit uses 2 round trips total: one for the prepare phase (coordinator to all participants and back) and one for the commit phase. The total time is bounded by network latency to the slowest participant.

A consensus-backed coordinator needs 3-4 round trips minimum: 1-2 to propose to the consensus group, then 1-2 to send the commit decision to participants. Quorum requirements often push this higher.

For a transaction with 3 participants in the same datacenter, single-node 2PC might finish in 2-5ms. A consensus-backed version typically takes 15-30ms. For a transaction spanning datacenters (Spanner-style geo-replication), consensus round trips alone can add 10-50ms.

Coordinator Type	Latency	Fault Tolerance	Complexity
Single-node	Low (2 round trips)	SPOF	Low
Consensus-backed	Higher (3-4 round trips)	Survives majority failures	High
Hybrid	Medium	Depends on coordinator HA	Medium

For high-value financial transactions where every commit must survive multiple node failures, the latency cost is worth paying. For low-latency OLTP workloads processing thousands of transactions per second, a 2-3x latency increase is usually too expensive. Match the coordinator design to the workload.

Most production systems use a hybrid approach: a highly available coordinator (potentially consensus-backed) that eliminates the SPOF, but optimized for low-latency protocol handling. The coordinator does not route every decision through full consensus. Instead, it uses consensus to keep its own state crash-safe, while actual commit/abort decisions go directly to participants.

Implementing Consensus-Backed 2PC

# Consensus-backed coordinator concept
class ConsensusBackedCoordinator:
    def __init__(self, paxos_group):
        self.paxos_group = paxos_group
        self.leader_lease_duration = 10  # seconds

    def begin_transaction(self, transaction):
        # Propose transaction decision to consensus group
        proposal = TransactionDecision(
            txn_id=transaction.id,
            decision='prepare',
            timestamp=self.paxos_group.get_current_timestamp()
        )

        # Wait for quorum agreement
        agreed_value = self.paxos_group.propose(proposal)

        # If we're the leader, we can proceed
        if self.paxos_group.is_leader():
            return self.execute_transaction(transaction)
        else:
            # Forward to current leader
            return self.forward_to_leader(transaction)

The Hybrid Approach

Most production systems use a hybrid: a highly available coordinator (potentially consensus-backed) that manages 2PC protocol, while the underlying storage uses Raft/Paxos for data replication. The coordinator manages the protocol semantics; the storage layer manages data safety.

This separation of concerns lets each layer be optimized independently — coordinator for low-latency protocol handling, storage layer for safe data replication.

Conclusion

Two-phase commit provides atomic commitment across distributed nodes. It works in theory. In practice, the blocking problem makes it unsuitable for high-availability systems.

When a participant votes yes and the coordinator crashes before sending the decision, the participant blocks indefinitely. This is unacceptable in systems that must stay available.

Most microservice architectures use saga instead. Saga sacrifices isolation and atomicity for availability. The trade-off is usually the right one. For cross-service business transactions, eventual consistency with compensating transactions is more practical than 2PC’s blocking semantics.

sequenceDiagram
    participant Coordinator
    participant P1 as Participant 1
    participant P2 as Participant 2
    Coordinator->>+P1: Prepare
    Coordinator->>+P2: Prepare
    P1->>-Coordinator: Vote Yes
    P2->>-Coordinator: Vote Yes
    Coordinator->>+P1: Commit
    Coordinator->>+P2: Commit

Key Points

Two-phase commit has a prepare phase and a commit phase
If the coordinator crashes after prepare, participants block
2PC is unsuitable for high-availability microservice architectures
Modern systems use saga patterns and consensus algorithms instead
Raft and Paxos provide fault-tolerant coordination without blocking

Production Checklist

Production Readiness Checklist

# Two-Phase Commit Production Readiness

- [ ] Coordinator running in HA mode (Paxos-based if possible)
- [ ] Transaction timeout values tuned for network characteristics
- [ ] Participant recovery protocol implemented and tested
- [ ] Monitoring for prepared-state blocking time
- [ ] Alerting for coordinator crashes
- [ ] Rollback rate and latency monitored
- [ ] Audit logging for all commit/rollback decisions
- [ ] Clear escalation path for stuck transactions
- [ ] Documented why 2PC was chosen over saga

Observability Checklist

Metrics

Transaction commit rate vs rollback rate
Prepare phase duration
Commit phase duration
Coordinator crash count and recovery time
Participant blocking time (time in prepared state)
Transaction timeout rate

Logs

Log prepare and vote from each participant
Log coordinator decision (commit or rollback)
Log participant state changes
Include transaction ID and participant IDs for correlation
Log timeout and recovery events

Alerts

Alert when participants are in prepared state too long
Alert on coordinator crash
Alert when transaction timeout rate increases
Alert when rollback rate exceeds baseline

Security Checklist

Authenticate all participants in the transaction
Authorize which participants can join which transactions
Encrypt coordinator-to-participant communication
Audit log all commit and rollback decisions
Validate transaction inputs at each participant
Protect coordinator management APIs

Two-Phase Commit Protocol Explained

Introduction

Phase 1: Prepare

Phase 2: Commit

Core Concepts

Coordinator and Participants

Embedded vs External Coordinator

Failure Handling

Participant Crashes Before Voting

Participant Crashes After Voting Yes but Before Commit

Coordinator Crashes

Network Partition

Alternatives to 2PC

Why 2PC Is Rarely Used

Three-Phase Commit (3PC)

Database Implementations and Practical Usage

Database Implementations

Database-Specific Implementations

Google Spanner

CockroachDB

TiDB

When to Use / Recovery

When to Use 2PC

Recovery Protocol

Recovery Protocol Implementation

WAL Entry Types

Participant Recovery Procedure

Coordinator Recovery Procedure

The Critical Rule

Implementation Sketch

Trade-off Analysis

2PC vs Alternatives

When to Choose Each Approach

Production Failure Scenarios

Common Pitfalls / Anti-Patterns

Interview Questions

Further Reading

Consensus Algorithms and 2PC

The Consensus Problem

Paxos and Raft as Coordinator Backends

Why Not Use Pure Consensus for Transactions?

Implementing Consensus-Backed 2PC

The Hybrid Approach

Conclusion

Key Points

Production Checklist

Production Readiness Checklist

Observability Checklist

Metrics

Logs

Alerts

Security Checklist

Category

Tags

Related Posts

Distributed Transactions: ACID vs BASE Trade-offs

Google Spanner: Globally Distributed SQL at Scale

Gossip Protocol: Scalable State Propagation