Two-Phase Commit Protocol Explained

Learn the two-phase commit protocol for distributed transactions: prepare and commit phases, coordinator role, failure handling, and why 2PC is rarely used.

published: reading time: 14 min read

Two-Phase Commit Protocol Explained

Two-phase commit (2PC) is a protocol for achieving atomic commitment across multiple database nodes. All nodes either commit or rollback together. No partial states.

The appeal is clear: distributed transactions that behave like single-node transactions. In practice, 2PC has failure modes that make it problematic, and most systems use alternatives like saga instead.

This post explains how 2PC works, the failure scenarios that cause trouble, and why it fell out of favor.

The Basic Protocol

2PC works in two phases: prepare and commit.

Phase 1: Prepare

The coordinator sends a prepare message to all participants. Each participant votes:

  • Vote Yes: The participant has completed its work, its locks are held, and it is ready to commit.
  • Vote No: Something went wrong. The participant rolls back and releases locks.
sequence
    Coordinator-->|Prepare| P1:Participant 1
    Coordinator-->|Prepare| P2:Participant 2
    Coordinator-->|Prepare| P3:Participant 3
    P1-->|Vote Yes| Coordinator
    P2-->|Vote Yes| Coordinator
    P3-->|Vote No| Coordinator

If any participant votes no, the coordinator sends rollback to all. Done.

Phase 2: Commit

If all participants vote yes, the coordinator sends commit to all. Each participant commits its local transaction and releases locks.

sequence
    Coordinator-->|Commit| P1:Participant 1
    Coordinator-->|Commit| P2:Participant 2
    Coordinator-->|Commit| P3:Participant 3
    P1-->|Committed| Coordinator
    P2-->|Committed| Coordinator
    P3-->|Committed| Coordinator

All participants must commit for the transaction to succeed.

The Coordinator

The coordinator is the linchpin of 2PC. It manages the voting, decides the outcome, and tells participants what to do.

In practice, the coordinator is often a separate service or embedded in the application. Some databases (like Percona XtraDB) implement 2PC internally for distributed transactions.

The coordinator must be reliable. If it crashes, participants may be left in limbo.

Embedded vs External Coordinator

AspectEmbedded CoordinatorExternal Coordinator
DefinitionCoordinator logic runs within the application process initiating the transactionCoordinator is a separate service/process that manages 2PC across participants
ExampleApplication code directly coordinates MySQL/PostgreSQL XA transactionsDedicated service like Narayana, WebSphere LTCO, or custom coordinator service
AdvantagesSimpler deployment; no additional service to operate; lower latency for single-site transactionsCoordinator survives application crashes; easier to monitor and manage; better for distributed multi-service transactions
DisadvantagesApplication crash kills the coordinator; harder to monitor; participants may block if application is overloadedAdditional dependency; network hop to coordinator; requires HA configuration for coordinator
Coordinator SPOFYes — if the application crashes, coordinator dies with itYes — unless configured in HA mode with consensus (e.g., Paxos-based)
When to UseSingle application coordinating local database partitions; simple XA transactions within one database clusterMulti-service transactions; transactions spanning multiple applications; when coordinator must survive application restarts
RecoveryApplication restart recovers coordinator state from persistent storageCoordinator service restart with state persisted to disk or replicated via consensus

Failure Handling

2PC has several failure scenarios that cause problems.

Participant Crashes Before Voting

If a participant crashes before voting, the coordinator times out and sends rollback. Other participants roll back. The crashed participant, when it recovers, also rolls back (or has nothing to do if it had not started work).

This case is manageable.

Participant Crashes After Voting Yes but Before Commit

This is the problematic case. The participant has voted yes, which means it has acquired locks and completed its work. It is waiting for the coordinator to tell it to commit or rollback.

If the coordinator crashes before sending the decision, this participant is stuck. It holds its locks indefinitely. Other participants have committed (if they received the commit message before crashing) or rolled back (if they received rollback or voted no).

The participant cannot decide unilaterally. It cannot commit (because the coordinator might have sent rollback to others). It cannot rollback (because the coordinator might have sent commit to others).

This is the blocking problem. The participant must wait for the coordinator to recover.

Coordinator Crashes

If the coordinator crashes after collecting yes votes but before sending commit, participants are stuck. They must wait for the coordinator to recover.

If the coordinator crashes and never recovers, participants are stuck forever. In practice, systems use timeouts and manual intervention.

Network Partition

If a participant cannot reach the coordinator, it blocks. If the coordinator cannot reach a participant, it must treat it as a no vote and send rollback.

Network partitions are common in distributed systems. 2PC does not handle them gracefully.

Why 2PC Is Rarely Used

The blocking problem is the fatal flaw. In a system that must stay available (which is most systems), blocking on coordinator recovery is unacceptable.

Other protocols address this. Three-phase commit (3PC) adds a pre-commit phase to eliminate blocking, but it still assumes a synchronous system and makes stronger network assumptions. It is rarely used in practice either.

The saga pattern avoids locking entirely. Instead of atomic commitment, saga uses compensating transactions. If step 3 fails, steps 1 and 2 are undone. The system remains available. The cost is temporary inconsistency. For saga pattern details, see Saga Pattern.

Event sourcing stores events, not state. The event log is the source of truth. Rebuilding state is a matter of replaying events. This avoids distributed transactions altogether.

Database-Specific Implementations

Most real-world 2PC implementations don’t look like the textbook version. The blocking problem is well-known, so database teams built around it. Spanner, CockroachDB, and TiDB each took a different path — but they share one idea: make the coordinator state survive crashes.

Google Spanner

Spanner uses Paxos groups as the coordinator. Instead of one node deciding and hoping it doesn’t crash, the group agrees on the decision. When a transaction touches multiple participant groups:

  1. The coordinator leader proposes a prepare timestamp via Paxos
  2. Participants acknowledge via Paxos consensus
  3. The coordinator leader assigns a commit timestamp and broadcasts via Paxos

The coordinator is the Paxos leader — if it dies, another node picks up immediately. Spanner’s TrueTime adds another trick: even during uncertainty windows, participants know how long to wait before giving up.

Spanner’s Paxos integration solves the blocking problem by replicating coordinator state across all group members. Any surviving group member can drive recovery after a failure.

# Simplified Spanner-style Paxos-coordinated 2PC
class PaxosCoordinator:
    def __init__(self, participants, paxos_group):
        self.participants = participants
        self.paxos_group = paxos_group  # Replicated coordinator state

    def execute(self, transaction):
        # Propose commit decision to Paxos group (not single node)
        proposal = {'decision': 'commit', 'txn': transaction, 'timestamp': None}

        # Paxos consensus among group members
        decided_value = self.paxos_group.propose(proposal)

        if decided_value['decision'] == 'commit':
            # Broadcast commit to all participants
            for p in self.participants:
                p.commit(transaction)
        else:
            for p in self.participants:
                p.rollback(transaction)

CockroachDB

CockroachDB went with distributed commit, where the transaction record itself is a Raft entity. The leaseholder of the transaction record acts as the coordinator — not a separate process. When a transaction commits:

  1. The transaction record gets updated with a commit timestamp
  2. This update goes into the Raft log across all replica nodes
  3. Participants read the committed transaction record to figure out what happened

The outcome lives in Raft, so it survives coordinator crashes. No single-node SPOF.

TiDB

TiDB splits things across three components: the TiDB Server handles coordination, PD hands out timestamps, and TiKV stores everything with Raft replication underneath. Fail the TiDB Server and the transaction record survives in TiKV.

DatabaseCoordinator TypeConsensus LayerBlocking Problem Solved?
SpannerPaxos group leaderPaxosYes — replicated coordinator state
CockroachDBTransaction leaseholderRaftYes — transaction record replicated
TiDBTiDB Server + PDRaft (via TiKV)Yes — commit decision in replicated Raft log

When to Use / When Not to Use 2PC

Use 2PC when:

  • You need atomic commitment across multiple nodes with strong consistency
  • Your network is stable and transactions are short (minimizing blocking window)
  • You are using a distributed database with Paxos-based coordinator (like Spanner) that eliminates the coordinator single point of failure
  • Occasional blocking during coordinator recovery is acceptable
  • Regulatory requirements demand ACID semantics across distributed nodes

Avoid 2PC when:

  • Your system must stay available during network partitions
  • Transactions are long-running (blocking window becomes unacceptable)
  • You have multiple independent services with separate databases
  • You cannot guarantee coordinator reliability (no HA configuration)
  • The blocking problem is unacceptable for your availability requirements

Production Failure Scenarios

FailureImpactMitigation
Participant crashes after voting Yes but before commitParticipant blocks indefinitely waiting for decisionUse Paxos-based consensus for coordinator; implement timeout-based lock release
Coordinator crashes after collecting Yes votesParticipants block waiting for decisionRun coordinator in HA with consensus; use persistent state for recovery
Network partition during commit phaseSome participants commit, others rollbackDesign for partition tolerance; use saga pattern for compensation
Participant recovery with uncertain stateParticipant doesn’t know whether to commit or rollbackUse WAL to determine state; implement recovery protocol; log prepare state
Coordinator timeout misconfigurationPremature rollback or indefinite waitingSet appropriate timeout values based on network characteristics
All participants vote NoCoordinator sends rollback; this is normalEnsure proper error handling; investigate root cause of votes

Recovery Protocol Implementation

Crash recovery in 2PC is where things get uncomfortable. When a participant restarts with a transaction in flight, it has to figure out what happened while it was gone — without asking the coordinator (because the coordinator might be down too).

The trick is that participants log their state to WAL before doing anything else. If you voted Yes, that goes to WAL before the vote message leaves. This means a restarted participant can reconstruct its state by reading its own log.

WAL Entry Types

- TXN_PREPARE: Voted Yes, locks acquired, waiting for coordinator decision
- TXN_COMMIT: Decision came back commit, locks released
- TXN_ROLLBACK: Decision came back rollback (or voted No), locks released

Participant Recovery Procedure

On restart, scan for transactions stuck in PREPARED state. For each one:

  1. Check the coordinator’s decision — either from the coordinator’s recovery log (old style) or from the replicated Paxos/Raft log (modern systems)
  2. If the decision was COMMIT, commit locally and release locks
  3. If the decision was ROLLBACK, roll back and release locks
  4. If the coordinator is still unreachable, stay in prepared state and wait

Coordinator Recovery Procedure

The coordinator must write its decision to WAL before telling participants. If you crash after collecting Yes votes but before persisting, you’ve got an inconsistent mess that needs manual intervention.

On restart, the coordinator reads its WAL for unresolved transactions and resends the decision to participants that haven’t acknowledged.

The Critical Rule

The coordinator MUST write its decision to persistent storage before sending it to participants. If the coordinator crashes after collecting yes votes but before persisting, and the participants have already committed, the system enters an inconsistent state that requires manual intervention.

Modern databases sidestep this entirely — the commit decision lives in a replicated log, so it survives coordinator crashes without anyone having to remember anything.

Observability Checklist

Metrics

  • Transaction commit rate vs rollback rate
  • Prepare phase duration
  • Commit phase duration
  • Coordinator crash count and recovery time
  • Participant blocking time (time in prepared state)
  • Transaction timeout rate

Logs

  • Log prepare and vote from each participant
  • Log coordinator decision (commit or rollback)
  • Log participant state changes
  • Include transaction ID and participant IDs for correlation
  • Log timeout and recovery events

Alerts

  • Alert when participants are in prepared state too long
  • Alert on coordinator crash
  • Alert when transaction timeout rate increases
  • Alert when rollback rate exceeds baseline

Security Checklist

  • Authenticate all participants in the transaction
  • Authorize which participants can join which transactions
  • Encrypt coordinator-to-participant communication
  • Audit log all commit and rollback decisions
  • Validate transaction inputs at each participant
  • Protect coordinator management APIs

Common Pitfalls / Anti-Patterns

Ignoring the blocking problem: The coordinator crash scenario where participants block indefinitely is not theoretical. In production systems with network issues, it happens. Do not use 2PC without a plan for this.

Long-running transactions with 2PC: The longer the transaction, the longer participants hold locks. Long transactions increase contention and blocking window. Keep transactions short.

Using 2PC for microservice transactions: 2PC across independent services with separate databases is generally the wrong approach. Services should own their data and use saga or choreography for cross-service consistency.

Not planning for coordinator failure: If the coordinator is a single point of failure, you will eventually have a bad day. Use HA coordinator configuration or consensus-based coordination.

Underestimating timeout configuration: Timeouts that are too short cause unnecessary rollbacks. Timeouts that are too long cause long blocking. Tune based on actual network characteristics.

Assuming atomicity provides isolation: 2PC provides atomicity but not isolation. Concurrent transactions can still see each other’s uncommitted results. Use appropriate isolation levels.

Quick Recap

sequence
    Coordinator-->|Prepare| P1:Participant 1
    Coordinator-->|Prepare| P2:Participant 2
    P1-->|Vote Yes| Coordinator
    P2-->|Vote Yes| Coordinator
    Coordinator-->|Commit| P1
    Coordinator-->|Commit| P2

Key Points

  • 2PC provides atomic commitment across distributed nodes (all commit or all rollback)
  • Phase 1 (Prepare): Coordinator asks participants to vote; participants acquire locks and vote
  • Phase 2 (Commit): If all vote Yes, coordinator sends commit; participants commit and release locks
  • Fatal flaw: Participants block indefinitely if coordinator crashes after prepare
  • Most microservice architectures use saga instead due to 2PC’s blocking problem

Production Checklist

# Two-Phase Commit Production Readiness

- [ ] Coordinator running in HA mode (Paxos-based if possible)
- [ ] Transaction timeout values tuned for network characteristics
- [ ] Participant recovery protocol implemented and tested
- [ ] Monitoring for prepared-state blocking time
- [ ] Alerting for coordinator crashes
- [ ] Rollback rate and latency monitored
- [ ] Audit logging for all commit/rollback decisions
- [ ] Clear escalation path for stuck transactions
- [ ] Documented why 2PC was chosen over saga

Implementation Sketch

A simplified 2PC coordinator looks like this:

class TwoPhaseCommitCoordinator:
    def __init__(self, participants):
        self.participants = participants
        self.state = 'init'

    def execute(self, transaction):
        # Phase 1: Prepare
        votes = []
        for participant in self.participants:
            vote = participant.prepare(transaction)
            votes.append(vote)

        if any(vote == 'no' for vote in votes):
            self.state = 'rollback'
            for participant in self.participants:
                participant.rollback(transaction)
            return 'rolled back'

        # Phase 2: Commit
        self.state = 'commit'
        for participant in self.participants:
            participant.commit(transaction)
        return 'committed'

Real implementations must handle timeouts, crashes, and retries.

Conclusion

Two-phase commit provides atomic commitment across distributed nodes. It works in theory. In practice, the blocking problem makes it unsuitable for high-availability systems.

When a participant votes yes and the coordinator crashes before sending the decision, the participant blocks indefinitely. This is unacceptable in systems that must stay available.

Most microservice architectures use saga instead. Saga sacrifices isolation and atomicity for availability. The trade-off is usually the right one. For cross-service business transactions, eventual consistency with compensating transactions is more practical than 2PC’s blocking semantics.

Category

Related Posts

Distributed Transactions: ACID vs BASE Trade-offs

Explore distributed transaction patterns: ACID vs BASE trade-offs, two-phase commit, saga pattern, eventual consistency, and choosing the right model.

#distributed-systems #transactions #consistency

Google Spanner: Globally Distributed SQL at Scale

Google Spanner architecture combining relational model with horizontal scalability, TrueTime API for global consistency, and F1 database implementation.

#distributed-systems #databases #google

Gossip Protocol: Scalable State Propagation

Learn how gossip protocols enable scalable state sharing in distributed systems. Covers epidemic broadcast, anti-entropy, SWIM failure detection, and real-world applications like Cassandra and Consul.

#distributed-systems #gossip-protocol #consistency