Two-Phase Commit Protocol Explained
Learn the two-phase commit protocol for distributed transactions: prepare and commit phases, coordinator role, failure handling, and why 2PC is rarely used.
Two-Phase Commit Protocol Explained
Two-phase commit (2PC) is a protocol for achieving atomic commitment across multiple database nodes. All nodes either commit or rollback together. No partial states.
The appeal is clear: distributed transactions that behave like single-node transactions. In practice, 2PC has failure modes that make it problematic, and most systems use alternatives like saga instead.
This post explains how 2PC works, the failure scenarios that cause trouble, and why it fell out of favor.
The Basic Protocol
2PC works in two phases: prepare and commit.
Phase 1: Prepare
The coordinator sends a prepare message to all participants. Each participant votes:
- Vote Yes: The participant has completed its work, its locks are held, and it is ready to commit.
- Vote No: Something went wrong. The participant rolls back and releases locks.
sequence
Coordinator-->|Prepare| P1:Participant 1
Coordinator-->|Prepare| P2:Participant 2
Coordinator-->|Prepare| P3:Participant 3
P1-->|Vote Yes| Coordinator
P2-->|Vote Yes| Coordinator
P3-->|Vote No| Coordinator
If any participant votes no, the coordinator sends rollback to all. Done.
Phase 2: Commit
If all participants vote yes, the coordinator sends commit to all. Each participant commits its local transaction and releases locks.
sequence
Coordinator-->|Commit| P1:Participant 1
Coordinator-->|Commit| P2:Participant 2
Coordinator-->|Commit| P3:Participant 3
P1-->|Committed| Coordinator
P2-->|Committed| Coordinator
P3-->|Committed| Coordinator
All participants must commit for the transaction to succeed.
The Coordinator
The coordinator is the linchpin of 2PC. It manages the voting, decides the outcome, and tells participants what to do.
In practice, the coordinator is often a separate service or embedded in the application. Some databases (like Percona XtraDB) implement 2PC internally for distributed transactions.
The coordinator must be reliable. If it crashes, participants may be left in limbo.
Embedded vs External Coordinator
| Aspect | Embedded Coordinator | External Coordinator |
|---|---|---|
| Definition | Coordinator logic runs within the application process initiating the transaction | Coordinator is a separate service/process that manages 2PC across participants |
| Example | Application code directly coordinates MySQL/PostgreSQL XA transactions | Dedicated service like Narayana, WebSphere LTCO, or custom coordinator service |
| Advantages | Simpler deployment; no additional service to operate; lower latency for single-site transactions | Coordinator survives application crashes; easier to monitor and manage; better for distributed multi-service transactions |
| Disadvantages | Application crash kills the coordinator; harder to monitor; participants may block if application is overloaded | Additional dependency; network hop to coordinator; requires HA configuration for coordinator |
| Coordinator SPOF | Yes — if the application crashes, coordinator dies with it | Yes — unless configured in HA mode with consensus (e.g., Paxos-based) |
| When to Use | Single application coordinating local database partitions; simple XA transactions within one database cluster | Multi-service transactions; transactions spanning multiple applications; when coordinator must survive application restarts |
| Recovery | Application restart recovers coordinator state from persistent storage | Coordinator service restart with state persisted to disk or replicated via consensus |
Failure Handling
2PC has several failure scenarios that cause problems.
Participant Crashes Before Voting
If a participant crashes before voting, the coordinator times out and sends rollback. Other participants roll back. The crashed participant, when it recovers, also rolls back (or has nothing to do if it had not started work).
This case is manageable.
Participant Crashes After Voting Yes but Before Commit
This is the problematic case. The participant has voted yes, which means it has acquired locks and completed its work. It is waiting for the coordinator to tell it to commit or rollback.
If the coordinator crashes before sending the decision, this participant is stuck. It holds its locks indefinitely. Other participants have committed (if they received the commit message before crashing) or rolled back (if they received rollback or voted no).
The participant cannot decide unilaterally. It cannot commit (because the coordinator might have sent rollback to others). It cannot rollback (because the coordinator might have sent commit to others).
This is the blocking problem. The participant must wait for the coordinator to recover.
Coordinator Crashes
If the coordinator crashes after collecting yes votes but before sending commit, participants are stuck. They must wait for the coordinator to recover.
If the coordinator crashes and never recovers, participants are stuck forever. In practice, systems use timeouts and manual intervention.
Network Partition
If a participant cannot reach the coordinator, it blocks. If the coordinator cannot reach a participant, it must treat it as a no vote and send rollback.
Network partitions are common in distributed systems. 2PC does not handle them gracefully.
Why 2PC Is Rarely Used
The blocking problem is the fatal flaw. In a system that must stay available (which is most systems), blocking on coordinator recovery is unacceptable.
Other protocols address this. Three-phase commit (3PC) adds a pre-commit phase to eliminate blocking, but it still assumes a synchronous system and makes stronger network assumptions. It is rarely used in practice either.
The saga pattern avoids locking entirely. Instead of atomic commitment, saga uses compensating transactions. If step 3 fails, steps 1 and 2 are undone. The system remains available. The cost is temporary inconsistency. For saga pattern details, see Saga Pattern.
Event sourcing stores events, not state. The event log is the source of truth. Rebuilding state is a matter of replaying events. This avoids distributed transactions altogether.
Database-Specific Implementations
Most real-world 2PC implementations don’t look like the textbook version. The blocking problem is well-known, so database teams built around it. Spanner, CockroachDB, and TiDB each took a different path — but they share one idea: make the coordinator state survive crashes.
Google Spanner
Spanner uses Paxos groups as the coordinator. Instead of one node deciding and hoping it doesn’t crash, the group agrees on the decision. When a transaction touches multiple participant groups:
- The coordinator leader proposes a prepare timestamp via Paxos
- Participants acknowledge via Paxos consensus
- The coordinator leader assigns a commit timestamp and broadcasts via Paxos
The coordinator is the Paxos leader — if it dies, another node picks up immediately. Spanner’s TrueTime adds another trick: even during uncertainty windows, participants know how long to wait before giving up.
Spanner’s Paxos integration solves the blocking problem by replicating coordinator state across all group members. Any surviving group member can drive recovery after a failure.
# Simplified Spanner-style Paxos-coordinated 2PC
class PaxosCoordinator:
def __init__(self, participants, paxos_group):
self.participants = participants
self.paxos_group = paxos_group # Replicated coordinator state
def execute(self, transaction):
# Propose commit decision to Paxos group (not single node)
proposal = {'decision': 'commit', 'txn': transaction, 'timestamp': None}
# Paxos consensus among group members
decided_value = self.paxos_group.propose(proposal)
if decided_value['decision'] == 'commit':
# Broadcast commit to all participants
for p in self.participants:
p.commit(transaction)
else:
for p in self.participants:
p.rollback(transaction)
CockroachDB
CockroachDB went with distributed commit, where the transaction record itself is a Raft entity. The leaseholder of the transaction record acts as the coordinator — not a separate process. When a transaction commits:
- The transaction record gets updated with a commit timestamp
- This update goes into the Raft log across all replica nodes
- Participants read the committed transaction record to figure out what happened
The outcome lives in Raft, so it survives coordinator crashes. No single-node SPOF.
TiDB
TiDB splits things across three components: the TiDB Server handles coordination, PD hands out timestamps, and TiKV stores everything with Raft replication underneath. Fail the TiDB Server and the transaction record survives in TiKV.
| Database | Coordinator Type | Consensus Layer | Blocking Problem Solved? |
|---|---|---|---|
| Spanner | Paxos group leader | Paxos | Yes — replicated coordinator state |
| CockroachDB | Transaction leaseholder | Raft | Yes — transaction record replicated |
| TiDB | TiDB Server + PD | Raft (via TiKV) | Yes — commit decision in replicated Raft log |
When to Use / When Not to Use 2PC
Use 2PC when:
- You need atomic commitment across multiple nodes with strong consistency
- Your network is stable and transactions are short (minimizing blocking window)
- You are using a distributed database with Paxos-based coordinator (like Spanner) that eliminates the coordinator single point of failure
- Occasional blocking during coordinator recovery is acceptable
- Regulatory requirements demand ACID semantics across distributed nodes
Avoid 2PC when:
- Your system must stay available during network partitions
- Transactions are long-running (blocking window becomes unacceptable)
- You have multiple independent services with separate databases
- You cannot guarantee coordinator reliability (no HA configuration)
- The blocking problem is unacceptable for your availability requirements
Production Failure Scenarios
| Failure | Impact | Mitigation |
|---|---|---|
| Participant crashes after voting Yes but before commit | Participant blocks indefinitely waiting for decision | Use Paxos-based consensus for coordinator; implement timeout-based lock release |
| Coordinator crashes after collecting Yes votes | Participants block waiting for decision | Run coordinator in HA with consensus; use persistent state for recovery |
| Network partition during commit phase | Some participants commit, others rollback | Design for partition tolerance; use saga pattern for compensation |
| Participant recovery with uncertain state | Participant doesn’t know whether to commit or rollback | Use WAL to determine state; implement recovery protocol; log prepare state |
| Coordinator timeout misconfiguration | Premature rollback or indefinite waiting | Set appropriate timeout values based on network characteristics |
| All participants vote No | Coordinator sends rollback; this is normal | Ensure proper error handling; investigate root cause of votes |
Recovery Protocol Implementation
Crash recovery in 2PC is where things get uncomfortable. When a participant restarts with a transaction in flight, it has to figure out what happened while it was gone — without asking the coordinator (because the coordinator might be down too).
The trick is that participants log their state to WAL before doing anything else. If you voted Yes, that goes to WAL before the vote message leaves. This means a restarted participant can reconstruct its state by reading its own log.
WAL Entry Types
- TXN_PREPARE: Voted Yes, locks acquired, waiting for coordinator decision
- TXN_COMMIT: Decision came back commit, locks released
- TXN_ROLLBACK: Decision came back rollback (or voted No), locks released
Participant Recovery Procedure
On restart, scan for transactions stuck in PREPARED state. For each one:
- Check the coordinator’s decision — either from the coordinator’s recovery log (old style) or from the replicated Paxos/Raft log (modern systems)
- If the decision was COMMIT, commit locally and release locks
- If the decision was ROLLBACK, roll back and release locks
- If the coordinator is still unreachable, stay in prepared state and wait
Coordinator Recovery Procedure
The coordinator must write its decision to WAL before telling participants. If you crash after collecting Yes votes but before persisting, you’ve got an inconsistent mess that needs manual intervention.
On restart, the coordinator reads its WAL for unresolved transactions and resends the decision to participants that haven’t acknowledged.
The Critical Rule
The coordinator MUST write its decision to persistent storage before sending it to participants. If the coordinator crashes after collecting yes votes but before persisting, and the participants have already committed, the system enters an inconsistent state that requires manual intervention.
Modern databases sidestep this entirely — the commit decision lives in a replicated log, so it survives coordinator crashes without anyone having to remember anything.
Observability Checklist
Metrics
- Transaction commit rate vs rollback rate
- Prepare phase duration
- Commit phase duration
- Coordinator crash count and recovery time
- Participant blocking time (time in prepared state)
- Transaction timeout rate
Logs
- Log prepare and vote from each participant
- Log coordinator decision (commit or rollback)
- Log participant state changes
- Include transaction ID and participant IDs for correlation
- Log timeout and recovery events
Alerts
- Alert when participants are in prepared state too long
- Alert on coordinator crash
- Alert when transaction timeout rate increases
- Alert when rollback rate exceeds baseline
Security Checklist
- Authenticate all participants in the transaction
- Authorize which participants can join which transactions
- Encrypt coordinator-to-participant communication
- Audit log all commit and rollback decisions
- Validate transaction inputs at each participant
- Protect coordinator management APIs
Common Pitfalls / Anti-Patterns
Ignoring the blocking problem: The coordinator crash scenario where participants block indefinitely is not theoretical. In production systems with network issues, it happens. Do not use 2PC without a plan for this.
Long-running transactions with 2PC: The longer the transaction, the longer participants hold locks. Long transactions increase contention and blocking window. Keep transactions short.
Using 2PC for microservice transactions: 2PC across independent services with separate databases is generally the wrong approach. Services should own their data and use saga or choreography for cross-service consistency.
Not planning for coordinator failure: If the coordinator is a single point of failure, you will eventually have a bad day. Use HA coordinator configuration or consensus-based coordination.
Underestimating timeout configuration: Timeouts that are too short cause unnecessary rollbacks. Timeouts that are too long cause long blocking. Tune based on actual network characteristics.
Assuming atomicity provides isolation: 2PC provides atomicity but not isolation. Concurrent transactions can still see each other’s uncommitted results. Use appropriate isolation levels.
Quick Recap
sequence
Coordinator-->|Prepare| P1:Participant 1
Coordinator-->|Prepare| P2:Participant 2
P1-->|Vote Yes| Coordinator
P2-->|Vote Yes| Coordinator
Coordinator-->|Commit| P1
Coordinator-->|Commit| P2
Key Points
- 2PC provides atomic commitment across distributed nodes (all commit or all rollback)
- Phase 1 (Prepare): Coordinator asks participants to vote; participants acquire locks and vote
- Phase 2 (Commit): If all vote Yes, coordinator sends commit; participants commit and release locks
- Fatal flaw: Participants block indefinitely if coordinator crashes after prepare
- Most microservice architectures use saga instead due to 2PC’s blocking problem
Production Checklist
# Two-Phase Commit Production Readiness
- [ ] Coordinator running in HA mode (Paxos-based if possible)
- [ ] Transaction timeout values tuned for network characteristics
- [ ] Participant recovery protocol implemented and tested
- [ ] Monitoring for prepared-state blocking time
- [ ] Alerting for coordinator crashes
- [ ] Rollback rate and latency monitored
- [ ] Audit logging for all commit/rollback decisions
- [ ] Clear escalation path for stuck transactions
- [ ] Documented why 2PC was chosen over saga
Implementation Sketch
A simplified 2PC coordinator looks like this:
class TwoPhaseCommitCoordinator:
def __init__(self, participants):
self.participants = participants
self.state = 'init'
def execute(self, transaction):
# Phase 1: Prepare
votes = []
for participant in self.participants:
vote = participant.prepare(transaction)
votes.append(vote)
if any(vote == 'no' for vote in votes):
self.state = 'rollback'
for participant in self.participants:
participant.rollback(transaction)
return 'rolled back'
# Phase 2: Commit
self.state = 'commit'
for participant in self.participants:
participant.commit(transaction)
return 'committed'
Real implementations must handle timeouts, crashes, and retries.
Conclusion
Two-phase commit provides atomic commitment across distributed nodes. It works in theory. In practice, the blocking problem makes it unsuitable for high-availability systems.
When a participant votes yes and the coordinator crashes before sending the decision, the participant blocks indefinitely. This is unacceptable in systems that must stay available.
Most microservice architectures use saga instead. Saga sacrifices isolation and atomicity for availability. The trade-off is usually the right one. For cross-service business transactions, eventual consistency with compensating transactions is more practical than 2PC’s blocking semantics.
Category
Related Posts
Distributed Transactions: ACID vs BASE Trade-offs
Explore distributed transaction patterns: ACID vs BASE trade-offs, two-phase commit, saga pattern, eventual consistency, and choosing the right model.
Google Spanner: Globally Distributed SQL at Scale
Google Spanner architecture combining relational model with horizontal scalability, TrueTime API for global consistency, and F1 database implementation.
Gossip Protocol: Scalable State Propagation
Learn how gossip protocols enable scalable state sharing in distributed systems. Covers epidemic broadcast, anti-entropy, SWIM failure detection, and real-world applications like Cassandra and Consul.