TCC: Try-Confirm-Cancel Pattern for Distributed Transactions
Learn the Try-Confirm-Cancel pattern for distributed transactions. Explore how TCC differs from 2PC and saga, with implementation examples and real-world use cases.
Introduction
TCC works by splitting every operation into three phases. Try reserves what you need. Confirm makes it permanent. Cancel releases what you reserved. Each service implements these three operations, and a coordinator orchestrates the flow.
sequenceDiagram
Coordinator->>ServiceA: Try(Reserve 5 units)
ServiceA->>Coordinator: TryConfirmed
Coordinator->>ServiceB: Try(Reserve $100)
ServiceB->>Coordinator: TryConfirmed
Coordinator->>ServiceA: Confirm
ServiceA->>Coordinator: Confirmed
Coordinator->>ServiceB: Confirm
ServiceB->>Coordinator: Confirmed
When Try fails for any participant:
sequenceDiagram
Coordinator->>ServiceA: Try(Reserve 5 units)
ServiceA->>Coordinator: TryConfirmed
Coordinator->>ServiceB: Try(Reserve $100)
ServiceB->>Coordinator: TryFailed(Review carefully)
Coordinator->>ServiceA: Cancel
ServiceA->>Coordinator: Cancelled
Idempotency matters at every phase. Services must handle duplicate Try calls gracefully. Confirm and Cancel also need to be idempotent, since the coordinator may retry if calls fail or get lost.
Topic-Specific Deep Dives
Conceptual Foundations
TCC vs Two-Phase Commit
TCC and 2PC both have three phases, but they work differently. In 2PC, participants lock resources during Prepare and hold those locks until Commit or Rollback. This blocking is the price of atomicity. In TCC, Try reserves but does not lock. Other operations can proceed against the same data, aware of pending reservations but not blocked by them.
The difference shows up under contention. Two competing transactions trying to reserve the same inventory: 2PC locks the rows and makes one wait or fail. TCC shows the second transaction a pending reservation and lets it decide what to do, whether that means queuing or picking an alternative.
2PC also assumes participants share a transaction manager. TCC works across heterogeneous systems because each service implements its own Try/Confirm/Cancel logic. Payment service, inventory service, shipping service, all different stacks, all coordinatable.
For a deeper look at 2PC and its limitations, see Two-Phase Commit.
TCC vs Basic Saga
Saga and TCC both avoid blocking, but failure handling differs. In a basic saga, each step has a compensation that undoes what the step did. If Step 3 fails, run compensation for Step 2, then Step 1. The compensation logic must know how to reverse each step’s effects.
TCC takes a different angle. Confirm and Cancel are explicit and symmetrical. Confirm commits the tentative reservation. Cancel releases it. The complexity shifts from writing reverse logic to implementing a reservation system that tracks pending operations.
TCC fits naturally when you can model operations as reservations. Hotel booking, inventory allocation, credit holds, seat reservations. These have a clear notion of “tentatively take this” and “make it official or release it.”
Saga fits better when operations are transformations rather than reservations. Moving money from account A to account B, transforming an order into an invoice. These lack natural reservation semantics and saga works fine there.
For more on saga, see Saga Pattern.
TCC compared to 2PC and saga
| Aspect | 2PC | Saga | TCC |
|---|---|---|---|
| Blocking | Yes - participants block during commit | No - no blocking | No - no blocking |
| Locking | Locks resources during prepare | No locks | Reservations, not locks |
| Atomicity | True atomic commit | Eventual atomicity | Eventual atomicity |
| Isolation | Full serializable isolation | No isolation | No isolation |
| Coordination | Centralized coordinator | Centralized or choreographed | Centralized coordinator |
| Heterogeneous Systems | Requires shared TM | Yes | Yes |
| Compensation Model | Automatic rollback | Explicit compensations | Explicit Confirm/Cancel |
| Failure Handling | Blocking on coordinator crash | Compensations in reverse | Confirm/Cancel with retries |
| Latency | Two round trips minimum | Per-step latency | Two round trips minimum |
| Use Case Fit | Strong consistency required | Transformation operations | Reservation operations |
| Recovery Complexity | High (blocking states) | Medium (compensation chain) | Medium (tentative state cleanup) |
| Implementation Complexity | Medium (DB-supported) | Low-Medium | Medium-High (reservation design) |
Implementing TCC in Practice
TCC requires a coordinator and participant implementations. Many frameworks handle the coordinator role. You focus on implementing Try, Confirm, and Cancel methods on your services.
A Flight Booking Example
Consider a flight booking system that coordinates an airline reservation, a hotel booking, and a car rental. All three must succeed or all three must be cancelled.
class FlightBooking:
def try_reserve(self, flight_id, passenger_id, seats):
# Check availability and tentatively hold seats
reservation = Reservation(
flight_id=flight_id,
passenger_id=passenger_id,
seats=seats,
status="TENTATIVE"
)
self.reservations.save(reservation)
return "TryConfirmed"
def confirm(self, flight_id, passenger_id):
# Make the tentative reservation permanent
reservation = self.reservations.find(flight_id, passenger_id)
reservation.status = "CONFIRMED"
self.reservations.save(reservation)
return "Confirmed"
def cancel(self, flight_id, passenger_id):
# Release the tentative hold
reservation = self.reservations.find(flight_id, passenger_id)
reservation.status = "CANCELLED"
self.reservations.save(reservation)
return "Cancelled"
The coordinator orchestrates the three-phase flow:
class BookingCoordinator:
def __init__(self, flight, hotel, car):
self.flight = flight
self.hotel = hotel
self.car = car
def book_trip(self, flight_id, hotel_id, car_id, passenger):
# Try phase
results = []
results.append(self.flight.try_reserve(flight_id, passenger, 1))
results.append(self.hotel.try_reserve(hotel_id, passenger, 1))
results.append(self.car.try_reserve(car_id, passenger, 1))
if all(r == "TryConfirmed" for r in results):
# Confirm phase
self.flight.confirm(flight_id, passenger)
self.hotel.confirm(hotel_id, passenger)
self.car.confirm(car_id, passenger)
else:
# Cancel phase
self.flight.cancel(flight_id, passenger)
self.hotel.cancel(hotel_id, passenger)
self.car.cancel(car_id, passenger)
This example omits error handling, timeouts, and duplicate detection. A production implementation needs retry logic, idempotency keys, and timeout handlers for when participants fail to respond.
Handling Failures and Timeouts
TCC assumes participants will eventually respond to Try, Confirm, or Cancel calls. When a participant becomes unresponsive, the coordinator must decide what to do. This is where TCC implementations diverge.
Some frameworks use guaranteed delivery. They store the intended action in a log and retry until the participant acknowledges. Others use a maximum retry count and then flag the transaction as requiring manual intervention.
The tricky case is when Try succeeded but Confirm failed. The participant reserved the resource but never received the confirmation. From the participant’s perspective, it has a tentative reservation waiting to be confirmed or cancelled. The coordinator’s retry of Confirm should eventually clear this state. But if the coordinator crashed entirely, you need a recovery process that queries participants about their pending states.
flowchart TD
A[Coordinator calls Confirm] --> B{Participant reachable?}
B -->|Yes| C[Confirm succeeds]
B -->|No| D[Store in retry queue]
D --> E[Retry with backoff]
E --> F{Participant responds?}
F -->|Yes| C
F -->|No| G[Max retries exceeded]
G --> H[Flag for manual review]
Complete TCC Flow Diagram
flowchart TD
Start[TCC Transaction] --> TryPhase[Coordinator sends<br/>Try to all participants]
TryPhase --> TryResults{All Try succeed?}
TryResults -->|No| CancelPhase[Coordinator sends<br/>Cancel to all participants]
CancelPhase --> CancelDone[Resources released<br/>Transaction aborted]
TryResults -->|Yes| ConfirmPhase[Coordinator sends<br/>Confirm to all participants]
ConfirmPhase --> ConfirmResults{All Confirm succeed?}
ConfirmResults -->|No| ConfirmRetry[Retry with backoff]
ConfirmRetry --> ConfirmResults
ConfirmResults -->|Yes| Success[Transaction committed<br/>All reservations finalized]
TryPhase --> Timeout{Participant times out?}
Timeout -->|Yes| CancelPhase
Timeout -->|No| TryResults
Three Main Scenarios:
| Scenario | Trigger | Coordinator Action | Outcome |
|---|---|---|---|
| Try -> Confirm success | All participants respond TryConfirmed | Send Confirm to all | All reservations become permanent |
| Try -> Cancel | Any participant responds TryFailed | Send Cancel to all | All tentative reservations released |
| Try timeout -> Cancel | Participant times out on Try | Send Cancel to all | All tentative reservations released |
State Transitions for a Single Participant:
stateDiagram-v2
[*] --> Idle: Transaction starts
Idle --> Tentative: Try succeeds
Tentative --> Confirmed: Confirm received
Tentative --> Cancelled: Cancel received
Tentative --> Tentative: Try timeout, waiting for Cancel
Confirmed --> [*]
Cancelled --> [*]
TCC Frameworks
Building TCC from scratch means managing coordinator state, retry logic, timeout handling, and recovery — all nontrivial. Several frameworks handle the heavy lifting.
Apache TCM (Transaction Coordinator Manager)
Apache TCM is the reference implementation for J2EE-style TCC. It integrates with application servers and provides declarative transaction boundaries. Best for Java/J2EE shops already invested in that ecosystem.
Narayana (JBossTS)
Narayana is an open-source transaction manager supporting LRC (Last Resource Commit) optimization, 2PC, and TCC. It provides both programmatic and declarative (annotation-based) approaches. Works well with Spring via Narayana’s Spring integration.
@Compensable(compensationMethod = "cancelReservation")
public void tryReserveSeats(ReservationRequest request) {
// Tentatively reserve seats
reservationService.createTentativeReservation(request);
}
public void cancelReservation(ReservationRequest request) {
// Release the tentative hold
reservationService.cancelReservation(request.getReservationId());
}
public void confirmReservation(ReservationRequest request) {
// Finalize the reservation
reservationService.confirmReservation(request.getReservationId());
}
ByteTCC
ByteTCC is a TCC implementation for Spring applications. It uses annotations to define Try/Confirm/Cancel methods and handles coordinator logic transparently. Lightweight and Spring-native, good for microservices running in Spring Boot.
@Compensable(confirmMethod = "confirm", cancelMethod = "cancel")
public boolean tryReserveInventory(InventoryRequest request) {
// Try logic: check availability, create tentative hold
return inventoryService.tentativeHold(request.getItemId(), request.getQuantity());
}
public void confirm(InventoryRequest request) {
inventoryService.confirmHold(request.getItemId(), request.getQuantity());
}
public void cancel(InventoryRequest request) {
inventoryService.releaseHold(request.getItemId(), request.getQuantity());
}
Spring TCC (Spring-Cloud-tencent)
Spring TCC is part of the Tencent Spring Cloud stack. Integrates with Service Comb and provides distributed TCC transaction support for Spring Cloud microservices.
Framework Comparison
| Framework | Language | Coordinator | Spring Integration | Recovery Support | Best For |
|---|---|---|---|---|---|
| Apache TCM | Java | Embedded | Yes (J2EE) | Yes | Enterprise Java apps |
| Narayana | Java/C | Both | Yes | Yes | JBoss/Spring ecosystem |
| ByteTCC | Java | External | Yes (Spring Boot) | Yes | Lightweight Spring microservices |
| Spring TCC | Java | External | Yes | Yes | Tencent/Spring Cloud stack |
For most new projects, ByteTCC or Narayana are the practical choices. ByteTCC is simpler and more Spring Boot-friendly. Narayana has more enterprise features and longer track record.
Confirm/Cancel Idempotency Implementation
Idempotency is not optional in TCC — it is load-bearing. The coordinator retries Confirm and Cancel calls until it gets a response. Your participant must handle duplicates gracefully.
The Idempotency Problem
Consider this scenario:
- Coordinator calls
confirm(reservation_id="abc") - Participant confirms successfully but the network drops before the response arrives
- Coordinator retries
confirm(reservation_id="abc") - If your confirm handler is not idempotent, you might re-confirm an already-confirmed reservation
Idempotency Key Design
Use a dedicated idempotency key for each Try/Confirm/Cancel call. The key should be deterministic — the same operation always gets the same key.
import hashlib
def make_idempotency_key(transaction_id, participant_id, phase):
"""Generate a deterministic idempotency key.
Same transaction + participant + phase always produces same key.
"""
raw = f"{transaction_id}:{participant_id}:{phase}"
return hashlib.sha256(raw.encode()).hexdigest()[:16]
class TccParticipant:
def confirm(self, transaction_id, participant_id, reservation_data):
key = make_idempotency_key(transaction_id, participant_id, "confirm")
# Idempotency check
existing = self.confirm_log.find_by_idempotency_key(key)
if existing:
# Already confirmed — return success without re-confirming
return ConfirmResult(
success=True,
already_confirmed=True,
confirmed_at=existing.confirmed_at
)
# Actual confirmation logic
reservation = self.reservations.find(reservation_data.id)
reservation.status = "CONFIRMED"
self.reservations.save(reservation)
# Record this confirmation for future idempotency
self.confirm_log.save(IdempotencyRecord(
key=key,
transaction_id=transaction_id,
confirmed_at=datetime.utcnow()
))
return ConfirmResult(success=True, already_confirmed=False)
def cancel(self, transaction_id, participant_id, reservation_data):
key = make_idempotency_key(transaction_id, participant_id, "cancel")
existing = self.cancel_log.find_by_idempotency_key(key)
if existing:
return CancelResult(
success=True,
already_cancelled=True,
cancelled_at=existing.cancelled_at
)
reservation = self.reservations.find(reservation_data.id)
reservation.status = "CANCELLED"
self.reservations.save(reservation)
self.cancel_log.save(IdempotencyRecord(
key=key,
transaction_id=transaction_id,
cancelled_at=datetime.utcnow()
))
return CancelResult(success=True, already_cancelled=False)
Confirm Before Cancel Problem
A subtler idempotency problem: what if Confirm runs twice (first times out, second succeeds), and then Cancel is retried? The cancellation would incorrectly release a confirmed reservation.
Track state transitions explicitly. Confirm transitions from TENTATIVE to CONFIRMED. Cancel transitions from TENTATIVE to CANCELLED. Once in CONFIRMED, Cancel should be a no-op, not a failure.
def cancel(self, transaction_id, participant_id, reservation_data):
reservation = self.reservations.find(reservation_data.id)
if reservation.status == "CONFIRMED":
# Already confirmed — Cancel is correctly a no-op
return CancelResult(success=True, reason="already_confirmed")
if reservation.status == "CANCELLED":
# Already cancelled — still a no-op
return CancelResult(success=True, reason="already_cancelled")
# Actual cancellation from TENTATIVE state
reservation.status = "CANCELLED"
self.reservations.save(reservation)
return CancelResult(success=True)
Timeout vs Permanent Failure
TCC distinguishes between transient failures (retry might succeed) and permanent failures (never going to succeed). In your Try handler:
- Transient failure: Return a retryable error, coordinator retries
- Permanent failure: Return
TryFailedwith a reason that means “do not retry, cancel everything”
def try_reserve(self, inventory_id, quantity):
try:
# Try logic
reserved = self.inventory.tentative_hold(inventory_id, quantity)
return TryResult(success=True, reservation_id=reserved.id)
except InsufficientInventory:
# Permanent failure — not retrying will help
return TryResult(success=False, reason="INSUFFICIENT_INVENTORY")
except TemporaryDatabaseError:
# Transient failure — worth retrying
raise TryRetryableError("Database temporarily unavailable")
except CapacityExceeded:
# Permanent failure — no amount of retry will fix this
return TryResult(success=False, reason="CAPACITY_EXCEEDED")
Advantages of TCC
The main advantage is that resources do not lock during the transaction. Other operations can read or modify the same data, aware of pending reservations but not blocked by them. This makes TCC more scalable than 2PC, particularly under high contention.
The three-phase structure is explicit. Every participant agrees to the contract: if you can reserve in Try, you guarantee you can confirm or cancel later.
TCC also works across heterogeneous systems. No shared transaction manager required. Each service implements its own semantics for Try, Confirm, and Cancel.
Common Pitfalls / Anti-Patterns
TCC is not a silver bullet.
The biggest challenge is designing Try/Confirm/Cancel for your specific domain. Not all operations map naturally to reservation semantics. Forcing a square peg into a round hole produces brittle implementations.
Idempotency trips people up. Confirm and Cancel must handle duplicate calls gracefully. If the coordinator retries a Confirm that actually succeeded, the participant needs to recognize this and return Confirmed, not try to confirm again.
The timeout case requires care. Try succeeds but the coordinator crashes before sending Confirm or Cancel. Resources sit in a tentative state. Without a resolution mechanism, you get resource leaks that pile up silently.
Latency also increases. Every transaction needs at least two round trips to each participant.
Use Cases & Decision Criteria
Use cases and decision criteria for adopting TCC in your system.
TCC fits well when your business logic naturally supports reservation semantics. when your business logic naturally supports reservation semantics. Inventory allocation, booking systems, credit reservations, seat holds. If you can model the operation as “tentatively take X and later either commit or release,” TCC gives you a clean structure
TCC gets awkward when operations are transformations rather than reservations. If Step 2 depends on the output of Step 1 in a way that does not fit reservation semantics, you end up stuffing intermediate state into Try and carrying it forward to Confirm. This works but loses the elegance.
For an overview of distributed transaction patterns including TCC, see Distributed Transactions. For reliable message delivery in distributed systems, see the Outbox Pattern.
Production Failure Scenarios
Scenario 1: Network Partition During Confirm Phase
Trigger: Network partition separates coordinator from one or more participants after Try succeeds.
What happens:
- All participants respond TryConfirmed
- Coordinator starts Confirm phase
- Coordinator can reach Participant A and B but not Participant C
- A and B confirm successfully; C never receives Confirm
- Coordinator retries C with backoff until max retries
- Max retries exceeded, transaction flagged for manual review
Outcome: Participants A and B have CONFIRMED reservations; Participant C has TENTATIVE. Inventory is inconsistently allocated. Manual intervention required to either Confirm C (if resources still available) or Cancel all participants.
Mitigation: Use a recovery process that periodically scans for TENTATIVE reservations older than a threshold and either completes or cancels them. Implement saga-style compensation as a fallback.
Scenario 2: Participant Crash and Recovery
Trigger: A participant process crashes after receiving Try but before processing Confirm.
What happens:
- Participant D receives Try, writes TENTATIVE reservation to durable storage, responds TryConfirmed
- Participant D crashes before receiving Confirm
- Participant D restarts and recovers its state
- Coordinator retries Confirm for Participant D
- If participant tracks idempotency correctly, it recognizes this is a retry and returns Confirmed (no-op on actual state)
Outcome: Transaction completes successfully if participant implemented idempotency correctly. If not implemented, the recovery might not recognize the pending reservation and create duplicate or conflicting state.
Mitigation: Durable write of TENTATIVE state before responding to Try. Idempotency check on Confirm that recognizes already-processed transactions.
Scenario 3: Clock Skew Across Participants
Trigger: Different participants have slightly different system clocks, causing TTL-based auto-expiry to fire at different times.
What happens:
- Transaction with 5-minute TTL on tentative reservations starts
- Participant E has a fast clock and expires the reservation at minute 4
- Participant F has a slow clock and still has TENTATIVE at minute 5
- Coordinator sends Confirm at minute 5 to both
- Participant E rejects Confirm (no longer has reservation)
- Participant F accepts Confirm
Outcome: Inconsistent state. Participant E has cancelled, Participant F has confirmed. Without reconciliation, the system has a phantom confirmed reservation.
Mitigation: Use logical time or a centralized time source for TTL. Build reconciliation logic that detects and resolves inconsistent states across participants.
Scenario 4: Cancel Storm Under High Contention
Trigger: System experiences high contention on a shared resource during peak load.
What happens:
- High traffic causes many transactions to timeout on Try simultaneously
- Coordinator sends Cancel to all affected participants
- Cancel storm overwhelms participant capacity
- Some Cancels time out, causing the coordinator to retry
- Retries add more load, worsening the situation
Outcome: System becomes unstable. Participants may need to shed load or the coordinator may need to slow down transaction initiation.
Mitigation: Implement circuit breakers on participants. Use bulkhead isolation between TCC participants. Rate-limit Try requests per client. Consider queueing Try requests instead of immediate rejection.
Scenario 5: Double-Spend via Race Condition
Trigger: Two different coordinators attempt to reserve the same inventory simultaneously.
What happens:
- Coordinator X sends Try(reserve 5 units) to Inventory Service
- Coordinator Y sends Try(reserve 5 units) to Inventory Service (same inventory)
- Both see availability because neither reservation is final yet
- Both receive TryConfirmed
- Both send Confirm (or one confirms and one cancels then retries)
- Without careful concurrency control, double-spend occurs
Outcome: More inventory is confirmed than actually available. Overselling occurs.
Mitigation: Use row-level locking or optimistic concurrency control in the Try phase. The tentative reservation should check available inventory at Confirm time, not just Try time. Some implementations use per-resource locking that short-circuits concurrent Try attempts on the same resource.
Trade-off Analysis
TCC vs 2PC:
| Dimension | TCC | 2PC | | ----------------------------- | -------------------------------------- | ---------------------------------------- | ------------------------------------------------- | | Blocking | No — participants not blocked | Yes — participants block during commit | | Resource Locking | Reservations only, no locks | Locks held during prepare and commit | | Atomicity | Eventual (via retry) | True atomic commit | | Isolation | None (others see pending reservations) | Full serializable isolation | | Latency | Two round trips minimum | Two round trips minimum | | Heterogeneous Support | Yes — each service implements T/C/C | No — requires shared transaction manager | | Implementation Complexity | Medium-High (reservation design) | Medium (DB-supported) | Winner for heterogeneous microservices | | Failure Handling | Retry with Confirm/Cancel | Blocking on coordinator crash | Winner for availability under partial failure |
TCC vs Saga
| Dimension | TCC | Saga | | --------------------------- | -------------------------------- | ---------------------------------------- | ------------------------------ | | Failure Handling | Explicit Confirm/Cancel symmetry | Reverse compensations (undo each step) | | Best For | Reservation semantics | Transformation semantics | | Idempotency Requirement | Critical (all phases) | Lower (compensations must be idempotent) | | Complexity Location | Reservation system design | Compensation chain logic | | Natural Fit | Hotel booking, inventory holds | Money transfers, order-to-invoice | Depends on domain | | Implementation Effort | Medium-High | Low-Medium | Winner for simpler domains |
Best-Effort vs Exactly-Once TCC
| Dimension | Best-Effort TCC | Exactly-Once TCC | | ----------------------------- | --------------------------------------------- | ------------------------------------- | ----------------------------------------------- | | Delivery Guarantee | Retry with limit, then flag for manual review | Retry indefinitely until acknowledged | | Dangling Reservations | Possible (manual cleanup needed) | Eliminated | | Implementation Complexity | Lower | Higher (persistent logging required) | | Operational Overhead | Higher (manual intervention) | Lower (self-healing) | Winner for reliability-critical systems | | Storage Overhead | Minimal | Logs must be maintained durably | Winner for storage-constrained environments |
When to Choose TCC Over Alternatives
| Scenario | Recommendation |
|---|---|
| Operations are reservations (inventory, booking, holds) | TCC |
| Operations are transformations (money transfer, order processing) | Saga |
| True atomicity with full isolation required | 2PC |
| Heterogeneous systems (different stacks, no shared TM) | TCC |
| High contention with concurrent access to same resources | TCC (with careful concurrency control) or Saga |
| Simple domain with low-stakes operations | Saga |
| Strict no-blocking requirement with strong eventual consistency | TCC |
| Enterprise Java ecosystem with existing transaction infrastructure | 2PC (if available) or Narayana TCC |
Quick Recap
Before finalizing your TCC implementation, verify these critical items:
- Try phase returns TryConfirmed or TryFailed with appropriate reason (permanent vs transient)
- Confirm and Cancel are idempotent; duplicate calls return success without re-executing
- Confirm handles the Confirm-before-Cancel problem (no-op when already CONFIRMED)
- Cancel handles already CANCELLED state as an idempotent no-op
- Deterministic idempotency keys are generated from transaction_id:participant_id:phase
- Timeout handling is implemented for unresponsive participants
- Maximum retry count with exponential backoff for Confirm/Cancel retries
- Recovery process exists for querying participants about pending states after coordinator crash
- TTL is set on tentative reservations to auto-expire abandoned transactions
- Coordinator-to-participant calls are authenticated (mTLS or JWT)
- Reservation data is validated in Try phase (not trusted blindly from coordinator)
- Observability is in place: completion rate, phase durations, retry counts, dangling reservation count
- Alerts configured for: accumulated dangling TENTATIVE reservations, confirm retry threshold exceeded, frequent cancel phase runs
Observability Checklist
TCC transactions span multiple services and involve multiple round trips. Without observability, you cannot tell whether a failed transaction left dangling tentative reservations.
Observability & Monitoring
Metrics
- TCC transaction completion rate (success vs try-fail vs confirm-fail vs cancel)
- Try phase duration and success rate
- Confirm phase duration and retry count
- Cancel phase duration and how often it runs
- Average number of participants per transaction
- Timeout rate per phase (try timeout, confirm timeout)
- Dangling tentative reservation count (reservations stuck in TENTATIVE state)
Logs
- Log Try phase start with transaction ID, participant ID, and reservation data
- Log Try phase outcome (confirmed, failed, timeout)
- Log Confirm/Cancel phase starts and outcomes
- Log retry attempts with attempt number and delay
- Include idempotency key in all phase logs for correlation
- Log participant state transitions (TENTATIVE → CONFIRMED, TENTATIVE → CANCELLED)
Alerts
- Alert when dangling TENTATIVE reservations accumulate (cleanup is failing)
- Alert when confirm retry count exceeds threshold
- Alert when cancel phase runs frequently (indicates try phase instability)
- Alert when participant times out repeatedly on try phase
- Alert when transaction takes longer than expected threshold
Tracing
from opentelemetry import trace
tracer = trace.get_tracer(__name__)
class TccTransaction:
def execute(self):
with tracer.start_as_current_span("tcc.transaction") as span:
span.set_attribute("tcc.transaction_id", self.txn_id)
span.set_attribute("tcc.participant_count", len(self.participants))
# Try phase
try_results = []
with tracer.start_as_current_span("tcc.try_phase") as try_span:
for participant in self.participants:
with tracer.start_as_current_span(f"tcc.try.{participant.name}") as p_span:
p_span.set_attribute("participant.name", participant.name)
result = participant.try_(self.request)
try_results.append(result)
p_span.set_attribute("tcc.try_result", result)
if all(r.success for r in try_results):
# Confirm phase
with tracer.start_as_current_span("tcc.confirm_phase") as confirm_span:
for participant in self.participants:
with tracer.start_as_current_span(f"tcc.confirm.{participant.name}") as p_span:
result = participant.confirm()
p_span.set_attribute("tcc.confirm_result", result)
else:
# Cancel phase
with tracer.start_as_current_span("tcc.cancel_phase") as cancel_span:
for participant in self.participants:
with tracer.start_as_current_span(f"tcc.cancel.{participant.name}") as p_span:
result = participant.cancel()
p_span.set_attribute("tcc.cancel_result", result)
Security & Resilience
Security Checklist
TCC coordination involves multiple services making state changes. Security misconfigurations can lead to unauthorized reservations or data leakage.
- Authenticate the coordinator-to-participant RPC calls (mutual TLS or JWT tokens)
- Authorize participants — coordinator should only call Confirm/Cancel on registered participants
- Validate reservation data in Try phase — do not trust coordinator-supplied quantities or IDs without validation
- Audit log all state transitions on tentative reservations (created, confirmed, cancelled)
- Encrypt coordinator-to-participant communication in transit
- Do not expose internal transaction IDs in error responses (use correlation IDs instead)
- Rate-limit Try requests per participant to prevent reservation exhaustion attacks
- Set TTL on tentative reservations so abandoned transactions auto-expire
Reservation Exhaustion Attack
A subtle TCC security concern: an attacker triggers many Try operations that succeed but never Confirm or Cancel. If tentative reservations hold inventory, the attacker can exhaust available inventory without paying.
Mitigations:
- TTL on tentative reservations: Auto-cancel after timeout
- Per-entity locking: Lock the reservation entity itself, not just the inventory
- Rate limiting Try: Limit how many Try requests a single client can make
- Verification on Confirm: Check the original request is still valid before confirming
Interview Questions
Expected answer points:
- Try: Reserves resources tentatively without committing. Participant records the intent and reserves the required capacity.
- Confirm: Commits the tentative reservation, making it permanent. Only called when Try succeeded for all participants.
- Cancel: Releases the tentative reservation, undoing the effects of Try. Called when any Try fails or the transaction times out.
Expected answer points:
- 2PC locks resources during the Prepare phase and holds those locks until Commit or Rollback, blocking other transactions.
- TCC uses reservations, not locks. Try reserves capacity but does not block other operations from seeing or using the same resources.
- Under contention, 2PC makes competing transactions wait or fail, while TCC shows the second transaction a pending reservation and lets it decide how to respond.
Expected answer points:
- The coordinator retries Confirm and Cancel calls until it receives a response, so duplicate calls are inevitable.
- If a participant is not idempotent, a duplicate Confirm might re-confirm an already-confirmed reservation, causing incorrect state.
- Idempotency is typically implemented using deterministic idempotency keys based on transaction ID, participant ID, and phase.
- Participants must also handle the Confirm-before-Cancel problem: once CONFIRMED, Cancel should be a no-op.
Expected answer points:
- The participant has a tentative reservation waiting to be confirmed or cancelled.
- The coordinator retries Confirm with exponential backoff until the participant responds or max retries are exceeded.
- If max retries are exceeded, the transaction is flagged for manual intervention.
- If the coordinator crashes entirely, a recovery process must query participants about their pending states to resolve dangling reservations.
Expected answer points:
- 2PC assumes participants share a common transaction manager (TM) and often a shared database.
- TCC works across heterogeneous systems because each service implements its own Try/Confirm/Cancel logic independently.
- Payment service, inventory service, shipping service can all run different stacks but still participate in the same TCC transaction.
- No shared TM or distributed lock manager is required.
Expected answer points:
- Scenario: Confirm runs twice (first times out, second succeeds), then Cancel is retried. The cancellation would incorrectly release a confirmed reservation.
- Solution: Track state transitions explicitly. Confirm transitions from TENTATIVE to CONFIRMED. Cancel transitions from TENTATIVE to CANCELLED.
- Once in CONFIRMED state, Cancel should be a no-op that returns success, not an error or a re-cancellation.
- Similarly, once in CANCELLED state, subsequent Cancel calls should be idempotent no-ops.
Expected answer points:
- TCC fits naturally when operations can be modeled as reservations: hotel booking, inventory allocation, credit holds, seat reservations.
- These have a clear notion of "tentatively take this" and "make it official or release it."
- Saga fits better when operations are transformations rather than reservations: moving money from account A to B, transforming an order into an invoice.
- TCC provides more structure than plain saga with explicit Confirm/Cancel symmetry.
Expected answer points:
- An attacker triggers many Try operations that succeed but never Confirm or Cancel.
- If tentative reservations hold inventory, the attacker can exhaust available inventory without paying.
- Mitigations: TTL on tentative reservations (auto-cancel after timeout), per-entity locking, rate limiting Try requests per client, and verification on Confirm that the original request is still valid.
Expected answer points:
- Transient failure: Return a retryable error. The coordinator retries the Try call. Example: TemporaryDatabaseError.
- Permanent failure: Return TryFailed with a reason indicating "do not retry, cancel everything." Example: InsufficientInventory, CapacityExceeded.
- Mixed case handling: CapacityExceeded might be permanent (no amount of retry will fix it), while a temporary lock timeout might be transient.
Expected answer points:
- Apache TCM: Reference implementation for J2EE-style TCC. Best for Java/J2EE shops already invested in that ecosystem with declarative transaction boundaries.
- Narayana (JBossTS): Open-source transaction manager supporting LRC optimization, 2PC, and TCC. Works well with Spring via Narayana's Spring integration. Good for JBoss/Spring ecosystem.
- ByteTCC: TCC implementation for Spring applications using annotations. Lightweight and Spring-native. Good for lightweight Spring Boot microservices.
Expected answer points:
- Use a hash function (e.g., SHA-256) over a string combining transaction_id, participant_id, and phase.
- Example: raw = f"{transaction_id}:{participant_id}:{phase}", then hash and take first N characters.
- The same operation always produces the same key, allowing duplicate detection.
- Store the idempotency key with the result to detect and skip duplicate Confirm/Cancel calls.
Expected answer points:
- TCC transaction completion rate (success vs try-fail vs confirm-fail vs cancel).
- Try/Confirm/Cancel phase duration and success rate per phase.
- Confirm phase retry count and cancel phase frequency.
- Average number of participants per transaction.
- Timeout rate per phase and dangling TENTATIVE reservation count (reservations stuck in TENTATIVE state).
Expected answer points:
- 2PC provides true atomic commit: all participants commit or all roll back in the same instant.
- TCC provides eventual atomicity: if Try succeeds for all, Confirm will eventually make all permanent, but there is a window where participants may have inconsistent states.
- TCC lacks the isolation property that 2PC provides. Other transactions can see pending reservations.
- The coordinator may take time to retry Confirm or Cancel, during which the state is not fully resolved.
Expected answer points:
- The coordinator orchestrates the three-phase flow: sends Try to all participants, then Confirm or Cancel based on results.
- It manages timeouts, retries, and recovery logic for dangling transactions.
- If the coordinator crashes after Try succeeds but before Confirm/Cancel is sent, participants may have pending tentative reservations.
- A recovery process must query participants about their pending states and either confirm or cancel based on what it finds.
Expected answer points:
- Every TCC transaction needs at least two round trips to each participant: Try, then Confirm (or Cancel).
- In 2PC, participants also block with two phases, but TCC adds the overhead of managing tentative state and retry logic.
- Coordination overhead grows with the number of participants, all of whom must respond before the transaction completes.
Expected answer points:
- Idle (initial state) -> Tentative (when Try succeeds).
- Tentative -> Confirmed (when Confirm is received and processed successfully).
- Tentative -> Cancelled (when Cancel is received and processed successfully).
- Tentative -> Tentative (if Try times out while waiting for Cancel to arrive).
- Confirmed and Cancelled are terminal states.
Expected answer points:
- Log Try phase start with transaction ID, participant ID, and reservation data.
- Log Try phase outcome (confirmed, failed, timeout) and Confirm/Cancel phase starts and outcomes.
- Log retry attempts with attempt number and delay, including the idempotency key for correlation.
- Log participant state transitions (TENTATIVE -> CONFIRMED, TENTATIVE -> CANCELLED).
Expected answer points:
- Both TCC and Saga avoid blocking, unlike 2PC.
- Saga uses compensations that undo what each step did. If Step 3 fails, run compensation for Step 2, then Step 1.
- TCC uses explicit Confirm and Cancel operations that are symmetrical and domain-independent.
- The complexity in Saga is writing reverse logic for each step; in TCC, complexity shifts to implementing the reservation system.
Expected answer points:
- Best-effort TCC: The coordinator retries Confirm/Cancel with backoff until a limit, then gives up and flags for manual intervention.
- Exactly-once TCC: Uses guaranteed delivery with persistent logging of intended actions. Retries indefinitely until participant acknowledges.
- Best-effort is simpler but may leave dangling reservations; exactly-once is more complex but ensures eventual resolution.
Expected answer points:
- Authenticate coordinator-to-participant RPC calls (mutual TLS or JWT tokens).
- Authorize participants so the coordinator only calls Confirm/Cancel on registered participants.
- Validate reservation data in Try phase; do not trust coordinator-supplied quantities or IDs without validation.
- Audit log all state transitions on tentative reservations, encrypt coordinator-to-participant communication, use correlation IDs instead of internal transaction IDs in errors, rate-limit Try requests, and set TTL on tentative reservations.
Further Reading
- Two-Phase Commit — Deep dive into 2PC, its limitations, and why it blocks
- Saga Pattern — Saga’s compensation-based approach for distributed transactions
- Distributed Transactions — Overview of distributed transaction patterns including TCC, saga, and 2PC
- Outbox Pattern — Reliable message delivery patterns that complement TCC
- Event-Driven Architecture — Patterns that complement TCC in microservices ecosystems
Conclusion
TCC gives you a structured way to coordinate distributed transactions without blocking. The three-phase model makes the contract explicit: reserve, commit, release. When your domain fits the reservation pattern, you get clean separation of concerns and better scalability than 2PC.
The trade-offs are real. Idempotent operations, timeout handling, recovery logic for dangling reservations. For high-contention scenarios with natural reservation semantics, TCC is worth the implementation effort. For simpler saga flows or operations that do not fit reservation semantics, basic saga or choreography may be the better choice.
See also Event-Driven Architecture for patterns that complement TCC in microservices ecosystems.
Category
Related Posts
The Outbox Pattern: Reliable Event Publishing in Distributed Systems
Learn the transactional outbox pattern for reliable event publishing. Discover how to solve the dual-write problem, implement idempotent consumers, and achieve exactly-once delivery.
Distributed Transactions: ACID vs BASE Trade-offs
Explore distributed transaction patterns: ACID vs BASE trade-offs, two-phase commit, saga pattern, eventual consistency, and choosing the right model.
CQRS Pattern
Separate read and write models. Command vs query models, eventual consistency implications, event sourcing integration, and when CQRS makes sense.