The Saga Pattern: Managing Distributed Transactions
Learn the saga pattern for managing distributed transactions without two-phase commit. Understand choreography vs orchestration implementations with practical examples and production considerations.
Saga Pattern: Managing Distributed Transactions
ACID transactions do not scale across services. When order service, inventory service, and payment service each have their own databases, you cannot wrap a single transaction around all three. Two-phase commit is theoretically possible but practically problematic (more on that in Two-Phase Commit).
The saga pattern offers an alternative. Instead of locking resources across services, a saga breaks a distributed transaction into a sequence of local transactions, each with a compensating transaction that can undo it.
Here I will explain how sagas work, the two implementation approaches (orchestration and choreography), and where the pattern breaks down.
The Core Idea
A saga represents a distributed transaction as a series of steps. Each step is a local transaction on one service. After each step, the saga either continues to the next step or, if a step fails, runs compensating transactions to undo the previous steps.
Step 1: Reserve Inventory (compensate: release inventory)
Step 2: Charge Payment (compensate: refund payment)
Step 3: Create Shipment (compensate: cancel shipment)
If step 3 fails, you run the compensation for step 2 (refund) and then step 1 (release inventory). The saga undoes what it did, leaving the system consistent.
sequence
Saga-->|Reserve| Inv:Inventory
Inv-->|OK| Saga
Saga-->|Charge| Pay:Payment
Pay-->|OK| Saga
Saga-->|Ship| Ship:Shipping
Ship-->|OK| Saga
Note over Saga: Success - all steps complete
The key insight: saga does not provide isolation. Unlike ACID transactions, saga steps can see each other’s partial results. You manage this at the application level.
Saga vs Two-Phase Commit
Two-phase commit (2PC) locks resources until all participants confirm. It provides atomicity but at the cost of availability. If the coordinator fails during the commit phase, participants may be left waiting indefinitely.
Saga takes a different approach. It sacrifices isolation and atomicity for availability and scalability. Steps execute one at a time. Failures trigger compensation, not rollback.
For a deep dive on 2PC and why it is often avoided, see Two-Phase Commit.
Choreography vs Orchestration
There are two ways to implement a saga.
Choreographed Saga
Each service knows its own step and its own compensation. When a step completes, the service emits an event. The next service reacts to that event. If something fails, services emit failure events that trigger compensations.
graph LR
Order[Order Service] -->|OrderCreated| Inv[Inventory]
Inv -->|InventoryReserved| Pay[Payment]
Pay -->|PaymentCharged| Ship[Shipping]
Ship -->|ShipmentCreated| Order
In this flow, each service reacts to the previous step’s event. The behavior is distributed. Each service knows only its own piece.
When Payment fails after Inventory is reserved:
graph LR
Pay -->|PaymentFailed| Inv
Inv -->|InventoryReleased| Order
The compensation logic lives in each service. Inventory responds to the failure by releasing the reservation.
Orchestrated Saga
A central orchestrator manages the sequence. It decides what step runs next, handles failures, and triggers compensations. The orchestrator knows the entire workflow.
graph LR
Orch[Order Orchestrator] -->|Reserve| Inv
Inv -->|OK| Orch
Orch -->|Charge| Pay
Pay -->|OK| Orch
Orch -->|Ship| Ship
Ship -->|OK| Orch
The orchestrator keeps state about what has completed. If Payment fails, the orchestrator tells Inventory to release and returns an error to the client.
For a full comparison, see Service Orchestration.
Implementing Saga Compensation
Compensation logic must be deterministic. If step 3 fails after step 2 succeeded, you must undo step 2. Running compensation twice or in the wrong order causes problems.
class OrderSaga:
def execute(self, order):
try:
# Step 1: Reserve inventory
reservation = self.inventory.reserve(order.items)
self.steps.append(('reserve', reservation))
# Step 2: Process payment
charge = self.payment.charge(order.payment_info, order.total)
self.steps.append(('charge', charge))
# Step 3: Create shipment
shipment = self.shipping.create(order.address)
self.steps.append(('ship', shipment))
return SagaResult(success=True, shipment=shipment)
except PaymentDeclined:
# Compensate step 1
self.inventory.release(self.steps[0][1].reservation_id)
return SagaResult(success=False, reason="Payment declined")
except ShippingUnavailable:
# Compensate step 2
self.payment.refund(self.steps[1][1].charge_id)
# Compensate step 1
self.inventory.release(self.steps[0][1].reservation_id)
return SagaResult(success=False, reason="Shipping unavailable")
The saga tracks completed steps in order. On failure, it runs compensations in reverse order. This is the compensating transaction pattern.
Idempotency in Sagas
Sagas execute across unreliable networks. Messages may be delivered twice. Services may crash mid-operation. Your saga must handle idempotency.
Reserve the same inventory twice should not double-reserve. Charge the same payment twice should not double-charge.
def reserve_inventory(self, reservation_id, items):
# Idempotency check
if self.inventory.is_reserved(reservation_id):
return self.inventory.get_reservation(reservation_id)
# Actually reserve
return self.inventory.create_reservation(reservation_id, items)
def charge_payment(self, charge_id, payment_info, amount):
# Idempotency check using charge_id
existing = self.payment.get_charge(charge_id)
if existing:
return existing
# Process charge
return self.payment.create_charge(charge_id, payment_info, amount)
Include an idempotency key (the reservation ID from step 1) in the compensation call. Check before acting.
Nested Sagas
Large workflows sometimes need sub-sagas. Rather than one monolithic saga with dozens of steps, you can decompose into nested sagas where a step’s “execute” is itself a saga.
For example, an order fulfillment saga might have a step called ProcessPayment that internally runs authorize → capture as a nested saga. If the nested saga fails, the parent saga treats it as a single failed step and compensates accordingly.
Why nest sagas?
- Reusability: The
ProcessPaymentnested saga can be reused across multiple parent sagas (order, subscription, refund) - Readability: Top-level saga reads like a business workflow, not a technical protocol
- Scoped failures: If payment fails, you know exactly which sub-step failed without scrolling through 20 parent steps
Concurrency control with nested sagas:
Nested sagas introduce concurrency at the parent level. While the payment nested saga is running, other parent sagas may also be running and trying to access shared resources. Use optimistic or pessimistic locking at the parent level.
class ParentSaga:
def execute(self):
# Step 1: Reserve inventory (parent-level lock on inventory record)
with self.lock('inventory', self.order.inventory_id):
self.reserve_inventory()
# Step 2: Run payment as nested saga (has its own compensation)
payment_result = PaymentNestedSaga().execute(self.payment_context)
if not payment_result.success:
# Parent-level compensation for inventory
self.release_inventory() # Uses same lock
return Failed
# Step 3: Create shipment
self.create_shipment()
Optimistic locking: Read the resource version before modifying. On update, check the version hasn’t changed. If it has, abort and retry.
Pessimistic locking: Acquire a lock before accessing the resource. Blocks other sagas from accessing it until the lock releases. Simpler but reduces concurrency.
Use optimistic locking for most cases (higher throughput). Use pessimistic locking only when the cost of a concurrent modification is very high (financial transactions, inventory with hard limits).
class SagaState:
"""Persisted saga state for crash recovery."""
def __init__(self, saga_id: str):
self.saga_id = saga_id
self.steps: list[tuple[str, str, dict]] = [] # [(step_name, status, data)]
self.status = "pending"
self.created_at = datetime.utcnow()
def mark_step_started(self, step_name: str, step_data: dict):
self.steps.append((step_name, "started", step_data))
self._persist()
def mark_step_completed(self, step_name: str, result_data: dict):
# Update step status
for i, (name, status, data) in enumerate(self.steps):
if name == step_name and status == "started":
self.steps[i] = (name, "completed", {**data, **result_data})
break
self._persist()
def mark_step_compensated(self, step_name: str):
for i, (name, status, data) in enumerate(self.steps):
if name == step_name and status == "completed":
self.steps[i] = (name, "compensated", data)
break
self._persist()
def get_pending_steps(self) -> list[str]:
return [name for name, status, _ in self.steps if status == "started"]
def get_completed_steps(self) -> list[str]:
return [name for name, status, _ in self.steps if status == "completed"]
def _persist(self):
# Save to durable storage (database, etc.)
db.sagas.upsert(self.saga_id, self.to_dict())
Failure Handling
Saga failures fall into several categories.
Transient failures: Network timeouts, temporary unavailability. Retry with backoff. If it keeps failing, treat as permanent.
Permanent failures: Insufficient inventory, card declined. These will not succeed on retry. Trigger compensation.
Unknown state: A service crashes mid-operation. When it recovers, determine what happened. Did the transaction commit before the crash? Did it not? This requires idempotency and state tracking.
Design compensations to be idempotent. If compensation runs twice, the second run should have no effect.
Performance Considerations
Saga trades a synchronous two-round-trip 2PC for a sequential multi-step approach. The math is worth understanding.
2PC latency profile: Two network round-trips (prepare + commit), but all participants vote and commit in parallel. Typical: 10-50ms per participant phase.
Saga latency profile: Each step adds its own latency. If step 1 takes 20ms, step 2 takes 30ms, step 3 takes 15ms, your total is 65ms plus orchestration overhead. Steps run sequentially, not in parallel.
For a 3-step saga vs 2PC across the same 3 services:
| Metric | 2PC | Saga |
|---|---|---|
| Happy path latency | ~20-40ms (parallel phases) | ~50-80ms (sequential steps) |
| Failure recovery | Blocks until coordinator recovers | Compensation runs immediately |
| Availability | Lower (blocking on coordinator) | Higher (no coordinator SPOF) |
| Lock duration | All locks held during both phases | Each lock released after its step |
Saga’s latency overhead is real but often acceptable. If each step is 10-30ms (typical for service calls), a 5-step saga runs in 50-150ms total. Compare that to a user-facing API timeout (usually 1-5 seconds) and the overhead is negligible.
Where saga latency hurts is high-throughput, low-latency paths (trading systems, real-time pricing). In those cases, consider whether you can pipeline steps (some steps don’t depend on previous step results and can overlap).
Testing Strategies
Sagas are notoriously hard to test because they span services and involve time. A structured approach helps.
Unit Testing Compensations
Test each compensation in isolation first. The compensation is the most critical piece — if it fails, your saga is stuck.
# Test that releasing inventory twice has no effect (idempotency)
def test_release_inventory_idempotent():
inventory = InMemoryInventory()
inventory.reserve("order-123", ["item-a"])
# First release — succeeds
result1 = inventory.release("order-123")
assert result1.success
assert not inventory.is_reserved("order-123")
# Second release — should be idempotent (no error)
result2 = inventory.release("order-123")
assert result2.success # Still succeeds, even though already released
# Third release — still idempotent
result3 = inventory.release("order-123")
assert result3.success
Integration Testing Failure Scenarios
The real test is whether your saga handles failures correctly. Set up test infrastructure that simulates failures at each step.
# Test: step 3 fails, step 2 compensation runs correctly
def test_saga_step3_failure_triggers_step2_compensation():
inventory = MockInventory() # Always succeeds
payment = MockPayment() # Always succeeds
shipping = MockShipping(fail_on="create") # Fails on create
saga = OrderSaga(inventory, payment, shipping)
result = saga.execute(order_with_3_items)
assert result.failed
assert result.failed_step == "create_shipment"
assert payment.refund_was_called() # Step 2 compensated
assert inventory.release_was_called() # Step 1 compensated
assert not shipping.shipment_created # Step 3 never ran
Testing Compensations Run in Correct Order
The most common saga bug is compensation running in the wrong order. Write a test that explicitly verifies reverse order.
def test_compensation_runs_in_reverse_order():
call_order = []
class TrackingService:
def do_step(self):
call_order.append(f"do-{self.name}")
return Success()
def compensate(self):
call_order.append(f"compensate-{self.name}")
return Success()
class OrderedSaga(Saga):
def __init__(self):
self.s1 = TrackingService(name="step1")
self.s2 = TrackingService(name="step2")
self.s3 = TrackingService(name="step3")
def execute(self):
self.do_step(self.s1)
self.do_step(self.s2)
self.do_step(self.s3) # Fails here
return Success()
def compensate(self):
# Should run in reverse: s3, s2, s1
self.compensate(self.s3)
self.compensate(self.s2)
self.compensate(self.s1)
saga = OrderedSaga()
saga.execute() # Fails on step 3
saga.compensate()
assert call_order == [
"do-step1", "do-step2", "do-step3",
"compensate-step3", "compensate-step2", "compensate-step1"
]
Chaos Testing Sagas in Production
Once your saga is running in production, inject failures to verify it handles them:
- Kill a service mid-saga and verify compensation runs
- Introduce network latency and verify timeouts trigger correctly
- Fill up a resource (disk, connection pool) and verify graceful degradation
- Split the network between two services and verify saga completes or compensates correctly
Distributed Tracing Integration
Sagas span multiple services. Without tracing, debugging a failed saga means grepping logs across 5 services and trying to piece together what happened. With distributed tracing (OpenTelemetry, Zipkin, Jaeger), you get a single trace ID that follows the saga across all services.
Trace Structure for Sagas
A saga trace has a parent span for the overall saga, with child spans for each step and compensation.
Trace: order-123-saga (trace-id: abc123)
├── Span: saga_created (service: order-service)
├── Span: step.reserve_inventory (service: inventory-service)
│ └── Span: compensate.reserve_inventory (service: inventory-service)
├── Span: step.charge_payment (service: payment-service)
│ └── Span: compensate.charge_payment (service: payment-service)
├── Span: step.create_shipment (service: shipping-service)
└── Span: saga_completed / saga_failed
Implementing Trace Context Propagation
Pass trace context through saga steps using baggage or span links.
from opentelemetry import trace
from opentelemetry.propagate import inject, extract
tracer = trace.get_tracer(__name__)
class OrderSaga:
def execute(self, order, context=None):
# Extract incoming trace context (if triggered by HTTP request)
if context:
ctx = extract(context)
span = tracer.start_span("order_saga", context=ctx)
else:
span = tracer.start_span("order_saga")
with span:
span.set_attribute("saga.id", order.saga_id)
span.set_attribute("saga.type", "order_fulfillment")
# Inject trace context into step calls
headers = {}
inject(headers) # Injects current trace into headers
try:
# Step 1: Reserve inventory (pass headers for trace propagation)
inventory_ctx = self.inventory.reserve(order.items, headers)
# Step 2: Charge payment
payment_ctx = self.payment.charge(order.payment, headers)
# Step 3: Create shipment
shipment = self.shipping.create(order.address, headers)
span.set_status(trace.Status.OK)
return Success(shipment)
except Exception as e:
span.record_exception(e)
span.set_status(trace.Status.ERROR, str(e))
raise
When a step fails and compensation runs, link the compensation span to the original step span so you can see the pairing in the trace.
def compensate_inventory(self, reservation_id, original_span):
with tracer.start_as_current_span(
"compensate.reserve_inventory",
links=[Link(original_span.get_span_context())]
) as span:
span.set_attribute("compensation.for", original_span.name)
span.set_attribute("reservation.id", reservation_id)
self.inventory.release(reservation_id)
This way, in your trace viewer, you see the compensate span linked back to its original do span — paired visually rather than guessing from logs.
Framework Recommendations
Building saga from scratch is educational but painful for production. Use a framework that handles state persistence, retry logic, observability, and distributed tracing out of the box.
Temporal
The strongest choice for saga orchestration. Temporal provides durable workflow execution — if your service crashes mid-saga, Temporal persists the workflow state and resumes it from where it left off. No need to build your own saga state machine.
- Strengths: Durable execution (survives worker crashes), built-in retries with backoff, activity heartbeats, sandboxed workflow code, strong OpenTelemetry integration
- Good for: Complex multi-step business workflows, long-running sagas (hours to days)
- Trade-offs: Self-hosting is operationally heavy; Temporal Cloud pricing can be significant at scale
# Temporal workflow example
@workflow.defn
class OrderWorkflow:
@workflow.run
async def run(self, order: Order) -> OrderResult:
# Activities with automatic retry
reservation = await workflow.execute_activity(
reserve_inventory,
order.items,
start_to_close_timeout=timedelta(seconds=30),
retry_policy=ActivityRetryPolicy(maximum_attempts=3),
)
try:
charge = await workflow.execute_activity(
charge_payment,
order.payment,
order.total,
start_to_close_timeout=timedelta(seconds=30),
)
except PaymentDeclined:
await workflow.execute_activity(
release_inventory,
reservation.id,
)
return OrderResult(rejected=True, reason="Payment declined")
shipment = await workflow.execute_activity(
create_shipment,
order.address,
)
return OrderResult(confirmed=True, shipment=shipment)
AWS Step Functions
Managed saga orchestration on AWS. Integrates tightly with AWS services (Lambda, ECS, SQS, DynamoDB). Good if you’re already all-in on AWS.
- Strengths: Fully managed, pay-per-state-transition, tight Lambda integration, visual workflow designer
- Good for: AWS-centric architectures, medium-complexity workflows
- Trade-offs: Vendor lock-in, expensive at high step counts, debugging can be opaque
Conductor (Netflix)
Conductor is an open-source saga orchestrator from Netflix. Good for microservices that need workflow orchestration without heavy operational overhead.
- Strengths: Open source (self-hostable), JSON-based workflow definitions, HTTP-based workers
- Good for: Teams wanting open-source without Temporal complexity
- Trade-offs: Not as battle-tested as Temporal at extreme scale, less mature ecosystem
Comparison
| Framework | Durability | Open Source | Complexity | Best For |
|---|---|---|---|---|
| Temporal | Excellent (durable execution) | Yes (server) + cloud | Medium | Complex long-running workflows |
| AWS Step Functions | Good (managed) | No | Low | AWS-centric, simple workflows |
| Conductor | Good | Yes (fully) | Medium | Open-source preference |
For most production scenarios, Temporal is the right call. The durable execution guarantee alone saves you from a whole class of saga state loss bugs.
When to Use / When Not to Use Saga
| Criteria | Saga (Choreography) | Saga (Orchestration) | Two-Phase Commit |
|---|---|---|---|
| Atomicity | Eventual | Eventual | True atomicity |
| Isolation | None | None | Full isolation |
| Availability | High | High | Low (blocks on coordinator failure) |
| Complexity | Distributed logic | Centralized logic | Distributed but synchronous |
| Compensation | Each service knows its own | Orchestrator directs | Automatic rollback |
| Latency | Per-step latency | Per-step + orchestration overhead | Two round-trips |
| Debugging | Harder (distributed) | Easier (centralized) | Moderate |
| Rollback Cost | Compensation required | Compensation required | Free (automatic) |
Use saga when:
- Operations span multiple services with separate databases
- You cannot use 2PC (which is usually the right call)
- Business transactions map naturally to a sequence of steps
- You can define compensating transactions for each step
- Eventual consistency is acceptable for your domain
- You need availability over strong isolation
Avoid saga when:
- Steps have tight interdependencies that require strict isolation
- Rollback must be immediate and guaranteed
- Compensation is expensive or impossible (sending an email, charging a card with long refund times)
- Your domain requires all-or-nothing atomicity that saga cannot provide
- The inconsistency window is unacceptable for your use case
Production Failure Scenarios
| Failure | Impact | Mitigation |
|---|---|---|
| Step fails and compensation also fails | System left in inconsistent state | Design idempotent compensations; implement retry with exponential backoff; alert on repeated compensation failures |
| Service crashes mid-step | Step may or may not have completed; unknown state | Use idempotency keys; implement saga state tracking; use durable workflow engines |
| Concurrent sagas interfere | One saga’s uncommitted data affects another saga’s read | Implement optimistic concurrency control; use application-level locks for critical resources |
| Compensation runs on already-succeeded step | Double compensation causes incorrect state | Track completed steps explicitly; prevent compensation on committed steps |
| Saga state lost (orchestrator crash) | Cannot determine what steps completed | Persist saga state to durable storage; use workflow engines that handle this |
| Infinite retry loop | System stuck in repeated failed attempts | Implement max retry count; move to dead letter state after threshold; alert |
Failure Flow Diagram
graph TD
Start[Start Saga] --> Step1[Execute Step 1]
Step1 --> Step1OK{Step 1 OK?}
Step1OK -->|No| Fail1[Compensate Step 1<br/>Return Error]
Step1OK -->|Yes| Step2[Execute Step 2]
Step2 --> Step2OK{Step 2 OK?}
Step2OK -->|No| Comp1[Compensate Step 1]
Comp1 --> Fail2[Return Error]
Step2OK -->|Yes| Step3[Execute Step 3]
Step3 --> Step3OK{Step 3 OK?}
Step3OK -->|No| Comp2A[Compensate Step 2]
Comp2A --> Comp2B[Compensate Step 1]
Comp2B --> Fail3[Return Error]
Step3OK -->|Yes| Success[Saga Complete]
Saga Execution State Machine
stateDiagram-v2
[*] --> Pending: Saga created
Pending --> Executing: First step started
Executing --> Executing: Step N completed
Executing --> Compensating: Step failed
Compensating --> Compensating: Compensation in progress
Compensating --> Completed: All compensations done
Executing --> Completed: Final step succeeded
Compensating --> Failed: Compensation failed after retries
Pending --> Failed: Immediate failure (validation error)
Completed --> [*]
Failed --> [*]
State Definitions:
| State | Description |
|---|---|
| Pending | Saga created but not yet started. Initial state before first step execution. |
| Executing | One or more steps have completed successfully. Saga is processing subsequent steps. |
| Compensating | A step failed and compensation is running in reverse order to undo completed steps. |
| Completed | All steps succeeded or all compensations succeeded. Terminal success state. |
| Failed | Saga reached an unrecoverable state (compensation failed or max retries exceeded). |
State Transition Rules:
- Once in Executing, cannot return to Pending
- Compensating can only be entered from Executing
- Completed and Failed are terminal states
- From Compensating, success leads to Completed, persistent failure leads to Failed
- Only Pending and Failed can be safely retried at the saga level
Observability Checklist
Metrics
- Saga completion rate (success vs failure vs compensating)
- Saga execution duration by type
- Compensation execution count and success rate
- Step failure rate by step type
- Concurrent saga count
- Average number of steps per saga
Logs
- Log saga start with correlation ID and input parameters
- Log each step start, completion, and compensation with step index
- Include compensating transaction ID in compensation logs
- Log saga outcome (success, failure, compensating)
- Include all relevant IDs: saga ID, correlation ID, step IDs, compensation IDs
Alerts
- Alert when saga takes longer than expected threshold
- Alert when compensation repeatedly fails
- Alert when saga failure rate exceeds normal baseline
- Alert when max retry count reached on a step
- Alert on stuck sagas (no progress for extended period)
Security Checklist
- Authenticate and authorize saga trigger endpoints
- Validate saga input parameters to prevent injection
- Audit log all saga state changes (start, step completion, compensation)
- Do not log sensitive data (payment info, passwords) in saga context
- Encrypt saga state at rest if using external storage
- Use correlation IDs for tracing without exposing internal IDs
Common Pitfalls / Anti-Patterns
Treating saga as ACID transaction: Saga does not provide isolation. Concurrent sagas can see each other’s partial results. If saga A reserves inventory and saga B reads inventory before A completes, B may make decisions on uncommitted data. Handle this at the application level.
Non-idempotent steps or compensations: If a step or compensation runs twice due to retries, the effect should be the same as running once. Reserve inventory twice should not double-reserve. Always check before acting.
Compensation order errors: Compensations must run in reverse order of execution. If step 2’s compensation runs before step 1’s, you may leave the system in a worse state. Explicitly track execution order.
Ignoring the inconsistency window: During a saga execution, the system is in an inconsistent state. Other operations may read partial results. Design your application to handle this (show pending states, use optimistic UI).
Long-running compensations: Compensations can take time (refunds, cancellations). While compensating, the system is still inconsistent. Minimize compensation time and alert if compensations are taking too long.
Not planning for compensation failure: What happens if compensation fails? The saga is stuck. Implement retry with backoff, then move to a dead letter state that requires manual intervention.
Quick Recap
sequence
Saga-->|Step 1: Reserve| Inv:Inventory
Inv-->|OK| Saga
Saga-->|Step 2: Charge| Pay:Payment
Pay-->|Fail| Saga
Saga-->|Compensate| Inv:Inventory
Key Points
- Saga breaks distributed transactions into steps with compensating transactions
- If a step fails, previous steps are compensated in reverse order
- Saga provides eventual consistency, not ACID isolation
- Two implementations: choreography (distributed) and orchestration (centralized)
- Idempotency is essential for safe retries
Production Checklist
# Saga Pattern Production Readiness
- [ ] Idempotent step and compensation handlers implemented
- [ ] Saga state persisted to durable storage
- [ ] Compensation logic tested for each step
- [ ] Maximum retry count configured per step
- [ ] Monitoring for saga execution duration
- [ ] Alerting for compensation failures
- [ ] Concurrent saga handling planned (optimistic locking or locks)
- [ ] User-facing pending state handling designed
- [ ] Dead letter handling for unrecoverable saga states
- [ ] Correlation IDs in all saga logs
Example: Order Processing Saga
A complete order processing saga with three services:
def create_order_saga(order):
saga = OrderSaga()
# Step 1: Reserve inventory
reservation = saga.reserve_inventory(order.items)
if not reservation.success:
return OrderResult(rejected=True, reason="Insufficient inventory")
# Step 2: Authorize payment
authorization = saga.authorize_payment(order.payment, order.total)
if not authorization.success:
saga.compensate_inventory(reservation.id)
return OrderResult(rejected=True, reason="Payment declined")
# Step 3: Create shipment
shipment = saga.create_shipment(order.address, order.items)
if not shipment.success:
saga.reverse_payment(authorization.id)
saga.compensate_inventory(reservation.id)
return OrderResult(rejected=True, reason="Shipping unavailable")
# All steps succeeded
return OrderResult(confirmed=True, order_id=order.id, shipment=shipment)
Each compensate_* method is a compensating transaction. They run in reverse order on failure.
Limitations
Sagas have real limitations.
No isolation: Concurrent sagas can interfere. If saga A reserves inventory and saga B reads it before A completes, B may make decisions based on uncommitted data. Application-level locks or optimistic concurrency control help, but the pattern does not give you automatic isolation.
Compensation complexity: Some operations are hard to compensate. Sending an email cannot be undone. Charging a card can be refunded, but refunds take time. Compensation is not free.
Debugging: A saga failure means tracing through multiple services to understand what happened. Logs and correlation IDs are essential.
Related Concepts
For distributed transaction fundamentals, see Distributed Transactions. For two-phase commit, see Two-Phase Commit.
For workflow patterns, see Service Orchestration and Service Choreography.
Conclusion
The saga pattern manages distributed transactions without 2PC. Break the transaction into steps, and define a compensation for each step. If a step fails, run the compensations in reverse order.
Choreographed sagas distribute behavior across services. Orchestrated sagas centralize coordination in an orchestrator. Both work; the choice depends on workflow complexity and team structure.
Sagas trade ACID isolation for availability. The application must handle partial states and concurrent access. This complexity is inherent to distributed transactions; saga just makes it explicit rather than pretending it does not exist.
Category
Related Posts
TCC: Try-Confirm-Cancel Pattern for Distributed Transactions
Learn the Try-Confirm-Cancel pattern for distributed transactions. Explore how TCC differs from 2PC and saga, with implementation examples and real-world use cases.
Service Orchestration: Coordinating Distributed Workflows
Explore service orchestration patterns for managing distributed workflows, workflow engines, saga implementations, and how orchestration compares to choreography.
CQRS Pattern
Separate read and write models. Command vs query models, eventual consistency implications, event sourcing integration, and when CQRS makes sense.