The Saga Pattern: Managing Distributed Transactions

Learn the saga pattern for managing distributed transactions without two-phase commit. Understand choreography vs orchestration implementations with practical examples and production considerations.

published: reading time: 21 min read

Saga Pattern: Managing Distributed Transactions

ACID transactions do not scale across services. When order service, inventory service, and payment service each have their own databases, you cannot wrap a single transaction around all three. Two-phase commit is theoretically possible but practically problematic (more on that in Two-Phase Commit).

The saga pattern offers an alternative. Instead of locking resources across services, a saga breaks a distributed transaction into a sequence of local transactions, each with a compensating transaction that can undo it.

Here I will explain how sagas work, the two implementation approaches (orchestration and choreography), and where the pattern breaks down.

The Core Idea

A saga represents a distributed transaction as a series of steps. Each step is a local transaction on one service. After each step, the saga either continues to the next step or, if a step fails, runs compensating transactions to undo the previous steps.

Step 1: Reserve Inventory (compensate: release inventory)
Step 2: Charge Payment (compensate: refund payment)
Step 3: Create Shipment (compensate: cancel shipment)

If step 3 fails, you run the compensation for step 2 (refund) and then step 1 (release inventory). The saga undoes what it did, leaving the system consistent.

sequence
    Saga-->|Reserve| Inv:Inventory
    Inv-->|OK| Saga
    Saga-->|Charge| Pay:Payment
    Pay-->|OK| Saga
    Saga-->|Ship| Ship:Shipping
    Ship-->|OK| Saga
    Note over Saga: Success - all steps complete

The key insight: saga does not provide isolation. Unlike ACID transactions, saga steps can see each other’s partial results. You manage this at the application level.

Saga vs Two-Phase Commit

Two-phase commit (2PC) locks resources until all participants confirm. It provides atomicity but at the cost of availability. If the coordinator fails during the commit phase, participants may be left waiting indefinitely.

Saga takes a different approach. It sacrifices isolation and atomicity for availability and scalability. Steps execute one at a time. Failures trigger compensation, not rollback.

For a deep dive on 2PC and why it is often avoided, see Two-Phase Commit.

Choreography vs Orchestration

There are two ways to implement a saga.

Choreographed Saga

Each service knows its own step and its own compensation. When a step completes, the service emits an event. The next service reacts to that event. If something fails, services emit failure events that trigger compensations.

graph LR
    Order[Order Service] -->|OrderCreated| Inv[Inventory]
    Inv -->|InventoryReserved| Pay[Payment]
    Pay -->|PaymentCharged| Ship[Shipping]
    Ship -->|ShipmentCreated| Order

In this flow, each service reacts to the previous step’s event. The behavior is distributed. Each service knows only its own piece.

When Payment fails after Inventory is reserved:

graph LR
    Pay -->|PaymentFailed| Inv
    Inv -->|InventoryReleased| Order

The compensation logic lives in each service. Inventory responds to the failure by releasing the reservation.

Orchestrated Saga

A central orchestrator manages the sequence. It decides what step runs next, handles failures, and triggers compensations. The orchestrator knows the entire workflow.

graph LR
    Orch[Order Orchestrator] -->|Reserve| Inv
    Inv -->|OK| Orch
    Orch -->|Charge| Pay
    Pay -->|OK| Orch
    Orch -->|Ship| Ship
    Ship -->|OK| Orch

The orchestrator keeps state about what has completed. If Payment fails, the orchestrator tells Inventory to release and returns an error to the client.

For a full comparison, see Service Orchestration.

Implementing Saga Compensation

Compensation logic must be deterministic. If step 3 fails after step 2 succeeded, you must undo step 2. Running compensation twice or in the wrong order causes problems.

class OrderSaga:
    def execute(self, order):
        try:
            # Step 1: Reserve inventory
            reservation = self.inventory.reserve(order.items)
            self.steps.append(('reserve', reservation))

            # Step 2: Process payment
            charge = self.payment.charge(order.payment_info, order.total)
            self.steps.append(('charge', charge))

            # Step 3: Create shipment
            shipment = self.shipping.create(order.address)
            self.steps.append(('ship', shipment))

            return SagaResult(success=True, shipment=shipment)

        except PaymentDeclined:
            # Compensate step 1
            self.inventory.release(self.steps[0][1].reservation_id)
            return SagaResult(success=False, reason="Payment declined")

        except ShippingUnavailable:
            # Compensate step 2
            self.payment.refund(self.steps[1][1].charge_id)
            # Compensate step 1
            self.inventory.release(self.steps[0][1].reservation_id)
            return SagaResult(success=False, reason="Shipping unavailable")

The saga tracks completed steps in order. On failure, it runs compensations in reverse order. This is the compensating transaction pattern.

Idempotency in Sagas

Sagas execute across unreliable networks. Messages may be delivered twice. Services may crash mid-operation. Your saga must handle idempotency.

Reserve the same inventory twice should not double-reserve. Charge the same payment twice should not double-charge.

def reserve_inventory(self, reservation_id, items):
    # Idempotency check
    if self.inventory.is_reserved(reservation_id):
        return self.inventory.get_reservation(reservation_id)

    # Actually reserve
    return self.inventory.create_reservation(reservation_id, items)

def charge_payment(self, charge_id, payment_info, amount):
    # Idempotency check using charge_id
    existing = self.payment.get_charge(charge_id)
    if existing:
        return existing

    # Process charge
    return self.payment.create_charge(charge_id, payment_info, amount)

Include an idempotency key (the reservation ID from step 1) in the compensation call. Check before acting.

Nested Sagas

Large workflows sometimes need sub-sagas. Rather than one monolithic saga with dozens of steps, you can decompose into nested sagas where a step’s “execute” is itself a saga.

For example, an order fulfillment saga might have a step called ProcessPayment that internally runs authorize → capture as a nested saga. If the nested saga fails, the parent saga treats it as a single failed step and compensates accordingly.

Why nest sagas?

  • Reusability: The ProcessPayment nested saga can be reused across multiple parent sagas (order, subscription, refund)
  • Readability: Top-level saga reads like a business workflow, not a technical protocol
  • Scoped failures: If payment fails, you know exactly which sub-step failed without scrolling through 20 parent steps

Concurrency control with nested sagas:

Nested sagas introduce concurrency at the parent level. While the payment nested saga is running, other parent sagas may also be running and trying to access shared resources. Use optimistic or pessimistic locking at the parent level.

class ParentSaga:
    def execute(self):
        # Step 1: Reserve inventory (parent-level lock on inventory record)
        with self.lock('inventory', self.order.inventory_id):
            self.reserve_inventory()

        # Step 2: Run payment as nested saga (has its own compensation)
        payment_result = PaymentNestedSaga().execute(self.payment_context)

        if not payment_result.success:
            # Parent-level compensation for inventory
            self.release_inventory()  # Uses same lock
            return Failed

        # Step 3: Create shipment
        self.create_shipment()

Optimistic locking: Read the resource version before modifying. On update, check the version hasn’t changed. If it has, abort and retry.

Pessimistic locking: Acquire a lock before accessing the resource. Blocks other sagas from accessing it until the lock releases. Simpler but reduces concurrency.

Use optimistic locking for most cases (higher throughput). Use pessimistic locking only when the cost of a concurrent modification is very high (financial transactions, inventory with hard limits).

class SagaState:
    """Persisted saga state for crash recovery."""

    def __init__(self, saga_id: str):
        self.saga_id = saga_id
        self.steps: list[tuple[str, str, dict]] = []  # [(step_name, status, data)]
        self.status = "pending"
        self.created_at = datetime.utcnow()

    def mark_step_started(self, step_name: str, step_data: dict):
        self.steps.append((step_name, "started", step_data))
        self._persist()

    def mark_step_completed(self, step_name: str, result_data: dict):
        # Update step status
        for i, (name, status, data) in enumerate(self.steps):
            if name == step_name and status == "started":
                self.steps[i] = (name, "completed", {**data, **result_data})
                break
        self._persist()

    def mark_step_compensated(self, step_name: str):
        for i, (name, status, data) in enumerate(self.steps):
            if name == step_name and status == "completed":
                self.steps[i] = (name, "compensated", data)
                break
        self._persist()

    def get_pending_steps(self) -> list[str]:
        return [name for name, status, _ in self.steps if status == "started"]

    def get_completed_steps(self) -> list[str]:
        return [name for name, status, _ in self.steps if status == "completed"]

    def _persist(self):
        # Save to durable storage (database, etc.)
        db.sagas.upsert(self.saga_id, self.to_dict())

Failure Handling

Saga failures fall into several categories.

Transient failures: Network timeouts, temporary unavailability. Retry with backoff. If it keeps failing, treat as permanent.

Permanent failures: Insufficient inventory, card declined. These will not succeed on retry. Trigger compensation.

Unknown state: A service crashes mid-operation. When it recovers, determine what happened. Did the transaction commit before the crash? Did it not? This requires idempotency and state tracking.

Design compensations to be idempotent. If compensation runs twice, the second run should have no effect.

Performance Considerations

Saga trades a synchronous two-round-trip 2PC for a sequential multi-step approach. The math is worth understanding.

2PC latency profile: Two network round-trips (prepare + commit), but all participants vote and commit in parallel. Typical: 10-50ms per participant phase.

Saga latency profile: Each step adds its own latency. If step 1 takes 20ms, step 2 takes 30ms, step 3 takes 15ms, your total is 65ms plus orchestration overhead. Steps run sequentially, not in parallel.

For a 3-step saga vs 2PC across the same 3 services:

Metric2PCSaga
Happy path latency~20-40ms (parallel phases)~50-80ms (sequential steps)
Failure recoveryBlocks until coordinator recoversCompensation runs immediately
AvailabilityLower (blocking on coordinator)Higher (no coordinator SPOF)
Lock durationAll locks held during both phasesEach lock released after its step

Saga’s latency overhead is real but often acceptable. If each step is 10-30ms (typical for service calls), a 5-step saga runs in 50-150ms total. Compare that to a user-facing API timeout (usually 1-5 seconds) and the overhead is negligible.

Where saga latency hurts is high-throughput, low-latency paths (trading systems, real-time pricing). In those cases, consider whether you can pipeline steps (some steps don’t depend on previous step results and can overlap).

Testing Strategies

Sagas are notoriously hard to test because they span services and involve time. A structured approach helps.

Unit Testing Compensations

Test each compensation in isolation first. The compensation is the most critical piece — if it fails, your saga is stuck.

# Test that releasing inventory twice has no effect (idempotency)
def test_release_inventory_idempotent():
    inventory = InMemoryInventory()
    inventory.reserve("order-123", ["item-a"])

    # First release — succeeds
    result1 = inventory.release("order-123")
    assert result1.success
    assert not inventory.is_reserved("order-123")

    # Second release — should be idempotent (no error)
    result2 = inventory.release("order-123")
    assert result2.success  # Still succeeds, even though already released

    # Third release — still idempotent
    result3 = inventory.release("order-123")
    assert result3.success

Integration Testing Failure Scenarios

The real test is whether your saga handles failures correctly. Set up test infrastructure that simulates failures at each step.

# Test: step 3 fails, step 2 compensation runs correctly
def test_saga_step3_failure_triggers_step2_compensation():
    inventory = MockInventory()  # Always succeeds
    payment = MockPayment()      # Always succeeds
    shipping = MockShipping(fail_on="create")  # Fails on create

    saga = OrderSaga(inventory, payment, shipping)
    result = saga.execute(order_with_3_items)

    assert result.failed
    assert result.failed_step == "create_shipment"
    assert payment.refund_was_called()        # Step 2 compensated
    assert inventory.release_was_called()      # Step 1 compensated
    assert not shipping.shipment_created      # Step 3 never ran

Testing Compensations Run in Correct Order

The most common saga bug is compensation running in the wrong order. Write a test that explicitly verifies reverse order.

def test_compensation_runs_in_reverse_order():
    call_order = []

    class TrackingService:
        def do_step(self):
            call_order.append(f"do-{self.name}")
            return Success()

        def compensate(self):
            call_order.append(f"compensate-{self.name}")
            return Success()

    class OrderedSaga(Saga):
        def __init__(self):
            self.s1 = TrackingService(name="step1")
            self.s2 = TrackingService(name="step2")
            self.s3 = TrackingService(name="step3")

        def execute(self):
            self.do_step(self.s1)
            self.do_step(self.s2)
            self.do_step(self.s3)  # Fails here
            return Success()

        def compensate(self):
            # Should run in reverse: s3, s2, s1
            self.compensate(self.s3)
            self.compensate(self.s2)
            self.compensate(self.s1)

    saga = OrderedSaga()
    saga.execute()  # Fails on step 3
    saga.compensate()

    assert call_order == [
        "do-step1", "do-step2", "do-step3",
        "compensate-step3", "compensate-step2", "compensate-step1"
    ]

Chaos Testing Sagas in Production

Once your saga is running in production, inject failures to verify it handles them:

  1. Kill a service mid-saga and verify compensation runs
  2. Introduce network latency and verify timeouts trigger correctly
  3. Fill up a resource (disk, connection pool) and verify graceful degradation
  4. Split the network between two services and verify saga completes or compensates correctly

Distributed Tracing Integration

Sagas span multiple services. Without tracing, debugging a failed saga means grepping logs across 5 services and trying to piece together what happened. With distributed tracing (OpenTelemetry, Zipkin, Jaeger), you get a single trace ID that follows the saga across all services.

Trace Structure for Sagas

A saga trace has a parent span for the overall saga, with child spans for each step and compensation.

Trace: order-123-saga (trace-id: abc123)
├── Span: saga_created (service: order-service)
├── Span: step.reserve_inventory (service: inventory-service)
│   └── Span: compensate.reserve_inventory (service: inventory-service)
├── Span: step.charge_payment (service: payment-service)
│   └── Span: compensate.charge_payment (service: payment-service)
├── Span: step.create_shipment (service: shipping-service)
└── Span: saga_completed / saga_failed

Implementing Trace Context Propagation

Pass trace context through saga steps using baggage or span links.

from opentelemetry import trace
from opentelemetry.propagate import inject, extract

tracer = trace.get_tracer(__name__)

class OrderSaga:
    def execute(self, order, context=None):
        # Extract incoming trace context (if triggered by HTTP request)
        if context:
            ctx = extract(context)
            span = tracer.start_span("order_saga", context=ctx)
        else:
            span = tracer.start_span("order_saga")

        with span:
            span.set_attribute("saga.id", order.saga_id)
            span.set_attribute("saga.type", "order_fulfillment")

            # Inject trace context into step calls
            headers = {}
            inject(headers)  # Injects current trace into headers

            try:
                # Step 1: Reserve inventory (pass headers for trace propagation)
                inventory_ctx = self.inventory.reserve(order.items, headers)

                # Step 2: Charge payment
                payment_ctx = self.payment.charge(order.payment, headers)

                # Step 3: Create shipment
                shipment = self.shipping.create(order.address, headers)

                span.set_status(trace.Status.OK)
                return Success(shipment)

            except Exception as e:
                span.record_exception(e)
                span.set_status(trace.Status.ERROR, str(e))
                raise

When a step fails and compensation runs, link the compensation span to the original step span so you can see the pairing in the trace.

def compensate_inventory(self, reservation_id, original_span):
    with tracer.start_as_current_span(
        "compensate.reserve_inventory",
        links=[Link(original_span.get_span_context())]
    ) as span:
        span.set_attribute("compensation.for", original_span.name)
        span.set_attribute("reservation.id", reservation_id)
        self.inventory.release(reservation_id)

This way, in your trace viewer, you see the compensate span linked back to its original do span — paired visually rather than guessing from logs.

Framework Recommendations

Building saga from scratch is educational but painful for production. Use a framework that handles state persistence, retry logic, observability, and distributed tracing out of the box.

Temporal

The strongest choice for saga orchestration. Temporal provides durable workflow execution — if your service crashes mid-saga, Temporal persists the workflow state and resumes it from where it left off. No need to build your own saga state machine.

  • Strengths: Durable execution (survives worker crashes), built-in retries with backoff, activity heartbeats, sandboxed workflow code, strong OpenTelemetry integration
  • Good for: Complex multi-step business workflows, long-running sagas (hours to days)
  • Trade-offs: Self-hosting is operationally heavy; Temporal Cloud pricing can be significant at scale
# Temporal workflow example
@workflow.defn
class OrderWorkflow:
    @workflow.run
    async def run(self, order: Order) -> OrderResult:
        # Activities with automatic retry
        reservation = await workflow.execute_activity(
            reserve_inventory,
            order.items,
            start_to_close_timeout=timedelta(seconds=30),
            retry_policy=ActivityRetryPolicy(maximum_attempts=3),
        )

        try:
            charge = await workflow.execute_activity(
                charge_payment,
                order.payment,
                order.total,
                start_to_close_timeout=timedelta(seconds=30),
            )
        except PaymentDeclined:
            await workflow.execute_activity(
                release_inventory,
                reservation.id,
            )
            return OrderResult(rejected=True, reason="Payment declined")

        shipment = await workflow.execute_activity(
            create_shipment,
            order.address,
        )

        return OrderResult(confirmed=True, shipment=shipment)

AWS Step Functions

Managed saga orchestration on AWS. Integrates tightly with AWS services (Lambda, ECS, SQS, DynamoDB). Good if you’re already all-in on AWS.

  • Strengths: Fully managed, pay-per-state-transition, tight Lambda integration, visual workflow designer
  • Good for: AWS-centric architectures, medium-complexity workflows
  • Trade-offs: Vendor lock-in, expensive at high step counts, debugging can be opaque

Conductor (Netflix)

Conductor is an open-source saga orchestrator from Netflix. Good for microservices that need workflow orchestration without heavy operational overhead.

  • Strengths: Open source (self-hostable), JSON-based workflow definitions, HTTP-based workers
  • Good for: Teams wanting open-source without Temporal complexity
  • Trade-offs: Not as battle-tested as Temporal at extreme scale, less mature ecosystem

Comparison

FrameworkDurabilityOpen SourceComplexityBest For
TemporalExcellent (durable execution)Yes (server) + cloudMediumComplex long-running workflows
AWS Step FunctionsGood (managed)NoLowAWS-centric, simple workflows
ConductorGoodYes (fully)MediumOpen-source preference

For most production scenarios, Temporal is the right call. The durable execution guarantee alone saves you from a whole class of saga state loss bugs.

When to Use / When Not to Use Saga

CriteriaSaga (Choreography)Saga (Orchestration)Two-Phase Commit
AtomicityEventualEventualTrue atomicity
IsolationNoneNoneFull isolation
AvailabilityHighHighLow (blocks on coordinator failure)
ComplexityDistributed logicCentralized logicDistributed but synchronous
CompensationEach service knows its ownOrchestrator directsAutomatic rollback
LatencyPer-step latencyPer-step + orchestration overheadTwo round-trips
DebuggingHarder (distributed)Easier (centralized)Moderate
Rollback CostCompensation requiredCompensation requiredFree (automatic)

Use saga when:

  • Operations span multiple services with separate databases
  • You cannot use 2PC (which is usually the right call)
  • Business transactions map naturally to a sequence of steps
  • You can define compensating transactions for each step
  • Eventual consistency is acceptable for your domain
  • You need availability over strong isolation

Avoid saga when:

  • Steps have tight interdependencies that require strict isolation
  • Rollback must be immediate and guaranteed
  • Compensation is expensive or impossible (sending an email, charging a card with long refund times)
  • Your domain requires all-or-nothing atomicity that saga cannot provide
  • The inconsistency window is unacceptable for your use case

Production Failure Scenarios

FailureImpactMitigation
Step fails and compensation also failsSystem left in inconsistent stateDesign idempotent compensations; implement retry with exponential backoff; alert on repeated compensation failures
Service crashes mid-stepStep may or may not have completed; unknown stateUse idempotency keys; implement saga state tracking; use durable workflow engines
Concurrent sagas interfereOne saga’s uncommitted data affects another saga’s readImplement optimistic concurrency control; use application-level locks for critical resources
Compensation runs on already-succeeded stepDouble compensation causes incorrect stateTrack completed steps explicitly; prevent compensation on committed steps
Saga state lost (orchestrator crash)Cannot determine what steps completedPersist saga state to durable storage; use workflow engines that handle this
Infinite retry loopSystem stuck in repeated failed attemptsImplement max retry count; move to dead letter state after threshold; alert

Failure Flow Diagram

graph TD
    Start[Start Saga] --> Step1[Execute Step 1]
    Step1 --> Step1OK{Step 1 OK?}
    Step1OK -->|No| Fail1[Compensate Step 1<br/>Return Error]
    Step1OK -->|Yes| Step2[Execute Step 2]
    Step2 --> Step2OK{Step 2 OK?}
    Step2OK -->|No| Comp1[Compensate Step 1]
    Comp1 --> Fail2[Return Error]
    Step2OK -->|Yes| Step3[Execute Step 3]
    Step3 --> Step3OK{Step 3 OK?}
    Step3OK -->|No| Comp2A[Compensate Step 2]
    Comp2A --> Comp2B[Compensate Step 1]
    Comp2B --> Fail3[Return Error]
    Step3OK -->|Yes| Success[Saga Complete]

Saga Execution State Machine

stateDiagram-v2
    [*] --> Pending: Saga created
    Pending --> Executing: First step started
    Executing --> Executing: Step N completed
    Executing --> Compensating: Step failed
    Compensating --> Compensating: Compensation in progress
    Compensating --> Completed: All compensations done
    Executing --> Completed: Final step succeeded
    Compensating --> Failed: Compensation failed after retries
    Pending --> Failed: Immediate failure (validation error)
    Completed --> [*]
    Failed --> [*]

State Definitions:

StateDescription
PendingSaga created but not yet started. Initial state before first step execution.
ExecutingOne or more steps have completed successfully. Saga is processing subsequent steps.
CompensatingA step failed and compensation is running in reverse order to undo completed steps.
CompletedAll steps succeeded or all compensations succeeded. Terminal success state.
FailedSaga reached an unrecoverable state (compensation failed or max retries exceeded).

State Transition Rules:

  1. Once in Executing, cannot return to Pending
  2. Compensating can only be entered from Executing
  3. Completed and Failed are terminal states
  4. From Compensating, success leads to Completed, persistent failure leads to Failed
  5. Only Pending and Failed can be safely retried at the saga level

Observability Checklist

Metrics

  • Saga completion rate (success vs failure vs compensating)
  • Saga execution duration by type
  • Compensation execution count and success rate
  • Step failure rate by step type
  • Concurrent saga count
  • Average number of steps per saga

Logs

  • Log saga start with correlation ID and input parameters
  • Log each step start, completion, and compensation with step index
  • Include compensating transaction ID in compensation logs
  • Log saga outcome (success, failure, compensating)
  • Include all relevant IDs: saga ID, correlation ID, step IDs, compensation IDs

Alerts

  • Alert when saga takes longer than expected threshold
  • Alert when compensation repeatedly fails
  • Alert when saga failure rate exceeds normal baseline
  • Alert when max retry count reached on a step
  • Alert on stuck sagas (no progress for extended period)

Security Checklist

  • Authenticate and authorize saga trigger endpoints
  • Validate saga input parameters to prevent injection
  • Audit log all saga state changes (start, step completion, compensation)
  • Do not log sensitive data (payment info, passwords) in saga context
  • Encrypt saga state at rest if using external storage
  • Use correlation IDs for tracing without exposing internal IDs

Common Pitfalls / Anti-Patterns

Treating saga as ACID transaction: Saga does not provide isolation. Concurrent sagas can see each other’s partial results. If saga A reserves inventory and saga B reads inventory before A completes, B may make decisions on uncommitted data. Handle this at the application level.

Non-idempotent steps or compensations: If a step or compensation runs twice due to retries, the effect should be the same as running once. Reserve inventory twice should not double-reserve. Always check before acting.

Compensation order errors: Compensations must run in reverse order of execution. If step 2’s compensation runs before step 1’s, you may leave the system in a worse state. Explicitly track execution order.

Ignoring the inconsistency window: During a saga execution, the system is in an inconsistent state. Other operations may read partial results. Design your application to handle this (show pending states, use optimistic UI).

Long-running compensations: Compensations can take time (refunds, cancellations). While compensating, the system is still inconsistent. Minimize compensation time and alert if compensations are taking too long.

Not planning for compensation failure: What happens if compensation fails? The saga is stuck. Implement retry with backoff, then move to a dead letter state that requires manual intervention.

Quick Recap

sequence
    Saga-->|Step 1: Reserve| Inv:Inventory
    Inv-->|OK| Saga
    Saga-->|Step 2: Charge| Pay:Payment
    Pay-->|Fail| Saga
    Saga-->|Compensate| Inv:Inventory

Key Points

  • Saga breaks distributed transactions into steps with compensating transactions
  • If a step fails, previous steps are compensated in reverse order
  • Saga provides eventual consistency, not ACID isolation
  • Two implementations: choreography (distributed) and orchestration (centralized)
  • Idempotency is essential for safe retries

Production Checklist

# Saga Pattern Production Readiness

- [ ] Idempotent step and compensation handlers implemented
- [ ] Saga state persisted to durable storage
- [ ] Compensation logic tested for each step
- [ ] Maximum retry count configured per step
- [ ] Monitoring for saga execution duration
- [ ] Alerting for compensation failures
- [ ] Concurrent saga handling planned (optimistic locking or locks)
- [ ] User-facing pending state handling designed
- [ ] Dead letter handling for unrecoverable saga states
- [ ] Correlation IDs in all saga logs

Example: Order Processing Saga

A complete order processing saga with three services:

def create_order_saga(order):
    saga = OrderSaga()

    # Step 1: Reserve inventory
    reservation = saga.reserve_inventory(order.items)
    if not reservation.success:
        return OrderResult(rejected=True, reason="Insufficient inventory")

    # Step 2: Authorize payment
    authorization = saga.authorize_payment(order.payment, order.total)
    if not authorization.success:
        saga.compensate_inventory(reservation.id)
        return OrderResult(rejected=True, reason="Payment declined")

    # Step 3: Create shipment
    shipment = saga.create_shipment(order.address, order.items)
    if not shipment.success:
        saga.reverse_payment(authorization.id)
        saga.compensate_inventory(reservation.id)
        return OrderResult(rejected=True, reason="Shipping unavailable")

    # All steps succeeded
    return OrderResult(confirmed=True, order_id=order.id, shipment=shipment)

Each compensate_* method is a compensating transaction. They run in reverse order on failure.

Limitations

Sagas have real limitations.

No isolation: Concurrent sagas can interfere. If saga A reserves inventory and saga B reads it before A completes, B may make decisions based on uncommitted data. Application-level locks or optimistic concurrency control help, but the pattern does not give you automatic isolation.

Compensation complexity: Some operations are hard to compensate. Sending an email cannot be undone. Charging a card can be refunded, but refunds take time. Compensation is not free.

Debugging: A saga failure means tracing through multiple services to understand what happened. Logs and correlation IDs are essential.

For distributed transaction fundamentals, see Distributed Transactions. For two-phase commit, see Two-Phase Commit.

For workflow patterns, see Service Orchestration and Service Choreography.

Conclusion

The saga pattern manages distributed transactions without 2PC. Break the transaction into steps, and define a compensation for each step. If a step fails, run the compensations in reverse order.

Choreographed sagas distribute behavior across services. Orchestrated sagas centralize coordination in an orchestrator. Both work; the choice depends on workflow complexity and team structure.

Sagas trade ACID isolation for availability. The application must handle partial states and concurrent access. This complexity is inherent to distributed transactions; saga just makes it explicit rather than pretending it does not exist.

Category

Related Posts

TCC: Try-Confirm-Cancel Pattern for Distributed Transactions

Learn the Try-Confirm-Cancel pattern for distributed transactions. Explore how TCC differs from 2PC and saga, with implementation examples and real-world use cases.

#distributed-systems #transactions #saga

Service Orchestration: Coordinating Distributed Workflows

Explore service orchestration patterns for managing distributed workflows, workflow engines, saga implementations, and how orchestration compares to choreography.

#microservices #orchestration #workflow

CQRS Pattern

Separate read and write models. Command vs query models, eventual consistency implications, event sourcing integration, and when CQRS makes sense.

#database #cqrs #event-sourcing