Service Choreography: Event-Driven Distributed Workflows

Learn service choreography patterns for building distributed workflows through events, sagas with choreography, and when to prefer events over orchestration.

published: reading time: 11 min read

Service Choreography: Event-Driven Distributed Workflows

In choreography, there is no conductor. Services emit events when they do something, and other services react to those events by doing their own thing. The overall behavior emerges from the chain of reactions, not from a central plan.

This is a fundamentally different way of thinking about distributed systems. Instead of asking “who coordinates this workflow?” you ask “what events should cause what reactions?”

This post covers how choreography works, why it appeals to distributed systems architects, and where it breaks down.

What is Service Choreography

Service choreography is a pattern where services communicate by emitting and consuming events. Each service reacts to incoming events and may emit new events as a result. There is no central orchestrator directing the workflow.

graph LR
    Order[Order Service] -->|OrderPlaced| Inv[Inventory Service]
    Inv -->|InventoryReserved| Pay[Payment Service]
    Pay -->|PaymentCharged| Ship[Shipping Service]
    Ship -->|ShipmentCreated| Notify[Notification Service]

The flow emerges from the event chain. Each service knows only its own trigger and reaction. The service that reserves inventory does not know or care what happens after. It just emits “InventoryReserved” and moves on.

Events vs Commands

Choreography relies on events, not commands. The distinction matters.

A command is directed: “do this thing.” It expects one specific handler. CreateOrder goes to the Order Service, and that is the only place it goes.

An event is broadcast: “this thing happened.” Any service that cares can react. OrderPlaced goes to the event bus, and Inventory, Analytics, Notification, and others can all subscribe and act.

Naming conventions make the distinction obvious. Commands are verb-noun: CreateUser, CancelOrder, ReserveInventory. Events are noun-verb past tense: UserCreated, OrderCancelled, InventoryReserved.

Benefits of Choreography

Choreography has real advantages for the right problem.

True decoupling: Services do not know about each other. They only know about events. Add a new subscriber without touching the publisher. Remove a subscriber the same way.

No single point of failure: There is no orchestrator that, if it crashes, halts all workflows. If one service goes down, only its reactions stop. Other workflows continue.

Independent deployability: Each service deploys on its own schedule. The inventory service does not need to coordinate with the payment service for a release.

Scalability: Event buses are designed to scale. Publishers and subscribers scale independently.

Sagas with Choreography

The saga pattern can be implemented with choreography instead of orchestration. In choreographed saga, each service knows its own step and its own compensation. When a step fails, the service emits a failure event. Downstream services that already acted react to the failure event by running their compensations.

sequence
    Order-->|Place| OrderSvc:Order Service
    OrderSvc-->|Reserve| Inv:Inventory Service
    Inv-->|Reserved| OrderSvc
    OrderSvc-->|Charge| Pay:Payment Service
    Pay-->|Failed| OrderSvc
    OrderSvc-->|Release| Inv:Inventory Service

This is more complex than orchestrated saga. You need to track which steps have completed so you know what to compensate. The compensation logic is spread across services rather than centralized.

For a detailed look at saga patterns, see Saga Pattern.

Implementing Choreography

Event Schema Design

Events need consistent schema. An event schema defines the fields each event type carries. Schema evolution matters because events are immutable once published.

{
  "type": "OrderPlaced",
  "version": "1.0",
  "order_id": "ord-12345",
  "customer_id": "cust-67890",
  "items": [...],
  "timestamp": "2026-03-22T10:30:00Z",
  "correlation_id": "corr-abc123"
}

Include a correlation ID in every event. It lets you trace a complete transaction across services by following all events with the same correlation ID.

Event Contract Testing

Services depend on event schemas. When one service changes its event format, subscribers may break. Use schema registries and contract testing to catch breaking changes early.

Apache Kafka uses a schema registry to enforce compatibility. Consumer and producer agree on a schema, and the registry rejects incompatible changes.

Idempotency

Events may be delivered more than once. Network partitions, consumer crashes, and retry logic all cause duplicates. Services must handle duplicate events idempotently.

def handle_order_placed(event):
    # Check if already processed
    if order_processed(event.order_id):
        return

    # Process the order
    process_order(event.order_id)

    # Mark as processed
    mark_order_processed(event.order_id)

Use idempotency keys or check-before-act patterns to ensure re-processing is safe.

When Choreography Works

Choreography shines for simple, linear workflows where each step naturally follows from the previous. It works well in systems with many independent consumers of the same events (analytics, monitoring, notifications). Teams that value service independence and autonomous deployment tend to prefer it. It fits high-scale event streaming architectures like Kafka or Kinesis.

When Choreography Breaks Down

Choreography has limits.

Invisible workflows: You cannot look at one service and see the overall transaction. The behavior is scattered across many services. Debugging is harder.

No atomic transactions: There is no transaction boundary. Partial failures leave the system in inconsistent states. Compensation must handle this.

Event chain complexity: Long chains of events become hard to follow. A triggers B, B triggers C, C triggers D. When something goes wrong in D, tracing back through the chain is painful.

Cyclic dependencies: What if service A needs to know when service D finishes, but D is triggered by C which is triggered by B which is triggered by A? Cycles in event dependencies are difficult to manage.

Observability Challenges

Choreographed systems produce many events. Understanding the overall state of a transaction requires correlating events across services.

Correlation IDs: Every event in a transaction shares a correlation ID. Collect all events with the same correlation ID to reconstruct the transaction.

Distributed tracing: OpenTelemetry propagates trace context through events, giving you a single view of the event chain.

Event sourcing: Store all events in sequence. Replay to reconstruct state. This aligns well with choreography but adds storage overhead.

When to Use / When Not to Use Choreography

Use choreography when:

  • Services are genuinely independent and decoupled
  • Workflows are simple and linear (A triggers B, B triggers C)
  • You want to avoid a single point of failure
  • Adding new steps should not require modifying a central orchestrator
  • Many independent consumers need to react to the same events (analytics, monitoring, notifications)
  • You are building an event streaming architecture (Kafka, Kinesis)
  • Teams value service autonomy and independent deployability

Avoid choreography when:

  • Workflows have complex branching or conditional paths
  • You need clear visibility into the entire workflow state
  • Compensation logic is complex and must be centralized
  • Debugging the workflow is critical for operations
  • You have cyclic dependencies in your event flow

Production Failure Scenarios

FailureImpactMitigation
Event lost in transitDownstream steps never execute; system left in incomplete stateUse persistent event stores (Kafka with replication); implement event replay capability
Event delivered twiceDuplicate processing; potential double operations (double charges)Implement idempotency checks using event IDs; use deduplication tables
Consumer crashes mid-processingEvent acknowledged but not fully processedUse consumer groups with rebalance; implement at-least-once delivery; idempotent handlers
Downstream service unavailableEvent emitted but no reaction occurs; partial executionImplement retry with exponential backoff; dead letter queues for failed processing
Event schema breaking changeSubscribers fail to process events they depend onUse schema registry with backward compatibility; version events; implement contract testing
Cyclic event dependenciesDeadlock where A waits for D and D is triggered by C which waits for B which waits for ADesign event flows as DAGs; validate for cycles before deployment
Event replay stormsReplaying historical events causes cascade of downstream processingImplement replay window limits; separate replay pipelines from production

Observability Checklist

Metrics

  • Event publish rate (events per second by type)
  • Event consumption rate per subscriber
  • Event processing duration per consumer
  • Dead letter queue depth
  • Event consumer lag (how far behind real-time)
  • Duplicate event detection count
  • Schema compatibility violations detected

Logs

  • Log all published events with event ID, type, and correlation ID
  • Log all consumed events with consumer ID
  • Include correlation ID to trace events across services
  • Log consumer failures with original event context
  • Log dead letter queue insertions with reason

Alerts

  • Alert when dead letter queue depth exceeds threshold
  • Alert when consumer lag exceeds SLA window
  • Alert when duplicate event rate spikes (indicates upstream issue)
  • Alert when event schema violations are detected
  • Alert on consumer failures that trigger retry storms

Security Checklist

  • Authenticate event publishers to prevent unauthorized event injection
  • Authorize subscribers to prevent subscription to sensitive event streams
  • Encrypt event payloads containing sensitive data at rest and in transit
  • Validate event schemas before publishing to prevent malformed events
  • Audit access to event stores and schema registries
  • Rate limit event publishing to prevent DoS via event flooding
  • Redact sensitive data from event logs (correlation IDs should be safe; payload data may not be)

Common Pitfalls / Anti-Patterns

Invisible workflows: With choreography, you cannot look at one service and see the overall transaction. The behavior emerges from the event chain. Without proper observability, debugging becomes a forensic exercise of correlating events across services.

Event chain complexity: Long chains become hard to follow. A triggers B, B triggers C, C triggers D. When D fails, tracing back through the chain is painful. Keep event chains short and well-documented.

Cyclic dependencies: If A depends on D completion and D is triggered by C which is triggered by B which is triggered by A, you have a cycle. Cycles cause deadlocks and infinite loops. Design your event dependencies as directed acyclic graphs (DAGs).

Assuming exactly-once delivery: Most event systems provide at-least-once or at-most-once delivery. Designing as if exactly-once exists leads to bugs. Always implement idempotent handlers.

Not planning for schema evolution: Events are immutable once published. If you need to change the schema, you must version the event. Subscribers must handle both old and new versions. This is harder than it sounds.

Ignoring consumer lag: As systems scale, consumers may fall behind. If the analytics consumer is 10 minutes behind, decisions based on analytics data are stale. Monitor and scale consumers to keep lag within acceptable bounds.

Quick Recap

graph LR
    A[Service A] -->|Event| B[Event Bus]
    B -->|Event| C[Service B]
    B -->|Event| D[Service C]
    B -->|Event| E[Service D]

Key Points

  • Choreography uses events (broadcast) rather than commands (directed)
  • Each service knows only its own trigger and reaction
  • No central orchestrator means no central point of failure
  • Compensation logic is distributed across services
  • Trade-off: simpler services but harder to see overall workflow state

Production Checklist

# Service Choreography Production Readiness

- [ ] Idempotent event handlers implemented
- [ ] Event schema registry with backward compatibility enforced
- [ ] Correlation IDs included in all events
- [ ] Dead letter queue configured for failed processing
- [ ] Consumer lag monitoring and alerting configured
- [ ] Event replay capability tested
- [ ] Distributed tracing configured across event consumers
- [ ] Schema evolution strategy documented and enforced
- [ ] Audit logging for event publishing
- [ ] Rate limiting on event publishing

Combining Choreography and Orchestration

The choice is not binary. Many systems use both.

Core business workflows with complex compensation logic often need orchestration. The orchestrator has the full picture and can manage failure cleanly.

Peripheral side effects (notifications, audit logs, analytics) are good candidates for choreography. These are typically fire-and-forget. If a notification fails, you do not want to roll back the whole order.

The pattern: orchestration for the critical path, choreography for everything else.

For event-driven architecture fundamentals, see event-driven architecture. For messaging infrastructure, see message queue types.

Conclusion

Service choreography works when services are genuinely independent and workflows are simple. Events broadcast and services react. No central coordinator means no central point of failure.

The price is invisible workflows and harder debugging. When something goes wrong, you trace through events across services rather than reading an orchestrator log.

Use choreography for peripheral side effects and independent reactions. Use orchestration for workflows that need clear transaction boundaries and complex compensation. The best systems use both.

Category

Related Posts

Load Balancing Algorithms: Round Robin, Least Connections, and Beyond

Explore load balancing algorithms used in microservices including round robin, least connections, weighted, IP hash, and adaptive algorithms.

#microservices #load-balancing #algorithms

Amazon's Architecture: Lessons from the Pioneer of Microservices

Learn how Amazon pioneered service-oriented architecture, the famous 'two-pizza team' rule, and how they built the foundation for AWS.

#microservices #amazon #architecture

Client-Side Discovery: Direct Service Routing in Microservices

Explore client-side service discovery patterns, how clients directly query the service registry, and when this approach works best.

#microservices #client-side-discovery #service-discovery