Service Choreography: Event-Driven Distributed Workflows
Learn service choreography patterns for building distributed workflows through events, sagas with choreography, and when to prefer events over orchestration.
Service Choreography: Event-Driven Distributed Workflows
In choreography, there is no conductor. Services emit events when they do something, and other services react to those events by doing their own thing. The overall behavior emerges from the chain of reactions, not from a central plan.
This is a fundamentally different way of thinking about distributed systems. Instead of asking “who coordinates this workflow?” you ask “what events should cause what reactions?”
This post covers how choreography works, why it appeals to distributed systems architects, and where it breaks down.
What is Service Choreography
Service choreography is a pattern where services communicate by emitting and consuming events. Each service reacts to incoming events and may emit new events as a result. There is no central orchestrator directing the workflow.
graph LR
Order[Order Service] -->|OrderPlaced| Inv[Inventory Service]
Inv -->|InventoryReserved| Pay[Payment Service]
Pay -->|PaymentCharged| Ship[Shipping Service]
Ship -->|ShipmentCreated| Notify[Notification Service]
The flow emerges from the event chain. Each service knows only its own trigger and reaction. The service that reserves inventory does not know or care what happens after. It just emits “InventoryReserved” and moves on.
Events vs Commands
Choreography relies on events, not commands. The distinction matters.
A command is directed: “do this thing.” It expects one specific handler. CreateOrder goes to the Order Service, and that is the only place it goes.
An event is broadcast: “this thing happened.” Any service that cares can react. OrderPlaced goes to the event bus, and Inventory, Analytics, Notification, and others can all subscribe and act.
Naming conventions make the distinction obvious. Commands are verb-noun: CreateUser, CancelOrder, ReserveInventory. Events are noun-verb past tense: UserCreated, OrderCancelled, InventoryReserved.
Benefits of Choreography
Choreography has real advantages for the right problem.
True decoupling: Services do not know about each other. They only know about events. Add a new subscriber without touching the publisher. Remove a subscriber the same way.
No single point of failure: There is no orchestrator that, if it crashes, halts all workflows. If one service goes down, only its reactions stop. Other workflows continue.
Independent deployability: Each service deploys on its own schedule. The inventory service does not need to coordinate with the payment service for a release.
Scalability: Event buses are designed to scale. Publishers and subscribers scale independently.
Sagas with Choreography
The saga pattern can be implemented with choreography instead of orchestration. In choreographed saga, each service knows its own step and its own compensation. When a step fails, the service emits a failure event. Downstream services that already acted react to the failure event by running their compensations.
sequence
Order-->|Place| OrderSvc:Order Service
OrderSvc-->|Reserve| Inv:Inventory Service
Inv-->|Reserved| OrderSvc
OrderSvc-->|Charge| Pay:Payment Service
Pay-->|Failed| OrderSvc
OrderSvc-->|Release| Inv:Inventory Service
This is more complex than orchestrated saga. You need to track which steps have completed so you know what to compensate. The compensation logic is spread across services rather than centralized.
For a detailed look at saga patterns, see Saga Pattern.
Implementing Choreography
Event Schema Design
Events need consistent schema. An event schema defines the fields each event type carries. Schema evolution matters because events are immutable once published.
{
"type": "OrderPlaced",
"version": "1.0",
"order_id": "ord-12345",
"customer_id": "cust-67890",
"items": [...],
"timestamp": "2026-03-22T10:30:00Z",
"correlation_id": "corr-abc123"
}
Include a correlation ID in every event. It lets you trace a complete transaction across services by following all events with the same correlation ID.
Event Contract Testing
Services depend on event schemas. When one service changes its event format, subscribers may break. Use schema registries and contract testing to catch breaking changes early.
Apache Kafka uses a schema registry to enforce compatibility. Consumer and producer agree on a schema, and the registry rejects incompatible changes.
Idempotency
Events may be delivered more than once. Network partitions, consumer crashes, and retry logic all cause duplicates. Services must handle duplicate events idempotently.
def handle_order_placed(event):
# Check if already processed
if order_processed(event.order_id):
return
# Process the order
process_order(event.order_id)
# Mark as processed
mark_order_processed(event.order_id)
Use idempotency keys or check-before-act patterns to ensure re-processing is safe.
When Choreography Works
Choreography shines for simple, linear workflows where each step naturally follows from the previous. It works well in systems with many independent consumers of the same events (analytics, monitoring, notifications). Teams that value service independence and autonomous deployment tend to prefer it. It fits high-scale event streaming architectures like Kafka or Kinesis.
When Choreography Breaks Down
Choreography has limits.
Invisible workflows: You cannot look at one service and see the overall transaction. The behavior is scattered across many services. Debugging is harder.
No atomic transactions: There is no transaction boundary. Partial failures leave the system in inconsistent states. Compensation must handle this.
Event chain complexity: Long chains of events become hard to follow. A triggers B, B triggers C, C triggers D. When something goes wrong in D, tracing back through the chain is painful.
Cyclic dependencies: What if service A needs to know when service D finishes, but D is triggered by C which is triggered by B which is triggered by A? Cycles in event dependencies are difficult to manage.
Observability Challenges
Choreographed systems produce many events. Understanding the overall state of a transaction requires correlating events across services.
Correlation IDs: Every event in a transaction shares a correlation ID. Collect all events with the same correlation ID to reconstruct the transaction.
Distributed tracing: OpenTelemetry propagates trace context through events, giving you a single view of the event chain.
Event sourcing: Store all events in sequence. Replay to reconstruct state. This aligns well with choreography but adds storage overhead.
When to Use / When Not to Use Choreography
Use choreography when:
- Services are genuinely independent and decoupled
- Workflows are simple and linear (A triggers B, B triggers C)
- You want to avoid a single point of failure
- Adding new steps should not require modifying a central orchestrator
- Many independent consumers need to react to the same events (analytics, monitoring, notifications)
- You are building an event streaming architecture (Kafka, Kinesis)
- Teams value service autonomy and independent deployability
Avoid choreography when:
- Workflows have complex branching or conditional paths
- You need clear visibility into the entire workflow state
- Compensation logic is complex and must be centralized
- Debugging the workflow is critical for operations
- You have cyclic dependencies in your event flow
Production Failure Scenarios
| Failure | Impact | Mitigation |
|---|---|---|
| Event lost in transit | Downstream steps never execute; system left in incomplete state | Use persistent event stores (Kafka with replication); implement event replay capability |
| Event delivered twice | Duplicate processing; potential double operations (double charges) | Implement idempotency checks using event IDs; use deduplication tables |
| Consumer crashes mid-processing | Event acknowledged but not fully processed | Use consumer groups with rebalance; implement at-least-once delivery; idempotent handlers |
| Downstream service unavailable | Event emitted but no reaction occurs; partial execution | Implement retry with exponential backoff; dead letter queues for failed processing |
| Event schema breaking change | Subscribers fail to process events they depend on | Use schema registry with backward compatibility; version events; implement contract testing |
| Cyclic event dependencies | Deadlock where A waits for D and D is triggered by C which waits for B which waits for A | Design event flows as DAGs; validate for cycles before deployment |
| Event replay storms | Replaying historical events causes cascade of downstream processing | Implement replay window limits; separate replay pipelines from production |
Observability Checklist
Metrics
- Event publish rate (events per second by type)
- Event consumption rate per subscriber
- Event processing duration per consumer
- Dead letter queue depth
- Event consumer lag (how far behind real-time)
- Duplicate event detection count
- Schema compatibility violations detected
Logs
- Log all published events with event ID, type, and correlation ID
- Log all consumed events with consumer ID
- Include correlation ID to trace events across services
- Log consumer failures with original event context
- Log dead letter queue insertions with reason
Alerts
- Alert when dead letter queue depth exceeds threshold
- Alert when consumer lag exceeds SLA window
- Alert when duplicate event rate spikes (indicates upstream issue)
- Alert when event schema violations are detected
- Alert on consumer failures that trigger retry storms
Security Checklist
- Authenticate event publishers to prevent unauthorized event injection
- Authorize subscribers to prevent subscription to sensitive event streams
- Encrypt event payloads containing sensitive data at rest and in transit
- Validate event schemas before publishing to prevent malformed events
- Audit access to event stores and schema registries
- Rate limit event publishing to prevent DoS via event flooding
- Redact sensitive data from event logs (correlation IDs should be safe; payload data may not be)
Common Pitfalls / Anti-Patterns
Invisible workflows: With choreography, you cannot look at one service and see the overall transaction. The behavior emerges from the event chain. Without proper observability, debugging becomes a forensic exercise of correlating events across services.
Event chain complexity: Long chains become hard to follow. A triggers B, B triggers C, C triggers D. When D fails, tracing back through the chain is painful. Keep event chains short and well-documented.
Cyclic dependencies: If A depends on D completion and D is triggered by C which is triggered by B which is triggered by A, you have a cycle. Cycles cause deadlocks and infinite loops. Design your event dependencies as directed acyclic graphs (DAGs).
Assuming exactly-once delivery: Most event systems provide at-least-once or at-most-once delivery. Designing as if exactly-once exists leads to bugs. Always implement idempotent handlers.
Not planning for schema evolution: Events are immutable once published. If you need to change the schema, you must version the event. Subscribers must handle both old and new versions. This is harder than it sounds.
Ignoring consumer lag: As systems scale, consumers may fall behind. If the analytics consumer is 10 minutes behind, decisions based on analytics data are stale. Monitor and scale consumers to keep lag within acceptable bounds.
Quick Recap
graph LR
A[Service A] -->|Event| B[Event Bus]
B -->|Event| C[Service B]
B -->|Event| D[Service C]
B -->|Event| E[Service D]
Key Points
- Choreography uses events (broadcast) rather than commands (directed)
- Each service knows only its own trigger and reaction
- No central orchestrator means no central point of failure
- Compensation logic is distributed across services
- Trade-off: simpler services but harder to see overall workflow state
Production Checklist
# Service Choreography Production Readiness
- [ ] Idempotent event handlers implemented
- [ ] Event schema registry with backward compatibility enforced
- [ ] Correlation IDs included in all events
- [ ] Dead letter queue configured for failed processing
- [ ] Consumer lag monitoring and alerting configured
- [ ] Event replay capability tested
- [ ] Distributed tracing configured across event consumers
- [ ] Schema evolution strategy documented and enforced
- [ ] Audit logging for event publishing
- [ ] Rate limiting on event publishing
Combining Choreography and Orchestration
The choice is not binary. Many systems use both.
Core business workflows with complex compensation logic often need orchestration. The orchestrator has the full picture and can manage failure cleanly.
Peripheral side effects (notifications, audit logs, analytics) are good candidates for choreography. These are typically fire-and-forget. If a notification fails, you do not want to roll back the whole order.
The pattern: orchestration for the critical path, choreography for everything else.
For event-driven architecture fundamentals, see event-driven architecture. For messaging infrastructure, see message queue types.
Conclusion
Service choreography works when services are genuinely independent and workflows are simple. Events broadcast and services react. No central coordinator means no central point of failure.
The price is invisible workflows and harder debugging. When something goes wrong, you trace through events across services rather than reading an orchestrator log.
Use choreography for peripheral side effects and independent reactions. Use orchestration for workflows that need clear transaction boundaries and complex compensation. The best systems use both.
Category
Related Posts
Load Balancing Algorithms: Round Robin, Least Connections, and Beyond
Explore load balancing algorithms used in microservices including round robin, least connections, weighted, IP hash, and adaptive algorithms.
Amazon's Architecture: Lessons from the Pioneer of Microservices
Learn how Amazon pioneered service-oriented architecture, the famous 'two-pizza team' rule, and how they built the foundation for AWS.
Client-Side Discovery: Direct Service Routing in Microservices
Explore client-side service discovery patterns, how clients directly query the service registry, and when this approach works best.