Service Choreography: Event-Driven Distributed Workflows
Learn service choreography patterns for distributed workflows through events, sagas with choreography, and when to prefer events over orchestration.
Service Choreography: Event-Driven Distributed Workflows
In choreography, there is no conductor. Services emit events when they do something, and other services react to those events by doing their own thing. The overall behavior emerges from the chain of reactions, not from a central plan.
This is a fundamentally different way of thinking about distributed systems. Instead of asking “who coordinates this workflow?” you ask “what events should cause what reactions?”
This post covers how choreography works, why it appeals to distributed systems architects, and where it breaks down.
Introduction
Service choreography is a pattern where services communicate by emitting and consuming events. Each service reacts to incoming events and may emit new events as a result. There is no central orchestrator directing the workflow.
graph LR
Order[Order Service] -->|OrderPlaced| Inv[Inventory Service]
Inv -->|InventoryReserved| Pay[Payment Service]
Pay -->|PaymentCharged| Ship[Shipping Service]
Ship -->|ShipmentCreated| Notify[Notification Service]
The flow emerges from the event chain. Each service knows only its own trigger and reaction. The service that reserves inventory does not know or care what happens after. It just emits “InventoryReserved” and moves on.
Core Concepts
Choreography relies on events, not commands. The distinction matters.
A command is directed: “do this thing.” It expects one specific handler. CreateOrder goes to the Order Service, and that is the only place it goes.
An event is broadcast: “this thing happened.” Any service that cares can react. OrderPlaced goes to the event bus, and Inventory, Analytics, Notification, and others can all subscribe and act.
Naming conventions make the distinction obvious. Commands are verb-noun: CreateUser, CancelOrder, ReserveInventory. Events are noun-verb past tense: UserCreated, OrderCancelled, InventoryReserved.
Benefits of Choreography
Choreography has real advantages for the right problem.
True decoupling: Services do not know about each other. They only know about events. Add a new subscriber without touching the publisher. Remove a subscriber the same way.
No single point of failure: There is no orchestrator that, if it crashes, halts all workflows. If one service goes down, only its reactions stop. Other workflows continue.
Independent deployability: Each service deploys on its own schedule. The inventory service does not need to coordinate with the payment service for a release.
Scalability: Event buses are designed to scale. Publishers and subscribers scale independently.
Sagas with Choreography
The saga pattern can be implemented with choreography instead of orchestration. In choreographed saga, each service knows its own step and its own compensation. When a step fails, the service emits a failure event. Downstream services that already acted react to the failure event by running their compensations.
sequenceDiagram
participant Order
participant OrderSvc as Order Service
participant Inv as Inventory Service
participant Pay as Payment Service
Order->>+OrderSvc: Place
OrderSvc->>+Inv: Reserve
Inv->>-OrderSvc: Reserved
OrderSvc->>+Pay: Charge
Pay->>-OrderSvc: Failed
OrderSvc->>+Inv: Release
This is more complex than orchestrated saga. You need to track which steps have completed so you know what to compensate. The compensation logic is spread across services rather than centralized.
For a detailed look at saga patterns, see Saga Pattern.
Implementing Choreography
Event Schema Design
Events need consistent schema. An event schema defines the fields each event type carries. Schema evolution matters because events are immutable once published.
{
"type": "OrderPlaced",
"version": "1.0",
"order_id": "ord-12345",
"customer_id": "cust-67890",
"items": [...],
"timestamp": "2026-03-22T10:30:00Z",
"correlation_id": "corr-abc123"
}
Include a correlation ID in every event. It lets you trace a complete transaction across services by following all events with the same correlation ID.
Event Contract Testing
Services depend on event schemas. When one service changes its event format, subscribers may break. Use schema registries and contract testing to catch breaking changes early.
Apache Kafka uses a schema registry to enforce compatibility. Consumer and producer agree on a schema, and the registry rejects incompatible changes.
Idempotency
Events may be delivered more than once. Network partitions, consumer crashes, and retry logic all cause duplicates. Services must handle duplicate events idempotently.
def handle_order_placed(event):
# Check if already processed
if order_processed(event.order_id):
return
# Process the order
process_order(event.order_id)
# Mark as processed
mark_order_processed(event.order_id)
Use idempotency keys or check-before-act patterns to ensure re-processing is safe.
When Choreography Works
Choreography shines for simple, linear workflows where each step naturally follows from the previous. It works well in systems with many independent consumers of the same events (analytics, monitoring, notifications). Teams that value service independence and autonomous deployment tend to prefer it. It fits high-scale event streaming architectures like Kafka or Kinesis.
When Choreography Breaks Down
Choreography has limits.
Invisible workflows: You cannot look at one service and see the overall transaction. The behavior is scattered across many services. Debugging is harder.
No atomic transactions: There is no transaction boundary. Partial failures leave the system in inconsistent states. Compensation must handle this.
Event chain complexity: Long chains of events become hard to follow. A triggers B, B triggers C, C triggers D. When something goes wrong in D, tracing back through the chain is painful.
Cyclic dependencies: What if service A needs to know when service D finishes, but D is triggered by C which is triggered by B which is triggered by A? Cycles in event dependencies are difficult to manage.
Observability Challenges
Choreographed systems produce many events. Understanding the overall state of a transaction requires correlating events across services.
Correlation IDs: Every event in a transaction shares a correlation ID. Collect all events with the same correlation ID to reconstruct the transaction.
Distributed tracing: OpenTelemetry propagates trace context through events, giving you a single view of the event chain.
Event sourcing: Store all events in sequence. Replay to reconstruct state. This aligns well with choreography but adds storage overhead.
When to Use Choreography
Use choreography when:
- Services are genuinely independent and decoupled
- Workflows are simple and linear (A triggers B, B triggers C)
- You want to avoid a single point of failure
- Adding new steps should not require modifying a central orchestrator
- Many independent consumers need to react to the same events (analytics, monitoring, notifications)
- You are building an event streaming architecture (Kafka, Kinesis)
- Teams value service autonomy and independent deployability
Avoid choreography when:
- Workflows have complex branching or conditional paths
- You need clear visibility into the entire workflow state
- Compensation logic is complex and must be centralized
- Debugging the workflow is critical for operations
- You have cyclic dependencies in your event flow
Combining Choreography and Orchestration
The choice is not binary. Many systems use both.
Core business workflows with complex compensation logic often need orchestration. The orchestrator has the full picture and can manage failure cleanly.
Peripheral side effects (notifications, audit logs, analytics) are good candidates for choreography. These are typically fire-and-forget. If a notification fails, you do not want to roll back the whole order.
The pattern: orchestration for the critical path, choreography for everything else.
For event-driven architecture fundamentals, see event-driven architecture. For messaging infrastructure, see message queue types.
Choreography Failure Isolation
Choreography handles failures differently than orchestration. When an orchestrated service fails, the orchestrator decides whether to continue or roll back. That decision lives in one place. In choreography, failure handling is distributed across services.
If the notification service goes down, the order still processes. The customer just does not get an email. You can replay the notification later without reprocessing the order itself.
The catch: your event reactions need to be idempotent. If the notification fires twice, the customer should not get two emails. If inventory is released twice, the stock count should not go negative.
Isolated failure also means you need a strategy for missed events. If the shipping service misses ShipmentCreated, it needs to detect that gap and recover. This is harder than a centralized orchestrator that tracks state explicitly.
Choreography Scaling Characteristics
Publishers and subscribers in choreography scale independently. If the order service gets more load than the notification service, you scale order service instances without touching notification service. The event bus absorbs the imbalance.
This independence extends to teams. A team can ship a new event consumer without coordinating with the publisher team. Valuable at scale, but adds coordination overhead at smaller organizations.
The bottleneck in choreography is the event bus itself. When the event bus saturates, everything slows down. Kafka and Kinesis handle this by partitioning events across consumer groups, but you need to design your event schemas with partitioning in mind from the start.
Schema Registry Deep Dive
A schema registry centralizes event schema management and prevents breaking changes from reaching consumers. In choreography, where publishers and subscribers are decoupled, the registry is your primary mechanism for maintaining contract compatibility.
How Schema Registries Work
Publishers register their event schemas before publishing. The registry validates the schema against compatibility rules. Consumers then fetch schemas from the registry to deserialize events.
Producer -> Schema Registry (validate) -> Event Bus -> Consumer (fetch schema)
Apache Kafka uses Confluent Schema Registry or AWS Glue Schema Registry. The registry enforces backward or forward compatibility depending on your configuration.
Compatibility Modes
Backward compatibility (most common): New schemas can read data written under old schemas. Consumers on the old schema can process events written under the new schema. This lets you add optional fields without breaking existing consumers.
Forward compatibility: New schemas can read data written under old schemas. Consumers on the new schema can process events written under the old schema. Useful when you cannot update all consumers simultaneously.
Full compatibility: Both backward and forward. More restrictive but safest for multi-version transitions.
Schema Evolution in Practice
When you need to change an event schema:
- Register the new version with the schema registry
- The registry validates against compatibility rules
- If compatible, the new schema is allowed; consumers continue using their cached schema
- If incompatible, the publish is rejected before it reaches the event bus
Without a registry, a breaking change reaches all consumers before you know something is wrong. With a registry, you catch it at publish time.
Schema Registry Anti-Patterns
Registering schemas without enforcement: If producers can bypass the registry, you lose the safety net. Gate all event publishing through the registry.
Evolving without coordination: Even with backward compatibility, coordinate with consumers before adding required fields or removing fields they depend on. Compatibility protects against crashes, not logical errors.
Not caching schemas: Consumers fetching schemas from the registry on every event adds latency. Cache schemas locally with a TTL and refresh periodically.
Trade-off Analysis
| Aspect | Choreography | Orchestration |
|---|---|---|
| Coupling | Loose; services know only events | Tight; services know the orchestrator |
| Single point of failure | None; event bus is resilient | The orchestrator |
| Workflow visibility | Hard to see overall transaction | Easy to see complete workflow state |
| Debugging | Requires correlated event tracing | Centralized log access |
| Compensation logic | Distributed across services | Centralized in orchestrator |
| Independent deployability | High; services deploy on their own schedule | Low; orchestrator changes may affect services |
| Scalability | Event bus and consumers scale independently | Orchestrator may become bottleneck |
| Cyclic dependencies | Risk of cycles in event graphs | Avoided by explicit control flow |
| Schema evolution | Harder; subscribers may break on breaking changes | Easier; orchestrator controls all calls |
Event Replay and Recovery Strategies
Event replay is central to recovery in choreographed systems. When a service crashes mid-processing or a downstream consumer misses events, replay lets you rebuild state without replaying the entire event history.
Replay Mechanisms
Offset-based replay (Kafka): Consumers track their current offset and can reset to an earlier offset to reprocess events. This works when events are immutable and stored durably. Reset the offset to re-process a window of events after a bug fix or crash.
Sequence-number-based replay (Kinesis): Each record has a sequence number. Consumers track their last processed sequence number and can request records from that point forward. Similar to offset replay but native to AWS Kinesis.
Timestamp-based replay: Store the last successful processing timestamp. On restart, query events with timestamps after that point. Less precise than offset/sequence but useful when you cannot track offsets directly.
Designing for Replay
Replays can be expensive if consumers are not designed for them. During normal operation you process one event at a time. During replay you can hit the same consumer with thousands of events per second as the backlog catches up.
Design consumers to handle replay gracefully:
def handle_event(event):
# Always idempotent: safe to call multiple times
state = get_current_state(event.order_id)
if state is None:
# First time seeing this order - normal processing
process_new_order(event)
else:
# Replay or late arrival - check if already processed
if already_processed(event.event_id):
return
handle_duplicate_or_out_of_order(event, state)
Partition-aware replay: If your event bus uses partitioning, replay within a partition is straightforward. Cross-partition replay requires more care since you lose ordering guarantees across partitions. Design partitions by correlation ID so all events for a given transaction land in the same partition.
Recovery Patterns
Checkpoint-based recovery: Store checkpoint markers in a separate store. After processing each batch, record the checkpoint. On restart, resume from the checkpoint rather than the beginning.
Dead letter queue replay: Events that fail after max retries go to the DLQ. Fix the underlying issue, then replay DLQ events. Inspect the error context before replaying to avoid infinite loops.
Idempotent recovery: The simplest recovery pattern. After processing, write the event ID to a processed events table. On restart, skip any event ID already in the table. Works for moderate event volumes. For high volume, use a TTL or log-structured approach to avoid unbounded table growth.
Common Pitfalls / Anti-Patterns
Invisible workflows: With choreography, you cannot look at one service and see the overall transaction. The behavior emerges from the event chain. Without proper observability, debugging becomes a forensic exercise of correlating events across services.
Event chain complexity: Long chains become hard to follow. A triggers B, B triggers C, C triggers D. When D fails, tracing back through the chain is painful. Keep event chains short and well-documented.
Cyclic dependencies: If A depends on D completion and D is triggered by C which is triggered by B which is triggered by A, you have a cycle. Cycles cause deadlocks and infinite loops. Design your event dependencies as directed acyclic graphs (DAGs).
Assuming exactly-once delivery: Most event systems provide at-least-once or at-most-once delivery. Designing as if exactly-once exists leads to bugs. Always implement idempotent handlers.
Not planning for schema evolution: Events are immutable once published. If you need to change the schema, you must version the event. Subscribers must handle both old and new versions. This is harder than it sounds.
Ignoring consumer lag: As systems scale, consumers may fall behind. If the analytics consumer is 10 minutes behind, decisions based on analytics data are stale. Monitor and scale consumers to keep lag within acceptable bounds.
Consumer Lag and Backpressure
Consumer lag is the gap between when an event is published and when it is fully processed by a consumer. In high-throughput choreography systems, lag compounds quickly and degrades system behavior in subtle ways.
Measuring Lag
Lag is measured per consumer group. Each consumer tracks its current position (offset or sequence number) against the latest available event. The difference is the lag in events. For time-based lag, multiply the event count by average event size and divide by consumer throughput.
Kafka lag monitoring: The consumer_lag metric is exposed by Kafka’s JMX metrics and by Kafka Connect. Tools like Confluent Control Center, Datadog, and CloudWatch can alert on lag thresholds.
Kinesis lag monitoring: Kinesis exposes GetRecords.MillisBehindLatest, which measures how far behind the shard reader is in milliseconds. High values indicate the consumer cannot keep up with incoming data.
Impact of Unchecked Lag
When analytics consumers lag, their data becomes stale. Decision-making based on that data produces incorrect conclusions. When notification consumers lag, customers receive confirmations late. When payment consumers lag, billing cycles stretch unpredictably.
The choreographed system continues operating, but reactions to events arrive progressively later. This can cause race conditions where a subsequent event arrives before the reaction to an earlier one.
Backpressure Mechanisms
Backpressure controls the rate of event processing when consumers are overwhelmed.
Consumer-side throttling: The consumer slows its poll rate when its processing queue fills. This is built into most SDKs but can be tuned. If your consumer can process 1000 events per second but the system produces 2000, throttling causes the queue to grow unbounded.
Publisher-side rate limiting: Cap the rate at which events are published per producer. This protects the system but can cause events to be dropped or delayed if the cap is too low.
Ingress metering: The event bus itself enforces throughput limits. Kafka’s quota mechanism throttles producers or consumers that exceed configured rates. This is a hard limit at the broker level.
Prefetch limits: Some consumer SDKs prefetch events into a local buffer before processing. Setting the prefetch limit too high causes memory pressure; too low causes throughput loss. Tune based on your processing latency and memory budget.
Scaling to Reduce Lag
Horizontal scaling of consumers reduces lag but requires care. Adding a consumer to a consumer group triggers a rebalance where partitions are reassigned. During rebalance, no events are processed. If rebalances are frequent (from unstable consumer instances or aggressive heartbeat settings), lag can worsen rather than improve.
Stabilize consumer instances before scaling: health checks, proper shutdown handling, and graceful rebalance protocols reduce the overhead of adding capacity.
Production Failure Scenarios
| Failure | Impact | Mitigation |
|---|---|---|
| Event lost in transit | Downstream steps never execute; system left in incomplete state | Use persistent event stores (Kafka with replication); implement event replay capability |
| Event delivered twice | Duplicate processing; potential double operations (double charges) | Implement idempotency checks using event IDs; use deduplication tables |
| Consumer crashes mid-processing | Event acknowledged but not fully processed | Use consumer groups with rebalance; implement at-least-once delivery; idempotent handlers |
| Downstream service unavailable | Event emitted but no reaction occurs; partial execution | Implement retry with exponential backoff; dead letter queues for failed processing |
| Event schema breaking change | Subscribers fail to process events they depend on | Use schema registry with backward compatibility; version events; implement contract testing |
| Cyclic event dependencies | Deadlock where A waits for D and D is triggered by C which waits for B which waits for A | Design event flows as DAGs; validate for cycles before deployment |
| Event replay storms | Replaying historical events causes cascade of downstream processing | Implement replay window limits; separate replay pipelines from production |
Key Takeaways
graph LR
A[Service A] -->|Event| B[Event Bus]
B -->|Event| C[Service B]
B -->|Event| D[Service C]
B -->|Event| E[Service D]
Key Points
- Choreography uses events (broadcast) rather than commands (directed)
- Each service knows only its own trigger and reaction
- No central orchestrator means no central point of failure
- Compensation logic is distributed across services
- Trade-off: simpler services but harder to see overall workflow state
Production Checklist
# Service Choreography Production Readiness
- [ ] Idempotent event handlers implemented
- [ ] Event schema registry with backward compatibility enforced
- [ ] Correlation IDs included in all events
- [ ] Dead letter queue configured for failed processing
- [ ] Consumer lag monitoring and alerting configured
- [ ] Event replay capability tested
- [ ] Distributed tracing configured across event consumers
- [ ] Schema evolution strategy documented and enforced
- [ ] Audit logging for event publishing
- [ ] Rate limiting on event publishing
Observability Checklist
Metrics
- Event publish rate (events per second by type)
- Event consumption rate per subscriber
- Event processing duration per consumer
- Dead letter queue depth
- Event consumer lag (how far behind real-time)
- Duplicate event detection count
- Schema compatibility violations detected
Logs
- Log all published events with event ID, type, and correlation ID
- Log all consumed events with consumer ID
- Include correlation ID to trace events across services
- Log consumer failures with original event context
- Log dead letter queue insertions with reason
Alerts
- Alert when dead letter queue depth exceeds threshold
- Alert when consumer lag exceeds SLA window
- Alert when duplicate event rate spikes (indicates upstream issue)
- Alert when event schema violations are detected
- Alert on consumer failures that trigger retry storms
Security Checklist
- Authenticate event publishers to prevent unauthorized event injection
- Authorize subscribers to prevent subscription to sensitive event streams
- Encrypt event payloads containing sensitive data at rest and in transit
- Validate event schemas before publishing to prevent malformed events
- Audit access to event stores and schema registries
- Rate limit event publishing to prevent DoS via event flooding
- Redact sensitive data from event logs (correlation IDs should be safe; payload data may not be)
Interview Questions
Choreography: services communicate by emitting and consuming events with no central director. Orchestration: a central orchestrator directs the workflow by calling services directly and holding all workflow state. In choreography each service only knows its own trigger and reaction; in orchestration the orchestrator holds the entire picture. The core trade-off is loose coupling versus central visibility.
Events can be delivered more than once due to network partitions, consumer crashes, or retry logic. Without idempotency, duplicate events cause duplicate operations: double charges, double reservations. Techniques include idempotency keys, check-before-act patterns, and deduplication tables. The pattern is simple: check if already processed before acting, then mark it done.
Every event in a transaction shares a correlation ID. Collecting all events with the same correlation ID lets you reconstruct the complete transaction flow across services. Without it, tracing a failed transaction through multiple services becomes a forensic exercise. Include the correlation ID in every event from the start.
The core challenge is that overall transaction state is scattered across services. Long event chains are hard to trace when something fails. Solutions include correlation IDs on all events, distributed tracing with OpenTelemetry, and event sourcing for replay. Event sourcing aligns well with choreography since you already have the events stored.
In orchestrated saga, the orchestrator tracks steps and triggers compensations centrally. In choreographed saga, each service knows its own step and compensation; when a step fails it emits a failure event and downstream services run their own compensations. Choreographed saga is more complex because you must track which steps completed across services without central state.
A triggers B, B triggers C, C triggers D, and D triggers A creates a deadlock loop where services wait indefinitely for events that never arrive. Prevention: design event flows as directed acyclic graphs (DAGs) and validate for cycles before deployment. Regular architecture review of event dependency graphs helps catch emerging cycles early.
Avoid choreography when: workflows have complex branching or conditional paths; you need clear visibility into entire workflow state for operations; compensation logic is complex and must be centralized for audit or safety; debugging the workflow is critical and centralized logs are more practical than distributed event tracing; or when cyclic dependencies are unavoidable in your event flow.
Events are immutable once published, so schema changes require versioning. Subscribers must handle both old and new versions during transitions. Use schema registries with backward compatibility enforcement (Kafka schema registry is the common choice). Contract testing catches breaking changes before they reach production. Because subscribers are decoupled, breaking changes in choreography are harder to manage than in orchestration.
Commands are directed ("do this thing") targeting one specific handler. Events are broadcast ("this thing happened") and any interested service can react. Naming reflects this: commands use verb-noun (CreateOrder, ReserveInventory), events use noun-verb past tense (OrderPlaced, InventoryReserved). Commands couple services to their handlers; events enable loose coupling.
Partial failures are inherent because there is no atomic transaction boundary across services. Each service must implement compensation for its own actions. When a step fails, emit a failure event; downstream services react by running their compensations. Track which saga steps have completed so you know what to compensate. Use dead letter queues for events that fail after max retries. Idempotent compensations are essential since compensation events can also arrive more than once.
Choreography is one pattern within EDA, not the whole thing. EDA covers any architecture built around event production and consumption; choreography specifically means decentralized reaction to events with no central coordinator. Orchestration is the other pattern within EDA where a central coordinator directs the flow. The event-driven architecture blog post covers the fundamentals; choreography is the decentralized execution model for those principles.
Name events in noun-verb past tense to describe what happened from the emitting service's perspective: OrderPlaced, PaymentCharged, InventoryReserved. Commands use verb-noun instead: CreateOrder, ReserveInventory. Put the aggregate ID in the event name when it helps subscribers filter correctly. Leave command-language verbs (Create, Send) out of event names entirely.
Beyond check-before-act, you have a few options. Deduplication tables keyed on event ID with a TTL handle storage cleanly. Bloom filters work for memory-efficient duplicate detection at scale. Some systems use deterministic idempotency keys where the business logic itself defines uniqueness, like order ID plus action type. At-least-once brokers pair well with natural idempotency in your business logic, like detecting duplicate charges via payment ID. Pick based on what your broker guarantees.
Brokers are the event bus that makes choreography viable. They handle routing, durability, scalability through partitioning, and delivery guarantees. Kafka's partitioned log gives you replay capability from offsets, which is essential for recovery. Kinesis does similar things on AWS with different scaling tradeoffs. Either way, the broker replaces what a central orchestrator would do by providing the shared infrastructure for distributing events to all subscribers.
Services only couple to event schemas, not to other services directly. The inventory team can ship a new event type without asking the payment team to change anything. The event contract is the only coupling point. The downside is that nobody has a clear view of the overall workflow anymore, which is why correlation IDs and distributed tracing become essential rather than optional.
You cannot test choreography the same way you test a monolith. Contract testing checks that event schemas work for both producers and subscribers without running the full system. Consumer-driven contracts let subscribers specify what events they expect. For integration testing, use test containers running embedded Kafka or LocalStack. You can inject events at the broker level to trace end-to-end flows. And chaos testing with service failures shows whether your compensation logic actually works.
Exponential backoff with jitter prevents thundering herd problems on retries. After 3-5 retries (configurable), move the event to a dead letter queue with the original event, error details, and retry count. DLQ events need to be inspectable and replayable after you fix whatever broke. Monitor DLQ depth as an alert metric. For transient failures like network timeouts, retries handle recovery. For permanent failures like schema mismatches or code bugs, the DLQ captures them for later replay.
Orchestrated rollback is straightforward: the orchestrator knows what succeeded and calls compensations in reverse order. Choreographed compensation has no central view, so each service must independently track and compensate for its own actions. That is harder to get right. The upside is resilience: if connectivity to the orchestrator drops, choreographed compensation can still proceed. Both require idempotent compensations because events arrive more than once.
Choreography pushes you toward database-per-service, naturally. Services often lean toward event sourcing since you are already producing events; storing them as the source of truth fits the pattern. CQRS pairs well too, with commands updating state and events propagating read model updates. You end up with denormalized data within services because you cannot join across service boundaries. Compensation logic sometimes needs to store pending transaction state locally so the service knows what to undo when a failure event arrives.
Watch consumer lag closely, how far behind real-time each subscriber is. Dead letter queue depth and insertion rate tell you when consumers are failing. Event processing duration per consumer catches spikes. Duplicate event rate spikes often indicate upstream producer issues. Schema compatibility violations mean a breaking change slipped through. Alert on consumer lag exceeding your SLA window, DLQ depth above a threshold, and error rates that suggest systemic problems.
Further Reading
- Chronix — Long Term Storage and Retrieval of Events with Chrome (academic reference on event-driven patterns)
- Enterprise Integration Patterns — Hohpe and Woolf’s canonical reference for messaging and event-based architecture
- event-driven.io — Practical guides on event-driven architecture and choreography
Conclusion
Service choreography works when services are genuinely independent and workflows are simple. Events broadcast and services react. No central coordinator means no central point of failure.
The price is invisible workflows and harder debugging. When something goes wrong, you trace through events across services rather than reading an orchestrator log.
Use choreography for peripheral side effects and independent reactions. Use orchestration for workflows that need clear transaction boundaries and complex compensation. The best systems use both.
Category
Related Posts
Amazon Architecture: Lessons from the Pioneer of Microservices
Learn how Amazon pioneered service-oriented architecture, the famous 'two-pizza team' rule, and how they built the foundation for AWS.
Client-Side Discovery: Direct Service Routing in Microservices
Explore client-side service discovery patterns, how clients directly query the service registry, and when this approach works best.
CQRS and Event Sourcing: Distributed Data Management
Learn about Command Query Responsibility Segregation and Event Sourcing patterns for managing distributed data in microservices architectures.