Service Choreography: Event-Driven Distributed Workflows

Learn service choreography patterns for distributed workflows through events, sagas with choreography, and when to prefer events over orchestration.

published: March 22, 2026 reading time: 32 min read author: GeekWorkBench updated: April 23, 2026

Quick Summary

Service choreography inverts the typical distributed systems picture. There is no conductor directing services—each one just reacts to events from others, and the workflow surfaces from those chained reactions. The upsides are real: services stay loosely coupled, and there is no central point of failure. The tradeoff is that you cannot look at any one service and see the full transaction; when something breaks, you trace through events across service boundaries. For simple, linear flows and peripheral side effects, choreography fits well. When you need to undo steps across services, orchestration tends to win.

Service Choreography: Event-Driven Distributed Workflows

In choreography, there is no conductor. Services emit events when they do something, and other services react to those events by doing their own thing. The overall behavior emerges from the chain of reactions, not from a central plan.

This is a fundamentally different way of thinking about distributed systems. Instead of asking “who coordinates this workflow?” you ask “what events should cause what reactions?”

This post covers how choreography works, why it appeals to distributed systems architects, and where it breaks down.

Introduction

Service choreography is a pattern where services communicate by emitting and consuming events. Each service reacts to incoming events and may emit new events as a result. There is no central orchestrator directing the workflow.

graph LR
    Order[Order Service] -->|OrderPlaced| Inv[Inventory Service]
    Inv -->|InventoryReserved| Pay[Payment Service]
    Pay -->|PaymentCharged| Ship[Shipping Service]
    Ship -->|ShipmentCreated| Notify[Notification Service]

The flow emerges from the event chain. Each service knows only its own trigger and reaction. The service that reserves inventory does not know or care what happens after. It just emits “InventoryReserved” and moves on.

Core Concepts

Choreography relies on events, not commands. The distinction matters.

A command is directed: “do this thing.” It expects one specific handler. CreateOrder goes to the Order Service, and that is the only place it goes.

An event is broadcast: “this thing happened.” Any service that cares can react. OrderPlaced goes to the event bus, and Inventory, Analytics, Notification, and others can all subscribe and act.

Naming conventions make the distinction obvious. Commands are verb-noun: CreateUser, CancelOrder, ReserveInventory. Events are noun-verb past tense: UserCreated, OrderCancelled, InventoryReserved.

Benefits of Choreography

Choreography has real advantages for the right problem.

True decoupling: Services do not know about each other. They only know about events. Add a new subscriber without touching the publisher. Remove a subscriber the same way.

No single point of failure: There is no orchestrator that, if it crashes, halts all workflows. If one service goes down, only its reactions stop. Other workflows continue.

Independent deployability: Each service deploys on its own schedule. The inventory service does not need to coordinate with the payment service for a release.

Scalability: Event buses are designed to scale. Publishers and subscribers scale independently.

Sagas with Choreography

The saga pattern can be implemented with choreography instead of orchestration. In choreographed saga, each service knows its own step and its own compensation. When a step fails, the service emits a failure event. Downstream services that already acted react to the failure event by running their compensations.

sequenceDiagram
    participant Order
    participant OrderSvc as Order Service
    participant Inv as Inventory Service
    participant Pay as Payment Service
    Order->>+OrderSvc: Place
    OrderSvc->>+Inv: Reserve
    Inv->>-OrderSvc: Reserved
    OrderSvc->>+Pay: Charge
    Pay->>-OrderSvc: Failed
    OrderSvc->>+Inv: Release

This is more complex than orchestrated saga. You need to track which steps have completed so you know what to compensate. The compensation logic is spread across services rather than centralized.

For a detailed look at saga patterns, see Saga Pattern.

Implementing Choreography

Event Schema Design

Events need consistent schema. An event schema defines the fields each event type carries. Schema evolution matters because events are immutable once published.

{
  "type": "OrderPlaced",
  "version": "1.0",
  "order_id": "ord-12345",
  "customer_id": "cust-67890",
  "items": [...],
  "timestamp": "2026-03-22T10:30:00Z",
  "correlation_id": "corr-abc123"
}

Include a correlation ID in every event. It lets you trace a complete transaction across services by following all events with the same correlation ID.

Event Contract Testing

Schema registries catch incompatible changes at publish time, but they do not catch semantic mismatches: a field that means something different to the consumer than the producer intended. That is where contract testing comes in.

Contract testing validates that event producers and consumers agree on the shape and meaning of events before deployment. Unlike schema registry validation (runtime), contract testing runs in your CI pipeline. It answers: “does the consumer’s expectation match what the producer actually sends?”

Approach	When It Catches Issues	Scope
Schema registry	At publish time (runtime)	Structural compatibility only
Contract testing	Before deployment (CI)	Structural + semantic compatibility
End-to-end tests	After deployment	Full system behavior

Consumer-driven contracts (CDC) flip the ownership model. The consumer defines what it expects from an event, publishes that contract, and the producer verifies against it before deploying. This ensures breaking changes never reach the event bus. Tools like Pact and Spring Cloud Contract implement CDC for event-driven systems.

Tool	Language	Event Format Support
Pact	Multi-language	JSON, Avro, Protobuf via plugins
Spring Cloud Contract	JVM	JSON, XML
AsyncAPI	Language-agnostic	Schema-agnostic

A practical contract testing workflow for event-driven choreography:

Consumer publishes a contract describing the event fields it expects
Producer fetches all consumer contracts before deploying a new schema version
CI runs contract verification. The producer sends a sample event matching the new schema, consumer validates it
If any consumer contract fails, the producer deploy is blocked
The producer team coordinates with the affected consumer team before retrying

This catches the class of bugs where a renamed field, changed data type, or removed optional field breaks a downstream service even though the schema registry allowed it.

Idempotency

Events may be delivered more than once. Network partitions, consumer crashes, and retry logic all cause duplicates. Services must handle duplicate events idempotently.

def handle_order_placed(event):
    # Check if already processed
    if order_processed(event.order_id):
        return

    # Process the order
    process_order(event.order_id)

    # Mark as processed
    mark_order_processed(event.order_id)

Use idempotency keys or check-before-act patterns to ensure re-processing is safe.

When Choreography Works

Choreography shines for simple, linear workflows where each step naturally follows from the previous. It works well in systems with many independent consumers of the same events (analytics, monitoring, notifications). Teams that value service independence and autonomous deployment tend to prefer it. It fits high-scale event streaming architectures like Kafka or Kinesis.

When Choreography Breaks Down

Choreography has limits.

Invisible workflows: You cannot look at one service and see the overall transaction. The behavior is scattered across many services. Debugging is harder.

No atomic transactions: There is no transaction boundary. Partial failures leave the system in inconsistent states. Compensation must handle this.

Event chain complexity: Long chains of events become hard to follow. A triggers B, B triggers C, C triggers D. When something goes wrong in D, tracing back through the chain is painful.

Cyclic dependencies: What if service A needs to know when service D finishes, but D is triggered by C which is triggered by B which is triggered by A? Cycles in event dependencies are difficult to manage.

Observability Challenges

Choreographed systems produce many events. Understanding the overall state of a transaction requires correlating events across services.

Correlation IDs: Every event in a transaction shares a correlation ID. Collect all events with the same correlation ID to reconstruct the transaction.

Distributed tracing: OpenTelemetry propagates trace context through events, giving you a single view of the event chain.

Event sourcing: Store all events in sequence. Replay to reconstruct state. This aligns well with choreography but adds storage overhead.

When to Use Choreography

Use choreography when:

Services are genuinely independent and decoupled
Workflows are simple and linear (A triggers B, B triggers C)
You want to avoid a single point of failure
Adding new steps should not require modifying a central orchestrator
Many independent consumers need to react to the same events (analytics, monitoring, notifications)
You are building an event streaming architecture (Kafka, Kinesis)
Teams value service autonomy and independent deployability

Avoid choreography when:

Workflows have complex branching or conditional paths
You need clear visibility into the entire workflow state
Compensation logic is complex and must be centralized
Debugging the workflow is critical for operations
You have cyclic dependencies in your event flow

Combining Choreography and Orchestration

The choice is not binary. Many systems use both.

Core business workflows with complex compensation logic often need orchestration. The orchestrator has the full picture and can manage failure cleanly.

Peripheral side effects (notifications, audit logs, analytics) are good candidates for choreography. These are typically fire-and-forget. If a notification fails, you do not want to roll back the whole order.

The pattern: orchestration for the critical path, choreography for everything else.

For event-driven architecture fundamentals, see event-driven architecture. For messaging infrastructure, see message queue types.

Choreography Failure Isolation

Choreography handles failures differently than orchestration. When an orchestrated service fails, the orchestrator decides whether to continue or roll back. That decision lives in one place. In choreography, failure handling is distributed across services.

If the notification service goes down, the order still processes. The customer just does not get an email. You can replay the notification later without reprocessing the order itself.

The catch: your event reactions need to be idempotent. If the notification fires twice, the customer should not get two emails. If inventory is released twice, the stock count should not go negative.

Isolated failure also means you need a strategy for missed events. If the shipping service misses ShipmentCreated, it needs to detect that gap and recover. This is harder than a centralized orchestrator that tracks state explicitly.

Choreography Scaling Characteristics

Publishers and subscribers in choreography scale independently. If the order service gets more load than the notification service, you scale order service instances without touching notification service. The event bus absorbs the imbalance.

This independence extends to teams. A team can ship a new event consumer without coordinating with the publisher team. Valuable at scale, but adds coordination overhead at smaller organizations.

The bottleneck in choreography is the event bus itself. When the event bus saturates, everything slows down. Kafka and Kinesis handle this by partitioning events across consumer groups, but you need to design your event schemas with partitioning in mind from the start.

Schema Registry Deep Dive

A schema registry centralizes event schema management and prevents breaking changes from reaching consumers. In choreography, where publishers and subscribers are decoupled, the registry is your primary mechanism for maintaining contract compatibility.

How Schema Registries Work

Publishers register their event schemas before publishing. The registry validates the schema against compatibility rules. Consumers then fetch schemas from the registry to deserialize events.

Producer -> Schema Registry (validate) -> Event Bus -> Consumer (fetch schema)

Apache Kafka uses Confluent Schema Registry or AWS Glue Schema Registry. The registry enforces backward or forward compatibility depending on your configuration.

Compatibility Modes

Backward compatibility (most common): New schemas can read data written under old schemas. Consumers on the old schema can process events written under the new schema. This lets you add optional fields without breaking existing consumers.

Forward compatibility: New schemas can read data written under old schemas. Consumers on the new schema can process events written under the old schema. Useful when you cannot update all consumers simultaneously.

Full compatibility: Both backward and forward. More restrictive but safest for multi-version transitions.

Schema Evolution in Practice

When you need to change an event schema:

Register the new version with the schema registry
The registry validates against compatibility rules
If compatible, the new schema is allowed; consumers continue using their cached schema
If incompatible, the publish is rejected before it reaches the event bus

Without a registry, a breaking change reaches all consumers before you know something is wrong. With a registry, you catch it at publish time.

Schema Registry Anti-Patterns

Registering schemas without enforcement: If producers can bypass the registry, you lose the safety net. Gate all event publishing through the registry.

Evolving without coordination: Even with backward compatibility, coordinate with consumers before adding required fields or removing fields they depend on. Compatibility protects against crashes, not logical errors.

Not caching schemas: Consumers fetching schemas from the registry on every event adds latency. Cache schemas locally with a TTL and refresh periodically.

Trade-off Analysis

Aspect	Choreography	Orchestration
Coupling	Loose; services know only events	Tight; services know the orchestrator
Single point of failure	None; event bus is resilient	The orchestrator
Workflow visibility	Hard to see overall transaction	Easy to see complete workflow state
Debugging	Requires correlated event tracing	Centralized log access
Compensation logic	Distributed across services	Centralized in orchestrator
Independent deployability	High; services deploy on their own schedule	Low; orchestrator changes may affect services
Scalability	Event bus and consumers scale independently	Orchestrator may become bottleneck
Cyclic dependencies	Risk of cycles in event graphs	Avoided by explicit control flow
Schema evolution	Harder; subscribers may break on breaking changes	Easier; orchestrator controls all calls

Event Replay and Recovery Strategies

Event replay is central to recovery in choreographed systems. When a service crashes mid-processing or a downstream consumer misses events, replay lets you rebuild state without replaying the entire event history.

Replay Mechanisms

Three replay mechanisms exist, each with different tradeoffs for precision, storage cost, and infrastructure support.

Offset-based replay (Kafka): Consumers track their current offset and can reset to an earlier offset to reprocess events. This works when events are immutable and stored durably. Reset the offset to re-process a window of events after a bug fix or crash. Kafka’s log-compacted topics can retain events indefinitely, making offset-based replay viable for years-old data. The downside: you need Kafka’s storage infrastructure and you pay for retention.

Sequence-number-based replay (Kinesis): Each record has a sequence number. Consumers track their last processed sequence number and can request records from that point forward. Similar to offset replay but native to AWS Kinesis. Kinesis data retention is capped at 365 days (default 24 hours), so sequence-number replay only works inside that window. Beyond it, you need a secondary store (S3, DynamoDB) for archived events.

Timestamp-based replay: Store the last successful processing timestamp. On restart, query events with timestamps after that point. Less precise than offset/sequence but useful when you cannot track offsets directly, for example when reading from a database event log or an S3 event archive. Timestamp replay can miss events or double-process them if clock skew exists between producers and consumers.

Mechanism	Precision	Storage Requirement	Typical Use Case
Offset-based (Kafka)	Exact per-partition	Kafka log (configurable retention)	Post-bugfix reprocessing
Sequence-number (Kinesis)	Exact per-shard	Kinesis stream (max 365 days)	AWS-native recovery
Timestamp-based	Approximate	Application-level checkpoint	DB event logs, S3 archives

Choose offset replay when you control the broker and need long retention. Choose sequence-number replay for AWS Kinesis-native systems within the retention window. Choose timestamp replay when your events come from a source that does not natively track offsets, but add deduplication to handle the imprecision.

Designing for Replay

Replays can be expensive if consumers are not designed for them. During normal operation you process one event at a time. During replay you can hit the same consumer with thousands of events per second as the backlog catches up.

Design consumers to handle replay gracefully:

def handle_event(event):
    # Always idempotent: safe to call multiple times
    state = get_current_state(event.order_id)

    if state is None:
        # First time seeing this order - normal processing
        process_new_order(event)
    else:
        # Replay or late arrival - check if already processed
        if already_processed(event.event_id):
            return
        handle_duplicate_or_out_of_order(event, state)

Partition-aware replay: If your event bus uses partitioning, replay within a partition is straightforward. Cross-partition replay requires more care since you lose ordering guarantees across partitions. Design partitions by correlation ID so all events for a given transaction land in the same partition.

Recovery Patterns

Checkpoint-based recovery: Store checkpoint markers in a separate store. After processing each batch, record the checkpoint. On restart, resume from the checkpoint rather than the beginning.

Dead letter queue replay: Events that fail after max retries go to the DLQ. Fix the underlying issue, then replay DLQ events. Inspect the error context before replaying to avoid infinite loops.

Idempotent recovery: The simplest recovery pattern. After processing, write the event ID to a processed events table. On restart, skip any event ID already in the table. Works for moderate event volumes. For high volume, use a TTL or log-structured approach to avoid unbounded table growth.

Common Pitfalls / Anti-Patterns

Invisible workflows: With choreography, you cannot look at one service and see the overall transaction. The behavior emerges from the event chain. Without proper observability, debugging becomes a forensic exercise of correlating events across services.

Event chain complexity: Long chains become hard to follow. A triggers B, B triggers C, C triggers D. When D fails, tracing back through the chain is painful. Keep event chains short and well-documented.

Cyclic dependencies: If A depends on D completion and D is triggered by C which is triggered by B which is triggered by A, you have a cycle. Cycles cause deadlocks and infinite loops. Design your event dependencies as directed acyclic graphs (DAGs).

Assuming exactly-once delivery: Most event systems provide at-least-once or at-most-once delivery. Designing as if exactly-once exists leads to bugs. Always implement idempotent handlers.

Not planning for schema evolution: Events are immutable once published. If you need to change the schema, you must version the event. Subscribers must handle both old and new versions. This is harder than it sounds.

Ignoring consumer lag: As systems scale, consumers may fall behind. If the analytics consumer is 10 minutes behind, decisions based on analytics data are stale. Monitor and scale consumers to keep lag within acceptable bounds.

Consumer Lag and Backpressure

Consumer lag is the gap between when an event is published and when it is fully processed by a consumer. In high-throughput choreography systems, lag compounds quickly and degrades system behavior in subtle ways.

Measuring Lag

Lag is measured per consumer group. Each consumer tracks its current position (offset or sequence number) against the latest available event. The difference is the lag in events. For time-based lag, multiply the event count by average event size and divide by consumer throughput.

Metric	Platform	What It Measures
`consumer_lag`	Kafka (JMX)	Number of events behind per partition
`records-lag-max`	Kafka (CLI `kafka-consumer-groups`)	Max lag across all partitions in a group
`MillisBehindLatest`	Kinesis (CloudWatch)	Milliseconds behind the latest record
`ApproximateBacklog`	AWS Lambda (when event source mapping)	Unprocessed event count for Lambda consumers

Kafka lag monitoring: Exposed via JMX metrics and kafka-consumer-groups CLI. Tools like Confluent Control Center, Datadog, Prometheus (with JMX exporter), and CloudWatch can collect and alert on lag per partition. Partition-level lag matters: one slow partition in a 100-partition topic can hold up processing for an entire consumer group.

Kinesis lag monitoring: GetRecords.MillisBehindLatest tracks how far behind the shard reader is in milliseconds. High values indicate the consumer cannot keep up. For Lambda-triggered Kinesis consumers, monitor IteratorAgeMilliseconds in CloudWatch. This is the Kinesis equivalent of Kafka’s consumer lag.

Setting lag thresholds:

Warning: Lag exceeds 10 minutes of processing time at the consumer’s average throughput
Critical: Lag exceeds 1 hour, or lag is growing monotonically over three consecutive measurement windows
SLA tiering: Not all consumers need the same lag tolerance. Analytics consumers can tolerate minutes of lag; payment consumers need sub-second lag. Set thresholds per consumer group, not globally.

Prometheus + Grafana setup: Export consumer_lag via the JMX exporter or the Kafka Prometheus exporter, then create Grafana panels showing lag per consumer group per partition. Add a trend line: a flat lag that stops growing means the consumer is catching up. A steadily climbing lag means the consumer is permanently behind.

Impact of Unchecked Lag

When analytics consumers lag, their data becomes stale. Decision-making based on that data produces incorrect conclusions. When notification consumers lag, customers receive confirmations late. When payment consumers lag, billing cycles stretch unpredictably.

The choreographed system continues operating, but reactions to events arrive progressively later. This can cause race conditions where a subsequent event arrives before the reaction to an earlier one.

The snowball effect: Under sustained load, lag compounds. A consumer already 10 minutes behind processes events slower because it is working against a growing backlog. Each new event adds to the tail. The consumer never catches up without intervention: scaling, partition rebalancing, or throttling upstream producers.

Concrete scenario: An inventory reservation consumer falls 5 minutes behind. During those 5 minutes, 200 new orders arrive. The consumer processes them in order, but by the time it reaches order #150, the customer has already cancelled through the web UI and placed a new order. The inventory service reserves stock for the cancelled order and then processes the replacement, double-reserving the same item. The customer gets a “low stock” error on the replacement. The root cause was not inventory logic: it was consumer lag.

Lag cascades across services: In choreography, service B reacts to an event from service A. If B is lagging, it produces its reactions late, causing C to fall behind too. Lag propagates down the event chain. A 30-second lag in the order service becomes a 5-minute lag in the shipping service if each intermediate consumer compounds the delay.

Lag Duration	Observable Effect
Sub-second	No visible impact
1-10 seconds	Dashboard metrics show lag but no user-facing effects
10 seconds - 1 minute	Race conditions possible; stale analytics
1-10 minutes	Customer-facing delays; cancels before confirmations arrive
10+ minutes	Cascading lag across downstream services; system operates on stale state

Monitor lag as a leading indicator of system health. By the time you see error rates spike, lag has been building for minutes. Alert on lag trends, not just absolute values.

Backpressure Mechanisms

Backpressure controls the rate of event processing when consumers are overwhelmed.

Consumer-side throttling: The consumer slows its poll rate when its processing queue fills. This is built into most SDKs but can be tuned. If your consumer can process 1000 events per second but the system produces 2000, throttling causes the queue to grow unbounded.

Publisher-side rate limiting: Cap the rate at which events are published per producer. This protects the system but can cause events to be dropped or delayed if the cap is too low.

Ingress metering: The event bus itself enforces throughput limits. Kafka’s quota mechanism throttles producers or consumers that exceed configured rates. This is a hard limit at the broker level.

Prefetch limits: Some consumer SDKs prefetch events into a local buffer before processing. Setting the prefetch limit too high causes memory pressure; too low causes throughput loss. Tune based on your processing latency and memory budget.

Scaling to Reduce Lag

Horizontal scaling of consumers reduces lag but requires care. Adding a consumer to a consumer group triggers a rebalance where partitions are reassigned. During rebalance, no events are processed. If rebalances are frequent (from unstable consumer instances or aggressive heartbeat settings), lag can worsen rather than improve.

Stabilize consumer instances before scaling: health checks, proper shutdown handling, and graceful rebalance protocols reduce the overhead of adding capacity.

Production Failure Scenarios

Failure	Impact	Mitigation
Event lost in transit	Downstream steps never execute; system left in incomplete state	Use persistent event stores (Kafka with replication); implement event replay capability
Event delivered twice	Duplicate processing; potential double operations (double charges)	Implement idempotency checks using event IDs; use deduplication tables
Consumer crashes mid-processing	Event acknowledged but not fully processed	Use consumer groups with rebalance; implement at-least-once delivery; idempotent handlers
Downstream service unavailable	Event emitted but no reaction occurs; partial execution	Implement retry with exponential backoff; dead letter queues for failed processing
Event schema breaking change	Subscribers fail to process events they depend on	Use schema registry with backward compatibility; version events; implement contract testing
Cyclic event dependencies	Deadlock where A waits for D and D is triggered by C which waits for B which waits for A	Design event flows as DAGs; validate for cycles before deployment
Event replay storms	Replaying historical events causes cascade of downstream processing	Implement replay window limits; separate replay pipelines from production

Key Takeaways

graph LR
    A[Service A] -->|Event| B[Event Bus]
    B -->|Event| C[Service B]
    B -->|Event| D[Service C]
    B -->|Event| E[Service D]

Key Points

Choreography uses events (broadcast) rather than commands (directed)
Each service knows only its own trigger and reaction
No central orchestrator means no central point of failure
Compensation logic is distributed across services
Trade-off: simpler services but harder to see overall workflow state

Production Checklist

# Service Choreography Production Readiness

- [ ] Idempotent event handlers implemented
- [ ] Event schema registry with backward compatibility enforced
- [ ] Correlation IDs included in all events
- [ ] Dead letter queue configured for failed processing
- [ ] Consumer lag monitoring and alerting configured
- [ ] Event replay capability tested
- [ ] Distributed tracing configured across event consumers
- [ ] Schema evolution strategy documented and enforced
- [ ] Audit logging for event publishing
- [ ] Rate limiting on event publishing

Observability Checklist

Metrics

Event publish rate (events per second by type)
Event consumption rate per subscriber
Event processing duration per consumer
Dead letter queue depth
Event consumer lag (how far behind real-time)
Duplicate event detection count
Schema compatibility violations detected

Logs

Log all published events with event ID, type, and correlation ID
Log all consumed events with consumer ID
Include correlation ID to trace events across services
Log consumer failures with original event context
Log dead letter queue insertions with reason

Alerts

Alert when dead letter queue depth exceeds threshold
Alert when consumer lag exceeds SLA window
Alert when duplicate event rate spikes (indicates upstream issue)
Alert when event schema violations are detected
Alert on consumer failures that trigger retry storms

Security Checklist

Authenticate event publishers to prevent unauthorized event injection
Authorize subscribers to prevent subscription to sensitive event streams
Encrypt event payloads containing sensitive data at rest and in transit
Validate event schemas before publishing to prevent malformed events
Audit access to event stores and schema registries
Rate limit event publishing to prevent DoS via event flooding
Redact sensitive data from event logs (correlation IDs should be safe; payload data may not be)

Interview Questions

1. What is the fundamental difference between choreography and orchestration in microservices?

Choreography: services communicate by emitting and consuming events with no central director. Orchestration: a central orchestrator directs the workflow by calling services directly and holding all workflow state. In choreography each service only knows its own trigger and reaction; in orchestration the orchestrator holds the entire picture. The core trade-off is loose coupling versus central visibility.

2. Why is idempotency critical in event-driven choreography?

Events can be delivered more than once due to network partitions, consumer crashes, or retry logic. Without idempotency, duplicate events cause duplicate operations: double charges, double reservations. Techniques include idempotency keys, check-before-act patterns, and deduplication tables. The pattern is simple: check if already processed before acting, then mark it done.

3. How does a correlation ID help in debugging choreographed systems?

Every event in a transaction shares a correlation ID. Collecting all events with the same correlation ID lets you reconstruct the complete transaction flow across services. Without it, tracing a failed transaction through multiple services becomes a forensic exercise. Include the correlation ID in every event from the start.

4. What are the main observability challenges in choreography?

The core challenge is that overall transaction state is scattered across services. Long event chains are hard to trace when something fails. Solutions include correlation IDs on all events, distributed tracing with OpenTelemetry, and event sourcing for replay. Event sourcing aligns well with choreography since you already have the events stored.

5. How does the saga pattern differ between choreography and orchestration?

In orchestrated saga, the orchestrator tracks steps and triggers compensations centrally. In choreographed saga, each service knows its own step and compensation; when a step fails it emits a failure event and downstream services run their own compensations. Choreographed saga is more complex because you must track which steps completed across services without central state.

6. What is the risk of cyclic dependencies in choreography and how do you prevent it?

A triggers B, B triggers C, C triggers D, and D triggers A creates a deadlock loop where services wait indefinitely for events that never arrive. Prevention: design event flows as directed acyclic graphs (DAGs) and validate for cycles before deployment. Regular architecture review of event dependency graphs helps catch emerging cycles early.

7. When should you avoid choreography in favor of orchestration?

Avoid choreography when: workflows have complex branching or conditional paths; you need clear visibility into entire workflow state for operations; compensation logic is complex and must be centralized for audit or safety; debugging the workflow is critical and centralized logs are more practical than distributed event tracing; or when cyclic dependencies are unavoidable in your event flow.

8. How does schema evolution work in choreography?

Events are immutable once published, so schema changes require versioning. Subscribers must handle both old and new versions during transitions. Use schema registries with backward compatibility enforcement (Kafka schema registry is the common choice). Contract testing catches breaking changes before they reach production. Because subscribers are decoupled, breaking changes in choreography are harder to manage than in orchestration.

9. What is the difference between events and commands?

Commands are directed ("do this thing") targeting one specific handler. Events are broadcast ("this thing happened") and any interested service can react. Naming reflects this: commands use verb-noun (CreateOrder, ReserveInventory), events use noun-verb past tense (OrderPlaced, InventoryReserved). Commands couple services to their handlers; events enable loose coupling.

10. How do you handle partial failures in choreographed sagas?

Partial failures are inherent because there is no atomic transaction boundary across services. Each service must implement compensation for its own actions. When a step fails, emit a failure event; downstream services react by running their compensations. Track which saga steps have completed so you know what to compensate. Use dead letter queues for events that fail after max retries. Idempotent compensations are essential since compensation events can also arrive more than once.

11. How does service choreography fit within the broader event-driven architecture (EDA) paradigm?

Choreography is one pattern within EDA, not the whole thing. EDA covers any architecture built around event production and consumption; choreography specifically means decentralized reaction to events with no central coordinator. Orchestration is the other pattern within EDA where a central coordinator directs the flow. The event-driven architecture blog post covers the fundamentals; choreography is the decentralized execution model for those principles.

12. What are the best practices for naming events in a choreographed system?

Name events in noun-verb past tense to describe what happened from the emitting service's perspective: OrderPlaced, PaymentCharged, InventoryReserved. Commands use verb-noun instead: CreateOrder, ReserveInventory. Put the aggregate ID in the event name when it helps subscribers filter correctly. Leave command-language verbs (Create, Send) out of event names entirely.

13. What strategies exist for detecting and handling duplicate events beyond basic idempotency checks?

Beyond check-before-act, you have a few options. Deduplication tables keyed on event ID with a TTL handle storage cleanly. Bloom filters work for memory-efficient duplicate detection at scale. Some systems use deterministic idempotency keys where the business logic itself defines uniqueness, like order ID plus action type. At-least-once brokers pair well with natural idempotency in your business logic, like detecting duplicate charges via payment ID. Pick based on what your broker guarantees.

14. What role do message brokers like Kafka or Kinesis play in enabling choreography?

Brokers are the event bus that makes choreography viable. They handle routing, durability, scalability through partitioning, and delivery guarantees. Kafka's partitioned log gives you replay capability from offsets, which is essential for recovery. Kinesis does similar things on AWS with different scaling tradeoffs. Either way, the broker replaces what a central orchestrator would do by providing the shared infrastructure for distributing events to all subscribers.

15. How does choreography enable and enforce microservices independence and autonomous deployment?

Services only couple to event schemas, not to other services directly. The inventory team can ship a new event type without asking the payment team to change anything. The event contract is the only coupling point. The downside is that nobody has a clear view of the overall workflow anymore, which is why correlation IDs and distributed tracing become essential rather than optional.

16. What are the main challenges and approaches for testing choreographed distributed systems?

You cannot test choreography the same way you test a monolith. Contract testing checks that event schemas work for both producers and subscribers without running the full system. Consumer-driven contracts let subscribers specify what events they expect. For integration testing, use test containers running embedded Kafka or LocalStack. You can inject events at the broker level to trace end-to-end flows. And chaos testing with service failures shows whether your compensation logic actually works.

17. How should retry logic and dead letter queues be implemented for event consumers in choreography?

Exponential backoff with jitter prevents thundering herd problems on retries. After 3-5 retries (configurable), move the event to a dead letter queue with the original event, error details, and retry count. DLQ events need to be inspectable and replayable after you fix whatever broke. Monitor DLQ depth as an alert metric. For transient failures like network timeouts, retries handle recovery. For permanent failures like schema mismatches or code bugs, the DLQ captures them for later replay.

18. How does the compensation mechanism in choreographed sagas compare to rollback in orchestrated transactions?

Orchestrated rollback is straightforward: the orchestrator knows what succeeded and calls compensations in reverse order. Choreographed compensation has no central view, so each service must independently track and compensate for its own actions. That is harder to get right. The upside is resilience: if connectivity to the orchestrator drops, choreographed compensation can still proceed. Both require idempotent compensations because events arrive more than once.

19. How does choreography influence database architecture choices per service?

Choreography pushes you toward database-per-service, naturally. Services often lean toward event sourcing since you are already producing events; storing them as the source of truth fits the pattern. CQRS pairs well too, with commands updating state and events propagating read model updates. You end up with denormalized data within services because you cannot join across service boundaries. Compensation logic sometimes needs to store pending transaction state locally so the service knows what to undo when a failure event arrives.

20. What specific monitoring and alerting metrics are most critical for production choreographed systems?

Watch consumer lag closely, how far behind real-time each subscriber is. Dead letter queue depth and insertion rate tell you when consumers are failing. Event processing duration per consumer catches spikes. Duplicate event rate spikes often indicate upstream producer issues. Schema compatibility violations mean a breaking change slipped through. Alert on consumer lag exceeding your SLA window, DLQ depth above a threshold, and error rates that suggest systemic problems.

Conclusion

Service choreography works when services are genuinely independent and workflows are simple. Events broadcast and services react. No central coordinator means no central point of failure.

The price is invisible workflows and harder debugging. When something goes wrong, you trace through events across services rather than reading an orchestrator log.

Use choreography for peripheral side effects and independent reactions. Use orchestration for workflows that need clear transaction boundaries and complex compensation. The best systems use both.

Service Choreography: Event-Driven Distributed Workflows

Introduction

Core Concepts

Benefits of Choreography

Sagas with Choreography

Implementing Choreography

Event Schema Design

Event Contract Testing

Idempotency

When Choreography Works

When Choreography Breaks Down

Observability Challenges

When to Use Choreography

Combining Choreography and Orchestration

Choreography Failure Isolation

Choreography Scaling Characteristics

Schema Registry Deep Dive

How Schema Registries Work

Compatibility Modes

Schema Evolution in Practice

Schema Registry Anti-Patterns

Trade-off Analysis

Event Replay and Recovery Strategies

Replay Mechanisms

Designing for Replay

Recovery Patterns

Common Pitfalls / Anti-Patterns

Consumer Lag and Backpressure

Measuring Lag

Impact of Unchecked Lag

Backpressure Mechanisms

Scaling to Reduce Lag

Production Failure Scenarios

Key Takeaways

Key Points

Production Checklist

Observability Checklist

Metrics

Logs

Alerts

Security Checklist

Interview Questions

Further Reading

Conclusion

Category

Tags

Related Posts

Amazon Architecture: Lessons from the Pioneer of Microservices

Client-Side Discovery: Direct Service Routing in Microservices

CQRS and Event Sourcing: Distributed Data Management