Service Choreography: Event-Driven Distributed Workflows

Learn service choreography patterns for distributed workflows through events, sagas with choreography, and when to prefer events over orchestration.

published: reading time: 27 min read author: GeekWorkBench updated: April 23, 2026

Service Choreography: Event-Driven Distributed Workflows

In choreography, there is no conductor. Services emit events when they do something, and other services react to those events by doing their own thing. The overall behavior emerges from the chain of reactions, not from a central plan.

This is a fundamentally different way of thinking about distributed systems. Instead of asking “who coordinates this workflow?” you ask “what events should cause what reactions?”

This post covers how choreography works, why it appeals to distributed systems architects, and where it breaks down.

Introduction

Service choreography is a pattern where services communicate by emitting and consuming events. Each service reacts to incoming events and may emit new events as a result. There is no central orchestrator directing the workflow.

graph LR
    Order[Order Service] -->|OrderPlaced| Inv[Inventory Service]
    Inv -->|InventoryReserved| Pay[Payment Service]
    Pay -->|PaymentCharged| Ship[Shipping Service]
    Ship -->|ShipmentCreated| Notify[Notification Service]

The flow emerges from the event chain. Each service knows only its own trigger and reaction. The service that reserves inventory does not know or care what happens after. It just emits “InventoryReserved” and moves on.

Core Concepts

Choreography relies on events, not commands. The distinction matters.

A command is directed: “do this thing.” It expects one specific handler. CreateOrder goes to the Order Service, and that is the only place it goes.

An event is broadcast: “this thing happened.” Any service that cares can react. OrderPlaced goes to the event bus, and Inventory, Analytics, Notification, and others can all subscribe and act.

Naming conventions make the distinction obvious. Commands are verb-noun: CreateUser, CancelOrder, ReserveInventory. Events are noun-verb past tense: UserCreated, OrderCancelled, InventoryReserved.

Benefits of Choreography

Choreography has real advantages for the right problem.

True decoupling: Services do not know about each other. They only know about events. Add a new subscriber without touching the publisher. Remove a subscriber the same way.

No single point of failure: There is no orchestrator that, if it crashes, halts all workflows. If one service goes down, only its reactions stop. Other workflows continue.

Independent deployability: Each service deploys on its own schedule. The inventory service does not need to coordinate with the payment service for a release.

Scalability: Event buses are designed to scale. Publishers and subscribers scale independently.

Sagas with Choreography

The saga pattern can be implemented with choreography instead of orchestration. In choreographed saga, each service knows its own step and its own compensation. When a step fails, the service emits a failure event. Downstream services that already acted react to the failure event by running their compensations.

sequenceDiagram
    participant Order
    participant OrderSvc as Order Service
    participant Inv as Inventory Service
    participant Pay as Payment Service
    Order->>+OrderSvc: Place
    OrderSvc->>+Inv: Reserve
    Inv->>-OrderSvc: Reserved
    OrderSvc->>+Pay: Charge
    Pay->>-OrderSvc: Failed
    OrderSvc->>+Inv: Release

This is more complex than orchestrated saga. You need to track which steps have completed so you know what to compensate. The compensation logic is spread across services rather than centralized.

For a detailed look at saga patterns, see Saga Pattern.

Implementing Choreography

Event Schema Design

Events need consistent schema. An event schema defines the fields each event type carries. Schema evolution matters because events are immutable once published.

{
  "type": "OrderPlaced",
  "version": "1.0",
  "order_id": "ord-12345",
  "customer_id": "cust-67890",
  "items": [...],
  "timestamp": "2026-03-22T10:30:00Z",
  "correlation_id": "corr-abc123"
}

Include a correlation ID in every event. It lets you trace a complete transaction across services by following all events with the same correlation ID.

Event Contract Testing

Services depend on event schemas. When one service changes its event format, subscribers may break. Use schema registries and contract testing to catch breaking changes early.

Apache Kafka uses a schema registry to enforce compatibility. Consumer and producer agree on a schema, and the registry rejects incompatible changes.

Idempotency

Events may be delivered more than once. Network partitions, consumer crashes, and retry logic all cause duplicates. Services must handle duplicate events idempotently.

def handle_order_placed(event):
    # Check if already processed
    if order_processed(event.order_id):
        return

    # Process the order
    process_order(event.order_id)

    # Mark as processed
    mark_order_processed(event.order_id)

Use idempotency keys or check-before-act patterns to ensure re-processing is safe.

When Choreography Works

Choreography shines for simple, linear workflows where each step naturally follows from the previous. It works well in systems with many independent consumers of the same events (analytics, monitoring, notifications). Teams that value service independence and autonomous deployment tend to prefer it. It fits high-scale event streaming architectures like Kafka or Kinesis.

When Choreography Breaks Down

Choreography has limits.

Invisible workflows: You cannot look at one service and see the overall transaction. The behavior is scattered across many services. Debugging is harder.

No atomic transactions: There is no transaction boundary. Partial failures leave the system in inconsistent states. Compensation must handle this.

Event chain complexity: Long chains of events become hard to follow. A triggers B, B triggers C, C triggers D. When something goes wrong in D, tracing back through the chain is painful.

Cyclic dependencies: What if service A needs to know when service D finishes, but D is triggered by C which is triggered by B which is triggered by A? Cycles in event dependencies are difficult to manage.

Observability Challenges

Choreographed systems produce many events. Understanding the overall state of a transaction requires correlating events across services.

Correlation IDs: Every event in a transaction shares a correlation ID. Collect all events with the same correlation ID to reconstruct the transaction.

Distributed tracing: OpenTelemetry propagates trace context through events, giving you a single view of the event chain.

Event sourcing: Store all events in sequence. Replay to reconstruct state. This aligns well with choreography but adds storage overhead.

When to Use Choreography

Use choreography when:

  • Services are genuinely independent and decoupled
  • Workflows are simple and linear (A triggers B, B triggers C)
  • You want to avoid a single point of failure
  • Adding new steps should not require modifying a central orchestrator
  • Many independent consumers need to react to the same events (analytics, monitoring, notifications)
  • You are building an event streaming architecture (Kafka, Kinesis)
  • Teams value service autonomy and independent deployability

Avoid choreography when:

  • Workflows have complex branching or conditional paths
  • You need clear visibility into the entire workflow state
  • Compensation logic is complex and must be centralized
  • Debugging the workflow is critical for operations
  • You have cyclic dependencies in your event flow

Combining Choreography and Orchestration

The choice is not binary. Many systems use both.

Core business workflows with complex compensation logic often need orchestration. The orchestrator has the full picture and can manage failure cleanly.

Peripheral side effects (notifications, audit logs, analytics) are good candidates for choreography. These are typically fire-and-forget. If a notification fails, you do not want to roll back the whole order.

The pattern: orchestration for the critical path, choreography for everything else.

For event-driven architecture fundamentals, see event-driven architecture. For messaging infrastructure, see message queue types.

Choreography Failure Isolation

Choreography handles failures differently than orchestration. When an orchestrated service fails, the orchestrator decides whether to continue or roll back. That decision lives in one place. In choreography, failure handling is distributed across services.

If the notification service goes down, the order still processes. The customer just does not get an email. You can replay the notification later without reprocessing the order itself.

The catch: your event reactions need to be idempotent. If the notification fires twice, the customer should not get two emails. If inventory is released twice, the stock count should not go negative.

Isolated failure also means you need a strategy for missed events. If the shipping service misses ShipmentCreated, it needs to detect that gap and recover. This is harder than a centralized orchestrator that tracks state explicitly.

Choreography Scaling Characteristics

Publishers and subscribers in choreography scale independently. If the order service gets more load than the notification service, you scale order service instances without touching notification service. The event bus absorbs the imbalance.

This independence extends to teams. A team can ship a new event consumer without coordinating with the publisher team. Valuable at scale, but adds coordination overhead at smaller organizations.

The bottleneck in choreography is the event bus itself. When the event bus saturates, everything slows down. Kafka and Kinesis handle this by partitioning events across consumer groups, but you need to design your event schemas with partitioning in mind from the start.

Schema Registry Deep Dive

A schema registry centralizes event schema management and prevents breaking changes from reaching consumers. In choreography, where publishers and subscribers are decoupled, the registry is your primary mechanism for maintaining contract compatibility.

How Schema Registries Work

Publishers register their event schemas before publishing. The registry validates the schema against compatibility rules. Consumers then fetch schemas from the registry to deserialize events.

Producer -> Schema Registry (validate) -> Event Bus -> Consumer (fetch schema)

Apache Kafka uses Confluent Schema Registry or AWS Glue Schema Registry. The registry enforces backward or forward compatibility depending on your configuration.

Compatibility Modes

Backward compatibility (most common): New schemas can read data written under old schemas. Consumers on the old schema can process events written under the new schema. This lets you add optional fields without breaking existing consumers.

Forward compatibility: New schemas can read data written under old schemas. Consumers on the new schema can process events written under the old schema. Useful when you cannot update all consumers simultaneously.

Full compatibility: Both backward and forward. More restrictive but safest for multi-version transitions.

Schema Evolution in Practice

When you need to change an event schema:

  1. Register the new version with the schema registry
  2. The registry validates against compatibility rules
  3. If compatible, the new schema is allowed; consumers continue using their cached schema
  4. If incompatible, the publish is rejected before it reaches the event bus

Without a registry, a breaking change reaches all consumers before you know something is wrong. With a registry, you catch it at publish time.

Schema Registry Anti-Patterns

Registering schemas without enforcement: If producers can bypass the registry, you lose the safety net. Gate all event publishing through the registry.

Evolving without coordination: Even with backward compatibility, coordinate with consumers before adding required fields or removing fields they depend on. Compatibility protects against crashes, not logical errors.

Not caching schemas: Consumers fetching schemas from the registry on every event adds latency. Cache schemas locally with a TTL and refresh periodically.

Trade-off Analysis

AspectChoreographyOrchestration
CouplingLoose; services know only eventsTight; services know the orchestrator
Single point of failureNone; event bus is resilientThe orchestrator
Workflow visibilityHard to see overall transactionEasy to see complete workflow state
DebuggingRequires correlated event tracingCentralized log access
Compensation logicDistributed across servicesCentralized in orchestrator
Independent deployabilityHigh; services deploy on their own scheduleLow; orchestrator changes may affect services
ScalabilityEvent bus and consumers scale independentlyOrchestrator may become bottleneck
Cyclic dependenciesRisk of cycles in event graphsAvoided by explicit control flow
Schema evolutionHarder; subscribers may break on breaking changesEasier; orchestrator controls all calls

Event Replay and Recovery Strategies

Event replay is central to recovery in choreographed systems. When a service crashes mid-processing or a downstream consumer misses events, replay lets you rebuild state without replaying the entire event history.

Replay Mechanisms

Offset-based replay (Kafka): Consumers track their current offset and can reset to an earlier offset to reprocess events. This works when events are immutable and stored durably. Reset the offset to re-process a window of events after a bug fix or crash.

Sequence-number-based replay (Kinesis): Each record has a sequence number. Consumers track their last processed sequence number and can request records from that point forward. Similar to offset replay but native to AWS Kinesis.

Timestamp-based replay: Store the last successful processing timestamp. On restart, query events with timestamps after that point. Less precise than offset/sequence but useful when you cannot track offsets directly.

Designing for Replay

Replays can be expensive if consumers are not designed for them. During normal operation you process one event at a time. During replay you can hit the same consumer with thousands of events per second as the backlog catches up.

Design consumers to handle replay gracefully:

def handle_event(event):
    # Always idempotent: safe to call multiple times
    state = get_current_state(event.order_id)

    if state is None:
        # First time seeing this order - normal processing
        process_new_order(event)
    else:
        # Replay or late arrival - check if already processed
        if already_processed(event.event_id):
            return
        handle_duplicate_or_out_of_order(event, state)

Partition-aware replay: If your event bus uses partitioning, replay within a partition is straightforward. Cross-partition replay requires more care since you lose ordering guarantees across partitions. Design partitions by correlation ID so all events for a given transaction land in the same partition.

Recovery Patterns

Checkpoint-based recovery: Store checkpoint markers in a separate store. After processing each batch, record the checkpoint. On restart, resume from the checkpoint rather than the beginning.

Dead letter queue replay: Events that fail after max retries go to the DLQ. Fix the underlying issue, then replay DLQ events. Inspect the error context before replaying to avoid infinite loops.

Idempotent recovery: The simplest recovery pattern. After processing, write the event ID to a processed events table. On restart, skip any event ID already in the table. Works for moderate event volumes. For high volume, use a TTL or log-structured approach to avoid unbounded table growth.

Common Pitfalls / Anti-Patterns

Invisible workflows: With choreography, you cannot look at one service and see the overall transaction. The behavior emerges from the event chain. Without proper observability, debugging becomes a forensic exercise of correlating events across services.

Event chain complexity: Long chains become hard to follow. A triggers B, B triggers C, C triggers D. When D fails, tracing back through the chain is painful. Keep event chains short and well-documented.

Cyclic dependencies: If A depends on D completion and D is triggered by C which is triggered by B which is triggered by A, you have a cycle. Cycles cause deadlocks and infinite loops. Design your event dependencies as directed acyclic graphs (DAGs).

Assuming exactly-once delivery: Most event systems provide at-least-once or at-most-once delivery. Designing as if exactly-once exists leads to bugs. Always implement idempotent handlers.

Not planning for schema evolution: Events are immutable once published. If you need to change the schema, you must version the event. Subscribers must handle both old and new versions. This is harder than it sounds.

Ignoring consumer lag: As systems scale, consumers may fall behind. If the analytics consumer is 10 minutes behind, decisions based on analytics data are stale. Monitor and scale consumers to keep lag within acceptable bounds.

Consumer Lag and Backpressure

Consumer lag is the gap between when an event is published and when it is fully processed by a consumer. In high-throughput choreography systems, lag compounds quickly and degrades system behavior in subtle ways.

Measuring Lag

Lag is measured per consumer group. Each consumer tracks its current position (offset or sequence number) against the latest available event. The difference is the lag in events. For time-based lag, multiply the event count by average event size and divide by consumer throughput.

Kafka lag monitoring: The consumer_lag metric is exposed by Kafka’s JMX metrics and by Kafka Connect. Tools like Confluent Control Center, Datadog, and CloudWatch can alert on lag thresholds.

Kinesis lag monitoring: Kinesis exposes GetRecords.MillisBehindLatest, which measures how far behind the shard reader is in milliseconds. High values indicate the consumer cannot keep up with incoming data.

Impact of Unchecked Lag

When analytics consumers lag, their data becomes stale. Decision-making based on that data produces incorrect conclusions. When notification consumers lag, customers receive confirmations late. When payment consumers lag, billing cycles stretch unpredictably.

The choreographed system continues operating, but reactions to events arrive progressively later. This can cause race conditions where a subsequent event arrives before the reaction to an earlier one.

Backpressure Mechanisms

Backpressure controls the rate of event processing when consumers are overwhelmed.

Consumer-side throttling: The consumer slows its poll rate when its processing queue fills. This is built into most SDKs but can be tuned. If your consumer can process 1000 events per second but the system produces 2000, throttling causes the queue to grow unbounded.

Publisher-side rate limiting: Cap the rate at which events are published per producer. This protects the system but can cause events to be dropped or delayed if the cap is too low.

Ingress metering: The event bus itself enforces throughput limits. Kafka’s quota mechanism throttles producers or consumers that exceed configured rates. This is a hard limit at the broker level.

Prefetch limits: Some consumer SDKs prefetch events into a local buffer before processing. Setting the prefetch limit too high causes memory pressure; too low causes throughput loss. Tune based on your processing latency and memory budget.

Scaling to Reduce Lag

Horizontal scaling of consumers reduces lag but requires care. Adding a consumer to a consumer group triggers a rebalance where partitions are reassigned. During rebalance, no events are processed. If rebalances are frequent (from unstable consumer instances or aggressive heartbeat settings), lag can worsen rather than improve.

Stabilize consumer instances before scaling: health checks, proper shutdown handling, and graceful rebalance protocols reduce the overhead of adding capacity.

Production Failure Scenarios

FailureImpactMitigation
Event lost in transitDownstream steps never execute; system left in incomplete stateUse persistent event stores (Kafka with replication); implement event replay capability
Event delivered twiceDuplicate processing; potential double operations (double charges)Implement idempotency checks using event IDs; use deduplication tables
Consumer crashes mid-processingEvent acknowledged but not fully processedUse consumer groups with rebalance; implement at-least-once delivery; idempotent handlers
Downstream service unavailableEvent emitted but no reaction occurs; partial executionImplement retry with exponential backoff; dead letter queues for failed processing
Event schema breaking changeSubscribers fail to process events they depend onUse schema registry with backward compatibility; version events; implement contract testing
Cyclic event dependenciesDeadlock where A waits for D and D is triggered by C which waits for B which waits for ADesign event flows as DAGs; validate for cycles before deployment
Event replay stormsReplaying historical events causes cascade of downstream processingImplement replay window limits; separate replay pipelines from production

Key Takeaways

graph LR
    A[Service A] -->|Event| B[Event Bus]
    B -->|Event| C[Service B]
    B -->|Event| D[Service C]
    B -->|Event| E[Service D]

Key Points

  • Choreography uses events (broadcast) rather than commands (directed)
  • Each service knows only its own trigger and reaction
  • No central orchestrator means no central point of failure
  • Compensation logic is distributed across services
  • Trade-off: simpler services but harder to see overall workflow state

Production Checklist

# Service Choreography Production Readiness

- [ ] Idempotent event handlers implemented
- [ ] Event schema registry with backward compatibility enforced
- [ ] Correlation IDs included in all events
- [ ] Dead letter queue configured for failed processing
- [ ] Consumer lag monitoring and alerting configured
- [ ] Event replay capability tested
- [ ] Distributed tracing configured across event consumers
- [ ] Schema evolution strategy documented and enforced
- [ ] Audit logging for event publishing
- [ ] Rate limiting on event publishing

Observability Checklist

Metrics

  • Event publish rate (events per second by type)
  • Event consumption rate per subscriber
  • Event processing duration per consumer
  • Dead letter queue depth
  • Event consumer lag (how far behind real-time)
  • Duplicate event detection count
  • Schema compatibility violations detected

Logs

  • Log all published events with event ID, type, and correlation ID
  • Log all consumed events with consumer ID
  • Include correlation ID to trace events across services
  • Log consumer failures with original event context
  • Log dead letter queue insertions with reason

Alerts

  • Alert when dead letter queue depth exceeds threshold
  • Alert when consumer lag exceeds SLA window
  • Alert when duplicate event rate spikes (indicates upstream issue)
  • Alert when event schema violations are detected
  • Alert on consumer failures that trigger retry storms

Security Checklist

  • Authenticate event publishers to prevent unauthorized event injection
  • Authorize subscribers to prevent subscription to sensitive event streams
  • Encrypt event payloads containing sensitive data at rest and in transit
  • Validate event schemas before publishing to prevent malformed events
  • Audit access to event stores and schema registries
  • Rate limit event publishing to prevent DoS via event flooding
  • Redact sensitive data from event logs (correlation IDs should be safe; payload data may not be)

Interview Questions

1. What is the fundamental difference between choreography and orchestration in microservices?

Choreography: services communicate by emitting and consuming events with no central director. Orchestration: a central orchestrator directs the workflow by calling services directly and holding all workflow state. In choreography each service only knows its own trigger and reaction; in orchestration the orchestrator holds the entire picture. The core trade-off is loose coupling versus central visibility.

2. Why is idempotency critical in event-driven choreography?

Events can be delivered more than once due to network partitions, consumer crashes, or retry logic. Without idempotency, duplicate events cause duplicate operations: double charges, double reservations. Techniques include idempotency keys, check-before-act patterns, and deduplication tables. The pattern is simple: check if already processed before acting, then mark it done.

3. How does a correlation ID help in debugging choreographed systems?

Every event in a transaction shares a correlation ID. Collecting all events with the same correlation ID lets you reconstruct the complete transaction flow across services. Without it, tracing a failed transaction through multiple services becomes a forensic exercise. Include the correlation ID in every event from the start.

4. What are the main observability challenges in choreography?

The core challenge is that overall transaction state is scattered across services. Long event chains are hard to trace when something fails. Solutions include correlation IDs on all events, distributed tracing with OpenTelemetry, and event sourcing for replay. Event sourcing aligns well with choreography since you already have the events stored.

5. How does the saga pattern differ between choreography and orchestration?

In orchestrated saga, the orchestrator tracks steps and triggers compensations centrally. In choreographed saga, each service knows its own step and compensation; when a step fails it emits a failure event and downstream services run their own compensations. Choreographed saga is more complex because you must track which steps completed across services without central state.

6. What is the risk of cyclic dependencies in choreography and how do you prevent it?

A triggers B, B triggers C, C triggers D, and D triggers A creates a deadlock loop where services wait indefinitely for events that never arrive. Prevention: design event flows as directed acyclic graphs (DAGs) and validate for cycles before deployment. Regular architecture review of event dependency graphs helps catch emerging cycles early.

7. When should you avoid choreography in favor of orchestration?

Avoid choreography when: workflows have complex branching or conditional paths; you need clear visibility into entire workflow state for operations; compensation logic is complex and must be centralized for audit or safety; debugging the workflow is critical and centralized logs are more practical than distributed event tracing; or when cyclic dependencies are unavoidable in your event flow.

8. How does schema evolution work in choreography?

Events are immutable once published, so schema changes require versioning. Subscribers must handle both old and new versions during transitions. Use schema registries with backward compatibility enforcement (Kafka schema registry is the common choice). Contract testing catches breaking changes before they reach production. Because subscribers are decoupled, breaking changes in choreography are harder to manage than in orchestration.

9. What is the difference between events and commands?

Commands are directed ("do this thing") targeting one specific handler. Events are broadcast ("this thing happened") and any interested service can react. Naming reflects this: commands use verb-noun (CreateOrder, ReserveInventory), events use noun-verb past tense (OrderPlaced, InventoryReserved). Commands couple services to their handlers; events enable loose coupling.

10. How do you handle partial failures in choreographed sagas?

Partial failures are inherent because there is no atomic transaction boundary across services. Each service must implement compensation for its own actions. When a step fails, emit a failure event; downstream services react by running their compensations. Track which saga steps have completed so you know what to compensate. Use dead letter queues for events that fail after max retries. Idempotent compensations are essential since compensation events can also arrive more than once.

11. How does service choreography fit within the broader event-driven architecture (EDA) paradigm?

Choreography is one pattern within EDA, not the whole thing. EDA covers any architecture built around event production and consumption; choreography specifically means decentralized reaction to events with no central coordinator. Orchestration is the other pattern within EDA where a central coordinator directs the flow. The event-driven architecture blog post covers the fundamentals; choreography is the decentralized execution model for those principles.

12. What are the best practices for naming events in a choreographed system?

Name events in noun-verb past tense to describe what happened from the emitting service's perspective: OrderPlaced, PaymentCharged, InventoryReserved. Commands use verb-noun instead: CreateOrder, ReserveInventory. Put the aggregate ID in the event name when it helps subscribers filter correctly. Leave command-language verbs (Create, Send) out of event names entirely.

13. What strategies exist for detecting and handling duplicate events beyond basic idempotency checks?

Beyond check-before-act, you have a few options. Deduplication tables keyed on event ID with a TTL handle storage cleanly. Bloom filters work for memory-efficient duplicate detection at scale. Some systems use deterministic idempotency keys where the business logic itself defines uniqueness, like order ID plus action type. At-least-once brokers pair well with natural idempotency in your business logic, like detecting duplicate charges via payment ID. Pick based on what your broker guarantees.

14. What role do message brokers like Kafka or Kinesis play in enabling choreography?

Brokers are the event bus that makes choreography viable. They handle routing, durability, scalability through partitioning, and delivery guarantees. Kafka's partitioned log gives you replay capability from offsets, which is essential for recovery. Kinesis does similar things on AWS with different scaling tradeoffs. Either way, the broker replaces what a central orchestrator would do by providing the shared infrastructure for distributing events to all subscribers.

15. How does choreography enable and enforce microservices independence and autonomous deployment?

Services only couple to event schemas, not to other services directly. The inventory team can ship a new event type without asking the payment team to change anything. The event contract is the only coupling point. The downside is that nobody has a clear view of the overall workflow anymore, which is why correlation IDs and distributed tracing become essential rather than optional.

16. What are the main challenges and approaches for testing choreographed distributed systems?

You cannot test choreography the same way you test a monolith. Contract testing checks that event schemas work for both producers and subscribers without running the full system. Consumer-driven contracts let subscribers specify what events they expect. For integration testing, use test containers running embedded Kafka or LocalStack. You can inject events at the broker level to trace end-to-end flows. And chaos testing with service failures shows whether your compensation logic actually works.

17. How should retry logic and dead letter queues be implemented for event consumers in choreography?

Exponential backoff with jitter prevents thundering herd problems on retries. After 3-5 retries (configurable), move the event to a dead letter queue with the original event, error details, and retry count. DLQ events need to be inspectable and replayable after you fix whatever broke. Monitor DLQ depth as an alert metric. For transient failures like network timeouts, retries handle recovery. For permanent failures like schema mismatches or code bugs, the DLQ captures them for later replay.

18. How does the compensation mechanism in choreographed sagas compare to rollback in orchestrated transactions?

Orchestrated rollback is straightforward: the orchestrator knows what succeeded and calls compensations in reverse order. Choreographed compensation has no central view, so each service must independently track and compensate for its own actions. That is harder to get right. The upside is resilience: if connectivity to the orchestrator drops, choreographed compensation can still proceed. Both require idempotent compensations because events arrive more than once.

19. How does choreography influence database architecture choices per service?

Choreography pushes you toward database-per-service, naturally. Services often lean toward event sourcing since you are already producing events; storing them as the source of truth fits the pattern. CQRS pairs well too, with commands updating state and events propagating read model updates. You end up with denormalized data within services because you cannot join across service boundaries. Compensation logic sometimes needs to store pending transaction state locally so the service knows what to undo when a failure event arrives.

20. What specific monitoring and alerting metrics are most critical for production choreographed systems?

Watch consumer lag closely, how far behind real-time each subscriber is. Dead letter queue depth and insertion rate tell you when consumers are failing. Event processing duration per consumer catches spikes. Duplicate event rate spikes often indicate upstream producer issues. Schema compatibility violations mean a breaking change slipped through. Alert on consumer lag exceeding your SLA window, DLQ depth above a threshold, and error rates that suggest systemic problems.

Further Reading

Conclusion

Service choreography works when services are genuinely independent and workflows are simple. Events broadcast and services react. No central coordinator means no central point of failure.

The price is invisible workflows and harder debugging. When something goes wrong, you trace through events across services rather than reading an orchestrator log.

Use choreography for peripheral side effects and independent reactions. Use orchestration for workflows that need clear transaction boundaries and complex compensation. The best systems use both.

Category

Related Posts

Amazon Architecture: Lessons from the Pioneer of Microservices

Learn how Amazon pioneered service-oriented architecture, the famous 'two-pizza team' rule, and how they built the foundation for AWS.

#microservices #amazon #architecture

Client-Side Discovery: Direct Service Routing in Microservices

Explore client-side service discovery patterns, how clients directly query the service registry, and when this approach works best.

#microservices #client-side-discovery #service-discovery

CQRS and Event Sourcing: Distributed Data Management

Learn about Command Query Responsibility Segregation and Event Sourcing patterns for managing distributed data in microservices architectures.

#microservices #cqrs #event-sourcing