Exactly-Once Delivery: The Elusive Guarantee
Explore exactly-once semantics in distributed messaging - why it's hard, how Kafka and SQS approach it, and practical patterns for deduplication.
Introduction
Distributed systems fail in ways that are hard to predict. A producer sends a message. The broker receives it and acknowledges. The acknowledgment gets lost in transit. The producer sends again. Now the broker has two copies. Or the consumer processes the message, crashes before committing its offset, and on restart receives the same message again.
sequenceDiagram
Producer->Broker: Send(msg)
Broker->Producer: ACK
Note over Producer,Broker: ACK lost in transit
Producer->Broker: Send(msg)
Broker->Producer: ACK
Note over Broker,Consumer: Two copies in broker
Any retry between producer, broker, or consumer can create duplicates. At-least-once is straightforward. Exactly-once requires coordinating all three.
Delivery Semantics
Message systems typically offer one of three guarantees.
At-least-once means messages may be redelivered but never lost. The consumer acknowledges after processing. If processing succeeds but the ack fails, the message comes back. This is the most common setting, but it requires idempotent consumers.
At-most-once means messages may be lost but never duplicated. The consumer acknowledges before processing. If the consumer crashes after ack but before processing, the message is gone. This is rare and usually a poor trade-off.
Exactly-once means each message is processed precisely once. This requires either a transaction spanning the message and its side effects, or idempotent processing with deduplication.
stateDiagram-v2
[*] --> AtMostOnce: Ack before processing
AtMostOnce --> AtLeastOnce: Ack after processing
AtLeastOnce --> ExactlyOnce: Add deduplication/idempotency
note right of AtMostOnce
Risk: Message loss
Use case: Log aggregation
end note
note right of AtLeastOnce
Risk: Duplicates
Use case: Most applications
end note
note right of ExactlyOnce
Risk: Higher latency
Use case: Financial transactions
end note
The delivery guarantee spectrum:
| Guarantee | Duplicates | Loss | Complexity | Typical Use |
|---|---|---|---|---|
| At-most-once | No | Yes | Low | Critical logging |
| At-least-once | Yes | No | Medium | Most applications |
| Exactly-once | No | No | High | Financial transactions |
Kafka’s Exactly-Once Semantics
Kafka offers exactly-once semantics within the Kafka ecosystem. It can guarantee that messages produced to a topic are consumed exactly once by consumers within Kafka.
Kafka achieves this through transactions introduced in version 0.11. When you enable transactions, Kafka writes messages to partitions as part of a transaction. Consumer offsets are committed as part of the same transaction, tying message consumption to offset commit.
flowchart LR
Producer-->|produce| Kafka[Kafka Cluster]
Kafka-->|consume| Consumer[Consumer]
Consumer-->|commit offset| Kafka
Kafka-->|write tx| Partition[Partition]
Consumer-->|write| DB[(Database)]
Note over Producer,DB: Atomic if both in same transaction
For exactly-once processing outside Kafka, say writing to a database, you need to handle it yourself. The consumer must write the offset and the database update in a single transaction, or rely on idempotent semantics to deduplicate.
For more on Kafka’s partitioning and consumer groups, see Apache Kafka.
AWS SQS FIFO
SQS offers two queue types. Standard queues provide at-least-once delivery. FIFO queues provide exactly-once processing and preserve order.
FIFO queues use content-based deduplication to discard duplicate messages within a 5-minute window. Send the same message twice within 5 minutes and SQS drops the second copy.
sqs.send_message(
QueueUrl=queue_url,
MessageBody=json.dumps({'order_id': '12345', 'action': 'ship'}),
MessageDeduplicationId='order-12345-ship',
MessageGroupId='orders'
)
SQS applies a hash to the message body to derive the deduplication ID when you enable content-based deduplication. Identical bodies produce identical deduplication IDs.
SQS FIFO can guarantee exactly-once within the queue itself, but your consumer still needs to be idempotent. If your consumer times out and does not delete the message within the visibility timeout window, SQS will deliver it again.
For more on SQS features, see AWS SQS and SNS.
RabbitMQ and Idempotent Consumers
RabbitMQ does not support exactly-once semantics out of the box. It gives you at-least-once with manual acknowledgments. Your consumer must make itself idempotent.
Idempotency means processing the same message multiple times has the same effect as processing it once. Here is how to actually do it.
Deduplication table: Store processed message IDs in a database. Before processing, check if the ID exists.
INSERT INTO processed_messages (msg_id, processed_at)
VALUES ('msg-123', NOW())
ON CONFLICT (msg_id) DO NOTHING;
-- If insert succeeds, process the message
-- If insert fails (duplicate), skip processing
Natural idempotency: Some operations are naturally idempotent. Setting a field to the same value twice leaves the database in the same state. UPDATE users SET status = 'active' WHERE id = 1 can run twice and nothing changes. Spotting these operations and using them cuts down on overhead.
Message-level transactions: Combine the message acknowledgment with your database transaction. If the database update fails, the message is not acknowledged and comes back.
For more on RabbitMQ’s acknowledgment model, see RabbitMQ.
Practical Patterns
Idempotency Keys
Every message carries a unique ID or you derive one from content. The consumer stores processed IDs. When a message arrives, check if its ID was already processed and skip if so. This approach works with any message broker.
def process_message(msg):
msg_id = msg.headers.get('x-idempotency-key') or hash(msg.body)
if redis.exists(f"processed:{msg_id}"):
return
do_work(msg)
redis.setex(f"processed:{msg_id}", ttl=86400, value='1')
Set the TTL longer than your maximum retry window to prevent unbounded growth.
Outbox Pattern
The outbox pattern addresses the dual-write problem. Instead of writing to the database and publishing a message in two separate steps, you write both to the database in a single transaction.
flowchart TD
Service-->|1 write| DB[(Outbox Table)]
DB-->|2 read| Relay[Outbox Relay]
Relay-->|3 publish| Broker[Message Broker]
Note over Service,Broker: Atomic write ensures consistency
The outbox table lives in the same database as your business data. You write the business record and the outbox event in the same transaction. A separate process polls the outbox and publishes to the broker.
The key guarantee: if the business record exists, the event will eventually be published. The event is never published without the corresponding record.
For more on these patterns, see Event-Driven Architecture and Asynchronous Communication.
Transactional Outbox with CDC
In high-throughput systems, polling the outbox adds latency. Change Data Capture tools like Debezium tail the database transaction log and publish when records change. This is faster but adds operational complexity.
The pattern stays the same: write to the outbox atomically with business data, let CDC handle propagation.
When Exactly-Once Matters
You do not always need exactly-once. For many use cases, at-least-once with idempotent consumers is sufficient and simpler. A log aggregator can handle duplicates. A notification system can usually tolerate a duplicate email if it is not critical.
Exactly-once matters for financial transactions (duplicate charges are serious), inventory operations (double decrements overstate deductions), unique constraints (double user creation may violate uniqueness), and external API calls (retrying a non-idempotent API creates duplicate side effects).
Measure the cost of duplicates in your system before implementing exactly-once. The complexity is real and the operational burden is higher.
Saga Pattern vs Outbox Pattern
The saga pattern and outbox pattern solve related but distinct problems. Both address exactly-once in distributed systems, but they handle different failure scenarios.
The saga pattern coordinates multiple services in a distributed transaction. When step 3 fails, saga executes compensating transactions to undo steps 1 and 2. This handles partial failure without distributed locks.
The outbox pattern handles the dual-write problem within a single service. Instead of writing to the database and publishing a message in two separate steps, you write both in a single database transaction. If the business record exists, the event will eventually be published.
| Aspect | Saga Pattern | Outbox Pattern |
|---|---|---|
| Problem solved | Multi-service distributed transactions | Single-service dual-write consistency |
| Failure handling | Compensating transactions | Eventual publication |
| Idempotency needed | Yes (compensating transactions must be idempotent) | Yes (outbox relay may retry) |
| Complexity | High (requires careful compensation design) | Medium (requires polling outbox) |
| Consistency model | Eventual | Stronger eventual (within a single service) |
| Use when | Multiple services must update in coordinated way | Service needs to publish events atomically with database changes |
Use the saga pattern when you have multiple independent services each making state changes that must be coordinated. Use the outbox pattern when a single service needs to publish events reliably without a distributed transaction coordinator.
For more on saga, see Saga Pattern. For outbox, see Outbox Pattern.
Idempotent Consumer with Redis Deduplication
Redis works well as a fast deduplication store for idempotent consumers. Store processed message IDs with a TTL longer than your retry window:
import redis
import json
import hashlib
class RedisIdempotentConsumer:
def __init__(self, redis_client: redis.Redis, ttl_seconds: int = 86400):
self.redis = redis_client
self.ttl = ttl_seconds
def _get_dedup_key(self, message) -> str:
"""Derive a unique key from the message."""
msg_id = message.get('message_id')
if msg_id:
return f"dedup:{msg_id}"
# Fallback: hash the message content
content = json.dumps(message, sort_keys=True)
content_hash = hashlib.sha256(content.encode()).hexdigest()[:16]
return f"dedup:hash:{content_hash}"
def is_duplicate(self, message) -> bool:
"""Check if this message was already processed."""
key = self._get_dedup_key(message)
return self.redis.exists(key) == 1
def mark_processed(self, message) -> None:
"""Mark message as processed with TTL."""
key = self._get_dedup_key(message)
self.redis.setex(key, self.ttl, '1')
def process(self, message: dict) -> bool:
"""
Process message idempotently.
Returns True if processed, False if duplicate.
"""
if self.is_duplicate(message):
return False
# Process the actual work
do_work(message)
# Mark as processed
self.mark_processed(message)
return True
# Usage in a consumer loop
consumer = RedisIdempotentConsumer(redis_client, ttl_seconds=86400)
for message in receive_messages():
if consumer.process(message):
print(f"Processed message {message['message_id']}")
else:
print(f"Skipped duplicate {message['message_id']}")
Set the TTL longer than your maximum retry window. If a message might be retried for up to 1 hour due to broker retry policies, set TTL to 2 hours or more. If you process millions of messages per day, watch Redis memory usage.
Kafka Exactly-Once with External Systems
Kafka transactions guarantee exactly-once within the Kafka ecosystem. When writing to external systems like databases, you need additional handling.
Kafka Transactions with JDBC
The Kafka JDBC sink connector handles exactly-once for database writes using the transactional outbox pattern:
# Kafka Connect JDBC sink configuration for exactly-once
transforms=insertIdempotenceKey
transforms.insertIdempotenceKey.type=org.apache.kafka.connect.transformes.ValueToKey
transforms.insertIdempotenceKey.fields=id
# Enable exactly-once for the sink
s Exactly-once support guarantees that records delivered to external systems are processed exactly once.
For the Kafka Streams word count application, each word count update is idempotent. If a count is recalculated due to reprocessing, the database write produces the same result. Design your stream processing to produce deterministic outputs, and exactly-once guarantees become unnecessary complexity.
## Google Pub/Sub Exactly-Once
Pub/Sub offers exactly-once delivery with a 7-day retry window. It deduplicates messages using publisher-supplied message IDs:
```python
from google.cloud import pubsub_v1
publisher = pubsub_v1.PublisherClient()
# Publish with explicit message ID (recommended)
future = publisher.publish(
'projects/my-project/topics/my-topic',
data=b'event data',
idempotency_key='my-unique-idempotency-key'
)
The idempotency_key parameter enables exactly-once delivery on the publish side. If you publish with the same key multiple times, Pub/Sub delivers only one message.
On the subscriber side, Pub/Sub acknowledges automatically after processing. If your subscriber crashes and restarts, messages are redelivered within the ack deadline window:
subscriber = pubsub_v1.SubscriberClient()
subscription_path = 'projects/my-project/subscriptions/my-sub'
def callback(message):
try:
process(message.data)
message.ack()
except Exception:
# Nack and let Pub/Sub redeliver
message.nack()
subscriber.subscribe(subscription_path, callback=callback)
Pub/Sub exactly-once works when:
- Publisher provides an idempotency key (or uses automatic deduplication based on message content)
- Subscriber acknowledges within the ack deadline
- Subscriber is idempotent (safe to process the same message twice)
Pub/Sub does not provide exactly-once if the subscriber crashes after processing but before acknowledging. In that case, the message is redelivered and must be idempotent.
Comparing Broker Support
| Broker | Exactly-Once Support | Notes |
|---|---|---|
| Kafka | Yes (transactions) | Within Kafka ecosystem |
| SQS FIFO | Yes (deduplication) | Within SQS, 5-min window |
| RabbitMQ | No (build it yourself) | At-least-once with idempotency |
| Google Pub/Sub | Yes | Delivery guarantee with retry window |
Trade-off Comparison Tables
Idempotency Implementation Strategies
| Strategy | Complexity | Latency Overhead | Storage Required | Best For |
|---|---|---|---|---|
| Deduplication table | Medium | 1-2 DB round trips | Grows with message volume | High-value transactions |
| Redis TTL keys | Low | 1-2 Redis ops | Keys with TTL | High-throughput, short retry windows |
| Natural idempotency | Low | None | None | Operations that set deterministic values |
| Message-level transactions | High | Transaction overhead | None | When message ack must align with DB commit |
Message Broker Exactly-Once Comparison
| Broker | Exactly-Once Scope | Deduplication Window | Consumer Idempotency Required? |
|---|---|---|---|
| Kafka | Within Kafka ecosystem only | N/A (transactional) | Yes for external systems |
| SQS FIFO | Within SQS, queue-level | 5 minutes | Yes, after visibility timeout |
| RabbitMQ | Not supported natively | N/A | Yes, mandatory |
| Google Pub/Sub | Delivery guarantee with retry | 7 days | Yes, after crash before ack |
Outbox Pattern Deployment Options
| Approach | Latency | Complexity | Consistency Guarantee | Operational Overhead |
|---|---|---|---|---|
| Polling outbox | Higher (poll interval) | Low | Eventual (within poll window) | Low |
| CDC (Debezium) | Low (log-based) | High | Near real-time | High (additional infrastructure) |
| Polling + CDC | Flexible | Medium | Tunable | Medium |
Saga vs Outbox Decision Matrix
| Condition | Use Saga Pattern | Use Outbox Pattern |
|---|---|---|
| Multi-service distributed transaction | Yes | No |
| Single-service dual-write problem | No | Yes |
| Cross-service consistency required | Yes | No |
| Event publication alongside business data | No | Yes |
| Compensating transactions available | Yes | No |
| Low operational complexity desired | No | Yes |
Production Failure Scenarios
Scenario 1: The Payment Processor Crash
A payment service processes an order and publishes a “payment successful” event. The consumer processes the payment, updates the order status, but crashes before deleting the message from SQS. With SQS FIFO’s visibility timeout, the message becomes visible again after the timeout. If the consumer is not idempotent, the payment is processed twice, resulting in a duplicate charge.
Mitigation: Implement idempotent consumers using deduplication tables or Redis-based dedup keys with TTL longer than the visibility timeout.
Scenario 2: The Kafka Transaction Timeout
A producer writes to Kafka with transactions enabled. The transaction commits successfully on the Kafka side, but the database write that was part of the same transaction times out. The Kafka offset is committed, but the downstream database update never happens, leaving the system in an inconsistent state where the offset indicates the message was processed but the side effect did not occur.
Mitigation: Ensure database transaction timeout is longer than Kafka’s transaction timeout, or use the transactional outbox pattern to decouple the writes.
Scenario 3: The RabbitMQ Network Partition
During a network partition, RabbitMQ’s publisher confirms may timeout. The producer assumes the message was not received and retries. After the partition heals, the original message is delivered along with the retry, creating duplicates. The consumer processes both but fails to detect the duplicate because the deduplication logic has a bug in its hash generation.
Mitigation: Use explicit message IDs for deduplication, validate hash generation logic, and ensure deduplication table entries survive restarts.
Scenario 4: The Pub/Sub Ack Deadline Miss
A Pub/Sub subscriber processes a message that requires calling an external API. The API call takes longer than the ack deadline. Pub/Sub redelivers the message to another subscriber instance. Both instances call the external API with the same payload, creating duplicate side effects on the external system.
Mitigation: Extend the ack deadline for long-running operations, implement idempotency keys at the external API level, and use distributed locking if necessary.
Scenario 5: The Outbox Relay Crash
A service writes to the outbox table and business tables in a single transaction. The outbox relay crashes after reading the outbox event but before publishing to the broker. The event is lost because the relay did not mark it as processed. Downstream consumers never receive the event, and the business state becomes inconsistent with the expected side effects.
Mitigation: Implement outbox relay with at-least-once delivery semantics, use idempotent publishing, and track published event IDs to prevent duplicates.
Quick Recap
- Understand that exactly-once refers to processing semantics, not transport guarantees
- Identify whether your system needs exactly-once or if at-least-once with idempotent consumers is sufficient
- Choose the right message broker based on your exactly-once support requirements (Kafka, SQS FIFO, Pub/Sub)
- Implement idempotency keys for message deduplication across retries
- Use the outbox pattern to solve the dual-write problem atomically
- Consider CDC tools like Debezium for high-throughput outbox relay
- Design consumers to be idempotent regardless of broker guarantees
- Apply natural idempotency where operations allow (e.g., SET same value)
- Use saga pattern with compensating transactions for multi-service coordination
- Monitor Redis memory usage if using it for deduplication at high message volumes
- Set TTL longer than maximum retry window to prevent premature key expiration
- Design stream processing outputs to be deterministic when possible
Interview Questions
- Partial failures cause message duplication when acks get lost
- Consumers can crash after processing but before committing offsets
- Any retry across producer-broker-consumer chain creates duplicates
- At-least-once: redelivers but never loses; requires idempotent consumers
- At-most-once: loses messages but never duplicates; consumer acks before processing
- Exactly-once: processed precisely once; needs transactions or idempotent deduplication
- Transactions (0.11+) write messages and offset commits atomically
- Ties message consumption to offset commit in a single transaction
- Only applies within Kafka; external systems need extra handling
- SQS hashes the message body to generate deduplication IDs
- Identical bodies within 5 minutes get dropped automatically
- You can also supply an explicit MessageDeduplicationId
- RabbitMQ only guarantees at-least-once with manual acks
- Use deduplication tables with ON CONFLICT DO NOTHING
- Or leverage natural idempotency (setting a field to the same value)
- Message-level transactions tie ack to database commit
- Writes the business record and outbox event in one database transaction
- A relay process polls the outbox and publishes to the broker
- If the business record exists, the event will eventually publish
- Saga: multi-service distributed transactions with compensating actions
- Outbox: single-service dual-write consistency (one database, one transaction)
- Saga coordinates across services; outbox handles reliable event publication
- Use message_id or content hash as the dedup key
- SETEX with TTL longer than your retry window prevents premature expiry
- Fast in-memory lookups scale well for high-throughput systems
- When a step fails, undo previous steps by running counter-transactions
- Each saga step needs a corresponding compensation
- Compensations must be idempotent since they can be retried
- 7-day retry window with publisher-supplied idempotency_key
- Without explicit key, content-based deduplication applies
- Subscribers must ack within the deadline; crash before ack means redelivery
- Message gets redelivered within the ack deadline window
- Subscriber must be idempotent to survive this safely
- Exactly-once is violated unless your consumer handles duplicates
- Operations that produce the same result regardless of repetitions
- Example: UPDATE users SET status='active' WHERE id=1 changes nothing if run twice
- Reduces need for external deduplication stores and overhead
- CDC tools like Debezium tail the transaction log instead of polling
- Faster propagation in high-throughput systems
- Trade-off: more operational complexity than polling outbox
- Higher latency from transactional coordination
- Operational complexity: outbox polling, CDC, saga coordination
- Consumer idempotency is non-negotiable regardless of broker support
- At-least-once + idempotent consumers is simpler for most use cases
- Consumer has a visibility window to process and delete the message
- If it times out without deleting, SQS redelivers
- Your consumer needs to handle this redelivery idempotently
- A unique identifier the publisher attaches to enable broker-side deduplication
- Kafka includes it in transactions; Pub/Sub uses idempotency_key parameter
- Lets publishers retry safely without creating duplicate messages
- Not every system needs it; at-least-once + idempotent consumers often suffices
- Log aggregators and non-critical notifications tolerate duplicates fine
- Financial transactions, inventory, and unique constraints may actually need it
- Uses transactional outbox pattern internally
- Writes offset and database update in one transaction
- Failed DB write means offset not committed, preventing reprocessing
- Reprocessing produces the same result if outputs are deterministic
- Kafka Streams word count is naturally idempotent
- If your stream operations are deterministic, exactly-once is redundant
- Kafka: yes via transactions (within Kafka only)
- SQS FIFO: yes via content dedup (5-minute window)
- RabbitMQ: no native support, build it yourself
- Pub/Sub: yes via 7-day retry window and idempotency keys
Further Reading
- Apache Kafka - Deep dive into Kafka’s partitioning, consumer groups, and transaction model
- AWS SQS and SNS - SQS features including visibility timeout and FIFO queue semantics
- RabbitMQ - RabbitMQ’s acknowledgment model and consumer patterns
- Event-Driven Architecture - Event-driven patterns and their relationship to exactly-once
- Asynchronous Communication - Async messaging patterns and outbox pattern origin
- Saga Pattern - Multi-service distributed transactions and compensating actions
- Outbox Pattern - Solving the dual-write problem in distributed systems
- Distributed Transactions - Two-phase commit, saga, and eventual consistency patterns
- Pub-Sub Patterns - Publish-subscribe semantics and message delivery guarantees
- Message Queue Types - Overview of message broker types and their delivery semantics
Conclusion
Exactly-once delivery is not a single mechanism but a combination of transport guarantees, idempotent processing, and sometimes distributed transactions. Most systems should start with at-least-once and idempotent consumers. Add exactly-once semantics only where the cost of duplicates justifies the implementation complexity.
The outbox pattern and idempotency keys are the practical tools here. They work across brokers and do not require special broker features.
For related reading, see Message Queue Types, Distributed Transactions, and Pub-Sub Patterns.
Category
Related Posts
Ordering Guarantees in Distributed Messaging
Understand how message brokers provide ordering guarantees, from FIFO queues to causal ordering across partitions, and the trade-offs in distributed systems.
Dead Letter Queues: Handling Message Failures Gracefully
Design and implement Dead Letter Queues for reliable message processing. Learn DLQ patterns, retry strategies, monitoring, and recovery workflows.
Apache Kafka: Distributed Streaming Platform
Learn how Apache Kafka handles distributed streaming with partitions, consumer groups, exactly-once semantics, and event-driven architecture patterns.