Exactly-Once Semantics in Stream Processing

Exactly-once ensures each event processes without duplicates or loss. Learn how it works, when you need it, and the true cost of implementation.

published: March 27, 2026 reading time: 17 min read author: GeekWorkBench updated: June 17, 2026

Quick Summary

Exactly-once semantics means the output is correct despite multiple physical processing attempts—not a single physical read of each message. The hard part is crossing system boundaries: Kafka transactions cover Kafka-to-Kafka exactly-once, but external database writes require idempotent consumers or the transactional outbox pattern to achieve the same guarantee. At-least-once delivery with idempotent consumers is the pragmatic default for most pipelines since it avoids the coordination overhead of checkpointing while producing correct results if the consumer deduplicates on key. Reserve true exactly-once for financial transactions, inventory updates, and billing records where duplicate output has serious consequences.

Exactly-Once Semantics: Guaranteeing Data Integrity in Stream Processing

Every message processes exactly once. No duplicates. No data loss. This is exactly-once semantics, and it is the gold standard for stream processing correctness.

The name is misleading. Nothing processes exactly once in the physical sense. A message might be read, processed, written, and then a failure occurs before the offset is committed. On restart, the message is read again. That is two physical reads. Exactly-once semantics guarantee that the output is as if there was only one read.

Exactly-once is hard because processing involves multiple systems: the message broker, the processor, and the output system. A failure at any point can break the guarantee.

Why Exactly-Once Is Hard

Consider the canonical stream processing path:

Kafka (or other source) -> Processor -> Database (or other sink)

For exactly-once to hold, three things must be true:

The source must not replay the same message after the processor has acknowledged it
The processor must not produce output twice for the same message
The sink must not write twice for the same message

Failure can happen anywhere. A crash after writing to the database but before committing the source offset means the message will be replayed. A crash after committing the offset but before writing to the database means the message is lost. Both are failures of exactly-once.

flowchart LR
    Source -->|1. read| Processor[Processor]
    Processor -->|2. process| DB[(Database)]
    Processor -->|3. commit offset| Source
    DB -->|4. ack| Processor

If step 3 happens before step 2 completes, the offset advances and the write is lost on restart. If step 2 completes before step 3, the write is duplicated on restart.

The Two-Phase Commit Pattern

The classic solution is two-phase commit (2PC). The processor writes to a staging area and coordinates a commit that covers both the output and the offset.

Read the message
Write to a staging area (not visible to consumers yet)
Commit the offset
Make the staged write visible (commit the output)

If the processor crashes at any point, the staged write is either not committed (output not visible, message will be replayed) or already committed (output visible, message will not be replayed).

Flink implements this with its checkpointing mechanism. Flink draws distributed snapshots using the Chandy-Lamport algorithm. Checkpoint barriers flow through the stream graph. When all inputs have received the barrier, the checkpoint is complete. The checkpoint includes both the stream position (offset) and the operator state.

// Flink: Enable exactly-once checkpointing
StreamExecutionEnvironment env = StreamExecutionEnvironment.getExecutionEnvironment();
env.enableCheckpointing(10000);  // Checkpoint every 10 seconds
env.getCheckpointConfig().setCheckpointingMode(CheckpointingMode.EXACTLY_ONCE);
env.getCheckpointConfig().setMinPauseBetweenCheckpoints(500);
env.getCheckpointConfig().setCheckpointTimeout(60000);
env.getCheckpointConfig().enableExternalizedCheckpoints(
    ExternalizedCheckpointCleanup.RETAIN_ON_CANCELLATION);

On failure, Flink restores from the last successful checkpoint. State is restored, stream position (offset) is restored, and processing resumes from exactly that point.

Kafka Transactions

Kafka provides exactly-once semantics through transactions. A Kafka transaction atomically commits a consumer offset and a set of producer records.

// Kafka Streams: exactly-once with transactions
props.put(StreamsConfig.PROCESSING_GUARANTEE_CONFIG, StreamsConfig.EXACTLY_ONCE_V2);

KafkaStreams streams = new KafkaStreams(builder.build(), props);

// The exactly-once_v2 setting uses Kafka transactions internally:
// - Consumer offsets are written to a consumer topic
// - Output records are written to output topics
// - Both are committed in a single transaction

The producer in exactly-once mode is idempotent by default. Each record includes a producer ID and sequence number. Duplicate records with the same sequence number are ignored by the broker.

// Idempotent producer (already enabled by default)
props.put(ProducerConfig.ENABLE_IDEMPOTENCE_CONFIG, true);
props.put(ProducerConfig.ACKS_CONFIG, "all");
props.put(ProducerConfig.RETRIES_CONFIG, Integer.MAX_VALUE);

TheTransactional Outbox Pattern

Exactly-once within Kafka is straightforward. Exactly-once with external systems is harder. When your processor writes to a database, Kafka transactions do not cover the database write.

The transactional outbox pattern solves this. Instead of writing directly to the database and to Kafka separately, write both to a local transaction log (the outbox). A separate process reads the outbox and publishes to Kafka AND updates the database.

flowchart LR
    App[Application] -->|1. begin tx| DB[(Database)]
    App -->|2. write to outbox| DB
    App -->|3. commit tx| DB
    DB -->|4. read outbox| Relay[Outbox Relay]
    Relay -->|5. publish to Kafka| Kafka
    Relay -->|6. mark published| DB

The database transaction guarantees atomicity of the business operation and the outbox write. The relay guarantees that the Kafka message is published only after the outbox record is committed. If the relay crashes after publishing but before marking the record, it republishes the same message (idempotent Kafka producer handles duplicates).

Idempotent Consumers

Most exactly-once implementations ultimately rely on idempotency. Exactly-once means: if you run the pipeline multiple times with the same input, you get the same output.

An idempotent consumer ignores duplicates:

def process_order(order_id, order_data):
    """
    Idempotent order processing.
    Uses the order_id as a deduplication key.
    """
    existing = db.query("SELECT * FROM orders WHERE order_id = %s", order_id)
    if existing:
        return  # Already processed, skip

    db.execute(
        "INSERT INTO orders (order_id, data) VALUES (%s, %s)",
        order_id, order_data
    )
    db.commit()

Even if the consumer processes the same order_id twice, only one record is inserted. The database’s unique constraint on order_id enforces idempotency.

For stateful aggregation, idempotency is more complex. You cannot just insert or update. You need exactly-once aggregation, which requires the checkpoint-based approaches described above.

When You Need Exactly-Once

Exactly-once is the right guarantee for:

Financial transactions: A payment must not be processed twice
Inventory updates: An item must not be reserved twice
Billing records: A customer must not be charged twice

Exactly-once is overkill for:

Metrics aggregation: If you miss one page view in a count of millions, the error is negligible
Click tracking: Same reasoning as metrics
Non-critical logs: Missing or duplicating a log entry rarely matters

Exactly-once has a real cost. It requires coordination between systems (checkpointing, transactions, distributed consensus). This coordination adds latency and reduces throughput. For many pipelines, at-least-once with idempotent consumers is faster and simpler.

At-Least-Once vs Exactly-Once

At-least-once is the pragmatic middle ground. Every message is processed at least once. If failures occur, messages are replayed. The result may contain duplicates, but no message is lost.

For most use cases, at-least-once with idempotent consumers produces correct results. If your consumer is idempotent, replaying a message produces the same result as processing it once. The duplicates are harmless.

# This consumer is idempotent. At-least-once with this consumer == exactly-once in practice.
@task
def load_orders_to_warehouse(orders):
    """
    Idempotent load: uses UPSERT (INSERT ON CONFLICT).
    Processing the same order twice produces the same result.
    """
    for order in orders:
        warehouse.execute("""
            INSERT INTO orders (order_id, data)
            VALUES (%s, %s)
            ON CONFLICT (order_id) DO UPDATE SET data = EXCLUDED.data
        """, order.order_id, order.data)

Measuring Exactly-Once

How do you know if your pipeline is truly exactly-once? You test it.

Chaos testing: Kill processes mid-execution. Restart. Verify that output is correct and no duplicates exist.

Inject duplicates: Intentionally inject duplicate messages into the stream. Verify that output contains each record exactly once.

Check output cardinality: For a given input, verify that output count matches input count (no duplicates, no losses).

Common Exactly-Once Pitfalls

Assuming Kafka exactly-once covers everything

Kafka transactions cover exactly-once within Kafka. Producer to topic, consumer offsets, topic-to-topic flows — all atomic. The moment your processor writes to something outside that boundary — a Postgres table, an HTTP endpoint, an S3 bucket — Kafka is no longer in the picture. The offset commit and the external write happen in separate systems with no shared transaction context.

The failure mode that trips most teams: a Kafka Streams app writes to Postgres, then calls commit(). The offset commits. The Postgres write fails silently — a constraint violation, a network hiccup, whatever. The app restarts, replays the message, and the next write succeeds. But the offset already moved, so you now have two identical Postgres rows. Flip it around: the Postgres write succeeds but the offset commit fails, and you get the same message inserted twice on restart.

The solution is idempotency at the sink, or the transactional outbox pattern to fold the external write into Kafka’s transaction scope. For Postgres, INSERT ... ON CONFLICT DO UPDATE with a stable payload deduplicates cleanly. For HTTP endpoints, pass a deduplication key and have the receiver reject repeats. Kafka exactly-once does not extend beyond the broker — it just pushes the idempotency problem onto the sink owner.

Checkpointing too infrequently

Checkpoint frequency is a tradeoff between recovery time and processing overhead. Each checkpoint briefly pauses the pipeline while state snapshots to storage. Some teams stretch the interval to 10 or 15 minutes to cut that overhead — and then a failure hits and they have lost an hour of events to replay.

The failure sequence: checkpoint at T=0, run for 9 minutes, crash at T=9. Restore replays from T=0, not T=9. Nine minutes of events get reprocessed. If your sink is not idempotent, you get duplicate output.

In Flink, the interval from env.enableCheckpointing(n) is a maximum, not a hard guarantee. A checkpoint that runs long — slow storage, network congestion, large state — pushes the next one back. Set checkpointTimeout shorter than your RTO so a stalled checkpoint fails loudly instead of silently blocking the pipeline.

Practical starting point: checkpoint interval at one-third of your recovery time objective. Three-minute RTO? Checkpoint every 60 seconds. Use MinPauseBetweenCheckpoints to prevent a slow checkpoint from stalling the next one entirely — this is critical for bursty throughput pipelines. Kill a TaskManager mid-job and time the actual recovery before you trust it in production.

State store growth

Flink’s checkpointing persists operator state: window buffers, join hashes, session trackers. Under continuous input, state grows without bound unless you actively cap it. A windowed aggregation where the watermark never advances holds all its data indefinitely. A sessionization operator tracking open sessions accumulates entries as long as sessions stay open.

The usual culprit is watermark configuration. Flink cleans up state when the watermark passes the expiration boundary of a window or session. If your source emits events with infrequent timestamps — batch files uploaded once an hour, for instance — the watermark stalls and state never expires. After a few days, the state store is gigabytes large and recovery slows down proportionally.

Two things help. Set state.ttl on every state descriptor: StateTtlConfig.newBuilder(Time.hours(24)).setUpdateType(StateTtlConfig.UpdateType.OnReadAndWrite).build(). This marks entries as expired after the TTL, but cleanup only happens during the next checkpoint — Flink does not eagerly purge expired state. Also configure the watermark grace period so that late events do not trigger re-evaluation of windows that have already been discarded.

Watch state size in development. Use Flink’s REST API to inspect state per operator and set alerts for unexpected growth. If the store doubles every hour, the watermark is probably stalled or a window is never closing.

Production Failure Scenarios

Duplicate output after crash during dual write

The processor writes to the database and commits the Kafka offset in a single transaction. The transaction commits but the acknowledgment fails to return. The processor crashes and restarts. The database write is committed. The offset is not committed. The message is replayed. The database write runs again — duplicate output.

The problem: the transaction boundary and the offset commit are not truly atomic from the processor’s perspective. Kafka transactions coordinate the offset and the producer write within Kafka, but the database write happens outside Kafka’s transaction scope. There is always a window where one side commits and the other does not.

The transactional outbox pattern closes this gap. Instead of writing to the database and to Kafka separately, the processor writes both to a local outbox table within the same database transaction. The outbox record is the single source of truth inside the transaction. A separate relay process reads from the outbox table and publishes to Kafka. Because the relay is idempotent — it reads and marks records published, and Kafka’s idempotent producer handles any duplicates from relay restarts — the Kafka side is exactly-once. The database write and the outbox write are atomic because they share the same database transaction.

Recovery validation is the alternative for systems that cannot use the outbox pattern. After restart, before replaying messages, the processor queries the database for the highest processed sequence number or offset and reconciles it against the committed Kafka offset. Only offsets greater than the reconciled position are replayed, so no duplicate writes occur. This requires storing offset metadata alongside the business data, which is a schema change but avoids running a relay process.

Checkpoint restore produces inconsistent window state

A stateful Flink job crashes. On restart, it restores from the last checkpoint. The checkpoint was taken mid-window computation. Restoring produces a window result that only reflects the pre-checkpoint state. Records between the checkpoint and the crash are reprocessed and combined with the restored partial aggregate — producing wrong numbers.

The window boundary timing is what makes this subtle. A Flink checkpoint captures state at a specific point in the stream, but the checkpoint barrier flows through the stream independently of window boundaries. If your window is 10 minutes and the checkpoint barrier arrives at minute 6, the checkpoint captures a partial window state. When the window closes at minute 10, the checkpoint has already stored that partial state. The window then processes records from minute 6 to 10 normally. If a crash occurs between the checkpoint and the window close, the restored state is the partial state from minute 6, and the reprocessed records from minute 6 to 10 are added again on top of it — double-counting the first chunk of records in that window.

This is especially dangerous with session windows. A session window with a 30-minute gap closes when no events arrive for 30 minutes. A checkpoint taken 5 minutes into a long-running session captures partial session state. A crash and restore re-executes those first 5 minutes of session activity against a state store that already holds that data, producing a session with inflated metrics.

Test checkpoint restores explicitly. Kill TaskManagers mid-window computation in a staging environment, and compare output against uninterrupted processing. Inject failures at known window boundary positions — just before the window closes, just after a checkpoint, mid-window — and check whether the result matches. The checkpoint_interval directly affects maximum data at risk: checkpoint every 60 seconds and you lose at most 60 seconds of window state on a crash. Checkpoint every 10 minutes and the exposure is 10 minutes.

Unique constraint race in idempotent consumers

Two instances process the same order_id simultaneously. Both check — neither finds a match. Both attempt to insert. One succeeds. The other hits a unique constraint and fails.

The race window is the gap between the SELECT and the INSERT. In high-throughput pipelines where the same key appears in rapid succession — a user triggering the same action multiple times, a sensor sending duplicate readings — two instances can acquire the same order_id within milliseconds of each other. Both SELECT queries return nothing. Both INSERT queries fire. The database serializes one after the other. The first succeeds, the second throws a constraint violation.

If the failed instance retries, it succeeds. Net result: correct. But the race creates brief inconsistency windows. During the window between the successful INSERT and the retry completing, the data is in the database but the consuming instance that failed may have already moved on, so the retry happens asynchronously. For pipelines where downstream systems query the database immediately after the INSERT, the brief inconsistency window is real. Additionally, each retry generates a database round-trip and an application error log entry, which adds noise to monitoring.

The fix is to use INSERT ... ON CONFLICT DO UPDATE (Postgres syntax; MySQL uses INSERT ... ON DUPLICATE KEY UPDATE). This collapses the check and the insert into a single atomic operation. The database evaluates the unique constraint internally and either inserts or updates without a race window. The consumer code simplifies from a two-step check-and-insert to a single call. For Kafka Streams specifically, the idempotent producer handles duplicates from Kafka’s side, but the database still needs this pattern to handle concurrent consumer instances correctly.

Kafka exactly-once skipping writes to non-transactional sinks

Kafka transactions make the producer idempotent and the offset commit atomic with the record write. But appending to a Parquet file or writing to S3 via multipart upload is not part of the Kafka transaction. A crash after the Kafka commit but before the filesystem write completes means the write is missing. On restart, the offset has advanced and the record is skipped.

Use transactional sinks, or build a reconciliation process to detect and fill gaps.

Trade-off Table: Exactly-Once Approaches

Aspect	Kafka Transactions	Flink Checkpointing	Spark Checkpointing	Idempotent Consumer
Scope	Kafka-to-Kafka only	End-to-end	End-to-end	Sink-side only
State management	Local RocksDB	Distributed RocksDB	RocksDB	No state
Recovery time	Milliseconds	Depends on checkpoint	Depends on checkpoint	Immediate
Throughput cost	Low	Medium	Medium	None
External sink support	No	Yes (if transactional)	Yes (if transactional)	Yes
Complexity	Low	High	Medium	Low
Exactly-once for aggregates	No	Yes	Yes	No
Best for	Kafka-native pipelines	Stateful Flink jobs	Stateful Spark jobs	External database sinks

Kafka transactions for pure Kafka pipelines. Idempotent consumers with at-least-once delivery for external sinks. Flink or Spark checkpointing for stateful aggregations across arbitrary sources.

Capacity Estimation

Transaction overhead

Kafka transactions add latency. The producer gathers records and commits atomically. Overhead depends on transaction size and transaction.timeout.ms.

# Small transactions: lower latency, higher per-transaction overhead
transaction.timeout.ms=10000
batch.size=16384

# Large transactions: higher latency, lower per-transaction overhead
transaction.timeout.ms=300000
batch.size=524288

A transaction committing every 10 seconds at 100K records/sec commits 1M records. The commit is a round-trip to the broker leader. At 10ms per round-trip, that is fine. At 500ms, it becomes a bottleneck.

Checkpoint storage

Checkpoint size drives recovery time and storage cost.

# Estimate Flink checkpoint size
# 10M state entries, 500 bytes per entry, 2x serialization overhead
state_entries=10_000_000
bytes_per_entry=500
serialization_factor=2

checkpoint_size_mb=$((state_entries * bytes_per_entry * serialization_factor / 1024 / 1024))
# ≈ 9.5 GB per checkpoint

# 3 checkpoints retained (recovery + cleanup gap):
total_storage_gb=$((checkpoint_size_mb * 3 / 1024))
# ≈ 28.5 GB

A 10GB checkpoint writing to S3 in 60 seconds requires ~170 MB/sec sustained bandwidth. Profile this before production.

Recovery time

Recovery time is checkpoint restore time plus replay time.

# Restore: read checkpoint from storage
checkpoint_size_gb=10
storage_read_bandwidth_mbps=500  # S3 or HDFS
restore_seconds=$((checkpoint_size_gb * 1024 / storage_read_bandwidth_mbps))
# ≈ 20 seconds for 10GB at 500 MB/sec

# Replay: records from the offset gap
# depends on processing rate

Flink also redistributes state across TaskManagers on recovery. A 10TB state with 100 partitions means ~100GB per TaskManager over the network. At 1Gbps, that is ~800 seconds. Use incremental checkpoints to reduce this.

Quick Recap

Exactly-once means “output is correct despite multiple physical processing attempts” — not one physical read.
The hard part is crossing system boundaries. Kafka-to-Kafka is tractable. Kafka-to-database requires more.
Kafka transactions only cover Kafka. External sinks need idempotency or outbox.
At-least-once with idempotent consumers is the pragmatic default. Reserve exactly-once for financial or inventory pipelines.
Recovery time is checkpoint_interval plus replay. Size your interval by your recovery time objective.

Conclusion

Exactly-once semantics are achievable but not free. The key is understanding where your pipeline crosses system boundaries and adding appropriate coordination at those points.

Within Kafka: Kafka transactions provide exactly-once. With external systems: idempotent consumers or the transactional outbox pattern. For stateful processing: Flink’s checkpointing or Spark’s checkpoint-based state management.

For most pipelines, at-least-once with idempotent consumers is the pragmatic choice. It is simpler to implement and performs better. Reserve exactly-once for pipelines where duplicate output has serious consequences.

For related reading, see Apache Kafka, Kafka Streams, and Apache Flink.