The Outbox Pattern: Reliable Event Publishing in Distributed Systems

Learn the transactional outbox pattern for reliable event publishing. Discover how to solve the dual-write problem, implement idempotent consumers, and achieve exactly-once delivery.

published: March 24, 2026 reading time: 32 min read author: GeekWorkBench updated: June 17, 2026

Quick Summary

The outbox pattern solves the dual-write problem—where your app writes to a database and publishes a message as two separate operations that can silently diverge. Instead, you write to the business table and an outbox table in the same database transaction. A separate relay process polls the outbox and publishes to the broker. The transaction is your atomicity boundary, so you never end up with a committed business record that never got published. The tradeoff is latency: the broker does not receive events until the relay polls (typically 100ms-1s delay). Consumers must be idempotent because the relay may deliver events more than once. High-throughput systems use CDC (Change Data Capture) instead of polling for near-real-time delivery, and run multiple relays with FOR UPDATE SKIP LOCKED to scale horizontally.

The Outbox Pattern: Reliable Event Publishing in Distributed Systems

The dual-write problem bites you in production. Application writes to the database and publishes a message to a broker. Either the write succeeds and the publish fails, or vice versa. Your system ends up inconsistent, and you spend days building compensating logic, retry queues, and reconciliation jobs.

The outbox pattern fixes this. Instead of writing to the database and publishing as two separate operations, you do both in a single database transaction. The message goes into an outbox table. A separate process reads the outbox and publishes to the broker. Atomicity without distributed transactions.

Let me walk through the problem the outbox solves, how to implement it, and the operational considerations that matter.

Introduction

Modern applications frequently need to update state and notify other systems. User places an order: Orders table gets a new row, and the Inventory service needs to reserve stock. The naive approach is two writes.

# Dual write approach - problematic
def place_order(order):
    db.insert("orders", order)
    message_broker.publish("order_created", order)

This fails in a few ways. Database insert succeeds but broker publish fails. Inventory never gets the message and stock stays unreserved. Or broker succeeds but database fails. Inventory thinks the order exists when it does not. Both cases leave the system inconsistent.

You might add a retry.

# Retry approach - still problematic
def place_order(order):
    db.insert("orders", order)
    try:
        message_broker.publish("order_created", order)
    except:
        retry_queue.add(order)

This handles transient failures but does not solve the fundamental issue. If the database insert commits after the publish succeeds but before we check, we still have inconsistency. Retry queues also need their own handling for duplicates, ordering, and failures.

The root problem: two separate systems that must stay in sync, coordinated with application code rather than a shared transaction boundary.

Topic-Specific Deep Dives

Core Concepts

The outbox uses the database transaction as the shared boundary. Instead of publishing directly to the broker, the application writes to two tables in the same transaction: the business table and an outbox table.

def place_order(order):
    with db.transaction():
        db.insert("orders", order)
        db.insert("outbox", {
            "event_type": "order_created",
            "payload": json.dumps(order),
            "created_at": now()
        })

Both writes happen atomically. Transaction commits, both order and outbox entry exist. Transaction rolls back, neither exists.

A separate process, often called the Relay or Publisher, reads the outbox table and publishes events to the broker.

class OutboxRelay:
    def process_outbox(self):
        events = db.fetch_unpublished(limit=100)
        for event in events:
            try:
                message_broker.publish(event["event_type"], event["payload"])
                db.mark_published(event["id"])
            except:
                pass  # Will retry on next poll

After successful publish, the relay marks the outbox entry as published. If it crashes before marking, the entry gets reprocessed on restart. If it crashes after publishing but before marking, the event gets published again. Consumers must handle duplicates gracefully.

Outbox Architecture Flow

An event starts its life in the database. Your application only ever touches the database—the relay handles everything downstream, which keeps your write path uncluttered.

mermaid
flowchart TD
    subgraph Application["Application Layer"]
        App[Application Code]
        DB[(Business<br/>Table)]
        Outbox[(Outbox<br/>Table)]
        App --> |BEGIN TRANSACTION| T1
        T1 --> |INSERT| DB
        T1 --> |INSERT| Outbox
        T1 --> |COMMIT| Commit1[Commit]
        Commit1 --> |Atomic write| DB
        Commit1 --> |Atomic write| Outbox
    end

    subgraph Relay["Relay / Publisher"]
        Poll[Poll outbox<br/>entries]
        Fetch[Fetch unpublished]
        Publish[Publish to broker]
        Mark[Mark as published]
        Poll --> Fetch
        Fetch --> Publish
        Publish --> Mark
    end

    subgraph Broker["Message Broker"]
        Topic[Event Topic]
    end

    subgraph Consumer["Consumer Services"]
        Handler[Event Handler]
        Process[Process event<br/>idempotently]
        Handler --> Process
    end

    Outbox --> |1. Poll| Poll
    Publish --> |2. Publish event| Topic
    Topic --> |3. Deliver| Handler

Flow Steps:

Step	Component	Action
1	Application	Writes business data + outbox entry in single transaction
2	Relay	Polls outbox table for unpublished entries
3	Relay	Publishes event to message broker
4	Relay	Marks outbox entry as published
5	Broker	Delivers event to subscribed consumers
6	Consumer	Processes event idempotently

Key Properties:

Atomicity: Business data and outbox entry commit together or not at all
Reliability: Relay retries ensure eventual delivery
At-least-once: Events may be published multiple times if relay crashes after publish but before mark
Idempotency required: Consumers must deduplicate using event IDs

For more on event-driven patterns, see Event-Driven Architecture.

Design and Architecture

The cost is delay. The broker does not get the message until the relay polls, processes, and publishes. For most use cases this is fine, milliseconds or seconds. But if your consumers need immediate notification, the outbox pattern is the wrong tool.

The outbox table design determines how fast the relay can run and how painful operations become. Pick the right indexes and partitioning strategy upfront, and polling stays snappy even as the table grows. The sections below cover what actually works in production.

Why Not Just Use Transactions?

Traditional database transactions cannot span both the database and the message broker. The broker is external. You cannot begin a transaction, write to the database, write to the broker, and commit as one unit.

The outbox works around this by keeping everything in the database. The transaction boundary covers the business data and the outbox entry. The relay publishes from the outbox to the broker outside the transaction. Atomicity for the write, eventual consistency for the notification.

You lose immediate notification. The broker does not get the message until the relay polls, processes, and publishes. For low-latency requirements, this delay matters. For most use cases, milliseconds or seconds is acceptable.

Outbox Table Schema Design

The outbox table schema matters for performance. A poorly designed outbox table can become a bottleneck under high write load.

Basic Schema

CREATE TABLE outbox (
    id              UUID PRIMARY KEY DEFAULT gen_random_uuid(),
    event_type      VARCHAR(255) NOT NULL,
    aggregate_type  VARCHAR(255),          -- e.g., 'Order', 'Payment'
    aggregate_id    VARCHAR(255) NOT NULL,   -- e.g., order-123
    payload         JSONB NOT NULL,
    created_at      TIMESTAMPTZ NOT NULL DEFAULT NOW(),
    published_at    TIMESTAMPTZ,             -- NULL = unpublished
    published       BOOLEAN NOT NULL DEFAULT false,

    -- Idempotency key for deduplication
    event_id       VARCHAR(255) UNIQUE NOT NULL
);

-- Index for relay polling: find unpublished entries efficiently
CREATE INDEX idx_outbox_unpublished
    ON outbox (published, created_at)
    WHERE published = false;

-- Index for aggregate lookups (useful for debugging/replays)
CREATE INDEX idx_outbox_aggregate
    ON outbox (aggregate_type, aggregate_id, created_at);

Partitioning Strategies

For high-throughput systems, partitioning the outbox table prevents it from becoming a bottleneck.

By time range (recommended for most cases):

-- Partition by day for easy TTL cleanup and efficient querying
CREATE TABLE outbox (
    id              UUID DEFAULT gen_random_uuid(),
    event_type      VARCHAR(255) NOT NULL,
    aggregate_type  VARCHAR(255),
    aggregate_id    VARCHAR(255) NOT NULL,
    payload         JSONB NOT NULL,
    created_at      TIMESTAMPTZ NOT NULL,
    published_at    TIMESTAMPTZ,
    published       BOOLEAN NOT NULL DEFAULT false,
    event_id        VARCHAR(255) UNIQUE NOT NULL,
    PRIMARY KEY (id, created_at)
) PARTITION BY RANGE (created_at);

-- Create daily partitions
CREATE TABLE outbox_2026_03_01 PARTITION OF outbox
    FOR VALUES FROM ('2026-03-01') TO ('2026-03-02');
-- ... etc

Partition pruning benefits: Queries for unpublished entries in a specific time range only scan the relevant partition. Old partitions can be detached and dropped for efficient cleanup.

Avoiding Hot Partitions

If all writes go to the most recent partition, you get a write hotspot. For extreme throughput, use a composite partition key:

-- Partition by (published status, date) creates a "hot" partition for
-- unpublished events and "cold" partitions for published ones
CREATE TABLE outbox (
    ...
    published       BOOLEAN NOT NULL,
    created_at      TIMESTAMPTZ NOT NULL,
    PRIMARY KEY (id, published, created_at)
) PARTITION BY LIST (published);

Index Design for Relay Performance

The relay needs to find unpublished entries fast. The compound index (published, created_at) where published = false is the critical path:

Relay queries: WHERE published = false ORDER BY created_at LIMIT N
The index covers this query entirely (index-only scan)
As the table grows, this index should fit in memory even if the table does not

-- Verify index is used with: EXPLAIN ( buffers, analyze ) SELECT ...
SELECT id FROM outbox
WHERE published = false
ORDER BY created_at
LIMIT 100;

Schema Design Summary

Concern	Recommendation
Primary key	UUID (no sequence contention)
Event ID	Business meaningful ID (order-123-created), not random UUID
Payload	JSONB (flexible, queryable)
Index for relay	`(published, created_at)` WHERE published=false
Partitioning	By time range (daily partitions)
Idempotency	UNIQUE constraint on event_id

Different approaches work in different situations.

Polling Relay

The simplest approach polls the outbox table at regular intervals.

class PollingRelay:
    def __init__(self, db, broker, interval_ms=100):
        self.db = db
        self.broker = broker
        self.interval = interval_ms / 1000

    def run(self):
        while True:
            events = self.db.fetch_unpublished(limit=100, order="created_at")
            for event in events:
                self.publish_and_mark(event)
            sleep(self.interval)

    def publish_and_mark(self, event):
        try:
            self.broker.publish(event["event_type"], event["payload"])
            self.db.mark_published(event["id"])
        except:
            pass

Polling is simple and reliable. The tradeoff is latency. With a 100ms interval, events publish within 100ms at best. Reduce the interval and database load goes up.

Change Data Capture

A more sophisticated approach uses CDC. Instead of polling, the database notifies the relay when new outbox entries appear. PostgreSQL’s LISTEN/NOTIFY, or tools like Debezium that read the write-ahead log, can drive this.

class CdcRelay:
    def __init__(self, db, broker):
        self.db = db
        self.broker = broker
        self.db.listen("outbox_inserts", self.handle_notification)

    def handle_notification(self):
        events = self.db.fetch_unpublished(limit=100)
        for event in events:
            self.publish_and_mark(event)

CDC cuts latency significantly. Events publish seconds or milliseconds after commit rather than on the next poll cycle. The cost is operational complexity. CDC tools add infrastructure and their own failure modes.

Polling vs CDC vs Transactional Outbox Comparison

Aspect	Polling Relay	CDC Relay	Transactional Outbox
Latency	Interval-based (100ms-1s typical)	Near real-time (ms-s)	Low (within transaction commit)
Database Load	Higher (continuous polling)	Low (event-driven)	Lowest (no polling)
Infrastructure	Simple process	CDC tool (Debezium, AWS DMS)	Native DB triggers/procedures
Reliability	High (simple, testable)	High (but CDC tool has own failure modes)	Highest (DB handles all)
Operational Complexity	Low	High (CDC tool management)	Medium (DB procedures)
Replay Capability	Yes (mark-based)	Yes (CDC log retention)	Limited (WAL-based)
Cross-Database Support	Yes	Limited (CDC per-DB)	No (DB-specific)
When to Use	Low throughput, simple setup	High throughput, low latency needs	Maximum reliability, DB-native solution

Deleting Instead of Marking

Some implementations delete outbox entries after successful publish rather than marking them.

def publish_and_mark(event):
    self.broker.publish(event["event_type"], event["payload"])
    self.db.delete_outbox_entry(event["id"])

This keeps the outbox table small. The cost is losing replay capability. Marking as published preserves the data and lets you replay from the beginning if needed.

Handling Failures and Idempotency

The outbox pattern guarantees at-least-once delivery. Relay publishes an event, then marks it published. If the mark fails, the event gets republished on the next cycle. Consumers must handle duplicates.

Idempotent consumers solve this. The consumer tracks which events it has already processed, using an event ID in a processed_events table or cache.

class IdempotentConsumer:
    def __init__(self, broker, db):
        self.broker = broker
        self.db = db

    def handle_order_created(self, payload):
        event_id = payload["event_id"]
        if self.db.already_processed(event_id):
            return  # Skip duplicate

        order = payload["order"]
        self.process_order(order)
        self.db.mark_processed(event_id)

The idempotency table needs its own cleanup to avoid growing forever. Purge processed events older than a certain age, or use a sliding window.

Exactly-Once Delivery

The outbox pattern, combined with idempotent consumers, gives you exactly-once semantics end-to-end. Database and outbox write happen atomically. Relay delivers the event at least once. Consumer processes it exactly once.

The nuance: “exactly once” means the business logic executes once, not that the event moves through the system once. The event may cross the broker multiple times. The consumer must recognize and deduplicate.

For a broader view of delivery semantics, see Saga Pattern which often pairs with outbox for distributed transaction handling.

Operational Considerations

The outbox table grows if the relay falls behind or publishes repeatedly fail. Monitor for stuck entries.

SELECT COUNT(*) FROM outbox
WHERE published = false
AND created_at < NOW() - INTERVAL '5 minutes';

Alert on this query catches relay failures or persistent publish errors.

Consider the relay’s failure modes too. If the relay crashes mid-batch, some events get published but not marked. The next relay instance reprocesses them. As long as consumers are idempotent, this is fine.

For high-throughput systems, batching helps. The relay fetches multiple events, publishes them in a single broker batch if supported, then marks them all published. Fewer broker round trips.

Relay High Availability

A single relay is a single point of failure. If it crashes, events pile up in the outbox until someone notices. For production systems, you need multiple relay instances.

Active-passive (leader election):

Use a lock in the database or Redis to elect an active relay. The active relay polls and publishes. Passive relays standby, ready to take over if the active dies.

import redis
import threading

class HaRelay:
    def __init__(self, db, broker, lock_key, ttl_seconds=30):
        self.db = db
        self.broker = broker
        self.lock_key = lock_key
        self.ttl = ttl_seconds
        self.redis = redis.Redis()

    def run(self):
        while True:
            acquired = self.redis.set(self.lock_key, socket.gethostname(),
                                       nx=True, ex=self.ttl)
            if acquired:
                self.do_work()  # Only active relay does work
            else:
                time.sleep(1)   # Standby, wait for lock release

    def do_work(self):
        while True:
            try:
                events = self.db.fetch_unpublished(limit=100)
                for event in events:
                    self.broker.publish(event["event_type"], event["payload"])
                    self.db.mark_published(event["id"])
            except redis.ConnectionError:
                break  # Lost lock, go back to standby

The nx=True means only one instance can hold the lock. The ex=self.ttl means the lock auto-releases if the active relay crashes and doesn’t renew. Passive instances detect the lock is gone and one of them acquires it.

Active-active (multiple relays with sharding):

Multiple relays can run simultaneously if they coordinate which outbox rows each one handles. Use SELECT ... FOR UPDATE SKIP LOCKED to let each relay claim a different batch:

class ShardedRelay:
    def process_outbox(self):
        # FOR UPDATE SKIP LOCKED: grab rows without blocking other relays
        events = db.fetch_unpublished(
            "SELECT * FROM outbox WHERE published = false "
            "ORDER BY created_at LIMIT 100 FOR UPDATE SKIP LOCKED"
        )
        for event in events:
            try:
                self.broker.publish(event["event_type"], event["payload"])
                self.db.mark_published(event["id"])
            except Exception:
                pass  # Let other relay try on next cycle

Both relays run concurrently, each grabbing different rows. No coordination overhead, no leader election needed. Throughput scales with relay count.

Outbox Entry TTL and Cleanup

The outbox table grows over time. Published entries that are never cleaned up become dead weight. You need a cleanup strategy.

Soft delete with published flag (already in schema):

# Mark as published, don't delete
db.mark_published(event["id"])

Hard delete for old published entries (run as a scheduled job):

-- Clean up published entries older than 7 days
DELETE FROM outbox
WHERE published = true
AND published_at < NOW() - INTERVAL '7 days';

Automated cleanup with partition expiry (if using time-based partitioning):

-- Detach and drop old partitions
ALTER TABLE outbox DETACH PARTITION outbox_2026_02_01;
DROP TABLE outbox_2026_02_01;

Partition pruning makes cleanup essentially free — dropping a partition is O(1) rather than deleting millions of rows.

Retention policy guidance:

Use case	TTL for published entries
Financial transactions	30-90 days (audit requirements)
Order events	7-14 days
High-volume event logs	1-3 days
Debug/replay scenarios	Keep all for 24h, then batch archive

Never delete unpublished entries — they represent events that haven’t been delivered yet. If you see unpublished entries older than your poll interval, something is wrong with the relay.

Combining with Saga

The outbox pattern pairs naturally with saga. In a saga, each step updates a service and publishes an event to trigger the next step. Without outbox, the publish can fail and leave the saga stuck.

With outbox, the step’s update and outbox write are atomic. The relay publishes the event that triggers the next step. If the relay is down, events queue up and publish when it recovers. The saga makes progress as soon as the relay is available.

For details on saga implementation, see Saga Pattern.

The core problem outbox solves for saga is broker unavailability mid-transaction. In a distributed transaction spanning multiple services, if Step 2 publishes its “step complete” event but the broker is down, Step 3 never fires. The saga hangs. With outbox, Step 2 writes its completion and the outbox entry atomically. The relay retries until the broker accepts the event. Step 3 fires when the relay recovers.

The failure semantics differ from single-service patterns. In a buy-and-sell saga, each step is atomic within its service, but the cross-service coordination depends on events surviving broker outages. If the relay is down for30 seconds during a 5-step saga, steps queue up and fire in sequence once the relay returns. The saga completes, just with added latency.

Compensations add complexity in outbox-based sagas. A compensation for Step 3 might execute after Step 4 has already run, depending on relay timing. Design compensations to be idempotent and aware of subsequent step state. Forward-only sagas where each step is irreversible work best with this approach. Sagas that need tight compensation ordering need additional coordination logic beyond outbox alone.

When to Use the Outbox Pattern

Use the outbox pattern when you need reliability guarantees for event publishing and can tolerate the additional latency (hundreds of milliseconds to seconds) between the database write and broker notification.

It works well in high-reliability scenarios: financial transactions, inventory updates, any domain where missed events create real problems. Well-suited to event-driven microservices where services communicate through events rather than direct calls.

The outbox adds operational complexity. You run the relay as a separate process. You monitor it. You handle idempotency in consumers. For low-reliability use cases or one-way notifications where missed events are acceptable, this complexity may not pay off.

For an overview of distributed transaction patterns, see Distributed Transactions.

CQRS and Event Sourcing

CQRS and event sourcing hit the same reliability wall that outbox was built to crack. In CQRS, getting events from the write side to the read side reliably is the whole challenge. Event sourcing has the same problem: your event store needs to tell consumers whenever new events land. Without outbox, both patterns end up relying on direct writes or polling that can quietly break and leave your systems out of sync.

Outbox fixes this. The event store and outbox entry commit together, so events don’t vanish between appending to the log and notifying consumers. Your consumers stay decoupled from your storage layer. Change the schema without worrying about breaking downstream processors.

The sections below dig into each pattern separately, then set them against direct writes and retry queues head-to-head. Five failure scenarios follow to show how outbox actually behaves when production decides to misbehave.

CQRS (Command Query Responsibility Segregation)

In CQRS, you separate read models from write models. Commands (writes) update the command side. Queries (reads) read from potentially different read models updated via events.

The outbox pattern is how you get events from the command side to the query side reliably:

Command Side                          Query Side
┌─────────────────┐                  ┌─────────────────┐
│ Order Service   │   outbox + relay  │ Read Model     │
│ - place_order  │ ───────────────► │ - OrderView    │
│ - update_status│   (reliable       │ - CustomerView │
│ - cancel       │    event stream)  │ - Analytics DB │
└─────────────────┘                  └─────────────────┘

Without outbox, you might update the command model and then directly update the read model. That direct call can fail and leave the systems out of sync. With outbox, the update and event publication are atomic — the read model eventually catches up, never misses an event.

Event Sourcing

Event sourcing takes CQRS further: instead of storing current state, you store the sequence of events that produced the current state. The event store is the source of truth.

The outbox pattern works well with event sourcing:

def place_order(order):
    with db.transaction():
        # Append to event log (the "event store")
        db.insert("events", {
            "event_type": "ORDER_PLACED",
            "aggregate_id": order.id,
            "payload": json.dumps(order),
            "sequence": get_next_sequence("Order", order.id)
        })

        # Also write to outbox for reliable notification
        db.insert("outbox", {
            "event_type": "ORDER_PLACED",
            "aggregate_id": order.id,
            "payload": json.dumps(order),
            "event_id": f"order-{order.id}-placed"
        })

The event store is the authoritative log. The outbox ensures events propagate to consumers (read models, audit logs, other services). If you only use event sourcing without outbox, you rely on consumers to read the event store directly — which couples them to your storage mechanism and makes it harder to evolve the event schema.

The outbox decouples: your event store is an internal implementation detail. Consumers receive events through the broker like any other event-driven system.

Trade-off Analysis

Dimension	Outbox Pattern	Direct Write (Dual Write)	Retry Queue
Consistency	Strong (atomic within DB transaction)	Weak (split across systems)	Medium (retries eventually succeed)
Complexity	Medium (relay + idempotency)	Low (direct publish)	Medium (queue management)
Latency	Higher (poll interval delay)	Low (immediate)	Medium (queue processing)
Failure Handling	Self-healing (relay retries)	Requires manual compensation	Built-in (queue retries)
Operational Burden	Medium (monitor relay, outbox size)	Low (no additional processes)	Medium (queue infrastructure)
Replay Capability	Yes (marked entries preserved)	No	Limited (queue retention)
Exactly-Once	Yes (with idempotent consumers)	No	No (at-least-once at best)
Infrastructure Cost	Medium (DB + relay process)	Low (just broker)	High (queue cluster)
Scalability	High (sharded relays)	Medium (broker scales)	High (queue scales)

Production Failure Scenarios

Scenario 1: Network Partition Between Relay and Broker

What happens: The relay publishes an event to the broker, the broker acknowledges receipt, but the network drops before the relay receives the ack. The relay marks the event as published. The broker actually has the event and delivers it to consumers. Duplicate delivery.

Impact: If consumers are not idempotent, they may process the same order twice.

Mitigation: Always design consumers to be idempotent. Use the event_id to deduplicate.

Scenario 2: Database Crash During Outbox Write

What happens: The application starts a transaction, inserts the business data, inserts the outbox entry, but crashes before the transaction commits. The database rolls back automatically. No event published. No business data written. System is consistent.

Impact: None. This is actually correct behavior — the crash prevented an inconsistent state.

Mitigation: None needed. This is why atomic transactions matter.

Scenario 3: Relay Crash After Publish Before Mark

What happens: Relay publishes 100 events to the broker. Broker accepts all 100. Before the relay can mark any as published, it crashes. On restart, the relay re-publishes all 100 events.

Impact: Consumers receive 200 events total (100 original + 100 duplicates). If consumers track processed event_ids, duplicates are discarded harmlessly. If not, you have double-processing.

Mitigation: Idempotent consumers with deduplication. Consider using transactional sentry patterns where the mark is part of the same transaction as the publish acknowledgment.

Scenario 4: Outbox Table Grows Unbounded

What happens: The relay is down for maintenance. Business continues. Outbox entries pile up. After a week, millions of rows exist. The relay starts up and struggles to catch up while the table scan degrades database performance.

Mitigation: Monitor unpublished event count and age. Set up alerts for stuck relays. Consider partitioning by time and dropping old partitions instead of deleting rows. Archive cold data to a separate table before dropping partitions.

Scenario 5: Payload Schema Change Breaks Consumers

What happens: A new version of the application changes the event payload schema. Old consumers that rely on specific fields break when they receive the new format.

Mitigation: Use versioned event types (e.g., order_created_v1, order_created_v2). Maintain backward compatibility within the same major version. Implement schema registry and validation at the relay layer.

Quick Recap

Business data and outbox entry written in same database transaction
Relay process runs separately from application
Consumers are idempotent and deduplicate by event_id
Outbox table has appropriate indexes for relay polling
Published entries are cleaned up on a schedule
Unpublished entry count and age are monitored
Relay is deployed in HA configuration (active-passive or sharded)
Event payload does not contain sensitive data (PII, passwords, tokens)

Observability Checklist

The outbox pattern introduces a separate relay process and additional database writes. Without monitoring, problems in the relay go unnoticed until events pile up and consumers fall behind.

Metrics

Outbox lag: count and age of unpublished entries
Relay publish rate: events published per second
Publish failure rate: events that failed to publish
End-to-end latency: time from outbox write to consumer delivery
Consumer lag: how far behind consumers are from producer rate
Batch size histogram: how many events per publish cycle

Key Queries

-- How many unpublished events are piling up?
SELECT COUNT(*) FROM outbox WHERE published = false;

-- How old is the oldest unpublished event? (should be < 1 minute)
SELECT MAX(NOW() - created_at) FROM outbox WHERE published = false;

-- How many published events per minute?
SELECT DATE_TRUNC('minute', published_at), COUNT(*)
FROM outbox WHERE published = true
GROUP BY 1 ORDER BY 1 DESC LIMIT 10;

Logs

Log each batch of events published with count and duration
Log publish failures with error details and retry count
Log when relay acquires or loses leadership lock
Include event_id, aggregate_type, and aggregate_id in all logs for correlation

Alerts

Alert when unpublished event count exceeds threshold (relay is behind)
Alert when oldest unpublished event age exceeds threshold (relay is stuck)
Alert when publish failure rate exceeds baseline
Alert when relay loses leadership and standby doesn’t pick up (HA failure)

Tracing

from opentelemetry import trace

tracer = trace.get_tracer(__name__)

class OutboxRelay:
    def process_batch(self, events):
        with tracer.start_as_current_span("outbox.publish_batch") as span:
            span.set_attribute("batch.size", len(events))

            for event in events:
                with tracer.start_as_current_span("outbox.publish_event") as event_span:
                    event_span.set_attribute("event.id", event["event_id"])
                    event_span.set_attribute("event.type", event["event_type"])
                    event_span.set_attribute("aggregate.type", event["aggregate_type"])

                    try:
                        self.broker.publish(event["event_type"], event["payload"])
                        self.db.mark_published(event["id"])
                        event_span.set_status(trace.Status.OK)
                    except Exception as e:
                        event_span.record_exception(e)
                        event_span.set_status(trace.Status.ERROR)
                        raise

Security Checklist

The outbox pattern writes event payloads to a database table that a separate process reads and publishes. Misconfigured security can leak sensitive data through the event stream.

[ ]Encrypt outbox table data at rest (database-level encryption or application-level encryption of sensitive payload fields)
[ ]Encrypt relay-to-broker communication (TLS on broker connection)
[ ]Authenticate relay-to-broker connections (broker authentication)
[ ]Authorize which relays can publish to which topics (broker ACLs)
[ ]Validate payload schema before publishing (reject payloads that are too large or malformed)
[ ]Audit log all publish operations with event metadata
[ ]Do not put sensitive data (passwords, tokens, PII) directly in event payloads — use references or obfuscated IDs instead
[ ]Rate-limit the relay’s publish rate to prevent accidental or malicious overload of consumers

Interview Questions

1. What problem does the outbox pattern solve, and why is it needed?

The outbox pattern solves the dual-write problem, where an application must write to a database and publish a message to a broker as two separate operations. Without outbox, either operation can fail after the other succeeds, leaving the system inconsistent. The outbox pattern uses the database transaction as a single atomic boundary—both the business data and the outbox entry commit together, or neither does.

2. How does the relay/publisher component work in the outbox pattern?

The relay is a separate process that polls the outbox table for unpublished entries, publishes them to the message broker, and marks them as published. It runs independently from the application, so if it crashes, the application continues to write to the database normally. On restart, the relay picks up where it left off and publishes any missed events.

3. What delivery semantics does the outbox pattern provide, and why?

The outbox pattern provides at-least-once delivery. Events may be published multiple times if the relay crashes after publishing but before marking as published. The consumer must be idempotent to handle duplicates and process each event exactly once. "Exactly-once" refers to business logic execution, not message transmission through the system.

4. What is idempotency, and why is it critical for consumers in the outbox pattern?

Idempotency means processing the same event multiple times produces the same result as processing it once. Consumers track which event_ids they have already processed, typically in a processed_events table or cache. When a duplicate event arrives, the consumer skips it. Without idempotency, duplicate delivery would cause double-processing bugs.

5. What is the key index design for the outbox table, and why?

The critical index is `(published, created_at)` with a partial condition `WHERE published = false`. This supports the relay's query: find unpublished entries ordered by creation time. The index is covering, so the database can satisfy the query entirely from the index without touching the table. As the table grows, this index should fit in memory even if the table does not.

6. How does CDC (Change Data Capture) compare to polling relay for outbox publishing?

Polling relay uses simple interval-based queries, which is reliable and easy to operate but adds latency (typically 100ms-1s). CDC uses database notification mechanisms (PostgreSQL LISTEN/NOTIFY) or WAL reading (Debezium) to trigger publishing immediately after a commit, reducing latency to milliseconds. CDC has higher operational complexity and its own failure modes, but dramatically reduces end-to-end latency.

7. What are the main approaches for relay high availability?

Two approaches: active-passive with leader election (using database locks or Redis) where one relay instance is active and others standby, and active-active with sharding (using SELECT FOR UPDATE SKIP LOCKED) where multiple relays claim different batches concurrently. Active-passive is simpler; active-active scales throughput with relay count but requires more coordination logic.

8. Why might you partition the outbox table, and what partitioning strategy do you recommend?

Partitioning prevents the outbox table from becoming a bottleneck under high write load and simplifies cleanup of old entries. Time-range partitioning by day is recommended—daily partitions can be detached and dropped for efficient cleanup (O(1) rather than DELETE of millions of rows). For extreme throughput with write hotspots, composite partitioning on (published, created_at) keeps hot unpublished rows separate from cold published ones.

9. How does the outbox pattern combine with the Saga pattern?

Saga orchestrates distributed transactions as a sequence of steps, where each step publishes an event to trigger the next step. Without outbox, a publish failure can leave the saga stuck mid-flight. With outbox, each step's update and outbox write are atomic, so the event that triggers the next step is guaranteed to be published when the relay is available. The saga makes progress as soon as the relay recovers.

10. What happens when the relay falls behind or crashes for an extended period?

Unpublished entries accumulate in the outbox table. The application continues normally, writing both business data and outbox entries. When the relay restarts, it processes the backlog. Problems occur if the backlog grows very large (performance degradation on large table scans) or if consumers have moved on (stale events). Monitoring unpublished count and age catches this early. Partitioning helps by making cleanup efficient.

11. What are the main operational concerns when running the outbox pattern in production?

Key concerns: monitoring outbox lag (unpublished count and age), ensuring relay HA (multiple instances or leader election), managing outbox table growth through partitioning and cleanup policies, handling payload schema evolution across versions, and ensuring consumers are idempotent. Also important: alerting on stuck relays, logging publish batches and failures, and tracing event flow end-to-end.

12. When should you NOT use the outbox pattern?

The outbox pattern adds operational complexity (relay process, monitoring, idempotency logic) and introduces latency (poll interval or CDC delay). Do not use it when: missed events are acceptable (low-value notifications), immediate broker notification is required (low-latency use cases), your reliability requirements are low, or the added complexity is not justified by the business impact. The tradeoff is simplicity versus guaranteed delivery.

13. How does outbox pattern connect to CQRS and event sourcing?

In CQRS, the outbox pattern provides reliable event propagation from command side to query side (read models). Without outbox, updating a read model directly can fail and leave systems out of sync. In event sourcing, the outbox decouples the event store (internal implementation) from consumers, which receive events through the broker like any event-driven system. The outbox pattern enables both patterns to evolve independently from their consumers.

14. What is the difference between marking as published versus deleting outbox entries?

Marking as published preserves the outbox entry for replay capability. If you need to reprocess events from the beginning (new consumer, debugging, disaster recovery), the marked entries support that. Deleting keeps the table small but loses replay capability—you cannot go back and reprocess past events. The tradeoff is storage and maintenance overhead versus replay flexibility.

15. How do you handle payload schema changes in an outbox-based system?

Use versioned event types (e.g., order_created_v1, order_created_v2). Maintain backward compatibility within major versions. Implement schema validation at the relay layer before publishing. Consider using a schema registry to track versions. When possible, add new fields rather than removing old ones. Consumers should ignore unknown fields to handle forward compatibility.

16. What security considerations are specific to the outbox pattern?

Event payloads in the outbox table can be read by the relay and delivered to brokers, so sensitive data (PII, passwords, tokens) must not appear directly in payloads—use references or obfuscated IDs instead. Additional concerns: encrypt outbox table at rest, TLS on relay-to-broker connections, broker authentication, topic ACLs to authorize which relays can publish where, and payload schema validation to reject malformed or oversized payloads.

17. What monitoring queries should you run against the outbox table in production?

Essential queries: count of unpublished entries (alert if above threshold), age of oldest unpublished entry (alert if above 1-2 minutes), published events per minute (throughput metric), and published_at age distribution (histogram of processing times). Log each batch publish with count and duration, and log relay leadership changes in HA setups. Include event_id, aggregate_type, and aggregate_id in all logs for correlation.

18. What is FOR UPDATE SKIP LOCKED and why is it useful for outbox relays?

FOR UPDATE SKIP LOCKED is a PostgreSQL clause that lets concurrent transactions claim different rows without blocking or deadlocking. In an active-active relay setup, each relay instance uses this clause to grab a batch of unpublished outbox rows atomically. Each relay gets a different batch, processes and marks its own, and concurrency scales without central coordination.

19. How does outbox pattern compare to transactional outbox (DB triggers/procedures)?

Transactional outbox uses native DB triggers or stored procedures to publish events directly within the database, without a separate relay process. It has the lowest infrastructure cost (highest reliability, DB handles everything) but is DB-specific (no cross-database support) and limited replay capability. The polling or CDC relay approaches are more operationally complex but offer better cross-DB support, replay, and flexibility.

20. What are the retention policy considerations for outbox entries?

Retention depends on use case: financial transactions typically need 30-90 days for audit compliance; order events typically 7-14 days; high-volume logs 1-3 days; debug/replay scenarios keep all for 24 hours then archive. Never delete unpublished entries—they represent undelivered events that indicate relay problems. Use partition-based cleanup for efficiency (drop old partitions rather than DELETE rows).

Conclusion

The outbox pattern fixes the dual-write problem by using the database transaction as the atomicity boundary. Business data and outbox entries commit together. A separate relay publishes the events. Idempotent consumers handle the duplicates that inevitably occur.

The pattern trades simplicity for reliability. You add infrastructure, monitoring, and idempotency logic. In return, you get guaranteed at-least-once delivery and exactly-once processing without distributed transactions.

Whether that tradeoff makes sense depends on your reliability requirements. For systems where missed events mean lost money or broken state, the outbox pattern is worth the complexity.

The Outbox Pattern: Reliable Event Publishing in Distributed Systems

Introduction

Topic-Specific Deep Dives

Core Concepts

Outbox Architecture Flow

Design and Architecture

Why Not Just Use Transactions?

Outbox Table Schema Design

Basic Schema

Partitioning Strategies

Avoiding Hot Partitions

Index Design for Relay Performance

Schema Design Summary

Polling Relay

Change Data Capture

Polling vs CDC vs Transactional Outbox Comparison

Deleting Instead of Marking

Handling Failures and Idempotency

Exactly-Once Delivery

Operational Considerations

Relay High Availability

Outbox Entry TTL and Cleanup

Combining with Saga

When to Use the Outbox Pattern

CQRS and Event Sourcing

CQRS (Command Query Responsibility Segregation)

Event Sourcing

Trade-off Analysis

Production Failure Scenarios

Scenario 1: Network Partition Between Relay and Broker

Scenario 2: Database Crash During Outbox Write

Scenario 3: Relay Crash After Publish Before Mark

Scenario 4: Outbox Table Grows Unbounded

Scenario 5: Payload Schema Change Breaks Consumers

Quick Recap

Observability Checklist

Metrics

Key Queries

Logs

Alerts

Tracing

Security Checklist

Interview Questions

Further Reading

Conclusion

Category

Tags

Related Posts

TCC: Try-Confirm-Cancel Pattern for Distributed Transactions

Distributed Transactions: ACID vs BASE Trade-offs

Backpressure Handling: Protecting Pipelines from Overload