The Outbox Pattern: Reliable Event Publishing in Distributed Systems

Learn the transactional outbox pattern for reliable event publishing. Discover how to solve the dual-write problem, implement idempotent consumers, and achieve exactly-once delivery.

published: reading time: 19 min read

The Outbox Pattern: Reliable Event Publishing in Distributed Systems

The dual-write problem bites you in production. Application writes to the database and publishes a message to a broker. Either the write succeeds and the publish fails, or vice versa. Your system ends up inconsistent, and you spend days building compensating logic, retry queues, and reconciliation jobs.

The outbox pattern fixes this. Instead of writing to the database and publishing as two separate operations, you do both in a single database transaction. The message goes into an outbox table. A separate process reads the outbox and publishes to the broker. Atomicity without distributed transactions.

Let me walk through the problem the outbox solves, how to implement it, and the operational considerations that matter.

The Dual-Write Problem

Modern applications frequently need to update state and notify other systems. User places an order: Orders table gets a new row, and the Inventory service needs to reserve stock. The naive approach is two writes.

# Dual write approach - problematic
def place_order(order):
    db.insert("orders", order)
    message_broker.publish("order_created", order)

This fails in a few ways. Database insert succeeds but broker publish fails. Inventory never gets the message and stock stays unreserved. Or broker succeeds but database fails. Inventory thinks the order exists when it does not. Both cases leave the system inconsistent.

You might add a retry.

# Retry approach - still problematic
def place_order(order):
    db.insert("orders", order)
    try:
        message_broker.publish("order_created", order)
    except:
        retry_queue.add(order)

This handles transient failures but does not solve the fundamental issue. If the database insert commits after the publish succeeds but before we check, we still have inconsistency. Retry queues also need their own handling for duplicates, ordering, and failures.

The root problem: two separate systems that must stay in sync, coordinated with application code rather than a shared transaction boundary.

How the Outbox Pattern Works

The outbox uses the database transaction as the shared boundary. Instead of publishing directly to the broker, the application writes to two tables in the same transaction: the business table and an outbox table.

def place_order(order):
    with db.transaction():
        db.insert("orders", order)
        db.insert("outbox", {
            "event_type": "order_created",
            "payload": json.dumps(order),
            "created_at": now()
        })

Both writes happen atomically. Transaction commits, both order and outbox entry exist. Transaction rolls back, neither exists.

A separate process, often called the Relay or Publisher, reads the outbox table and publishes events to the broker.

class OutboxRelay:
    def process_outbox(self):
        events = db.fetch_unpublished(limit=100)
        for event in events:
            try:
                message_broker.publish(event["event_type"], event["payload"])
                db.mark_published(event["id"])
            except:
                pass  # Will retry on next poll

After successful publish, the relay marks the outbox entry as published. If it crashes before marking, the entry gets reprocessed on restart. If it crashes after publishing but before marking, the event gets published again. Consumers must handle duplicates gracefully.

Outbox Architecture Flow

flowchart TD
    subgraph Application["Application Layer"]
        App[Application Code]
        DB[(Business<br/>Table)]
        Outbox[(Outbox<br/>Table)]
        App --> |BEGIN TRANSACTION| T1
        T1 --> |INSERT| DB
        T1 --> |INSERT| Outbox
        T1 --> |COMMIT| Commit1[Commit]
        Commit1 --> |Atomic write| DB
        Commit1 --> |Atomic write| Outbox
    end

    subgraph Relay["Relay / Publisher"]
        Poll[Poll outbox<br/>entries]
        Fetch[Fetch unpublished]
        Publish[Publish to broker]
        Mark[Mark as published]
        Poll --> Fetch
        Fetch --> Publish
        Publish --> Mark
    end

    subgraph Broker["Message Broker"]
        Topic[Event Topic]
    end

    subgraph Consumer["Consumer Services"]
        Handler[Event Handler]
        Process[Process event<br/>idempotently]
        Handler --> Process
    end

    Outbox --> |1. Poll| Poll
    Publish --> |2. Publish event| Topic
    Topic --> |3. Deliver| Handler

Flow Steps:

StepComponentAction
1ApplicationWrites business data + outbox entry in single transaction
2RelayPolls outbox table for unpublished entries
3RelayPublishes event to message broker
4RelayMarks outbox entry as published
5BrokerDelivers event to subscribed consumers
6ConsumerProcesses event idempotently

Key Properties:

  • Atomicity: Business data and outbox entry commit together or not at all
  • Reliability: Relay retries ensure eventual delivery
  • At-least-once: Events may be published multiple times if relay crashes after publish but before mark
  • Idempotency required: Consumers must deduplicate using event IDs

For more on event-driven patterns, see Event-Driven Architecture.

Why Not Just Use Transactions?

Traditional database transactions cannot span both the database and the message broker. The broker is external. You cannot begin a transaction, write to the database, write to the broker, and commit as one unit.

The outbox works around this by keeping everything in the database. The transaction boundary covers the business data and the outbox entry. The relay publishes from the outbox to the broker outside the transaction. Atomicity for the write, eventual consistency for the notification.

You lose immediate notification. The broker does not get the message until the relay polls, processes, and publishes. For low-latency requirements, this delay matters. For most use cases, milliseconds or seconds is acceptable.

Outbox Table Schema Design

The outbox table schema matters for performance. A poorly designed outbox table can become a bottleneck under high write load.

Basic Schema

CREATE TABLE outbox (
    id              UUID PRIMARY KEY DEFAULT gen_random_uuid(),
    event_type      VARCHAR(255) NOT NULL,
    aggregate_type  VARCHAR(255),          -- e.g., 'Order', 'Payment'
    aggregate_id    VARCHAR(255) NOT NULL,   -- e.g., order-123
    payload         JSONB NOT NULL,
    created_at      TIMESTAMPTZ NOT NULL DEFAULT NOW(),
    published_at    TIMESTAMPTZ,             -- NULL = unpublished
    published       BOOLEAN NOT NULL DEFAULT false,

    -- Idempotency key for deduplication
    event_id       VARCHAR(255) UNIQUE NOT NULL
);

-- Index for relay polling: find unpublished entries efficiently
CREATE INDEX idx_outbox_unpublished
    ON outbox (published, created_at)
    WHERE published = false;

-- Index for aggregate lookups (useful for debugging/replays)
CREATE INDEX idx_outbox_aggregate
    ON outbox (aggregate_type, aggregate_id, created_at);

Partitioning Strategies

For high-throughput systems, partitioning the outbox table prevents it from becoming a bottleneck.

By time range (recommended for most cases):

-- Partition by day for easy TTL cleanup and efficient querying
CREATE TABLE outbox (
    id              UUID DEFAULT gen_random_uuid(),
    event_type      VARCHAR(255) NOT NULL,
    aggregate_type  VARCHAR(255),
    aggregate_id    VARCHAR(255) NOT NULL,
    payload         JSONB NOT NULL,
    created_at      TIMESTAMPTZ NOT NULL,
    published_at    TIMESTAMPTZ,
    published       BOOLEAN NOT NULL DEFAULT false,
    event_id        VARCHAR(255) UNIQUE NOT NULL,
    PRIMARY KEY (id, created_at)
) PARTITION BY RANGE (created_at);

-- Create daily partitions
CREATE TABLE outbox_2026_03_01 PARTITION OF outbox
    FOR VALUES FROM ('2026-03-01') TO ('2026-03-02');
-- ... etc

Partition pruning benefits: Queries for unpublished entries in a specific time range only scan the relevant partition. Old partitions can be detached and dropped for efficient cleanup.

Avoiding Hot Partitions

If all writes go to the most recent partition, you get a write hotspot. For extreme throughput, use a composite partition key:

-- Partition by (published status, date) creates a "hot" partition for
-- unpublished events and "cold" partitions for published ones
CREATE TABLE outbox (
    ...
    published       BOOLEAN NOT NULL,
    created_at      TIMESTAMPTZ NOT NULL,
    PRIMARY KEY (id, published, created_at)
) PARTITION BY LIST (published);

Index Design for Relay Performance

The relay needs to find unpublished entries fast. The compound index (published, created_at) where published = false is the critical path:

  • Relay queries: WHERE published = false ORDER BY created_at LIMIT N
  • The index covers this query entirely (index-only scan)
  • As the table grows, this index should fit in memory even if the table does not
-- Verify index is used with: EXPLAIN ( buffers, analyze ) SELECT ...
SELECT id FROM outbox
WHERE published = false
ORDER BY created_at
LIMIT 100;

Schema Design Summary

ConcernRecommendation
Primary keyUUID (no sequence contention)
Event IDBusiness meaningful ID (order-123-created), not random UUID
PayloadJSONB (flexible, queryable)
Index for relay(published, created_at) WHERE published=false
PartitioningBy time range (daily partitions)
IdempotencyUNIQUE constraint on event_id

Different approaches work in different situations.

Polling Relay

The simplest approach polls the outbox table at regular intervals.

class PollingRelay:
    def __init__(self, db, broker, interval_ms=100):
        self.db = db
        self.broker = broker
        self.interval = interval_ms / 1000

    def run(self):
        while True:
            events = self.db.fetch_unpublished(limit=100, order="created_at")
            for event in events:
                self.publish_and_mark(event)
            sleep(self.interval)

    def publish_and_mark(self, event):
        try:
            self.broker.publish(event["event_type"], event["payload"])
            self.db.mark_published(event["id"])
        except:
            pass

Polling is simple and reliable. The tradeoff is latency. With a 100ms interval, events publish within 100ms at best. Reduce the interval and database load goes up.

Change Data Capture

A more sophisticated approach uses CDC. Instead of polling, the database notifies the relay when new outbox entries appear. PostgreSQL’s LISTEN/NOTIFY, or tools like Debezium that read the write-ahead log, can drive this.

class CdcRelay:
    def __init__(self, db, broker):
        self.db = db
        self.broker = broker
        self.db.listen("outbox_inserts", self.handle_notification)

    def handle_notification(self):
        events = self.db.fetch_unpublished(limit=100)
        for event in events:
            self.publish_and_mark(event)

CDC cuts latency significantly. Events publish seconds or milliseconds after commit rather than on the next poll cycle. The cost is operational complexity. CDC tools add infrastructure and their own failure modes.

Polling vs CDC vs Transactional Outbox Comparison

AspectPolling RelayCDC RelayTransactional Outbox
LatencyInterval-based (100ms-1s typical)Near real-time (ms-s)Low (within transaction commit)
Database LoadHigher (continuous polling)Low (event-driven)Lowest (no polling)
InfrastructureSimple processCDC tool (Debezium, AWS DMS)Native DB triggers/procedures
ReliabilityHigh (simple, testable)High (but CDC tool has own failure modes)Highest (DB handles all)
Operational ComplexityLowHigh (CDC tool management)Medium (DB procedures)
Replay CapabilityYes (mark-based)Yes (CDC log retention)Limited (WAL-based)
Cross-Database SupportYesLimited (CDC per-DB)No (DB-specific)
Example ToolsCustom polling processDebezium, AWS DMS, PostgreSQL LISTEN/NOTIFYPostgreSQL triggers, SQL Server Service Broker
When to UseLow throughput, simple setupHigh throughput, low latency needsMaximum reliability, DB-native solution

Deleting Instead of Marking

Some implementations delete outbox entries after successful publish rather than marking them.

def publish_and_mark(event):
    self.broker.publish(event["event_type"], event["payload"])
    self.db.delete_outbox_entry(event["id"])

This keeps the outbox table small. The cost is losing replay capability. Marking as published preserves the data and lets you replay from the beginning if needed.

Handling Failures and Idempotency

The outbox pattern guarantees at-least-once delivery. Relay publishes an event, then marks it published. If the mark fails, the event gets republished on the next cycle. Consumers must handle duplicates.

Idempotent consumers solve this. The consumer tracks which events it has already processed, using an event ID in a processed_events table or cache.

class IdempotentConsumer:
    def __init__(self, broker, db):
        self.broker = broker
        self.db = db

    def handle_order_created(self, payload):
        event_id = payload["event_id"]
        if self.db.already_processed(event_id):
            return  # Skip duplicate

        order = payload["order"]
        self.process_order(order)
        self.db.mark_processed(event_id)

The idempotency table needs its own cleanup to avoid growing forever. Purge processed events older than a certain age, or use a sliding window.

Exactly-Once Delivery

The outbox pattern, combined with idempotent consumers, gives you exactly-once semantics end-to-end. Database and outbox write happen atomically. Relay delivers the event at least once. Consumer processes it exactly once.

The nuance: “exactly once” means the business logic executes once, not that the event moves through the system once. The event may cross the broker multiple times. The consumer must recognize and deduplicate.

For a broader view of delivery semantics, see Saga Pattern which often pairs with outbox for distributed transaction handling.

Operational Considerations

The outbox table grows if the relay falls behind or publishes repeatedly fail. Monitor for stuck entries.

SELECT COUNT(*) FROM outbox
WHERE published = false
AND created_at < NOW() - INTERVAL '5 minutes';

Alert on this query catches relay failures or persistent publish errors.

Consider the relay’s failure modes too. If the relay crashes mid-batch, some events get published but not marked. The next relay instance reprocesses them. As long as consumers are idempotent, this is fine.

For high-throughput systems, batching helps. The relay fetches multiple events, publishes them in a single broker batch if supported, then marks them all published. Fewer broker round trips.

Relay High Availability

A single relay is a single point of failure. If it crashes, events pile up in the outbox until someone notices. For production systems, you need multiple relay instances.

Active-passive (leader election):

Use a lock in the database or Redis to elect an active relay. The active relay polls and publishes. Passive relays standby, ready to take over if the active dies.

import redis
import threading

class HaRelay:
    def __init__(self, db, broker, lock_key, ttl_seconds=30):
        self.db = db
        self.broker = broker
        self.lock_key = lock_key
        self.ttl = ttl_seconds
        self.redis = redis.Redis()

    def run(self):
        while True:
            acquired = self.redis.set(self.lock_key, socket.gethostname(),
                                       nx=True, ex=self.ttl)
            if acquired:
                self.do_work()  # Only active relay does work
            else:
                time.sleep(1)   # Standby, wait for lock release

    def do_work(self):
        while True:
            try:
                events = self.db.fetch_unpublished(limit=100)
                for event in events:
                    self.broker.publish(event["event_type"], event["payload"])
                    self.db.mark_published(event["id"])
            except redis.ConnectionError:
                break  # Lost lock, go back to standby

The nx=True means only one instance can hold the lock. The ex=self.ttl means the lock auto-releases if the active relay crashes and doesn’t renew. Passive instances detect the lock is gone and one of them acquires it.

Active-active (multiple relays with sharding):

Multiple relays can run simultaneously if they coordinate which outbox rows each one handles. Use SELECT ... FOR UPDATE SKIP LOCKED to let each relay claim a different batch:

class ShardedRelay:
    def process_outbox(self):
        # FOR UPDATE SKIP LOCKED: grab rows without blocking other relays
        events = db.fetch_unpublished(
            "SELECT * FROM outbox WHERE published = false "
            "ORDER BY created_at LIMIT 100 FOR UPDATE SKIP LOCKED"
        )
        for event in events:
            try:
                self.broker.publish(event["event_type"], event["payload"])
                self.db.mark_published(event["id"])
            except Exception:
                pass  # Let other relay try on next cycle

Both relays run concurrently, each grabbing different rows. No coordination overhead, no leader election needed. Throughput scales with relay count.

Outbox Entry TTL and Cleanup

The outbox table grows over time. Published entries that are never cleaned up become dead weight. You need a cleanup strategy.

Soft delete with published flag (already in schema):

# Mark as published, don't delete
db.mark_published(event["id"])

Hard delete for old published entries (run as a scheduled job):

-- Clean up published entries older than 7 days
DELETE FROM outbox
WHERE published = true
AND published_at < NOW() - INTERVAL '7 days';

Automated cleanup with partition expiry (if using time-based partitioning):

-- Detach and drop old partitions
ALTER TABLE outbox DETACH PARTITION outbox_2026_02_01;
DROP TABLE outbox_2026_02_01;

Partition pruning makes cleanup essentially free — dropping a partition is O(1) rather than deleting millions of rows.

Retention policy guidance:

Use caseTTL for published entries
Financial transactions30-90 days (audit requirements)
Order events7-14 days
High-volume event logs1-3 days
Debug/replay scenariosKeep all for 24h, then batch archive

Never delete unpublished entries — they represent events that haven’t been delivered yet. If you see unpublished entries older than your poll interval, something is wrong with the relay.

Combining with Saga

The outbox pattern pairs naturally with saga. In a saga, each step updates a service and publishes an event to trigger the next step. Without outbox, the publish can fail and leave the saga stuck.

With outbox, the step’s update and outbox write are atomic. The relay publishes the event that triggers the next step. If the relay is down, events queue up and publish when it recovers. The saga makes progress as soon as the relay is available.

For details on saga implementation, see Saga Pattern.

When to Use the Outbox Pattern

Use the outbox pattern when you need reliability guarantees for event publishing and can tolerate the additional latency (hundreds of milliseconds to seconds) between the database write and broker notification.

It works well in high-reliability scenarios: financial transactions, inventory updates, any domain where missed events create real problems. Well-suited to event-driven microservices where services communicate through events rather than direct calls.

The outbox adds operational complexity. You run the relay as a separate process. You monitor it. You handle idempotency in consumers. For low-reliability use cases or one-way notifications where missed events are acceptable, this complexity may not pay off.

For an overview of distributed transaction patterns, see Distributed Transactions.

Observability Checklist

The outbox pattern introduces a separate relay process and additional database writes. Without monitoring, problems in the relay go unnoticed until events pile up and consumers fall behind.

Metrics

  • Outbox lag: count and age of unpublished entries
  • Relay publish rate: events published per second
  • Publish failure rate: events that failed to publish
  • End-to-end latency: time from outbox write to consumer delivery
  • Consumer lag: how far behind consumers are from producer rate
  • Batch size histogram: how many events per publish cycle

Key Queries

-- How many unpublished events are piling up?
SELECT COUNT(*) FROM outbox WHERE published = false;

-- How old is the oldest unpublished event? (should be < 1 minute)
SELECT MAX(NOW() - created_at) FROM outbox WHERE published = false;

-- How many published events per minute?
SELECT DATE_TRUNC('minute', published_at), COUNT(*)
FROM outbox WHERE published = true
GROUP BY 1 ORDER BY 1 DESC LIMIT 10;

Logs

  • Log each batch of events published with count and duration
  • Log publish failures with error details and retry count
  • Log when relay acquires or loses leadership lock
  • Include event_id, aggregate_type, and aggregate_id in all logs for correlation

Alerts

  • Alert when unpublished event count exceeds threshold (relay is behind)
  • Alert when oldest unpublished event age exceeds threshold (relay is stuck)
  • Alert when publish failure rate exceeds baseline
  • Alert when relay loses leadership and standby doesn’t pick up (HA failure)

Tracing

from opentelemetry import trace

tracer = trace.get_tracer(__name__)

class OutboxRelay:
    def process_batch(self, events):
        with tracer.start_as_current_span("outbox.publish_batch") as span:
            span.set_attribute("batch.size", len(events))

            for event in events:
                with tracer.start_as_current_span("outbox.publish_event") as event_span:
                    event_span.set_attribute("event.id", event["event_id"])
                    event_span.set_attribute("event.type", event["event_type"])
                    event_span.set_attribute("aggregate.type", event["aggregate_type"])

                    try:
                        self.broker.publish(event["event_type"], event["payload"])
                        self.db.mark_published(event["id"])
                        event_span.set_status(trace.Status.OK)
                    except Exception as e:
                        event_span.record_exception(e)
                        event_span.set_status(trace.Status.ERROR)
                        raise

Security Checklist

The outbox pattern writes event payloads to a database table that a separate process reads and publishes. Misconfigured security can leak sensitive data through the event stream.

  • [ ]Encrypt outbox table data at rest (database-level encryption or application-level encryption of sensitive payload fields)
  • [ ]Encrypt relay-to-broker communication (TLS on broker connection)
  • [ ]Authenticate relay-to-broker connections (broker authentication)
  • [ ]Authorize which relays can publish to which topics (broker ACLs)
  • [ ]Validate payload schema before publishing (reject payloads that are too large or malformed)
  • [ ]Audit log all publish operations with event metadata
  • [ ]Do not put sensitive data (passwords, tokens, PII) directly in event payloads — use references or obfuscated IDs instead
  • [ ]Rate-limit the relay’s publish rate to prevent accidental or malicious overload of consumers

Connection to CQRS and Event Sourcing

The outbox pattern is a foundational piece for event-driven architectures. It connects directly to CQRS and event sourcing patterns.

CQRS (Command Query Responsibility Segregation)

In CQRS, you separate read models from write models. Commands (writes) update the command side. Queries (reads) read from potentially different read models updated via events.

The outbox pattern is how you get events from the command side to the query side reliably:

Command Side                          Query Side
┌─────────────────┐                  ┌─────────────────┐
│ Order Service   │   outbox + relay  │ Read Model     │
│ - place_order  │ ───────────────► │ - OrderView    │
│ - update_status│   (reliable       │ - CustomerView │
│ - cancel       │    event stream)  │ - Analytics DB │
└─────────────────┘                  └─────────────────┘

Without outbox, you might update the command model and then directly update the read model. That direct call can fail and leave the systems out of sync. With outbox, the update and event publication are atomic — the read model eventually catches up, never misses an event.

Event Sourcing

Event sourcing takes CQRS further: instead of storing current state, you store the sequence of events that produced the current state. The event store is the source of truth.

The outbox pattern works well with event sourcing:

def place_order(order):
    with db.transaction():
        # Append to event log (the "event store")
        db.insert("events", {
            "event_type": "ORDER_PLACED",
            "aggregate_id": order.id,
            "payload": json.dumps(order),
            "sequence": get_next_sequence("Order", order.id)
        })

        # Also write to outbox for reliable notification
        db.insert("outbox", {
            "event_type": "ORDER_PLACED",
            "aggregate_id": order.id,
            "payload": json.dumps(order),
            "event_id": f"order-{order.id}-placed"
        })

The event store is the authoritative log. The outbox ensures events propagate to consumers (read models, audit logs, other services). If you only use event sourcing without outbox, you rely on consumers to read the event store directly — which couples them to your storage mechanism and makes it harder to evolve the event schema.

The outbox decouples: your event store is an internal implementation detail. Consumers receive events through the broker like any other event-driven system.

Conclusion

The outbox pattern fixes the dual-write problem by using the database transaction as the atomicity boundary. Business data and outbox entries commit together. A separate relay publishes the events. Idempotent consumers handle the duplicates that inevitably occur.

The pattern trades simplicity for reliability. You add infrastructure, monitoring, and idempotency logic. In return, you get guaranteed at-least-once delivery and exactly-once processing without distributed transactions.

Whether that tradeoff makes sense depends on your reliability requirements. For systems where missed events mean lost money or broken state, the outbox pattern is worth the complexity.

Category

Related Posts

TCC: Try-Confirm-Cancel Pattern for Distributed Transactions

Learn the Try-Confirm-Cancel pattern for distributed transactions. Explore how TCC differs from 2PC and saga, with implementation examples and real-world use cases.

#distributed-systems #transactions #saga

Distributed Transactions: ACID vs BASE Trade-offs

Explore distributed transaction patterns: ACID vs BASE trade-offs, two-phase commit, saga pattern, eventual consistency, and choosing the right model.

#distributed-systems #transactions #consistency

Backpressure Handling: Protecting Pipelines from Overload

Learn how to implement backpressure in data pipelines to prevent cascading failures, handle overload gracefully, and maintain system stability.

#data-engineering #backpressure #data-pipelines