The Outbox Pattern: Reliable Event Publishing in Distributed Systems
Learn the transactional outbox pattern for reliable event publishing. Discover how to solve the dual-write problem, implement idempotent consumers, and achieve exactly-once delivery.
The Outbox Pattern: Reliable Event Publishing in Distributed Systems
The dual-write problem bites you in production. Application writes to the database and publishes a message to a broker. Either the write succeeds and the publish fails, or vice versa. Your system ends up inconsistent, and you spend days building compensating logic, retry queues, and reconciliation jobs.
The outbox pattern fixes this. Instead of writing to the database and publishing as two separate operations, you do both in a single database transaction. The message goes into an outbox table. A separate process reads the outbox and publishes to the broker. Atomicity without distributed transactions.
Let me walk through the problem the outbox solves, how to implement it, and the operational considerations that matter.
The Dual-Write Problem
Modern applications frequently need to update state and notify other systems. User places an order: Orders table gets a new row, and the Inventory service needs to reserve stock. The naive approach is two writes.
# Dual write approach - problematic
def place_order(order):
db.insert("orders", order)
message_broker.publish("order_created", order)
This fails in a few ways. Database insert succeeds but broker publish fails. Inventory never gets the message and stock stays unreserved. Or broker succeeds but database fails. Inventory thinks the order exists when it does not. Both cases leave the system inconsistent.
You might add a retry.
# Retry approach - still problematic
def place_order(order):
db.insert("orders", order)
try:
message_broker.publish("order_created", order)
except:
retry_queue.add(order)
This handles transient failures but does not solve the fundamental issue. If the database insert commits after the publish succeeds but before we check, we still have inconsistency. Retry queues also need their own handling for duplicates, ordering, and failures.
The root problem: two separate systems that must stay in sync, coordinated with application code rather than a shared transaction boundary.
How the Outbox Pattern Works
The outbox uses the database transaction as the shared boundary. Instead of publishing directly to the broker, the application writes to two tables in the same transaction: the business table and an outbox table.
def place_order(order):
with db.transaction():
db.insert("orders", order)
db.insert("outbox", {
"event_type": "order_created",
"payload": json.dumps(order),
"created_at": now()
})
Both writes happen atomically. Transaction commits, both order and outbox entry exist. Transaction rolls back, neither exists.
A separate process, often called the Relay or Publisher, reads the outbox table and publishes events to the broker.
class OutboxRelay:
def process_outbox(self):
events = db.fetch_unpublished(limit=100)
for event in events:
try:
message_broker.publish(event["event_type"], event["payload"])
db.mark_published(event["id"])
except:
pass # Will retry on next poll
After successful publish, the relay marks the outbox entry as published. If it crashes before marking, the entry gets reprocessed on restart. If it crashes after publishing but before marking, the event gets published again. Consumers must handle duplicates gracefully.
Outbox Architecture Flow
flowchart TD
subgraph Application["Application Layer"]
App[Application Code]
DB[(Business<br/>Table)]
Outbox[(Outbox<br/>Table)]
App --> |BEGIN TRANSACTION| T1
T1 --> |INSERT| DB
T1 --> |INSERT| Outbox
T1 --> |COMMIT| Commit1[Commit]
Commit1 --> |Atomic write| DB
Commit1 --> |Atomic write| Outbox
end
subgraph Relay["Relay / Publisher"]
Poll[Poll outbox<br/>entries]
Fetch[Fetch unpublished]
Publish[Publish to broker]
Mark[Mark as published]
Poll --> Fetch
Fetch --> Publish
Publish --> Mark
end
subgraph Broker["Message Broker"]
Topic[Event Topic]
end
subgraph Consumer["Consumer Services"]
Handler[Event Handler]
Process[Process event<br/>idempotently]
Handler --> Process
end
Outbox --> |1. Poll| Poll
Publish --> |2. Publish event| Topic
Topic --> |3. Deliver| Handler
Flow Steps:
| Step | Component | Action |
|---|---|---|
| 1 | Application | Writes business data + outbox entry in single transaction |
| 2 | Relay | Polls outbox table for unpublished entries |
| 3 | Relay | Publishes event to message broker |
| 4 | Relay | Marks outbox entry as published |
| 5 | Broker | Delivers event to subscribed consumers |
| 6 | Consumer | Processes event idempotently |
Key Properties:
- Atomicity: Business data and outbox entry commit together or not at all
- Reliability: Relay retries ensure eventual delivery
- At-least-once: Events may be published multiple times if relay crashes after publish but before mark
- Idempotency required: Consumers must deduplicate using event IDs
For more on event-driven patterns, see Event-Driven Architecture.
Why Not Just Use Transactions?
Traditional database transactions cannot span both the database and the message broker. The broker is external. You cannot begin a transaction, write to the database, write to the broker, and commit as one unit.
The outbox works around this by keeping everything in the database. The transaction boundary covers the business data and the outbox entry. The relay publishes from the outbox to the broker outside the transaction. Atomicity for the write, eventual consistency for the notification.
You lose immediate notification. The broker does not get the message until the relay polls, processes, and publishes. For low-latency requirements, this delay matters. For most use cases, milliseconds or seconds is acceptable.
Outbox Table Schema Design
The outbox table schema matters for performance. A poorly designed outbox table can become a bottleneck under high write load.
Basic Schema
CREATE TABLE outbox (
id UUID PRIMARY KEY DEFAULT gen_random_uuid(),
event_type VARCHAR(255) NOT NULL,
aggregate_type VARCHAR(255), -- e.g., 'Order', 'Payment'
aggregate_id VARCHAR(255) NOT NULL, -- e.g., order-123
payload JSONB NOT NULL,
created_at TIMESTAMPTZ NOT NULL DEFAULT NOW(),
published_at TIMESTAMPTZ, -- NULL = unpublished
published BOOLEAN NOT NULL DEFAULT false,
-- Idempotency key for deduplication
event_id VARCHAR(255) UNIQUE NOT NULL
);
-- Index for relay polling: find unpublished entries efficiently
CREATE INDEX idx_outbox_unpublished
ON outbox (published, created_at)
WHERE published = false;
-- Index for aggregate lookups (useful for debugging/replays)
CREATE INDEX idx_outbox_aggregate
ON outbox (aggregate_type, aggregate_id, created_at);
Partitioning Strategies
For high-throughput systems, partitioning the outbox table prevents it from becoming a bottleneck.
By time range (recommended for most cases):
-- Partition by day for easy TTL cleanup and efficient querying
CREATE TABLE outbox (
id UUID DEFAULT gen_random_uuid(),
event_type VARCHAR(255) NOT NULL,
aggregate_type VARCHAR(255),
aggregate_id VARCHAR(255) NOT NULL,
payload JSONB NOT NULL,
created_at TIMESTAMPTZ NOT NULL,
published_at TIMESTAMPTZ,
published BOOLEAN NOT NULL DEFAULT false,
event_id VARCHAR(255) UNIQUE NOT NULL,
PRIMARY KEY (id, created_at)
) PARTITION BY RANGE (created_at);
-- Create daily partitions
CREATE TABLE outbox_2026_03_01 PARTITION OF outbox
FOR VALUES FROM ('2026-03-01') TO ('2026-03-02');
-- ... etc
Partition pruning benefits: Queries for unpublished entries in a specific time range only scan the relevant partition. Old partitions can be detached and dropped for efficient cleanup.
Avoiding Hot Partitions
If all writes go to the most recent partition, you get a write hotspot. For extreme throughput, use a composite partition key:
-- Partition by (published status, date) creates a "hot" partition for
-- unpublished events and "cold" partitions for published ones
CREATE TABLE outbox (
...
published BOOLEAN NOT NULL,
created_at TIMESTAMPTZ NOT NULL,
PRIMARY KEY (id, published, created_at)
) PARTITION BY LIST (published);
Index Design for Relay Performance
The relay needs to find unpublished entries fast. The compound index (published, created_at) where published = false is the critical path:
- Relay queries:
WHERE published = false ORDER BY created_at LIMIT N - The index covers this query entirely (index-only scan)
- As the table grows, this index should fit in memory even if the table does not
-- Verify index is used with: EXPLAIN ( buffers, analyze ) SELECT ...
SELECT id FROM outbox
WHERE published = false
ORDER BY created_at
LIMIT 100;
Schema Design Summary
| Concern | Recommendation |
|---|---|
| Primary key | UUID (no sequence contention) |
| Event ID | Business meaningful ID (order-123-created), not random UUID |
| Payload | JSONB (flexible, queryable) |
| Index for relay | (published, created_at) WHERE published=false |
| Partitioning | By time range (daily partitions) |
| Idempotency | UNIQUE constraint on event_id |
Different approaches work in different situations.
Polling Relay
The simplest approach polls the outbox table at regular intervals.
class PollingRelay:
def __init__(self, db, broker, interval_ms=100):
self.db = db
self.broker = broker
self.interval = interval_ms / 1000
def run(self):
while True:
events = self.db.fetch_unpublished(limit=100, order="created_at")
for event in events:
self.publish_and_mark(event)
sleep(self.interval)
def publish_and_mark(self, event):
try:
self.broker.publish(event["event_type"], event["payload"])
self.db.mark_published(event["id"])
except:
pass
Polling is simple and reliable. The tradeoff is latency. With a 100ms interval, events publish within 100ms at best. Reduce the interval and database load goes up.
Change Data Capture
A more sophisticated approach uses CDC. Instead of polling, the database notifies the relay when new outbox entries appear. PostgreSQL’s LISTEN/NOTIFY, or tools like Debezium that read the write-ahead log, can drive this.
class CdcRelay:
def __init__(self, db, broker):
self.db = db
self.broker = broker
self.db.listen("outbox_inserts", self.handle_notification)
def handle_notification(self):
events = self.db.fetch_unpublished(limit=100)
for event in events:
self.publish_and_mark(event)
CDC cuts latency significantly. Events publish seconds or milliseconds after commit rather than on the next poll cycle. The cost is operational complexity. CDC tools add infrastructure and their own failure modes.
Polling vs CDC vs Transactional Outbox Comparison
| Aspect | Polling Relay | CDC Relay | Transactional Outbox |
|---|---|---|---|
| Latency | Interval-based (100ms-1s typical) | Near real-time (ms-s) | Low (within transaction commit) |
| Database Load | Higher (continuous polling) | Low (event-driven) | Lowest (no polling) |
| Infrastructure | Simple process | CDC tool (Debezium, AWS DMS) | Native DB triggers/procedures |
| Reliability | High (simple, testable) | High (but CDC tool has own failure modes) | Highest (DB handles all) |
| Operational Complexity | Low | High (CDC tool management) | Medium (DB procedures) |
| Replay Capability | Yes (mark-based) | Yes (CDC log retention) | Limited (WAL-based) |
| Cross-Database Support | Yes | Limited (CDC per-DB) | No (DB-specific) |
| Example Tools | Custom polling process | Debezium, AWS DMS, PostgreSQL LISTEN/NOTIFY | PostgreSQL triggers, SQL Server Service Broker |
| When to Use | Low throughput, simple setup | High throughput, low latency needs | Maximum reliability, DB-native solution |
Deleting Instead of Marking
Some implementations delete outbox entries after successful publish rather than marking them.
def publish_and_mark(event):
self.broker.publish(event["event_type"], event["payload"])
self.db.delete_outbox_entry(event["id"])
This keeps the outbox table small. The cost is losing replay capability. Marking as published preserves the data and lets you replay from the beginning if needed.
Handling Failures and Idempotency
The outbox pattern guarantees at-least-once delivery. Relay publishes an event, then marks it published. If the mark fails, the event gets republished on the next cycle. Consumers must handle duplicates.
Idempotent consumers solve this. The consumer tracks which events it has already processed, using an event ID in a processed_events table or cache.
class IdempotentConsumer:
def __init__(self, broker, db):
self.broker = broker
self.db = db
def handle_order_created(self, payload):
event_id = payload["event_id"]
if self.db.already_processed(event_id):
return # Skip duplicate
order = payload["order"]
self.process_order(order)
self.db.mark_processed(event_id)
The idempotency table needs its own cleanup to avoid growing forever. Purge processed events older than a certain age, or use a sliding window.
Exactly-Once Delivery
The outbox pattern, combined with idempotent consumers, gives you exactly-once semantics end-to-end. Database and outbox write happen atomically. Relay delivers the event at least once. Consumer processes it exactly once.
The nuance: “exactly once” means the business logic executes once, not that the event moves through the system once. The event may cross the broker multiple times. The consumer must recognize and deduplicate.
For a broader view of delivery semantics, see Saga Pattern which often pairs with outbox for distributed transaction handling.
Operational Considerations
The outbox table grows if the relay falls behind or publishes repeatedly fail. Monitor for stuck entries.
SELECT COUNT(*) FROM outbox
WHERE published = false
AND created_at < NOW() - INTERVAL '5 minutes';
Alert on this query catches relay failures or persistent publish errors.
Consider the relay’s failure modes too. If the relay crashes mid-batch, some events get published but not marked. The next relay instance reprocesses them. As long as consumers are idempotent, this is fine.
For high-throughput systems, batching helps. The relay fetches multiple events, publishes them in a single broker batch if supported, then marks them all published. Fewer broker round trips.
Relay High Availability
A single relay is a single point of failure. If it crashes, events pile up in the outbox until someone notices. For production systems, you need multiple relay instances.
Active-passive (leader election):
Use a lock in the database or Redis to elect an active relay. The active relay polls and publishes. Passive relays standby, ready to take over if the active dies.
import redis
import threading
class HaRelay:
def __init__(self, db, broker, lock_key, ttl_seconds=30):
self.db = db
self.broker = broker
self.lock_key = lock_key
self.ttl = ttl_seconds
self.redis = redis.Redis()
def run(self):
while True:
acquired = self.redis.set(self.lock_key, socket.gethostname(),
nx=True, ex=self.ttl)
if acquired:
self.do_work() # Only active relay does work
else:
time.sleep(1) # Standby, wait for lock release
def do_work(self):
while True:
try:
events = self.db.fetch_unpublished(limit=100)
for event in events:
self.broker.publish(event["event_type"], event["payload"])
self.db.mark_published(event["id"])
except redis.ConnectionError:
break # Lost lock, go back to standby
The nx=True means only one instance can hold the lock. The ex=self.ttl means the lock auto-releases if the active relay crashes and doesn’t renew. Passive instances detect the lock is gone and one of them acquires it.
Active-active (multiple relays with sharding):
Multiple relays can run simultaneously if they coordinate which outbox rows each one handles. Use SELECT ... FOR UPDATE SKIP LOCKED to let each relay claim a different batch:
class ShardedRelay:
def process_outbox(self):
# FOR UPDATE SKIP LOCKED: grab rows without blocking other relays
events = db.fetch_unpublished(
"SELECT * FROM outbox WHERE published = false "
"ORDER BY created_at LIMIT 100 FOR UPDATE SKIP LOCKED"
)
for event in events:
try:
self.broker.publish(event["event_type"], event["payload"])
self.db.mark_published(event["id"])
except Exception:
pass # Let other relay try on next cycle
Both relays run concurrently, each grabbing different rows. No coordination overhead, no leader election needed. Throughput scales with relay count.
Outbox Entry TTL and Cleanup
The outbox table grows over time. Published entries that are never cleaned up become dead weight. You need a cleanup strategy.
Soft delete with published flag (already in schema):
# Mark as published, don't delete
db.mark_published(event["id"])
Hard delete for old published entries (run as a scheduled job):
-- Clean up published entries older than 7 days
DELETE FROM outbox
WHERE published = true
AND published_at < NOW() - INTERVAL '7 days';
Automated cleanup with partition expiry (if using time-based partitioning):
-- Detach and drop old partitions
ALTER TABLE outbox DETACH PARTITION outbox_2026_02_01;
DROP TABLE outbox_2026_02_01;
Partition pruning makes cleanup essentially free — dropping a partition is O(1) rather than deleting millions of rows.
Retention policy guidance:
| Use case | TTL for published entries |
|---|---|
| Financial transactions | 30-90 days (audit requirements) |
| Order events | 7-14 days |
| High-volume event logs | 1-3 days |
| Debug/replay scenarios | Keep all for 24h, then batch archive |
Never delete unpublished entries — they represent events that haven’t been delivered yet. If you see unpublished entries older than your poll interval, something is wrong with the relay.
Combining with Saga
The outbox pattern pairs naturally with saga. In a saga, each step updates a service and publishes an event to trigger the next step. Without outbox, the publish can fail and leave the saga stuck.
With outbox, the step’s update and outbox write are atomic. The relay publishes the event that triggers the next step. If the relay is down, events queue up and publish when it recovers. The saga makes progress as soon as the relay is available.
For details on saga implementation, see Saga Pattern.
When to Use the Outbox Pattern
Use the outbox pattern when you need reliability guarantees for event publishing and can tolerate the additional latency (hundreds of milliseconds to seconds) between the database write and broker notification.
It works well in high-reliability scenarios: financial transactions, inventory updates, any domain where missed events create real problems. Well-suited to event-driven microservices where services communicate through events rather than direct calls.
The outbox adds operational complexity. You run the relay as a separate process. You monitor it. You handle idempotency in consumers. For low-reliability use cases or one-way notifications where missed events are acceptable, this complexity may not pay off.
For an overview of distributed transaction patterns, see Distributed Transactions.
Observability Checklist
The outbox pattern introduces a separate relay process and additional database writes. Without monitoring, problems in the relay go unnoticed until events pile up and consumers fall behind.
Metrics
- Outbox lag: count and age of unpublished entries
- Relay publish rate: events published per second
- Publish failure rate: events that failed to publish
- End-to-end latency: time from outbox write to consumer delivery
- Consumer lag: how far behind consumers are from producer rate
- Batch size histogram: how many events per publish cycle
Key Queries
-- How many unpublished events are piling up?
SELECT COUNT(*) FROM outbox WHERE published = false;
-- How old is the oldest unpublished event? (should be < 1 minute)
SELECT MAX(NOW() - created_at) FROM outbox WHERE published = false;
-- How many published events per minute?
SELECT DATE_TRUNC('minute', published_at), COUNT(*)
FROM outbox WHERE published = true
GROUP BY 1 ORDER BY 1 DESC LIMIT 10;
Logs
- Log each batch of events published with count and duration
- Log publish failures with error details and retry count
- Log when relay acquires or loses leadership lock
- Include event_id, aggregate_type, and aggregate_id in all logs for correlation
Alerts
- Alert when unpublished event count exceeds threshold (relay is behind)
- Alert when oldest unpublished event age exceeds threshold (relay is stuck)
- Alert when publish failure rate exceeds baseline
- Alert when relay loses leadership and standby doesn’t pick up (HA failure)
Tracing
from opentelemetry import trace
tracer = trace.get_tracer(__name__)
class OutboxRelay:
def process_batch(self, events):
with tracer.start_as_current_span("outbox.publish_batch") as span:
span.set_attribute("batch.size", len(events))
for event in events:
with tracer.start_as_current_span("outbox.publish_event") as event_span:
event_span.set_attribute("event.id", event["event_id"])
event_span.set_attribute("event.type", event["event_type"])
event_span.set_attribute("aggregate.type", event["aggregate_type"])
try:
self.broker.publish(event["event_type"], event["payload"])
self.db.mark_published(event["id"])
event_span.set_status(trace.Status.OK)
except Exception as e:
event_span.record_exception(e)
event_span.set_status(trace.Status.ERROR)
raise
Security Checklist
The outbox pattern writes event payloads to a database table that a separate process reads and publishes. Misconfigured security can leak sensitive data through the event stream.
- [ ]Encrypt outbox table data at rest (database-level encryption or application-level encryption of sensitive payload fields)
- [ ]Encrypt relay-to-broker communication (TLS on broker connection)
- [ ]Authenticate relay-to-broker connections (broker authentication)
- [ ]Authorize which relays can publish to which topics (broker ACLs)
- [ ]Validate payload schema before publishing (reject payloads that are too large or malformed)
- [ ]Audit log all publish operations with event metadata
- [ ]Do not put sensitive data (passwords, tokens, PII) directly in event payloads — use references or obfuscated IDs instead
- [ ]Rate-limit the relay’s publish rate to prevent accidental or malicious overload of consumers
Connection to CQRS and Event Sourcing
The outbox pattern is a foundational piece for event-driven architectures. It connects directly to CQRS and event sourcing patterns.
CQRS (Command Query Responsibility Segregation)
In CQRS, you separate read models from write models. Commands (writes) update the command side. Queries (reads) read from potentially different read models updated via events.
The outbox pattern is how you get events from the command side to the query side reliably:
Command Side Query Side
┌─────────────────┐ ┌─────────────────┐
│ Order Service │ outbox + relay │ Read Model │
│ - place_order │ ───────────────► │ - OrderView │
│ - update_status│ (reliable │ - CustomerView │
│ - cancel │ event stream) │ - Analytics DB │
└─────────────────┘ └─────────────────┘
Without outbox, you might update the command model and then directly update the read model. That direct call can fail and leave the systems out of sync. With outbox, the update and event publication are atomic — the read model eventually catches up, never misses an event.
Event Sourcing
Event sourcing takes CQRS further: instead of storing current state, you store the sequence of events that produced the current state. The event store is the source of truth.
The outbox pattern works well with event sourcing:
def place_order(order):
with db.transaction():
# Append to event log (the "event store")
db.insert("events", {
"event_type": "ORDER_PLACED",
"aggregate_id": order.id,
"payload": json.dumps(order),
"sequence": get_next_sequence("Order", order.id)
})
# Also write to outbox for reliable notification
db.insert("outbox", {
"event_type": "ORDER_PLACED",
"aggregate_id": order.id,
"payload": json.dumps(order),
"event_id": f"order-{order.id}-placed"
})
The event store is the authoritative log. The outbox ensures events propagate to consumers (read models, audit logs, other services). If you only use event sourcing without outbox, you rely on consumers to read the event store directly — which couples them to your storage mechanism and makes it harder to evolve the event schema.
The outbox decouples: your event store is an internal implementation detail. Consumers receive events through the broker like any other event-driven system.
Conclusion
The outbox pattern fixes the dual-write problem by using the database transaction as the atomicity boundary. Business data and outbox entries commit together. A separate relay publishes the events. Idempotent consumers handle the duplicates that inevitably occur.
The pattern trades simplicity for reliability. You add infrastructure, monitoring, and idempotency logic. In return, you get guaranteed at-least-once delivery and exactly-once processing without distributed transactions.
Whether that tradeoff makes sense depends on your reliability requirements. For systems where missed events mean lost money or broken state, the outbox pattern is worth the complexity.
Category
Related Posts
TCC: Try-Confirm-Cancel Pattern for Distributed Transactions
Learn the Try-Confirm-Cancel pattern for distributed transactions. Explore how TCC differs from 2PC and saga, with implementation examples and real-world use cases.
Distributed Transactions: ACID vs BASE Trade-offs
Explore distributed transaction patterns: ACID vs BASE trade-offs, two-phase commit, saga pattern, eventual consistency, and choosing the right model.
Backpressure Handling: Protecting Pipelines from Overload
Learn how to implement backpressure in data pipelines to prevent cascading failures, handle overload gracefully, and maintain system stability.