Event Sourcing

Storing state changes as immutable events. Event store implementation, event replay, schema evolution, and comparison with traditional CRUD approaches.

published: reading time: 13 min read updated: January 1, 1970

Event Sourcing

Most systems store current state. You have an Order with status “shipped.” That is what the database knows. If you want to know the order’s history, you add created_at and updated_at columns, or a separate audit log, or a history table.

Event sourcing stores the history directly. Instead of “order status is shipped,” you store “order was placed,” “payment received,” “order shipped.” The current state comes from replaying these events.

This changes how you think about data. Every change is recorded.

The Core Idea

In event sourcing, you store events, not state. An event represents something that happened:

  • OrderPlaced
  • PaymentReceived
  • OrderShipped
  • OrderDelivered

To find the current state of an order, replay all its events in order. To find history, read the events directly.

# Events
class OrderPlaced:
    order_id: str
    customer_id: str
    items: list
    timestamp: datetime

class PaymentReceived:
    order_id: str
    amount: decimal
    timestamp: datetime

class OrderShipped:
    order_id: str
    tracking_number: str
    timestamp: datetime

# Event store appends events
event_store.append(OrderPlaced(order_id="123", ...))
event_store.append(PaymentReceived(order_id="123", ...))
event_store.append(OrderShipped(order_id="123", ...))

# To get current state, replay events
def get_order_state(order_id):
    events = event_store.get_events_for(order_id)
    state = Order()
    for event in events:
        state = apply_event(state, event)
    return state

Real implementations handle concurrency, snapshots, and versioning. The principle holds.

flowchart LR
    Command[("Command<br/>PlaceOrder")] --> Append[("Event Store<br/>Append-only")]

    Append --> Stream[("Event Stream<br/>Order-123")]

    Stream --> Proj1[("Projection<br/>Order State")]

    Stream --> Proj2[("Projection<br/>Customer History")]

    Stream --> Proj3[("Projection<br/>Analytics Events")]

    Proj1 -.->|periodic| Snapshot[("Snapshot<br/>Version N")]

    Snapshot -->|faster rebuild| Proj1

    ReadModel1[("Read Model<br/>Order Detail")] <--> Proj1

    ReadModel2[("Read Model<br/>Customer View")] <--> Proj2

    ReadModel3[("Read Model<br/>Dashboard")] <--> Proj3

Event Store Implementation

An event store is an append-only log of events. Properties:

  • Events are immutable once written
  • Events are organized by aggregate (entity)
  • Events are ordered by timestamp or sequence number
  • Events include metadata: timestamp, user, correlation ID

Simple implementations use a relational table:

CREATE TABLE events (
    event_id BIGINT PRIMARY KEY AUTO_INCREMENT,
    aggregate_id VARCHAR(100) NOT NULL,
    aggregate_type VARCHAR(50) NOT NULL,
    event_type VARCHAR(100) NOT NULL,
    event_data JSON NOT NULL,
    metadata JSON,
    created_at TIMESTAMP DEFAULT CURRENT_TIMESTAMP,
    INDEX idx_aggregate (aggregate_id, created_at)
);

Production event stores like EventStoreDB, Axon, and Marten provide stream management, snapshots for performance, projections for building read models, and subscriptions for reacting to new events.

Event Replay for State Reconstruction

Given all events for an aggregate, you can reconstruct its state at any point in time. This works for debugging, auditing, and recovery.

“Show me the order state as it was on March 15 at 3pm”: replay events up to that timestamp.

“Why did this order get cancelled?”: read the events and see exactly what happened and in what order.

Bug in projection logic? Rebuild the read model from scratch by replaying all events.

Snapshots improve performance for long-lived aggregates. Store periodic snapshots and replay only events since the last snapshot instead of replaying everything from the beginning.

# With snapshots
def get_order_state(order_id, as_of_date):
    snapshot = event_store.get_snapshot_before(order_id, as_of_date)
    events = event_store.get_events_since(order_id, snapshot.version, until=as_of_date)
    state = snapshot.data
    for event in events:
        state = apply_event(state, event)
    return state

Eventual Consistency and Read Model Rebuilding

Event sourcing typically uses CQRS. Commands produce events. Events are stored. Separate read models consume events and build query-optimized views.

The read model is eventually consistent. Events are written, then published to consumers who update read models. There is a delay between a command succeeding and the read model reflecting that change.

This delay is typically milliseconds in normal operation but can stretch under load or during rebuilding.

Read model rebuilding is useful. You can add new projections to existing data, rebuild after fixing bugs, create new read models for new features from historical data, and test projections against the same events.

# Rebuild a read model from scratch
def rebuild_read_model(projection, from_version=0):
    events = event_store.get_all_events(from_version=from_version)
    for event in events:
        projection.handle(event)
    projection.persist()

Comparing with Traditional CRUD

Traditional CRUD stores current state directly. Update overwrites previous state.

AspectCRUDEvent Sourcing
Data storedCurrent stateHistory of changes
Update behaviorOverwriteAppend
HistoryOptional, must be designedBuilt-in
DebuggingLimited visibilityFull audit trail
Read modelsDirect from primary storeBuilt from events
StorageTypically lessTypically more
ComplexitySimplerMore complex

CRUD works for simple domains where current state is what matters and history is not important. Event sourcing pays off when history matters—audit requirements, complex debugging, temporal queries.

Event sourcing storage growth is manageable with snapshots, event pruning for closed aggregates, and columnar storage for event data.

Handling Event Schema Evolution

Events are immutable but schemas change. A ProductCreated event from three years ago might have different fields than one created today.

Versioning approaches:

Upcasting: write an upcaster that transforms old event versions to new versions during replay.

Event versioning: include version numbers in events and handle different versions in projection logic.

Schema migration: migrate old event data to new schema when reading.

# Upcaster example
class OrderPlacedUpcaster:
    def can_upcast(self, event_data):
        return event_data.get('version', 0) < 2

    def upcast(self, event_data):
        if event_data.get('version', 0) < 1:
            event_data['customer_email'] = None
        if event_data.get('version', 0) < 2:
            event_data['customer_id'] = str(event_data['customer_id'])
        event_data['version'] = 2
        return event_data

Most teams add versioning when needed. Over-engineering upfront costs more than dealing with old events later.

When to Use / When Not to Use Event Sourcing

Use event sourcing when:

  • Audit trails are a compliance requirement
  • You need to reconstruct state at any point in time
  • Debugging requires understanding how state evolved
  • Read models may need to be rebuilt from scratch
  • You are already using CQRS and want complete history
  • Your domain involves long-running processes with multiple steps

Do not use event sourcing when:

  • CRUD is simpler and history is not important
  • Your team lacks experience with distributed systems patterns
  • Storage costs are a significant constraint
  • You need strong consistency within milliseconds
  • Schema changes are frequent and upfront versioning overhead is too high

Event Sourcing vs CRUD vs State Storage Trade-offs

DimensionCRUD (State Storage)Event SourcingState + Events Hybrid
Storage modelCurrent stateImmutable eventsBoth
History preservedNone (overwrites)Full historyLimited event log
Update behaviorOverwriteAppend-onlyAppend events
Debugging visibilityCurrent state onlyFull replayRecent events + snapshots
Read model rebuildN/AFull from scratchFrom last snapshot
Storage growthControlledUnbounded (needs pruning)Managed with TTL
Schema evolutionMigrate existing rowsUpcasters on replayVersioned events
ComplexityLowHighMedium
Best forSimple CRUD appsAudit-critical domainsMost event-driven systems

Production Failure Scenarios

FailureImpactMitigation
Event replay explosionLong aggregates take minutes to rebuildUse snapshots, limit aggregate size, archive old events
Schema evolution gapsOld events fail to deserializeUpcasters, version numbers in events, backward-compatible schemas
Eventual consistency lagRead models stale after writesMonitor projection lag, set SLA alerts, design UI for async updates
Eventual consistency in multi-aggregate queriesCross-aggregate reads return inconsistent stateDesign aggregates to be consistency boundaries, compensate with sagas
Event store write throughputSingle stream becomes bottleneckPartition by aggregate type, use event store clustering
Snapshot corruptionRebuilds produce wrong stateVerify snapshots, maintain event replay as source of truth

Capacity Estimation: Event Store Growth and Snapshot Frequency

Event stores grow indefinitely unless you manage them. Understanding growth rates and snapshot frequency is essential for operational planning.

Event store growth formula:

events_per_day = commands_per_day × events_per_command
storage_per_day = events_per_day × avg_event_size
storage_per_year = storage_per_day × 365

For an order processing system handling 100,000 orders per day with 5 events per order on average and 400 bytes per event:

  • Events per day: 500,000
  • Storage per day: 500,000 × 400 = 200MB
  • Storage per year: 200MB × 365 = 73GB
  • With snapshots (1 snapshot per aggregate per week): add ~20% overhead → ~88GB per year

Snapshot frequency planning:

events_between_snapshots = desired_rebuild_time_ms / avg_event_process_time_ms
snapshot_frequency_days = events_between_snapshots / events_per_day_per_aggregate

If rebuilding an aggregate from scratch takes 30 seconds (30,000ms) and each event takes 1ms to process, you need a snapshot every 30,000 events. With an aggregate receiving 100 events per day, that is one snapshot every 300 days — roughly yearly. For aggregates receiving 1,000 events per day (an active user account), you need snapshots every 30 days.

The practical approach: snapshot when the aggregate crosses a fixed event count threshold (every 100 or 1000 events) rather than on a time schedule. This keeps rebuild time bounded regardless of aggregate activity level.

Snapshot storage formula:

snapshot_size = aggregate_state_serialized_bytes
total_snapshot_storage = snapshot_size × number_of_aggregates × snapshots_per_aggregate_kept

For 1M aggregates, each snapshot averaging 10KB, keeping 2 snapshots per aggregate (current + previous for safety): 1M × 10KB × 2 = 20GB. Manageable. For 100M aggregates at the same rate: 2TB — requiring compression or archival strategy.

Observability Hooks: Monitoring Event Processing Health

Key metrics for event sourcing systems: projection lag, event processing latency, and event store append latency.

-- PostgreSQL: check event stream staleness (for tracked consumer groups)
SELECT
    consumer_id,
    last_processed_event_id,
    last_processed_timestamp,
    EXTRACT(EPOCH FROM (now() - last_processed_timestamp)) AS lag_seconds
FROM projection_consumers
WHERE is_active = true
ORDER BY lag_seconds DESC;

Key alerts:

# Projection lag alert
- alert: EventSourcingProjectionLag
  expr: event_projection_lag_seconds > 300
  for: 5m
  labels:
    severity: warning
  annotations:
    summary: "Projection {{ $labels.projection }} lag exceeds 5 minutes"

# Event store append latency
- alert: EventStoreAppendLatencyHigh
  expr: histogram_quantile(0.95, event_append_duration_seconds) > 0.1
  for: 5m
  labels:
    severity: warning
  annotations:
    summary: "Event store append P95 latency above 100ms"

# Snapshot age alert (detect aggregates not being snapshotted)
- alert: SnapshotAgeExceedsThreshold
  expr: max(event_count_since_last_snapshot) BY (aggregate_type) > 50000
  for: 1h
  labels:
    severity: warning
  annotations:
    summary: "Aggregate type {{ $labels.aggregate_type }} has aggregates without recent snapshots"

Real-World Case Study: Axon Framework at a Major Bank

A major European bank rebuilt their trading platform on Axon Framework (Java-based CQRS/ES framework) to handle regulatory compliance requirements. Every trade must have a complete, immutable audit trail. Event sourcing was chosen because regulators required the ability to reconstruct state at any point in time — not just current state, but state as it existed on any historical date.

Their deployment: 50 aggregate types, 200,000 daily trades, 50-100 events per trade (pricing, risk checks, approvals). Event store grew from 100GB in year one to 8TB over five years. Projection lag was their primary operational headache — the risk calculation projection took 45 minutes to catch up after a batch of overnight trades loaded at market open.

The resolution: they implemented parallelized snapshot rebuilding. Instead of one projection worker processing events sequentially, they partitioned the event log by aggregate ID and processed 16 partitions in parallel, reducing projection rebuild time from 45 minutes to under 5 minutes. They also introduced a “warm standby” projection that lagged by 5 minutes intentionally — if the primary projection failed, the warm standby could take over with only 5 minutes of data loss.

Interview Questions

Q: Your event store has 10 years of data and adding a new projection requires replaying all 10 years. How do you manage this?

Archive old events to cold storage (object storage like S3), keep recent events in the hot event store. New projections start from the most recent snapshot in the hot store and only replay the events since that snapshot. If you need full history replay (regulatory requirements), run it offline in a separate environment during a maintenance window. The key insight: plan archival strategy before you need it. Implement snapshots and tiered storage from the beginning, not as a retrofit.

Q: Two aggregates in the same event stream need to stay consistent with each other. How do you handle this?

Aggregates should be designed as consistency boundaries — each aggregate’s invariants are protected internally, and cross-aggregate consistency is handled through sagas or workflows. If you need two aggregates to change together atomically, they should be a single aggregate. If they must remain separate for scalability or ownership reasons, use a saga: emit a CrossAggregateReferenceCreated event from aggregate A, have a saga process that event and emit a compensating command to aggregate B, and handle the eventual consistency explicitly. Do not try to use distributed transactions (2PC) across aggregates — that defeats the purpose of having separate consistency boundaries.

Q: How do you handle schema evolution for events?

Events are immutable once written, so old events must deserialize with new code. Use upcasters: functions that transform old event schemas to the current schema on read. Version your events explicitly in the payload: v1, v2. When the schema changes, add a v2 upcaster that reads v1 events and produces v2 representations. Keep upcasters for all historical versions — never delete old upcasters. Test by replaying events from the beginning of your event store and verifying the final aggregate state matches expected results.

Q: What is the relationship between snapshots and event replay?

Snapshots are a performance optimization. Without snapshots, reconstructing aggregate state requires replaying all events from the beginning of the aggregate’s history. With snapshots, you periodically save the aggregate’s state at a given event sequence number. Reconstructing the aggregate loads the snapshot and replays only events after the snapshot sequence. Without snapshots, an aggregate with 10 years of history would be slow to load. With yearly snapshots, you replay at most 1 year of events. Snapshots do not affect correctness — event replay from event 0 always produces the same state. They only affect performance.


See Also

  • Microservices Roadmap - Event sourcing and CQRS are foundational patterns for event-driven microservice architectures, where services communicate through immutable event logs rather than direct database calls

Conclusion

Event sourcing stores state changes as immutable events. Current state is derived by replaying events. History is preserved automatically. Read models are built by consuming events.

Event sourcing enables complete audit trails, easy debugging, and rebuildable read models. It adds complexity—eventual consistency, schema evolution, event store management. It works best with CQRS and fits domains where history matters.

Use event sourcing when you need the audit trail, when debugging requires understanding how state evolved, or when building read models that might need rebuilding. Avoid it when CRUD is simpler and history is not important.


Category

Related Posts

CQRS Pattern

Separate read and write models. Command vs query models, eventual consistency implications, event sourcing integration, and when CQRS makes sense.

#database #cqrs #event-sourcing

CQRS and Event Sourcing: Distributed Data Management Patterns

Learn about Command Query Responsibility Segregation and Event Sourcing patterns for managing distributed data in microservices architectures.

#microservices #cqrs #event-sourcing

Publish/Subscribe Patterns: Topics, Subscriptions, and Filtering

Learn publish-subscribe messaging patterns: topic hierarchies, subscription management, message filtering, fan-out, and dead letter queues.

#messaging #patterns #architecture