Consistency Models in Distributed Systems: Complete Guide

Learn about strong, weak, eventual, and causal consistency models. Understand read-your-writes, monotonic reads, and picking the right model for your system.

published: reading time: 63 min read author: GeekWorkBench

Consistency Models in Distributed Systems

Strong consistency sounds simple: reads always return the most recent write. In distributed systems, though, strong consistency is expensive. The CAP theorem forces us to think about consistency trade-offs, and the PACELC theorem reminds us that even without partitions, consistency has a latency cost.

Knowing where different consistency models apply helps you make better architectural decisions. Most systems do not need strong consistency everywhere. Finding where you can relax consistency guarantees can dramatically improve performance.

This post builds on the CAP theorem and PACELC theorem discussions.


Introduction

Think of consistency as a dial you can turn, not a binary switch:

graph LR
    A[Strong<br/>Consistency] --> B[Sequential<br/>Consistency]
    B --> C[Causal<br/>Consistency]
    C --> D[Eventual<br/>Consistency]
    D --> E[Weak<br/>Consistency]

Strong consistency at one end, weak at the other. Most practical systems sit somewhere in between.


Core Concepts

Strong consistency guarantees that any read sees all preceding writes. The system appears to have a single copy of the data. This is what most developers assume when they think about database transactions.

Linearizability vs Sequential Consistency

There is a formal distinction that trips up even experienced engineers: linearizability and sequential consistency are not the same thing.

Sequential consistency (as described above) guarantees that all nodes see operations in the same total order, but that order does not have to correspond to real time. Operations can appear to happen before or after their actual wall-clock time.

Linearizability adds a real-time requirement on top of sequential consistency. Every operation appears to complete atomically at some point between its invocation and its response, and those points are ordered in real time. If write A completes before write B starts (in wall-clock time), then B must be visible after A on all nodes.

The practical difference: linearizability gives you the illusion of a single copy of data across all time. Sequential consistency only guarantees order, not real-time ordering. Linearizability is strictly stronger.

Most databases that claim “strong consistency” mean linearizability. Paxos and Raft provide linearizability. Single-leader replication with reads from the leader is linearizable (if the leader does not change during the read).

Concrete example: sequentially consistent but not linearizable.

Consider two processes writing to the same variable, with a write from region A (x=2) and a write from region B (x=3) that complete in overlapping real time. Under sequential consistency, all replicas must agree on the same total order (e.g., B’s write first, then A’s), but they might not reflect real time — A’s write “wins” even if it started after B’s completed. Under linearizability, the real-time ordering matters: if B’s write completes before A’s write starts, then A’s write must appear after B’s everywhere. A system can be sequentially consistent (all replicas agree on order) but not linearizable (the agreed-upon order does not match real time).

// Linearizability test: does the read reflect a write that completed before it started?
// Wall clock: write starts at t=0, completes at t=5
//             read starts at t=6, completes at t=10
// Linearizability requires: read sees the write
// Sequential consistency: read might see the write OR the initial state (order is preserved, but real time is not)

// Real-world example of the difference:
// Sequential consistency: process A writes x=1, then x=2. Process B reads x.
// B might see x=2 even if B's read starts before A's write completes in wall-clock time,
// as long as B sees A's writes in order (x=1 then x=2, never x=2 then x=1).
// Linearizability: B's read at t=6 must see x=2 because A's write completed at t=5.

Consistency Model Comparison Table

Here is how real-world systems map to consistency models:

SystemDefault ModelStrongest Model AvailableNotes
DynamoDBEventual (per-item)Linearizable (per-item)Per-item linearizability; cross-item transactions are limited
CassandraEventual (ONE)QUORUM / ALLTunable consistency; QUORUM gives linearizability
Google SpannerLinearizableLinearizable (TrueTime)GPS + atomic clocks bound clock uncertainty to ±7ms
CockroachDBSnapshot + serializableSerializableSerializable isolation level via distributed transactions
etcdLinearizableLinearizableUses Raft consensus
MongoDB (wired tiger)Snapshot / linearizableLinearizable (majority read)Single-shard linearizability; multi-shard needs transactions
Amazon AuroraSession (read-after-write)Linearizable (MCI)Multi-master with cache-coherent invalidation
Azure Cosmos DBSession (default)Strong (bounded staleness)Five consistency models tunable per request
RiakEventualStrong (ALL) / causalCRDT-native; strong requires ALL replicas
Zookeeper (ZK)LinearizableLinearizableSingle-object operations; transactions across objects
KafkaPer-partition orderingExactly-once (transactional)Within a partition, all consumers see messages in order

Why it matters: Teams often pick a database based on its advertised consistency model without realizing they can tune it. Cassandra with QUORUM is strongly consistent. Cosmos DB with bounded staleness is cheaper than strong but still bounded. The default model is not the only model.

Bank account balances are the standard example:

// Strong consistency: read after write always returns the new value
async function transfer(from, to, amount) {
  const account = await db.account.findOne({ id: from });
  if (account.balance >= amount) {
    await db.account.update(
      { id: from },
      { balance: account.balance - amount },
    );
    await db.account.update({ id: to }, { balance: account.balance + amount });
    return { success: true };
  }
  return { success: false, reason: "Insufficient funds" };
}

// After this completes, any read of 'from' balance shows the new value
const balance = await db.account.findOne({ id: from });
// balance is definitely reduced, even if you read from a different replica

Strong consistency uses consensus protocols like Paxos or Raft to coordinate reads and writes across nodes. This coordination adds latency. In a geo-distributed system, a strongly consistent write might take 100-300ms. An eventually consistent write might take 5ms. The extra latency is usually worth it for data that matters.


Sequential Consistency

Sequential consistency sits between strong and eventual. It guarantees that all nodes see operations in the same order, but that order does not have to match real time.

Imagine two processes writing to the same variable:

// Sequential consistency: both replicas see writes in the same order
// but that order might not match real-world time

// Process A writes x = 1
// Process B writes x = 2
// Process C reads x

// Valid under sequential consistency: C might read 1 or 2
// Invalid: C reading x = 0 (the initial value) after both writes completed

With sequential consistency, you cannot have time travel paradoxes where a later write appears to happen before an earlier write. But you also cannot assume that a read reflects all writes that finished before it in real time.


Causal Consistency

Causal consistency is weaker. It only guarantees that causally related operations appear in order across all nodes. Operations with no causal relationship can be observed in different orders by different nodes.

Causally related means: if operation A causes operation B to happen, then A must be visible before B everywhere. Example:

// Causal consistency example
// Post 1: "I got promoted" (write by user)
// Post 2: "Celebrating tonight" (comment on post 1)

// Every node sees post 2 only after seeing post 1
// Because post 2 is causally dependent on post 1

// But "I got promoted" and "Weather is nice today" from the same user
// Are not causally related - they can appear in any order

Causal consistency is attractive because it is the strongest model that can be implemented without coordination in certain systems. The Dynamo paper popularized eventual consistency with causal metadata.


Eventual Consistency Models

Eventual consistency makes the weakest guarantee: if no new updates are made to an object, eventually all reads will return the last written value. This section covers the core eventual consistency variants and their practical implications.

Eventual Consistency

The word “eventually” is doing a lot of work here. It could mean milliseconds or hours. Different systems give different guarantees.

Bounded vs Unbounded Eventual Consistency

An important distinction often overlooked: bounded vs unbounded eventual consistency.

TypeDefinitionReal-World Examples
BoundedConvergence happens within a known, finite timeDynamoDB (typically seconds), Cassandra (usually < 1 second normally)
UnboundedConvergence will happen, but no time bound is guaranteedDNS propagation (minutes to hours), CDN cache expiry (hours to days)

Why this matters:

// Bounded eventual consistency - you can reason about staleness
// DynamoDB withDynamoDB Streams: staleness typically < 5 seconds
async function readWithStalenessBound(key) {
  const item = await dynamodb.getItem({ Key: key }).promise();
  // You know the item is at most 5 seconds stale (bounded)
  return item;
}

// Unbounded eventual consistency - no promises about when
// Traditional DNS: TTL might be 24 hours, but convergence is not guaranteed
async function resolveWithDNS(hostname) {
  const addresses = await dns.lookup(hostname);
  // Could return old IP for hours; no bounded guarantee
  return addresses;
}

Convergence time factors:

FactorImpact on Convergence Time
Network distanceGeo-distributed replicas add 50-200ms per hop
Load averageHigh load increases replication lag
Failure recoveryNode recovery triggers anti-entropy, can take minutes
Quorum availabilityIf quorum is impaired, writes may stall

For bounded eventual consistency, typical convergence times:

  • Same-region replicas: 10-100ms
  • Cross-region replicas: 100ms - 2s
  • Under failure conditions: can extend to 30s or more

For unbounded eventual consistency:

  • DNS: minutes to 48+ hours (depending on TTL and propagation)
  • CDN: hours to days
  • Some gossip protocols: theoretically unbounded, practically O(log N)

Session Consistency Models

Session consistency guarantees apply within a single client session, providing practical guarantees like read-your-writes and monotonicity that users expect in real applications. This section covers session guarantees and their implementations.

Session Consistency

Session consistency guarantees that within a single client session, certain properties hold. This is not a single model but a family of guarantees:

Session GuaranteeDefinitionImplementation
Read-your-writesSession sees its own writesRoute reads to primary or last-write replica
Monotonic readsSession never sees older valuesSticky sessions, read your writes routing
Monotonic writesSession’s writes are orderedPer-client write sequencing
Write-follows-readsWrite is ordered after reads that preceded itCausal ordering tracking

DynamoDB session consistency example:

// DynamoDB: Session-level consistency guarantees
// Using a client's session token to ensure read-your-writes

const dynamodb = new AWS.DynamoDB({ region: "us-east-1" });
const docClient = new AWS.DynamoDB.DocumentClient({
  dynamodb,
  sessionToken: "user-session-id", // Enables session guarantees
});

// With ConsistentRead=true, guarantees read-your-writes within session
const item = await docClient
  .get({
    TableName: "Posts",
    Key: { userId, postId },
    ConsistentRead: true, // Strong within this session
  })
  .promise();

PRAM (Pipelined Random Access Memory) Consistency

PRAM consistency is a weak model that guarantees:

Writes from a single process are observed by all processes in the order they were issued. Writes from different processes may be observed in different orders.

In simpler terms: your own writes appear in order to everyone; other people’s writes may appear in different orders to different observers.

// PRAM consistency example
// Process A writes x=1, then x=2
// Process B writes y=1, then y=2

// All processes see:
// - A's writes in order: x=1 then x=2
// - B's writes in order: y=1 then y=2
// But some might see B's writes interleaved with A's differently

// Valid PRAM ordering: x=1, x=2, y=1, y=2
// Also valid: y=1, y=2, x=1, x=2
// Also valid: x=1, y=1, x=2, y=2
// NOT valid: x=2, x=1 (A's writes out of order)

PRAM vs Causal consistency:

  • PRAM: Only your own write ordering is guaranteed
  • Causal: Write ordering that has causal dependency is guaranteed

PRAM is relatively easy to implement efficiently and provides useful guarantees for some workloads.


// Eventual consistency: write returns immediately
async function writeEventual(key, value) {
  // Write to local replica, return immediately
  await localCache.set(key, value);
  return { success: true, version: ++localVersion };
}

// Later, background replication propagates the write
// During the propagation window, reads might return stale values
async function read(key) {
  return await localCache.get(key); // Could be stale during propagation
}

Amazon DynamoDB uses eventual consistency by default. Google Spanner uses strong consistency. Most databases let you choose per-query.


Read Your Own Writes

Read-your-writes consistency (also called read-after-write consistency) guarantees that after a client writes a value, subsequent reads by that same client see that write.

This is intuitive for single-node systems but tricky in distributed systems:

// Without read-your-writes: user might not see their own post
await api.createPost({ title: "Hello World" });
const posts = await api.getMyPosts();
// posts might not include "Hello World" if the read hits a replica
// that has not yet received the write

// With read-your-writes: the read routes to the right place
await api.createPost({ title: "Hello World" });
const posts = await api.getMyPosts({ consistentRead: true });
// posts definitely includes "Hello World"

Most social media applications need read-your-writes for user-generated content. Users get confused when they post something and then do not see it. This is one of the most common bugs I see in production systems.


Monotonic Reads

Monotonic reads guarantee that if a client reads a value V at time T, it will never read an older value at a later time. Reads never go backward in time.

This prevents a jarring experience where data appears to roll back:

// Without monotonic reads: timeline can go backward
await api.createComment({ text: "Great post!" });
const comments1 = await api.getComments(postId); // Includes "Great post!"
// ... some time passes, replication lag increases ...
const comments2 = await api.getComments(postId); // "Great post!" disappeared!

// With monotonic reads: once you see a value, you never unsee it
const comments = await api.getComments(postId, { monotonic: true });
// Once "Great post!" appears, it stays visible for this client session

Monotonic reads are important for any application where users see a chronological list of items. Without this guarantee, pagination can show duplicate items or skip items that were added while you were reading.


Monotonic Writes

Monotonic writes guarantee that writes from a client appear in the order they were issued. This sounds obvious, but in distributed systems, a client might issue two writes that arrive at different replicas in different orders.

// Without monotonic writes: writes can be applied out of order
await api.updateProfile({ name: "Alice" });
await api.updateProfile({ status: "Busy" });
// The replica receiving updates might apply status before name
// resulting in name being set but status showing previous value

// With monotonic writes: writes are serialized per client
await api.updateProfile({ name: "Alice" }, { serial: true });
await api.updateProfile({ status: "Busy" }, { serial: true });
// Updates are applied in order, regardless of network timing

This is less commonly discussed but matters for operations that build on each other.


ACID vs BASE

Most developers learn ACID (Atomicity, Consistency, Isolation, Durability) as the default model for databases. In distributed systems, the alternative is BASE (Basically Available, Soft state, Eventually consistent).

These are not competing worldviews — they represent different trade-off philosophies.

PropertyACIDBASE
AtomicityAll or nothingAll or nothing (per node)
ConsistencyInvariants preserved at transaction endInvariants eventually restored
IsolationSerializable or weakerBest effort
DurabilitySynchronous write to diskAsynchronous, may lose recent writes
AvailabilityMay sacrifice for consistencyPrioritizes availability
Typical useFinancial, inventory, order managementWeb scale, social networks, caching layers

ACID systems (PostgreSQL, MySQL/InnoDB) use two-phase locking for isolation. They will stall or reject writes if locks cannot be acquired. BASE systems (most NoSQL) trade strict isolation for availability and partition tolerance. Neither is wrong — they serve different purposes.

The key insight: you do not have to choose one paradigm for your entire application. Many production systems use both — ACID databases as the system of record for transactions, and BASE stores (or caches) for read-heavy derived views. The challenge is knowing which data belongs in which model.

In practice: financial ledgers, inventory counts, and payment state belong in ACID. Social feeds, cached user preferences, and analytics dashboards belong in BASE.


Trade-off Analysis

This table provides a decision-matrix comparison for consistency model selection across practical dimensions:

DimensionLinearizableSequentialCausalSessionEventual
Write latency100-300ms (geo-dist)50-200ms20-100ms10-50ms5-20ms
Read latency100-300ms (geo-dist)50-200ms20-100ms10-50ms1-10ms
Coordination requiredAll nodes (quorum)All nodes (total order)Causally related opsPer-sessionNone
Staleness windowZeroZeroMinimalSession-scopedUnbounded or bounded
Implementation complexityHigh (consensus)MediumMediumLowLow
Failure behaviorUnavailable if quorum lostUnavailable if quorum lostDegraded (causal gaps)Degraded (session breaks)Available (may be stale)
CAP classificationCPCPCA hybridCA hybridAP
Cross-region writesExpensive (all regions)ExpensiveModerate (causal only)Cheap (session sticky)Cheapest (async)

Quorum Configuration Quick Reference

Replicas (N)Strong (R+W>N)Fast Reads (R=1)Fast Writes (W=1)
3R=2, W=2R=1, W=3R=3, W=1
5R=3, W=3R=1, W=5R=5, W=1
7R=4, W=4R=1, W=7R=7, W=1

Trade-off Summary

Consistency, availability, latency, and complexity form a fundamental trade-off space in distributed systems. Here is a consolidated view of the key decisions:

Trade-off DimensionStrong ConsistencyWeak/Eventual Consistency
LatencyHigh (quorum round-trips, consensus)Low (local writes, async replication)
ThroughputLower (coordination overhead)Higher (parallel local writes)
AvailabilityLower (fails if quorum unreachable)Higher (continues during partitions)
ComplexityHigher (consensus protocols, 2PC)Lower (eventual merge, CRDTs)
DebuggingEasier (total order, linearizable)Harder (staleness, race conditions)
UX RiskLower (reads always fresh)Higher (stale reads, rollback)
Consistency ModelBest ForAvoid When
LinearizableFinancial transactions, locks, inventoryLatency matters, geo-distributed
SequentialLeader-based reads, single-regionMulti-region with causality needs
CausalSocial apps, collaborative editingTotal order required
SessionUser-generated content, dashboardsFinancial or inventory updates
EventualCaches, counters, analyticsWrites must be immediately visible
WeakHigh-throughput logging, sensor dataAny correctness dependency

Production Failure Scenarios

Failure ScenarioImpactMitigation
Replica returns stale data after writeUser sees old value after confirming writeImplement read-your-writes by routing reads to quorum or primary
Monotonic read violationData appears to roll back, breaking user experienceUse sticky sessions or read-your-writes guarantees
Causal ordering violationComment appears before post it referencesImplement vector clocks for causal tracking
Conflict resolution produces wrong resultLWW picks wrong version in concurrent editsUse CRDTs for automatically mergeable data
Replication lag spikesExtended staleness window during high loadMonitor lag, alert on thresholds, scale replicas

Conflict Resolution Strategies

When multiple replicas accept concurrent writes, conflicts must be resolved. This section covers conflict resolution strategies, how to choose the right model, and common mistakes to avoid.

Choosing the Right Consistency Model

Picking a consistency model is not a one-time architectural decision — it is a per-operation decision that affects both performance and correctness.

Decision Framework

Answer these questions in order:

1. What happens if reads return stale data?

  • Financial transactions, inventory, payment state: Strong consistency (linearizable)
  • Social media feeds, likes, view counts: Eventual consistency is fine
  • User-generated content (posts, comments): Session consistency (read-your-writes)

2. What happens if writes are lost?

  • If a write being lost causes financial harm or data corruption: Strong consistency with quorum
  • If lost writes are acceptable and convergence is eventual: Eventual consistency

3. What is your tolerance for latency?

  • User-facing operations needing fast response: Eventual or session consistency
  • Background jobs, analytics: Eventual consistency is fine
  • Synchronous operations where users wait for confirmation: Strong consistency

4. Are operations causally dependent?

  • If operation B depends on operation A (e.g., comment on a post): Causal consistency at minimum
  • If operations are independent: Eventual consistency is usually sufficient

Consistency Model Selection Table

Use CaseRecommended ModelImplementation
Financial transfers, payment processingLinearizable (strong)QUORUM, Raft/Paxos, 2PC
User posts and commentsSession (read-your-writes)Sticky sessions, primary routing
Product catalog, pricing (display)Bounded eventualDynamoDB, Cassandra with LOCAL_QUORUM
Social media likes, follower countsEventualDynamoDB, Cassandra ONE
Collaborative editing (text, documents)Causal / CRDTRGA, OR-Set, operational transformation
Distributed locks, leader electionLinearizableetcd, Zookeeper (Zab/Raft)
Configuration, feature flagsStrong / linearizableetcd, Zookeeper
Recommendation engine, analyticsEventualCassandra, DynamoDB streams
Shopping cart operationsSession / monotonicDynamoDB with session token

Common Mistakes

  • Using strong consistency everywhere. This is the default for most developers because it is the mental model from single-node databases. Most systems do not need it everywhere, and paying the coordination cost globally tanks performance.
  • Choosing eventual consistency without measuring lag. “Eventual” is not “fast.” Under load, eventual consistency lag can spike to seconds or minutes. If your users notice stale data, you need bounded eventual at minimum.
  • Forgetting read-your-writes for user-facing apps. This is the single most common consistency bug in production. Users post content and do not see it. It is jarring and entirely preventable.
StrategyHow It WorksTrade-offs
Last-Write-Wins (LWW)Highest timestamp winsSimple but can lose writes; depends on clock sync
First-Write-WinsFirst write winsConservative but wasteful
CRDTsData structures that merge automaticallyNo conflicts by design; limited to certain types
Application MergeApp code decides how to combineFlexible but requires custom logic
Manual ResolutionUser decidesBest for critical conflicts; bad UX otherwise

CRDT Patterns and Examples

CRDTs (Conflict-free Replicated Data Types) provide mathematically proven conflict resolution by design. This section covers common CRDT implementations and when to use them.

CRDT Example: G-Counter

// Grow-only counter - merges by taking max of each replica's value
class GCounter {
  constructor(replicaId) {
    this.replicaId = replicaId;
    this.counts = {}; // { replicaId: value }
  }

  increment() {
    this.counts[this.replicaId] = (this.counts[this.replicaId] || 0) + 1;
  }

  merge(other) {
    for (const [id, value] of Object.entries(other.counts)) {
      this.counts[id] = Math.max(this.counts[id] || 0, value);
    }
  }

  value() {
    return Object.values(this.counts).reduce((a, b) => a + b, 0);
  }
}

CRDT Example: LWW-Register (Last-Write-Wins)

LWW-Register stores a value and a timestamp. On merge, the write with the higher timestamp wins:

// Last-Write-Wins Register - simplest conflict-free register
class LWWRegister {
  constructor(replicaId) {
    this.replicaId = replicaId;
    this.value = null;
    this.timestamp = 0;
  }

  set(value) {
    // Use hybrid logical clock or wall clock with care
    this.timestamp = Date.now();
    this.value = value;
  }

  merge(other) {
    if (other.timestamp > this.timestamp) {
      this.value = other.value;
      this.timestamp = other.timestamp;
    }
    // If our timestamp is newer or equal, keep our value
  }

  get() {
    return this.value;
  }
}

// Usage: inventory items, user preferences, configuration
const inventory = new LWWRegister("replica-1");
inventory.set({ item: "widget", count: 10 });
// Concurrent write from another replica with timestamp 1001
inventory.merge({ value: { item: "widget", count: 15 }, timestamp: 1001 });
console.log(inventory.get()); // { item: 'widget', count: 15 }

CRDT Example: OR-Set (Observed-Remove Set)

OR-Set allows adding and removing elements where add-wins-over-remove for concurrent operations:

// Observed-Remove Set - add wins over remove for concurrent ops
class ORSet {
  constructor(replicaId) {
    this.replicaId = replicaId;
    this.items = new Map(); // element -> { addedBy: replicaId, addTag: number }
    this.tombstones = new Map(); // element -> Set of removed tags
  }

  add(element) {
    const tag = Date.now() + Math.random();
    this.items.set(element, {
      addedBy: this.replicaId,
      addTag: tag,
      removed: false,
    });
  }

  remove(element) {
    // Mark as removed but don't delete - we need to track for merging
    const entry = this.items.get(element);
    if (entry) {
      entry.removed = true;
    }
  }

  merge(other) {
    for (const [element, entry] of other.items) {
      const ourEntry = this.items.get(element);

      if (!ourEntry) {
        // Element doesn't exist locally, add it
        this.items.set(element, { ...entry });
      } else if (entry.addTag > ourEntry.addTag) {
        // Other replica added this element more recently
        this.items.set(element, { ...entry });
      } else if (entry.removed && !ourEntry.removed) {
        // Other replica removed it, but we haven't - add wins
        // Keep the element
        ourEntry.removed = false;
      }
      // If both removed, stays removed (no-op)
    }
  }

  contains(element) {
    const entry = this.items.get(element);
    return entry && !entry.removed;
  }
}

// Usage: shopping cart items, collaborative document elements
const cart = new ORSet("user-1");
cart.add("item-A");
cart.add("item-B");
cart.remove("item-B");
// Concurrent removal of item-B and addition of item-C from another device
cart.merge({
  items: new Map([
    ["item-B", { addedBy: "user-1", addTag: 100, removed: true }],
    ["item-C", { addedBy: "user-1", addTag: 101, removed: false }],
  ]),
});
console.log(cart.contains("item-A")); // true
console.log(cart.contains("item-B")); // true (add wins)
console.log(cart.contains("item-C")); // true

CRDT Comparison Table

CRDT TypeOperationsMerge StrategyUse Cases
G-CounterIncrement onlyTake max per replicaVote counts, page views
PN-CounterIncrement + DecrementAdd positives, subtract negativesAccount balances
LWW-RegisterSet valueHigher timestamp winsConfiguration, preferences
OR-SetAdd + RemoveAdd wins over concurrent removeShopping carts, collaborative editing
RGAAdd + Remove (ordered)Conflation of concurrent removesChat messages, collaborative text

Transaction Coordination Protocols

Strong consistency in distributed transactions requires coordination protocols. This section covers two-phase commit, three-phase commit, and alternatives like the Saga pattern for long-running workflows.

Two-Phase Commit (2PC)

2PC is an atomic commitment protocol with two phases:

Phase 1 — Prepare: The coordinator sends a PREPARE message to all participants. Each participant votes YES if it can commit (has locked resources, written redo logs) or NO if it must abort.

Phase 2 — Commit/Abort: If all participants vote YES, the coordinator sends COMMIT and all participants apply the change. If any vote NO, the coordinator sends ABORT and all participants roll back.

// Simplified 2PC coordinator implementation
class TwoPhaseCommitCoordinator {
  async execute(transaction, participants) {
    // Phase 1: Prepare
    const votes = await Promise.all(
      participants.map((p) => p.prepare(transaction)),
    );

    const canCommit = votes.every((v) => v === "YES");

    if (canCommit) {
      // Phase 2a: Commit
      await Promise.all(participants.map((p) => p.commit(transaction)));
      return { success: true };
    } else {
      // Phase 2b: Abort
      await Promise.all(participants.map((p) => p.abort(transaction)));
      return { success: false, reason: "Participant voted NO" };
    }
  }
}

2PC Failure Modes

Failure PointProblemOutcome
Coordinator crashes before prepareParticipants lock resources indefinitelyTimeout-based rollback required
Coordinator crashes after prepare but before commitParticipants do not know decisionBlock waiting for coordinator
Participant crashes after prepareHas locks but voted YES, does not know decisionBlock or heuristic commit possible
Network partition during commitSome participants commit, others do notInconsistent state

The blocking problem is 2PC’s main weakness: if the coordinator fails after participants have voted YES, they block indefinitely waiting for the commit or abort decision.

Three-Phase Commit (3PC)

3PC adds a third phase to eliminate the blocking problem under synchronous networks with the finite upper bound on processor arrest:

Phase 1 — CanCommit: Coordinator asks if participants can commit (similar to 2PC prepare, but participants do not lock resources yet).

Phase 2 — PreCommit: If all say YES, coordinator sends PRE-COMMIT. Participants acknowledge and lock resources.

Phase 3 — DoCommit: Coordinator sends DO-COMMIT. Participants apply the change and release locks.

// 3PC eliminates blocking by adding the PreCommit phase
// If a participant times out in PreCommit, it can safely commit
// because everyone already voted YES and received PreCommit

// Timeout handling in 3PC participant
class ThreePhaseParticipant {
  async handlePreCommit(transaction) {
    this.state = "PRE_COMMITTED";
    await this.writeRedoLog(transaction);
    await this.acquireLocks(transaction);

    // Send acknowledgment
    await this.coordinator.ackPreCommit(transaction.id);
  }

  // If coordinator crashes here and we time out waiting for DoCommit:
  // We can safely commit because:
  // 1. We received PreCommit (meaning all voted YES)
  // 2. We know enough participants are in PreCommitted or Committed state
  async onDoCommitTimeout() {
    if (this.state === "PRE_COMMITTED") {
      await this.doCommit();
    }
  }
}

2PC vs 3PC Comparison

Aspect2PC3PC
BlockingYes (coordinator failure)No (under sync network assumptions)
Network rounds23
LatencyLowerHigher
Failure handlingParticipant block on coordinatorSafe commit on timeout in PreCommit
Real-world usageMost databases (with variations)Rarely used directly
AssumptionsCrash-stop, eventually synchronousEventually synchronous, bounded delay

Why 3PC is not widely used:

  • The assumption of eventually synchronous networks does not hold in geo-distributed systems
  • The protocol is sensitive to timing — if messages are delayed beyond bounds, the safety guarantees break
  • In practice, Paxos/Raft consensus protocols are preferred over 3PC for strong consistency

Sagas

For workflows that span multiple services and cannot hold locks for minutes or hours, the Saga pattern replaces atomic commit with a sequence of local transactions, each with a compensating transaction to undo its work:

// Saga pattern: compensating transactions instead of 2PC
class OrderSaga {
  steps = [
    { name: "reserveInventory", compensate: "releaseInventory" },
    { name: "processPayment", compensate: "refundPayment" },
    { name: "createShipment", compensate: "cancelShipment" },
  ];

  async execute() {
    const completed = [];

    for (const step of this.steps) {
      try {
        const result = await this.executeStep(step);
        completed.push({ step, result });
      } catch (error) {
        // Compensate in reverse order
        for (const { step } of completed.reverse()) {
          await this.executeCompensation(step);
        }
        throw error;
      }
    }
  }
}

Sagas trade atomicity for availability and low latency — if a compensation fails, you need alerting and manual intervention, but the system remains available.


Chain Replication

Chain replication is an alternative to Paxos/Raft for achieving strong consistency. In wide-area networks, it can offer lower write latency than quorum-based approaches. This section covers how chain replication works, its guarantees, and comparison with Raft.

How Chain Replication Works

Data is replicated along a chain of nodes: Head → Middle → Tail. Writes arrive at the Head and propagate sequentially down the chain. Reads are served by the Tail (which has seen all committed writes).

graph LR
    W[Write Request] --> H[Head Node]
    H --> M1[Node 2]
    M1 --> M2[Node 3]
    M2 --> T[Tail Node]
    R[Read Request] --> T
    T --> RV[Return Value]

Write path:

  1. Head receives write, appends to its log
  2. Head forwards to next node; each node appends to its log and forwards
  3. When Tail receives and acknowledges, the write is committed
  4. Tail sends commit acknowledgment back up the chain to Head
  5. Head responds to client

Read path:

  1. Client sends read to Tail
  2. Tail has the most up-to-date committed data (all writes pass through it)
  3. Tail returns the value directly

Chain Replication Guarantees

Because writes propagate in order and the Tail always holds the latest committed state, reads from the Tail are strongly consistent without needing quorum.

// Chain replication write latency: O(chain_length)
// In a 3-node chain: Head → Node2 → Tail
// Write goes through all 3 nodes before client gets acknowledgment

// Compare to Raft quorum: write must reach (N/2 + 1) nodes
// In a 3-node Raft cluster, quorum is 2 — same as chain replication
// Difference becomes apparent with more nodes:
// Raft quorum with 5 nodes: 3 nodes
// Chain replication with 5 nodes: all 5 nodes in chain

// Chain replication drawback: failure of Head or Tail requires reconfiguration
// Raft can elect new leader from any majority node

Chain Replication vs Raft Comparison

AspectChain ReplicationRaft Consensus
Write latencyO(N) sequential propagationO(N) quorum (but parallel)
Read latencyO(1) — direct to TailO(N) if leader, O(N) if followers
Strong consistencyYes (reads from Tail)Yes (leader or quorum reads)
Leader bottleneckNo (writes flow through chain)Yes (all writes go through leader)
Failure handlingHead or Tail failure needs rechainAny server can become leader
Geo-distributionChain topology is fixedFlexible majority quorum placement
ImplementationSimpler (deterministic chain)More complex (leader election)
Used byAzure Cosmos DB (some configs), CORFUetcd, CockroachDB, Consul

CORFU uses chain replication with flash storage for high-performance consensus in data centers.


Quorum-Based Systems

Quorum-based replication underpins many distributed databases. If you understand the math behind quorum selection, you can pick the right consistency level for your workload. This section covers the R/W/N model, quorum math, and practical trade-offs.

The R/W/N Model

In a system with N replicas, a quorum is a subset of replicas that must respond for an operation to succeed.

For read quorum R and write quorum W:

  • A write must be acknowledged by W replicas
  • A read must be acknowledged by R replicas
  • To guarantee strong consistency: R + W > N

This inequality ensures that every read quorum overlaps with every write quorum — at least one replica in any read quorum has seen the latest write.

Quorum Math Table

NRWR + W > NConsistency GuaranteeUse Case
311NoWeak (eventual)Maximum availability, any replica
321NoWeak (eventual)Fast writes, weaker reads
312NoWeak (eventual)Fast reads, weaker writes
322YesStrong (quorum overlap guaranteed)Cassandra QUORUM, DynamoDB
333YesStrongest (all nodes)Maximum durability, highest latency
533YesStrongCommon production configuration
523YesStrongRead-heavy workloads
532YesStrongWrite-heavy workloads

Quorum Selection Trade-offs

// DynamoDB: N=3 replicas (automatic)
// Strong consistency: R=3, W=3 (all replicas must acknowledge)
// Eventually consistent: R=1, W=1 (any replica)
// Session consistency: R=1, W=1 with session sticky

// Cassandra: N=3, configurable
// ONE:  R=1, W=1   — fastest, weakest consistency
// QUORUM: R=2, W=2 — balanced, strong consistency
// ALL:  R=3, W=3   — slowest, strongest durability

// The key insight: R+W > N guarantees that reads see writes
// This is why QUORUM in a 3-node cluster (R=2, W=2) is strongly consistent
// Even if the leader crashes mid-write, quorum ensures overlap

Quorum with Node Failures: Read Repair Implications

When some replicas are down, quorum operations continue but staleness can increase:

// Scenario: N=5, R=3, W=3
// If 2 replicas are down:
// - Writes still succeed (W=3 remaining replicas)
// - Reads still succeed (R=3 remaining replicas)
// - BUT: the 2 dead replicas miss the write

// When dead replicas come back online:
// - They must sync via anti-entropy (read repair only catches divergent reads)
// - During the recovery window, reads from the dead replicas would be stale

// Best practice: monitor replica health
// If more than (N-W) replicas are down, writes should fail
// Because you cannot guarantee W replicas will persist the write
const MIN_REPLICAS_FOR_WRITE = N - (N - W);
if (availableReplicas < MIN_REPLICAS_FOR_WRITE) {
  throw new Error("Insufficient replicas for safe write");
}

Sloppy Quorums and Hinted Handoff

Some systems (notably Dynamo) use sloppy quorums: when the primary quorum is unavailable, writes go to healthy nodes outside the preferred location. The data is hinted to be handed back when the original node recovers.

// Sloppy quorum: write succeeds even if preferred nodes are down
// Example: N=3, preferred nodes [A, B, C], all down
// Coordinator writes to [D, E, F] instead (sloppy quorum)
// When A comes back, D/E/F send hinted writes to A

// Trade-off: sloppy quorums sacrifice strict consistency for availability
// A read during the handoff window might not see a recent write
// But the system remains available during partial failures

// Hinted handoff sequence:
// 1. D receives write for A (A is down)
// 2. D stores the write with a hint: "deliver to A when available"
// 3. D periodically checks if A is back
// 4. When A recovers, D sends the hinted data to A
// 5. A applies the write and acknowledges to D

Data Repair Mechanisms

Distributed databases use two complementary mechanisms to maintain consistency: read repair and anti-entropy. This section explains both approaches and when each matters in production.

Read Repair (Reactive)

Read repair happens during read operations. When a replica returns stale data during a read, the system proactively pushes the fresh data to that replica.

// Conceptual read repair flow
async function readWithRepair(key) {
  // Read from multiple replicas
  const [replica1, replica2, replica3] = await Promise.all([
    readReplica("replica1", key),
    readReplica("replica2", key),
    readReplica("replica3", key),
  ]);

  // Check for divergence
  const versions = [replica1, replica2, replica3];
  const latest = findMostRecent(versions);

  // If any replica is stale, repair it
  for (const replica of versions) {
    if (replica.version < latest.version) {
      // Asynchronously push the correct value
      repairReplica(replica, latest);
    }
  }

  return latest.value;
}

Characteristics:

  • Triggered by: Read operations
  • Scope: Only repairs replicas that were contacted during the read
  • Latency impact: Minimal (async repair)
  • Coverage: Limited - replicas not read may remain stale

Anti-Entropy (Proactive)

Anti-entropy is a background process that continuously compares replicas and repairs any divergence, regardless of whether reads have occurred.

// Conceptual anti-entropy using Merkle trees
class AntiEntropyService {
  constructor(replicas) {
    this.replicas = replicas;
    this.merkleTrees = new Map();
  }

  // Build Merkle tree for a replica's data range
  buildMerkleTree(replica, range) {
    const tree = new MerkleTree();
    for (const [key, value] of this.getRange(replica, range)) {
      tree.insert(hash(key + value.version));
    }
    return tree;
  }

  // Compare trees to find divergent keys
  async compareReplicas(replicaA, replicaB) {
    const treeA = this.buildMerkleTree(replicaA, "all");
    const treeB = this.buildMerkleTree(replicaB, "all");

    // Find nodes that differ
    const diff = treeA.diff(treeB);

    // For each divergent node, traverse down to find actual keys
    const divergentKeys = [];
    for (const node of diff) {
      if (node.isLeaf()) {
        divergentKeys.push(node.key);
      } else {
        // Fetch children for deeper comparison
        const children = await this.fetchChildren(replicaA, node);
        divergentKeys.push(...children);
      }
    }

    // Sync divergent keys
    await this.syncKeys(replicaA, replicaB, divergentKeys);
  }
}

Characteristics:

  • Triggered by: Background scheduled process (e.g., Cassandra nodetool repair)
  • Scope: All replicas, regardless of recent reads
  • Latency impact: Can be significant during repair operations
  • Coverage: Comprehensive - finds all divergence

Comparison Table

AspectRead RepairAnti-Entropy
TriggerOn-readBackground/scheduled
ScopeOnly read replicasAll replicas
Staleness windowUntil next read of stale replicaZero (when repair completes)
Resource costLow (per-read)High (full tree comparison)
Failure detectionDetects on readDetects all divergence
ExampleCassandra, DynamoDBCassandra nodetool repair, Riak

When Each Mechanism Matters

ScenarioRead Repair Sufficient?Need Anti-Entropy?
Low read frequencyNo - replicas may go stale for long periodsYes
High read frequencyYes - repairs happen frequentlyNo
Write-heavy workloadsNo - stale replicas accumulateYes
Read-heavy workloadsYes - continuous reads repair frequentlyOptional
Compliance/audit requirementsNo - need guaranteed convergenceYes

Practical note: Most production systems use both. Cassandra uses read-repair for immediate repairs during reads and nodetool repair (anti-entropy) as a periodic background job to ensure all replicas converge.


CAP Theorem Implications

Eric Brewer’s CAP theorem (2000) states that a distributed system cannot simultaneously provide Consistency, Availability, and Partition tolerance. In practice, partitions happen — networks fail, nodes become unreachable. When they do, you must choose between consistency and availability.

The CAP choice is not permanent. You can choose different trade-offs for different parts of your system, and you can even change the choice dynamically.

ScenarioTrade-offExamples
Partition occursCP (consistency over availability)Zookeeper, etcd, MongoDB majority
Partition healsSystem recovers with full consistencyAll systems return to normal
No partition (normal ops)Both consistency and availabilityDepends on latency vs consistency needs

CP systems (Zookeeper, etcd, MongoDB with majority) will refuse to serve reads or accept writes if a majority cannot be reached. They will never return stale or potentially incorrect data.

AP systems (DynamoDB, Cassandra, Riak) will continue serving reads and accepting writes even during partitions. They may return stale data but they remain available.

The nuance most engineers miss: “availability” in CAP is not about latency, it is about whether the system can respond at all. An AP system that returns 503 because it is overloaded is not “available” in CAP terms. A CP system that returns “I cannot confirm this write” is still available — it is just not accepting writes.

PACELC extends this: even without partitions, latency and consistency are in trade-off. A strongly consistent system has higher write latency because of coordination. This is often the more relevant trade-off in geo-distributed deployments.


Delta CRDTs

Standard CRDTs (G-Counter, OR-Set) track the full state of a data structure. For large data structures like counters or registers, this is acceptable. For growing data structures like lists or text, full-state CRDTs become expensive — the merge payload grows with the size of the data structure.

Delta CRDTs solve this by propagating only the delta (change) since the last sync, rather than the full state.

// Delta-G-Counter: only propagate incremented deltas, not full state
class DeltaGCounter {
  constructor(replicaId) {
    this.replicaId = replicaId;
    this.fullState = {}; // { replicaId: count }
    this.pendingDeltas = {}; // { replicaId: deltaSinceLastSync }
  }

  increment() {
    this.fullState[this.replicaId] = (this.fullState[this.replicaId] || 0) + 1;
    this.pendingDeltas[this.replicaId] =
      (this.pendingDeltas[this.replicaId] || 0) + 1;
  }

  // Get delta to send to another replica
  getDelta() {
    const delta = { ...this.pendingDeltas };
    this.pendingDeltas = {}; // Clear pending after sending
    return delta;
  }

  // Merge received delta (not full state)
  mergeDelta(delta) {
    for (const [id, value] of Object.entries(delta)) {
      this.fullState[id] = Math.max(this.fullState[id] || 0, value);
    }
  }

  value() {
    return Object.values(this.fullState).reduce((a, b) => a + b, 0);
  }
}

// Usage: large counters that change frequently but need efficient sync
// Instead of sending full counter state (which could be huge), send only deltas

When Delta CRDTs matter: High-frequency counters (page views, vote counts) where the full state grows unbounded, but you need efficient network utilization. Most production CRDT deployments for large-scale systems use delta propagation.


Byzantine Fault Tolerance in Consistency

The consistency models discussed so far assume crash-stop failures: a node either works correctly or stops responding. Byzantine failures are worse — a node can behave arbitrarily, sending incorrect data, contradicting itself, or selectively responding to some nodes but not others.

Byzantine Fault Tolerance (BFT) is relevant for:

  • Systems where nodes may be compromised (adversarial environments)
  • Cryptocurrency blockchains (Bitcoin, Byzantine fault tolerance for consensus)
  • Aerospace and critical infrastructure

Practical BFT protocols like PBFT (Practical Byzantine Fault Tolerance) require more than two-thirds of nodes to be honest. This is significantly more expensive than crash-fault tolerant protocols like Raft, which only needs a majority.

For most web and business applications, crash-fault tolerance (Raft, Paxos) is sufficient. Byzantine fault tolerance is overkill and adds substantial overhead. The exception is if you are building systems where nodes may behave maliciously — in which case you are probably working on consensus for cryptocurrencies or distributed ledgers, not general web services.


Common Pitfalls / Anti-Patterns

Pitfall 1: Assuming “Eventually Consistent” Means “Quickly Consistent”

Problem: Teams assume eventual consistency means consistent within seconds when it can take minutes or longer during high load or failures.

Solution: Measure your actual staleness distribution under various conditions. Do not make UX assumptions without data. Consider bounded staleness models if users need timeliness guarantees.

Pitfall 2: Ignoring Monotonic Read Violations

Problem: When reads route to different replicas, users can see data “go backward” - seeing a value, then an older value. This is deeply confusing even if technically “eventually consistent.”

Solution: Use sticky sessions or read-your-writes guarantees for user-facing applications. Route reads to the same replica within a session.

Pitfall 3: Using Vector Clocks Without Understanding Them

Problem: Implementing vector clocks for causal consistency seems simple, but the memory and bandwidth overhead grows with replica count and history length.

Solution: Most databases handle this internally (DynamoDB, Cassandra). Only implement custom vector clocks if you understand the trade-offs and have measured the overhead.

Pitfall 4: Testing Consistency at Wrong Layer

Problem: Testing consistency only at the unit test level misses distributed race conditions that only appear under concurrent load.

Solution: Use chaos testing frameworks (Jepsen, Chaos Monkey) to inject failures and verify consistency guarantees hold. Test your application, not just the database.


Real-World Case Studies

Amazon DynamoDB: Eventual Consistency at Scale

DynamoDB launched with eventual consistency as the default reading model. It was a deliberate engineering trade-off — strong consistency required quorum reads that doubled latency and tripled cost. The team knew most workloads could tolerate brief staleness.

The Dynamo paper (2007) described how they handled the Cassandra counter incident in 2010: a burst of traffic to a popular item caused replication lag, and thousands of reads returned stale data for several minutes. Their mitigation was straightforward — they rate-limited the popular key and let anti-entropy catch up. The incident revealed that bounded vs unbounded eventual consistency matters even at Amazon’s scale.

DynamoDB now offers both eventual and strong consistency options per read. The eventual model still handles the majority of requests — it’s faster and cheaper.

Google Spanner: Strong Consistency Across Geo-Distributed Nodes

Spanner achieves strong consistency across globally distributed nodes using TrueTime — GPS and atomic clock hardware that bounds clock uncertainty to ±7ms. Writes go through Paxos consensus, and reads quorum on the latest Paxos sequence number.

The operational cost is real: Spanner writes take 10-40ms in the same region, 100-200ms cross-region. This is why Spanner is expensive. The trade-off is deliberate — if you need globally consistent data with availability guarantees, you pay in latency and cost.

A known failure: Spanner’s leader lease mechanism means that if a datacenter loses power and its Paxos leader goes down, there is a lease expiration period where no writes can be accepted (typically 10 seconds). During this window the system is unavailable for writes, even though TrueTime could theoretically allow it.

Cassandra: Tunable Consistency Meets the 2018 Outage

Cassandra’s tunable consistency model lets you choose per-query — ONE, QUORUM, ALL. The problem: developers often chose ONE for speed, and did not understand what that meant for durability.

In October 2018, a large Cassandra deployment experienced data loss after a node failure during a configuration change. The root cause was a regression in how Cassandra’s failure detection and repair mechanisms interacted with the ONE consistency level for writes. With replication factor 3 and consistency level ONE, losing one node could silently lose writes that the other replicas had not yet acknowledged.

The fix was not purely technical — it required updating operational runbooks to mandate QUORUM for all writes in production, and adding tooling to enforce this automatically.

Quick Recap Checklist

  • Eventual consistency guarantees that if no new updates are made, all replicas will eventually return the same value
  • Strong consistency ensures all reads receive the most recent write — requires coordination (e.g., Paxos, Raft)
  • Sequential consistency guarantees that all nodes see operations in the same total order as they were initiated
  • Causal consistency ensures causally related operations are seen by all nodes in order
  • Read-your-writes consistency guarantees a client always sees their own previous writes
  • Monotonic reads guarantee that once a client reads a value, subsequent reads never return older values
  • Monotonic writes guarantee that all writes by a client are processed in order across replicas
  • Session consistency provides a good balance — a client’s reads within a session are consistent
  • The CAP theorem constrains what consistency models are achievable during network partitions
  • PACELC extends CAP to describe latency-consistency trade-offs even when the system is healthy

Real-world Failure Scenarios

Scenario 1: Google Gmail’s Eventual Consistency Bug (2009)

What happened: In 2009, Gmail experienced a bug where sent emails were occasionally not delivered to recipients for several hours, despite the sender receiving a “sent successfully” confirmation.

Root cause: Gmail used an eventually consistent storage backend for message routing metadata. A routine storage cluster migration caused the routing index to become stale. Messages were correctly stored but the lookup index pointed to the wrong storage partition.

Impact: Approximately 0.5% of emails sent during the affected window were delayed by 2-8 hours. Business-critical emails (contract updates, interview invitations) arrived late, causing real-world consequences.

Lesson learned: Eventual consistency is acceptable for non-critical data, but the “eventual” window can extend unpredictably during infrastructure operations. Critical notification systems need either synchronous confirmation reads or application-level delivery tracking.

Scenario 2: Amazon DynamoDB’s Strict Consistency Confusion

What happened: In 2014, developers building on DynamoDB noticed inconsistent behaviour when using strongly consistent reads after write operations. In some multi-partition scenarios, reads immediately following writes returned stale values despite DynamoDB’s documentation stating strong consistency was the default.

Root cause: DynamoDB’s global tables replicate across regions using an eventually consistent merge. Strongly consistent reads are only guaranteed within a single region. Cross-region strongly consistent reads were silently falling back to eventual consistency without clear documentation.

Impact: Several applications assumed cross-region strong consistency, leading to data divergence during failover scenarios. One company discovered their disaster recovery data was 15 minutes stale when they needed it most.

Lesson learned: Understand the exact consistency guarantees of your database — not just the marketing name. “Strong consistency” can have geographic and operational scope limitations that aren’t immediately obvious.

Scenario 3: Facebook’s TAO Consistency Model in Practice

What happened: Facebook’s TAO distributed data store experienced a bug where cached object relationships (e.g., “user X is friend of user Y”) would occasionally show contradictory results when queried from different data centre locations simultaneously.

Root cause: TAO uses a hierarchy of caches with eventual consistency between tiers. A bug in the invalidation propagation caused stale friendship edges to persist in secondary caches after the primary cache was invalidated.

Impact: Users occasionally saw friend suggestions that contradicted their actual friend list, or saw content permissions applied incorrectly. The inconsistency was subtle enough that most users attributed it to UI bugs rather than data inconsistency.

Lesson learned: Multi-tier caching systems compound consistency problems. Each cache tier adds its own “eventual” window to writes. Cache invalidation is harder than cache population — design invalidation as carefully as you design the caching logic itself.

Interview Questions

1. A user posts a message to a chat application and immediately refreshes to see it, but it is not there. What consistency model is violated, and how would you fix it?

This is a read-your-writes consistency violation. The write went to one replica but the subsequent read hit a different replica that had not yet received the write. Fixes include routing reads to the primary or last-written replica for the session, using sticky sessions, or issuing reads with a higher consistency level (QUORUM) after a write.

2. You are designing a shopping cart for an e-commerce site. Which consistency model do you choose for cart operations, and why?

The cart needs strong consistency or at minimum read-your-writes. If a user adds an item and it disappears, they lose a sale. However, price and inventory checks can use eventual consistency — it is acceptable if the displayed price is slightly stale, as long as the checkout process re-verifies. Session consistency with QUORUM reads works well here.

3. Explain the difference between read-your-writes and monotonic reads.

Read-your-writes guarantees that after writing value V, subsequent reads by the same session see V or a more recent value. Monotonic reads guarantee that if a session sees value V at time T, it will never see an older value at time T'>T. Read-your-writes is about your own writes; monotonic reads is about not going backward in time for any reason.

4. A financial service processes payments. What consistency level would you use for the debit and credit operations?

Strong consistency (linearizable) — using QUORUM or ALL consistency levels, or distributed transactions with two-phase commit. The cost is latency and throughput, but for financial operations where incorrect balances cause real harm, this is non-negotiable. Many systems use a separate strongly-consistent ledger for the actual money movement and use eventual consistency only for display/read purposes.

5. PACELC says that in the presence of a partition, you choose between consistency and availability. But even without partitions, consistency has a cost. Explain using PACELC.

PACELC states: if there is a partition (P), you choose between availability (A) and consistency (C); else (E), even without partitions, latency (L) and consistency (C) are in trade-off. Strong consistency requires coordination — writes must go through a consensus protocol or quorum, which adds round trips. In a geo-distributed setting, a strongly consistent write might cross continents. An eventually consistent write can commit locally and replicate asynchronously, cutting latency significantly. The trade-off is that you may read stale data while replication is in flight.

6. In a system with 5 replicas (N=5), what quorum values (R, W) would you choose for a read-heavy workload that still requires strong consistency? Explain your reasoning.

For a read-heavy workload with strong consistency, R=3, W=3 (QUORUM) works well. R+W=6 > N=5 guarantees strong consistency. With 5 replicas, quorum is 3 — you need 3 replicas to acknowledge both reads and writes. For read-heavy workloads, you might also consider R=3, W=2 — this still gives R+W=5 > 5 for strong consistency, and reads are faster (3 replicas vs 3 for writes). The trade-off is slightly higher write latency since writes need 3 acknowledgments but reads only need 3 of 5 (same as quorum in this case).

7. What is the difference between 2PC (Two-Phase Commit) and 3PC (Three-Phase Commit)? Why does 3PC eliminate blocking but still not see widespread adoption?

2PC has a blocking problem: if the coordinator crashes after participants have voted YES, they wait forever. 3PC adds a PreCommit phase so participants know the decision before committing. If a participant times out waiting for DoCommit after PreCommit, it can safely commit because all participants already voted YES. However, 3PC assumes eventually synchronous networks — if messages are delayed beyond bounds, the safety guarantees break. In geo-distributed systems, this assumption does not hold. In practice, Paxos/Raft are preferred over 3PC because they handle leader elections more robustly and have cleaner failure semantics.

8. You are building a collaborative document editing application where multiple users can edit the same paragraph simultaneously. Which CRDT type would you use, and why?

An RGA (Replicated Growable Array) or similar ordered sequence CRDT is ideal for collaborative text editing. RGA handles concurrent inserts at the same position by using causal ordering — if two users insert "A" and "B" at position 5 simultaneously, the result is deterministic regardless of network ordering. Deletions are typically implemented as tombstones so that concurrent delete and insert operations merge correctly (add-wins or delete-wins depending on semantics). This avoids the complexity of Operational Transformation while providing automatic conflict resolution. The key property: no matter what order operations arrive, the final document state is identical on all replicas.

9. Explain the difference between read repair and anti-entropy. When would you rely only on read repair?

Read repair is reactive — when a client reads from multiple replicas and detects a stale value, the system asynchronously pushes the fresh data to stale replicas during that read. Anti-entropy is proactive — a background process continuously compares Merkle trees (or similar) across all replica pairs to find and repair divergence, regardless of whether reads have occurred. You would rely only on read repair when: reads are frequent (stale replicas get repaired quickly), write volume is low (staleness accumulates slowly), and you can tolerate replicas being stale between their own reads. You need anti-entropy when: reads are infrequent (replicas stay stale for long periods), write volume is high (stale replicas accumulate missed writes), or compliance requires guaranteed convergence within a bounded time.

10. Chain replication offers O(1) read latency for strongly consistent reads. Why do most production systems use Raft or Paxos instead?

Chain replication's O(1) read latency from the Tail node is attractive, but it has a critical vulnerability: the Head and Tail are single points of failure. If the Tail fails, you cannot serve reads until a reconfiguration completes. If the Head fails, writes stop propagating. Raft elections can elect any majority node as leader, providing better fault tolerance. Additionally, chain replication's sequential propagation means write latency is O(chain_length) — in a wide-area chain spanning multiple data centers, this adds significant latency. Raft's parallel quorum-based approach can complete writes with fewer round-trips in geo-distributed settings. Chain replication shines in single-datacenter or high-performance local-area deployments where the topology is stable.

11. A team member suggests using Last-Write-Wins (LWW) for a shopping cart because it is simple to implement. What is the hidden cost, and what would you recommend instead?

LWW appears simple but has a critical hidden cost: it depends on synchronized wall clocks across distributed nodes, which is notoriously unreliable. Clock skew, NTP drift, or VM pauses can cause the "last" write to actually be an older write that arrived later. This silently drops user data — imagine a user adding an item to their cart while another device removes an item, and due to clock skew the remove wins even though it happened first. Instead, I would recommend an OR-Set (Observed-Remove Set) CRDT for shopping carts — it preserves all add operations and uses causal ordering to handle concurrent adds and removes correctly. The trade-off is slightly more storage overhead, but the semantics are correct for e-commerce.

12. Your monitoring system shows replication lag spiking to 30+ seconds during peak hours. What are the most likely causes, and how do you diagnose them?

Common causes of replication lag spikes: (1) Network congestion or increased latency between datacenters — check cross-region latency metrics; (2) Write burst overwhelming replicas — check write throughput vs replica capacity; (3) Disk IOPS saturation on replicas — check disk queue depth and IOPS utilization; (4) Replica recovery after node failure — anti-entropy or read-repair catching up; (5) Garbage collection pauses in JVM-based replicas causing them to fall behind. To diagnose: check replication_lag_seconds per replica, compare lag spike timestamps with write throughput graphs, check disk and network metrics at the same time windows, and look for correlation with failure events or scaling events. Alerts should fire when lag exceeds acceptable bounds for your bounded eventual consistency window.

13. What is a sloppy quorum, and when might you intentionally choose it over a strict quorum?

A sloppy quorum writes to the first N healthy nodes when preferred nodes are unavailable, rather than failing. The data is "hinted" to be delivered to the correct node when it recovers. You might choose a sloppy quorum when: (1) you prioritize write availability over immediate consistency — a node being down should not block all writes; (2) you have geographically distributed replicas and want writes to succeed even if a datacenter is partially unavailable; (3) your use case tolerates brief staleness as long as writes eventually succeed. The trade-off is that during the handoff window, reads might not see the most recent write. Dynamo-style systems use sloppy quorums deliberately — they are designed for highest availability, not strongest consistency.

14. How would you design a system that needs strong consistency for financial transactions but eventual consistency for analytics dashboards reading from the same data?

The classic approach is a dual-path architecture: financial writes go through a strongly consistent path (Paxos/Raft quorum, or 2PC with a reliable coordinator), while analytics read from a separate eventually consistent replica. The key is event-driven replication: when a financial transaction commits, publish an event to a streaming system (Kafka, Kinesis) that asynchronously replicates to the analytics store. The analytics store uses its own eventual consistency model and can be tuned for read throughput rather than consistency. At consistency boundaries (e.g., month-end reporting), you take a snapshot from the strongly consistent store rather than relying on replication to have caught up. This pattern is used by most large-scale financial systems — the "source of truth" is strongly consistent and expensive; derived views are eventually consistent and cheap to query.

15. You are running Cassandra with N=3 and consistency level ONE for all reads and writes. What failure scenarios can cause data loss? How do you prevent it?

With N=3 and consistency level ONE, data loss occurs when: (1) A write is acknowledged after being written to only one replica, and that replica fails before the other two receive it — the write is lost; (2) A read returns data from a replica that later fails before anti-entropy repairs the other replicas — reads from that replica may have returned stale data in the window between the write and the failure; (3) A network partition splits the cluster such that a minority of replicas accept writes and then fail. Prevention: use QUORUM (R=2, W=2) for all production writes — R+W=4 > N=3 guarantees strong consistency and that at least one replica in any read quorum has seen every write. At minimum, use LOCAL_QUORUM for geographically distributed deployments to ensure reads and writes quorum within the local datacenter before acknowledging.

16. Explain what a vector clock is, and when you would use one instead of a simple timestamp for conflict resolution.

A vector clock is a per-object, per-replica timestamp structure that tracks the logical time at each replica. Unlike a wall-clock timestamp (which is single-valued), a vector clock is a map like {replica_A: 3, replica_B: 1, replica_C: 5}. It can capture causality: if replica A's clock is [3, 1, 5] and it receives an update from replica B with clock [3, 2, 5], A's merged clock is [3, 2, 5]. This lets you determine whether two versions of an object are causally related (one happened before the other) or concurrent (neither caused the other). You use vector clocks when: you need to implement causal consistency (Dynamo-style), you need to distinguish between "this version is older" vs "this version is concurrent but different," and your conflict resolution is more sophisticated than last-write-wins. The cost: vector clocks grow with replica count and history length, which is why most production systems either limit history or switch to LWW after a bounded window.

17. What is the relationship between isolation levels (in ACID databases) and consistency models (in distributed systems)?

Isolation levels in ACID databases and consistency models in distributed systems are related but not identical. Isolation levels (READ UNCOMMITTED, READ COMMITTED, REPEATABLE READ, SERIALIZABLE) govern how concurrent transactions interact within a single database node. Consistency models govern how data appears across distributed replicas. SERIALIZABLE isolation maps roughly to linearizability — all operations appear to happen in a global total order. READ COMMITTED maps roughly to session consistency — you see only committed data, but you may see your own writes multiple times or not see them immediately. The key difference: isolation levels are about concurrency within one machine; consistency models are about state across multiple machines. A database can have SERIALIZABLE isolation locally but still be eventually consistent across replicas if replication is asynchronous. This is why PostgreSQL's serializable snapshot isolation (SSI) is not the same as linearizability across a distributed setup — the serializability guarantee holds within one node, not across asynchronous replicas.

18. Why does MongoDB's default read concern "local" not provide linearizability, even though reads go to the primary replica?

MongoDB's "local" read concern reads from the primary replica, which means you always get the latest data the primary has acknowledged. However, it does not guarantee linearizability because: (1) The primary may have acknowledged a write that has not yet been replicated to a majority of replicas. If the primary fails and a new primary is elected from a replica that has not received this write, a subsequent read could go to the new primary and see a different (older) value. (2) "local" reads do not wait for replication to confirm the write is durable on a majority — they read whatever the primary has locally, which may not survive a failover. To get linearizability in MongoDB, you need "majority" read concern, which reads from the primary but waits to confirm the data has been replicated to a majority before returning it. This maps to the R+W > N model: majority reads guarantee you are reading from a replica that has seen all acknowledged writes.

19. A user notices that in your multi-region database, their profile update from the US East region is visible to their colleague in EU West within 200ms, but sometimes takes 2-5 seconds. What is happening?

You are likely seeing variable replication lag due to network conditions and consistency level. With a multi-region deployment and eventual consistency: cross-region network latency varies based on load, congestion, and routing — 50-200ms is typical for US-East to EU-West, but spikes happen during network congestion or BGP re-routing. The 200ms reads are likely hitting a replica that has already received the latest replication batch. The 2-5 second reads are hitting replicas still catching up on the replication queue. This is the "eventually" in eventual consistency doing what it says — convergence is not guaranteed within a bounded time unless you use bounded eventual consistency (with explicit latency SLOs) or strong consistency (QUORUM reads across regions, which would be 200-400ms per read). The fix depends on requirements: if users need to see their own writes immediately, implement read-your-writes by routing reads to the primary region for the session. If all reads need to be fast and consistent, you need a read-your-writes session guarantee plus replication lag monitoring.

20. What happens to your system's consistency guarantees when you add a caching layer (like Redis) in front of your database?

Adding a cache layer fundamentally changes the consistency model because the cache and database can diverge. The cache provides a shortcut for reads — if data is in cache, you skip the database entirely. If the cache is not invalidated or updated correctly after a write, you get stale reads from cache. This can violate even read-your-writes consistency: a user writes data, it goes to the database, the cache is not invalidated, the user reads back and hits the stale cache. The solution is cache-aside with careful invalidation: after any write, delete or update the cache entry so the next read fetches fresh data from the database. You can also use write-through cache (cache updated synchronously with database, more consistent but adds write latency) or read-through cache (cache populated on miss, no consistency benefit). Cache TTLs introduce another problem: even with correct invalidation, a long TTL means reads can be arbitrarily stale if the cache entry was populated before a write. The rule: caches are best for data that is read frequently and written rarely. For frequently-written data, the cache adds more consistency headaches than performance benefit unless you invalidate aggressively.

Observability Checklist

Metrics to Capture

  • read_staleness_seconds (histogram) - Time between write and read reflecting it
  • consistency_level_used (counter) - Breakdown by consistency level
  • conflict_resolution_duration_ms (histogram) - Time spent resolving conflicts
  • monotonic_violations_total (counter) - When reads go backward in time
  • replication_lag_seconds (gauge) - Per replica, per datacenter

Testing Eventual Consistency

// Jepsen-style test for eventual consistency
async function testEventualConsistency(clients, writeKey, writeValue) {
  // Write to all clients simultaneously
  await Promise.all(clients.map((c) => c.write(writeKey, writeValue)));

  // Poll until all clients return the value
  const deadline = Date.now() + 30000; // 30 second timeout
  while (Date.now() < deadline) {
    const results = await Promise.all(clients.map((c) => c.read(writeKey)));
    if (results.every((r) => r === writeValue)) {
      return { success: true, timeMs: Date.now() - start };
    }
    await sleep(100);
  }
  return { success: false, results };
}

Alerts to Configure

AlertThresholdSeverity
Staleness > 1s1000msWarning
Staleness > 10s10000msCritical
Monotonic violations> 0 in 5 minutesCritical
Conflict rate > 1%1% of writesWarning

Security Checklist

  • Authentication required for all replica communication
  • TLS encryption for all inter-node replication traffic
  • Audit logging of consistency level changes
  • Network policies restricting replica access
  • Encryption at rest for all replica data
  • Client credentials rotated periodically

Further Reading

Academic Papers

Standards and Specifications

Production Guides

For latency trade-offs, see the PACELC theorem post. For building highly available systems with consistency guarantees, see Availability Patterns.


Conclusion

  • Consistency models exist on a spectrum from strong to weak.
  • Read-your-writes matters for user-facing applications - always verify writes are visible.
  • Monotonic reads prevent jarring rollback experiences in pagination and feeds.
  • Causal consistency covers most real-world scenarios efficiently.
  • Conflict resolution (LWW, CRDTs, app-merge) must match your data’s semantics.
  • Most databases let you configure consistency per query - use this flexibility.

Copy/Paste Checklist

- [ ] Audit application to identify consistency requirements per operation
- [ ] Use strong consistency (QUORUM, LINEARIZABLE) for financial/inventory ops
- [ ] Implement read-your-writes for user-generated content
- [ ] Use sticky sessions to ensure monotonic reads
- [ ] Choose conflict resolution strategy: LWW, CRDT, or application merge
- [ ] Monitor replication lag and staleness metrics
- [ ] Test consistency guarantees under failure injection
- [ ] Document consistency requirements in ADR (Architecture Decision Records)

Category

Related Posts

Google Spanner: Globally Distributed SQL at Scale

Google Spanner architecture combining relational model with horizontal scalability, TrueTime API for global consistency, and F1 database implementation.

#distributed-systems #databases #google

Gossip Protocol: Scalable State Propagation

Learn how gossip protocols enable scalable state sharing in distributed systems. Covers epidemic broadcast, anti-entropy, SWIM failure detection, and real-world applications like Cassandra and Consul.

#distributed-systems #gossip-protocol #consistency

Synchronous Replication: Data Consistency Across Nodes

Learn about synchronous replication patterns in distributed databases, including Quorum-based replication and write-ahead log shipping for zero RPO.

#distributed-systems #replication #consistency