Distributed Transactions: ACID vs BASE Trade-offs

Explore distributed transaction patterns: ACID vs BASE trade-offs, two-phase commit, saga pattern, eventual consistency, and choosing the right model.

published: reading time: 34 min read author: GeekWorkBench

Distributed Transactions: ACID vs BASE Trade-offs

Introduction

In a single database, transactions are straightforward. The database handles atomicity, isolation, consistency, and durability (ACID). When you commit, it is committed. When you read data, it is the latest.

In distributed systems, this simplicity breaks. Data lives across multiple databases, services, and nodes. Keeping everything consistent requires coordination—and coordination has costs: latency, availability, and complexity.

This post explores trade-offs between consistency models, from strict ACID to relaxed eventual consistency, and how to choose the right model for your system.

Topic-Specific Deep Dives

Consistency Models

ACID vs BASE

These are two philosophies for handling distributed data.

ACID (Atomicity, Consistency, Isolation, Durability) prioritizes consistency above all else. Every transaction sees a consistent state. The database will wait (or fail) to maintain consistency.

BASE (Basically Available, Soft state, Eventually consistent) prioritizes availability. The system may return stale data temporarily, but given enough time, all replicas converge to the same value.

flowchart TD
    A[ACID] -->|Strong consistency| B[Reads always see latest writes]
    A -->|Isolation| C[Concurrent tx look serial]
    A -->|Durability| D[Committed data survives crashes]

    E[BASE] -->|Availability| F[System stays available]
    E -->|Eventual| G[Data converges over time]
    E -->|Soft state| H[State may change without input]

ACID is what you want for financial transactions. BASE is what you get with many NoSQL databases and distributed caches.

The CAP Theorem

Eric Brewer’s CAP theorem says you can have at most two of three: Consistency, Availability, and Partition tolerance. Since network partitions happen in real systems, you must choose between consistency and availability.

flowchart LR
    C[Consistency] -->|CP| DB[(Database)]
    C -->|CA| DB
    A[Availability] -->|AP| DB
    P[Partition Tolerance] --> C
    P --> A

A CP system (like ZooKeeper, etcd) will refuse to serve requests if it cannot confirm consistency. An AP system (like Cassandra, DynamoDB) will serve requests even during partitions but may return stale data.

Most real systems choose AP and use compensating transactions (like saga) to handle inconsistency.

Distributed Protocols

Consensus Algorithms

When 2PC’s coordinator fails, participants are left in limbo. Consensus algorithms solve this by using a quorum-based approach where any node can take over if the leader fails.

Paxos, developed by Leslie Lamport, uses two phases:

  1. Prepare phase: A node proposing a value sends a prepare request to a quorum of nodes. Nodes promise not to accept requests with a lower proposal number.
  2. Accept phase: If a majority responds, the proposer sends an accept request. Nodes accept if they haven’t seen a higher-numbered proposal.

Paxos guarantees safety (no two proposals can be chosen for the same value) but is notoriously difficult to implement correctly. Multi-Paxos optimizes for the common case where one node is the stable leader.

Raft was designed as a more understandable alternative to Paxos. It separates concerns into three sub-problems:

  • Leader election: Nodes vote for a leader using randomized election timeouts. The first node to timeout wins and becomes leader.
  • Log replication: The leader accepts entries and replicates them to followers via a heartbeat pipeline.
  • Safety: If a leader fails, only nodes with enough log entries can become the new leader.
flowchart TD
    A[Node A<br/>Candidate] -->|RequestVote| B[Node B<br/>Follower]
    A -->|RequestVote| C[Node C<br/>Follower]
    B -->|Vote granted| A
    C -->|Vote granted| A
    A -->|Elected Leader| D[Node A<br/>Leader]
    D -->|Heartbeat| B
    D -->|Heartbeat| C

Comparison: ZooKeeper and etcd use Raft. Google Spanner uses Paxos (specifically Multi-Paxos with TrueTime). CockroachDB uses a Raft variant called Raft + HLC.

Two-Phase Commit

Two-phase commit (2PC) coordinates a distributed transaction across multiple participants. A coordinator asks all participants to prepare. If all say yes, it tells them to commit. If any says no, it tells them all to rollback.

For a detailed explanation including failure handling, see Two-Phase Commit.

The problem with 2PC is that it is blocking. If the coordinator crashes after participants have prepared but before sending commit, participants must wait indefinitely. For this reason, many systems avoid 2PC in favor of saga or other patterns.

Distributed Locking

When you need exclusive access to a distributed resource, you need distributed locking. Unlike local locks, distributed locks must survive node failures and network partitions.

Chubby, Google’s lock service, uses Paxos to achieve consensus on lock ownership. Applications like BigTable and MapReduce use Chubby for leader election and distributed locking.

Redis Distributed Locks (Redlock): The Redlock algorithm uses multiple Redis instances to acquire a lock. A client writes a unique value to multiple Redis nodes. If a majority of nodes respond within a timeout, the lock is acquired. This approach is simpler than Paxos but has known safety issues under certain network partition scenarios.

ZooKeeper ephemeral nodes: A node can create an ephemeral file. If the client disconnects, ZooKeeper automatically deletes the file. Applications use this for leader election (the first node to create the file becomes leader) and distributed locks (lock files created in a specific order).

sequenceDiagram
    Client1 -->|Create lock| ZooKeeper
    ZooKeeper -->|Created| Client1
    Client2 -->|Create lock| ZooKeeper
    ZooKeeper -->|Node exists| Client2
    Client2 -->|Watch lock| ZooKeeper
    Client1 -->|Release lock| ZooKeeper
    ZooKeeper -->|Node deleted| Client2
    Client2 -->|Create lock| ZooKeeper
    ZooKeeper -->|Created| Client2

Key considerations for distributed locks:

  • Lock lifetime: How long should a lock be held? Too short and work doesn’t complete. Too long and other tasks wait unnecessarily.
  • Clock skew: If nodes have different clock speeds, lock timeouts can behave unexpectedly. Use logical clocks (Lamport timestamps or hybrid logical clocks) when possible.
  • Fencing tokens: Always include a monotonically increasing token with locked operations. Storage systems can reject stale writes using the token.

Consistency Patterns

Transaction Isolation Levels in Distributed Systems

Local database isolation levels (READ UNCOMMITTED, READ COMMITTED, REPEATABLE READ, SERIALIZABLE) don’t map directly to distributed systems. The guarantees differ.

Isolation LevelLocal MeaningDistributed Equivalent
Read UncommittedDirty reads allowedNot typically used
Read CommittedOnly committed readsEventual consistency with read repair
Repeatable ReadSame row, same valueSession consistency (read your writes)
SerializableFull serial orderLinearizable consistency (Paxos/Raft)

Snapshot isolation is common in distributed databases. Each transaction reads from a consistent snapshot (a point-in-time view). Writes are validated at commit time to ensure no conflicts. PostgreSQL uses snapshot isolation. CockroachDB implements it via its MVCC (Multi-Version Concurrency Control) layer.

Serialized Snapshot Isolation (SSI) extends snapshot isolation by detecting write-write conflicts at commit time rather than at write time. This eliminates the “write skew” anomaly but has higher overhead. CockroachDB uses SSI by default for serializable transactions.

The Saga Pattern

Saga breaks a distributed transaction into a sequence of local transactions, each with a compensating transaction. If step 3 fails, steps 1 and 2 are compensated (undone).

See Saga Pattern for full details.

Saga sacrifices isolation and atomicity for availability. Unlike 2PC, saga does not block. Unlike ACID, saga allows intermediate states to be visible.

sequenceDiagram
    participant Saga
    participant Inv as Inventory
    participant Pay as Payment
    Saga-->|Reserve| Inv
    Inv-->|OK| Saga
    Saga-->|Charge| Pay
    Pay-->|OK| Saga
    Note over Saga,Pay: If payment fails, compensate inventory

Eventual Consistency

Eventual consistency guarantees that, if no new updates are made, all replicas will eventually return the same value. The system is available during the “eventually” window.

Amazon DynamoDB, Cassandra, and many distributed databases use eventual consistency by default or offer it as an option.

The “eventually” window can be milliseconds or seconds. Under high load or network partitions, it stretches.

Read Repair

One way to achieve eventual consistency: when you read data and detect inconsistency, you repair the stale replica on the fly.

def read(key):
    values = [replica.read(key) for replica in replicas]
    if not all_equal(values):
        # Repair stale replicas
        latest = max(values, key=lambda v: v.version)
        for replica in replicas:
            if replica.value != latest.value:
                replica.write(key, latest)
    return values[0]

This makes reads slightly more expensive but ensures convergence over time.

Anti-Entropy

Anti-entropy is background repair. Nodes periodically compare their data and fix inconsistencies. This runs continuously in the background, separate from read operations.

Consistency Levels

Many databases let you tune consistency per operation.

Strong consistency: All reads see the result of the latest write. No staleness.

Sequential consistency: All nodes see operations in the same order, but that order may lag behind wall clock time.

Causal consistency: If operation A causally precedes operation B, all nodes see A before B.

Eventual consistency: Given no new updates, all replicas eventually converge.

Read-your-writes consistency: Your own writes are always visible to subsequent reads from the same client. A special case of causal consistency.

# Strong (block until confirmed by quorum)
result = db.read('key', consistency='strong')

# Eventual (return immediately from nearest replica)
result = db.read('key', consistency='eventual')

# Session consistency (read from same replica as write)
result = db.read('key', consistency='session', session_id=client_session)

# Quorum read (R out of N replicas must respond)
result = db.read('key', consistency='quorum', read_quorum=3)

# Bounded staleness (read from replica within time bound)
result = db.read('key', consistency='bounded', max_staleness='5s')

Latency and Throughput by Consistency Level

The consistency level you choose directly affects both latency and throughput. Here’s what to expect in practice:

Consistency LevelRead LatencyWrite LatencyThroughputNotes
Strong (linearizable)20-100ms30-150msLowestMust confirm with quorum or leader
Sequential10-50ms20-80msLowTotal order broadcast required
Session5-15ms10-30msMediumReads from preferred replica
Bounded staleness2-10ms10-30msHighReturn stale data within bound
Eventual1-5ms5-20msHighestNearest replica, no confirmation

These numbers assume 3-5 replicas in a single region. Cross-region and global deployments push these higher. Strong consistency across regions can hit 200-500ms round-trip.

Why the gap matters: A user-facing API with a 200ms timeout can barely afford one strong-consistency read cross-region. It can handle dozens of eventual reads in the same window. For high-frequency operations (analytics, caching, social feeds), eventual consistency is the only practical choice.

Throughput scaling: With eventual consistency, you can scale read throughput horizontally by adding replicas. Every replica can serve reads. With strong consistency, every read must contact a quorum, so adding replicas does not help read throughput (though it does help with read availability).

Database-Specific Consistency Implementations

Different databases handle consistency very differently under the hood. Understanding what your database actually does helps you make better trade-offs.

Google Spanner uses TrueTime (GPS + atomic clocks) to bound clock uncertainty to ±7ms. This lets Spanner provide strong consistency without a single leader — any replica can serve reads at any timestamp, as long as it waits out the uncertainty window. The cost is read latency: strong reads must wait for time to advance past the uncertainty bound.

CockroachDB uses Hybrid Logical Clocks (HLC) instead of physical clocks. HLC combines physical time with a logical counter, giving you causality tracking without TrueTime’s hardware requirements. CockroachDB’s default consistency level is snapshot (like eventual), but you can request SYNC reads that force consensus.

Amazon DynamoDB uses last-write-wins by default for eventual consistency. Its conditional writes provide optimistic concurrency control, but it doesn’t support true multi-item transactions without DynamoDB Transactions (which uses 2PC internally). DynamoDB Streams provides ordered change capture per partition.

Apache Cassandra uses eventual consistency with tunable consistency per query. You can request ONE, QUORUM, or ALL reads/writes. Cassandra’s lightweight transactions (LWT) use Paxos for linearizable consistency on a per-partition basis — but LWT is expensive (4 round-trips) and should be used sparingly.

DatabaseConsistency ModelTransaction SupportMax Safe Latency
SpannerStrong (TrueTime)2PC + Paxos~50ms (single region)
CockroachDBStrong (HLC + Raft)Raft-based distributed SQL~20ms (single region)
DynamoDBEventual (LWW)Per-item with Transactions~10ms (provisioned)
CassandraEventual + LWTPaxos (LWT) per partition~5ms (LOCAL_* modes)
etcdStrong (Raft)Linearizable reads by default~5ms

Application-Level Consistency

Handling Inconsistency at the Application Level

When you choose eventual consistency, your application must handle partial states. This is where things get tricky.

Scenario: User orders a product. Payment succeeds. Inventory reservation succeeds. Shipment creation fails. The saga compensates and refunds the payment.

During the failure window, the user sees “order placed” but shipment does not exist. The saga eventually undoes everything, but the user may have seen inconsistent data briefly.

Mitigation options:

  • Show pending states clearly to the user
  • Use idempotency keys to safely retry
  • Implement compensating transactions that notify the user
  • Design sagas to minimize the inconsistency window

When to Use / When Not to Use Strong Consistency

CriteriaStrong Consistency (ACID)Eventual Consistency (BASE)
LatencyHigher (waits for confirmations)Lower (returns immediately)
AvailabilityLower (may refuse requests during partitions)Higher (always serves requests)
IsolationFull serializable isolationNo isolation (dirty reads possible)
ComplexitySimple transactionsComplex conflict resolution
RollbackAutomatic and freeRequires compensation logic
Use CaseFinancial, inventory, complianceAnalytics, social features, caching
Data ReturnedAlways latestMay be stale

When to Use Strong Consistency

Use strong consistency when:

  • Financial transactions where double-charging is unacceptable
  • Inventory systems where overselling has direct business impact
  • Operations with regulatory compliance requirements
  • Any case where stale reads cause business damage
  • Systems where users expect immediate visibility of their actions

When to Use Eventual Consistency

Avoid strong consistency when:

  • High-volume, low-value operations where blocking is too expensive
  • Systems that must stay available during network partitions
  • User-generated content (likes, comments) where occasional duplication is acceptable
  • Analytics and reporting where slightly stale data is fine
  • Operations where eventual consistency is explicitly acceptable to the business

Decision Tree: Which Consistency Model Should I Choose?

flowchart TD
    Start{What are you building?}
    Start --> Financial{Financial transaction?}
    Financial -->|Yes| Strong1[Use Strong Consistency / ACID transactions or 2PC / Accept higher latency]
    Financial -->|No| Inventory{Inventory management?}
    Inventory -->|Yes| Strong2[Use Strong Consistency or Saga with careful design / Prevent overselling]
    Inventory -->|No| UserContent{User-generated content?}
    UserContent -->|Yes| Eventual1[Use Eventual Consistency / Likes, comments, posts / Accept duplicates]
    UserContent -->|No| Analytics{Analytics or reporting?}
    Analytics -->|Yes| Eventual2[Use Eventual Consistency / Bounded staleness OK / Focus on throughput]
    Analytics -->|No| Partition{Need availability during partitions?}
    Partition -->|Yes| Eventual3[Use Eventual Consistency / BASE model / Saga for transactions]
    Partition -->|No| Latency{Critical latency requirements?}
    Latency -->|Yes| Eventual4[Use Eventual Consistency with quorum reads / Tune R/W values]
    Latency -->|No| Regulatory{Regulatory compliance requirements?}
    Regulatory -->|Yes| Strong3[Use Strong Consistency / Audit trails, compliance / Document decisions]
    Regulatory -->|No| Default[Default to Eventual / Add Strong only where needed / Profile before optimizing]

Quick Reference by Use Case:

Use CaseRecommended ModelReasoning
Banking/FinancialStrong (ACID/2PC)Double-charging unacceptable
E-commerce inventoryStrong or SagaOverselling has direct cost
Social media likesEventualDuplicates tolerable, speed matters
Product catalogEventualStale reads acceptable
Order processingSagaCross-service, compensating transactions
Analytics/DashboardEventualStale data fine, throughput critical
User profilesSession/Read-your-writesUser expects to see own changes
Stock tradingStrongRegulatory, exact values required

Consistency and Microservices

In microservices, each service owns its data. Cross-service consistency is the hardest problem. You cannot wrap a transaction around service A’s database and service B’s database.

Options:

  1. Accept eventual consistency: Use saga or choreography. Handle partial failures explicitly.
  2. Shared database: Both services use the same database. This creates coupling.
  3. 2PC: Atomic transaction across databases. Blocked during commit. Rarely used.
  4. Event sourcing: Store events, not state. Reconstruct state by replaying.

Most microservice architectures choose option 1: eventual consistency with saga or orchestration.

For related patterns, see Saga Pattern, Event-Driven Architecture, and Message Queue Types.

Use Cases: Strong Consistency

Strong consistency matters for financial transactions where double-charging is unacceptable. Inventory systems where overselling is unacceptable. Operations with regulatory compliance requirements. Any case where stale reads cause business damage.

For these cases, use 2PC or synchronous replication. Accept the latency and availability trade-offs.

Use Cases: Eventual Consistency

Eventual consistency makes sense for user-generated content (likes, comments) where occasional duplication is acceptable. Analytics and reporting where slightly stale data is fine. High-volume, low-value operations where blocking is too expensive. Systems that must stay available during network partitions.

For more on read and write patterns, see Database Replication.

Trade-off Analysis

Understanding the trade-offs between consistency models helps you make informed decisions for your specific use case.

Consistency vs Availability Trade-offs

ScenarioStrong Consistency (ACID)Eventual Consistency (BASE)
Network partition during writeRequest blocked or rejectedWrite accepted, synced later
Replica unavailableWrite fails (quorum not met)Write succeeds to available replicas
Read during partitionMay return errorReturns local replica (possibly stale)
Recovery after partitionAutomatic reconciliationRequires conflict resolution
Latency per operation20-150ms (quorum round-trip)1-5ms (local write)
Throughput ceilingLimited by slowest replicaHorizontally scalable with replicas

2PC vs Saga vs Event Sourcing Trade-offs

CriteriaTwo-Phase Commit (2PC)Saga PatternEvent Sourcing
AtomicityFull atomicity across participantsNo atomicity (partial states visible)Full atomicity via event log
BlockingYes (coordinator failure blocks participants)No (compensations are async)No (events appended immutably)
LatencyHigh (multiple round-trips)Medium (sequential local transactions)Low write, higher read complexity
ComplexityModerate (protocol handling)High (compensation logic per step)High (event reconstruction)
RollbackAutomatic (all or nothing)Manual (compensation transactions)Via compensating events
Audit TrailNo (only final outcome)Partial (via compensation logs)Full (immutable event history)
Use CaseFinancial, tight couplingCross-service business transactionsEvent-driven systems, CQRS

Consensus Algorithm Trade-offs

AlgorithmImplementation ComplexityLatencyFault ToleranceUse Cases
PaxosVery highMediumN/2 failuresGoogle Spanner, some distributed DBs
RaftHighMediumN/2 failuresetcd, ZooKeeper, CockroachDB
2PCModerateHighCoordinator SPOFRarely used in microservices

Lock-based vs Lock-free Trade-offs

CriteriaDistributed Locks (Chubby, Redlock, ZK)Lock-free (Optimistic, MVCC)
Contention handlingPessimistic (lock before operation)Optimistic (detect conflicts at commit)
Latency under low contentionHigher (lock acquisition overhead)Lower (no lock needed)
Latency under high contentionPredictable (queued)High (retry storms)
Starvation riskYes (if lock held too long)Yes (if retries fail)
ImplementationComplex (fencing, renewal, expiry)Moderate (version checking)

Failure Scenarios Overview

Real-world Failure Scenarios

FailureImpactMitigation
Coordinator crashes in 2PC prepare phaseParticipants hold locks indefinitelyUse Paxos-based consensus for coordinator; implement timeout-based lock release
Network partition during commitSome participants commit, others rollback; inconsistent stateDesign for partition tolerance; use saga for compensation
Replica lag in synchronous replicationRead latency increases; writes may timeoutUse semi-synchronous replication; monitor replica lag; adjust consistency level
Read repair produces wrong resultStale data returned under high contentionUse version vectors; implement conflict resolution; use quorum reads
Eventual consistency window too longUsers see stale data for unacceptable periodMonitor convergence time; adjust replication factor; use read repair
Deadlock in distributed transactionTransactions timeout; partial state possibleImplement deadlock detection; set appropriate timeout values
Fencing token reuse after lock expirationDuplicate writes accepted by storageAlways include monotonically increasing tokens; storage must validate strictly
Clock skew causing read inconsistenciesNodes disagree on latest valueUse logical clocks (HLC) instead of wall-clock time; bound uncertainty
Saga compensation failurePartial transaction state committed; inconsistent dataDesign idempotent compensations; implement retry with exponential backoff
Quorum split-brain during partitionBoth partitions accept writesRequire majority quorum; reject writes on minority partition

Consistency Failure Flow

flowchart TD
    Start[Read Request] --> Quorum{Quorum Reached?}
    Quorum -->|Yes| Latest{All Replicas Current?}
    Latest -->|Yes| ReturnLatest[Return Latest Value]
    Latest -->|No| Repair[Trigger Read Repair]
    Repair --> ReturnLatest
    Quorum -->|No| StaleOK{Can Return Stale?}
    StaleOK -->|Yes| ReturnStale[Return Stale Value / Mark for Repair]
    StaleOK -->|No| Error[Return Error]

2PC Failure Flow

flowchart TD
    Start[Start 2PC] --> Prepare[Coordinator sends Prepare to all participants]
    Prepare --> Wait{All Participants Prepare OK?}
    Wait -->|Yes| Commit[Coordinator sends Commit to all]
    Wait -->|No| Rollback[Coordinator sends Rollback to all]
    Commit --> CoordCrash{Coordinator Crashes?}
    CoordCrash -->|Yes| ParticipantsWait[Participants wait indefinitely]
    ParticipantsWait --> Recovery[Recovery protocol: Timeout + Query participants]
    Recovery --> Outcome{Determine outcome}
    Outcome -->|Commit| DoneC[Distributed transaction committed]
    Outcome -->|Rollback| DoneR[Distributed transaction rolled back]

Common Pitfalls / Anti-Patterns

Defaulting to strong consistency everywhere: Strong consistency is expensive. Not every operation needs it. Profile your access patterns and choose consistency levels appropriately.

Ignoring the CAP tradeoff: You cannot have both strong consistency and availability during partitions. Decide what your system needs and design accordingly—don’t pretend this constraint doesn’t exist.

Using 2PC without understanding blocking: The coordinator blocking problem is real. If you use 2PC, understand the failure modes and have mitigation plans ready before production.

Treating eventual consistency as failure: Eventual consistency is not a bug—it is a feature for scalability. Design your application to handle it gracefully from day one.

Not monitoring replication lag: If you use replica reads for scalability but do not monitor lag, you may serve stale data without knowing it until users complain.

Ignoring conflict resolution: With eventual consistency and concurrent writes, conflicts happen. Have a clear strategy: last-write-wins, application-level merge, or quorum-based resolution.

Interview Questions

1. Explain the CAP theorem. Can you give examples of systems that choose consistency vs availability?

Expected answer points:

  • CAP theorem states you can have at most two of three: Consistency, Availability, and Partition tolerance
  • Since partitions happen in real systems, you must choose between CP (Consistency + Partition tolerance) or AP (Availability + Partition tolerance)
  • CP examples: ZooKeeper, etcd, Google Spanner - these will refuse requests if they cannot confirm consistency
  • AP examples: Cassandra, DynamoDB, Amazon S3 - these will serve requests even during partitions but may return stale data
2. What is the difference between ACID and BASE consistency models?

Expected answer points:

  • ACID prioritizes strong consistency: atomicity, isolation, consistency, durability - the database waits or fails to maintain consistency
  • BASE (Basically Available, Soft state, Eventually consistent) prioritizes availability: the system may return stale data temporarily
  • ACID is used for financial transactions; BASE is common in NoSQL databases and distributed caches
  • Trade-off: ACID gives you certainty but higher latency and lower availability; BASE gives you speed and availability but eventual consistency
3. Describe how Two-Phase Commit works and its main drawbacks.

Expected answer points:

  • 2PC has a coordinator and participants. Phase 1 (prepare): coordinator asks all participants to promise to commit or rollback
  • Phase 2 (commit): if all participants voted yes, coordinator sends commit; if any voted no, coordinator sends rollback
  • Main drawback: blocking - if coordinator crashes after participants prepare but before commit, participants wait indefinitely
  • Other issues: coordinator is a single point of failure; protocol is not resilient to network partitions; all participants must be available
4. How does the Saga pattern handle distributed transactions differently from 2PC?

Expected answer points:

  • Saga breaks a distributed transaction into a sequence of local transactions, each with a compensating transaction for rollback
  • If step N fails, steps 1 through N-1 are compensated (undone) in reverse order
  • Saga does not block (unlike 2PC) - compensations run as regular transactions
  • Saga sacrifices isolation and atomicity - intermediate states are visible, which 2PC does not allow
  • Best for long-running business transactions across services where 2PC's blocking is unacceptable
5. What are the different consistency levels in distributed databases, from strongest to weakest?

Expected answer points:

  • Strong (linearizable) consistency: all reads see the result of the latest write - highest latency, lowest availability
  • Sequential consistency: all nodes see operations in the same order, but order may lag wall clock time
  • Session consistency: read-your-writes - your own writes are always visible to subsequent reads from the same client
  • Causal consistency: if operation A causally precedes B, all nodes see A before B
  • Bounded staleness: return data within a time bound (e.g., within 5 seconds)
  • Eventual consistency: given no new updates, all replicas eventually converge - lowest latency, highest availability
6. Explain the difference between Paxos and Raft consensus algorithms.

Expected answer points:

  • Paxos, developed by Leslie Lamport, uses two phases (prepare/accept) to achieve consensus on a value
  • Paxos is proven correct but notoriously difficult to implement and understand
  • Raft was designed as a more understandable alternative, separating concerns into leader election, log replication, and safety
  • Raft uses randomized election timeouts for leader election - first node to timeout wins
  • Multi-Paxos optimizes for the common case where one node is the stable leader
  • Practical usage: ZooKeeper and etcd use Raft; Google Spanner uses Paxos
7. What is read repair and how does it achieve eventual consistency?

Expected answer points:

  • Read repair is a technique where, when you read data and detect inconsistency, you repair the stale replica on the fly
  • During a read, you query multiple replicas and compare versions
  • If replicas have different values, you update stale replicas with the latest value
  • This makes reads slightly more expensive but ensures convergence over time without background processes
  • Used in Amazon DynamoDB and Cassandra for eventual consistency
8. What are the main challenges with distributed locking and how do you mitigate them?

Expected answer points:

  • Lock lifetime: too short and work doesn't complete; too long and other tasks wait unnecessarily
  • Clock skew: different clock speeds cause unexpected timeout behavior - use logical clocks (Lamport or HLC) when possible
  • Fencing tokens: always include monotonically increasing tokens with locked operations so storage can reject stale writes
  • Network partitions: locks held by failed nodes are not automatically released - use lease-based locks with renewal
  • Single point of failure: avoid centralized lock services for critical paths - use quorum-based approaches
9. How does anti-entropy differ from read repair for achieving consistency?

Expected answer points:

  • Read repair fixes inconsistencies during read operations - it is triggered by read traffic
  • Anti-entropy is background repair - nodes periodically compare their data and fix inconsistencies separate from reads
  • Anti-entropy uses Merkle trees to efficiently compare large datasets between replicas
  • Anti-entropy runs continuously in the background and can catch inconsistencies that read repair misses (e.g., data that nobody reads)
  • Cassandra uses both: read repair for immediate consistency on reads, anti-entropy for background repair
10. When would you choose eventual consistency over strong consistency in a production system?

Expected answer points:

  • High-volume, low-value operations: when blocking is too expensive (e.g., social media likes, views, non-critical counters)
  • User-generated content: when occasional duplication or staleness is acceptable (e.g., comments, posts, profile updates)
  • Analytics and reporting: when slightly stale data is fine and throughput is critical
  • Systems that must stay available during network partitions: eventual consistency prioritizes availability
  • When your users won't notice or care about brief staleness - most user-facing features fall here
  • Counter-indication: financial transactions, inventory management, anything where stale reads cause business damage
11. What is the difference between synchronous and asynchronous replication in distributed databases?

Expected answer points:

  • Synchronous replication: write waits until all replicas confirm the write - strong consistency but higher latency
  • Asynchronous replication: write returns immediately after local commit, replicas updated in background - lower latency but eventual consistency
  • Semi-synchronous: write waits for at least one replica (not all) - balanced approach
  • Synchronous replication can cause writes to fail if a replica is down; asynchronous may lose writes if primary fails before replication
  • Most production systems use semi-synchronous or asynchronous with appropriate consistency levels
12. How does snapshot isolation work in distributed databases like CockroachDB?

Expected answer points:

  • Snapshot isolation gives each transaction a consistent point-in-time view of the database at the moment it starts
  • Under the hood, distributed databases like CockroachDB use MVCC (Multi-Version Concurrency Control) to maintain multiple versions of data
  • Reads see the most recent commit as of the transaction's start timestamp
  • Writes are validated at commit time to detect write-write conflicts
  • CockroachDB uses HLC (Hybrid Logical Clocks) to assign timestamps that preserve causality across nodes
  • Serializable snapshot isolation (SSI) extends this by detecting write-skew anomalies at commit time
13. What are fencing tokens and why are they important in distributed locking?

Expected answer points:

  • Fencing tokens are monotonically increasing identifiers returned when a lock is acquired
  • When a client holds a lock but gets partitioned, a new client can acquire the same lock
  • Without fencing tokens, storage systems cannot distinguish stale writes from fresh ones
  • With fencing tokens, storage can reject operations with stale tokens, ensuring only one writer succeeds
  • The pattern: include the fencing token in every operation under the lock; storage rejects if token is not strictly increasing
  • Google Chubby and Spanner use this pattern; it is essential when locks can be held by failed processes
14. Explain the difference between linearizability and serializability in distributed systems.

Expected answer points:

  • Linearizability is a recency guarantee on individual operations - every read sees the most recent write preceding it in real-time
  • Serializability is a property of a transaction history - the outcome is equivalent to some serial execution of transactions
  • Linearizability applies to single-object operations; serializability applies to multi-object transactions
  • A system can be linearizable without being serializable (e.g., single-object operations)
  • A system can be serializable without being linearizable (e.g., slow reads can return stale data)
  • Strong consistency in practice often means linearizable; serializable isolation (like SSI) is what database textbooks recommend
15. What is the role of a distributed transaction coordinator, and what happens when it fails?

Expected answer points:

  • The transaction coordinator manages the two-phase commit (2PC) protocol across participants
  • In phase 1, it asks participants to prepare; in phase 2, it tells them to commit or rollback
  • If the coordinator crashes after participants prepare but before commit, participants are blocked indefinitely
  • This blocking problem is why 2PC is avoided in practice for long-running or high-value transactions
  • Solutions: use Paxos-based consensus for coordinator (like Google Spanner), use saga pattern instead of 2PC
  • Some systems use backup coordinators or recovery protocols that query participants to determine outcome
16. How does the Redlock algorithm work and what are its known safety issues?

Expected answer points:

  • Redlock uses multiple independent Redis instances to acquire a distributed lock
  • A client writes a unique value to multiple nodes; if a majority responds within the timeout, the lock is acquired
  • The majority quorum (N/2+1) provides fault tolerance against node failures
  • Known safety issue: Redlock is not fault-tolerant to network partitions as originally designed
  • Under certain network partition scenarios, multiple clients can acquire the lock simultaneously
  • Martin Kleppmann identified these issues in his paper "Redis is mis-handled the Redlock algorithm incorrectly"
  • For production use, consider Redisson's RLock or use a consensus-based lock service like ZooKeeper/etcd
17. What is causal consistency and how does it differ from strong and eventual consistency?

Expected answer points:

  • Causal consistency guarantees that if operation A causally precedes operation B, all nodes see A before B
  • It is weaker than sequential consistency (which requires total order) but stronger than eventual consistency
  • Concurrent operations (A and B where neither causally precedes the other) can be seen in any order
  • Read-your-writes consistency is a special case of causal consistency - your own writes are always visible to your subsequent reads
  • Implementing causal consistency requires tracking dependencies (using vector clocks or similar)
  • Not widely implemented as a first-class consistency level; most systems choose between strong and eventual
18. How does Google Spanner achieve strong consistency without a single leader?

Expected answer points:

  • Spanner uses TrueTime, which combines GPS receivers and atomic clocks to bound clock uncertainty to ±7ms
  • Because clock uncertainty is bounded, any replica can serve reads at a timestamp once it waits out the uncertainty window
  • This allows leaderless strong reads - no single replica must be the leader for reads
  • Writes go through a Paxos-based consensus protocol (Multi-Paxos) for durability
  • The TrueTime API provides now() returns [earliest, latest] timestamps; commits wait until latest < write timestamp
  • Trade-off: read latency increases because replicas must wait out the uncertainty bound
19. What is the difference between choreography and orchestration in the saga pattern?

Expected answer points:

  • In choreography, each participant emits events that trigger the next step in other services - decentralized
  • In orchestration, a central coordinator tells participants what to do and handles compensations - centralized
  • Choreography benefits: services are loosely coupled; no single point of failure
  • Choreography drawbacks: business logic scattered across services; hard to see overall flow
  • Orchestration benefits: clear central view of the saga; easier to test and debug
  • Orchestration drawbacks: coordinator becomes a single point of failure if not designed carefully
  • Most practical sagas use a mix: lightweight orchestration for core flow, choreography for peripheral events
20. What are the main considerations for choosing between 2PC, saga, and event sourcing for distributed transactions?

Expected answer points:

  • 2PC: choose when you need atomicity across a small number of known participants and can accept blocking
  • Saga: choose when participants are independent services with long-running operations; availability matters more than atomicity
  • Event sourcing: choose when you need audit trails, temporal queries, or the ability to reconstruct state at any point
  • 2PC is not suitable for cross-service transactions in microservices due to tight coupling and blocking
  • Saga is the most common choice for business transactions spanning multiple services
  • Event sourcing pairs well with CQRS (Command Query Responsibility Segregation) for read-heavy systems
  • Consider the complexity of compensation logic vs. the cost of failures when making the choice

Further Reading

Quick Recap Checklist

Before shipping distributed transaction logic to production, run through this checklist:

  • CAP tradeoff documented — For each service, have you explicitly decided between CP and AP?
  • Consistency level chosen per operation — Not defaulting to strongest; choosing based on business requirements
  • Saga pattern implemented — For cross-service business transactions needing atomicity without 2PC blocking
  • 2PC avoided for microservices — Using saga or event sourcing instead where appropriate
  • Replication lag monitored — Alerting on lag exceeding defined thresholds
  • Transaction latency SLAs defined — P50/P95/P99 tracked and alerted
  • Conflict resolution strategy documented — Last-write-wins, application-level merge, or quorum-based
  • Idempotency implemented — All distributed operations are idempotent or use idempotency keys
  • Fencing tokens in place — Monotonically increasing tokens with all locked operations
  • Circuit breakers configured — Preventing cascade failures during partial outages
  • Correlation IDs in transaction logs — Every distributed operation traceable across services
  • Failure scenarios tested — Chaos testing or game days covering coordinator crash, partition, replica failure
  • Compensation logic verified — Saga compensations tested with injected failures at each step
  • Read repair and anti-entropy understood — Know which consistency mechanism your database uses
  • Lock lifetime tuned — Lock expiry long enough to complete work, short enough to unblock

Conclusion

Distributed transactions do not have a one-size-fits-all answer. ACID gives you consistency but sacrifices availability and latency. BASE gives you availability but introduces inconsistency windows.

In practice, most systems use a mix. Core business operations get strong consistency. Peripheral operations get eventual consistency. Saga patterns handle cross-service business transactions without 2PC’s blocking problems.

The right choice depends on your domain, your tolerance for inconsistency, and what your users expect. Do not default to strong consistency everywhere. It is expensive. Do not default to eventual consistency everywhere. Your users will notice bugs.

flowchart TD
    A[Consistency Models] --> B[Strong ACID]
    A --> C[Eventual BASE]
    B --> D[2PC, Synchronous Replication]
    C --> E[Saga, Async Replication]
    D --> F[High Latency, Lower Availability]
    E --> G[Low Latency, Higher Availability]

Key Points

  • ACID prioritizes consistency; BASE prioritizes availability
  • CAP theorem: you cannot have consistency, availability, and partition tolerance simultaneously
  • Choose consistency level per operation based on business requirements
  • Saga pattern handles distributed transactions without 2PC’s blocking problems
  • Eventual consistency requires application-level handling for conflicts

Production Checklist

# Distributed Transactions Production Readiness

- [ ] Consistency level chosen per operation based on requirements
- [ ] Saga pattern implemented for cross-service business transactions
- [ ] Replication lag monitored and alerted
- [ ] Transaction latency SLAs defined and monitored
- [ ] Conflict resolution strategy documented and implemented
- [ ] Idempotency implemented for all distributed operations
- [ ] CAP tradeoff understood and documented for each service
- [ ] Circuit breakers implemented to prevent cascade failures
- [ ] Correlation IDs in all distributed transaction logs

Category

Related Posts

The Outbox Pattern: Reliable Event Publishing in Distributed Systems

Learn the transactional outbox pattern for reliable event publishing. Discover how to solve the dual-write problem, implement idempotent consumers, and achieve exactly-once delivery.

#distributed-systems #patterns #event-driven

TCC: Try-Confirm-Cancel Pattern for Distributed Transactions

Learn the Try-Confirm-Cancel pattern for distributed transactions. Explore how TCC differs from 2PC and saga, with implementation examples and real-world use cases.

#distributed-systems #transactions #saga

Two-Phase Commit Protocol Explained

Learn the two-phase commit protocol for distributed transactions: prepare and commit phases, coordinator role, failure handling, and trade-offs.

#distributed-systems #transactions #protocols