Distributed Transactions: ACID vs BASE Trade-offs
Explore distributed transaction patterns: ACID vs BASE trade-offs, two-phase commit, saga pattern, eventual consistency, and choosing the right model.
Distributed Transactions: ACID vs BASE Trade-offs
Introduction
In a single database, transactions are straightforward. The database handles atomicity, isolation, consistency, and durability (ACID). When you commit, it is committed. When you read data, it is the latest.
In distributed systems, this simplicity breaks. Data lives across multiple databases, services, and nodes. Keeping everything consistent requires coordination—and coordination has costs: latency, availability, and complexity.
This post explores trade-offs between consistency models, from strict ACID to relaxed eventual consistency, and how to choose the right model for your system.
Topic-Specific Deep Dives
Consistency Models
ACID vs BASE
These are two philosophies for handling distributed data.
ACID (Atomicity, Consistency, Isolation, Durability) prioritizes consistency above all else. Every transaction sees a consistent state. The database will wait (or fail) to maintain consistency.
BASE (Basically Available, Soft state, Eventually consistent) prioritizes availability. The system may return stale data temporarily, but given enough time, all replicas converge to the same value.
flowchart TD
A[ACID] -->|Strong consistency| B[Reads always see latest writes]
A -->|Isolation| C[Concurrent tx look serial]
A -->|Durability| D[Committed data survives crashes]
E[BASE] -->|Availability| F[System stays available]
E -->|Eventual| G[Data converges over time]
E -->|Soft state| H[State may change without input]
ACID is what you want for financial transactions. BASE is what you get with many NoSQL databases and distributed caches.
The CAP Theorem
Eric Brewer’s CAP theorem says you can have at most two of three: Consistency, Availability, and Partition tolerance. Since network partitions happen in real systems, you must choose between consistency and availability.
flowchart LR
C[Consistency] -->|CP| DB[(Database)]
C -->|CA| DB
A[Availability] -->|AP| DB
P[Partition Tolerance] --> C
P --> A
A CP system (like ZooKeeper, etcd) will refuse to serve requests if it cannot confirm consistency. An AP system (like Cassandra, DynamoDB) will serve requests even during partitions but may return stale data.
Most real systems choose AP and use compensating transactions (like saga) to handle inconsistency.
Distributed Protocols
Consensus Algorithms
When 2PC’s coordinator fails, participants are left in limbo. Consensus algorithms solve this by using a quorum-based approach where any node can take over if the leader fails.
Paxos, developed by Leslie Lamport, uses two phases:
- Prepare phase: A node proposing a value sends a prepare request to a quorum of nodes. Nodes promise not to accept requests with a lower proposal number.
- Accept phase: If a majority responds, the proposer sends an accept request. Nodes accept if they haven’t seen a higher-numbered proposal.
Paxos guarantees safety (no two proposals can be chosen for the same value) but is notoriously difficult to implement correctly. Multi-Paxos optimizes for the common case where one node is the stable leader.
Raft was designed as a more understandable alternative to Paxos. It separates concerns into three sub-problems:
- Leader election: Nodes vote for a leader using randomized election timeouts. The first node to timeout wins and becomes leader.
- Log replication: The leader accepts entries and replicates them to followers via a heartbeat pipeline.
- Safety: If a leader fails, only nodes with enough log entries can become the new leader.
flowchart TD
A[Node A<br/>Candidate] -->|RequestVote| B[Node B<br/>Follower]
A -->|RequestVote| C[Node C<br/>Follower]
B -->|Vote granted| A
C -->|Vote granted| A
A -->|Elected Leader| D[Node A<br/>Leader]
D -->|Heartbeat| B
D -->|Heartbeat| C
Comparison: ZooKeeper and etcd use Raft. Google Spanner uses Paxos (specifically Multi-Paxos with TrueTime). CockroachDB uses a Raft variant called Raft + HLC.
Two-Phase Commit
Two-phase commit (2PC) coordinates a distributed transaction across multiple participants. A coordinator asks all participants to prepare. If all say yes, it tells them to commit. If any says no, it tells them all to rollback.
For a detailed explanation including failure handling, see Two-Phase Commit.
The problem with 2PC is that it is blocking. If the coordinator crashes after participants have prepared but before sending commit, participants must wait indefinitely. For this reason, many systems avoid 2PC in favor of saga or other patterns.
Distributed Locking
When you need exclusive access to a distributed resource, you need distributed locking. Unlike local locks, distributed locks must survive node failures and network partitions.
Chubby, Google’s lock service, uses Paxos to achieve consensus on lock ownership. Applications like BigTable and MapReduce use Chubby for leader election and distributed locking.
Redis Distributed Locks (Redlock): The Redlock algorithm uses multiple Redis instances to acquire a lock. A client writes a unique value to multiple Redis nodes. If a majority of nodes respond within a timeout, the lock is acquired. This approach is simpler than Paxos but has known safety issues under certain network partition scenarios.
ZooKeeper ephemeral nodes: A node can create an ephemeral file. If the client disconnects, ZooKeeper automatically deletes the file. Applications use this for leader election (the first node to create the file becomes leader) and distributed locks (lock files created in a specific order).
sequenceDiagram
Client1 -->|Create lock| ZooKeeper
ZooKeeper -->|Created| Client1
Client2 -->|Create lock| ZooKeeper
ZooKeeper -->|Node exists| Client2
Client2 -->|Watch lock| ZooKeeper
Client1 -->|Release lock| ZooKeeper
ZooKeeper -->|Node deleted| Client2
Client2 -->|Create lock| ZooKeeper
ZooKeeper -->|Created| Client2
Key considerations for distributed locks:
- Lock lifetime: How long should a lock be held? Too short and work doesn’t complete. Too long and other tasks wait unnecessarily.
- Clock skew: If nodes have different clock speeds, lock timeouts can behave unexpectedly. Use logical clocks (Lamport timestamps or hybrid logical clocks) when possible.
- Fencing tokens: Always include a monotonically increasing token with locked operations. Storage systems can reject stale writes using the token.
Consistency Patterns
Transaction Isolation Levels in Distributed Systems
Local database isolation levels (READ UNCOMMITTED, READ COMMITTED, REPEATABLE READ, SERIALIZABLE) don’t map directly to distributed systems. The guarantees differ.
| Isolation Level | Local Meaning | Distributed Equivalent |
|---|---|---|
| Read Uncommitted | Dirty reads allowed | Not typically used |
| Read Committed | Only committed reads | Eventual consistency with read repair |
| Repeatable Read | Same row, same value | Session consistency (read your writes) |
| Serializable | Full serial order | Linearizable consistency (Paxos/Raft) |
Snapshot isolation is common in distributed databases. Each transaction reads from a consistent snapshot (a point-in-time view). Writes are validated at commit time to ensure no conflicts. PostgreSQL uses snapshot isolation. CockroachDB implements it via its MVCC (Multi-Version Concurrency Control) layer.
Serialized Snapshot Isolation (SSI) extends snapshot isolation by detecting write-write conflicts at commit time rather than at write time. This eliminates the “write skew” anomaly but has higher overhead. CockroachDB uses SSI by default for serializable transactions.
The Saga Pattern
Saga breaks a distributed transaction into a sequence of local transactions, each with a compensating transaction. If step 3 fails, steps 1 and 2 are compensated (undone).
See Saga Pattern for full details.
Saga sacrifices isolation and atomicity for availability. Unlike 2PC, saga does not block. Unlike ACID, saga allows intermediate states to be visible.
sequenceDiagram
participant Saga
participant Inv as Inventory
participant Pay as Payment
Saga-->|Reserve| Inv
Inv-->|OK| Saga
Saga-->|Charge| Pay
Pay-->|OK| Saga
Note over Saga,Pay: If payment fails, compensate inventory
Eventual Consistency
Eventual consistency guarantees that, if no new updates are made, all replicas will eventually return the same value. The system is available during the “eventually” window.
Amazon DynamoDB, Cassandra, and many distributed databases use eventual consistency by default or offer it as an option.
The “eventually” window can be milliseconds or seconds. Under high load or network partitions, it stretches.
Read Repair
One way to achieve eventual consistency: when you read data and detect inconsistency, you repair the stale replica on the fly.
def read(key):
values = [replica.read(key) for replica in replicas]
if not all_equal(values):
# Repair stale replicas
latest = max(values, key=lambda v: v.version)
for replica in replicas:
if replica.value != latest.value:
replica.write(key, latest)
return values[0]
This makes reads slightly more expensive but ensures convergence over time.
Anti-Entropy
Anti-entropy is background repair. Nodes periodically compare their data and fix inconsistencies. This runs continuously in the background, separate from read operations.
Consistency Levels
Many databases let you tune consistency per operation.
Strong consistency: All reads see the result of the latest write. No staleness.
Sequential consistency: All nodes see operations in the same order, but that order may lag behind wall clock time.
Causal consistency: If operation A causally precedes operation B, all nodes see A before B.
Eventual consistency: Given no new updates, all replicas eventually converge.
Read-your-writes consistency: Your own writes are always visible to subsequent reads from the same client. A special case of causal consistency.
# Strong (block until confirmed by quorum)
result = db.read('key', consistency='strong')
# Eventual (return immediately from nearest replica)
result = db.read('key', consistency='eventual')
# Session consistency (read from same replica as write)
result = db.read('key', consistency='session', session_id=client_session)
# Quorum read (R out of N replicas must respond)
result = db.read('key', consistency='quorum', read_quorum=3)
# Bounded staleness (read from replica within time bound)
result = db.read('key', consistency='bounded', max_staleness='5s')
Latency and Throughput by Consistency Level
The consistency level you choose directly affects both latency and throughput. Here’s what to expect in practice:
| Consistency Level | Read Latency | Write Latency | Throughput | Notes |
|---|---|---|---|---|
| Strong (linearizable) | 20-100ms | 30-150ms | Lowest | Must confirm with quorum or leader |
| Sequential | 10-50ms | 20-80ms | Low | Total order broadcast required |
| Session | 5-15ms | 10-30ms | Medium | Reads from preferred replica |
| Bounded staleness | 2-10ms | 10-30ms | High | Return stale data within bound |
| Eventual | 1-5ms | 5-20ms | Highest | Nearest replica, no confirmation |
These numbers assume 3-5 replicas in a single region. Cross-region and global deployments push these higher. Strong consistency across regions can hit 200-500ms round-trip.
Why the gap matters: A user-facing API with a 200ms timeout can barely afford one strong-consistency read cross-region. It can handle dozens of eventual reads in the same window. For high-frequency operations (analytics, caching, social feeds), eventual consistency is the only practical choice.
Throughput scaling: With eventual consistency, you can scale read throughput horizontally by adding replicas. Every replica can serve reads. With strong consistency, every read must contact a quorum, so adding replicas does not help read throughput (though it does help with read availability).
Database-Specific Consistency Implementations
Different databases handle consistency very differently under the hood. Understanding what your database actually does helps you make better trade-offs.
Google Spanner uses TrueTime (GPS + atomic clocks) to bound clock uncertainty to ±7ms. This lets Spanner provide strong consistency without a single leader — any replica can serve reads at any timestamp, as long as it waits out the uncertainty window. The cost is read latency: strong reads must wait for time to advance past the uncertainty bound.
CockroachDB uses Hybrid Logical Clocks (HLC) instead of physical clocks. HLC combines physical time with a logical counter, giving you causality tracking without TrueTime’s hardware requirements. CockroachDB’s default consistency level is snapshot (like eventual), but you can request SYNC reads that force consensus.
Amazon DynamoDB uses last-write-wins by default for eventual consistency. Its conditional writes provide optimistic concurrency control, but it doesn’t support true multi-item transactions without DynamoDB Transactions (which uses 2PC internally). DynamoDB Streams provides ordered change capture per partition.
Apache Cassandra uses eventual consistency with tunable consistency per query. You can request ONE, QUORUM, or ALL reads/writes. Cassandra’s lightweight transactions (LWT) use Paxos for linearizable consistency on a per-partition basis — but LWT is expensive (4 round-trips) and should be used sparingly.
| Database | Consistency Model | Transaction Support | Max Safe Latency |
|---|---|---|---|
| Spanner | Strong (TrueTime) | 2PC + Paxos | ~50ms (single region) |
| CockroachDB | Strong (HLC + Raft) | Raft-based distributed SQL | ~20ms (single region) |
| DynamoDB | Eventual (LWW) | Per-item with Transactions | ~10ms (provisioned) |
| Cassandra | Eventual + LWT | Paxos (LWT) per partition | ~5ms (LOCAL_* modes) |
| etcd | Strong (Raft) | Linearizable reads by default | ~5ms |
Application-Level Consistency
Handling Inconsistency at the Application Level
When you choose eventual consistency, your application must handle partial states. This is where things get tricky.
Scenario: User orders a product. Payment succeeds. Inventory reservation succeeds. Shipment creation fails. The saga compensates and refunds the payment.
During the failure window, the user sees “order placed” but shipment does not exist. The saga eventually undoes everything, but the user may have seen inconsistent data briefly.
Mitigation options:
- Show pending states clearly to the user
- Use idempotency keys to safely retry
- Implement compensating transactions that notify the user
- Design sagas to minimize the inconsistency window
When to Use / When Not to Use Strong Consistency
| Criteria | Strong Consistency (ACID) | Eventual Consistency (BASE) |
|---|---|---|
| Latency | Higher (waits for confirmations) | Lower (returns immediately) |
| Availability | Lower (may refuse requests during partitions) | Higher (always serves requests) |
| Isolation | Full serializable isolation | No isolation (dirty reads possible) |
| Complexity | Simple transactions | Complex conflict resolution |
| Rollback | Automatic and free | Requires compensation logic |
| Use Case | Financial, inventory, compliance | Analytics, social features, caching |
| Data Returned | Always latest | May be stale |
When to Use Strong Consistency
Use strong consistency when:
- Financial transactions where double-charging is unacceptable
- Inventory systems where overselling has direct business impact
- Operations with regulatory compliance requirements
- Any case where stale reads cause business damage
- Systems where users expect immediate visibility of their actions
When to Use Eventual Consistency
Avoid strong consistency when:
- High-volume, low-value operations where blocking is too expensive
- Systems that must stay available during network partitions
- User-generated content (likes, comments) where occasional duplication is acceptable
- Analytics and reporting where slightly stale data is fine
- Operations where eventual consistency is explicitly acceptable to the business
Decision Tree: Which Consistency Model Should I Choose?
flowchart TD
Start{What are you building?}
Start --> Financial{Financial transaction?}
Financial -->|Yes| Strong1[Use Strong Consistency / ACID transactions or 2PC / Accept higher latency]
Financial -->|No| Inventory{Inventory management?}
Inventory -->|Yes| Strong2[Use Strong Consistency or Saga with careful design / Prevent overselling]
Inventory -->|No| UserContent{User-generated content?}
UserContent -->|Yes| Eventual1[Use Eventual Consistency / Likes, comments, posts / Accept duplicates]
UserContent -->|No| Analytics{Analytics or reporting?}
Analytics -->|Yes| Eventual2[Use Eventual Consistency / Bounded staleness OK / Focus on throughput]
Analytics -->|No| Partition{Need availability during partitions?}
Partition -->|Yes| Eventual3[Use Eventual Consistency / BASE model / Saga for transactions]
Partition -->|No| Latency{Critical latency requirements?}
Latency -->|Yes| Eventual4[Use Eventual Consistency with quorum reads / Tune R/W values]
Latency -->|No| Regulatory{Regulatory compliance requirements?}
Regulatory -->|Yes| Strong3[Use Strong Consistency / Audit trails, compliance / Document decisions]
Regulatory -->|No| Default[Default to Eventual / Add Strong only where needed / Profile before optimizing]
Quick Reference by Use Case:
| Use Case | Recommended Model | Reasoning |
|---|---|---|
| Banking/Financial | Strong (ACID/2PC) | Double-charging unacceptable |
| E-commerce inventory | Strong or Saga | Overselling has direct cost |
| Social media likes | Eventual | Duplicates tolerable, speed matters |
| Product catalog | Eventual | Stale reads acceptable |
| Order processing | Saga | Cross-service, compensating transactions |
| Analytics/Dashboard | Eventual | Stale data fine, throughput critical |
| User profiles | Session/Read-your-writes | User expects to see own changes |
| Stock trading | Strong | Regulatory, exact values required |
Consistency and Microservices
In microservices, each service owns its data. Cross-service consistency is the hardest problem. You cannot wrap a transaction around service A’s database and service B’s database.
Options:
- Accept eventual consistency: Use saga or choreography. Handle partial failures explicitly.
- Shared database: Both services use the same database. This creates coupling.
- 2PC: Atomic transaction across databases. Blocked during commit. Rarely used.
- Event sourcing: Store events, not state. Reconstruct state by replaying.
Most microservice architectures choose option 1: eventual consistency with saga or orchestration.
For related patterns, see Saga Pattern, Event-Driven Architecture, and Message Queue Types.
Use Cases: Strong Consistency
Strong consistency matters for financial transactions where double-charging is unacceptable. Inventory systems where overselling is unacceptable. Operations with regulatory compliance requirements. Any case where stale reads cause business damage.
For these cases, use 2PC or synchronous replication. Accept the latency and availability trade-offs.
Use Cases: Eventual Consistency
Eventual consistency makes sense for user-generated content (likes, comments) where occasional duplication is acceptable. Analytics and reporting where slightly stale data is fine. High-volume, low-value operations where blocking is too expensive. Systems that must stay available during network partitions.
For more on read and write patterns, see Database Replication.
Trade-off Analysis
Understanding the trade-offs between consistency models helps you make informed decisions for your specific use case.
Consistency vs Availability Trade-offs
| Scenario | Strong Consistency (ACID) | Eventual Consistency (BASE) |
|---|---|---|
| Network partition during write | Request blocked or rejected | Write accepted, synced later |
| Replica unavailable | Write fails (quorum not met) | Write succeeds to available replicas |
| Read during partition | May return error | Returns local replica (possibly stale) |
| Recovery after partition | Automatic reconciliation | Requires conflict resolution |
| Latency per operation | 20-150ms (quorum round-trip) | 1-5ms (local write) |
| Throughput ceiling | Limited by slowest replica | Horizontally scalable with replicas |
2PC vs Saga vs Event Sourcing Trade-offs
| Criteria | Two-Phase Commit (2PC) | Saga Pattern | Event Sourcing |
|---|---|---|---|
| Atomicity | Full atomicity across participants | No atomicity (partial states visible) | Full atomicity via event log |
| Blocking | Yes (coordinator failure blocks participants) | No (compensations are async) | No (events appended immutably) |
| Latency | High (multiple round-trips) | Medium (sequential local transactions) | Low write, higher read complexity |
| Complexity | Moderate (protocol handling) | High (compensation logic per step) | High (event reconstruction) |
| Rollback | Automatic (all or nothing) | Manual (compensation transactions) | Via compensating events |
| Audit Trail | No (only final outcome) | Partial (via compensation logs) | Full (immutable event history) |
| Use Case | Financial, tight coupling | Cross-service business transactions | Event-driven systems, CQRS |
Consensus Algorithm Trade-offs
| Algorithm | Implementation Complexity | Latency | Fault Tolerance | Use Cases |
|---|---|---|---|---|
| Paxos | Very high | Medium | N/2 failures | Google Spanner, some distributed DBs |
| Raft | High | Medium | N/2 failures | etcd, ZooKeeper, CockroachDB |
| 2PC | Moderate | High | Coordinator SPOF | Rarely used in microservices |
Lock-based vs Lock-free Trade-offs
| Criteria | Distributed Locks (Chubby, Redlock, ZK) | Lock-free (Optimistic, MVCC) |
|---|---|---|
| Contention handling | Pessimistic (lock before operation) | Optimistic (detect conflicts at commit) |
| Latency under low contention | Higher (lock acquisition overhead) | Lower (no lock needed) |
| Latency under high contention | Predictable (queued) | High (retry storms) |
| Starvation risk | Yes (if lock held too long) | Yes (if retries fail) |
| Implementation | Complex (fencing, renewal, expiry) | Moderate (version checking) |
Failure Scenarios Overview
Real-world Failure Scenarios
| Failure | Impact | Mitigation |
|---|---|---|
| Coordinator crashes in 2PC prepare phase | Participants hold locks indefinitely | Use Paxos-based consensus for coordinator; implement timeout-based lock release |
| Network partition during commit | Some participants commit, others rollback; inconsistent state | Design for partition tolerance; use saga for compensation |
| Replica lag in synchronous replication | Read latency increases; writes may timeout | Use semi-synchronous replication; monitor replica lag; adjust consistency level |
| Read repair produces wrong result | Stale data returned under high contention | Use version vectors; implement conflict resolution; use quorum reads |
| Eventual consistency window too long | Users see stale data for unacceptable period | Monitor convergence time; adjust replication factor; use read repair |
| Deadlock in distributed transaction | Transactions timeout; partial state possible | Implement deadlock detection; set appropriate timeout values |
| Fencing token reuse after lock expiration | Duplicate writes accepted by storage | Always include monotonically increasing tokens; storage must validate strictly |
| Clock skew causing read inconsistencies | Nodes disagree on latest value | Use logical clocks (HLC) instead of wall-clock time; bound uncertainty |
| Saga compensation failure | Partial transaction state committed; inconsistent data | Design idempotent compensations; implement retry with exponential backoff |
| Quorum split-brain during partition | Both partitions accept writes | Require majority quorum; reject writes on minority partition |
Consistency Failure Flow
flowchart TD
Start[Read Request] --> Quorum{Quorum Reached?}
Quorum -->|Yes| Latest{All Replicas Current?}
Latest -->|Yes| ReturnLatest[Return Latest Value]
Latest -->|No| Repair[Trigger Read Repair]
Repair --> ReturnLatest
Quorum -->|No| StaleOK{Can Return Stale?}
StaleOK -->|Yes| ReturnStale[Return Stale Value / Mark for Repair]
StaleOK -->|No| Error[Return Error]
2PC Failure Flow
flowchart TD
Start[Start 2PC] --> Prepare[Coordinator sends Prepare to all participants]
Prepare --> Wait{All Participants Prepare OK?}
Wait -->|Yes| Commit[Coordinator sends Commit to all]
Wait -->|No| Rollback[Coordinator sends Rollback to all]
Commit --> CoordCrash{Coordinator Crashes?}
CoordCrash -->|Yes| ParticipantsWait[Participants wait indefinitely]
ParticipantsWait --> Recovery[Recovery protocol: Timeout + Query participants]
Recovery --> Outcome{Determine outcome}
Outcome -->|Commit| DoneC[Distributed transaction committed]
Outcome -->|Rollback| DoneR[Distributed transaction rolled back]
Common Pitfalls / Anti-Patterns
Defaulting to strong consistency everywhere: Strong consistency is expensive. Not every operation needs it. Profile your access patterns and choose consistency levels appropriately.
Ignoring the CAP tradeoff: You cannot have both strong consistency and availability during partitions. Decide what your system needs and design accordingly—don’t pretend this constraint doesn’t exist.
Using 2PC without understanding blocking: The coordinator blocking problem is real. If you use 2PC, understand the failure modes and have mitigation plans ready before production.
Treating eventual consistency as failure: Eventual consistency is not a bug—it is a feature for scalability. Design your application to handle it gracefully from day one.
Not monitoring replication lag: If you use replica reads for scalability but do not monitor lag, you may serve stale data without knowing it until users complain.
Ignoring conflict resolution: With eventual consistency and concurrent writes, conflicts happen. Have a clear strategy: last-write-wins, application-level merge, or quorum-based resolution.
Interview Questions
Expected answer points:
- CAP theorem states you can have at most two of three: Consistency, Availability, and Partition tolerance
- Since partitions happen in real systems, you must choose between CP (Consistency + Partition tolerance) or AP (Availability + Partition tolerance)
- CP examples: ZooKeeper, etcd, Google Spanner - these will refuse requests if they cannot confirm consistency
- AP examples: Cassandra, DynamoDB, Amazon S3 - these will serve requests even during partitions but may return stale data
Expected answer points:
- ACID prioritizes strong consistency: atomicity, isolation, consistency, durability - the database waits or fails to maintain consistency
- BASE (Basically Available, Soft state, Eventually consistent) prioritizes availability: the system may return stale data temporarily
- ACID is used for financial transactions; BASE is common in NoSQL databases and distributed caches
- Trade-off: ACID gives you certainty but higher latency and lower availability; BASE gives you speed and availability but eventual consistency
Expected answer points:
- 2PC has a coordinator and participants. Phase 1 (prepare): coordinator asks all participants to promise to commit or rollback
- Phase 2 (commit): if all participants voted yes, coordinator sends commit; if any voted no, coordinator sends rollback
- Main drawback: blocking - if coordinator crashes after participants prepare but before commit, participants wait indefinitely
- Other issues: coordinator is a single point of failure; protocol is not resilient to network partitions; all participants must be available
Expected answer points:
- Saga breaks a distributed transaction into a sequence of local transactions, each with a compensating transaction for rollback
- If step N fails, steps 1 through N-1 are compensated (undone) in reverse order
- Saga does not block (unlike 2PC) - compensations run as regular transactions
- Saga sacrifices isolation and atomicity - intermediate states are visible, which 2PC does not allow
- Best for long-running business transactions across services where 2PC's blocking is unacceptable
Expected answer points:
- Strong (linearizable) consistency: all reads see the result of the latest write - highest latency, lowest availability
- Sequential consistency: all nodes see operations in the same order, but order may lag wall clock time
- Session consistency: read-your-writes - your own writes are always visible to subsequent reads from the same client
- Causal consistency: if operation A causally precedes B, all nodes see A before B
- Bounded staleness: return data within a time bound (e.g., within 5 seconds)
- Eventual consistency: given no new updates, all replicas eventually converge - lowest latency, highest availability
Expected answer points:
- Paxos, developed by Leslie Lamport, uses two phases (prepare/accept) to achieve consensus on a value
- Paxos is proven correct but notoriously difficult to implement and understand
- Raft was designed as a more understandable alternative, separating concerns into leader election, log replication, and safety
- Raft uses randomized election timeouts for leader election - first node to timeout wins
- Multi-Paxos optimizes for the common case where one node is the stable leader
- Practical usage: ZooKeeper and etcd use Raft; Google Spanner uses Paxos
Expected answer points:
- Read repair is a technique where, when you read data and detect inconsistency, you repair the stale replica on the fly
- During a read, you query multiple replicas and compare versions
- If replicas have different values, you update stale replicas with the latest value
- This makes reads slightly more expensive but ensures convergence over time without background processes
- Used in Amazon DynamoDB and Cassandra for eventual consistency
Expected answer points:
- Lock lifetime: too short and work doesn't complete; too long and other tasks wait unnecessarily
- Clock skew: different clock speeds cause unexpected timeout behavior - use logical clocks (Lamport or HLC) when possible
- Fencing tokens: always include monotonically increasing tokens with locked operations so storage can reject stale writes
- Network partitions: locks held by failed nodes are not automatically released - use lease-based locks with renewal
- Single point of failure: avoid centralized lock services for critical paths - use quorum-based approaches
Expected answer points:
- Read repair fixes inconsistencies during read operations - it is triggered by read traffic
- Anti-entropy is background repair - nodes periodically compare their data and fix inconsistencies separate from reads
- Anti-entropy uses Merkle trees to efficiently compare large datasets between replicas
- Anti-entropy runs continuously in the background and can catch inconsistencies that read repair misses (e.g., data that nobody reads)
- Cassandra uses both: read repair for immediate consistency on reads, anti-entropy for background repair
Expected answer points:
- High-volume, low-value operations: when blocking is too expensive (e.g., social media likes, views, non-critical counters)
- User-generated content: when occasional duplication or staleness is acceptable (e.g., comments, posts, profile updates)
- Analytics and reporting: when slightly stale data is fine and throughput is critical
- Systems that must stay available during network partitions: eventual consistency prioritizes availability
- When your users won't notice or care about brief staleness - most user-facing features fall here
- Counter-indication: financial transactions, inventory management, anything where stale reads cause business damage
Expected answer points:
- Synchronous replication: write waits until all replicas confirm the write - strong consistency but higher latency
- Asynchronous replication: write returns immediately after local commit, replicas updated in background - lower latency but eventual consistency
- Semi-synchronous: write waits for at least one replica (not all) - balanced approach
- Synchronous replication can cause writes to fail if a replica is down; asynchronous may lose writes if primary fails before replication
- Most production systems use semi-synchronous or asynchronous with appropriate consistency levels
Expected answer points:
- Snapshot isolation gives each transaction a consistent point-in-time view of the database at the moment it starts
- Under the hood, distributed databases like CockroachDB use MVCC (Multi-Version Concurrency Control) to maintain multiple versions of data
- Reads see the most recent commit as of the transaction's start timestamp
- Writes are validated at commit time to detect write-write conflicts
- CockroachDB uses HLC (Hybrid Logical Clocks) to assign timestamps that preserve causality across nodes
- Serializable snapshot isolation (SSI) extends this by detecting write-skew anomalies at commit time
Expected answer points:
- Fencing tokens are monotonically increasing identifiers returned when a lock is acquired
- When a client holds a lock but gets partitioned, a new client can acquire the same lock
- Without fencing tokens, storage systems cannot distinguish stale writes from fresh ones
- With fencing tokens, storage can reject operations with stale tokens, ensuring only one writer succeeds
- The pattern: include the fencing token in every operation under the lock; storage rejects if token is not strictly increasing
- Google Chubby and Spanner use this pattern; it is essential when locks can be held by failed processes
Expected answer points:
- Linearizability is a recency guarantee on individual operations - every read sees the most recent write preceding it in real-time
- Serializability is a property of a transaction history - the outcome is equivalent to some serial execution of transactions
- Linearizability applies to single-object operations; serializability applies to multi-object transactions
- A system can be linearizable without being serializable (e.g., single-object operations)
- A system can be serializable without being linearizable (e.g., slow reads can return stale data)
- Strong consistency in practice often means linearizable; serializable isolation (like SSI) is what database textbooks recommend
Expected answer points:
- The transaction coordinator manages the two-phase commit (2PC) protocol across participants
- In phase 1, it asks participants to prepare; in phase 2, it tells them to commit or rollback
- If the coordinator crashes after participants prepare but before commit, participants are blocked indefinitely
- This blocking problem is why 2PC is avoided in practice for long-running or high-value transactions
- Solutions: use Paxos-based consensus for coordinator (like Google Spanner), use saga pattern instead of 2PC
- Some systems use backup coordinators or recovery protocols that query participants to determine outcome
Expected answer points:
- Redlock uses multiple independent Redis instances to acquire a distributed lock
- A client writes a unique value to multiple nodes; if a majority responds within the timeout, the lock is acquired
- The majority quorum (N/2+1) provides fault tolerance against node failures
- Known safety issue: Redlock is not fault-tolerant to network partitions as originally designed
- Under certain network partition scenarios, multiple clients can acquire the lock simultaneously
- Martin Kleppmann identified these issues in his paper "Redis is mis-handled the Redlock algorithm incorrectly"
- For production use, consider Redisson's RLock or use a consensus-based lock service like ZooKeeper/etcd
Expected answer points:
- Causal consistency guarantees that if operation A causally precedes operation B, all nodes see A before B
- It is weaker than sequential consistency (which requires total order) but stronger than eventual consistency
- Concurrent operations (A and B where neither causally precedes the other) can be seen in any order
- Read-your-writes consistency is a special case of causal consistency - your own writes are always visible to your subsequent reads
- Implementing causal consistency requires tracking dependencies (using vector clocks or similar)
- Not widely implemented as a first-class consistency level; most systems choose between strong and eventual
Expected answer points:
- Spanner uses TrueTime, which combines GPS receivers and atomic clocks to bound clock uncertainty to ±7ms
- Because clock uncertainty is bounded, any replica can serve reads at a timestamp once it waits out the uncertainty window
- This allows leaderless strong reads - no single replica must be the leader for reads
- Writes go through a Paxos-based consensus protocol (Multi-Paxos) for durability
- The TrueTime API provides now() returns [earliest, latest] timestamps; commits wait until latest < write timestamp
- Trade-off: read latency increases because replicas must wait out the uncertainty bound
Expected answer points:
- In choreography, each participant emits events that trigger the next step in other services - decentralized
- In orchestration, a central coordinator tells participants what to do and handles compensations - centralized
- Choreography benefits: services are loosely coupled; no single point of failure
- Choreography drawbacks: business logic scattered across services; hard to see overall flow
- Orchestration benefits: clear central view of the saga; easier to test and debug
- Orchestration drawbacks: coordinator becomes a single point of failure if not designed carefully
- Most practical sagas use a mix: lightweight orchestration for core flow, choreography for peripheral events
Expected answer points:
- 2PC: choose when you need atomicity across a small number of known participants and can accept blocking
- Saga: choose when participants are independent services with long-running operations; availability matters more than atomicity
- Event sourcing: choose when you need audit trails, temporal queries, or the ability to reconstruct state at any point
- 2PC is not suitable for cross-service transactions in microservices due to tight coupling and blocking
- Saga is the most common choice for business transactions spanning multiple services
- Event sourcing pairs well with CQRS (Command Query Responsibility Segregation) for read-heavy systems
- Consider the complexity of compensation logic vs. the cost of failures when making the choice
Further Reading
- Two-Phase Commit - Deep dive into 2PC, including failure handling and alternatives
- Saga Pattern - Detailed coverage of choreography vs orchestration approaches
- Database Replication - Replication strategies and their consistency implications
- Event-Driven Architecture - How eventual consistency fits into event-driven systems
- Message Queue Types - Understanding messaging patterns that underpin distributed transactions
- Designing Data-Intensive Applications - Martin Kleppmann’s comprehensive guide to distributed systems
- The Part-Time Parliament - Lamport’s original Paxos paper
Quick Recap Checklist
Before shipping distributed transaction logic to production, run through this checklist:
- CAP tradeoff documented — For each service, have you explicitly decided between CP and AP?
- Consistency level chosen per operation — Not defaulting to strongest; choosing based on business requirements
- Saga pattern implemented — For cross-service business transactions needing atomicity without 2PC blocking
- 2PC avoided for microservices — Using saga or event sourcing instead where appropriate
- Replication lag monitored — Alerting on lag exceeding defined thresholds
- Transaction latency SLAs defined — P50/P95/P99 tracked and alerted
- Conflict resolution strategy documented — Last-write-wins, application-level merge, or quorum-based
- Idempotency implemented — All distributed operations are idempotent or use idempotency keys
- Fencing tokens in place — Monotonically increasing tokens with all locked operations
- Circuit breakers configured — Preventing cascade failures during partial outages
- Correlation IDs in transaction logs — Every distributed operation traceable across services
- Failure scenarios tested — Chaos testing or game days covering coordinator crash, partition, replica failure
- Compensation logic verified — Saga compensations tested with injected failures at each step
- Read repair and anti-entropy understood — Know which consistency mechanism your database uses
- Lock lifetime tuned — Lock expiry long enough to complete work, short enough to unblock
Conclusion
Distributed transactions do not have a one-size-fits-all answer. ACID gives you consistency but sacrifices availability and latency. BASE gives you availability but introduces inconsistency windows.
In practice, most systems use a mix. Core business operations get strong consistency. Peripheral operations get eventual consistency. Saga patterns handle cross-service business transactions without 2PC’s blocking problems.
The right choice depends on your domain, your tolerance for inconsistency, and what your users expect. Do not default to strong consistency everywhere. It is expensive. Do not default to eventual consistency everywhere. Your users will notice bugs.
flowchart TD
A[Consistency Models] --> B[Strong ACID]
A --> C[Eventual BASE]
B --> D[2PC, Synchronous Replication]
C --> E[Saga, Async Replication]
D --> F[High Latency, Lower Availability]
E --> G[Low Latency, Higher Availability]
Key Points
- ACID prioritizes consistency; BASE prioritizes availability
- CAP theorem: you cannot have consistency, availability, and partition tolerance simultaneously
- Choose consistency level per operation based on business requirements
- Saga pattern handles distributed transactions without 2PC’s blocking problems
- Eventual consistency requires application-level handling for conflicts
Production Checklist
# Distributed Transactions Production Readiness
- [ ] Consistency level chosen per operation based on requirements
- [ ] Saga pattern implemented for cross-service business transactions
- [ ] Replication lag monitored and alerted
- [ ] Transaction latency SLAs defined and monitored
- [ ] Conflict resolution strategy documented and implemented
- [ ] Idempotency implemented for all distributed operations
- [ ] CAP tradeoff understood and documented for each service
- [ ] Circuit breakers implemented to prevent cascade failures
- [ ] Correlation IDs in all distributed transaction logs Category
Related Posts
The Outbox Pattern: Reliable Event Publishing in Distributed Systems
Learn the transactional outbox pattern for reliable event publishing. Discover how to solve the dual-write problem, implement idempotent consumers, and achieve exactly-once delivery.
TCC: Try-Confirm-Cancel Pattern for Distributed Transactions
Learn the Try-Confirm-Cancel pattern for distributed transactions. Explore how TCC differs from 2PC and saga, with implementation examples and real-world use cases.
Two-Phase Commit Protocol Explained
Learn the two-phase commit protocol for distributed transactions: prepare and commit phases, coordinator role, failure handling, and trade-offs.