Distributed Systems Primer: Key Concepts for Modern Architecture
A practical introduction to distributed systems fundamentals. Learn about failure modes, replication strategies, consensus algorithms, and the core challenges of building distributed software.
Introduction
Single machines hit limits. A database on one server can only scale vertically so far. At some point, you need more compute, storage, or network bandwidth than one machine provides. Distribution lets you scale horizontally.
Beyond raw capacity, distribution enables reliability. If one server fails, others take over. Users in Tokyo and London can hit nearby servers instead of waiting for cross-continental round trips. Compliance might require data to stay within certain regions.
The trade-off is complexity. A single machine has one clock, one view of memory, and one path for operations. Distributed systems have multiple clocks that never agree perfectly, multiple views of data that can diverge, and network paths that fail in subtle ways.
Core Concepts
A distributed system must coordinate nodes that may fail independently. The Two General Problem shows why this is hard: no matter what protocol two generals agree on, they cannot guarantee they have both committed to the same plan with certainty.
graph LR
A[General A] -->|Attack at dawn| B[General B]
B -->|Acknowledgment| A
A -->|Confirmation| B
B -->|Must confirm| A
A -->|Infinite loop| B
In practice, we accept probabilistic guarantees. Systems use acknowledgments, timeouts, and retries to achieve “good enough” coordination. Understanding what guarantees are actually possible keeps you from over-engineering for certainty that mathematics forbids.
Failure Modes
Distributed systems face richer failure modes than single machines:
Crash Faults
A node stops working. It stops responding to requests. Detection is straightforward: no response within timeout means crash. Handling is relatively simple: switch to a backup.
Byzantine Faults
A node keeps running but produces incorrect responses. It might be buggy, compromised by an attacker, or misconfigured. Byzantine fault tolerance requires cryptographic verification and consensus among multiple nodes. Financial systems and aerospace computing deal with these requirements.
Network Partitions
Nodes stay alive but cannot communicate with each other. The network between them fails. This is the scenario the CAP theorem addresses. During a partition, you choose between serving stale data (availability) or refusing to serve (consistency).
graph TD
subgraph DC1["Data Center 1"]
A[Node A] --> B[Node B]
end
subgraph DC2["Data Center 2"]
C[Node C] --> D[Node D]
end
A -.->|Partition| C
B -.->|Partition| D
Partitions are not rare events. In any large enough system, you should expect regular network blips. Design for partitions, not against them.
See the availability patterns post for building systems that remain available despite partitions. The CAP theorem post explains the fundamental trade-off during partitions.
Replication Strategies
Once you distribute data across multiple nodes, you must keep those copies in sync. Different strategies offer different trade-offs.
Synchronous Replication
Write to all replicas before confirming to the client. All copies stay consistent, but write latency equals the slowest replica. If one replica is slow or unreachable, writes fail.
async function syncWrite(key, value, replicas) {
const results = await Promise.all(replicas.map((r) => r.write(key, value)));
if (results.some((r) => !r.success)) {
throw new Error("Write failed");
}
return { success: true };
}
Used when consistency matters more than write availability. Traditional relational databases often use synchronous replication within a data center.
Asynchronous Replication
Write to primary, acknowledge immediately, then propagate to replicas in background. Writes are fast, but copies can diverge during normal operation and especially during failures.
async function asyncWrite(key, value, primary) {
await primary.write(key, value); // Immediate acknowledgment
await backgroundPropagate(key, value); // Replicas updated later
return { success: true };
}
Used when write latency matters more than immediate consistency. Many NoSQL databases default to async replication.
Quorum-Based Replication
Writes must succeed on a quorum of replicas (typically N/2 + 1). Reads also query a quorum. If the same quorum is used for reads and writes, you get strong consistency.
// Quorum calculation for N=3 replicas
// Strong consistency: W=2, R=2 (W+R > N means reads see previous writes)
// Eventual consistency: W=1, R=1 (any replica can serve reads)
DynamoDB, Cassandra, and similar systems let you tune consistency per query. See the consistency models post for the spectrum from strong to eventual consistency.
Consensus Algorithms
Some operations require all nodes to agree. Which node is the leader? Should we accept this transaction? Did this client actually request this operation? Consensus algorithms solve agreement in the presence of failures.
Paxos
The classic consensus algorithm, developed by Leslie Lamport in 1989. Paxos works in two phases: prepare and accept. Nodes only accept values that have not been seen with a higher proposal number. Paxos is proven correct but notoriously hard to implement.
Most production systems use Multi-Paxos, which optimizes for the common case where one node is the stable leader for an extended period.
Raft
Designed to be more understandable than Paxos. The authors called it “In Search of an Understandable Consensus Algorithm.” Raft separates concerns into leader election, log replication, and safety.
graph LR
A[Follower] -->|Election timeout| B[Candidate]
B -->|Votes from majority| C[Leader]
C -->|Heartbeat timeout| A
B -->|Another leader appears| A
C -->|Crash| A
Raft is used in etcd (Kubernetes’ backing store), CockroachDB, and many other systems. If you need consensus and want something you can actually debug, Raft is usually the right choice.
Why Not Just Use Two-Phase Commit?
Two-phase commit (2PC) is not a consensus algorithm. It coordinates commits across nodes but has a critical flaw: if the coordinator crashes after participants have voted “yes” but before sending “commit,” participants wait forever. 2PC is blocking and not partition-tolerant.
See the two-phase commit and distributed transactions posts for alternatives like the saga pattern.
Time in Distributed Systems
Clocks across machines never agree perfectly. Network latency varies. Machines crash and restart with incorrect times. Understanding time is crucial for debugging and correctness in distributed systems.
Physical Clocks vs Logical Clocks
Physical clocks try to track real time but drift and skew. NTP synchronization helps but cannot eliminate uncertainty. For ordering events, logical clocks (Lamport timestamps, vector clocks) provide better guarantees.
Lamport timestamps capture “happened-before” relationships: if A happened before B, then A’s timestamp is less than B’s. But Lamport timestamps alone cannot tell you if A and B were concurrent.
Vector clocks extend Lamport timestamps to capture causality: each node maintains a vector of timestamps, one per node. Two events are concurrent if neither vector dominates the other.
graph LR
A[Node A: 1] --> B[Node B: 1,1]
A --> C[Node A: 2]
C --> D[Node B: 2,1]
B --> E[Concurrent writes?<br/>A: v=1, B: v=1]
D --> E
Vector clocks are used in systems like DynamoDB and Riak to track causal history and resolve write conflicts.
HLC (Hybrid Logical Clocks)
HLC combines physical time with logical time for ordering. Many modern systems use HLCs to get benefits of both.
HLC maintains both a physical timestamp and a logical counter. When a node receives a message, it compares the embedded HLC with its own and takes the maximum, incrementing the logical counter only if they are equal. This gives you wall-clock timestamps for human readability while preserving correct causal ordering even when physical clocks drift.
MongoDB uses HLCs for global transaction ordering. CockroachDB uses a similar hybrid approach for distributed SQL operations.
Time Synchronization Deep Dive
NTP (Network Time Protocol) synchronizes machines to reference clocks but cannot guarantee perfect synchronization. In practice, clock offset error can reach tens of milliseconds on WAN links. For systems requiring tighter guarantees, consider:
- Precision Time Protocol (PTP): Achieves sub-microsecond accuracy in LAN environments, used in financial trading systems and telecommunications.
- TrueTime: Google’s approach uses GPS and atomic clocks to bound clock uncertainty to approximately 7 milliseconds. Spanner uses this for globally consistent transactions.
- Caesium: A software-only approach that detaches the system clock from physical time, maintaining logical stability even during NTP step corrections.
The key insight: design systems to tolerate clock uncertainty rather than assuming synchronized clocks. Use intervals (not point timestamps) when reasoning about event ordering in safety-critical code.
State Machine Replication
State machine replication is a general technique for building fault-tolerant distributed systems. The core idea: if every node starts in the same state and processes the same sequence of commands in the same order, they will produce the same state and same outputs.
graph TD
A[Client] -->|submit command| B[Leader]
B -->|replicate| C[Node A]
B -->|replicate| D[Node B]
B -->|replicate| E[Node C]
C -->|ack| B
D -->|ack| B
E -->|ack| B
B -->|commit| C
B -->|commit| D
B -->|commit| E
Total order broadcast (also called atomic broadcast) is the primitive that makes this work: every node delivers messages in the same order. Consensus algorithms like Raft and Paxos implement total order broadcast as part of their log replication.
The Paxos and Raft sections above describe how these algorithms achieve consensus on the sequence. Multi-Paxos extends single-decree Paxos to decide on a sequence of values (the log), effectively implementing total order broadcast.
Replicated State Machines in Production
| System | Algorithm | Use Case |
|---|---|---|
| etcd | Raft | Kubernetes cluster state, distributed locks |
| CockroachDB | Raft | Distributed SQL, geo-replicated tables |
| TiKV | Raft | RockDB backing store, transaction coordination |
| ZooKeeper | Zab (Paxos-like) | Configuration storage, leader election |
| Chubby (Google) | Paxos | Lock service for BigTable, Spanner |
The practical takeaway: you rarely implement replicated state machines from scratch. You use systems like etcd for coordination or CockroachDB for distributed data, and those systems handle replication for you. Understanding the model helps you debug them.
Failure Detectors
A failure detector is a module that suspects a node has crashed. Rather than asking “is this node alive?” (which is impossible to answer definitively in an asynchronous network), a failure detector answers “has this node failed?” with some probability.
The two properties that characterize failure detectors:
- Completeness: Every crashed node is eventually suspected by every correct node.
- Accuracy: Correct nodes are rarely suspected of having failed. Perfect accuracy is impossible in asynchronous systems (this is the FLP impossibility result).
The accrual failure detector model, used in Akka and Cassandra, outputs a suspicion level (a probability) rather than a boolean. Instead of “node is down,” you get “node is down with 80% confidence.” Applications can configure their own thresholds based on the cost of false positives vs false negatives.
// Accrual failure detector concept
class AccrualFailureDetector {
constructor(threshold = 0.8) {
this.threshold = threshold;
this.heartbeatHistory = new Map();
}
recordHeartbeat(nodeId) {
const now = Date.now();
const history = this.heartbeatHistory.get(nodeId) || [];
history.push(now);
// Keep last N intervals
this.heartbeatHistory.set(nodeId, history.slice(-10));
}
getSuspicionLevel(nodeId) {
const intervals = this.computeIntervals(nodeId);
if (intervals.length < 2) return 0;
const mean = intervals.reduce((a, b) => a + b, 0) / intervals.length;
const variance =
intervals.reduce((sum, i) => sum + Math.pow(i - mean, 2), 0) /
intervals.length;
const stdDev = Math.sqrt(variance);
// Phi Accrual: converts variance into a suspicion probability
return -Math.log10(1 - this.cdf(mean + 2 * stdDev));
}
isAvailable(nodeId) {
return this.getSuspicionLevel(nodeId) < this.threshold;
}
}
Cassandra uses an accrual failure detector based on Phi Accrual. Instead of a fixed timeout, each node dynamically computes a suspicion level based on heartbeat inter-arrival times. When network latency varies, the detector adapts rather than triggering false positives.
Distributed Transactions
Coordinating changes across multiple services or databases is notoriously hard. ACID transactions work within a single database. Across multiple databases, you need coordination.
Saga Pattern
Instead of distributed ACID transactions, sagas coordinate via a sequence of local transactions. Each service completes its local transaction and publishes an event. The next service picks up that event and performs its local transaction.
graph TD
A[Order Service] -->|OrderCreated| B[Payment Service]
B -->|PaymentCompleted| C[Inventory Service]
C -->|InventoryReserved| D[Shipping Service]
D -->|ShipmentScheduled| E[Order Complete]
B -->|PaymentFailed| F[Compensating: Refund]
C -->|ReservationFailed| G[Compensating: Cancel Payment]
If any step fails, compensating transactions undo previous steps. Sagas work well for long-running business processes where immediate consistency is less important than eventual completion.
See the saga pattern post for implementation details including choreography versus orchestration approaches.
Outbox Pattern
For reliable event publishing alongside database updates, the transactional outbox pattern writes events to a table within the same transaction as business data. A separate process polls that table and publishes events to consumers.
async function createOrder(order) {
await db.transaction(async (trx) => {
await trx.insert("orders", order);
// Same transaction: write to outbox
await trx.insert("outbox", {
eventType: "OrderCreated",
payload: JSON.stringify(order),
});
});
// Outbox poller picks this up and publishes
}
This avoids the dual-write problem where the database commits but the event fails to publish (or vice versa).
Coordination Services
Some operations should only happen once, regardless of how many nodes try. Distributed locks, leader election, and service discovery all require coordination.
Partitioning and Sharding Strategies
Data partitioning (sharding) distributes data across multiple nodes:
| Strategy | How It Works | Trade-offs |
|---|---|---|
| Hash-based | shard = hash(key) % num_shards | Even distribution, but reshuffling required when adding nodes |
| Range-based | Keys partitioned by value ranges | Good for range queries, but can create hotspots |
| Directory-based | Lookup table maps keys to shards | Flexible, but lookup table is a SPOF |
| Consistent Hashing | Virtual nodes on hash ring | Minimal reshuffling when nodes added/removed |
Consistent hashing minimizes data movement when nodes are added or removed:
// Consistent hashing concept
class ConsistentHash {
constructor(nodes, virtualNodes = 150) {
this.ring = new Map();
this.sortedKeys = [];
for (const node of nodes) {
this.addNode(node, virtualNodes);
}
}
addNode(node, vNodes) {
for (let i = 0; i < vNodes; i++) {
const key = hash(node + ":" + i);
this.ring.set(key, node);
}
this.sortedKeys = Array.from(this.ring.keys()).sort();
}
getNode(key) {
const hash = hash(key);
// Binary search for first key >= hash(key)
const idx =
this.sortedKeys.findIndex((k) => k >= hash) % this.sortedKeys.length;
return this.ring.get(this.sortedKeys[idx]);
}
}
Gossip Protocols for State Synchronization
Gossip protocols spread information through a system like a virus. Each node periodically shares state with a few random peers, achieving eventual consistency with O(log N) propagation time.
graph LR
A[Node A] -->|Gossip to 2 random peers| B[Node B]
A -->|Gossip to 2 random peers| C[Node C]
B -->|Next round| D[Node D]
C -->|Next round| E[Node E]
D -->|Next round| F[Node F]
E -->|Next round| G[Node G]
Gossip characteristics:
- Convergence time: O(log N) rounds to reach all nodes
- Failure tolerance: Nodes can join/leave without protocol disruption
- Eventual consistency: Always converges, but no guaranteed delivery time
- Use cases: Membership protocols, failure detection, anti-entropy
// Simplified gossip implementation
class GossipProtocol {
constructor(nodeId, peers, state) {
this.nodeId = nodeId;
this.peers = peers;
this.state = state;
this.version = 0;
}
// Called periodically
async gossip() {
const targets = this.selectRandomPeers(2);
const payload = { state: this.state, version: this.version };
for (const peer of targets) {
await peer.receiveGossip(this.nodeId, payload);
}
}
async receiveGossip(fromNode, payload) {
if (payload.version > this.version) {
this.state = this.merge(this.state, payload.state);
this.version = payload.version;
}
}
selectRandomPeers(count) {
return this.peers
.filter((p) => p !== this.nodeId)
.sort(() => Math.random() - 0.5)
.slice(0, count);
}
}
Cassandra uses gossip for failure detection and state synchronization. Consul uses gossip for membership and health checking.
Which Coordination Service to Use
| Feature | etcd | Apache ZooKeeper | Consul |
|---|---|---|---|
| Consensus | Raft | Zab (Paxos-like) | Raft |
| Data Model | Flat key-value | Hierarchical znodes | Flat key-value |
| ACLs | RBAC | ACLs per znode | Intentions + ACLs |
| Health Checks | Lease-based | No built-in | Agent-based |
| Multi-region | Limited | No | Yes (WAN federation) |
| DNS Integration | No | No | Yes |
| Use Case Focus | Kubernetes state | HBase, Kafka | Service discovery |
When to use etcd:
- Backing Kubernetes cluster state
- When you need distributed lock with lease support
- Configuration management with watch-based updates
When to use ZooKeeper:
- Legacy systems already using it (HBase, Kafka)
- When you need hierarchical data model
- Java-centric environments
When to use Consul:
- Service discovery with health checking
- Multi-datacenter deployments
- When you need built-in DNS for service discovery
Distributed Locks
Mutexes that work across machines. Naive implementations using Redis SETNX fail in subtle ways: clients can crash before releasing, clocks skew, and network partitions leave locks held indefinitely.
Proper distributed locks require: fencing tokens (to handle client crashes), clock-safe expiry, and atomic check-and-set operations.
Leader Election
One node needs to be primary for a given task. Uses coordination service (etcd, ZooKeeper) to agree on who that is. If leader fails, remaining nodes hold new election.
graph TD
A[All nodes start] --> B[Election timeout]
B --> C[Node 1 votes for self]
B --> D[Node 2 votes for self]
C --> E{Does anyone have majority?}
D --> E
E -->|Yes| F[Node 1 is leader]
E -->|No| G[New election]
Leader election backs high availability patterns. When the primary fails, election identifies a new leader and work continues.
The service registry post covers how services discover leaders.
Production Failure Scenarios
| Failure Scenario | Impact | Mitigation |
|---|---|---|
| Split-brain after partition | Both sides accept writes, data diverges | Quorum-based decisions, fencing |
| Lost writes during leader crash | Write acknowledged but not replicated | Sync replication for critical data, write-ahead logs |
| Stale reads after replica promotion | Applications read old data | Read-your-writes consistency, routing reads to new primary |
| Cascade from one slow node | Distributed slowdown spreads | Bulkhead pattern, circuit breakers |
| Clock skew causing ordering issues | Events appear in wrong sequence | Logical clocks for ordering, NTP with careful config |
Common Pitfalls / Anti-Patterns
Pitfall 1: Assuming Strong Consistency is Free
Problem: Using synchronous replication everywhere because “consistency matters.” This adds latency to every write and makes your system vulnerable to replica failures.
Solution: Identify which operations truly need strong consistency. Use eventual consistency where acceptable. See the consistency models post for classifying operations.
Pitfall 2: Ignoring Clock Skew
Problem: Using wall-clock time for ordering events or conflict resolution. Clock skew between machines makes this unreliable.
Solution: Use logical clocks (Lamport timestamps, vector clocks) for event ordering. Use HLCs when you need both ordering and wall-clock time.
Pitfall 3: Building Distribution on Day One
Problem: Starting with microservices and distributed data before understanding the actual scale requirements. This adds complexity without benefit.
Solution: Build monolithic first. Distribute when you have concrete scale or reliability requirements. See the microservices vs monolith post for this trade-off.
Pitfall 4: Underestimating Coordination Overhead
Problem: Adding distributed locks, consensus, and leader election for operations that do not actually need them. This creates bottlenecks and fragility.
Solution: Use the simplest coordination mechanism that meets requirements. Avoid distributed transactions if sagas work. Avoid consensus if leader lease-based approaches suffice.
Quick Recap Checklist
- Distributed systems solve capacity and reliability problems that single machines cannot.
- The Two General Problem shows we cannot achieve certainty in coordination—only probabilistic guarantees.
- Failure modes (crash, Byzantine, partition) require different handling.
- Replication strategies (sync, async, quorum) trade consistency against latency and availability.
- Consensus algorithms (Raft, Paxos) solve agreement but add complexity.
- Time in distributed systems requires logical clocks, not just physical clocks.
- Distributed transactions (sagas, outbox) coordinate multi-service operations without ACID across boundaries.
Copy/Paste Checklist
- [ ] Identify which operations need strong vs eventual consistency
- [ ] Choose replication strategy per service based on consistency requirements
- [ ] Implement proper timeouts and retries for all network calls
- [ ] Use consensus (Raft) when you need strongly consistent leader election
- [ ] Implement observability: traces, metrics, and logs across service boundaries
- [ ] Test failure scenarios: network partitions, node crashes, clock skew
- [ ] Start simple: do not add distribution until you have concrete requirements
For deeper exploration, see the distributed systems roadmap for a structured learning path.
When to Use / When Not to Use
| Scenario | Recommendation |
|---|---|
| Single machine handles your load | Do not distribute yet |
| Need horizontal scaling | Distribute stateless services first |
| Strong consistency required | Use consensus algorithms (Raft, Paxos) |
| Eventual consistency acceptable | Async replication is simpler |
| Long-running business transactions | Saga pattern with compensating logic |
| Multi-region deployment | Expect partitions, plan for divergence |
Observability Checklist
Distributed systems require more monitoring than single-machine applications:
Metrics to Capture
- Request latency: P50, P95, P99 by service and endpoint
- Error rates: By type (timeout, crash, Byzantine)
- Replication lag: How far behind are replicas
- Quorum health: Are reads and writes achieving quorum
- Partition events: Frequency and duration of network partitions
Logs to Emit
{
"timestamp": "2026-03-24T10:15:30.123Z",
"service": "order-service",
"operation": "createOrder",
"latency_ms": 45,
"replicas_contacted": 3,
"partition_detected": false,
"trace_id": "abc123"
}
Distributed tracing (covered in Distributed Tracing) connects requests across service boundaries.
Alerts to Configure
| Alert | Threshold | Severity |
|---|---|---|
| Replication lag > 30s | 30000ms | Warning |
| Replication lag > 5min | 300000ms | Critical |
| Quorum failures > 1% | 1% of requests | Critical |
| Partition lasting > 60s | 60000ms | Critical |
| Leader election rate spike | > 3 elections/hour | Warning |
Interview Questions
Expected answer points:
- Two nodes trying to coordinate across an unreliable network cannot guarantee mutual agreement
- No protocol can achieve certainty; only probabilistic guarantees are possible
- Every acknowledgment cycle has a window where messages can be lost or delayed
- Design implication: accept eventual consistency and design for retry/timeout mechanisms
Expected answer points:
- Crash faults: node stops responding entirely, easiest to handle via failover
- Byzantine faults: node continues operating but produces incorrect results, requires cryptographic verification and consensus
- Network partitions: nodes remain alive but cannot communicate, splits the system into isolated subgroups
- Each requires different mitigation strategies; partitions trigger CAP theorem trade-offs
Expected answer points:
- Choose synchronous when consistency outweighs write availability and latency tolerance
- All replicas must acknowledge before client receives confirmation
- Trade-off: write latency equals slowest replica; one slow node blocks writes
- Use cases: financial systems, databases requiring strong consistency within a datacenter
Expected answer points:
- Writes must succeed on N/2 + 1 replicas (quorum)
- Reads also query a quorum; if W+R > N, reads see previous writes
- With W=2, R=2 for N=3 replicas, you get strong consistency
- DynamoDB and Cassandra allow tuning consistency per query
Expected answer points:
- Paxos: proven correct but notoriously difficult to implement and understand
- Raft: designed for understandability, separates leader election, log replication, and safety
- Multi-Paxos optimizes for stable leader scenarios in production
- Choose Raft for debuggability; choose Paxos if existing system expertise exists
- etcd, CockroachDB use Raft; most production Paxos is Multi-Paxos
Expected answer points:
- 2PC coordinates commits but does not guarantee node agreement in failure scenarios
- If coordinator crashes after participants vote "yes" but before "commit," participants wait indefinitely
- 2PC is blocking and not partition-tolerant
- Consensus algorithms like Raft/Paxos handle this via leader election and log replication
Expected answer points:
- Lamport timestamps capture "happened-before" relationships but cannot detect concurrency
- If A happened before B, A's timestamp is less than B's (but converse not true)
- Vector clocks extend Lamport with per-node timestamp vectors to detect concurrent events
- Two events are concurrent if neither vector dominates the other
- Use Lamport for simple ordering; use vector clocks when you need conflict detection (DynamoDB, Riak)
Expected answer points:
- Sagas replace distributed ACID with a sequence of local transactions
- Each service completes its local transaction and publishes an event
- Next service picks up the event and performs its local transaction
- If any step fails, compensating transactions undo previous steps
- Trade-off: gives up immediate consistency for eventual completion
- Use for long-running business processes where immediate consistency is less critical
Expected answer points:
- Solves the dual-write problem: database commits but event fails to publish (or vice versa)
- Writes events to an outbox table within the same transaction as business data
- A separate poller process reads the outbox and publishes to message broker
- Guarantees atomicity: both business data and outbox entry commit together
Expected answer points:
- Hash-based: even data distribution but requires reshuffling when adding/removing nodes
- Range-based: efficient for range queries but risk of hotspots on certain key ranges
- Directory-based: flexible key-to-shard mapping but introduces single point of failure in lookup table
- Consistent hashing: minimizes data movement when nodes change via virtual nodes on hash ring
Expected answer points:
- Each node periodically shares state with a few random peers
- Information spreads like a virus through the system
- Convergence time is O(log N) rounds regardless of system size
- Nodes can join/leave without disrupting the protocol
- No guaranteed delivery time; only eventual consistency
- Use cases: membership protocols, failure detection, anti-entropy (Cassandra, Consul)
Expected answer points:
- Fencing tokens to handle client crashes and prevent split-brain scenarios
- Clock-safe expiry: locks must auto-release even if holder crashes
- Atomic check-and-set operations to prevent race conditions on lock acquisition
- Naive Redis SETNX fails: clock skew, crash-before-release, network partitions all cause issues
Expected answer points:
- etcd: Raft consensus, flat key-value, RBAC, lease-based health checks; limited multi-region
- ZooKeeper: Zab (Paxos-like), hierarchical znodes, ACLs per znode; no built-in health checks or multi-region
- Consul: Raft consensus, flat key-value, intentions + ACLs, agent-based health checks, WAN federation
- Choose based on existing stack: Kubernetes → etcd; HBase/Kafka → ZooKeeper; service discovery → Consul
Expected answer points:
- Split-brain occurs during network partitions where both sides of the partition accept writes
- Results in data divergence that is difficult to merge after partition heals
- Prevention: quorum-based decision making ensures only majority partition accepts writes
- Fencing tokens prevent stale writes from partitioned nodes after reconnection
Expected answer points:
- Single machine handles current and near-term load without vertical scaling constraints
- When simplicity outweighs scalability needs for your use case
- Strong consistency requirements make distribution expensive
- Premature distribution adds complexity without concrete scale/reliability benefits
- Build monolithic first; distribute when you have concrete requirements
Expected answer points:
- HLCs store both physical time and logical counter
- Provides wall-clock timestamps for human readability and debugging
- Logical component ensures correct ordering when physical clocks drift or skew
- Less uncertainty than pure physical clocks; more interpretable than pure logical clocks
- Used in many modern distributed systems for global event ordering
Expected answer points:
- Leader-based (Raft, Multi-Paxos): efficient writes through single leader, but leader is bottleneck and single point of failure
- Leaderless (some Dynamo-style systems): writes spread across nodes, but coordination complexity increases
- Leader leases reduce leader dependency but introduce clock skew risks
- Choose based on write throughput requirements and consistency needs
Expected answer points:
- Request latency: P50, P95, P99 by service and endpoint
- Error rates by type: timeouts, crashes, Byzantine responses
- Replication lag: how far replicas trail behind primary
- Quorum health: are reads/writes achieving required quorum
- Partition events: frequency and duration of network partitions
- Leader election rate: spikes indicate instability
Expected answer points:
- Wall-clock time is unreliable for event ordering due to machine clock drift and NTP uncertainty
- Using physical time for conflict resolution leads to incorrect results after clock skew
- Logical clocks provide ordering guarantees but lack real-time interpretability
- HLCs combine both but require careful implementation
- Always use logical clocks for event ordering; physical clocks only for human-readable timestamps
Expected answer points:
- Bulkhead pattern: isolates components so failure in one service does not bring down others
- Circuit breakers: detect downstream failures and stop requests to failing component
- Both prevent distributed slowdown from spreading across the system
- Circuit breakers allow failed components time to recover while serving fallback responses
- Without these, one slow node can cause timeouts across all dependent services
Further Reading
- Consistency Models - Deep dive into strong vs eventual consistency trade-offs
- CAP Theorem - Understanding the fundamental partition tolerance trade-off
- Two-Phase Commit - Why blocking coordination fails in distributed systems
- Distributed Transactions - Saga pattern and alternative approaches
- Service Registry - How services discover leaders and coordinate
- Distributed Tracing - Observability across service boundaries
Conclusion
Distributed systems are fundamentally about trade-offs. Capacity versus complexity. Consistency versus availability. Simplicity versus scalability.
The concepts covered in this primer provide a foundation for thinking about distributed problems. Failure modes teach you to expect the unexpected. Replication strategies show how to maintain data consistency across unreliable networks. Consensus algorithms provide the tools for nodes to agree despite failures.
Time and ordering present subtle challenges that trip up even experienced engineers. Coordination services solve the hard problems of leader election and distributed locks. Production failure scenarios remind you that theory and practice diverge in real systems.
Before reaching for distributed solutions, verify that your problem actually requires them. Single machines handle more than developers assume. Distribution adds latency, complexity, and failure modes that are difficult to reason about. Build monolithic first. Distribute when you have concrete requirements.
For structured learning paths, see the distributed systems roadmap which covers these topics in greater depth.
Category
Related Posts
The Eight Fallacies of Distributed Computing
Explore the classic assumptions developers make about networked systems that lead to failures. Learn how to avoid these pitfalls in distributed architecture.
Microservices vs Monolith: Choosing the Right Architecture
Understand the fundamental differences between monolithic and microservices architectures, their trade-offs, and how to decide which approach fits your project.
CAP Theorem: Consistency vs Availability Trade-offs
Learn the fundamental trade-off between Consistency, Availability, and Partition tolerance in distributed systems with practical examples.