Distributed Systems Primer: Key Concepts for Modern Architecture

A practical introduction to distributed systems fundamentals. Learn about failure modes, replication strategies, consensus algorithms, and the core challenges of building distributed software.

published: March 24, 2026 reading time: 37 min read author: GeekWorkBench updated: June 17, 2026

Quick Summary

Distributed systems distribute both data and problems across multiple machines, and this primer walks through what that actually means: the different ways nodes fail (crash, lie, or get isolated), the tradeoffs between replication approaches (waiting for everyone versus not), and why consensus algorithms like Raft exist. You'll also see why physical clocks cause trouble and logical clocks save the day, how sagas replace ACID transactions across services, and what etcd actually does beyond just being 'where Kubernetes stores stuff'. The real lesson is accepting that certainty is expensive and sometimes impossible, so design for the failures that actually happen rather than the ones you can wish away.

Introduction

Single machines hit limits. A database on one server can only scale vertically so far. At some point, you need more compute, storage, or network bandwidth than one machine provides. Distribution lets you scale horizontally.

Beyond raw capacity, distribution enables reliability. If one server fails, others take over. Users in Tokyo and London can hit nearby servers instead of waiting for cross-continental round trips. Compliance might require data to stay within certain regions.

The trade-off is complexity. A single machine has one clock, one view of memory, and one path for operations. Distributed systems have multiple clocks that never agree perfectly, multiple views of data that can diverge, and network paths that fail in subtle ways.

Core Concepts

A distributed system must coordinate nodes that may fail independently. The Two General Problem shows why this is hard: no matter what protocol two generals agree on, they cannot guarantee they have both committed to the same plan with certainty.

graph LR
    A[General A] -->|Attack at dawn| B[General B]
    B -->|Acknowledgment| A
    A -->|Confirmation| B
    B -->|Must confirm| A
    A -->|Infinite loop| B

In practice, we accept probabilistic guarantees. Systems use acknowledgments, timeouts, and retries to achieve “good enough” coordination. Understanding what guarantees are actually possible keeps you from over-engineering for certainty that mathematics forbids.

Failure Modes

Distributed systems face richer failure modes than single machines:

Crash Faults

A crash fault is the simplest failure mode. A node stops working entirely. It stops responding to requests, and in most cases, stops executing code. The key point is that the failure is detectable. The node either responds or it does not, with no ambiguous middle state.

Types of crash faults:

Crash faults split into two categories based on how the node fails.

Fail-stop: The node crashes and never recovers. No response means the node is down. Once detected, failover is straightforward. This is the failure mode assumed by most consensus algorithms like Raft and Paxos.
Fail-recover: The node crashes but later restarts. This adds complexity. The recovered node may have stale state that needs to be reconciled with the rest of the cluster. If the node was the leader, it may need to rejoin as a follower or trigger a new election.

Detection mechanisms:

Crash faults are detected through timeouts. A coordinator, whether a client or an intermediate load balancer, sets a timeout for each request. If no response arrives within that window, the node is considered crashed. The timeout value is a trade-off. Set it too short and you get false positives from slow nodes. Set it too long and you delay failover unnecessarily.

// Crash detection via timeout
async function callWithCrashDetection(node, request, timeoutMs = 5000) {
  const controller = new AbortController();
  const timeout = setTimeout(() => controller.abort(), timeoutMs);

  try {
    const response = await fetch(node.url, {
      signal: controller.signal,
      method: request.method,
      body: JSON.stringify(request.data),
    });
    clearTimeout(timeout);
    return { success: true, data: response };
  } catch (error) {
    clearTimeout(timeout);
    if (error.name === "AbortError") {
      // Timeout — assume crash
      return { success: false, reason: "timeout", node: node.id };
    }
    // Other error — could be crash or network issue
    return { success: false, reason: error.message, node: node.id };
  }
}

Failover handling:

When a crash is detected, failover switches traffic to a backup node. For stateless services, this is straightforward. Any healthy instance can take over. For stateful services, the backup must have the necessary state. This means either passive replication (async copy of data) or active replication (synchronous state mirroring).

Failover time has several components: failure detection (typically 5-30 seconds depending on timeout configuration), routing update (DNS TTL updates or load balancer reconfiguration, 30-60 seconds), and instance startup if the backup is not pre-warmed (30-120 seconds). Optimizing each component reduces total downtime.

Crash faults are the baseline failure mode for HA design. If your system cannot handle simple node crashes, it will not survive more complex failure modes like network partitions or Byzantine faults.

Byzantine Faults

A Byzantine fault is the most challenging failure mode. A node continues operating but produces incorrect responses. Unlike crash faults where the node simply stops, Byzantine faults produce answers that appear valid but are wrong. The node might be buggy, compromised by an attacker, or misconfigured in a way that corrupts its outputs.

Why Byzantine faults are harder to handle:

With crash faults, detection is binary. Either the node responds or it does not. With Byzantine faults, the node responds, but the response is untrustworthy. You cannot distinguish a buggy response from a correct one without additional verification. This is why Byzantine fault tolerance (BFT) requires cryptographic proofs and agreement among multiple independent nodes.

A node might produce incorrect outputs due to:

Software bugs: A parser bug might accept malformed input that produces wrong but self-consistent outputs
Hardware faults: Memory bit flips can cause silent data corruption that still passes checksums
Security breaches: An attacker with node access can intentionally produce falsified responses
Misconfiguration: A node with wrong system time produces timestamps that invalidate correct data

BFT consensus requirements:

Byzantine fault tolerant systems require more replicas than crash-fault tolerant systems. To tolerate F Byzantine faults, you need 2F+1 total replicas, not a majority but a supermajority of at least two-thirds. With 3 replicas, you can tolerate 1 Byzantine fault. With 4 replicas, you can still only tolerate 1. With 7 replicas, you can tolerate 2.

Byzantine nodes can lie arbitrarily. If F nodes are compromised and lie, you need enough honest nodes to form a quorum that outvotes them. With 3 replicas and 1 Byzantine node, the Byzantine node can claim whatever the honest node claims, making it impossible to determine truth. With 4 replicas and 1 Byzantine node, the 3 honest nodes still form a majority.

Real-world BFT systems:

Production Byzantine fault tolerant systems include:

PBFT (Practical Byzantine Fault Tolerance): The original BFT consensus protocol, used in some permissioned blockchain systems
Zyzzyva: Optimizes PBFT by reducing the number of messages required
Blockchain consensus (Bitcoin, Ethereum): Uses proof-of-work to make Byzantine behavior economically costly rather than computationally preventable
BFT-SMaRt: A Java implementation of BFT consensus for research and production

For most web applications, crash-fault tolerance (Raft, Paxos) is sufficient. Byzantine fault tolerance is required when nodes may behave maliciously: cryptocurrency ledgers, distributed databases with adversarial participants, or aerospace control systems.

BFT adds substantial overhead. More replicas, more complex protocols, and more messages per consensus round. Only adopt BFT when your threat model actually includes nodes that produce incorrect outputs, not just nodes that stop responding.

Network Partitions

Nodes stay alive but cannot communicate with each other. The network between them fails. This is the scenario the CAP theorem addresses. During a partition, you choose between serving stale data (availability) or refusing to serve (consistency).

graph TD
    subgraph DC1["Data Center 1"]
        A[Node A] --> B[Node B]
    end
    subgraph DC2["Data Center 2"]
        C[Node C] --> D[Node D]
    end
    A -.->|Partition| C
    B -.->|Partition| D

Partitions are not rare events. In any large enough system, you should expect regular network blips. Design for partitions, not against them.

See the availability patterns post for building systems that remain available despite partitions. The CAP theorem post explains the fundamental trade-off during partitions.

Replication Strategies

Once you distribute data across multiple nodes, you must keep those copies in sync. Different strategies offer different trade-offs.

Synchronous Replication

Write to all replicas before confirming to the client. All copies stay consistent, but write latency equals the slowest replica. If one replica is slow or unreachable, writes fail.

async function syncWrite(key, value, replicas) {
  const results = await Promise.all(replicas.map((r) => r.write(key, value)));
  if (results.some((r) => !r.success)) {
    throw new Error("Write failed");
  }
  return { success: true };
}

Used when consistency matters more than write availability. Traditional relational databases often use synchronous replication within a data center.

Asynchronous Replication

Write to primary, acknowledge immediately, then propagate to replicas in background. Writes are fast, but copies can diverge during normal operation and especially during failures.

async function asyncWrite(key, value, primary) {
  await primary.write(key, value); // Immediate acknowledgment
  await backgroundPropagate(key, value); // Replicas updated later
  return { success: true };
}

Used when write latency matters more than immediate consistency. Many NoSQL databases default to async replication.

Quorum-Based Replication

Writes must succeed on a quorum of replicas (typically N/2 + 1). Reads also query a quorum. If the same quorum is used for reads and writes, you get strong consistency.

// Quorum calculation for N=3 replicas
// Strong consistency: W=2, R=2 (W+R > N means reads see previous writes)
// Eventual consistency: W=1, R=1 (any replica can serve reads)

DynamoDB, Cassandra, and similar systems let you tune consistency per query. See the consistency models post for the spectrum from strong to eventual consistency.

Consensus Algorithms

Some operations require all nodes to agree. Which node is the leader? Should we accept this transaction? Did this client actually request this operation? Consensus algorithms solve agreement in the presence of failures.

Paxos

Paxos is the classic consensus algorithm, developed by Leslie Lamport in 1989 and proven correct in 1998. It solves the problem of getting distributed nodes to agree on a single value, even when some nodes fail or messages are lost. Paxos is the foundation for many production consensus systems, though few implement it directly due to its complexity.

The two phases:

Paxos operates in two distinct phases, each requiring a round-trip to a quorum of nodes.

Phase 1 — Prepare: A node that wants to propose a value (the proposer) sends a prepare message with a proposal number N to a quorum of acceptors. Each acceptor promises not to accept any proposal with a number less than N, and replies with the highest-numbered proposal it has already accepted, if any.

Proposer → Acceptors: PREPARE(N)
Acceptors reply: PROMISE(N, [accepted_proposal, accepted_value])

Phase 2 — Accept: If a majority of acceptors promise to accept proposal N, the proposer sends an accept message with proposal N and its value. Each acceptor that promised not to accept lower-numbered proposals accepts and acknowledges.

Proposer → Acceptors: ACCEPT(N, value)
Acceptors reply: ACK(N)

Once accepted by a majority, the value is decided. No other value can be chosen because any majority must overlap with the accepting majority on at least one node.

Why Paxos is hard to implement:

The algorithm as described in papers is subtle in ways that bite production implementations.

Concurrency: Multiple proposers can race, causing promise rejections and repeated rounds. The failure recovery paths are complex.
Epochs and ballot numbers: Proposal numbers must increase monotonically across all proposers. Implementing this across distributed nodes without a single coordinator is non-trivial.
State persistence: Acceptors must durably record which proposals they have accepted. If this state is lost and a node restarts, it might accept conflicting values.
Partition handling: If a network partition isolates a node, it might be left uncertain whether its last proposal succeeded. This requires careful recovery logic.

Multi-Paxos optimization:

Most production systems use Multi-Paxos, which extends Paxos to decide on a sequence of values (a log) rather than a single value. The key optimization is the stable leader. If one node is elected as the permanent leader for an extended period, it can skip Phase 1 for subsequent entries in the log. This reduces the common case from two round-trips to one.

Systems like Google’s Chubby, Spanner, and Apache BookKeeper use Multi-Paxos variants. The stable leader approach trades off some availability (if the leader fails, you need a new leader election) for better latency during normal operation.

Paxos vs Raft:

Raft was designed explicitly to be more understandable than Paxos. It separates concerns more cleanly: leader election is separate from log replication, and the algorithm is structured as a series of independent modules. If you need to implement consensus from scratch, Raft is usually the better choice. If you are using an existing consensus library, understanding Paxos helps you reason about correctness guarantees.

Raft

Designed to be more understandable than Paxos. The authors called it “In Search of an Understandable Consensus Algorithm.” Raft separates concerns into leader election, log replication, and safety.

graph LR
    A[Follower] -->|Election timeout| B[Candidate]
    B -->|Votes from majority| C[Leader]
    C -->|Heartbeat timeout| A
    B -->|Another leader appears| A
    C -->|Crash| A

Raft is used in etcd (Kubernetes’ backing store), CockroachDB, and many other systems. If you need consensus and want something you can actually debug, Raft is usually the right choice.

Why Not Just Use Two-Phase Commit?

Two-phase commit (2PC) is not a consensus algorithm. It coordinates commits across nodes but has a critical flaw: if the coordinator crashes after participants have voted “yes” but before sending “commit,” participants wait forever. 2PC is blocking and not partition-tolerant.

See the two-phase commit and distributed transactions posts for alternatives like the saga pattern.

Time in Distributed Systems

Clocks across machines never agree perfectly. Network latency varies. Machines crash and restart with incorrect times. Understanding time is crucial for debugging and correctness in distributed systems.

Physical Clocks vs Logical Clocks

Physical clocks try to track real time but drift and skew. NTP synchronization helps but cannot eliminate uncertainty. For ordering events, logical clocks (Lamport timestamps, vector clocks) provide better guarantees.

Lamport timestamps capture “happened-before” relationships: if A happened before B, then A’s timestamp is less than B’s. But Lamport timestamps alone cannot tell you if A and B were concurrent.

Vector clocks extend Lamport timestamps to capture causality: each node maintains a vector of timestamps, one per node. Two events are concurrent if neither vector dominates the other.

graph LR
    A[Node A: 1] --> B[Node B: 1,1]
    A --> C[Node A: 2]
    C --> D[Node B: 2,1]
    B --> E[Concurrent writes?<br/>A: v=1, B: v=1]
    D --> E

Vector clocks are used in systems like DynamoDB and Riak to track causal history and resolve write conflicts.

HLC (Hybrid Logical Clocks)

An HLC value combines two things: a physical timestamp from the machine’s clock and a logical counter that only increments when two nodes happen to have the same physical time at the same moment. When node A sends a message to node B, it bundles its current HLC value along with the payload. Node B receives it, looks at the physical time embedded in that HLC, compares it against its own physical clock, and takes whichever is larger. If both clocks show the same time, node B bumps its logical counter up by one. This way, every node stays ahead of anything it has received without needing a central time source.

The guarantee this gives you: if event A causally preceded event B, then A’s HLC is less than B’s HLC. This holds even when NTP steps a clock backward mid-operation. The logical counter absorbs the correction without requiring synchronized clocks.

NTP step corrections are the concrete reason physical timestamps alone break down for event ordering. When a clock jumps backward, an event recorded at timestamp 1000 can suddenly appear to have happened after an event at timestamp 1005. For conflict resolution or audit trails, this is catastrophic. HLCs fall back to the logical counter whenever physical times collide, which prevents this kind of reordering.

MongoDB 4.0+ uses HLCs for global transaction ordering across replica sets. CockroachDB uses a similar hybrid approach for distributed SQL. Both need globally consistent timestamps for transactions that might execute concurrently across data centers, and pure physical clocks cannot provide those guarantees.

The trade-off is extra storage per event and careful implementation at message boundaries. The logical counter stays near zero during normal operation and only increments during NTP corrections or high-contention periods. In practice, the counter rarely grows large.

Time Synchronization Deep Dive

NTP (Network Time Protocol) synchronizes machines to reference clocks but cannot guarantee perfect synchronization. In practice, clock offset error can reach tens of milliseconds on WAN links. For systems requiring tighter guarantees, consider:

Precision Time Protocol (PTP): Achieves sub-microsecond accuracy in LAN environments, used in financial trading systems and telecommunications.
TrueTime: Google’s approach uses GPS and atomic clocks to bound clock uncertainty to approximately 7 milliseconds. Spanner uses this for globally consistent transactions.
Caesium: A software-only approach that detaches the system clock from physical time, maintaining logical stability even during NTP step corrections.

The key insight: design systems to tolerate clock uncertainty rather than assuming synchronized clocks. Use intervals (not point timestamps) when reasoning about event ordering in safety-critical code.

State Machine Replication

State machine replication is a general technique for building fault-tolerant distributed systems. The core idea: if every node starts in the same state and processes the same sequence of commands in the same order, they will produce the same state and same outputs.

graph TD
    A[Client] -->|submit command| B[Leader]
    B -->|replicate| C[Node A]
    B -->|replicate| D[Node B]
    B -->|replicate| E[Node C]
    C -->|ack| B
    D -->|ack| B
    E -->|ack| B
    B -->|commit| C
    B -->|commit| D
    B -->|commit| E

Total order broadcast (also called atomic broadcast) is the primitive that makes this work: every node delivers messages in the same order. Consensus algorithms like Raft and Paxos implement total order broadcast as part of their log replication.

The Paxos and Raft sections above describe how these algorithms achieve consensus on the sequence. Multi-Paxos extends single-decree Paxos to decide on a sequence of values (the log), effectively implementing total order broadcast.

Replicated State Machines in Production

System	Algorithm	Use Case
etcd	Raft	Kubernetes cluster state, distributed locks
CockroachDB	Raft	Distributed SQL, geo-replicated tables
TiKV	Raft	RockDB backing store, transaction coordination
ZooKeeper	Zab (Paxos-like)	Configuration storage, leader election
Chubby (Google)	Paxos	Lock service for BigTable, Spanner

The practical takeaway: you rarely implement replicated state machines from scratch. You use systems like etcd for coordination or CockroachDB for distributed data, and those systems handle replication for you. Understanding the model helps you debug them.

Failure Detectors

A failure detector is a module that suspects a node has crashed. Rather than asking “is this node alive?” (which is impossible to answer definitively in an asynchronous network), a failure detector answers “has this node failed?” with some probability.

The two properties that characterize failure detectors:

Completeness: Every crashed node is eventually suspected by every correct node.
Accuracy: Correct nodes are rarely suspected of having failed. Perfect accuracy is impossible in asynchronous systems (this is the FLP impossibility result).

The accrual failure detector model, used in Akka and Cassandra, outputs a suspicion level (a probability) rather than a boolean. Instead of “node is down,” you get “node is down with 80% confidence.” Applications can configure their own thresholds based on the cost of false positives vs false negatives.

// Accrual failure detector concept
class AccrualFailureDetector {
  constructor(threshold = 0.8) {
    this.threshold = threshold;
    this.heartbeatHistory = new Map();
  }

  recordHeartbeat(nodeId) {
    const now = Date.now();
    const history = this.heartbeatHistory.get(nodeId) || [];
    history.push(now);
    // Keep last N intervals
    this.heartbeatHistory.set(nodeId, history.slice(-10));
  }

  getSuspicionLevel(nodeId) {
    const intervals = this.computeIntervals(nodeId);
    if (intervals.length < 2) return 0;

    const mean = intervals.reduce((a, b) => a + b, 0) / intervals.length;
    const variance =
      intervals.reduce((sum, i) => sum + Math.pow(i - mean, 2), 0) /
      intervals.length;
    const stdDev = Math.sqrt(variance);

    // Phi Accrual: converts variance into a suspicion probability
    return -Math.log10(1 - this.cdf(mean + 2 * stdDev));
  }

  isAvailable(nodeId) {
    return this.getSuspicionLevel(nodeId) < this.threshold;
  }
}

Cassandra uses an accrual failure detector based on Phi Accrual. Instead of a fixed timeout, each node dynamically computes a suspicion level based on heartbeat inter-arrival times. When network latency varies, the detector adapts rather than triggering false positives.

Distributed Transactions

Coordinating changes across multiple services or databases is notoriously hard. ACID transactions work within a single database. Across multiple databases, you need coordination.

Saga Pattern

Instead of distributed ACID transactions, sagas coordinate via a sequence of local transactions. Each service completes its local transaction and publishes an event. The next service picks up that event and performs its local transaction.

graph TD
    A[Order Service] -->|OrderCreated| B[Payment Service]
    B -->|PaymentCompleted| C[Inventory Service]
    C -->|InventoryReserved| D[Shipping Service]
    D -->|ShipmentScheduled| E[Order Complete]
    B -->|PaymentFailed| F[Compensating: Refund]
    C -->|ReservationFailed| G[Compensating: Cancel Payment]

If any step fails, compensating transactions undo previous steps. Sagas work well for long-running business processes where immediate consistency is less important than eventual completion.

See the saga pattern post for implementation details including choreography versus orchestration approaches.

Outbox Pattern

For reliable event publishing alongside database updates, the transactional outbox pattern writes events to a table within the same transaction as business data. A separate process polls that table and publishes events to consumers.

async function createOrder(order) {
  await db.transaction(async (trx) => {
    await trx.insert("orders", order);
    // Same transaction: write to outbox
    await trx.insert("outbox", {
      eventType: "OrderCreated",
      payload: JSON.stringify(order),
    });
  });
  // Outbox poller picks this up and publishes
}

This avoids the dual-write problem where the database commits but the event fails to publish (or vice versa).

Coordination Services

Some operations should only happen once, regardless of how many nodes try. Distributed locks, leader election, and service discovery all require coordination.

Partitioning and Sharding Strategies

Data partitioning (sharding) distributes data across multiple nodes:

Strategy	How It Works	Trade-offs
Hash-based	`shard = hash(key) % num_shards`	Even distribution, but reshuffling required when adding nodes
Range-based	Keys partitioned by value ranges	Good for range queries, but can create hotspots
Directory-based	Lookup table maps keys to shards	Flexible, but lookup table is a SPOF
Consistent Hashing	Virtual nodes on hash ring	Minimal reshuffling when nodes added/removed

Consistent hashing minimizes data movement when nodes are added or removed:

// Consistent hashing concept
class ConsistentHash {
  constructor(nodes, virtualNodes = 150) {
    this.ring = new Map();
    this.sortedKeys = [];

    for (const node of nodes) {
      this.addNode(node, virtualNodes);
    }
  }

  addNode(node, vNodes) {
    for (let i = 0; i < vNodes; i++) {
      const key = hash(node + ":" + i);
      this.ring.set(key, node);
    }
    this.sortedKeys = Array.from(this.ring.keys()).sort();
  }

  getNode(key) {
    const hash = hash(key);
    // Binary search for first key >= hash(key)
    const idx =
      this.sortedKeys.findIndex((k) => k >= hash) % this.sortedKeys.length;
    return this.ring.get(this.sortedKeys[idx]);
  }
}

Gossip Protocols for State Synchronization

Gossip protocols spread information through a system like a virus. Each node periodically shares state with a few random peers, achieving eventual consistency with O(log N) propagation time.

graph LR
    A[Node A] -->|Gossip to 2 random peers| B[Node B]
    A -->|Gossip to 2 random peers| C[Node C]
    B -->|Next round| D[Node D]
    C -->|Next round| E[Node E]
    D -->|Next round| F[Node F]
    E -->|Next round| G[Node G]

Gossip characteristics:

Convergence time: O(log N) rounds to reach all nodes
Failure tolerance: Nodes can join/leave without protocol disruption
Eventual consistency: Always converges, but no guaranteed delivery time
Use cases: Membership protocols, failure detection, anti-entropy

// Simplified gossip implementation
class GossipProtocol {
  constructor(nodeId, peers, state) {
    this.nodeId = nodeId;
    this.peers = peers;
    this.state = state;
    this.version = 0;
  }

  // Called periodically
  async gossip() {
    const targets = this.selectRandomPeers(2);
    const payload = { state: this.state, version: this.version };

    for (const peer of targets) {
      await peer.receiveGossip(this.nodeId, payload);
    }
  }

  async receiveGossip(fromNode, payload) {
    if (payload.version > this.version) {
      this.state = this.merge(this.state, payload.state);
      this.version = payload.version;
    }
  }

  selectRandomPeers(count) {
    return this.peers
      .filter((p) => p !== this.nodeId)
      .sort(() => Math.random() - 0.5)
      .slice(0, count);
  }
}

Cassandra uses gossip for failure detection and state synchronization. Consul uses gossip for membership and health checking.

Which Coordination Service to Use

Feature	etcd	Apache ZooKeeper	Consul
Consensus	Raft	Zab (Paxos-like)	Raft
Data Model	Flat key-value	Hierarchical znodes	Flat key-value
ACLs	RBAC	ACLs per znode	Intentions + ACLs
Health Checks	Lease-based	No built-in	Agent-based
Multi-region	Limited	No	Yes (WAN federation)
DNS Integration	No	No	Yes
Use Case Focus	Kubernetes state	HBase, Kafka	Service discovery

When to use etcd:

Backing Kubernetes cluster state
When you need distributed lock with lease support
Configuration management with watch-based updates

When to use ZooKeeper:

Legacy systems already using it (HBase, Kafka)
When you need hierarchical data model
Java-centric environments

When to use Consul:

Service discovery with health checking
Multi-datacenter deployments
When you need built-in DNS for service discovery

Distributed Locks

Mutexes that work across machines. Naive implementations using Redis SETNX fail in subtle ways: clients can crash before releasing, clocks skew, and network partitions leave locks held indefinitely.

Proper distributed locks require: fencing tokens (to handle client crashes), clock-safe expiry, and atomic check-and-set operations.

Fencing tokens prevent crashed clients from continuing after restart. When you acquire a lock, you get a monotonically increasing token. The resource server rejects any request whose token is older than the last one it accepted. A client that crashes while holding token 5 cannot do damage after restarting and getting token 7 — the resource just says no to the stale token.

Clock-safe expiry prevents locks from hanging around after a client dies. The lock must release even if the client cannot explicitly unlock. The catch: wall-clock time varies across machines. Locks based on system time can expire at different moments on different hosts. Either use a centralized time source or accept that expiry windows are approximate. Redlock, Martin Kleppmann’s algorithm, addresses some of these concerns but relies on network latency assumptions that don’t always hold in production.

Atomic check-and-set is the foundation. Redis’s SET key value NX PX ms handles acquisition and expiry atomically. With Redisson or similar libraries, use tryLock() with a lease time instead of lock() — the latter leaves locks orphaned if your process crashes between acquiring and releasing.

The point I keep landing on: distributed locks protect data, not operations. A lock on “account A” prevents two operations from modifying that balance simultaneously. If the lock fails and both run, you get lost updates or double-charges. Design around invariants, not just mutual exclusion.

Often you do not need a distributed lock at all. Incrementing a counter? An atomic increment works. Reserving inventory? A database transaction with row-level locking handles it. Distributed locks are the fallback when nothing simpler fits.

Leader Election

One node needs to be primary for a given task. Uses coordination service (etcd, ZooKeeper) to agree on who that is. If leader fails, remaining nodes hold new election.

graph TD
    A[All nodes start] --> B[Election timeout]
    B --> C[Node 1 votes for self]
    B --> D[Node 2 votes for self]
    C --> E{Does anyone have majority?}
    D --> E
    E -->|Yes| F[Node 1 is leader]
    E -->|No| G[New election]

Leader election backs high availability patterns. When the primary fails, election identifies a new leader and work continues.

The service registry post covers how services discover leaders.

Production Failure Scenarios

Failure Scenario	Impact	Mitigation
Split-brain after partition	Both sides accept writes, data diverges	Quorum-based decisions, fencing
Lost writes during leader crash	Write acknowledged but not replicated	Sync replication for critical data, write-ahead logs
Stale reads after replica promotion	Applications read old data	Read-your-writes consistency, routing reads to new primary
Cascade from one slow node	Distributed slowdown spreads	Bulkhead pattern, circuit breakers
Clock skew causing ordering issues	Events appear in wrong sequence	Logical clocks for ordering, NTP with careful config

Common Pitfalls / Anti-Patterns

Pitfall 1: Assuming Strong Consistency is Free

Problem: Using synchronous replication everywhere because “consistency matters.” This adds latency to every write and makes your system vulnerable to replica failures.

Solution: Identify which operations truly need strong consistency. Use eventual consistency where acceptable. See the consistency models post for classifying operations.

Pitfall 2: Ignoring Clock Skew

Problem: Using wall-clock time for ordering events or conflict resolution. Clock skew between machines makes this unreliable.

Solution: Use logical clocks (Lamport timestamps, vector clocks) for event ordering. Use HLCs when you need both ordering and wall-clock time.

Pitfall 3: Building Distribution on Day One

Problem: Starting with microservices and distributed data before understanding the actual scale requirements. This adds complexity without benefit.

Solution: Build monolithic first. Distribute when you have concrete scale or reliability requirements. See the microservices vs monolith post for this trade-off.

Pitfall 4: Underestimating Coordination Overhead

Problem: Adding distributed locks, consensus, and leader election for operations that do not actually need them. This creates bottlenecks and fragility.

Every coordination round is a network round-trip. A distributed lock needs at least one hop to acquire and one to release. A consensus round needs multiple round-trips to a quorum. These sound negligible in isolation, but at scale they become the bottleneck.

Here is the math that changed how I think about this. A simple counter increment on a single node takes one CPU instruction. With a distributed lock, you need lock acquisition, the increment itself, and lock release. If your service handles 10,000 requests per second and each lock acquisition takes 5ms, your lock alone introduces 50 seconds of cumulative latency per second. The lock serializes all access across all nodes.

Distributed transactions make this worse. A saga with five steps involves five separate service calls, each with its own database transaction. The network overhead compounds. The compensating transaction logic adds code paths that almost never run in testing but trigger during production failures at exactly the wrong moment.

Most of the time, you do not need any of this. Increment a counter? An atomic increment works. Invalidate a cache? TTL expiry is simpler and avoids the dual-write problem. Make an operation idempotent? Deduplication keys with expiration handle it. Rate limit? Token bucket with local state works fine when approximate is acceptable.

You need distributed locks or consensus only when multiple nodes modify the same state in conflicting ways and there is no way to resolve that conflict without agreement. If you are unsure whether you need it, you probably do not. The default should be no coordination. Add it only when measurements show that uncoordinated access causes actual correctness or availability problems.

Solution: Use the simplest coordination mechanism that meets requirements. Avoid distributed transactions if sagas work. Avoid consensus if leader lease-based approaches suffice.

Quick Recap Checklist

Distributed systems solve capacity and reliability problems that single machines cannot.
The Two General Problem shows we cannot achieve certainty in coordination—only probabilistic guarantees.
Failure modes (crash, Byzantine, partition) require different handling.
Replication strategies (sync, async, quorum) trade consistency against latency and availability.
Consensus algorithms (Raft, Paxos) solve agreement but add complexity.
Time in distributed systems requires logical clocks, not just physical clocks.
Distributed transactions (sagas, outbox) coordinate multi-service operations without ACID across boundaries.

Copy/Paste Checklist

- [ ] Identify which operations need strong vs eventual consistency
- [ ] Choose replication strategy per service based on consistency requirements
- [ ] Implement proper timeouts and retries for all network calls
- [ ] Use consensus (Raft) when you need strongly consistent leader election
- [ ] Implement observability: traces, metrics, and logs across service boundaries
- [ ] Test failure scenarios: network partitions, node crashes, clock skew
- [ ] Start simple: do not add distribution until you have concrete requirements

For deeper exploration, see the distributed systems roadmap for a structured learning path.

When to Use / When Not to Use

Scenario	Recommendation
Single machine handles your load	Do not distribute yet
Need horizontal scaling	Distribute stateless services first
Strong consistency required	Use consensus algorithms (Raft, Paxos)
Eventual consistency acceptable	Async replication is simpler
Long-running business transactions	Saga pattern with compensating logic
Multi-region deployment	Expect partitions, plan for divergence

Observability Checklist

Distributed systems require more monitoring than single-machine applications:

Metrics to Capture

Request latency: P50, P95, P99 by service and endpoint
Error rates: By type (timeout, crash, Byzantine)
Replication lag: How far behind are replicas
Quorum health: Are reads and writes achieving quorum
Partition events: Frequency and duration of network partitions

Logs to Emit

{
  "timestamp": "2026-03-24T10:15:30.123Z",
  "service": "order-service",
  "operation": "createOrder",
  "latency_ms": 45,
  "replicas_contacted": 3,
  "partition_detected": false,
  "trace_id": "abc123"
}

Distributed tracing (covered in Distributed Tracing) connects requests across service boundaries.

Alerts to Configure

Alert	Threshold	Severity
Replication lag > 30s	30000ms	Warning
Replication lag > 5min	300000ms	Critical
Quorum failures > 1%	1% of requests	Critical
Partition lasting > 60s	60000ms	Critical
Leader election rate spike	> 3 elections/hour	Warning

Interview Questions

1. What is the Two General Problem and why does it matter in distributed systems design?

Expected answer points:

Two nodes trying to coordinate across an unreliable network cannot guarantee mutual agreement
No protocol can achieve certainty; only probabilistic guarantees are possible
Every acknowledgment cycle has a window where messages can be lost or delayed
Design implication: accept eventual consistency and design for retry/timeout mechanisms

2. Explain the difference between crash faults, Byzantine faults, and network partitions.

Expected answer points:

Crash faults: node stops responding entirely, easiest to handle via failover
Byzantine faults: node continues operating but produces incorrect results, requires cryptographic verification and consensus
Network partitions: nodes remain alive but cannot communicate, splits the system into isolated subgroups
Each requires different mitigation strategies; partitions trigger CAP theorem trade-offs

3. When would you choose synchronous replication over asynchronous replication?

Expected answer points:

Choose synchronous when consistency outweighs write availability and latency tolerance
All replicas must acknowledge before client receives confirmation
Trade-off: write latency equals slowest replica; one slow node blocks writes
Use cases: financial systems, databases requiring strong consistency within a datacenter

4. What is quorum-based replication and how does it provide strong consistency?

Expected answer points:

Writes must succeed on N/2 + 1 replicas (quorum)
Reads also query a quorum; if W+R > N, reads see previous writes
With W=2, R=2 for N=3 replicas, you get strong consistency
DynamoDB and Cassandra allow tuning consistency per query

5. Compare Paxos and Raft consensus algorithms. When would you choose one over the other?

Expected answer points:

Paxos: proven correct but notoriously difficult to implement and understand
Raft: designed for understandability, separates leader election, log replication, and safety
Multi-Paxos optimizes for stable leader scenarios in production
Choose Raft for debuggability; choose Paxos if existing system expertise exists
etcd, CockroachDB use Raft; most production Paxos is Multi-Paxos

6. Why is Two-Phase Commit not considered a consensus algorithm? What is its critical flaw?

Expected answer points:

2PC coordinates commits but does not guarantee node agreement in failure scenarios
If coordinator crashes after participants vote "yes" but before "commit," participants wait indefinitely
2PC is blocking and not partition-tolerant
Consensus algorithms like Raft/Paxos handle this via leader election and log replication

7. Explain Lamport timestamps versus vector clocks. When would you use each?

Expected answer points:

Lamport timestamps capture "happened-before" relationships but cannot detect concurrency
If A happened before B, A's timestamp is less than B's (but converse not true)
Vector clocks extend Lamport with per-node timestamp vectors to detect concurrent events
Two events are concurrent if neither vector dominates the other
Use Lamport for simple ordering; use vector clocks when you need conflict detection (DynamoDB, Riak)

8. How does the Saga pattern handle distributed transactions differently from ACID transactions?

Expected answer points:

Sagas replace distributed ACID with a sequence of local transactions
Each service completes its local transaction and publishes an event
Next service picks up the event and performs its local transaction
If any step fails, compensating transactions undo previous steps
Trade-off: gives up immediate consistency for eventual completion
Use for long-running business processes where immediate consistency is less critical

9. What problem does the transactional outbox pattern solve? How does it work?

Expected answer points:

Solves the dual-write problem: database commits but event fails to publish (or vice versa)
Writes events to an outbox table within the same transaction as business data
A separate poller process reads the outbox and publishes to message broker
Guarantees atomicity: both business data and outbox entry commit together

10. What are the trade-offs between hash-based, range-based, and consistent hashing sharding strategies?

Expected answer points:

Hash-based: even data distribution but requires reshuffling when adding/removing nodes
Range-based: efficient for range queries but risk of hotspots on certain key ranges
Directory-based: flexible key-to-shard mapping but introduces single point of failure in lookup table
Consistent hashing: minimizes data movement when nodes change via virtual nodes on hash ring

11. How do gossip protocols achieve eventual consistency? What are their characteristics?

Expected answer points:

Each node periodically shares state with a few random peers
Information spreads like a virus through the system
Convergence time is O(log N) rounds regardless of system size
Nodes can join/leave without disrupting the protocol
No guaranteed delivery time; only eventual consistency
Use cases: membership protocols, failure detection, anti-entropy (Cassandra, Consul)

12. What are the key requirements for a proper distributed lock implementation?

Expected answer points:

Fencing tokens to handle client crashes and prevent split-brain scenarios
Clock-safe expiry: locks must auto-release even if holder crashes
Atomic check-and-set operations to prevent race conditions on lock acquisition
Naive Redis SETNX fails: clock skew, crash-before-release, network partitions all cause issues

13. Compare etcd, Apache ZooKeeper, and Consul for coordination services.

Expected answer points:

etcd: Raft consensus, flat key-value, RBAC, lease-based health checks; limited multi-region
ZooKeeper: Zab (Paxos-like), hierarchical znodes, ACLs per znode; no built-in health checks or multi-region
Consul: Raft consensus, flat key-value, intentions + ACLs, agent-based health checks, WAN federation
Choose based on existing stack: Kubernetes → etcd; HBase/Kafka → ZooKeeper; service discovery → Consul

14. What is split-brain in distributed systems and how do you prevent it?

Expected answer points:

Split-brain occurs during network partitions where both sides of the partition accept writes
Results in data divergence that is difficult to merge after partition heals
Prevention: quorum-based decision making ensures only majority partition accepts writes
Fencing tokens prevent stale writes from partitioned nodes after reconnection

15. Under what circumstances would you recommend against distributing a system?

Expected answer points:

Single machine handles current and near-term load without vertical scaling constraints
When simplicity outweighs scalability needs for your use case
Strong consistency requirements make distribution expensive
Premature distribution adds complexity without concrete scale/reliability benefits
Build monolithic first; distribute when you have concrete requirements

16. How does Hybrid Logical Clocks (HLC) combine benefits of physical and logical clocks?

Expected answer points:

HLCs store both physical time and logical counter
Provides wall-clock timestamps for human readability and debugging
Logical component ensures correct ordering when physical clocks drift or skew
Less uncertainty than pure physical clocks; more interpretable than pure logical clocks
Used in many modern distributed systems for global event ordering

17. What are the trade-offs between leader-based consensus and leaderless consensus approaches?

Expected answer points:

Leader-based (Raft, Multi-Paxos): efficient writes through single leader, but leader is bottleneck and single point of failure
Leaderless (some Dynamo-style systems): writes spread across nodes, but coordination complexity increases
Leader leases reduce leader dependency but introduce clock skew risks
Choose based on write throughput requirements and consistency needs

18. What metrics should you monitor to detect distributed system failures early?

Expected answer points:

Request latency: P50, P95, P99 by service and endpoint
Error rates by type: timeouts, crashes, Byzantine responses
Replication lag: how far replicas trail behind primary
Quorum health: are reads/writes achieving required quorum
Partition events: frequency and duration of network partitions
Leader election rate: spikes indicate instability

19. Why is clock skew a subtle but critical problem in distributed systems?

Expected answer points:

Wall-clock time is unreliable for event ordering due to machine clock drift and NTP uncertainty
Using physical time for conflict resolution leads to incorrect results after clock skew
Logical clocks provide ordering guarantees but lack real-time interpretability
HLCs combine both but require careful implementation
Always use logical clocks for event ordering; physical clocks only for human-readable timestamps

20. How does the bulkhead pattern and circuit breakers prevent cascade failures in distributed systems?

Expected answer points:

Bulkhead pattern: isolates components so failure in one service does not bring down others
Circuit breakers: detect downstream failures and stop requests to failing component
Both prevent distributed slowdown from spreading across the system
Circuit breakers allow failed components time to recover while serving fallback responses
Without these, one slow node can cause timeouts across all dependent services

Conclusion

Distributed systems are fundamentally about trade-offs. Capacity versus complexity. Consistency versus availability. Simplicity versus scalability.

The concepts covered in this primer provide a foundation for thinking about distributed problems. Failure modes teach you to expect the unexpected. Replication strategies show how to maintain data consistency across unreliable networks. Consensus algorithms provide the tools for nodes to agree despite failures.

Time and ordering present subtle challenges that trip up even experienced engineers. Coordination services solve the hard problems of leader election and distributed locks. Production failure scenarios remind you that theory and practice diverge in real systems.

Before reaching for distributed solutions, verify that your problem actually requires them. Single machines handle more than developers assume. Distribution adds latency, complexity, and failure modes that are difficult to reason about. Build monolithic first. Distribute when you have concrete requirements.

For structured learning paths, see the distributed systems roadmap which covers these topics in greater depth.

Introduction

Core Concepts

Failure Modes

Crash Faults

Byzantine Faults

Network Partitions

Replication Strategies

Synchronous Replication

Asynchronous Replication

Quorum-Based Replication

Consensus Algorithms

Paxos

Raft

Why Not Just Use Two-Phase Commit?

Time in Distributed Systems

Physical Clocks vs Logical Clocks

HLC (Hybrid Logical Clocks)

Time Synchronization Deep Dive

State Machine Replication

Replicated State Machines in Production

Failure Detectors

Distributed Transactions

Saga Pattern

Outbox Pattern

Coordination Services

Partitioning and Sharding Strategies

Gossip Protocols for State Synchronization

Which Coordination Service to Use

Distributed Locks

Leader Election

Production Failure Scenarios

Common Pitfalls / Anti-Patterns

Pitfall 1: Assuming Strong Consistency is Free

Pitfall 2: Ignoring Clock Skew

Pitfall 3: Building Distribution on Day One

Pitfall 4: Underestimating Coordination Overhead

Quick Recap Checklist

Copy/Paste Checklist

When to Use / When Not to Use

Observability Checklist

Metrics to Capture

Logs to Emit

Alerts to Configure

Interview Questions

Further Reading

Conclusion

Category

Tags

Related Posts

The Eight Fallacies of Distributed Computing

Microservices vs Monolith: Choosing the Right Architecture

CAP Theorem: Consistency vs Availability Trade-offs