Distributed Systems Primer: Key Concepts for Modern Architecture

A practical introduction to distributed systems fundamentals. Learn about failure modes, replication strategies, consensus algorithms, and the core challenges of building distributed software.

published: reading time: 15 min read

Distributed Systems Primer

A distributed system is multiple computers working together to solve a problem that no single machine can handle alone. Sounds simple. Building one that actually works is one of the hardest things in software engineering.

This post covers the fundamental concepts you need before diving into specific distributed systems patterns. If you are building systems that span multiple machines, these ideas will determine whether your architecture holds up.


Why Distribute?

Single machines hit limits. A database on one server can only scale vertically so far. At some point, you need more compute, storage, or network bandwidth than one machine provides. Distribution lets you scale horizontally.

Beyond raw capacity, distribution enables reliability. If one server fails, others take over. Users in Tokyo and London can hit nearby servers instead of waiting for cross-continental round trips. Compliance might require data to stay within certain regions.

The trade-off is complexity. A single machine has one clock, one view of memory, and one path for operations. Distributed systems have multiple clocks that never agree perfectly, multiple views of data that can diverge, and network paths that fail in subtle ways.


The Two General Problem

A distributed system must coordinate nodes that may fail independently. The Two General Problem shows why this is hard: no matter what protocol two generals agree on, they cannot guarantee they have both committed to the same plan with certainty.

graph LR
    A[General A] -->|Attack at dawn| B[General B]
    B -->|Acknowledgment| A
    A -->|Confirmation| B
    B -->|Must confirm| A
    A -->|Infinite loop| B

In practice, we accept probabilistic guarantees. Systems use acknowledgments, timeouts, and retries to achieve “good enough” coordination. Understanding what guarantees are actually possible keeps you from over-engineering for certainty that mathematics forbids.


Failure Modes

Distributed systems face richer failure modes than single machines:

Crash Faults

A node stops working. It stops responding to requests. Detection is straightforward: no response within timeout means crash. Handling is relatively simple: switch to a backup.

Byzantine Faults

A node keeps running but produces incorrect responses. It might be buggy, compromised by an attacker, or misconfigured. Byzantine fault tolerance requires cryptographic verification and consensus among multiple nodes. Financial systems and aerospace computing deal with these requirements.

Network Partitions

Nodes stay alive but cannot communicate with each other. The network between them fails. This is the scenario the CAP theorem addresses. During a partition, you choose between serving stale data (availability) or refusing to serve (consistency).

graph TD
    subgraph DC1["Data Center 1"]
        A[Node A] --> B[Node B]
    end
    subgraph DC2["Data Center 2"]
        C[Node C] --> D[Node D]
    end
    A -.->|Partition| C
    B -.->|Partition| D

Partitions are not rare events. In any large enough system, you should expect regular network blips. Design for partitions, not against them.

See the availability patterns post for building systems that remain available despite partitions. The CAP theorem post explains the fundamental trade-off during partitions.


Replication Strategies

Once you distribute data across multiple nodes, you must keep those copies in sync. Different strategies offer different trade-offs.

Synchronous Replication

Write to all replicas before confirming to the client. All copies stay consistent, but write latency equals the slowest replica. If one replica is slow or unreachable, writes fail.

async function syncWrite(key, value, replicas) {
  const results = await Promise.all(replicas.map((r) => r.write(key, value)));
  if (results.some((r) => !r.success)) {
    throw new Error("Write failed");
  }
  return { success: true };
}

Used when consistency matters more than write availability. Traditional relational databases often use synchronous replication within a data center.

Asynchronous Replication

Write to primary, acknowledge immediately, then propagate to replicas in background. Writes are fast, but copies can diverge during normal operation and especially during failures.

async function asyncWrite(key, value, primary) {
  await primary.write(key, value); // Immediate acknowledgment
  await backgroundPropagate(key, value); // Replicas updated later
  return { success: true };
}

Used when write latency matters more than immediate consistency. Many NoSQL databases default to async replication.

Quorum-Based Replication

Writes must succeed on a quorum of replicas (typically N/2 + 1). Reads also query a quorum. If the same quorum is used for reads and writes, you get strong consistency.

// Quorum calculation for N=3 replicas
// Strong consistency: W=2, R=2 (W+R > N means reads see previous writes)
// Eventual consistency: W=1, R=1 (any replica can serve reads)

DynamoDB, Cassandra, and similar systems let you tune consistency per query. See the consistency models post for the spectrum from strong to eventual consistency.


Consensus Algorithms

Some operations require all nodes to agree. Which node is the leader? Should we accept this transaction? Did this client actually request this operation? Consensus algorithms solve agreement in the presence of failures.

Paxos

The classic consensus algorithm, developed by Leslie Lamport in 1989. Paxos works in two phases: prepare and accept. Nodes only accept values that have not been seen with a higher proposal number. Paxos is proven correct but notoriously hard to implement.

Most production systems use Multi-Paxos, which optimizes for the common case where one node is the stable leader for an extended period.

Raft

Designed to be more understandable than Paxos. The authors called it “In Search of an Understandable Consensus Algorithm.” Raft separates concerns into leader election, log replication, and safety.

graph LR
    A[Follower] -->|Election timeout| B[Candidate]
    B -->|Votes from majority| C[Leader]
    C -->|Heartbeat timeout| A
    B -->|Another leader appears| A
    C -->|Crash| A

Raft is used in etcd (Kubernetes’ backing store), CockroachDB, and many other systems. If you need consensus and want something you can actually debug, Raft is usually the right choice.

Why Not Just Use Two-Phase Commit?

Two-phase commit (2PC) is not a consensus algorithm. It coordinates commits across nodes but has a critical flaw: if the coordinator crashes after participants have voted “yes” but before sending “commit,” participants wait forever. 2PC is blocking and not partition-tolerant.

See the two-phase commit and distributed transactions posts for alternatives like the saga pattern.


Time in Distributed Systems

Clocks across machines never agree perfectly. Network latency varies. Machines crash and restart with incorrect times. Understanding time is crucial for debugging and correctness in distributed systems.

Physical Clocks vs Logical Clocks

Physical clocks try to track real time but drift and skew. NTP synchronization helps but cannot eliminate uncertainty. For ordering events, logical clocks (Lamport timestamps, vector clocks) provide better guarantees.

Lamport timestamps capture “happened-before” relationships: if A happened before B, then A’s timestamp is less than B’s. But Lamport timestamps alone cannot tell you if A and B were concurrent.

Vector clocks extend Lamport timestamps to capture causality: each node maintains a vector of timestamps, one per node. Two events are concurrent if neither vector dominates the other.

graph LR
    A[Node A: 1] --> B[Node B: 1,1]
    A --> C[Node A: 2]
    C --> D[Node B: 2,1]
    B --> E[Concurrent writes?<br/>A: v=1, B: v=1]
    D --> E

Vector clocks are used in systems like DynamoDB and Riak to track causal history and resolve write conflicts.

HLC (Hybrid Logical Clocks)

Hybrid Logical Clocks combine physical time with logical time for ordering. Many modern systems use HLCs to get benefits of both.


Distributed Transactions

Coordinating changes across multiple services or databases is notoriously hard. ACID transactions work within a single database. Across multiple databases, you need coordination.

Saga Pattern

Instead of distributed ACID transactions, sagas coordinate via a sequence of local transactions. Each service completes its local transaction and publishes an event. The next service picks up that event and performs its local transaction.

graph TD
    A[Order Service] -->|OrderCreated| B[Payment Service]
    B -->|PaymentCompleted| C[Inventory Service]
    C -->|InventoryReserved| D[Shipping Service]
    D -->|ShipmentScheduled| E[Order Complete]
    B -->|PaymentFailed| F[Compensating: Refund]
    C -->|ReservationFailed| G[Compensating: Cancel Payment]

If any step fails, compensating transactions undo previous steps. Sagas work well for long-running business processes where immediate consistency is less important than eventual completion.

See the saga pattern post for implementation details including choreography versus orchestration approaches.

Outbox Pattern

For reliable event publishing alongside database updates, the transactional outbox pattern writes events to a table within the same transaction as business data. A separate process polls that table and publishes events to consumers.

async function createOrder(order) {
  await db.transaction(async (trx) => {
    await trx.insert("orders", order);
    // Same transaction: write to outbox
    await trx.insert("outbox", {
      eventType: "OrderCreated",
      payload: JSON.stringify(order),
    });
  });
  // Outbox poller picks this up and publishes
}

This avoids the dual-write problem where the database commits but the event fails to publish (or vice versa).


Coordination Services

Some operations should only happen once, regardless of how many nodes try. Distributed locks, leader election, and service discovery all require coordination.

Partitioning and Sharding Strategies

Data partitioning (sharding) distributes data across multiple nodes:

StrategyHow It WorksTrade-offs
Hash-basedshard = hash(key) % num_shardsEven distribution, but reshuffling required when adding nodes
Range-basedKeys partitioned by value rangesGood for range queries, but can create hotspots
Directory-basedLookup table maps keys to shardsFlexible, but lookup table is a SPOF
Consistent HashingVirtual nodes on hash ringMinimal reshuffling when nodes added/removed

Consistent hashing minimizes data movement when nodes are added or removed:

// Consistent hashing concept
class ConsistentHash {
  constructor(nodes, virtualNodes = 150) {
    this.ring = new Map();
    this.sortedKeys = [];

    for (const node of nodes) {
      this.addNode(node, virtualNodes);
    }
  }

  addNode(node, vNodes) {
    for (let i = 0; i < vNodes; i++) {
      const key = hash(node + ":" + i);
      this.ring.set(key, node);
    }
    this.sortedKeys = Array.from(this.ring.keys()).sort();
  }

  getNode(key) {
    const hash = hash(key);
    // Binary search for first key >= hash(key)
    const idx =
      this.sortedKeys.findIndex((k) => k >= hash) % this.sortedKeys.length;
    return this.ring.get(this.sortedKeys[idx]);
  }
}

Gossip Protocols for State Synchronization

Gossip protocols spread information through a system like a virus. Each node periodically shares state with a few random peers, achieving eventual consistency with O(log N) propagation time.

graph LR
    A[Node A] -->|Gossip to 2 random peers| B[Node B]
    A -->|Gossip to 2 random peers| C[Node C]
    B -->|Next round| D[Node D]
    C -->|Next round| E[Node E]
    D -->|Next round| F[Node F]
    E -->|Next round| G[Node G]

Gossip characteristics:

  • Convergence time: O(log N) rounds to reach all nodes
  • Failure tolerance: Nodes can join/leave without protocol disruption
  • Eventual consistency: Always converges, but no guaranteed delivery time
  • Use cases: Membership protocols, failure detection, anti-entropy
// Simplified gossip implementation
class GossipProtocol {
  constructor(nodeId, peers, state) {
    this.nodeId = nodeId;
    this.peers = peers;
    this.state = state;
    this.version = 0;
  }

  // Called periodically
  async gossip() {
    const targets = this.selectRandomPeers(2);
    const payload = { state: this.state, version: this.version };

    for (const peer of targets) {
      await peer.receiveGossip(this.nodeId, payload);
    }
  }

  async receiveGossip(fromNode, payload) {
    if (payload.version > this.version) {
      this.state = this.merge(this.state, payload.state);
      this.version = payload.version;
    }
  }

  selectRandomPeers(count) {
    return this.peers
      .filter((p) => p !== this.nodeId)
      .sort(() => Math.random() - 0.5)
      .slice(0, count);
  }
}

Cassandra uses gossip for failure detection and state synchronization. Consul uses gossip for membership and health checking.

Which Coordination Service to Use

FeatureetcdApache ZooKeeperConsul
ConsensusRaftZab (Paxos-like)Raft
Data ModelFlat key-valueHierarchical znodesFlat key-value
ACLsRBACACLs per znodeIntentions + ACLs
Health ChecksLease-basedNo built-inAgent-based
Multi-regionLimitedNoYes (WAN federation)
DNS IntegrationNoNoYes
Use Case FocusKubernetes stateHBase, KafkaService discovery

When to use etcd:

  • Backing Kubernetes cluster state
  • When you need distributed锁 with lease support
  • Configuration management with watch-based updates

When to use ZooKeeper:

  • Legacy systems already using it (HBase, Kafka)
  • When you need hierarchical data model
  • Java-centric environments

When to use Consul:

  • Service discovery with health checking
  • Multi-datacenter deployments
  • When you need built-in DNS for service discovery

Distributed Locks

Mutexes that work across machines. Naive implementations using Redis SETNX fail in subtle ways: clients can crash before releasing, clocks skew, and network partitions leave locks held indefinitely.

Proper distributed locks require: fencing tokens (to handle client crashes), clock-safe expiry, and atomic check-and-set operations.

Leader Election

One node needs to be primary for a given task. Uses coordination service (etcd, ZooKeeper) to agree on who that is. If leader fails, remaining nodes hold new election.

graph TD
    A[All nodes start] --> B[Election timeout]
    B --> C[Node 1 votes for self]
    B --> D[Node 2 votes for self]
    C --> E{Does anyone have majority?}
    D --> E
    E -->|Yes| F[Node 1 is leader]
    E -->|No| G[New election]

Leader election backs high availability patterns. When the primary fails, election identifies a new leader and work continues.

The service registry post covers how services discover leaders.


When to Use / When Not to Use

ScenarioRecommendation
Single machine handles your loadDo not distribute yet
Need horizontal scalingDistribute stateless services first
Strong consistency requiredUse consensus algorithms (Raft, Paxos)
Eventual consistency acceptableAsync replication is simpler
Long-running business transactionsSaga pattern with compensating logic
Multi-region deploymentExpect partitions, plan for divergence

Production Failure Scenarios

Failure ScenarioImpactMitigation
Split-brain after partitionBoth sides accept writes, data divergesQuorum-based decisions, fencing
Lost writes during leader crashWrite acknowledged but not replicatedSync replication for critical data, write-ahead logs
Stale reads after replica promotionApplications read old dataRead-your-writes consistency, routing reads to new primary
Cascade from one slow nodeDistributed slowdown spreadsBulkhead pattern, circuit breakers
Clock skew causing ordering issuesEvents appear in wrong sequenceLogical clocks for ordering, NTP with careful config

Observability Checklist

Distributed systems require more monitoring than single-machine applications:

Metrics to Capture

  • Request latency: P50, P95, P99 by service and endpoint
  • Error rates: By type (timeout, crash, Byzantine)
  • Replication lag: How far behind are replicas
  • Quorum health: Are reads and writes achieving quorum
  • Partition events: Frequency and duration of network partitions

Logs to Emit

{
  "timestamp": "2026-03-24T10:15:30.123Z",
  "service": "order-service",
  "operation": "createOrder",
  "latency_ms": 45,
  "replicas_contacted": 3,
  "partition_detected": false,
  "trace_id": "abc123"
}

Distributed tracing (covered in Distributed Tracing) connects requests across service boundaries.

Alerts to Configure

AlertThresholdSeverity
Replication lag > 30s30000msWarning
Replication lag > 5min300000msCritical
Quorum failures > 1%1% of requestsCritical
Partition lasting > 60s60000msCritical
Leader election rate spike> 3 elections/hourWarning

Common Pitfalls / Anti-Patterns

Pitfall 1: Assuming Strong Consistency is Free

Problem: Using synchronous replication everywhere because “consistency matters.” This adds latency to every write and makes your system vulnerable to replica failures.

Solution: Identify which operations truly need strong consistency. Use eventual consistency where acceptable. See the consistency models post for classifying operations.

Pitfall 2: Ignoring Clock Skew

Problem: Using wall-clock time for ordering events or conflict resolution. Clock skew between machines makes this unreliable.

Solution: Use logical clocks (Lamport timestamps, vector clocks) for event ordering. Use HLCs when you need both ordering and wall-clock time.

Pitfall 3: Building Distribution on Day One

Problem: Starting with microservices and distributed data before understanding the actual scale requirements. This adds complexity without benefit.

Solution: Build monolithic first. Distribute when you have concrete scale or reliability requirements. See the microservices vs monolith post for this trade-off.

Pitfall 4: Underestimating Coordination Overhead

Problem: Adding distributed locks, consensus, and leader election for operations that do not actually need them. This creates bottlenecks and fragility.

Solution: Use the simplest coordination mechanism that meets requirements. Avoid distributed transactions if sagas work. Avoid consensus if leader lease-based approaches suffice.


Quick Recap

  • Distributed systems solve capacity and reliability problems that single machines cannot.
  • The Two General Problem shows we cannot achieve certainty in coordination—only probabilistic guarantees.
  • Failure modes (crash, Byzantine, partition) require different handling.
  • Replication strategies (sync, async, quorum) trade consistency against latency and availability.
  • Consensus algorithms (Raft, Paxos) solve agreement but add complexity.
  • Time in distributed systems requires logical clocks, not just physical clocks.
  • Distributed transactions (sagas, outbox) coordinate multi-service operations without ACID across boundaries.

Copy/Paste Checklist

- [ ] Identify which operations need strong vs eventual consistency
- [ ] Choose replication strategy per service based on consistency requirements
- [ ] Implement proper timeouts and retries for all network calls
- [ ] Use consensus (Raft) when you need strongly consistent leader election
- [ ] Implement observability: traces, metrics, and logs across service boundaries
- [ ] Test failure scenarios: network partitions, node crashes, clock skew
- [ ] Start simple: do not add distribution until you have concrete requirements

For deeper exploration, see the distributed systems roadmap for a structured learning path.

Category

Related Posts

Microservices vs Monolith: Choosing the Right Architecture

Understand the fundamental differences between monolithic and microservices architectures, their trade-offs, and how to decide which approach fits your project.

#microservices #monolith #architecture

The Eight Fallacies of Distributed Computing

Explore the classic assumptions developers make about networked systems that lead to failures. Learn how to avoid these pitfalls in distributed architecture.

#distributed-systems #distributed-computing #system-design

Distributed Caching: Multi-Node Cache Clusters

Scale caching across multiple nodes. Learn about cache clusters, consistency models, session stores, and cache coherence patterns.

#distributed-systems #caching #scalability