Distributed Operating Systems
Explore distributed file systems, RPC mechanisms, cluster scheduling, and the fundamental concepts behind modern distributed operating systems.
Introduction
Distributed operating systems represent a fundamental shift from single machine systems: instead of one computer doing all the work, multiple independent machines coordinate to provide unified services. The operating system abstracts away the network boundary, making distributed resources appear local to applications. This is conceptually elegant but implementationally challenging—network delays, partial failures, and consistency trade-offs create complexity that doesn’t exist in single machine systems.
Modern data centers run distributed systems at every layer: HDFS and Ceph for storage, Kubernetes for orchestration, ZooKeeper for coordination, and custom RPC frameworks for service communication. Understanding distributed OS concepts is essential for anyone building scalable, reliable infrastructure.
When to Use / When Not to Use
Distributed systems are appropriate when:
- Horizontal scaling is required — Workload exceeds single-machine capacity
- High availability is mandatory — Applications must survive machine failures
- Geographic distribution is needed — Users span multiple regions with latency requirements
- Different components have different requirements — Specialized machines for compute vs storage
Distributed systems are NOT appropriate when:
- Simple CRUD operations dominate — A single PostgreSQL instance handles millions of queries
- ACID transactions are critical — Distributed transactions are slow; consider data locality first
- Team lacks operational expertise — Distributed systems fail in spectacular ways; operators must understand them
- Latency is ultra-critical — Network round-trips add milliseconds that may be unacceptable
Architecture or Flow Diagram
flowchart TB
subgraph "Client Layer"
APP1[Application]
APP2[Application]
APP3[Application]
end
subgraph "RPC / API Gateway"
LB[Load Balancer]
GW[API Gateway]
end
subgraph "Service Mesh"
S1[Service A<br/>Replica 1]
S2[Service A<br/>Replica 2]
S3[Service B<br/>Replica 1]
S4[Service B<br/>Replica 2]
end
subgraph "Storage Layer"
NFS[NFS Server]
GFS[GFS2 / Ceph Cluster]
ZK[ZooKeeper<br/>Quorum]
end
subgraph "Scheduling"
K8S[Kubernetes<br/>Control Plane]
MESOS[Mesos<br/>Master]
end
APP1 --> LB
APP2 --> LB
APP3 --> GW
LB --> S1
LB --> S2
GW --> S3
GW --> S4
S1 --> NFS
S3 --> GFS
S2 --> ZK
S4 --> ZK
K8S -.-> S1
K8S -.-> S2
K8S -.-> S3
K8S -.-> S4
style GFS stroke:#ff6b6b,stroke-width:3px
style ZK stroke:#ffa94d,stroke-width:3px
Core Concepts
Distributed File Systems
NFS (Network File System)
NFS is the classic distributed file system, dating to 1984. It provides transparent file access over the network:
# Mount NFS share on client
mount -t nfs4 -o rw,sync server.example.com:/shared /mnt/nfs
# /etc/exports on server
# /shared *(rw,sync,no_subtree_check,no_root_squash)
# Root squashing maps root user to anonymous UID for security
# Automount with autofs
# /etc/auto.master
# /mnt/nfs /etc/auto.nfs --timeout=60
# /etc/auto.nfs
# shared -rw,sync server.example.com:/shared
NFSv4 uses a stateful protocol with better performance but requires careful lock management. The sync mount option writes data to disk before responding; async is faster but risks data loss on server crash.
GFS2 (Global File System 2)
GFS2 is a clustered file system for shared storage:
# Create GFS2 filesystem
mkfs.gfs2 -p lock_dlm -j 3 -t cluster_name:vol_name /dev/sdb
# Mount GFS2 filesystem (on each node)
mount -t gfs2 -o noatime /dev/sdb /mnt/gfs2
# Check lock status
gfs_tool lockdump /mnt/gfs2
gfs_tool gettune /mnt/gfs2
GFS2 requires shared storage (SAN) and a working cluster manager (Corosync + Pacemaker). It provides coherent file system access across all nodes—writes are immediately visible to all cluster members.
Remote Procedure Call (RPC)
RPC abstracts network communication into function calls:
// Protocol buffer definition (gRPC style)
syntax = "proto3";
package calculator;
service Calculator {
rpc Add(AddRequest) returns (AddResponse);
rpc StreamSum(StreamRequest) returns (stream StreamResponse);
}
message AddRequest {
int32 a = 1;
int32 b = 2;
}
message AddResponse {
int32 result = 1;
}
// Generated server stub (Python example)
import grpc
from concurrent import futures
import calculator_pb2 as pb2
import calculator_pb2_grpc as pb2_grpc
class CalculatorServicer(pb2_grpc.CalculatorServicer):
def Add(self, request, context):
result = request.a + request.b
return pb2.AddResponse(result=result)
def StreamSum(self, request_iterator, context):
total = 0
for req in request_iterator:
total += req.value
yield pb2.StreamResponse(running_total=total)
# Server setup
server = grpc.server(futures.ThreadPoolExecutor(max_workers=10))
pb2_grpc.add_CalculatorServicer_to_server(CalculatorServicer(), server)
server.add_insecure_port('[::]:50051')
server.start()
Cluster Scheduling
Kubernetes Scheduling
Kubernetes schedules pods across nodes based on resources and constraints:
apiVersion: v1
kind: Pod
metadata:
name: my-app
spec:
affinity:
podAntiAffinity:
preferredDuringSchedulingIgnoredDuringExecution:
- weight: 100
podAffinityTerm:
labelSelector:
matchExpressions:
- key: app
operator: In
values:
- my-app
topologyKey: kubernetes.io/hostname
containers:
- name: my-container
image: my-image:latest
resources:
requests:
memory: "128Mi"
cpu: "250m"
limits:
memory: "256Mi"
cpu: "500m"
nodeSelector:
disktype: ssd
tolerations:
- key: "node.kubernetes.io/not-ready"
operator: "Exists"
effect: "NoSchedule"
Mesos Architecture
Mesos uses a two-level scheduling architecture:
# Mesos executor framework (Python example)
from mesos.interface import mesos_pb2
import mesos.native as mesos
class MyTask:
def __init__(self, task_id, command):
self.task_id = task_id
self.command = command
def reregistered(driver, frameworkId):
print(f"Re-registered: {frameworkId.value}")
def resourceOffers(driver, offers):
for offer in offers:
if len(tasks) > 0:
task = tasks.pop(0)
task.mesos_task_id.value = str(task_id_counter)
driver.launchTasks(offer.id, [task])
else:
driver.declineOffer(offer.id)
driver = mesos.MesosExecutorDriver(
mesos.Executor(),
framework=mesos_pb2.FrameworkInfo()
)
driver.run()
Production Failure Scenarios
Scenario 1: Split-Brain in Clustered Storage
Problem: Network partition causes both sides of partition to think they have primary storage access, leading to data corruption.
Mitigation:
- Use quorum-based fencing (STONITH—Shoot The Other Node In The Head)
- Configure watchdog timers appropriately
- Test network partitions regularly withchaos engineering
- Use journaling file systems that detect split-brain
Scenario 2: NFS Server Becoming Unavailable
Problem: Applications hang when NFS server fails, causing cascading timeouts.
Mitigation:
# Use soft mounts with timeout for less critical data
mount -t nfs4 -o soft,timeo=30,retrans=3 server:/share /mnt/nfs
# Use hard,intr for critical data (can be interrupted)
mount -t nfs4 -o hard,intr,timeo=600 server:/share /mnt/nfs
# Enable automatic unmount on server loss
umount -l /mnt/nfs # Lazy unmount
Scenario 3: Clock Skew in Distributed Systems
Problem: Nodes disagree on time due to NTP misconfiguration, causing certificate validation failures and cache inconsistencies.
Mitigation:
- Use NTP with multiple upstream sources
- Configure acceptable drift thresholds
- Use logical timestamps (Lamport clocks or vector clocks) for event ordering
- In critical systems, use PTP (Precision Time Protocol) for sub-millisecond accuracy
Trade-off Table
| Aspect | NFS | GFS2 | HDFS | CephFS |
|---|---|---|---|---|
| Consistency Model | Write-through | Cluster-coherent | Write-once | Strong eventual |
| Setup Complexity | Low | Medium | High | High |
| Storage Type | Any shared | Shared SAN only | Distributed | Any |
| Max Scale | Single server | Few dozen nodes | Thousands | Thousands |
| Use Case | General purpose | HA workloads | Big data analytics | Cloud native |
Implementation Snippet: Raft-based RPC Framework
Building a fault-tolerant RPC layer:
import hashlib
import time
from dataclasses import dataclass
from typing import Optional, Dict, Any
@dataclass
class LogEntry:
term: int
index: int
command: Dict[str, Any]
class RaftNode:
def __init__(self, node_id: str, peers: list):
self.node_id = node_id
self.peers = peers
self.current_term = 0
self.voted_for: Optional[str] = None
self.log: list[LogEntry] = []
self.commit_index = 0
def request_vote(self, candidate_id: str, last_log_index: int,
last_log_term: int) -> bool:
if self.current_term > candidate_id.term:
return False
if self.voted_for is None or self.voted_for == candidate_id:
if last_log_term > self.get_last_log_term():
self.voted_for = candidate_id
return True
return False
def append_entries(self, leader_id: str, entries: list[LogEntry],
prev_log_index: int, prev_log_term: int) -> bool:
if prev_log_index > len(self.log):
return False
# Consistency check
if prev_log_index > 0 and self.log[prev_log_index - 1].term != prev_log_term:
self.log = self.log[:prev_log_index - 1]
return False
self.log.extend(entries)
return True
def get_last_log_term(self) -> int:
if self.log:
return self.log[-1].term
return 0
Observability Checklist
For distributed systems, monitor:
- End-to-end latency — Distributed tracing (Jaeger, Zipkin)
- Service mesh metrics — Envoy proxy metrics, circuit breakers
- Storage latency — NFS mount latency, Ceph OSD throughput
- Quorum health — ZooKeeper znode count, Raft leader elections
- Network partitioning — Detector mechanisms, split-brain events
- Clock skew — NTP offset measurements across nodes
Common Pitfalls / Anti-Patterns
- Kerberos for NFSv4 — Prevents unauthorized mount without proper ticket
- mTLS in service mesh — Istio/Linkerd provide automatic encryption and authentication
- RBAC for Kubernetes — Principle of least privilege for service accounts
- Secrets management — Use Vault or Kubernetes secrets, not environment variables
- Audit logging — Distributed traces should include authentication context
Common Pitfalls / Anti-patterns
- Ignoring network partitions — Assuming the network is reliable is the most common distributed systems mistake
- Not planning for partial failure — Every remote call can fail; timeout and retry logic is mandatory
- Over-engineering consistency — Using distributed transactions when simpler eventual consistency would work
- Forgetting about operator complexity — Running a distributed system requires expertise that small teams may lack
- Treating debugging as impossible — Distributed tracing makes debugging tractable; instrument early
Quick Recap Checklist
- Distributed systems trade single-system simplicity for scale and fault tolerance
- CAP theorem forces explicit trade-offs between consistency and availability
- NFS provides simple networked file access; GFS2 provides cluster-coherent storage
- RPC frameworks abstract network communication but not its failure modes
- Cluster schedulers (Kubernetes, Mesos) provide resource abstraction across machines
- Monitor end-to-end latency, quorum health, and clock synchronization
- Design assuming partial failure—every remote call can fail
Real-World Case Study: Netflix’s Distributed System Architecture
Netflix operates one of the largest distributed systems in the world, serving 200+ million subscribers. Their architecture demonstrates key distributed OS concepts:
- Service Discovery: Eureka provides dynamic service registration and location, replacing static configurations
- Data Storage: Cassandra handles global data replication across regions with tunable consistency
- Coordination: ZooKeeper (now etcd) manages configuration and leader election for critical services
- Traffic Management: Zuul provides dynamic routing, load balancing, and circuit breaking
- Asynchronous Communication: Kafka provides durable, ordered message streams between services
Their approach to distributed systems embraces eventual consistency where appropriate (user preferences can be slightly stale) while enforcing strong consistency for billing and playback authorization.
Advanced Topic: Sharding Strategies and Trade-offs
Sharding (horizontal partitioning) distributes data across multiple nodes but introduces complexity:
Hash-based sharding: Route by hash(key) % num_nodes
- Even distribution, but adding nodes requires remapping all data
- Use consistent hashing to minimize remapping
Range-based sharding: Route by key ranges (a-m, n-z)
- Natural for time-series data, but can create hotspots
- Example: logs organized by date ranges
Geo-based sharding: Route by user region (US, EU, APAC)
- Low latency for local access, but cross-region queries are expensive
- Use replication for global reads
Trade-offs: Sharding improves write scalability but complicates queries that span shards (joins, aggregates). Consider whether your workload actually needs sharding—many workloads that scale to a single PostgreSQL instance with read replicas avoid sharding complexity entirely.
Interview Questions
Lamport clocks are single integer counters incremented on each event; they establish partial ordering but cannot determine causality. If L(a) < L(b), we know a happened before b, but not vice versa. Vector clocks are arrays of counters, one per node; they can detect causality (if all entries in VC(a) are ≤ VC(b), then a could have caused b). Vector clocks enable better conflict detection in eventually consistent systems like DynamoDB or Cassandra.
In 2PC, a coordinator asks all participants to "prepare" (vote yes/no). If all vote yes, coordinator sends "commit." If any votes no, coordinator sends "rollback." Problems: (1) The coordinator is a single point of failure—if it crashes after "prepare" but before "commit," participants block indefinitely. (2) Network partitions can leave the system in an ambiguous state. (3) Holding locks during the "prepare" phase blocks other transactions. Alternatives include SAGA pattern (for long transactions) or accepting eventual consistency.
Fencing prevents a node from accessing shared storage after it has been declared "dead" by the cluster manager. Without fencing, a supposedly-dead node might still have I/O queued and corrupt data when it recovers. Fencing techniques include: persistent reservation (SCSI persistent reservations), power fencing (IPMI/PDU), and network fencing (disabling switch ports). STONITH ("Shoot The Other Node In The Head") is the most common fencing strategy in pacemaker/corosync clusters.
ZooKeeper uses ZAB (ZooKeeper Atomic Broadcast) for leader election and consensus. Nodes start as LOOKING state, request votes from peers, and whoever gets a quorum (majority) becomes leader. The leader then broadcasts new epochs; followers sync with the leader before becoming operational. If the leader fails, followers timeout and restart election. This requires a minimum of 3 nodes (for majority quorum) and odd numbers to avoid split-brain.
Synchronous replication waits for all replicas to acknowledge writes before returning success—guaranteeing durability but adding latency equal to the slowest replica's round-trip. Asynchronous replication acknowledges writes immediately after local storage, accepting potential data loss if the primary fails before replication completes—much lower latency but weaker durability guarantees. Semi-synchronous waits for at least one replica, balancing both concerns. MongoDB's majority read concern and PostgreSQL's synchronous commit are examples of these different approaches.
CAP theorem states you can have at most two of three: Consistency (all nodes see the same data), Availability (every request gets a response), and Partition tolerance (system continues despite network partitions). Since partitions will happen, you must choose: CP systems sacrifice availability (may return errors during partitions)—etcd, ZooKeeper. AP systems sacrifice consistency (return stale data during partitions)—Cassandra, DynamoDB. Understanding which trade-off your workload tolerates guides architectural decisions.
CRDT (Conflict-free Replicated Data Type) is a data structure that can be merged automatically without coordination between replicas. Examples include G-Counter (grow-only counter), LWW-Register (last-write-wins), and OR-Set (observed-remove set). Use CRDTs when: (1) multiple nodes can make concurrent updates without coordination, (2) eventual consistency is acceptable, (3) you want to avoid distributed transactions. CRDTs trade semantic richness for automatic convergence—they work for counters and registers but not for complex transactional operations.
Raft handles failures through: (1) Leader election—if a follower doesn't receive heartbeats from the leader within election timeout, it becomes candidate and requests votes. The candidate with quorum becomes leader. (2) Log replication—the leader appends entries to its log and replicates to followers via AppendEntries RPCs. Entries are committed once majority of nodes have persisted them. (3) Membership changes—joint consensus allows adding/removing nodes safely without downtime. If a leader fails, followers timeout and restart election cycle.
Eventual consistency guarantees that if no new updates are made, all replicas will eventually return the same value—reads may return stale data during the convergence window. Strong consistency guarantees that every read returns the most recent write or an error—reads always see the latest committed data. Examples: DynamoDB (eventual), Google Spanner (strong). Trade-offs: eventual consistency offers lower latency and higher availability; strong consistency provides predictable behavior at the cost of higher latency and potential unavailability during partitions.
Consistent hashing maps both data keys and server nodes onto a hash ring (e.g., 0 to 2^32). Each key is stored on the first node clockwise from its hash position. When a node is added/removed, only keys mapped to nearby nodes move—not all keys like traditional hashing. This provides: (1) minimal data movement during scaling, (2) uniform load distribution, (3) simplified cache distribution (CDNs). Variants with virtual nodes improve distribution evenness by mapping each physical node to multiple positions on the ring.
Gossip protocols spread information peer-to-peer with epidemic-style dissemination. Each node periodically picks random peers to exchange state with, and corrupted state converges exponentially fast. Unlike consensus protocols, gossip works async and eventual-consistent, making it resilient to network partitions and node failures. Cassandra uses gossip for membership (each node periodically exchanges cluster state with 3-5 peers), and Consul uses it for service discovery catalog replication. Gossip is eventually consistent but not linearizable.
A basic hash ring maps each physical node to one position on the ring. Problems: (1) Uneven load distribution if nodes have different capacities, (2) Large data movement when adding/removing nodes. Virtual nodes map each physical node to multiple positions (e.g., 100-200 hash values), giving finer-grained distribution. If one physical node has 2x capacity, assign it 2x virtual nodes. When a node fails, its virtual nodes are distributed across multiple successors, and adding a node only requires remapping a fraction of keys. This approach balances load and minimizes reorganization.
MapReduce is a programming model for processing large datasets across distributed nodes. The map phase filters and transforms input data, producing intermediate key/value pairs. The reduce phase merges all values associated with the same key. Under the hood, MapReduce relies on distributed OS concepts: (1) Task scheduling across nodes (like YARN), (2) Data locality awareness to schedule compute near stored data, (3) Fault tolerance through task retry and speculative execution, (4) Network serialization of intermediate results. Hadoop MapReduce demonstrates how distributed frameworks abstract away the complexity of coordinating hundreds of nodes.
DynamoDB uses a synthesis of techniques: (1) Quorum reads/writes — R + W > N for strong consistency, R + W <= N for eventual. Typical config: N=3, R=2, W=2. (2) Vector clocks for causality — tracks which version came from which write, enabling merge during reads. (3) Hinted handoff — if a replica is temporarily down, another accepts writes and later hands them off. (4) Merkle tree synchronization — background reconciliation detects divergence. This design deliberately sacrifices linearizability to achieve availability during partitions — the CAP trade-off made explicit.
Chubby is a distributed lock service providing coarse-grained advisory locks and reliable storage. It uses Paxos consensus for consistency and stores lock ownership and small data files in a replicated BerkeleyDB. Use cases: (1) Leader election for Bigtable, (2) Namespace locking for GFS, (3) Service discovery via DNS-like name lookups, (4) Configuration storage. Chubby's design philosophy: prefer reliability and simplicity over performance — don't use it for frequently updated data. It's been largely replaced by Zookeeper in open-source stacks, but the design influenced etcd and Consul.
A Bloom filter is a probabilistic data structure that answers "is this item in the set?" with possible false positives but no false negatives. It uses k hash functions mapping to a bit array of size m. Adding an item sets k bits; checking queries whether all k bits are set. False positive rate: (1 - e^(-kn/m))^k. In distributed systems: (1) Membership checks — Cassandra uses Bloom filters to check if SSTable might contain a key before accessing disk. (2) Routing optimization — Google uses Bloom filters for approximating set membership in query servers. (3) Cache filtering — avoid caching negative results. Trade-off: memory-efficient but introduces small error rate acceptable for many use cases.
A Merkle tree is a hash tree where each leaf is the hash of a data block, and each internal node is the hash of its children. Comparison: compute root hashes of two replicas, if different, traverse to find the subtrees that differ — minimizing data transfer. Amazon DynamoDB uses Merkle trees for anti-entropy (background reconciliation) between replicas. When synchronizing large datasets, Merkle trees reduce comparison complexity from O(n) to O(log n) storage nodes. Cassandra repairs using Merkle trees via nodetool repair. The tree structure enables efficient identification of divergent ranges without comparing every record.
Pessimistic control assumes conflicts will happen and prevents them by holding locks or serializing access — uses two-phase locking (2PL), where locks are held until transaction commits. Pros: no aborts, predictable latency. Cons: high contention reduces throughput, deadlocks require detection/rollback.
Optimistic control assumes conflicts are rare, allows concurrent execution, and validates at commit time (OCC — optimistic concurrency control). If validation fails (another transaction modified the same data), abort and retry. Pros: high throughput under low contention. Cons: wasted work on aborts, latency from retry. DynamoDB uses optimistic locking via version numbers; Percolator (Google's transaction system) uses timestamp ordering for distributed transactions. Choose based on conflict rates: high contention favors pessimistic; low contention favors optimistic.
Vector clocks grow unboundedly — each node's array expands with every causal event. Truncation strategies: (1) Timestamp-based — drop entries older than a threshold (e.g., 2 weeks). (2) Version-based — keep only last N entries per node. (3) Epoch expiration — partition history by epochs, only keep last epoch. Trade-offs: truncation loses ability to determine causality for older events (may create false causality), but unbounded growth would eventually exhaust storage. Cassandra uses version-based truncation (configurable per table). Some systems switch to hybrid approaches (vector clocks for recent events, hash-based for older) after a threshold.
Eventual consistency: if no new updates, all replicas converge to same value eventually. No ordering guarantee — reads may see values out of chronological order.
Eventual linearizability: each operation appears atomic and takes effect at some point between invocation and response, and all clients see operations in the same total order. It's linearizability applied eventually — operations are eventually consistent but maintain the illusion of sequential consistency when observed. Google Spanner achieves this with globally synchronized commit timestamps via TrueTime (GPS + atomic clocks). This matters for operations like compare-and-swap where the "sequence" of operations affects correctness — eventual linearizability guarantees that if A completes before B starts, B sees A's result.
Further Reading
- Designing Data-Intensive Applications - Martin Kleppmann’s comprehensive guide
- The Raft Consensus Algorithm - Interactive Raft visualization and paper
- Distributed Systems for Practitioners - Practical distributed systems design
- Patterns of Distributed Systems - ETP patterns reference
- Chaos Engineering - Principles for testing distributed system resilience
Conclusion
Distributed operating systems coordinate multiple independent machines to provide unified services, trading single-system simplicity for scale and fault tolerance. The CAP theorem forces explicit trade-offs between consistency and availability—understanding your workload’s tolerance for staleness guides architectural decisions.
Key components include distributed file systems (NFS for general purpose, GFS2 for cluster-coherent storage), RPC frameworks that abstract network communication, and cluster schedulers like Kubernetes and Mesos that provide resource abstraction across machines. Design assuming partial failure—every remote call can fail, network partitions can occur, and clocks can skew.
For continued learning, explore distributed consensus algorithms (Raft, Paxos), eventual consistency patterns (CRDTs, conflict-free data structures), and advanced topics like distributed transactions, shadow leader patterns, and multi-region database architectures.
Category
Related Posts
CPU Affinity & Real-Time Operating Systems
CPU affinity binds processes to specific cores for cache warmth and latency control. RTOS adds deterministic scheduling with bounded latency for industrial, medical, and automotive systems.
Fork & Exec System Calls
fork() duplicates a running process, then exec() replaces it with a new program. Together they power every shell, web server, and daemon on Unix-like systems.
System Calls Interface
System calls are the boundary between user programs and the kernel. They are the mechanism by which user-space applications request services from the operating system — opening files, creating processes, allocating memory, and more. Understanding syscalls reveals how the OS enforces isolation and provides safe access to hardware.