Google Chubby: The Lock Service That Inspired ZooKeeper
Explore Google Chubby's architecture, lock-based coordination, Paxos integration, cell hierarchy, and its influence on distributed systems design.
Google Chubby: The Lock Service That Inspired ZooKeeper
Google Chubby was Google’s internal lock service and coarse-grained file system, described in a 2006 paper that shaped distributed systems design. Chubby itself was not open-sourced, but its design directly inspired Apache ZooKeeper, adopting many Chubby concepts and even API naming.
Chubby solved a core problem: large-scale distributed systems need coordination primitives like locks, leader election, and configuration storage, but building these correctly is notoriously difficult. Chubby provided these as a service, letting developers focus on their problems rather than reinventing coordination code.
Today, Chubby has been largely replaced by other systems within Google, but its influence persists in every ZooKeeper deployment, every etcd cluster, and every system using distributed locks.
Design Philosophy
Chubby was built around a few key observations:
Coarse-grained locks are easier: Acquiring a distributed lock is expensive. Chubby optimized for coarse-grained locks held for seconds or minutes, not fine-grained locks held for milliseconds. The overhead of lock acquisition amortized over longer hold times.
Locks and files are related: A lock can be naturally represented as a file with specific semantics. A file containing the lock holder’s identity serves both as state and as the lock itself.
Event notification matters: Applications need to know when configuration changes, when a lock is released, or when a machine fails. Chubby built event notification into its API.
Cell hierarchy reduces confusion: Rather than a flat namespace, Chubby organized data into cells, each serving a subset of the overall namespace. This isolation simplified deployment and failure boundaries.
graph TD
A[Chubby Cell 1<br/>Bigtable Namespace] --> B[Cell 1 replica 1]
A --> C[Cell 1 replica 2]
A --> D[Cell 1 replica 3]
E[Chubby Cell 2<br/>GFS Namespace] --> F[Cell 2 replica 1]
E --> G[Cell 2 replica 2]
E --> H[Cell 2 replica 3]
Cell Migration and Namespace Sharing
As services grew, a single Chubby cell could become a bottleneck. Chubby supports cell migration: moving a service’s data from one cell to another without service interruption. The process involves:
- Creating a new cell with the target namespace
- Updating the service’s DNS or configuration to point to the new cell
- Allowing in-flight operations to complete on the old cell
- Decommissioning the old cell once all clients have migrated
Namespace sharing across cells was intentionally limited. Each cell originally maintained its own isolated namespace, and there was no built-in cross-cell replication of the namespace itself. Applications that needed cross-cell coordination either used a dedicated shared cell or implemented their own replication layer.
Later versions of Chubby introduced cell federation, allowing a cell to proxy requests to another cell, effectively providing a unified namespace view across multiple cells. This was used by services that needed to span multiple Chubby cells while presenting a single logical namespace to clients.
# Conceptual cell lookup flow
/service/bigtable/leader --> Cell A (handles /ls/google/chubby/cells/bigtable)
/service/gfs/master --> Cell B (handles /ls/google/chubby/cells/gfs)
# Cell A can proxy to Cell B for namespace access
Cell A -->|proxy| Cell B: lookup /ls/google/chubby/cells/gfs/master
Introduction
A Chubby cell consists of 5 replicas running the same software. Replicas use Paxos to agree on operation order, ensuring consistency even when some replicas fail.
One replica is the master at any given time. All writes go through the master, which replicates to others via Paxos. Reads can be served by any replica, which locally cache data for performance.
graph TD
C[Client] -->|Write| M[Master]
C -->|Read| R1[Replica 1]
C -->|Read| R2[Replica 2]
M -->|Replicate| R1
M -->|Replicate| R2
M -->|Replicate| R3[Replica 3]
R1 -.->|Cache| C
R2 -.->|Cache| C
R3 -.->|Cache| C
Paxos gives Chubby linearizable reads and writes, the strongest consistency guarantee. This matters when two processes both think they hold the same lock—that’s the kind of inconsistency Paxos prevents.
Master Election and Failover Latency
When the master fails, Chubby’s failover process follows a predictable timeline:
sequenceDiagram
participant R1 as Replica 1
participant R2 as Replica 2
participant R3 as Replica 3
participant M as Master
participant C as Client
Note over M: Master crashes at T=0
R1->>R2: Lease expired, starting election
R2->>R1: I vote for you
R3->>R1: I vote for you
Note over R1: Quorum (3/5) reached
R1->>R2: I am the new master (epoch+1)
Note over R1: New master at T=~200ms
R1->>C: New master notification
Note over C: Client discovers new master via DNS
Note over C: Typically 1-2 seconds total
| Failover Phase | Typical Duration | Notes |
|---|---|---|
| Lease expiration (master silent) | 4-8 seconds | Configurable per cell, lease duration sets this window |
| Leaseholder detection + election start | 50-200ms | Remaining replicas detect lease expiry via quorum timeout |
| Paxos leader election (propose/accept) | 50-150ms | Single round of Prepare/Accept with 5 replicas |
| Client DNS update propagation | 0-30 seconds | Depends on TTL; clients cache old master up to TTL |
| Handle re-acquisition by clients | 100-500ms | Clients must re-open file handles |
| Total unavailability window | ~5-10 seconds | worst case with default lease and DNS TTL |
The paper reports that Chubby cells typically experience 1-2 seconds of unavailability during a planned master migration (e.g., for maintenance), and 5-10 seconds during an unexpected master failure. The lease duration is the primary tuning knob: shorter leases mean faster failover but more instability from transient network issues.
ZooKeeper’s leader election typically completes in 100-500ms under similar conditions, while etcd finishes within a second. Chubby’s slightly longer failover time is the trade-off for its stronger lease-based split-brain prevention.
The File System Interface
Chubby presents a file system interface (like Unix) rather than a key-value API. Files organize in directories forming a path hierarchy.
/ls/google/chubby/cells/bigtable/master
/ls/google/chubby/cells/gfs/primary
/ls/services/search/config
/ls/services/search/locks/query
Each file has:
- Content: Small amounts of binary data (typically less than 1 MB)
- ACLs: Access control lists for permissions
- Metadata: Modification times, lock holders, etc.
- Locks: Optional advisory locks
# Chubby client operations (conceptual)
open("/ls/services/search/config") # Open a file
setContents("replicas=5, shard=true") # Write data
getContents() # Read data
acquire("my-process-id") # Acquire lock
release("my-process-id") # Release lock
Files can be ephemeral, automatically deleted when the client that created them closes its handle. If a process dies, its ephemeral files disappear, signaling failure to others.
Advisory Locks
Chubby provides advisory locks, not mandatory ones. A process can access a file without acquiring its lock, but coordination requires disciplines to actually use the locks.
# Conceptual Chubby lock acquisition
lock = chubby.lock("/ls/services/search/locks/query")
if lock.acquire(blocking=True, timeout=30):
try:
# Critical section - we hold the lock
config = read_config()
do_query(config)
finally:
lock.release()
else:
# Could not acquire lock
raise Exception("Lock acquisition timed out")
The lock advisory nature means misbehaving clients can ignore locks. In practice, this is rarely a problem because correct clients follow the protocol, and incorrect clients fail in predictable ways that trigger alerts.
Events and Notifications
Clients subscribe to events on files and directories. When things happen, Chubby notifies the client:
- File modified: Content of a watched file changed
- Lock acquired/released: Someone acquired or released a lock
- Child added/removed: A child directory or file was created or deleted
- Master failure: The cell master failed and a new one was elected
- Connection alive/broken: Connection to Chubby cell state changed
# Subscribe to file change events
handle = chubby.openFile("/ls/services/search/config")
handle.subscribe(chubby.FILE_MODIFIED, callback=reload_config)
# Subscribe to lock events
lock = chubby.lock("/ls/services/search/locks/master")
lock.subscribe(chubby.LOCK_ACQUIRED, callback=i_am_master_now)
The event mechanism lets clients react to changes without polling. For large-scale systems, this cuts down on load for both clients and the coordination service.
Chubby and Paxos
Under the hood, Chubby uses Paxos for consensus among replicas. The Chubby paper clarified how Paxos integrates into a production system:
- Database backing store: Chubby uses a simple database (originally BerkeleyDB, later custom) to store state
- Instance numbers: Each Paxos instance is numbered, and Chubby runs many instances in parallel for different operations
- Epoch numbers: Masters are identified by epoch numbers, preventing old masters from acting after a new master is elected
- Lease mechanism: The master holds a lease from replicas, renewing it periodically
sequenceDiagram
participant R1 as Replica 1
participant R2 as Replica 2
participant R3 as Replica 3
participant M as Master
Note over M: Prepare to renew lease
M->>R1: Prepare(N=5)
M->>R2: Prepare(N=5)
M->>R3: Prepare(N=5)
R1-->>M: Promise(N=5,v=4)
R2-->>M: Promise(N=5,v=4)
R3-->>M: Promise(N=5,v=4)
Note over M: Majority promised!
M->>R1: Accept(N=5,v=5)
M->>R2: Accept(N=5,v=5)
M->>R3: Accept(N=5,v=5)
The lease mechanism ensures that if the master fails, replicas wait until the lease expires before accepting a new master. This prevents split-brain where two masters think they are in charge.
Paxos Integration Deep Dive
The original Paxos paper leaves several implementation details unspecified. Chubby’s production deployment fills in these gaps:
Epoch Numbers (Master Identity):
Epoch numbers are monotonically increasing integers that identify a master’s term. When a new master is elected, it increments the epoch number. Any message from an old epoch (lower epoch number) is rejected by replicas. This solves the “old master waking up” problem: if a master crashes and a new one is elected, but the old master recovers and tries to issue commands, replicas reject them because the old master’s epoch is stale.
# Simplified epoch check on replica
def handle_master_message(master_epoch, master_id, operation):
if master_epoch < self.current_epoch:
raise Rejected(f"Stale epoch {master_epoch} < {self.current_epoch}")
if master_epoch == self.current_epoch and master_id != self.master_id:
raise Rejected(f"Master ID mismatch")
# Accept the operation
self.process(operation)
Proposal Numbers (Operation Ordering):
Each Paxos instance (representing one Chubby operation) has a unique proposal number. Proposal numbers are structured as (round_number, replica_id) tuples, compared lexicographically. The higher round number wins, and if round numbers tie, the higher replica ID breaks the tie. This total ordering ensures that even if two proposers compete, only one value can be chosen per instance.
# Proposal number comparison
def compare_proposals(p1: tuple[int, int], p2: tuple[int, int]) -> int:
"""Returns 1 if p1 > p2, -1 if p1 < p2, 0 if equal"""
if p1[0] != p2[0]:
return 1 if p1[0] > p2[0] else -1
return 1 if p1[1] > p2[1] else -1
Chubby runs thousands of Paxos instances in parallel, one per operation. The instance number identifies which operation is being agreed upon, while proposal numbers sequence competing proposals within that instance.
Lease Mechanism in Detail:
The master lease works as a timed promise from replicas to the master. The lease has a fixed duration (typically seconds, configured per cell). The master must renew the lease before it expires or lose mastership. This is different from Raft’s heartbeat approach:
| Aspect | Chubby Master Lease | Raft Leader Heartbeat |
|---|---|---|
| Renewal trigger | Time-based, before expiry | Periodic heartbeat |
| Validity period | Explicit lease interval | Election timeout margin |
| **Stale leader detection | Lease expiry timeout | Heartbeat timeout |
| Replica promise | ”I will not accept others" | "I will not vote for others” |
During normal operation, the master renews its lease well before expiry. If the master fails to renew (crash, network partition), replicas wait until the lease expires before accepting a new master. This grace period is the “lease timeout” and typically ranges from hundreds of milliseconds to a few seconds.
# Conceptual lease renewal timeline
T=0s Master acquires lease (lease_duration=5s)
T=3s Master renews lease (at 60% of duration)
T=6s Master renews lease again
T=11s Master crashes, fails to renew
T=16s Lease expires (master + lease_duration)
T=16s+ Replicas accept new master election
Database Backing Store:
Chubby originally used BerkeleyDB as its backing store, later replacing it with a custom database. The database stores the Paxos log and application state. Each write goes to the WAL (write-ahead log) before being applied, enabling crash recovery. The database also stores the cell’s configuration, lock states, and file metadata.
Use Cases at Google
Within Google, Chubby handled several critical coordination tasks:
Leader Election: For Bigtable and GFS, one server needed to be primary. Chubby’s lock mechanism enabled reliable leader election without complex consensus code in each system.
Configuration Storage: Service configurations, feature flags, and routing tables lived in Chubby. Event notifications let services reload without restart.
Naming Service: Before DNS was sufficient, Chubby served as a naming service. Machine names, service endpoints, and discovery information lived in Chubby.
Distributed Locks: Semaphores and barriers for distributed batch jobs used Chubby locks to coordinate across thousands of machines.
Original Bigtable Dependency Context
Chubby and Bigtable have a deep historical relationship that shaped Google’s early infrastructure. Bigtable was designed to be a distributed storage system for structured data, capable of scaling to petabytes across thousands of machines. But Bigtable’s original design assumed the existence of a coordination service: tablets needed to know who the master was, where to find their tablet server, and how to recover from failures.
Chubby was built, in part, to solve this coordination problem for Bigtable. The original Bigtable paper (2006) describes how it uses Chubby for:
- Master election: The Bigtable master uses a Chubby exclusive lock to ensure only one master exists at a time
- Tablet location: Bigtable stores tablet location metadata in Chubby files, allowing clients to find the right tablet server
- Server lease management: Tablet servers maintain Chubby locks; if a tablet server loses its lock (crashed or network partition), the master knows to reassign its tablets
# Bigtable/Chubby interaction example
/chubby/cells/bigtable/master # Bigtable master election lock
/chubby/cells/bigtable/tablets/ # Directory containing tablet location data
/chubby/cells/bigtable/tserver/ # Ephemeral files for live tablet servers
This tight coupling meant that Bigtable could not function if its Chubby cell was unavailable. In practice, Google ran multiple independent Chubby cells, and each Bigtable cluster was assigned to a specific cell. The Bigtable master would retry Chubby operations with exponential backoff, and would stop serving if the Chubby cell was unreachable for an extended period.
The dependency was unidirectional: Bigtable depended on Chubby, but Chubby did not depend on Bigtable. Chubby’s backing store (originally BerkeleyDB) was independent of Bigtable. This asymmetry is important: if Bigtable had its own consensus layer, it would have been circular, which would complicate recovery.
Over time, both systems evolved. Bigtable’s later incarnation (2012+) as Cloud Bigtable changed some of these assumptions, and Google’s internal infrastructure replaced Chubby with more specialized systems. The original design illustrates how coordination services tend to be foundational: they are typically designed early and replaced late.
The ZooKeeper Connection
The ZooKeeper paper explicitly calls out Chubby as inspiration. The project borrowed several core concepts:
- The hierarchical file system namespace (though ZooKeeper uses znodes)
- Ephemeral nodes for service discovery
- Watches for event notification
- The general lock abstraction for coordination
However, ZooKeeper made significant departures:
- Watches are one-time in ZooKeeper vs. persistent in Chubby: ZooKeeper simplified by making watches fire once and require re-registration
- ZooKeeper is a coordination service, not a lock service: ZooKeeper provides primitives that can be combined into locks, rather than locking as a primary feature
- ZooKeeper separates consensus from caching: Clients can cache data while still getting notifications
Chubby vs ZooKeeper vs etcd: Direct Comparison
| Dimension | Chubby | ZooKeeper | etcd |
|---|---|---|---|
| Consensus Protocol | Paxos (internal) | Zab (custom Paxos variant) | Raft (from paper) |
| API Style | File system interface | Hierarchical znodes | Flat key-value with directories |
| Language Origin | C++ | Java | Go |
| Client Libraries | C++, Java | Java, C, Python, Go | Go, Java, Python, Ruby |
| Watches | Persistent (stay registered) | One-time (must re-register) | One-time (must re-register) |
| Lock Support | Advisory locks as first-class | Via recipes (curator) | Via transactions |
| ACLs | UNIX-like permissions | ACL per znode | RBAC with roles |
| Scalability | Cell-based sharding | Single ensemble | Single cluster |
| Performance | ~100k requests/sec per cell | ~10k-100k requests/sec | ~10k+ requests/sec |
| Leader Leases | Yes (prevents split-brain) | No (Zab handles it) | Yes (Raft heartbeat) |
| Observers | No | Yes (read scaling) | No (but Raft learner mode) |
| Auth Methods | SASL | SASL/Kerberos | mTLS, username/password |
| Typical Use | Internal Google (historical) | Hadoop/Kafka coordination | Kubernetes config |
Key takeaways:
- Chubby was pioneering but closed-source and largely replaced inside Google
- ZooKeeper is the direct Chubby descendant, battle-tested in big data ecosystems
- etcd offers better performance and a cleaner API, preferred for new deployments
graph TD
A[Chubby 2006] --> B[ZooKeeper 2008]
A --> C[etcd 2013]
B --> D[Kafka, HBase, Drill]
C --> E[Kubernetes, CoreOS]
Limitations and Evolution
Chubby has known limitations that shaped its evolution:
Single-cell scalability: Early cells became bottlenecks, requiring sharding into multiple cells.
Event delivery delays: In rare cases, event delivery could be delayed. Applications needed their own timeouts and fallback mechanisms.
Not for high-frequency changes: Chubby was designed for configuration and coordination, not high-frequency data. Clients were expected to cache and poll for high-frequency updates.
Replaced by other systems: Within Google, Chubby has been largely replaced by more scalable systems. Spanner handles distributed transactions, and other specialized systems handle service discovery.
Trade-off Analysis
Chubby’s design embraces specific trade-offs that system designers must understand:
| Design Choice | Advantage | Disadvantage |
|---|---|---|
| Coarse-grained locks | Low coordination overhead; scalable for global locks | Cannot optimize per-record or fine-grained locking |
| Advisory locks | Flexible; no centralized enforcement bottleneck | Requires client discipline; misbehaving clients can ignore locks |
| Paxos with 5 replicas | Tolerates 2 simultaneous failures; strong consistency | Higher latency than quorum-less reads; more resource usage |
| Persistent event notifications | Simple subscription model for long-lived clients | Can accumulate stale watchers; memory pressure on server |
| File system interface | Intuitive hierarchy for naming and organization | Less efficient for flat key-value access patterns |
| Cell-based namespace | Failure isolation; independent scaling per cell | Cross-cell coordination requires federation/proxying |
| Master lease mechanism | Prevents split-brain during partitions effectively | Lease duration tuning is critical; too short causes instability |
| BerkeleyDB/custom backing store | Crash recovery via WAL; proven database technology | Additional dependency; schema changes require cell restart |
| Cell migration support | Allows load balancing; cell replacement without downtime | Operationally complex; requires DNS reconfiguration |
Key insight: Chubby prioritized correctness and simplicity over raw speed. The trade-offs favor reliability and ease of use for coordination tasks—you get strong consistency guarantees in exchange for somewhat higher latency.
When to Use / When Not to Use
When to Use Chubby
Chubby was designed for coarse-grained distributed coordination at Google. If you need a similar system today, there are a few scenarios where it remains the right choice. Chubby’s advisory locks and ephemeral nodes give you a solid abstraction for lock acquisition and leader election among distributed processes, making it straightforward to pick a leader without building custom consensus logic. The hierarchical namespace also maps naturally to naming and service discovery use cases, where you want to organize services and endpoints in a familiar directory structure. Finally, if you have small, rarely-changed configuration that all participants need to agree on, Chubby’s event notifications mean services can reload without restart when configuration changes.
When Not to Use Chubby
Chubby is the wrong tool in several situations. High-frequency data access is one of them, since Chubby cells handle low throughput and cannot accommodate anything more than occasional reads and writes without becoming a bottleneck. Fine-grained coordination is another case where Chubby falls short, because its design centers on coarse-grained locks where one master oversees an entire system. If you need cross-cell namespace sharing, you will hit Chubby’s limitation where each cell’s namespace is isolated from others. For modern deployments outside Google, Chubby is not actively maintained, so ZooKeeper, etcd, or Consul are better choices with active community support.
Production Failure Scenarios
| Failure | Impact | Mitigation |
|---|---|---|
| Master lease expiration | Master loses mastership; all clients must re-acquire handles; brief unavailability | Chubby handles lease renewal automatically; if master fails, clients discover new master via DNS |
| Cell overload | Chubby cell becomes a bottleneck under high request load | Distribute coordination across multiple Chubby cells; move non-critical clients to different cells |
| Event delivery delays | In rare cases, lock acquisition events are delayed; applications holding stale locks may cause conflicts | Applications should implement their own timeouts and re-check lock state before taking action |
| DNS cache staleness | When master moves, DNS entries take time to propagate; clients may contact old master briefly | Use short DNS TTLs; clients should handle OperationTimeout gracefully and retry discovery |
| File handle exhaustion | Too many clients holding handles to the same file saturates handle table | Monitor handle count; release handles promptly; design for brief handle lifetime |
| Incompatible schema changes | DDL changes to the Chubby database itself require cell restart | Coordinate schema changes during maintenance windows; test on non-production cells first |
Master Lease Expiry: Split-Brain and Recovery
The most critical failure scenario is master lease expiry leading to a potential split-brain situation. Understanding the exact sequence matters for operating Chubby cells safely.
What is split-brain in this context?
Split-brain occurs when two replicas believe they are the master simultaneously. Chubby’s lease mechanism prevents true split-brain during the lease validity window, but a narrow window opens during the lease expiry transition.
Failure sequence:
sequenceDiagram
participant M as Master M1
participant R1 as Replica 1
participant R2 as Replica 2
participant R3 as Replica 3
participant M2 as New Master M2
Note over M: Master M1 lease expires at T=0
Note over M: M1 is network-partitioned
R1->>R2: Lease expired, no response from M1
R2->>R1: Agreed, starting election
R1->>R2: I vote for R1
R2->>R1: Confirmed
Note over R1: R1 elected new master M2 (epoch+1)
Note over M: M1 still thinks it is master!
Note over M: M1 serving writes locally during partition
Note over M: But cannot replicate to quorum
Note over M2: M2 holds valid lease, accepts writes
Note over R1,R2,R3: Replicas ignore M1 (stale epoch)
Note over M: At T=0, M1 lost quorum-backed lease
Note over M: Clients retry, discover M2 via DNS
Key protection mechanism:
The epoch number is the critical protection. When M2 is elected, it increments the epoch. Even if M1 (the partitioned master) recovers and tries to issue commands, replicas check the epoch and reject M1’s messages because (M1_epoch < M2_epoch). M1’s writes during the partition window are lost from the perspective of the quorum, but they never contaminated the committed Paxos log.
Recovery procedure:
- M1 detects its lease has expired (no renewal acknowledgment from quorum)
- M1 stops accepting new writes locally
- M1 initiates a rejoin procedure: it fetches the latest Paxos state from the quorum
- M1 discovers it is no longer master (higher epoch exists)
- M1 transitions to a replica and syncs its state to the latest Paxos log entry
- M1 resumes normal operation as a replica
# Simplified master recovery pseudocode
def on_lease_expired():
stop_accepting_writes()
current_epoch = fetch_current_epoch_from_quorum()
if current_epoch > my_epoch:
my_epoch = current_epoch
my_role = REPLICA
sync_state_from_master()
else:
# We might still be master, retry election
start_leader_election()
What happens to M1’s writes during partition?
M1 could have accepted writes during the partition that never reached the quorum. These writes are effectively “tentative” and are discarded during recovery. Clients that read from M1 during the partition might have seen uncommitted state. For this reason, critical operations should use the WAIT_ANY flag to wait for acknowledgment from the quorum, not just the master.
Worst-case timeline:
| Phase | Duration | Details |
|---|---|---|
| Partition begins, M1 still master | 0s | M1 thinks it is master, serves writes locally |
| Lease expires (M1 cannot renew) | 4-8s | M1 cannot get quorum for renewal |
| Replicas detect lease expiry | ~100ms | Election timeout, start new election |
| New master M2 elected | 50-150ms | M2 wins with higher epoch |
| DNS propagation to clients | 0-30s | TTL on DNS entries for Chubby master location |
| All clients redirected | 5-40s | Total worst-case unavailability window |
Practical mitigations for operators:
# Monitor master lease health
# Alert if master has not renewed lease within 50% of lease_duration
alert: ChubbyMasterLeaseRenewalWarning
expr: time() - chubby_last_lease_renewal > (lease_duration * 0.5)
severity: warning
# Use short DNS TTLs for master location
# Recommended: TTL = 5-10 seconds for production cells
# Set lease_duration appropriately
# Short: faster failover (2-4s) but more sensitivity to transient network issues
# Long: more stability (8-12s) but longer failover window
Quick Recap Checklist
Before diving deeper or preparing for an interview, verify your understanding of these key Chubby concepts:
- Chubby uses Paxos consensus among 5 replicas, with one master handling all writes
- Master lease mechanism prevents split-brain scenarios during network partitions
- Reads can be served by any replica with local caching for performance
- Epoch numbers prevent stale master commands from being accepted after failover
- Ephemeral files automatically disappear when the creating client disconnects
- Advisory locks require client discipline; Chubby does not enforce lock usage
- Event notifications are persistent (stay registered until explicitly removed)
- Cell hierarchy organizes the namespace into isolated failure domains
- Chubby’s design directly inspired ZooKeeper’s znodes, watches, and ephemeral nodes
- ZooKeeper simplified Chubby by making watches one-time (require re-registration)
Interview Questions
Expected answer points:
- Distributed lock acquisition is expensive (network round-trips, consensus overhead)
- Coarse-grained locks held for seconds or minutes amortize this overhead effectively
- Fine-grained locks held for milliseconds would make the coordination service a bottleneck
- Chubby was designed for coordination among relatively few processes, not per-record locking
Expected answer points:
- Master holds a lease from replicas, renewed periodically before expiry
- If master fails to renew (crash or partition), replicas wait until lease expires before accepting a new master
- Epoch numbers ensure old masters cannot issue commands after a new master is elected
- During partition, the old master cannot commit writes because it lacks quorum-backed lease
Expected answer points:
- Epoch numbers are monotonically increasing integers identifying a master's term
- When a new master is elected, the epoch number increments
- Any message from an old epoch (lower epoch number) is rejected by replicas
- This solves the "old master waking up" problem after network partition recovery
Expected answer points:
- Ephemeral files are automatically deleted when the client that created them closes its handle
- If a process dies, its ephemeral files disappear automatically
- This signals failure to other clients without requiring explicit cleanup
- Used for service discovery: a tablet server creates an ephemeral file; when it dies, the file disappears and the master knows to reassign its tablets
Expected answer points:
- Chubby events are persistent: they stay registered until explicitly unsubscribed
- ZooKeeper watches are one-time: they fire once and require re-registration
- Chubby's approach is simpler for long-lived subscriptions but can accumulate stale watchers
- ZooKeeper's approach requires more client-side bookkeeping but scales better
Expected answer points:
- Chubby presents a hierarchical Unix-like namespace with directories and files
- Each file has content, ACLs, metadata, and optional advisory locks
- Paths like /ls/google/chubby/cells/bigtable/master mirror familiar file paths
- Key-value stores typically use flat namespaces; Chubby's hierarchy organizes data naturally for naming and service discovery
Expected answer points:
- Paxos requires a majority quorum to operate (3 out of 5 can acknowledge writes)
- Five replicas tolerate two simultaneous failures while maintaining quorum
- Three replicas would only tolerate one failure (majority is 2)
- Seven replicas would be more resilient but add more latency and resource overhead
Expected answer points:
- During partition, the old master may have accepted writes locally that never reached the quorum
- These writes are "tentative" and are discarded during recovery
- Clients reading from the partitioned master during the outage might have seen uncommitted state
- Critical operations should use WAIT_ANY flag to wait for quorum acknowledgment, not just master acceptance
Expected answer points:
- Bigtable master uses a Chubby exclusive lock for master election (only one master at a time)
- Tablet location metadata is stored in Chubby files so clients can find the right tablet server
- Tablet servers maintain Chubby locks; losing a lock signals the master to reassign tablets
- The dependency was unidirectional: Bigtable depended on Chubby, but Chubby did not depend on Bigtable
Expected answer points:
- Chubby: time-based renewal before expiry; explicit lease interval with duration
- Raft: periodic heartbeat messages; election timeout margin determines validity
- Chubby replica promises "I will not accept others" during lease validity
- Raft follower promises "I will not vote for others" during election timeout
- Both achieve similar goals (preventing split-brain) through different mechanisms
Expected answer points:
- Originally used BerkeleyDB as the backing store for Paxos log and application state
- Later replaced with a custom database built by Google
- The database stores the Paxos log, cell configuration, lock states, and file metadata
- Each write goes to the write-ahead log (WAL) before being applied, enabling crash recovery
Expected answer points:
- Each Paxos instance has a unique proposal number structured as (round_number, replica_id)
- Comparison is lexicographic: higher round number wins
- If round numbers tie, higher replica ID breaks the tie
- This total ordering ensures only one value can be chosen per instance, even with competing proposers
Expected answer points:
- Persistent watchers can accumulate and create memory pressure on the server
- One-time watches simplify server-side state management
- Clients must re-register after receiving an event, which is explicit and auditable
- The trade-off is more bookkeeping on the client side, but ZooKeeper prioritizes server scalability
Expected answer points:
- Later versions of Chubby introduced cell federation to provide a unified namespace across multiple cells
- A cell can proxy requests to another cell, effectively bridging namespace views
- This allowed services spanning multiple Chubby cells to present a single logical namespace to clients
- Used by services that outgrew a single cell's capacity but needed coherent namespace access
Expected answer points:
- Lease expiration window: 4-8 seconds (configurable, sets detection delay)
- Election and Paxos leader election: 50-200ms
- Client DNS update propagation: 0-30 seconds (depends on TTL setting)
- Total typical unavailability: 1-2 seconds for planned migrations, 5-10 seconds for unexpected failures
- ZooKeeper typically completes election in 100-500ms, etcd within 1 second
Expected answer points:
- Each cell serves a subset of the overall namespace with its own replica set
- Failure of one cell does not directly impact others serving different namespaces
- This isolation simplifies deployment boundaries and failure domains
- Services can be assigned to cells based on criticality or load characteristics
Expected answer points:
- Single-cell scalability: early cells became bottlenecks under high request load
- Not designed for high-frequency data changes; clients expected to cache and poll
- Event delivery could occasionally be delayed, requiring application-level timeouts
- More specialized systems like Spanner handle distributed transactions more efficiently
- Cell migration was possible but operationally complex
Expected answer points:
- The original Paxos paper left several implementation details unspecified
- Chubby filled gaps: epoch numbers, proposal numbering, lease mechanisms, database backing store
- Chubby runs thousands of Paxos instances in parallel (one per operation)
- Instance numbers identify which operation is being agreed upon, while proposal numbers sequence competing proposals
Expected answer points:
- Mandatory locks would require Chubby to intercept and block all file access attempts
- This would create a centralized bottleneck defeating Chubby's distributed design
- Advisory locks are a convention: correct clients use them, incorrect clients fail predictably
- In practice, well-behaved clients follow the protocol, and misbehaving ones trigger alerts
Expected answer points:
- ZooKeeper directly adopted Chubby's concepts: hierarchical namespace, ephemeral nodes, watches, lock abstraction
- etcd (ZooKeeper's spiritual successor) uses the same Raft consensus algorithm approach
- Consul combines Chubby-style coordination with service discovery
- Kubernetes uses etcd for configuration and leader election, descendants of Chubby's design
- The lesson that "coordination is a common problem worth solving once and well" remains relevant
Further Reading
- Distributed Transactions explains the Paxos consensus algorithm Chubby uses
- etcd Distributed Storage covers ZooKeeper’s spiritual successor
- Apache ZooKeeper describes ZooKeeper’s direct adoption of Chubby patterns
- Consistency Models covers linearizability guarantees that Paxos provides
Conclusion
Chubby was a pioneering coordination service demonstrating how to build production-grade distributed consensus:
- Production Paxos: Showed how Paxos could work efficiently for real-world use
- Coarse-grained locking: Provided locks as a service, amortizing coordination overhead
- Event notifications: Built reactive patterns into the coordination service itself
- Hierarchical namespace: Made data organization intuitive, like familiar file systems
The 2006 Chubby paper is essential reading for distributed systems practitioners. It explains not just how Chubby works, but why decisions were made, and what trade-offs were accepted.
Chubby itself is now largely historical, but its influence persists. Every time you use ZooKeeper to elect a leader, etcd to store configuration, or Consul for service discovery, you are using systems shaped by Chubby’s design.
The lesson: coordination is a common problem worth solving once and well. Getting consensus right is hard enough without every team inventing their own poorly-tested version. Chubby proved that a well-designed coordination service could be general enough and efficient enough to serve as infrastructure for large-scale distributed systems.
Category
Related Posts
Apache ZooKeeper: Consensus and Coordination
Explore ZooKeeper's Zab consensus protocol, hierarchical znodes, watches, leader election, and practical use cases for distributed coordination.
etcd: Distributed Key-Value Store for Configuration
Deep dive into etcd architecture using Raft consensus, watches for reactive configuration, leader election patterns, and Kubernetes integration.
Google Spanner: Globally Distributed SQL at Scale
Google Spanner architecture combining relational model with horizontal scalability, TrueTime API for global consistency, and F1 database implementation.