Google Chubby: The Lock Service That Inspired ZooKeeper

Explore Google Chubby's architecture, lock-based coordination, Paxos integration, cell hierarchy, and its influence on distributed systems design.

published: reading time: 31 min read author: GeekWorkBench

Google Chubby: The Lock Service That Inspired ZooKeeper

Google Chubby was Google’s internal lock service and coarse-grained file system, described in a 2006 paper that shaped distributed systems design. Chubby itself was not open-sourced, but its design directly inspired Apache ZooKeeper, adopting many Chubby concepts and even API naming.

Chubby solved a core problem: large-scale distributed systems need coordination primitives like locks, leader election, and configuration storage, but building these correctly is notoriously difficult. Chubby provided these as a service, letting developers focus on their problems rather than reinventing coordination code.

Today, Chubby has been largely replaced by other systems within Google, but its influence persists in every ZooKeeper deployment, every etcd cluster, and every system using distributed locks.


Design Philosophy

Chubby was built around a few key observations:

Coarse-grained locks are easier: Acquiring a distributed lock is expensive. Chubby optimized for coarse-grained locks held for seconds or minutes, not fine-grained locks held for milliseconds. The overhead of lock acquisition amortized over longer hold times.

Locks and files are related: A lock can be naturally represented as a file with specific semantics. A file containing the lock holder’s identity serves both as state and as the lock itself.

Event notification matters: Applications need to know when configuration changes, when a lock is released, or when a machine fails. Chubby built event notification into its API.

Cell hierarchy reduces confusion: Rather than a flat namespace, Chubby organized data into cells, each serving a subset of the overall namespace. This isolation simplified deployment and failure boundaries.

graph TD
    A[Chubby Cell 1<br/>Bigtable Namespace] --> B[Cell 1 replica 1]
    A --> C[Cell 1 replica 2]
    A --> D[Cell 1 replica 3]

    E[Chubby Cell 2<br/>GFS Namespace] --> F[Cell 2 replica 1]
    E --> G[Cell 2 replica 2]
    E --> H[Cell 2 replica 3]

Cell Migration and Namespace Sharing

As services grew, a single Chubby cell could become a bottleneck. Chubby supports cell migration: moving a service’s data from one cell to another without service interruption. The process involves:

  1. Creating a new cell with the target namespace
  2. Updating the service’s DNS or configuration to point to the new cell
  3. Allowing in-flight operations to complete on the old cell
  4. Decommissioning the old cell once all clients have migrated

Namespace sharing across cells was intentionally limited. Each cell originally maintained its own isolated namespace, and there was no built-in cross-cell replication of the namespace itself. Applications that needed cross-cell coordination either used a dedicated shared cell or implemented their own replication layer.

Later versions of Chubby introduced cell federation, allowing a cell to proxy requests to another cell, effectively providing a unified namespace view across multiple cells. This was used by services that needed to span multiple Chubby cells while presenting a single logical namespace to clients.

# Conceptual cell lookup flow
/service/bigtable/leader  -->  Cell A (handles /ls/google/chubby/cells/bigtable)
/service/gfs/master       -->  Cell B (handles /ls/google/chubby/cells/gfs)

# Cell A can proxy to Cell B for namespace access
Cell A -->|proxy| Cell B: lookup /ls/google/chubby/cells/gfs/master

Introduction

A Chubby cell consists of 5 replicas running the same software. Replicas use Paxos to agree on operation order, ensuring consistency even when some replicas fail.

One replica is the master at any given time. All writes go through the master, which replicates to others via Paxos. Reads can be served by any replica, which locally cache data for performance.

graph TD
    C[Client] -->|Write| M[Master]
    C -->|Read| R1[Replica 1]
    C -->|Read| R2[Replica 2]

    M -->|Replicate| R1
    M -->|Replicate| R2
    M -->|Replicate| R3[Replica 3]

    R1 -.->|Cache| C
    R2 -.->|Cache| C
    R3 -.->|Cache| C

Paxos gives Chubby linearizable reads and writes, the strongest consistency guarantee. This matters when two processes both think they hold the same lock—that’s the kind of inconsistency Paxos prevents.

Master Election and Failover Latency

When the master fails, Chubby’s failover process follows a predictable timeline:

sequenceDiagram
    participant R1 as Replica 1
    participant R2 as Replica 2
    participant R3 as Replica 3
    participant M as Master
    participant C as Client

    Note over M: Master crashes at T=0
    R1->>R2: Lease expired, starting election
    R2->>R1: I vote for you
    R3->>R1: I vote for you
    Note over R1: Quorum (3/5) reached
    R1->>R2: I am the new master (epoch+1)
    Note over R1: New master at T=~200ms
    R1->>C: New master notification
    Note over C: Client discovers new master via DNS
    Note over C: Typically 1-2 seconds total
Failover PhaseTypical DurationNotes
Lease expiration (master silent)4-8 secondsConfigurable per cell, lease duration sets this window
Leaseholder detection + election start50-200msRemaining replicas detect lease expiry via quorum timeout
Paxos leader election (propose/accept)50-150msSingle round of Prepare/Accept with 5 replicas
Client DNS update propagation0-30 secondsDepends on TTL; clients cache old master up to TTL
Handle re-acquisition by clients100-500msClients must re-open file handles
Total unavailability window~5-10 secondsworst case with default lease and DNS TTL

The paper reports that Chubby cells typically experience 1-2 seconds of unavailability during a planned master migration (e.g., for maintenance), and 5-10 seconds during an unexpected master failure. The lease duration is the primary tuning knob: shorter leases mean faster failover but more instability from transient network issues.

ZooKeeper’s leader election typically completes in 100-500ms under similar conditions, while etcd finishes within a second. Chubby’s slightly longer failover time is the trade-off for its stronger lease-based split-brain prevention.


The File System Interface

Chubby presents a file system interface (like Unix) rather than a key-value API. Files organize in directories forming a path hierarchy.

/ls/google/chubby/cells/bigtable/master
/ls/google/chubby/cells/gfs/primary
/ls/services/search/config
/ls/services/search/locks/query

Each file has:

  • Content: Small amounts of binary data (typically less than 1 MB)
  • ACLs: Access control lists for permissions
  • Metadata: Modification times, lock holders, etc.
  • Locks: Optional advisory locks
# Chubby client operations (conceptual)
open("/ls/services/search/config")      # Open a file
setContents("replicas=5, shard=true")    # Write data
getContents()                           # Read data
acquire("my-process-id")                # Acquire lock
release("my-process-id")                # Release lock

Files can be ephemeral, automatically deleted when the client that created them closes its handle. If a process dies, its ephemeral files disappear, signaling failure to others.


Advisory Locks

Chubby provides advisory locks, not mandatory ones. A process can access a file without acquiring its lock, but coordination requires disciplines to actually use the locks.

# Conceptual Chubby lock acquisition
lock = chubby.lock("/ls/services/search/locks/query")

if lock.acquire(blocking=True, timeout=30):
    try:
        # Critical section - we hold the lock
        config = read_config()
        do_query(config)
    finally:
        lock.release()
else:
    # Could not acquire lock
    raise Exception("Lock acquisition timed out")

The lock advisory nature means misbehaving clients can ignore locks. In practice, this is rarely a problem because correct clients follow the protocol, and incorrect clients fail in predictable ways that trigger alerts.


Events and Notifications

Clients subscribe to events on files and directories. When things happen, Chubby notifies the client:

  • File modified: Content of a watched file changed
  • Lock acquired/released: Someone acquired or released a lock
  • Child added/removed: A child directory or file was created or deleted
  • Master failure: The cell master failed and a new one was elected
  • Connection alive/broken: Connection to Chubby cell state changed
# Subscribe to file change events
handle = chubby.openFile("/ls/services/search/config")
handle.subscribe(chubby.FILE_MODIFIED, callback=reload_config)

# Subscribe to lock events
lock = chubby.lock("/ls/services/search/locks/master")
lock.subscribe(chubby.LOCK_ACQUIRED, callback=i_am_master_now)

The event mechanism lets clients react to changes without polling. For large-scale systems, this cuts down on load for both clients and the coordination service.


Chubby and Paxos

Under the hood, Chubby uses Paxos for consensus among replicas. The Chubby paper clarified how Paxos integrates into a production system:

  • Database backing store: Chubby uses a simple database (originally BerkeleyDB, later custom) to store state
  • Instance numbers: Each Paxos instance is numbered, and Chubby runs many instances in parallel for different operations
  • Epoch numbers: Masters are identified by epoch numbers, preventing old masters from acting after a new master is elected
  • Lease mechanism: The master holds a lease from replicas, renewing it periodically
sequenceDiagram
    participant R1 as Replica 1
    participant R2 as Replica 2
    participant R3 as Replica 3
    participant M as Master

    Note over M: Prepare to renew lease
    M->>R1: Prepare(N=5)
    M->>R2: Prepare(N=5)
    M->>R3: Prepare(N=5)
    R1-->>M: Promise(N=5,v=4)
    R2-->>M: Promise(N=5,v=4)
    R3-->>M: Promise(N=5,v=4)
    Note over M: Majority promised!
    M->>R1: Accept(N=5,v=5)
    M->>R2: Accept(N=5,v=5)
    M->>R3: Accept(N=5,v=5)

The lease mechanism ensures that if the master fails, replicas wait until the lease expires before accepting a new master. This prevents split-brain where two masters think they are in charge.

Paxos Integration Deep Dive

The original Paxos paper leaves several implementation details unspecified. Chubby’s production deployment fills in these gaps:

Epoch Numbers (Master Identity):

Epoch numbers are monotonically increasing integers that identify a master’s term. When a new master is elected, it increments the epoch number. Any message from an old epoch (lower epoch number) is rejected by replicas. This solves the “old master waking up” problem: if a master crashes and a new one is elected, but the old master recovers and tries to issue commands, replicas reject them because the old master’s epoch is stale.

# Simplified epoch check on replica
def handle_master_message(master_epoch, master_id, operation):
    if master_epoch < self.current_epoch:
        raise Rejected(f"Stale epoch {master_epoch} < {self.current_epoch}")
    if master_epoch == self.current_epoch and master_id != self.master_id:
        raise Rejected(f"Master ID mismatch")
    # Accept the operation
    self.process(operation)

Proposal Numbers (Operation Ordering):

Each Paxos instance (representing one Chubby operation) has a unique proposal number. Proposal numbers are structured as (round_number, replica_id) tuples, compared lexicographically. The higher round number wins, and if round numbers tie, the higher replica ID breaks the tie. This total ordering ensures that even if two proposers compete, only one value can be chosen per instance.

# Proposal number comparison
def compare_proposals(p1: tuple[int, int], p2: tuple[int, int]) -> int:
    """Returns 1 if p1 > p2, -1 if p1 < p2, 0 if equal"""
    if p1[0] != p2[0]:
        return 1 if p1[0] > p2[0] else -1
    return 1 if p1[1] > p2[1] else -1

Chubby runs thousands of Paxos instances in parallel, one per operation. The instance number identifies which operation is being agreed upon, while proposal numbers sequence competing proposals within that instance.

Lease Mechanism in Detail:

The master lease works as a timed promise from replicas to the master. The lease has a fixed duration (typically seconds, configured per cell). The master must renew the lease before it expires or lose mastership. This is different from Raft’s heartbeat approach:

AspectChubby Master LeaseRaft Leader Heartbeat
Renewal triggerTime-based, before expiryPeriodic heartbeat
Validity periodExplicit lease intervalElection timeout margin
**Stale leader detectionLease expiry timeoutHeartbeat timeout
Replica promise”I will not accept others""I will not vote for others”

During normal operation, the master renews its lease well before expiry. If the master fails to renew (crash, network partition), replicas wait until the lease expires before accepting a new master. This grace period is the “lease timeout” and typically ranges from hundreds of milliseconds to a few seconds.

# Conceptual lease renewal timeline
T=0s      Master acquires lease (lease_duration=5s)
T=3s      Master renews lease (at 60% of duration)
T=6s      Master renews lease again
T=11s     Master crashes, fails to renew
T=16s     Lease expires (master + lease_duration)
T=16s+    Replicas accept new master election

Database Backing Store:

Chubby originally used BerkeleyDB as its backing store, later replacing it with a custom database. The database stores the Paxos log and application state. Each write goes to the WAL (write-ahead log) before being applied, enabling crash recovery. The database also stores the cell’s configuration, lock states, and file metadata.


Use Cases at Google

Within Google, Chubby handled several critical coordination tasks:

Leader Election: For Bigtable and GFS, one server needed to be primary. Chubby’s lock mechanism enabled reliable leader election without complex consensus code in each system.

Configuration Storage: Service configurations, feature flags, and routing tables lived in Chubby. Event notifications let services reload without restart.

Naming Service: Before DNS was sufficient, Chubby served as a naming service. Machine names, service endpoints, and discovery information lived in Chubby.

Distributed Locks: Semaphores and barriers for distributed batch jobs used Chubby locks to coordinate across thousands of machines.

Original Bigtable Dependency Context

Chubby and Bigtable have a deep historical relationship that shaped Google’s early infrastructure. Bigtable was designed to be a distributed storage system for structured data, capable of scaling to petabytes across thousands of machines. But Bigtable’s original design assumed the existence of a coordination service: tablets needed to know who the master was, where to find their tablet server, and how to recover from failures.

Chubby was built, in part, to solve this coordination problem for Bigtable. The original Bigtable paper (2006) describes how it uses Chubby for:

  • Master election: The Bigtable master uses a Chubby exclusive lock to ensure only one master exists at a time
  • Tablet location: Bigtable stores tablet location metadata in Chubby files, allowing clients to find the right tablet server
  • Server lease management: Tablet servers maintain Chubby locks; if a tablet server loses its lock (crashed or network partition), the master knows to reassign its tablets
# Bigtable/Chubby interaction example
/chubby/cells/bigtable/master     # Bigtable master election lock
/chubby/cells/bigtable/tablets/  # Directory containing tablet location data
/chubby/cells/bigtable/tserver/  # Ephemeral files for live tablet servers

This tight coupling meant that Bigtable could not function if its Chubby cell was unavailable. In practice, Google ran multiple independent Chubby cells, and each Bigtable cluster was assigned to a specific cell. The Bigtable master would retry Chubby operations with exponential backoff, and would stop serving if the Chubby cell was unreachable for an extended period.

The dependency was unidirectional: Bigtable depended on Chubby, but Chubby did not depend on Bigtable. Chubby’s backing store (originally BerkeleyDB) was independent of Bigtable. This asymmetry is important: if Bigtable had its own consensus layer, it would have been circular, which would complicate recovery.

Over time, both systems evolved. Bigtable’s later incarnation (2012+) as Cloud Bigtable changed some of these assumptions, and Google’s internal infrastructure replaced Chubby with more specialized systems. The original design illustrates how coordination services tend to be foundational: they are typically designed early and replaced late.


The ZooKeeper Connection

The ZooKeeper paper explicitly calls out Chubby as inspiration. The project borrowed several core concepts:

  • The hierarchical file system namespace (though ZooKeeper uses znodes)
  • Ephemeral nodes for service discovery
  • Watches for event notification
  • The general lock abstraction for coordination

However, ZooKeeper made significant departures:

  • Watches are one-time in ZooKeeper vs. persistent in Chubby: ZooKeeper simplified by making watches fire once and require re-registration
  • ZooKeeper is a coordination service, not a lock service: ZooKeeper provides primitives that can be combined into locks, rather than locking as a primary feature
  • ZooKeeper separates consensus from caching: Clients can cache data while still getting notifications

Chubby vs ZooKeeper vs etcd: Direct Comparison

DimensionChubbyZooKeeperetcd
Consensus ProtocolPaxos (internal)Zab (custom Paxos variant)Raft (from paper)
API StyleFile system interfaceHierarchical znodesFlat key-value with directories
Language OriginC++JavaGo
Client LibrariesC++, JavaJava, C, Python, GoGo, Java, Python, Ruby
WatchesPersistent (stay registered)One-time (must re-register)One-time (must re-register)
Lock SupportAdvisory locks as first-classVia recipes (curator)Via transactions
ACLsUNIX-like permissionsACL per znodeRBAC with roles
ScalabilityCell-based shardingSingle ensembleSingle cluster
Performance~100k requests/sec per cell~10k-100k requests/sec~10k+ requests/sec
Leader LeasesYes (prevents split-brain)No (Zab handles it)Yes (Raft heartbeat)
ObserversNoYes (read scaling)No (but Raft learner mode)
Auth MethodsSASLSASL/KerberosmTLS, username/password
Typical UseInternal Google (historical)Hadoop/Kafka coordinationKubernetes config

Key takeaways:

  • Chubby was pioneering but closed-source and largely replaced inside Google
  • ZooKeeper is the direct Chubby descendant, battle-tested in big data ecosystems
  • etcd offers better performance and a cleaner API, preferred for new deployments
graph TD
    A[Chubby 2006] --> B[ZooKeeper 2008]
    A --> C[etcd 2013]
    B --> D[Kafka, HBase, Drill]
    C --> E[Kubernetes, CoreOS]

Limitations and Evolution

Chubby has known limitations that shaped its evolution:

Single-cell scalability: Early cells became bottlenecks, requiring sharding into multiple cells.

Event delivery delays: In rare cases, event delivery could be delayed. Applications needed their own timeouts and fallback mechanisms.

Not for high-frequency changes: Chubby was designed for configuration and coordination, not high-frequency data. Clients were expected to cache and poll for high-frequency updates.

Replaced by other systems: Within Google, Chubby has been largely replaced by more scalable systems. Spanner handles distributed transactions, and other specialized systems handle service discovery.


Trade-off Analysis

Chubby’s design embraces specific trade-offs that system designers must understand:

Design ChoiceAdvantageDisadvantage
Coarse-grained locksLow coordination overhead; scalable for global locksCannot optimize per-record or fine-grained locking
Advisory locksFlexible; no centralized enforcement bottleneckRequires client discipline; misbehaving clients can ignore locks
Paxos with 5 replicasTolerates 2 simultaneous failures; strong consistencyHigher latency than quorum-less reads; more resource usage
Persistent event notificationsSimple subscription model for long-lived clientsCan accumulate stale watchers; memory pressure on server
File system interfaceIntuitive hierarchy for naming and organizationLess efficient for flat key-value access patterns
Cell-based namespaceFailure isolation; independent scaling per cellCross-cell coordination requires federation/proxying
Master lease mechanismPrevents split-brain during partitions effectivelyLease duration tuning is critical; too short causes instability
BerkeleyDB/custom backing storeCrash recovery via WAL; proven database technologyAdditional dependency; schema changes require cell restart
Cell migration supportAllows load balancing; cell replacement without downtimeOperationally complex; requires DNS reconfiguration

Key insight: Chubby prioritized correctness and simplicity over raw speed. The trade-offs favor reliability and ease of use for coordination tasks—you get strong consistency guarantees in exchange for somewhat higher latency.


When to Use / When Not to Use

When to Use Chubby

Chubby was designed for coarse-grained distributed coordination at Google. If you need a similar system today, there are a few scenarios where it remains the right choice. Chubby’s advisory locks and ephemeral nodes give you a solid abstraction for lock acquisition and leader election among distributed processes, making it straightforward to pick a leader without building custom consensus logic. The hierarchical namespace also maps naturally to naming and service discovery use cases, where you want to organize services and endpoints in a familiar directory structure. Finally, if you have small, rarely-changed configuration that all participants need to agree on, Chubby’s event notifications mean services can reload without restart when configuration changes.

When Not to Use Chubby

Chubby is the wrong tool in several situations. High-frequency data access is one of them, since Chubby cells handle low throughput and cannot accommodate anything more than occasional reads and writes without becoming a bottleneck. Fine-grained coordination is another case where Chubby falls short, because its design centers on coarse-grained locks where one master oversees an entire system. If you need cross-cell namespace sharing, you will hit Chubby’s limitation where each cell’s namespace is isolated from others. For modern deployments outside Google, Chubby is not actively maintained, so ZooKeeper, etcd, or Consul are better choices with active community support.

Production Failure Scenarios

FailureImpactMitigation
Master lease expirationMaster loses mastership; all clients must re-acquire handles; brief unavailabilityChubby handles lease renewal automatically; if master fails, clients discover new master via DNS
Cell overloadChubby cell becomes a bottleneck under high request loadDistribute coordination across multiple Chubby cells; move non-critical clients to different cells
Event delivery delaysIn rare cases, lock acquisition events are delayed; applications holding stale locks may cause conflictsApplications should implement their own timeouts and re-check lock state before taking action
DNS cache stalenessWhen master moves, DNS entries take time to propagate; clients may contact old master brieflyUse short DNS TTLs; clients should handle OperationTimeout gracefully and retry discovery
File handle exhaustionToo many clients holding handles to the same file saturates handle tableMonitor handle count; release handles promptly; design for brief handle lifetime
Incompatible schema changesDDL changes to the Chubby database itself require cell restartCoordinate schema changes during maintenance windows; test on non-production cells first

Master Lease Expiry: Split-Brain and Recovery

The most critical failure scenario is master lease expiry leading to a potential split-brain situation. Understanding the exact sequence matters for operating Chubby cells safely.

What is split-brain in this context?

Split-brain occurs when two replicas believe they are the master simultaneously. Chubby’s lease mechanism prevents true split-brain during the lease validity window, but a narrow window opens during the lease expiry transition.

Failure sequence:

sequenceDiagram
    participant M as Master M1
    participant R1 as Replica 1
    participant R2 as Replica 2
    participant R3 as Replica 3
    participant M2 as New Master M2

    Note over M: Master M1 lease expires at T=0
    Note over M: M1 is network-partitioned
    R1->>R2: Lease expired, no response from M1
    R2->>R1: Agreed, starting election
    R1->>R2: I vote for R1
    R2->>R1: Confirmed
    Note over R1: R1 elected new master M2 (epoch+1)
    Note over M: M1 still thinks it is master!
    Note over M: M1 serving writes locally during partition
    Note over M: But cannot replicate to quorum
    Note over M2: M2 holds valid lease, accepts writes
    Note over R1,R2,R3: Replicas ignore M1 (stale epoch)
    Note over M: At T=0, M1 lost quorum-backed lease
    Note over M: Clients retry, discover M2 via DNS

Key protection mechanism:

The epoch number is the critical protection. When M2 is elected, it increments the epoch. Even if M1 (the partitioned master) recovers and tries to issue commands, replicas check the epoch and reject M1’s messages because (M1_epoch < M2_epoch). M1’s writes during the partition window are lost from the perspective of the quorum, but they never contaminated the committed Paxos log.

Recovery procedure:

  1. M1 detects its lease has expired (no renewal acknowledgment from quorum)
  2. M1 stops accepting new writes locally
  3. M1 initiates a rejoin procedure: it fetches the latest Paxos state from the quorum
  4. M1 discovers it is no longer master (higher epoch exists)
  5. M1 transitions to a replica and syncs its state to the latest Paxos log entry
  6. M1 resumes normal operation as a replica
# Simplified master recovery pseudocode
def on_lease_expired():
    stop_accepting_writes()
    current_epoch = fetch_current_epoch_from_quorum()
    if current_epoch > my_epoch:
        my_epoch = current_epoch
        my_role = REPLICA
        sync_state_from_master()
    else:
        # We might still be master, retry election
        start_leader_election()

What happens to M1’s writes during partition?

M1 could have accepted writes during the partition that never reached the quorum. These writes are effectively “tentative” and are discarded during recovery. Clients that read from M1 during the partition might have seen uncommitted state. For this reason, critical operations should use the WAIT_ANY flag to wait for acknowledgment from the quorum, not just the master.

Worst-case timeline:

PhaseDurationDetails
Partition begins, M1 still master0sM1 thinks it is master, serves writes locally
Lease expires (M1 cannot renew)4-8sM1 cannot get quorum for renewal
Replicas detect lease expiry~100msElection timeout, start new election
New master M2 elected50-150msM2 wins with higher epoch
DNS propagation to clients0-30sTTL on DNS entries for Chubby master location
All clients redirected5-40sTotal worst-case unavailability window

Practical mitigations for operators:

# Monitor master lease health
# Alert if master has not renewed lease within 50% of lease_duration
alert: ChubbyMasterLeaseRenewalWarning
  expr: time() - chubby_last_lease_renewal > (lease_duration * 0.5)
  severity: warning

# Use short DNS TTLs for master location
# Recommended: TTL = 5-10 seconds for production cells

# Set lease_duration appropriately
# Short: faster failover (2-4s) but more sensitivity to transient network issues
# Long: more stability (8-12s) but longer failover window

Quick Recap Checklist

Before diving deeper or preparing for an interview, verify your understanding of these key Chubby concepts:

  • Chubby uses Paxos consensus among 5 replicas, with one master handling all writes
  • Master lease mechanism prevents split-brain scenarios during network partitions
  • Reads can be served by any replica with local caching for performance
  • Epoch numbers prevent stale master commands from being accepted after failover
  • Ephemeral files automatically disappear when the creating client disconnects
  • Advisory locks require client discipline; Chubby does not enforce lock usage
  • Event notifications are persistent (stay registered until explicitly removed)
  • Cell hierarchy organizes the namespace into isolated failure domains
  • Chubby’s design directly inspired ZooKeeper’s znodes, watches, and ephemeral nodes
  • ZooKeeper simplified Chubby by making watches one-time (require re-registration)

Interview Questions

1. Why did Google design Chubby as a coarse-grained lock service rather than a fine-grained one?

Expected answer points:

  • Distributed lock acquisition is expensive (network round-trips, consensus overhead)
  • Coarse-grained locks held for seconds or minutes amortize this overhead effectively
  • Fine-grained locks held for milliseconds would make the coordination service a bottleneck
  • Chubby was designed for coordination among relatively few processes, not per-record locking
2. How does Chubby's master lease mechanism prevent split-brain scenarios?

Expected answer points:

  • Master holds a lease from replicas, renewed periodically before expiry
  • If master fails to renew (crash or partition), replicas wait until lease expires before accepting a new master
  • Epoch numbers ensure old masters cannot issue commands after a new master is elected
  • During partition, the old master cannot commit writes because it lacks quorum-backed lease
3. What is the role of epoch numbers in Chubby's failover process?

Expected answer points:

  • Epoch numbers are monotonically increasing integers identifying a master's term
  • When a new master is elected, the epoch number increments
  • Any message from an old epoch (lower epoch number) is rejected by replicas
  • This solves the "old master waking up" problem after network partition recovery
4. How do ephemeral files work in Chubby, and what problem do they solve?

Expected answer points:

  • Ephemeral files are automatically deleted when the client that created them closes its handle
  • If a process dies, its ephemeral files disappear automatically
  • This signals failure to other clients without requiring explicit cleanup
  • Used for service discovery: a tablet server creates an ephemeral file; when it dies, the file disappears and the master knows to reassign its tablets
5. What is the difference between Chubby's event notifications and ZooKeeper's watches?

Expected answer points:

  • Chubby events are persistent: they stay registered until explicitly unsubscribed
  • ZooKeeper watches are one-time: they fire once and require re-registration
  • Chubby's approach is simpler for long-lived subscriptions but can accumulate stale watchers
  • ZooKeeper's approach requires more client-side bookkeeping but scales better
6. How does Chubby's file system interface differ from a typical key-value store?

Expected answer points:

  • Chubby presents a hierarchical Unix-like namespace with directories and files
  • Each file has content, ACLs, metadata, and optional advisory locks
  • Paths like /ls/google/chubby/cells/bigtable/master mirror familiar file paths
  • Key-value stores typically use flat namespaces; Chubby's hierarchy organizes data naturally for naming and service discovery
7. Why does Chubby use five replicas in a cell?

Expected answer points:

  • Paxos requires a majority quorum to operate (3 out of 5 can acknowledge writes)
  • Five replicas tolerate two simultaneous failures while maintaining quorum
  • Three replicas would only tolerate one failure (majority is 2)
  • Seven replicas would be more resilient but add more latency and resource overhead
8. What happens to uncommitted writes when a network partition causes master lease expiry?

Expected answer points:

  • During partition, the old master may have accepted writes locally that never reached the quorum
  • These writes are "tentative" and are discarded during recovery
  • Clients reading from the partitioned master during the outage might have seen uncommitted state
  • Critical operations should use WAIT_ANY flag to wait for quorum acknowledgment, not just master acceptance
9. How did Bigtable originally depend on Chubby?

Expected answer points:

  • Bigtable master uses a Chubby exclusive lock for master election (only one master at a time)
  • Tablet location metadata is stored in Chubby files so clients can find the right tablet server
  • Tablet servers maintain Chubby locks; losing a lock signals the master to reassign tablets
  • The dependency was unidirectional: Bigtable depended on Chubby, but Chubby did not depend on Bigtable
10. What are the key differences between Chubby leases and Raft leader heartbeats?

Expected answer points:

  • Chubby: time-based renewal before expiry; explicit lease interval with duration
  • Raft: periodic heartbeat messages; election timeout margin determines validity
  • Chubby replica promises "I will not accept others" during lease validity
  • Raft follower promises "I will not vote for others" during election timeout
  • Both achieve similar goals (preventing split-brain) through different mechanisms
11. What was the original backing store for Chubby, and how did it evolve?

Expected answer points:

  • Originally used BerkeleyDB as the backing store for Paxos log and application state
  • Later replaced with a custom database built by Google
  • The database stores the Paxos log, cell configuration, lock states, and file metadata
  • Each write goes to the write-ahead log (WAL) before being applied, enabling crash recovery
12. How does proposal number ordering work in Chubby's Paxos implementation?

Expected answer points:

  • Each Paxos instance has a unique proposal number structured as (round_number, replica_id)
  • Comparison is lexicographic: higher round number wins
  • If round numbers tie, higher replica ID breaks the tie
  • This total ordering ensures only one value can be chosen per instance, even with competing proposers
13. Why did ZooKeeper make watches one-time rather than persistent like Chubby?

Expected answer points:

  • Persistent watchers can accumulate and create memory pressure on the server
  • One-time watches simplify server-side state management
  • Clients must re-register after receiving an event, which is explicit and auditable
  • The trade-off is more bookkeeping on the client side, but ZooKeeper prioritizes server scalability
14. What is cell federation in Chubby and when was it introduced?

Expected answer points:

  • Later versions of Chubby introduced cell federation to provide a unified namespace across multiple cells
  • A cell can proxy requests to another cell, effectively bridging namespace views
  • This allowed services spanning multiple Chubby cells to present a single logical namespace to clients
  • Used by services that outgrew a single cell's capacity but needed coherent namespace access
15. What is the typical failover latency for a Chubby cell when the master crashes?

Expected answer points:

  • Lease expiration window: 4-8 seconds (configurable, sets detection delay)
  • Election and Paxos leader election: 50-200ms
  • Client DNS update propagation: 0-30 seconds (depends on TTL setting)
  • Total typical unavailability: 1-2 seconds for planned migrations, 5-10 seconds for unexpected failures
  • ZooKeeper typically completes election in 100-500ms, etcd within 1 second
16. How does Chubby's cell hierarchy help with failure isolation?

Expected answer points:

  • Each cell serves a subset of the overall namespace with its own replica set
  • Failure of one cell does not directly impact others serving different namespaces
  • This isolation simplifies deployment boundaries and failure domains
  • Services can be assigned to cells based on criticality or load characteristics
17. What are the main limitations that led to Chubby being replaced within Google?

Expected answer points:

  • Single-cell scalability: early cells became bottlenecks under high request load
  • Not designed for high-frequency data changes; clients expected to cache and poll
  • Event delivery could occasionally be delayed, requiring application-level timeouts
  • More specialized systems like Spanner handle distributed transactions more efficiently
  • Cell migration was possible but operationally complex
18. What is the relationship between Chubby's Paxos implementation and the original Lamport Paxos paper?

Expected answer points:

  • The original Paxos paper left several implementation details unspecified
  • Chubby filled gaps: epoch numbers, proposal numbering, lease mechanisms, database backing store
  • Chubby runs thousands of Paxos instances in parallel (one per operation)
  • Instance numbers identify which operation is being agreed upon, while proposal numbers sequence competing proposals
19. Why does Chubby use advisory locks instead of mandatory locks?

Expected answer points:

  • Mandatory locks would require Chubby to intercept and block all file access attempts
  • This would create a centralized bottleneck defeating Chubby's distributed design
  • Advisory locks are a convention: correct clients use them, incorrect clients fail predictably
  • In practice, well-behaved clients follow the protocol, and misbehaving ones trigger alerts
20. How does Chubby's influence persist in modern distributed systems?

Expected answer points:

  • ZooKeeper directly adopted Chubby's concepts: hierarchical namespace, ephemeral nodes, watches, lock abstraction
  • etcd (ZooKeeper's spiritual successor) uses the same Raft consensus algorithm approach
  • Consul combines Chubby-style coordination with service discovery
  • Kubernetes uses etcd for configuration and leader election, descendants of Chubby's design
  • The lesson that "coordination is a common problem worth solving once and well" remains relevant

Further Reading

Conclusion

Chubby was a pioneering coordination service demonstrating how to build production-grade distributed consensus:

  • Production Paxos: Showed how Paxos could work efficiently for real-world use
  • Coarse-grained locking: Provided locks as a service, amortizing coordination overhead
  • Event notifications: Built reactive patterns into the coordination service itself
  • Hierarchical namespace: Made data organization intuitive, like familiar file systems

The 2006 Chubby paper is essential reading for distributed systems practitioners. It explains not just how Chubby works, but why decisions were made, and what trade-offs were accepted.

Chubby itself is now largely historical, but its influence persists. Every time you use ZooKeeper to elect a leader, etcd to store configuration, or Consul for service discovery, you are using systems shaped by Chubby’s design.

The lesson: coordination is a common problem worth solving once and well. Getting consensus right is hard enough without every team inventing their own poorly-tested version. Chubby proved that a well-designed coordination service could be general enough and efficient enough to serve as infrastructure for large-scale distributed systems.


Category

Related Posts

Apache ZooKeeper: Consensus and Coordination

Explore ZooKeeper's Zab consensus protocol, hierarchical znodes, watches, leader election, and practical use cases for distributed coordination.

#distributed-systems #databases #zookeeper

etcd: Distributed Key-Value Store for Configuration

Deep dive into etcd architecture using Raft consensus, watches for reactive configuration, leader election patterns, and Kubernetes integration.

#distributed-systems #databases #etcd

Google Spanner: Globally Distributed SQL at Scale

Google Spanner architecture combining relational model with horizontal scalability, TrueTime API for global consistency, and F1 database implementation.

#distributed-systems #databases #google