Google Chubby: The Lock Service That Inspired ZooKeeper

Explore Google Chubby's architecture, lock-based coordination, Paxos integration, cell hierarchy, and its influence on distributed systems design.

published: reading time: 21 min read

Google Chubby: The Lock Service That Inspired ZooKeeper

Google Chubby was Google’s internal lock service and coarse-grained file system, described in a 2006 paper that shaped distributed systems design. Chubby itself was not open-sourced, but its design directly inspired Apache ZooKeeper, adopting many Chubby concepts and even API naming.

Chubby solved a core problem: large-scale distributed systems need coordination primitives like locks, leader election, and configuration storage, but building these correctly is notoriously difficult. Chubby provided these as a service, letting developers focus on their problems rather than reinventing coordination code.

Today, Chubby has been largely replaced by other systems within Google, but its influence persists in every ZooKeeper deployment, every etcd cluster, and every system using distributed locks.


Design Philosophy

Chubby was built around a few key observations:

Coarse-grained locks are easier: Acquiring a distributed lock is expensive. Chubby optimized for coarse-grained locks held for seconds or minutes, not fine-grained locks held for milliseconds. The overhead of lock acquisition amortized over longer hold times.

Locks and files are related: A lock can be naturally represented as a file with specific semantics. A file containing the lock holder’s identity serves both as state and as the lock itself.

Event notification matters: Applications need to know when configuration changes, when a lock is released, or when a machine fails. Chubby built event notification into its API.

Cell hierarchy reduces confusion: Rather than a flat namespace, Chubby organized data into cells, each serving a subset of the overall namespace. This isolation simplified deployment and failure boundaries.

graph TD
    A[Chubby Cell 1<br/>Bigtable Namespace] --> B[Cell 1 replica 1]
    A --> C[Cell 1 replica 2]
    A --> D[Cell 1 replica 3]

    E[Chubby Cell 2<br/>GFS Namespace] --> F[Cell 2 replica 1]
    E --> G[Cell 2 replica 2]
    E --> H[Cell 2 replica 3]

Cell Migration and Namespace Sharing

As services grew, a single Chubby cell could become a bottleneck. Chubby supports cell migration: moving a service’s data from one cell to another without service interruption. The process involves:

  1. Creating a new cell with the target namespace
  2. Updating the service’s DNS or configuration to point to the new cell
  3. Allowing in-flight operations to complete on the old cell
  4. Decommissioning the old cell once all clients have migrated

Namespace sharing across cells was intentionally limited. Each cell originally maintained its own isolated namespace, and there was no built-in cross-cell replication of the namespace itself. Applications that needed cross-cell coordination either used a dedicated shared cell or implemented their own replication layer.

Later versions of Chubby introduced cell federation, allowing a cell to proxy requests to another cell, effectively providing a unified namespace view across multiple cells. This was used by services that needed to span multiple Chubby cells while presenting a single logical namespace to clients.

# Conceptual cell lookup flow
/service/bigtable/leader  -->  Cell A (handles /ls/google/chubby/cells/bigtable)
/service/gfs/master       -->  Cell B (handles /ls/google/chubby/cells/gfs)

# Cell A can proxy to Cell B for namespace access
Cell A -->|proxy| Cell B: lookup /ls/google/chubby/cells/gfs/master

Architecture Overview

A Chubby cell consists of 5 replicas running the same software. Replicas use Paxos to agree on operation order, ensuring consistency even when some replicas fail.

One replica is the master at any given time. All writes go through the master, which replicates to others via Paxos. Reads can be served by any replica, which locally cache data for performance.

graph TD
    C[Client] -->|Write| M[Master]
    C -->|Read| R1[Replica 1]
    C -->|Read| R2[Replica 2]

    M -->|Replicate| R1
    M -->|Replicate| R2
    M -->|Replicate| R3[Replica 3]

    R1 -.->|Cache| C
    R2 -.->|Cache| C
    R3 -.->|Cache| C

Paxos gives Chubby linearizable reads and writes, the strongest consistency guarantee. This matters for locks, where inconsistent state could make multiple processes think they hold the same lock.

Master Election and Failover Latency

When the master fails, Chubby’s failover process follows a predictable timeline:

sequenceDiagram
    participant R1 as Replica 1
    participant R2 as Replica 2
    participant R3 as Replica 3
    participant M as Master
    participant C as Client

    Note over M: Master crashes at T=0
    R1->>R2: Lease expired, starting election
    R2->>R1: I vote for you
    R3->>R1: I vote for you
    Note over R1: Quorum (3/5) reached
    R1->>R2: I am the new master (epoch+1)
    Note over R1: New master at T=~200ms
    R1->>C: New master notification
    Note over C: Client discovers new master via DNS
    Note over C: Typically 1-2 seconds total
Failover PhaseTypical DurationNotes
Lease expiration (master silent)4-8 secondsConfigurable per cell, lease duration sets this window
Leaseholder detection + election start50-200msRemaining replicas detect lease expiry via quorum timeout
Paxos leader election (propose/accept)50-150msSingle round of Prepare/Accept with 5 replicas
Client DNS update propagation0-30 secondsDepends on TTL; clients cache old master up to TTL
Handle re-acquisition by clients100-500msClients must re-open file handles
Total unavailability window~5-10 secondsworst case with default lease and DNS TTL

The paper reports that Chubby cells typically experience 1-2 seconds of unavailability during a planned master migration (e.g., for maintenance), and 5-10 seconds during an unexpected master failure. The lease duration is the primary tuning knob: shorter leases mean faster failover but more instability from transient network issues.

For comparison, ZooKeeper’s leader election typically completes in 100-500ms under similar conditions, while etcd leader election usually finishes within 1 second. Chubby’s slightly longer failover is a trade-off for its stronger lease-based split-brain prevention.


The File System Interface

Chubby presents a file system interface (like Unix) rather than a key-value API. Files organize in directories forming a path hierarchy.

/ls/google/chubby/cells/bigtable/master
/ls/google/chubby/cells/gfs/primary
/ls/services/search/config
/ls/services/search/locks/query

Each file has:

  • Content: Small amounts of binary data (typically less than 1 MB)
  • ACLs: Access control lists for permissions
  • Metadata: Modification times, lock holders, etc.
  • Locks: Optional advisory locks
# Chubby client operations (conceptual)
open("/ls/services/search/config")      # Open a file
setContents("replicas=5, shard=true")    # Write data
getContents()                           # Read data
acquire("my-process-id")                # Acquire lock
release("my-process-id")                # Release lock

Files can be ephemeral, automatically deleted when the client that created them closes its handle. If a process dies, its ephemeral files disappear, signaling failure to others.


Advisory Locks

Chubby provides advisory locks, not mandatory ones. A process can access a file without acquiring its lock, but coordination requires disciplines to actually use the locks.

# Conceptual Chubby lock acquisition
lock = chubby.lock("/ls/services/search/locks/query")

if lock.acquire(blocking=True, timeout=30):
    try:
        # Critical section - we hold the lock
        config = read_config()
        do_query(config)
    finally:
        lock.release()
else:
    # Could not acquire lock
    raise Exception("Lock acquisition timed out")

The lock advisory nature means misbehaving clients can ignore locks. In practice, this is rarely a problem because correct clients follow the protocol, and incorrect clients fail in predictable ways that trigger alerts.


Events and Notifications

Clients subscribe to events on files and directories. When things happen, Chubby notifies the client:

  • File modified: Content of a watched file changed
  • Lock acquired/released: Someone acquired or released a lock
  • Child added/removed: A child directory or file was created or deleted
  • Master failure: The cell master failed and a new one was elected
  • Connection alive/broken: Connection to Chubby cell state changed
# Subscribe to file change events
handle = chubby.openFile("/ls/services/search/config")
handle.subscribe(chubby.FILE_MODIFIED, callback=reload_config)

# Subscribe to lock events
lock = chubby.lock("/ls/services/search/locks/master")
lock.subscribe(chubby.LOCK_ACQUIRED, callback=i_am_master_now)

This event mechanism enables reactive configuration. Rather than polling, clients receive notifications when relevant state changes. For large-scale systems, this reduces load on clients and the coordination service.


Chubby and Paxos

Under the hood, Chubby uses Paxos for consensus among replicas. The Chubby paper clarified how Paxos integrates into a production system:

  • Database backing store: Chubby uses a simple database (originally BerkeleyDB, later custom) to store state
  • Instance numbers: Each Paxos instance is numbered, and Chubby runs many instances in parallel for different operations
  • Epoch numbers: Masters are identified by epoch numbers, preventing old masters from acting after a new master is elected
  • Lease mechanism: The master holds a lease from replicas, renewing it periodically
sequenceDiagram
    participant R1 as Replica 1
    participant R2 as Replica 2
    participant R3 as Replica 3
    participant M as Master

    Note over M: Prepare to renew lease
    M->>R1: Prepare(N=5)
    M->>R2: Prepare(N=5)
    M->>R3: Prepare(N=5)
    R1-->>M: Promise(N=5,v=4)
    R2-->>M: Promise(N=5,v=4)
    R3-->>M: Promise(N=5,v=4)
    Note over M: Majority promised!
    M->>R1: Accept(N=5,v=5)
    M->>R2: Accept(N=5,v=5)
    M->>R3: Accept(N=5,v=5)

The lease mechanism ensures that if the master fails, replicas wait until the lease expires before accepting a new master. This prevents split-brain where two masters think they are in charge.

Paxos Integration Deep Dive

The original Paxos paper leaves several implementation details unspecified. Chubby’s production deployment fills in these gaps:

Epoch Numbers (Master Identity):

Epoch numbers are monotonically increasing integers that identify a master’s term. When a new master is elected, it increments the epoch number. Any message from an old epoch (lower epoch number) is rejected by replicas. This solves the “old master waking up” problem: if a master crashes and a new one is elected, but the old master recovers and tries to issue commands, replicas reject them because the old master’s epoch is stale.

# Simplified epoch check on replica
def handle_master_message(master_epoch, master_id, operation):
    if master_epoch < self.current_epoch:
        raise Rejected(f"Stale epoch {master_epoch} < {self.current_epoch}")
    if master_epoch == self.current_epoch and master_id != self.master_id:
        raise Rejected(f"Master ID mismatch")
    # Accept the operation
    self.process(operation)

Proposal Numbers (Operation Ordering):

Each Paxos instance (representing one Chubby operation) has a unique proposal number. Proposal numbers are structured as (round_number, replica_id) tuples, compared lexicographically. The higher round number wins, and if round numbers tie, the higher replica ID breaks the tie. This total ordering ensures that even if two proposers compete, only one value can be chosen per instance.

# Proposal number comparison
def compare_proposals(p1: tuple[int, int], p2: tuple[int, int]) -> int:
    """Returns 1 if p1 > p2, -1 if p1 < p2, 0 if equal"""
    if p1[0] != p2[0]:
        return 1 if p1[0] > p2[0] else -1
    return 1 if p1[1] > p2[1] else -1

Chubby runs thousands of Paxos instances in parallel, one per operation. The instance number identifies which operation is being agreed upon, while proposal numbers sequence competing proposals within that instance.

Lease Mechanism in Detail:

The master lease works as a timed promise from replicas to the master. The lease has a fixed duration (typically seconds, configured per cell). The master must renew the lease before it expires or lose mastership. This is different from Raft’s heartbeat approach:

AspectChubby Master LeaseRaft Leader Heartbeat
Renewal triggerTime-based, before expiryPeriodic heartbeat
Validity periodExplicit lease intervalElection timeout margin
**Stale leader detectionLease expiry timeoutHeartbeat timeout
Replica promise”I will not accept others""I will not vote for others”

During normal operation, the master renews its lease well before expiry. If the master fails to renew (crash, network partition), replicas wait until the lease expires before accepting a new master. This grace period is the “lease timeout” and typically ranges from hundreds of milliseconds to a few seconds.

# Conceptual lease renewal timeline
T=0s      Master acquires lease (lease_duration=5s)
T=3s      Master renews lease (at 60% of duration)
T=6s      Master renews lease again
T=11s     Master crashes, fails to renew
T=16s     Lease expires (master + lease_duration)
T=16s+    Replicas accept new master election

Database Backing Store:

Chubby originally used BerkeleyDB as its backing store, later replacing it with a custom database. The database stores the Paxos log and application state. Each write goes to the WAL (write-ahead log) before being applied, enabling crash recovery. The database also stores the cell’s configuration, lock states, and file metadata.


Use Cases at Google

Within Google, Chubby handled several critical coordination tasks:

Leader Election: For Bigtable and GFS, one server needed to be primary. Chubby’s lock mechanism enabled reliable leader election without complex consensus code in each system.

Configuration Storage: Service configurations, feature flags, and routing tables lived in Chubby. Event notifications let services reload without restart.

Naming Service: Before DNS was sufficient, Chubby served as a naming service. Machine names, service endpoints, and discovery information lived in Chubby.

Distributed Locks: Semaphores and barriers for distributed batch jobs used Chubby locks to coordinate across thousands of machines.

Original Bigtable Dependency Context

Chubby and Bigtable have a deep historical relationship that shaped Google’s early infrastructure. Bigtable was designed to be a distributed storage system for structured data, capable of scaling to petabytes across thousands of machines. But Bigtable’s original design assumed the existence of a coordination service: tablets needed to know who the master was, where to find their tablet server, and how to recover from failures.

Chubby was built, in part, to solve this coordination problem for Bigtable. The original Bigtable paper (2006) describes how it uses Chubby for:

  • Master election: The Bigtable master uses a Chubby exclusive lock to ensure only one master exists at a time
  • Tablet location: Bigtable stores tablet location metadata in Chubby files, allowing clients to find the right tablet server
  • Server lease management: Tablet servers maintain Chubby locks; if a tablet server loses its lock (crashed or network partition), the master knows to reassign its tablets
# Bigtable/Chubby interaction example
/chubby/cells/bigtable/master     # Bigtable master election lock
/chubby/cells/bigtable/tablets/  # Directory containing tablet location data
/chubby/cells/bigtable/tserver/  # Ephemeral files for live tablet servers

This tight coupling meant that Bigtable could not function if its Chubby cell was unavailable. In practice, Google ran multiple independent Chubby cells, and each Bigtable cluster was assigned to a specific cell. The Bigtable master would retry Chubby operations with exponential backoff, and would stop serving if the Chubby cell was unreachable for an extended period.

The dependency was unidirectional: Bigtable depended on Chubby, but Chubby did not depend on Bigtable. Chubby’s backing store (originally BerkeleyDB) was independent of Bigtable. This asymmetry is important: if Bigtable had its own consensus layer, it would have been circular, which would complicate recovery.

Over time, both systems evolved. Bigtable’s later incarnation (2012+) as Cloud Bigtable changed some of these assumptions, and Google’s internal infrastructure replaced Chubby with more specialized systems. The original design illustrates how coordination services tend to be foundational: they are typically designed early and replaced late.


The ZooKeeper Connection

The ZooKeeper paper explicitly acknowledges Chubby’s influence. ZooKeeper adopted several concepts:

  • The hierarchical file system namespace (though ZooKeeper uses znodes)
  • Ephemeral nodes for service discovery
  • Watches for event notification
  • The general lock abstraction for coordination

However, ZooKeeper made significant departures:

  • Watches are one-time in ZooKeeper vs. persistent in Chubby: ZooKeeper simplified by making watches fire once and require re-registration
  • ZooKeeper is a coordination service, not a lock service: ZooKeeper provides primitives that can be combined into locks, rather than locking as a primary feature
  • ZooKeeper separates consensus from caching: Clients can cache data while still getting notifications

Chubby vs ZooKeeper vs etcd: Direct Comparison

DimensionChubbyZooKeeperetcd
Consensus ProtocolPaxos (internal)Zab (custom Paxos variant)Raft (from paper)
API StyleFile system interfaceHierarchical znodesFlat key-value with directories
Language OriginC++JavaGo
Client LibrariesC++, JavaJava, C, Python, GoGo, Java, Python, Ruby
WatchesPersistent (stay registered)One-time (must re-register)One-time (must re-register)
Lock SupportAdvisory locks as first-classVia recipes (curator)Via transactions
ACLsUNIX-like permissionsACL per znodeRBAC with roles
ScalabilityCell-based shardingSingle ensembleSingle cluster
Performance~100k requests/sec per cell~10k-100k requests/sec~10k+ requests/sec
Leader LeasesYes (prevents split-brain)No (Zab handles it)Yes (Raft heartbeat)
ObserversNoYes (read scaling)No (but Raft learner mode)
Auth MethodsSASLSASL/KerberosmTLS, username/password
Typical UseInternal Google (historical)Hadoop/Kafka coordinationKubernetes config

Key takeaways:

  • Chubby was pioneering but closed-source and largely replaced inside Google
  • ZooKeeper is the direct Chubby descendant, battle-tested in big data ecosystems
  • etcd offers better performance and a cleaner API, preferred for new deployments
graph TD
    A[Chubby 2006] --> B[ZooKeeper 2008]
    A --> C[etcd 2013]
    B --> D[Kafka, HBase, Drill]
    C --> E[Kubernetes, CoreOS]

Limitations and Evolution

Chubby has known limitations that shaped its evolution:

Single-cell scalability: Early cells became bottlenecks, requiring sharding into multiple cells.

Event delivery delays: In rare cases, event delivery could be delayed. Applications needed their own timeouts and fallback mechanisms.

Not for high-frequency changes: Chubby was designed for configuration and coordination, not high-frequency data. Clients were expected to cache and poll for high-frequency updates.

Replaced by other systems: Within Google, Chubby has been largely replaced by more scalable systems. Spanner handles distributed transactions, and other specialized systems handle service discovery.


When to Use / When Not to Use Chubby

When to Use Chubby

Chubby was designed for coarse-grained distributed coordination at Google. If you need a similar system today, there are a few scenarios where it remains the right choice. Chubby’s advisory locks and ephemeral nodes give you a solid abstraction for lock acquisition and leader election among distributed processes, making it straightforward to pick a leader without building custom consensus logic. The hierarchical namespace also maps naturally to naming and service discovery use cases, where you want to organize services and endpoints in a familiar directory structure. Finally, if you have small, rarely-changed configuration that all participants need to agree on, Chubby’s event notifications mean services can reload without restart when configuration changes.

When Not to Use Chubby

Chubby is the wrong tool in several situations. High-frequency data access is one of them, since Chubby cells handle low throughput and cannot accommodate anything more than occasional reads and writes without becoming a bottleneck. Fine-grained coordination is another case where Chubby falls short, because its design centers on coarse-grained locks where one master oversees an entire system. If you need cross-cell namespace sharing, you will hit Chubby’s limitation where each cell’s namespace is isolated from others. For modern deployments outside Google, Chubby is not actively maintained, so ZooKeeper, etcd, or Consul are better choices with active community support.

Production Failure Scenarios

FailureImpactMitigation
Master lease expirationMaster loses mastership; all clients must re-acquire handles; brief unavailabilityChubby handles lease renewal automatically; if master fails, clients discover new master via DNS
Cell overloadChubby cell becomes a bottleneck under high request loadDistribute coordination across multiple Chubby cells; move non-critical clients to different cells
Event delivery delaysIn rare cases, lock acquisition events are delayed; applications holding stale locks may cause conflictsApplications should implement their own timeouts and re-check lock state before taking action
DNS cache stalenessWhen master moves, DNS entries take time to propagate; clients may contact old master brieflyUse short DNS TTLs; clients should handle OperationTimeout gracefully and retry discovery
File handle exhaustionToo many clients holding handles to the same file saturates handle tableMonitor handle count; release handles promptly; design for brief handle lifetime
Incompatible schema changesDDL changes to the Chubby database itself require cell restartCoordinate schema changes during maintenance windows; test on non-production cells first

Master Lease Expiry: Split-Brain and Recovery

The most critical failure scenario is master lease expiry leading to a potential split-brain situation. Understanding the exact sequence matters for operating Chubby cells safely.

What is split-brain in this context?

Split-brain occurs when two replicas believe they are the master simultaneously. Chubby’s lease mechanism prevents true split-brain during the lease validity window, but a narrow window opens during the lease expiry transition.

Failure sequence:

sequenceDiagram
    participant M as Master M1
    participant R1 as Replica 1
    participant R2 as Replica 2
    participant R3 as Replica 3
    participant M2 as New Master M2

    Note over M: Master M1 lease expires at T=0
    Note over M: M1 is network-partitioned
    R1->>R2: Lease expired, no response from M1
    R2->>R1: Agreed, starting election
    R1->>R2: I vote for R1
    R2->>R1: Confirmed
    Note over R1: R1 elected new master M2 (epoch+1)
    Note over M: M1 still thinks it is master!
    Note over M: M1 serving writes locally during partition
    Note over M: But cannot replicate to quorum
    Note over M2: M2 holds valid lease, accepts writes
    Note over R1,R2,R3: Replicas ignore M1 (stale epoch)
    Note over M: At T=0, M1 lost quorum-backed lease
    Note over M: Clients retry, discover M2 via DNS

Key protection mechanism:

The epoch number is the critical protection. When M2 is elected, it increments the epoch. Even if M1 (the partitioned master) recovers and tries to issue commands, replicas check the epoch and reject M1’s messages because (M1_epoch < M2_epoch). M1’s writes during the partition window are lost from the perspective of the quorum, but they never contaminated the committed Paxos log.

Recovery procedure:

  1. M1 detects its lease has expired (no renewal acknowledgment from quorum)
  2. M1 stops accepting new writes locally
  3. M1 initiates a rejoin procedure: it fetches the latest Paxos state from the quorum
  4. M1 discovers it is no longer master (higher epoch exists)
  5. M1 transitions to a replica and syncs its state to the latest Paxos log entry
  6. M1 resumes normal operation as a replica
# Simplified master recovery pseudocode
def on_lease_expired():
    stop_accepting_writes()
    current_epoch = fetch_current_epoch_from_quorum()
    if current_epoch > my_epoch:
        my_epoch = current_epoch
        my_role = REPLICA
        sync_state_from_master()
    else:
        # We might still be master, retry election
        start_leader_election()

What happens to M1’s writes during partition?

M1 could have accepted writes during the partition that never reached the quorum. These writes are effectively “tentative” and are discarded during recovery. Clients that read from M1 during the partition might have seen uncommitted state. For this reason, critical operations should use the WAIT_ANY flag to wait for acknowledgment from the quorum, not just the master.

Worst-case timeline:

PhaseDurationDetails
Partition begins, M1 still master0sM1 thinks it is master, serves writes locally
Lease expires (M1 cannot renew)4-8sM1 cannot get quorum for renewal
Replicas detect lease expiry~100msElection timeout, start new election
New master M2 elected50-150msM2 wins with higher epoch
DNS propagation to clients0-30sTTL on DNS entries for Chubby master location
All clients redirected5-40sTotal worst-case unavailability window

Practical mitigations for operators:

# Monitor master lease health
# Alert if master has not renewed lease within 50% of lease_duration
alert: ChubbyMasterLeaseRenewalWarning
  expr: time() - chubby_last_lease_renewal > (lease_duration * 0.5)
  severity: warning

# Use short DNS TTLs for master location
# Recommended: TTL = 5-10 seconds for production cells

# Set lease_duration appropriately
# Short: faster failover (2-4s) but more sensitivity to transient network issues
# Long: more stability (8-12s) but longer failover window

Summary

Chubby was a pioneering coordination service demonstrating how to build production-grade distributed consensus:

  • Production Paxos: Showed how Paxos could work efficiently for real-world use
  • Coarse-grained locking: Provided locks as a service, amortizing coordination overhead
  • Event notifications: Built reactive patterns into the coordination service itself
  • Hierarchical namespace: Made data organization intuitive, like familiar file systems

The 2006 Chubby paper is essential reading for distributed systems practitioners. It explains not just how Chubby works, but why decisions were made, and what trade-offs were accepted.

Chubby itself is now largely historical, but its influence persists. Every time you use ZooKeeper to elect a leader, etcd to store configuration, or Consul for service discovery, you are using systems shaped by Chubby’s design.

The lesson: coordination is a common problem worth solving once and well. Building consensus correctly is hard enough without every team inventing their own poorly-tested version. Chubby showed that a well-designed coordination service could be general enough and efficient enough to serve as infrastructure for large-scale distributed systems.

Category

Related Posts

Apache ZooKeeper: Consensus and Coordination

Explore ZooKeeper's Zab consensus protocol, hierarchical znodes, watches, leader election, and practical use cases for distributed coordination.

#distributed-systems #databases #zookeeper

etcd: Distributed Key-Value Store for Configuration

Deep dive into etcd architecture using Raft consensus, watches for reactive configuration, leader election patterns, and Kubernetes integration.

#distributed-systems #databases #etcd

Google Spanner: Globally Distributed SQL at Scale

Google Spanner architecture combining relational model with horizontal scalability, TrueTime API for global consistency, and F1 database implementation.

#distributed-systems #databases #google