Google Chubby: The Lock Service That Inspired ZooKeeper

Explore Google Chubby's architecture, lock-based coordination, Paxos integration, cell hierarchy, and its influence on distributed systems design.

published: March 24, 2026 reading time: 35 min read author: GeekWorkBench updated: June 17, 2026

Quick Summary

Google Chubby was a lock service and coarse-grained file system that used Paxos consensus to coordinate distributed systems. It organized its namespace into cells for failure isolation, with 5 replicas using a master lease mechanism to prevent split-brain scenarios. Ephemeral files automatically disappeared when clients disconnected, making them useful for service discovery. ZooKeeper later adopted many of Chubby's concepts, including hierarchical namespaces, ephemeral nodes, watches, and the lock abstraction.

Google Chubby: The Lock Service That Inspired ZooKeeper

Google Chubby was Google’s internal lock service and coarse-grained file system, described in a 2006 paper that shaped distributed systems design. Chubby itself was not open-sourced, but its design directly inspired Apache ZooKeeper, adopting many Chubby concepts and even API naming.

Chubby solved a core problem: large-scale distributed systems need coordination primitives like locks, leader election, and configuration storage, but building these correctly is notoriously difficult. Chubby provided these as a service, letting developers focus on their problems rather than reinventing coordination code.

Today, Chubby has been largely replaced by other systems within Google, but its influence persists in every ZooKeeper deployment, every etcd cluster, and every system using distributed locks.

Design Philosophy

Chubby was built around a few key observations:

Coarse-grained locks are easier: Acquiring a distributed lock is expensive. Chubby optimized for coarse-grained locks held for seconds or minutes, not fine-grained locks held for milliseconds. The overhead of lock acquisition amortized over longer hold times.

Locks and files are related: A lock can be naturally represented as a file with specific semantics. A file containing the lock holder’s identity serves both as state and as the lock itself.

Event notification matters: Applications need to know when configuration changes, when a lock is released, or when a machine fails. Chubby built event notification into its API.

Cell hierarchy reduces confusion: Rather than a flat namespace, Chubby organized data into cells, each serving a subset of the overall namespace. This isolation simplified deployment and failure boundaries.

graph TD
    A[Chubby Cell 1<br/>Bigtable Namespace] --> B[Cell 1 replica 1]
    A --> C[Cell 1 replica 2]
    A --> D[Cell 1 replica 3]

    E[Chubby Cell 2<br/>GFS Namespace] --> F[Cell 2 replica 1]
    E --> G[Cell 2 replica 2]
    E --> H[Cell 2 replica 3]

As services grew within Google, a single Chubby cell often became a load bottleneck. Cells had finite request capacity, and services like Bigtable or GFS generated enough coordination traffic to saturate them. Cell migration addressed this by moving a service’s namespace subtree from an overloaded cell to a less-burdened one. The key constraint was zero downtime: production services could not tolerate coordination unavailability during the transition.

The migration procedure ran in four stages. First, operators provisioned a new target cell and seeded it with the service’s namespace subtree, copying the relevant file hierarchy and lock state. Second, the service’s client configuration updated to point DNS at the new cell’s address rather than the old one. Because clients cached DNS entries with a TTL, this propagation took up to the TTL duration. Third, in-flight operations on the old cell were allowed to complete naturally: any client that still contacted the old cell would be proxied or would timeout and retry against the new cell. Fourth, once all clients had drained from the old cell, operators decommissioned it and released its resources. The whole process typically took minutes to hours depending on TTL settings and client retry behavior.

Why not just add replicas to the existing cell? Because Chubby’s cell is a fixed failure domain. All five replicas in a cell fail together if a majority dies. Adding more replicas to a cell that is already CPU-bound on coordination requests would not help; the bottleneck was request processing, not fault tolerance. Migration let operators add capacity by adding an entirely new cell with its own replica set, which also improved fault tolerance by decoupling the service’s availability from any single cell.

Namespace isolation across cells was an intentional early design choice. Each cell maintained its own /ls/ namespace tree, and there was no built-in replication or synchronization between cells. The rationale was failure isolation: if Cell A for Bigtable had a problem, Cell B for GFS should keep working. The trade-off was that a service spanning multiple cells could not share lock state or file content across cell boundaries without application-level code. Services needing cross-cell coordination either ran a dedicated shared cell for that traffic, or implemented their own replication layer on top of Chubby.

Later versions of Chubby introduced cell federation to address the namespace isolation limitation. Federation let one cell proxy requests to another, so a client talking to Cell A could transparently access files stored in Cell B’s namespace. The proxying happened at the Chubby protocol level, so the client library did not need to know which cell held which data. In practice, federation was useful for services that had outgrown a single cell but could not tolerate the operational complexity of a full migration. The proxy overhead added latency, so federation was not a performance solution; it was an availability and operational solution.

# Cell migration timeline (conceptual)
T=0h      Provision new cell, seed namespace from old cell
T=0-30m   Update DNS for service to point to new cell (TTL-dependent)
T=30m-2h  Clients gradually migrate as DNS cache expires
T=2h+     All clients using new cell, decommission old cell

# Cell federation proxy flow
Client --> Cell A: GET /ls/google/chubby/cells/gfs/master
Cell A --> Cell B: proxy request
Cell B --> Cell A: response
Cell A --> Client: forwarded response

Introduction

A Chubby cell consists of 5 replicas running the same software. Replicas use Paxos to agree on operation order, ensuring consistency even when some replicas fail.

One replica is the master at any given time. All writes go through the master, which replicates to others via Paxos. Reads can be served by any replica, which locally cache data for performance.

graph TD
    C[Client] -->|Write| M[Master]
    C -->|Read| R1[Replica 1]
    C -->|Read| R2[Replica 2]

    M -->|Replicate| R1
    M -->|Replicate| R2
    M -->|Replicate| R3[Replica 3]

    R1 -.->|Cache| C
    R2 -.->|Cache| C
    R3 -.->|Cache| C

Paxos gives Chubby linearizable reads and writes, the strongest consistency guarantee. This matters when two processes both think they hold the same lock—that’s the kind of inconsistency Paxos prevents.

Master Election and Failover Latency

When the master fails, Chubby’s failover process follows a predictable timeline:

sequenceDiagram
    participant R1 as Replica 1
    participant R2 as Replica 2
    participant R3 as Replica 3
    participant M as Master
    participant C as Client

    Note over M: Master crashes at T=0
    R1->>R2: Lease expired, starting election
    R2->>R1: I vote for you
    R3->>R1: I vote for you
    Note over R1: Quorum (3/5) reached
    R1->>R2: I am the new master (epoch+1)
    Note over R1: New master at T=~200ms
    R1->>C: New master notification
    Note over C: Client discovers new master via DNS
    Note over C: Typically 1-2 seconds total

Failover Phase	Typical Duration	Notes
Lease expiration (master silent)	4-8 seconds	Configurable per cell, lease duration sets this window
Leaseholder detection + election start	50-200ms	Remaining replicas detect lease expiry via quorum timeout
Paxos leader election (propose/accept)	50-150ms	Single round of Prepare/Accept with 5 replicas
Client DNS update propagation	0-30 seconds	Depends on TTL; clients cache old master up to TTL
Handle re-acquisition by clients	100-500ms	Clients must re-open file handles
Total unavailability window	~5-10 seconds	worst case with default lease and DNS TTL

The paper reports that Chubby cells typically experience 1-2 seconds of unavailability during a planned master migration (e.g., for maintenance), and 5-10 seconds during an unexpected master failure. The lease duration is the primary tuning knob: shorter leases mean faster failover but more instability from transient network issues.

ZooKeeper’s leader election typically completes in 100-500ms under similar conditions, while etcd finishes within a second. Chubby’s slightly longer failover time is the trade-off for its stronger lease-based split-brain prevention.

The File System Interface

Chubby presents a file system interface (like Unix) rather than a key-value API. Files organize in directories forming a path hierarchy.

/ls/google/chubby/cells/bigtable/master
/ls/google/chubby/cells/gfs/primary
/ls/services/search/config
/ls/services/search/locks/query

Each file has:

Content: Small amounts of binary data (typically less than 1 MB)
ACLs: Access control lists for permissions
Metadata: Modification times, lock holders, etc.
Locks: Optional advisory locks

# Chubby client operations (conceptual)
open("/ls/services/search/config")      # Open a file
setContents("replicas=5, shard=true")    # Write data
getContents()                           # Read data
acquire("my-process-id")                # Acquire lock
release("my-process-id")                # Release lock

Files can be ephemeral, automatically deleted when the client that created them closes its handle. If a process dies, its ephemeral files disappear, signaling failure to others.

Advisory Locks

Chubby provides advisory locks, not mandatory ones. A process can access a file without acquiring its lock, but coordination requires disciplines to actually use the locks.

# Conceptual Chubby lock acquisition
lock = chubby.lock("/ls/services/search/locks/query")

if lock.acquire(blocking=True, timeout=30):
    try:
        # Critical section - we hold the lock
        config = read_config()
        do_query(config)
    finally:
        lock.release()
else:
    # Could not acquire lock
    raise Exception("Lock acquisition timed out")

The lock advisory nature means misbehaving clients can ignore locks. In practice, this is rarely a problem because correct clients follow the protocol, and incorrect clients fail in predictable ways that trigger alerts.

Events and Notifications

Clients subscribe to events on files and directories. When things happen, Chubby notifies the client:

File modified: Content of a watched file changed
Lock acquired/released: Someone acquired or released a lock
Child added/removed: A child directory or file was created or deleted
Master failure: The cell master failed and a new one was elected
Connection alive/broken: Connection to Chubby cell state changed

# Subscribe to file change events
handle = chubby.openFile("/ls/services/search/config")
handle.subscribe(chubby.FILE_MODIFIED, callback=reload_config)

# Subscribe to lock events
lock = chubby.lock("/ls/services/search/locks/master")
lock.subscribe(chubby.LOCK_ACQUIRED, callback=i_am_master_now)

The event mechanism lets clients react to changes without polling. For large-scale systems, this cuts down on load for both clients and the coordination service.

Chubby and Paxos

Under the hood, Chubby uses Paxos for consensus among replicas. The Chubby paper clarified how Paxos integrates into a production system:

Database backing store: Chubby uses a simple database (originally BerkeleyDB, later custom) to store state
Instance numbers: Each Paxos instance is numbered, and Chubby runs many instances in parallel for different operations
Epoch numbers: Masters are identified by epoch numbers, preventing old masters from acting after a new master is elected
Lease mechanism: The master holds a lease from replicas, renewing it periodically

sequenceDiagram
    participant R1 as Replica 1
    participant R2 as Replica 2
    participant R3 as Replica 3
    participant M as Master

    Note over M: Prepare to renew lease
    M->>R1: Prepare(N=5)
    M->>R2: Prepare(N=5)
    M->>R3: Prepare(N=5)
    R1-->>M: Promise(N=5,v=4)
    R2-->>M: Promise(N=5,v=4)
    R3-->>M: Promise(N=5,v=4)
    Note over M: Majority promised!
    M->>R1: Accept(N=5,v=5)
    M->>R2: Accept(N=5,v=5)
    M->>R3: Accept(N=5,v=5)

The lease mechanism ensures that if the master fails, replicas wait until the lease expires before accepting a new master. This prevents split-brain where two masters think they are in charge.

Paxos Integration Deep Dive

The original Paxos paper leaves several implementation details unspecified. Chubby’s production deployment fills in these gaps:

Epoch Numbers (Master Identity):

Epoch numbers are monotonically increasing integers that identify a master’s term. When a new master is elected, it increments the epoch number. Any message from an old epoch (lower epoch number) is rejected by replicas. This solves the “old master waking up” problem: if a master crashes and a new one is elected, but the old master recovers and tries to issue commands, replicas reject them because the old master’s epoch is stale.

# Simplified epoch check on replica
def handle_master_message(master_epoch, master_id, operation):
    if master_epoch < self.current_epoch:
        raise Rejected(f"Stale epoch {master_epoch} < {self.current_epoch}")
    if master_epoch == self.current_epoch and master_id != self.master_id:
        raise Rejected(f"Master ID mismatch")
    # Accept the operation
    self.process(operation)

Proposal Numbers (Operation Ordering):

Each Paxos instance (representing one Chubby operation) has a unique proposal number. Proposal numbers are structured as (round_number, replica_id) tuples, compared lexicographically. The higher round number wins, and if round numbers tie, the higher replica ID breaks the tie. This total ordering ensures that even if two proposers compete, only one value can be chosen per instance.

# Proposal number comparison
def compare_proposals(p1: tuple[int, int], p2: tuple[int, int]) -> int:
    """Returns 1 if p1 > p2, -1 if p1 < p2, 0 if equal"""
    if p1[0] != p2[0]:
        return 1 if p1[0] > p2[0] else -1
    return 1 if p1[1] > p2[1] else -1

Chubby runs thousands of Paxos instances in parallel, one per operation. The instance number identifies which operation is being agreed upon, while proposal numbers sequence competing proposals within that instance.

Lease Mechanism in Detail:

The master lease works as a timed promise from replicas to the master. The lease has a fixed duration (typically seconds, configured per cell). The master must renew the lease before it expires or lose mastership. This is different from Raft’s heartbeat approach:

Aspect	Chubby Master Lease	Raft Leader Heartbeat
Renewal trigger	Time-based, before expiry	Periodic heartbeat
Validity period	Explicit lease interval	Election timeout margin
**Stale leader detection	Lease expiry timeout	Heartbeat timeout
Replica promise	”I will not accept others"	"I will not vote for others”

During normal operation, the master renews its lease well before expiry. If the master fails to renew (crash, network partition), replicas wait until the lease expires before accepting a new master. This grace period is the “lease timeout” and typically ranges from hundreds of milliseconds to a few seconds.

# Conceptual lease renewal timeline
T=0s      Master acquires lease (lease_duration=5s)
T=3s      Master renews lease (at 60% of duration)
T=6s      Master renews lease again
T=11s     Master crashes, fails to renew
T=16s     Lease expires (master + lease_duration)
T=16s+    Replicas accept new master election

Database Backing Store:

Chubby originally used BerkeleyDB as its backing store, later replacing it with a custom database. The database stores the Paxos log and application state. Each write goes to the WAL (write-ahead log) before being applied, enabling crash recovery. The database also stores the cell’s configuration, lock states, and file metadata.

Use Cases at Google

Within Google, Chubby handled several critical coordination tasks:

Leader Election: For Bigtable and GFS, one server needed to be primary. Chubby’s lock mechanism enabled reliable leader election without complex consensus code in each system.

Configuration Storage: Service configurations, feature flags, and routing tables lived in Chubby. Event notifications let services reload without restart.

Naming Service: Before DNS was sufficient, Chubby served as a naming service. Machine names, service endpoints, and discovery information lived in Chubby.

Distributed Locks: Semaphores and barriers for distributed batch jobs used Chubby locks to coordinate across thousands of machines.

Original Bigtable Dependency Context

Chubby and Bigtable have a deep historical relationship that shaped Google’s early infrastructure. Bigtable was designed to be a distributed storage system for structured data, capable of scaling to petabytes across thousands of machines. But Bigtable’s original design assumed the existence of a coordination service: tablets needed to know who the master was, where to find their tablet server, and how to recover from failures.

Chubby was built, in part, to solve this coordination problem for Bigtable. The original Bigtable paper (2006) describes how it uses Chubby for:

Master election: The Bigtable master uses a Chubby exclusive lock to ensure only one master exists at a time
Tablet location: Bigtable stores tablet location metadata in Chubby files, allowing clients to find the right tablet server
Server lease management: Tablet servers maintain Chubby locks; if a tablet server loses its lock (crashed or network partition), the master knows to reassign its tablets

# Bigtable/Chubby interaction example
/chubby/cells/bigtable/master     # Bigtable master election lock
/chubby/cells/bigtable/tablets/  # Directory containing tablet location data
/chubby/cells/bigtable/tserver/  # Ephemeral files for live tablet servers

This tight coupling meant that Bigtable could not function if its Chubby cell was unavailable. In practice, Google ran multiple independent Chubby cells, and each Bigtable cluster was assigned to a specific cell. The Bigtable master would retry Chubby operations with exponential backoff, and would stop serving if the Chubby cell was unreachable for an extended period.

The dependency was unidirectional: Bigtable depended on Chubby, but Chubby did not depend on Bigtable. Chubby’s backing store (originally BerkeleyDB) was independent of Bigtable. This asymmetry is important: if Bigtable had its own consensus layer, it would have been circular, which would complicate recovery.

Over time, both systems evolved. Bigtable’s later incarnation (2012+) as Cloud Bigtable changed some of these assumptions, and Google’s internal infrastructure replaced Chubby with more specialized systems. The original design illustrates how coordination services tend to be foundational: they are typically designed early and replaced late.

The ZooKeeper Connection

The ZooKeeper paper explicitly calls out Chubby as inspiration. The project borrowed several core concepts:

The hierarchical file system namespace (though ZooKeeper uses znodes)
Ephemeral nodes for service discovery
Watches for event notification
The general lock abstraction for coordination

However, ZooKeeper made significant departures:

Watches are one-time in ZooKeeper vs. persistent in Chubby: ZooKeeper simplified by making watches fire once and require re-registration
ZooKeeper is a coordination service, not a lock service: ZooKeeper provides primitives that can be combined into locks, rather than locking as a primary feature
ZooKeeper separates consensus from caching: Clients can cache data while still getting notifications

Chubby vs ZooKeeper vs etcd: Direct Comparison

Dimension	Chubby	ZooKeeper	etcd
Consensus Protocol	Paxos (internal)	Zab (custom Paxos variant)	Raft (from paper)
API Style	File system interface	Hierarchical znodes	Flat key-value with directories
Language Origin	C++	Java	Go
Client Libraries	C++, Java	Java, C, Python, Go	Go, Java, Python, Ruby
Watches	Persistent (stay registered)	One-time (must re-register)	One-time (must re-register)
Lock Support	Advisory locks as first-class	Via recipes (curator)	Via transactions
ACLs	UNIX-like permissions	ACL per znode	RBAC with roles
Scalability	Cell-based sharding	Single ensemble	Single cluster
Performance	~100k requests/sec per cell	~10k-100k requests/sec	~10k+ requests/sec
Leader Leases	Yes (prevents split-brain)	No (Zab handles it)	Yes (Raft heartbeat)
Observers	No	Yes (read scaling)	No (but Raft learner mode)
Auth Methods	SASL	SASL/Kerberos	mTLS, username/password
Typical Use	Internal Google (historical)	Hadoop/Kafka coordination	Kubernetes config

Key takeaways:

Chubby was pioneering but closed-source and largely replaced inside Google
ZooKeeper is the direct Chubby descendant, battle-tested in big data ecosystems
etcd offers better performance and a cleaner API, preferred for new deployments

graph TD
    A[Chubby 2006] --> B[ZooKeeper 2008]
    A --> C[etcd 2013]
    B --> D[Kafka, HBase, Drill]
    C --> E[Kubernetes, CoreOS]

Limitations and Evolution

Chubby has known limitations that shaped its evolution:

Single-cell scalability: Early cells became bottlenecks, requiring sharding into multiple cells.

Event delivery delays: In rare cases, event delivery could be delayed. Applications needed their own timeouts and fallback mechanisms.

Not for high-frequency changes: Chubby was designed for configuration and coordination, not high-frequency data. Clients were expected to cache and poll for high-frequency updates.

Replaced by other systems: Within Google, Chubby has been largely replaced by more scalable systems. Spanner handles distributed transactions, and other specialized systems handle service discovery.

Trade-off Analysis

Chubby’s design embraces specific trade-offs that system designers must understand:

Design Choice	Advantage	Disadvantage
Coarse-grained locks	Low coordination overhead; scalable for global locks	Cannot optimize per-record or fine-grained locking
Advisory locks	Flexible; no centralized enforcement bottleneck	Requires client discipline; misbehaving clients can ignore locks
Paxos with 5 replicas	Tolerates 2 simultaneous failures; strong consistency	Higher latency than quorum-less reads; more resource usage
Persistent event notifications	Simple subscription model for long-lived clients	Can accumulate stale watchers; memory pressure on server
File system interface	Intuitive hierarchy for naming and organization	Less efficient for flat key-value access patterns
Cell-based namespace	Failure isolation; independent scaling per cell	Cross-cell coordination requires federation/proxying
Master lease mechanism	Prevents split-brain during partitions effectively	Lease duration tuning is critical; too short causes instability
BerkeleyDB/custom backing store	Crash recovery via WAL; proven database technology	Additional dependency; schema changes require cell restart
Cell migration support	Allows load balancing; cell replacement without downtime	Operationally complex; requires DNS reconfiguration

Key insight: Chubby prioritized correctness and simplicity over raw speed. The trade-offs favor reliability and ease of use for coordination tasks—you get strong consistency guarantees in exchange for somewhat higher latency.

When to Use / When Not to Use

When to Use Chubby

Chubby was designed for coarse-grained distributed coordination at Google. The main use case is leader election among distributed processes. With a cluster of machines that need to agree on a single primary—such as electing a master in a Bigtable or GFS-like system—you acquire an exclusive lock on a known path, and whoever holds it becomes the leader. This avoids building custom consensus logic for each service.

Service discovery fits naturally with Chubby’s namespace. You can create a directory structure like /ls/services/search/ with ephemeral child nodes representing live instances. When a service starts, it creates its node; when it dies, the node disappears automatically. Clients watch the directory and immediately see which instances are available, without polling or maintaining their own registry.

Chubby also handles configuration management when all participants must see the same settings. If you have a small amount of configuration—replica counts, feature flags, shard mappings—that rarely changes, Chubby’s event notifications let services reload without restart. Chubby was not designed for operational data or high-frequency metrics; it works best for static or semi-static configuration where consistency matters more than throughput.

You might also reach for Chubby when you need a centralized lock manager but cannot or do not want to implement Paxos yourself. The lock interface abstracts away distributed consensus complexity. If your system needs locks but you lack the resources to build a correct implementation from scratch, using an existing coordination service is better than inventing your own protocol.

Outside Google, ZooKeeper, etcd, and Consul now fill this role. The use cases are still valid; the implementations have simply evolved.

When Not to Use Chubby

Chubby fails in several situations, mostly due to intentional but limiting design decisions. High-frequency data access is the most obvious case. Chubby cells handle low throughput and cannot handle frequent reads and writes without becoming a bottleneck. If you need to store data that changes often—rate counters, session state, real-time metrics—Chubby’s Paxos-backed consensus layer will not keep up. The comparison table shows Chubby manages roughly 100k requests/sec per cell, but that capacity is meant for coordination, not data storage.

Fine-grained coordination is another area where Chubby falls short. Chubby uses coarse-grained locks where one master oversees an entire system. If you need per-record or per-item locks—say, locking a specific row in a database or a particular resource in a queue—Chubby cannot help you efficiently. The lock acquisition overhead that Chubby amortizes over seconds or minutes becomes prohibitive when you need millisecond-level locking. Systems that need fine-grained coordination typically use in-process locks or specialized distributed locks built on a different architecture.

Cross-cell namespace sharing is another limitation. Each Chubby cell maintains an isolated namespace, and cross-cell coordination requires either cell federation (introduced in later versions) or application-level replication. If your service spans multiple cells and needs a unified namespace view, you run into Chubby’s original design assumption: cells are independent failure domains, not shards of a shared namespace. Later additions like cell federation partially addressed this, but the architecture remains cell-centric rather than globally shared.

For modern deployments outside Google, the practical concern is that Chubby is no longer actively maintained. Google has largely replaced it internally, and it was never open-sourced for external use. ZooKeeper, etcd, and Consul are the actively maintained descendants, with community support, regular security updates, and cloud-native tooling integrations. ZooKeeper has been battle-tested in the Hadoop and Kafka ecosystems. etcd forms the foundation of Kubernetes’ control plane. Consul adds service mesh integration and DNS-style discovery. Using Chubby today means taking on an unmaintained system with no external community and no security patches.

Operational complexity also favors the alternatives. Running a production Chubby cell requires understanding its lease tuning, DNS TTL configuration, and cell migration procedures. The comparison table shows ZooKeeper and etcd both offer simpler operational models with better documentation and wider toolchain support. Unless you are studying Chubby’s internals or working within Google’s infrastructure, the modern alternatives give you the same coordination benefits without the operational overhead.

Production Failure Scenarios

Failure	Impact	Mitigation
Master lease expiration	Master loses mastership; all clients must re-acquire handles; brief unavailability	Chubby handles lease renewal automatically; if master fails, clients discover new master via DNS
Cell overload	Chubby cell becomes a bottleneck under high request load	Distribute coordination across multiple Chubby cells; move non-critical clients to different cells
Event delivery delays	In rare cases, lock acquisition events are delayed; applications holding stale locks may cause conflicts	Applications should implement their own timeouts and re-check lock state before taking action
DNS cache staleness	When master moves, DNS entries take time to propagate; clients may contact old master briefly	Use short DNS TTLs; clients should handle `OperationTimeout` gracefully and retry discovery
File handle exhaustion	Too many clients holding handles to the same file saturates handle table	Monitor handle count; release handles promptly; design for brief handle lifetime
Incompatible schema changes	DDL changes to the Chubby database itself require cell restart	Coordinate schema changes during maintenance windows; test on non-production cells first

Master Lease Expiry: Split-Brain and Recovery

The most critical failure scenario is master lease expiry leading to a potential split-brain situation. Understanding the exact sequence matters for operating Chubby cells safely.

What is split-brain in this context?

Split-brain occurs when two replicas believe they are the master simultaneously. Chubby’s lease mechanism prevents true split-brain during the lease validity window, but a narrow window opens during the lease expiry transition.

Failure sequence:

sequenceDiagram
    participant M as Master M1
    participant R1 as Replica 1
    participant R2 as Replica 2
    participant R3 as Replica 3
    participant M2 as New Master M2

    Note over M: Master M1 lease expires at T=0
    Note over M: M1 is network-partitioned
    R1->>R2: Lease expired, no response from M1
    R2->>R1: Agreed, starting election
    R1->>R2: I vote for R1
    R2->>R1: Confirmed
    Note over R1: R1 elected new master M2 (epoch+1)
    Note over M: M1 still thinks it is master!
    Note over M: M1 serving writes locally during partition
    Note over M: But cannot replicate to quorum
    Note over M2: M2 holds valid lease, accepts writes
    Note over R1,R2,R3: Replicas ignore M1 (stale epoch)
    Note over M: At T=0, M1 lost quorum-backed lease
    Note over M: Clients retry, discover M2 via DNS

Key protection mechanism:

The epoch number is the critical protection. When M2 is elected, it increments the epoch. Even if M1 (the partitioned master) recovers and tries to issue commands, replicas check the epoch and reject M1’s messages because (M1_epoch < M2_epoch). M1’s writes during the partition window are lost from the perspective of the quorum, but they never contaminated the committed Paxos log.

Recovery procedure:

M1 detects its lease has expired (no renewal acknowledgment from quorum)
M1 stops accepting new writes locally
M1 initiates a rejoin procedure: it fetches the latest Paxos state from the quorum
M1 discovers it is no longer master (higher epoch exists)
M1 transitions to a replica and syncs its state to the latest Paxos log entry
M1 resumes normal operation as a replica

# Simplified master recovery pseudocode
def on_lease_expired():
    stop_accepting_writes()
    current_epoch = fetch_current_epoch_from_quorum()
    if current_epoch > my_epoch:
        my_epoch = current_epoch
        my_role = REPLICA
        sync_state_from_master()
    else:
        # We might still be master, retry election
        start_leader_election()

What happens to M1’s writes during partition?

M1 could have accepted writes during the partition that never reached the quorum. These writes are effectively “tentative” and are discarded during recovery. Clients that read from M1 during the partition might have seen uncommitted state. For this reason, critical operations should use the WAIT_ANY flag to wait for acknowledgment from the quorum, not just the master.

Worst-case timeline:

Phase	Duration	Details
Partition begins, M1 still master	0s	M1 thinks it is master, serves writes locally
Lease expires (M1 cannot renew)	4-8s	M1 cannot get quorum for renewal
Replicas detect lease expiry	~100ms	Election timeout, start new election
New master M2 elected	50-150ms	M2 wins with higher epoch
DNS propagation to clients	0-30s	TTL on DNS entries for Chubby master location
All clients redirected	5-40s	Total worst-case unavailability window

Practical mitigations for operators:

# Monitor master lease health
# Alert if master has not renewed lease within 50% of lease_duration
alert: ChubbyMasterLeaseRenewalWarning
  expr: time() - chubby_last_lease_renewal > (lease_duration * 0.5)
  severity: warning

# Use short DNS TTLs for master location
# Recommended: TTL = 5-10 seconds for production cells

# Set lease_duration appropriately
# Short: faster failover (2-4s) but more sensitivity to transient network issues
# Long: more stability (8-12s) but longer failover window

Quick Recap Checklist

Before diving deeper or preparing for an interview, verify your understanding of these key Chubby concepts:

Interview Questions

1. Why did Google design Chubby as a coarse-grained lock service rather than a fine-grained one?

Expected answer points:

Distributed lock acquisition is expensive (network round-trips, consensus overhead)
Coarse-grained locks held for seconds or minutes amortize this overhead effectively
Fine-grained locks held for milliseconds would make the coordination service a bottleneck
Chubby was designed for coordination among relatively few processes, not per-record locking

2. How does Chubby's master lease mechanism prevent split-brain scenarios?

Expected answer points:

Master holds a lease from replicas, renewed periodically before expiry
If master fails to renew (crash or partition), replicas wait until lease expires before accepting a new master
Epoch numbers ensure old masters cannot issue commands after a new master is elected
During partition, the old master cannot commit writes because it lacks quorum-backed lease

3. What is the role of epoch numbers in Chubby's failover process?

Expected answer points:

Epoch numbers are monotonically increasing integers identifying a master's term
When a new master is elected, the epoch number increments
Any message from an old epoch (lower epoch number) is rejected by replicas
This solves the "old master waking up" problem after network partition recovery

4. How do ephemeral files work in Chubby, and what problem do they solve?

Expected answer points:

Ephemeral files are automatically deleted when the client that created them closes its handle
If a process dies, its ephemeral files disappear automatically
This signals failure to other clients without requiring explicit cleanup
Used for service discovery: a tablet server creates an ephemeral file; when it dies, the file disappears and the master knows to reassign its tablets

5. What is the difference between Chubby's event notifications and ZooKeeper's watches?

Expected answer points:

Chubby events are persistent: they stay registered until explicitly unsubscribed
ZooKeeper watches are one-time: they fire once and require re-registration
Chubby's approach is simpler for long-lived subscriptions but can accumulate stale watchers
ZooKeeper's approach requires more client-side bookkeeping but scales better

6. How does Chubby's file system interface differ from a typical key-value store?

Expected answer points:

Chubby presents a hierarchical Unix-like namespace with directories and files
Each file has content, ACLs, metadata, and optional advisory locks
Paths like /ls/google/chubby/cells/bigtable/master mirror familiar file paths
Key-value stores typically use flat namespaces; Chubby's hierarchy organizes data naturally for naming and service discovery

7. Why does Chubby use five replicas in a cell?

Expected answer points:

Paxos requires a majority quorum to operate (3 out of 5 can acknowledge writes)
Five replicas tolerate two simultaneous failures while maintaining quorum
Three replicas would only tolerate one failure (majority is 2)
Seven replicas would be more resilient but add more latency and resource overhead

8. What happens to uncommitted writes when a network partition causes master lease expiry?

Expected answer points:

During partition, the old master may have accepted writes locally that never reached the quorum
These writes are "tentative" and are discarded during recovery
Clients reading from the partitioned master during the outage might have seen uncommitted state
Critical operations should use WAIT_ANY flag to wait for quorum acknowledgment, not just master acceptance

9. How did Bigtable originally depend on Chubby?

Expected answer points:

Bigtable master uses a Chubby exclusive lock for master election (only one master at a time)
Tablet location metadata is stored in Chubby files so clients can find the right tablet server
Tablet servers maintain Chubby locks; losing a lock signals the master to reassign tablets
The dependency was unidirectional: Bigtable depended on Chubby, but Chubby did not depend on Bigtable

10. What are the key differences between Chubby leases and Raft leader heartbeats?

Expected answer points:

Chubby: time-based renewal before expiry; explicit lease interval with duration
Raft: periodic heartbeat messages; election timeout margin determines validity
Chubby replica promises "I will not accept others" during lease validity
Raft follower promises "I will not vote for others" during election timeout
Both achieve similar goals (preventing split-brain) through different mechanisms

11. What was the original backing store for Chubby, and how did it evolve?

Expected answer points:

Originally used BerkeleyDB as the backing store for Paxos log and application state
Later replaced with a custom database built by Google
The database stores the Paxos log, cell configuration, lock states, and file metadata
Each write goes to the write-ahead log (WAL) before being applied, enabling crash recovery

12. How does proposal number ordering work in Chubby's Paxos implementation?

Expected answer points:

Each Paxos instance has a unique proposal number structured as (round_number, replica_id)
Comparison is lexicographic: higher round number wins
If round numbers tie, higher replica ID breaks the tie
This total ordering ensures only one value can be chosen per instance, even with competing proposers

13. Why did ZooKeeper make watches one-time rather than persistent like Chubby?

Expected answer points:

Persistent watchers can accumulate and create memory pressure on the server
One-time watches simplify server-side state management
Clients must re-register after receiving an event, which is explicit and auditable
The trade-off is more bookkeeping on the client side, but ZooKeeper prioritizes server scalability

14. What is cell federation in Chubby and when was it introduced?

Expected answer points:

Later versions of Chubby introduced cell federation to provide a unified namespace across multiple cells
A cell can proxy requests to another cell, effectively bridging namespace views
This allowed services spanning multiple Chubby cells to present a single logical namespace to clients
Used by services that outgrew a single cell's capacity but needed coherent namespace access

15. What is the typical failover latency for a Chubby cell when the master crashes?

Expected answer points:

Lease expiration window: 4-8 seconds (configurable, sets detection delay)
Election and Paxos leader election: 50-200ms
Client DNS update propagation: 0-30 seconds (depends on TTL setting)
Total typical unavailability: 1-2 seconds for planned migrations, 5-10 seconds for unexpected failures
ZooKeeper typically completes election in 100-500ms, etcd within 1 second

16. How does Chubby's cell hierarchy help with failure isolation?

Expected answer points:

Each cell serves a subset of the overall namespace with its own replica set
Failure of one cell does not directly impact others serving different namespaces
This isolation simplifies deployment boundaries and failure domains
Services can be assigned to cells based on criticality or load characteristics

17. What are the main limitations that led to Chubby being replaced within Google?

Expected answer points:

Single-cell scalability: early cells became bottlenecks under high request load
Not designed for high-frequency data changes; clients expected to cache and poll
Event delivery could occasionally be delayed, requiring application-level timeouts
More specialized systems like Spanner handle distributed transactions more efficiently
Cell migration was possible but operationally complex

18. What is the relationship between Chubby's Paxos implementation and the original Lamport Paxos paper?

Expected answer points:

The original Paxos paper left several implementation details unspecified
Chubby filled gaps: epoch numbers, proposal numbering, lease mechanisms, database backing store
Chubby runs thousands of Paxos instances in parallel (one per operation)
Instance numbers identify which operation is being agreed upon, while proposal numbers sequence competing proposals

19. Why does Chubby use advisory locks instead of mandatory locks?

Expected answer points:

Mandatory locks would require Chubby to intercept and block all file access attempts
This would create a centralized bottleneck defeating Chubby's distributed design
Advisory locks are a convention: correct clients use them, incorrect clients fail predictably
In practice, well-behaved clients follow the protocol, and misbehaving ones trigger alerts

20. How does Chubby's influence persist in modern distributed systems?

Expected answer points:

ZooKeeper directly adopted Chubby's concepts: hierarchical namespace, ephemeral nodes, watches, lock abstraction
etcd (ZooKeeper's spiritual successor) uses the same Raft consensus algorithm approach
Consul combines Chubby-style coordination with service discovery
Kubernetes uses etcd for configuration and leader election, descendants of Chubby's design
The lesson that "coordination is a common problem worth solving once and well" remains relevant

Conclusion

Chubby was a pioneering coordination service demonstrating how to build production-grade distributed consensus:

Production Paxos: Showed how Paxos could work efficiently for real-world use
Coarse-grained locking: Provided locks as a service, amortizing coordination overhead
Event notifications: Built reactive patterns into the coordination service itself
Hierarchical namespace: Made data organization intuitive, like familiar file systems

The 2006 Chubby paper is essential reading for distributed systems practitioners. It explains not just how Chubby works, but why decisions were made, and what trade-offs were accepted.

Chubby itself is now largely historical, but its influence persists. Every time you use ZooKeeper to elect a leader, etcd to store configuration, or Consul for service discovery, you are using systems shaped by Chubby’s design.

The lesson: coordination is a common problem worth solving once and well. Getting consensus right is hard enough without every team inventing their own poorly-tested version. Chubby proved that a well-designed coordination service could be general enough and efficient enough to serve as infrastructure for large-scale distributed systems.

Google Chubby: The Lock Service That Inspired ZooKeeper

Design Philosophy

Cell Migration and Namespace Sharing

Introduction

Master Election and Failover Latency

The File System Interface

Advisory Locks

Events and Notifications

Chubby and Paxos

Paxos Integration Deep Dive

Use Cases at Google

Original Bigtable Dependency Context

The ZooKeeper Connection

Chubby vs ZooKeeper vs etcd: Direct Comparison

Limitations and Evolution

Trade-off Analysis

When to Use / When Not to Use

When to Use Chubby

When Not to Use Chubby

Production Failure Scenarios

Master Lease Expiry: Split-Brain and Recovery

Quick Recap Checklist

Interview Questions

Further Reading

Conclusion

Category

Tags

Related Posts

Apache ZooKeeper: Consensus and Coordination

etcd: Distributed Key-Value Store for Configuration

Google Spanner: Globally Distributed SQL at Scale