Apache ZooKeeper: Consensus and Coordination

Explore ZooKeeper's Zab consensus protocol, hierarchical znodes, watches, leader election, and practical use cases for distributed coordination.

published: March 24, 2026 reading time: 35 min read author: GeekWorkBench updated: June 17, 2026

Quick Summary

ZooKeeper exposes a hierarchical namespace of znodes that any client can read or write, and its Zab consensus protocol guarantees linearizable writes by routing all updates through a single leader that broadcasts transactions to a majority of followers. Ephemeral znodes vanish automatically when a client session ends, which makes them ideal for service discovery and leader election, while sequential znodes produce monotonically increasing counters used to build distributed locks and election algorithms. Watches let clients subscribe to change notifications without polling. The real insight is that ZooKeeper trades flexible querying for strong coordination guarantees: it handles leader election, distributed locking, and configuration management correctly so you do not have to.

Apache ZooKeeper: Consensus and Coordination for Distributed Systems

Yahoo! Research started ZooKeeper in 2007 as a Hadoop subproject. The problem they tackled is universal in distributed systems: coordinating multiple processes is hard, and everyone was writing their own coordination code badly.

The insight: distributed coordination has common patterns. Rather than every system reinventing leader election, distributed locks, and configuration management, ZooKeeper could provide these as building blocks.

ZooKeeper became the coordination layer for HBase, Kafka, Drill, and many others relying on leader election and cluster state management.

Data Model: Hierarchical Namespaces

ZooKeeper exposes a hierarchical namespace of znodes, like a filesystem. Each znode holds data and can have children, forming a tree.

/zookeeper           # Reserved namespace for ZooKeeper internals
/config              # Application configuration
/config/cluster      # Cluster-wide settings
/services            # Service registry
/services/api       # API service endpoints
/locks               # Distributed locks
/election            # Leader election nodes

Unlike filesystems, ZooKeeper znodes are not directories for general use. They are lightweight coordination primitives for small amounts of critical data: configuration values, lock states, leader information.

# Create a znode (persistent by default)
create /config/database "db.example.com:5432"

# Get znode data
get /config/database

# Create an ephemeral znode (deleted when client disconnects)
create -e /services/api/worker-1 "192.168.1.10:8080"

# Create a sequential znode (ZooKeeper appends a monotonically increasing counter)
create -s /locks/job- "worker-1"
# Result: /locks/job-0000000001

Znodes have special flags:

Ephemeral: Automatically deleted when client session ends
Sequential: ZooKeeper appends a monotonically increasing counter to the name
Container: For leader election and similar patterns; deleted when empty

Sessions and Watches

Client sessions are the foundation. Connect to ZooKeeper and establish a session with a configurable timeout. Sessions are maintained via periodic pings; miss too many and the session expires.

// Java client example
ZooKeeper zk = new ZooKeeper(
    "zoo1:2181,zoo2:2181,zoo3:2181",  // Connection string
    30000,                              // Session timeout (ms)
    event -> {
        switch (event.getState()) {
            case SyncConnected:
                System.out.println("Connected to ZooKeeper");
                break;
            case Disconnected:
                System.out.println("Disconnected from ZooKeeper");
                break;
            case Expired:
                System.out.println("Session expired");
                break;
        }
    }
);

Watches are one-time notifications when a znode changes. Set a watch, the znode changes, ZooKeeper sends a notification.

// Set a watch and handle changes
zk.getData("/config/database", event -> {
    if (event.getType() == EventType.NodeDataChanged) {
        System.out.println("Configuration changed!");
        // Reload configuration
    }
}, new Stat());

sequenceDiagram
    participant C as Client
    participant Z as ZooKeeper

    C->>Z: getData("/config", watch=true)
    Z-->>C: Current data

    Note over Z: Data changed
    Z->>C: Watch triggered
    C->>Z: getData("/config", watch=true)
    Z-->>C: New data

Watches are one-time triggers. After a watch fires, set a new watch to keep monitoring. Simplifies the implementation but requires client re-arming on each notification.

The Zab Protocol

ZooKeeper uses ZooKeeper Atomic Broadcast (Zab) for consensus. Zab is similar to Paxos in guarantees (linearizable writes) but different in approach, optimized for ZooKeeper’s workload.

Zab has two main phases:

Recovery (Catch-up): Leader fails, new leader takes over, followers synchronize. New leader has the most recent committed updates, followers replay missed transactions.

Broadcast: During normal operation, leader broadcasts transactions to followers. Transaction commits when a majority acknowledges.

graph TD
    A[Leader] -->|PROPOSE| B[Follower 1]
    A -->|PROPOSE| C[Follower 2]
    A -->|PROPOSE| D[Follower 3]

    B -->|ACK| A
    C -->|ACK| A
    D -->|ACK| A

    A -->|COMMIT| B
    A -->|COMMIT| C
    A -->|COMMIT| D

The key property: ZooKeeper guarantees FIFO client ordering per client. If a client submits A then B, ZooKeeper ensures A is applied before B, even across leader changes.

Leader Election

ZooKeeper implements leader election through ephemeral sequential znodes:

public class LeaderElection {
    private static final String ELECTION_PATH = "/election";

    public void runForLeadership() throws KeeperException {
        // Create ephemeral sequential znode
        String znode = zk.create(
            ELECTION_PATH + "/candidate-",
            myAddress.getBytes(),
            ZooDefs.Ids.OPEN_ACL_UNSAFE,
            CreateMode.EphemeralSequential
        );

        // Get all candidates sorted by znode name (which includes sequence number)
        List<String> candidates = zk.getChildren(ELECTION_PATH, false);
        Collections.sort(candidates);

        // If I'm the first, I'm the leader
        String smallest = candidates.get(0);
        if (smallest.equals(znode.substring(znode.lastIndexOf('/') + 1))) {
            System.out.println("I am the leader!");
        }
    }
}

Leader crashes, ephemeral znode disappears, next candidate becomes leader. Kafka uses this pattern for controller leader election.

Distributed Locks

ZooKeeper implements distributed locks via ephemeral znodes and watches:

Create an ephemeral sequential znode under the lock path
Get all children and sort them
If your znode is first, you hold the lock
If not, watch the previous znode
Previous znode disappears, you may hold the lock

public class DistributedLock {
    private static final String LOCK_PATH = "/locks/my-resource";

    public boolean acquire() throws KeeperException {
        // Create ephemeral sequential znode
        myZnode = zk.create(LOCK_PATH + "/lock-",
            myId.getBytes(),
            ZooDefs.Ids.OPEN_ACL_UNSAFE,
            CreateMode.EphemeralSequential
        );

        // Check if we are the smallest
        while (true) {
            List<String> locks = zk.getChildren(LOCK_PATH, false);
            Collections.sort(locks);

            if (locks.get(0).equals(getLocalName())) {
                return true;  // We hold the lock
            }

            // Watch the previous lock holder
            String previous = locks.get(getIndex() - 1);
            zk.getData(LOCK_PATH + "/" + previous, watch, new Stat());
            return false;  // Not our turn yet
        }
    }

    public void release() throws KeeperException {
        zk.delete(myZnode, -1);
    }
}

This lock implementation is correct but has a scalability limitation: every acquisition creates a znode and reads the children list. For high-contention locks, consider etcd or Redis optimized for this workload.

Common Use Cases

ZooKeeper handles several coordination patterns:

Service Discovery: Register services as ephemeral znodes, discover by reading the directory. Service dies, znode disappears automatically.

/services/http/api    # Ephemeral znode for each API instance

Configuration Management: Store configuration in ZooKeeper, watch for changes. Config updates notify all interested parties.

/config/application   # Contains JSON or properties
/config/database
/config/features

Leader Election: As shown above, for selecting a primary in a replicated system.

Barrier: Coordinate when multiple processes should proceed. Create a container znode; when enough processes signal, they all proceed.

Double Barrier: Ensure processes start and end together. Enter when N processes arrive, release when N exit.

Observer Nodes: Read Scaling Extension

Standard ZooKeeper ensembles require a majority to acknowledge writes. This limits write scalability but also read scalability since followers handle both reads and writes.

Observer nodes solve this by participating in ZooKeeper without voting. They receive proposals from the leader and apply transactions, but do not count toward majority for write commitment.

graph TD
    A[Leader] -->|PROPOSE| B[Follower 1]
    A -->|PROPOSE| C[Follower 2]
    A -->|PROPOSE| D[Observer 1]
    A -->|PROPOSE| E[Observer 2]

    B -->|ACK| A
    C -->|ACK| A
    D -.->|No ACK| A
    E -.->|No ACK| A

    A -->|COMMIT| B
    A -->|COMMIT| C
    A -->|COMMIT| D
    A -->|COMMIT| E

Why use observers:

Without Observers	With Observers
Reads limited by follower count	Scale reads independently
Cross-region reads add WAN latency	Observers in each region for local reads
All nodes must handle write quorum	Only voting members needed for writes
Adding nodes increases write load	Add observers without affecting writes

Configuration:

# In zoo.cfg for observer node
peerType=observer
server.1=zoo1:2888:3888
server.2=zoo2:2888:3888
server.3=zoo3:2888:3888
server.4=zoo4:2888:3888:observer   # Observer

Use case - geo-distributed reads:

us-east-1: 3 voters (leader preferred here)
us-west-2: 2 observers for local reads
eu-west-1: 1 observer for local reads

Clients in us-west-2 can read from their local observer without crossing the WAN. Writes still go to us-east-1 majority.

Getting Started

The ZooKeeper distribution includes a Java client and CLI. For most languages, use Curator which adds retry logic, connection management, and higher-level recipes.

# Start ZooKeeper in standalone mode (testing only)
bin/zkServer.sh start-foreground

# CLI commands
zkCli.sh -server localhost:2181
[zookeeper] ls /
[zookeeper] create /test "hello"
[zookeeper] get /test
[zookeeper] set /test "world"
[zookeeper] delete /test

// Using Curator for higher-level patterns
CuratorFramework client = CuratorFrameworkFactory.newClient(
    "zoo1:2181,zoo2:2181,zoo3:2181",
    new RetryNTimes(3, 1000)
);
client.start();

// Leader latch (simpler leader election)
LeaderLatch latch = new LeaderLatch(client, ELECTION_PATH);
latch.start();
latch.await();
if (latch.hasLeadership()) {
    // I am the leader
}

Common Pitfalls / Anti-Patterns

Limitations:

Not for data storage: ZooKeeper is for coordination, not application data. Keep znodes small (KB range).
Scalability ceiling: Single ensemble handles roughly 10k-100k requests per second. For more, use multiple ensembles or etcd.
Watch reliability: Watches are one-time; handle missed notifications during reconnection.
Session timeout: Too-aggressive timeouts cause spurious disconnections during GC pauses.

Best practices:

Use an odd number of nodes (3, 5, 7) for fault tolerance
Co-locate ZooKeeper with low-latency network links
Monitor the outstanding requests queue length
Set appropriate tick time and init/max connections limits

When to Use / When Not to Use ZooKeeper

When to Use ZooKeeper

Three coordination patterns make ZooKeeper worth reaching for: leader election, distributed locking, and configuration synchronization with real-time propagation. These problems show up constantly in distributed systems, and ZooKeeper’s feature set addresses each one directly.

Leader election is the most common entry point. Kafka, HBase, and Drill all use ZooKeeper to pick a primary instance that coordinates work across replicas. When the primary fails, ephemeral nodes let the remaining candidates detect the failure within the session timeout window and hold a new election. The mechanism: create an ephemeral sequential znode under /election, sort all candidates by sequence number, the smallest wins. Kafka’s controller broker uses exactly this pattern to ensure exactly one broker manages partition assignments at any given time.

Distributed locks matter when multiple processes compete for exclusive access to a shared resource — a database connection pool, a write handle on a shard, a singleton task. ZooKeeper’s lock recipe uses the same ephemeral sequential mechanism: every lock holder creates a sequential znode, and whoever holds the lowest sequence number owns the lock. This works correctly even under leader elections because ZooKeeper guarantees FIFO ordering — if your lock request arrives before a competitor’s, you will always acquire the lock first.

Configuration synchronization is the third strong use case. Store application configuration in a persistent znode, set watches on it, and every connected client gets a notification the moment the configuration changes. Updates propagate within one network round-trip, which beats polling. Common examples include database connection strings, feature flags, and service endpoint lists.

Production usage:

System	ZooKeeper role
Kafka	Controller leader election, cluster membership, topic config
HBase	Region server master election, bootstrap server discovery
Drill	Foreman election, storage plugin registry
YARN	Resource manager leader election, node registration

If you need any of these three patterns, the Curator client library (Java) or zkpython (Python) makes the implementation less error-prone. Curator’s LeaderLatch and InterProcessMutex recipes handle the ephemeral sequential znode mechanics for you.

Trade-off Comparison: ZooKeeper vs Alternatives

Criterion	ZooKeeper	etcd	Consul	Redis
Consensus protocol	Zab (leader-based)	Raft	Raft (with Serf)	None (no consistency)
Linearizable writes	Yes	Yes	Yes	No
Distributed locks	Native recipes	Leases + range keys	Sessions + locks	RedLock (unsafe)
Service discovery	Ephemeral znodes	Key-value watching	DNS + KV	Pub/sub
Multi-region support	Observers	etcd cd	WAN federation	Redis Cluster
API style	Custom (C/Java)	gRPC + JSON	HTTP + DNS	Redis protocol
Operations complexity	Medium	Low	Low	Low
Max throughput/ensemble	~100k req/s	~200k req/s	~50k req/s	~500k req/s (single)
Best for	Locking, leadership	Config + discovery	Service mesh	Caching, simple queues

The table above helps narrow the field, but selection criteria matter more than feature lists. Pick ZooKeeper when you need strong lock guarantees in a leader-driven model — particularly with Kafka, HBase, Drill, or YARN. These systems were built on ZooKeeper and use its lock recipes and leader election primitives directly. Kafka’s controller broker relies on ZooKeeper’s ephemeral sequential znode pattern to ensure exactly one broker manages partition leadership at any moment. If you run Kafka or HBase, ZooKeeper is not optional — it is infrastructure. The lock implementation in ZooKeeper is correct under the Zab leader assumption, and Curator’s InterProcessMutex handles connection loss edge cases well.

Pick etcd when you want a cloud-native coordination service with better performance and simpler operations. etcd implements Raft consensus and exposes a gRPC + JSON API over HTTP, which makes integrating non-Java services easier and gives you better tooling in Kubernetes environments. Kubernetes uses etcd for all cluster state — pod definitions, RBAC policies, configmaps — and the etcd operator makes self-hosted etcd easier to manage than a bare ZooKeeper ensemble. etcd’s lease mechanism combined with range queries gives you a workable distributed lock pattern, though it does not have ZooKeeper’s formal lock recipes. If you do not need Kafka or HBase integration, etcd is usually the right choice for new projects.

Pick Consul when service mesh integration or DNS-based service discovery matters more than lock correctness. Consul works well for service mesh — it has Envoy integration, intentions, and a DNS interface that fits microservice environments naturally. The Raft-based KV store handles linearizable reads, but Consul’s lock implementation relies on sessions with TTLs, which creates windows where a lock holder can be falsely considered expired before the lock is actually released. This makes Consul’s locks a poor fit for safety-critical coordination. Use Consul for service discovery and feature-flag storage; use ZooKeeper or etcd for leader election and correctness-sensitive locks.

When Not to Use ZooKeeper

ZooKeeper is a coordination service, not a data store. The most common mistake is treating it like a database and stuffing application data into znodes. Every write goes through fsync on the transaction log, so multi-KB or multi-MB payloads tank write throughput. The hard limit is 1 MB per znode, but in practice coordination metadata should stay under 10 KB. If you catch yourself thinking “we will just store a bit of JSON here,” use etcd or Consul instead.

Performance ceilings that should make you reconsider:

Metric	ZooKeeper ceiling	Consider instead
Write throughput	~100k req/s per ensemble	etcd (~200k req/s), Consul (~50k req/s)
Znode data size	< 1 MB (hard limit)	External storage + ZooKeeper for paths
Request payload	Small coordination metadata	S3/HDFS for large data
Read throughput	Bounded by follower count	Observers help, but only for reads

Multi-tenant environments need care. ZooKeeper’s ACL system operates at the znode level with no concept of namespaces or resource quotas. If one application’s workload creates thousands of ephemeral znodes or fires frequent configuration updates, it adds latency for every other application sharing the ensemble. Separate ensembles per tenant fixes this, but that means running more clusters.

Cloud-native deployments without StatefulSets introduce operational complexity that often outweighs the benefits. ZooKeeper nodes need stable identity — a pod that gets rescheduled and picks up a new IP looks like a disconnected client. Kubernetes StatefulSets give pods stable network names and persistent storage, which makes ZooKeeper viable on Kubernetes. Without that infrastructure, managed alternatives like etcd on a Kubernetes operator are easier to run.

The operational burden: ZooKeeper ensembles need careful tuning of tickTime, initLimit, syncLimit, and disk I/O settings. If your team has no prior experience running ZooKeeper in production, the learning curve is steep and misconfiguration means ensemble failure. Managed coordination services lower this floor considerably. Self-host ZooKeeper when you have existing operational expertise or need to integrate with Kafka or HBase.

Production Failure Scenarios

Client-Side Failures

Failure	Impact	Mitigation
Ensemble loses quorum (N/2+ nodes down)	ZooKeeper becomes unavailable; all coordination operations halt	Run an odd number of nodes (3, 5, 7); co-locate ensemble members in different availability zones; monitor ZK service health
Leader election storm	Network blip triggers leader election; all in-flight operations fail	Set appropriate `tickTime`, `initLimit`, and `syncLimit`; ensure network is stable; avoid GC pauses that look like failures
Session timeout misconfiguration	Clients get disconnected during normal GC pauses; ephemeral nodes disappear; locks are released	Set `tickTime` to 2000ms minimum; set `maxSessionTimeout` generously; handle `Disconnected` events gracefully
Watch delivery delay	During leader election, watch notifications are delayed; clients miss state changes	Monitor watch count per client; set appropriate session timeout; avoid watch storms
Snapshot size explosion	Database snapshot grows too large; restart takes too long	Set `snapCount` appropriately; enable auto-purge; keep data directory on fast storage
Client reconnect flood	When ZooKeeper recovers, thousands of clients reconnect simultaneously; overwhelm the leader	Use exponential backoff on reconnect; stagger client startup times; rate-limit reconnection attempts
Znode data too large	Storing multi-MB data in znode causes performance degradation	Keep znode data under 1MB; use separate storage for large data; reference paths to S3/HDFS instead

Security and Operations

ACLs and Authentication

ZooKeeper supports fine-grained access control through ACLs. Each znode has its own ACL list governing who can read, write, delete, or administer it.

SASL/Kerberos authentication:

# In zoo.cfg - enable SASL authentication
authProvider.1=org.apache.zookeeper.server.auth.SASLAuthenticationProvider
# Requires kinit setup on each ZooKeeper node and client

IP-based ACLs:

// Restrict a znode to specific IP addresses
List<ACL> acl = Arrays.asList(
    new ACL(Perms.READ, new Id("ip", "192.168.1.0/24")),
    new ACL(Perms.ALL, new Id("ip", "192.168.1.100"))
);
zk.create("/config/database", data, acl, CreateMode.PERSISTENT);

The danger of OPEN_ACL_UNSAFE:

// This is in the ZooKeeper source code - OPEN_ACL_UNSAFE grants full permissions to anyone
ZooDefs.Ids.OPEN_ACL_UNSAFE  // = Perms.ALL for scheme "world", id "anyone"

// NEVER use OPEN_ACL_UNSAFE in production:
zk.create("/locks/my-lock", data, ZooDefs.Ids.OPEN_ACL_UNSAFE, CreateMode.EphemeralSequential);
// Any client that can reach ZooKeeper can delete or modify this znode!

Production ACL patterns:

// Use world:anyone with READ only for public znodes
new ACL(Perms.READ, new Id("world", "anyone"))

// Use auth: anyone authenticated can access
new ACL(Perms.ALL, new Id("auth", ""))

// Digest authentication - username/password based
Id digestId = new Id("digest", DigestAuthenticationProvider.generateDigest("admin:admin123"));
new ACL(Perms.ALL, digestId)

SASL (Kerberos) in a Hadoop-style cluster:

# In zoo.cfg for a Kerberized cluster
authProvider.1=org.apache.zookeeper.server.auth.SASLAuthenticationProvider
kerberos.removeHostAddress=true
kerberos.removeUserAddress=true

# Client JVM needs jaas.conf:
# Client {
#   com.sun.security.auth.module.Krb5LoginModule required
#   useKeyTab=true keyTab="/path/to/keytab" principal="zookeeper/node@REALM";
# };

Operational and Environmental Issues

Multi-Cluster Deployment

Running separate ZooKeeper ensembles for different applications provides fault isolation and prevents noisy-neighbor problems.

When to use separate ensembles:

Scenario	Recommendation
Kafka and HBase share a cluster	Split into dedicated ensembles
Production and staging share	Always use separate ensembles
Different security requirements	Separate ensembles with different ACLs
Different latency requirements	Separate ensembles co-located per tier

Deployment architecture:

# Ensemble A - for Kafka (3 nodes, low latency)
/zoo-a/  # 10.0.1.1:2181, 10.0.1.2:2181, 10.0.1.3:2181

# Ensemble B - for HBase (3 nodes, low latency)
/zoo-b/  # 10.0.2.1:2181, 10.0.2.2:2181, 10.0.2.3:2181

# Ensemble C - for monitoring/coordination (3 nodes)
/zoo-c/  # 10.0.3.1:2181, 10.0.3.2:2181, 10.0.3.3:2181

Client configuration for specific ensemble:

// Connect to only the correct ensemble
ZooKeeper zk = new ZooKeeper(
    "zoo-a-1:2181,zoo-a-2:2181,zoo-a-3:2181",  // Only Kafka ensemble
    30000,
    event -> { /* handler */ }
);

Naming conventions in shared namespace deployments:

If you run one ensemble but need logical separation, use top-level znodes as namespaces:

# Logical namespace per application (but same physical ensemble)
/
/kafka        # Kafka's znodes
/hbase        # HBase's znodes
/monitoring   # Monitoring service's znodes

This is less isolated than separate ensembles but simpler to operate. Use separate ensembles when one application’s load or faults could impact another.

Deployment and Lifecycle

Rolling Upgrade Procedures

ZooKeeper supports rolling upgrades where you upgrade one node at a time while keeping the ensemble available.

Pre-upgrade checklist:

# 1. Verify ensemble is healthy
echo "srvr" | nc localhost 2181
# Look for mode: leader or mode: follower, and all connections established

# 2. Check for any pending transactions
echo "mntr" | nc localhost 2181 | grep zk_pending
# Should be near zero before upgrading

# 3. Back up current data
tar -czf /backup/zookeeper-$(date +%Y%m%d).tar.gz /var/lib/zookeeper/data

Rolling upgrade steps:

# For each ZooKeeper node (one at a time):

# 1. Stop the node
systemctl stop zookeeper

# 2. Update ZooKeeper binary to new version
# rpm -Uvh zookeeper-3.9.x-server.rpm  (or apt, tarball, etc.)

# 3. Start the node
systemctl start zookeeper

# 4. Wait for node to rejoin and catch up
sleep 10
echo "srvr" | nc localhost 2181
# Verify node is in sync:Mode should be follower or leader

# 5. Verify ensemble health before proceeding to next node
echo "srvr" | nc any-healthy-node 2181
# Check that all nodes are connected: num_alive_connections = 2 (for 3-node)

Version compatibility:

Upgrade Path	Behavior
3.4.x -> 3.5.x (patch)	Rolling upgrade supported
3.4.x -> 3.6.x	Rolling upgrade supported
3.5.x -> 3.6.x	Rolling upgrade supported
3.4/3.5 -> 3.9.x	Rolling upgrade supported
Major version jump	Requires full cluster restart

Never upgrade more than one minor version at a time. For example, 3.4 -> 3.6 should go through 3.5.

Disk I/O Sensitivity

ZooKeeper is extremely sensitive to disk I/O latency, particularly WAL fsync operations.

Why ZooKeeper is I/O sensitive:

# Every write operation:
// 1. Append to WAL (write + fsync)
// 2. Write to transaction log
// 3. Update in-memory state
// 4. fsync before returning to client

# ZooKeeper fsync latency directly controls write throughput:
// If fsync P99 = 10ms -> max ~100 writes/second per node
// If fsync P99 = 1ms -> max ~1000 writes/second per node

SSD requirements for ZooKeeper:

Environment	Minimum	Recommended
Development	SSD (SATA)	NVMe SSD
Production	NVMe SSD mandatory	NVMe SSD with >50K IOPS
WAN deployment	NVMe SSD + separate disk	Dedicated NVMe for dataDir

Disk configuration:

# Use separate disks for dataDir and dataLogDir (transaction log)
# dataLogDir should be on the fastest disk (SSD/NVMe)
dataDir=/var/lib/zookeeper/data
dataLogDir=/var/lib/zookeeper/txnlog  # NVMe SSD

# In zoo.cfg - Preallocates transaction log files
# Increase if you have many writes
preAllocSize=65536

# Snapshots - less I/O sensitive but still important
# Increase snapshot count to reduce frequency
snapCount=100000

# Monitor disk I/O latency
iostat -x 1
# Key metric: avgqu-sz should stay < 4 for sustained periods
# If avgqu-sz > 16, your disk is severely overloaded

Do not use network-attached storage (NAS, NFS, EBS network volumes) for ZooKeeper data or transaction logs. Network latency adds directly to fsync latency and causes follower lag that triggers leader elections.

Curator Recipes Catalog

Curator is the most widely-used ZooKeeper Java client. Beyond locks, it provides higher-level recipes:

GroupMember recipe (for service membership):

// GroupMember tracks which nodes are currently active members of a group
// Much simpler than building this with ephemeral znodes manually

CuratorFramework client = CuratorFrameworkFactory.newClient(
    "zoo1:2181,zoo2:2181,zoo3:2181",
    new RetryNTimes(3, 1000)
);
client.start();

GroupMember<String> group = new GroupMember<String>(
    client,
    "/services/myapp/members",  // Group path
    UUID.randomUUID().toString()  // Our member ID
);

// Listen for membership changes
group.addMembershipChangeListener(event -> {
    System.out.println("Members: " + event.getMembers());
});

// Start - this creates our ephemeral znode
group.start();

// On shutdown
group.close();
client.close();

DistributedDoubleBarrier recipe (processes start and end together):

// DoubleBarrier - enter N processes, block until all arrive, then release all together
// Useful for distributed batch jobs that need all participants ready before starting

DistributedDoubleBarrier barrier = new DistributedDoubleBarrier(
    client,
    "/barriers/batch-job",
    3  // Wait for 3 processes
);

barrier.enter();  // Blocks until 3 processes call enter()
System.out.println("All participants ready - starting batch job");

// ... do batch work ...

barrier.leave();  // Blocks until 3 processes call leave()
System.out.println("All participants finished");

Other useful Curator recipes:

// LeaderLatch - simple leader election (non-blocking)
LeaderLatch latch = new LeaderLatch(client, "/election/leader");
latch.start();
latch.await();  // Block until we become leader
if (latch.hasLeadership()) {
    // We are leader
}

// DistributedQueue - FIFO queue backed by ZooKeeper znodes
DistributedQueue<String> queue = new DistributedQueue<String>(
    client,
    "/queues/work-queue",
    new StringCodec()
);
queue.put("task-1");
String task = queue.take();  // Blocks until item available

// DistributedPriorityQueue - items consumed in priority order
DistributedPriorityQueue<Integer> priorityQueue =
    new DistributedPriorityQueue<Integer>(client, "/queues/priority", 5);

When to use Curator recipes vs. raw ZooKeeper:

Use Case	Recommendation
Leader election	Curator LeaderLatch (simpler)
Distributed locks	Curator InterProcessMutex
Double barriers	Curator DistributedDoubleBarrier
Service discovery/membership	Curator GroupMember
Configuration management	Raw ZooKeeper + watches (Curator is overkill)

Four-Letter Command Monitoring

ZooKeeper exposes management commands via its admin port (2181) using four-letter commands.

The three most useful four-letter commands:

# srvr - server statistics
echo "srvr" | nc localhost 2181
# Output:
# Zookeeper version: 3.9.3
# Latency min/avg/max: 0/0/3
# Packets: sent=12345 recv=12344
# Outstanding: 0
# Zxid: 0x10000002c
# Mode: leader
# Node count: 1543

# stat - connection statistics + srvr
echo "stat" | nc localhost 2181
# Adds: Connected count, IP list, Session info

# mntr - Prometheus-friendly metrics output
echo "mntr" | nc localhost 2181
# zk_version=3.9.3
# zk_server_state=leader
# zk_avg_latency=0.5
# zk_max_latency=15
# zk_min_latency=0
# zk_packets_sent=12345
# zk_packets_received=12344
# zk_num_alive_connections=2
# zk_pending_syncs=0
# zk_znode_count=1543
# zk_ephemerals_count=45
# zk_approximate_data_size=524288

Key mntr metrics to alert on:

Metric	Warning Threshold	Critical Threshold
`zk_num_alive_connections`	> 100	> 500
`zk_pending_syncs`	> 10	> 100
`zk_outstanding_requests`	> 10	> 100
`zk_avg_latency`	> 10ms	> 100ms
`zk_max_latency`	> 100ms	> 1000ms
`zk_ephemerals_count`	> 10000	> 50000
`zk_watch_count`	> 50000	> 200000

Automating four-letter commands in scripts:

#!/bin/bash
# health_check.sh - run via cron or monitoring agent

ZK_HOST="${ZK_HOST:-localhost}"
ZK_PORT="${ZK_PORT:-2181}"

# Check if ZooKeeper is responding
if echo "ruok" | nc -w 3 $ZK_HOST $ZK_PORT | grep -q "imok"; then
    echo "ZooKeeper is healthy"
else
    echo "ZooKeeper is NOT healthy"
    exit 1
fi

# Check latency
LATENCY=$(echo "mntr" | nc -w 3 $ZK_HOST $ZK_PORT | grep "^zk_avg_latency" | cut -f2)
if (( $(echo "$LATENCY > 10" | bc -l) )); then
    echo "WARNING: ZooKeeper latency is ${LATENCY}ms"
fi

Enable four-letter commands in production only on internal network interfaces, never on public IPs. They provide detailed cluster state that could aid attackers.

initLimit and syncLimit Tuning for WAN Deployments

ZooKeeper has two timing parameters that become critical when ensemble members are spread across WAN locations.

tickTime (base timing unit in ms, default: 2000):

# tickTime = 2000ms (default)
# All other timing values are expressed in tickTime units

initLimit (ticks allowed for follower to initial sync with leader):

# initLimit = 10 (default)
# Time allowed = 10 * tickTime = 20 seconds
# Increase this if followers need more time to catch up on initial startup

syncLimit (ticks allowed for follower to sync with leader during normal operation):

# syncLimit = 5 (default)
# Time allowed = 5 * tickTime = 10 seconds
# This is how far behind a follower can drift before being considered dead

WAN tuning example (us-east to us-west):

# WAN RTT between regions: ~70ms
# tickTime = 2000ms
# initLimit = 100  # 200 seconds for initial sync (followers might need to replay years of WAL)
# syncLimit = 10   # 20 seconds for sync (catches transient network blips)

# In zoo.cfg:
tickTime=2000
initLimit=100
syncLimit=10

Practical syncLimit tuning guidance:

Network Type	RTT	syncLimit Recommendation
Same datacenter	< 1ms	syncLimit=5 (default)
Same region (AZ)	1-5ms	syncLimit=5-10
Cross-region	20-100ms	syncLimit=10-50
WAN/multi-region	100-300ms	syncLimit=50-150

initLimit tuning for large data volumes:

# If your ZooKeeper database is large (>1GB),
# a follower may need significant time to load snapshot + replay WAL

# Estimate initLimit needed:
# initLimit >= (time to load snapshot + WAL replay) / tickTime

# For 5GB database with slow disk:
# Snapshot load: 30 seconds
# WAL replay (100K entries): 20 seconds
# Total: 50 seconds
# initLimit = 50 / 2 = 25 (at tickTime=2000ms)

# Recommendation: always over-provision initLimit for WAN
# better to have extra time than to have followers excluded

Monitoring when syncLimit is too tight:

# Check for disconnections in logs
grep "closing client connection" /var/log/zookeeper/zookeeper.log

# Or via four-letter command
echo "srvr" | nc localhost 2181 | grep "Outstanding"
# If Outstanding > 0 continuously, syncLimit might be too tight

Quick Recap Checklist

Before you go, run through this checklist to verify your ZooKeeper knowledge:

Interview Questions

1. What is ZooKeeper's data model and how does it differ from a traditional filesystem?

Expected answer points:

ZooKeeper exposes a hierarchical namespace of znodes, like a filesystem with / delimited paths
Unlike filesystems, znodes are NOT general-purpose directories — they are lightweight coordination primitives
Each znode holds a small amount of critical data (configuration, lock state, leader info)
Znodes support special flags: Ephemeral (auto-deleted on session end), Sequential (auto-incrementing counter suffix), Container (deleted when empty)

2. How does ZooKeeper guarantee linearizable writes?

Expected answer points:

ZooKeeper uses the Zab protocol (ZooKeeper Atomic Broadcast) for consensus
All writes must go through the leader; followers forward write requests to the leader
A write is committed only after a majority of followers acknowledge it
Zab has two phases: Broadcast (normal operation, leader broadcasts transactions) and Recovery (catch-up after leader failure)
FIFO client ordering per client is guaranteed — if client submits A then B, A is applied before B

3. What happens during ZooKeeper leader election?

Expected answer points:

Leader election is triggered when the current leader fails or the ensemble starts up
A leader is elected through the Zab recovery phase — the server with the most recent committed state becomes leader
Followers synchronize with the new leader by replaying missed transactions from their WAL
For application-level leader election (e.g., Kafka controller), ephemeral sequential znodes are used: candidates create sequential znodes, the one with the lowest sequence number wins

4. Explain the difference between ZooKeeper watches and event listeners in traditional pub/sub systems.

Expected answer points:

ZooKeeper watches are one-time triggers — they fire once and then the client must set a new watch to keep monitoring
Watches are ordered: ZooKeeper guarantees that the client sees the watch event before seeing the new state
Watches are set on specific znodes and report changes to that znode or its children
This one-time design simplifies ZooKeeper's implementation but requires careful client re-arming logic

5. How would you implement a distributed lock using ZooKeeper?

Expected answer points:

Create an ephemeral sequential znode under a designated lock path (e.g., /locks/my-resource)
Get all children of the lock path, sort them by name (which includes the sequence number)
If your znode is first in sorted order, you hold the lock
If not first, watch the previous znode in sorted order
When the previous znode disappears (holder released lock or died), re-check if you are now first
Release lock by deleting your ephemeral znode

6. Why does ZooKeeper require an odd number of nodes in an ensemble?

Expected answer points:

ZooKeeper requires a majority (quorum) to acknowledge writes — N/2+1 nodes must agree
With an odd number, you get clean majority thresholds: 3 nodes need 2, 5 need 3, 7 need 4
With an even number, you get redundant nodes that don't help with majority: 4 nodes still need 3
Adding a 4th node to a 3-node cluster increases cost but doesn't improve failure tolerance

7. What is the purpose of initLimit and syncLimit in zoo.cfg?

Expected answer points:

tickTime is the base timing unit in milliseconds (default 2000ms)
initLimit = number of ticks allowed for a follower to initially sync with the leader during startup or recovery (default 10, so 20 seconds)
syncLimit = number of ticks a follower can be behind the leader during normal operation before being considered dead (default 5, so 10 seconds)
For WAN deployments, these values must be increased because RTT between sites is higher

8. What are observer nodes and when would you use them?

Expected answer points:

Observers are non-voting ZooKeeper nodes that receive proposals from the leader and apply transactions
They do NOT count toward majority for write commitment, so they don't increase write quorum size
Use observers to scale read throughput across geographic regions without impacting write latency
Clients in us-west-2 can read from a local observer without crossing WAN to us-east-1
Observer nodes are configured with peerType=observer in zoo.cfg

9. What is the difference between Zab and Paxos?

Expected answer points:

Both provide linearizable writes and consensus guarantees
Paxos is a generic consensus algorithm; Zab is optimized for ZooKeeper's specific workload
Zab is designed around the concept of a single leader — it explicitly models leader epochs (zxid) and recovery
Paxos uses two phases (prepare, accept) with arbitrary proposers; Zab's Broadcast phase is essentially a two-phase commit with a fixed leader
Zab provides FIFO ordering per client, which Paxos doesn't guarantee by default

10. ZooKeeper is not suitable as a general-purpose database. Why not, and what are the alternatives?

Expected answer points:

ZooKeeper is designed for coordination metadata — each znode should hold KB-level data, not MB or GB
Write throughput is limited to roughly 10k-100k requests per second per ensemble
Alternatives: etcd (better performance, cleaner API, built for cloud-native), Consul (has DNS interface), Redis (for simple locks but no consensus guarantees)

11. How does ZooKeeper handle session expiration and what are the pitfalls?

Expected answer points:

Clients maintain sessions through periodic pings; missing too many pings triggers session expiration
All ephemeral znodes created by that session are automatically deleted when the session expires
Pitfall: GC pauses can look like session expiration — set maxSessionTimeout generously and avoid aggressive session timeout values
Pitfall: when a session expires, all the client's ephemeral nodes vanish simultaneously, potentially causing locks to be released unexpectedly

12. What is a Zxid (ZooKeeper Transaction ID) and why is it important?

Expected answer points:

Every ZooKeeper transaction is tagged with a Zxid — a 64-bit number with two parts: epoch (high 32 bits) and counter (low 32 bits)
The epoch changes each time a new leader takes over, making it easy to distinguish which leader proposed a transaction
The counter increments for each transaction within a given epoch
When comparing two ZooKeeper states, the one with the higher Zxid is more recent — this is how followers determine which state is more up-to-date during recovery

13. What are the four ACL permission types in ZooKeeper?

Expected answer points:

CREATE (c): Can create child znodes
READ (r): Can read the znode and list its children
WRITE (w): Can write the znode data
DELETE (d): Can delete child znodes
ADMIN (a): Can set ACLs on the znode
Common schemes: world (anyone), auth (authenticated users), ip (IP-based), digest (username/password)

14. How does ZooKeeper's snapshotting mechanism work?

Expected answer points:

ZooKeeper takes periodic snapshots (memory dumps) of its state — configurable via snapCount (default 100k transactions)
Snapshots are "fuzzy" — they capture the in-memory state without holding a lock, so they may include partially-applied transactions
During recovery, ZooKeeper replays the transaction log to reconcile any inconsistencies caused by fuzzy snapshots
Auto-purge should be enabled in production to clean up old snapshots and transaction logs

15. What is the "split-brain" problem and how does ZooKeeper avoid it?

Expected answer points:

Split-brain occurs when a cluster divides into two or more partitions that can't communicate, and each partition elects its own leader
ZooKeeper avoids split-brain by requiring a majority quorum for writes — a partition with fewer than N/2+1 nodes cannot elect a leader or commit writes
If a network partition isolates a minority of nodes, they stop serving requests; the majority continues normally
This is why ZooKeeper ensembles should span multiple availability zones — a single AZ failure shouldn't isolate a majority

16. What is the role of the transaction log (WAL) in ZooKeeper?

Expected answer points:

Every state-changing operation is appended to the WAL before being applied to the in-memory database
The WAL is write + fsync on every transaction — this is why ZooKeeper is extremely disk I/O sensitive
If fsync P99 = 10ms, maximum write throughput is roughly 100 writes/second per node
Place the transaction log on the fastest disk (NVMe SSD), separate from dataDir (snapshots)

17. How does Curator improve upon the raw ZooKeeper Java client?

Expected answer points:

Curator adds automatic retry logic with configurable retry policies (RetryNTimes, RetryOneTime, ExponentialBackoffRetry)
Curator provides higher-level recipes: LeaderLatch (leader election), InterProcessMutex (distributed locks), DistributedDoubleBarrier, GroupMember, DistributedQueue
Curator handles connection management and "handle height" (connection loss edge cases) that the raw client leaves to the application
Use Curator for locks, barriers, and leader election; raw ZooKeeper + watches are fine for configuration management

18. What happens during a ZooKeeper rolling upgrade and what are the version compatibility rules?

Expected answer points:

ZooKeeper supports rolling upgrades where one node is upgraded at a time while the ensemble remains available
Never upgrade more than one minor version at a time (e.g., 3.4 → 3.6 requires going through 3.5)
Before upgrading: verify ensemble is healthy, check for pending transactions (zk_pending near zero), back up data directory
Upgrade path: stop node → update binary → start node → wait for node to rejoin and catch up → verify → proceed to next node

19. How would you monitor a ZooKeeper ensemble in production?

Expected answer points:

Use the mntr four-letter command to expose Prometheus-friendly metrics: zk_avg_latency, zk_max_latency, zk_pending_syncs, zk_num_alive_connections, zk_ephemerals_count, zk_watch_count
Key alerts: zk_avg_latency > 10ms (warning), > 100ms (critical); zk_outstanding_requests > 10 (warning); zk_num_alive_connections > 100 (warning)
Monitor disk I/O latency (avgqu-sz should stay < 4) and follower lag (outstanding sync requests)
Use ruok four-letter command for basic health checks (returns "imok" if healthy)

20. What is the difference between container znodes and ephemeral znodes?

Expected answer points:

Container znodes are deleted by ZooKeeper when they have no more children (used for leader election, locks where cleanup is needed)
Ephemeral znodes are deleted when the client session that created them ends (used for service discovery, temporary locks)
Container + sequential + ephemeral can be combined: CreateMode.EphemeralSequential for locks, CreateMode.Container for leader election parent nodes

Conclusion

ZooKeeper provides essential coordination primitives for distributed systems:

Leader election: Through ephemeral sequential znodes
Distributed locks: Through znode ordering and watches
Configuration management: Through persistent znodes with watches
Service discovery: Through ephemeral znodes for ephemeral services

The Zab protocol delivers linearizable writes essential for coordination correctness. Without this guarantee, distributed locks could have race conditions that appear correct but fail under leader elections.

ZooKeeper’s age shows in some design decisions: one-time watches need client re-arming, the C client library has rough edges. But for leader election and coordination, ZooKeeper remains solid and widely-deployed.

For greenfield projects, etcd is often a better choice today with a cleaner API and better performance. But ZooKeeper’s maturity and battle-testing in production Hadoop and Kafka clusters keep it relevant, especially when integrating with existing systems.

Apache ZooKeeper: Consensus and Coordination for Distributed Systems

Data Model: Hierarchical Namespaces

Sessions and Watches

The Zab Protocol

Leader Election

Distributed Locks

Common Use Cases

Observer Nodes: Read Scaling Extension

Getting Started

Common Pitfalls / Anti-Patterns

When to Use / When Not to Use ZooKeeper

When to Use ZooKeeper

Trade-off Comparison: ZooKeeper vs Alternatives

When Not to Use ZooKeeper

Production Failure Scenarios

Client-Side Failures

Security and Operations

ACLs and Authentication

Operational and Environmental Issues

Multi-Cluster Deployment

Deployment and Lifecycle

Rolling Upgrade Procedures

Disk I/O Sensitivity

Curator Recipes Catalog

Four-Letter Command Monitoring

initLimit and syncLimit Tuning for WAN Deployments

Quick Recap Checklist

Interview Questions

Further Reading

Conclusion

Category

Tags

Related Posts

etcd: Distributed Key-Value Store for Configuration

Google Chubby: The Lock Service That Inspired ZooKeeper

Leader Election in Distributed Systems