etcd: Distributed Key-Value Store for Configuration

Deep dive into etcd architecture using Raft consensus, watches for reactive configuration, leader election patterns, and Kubernetes integration.

published: reading time: 19 min read

etcd: Distributed Key-Value Store for Service Discovery and Configuration

etcd started at CoreOS as a coordination service built on the Raft consensus algorithm. Its job is storing configuration, detecting failures, and coordinating leader elections in distributed systems. Today etcd is best known as Kubernetes’ backing store, holding all cluster state: pods, services, deployments, configuration.

The name comes from the Unix convention of naming services with “etc” and “d” for daemon. etcd was meant to be the “etc” for distributed systems.


Core Architecture

etcd is a distributed, consistent, highly-available key-value store. Unlike databases optimized for arbitrary queries, etcd handles:

  • Small amounts of critical data (configuration, locks, leader election)
  • Strong consistency (no eventual consistency)
  • Fast distributed coordination primitives

The architecture uses Raft to maintain a replicated log. The Raft paper is titled “In Search of an Understandable Consensus Algorithm”, and etcd’s implementation is considered the reference implementation.

graph TD
    A[Client] -->|Read/Write| B[Leader]
    B -->|Replicate| C[Follower 1]
    B -->|Replicate| D[Follower 2]
    B -->|Replicate| E[Follower 3]

    C -.->|Heartbeat| B
    D -.->|Heartbeat| B
    E -.->|Heartbeat| B

    B --> F[(Raft Log)]
    C --> G[(Raft Log)]
    D --> H[(Raft Log)]
    E --> I[(Raft Log)]

In a Raft cluster, one node is the leader and handles all writes. Writes go through the leader’s replicated log, committed to a majority of nodes before durability. Reads can go to any node, but followers might return stale data unless you request linearizable reads.


The Raft Consensus Algorithm

Raft achieves consensus by electing a leader and replicating a log of commands. The algorithm has three parts:

Leader Election:

  • Nodes start as followers
  • Followers become candidates if they miss leader communication within the election timeout
  • Candidates request votes from other nodes
  • Majority votes = new leader

Log Replication:

  • Leader accepts commands from clients
  • Appends to local log
  • Replicates to followers via AppendEntries messages
  • Command on majority of logs = applied to state machine, client notified

Safety:

  • Only servers with committed entries can become leader
  • Committed entries survive minority node failures
// Simplified Raft write flow in etcd
func (s *Server) Propose(ctx context.Context, cmd []byte) error {
    // 1. Submit to Raft log
    ch := make(chan applyFuture)
    s.node.Propose(ctx, cmd, ch)

    // 2. Wait for commit
    result := <-ch
    if result.Error() != nil {
        return result.Error()
    }

    // 3. Apply to state machine
    return s.apply(cmd)
}

The key insight: as long as a majority agrees on log contents, the cluster is consistent. Leader crashes = new leader with most complete log elected.


Data Model and API

etcd stores data in a flat key-value namespace with a hierarchical directory structure (like a filesystem). Keys are strings, values can be arbitrary bytes.

# Store a simple value
etcdctl put /services/api/server1 "10.0.0.1:8080"

# Retrieve the value
etcdctl get /services/api/server1

# Store JSON configuration
etcdctl put /config/database '{"host": "db.example.com", "port": 5432}'

# List all keys under a prefix
etcdctl get --prefix /services/

The etcd API follows the gRPC protocol, which you can call directly or via client libraries:

import "go.etcd.io/etcd/client/v3"

cli, _ := client.NewFromURLs([]string{"http://localhost:2379"})
defer cli.Close()

// Write a value
cli.Put(context.Background(), "/config/featureflags", `{"newUI": true}`)

// Read a value
resp, _ := cli.Get(context.Background(), "/config/featureflags")
fmt.Println(string(resp.Kvs[0].Value))

// Atomic compare-and-swap (requires current value)
cli.Txn(context.Background()).If(
    clientv3.Compare(clientv3.Value("/config/featureflags"), "=", `{"newUI": false}`),
).Then(
    clientv3.OpPut("/config/featureflags", `{"newUI": true}`),
).Else(
    clientv3.OpPut("/config/featureflags", `{"newUI": false}`),
).Commit()

Transactions enable atomic compare-and-swap operations, essential for distributed locks and leader election.


Watches and Reactive Configuration

Native watch support is etcd’s most powerful feature. Subscribe to key changes instead of polling, and get notified immediately when data changes.

// Watch a single key
watchChan := cli.Watch(context.Background(), "/services/api/")
for resp := range watchChan {
    for _, event := range resp.Events {
        fmt.Printf("%s %q: %q\n", event.Type, event.Kv.Key, event.Kv.Value)
    }
}

// Watch an entire prefix recursively
watchChan := cli.Watch(context.Background(), "/services/", clientv3.WithPrefix())

Watches in etcd are cheap and scalable because they are true subscriptions, not polling. The etcd server tracks which clients watch which keys and pushes updates directly.

sequenceDiagram
    participant K as Kubernetes Controller
    participant E as etcd Server
    participant W as Watch Stream

    K->>E: Watch /pods/
    E->>W: Create subscription
    W-->>K: Initial state

    Note over E: Pod created
    E->>W: Push new event
    W-->>K: Pod event notification
    K->>K: Reconcile pod state

This watch mechanism lets Kubernetes controllers react immediately to cluster changes without polling.


Authentication and RBAC

etcd ships with built-in role-based access control. Before enabling auth, you create users and bind them to roles that permit specific key prefixes.

# Create a user with password authentication
etcdctl user add root
# Add user to the root role (full access)
etcdctl user grant-role root root

# Enable authentication
etcdctl auth enable

# Now all requests require credentials
etcdctl --user root:password put /services/api/server1 "10.0.0.1:8080"

Built-in roles:

RolePermissions
rootFull read/write/admin on all keys
guestRead on all keys (default role)
etcd-clusterRead/write on cluster communication keys

Custom roles for least privilege:

# Create a role scoped to a key prefix
etcdctl role add app-config
etcdctl role grant-permission app-config readwrite /config/app/

# Create a user and assign the role
etcdctl user add app-service
etcdctl user grant-role app-service app-config

Without RBAC, any client that can reach etcd port 2379 has full cluster access. In Kubernetes, the etcd RBAC layer is typically bypassed because the API server handles authentication — but for standalone etcd deployments, RBAC is the first line of defense.


TLS/mTLS Setup for Cluster Communication

etcd supports encrypted communication between members and clients via mutual TLS. Each member and client presents a certificate; both sides verify the other’s identity.

Certificate requirements:

  • Each etcd member needs a server certificate with its hostname/IP as Subject Alternative Names (SANs)
  • Each client needs a client certificate for authentication
  • The CA certificate must be trusted by all members and clients
# Generate certificates using cfssl or openssl
# Example using openssl - simplified

# 1. Create the CA
openssl genrsa -out ca-key.pem 2048
openssl req -x509 -new -nodes -key ca-key.pem -days 365 -out ca-crt.pem \
  -subj "/CN=etcd-ca"

# 2. Create member certificate with SANs for all cluster IPs
openssl genrsa -out member-key.pem 2048
# SANs: etcd-1, etcd-2, etcd-3, 192.168.1.1, 192.168.1.2, 192.168.1.3
openssl req -new -key member-key.pem -out member.csr \
  -subj "/CN=etcd-member"
# Sign with CA...

# 3. Client certificates (no SAN required)
openssl genrsa -out client-key.pem 2048
openssl req -new -key client-key.pem -out client.csr \
  -subj "/CN=etcd-client"
# Sign with CA...

Cluster member flags:

# On each member, specify certificates
etcd \
  --cert-file=/etc/etcd/ssl/server.crt \
  --key-file=/etc/etcd/ssl/server.key \
  --trusted-ca-file=/etc/etcd/ssl/ca.crt \
  --peer-cert-file=/etc/etcd/ssl/peer.crt \
  --peer-key-file=/etc/etcd/ssl/peer.key \
  --peer-trusted-ca-file=/etc/etcd/ssl/ca.crt \
  --initial-cluster etcd-1=https://192.168.1.1:2380,etcd-2=https://192.168.1.2:2380 \
  --listen-peer-urls https://0.0.0.0:2380 \
  --listen-client-urls https://0.0.0.0:2379

Client connection with TLS:

# Connect a client with certificates
etcdctl --cacert=/etc/etcd/ssl/ca.crt \
  --cert=/etc/etcd/ssl/client.crt \
  --key=/etc/etcd/ssl/client.key \
  endpoint health

Without TLS in production, network-level attackers can read all cluster state including Kubernetes secrets. TLS also prevents man-in-the-middle attacks that could inject false configuration into your cluster.


Lease Revocation Edge Cases

etcd leases attach TTLs to keys. When a lease expires, all keys attached to it are deleted atomically. The edge case is what happens when the lease holder crashes before renewal.

Grace period behavior:

// When a lease holder crashes:
// 1. The lease continues running on etcd server
// 2. If no keepalive arrives before --lease-grant-timeout, etcd revokes the lease
// 3. All keys attached to the lease are deleted
// 4. Watchers on those keys receive delete events

// The grace period is approximately: (lease TTL) + (leader election timeout)
// If leader is down, lease cannot be renewed until new leader is elected

Practical implications:

ScenarioLease Behavior
Holder crashes, lease TTL = 10sKeys deleted ~10s after crash (plus network delay)
Leader fails during grace periodNew leader elected, old holder’s lease revoked, keys deleted
Network partition (holder isolated)Partition-side holder cannot renew; majority-side lease revoked after TTL expires
Holder JVM pauses (GC)If pause exceeds lease TTL, keys deleted; use longer TTL + reliable keepalive

Recommendation: set lease TTL at 2-5x your expected renewal interval, and implement retry logic with exponential backoff on the client side.


Namespace Isolation for Multi-Tenant Environments

etcd has no native multi-tenant namespace abstraction. Isolation is achieved through key prefix conventions and RBAC roles.

Prefix-based tenancy:

# Tenant A's keys
etcdctl role add tenant-a
etcdctl role grant-permission tenant-a readwrite /tenants/a/

# Tenant B's keys
etcdctl role add tenant-b
etcdctl role grant-permission tenant-b readwrite /tenants/b/

# Each tenant's application only gets credentials granting access to their prefix
# Kubernetes uses this pattern: /registry/...

For Kubernetes itself, the convention is:

# Kubernetes state lives under these prefixes
/registry/apiregistration.k8s.io/apiservices/
/registry/services/endpoints/
/registry/pods/
/registry/secrets/
/registry/configmaps/

Multi-tenant considerations:

ConcernMitigation
Tenant key sprawlEnforce naming conventions; monitor key count per prefix
Resource contentionSeparate etcd clusters per tenant at scale
Quota limitsetcd has no per-key-prefix quota; implement application-level checks
RBAC complexityUse role inheritance where possible; automate role provisioning

At scale (hundreds of tenants), running a dedicated etcd cluster per tenant or using a managed service like etcd Operator becomes more practical than prefix-based isolation on a single cluster.


Key Monitoring Metrics

Monitor these metrics to catch etcd problems before they affect your cluster:

Critical metrics:

# Using Prometheus with etcd metrics endpoint (--listen-metrics-urls)
# Metrics endpoint: http://localhost:2381/metrics

# Commit index lag - how far behind followers are from leader
etcd_server_leader_changes_seen_total
etcd_server_commits_total
# For lag: compare follower applied_index vs leader committed_index

# WAL fsync latency - disk performance indicator
etcd_wal_fsync_duration_seconds_bucket
# P99 should be < 10ms; > 100ms indicates disk I/O problems

# Snapshot size
etcd_server_snapshot_apply_in_progress_total
# Snapshot size growing rapidly indicates write burst or missed snapshots

Prometheus alerting rules:

groups:
  - name: etcd
    rules:
      # Alert if WAL fsync P99 > 500ms
      - alert: EtcdWalFsyncLatency
        expr: histogram_quantile(0.99, rate(etcd_wal_fsync_duration_seconds_bucket[5m])) > 0.5
        for: 10m
        labels:
          severity: critical
        annotations:
          summary: "etcd WAL fsync latency is critical"

      # Alert if follower commit index lag is growing
      - alert: EtcdCommitIndexLag
        expr: etcd_server_leader_commits_index - etcd_server_apply_commits_index > 1000
        for: 5m
        labels:
          severity: warning
        annotations:
          summary: "etcd follower is lagging behind leader"

      # Alert if snapshot restore is in progress (node recovering)
      - alert: EtcdSnapshotRestore
        expr: etcd_server_snapshot_apply_in_progress_total == 1
        for: 1m
        labels:
          severity: warning
        annotations:
          summary: "etcd is currently restoring a snapshot"

Rolling Upgrade Procedures

etcd supports rolling upgrades where one member is upgraded at a time with no cluster downtime.

Standard rolling upgrade steps:

# 1. Verify cluster health before starting
etcdctl endpoint health --cluster

# 2. Upgrade one member at a time
# Stop etcd on member, replace binary, restart
systemctl stop etcd

# 3. After restart, wait for member to catch up
# Check that the member joins the cluster
etcdctl member list

# 4. Verify the upgraded member is healthy
etcdctl endpoint health

# 5. Proceed to next member

Pre-upgrade checklist:

  • Review etcd release notes for any breaking changes between versions
  • Ensure your etcd data directory (—data-dir) has adequate free space (2x current DB size recommended)
  • Take a snapshot before starting: etcdctl snapshot save /backup/etcd-snap.db
  • Verify the new binary version is compatible with cluster API version (etcd --version)

Upgrade compatibility matrix:

Cluster VersionSupported Upgrades
3.43.4 -> 3.5 (same major version)
3.53.5 -> 3.6 (same major version)
3.63.6 -> 3.7 (if released)
NoteMajor version jumps require full cluster restart (not rolling)

Never upgrade more than one minor version at a time. For example, going from 3.4 to 3.6 should be done as 3.4 -> 3.5 -> 3.6.


Disk I/O Requirements

etcd’s performance is fundamentally bound by disk I/O, particularly fsync latency on the WAL.

Requirements by environment:

EnvironmentMinimumRecommended
Development/QAHDD (7200 RPM) with ext4/xfsSSD with NVMe
ProductionSSD (SATA or NVMe)NVMe SSD with >50K IOPS
Kubernetes Control PlaneNVMe SSD mandatoryNVMe SSD with <1ms fsync P99
Heavy write loadNVMe SSD with >100K IOPS and >1GB/s throughputRAID-0 NVMe or dedicated NVMe device

Why SSD/NVMe is mandatory:

# etcd writes to WAL on every committed operation
# Each write must be fsynced before acknowledgment
# HDD fsync latency: 10-50ms (unacceptable)
# SSD fsync latency: 0.1-1ms (acceptable)
# NVMe fsync latency: 0.05-0.2ms (optimal)

# A 10-node cluster with 1 HDD
# fsync P99 = 30ms
# Max theoretical throughput = 1000ms / 30ms = ~33 writes/second
# At 1000 writes/second -> 30+ second commit latency -> leader election

Disk configuration best practices:

# Use --data-dir on dedicated disk or partition
# Do NOT share with application data
etcd --data-dir=/var/lib/etcd

# For NVMe, consider disabling disk I/O scheduler
echo "none" > /sys/block/nvme0n1/queue/scheduler

# Mount with noatime to avoid unnecessary writes
# /etc/fstab entry:
# /dev/nvme0n1 /var/lib/etcd ext4 defaults,noatime 0 2

# Monitor disk I/O
iostat -x 1
# Watch %util and avgqu-sz - if avgqu-sz > 4 for sustained periods, disk is saturated

Don’t use network-attached storage (NAS/NFS) for etcd data. Network latency adds directly to fsync latency and will cause leader election timeouts. Local NVMe or dedicated SSD is the only production-supported configuration.


Leader Election

etcd provides primitives for leader election. Create a dedicated directory and use transactions to ensure only one process leads at a time.

// Simplified leader election using etcd
func ElectLeader(cli *clientv3.Client, leaderName string) {
    // Try to acquire leadership by creating a key
    _, err := cli.Put(context.Background(), "/election/leader", leaderName,
        clientv3.WithLease(clientv3.NewLease(cli)))

    if err != nil {
        // Key exists, check current leader
        resp, _ := cli.Get(context.Background(), "/election/leader")
        fmt.Printf("Current leader: %s\n", resp.Kvs[0].Value)
        return
    }

    fmt.Printf("%s is now the leader!\n", leaderName)

    // Keep leadership by renewing the lease
    // If this process dies, the lease expires and leadership is released
}

The lease mechanism matters. Leader crashes, lease expires, another node takes over. This prevents stale leadership where a dead process appears to still be in charge.


Handling Failures and Recovery

etcd tolerates node failures gracefully:

  • Minority down: Cluster continues serving reads and writes
  • Leader down: New leader elected within seconds, brief write pause
  • Majority down: Cluster unavailable (cannot guarantee consistency without majority)
# Check cluster health
etcdctl endpoint health

# Check member list
etcdctl member list

# Remove a failed member
etcdctl member remove <member-id>

Recovering nodes must catch up on missed updates. etcd supports:

  • Snapshot recovery: Restore from a recent snapshot and replay WAL entries
  • Member migration: Replace failed node with a new one that learns from the leader

For production clusters, regular snapshots are essential. Without them, a recovering node might replay years of WAL entries.


WAL and Snapshot Mechanics

etcd uses a Write-Ahead Log (WAL) to ensure durability of committed operations before applying them to the state machine. Every write that passes Raft consensus gets written to the WAL before being applied.

WAL structure:

wal/
├── 0.index              # WAL metadata
├── 1.snap/              # Snapshot of state at index 1
│   ├── db               # BoltDB snapshot file
│   └── metadata
├── 2.wal/               # WAL for entries 2 onwards
│   ├── 0.index          # First WAL segment index
│   ├── 0000000000000000-0000000000000000.wal
│   └── 0000000000000001-0000000000000256.wal

Each WAL segment file contains record batches with:

  • Type: Entry (raft log entry), State (hard state like term/vote), etc.
  • Term: Raft term number
  • Index: Log entry index
  • Data: The actual write command
graph LR
    A[Client Write] --> B[Raft Propose]
    B --> C[Append to WAL]
    C --> D[Replicate to Followers]
    D --> E[Wait for Majority Ack]
    E --> F[Mark Committed in WAL]
    F --> G[Apply to State Machine]
    G --> H[Write Snapshot]

Snapshotting process:

When the WAL grows too large or during routine snapshots, etcd creates a snapshot:

  1. etcd takes a BoltDB snapshot of the current state
  2. WAL is truncated - old entries are discarded
  3. New member joining can restore from snapshot + replay only recent WAL
# Manually trigger a snapshot
etcdctl snapshot save /tmp/etcd-snap.db

# Check snapshot status
etcdctl snapshot status /tmp/etcd-snap.db

# Restore from snapshot (for disaster recovery)
etcdctl snapshot restore /tmp/etcd-snap.db \
  --name etcd-1 \
  --initial-cluster etcd-1=http://192.168.1.1:2380 \
  --initial-cluster-token etcd-cluster-1 \
  --initial-advertise-peer-urls http://192.168.1.1:2380

Why snapshots matter:

Without SnapshotsWith Snapshots
New node must replay ALL WAL entriesNew node replays only recent entries
Recovery time = years of entriesRecovery time = minutes of entries
Disk space unboundedWAL bounded by snapshot interval
Memory pressure from WAL indexClean, bounded WAL

Operational best practices:

  • Monitor etcd_server_snapshot_apply_in_progress_total metric
  • Set --snapshot-count to control how often snapshots are taken (default: 100,000)
  • Ensure adequate disk space for WAL + snapshots during heavy write periods

Performance Characteristics

etcd is not designed for high throughput on large datasets. It excels at:

  • Megabytes to gigabytes of critical configuration
  • Consistent latency operations (1-10ms typically)
  • Thousands of watches and frequent small updates

Rough numbers for a 3-node cluster on modern hardware:

  • Write latency: 1-5ms at 10k+ QPS
  • Read latency: sub-millisecond for cached reads
  • Watch scalability: 100k+ concurrent watches

These come from etcd team’s benchmarking. Real-world performance depends on network latency, disk I/O, and cluster size.


Kubernetes Integration

Kubernetes stores all cluster state in etcd. kubectl get pods? API server reads from etcd. Deploy a service? API server writes to etcd, controllers watching react.

graph TD
    A[kubectl] -->|HTTP| B[API Server]
    B -->|Read/Write| C[etcd]
    D[Controller] -->|Watch| B
    D -.->|Reconcile| E[Pod]
    B -.->|Notify| D

This tight coupling means etcd failures affect the entire Kubernetes control plane. If etcd is unavailable:

  • No new pods scheduled
  • Existing pods keep running (kubelets manage them independently)
  • Services cannot be created or modified
  • DNS might stop resolving new entries

Kubernetes deployments typically run etcd on dedicated nodes with redundant storage and monitoring.


When to Use / When Not to Use etcd

When to Use etcd

For distributed coordination and locking, etcd excels. Its lease mechanism and compare-and-swap semantics make leader election, distributed locks, and barriers straightforward to implement. Service discovery works naturally too — you can register services, track their health, and discover endpoints dynamically without a separate discovery service. Configuration storage is another strong use case: cluster configuration, feature flags, and dynamic settings that need consistent reads across nodes fit perfectly. When your application needs multiple processes to agree on a single value across machines, etcd provides the consensus guarantees you need.

When Not to Use etcd

etcd is not meant for general application data. It’s designed for metadata and coordination, so high-volume application data belongs in Redis or a proper database. High-throughput event streams also push beyond what etcd handles well — while 1-10ms latency at thousands of operations per second works fine for coordination, it won’t meet the demands of event pipelines. For queuing or job scheduling, reach for a dedicated message queue like Kafka or RabbitMQ instead.

Production Failure Scenarios

FailureImpactMitigation
Quorum loss (N/2+ nodes fail)etcd cluster becomes unavailable; no reads or writes possibleUse 5-node cluster for better fault tolerance; monitor node health; set appropriate election-timeout
Disk I/O saturationetcd is extremely disk I/O sensitive; slow fsync causes leader heartbeats to timeoutUse SSDs (NVMe preferred); isolate etcd disk I/O from other workloads; monitor fsync latency
Snapshot restore taking too longWhen leader sends snapshots to new nodes, followers are unavailable during downloadPre-provision nodes with snapshot already loaded; ensure fast network between nodes; use pre-voted configured correctly
Lease holder crashKeys with TTL expire immediately; watchers are notifiedUse longer TTL for less-critical leases; implement leasekeepalive heartbeat to renew; handle expired lease gracefully
Raft proposal timeout misconfigurationToo-short timeouts cause spurious leader elections; too-long causes slow failoverheartbeat-interval should be ~100ms; election-timeout should be 10x heartbeat; tune based on network RTT
Watch flood during leader electionThousands of events fire simultaneously during leadership transitionUse filtering watchers where possible; debounce watch event processing; batch watch event handling
Inconsistent snapshot stateIf snapshot is taken before WAL is flushed, restored node may have inconsistent stateUse raft log compaction carefully; ensure WAL and snapshot are properly sequenced; test restore procedures

Summary

etcd is purpose-built for coordination in distributed systems:

  • Strong consistency: Via Raft, no eventual consistency
  • Reliability: Survives minority failures without data loss
  • Efficiency: Watches enable reactive patterns without polling
  • Simplicity: Few primitives that compose well

The Raft implementation is considered a gold standard. Reading etcd source code teaches you more about distributed consensus than any textbook.

For Kubernetes users, understanding etcd is essential for debugging cluster issues. Root cause is often etcd: slow disk I/O causing leader election timeouts, memory pressure from a bloated WAL, or network issues preventing consensus.

For other use cases: do you actually need etcd? Millions of reads per second on terabytes? Use a different database. Need reliable configuration storage with fast updates and watches? etcd (or AWS Parameter Store or Consul) fits.

Category

Related Posts

Apache ZooKeeper: Consensus and Coordination

Explore ZooKeeper's Zab consensus protocol, hierarchical znodes, watches, leader election, and practical use cases for distributed coordination.

#distributed-systems #databases #zookeeper

Google Chubby: The Lock Service That Inspired ZooKeeper

Explore Google Chubby's architecture, lock-based coordination, Paxos integration, cell hierarchy, and its influence on distributed systems design.

#distributed-systems #databases #google

Raft Consensus Algorithm

Raft is a consensus algorithm designed to be understandable, decomposing the problem into leader election, log replication, and safety.

#distributed-systems #consensus #raft