etcd: Distributed Key-Value Store for Configuration
Deep dive into etcd architecture using Raft consensus, watches for reactive configuration, leader election patterns, and Kubernetes integration.
etcd: Distributed Key-Value Store for Service Discovery and Configuration
etcd started at CoreOS as a coordination service built on the Raft consensus algorithm. Its job is storing configuration, detecting failures, and coordinating leader elections in distributed systems. Today etcd is best known as Kubernetes’ backing store, holding all cluster state: pods, services, deployments, configuration.
The name comes from the Unix convention of naming services with “etc” and “d” for daemon. etcd was meant to be the “etc” for distributed systems.
Core Architecture
etcd is a distributed, consistent, highly-available key-value store. Unlike databases optimized for arbitrary queries, etcd handles:
- Small amounts of critical data (configuration, locks, leader election)
- Strong consistency (no eventual consistency)
- Fast distributed coordination primitives
The architecture uses Raft to maintain a replicated log. The Raft paper is titled “In Search of an Understandable Consensus Algorithm”, and etcd’s implementation is considered the reference implementation.
graph TD
A[Client] -->|Read/Write| B[Leader]
B -->|Replicate| C[Follower 1]
B -->|Replicate| D[Follower 2]
B -->|Replicate| E[Follower 3]
C -.->|Heartbeat| B
D -.->|Heartbeat| B
E -.->|Heartbeat| B
B --> F[(Raft Log)]
C --> G[(Raft Log)]
D --> H[(Raft Log)]
E --> I[(Raft Log)]
In a Raft cluster, one node is the leader and handles all writes. Writes go through the leader’s replicated log, committed to a majority of nodes before durability. Reads can go to any node, but followers might return stale data unless you request linearizable reads.
The Raft Consensus Algorithm
Raft achieves consensus by electing a leader and replicating a log of commands. The algorithm has three parts:
Leader Election:
- Nodes start as followers
- Followers become candidates if they miss leader communication within the election timeout
- Candidates request votes from other nodes
- Majority votes = new leader
Log Replication:
- Leader accepts commands from clients
- Appends to local log
- Replicates to followers via AppendEntries messages
- Command on majority of logs = applied to state machine, client notified
Safety:
- Only servers with committed entries can become leader
- Committed entries survive minority node failures
// Simplified Raft write flow in etcd
func (s *Server) Propose(ctx context.Context, cmd []byte) error {
// 1. Submit to Raft log
ch := make(chan applyFuture)
s.node.Propose(ctx, cmd, ch)
// 2. Wait for commit
result := <-ch
if result.Error() != nil {
return result.Error()
}
// 3. Apply to state machine
return s.apply(cmd)
}
The key insight: as long as a majority agrees on log contents, the cluster is consistent. Leader crashes = new leader with most complete log elected.
Data Model and API
etcd stores data in a flat key-value namespace with a hierarchical directory structure (like a filesystem). Keys are strings, values can be arbitrary bytes.
# Store a simple value
etcdctl put /services/api/server1 "10.0.0.1:8080"
# Retrieve the value
etcdctl get /services/api/server1
# Store JSON configuration
etcdctl put /config/database '{"host": "db.example.com", "port": 5432}'
# List all keys under a prefix
etcdctl get --prefix /services/
The etcd API follows the gRPC protocol, which you can call directly or via client libraries:
import "go.etcd.io/etcd/client/v3"
cli, _ := client.NewFromURLs([]string{"http://localhost:2379"})
defer cli.Close()
// Write a value
cli.Put(context.Background(), "/config/featureflags", `{"newUI": true}`)
// Read a value
resp, _ := cli.Get(context.Background(), "/config/featureflags")
fmt.Println(string(resp.Kvs[0].Value))
// Atomic compare-and-swap (requires current value)
cli.Txn(context.Background()).If(
clientv3.Compare(clientv3.Value("/config/featureflags"), "=", `{"newUI": false}`),
).Then(
clientv3.OpPut("/config/featureflags", `{"newUI": true}`),
).Else(
clientv3.OpPut("/config/featureflags", `{"newUI": false}`),
).Commit()
Transactions enable atomic compare-and-swap operations, essential for distributed locks and leader election.
Watches and Reactive Configuration
Native watch support is etcd’s most powerful feature. Subscribe to key changes instead of polling, and get notified immediately when data changes.
// Watch a single key
watchChan := cli.Watch(context.Background(), "/services/api/")
for resp := range watchChan {
for _, event := range resp.Events {
fmt.Printf("%s %q: %q\n", event.Type, event.Kv.Key, event.Kv.Value)
}
}
// Watch an entire prefix recursively
watchChan := cli.Watch(context.Background(), "/services/", clientv3.WithPrefix())
Watches in etcd are cheap and scalable because they are true subscriptions, not polling. The etcd server tracks which clients watch which keys and pushes updates directly.
sequenceDiagram
participant K as Kubernetes Controller
participant E as etcd Server
participant W as Watch Stream
K->>E: Watch /pods/
E->>W: Create subscription
W-->>K: Initial state
Note over E: Pod created
E->>W: Push new event
W-->>K: Pod event notification
K->>K: Reconcile pod state
This watch mechanism lets Kubernetes controllers react immediately to cluster changes without polling.
Authentication and RBAC
etcd ships with built-in role-based access control. Before enabling auth, you create users and bind them to roles that permit specific key prefixes.
# Create a user with password authentication
etcdctl user add root
# Add user to the root role (full access)
etcdctl user grant-role root root
# Enable authentication
etcdctl auth enable
# Now all requests require credentials
etcdctl --user root:password put /services/api/server1 "10.0.0.1:8080"
Built-in roles:
| Role | Permissions |
|---|---|
root | Full read/write/admin on all keys |
guest | Read on all keys (default role) |
etcd-cluster | Read/write on cluster communication keys |
Custom roles for least privilege:
# Create a role scoped to a key prefix
etcdctl role add app-config
etcdctl role grant-permission app-config readwrite /config/app/
# Create a user and assign the role
etcdctl user add app-service
etcdctl user grant-role app-service app-config
Without RBAC, any client that can reach etcd port 2379 has full cluster access. In Kubernetes, the etcd RBAC layer is typically bypassed because the API server handles authentication — but for standalone etcd deployments, RBAC is the first line of defense.
TLS/mTLS Setup for Cluster Communication
etcd supports encrypted communication between members and clients via mutual TLS. Each member and client presents a certificate; both sides verify the other’s identity.
Certificate requirements:
- Each etcd member needs a server certificate with its hostname/IP as Subject Alternative Names (SANs)
- Each client needs a client certificate for authentication
- The CA certificate must be trusted by all members and clients
# Generate certificates using cfssl or openssl
# Example using openssl - simplified
# 1. Create the CA
openssl genrsa -out ca-key.pem 2048
openssl req -x509 -new -nodes -key ca-key.pem -days 365 -out ca-crt.pem \
-subj "/CN=etcd-ca"
# 2. Create member certificate with SANs for all cluster IPs
openssl genrsa -out member-key.pem 2048
# SANs: etcd-1, etcd-2, etcd-3, 192.168.1.1, 192.168.1.2, 192.168.1.3
openssl req -new -key member-key.pem -out member.csr \
-subj "/CN=etcd-member"
# Sign with CA...
# 3. Client certificates (no SAN required)
openssl genrsa -out client-key.pem 2048
openssl req -new -key client-key.pem -out client.csr \
-subj "/CN=etcd-client"
# Sign with CA...
Cluster member flags:
# On each member, specify certificates
etcd \
--cert-file=/etc/etcd/ssl/server.crt \
--key-file=/etc/etcd/ssl/server.key \
--trusted-ca-file=/etc/etcd/ssl/ca.crt \
--peer-cert-file=/etc/etcd/ssl/peer.crt \
--peer-key-file=/etc/etcd/ssl/peer.key \
--peer-trusted-ca-file=/etc/etcd/ssl/ca.crt \
--initial-cluster etcd-1=https://192.168.1.1:2380,etcd-2=https://192.168.1.2:2380 \
--listen-peer-urls https://0.0.0.0:2380 \
--listen-client-urls https://0.0.0.0:2379
Client connection with TLS:
# Connect a client with certificates
etcdctl --cacert=/etc/etcd/ssl/ca.crt \
--cert=/etc/etcd/ssl/client.crt \
--key=/etc/etcd/ssl/client.key \
endpoint health
Without TLS in production, network-level attackers can read all cluster state including Kubernetes secrets. TLS also prevents man-in-the-middle attacks that could inject false configuration into your cluster.
Lease Revocation Edge Cases
etcd leases attach TTLs to keys. When a lease expires, all keys attached to it are deleted atomically. The edge case is what happens when the lease holder crashes before renewal.
Grace period behavior:
// When a lease holder crashes:
// 1. The lease continues running on etcd server
// 2. If no keepalive arrives before --lease-grant-timeout, etcd revokes the lease
// 3. All keys attached to the lease are deleted
// 4. Watchers on those keys receive delete events
// The grace period is approximately: (lease TTL) + (leader election timeout)
// If leader is down, lease cannot be renewed until new leader is elected
Practical implications:
| Scenario | Lease Behavior |
|---|---|
| Holder crashes, lease TTL = 10s | Keys deleted ~10s after crash (plus network delay) |
| Leader fails during grace period | New leader elected, old holder’s lease revoked, keys deleted |
| Network partition (holder isolated) | Partition-side holder cannot renew; majority-side lease revoked after TTL expires |
| Holder JVM pauses (GC) | If pause exceeds lease TTL, keys deleted; use longer TTL + reliable keepalive |
Recommendation: set lease TTL at 2-5x your expected renewal interval, and implement retry logic with exponential backoff on the client side.
Namespace Isolation for Multi-Tenant Environments
etcd has no native multi-tenant namespace abstraction. Isolation is achieved through key prefix conventions and RBAC roles.
Prefix-based tenancy:
# Tenant A's keys
etcdctl role add tenant-a
etcdctl role grant-permission tenant-a readwrite /tenants/a/
# Tenant B's keys
etcdctl role add tenant-b
etcdctl role grant-permission tenant-b readwrite /tenants/b/
# Each tenant's application only gets credentials granting access to their prefix
# Kubernetes uses this pattern: /registry/...
For Kubernetes itself, the convention is:
# Kubernetes state lives under these prefixes
/registry/apiregistration.k8s.io/apiservices/
/registry/services/endpoints/
/registry/pods/
/registry/secrets/
/registry/configmaps/
Multi-tenant considerations:
| Concern | Mitigation |
|---|---|
| Tenant key sprawl | Enforce naming conventions; monitor key count per prefix |
| Resource contention | Separate etcd clusters per tenant at scale |
| Quota limits | etcd has no per-key-prefix quota; implement application-level checks |
| RBAC complexity | Use role inheritance where possible; automate role provisioning |
At scale (hundreds of tenants), running a dedicated etcd cluster per tenant or using a managed service like etcd Operator becomes more practical than prefix-based isolation on a single cluster.
Key Monitoring Metrics
Monitor these metrics to catch etcd problems before they affect your cluster:
Critical metrics:
# Using Prometheus with etcd metrics endpoint (--listen-metrics-urls)
# Metrics endpoint: http://localhost:2381/metrics
# Commit index lag - how far behind followers are from leader
etcd_server_leader_changes_seen_total
etcd_server_commits_total
# For lag: compare follower applied_index vs leader committed_index
# WAL fsync latency - disk performance indicator
etcd_wal_fsync_duration_seconds_bucket
# P99 should be < 10ms; > 100ms indicates disk I/O problems
# Snapshot size
etcd_server_snapshot_apply_in_progress_total
# Snapshot size growing rapidly indicates write burst or missed snapshots
Prometheus alerting rules:
groups:
- name: etcd
rules:
# Alert if WAL fsync P99 > 500ms
- alert: EtcdWalFsyncLatency
expr: histogram_quantile(0.99, rate(etcd_wal_fsync_duration_seconds_bucket[5m])) > 0.5
for: 10m
labels:
severity: critical
annotations:
summary: "etcd WAL fsync latency is critical"
# Alert if follower commit index lag is growing
- alert: EtcdCommitIndexLag
expr: etcd_server_leader_commits_index - etcd_server_apply_commits_index > 1000
for: 5m
labels:
severity: warning
annotations:
summary: "etcd follower is lagging behind leader"
# Alert if snapshot restore is in progress (node recovering)
- alert: EtcdSnapshotRestore
expr: etcd_server_snapshot_apply_in_progress_total == 1
for: 1m
labels:
severity: warning
annotations:
summary: "etcd is currently restoring a snapshot"
Rolling Upgrade Procedures
etcd supports rolling upgrades where one member is upgraded at a time with no cluster downtime.
Standard rolling upgrade steps:
# 1. Verify cluster health before starting
etcdctl endpoint health --cluster
# 2. Upgrade one member at a time
# Stop etcd on member, replace binary, restart
systemctl stop etcd
# 3. After restart, wait for member to catch up
# Check that the member joins the cluster
etcdctl member list
# 4. Verify the upgraded member is healthy
etcdctl endpoint health
# 5. Proceed to next member
Pre-upgrade checklist:
- Review etcd release notes for any breaking changes between versions
- Ensure your etcd data directory (—data-dir) has adequate free space (2x current DB size recommended)
- Take a snapshot before starting:
etcdctl snapshot save /backup/etcd-snap.db - Verify the new binary version is compatible with cluster API version (
etcd --version)
Upgrade compatibility matrix:
| Cluster Version | Supported Upgrades |
|---|---|
| 3.4 | 3.4 -> 3.5 (same major version) |
| 3.5 | 3.5 -> 3.6 (same major version) |
| 3.6 | 3.6 -> 3.7 (if released) |
| Note | Major version jumps require full cluster restart (not rolling) |
Never upgrade more than one minor version at a time. For example, going from 3.4 to 3.6 should be done as 3.4 -> 3.5 -> 3.6.
Disk I/O Requirements
etcd’s performance is fundamentally bound by disk I/O, particularly fsync latency on the WAL.
Requirements by environment:
| Environment | Minimum | Recommended |
|---|---|---|
| Development/QA | HDD (7200 RPM) with ext4/xfs | SSD with NVMe |
| Production | SSD (SATA or NVMe) | NVMe SSD with >50K IOPS |
| Kubernetes Control Plane | NVMe SSD mandatory | NVMe SSD with <1ms fsync P99 |
| Heavy write load | NVMe SSD with >100K IOPS and >1GB/s throughput | RAID-0 NVMe or dedicated NVMe device |
Why SSD/NVMe is mandatory:
# etcd writes to WAL on every committed operation
# Each write must be fsynced before acknowledgment
# HDD fsync latency: 10-50ms (unacceptable)
# SSD fsync latency: 0.1-1ms (acceptable)
# NVMe fsync latency: 0.05-0.2ms (optimal)
# A 10-node cluster with 1 HDD
# fsync P99 = 30ms
# Max theoretical throughput = 1000ms / 30ms = ~33 writes/second
# At 1000 writes/second -> 30+ second commit latency -> leader election
Disk configuration best practices:
# Use --data-dir on dedicated disk or partition
# Do NOT share with application data
etcd --data-dir=/var/lib/etcd
# For NVMe, consider disabling disk I/O scheduler
echo "none" > /sys/block/nvme0n1/queue/scheduler
# Mount with noatime to avoid unnecessary writes
# /etc/fstab entry:
# /dev/nvme0n1 /var/lib/etcd ext4 defaults,noatime 0 2
# Monitor disk I/O
iostat -x 1
# Watch %util and avgqu-sz - if avgqu-sz > 4 for sustained periods, disk is saturated
Don’t use network-attached storage (NAS/NFS) for etcd data. Network latency adds directly to fsync latency and will cause leader election timeouts. Local NVMe or dedicated SSD is the only production-supported configuration.
Leader Election
etcd provides primitives for leader election. Create a dedicated directory and use transactions to ensure only one process leads at a time.
// Simplified leader election using etcd
func ElectLeader(cli *clientv3.Client, leaderName string) {
// Try to acquire leadership by creating a key
_, err := cli.Put(context.Background(), "/election/leader", leaderName,
clientv3.WithLease(clientv3.NewLease(cli)))
if err != nil {
// Key exists, check current leader
resp, _ := cli.Get(context.Background(), "/election/leader")
fmt.Printf("Current leader: %s\n", resp.Kvs[0].Value)
return
}
fmt.Printf("%s is now the leader!\n", leaderName)
// Keep leadership by renewing the lease
// If this process dies, the lease expires and leadership is released
}
The lease mechanism matters. Leader crashes, lease expires, another node takes over. This prevents stale leadership where a dead process appears to still be in charge.
Handling Failures and Recovery
etcd tolerates node failures gracefully:
- Minority down: Cluster continues serving reads and writes
- Leader down: New leader elected within seconds, brief write pause
- Majority down: Cluster unavailable (cannot guarantee consistency without majority)
# Check cluster health
etcdctl endpoint health
# Check member list
etcdctl member list
# Remove a failed member
etcdctl member remove <member-id>
Recovering nodes must catch up on missed updates. etcd supports:
- Snapshot recovery: Restore from a recent snapshot and replay WAL entries
- Member migration: Replace failed node with a new one that learns from the leader
For production clusters, regular snapshots are essential. Without them, a recovering node might replay years of WAL entries.
WAL and Snapshot Mechanics
etcd uses a Write-Ahead Log (WAL) to ensure durability of committed operations before applying them to the state machine. Every write that passes Raft consensus gets written to the WAL before being applied.
WAL structure:
wal/
├── 0.index # WAL metadata
├── 1.snap/ # Snapshot of state at index 1
│ ├── db # BoltDB snapshot file
│ └── metadata
├── 2.wal/ # WAL for entries 2 onwards
│ ├── 0.index # First WAL segment index
│ ├── 0000000000000000-0000000000000000.wal
│ └── 0000000000000001-0000000000000256.wal
Each WAL segment file contains record batches with:
- Type: Entry (raft log entry), State (hard state like term/vote), etc.
- Term: Raft term number
- Index: Log entry index
- Data: The actual write command
graph LR
A[Client Write] --> B[Raft Propose]
B --> C[Append to WAL]
C --> D[Replicate to Followers]
D --> E[Wait for Majority Ack]
E --> F[Mark Committed in WAL]
F --> G[Apply to State Machine]
G --> H[Write Snapshot]
Snapshotting process:
When the WAL grows too large or during routine snapshots, etcd creates a snapshot:
- etcd takes a BoltDB snapshot of the current state
- WAL is truncated - old entries are discarded
- New member joining can restore from snapshot + replay only recent WAL
# Manually trigger a snapshot
etcdctl snapshot save /tmp/etcd-snap.db
# Check snapshot status
etcdctl snapshot status /tmp/etcd-snap.db
# Restore from snapshot (for disaster recovery)
etcdctl snapshot restore /tmp/etcd-snap.db \
--name etcd-1 \
--initial-cluster etcd-1=http://192.168.1.1:2380 \
--initial-cluster-token etcd-cluster-1 \
--initial-advertise-peer-urls http://192.168.1.1:2380
Why snapshots matter:
| Without Snapshots | With Snapshots |
|---|---|
| New node must replay ALL WAL entries | New node replays only recent entries |
| Recovery time = years of entries | Recovery time = minutes of entries |
| Disk space unbounded | WAL bounded by snapshot interval |
| Memory pressure from WAL index | Clean, bounded WAL |
Operational best practices:
- Monitor
etcd_server_snapshot_apply_in_progress_totalmetric - Set
--snapshot-countto control how often snapshots are taken (default: 100,000) - Ensure adequate disk space for WAL + snapshots during heavy write periods
Performance Characteristics
etcd is not designed for high throughput on large datasets. It excels at:
- Megabytes to gigabytes of critical configuration
- Consistent latency operations (1-10ms typically)
- Thousands of watches and frequent small updates
Rough numbers for a 3-node cluster on modern hardware:
- Write latency: 1-5ms at 10k+ QPS
- Read latency: sub-millisecond for cached reads
- Watch scalability: 100k+ concurrent watches
These come from etcd team’s benchmarking. Real-world performance depends on network latency, disk I/O, and cluster size.
Kubernetes Integration
Kubernetes stores all cluster state in etcd. kubectl get pods? API server reads from etcd. Deploy a service? API server writes to etcd, controllers watching react.
graph TD
A[kubectl] -->|HTTP| B[API Server]
B -->|Read/Write| C[etcd]
D[Controller] -->|Watch| B
D -.->|Reconcile| E[Pod]
B -.->|Notify| D
This tight coupling means etcd failures affect the entire Kubernetes control plane. If etcd is unavailable:
- No new pods scheduled
- Existing pods keep running (kubelets manage them independently)
- Services cannot be created or modified
- DNS might stop resolving new entries
Kubernetes deployments typically run etcd on dedicated nodes with redundant storage and monitoring.
When to Use / When Not to Use etcd
When to Use etcd
For distributed coordination and locking, etcd excels. Its lease mechanism and compare-and-swap semantics make leader election, distributed locks, and barriers straightforward to implement. Service discovery works naturally too — you can register services, track their health, and discover endpoints dynamically without a separate discovery service. Configuration storage is another strong use case: cluster configuration, feature flags, and dynamic settings that need consistent reads across nodes fit perfectly. When your application needs multiple processes to agree on a single value across machines, etcd provides the consensus guarantees you need.
When Not to Use etcd
etcd is not meant for general application data. It’s designed for metadata and coordination, so high-volume application data belongs in Redis or a proper database. High-throughput event streams also push beyond what etcd handles well — while 1-10ms latency at thousands of operations per second works fine for coordination, it won’t meet the demands of event pipelines. For queuing or job scheduling, reach for a dedicated message queue like Kafka or RabbitMQ instead.
Production Failure Scenarios
| Failure | Impact | Mitigation |
|---|---|---|
| Quorum loss (N/2+ nodes fail) | etcd cluster becomes unavailable; no reads or writes possible | Use 5-node cluster for better fault tolerance; monitor node health; set appropriate election-timeout |
| Disk I/O saturation | etcd is extremely disk I/O sensitive; slow fsync causes leader heartbeats to timeout | Use SSDs (NVMe preferred); isolate etcd disk I/O from other workloads; monitor fsync latency |
| Snapshot restore taking too long | When leader sends snapshots to new nodes, followers are unavailable during download | Pre-provision nodes with snapshot already loaded; ensure fast network between nodes; use pre-voted configured correctly |
| Lease holder crash | Keys with TTL expire immediately; watchers are notified | Use longer TTL for less-critical leases; implement leasekeepalive heartbeat to renew; handle expired lease gracefully |
| Raft proposal timeout misconfiguration | Too-short timeouts cause spurious leader elections; too-long causes slow failover | heartbeat-interval should be ~100ms; election-timeout should be 10x heartbeat; tune based on network RTT |
| Watch flood during leader election | Thousands of events fire simultaneously during leadership transition | Use filtering watchers where possible; debounce watch event processing; batch watch event handling |
| Inconsistent snapshot state | If snapshot is taken before WAL is flushed, restored node may have inconsistent state | Use raft log compaction carefully; ensure WAL and snapshot are properly sequenced; test restore procedures |
Related Concepts
- Distributed Transactions covers consensus mechanisms
- CAP Theorem explains why etcd chooses consistency
- Consistency Models covers linearizability
- Service Registry patterns that etcd enables
Summary
etcd is purpose-built for coordination in distributed systems:
- Strong consistency: Via Raft, no eventual consistency
- Reliability: Survives minority failures without data loss
- Efficiency: Watches enable reactive patterns without polling
- Simplicity: Few primitives that compose well
The Raft implementation is considered a gold standard. Reading etcd source code teaches you more about distributed consensus than any textbook.
For Kubernetes users, understanding etcd is essential for debugging cluster issues. Root cause is often etcd: slow disk I/O causing leader election timeouts, memory pressure from a bloated WAL, or network issues preventing consensus.
For other use cases: do you actually need etcd? Millions of reads per second on terabytes? Use a different database. Need reliable configuration storage with fast updates and watches? etcd (or AWS Parameter Store or Consul) fits.
Category
Related Posts
Apache ZooKeeper: Consensus and Coordination
Explore ZooKeeper's Zab consensus protocol, hierarchical znodes, watches, leader election, and practical use cases for distributed coordination.
Google Chubby: The Lock Service That Inspired ZooKeeper
Explore Google Chubby's architecture, lock-based coordination, Paxos integration, cell hierarchy, and its influence on distributed systems design.
Raft Consensus Algorithm
Raft is a consensus algorithm designed to be understandable, decomposing the problem into leader election, log replication, and safety.