etcd — Complete Internals for Kubernetes
Raft consensus, MVCC storage engine, WAL, snapshots, compaction, keyspace layout, backup/restore, TLS, performance tuning, and production operations for the Kubernetes backing store.
What Is etcd?
etcd is a strongly consistent, distributed key-value store written in Go. It was created by CoreOS in 2013 and became the backing store for Kubernetes in its earliest versions. The name comes from /etc (Unix configuration directory) + d (distributed). etcd provides the following guarantees that Kubernetes relies on:
- Strong consistency: Every read returns the latest committed write, as if the cluster had a single node (given quorum). No stale reads.
- High availability: Tolerates minority member failures while maintaining consistency and availability for the majority partition.
- Watch: Clients can watch a key or prefix and receive real-time notifications when values change. This powers the Kubernetes Informer machinery.
- Transactions (MVCC): Multi-Version Concurrency Control allows optimistic locking — compare-and-swap semantics prevent lost updates between concurrent writers.
- Leases: Time-bounded ownership of keys. Used by Kubernetes for leader election and node heartbeats.
Raft Consensus Algorithm
etcd uses the Raft consensus algorithm (2013, Ongaro & Ousterhout) to ensure all members agree on the log order even in the presence of failures. Raft is designed to be more understandable than Paxos while providing the same guarantees.
Member States
Every etcd member is always in one of three states:
Follower
Passive state. Receives AppendEntries RPCs from the leader (log replication + heartbeats). Forwards client writes to the leader. Starts an election if no heartbeat is received within the election timeout (150–300ms randomized).
Candidate
Transitional state during election. Increments its term, votes for itself, and sends RequestVote RPCs to all peers. Transitions to Leader if it gets majority votes, or back to Follower if it discovers a current leader or higher term.
Leader
The active state for ONE member at a time. Accepts all client writes. Replicates entries to followers via AppendEntries. Sends periodic heartbeats (≤50ms) to prevent elections. Steps down if it can't reach a majority.
Write Commit Flow
Figure 1: Raft write commit sequence. The leader writes to its WAL, replicates to followers in parallel, waits for majority ACK (quorum), commits to boltdb, then sends the response. Followers apply the entry on the next heartbeat carrying the updated commitIndex.
Quorum Math
Raft requires ⌊n/2⌋ + 1 votes (a majority) to elect a leader and to commit an entry. This determines how many member failures a cluster can tolerate:
| Members (n) | Quorum | Tolerated Failures | Recommendation |
|---|---|---|---|
| 1 | 1 | 0 | Dev/test only; any failure = total loss |
| 2 | 2 | 0 | Worse than 1 — both must be up; never use |
| 3 | 2 | 1 | Minimum HA. Standard for most production clusters. |
| 4 | 3 | 1 | Same fault tolerance as 3, but more expensive; avoid |
| 5 | 3 | 2 | Large/critical clusters. Handles rolling upgrades safely. |
| 7 | 4 | 3 | Very large clusters. Write latency increases with more members. |
Leader Election Timing
etcd's election timeout is randomized per member between --election-timeout (default 1000ms) and 2 * --election-timeout (2000ms). The randomization prevents multiple members from starting elections simultaneously. The heartbeat interval is --heartbeat-interval (default 100ms). Rule: election timeout should be ≥ 10× heartbeat interval.
# Check election and heartbeat intervals (from running etcd)
ETCDCTL_API=3 etcdctl endpoint status --write-out=json \
--endpoints=https://127.0.0.1:2379 \
--cacert=/etc/kubernetes/pki/etcd/ca.crt \
--cert=/etc/kubernetes/pki/etcd/peer.crt \
--key=/etc/kubernetes/pki/etcd/peer.key | python3 -m json.tool
# Watch for election events
journalctl -u etcd --since "5 min ago" | grep -i "elect\|leader\|campaign"
# Count leader changes (should be near 0 in healthy cluster)
ETCDCTL_API=3 etcdctl endpoint status ... | grep -i leader
Storage Engine: boltdb and MVCC
boltdb (bbolt)
etcd uses boltdb (later forked to bbolt by etcd's maintainers) as its embedded storage engine. boltdb is a pure-Go, B+tree-based key-value store with the following properties:
- Single writer: Only one write transaction at a time (serialized via a mutex). Multiple concurrent read transactions are supported.
- ACID transactions: Writes are atomic and durable (fsync before returning). No partial writes.
- Memory-mapped file: The database file (
member/snap/db) is memory-mapped withmmap(). Read transactions are zero-copy — they read directly from the OS page cache. - Copy-on-write B+tree: Pages are never modified in-place; new pages are allocated and the old ones are freed after commit. This means reads don't block writes and vice versa.
fsync() on WAL (Write-Ahead Log) entries before acknowledging writes to the leader. This is why etcd is extremely sensitive to disk I/O latency. Network-attached storage (NFS, SAN without write-back caching) is a common source of etcd instability. Use local NVMe SSDs. The benchmark for etcd disk: WAL fsync p99 < 10ms.
MVCC — Multi-Version Concurrency Control
etcd implements MVCC at the application level on top of boltdb. Every write creates a new revision of the key — the previous version is not overwritten. This allows:
- Watch history: Clients can watch from any past revision, receiving all changes since that point.
- Optimistic concurrency: Clients include the
modRevisionthey read; if it's been updated, the transaction fails with a conflict. - Consistent reads: Read at a specific revision to get a snapshot of the database at that point in time.
The MVCC data model:
Keyspace (boltdb):
Key: (keyBytes, revision) → Value: (value, create_revision, mod_revision, version, lease)
/registry/pods/default/nginx @ rev=100 → {data=..., create_rev=100, mod_rev=100, ver=1}
/registry/pods/default/nginx @ rev=150 → {data=..., create_rev=100, mod_rev=150, ver=2}
/registry/pods/default/nginx @ rev=200 → {data=..., create_rev=100, mod_rev=200, ver=3}
Revision index (boltdb):
rev=100 → /registry/pods/default/nginx (PUT)
rev=150 → /registry/pods/default/nginx (PUT)
rev=200 → /registry/pods/default/nginx (PUT)
rev=201 → /registry/secrets/default/mysecret (PUT)
The cluster revision (also called the global revision) is a monotonically increasing integer. Every write that changes the state (PUT or DELETE) increments the cluster revision. This maps directly to Kubernetes resourceVersion.
# Inspect MVCC versions for a key
ETCDCTL_API=3 etcdctl get /registry/pods/default/nginx \
--write-out=json \
--cacert=/etc/kubernetes/pki/etcd/ca.crt \
--cert=/etc/kubernetes/pki/etcd/peer.crt \
--key=/etc/kubernetes/pki/etcd/peer.key | python3 -m json.tool
# Output: {"kvs":[{"key":"...","create_revision":100,"mod_revision":200,"version":3,"value":"..."}]}
# Get all revisions of a key (history)
ETCDCTL_API=3 etcdctl get /registry/pods/default/nginx \
--write-out=json --rev=0 # 0 = latest
# Check current cluster revision
ETCDCTL_API=3 etcdctl endpoint status --write-out=json ... | jq '.[0].Status.header.revision'
WAL (Write-Ahead Log) and Snapshots
Write-Ahead Log
Before any write is applied to boltdb, etcd writes it to the WAL — a sequential append-only log on disk. The WAL serves two purposes:
- Durability: If etcd crashes after writing to WAL but before applying to boltdb, it can replay the WAL on restart to recover.
- Replication: The leader sends WAL entries (as Raft log entries via AppendEntries RPC) to followers.
WAL files are stored in member/wal/. Each file is a fixed size (64MB by default). When a file fills up, a new segment is created. Old WAL segments are retained until a snapshot is taken.
Snapshots
Keeping the entire WAL forever would make startup slow (replaying millions of entries). Periodically, etcd takes a snapshot of the entire boltdb state. Snapshots are stored in member/snap/. After a snapshot is saved, WAL entries before the snapshot's revision can be discarded.
etcd snapshots are triggered when --snapshot-count entries have been applied since the last snapshot (default: 100,000). For large clusters with high write rates, this can happen frequently.
# etcd data directory layout
/var/lib/etcd/
├── member/
│ ├── snap/
│ │ ├── 0000000000000002-0000000000030000.snap # snapshot at term-2, index-30000
│ │ └── db # boltdb database file (mmap'd)
│ └── wal/
│ ├── 0000000000000000-0000000000000000.wal # first WAL segment
│ └── 0000000000000001-0000000000030001.wal # second WAL segment
# Check snapshot metadata
ETCDCTL_API=3 etcdctl snapshot status member/snap/db --write-out=table
Kubernetes Keyspace Layout
All Kubernetes objects are stored in etcd under the /registry/ prefix. The key format encodes the API group, resource type, namespace (for namespaced resources), and name:
Cluster-scoped resources:
/registry/{resource}/{name}
/registry/nodes/worker-1
/registry/namespaces/default
/registry/clusterroles/cluster-admin
/registry/persistentvolumes/pvc-abc123
Namespaced resources:
/registry/{resource}/{namespace}/{name}
/registry/pods/default/nginx-abc123
/registry/deployments/production/myapp
/registry/secrets/kube-system/bootstrap-token-xyz
API Group resources:
/registry/{group}/{resource}/{namespace}/{name}
/registry/apps/deployments/default/myapp
/registry/apps/replicasets/default/myapp-7d4f9c
/registry/batch/jobs/default/etl-job
/registry/networking.k8s.io/ingresses/default/my-ingress
/registry/rbac.authorization.k8s.io/roles/default/read-pods
Special keys:
/registry/events/default/nginx-abc.17b2c3d4e5f6 # Events (use separate etcd in large clusters)
/registry/minions/{node-name} # Old name for nodes (still used internally)
/registry/apiextensions.k8s.io/customresourcedefinitions/{name}
/registry/leases/kube-system/kube-scheduler # Leader election
# Inspect the entire keyspace
ETCDCTL_API=3 etcdctl get /registry \
--prefix --keys-only \
--cacert=/etc/kubernetes/pki/etcd/ca.crt \
--cert=/etc/kubernetes/pki/etcd/peer.crt \
--key=/etc/kubernetes/pki/etcd/peer.key | head -50
# Count objects by type
ETCDCTL_API=3 etcdctl get /registry --prefix --keys-only ... | \
sed 's|/registry/||' | cut -d/ -f1 | sort | uniq -c | sort -rn
# Read a specific pod's raw protobuf value (binary)
ETCDCTL_API=3 etcdctl get /registry/pods/default/nginx-abc \
--print-value-only ... | strings | head -30
Object Encoding
Kubernetes objects are stored in etcd as protobuf (not JSON), prefixed with a magic byte sequence: k8s\x00. The encoding is:
Etcd value = "k8s\x00" + Unknown{TypeMeta} + protobuf(object)
TypeMeta (Unknown struct):
TypeMeta.APIVersion = "apps/v1"
TypeMeta.Kind = "Deployment"
protobuf encoding uses the "internal" Go types (not the versioned client types).
Compaction and Defragmentation
Why Compaction Is Needed
MVCC keeps all historical revisions of every key. Without compaction, the database grows unboundedly. A busy cluster with 100 pods being updated 10 times/second generates 1,000 new MVCC entries per second. After a week that's 600 million entries.
Compaction removes all historical revisions older than a specified revision, keeping only the latest value for each key.
# Manual compaction (compact everything older than current revision)
ETCD_ENDPOINTS="https://127.0.0.1:2379"
ETCD_CERT_FLAGS="--cacert=/etc/kubernetes/pki/etcd/ca.crt --cert=/etc/kubernetes/pki/etcd/peer.crt --key=/etc/kubernetes/pki/etcd/peer.key"
# Get current revision
REV=$(ETCDCTL_API=3 etcdctl endpoint status $ETCD_CERT_FLAGS --endpoints=$ETCD_ENDPOINTS --write-out=json | jq '.[0].Status.header.revision')
echo "Current revision: $REV"
# Compact to current revision (removes all history)
ETCDCTL_API=3 etcdctl compact $REV $ETCD_CERT_FLAGS --endpoints=$ETCD_ENDPOINTS
# Defragment (reclaim disk space after compaction — boltdb doesn't shrink automatically)
ETCDCTL_API=3 etcdctl defrag $ETCD_CERT_FLAGS --endpoints=$ETCD_ENDPOINTS
# Note: defrag briefly blocks all writes — do it one member at a time, non-leader first
Auto-Compaction
etcd supports automatic compaction via the --auto-compaction-mode and --auto-compaction-retention flags:
# Time-based: compact entries older than 1 hour
etcd --auto-compaction-mode=periodic --auto-compaction-retention=1h
# Revision-based: keep last 1000 revisions
etcd --auto-compaction-mode=revision --auto-compaction-retention=1000
# For Kubernetes: periodic with 8-hour retention is a common setting
# (allows Informers to reconnect with a 8-hour-old resourceVersion before needing relist)
--quota-backend-bytes limit (default 2GB). When exceeded, etcd returns NOSPACE errors for all writes, causing the Kubernetes apiserver to fail all mutations. The cluster becomes effectively read-only. Kubernetes recommends setting this to 8GB maximum for large clusters. Monitor etcd_mvcc_db_total_size_in_bytes and alert at 70% of quota.
Leases — TTL-based Key Ownership
etcd Leases are time-bounded tokens. A key can be attached to a Lease; when the Lease expires (client fails to renew), all attached keys are automatically deleted. This is used by Kubernetes for:
- Leader election:
kube-schedulerandkube-controller-managerhold a Lease object in thekube-systemnamespace. The holder is the active leader. - Node heartbeats (v1.17+): kubelet updates a
Leaseobject inkube-node-leasenamespace every 10s, avoiding constant updates to the heavy Node object.
# Node heartbeat leases
kubectl get leases -n kube-node-lease
# NAME HOLDER AGE
# worker-1 worker-1 5d
# worker-2 worker-2 5d
# Leader election leases
kubectl get leases -n kube-system
# NAME HOLDER AGE
# kube-controller-manager kube-controller-manager-node1 5d
# kube-scheduler kube-scheduler-node1 5d
# Inspect a lease
kubectl get lease kube-scheduler -n kube-system -o yaml
# spec:
# holderIdentity: kube-scheduler-node1_abc-uuid
# leaseDurationSeconds: 15
# renewTime: "2026-05-15T18:00:00Z" # updated every renewDeadline (10s)
TLS and Security Configuration
etcd uses two separate TLS configurations:
Client TLS (port 2379)
Used by the apiserver to connect to etcd. Requires:
--trusted-ca-file: CA that signed apiserver's client cert--cert-file: etcd server's serving cert--key-file: etcd server's private key--client-cert-auth=true: require mTLS from clients
Peer TLS (port 2380)
Used between etcd members for Raft replication. Requires:
--peer-trusted-ca-file: Peer CA (often same as etcd CA)--peer-cert-file: This member's peer cert--peer-key-file: This member's peer key--peer-client-cert-auth=true: require mTLS from peers
# Full etcd TLS configuration flags
etcd \
--name=etcd-0 \
--data-dir=/var/lib/etcd \
# Client TLS
--listen-client-urls=https://0.0.0.0:2379 \
--advertise-client-urls=https://etcd-0.etcd.svc:2379 \
--trusted-ca-file=/etc/kubernetes/pki/etcd/ca.crt \
--cert-file=/etc/kubernetes/pki/etcd/server.crt \
--key-file=/etc/kubernetes/pki/etcd/server.key \
--client-cert-auth=true \
# Peer TLS
--listen-peer-urls=https://0.0.0.0:2380 \
--initial-advertise-peer-urls=https://etcd-0.etcd.svc:2380 \
--peer-trusted-ca-file=/etc/kubernetes/pki/etcd/ca.crt \
--peer-cert-file=/etc/kubernetes/pki/etcd/peer.crt \
--peer-key-file=/etc/kubernetes/pki/etcd/peer.key \
--peer-client-cert-auth=true \
# Cluster formation
--initial-cluster-state=new \
--initial-cluster="etcd-0=https://etcd-0.etcd.svc:2380,etcd-1=https://etcd-1.etcd.svc:2380,etcd-2=https://etcd-2.etcd.svc:2380"
Backup and Restore
Taking a Snapshot
etcd snapshots are atomic, consistent point-in-time copies of the entire keyspace. A snapshot taken from any member (including followers) captures the state at the snapshot revision.
#!/bin/bash
# Production etcd backup script
BACKUP_DIR="/backup/etcd/$(date +%Y%m%d-%H%M%S)"
mkdir -p "$BACKUP_DIR"
ETCD_CERT_FLAGS="
--cacert=/etc/kubernetes/pki/etcd/ca.crt
--cert=/etc/kubernetes/pki/etcd/peer.crt
--key=/etc/kubernetes/pki/etcd/peer.key"
# Take snapshot (connects to local etcd member)
ETCDCTL_API=3 etcdctl snapshot save "$BACKUP_DIR/snapshot.db" \
--endpoints=https://127.0.0.1:2379 \
$ETCD_CERT_FLAGS
# Verify snapshot integrity
ETCDCTL_API=3 etcdctl snapshot status "$BACKUP_DIR/snapshot.db" --write-out=table
# Compress and upload to off-cluster storage (S3, GCS, etc.)
gzip "$BACKUP_DIR/snapshot.db"
aws s3 cp "$BACKUP_DIR/snapshot.db.gz" s3://my-cluster-backups/etcd/
echo "Backup complete: $BACKUP_DIR/snapshot.db.gz"
# Automate with a CronJob (if etcd runs outside the cluster) or system cron
# /etc/cron.d/etcd-backup
0 * * * * root /usr/local/bin/etcd-backup.sh >> /var/log/etcd-backup.log 2>&1
Restoring From a Snapshot
Restoring from a snapshot replaces the entire etcd cluster state. This is a destructive operation — it overwrites all current data with the snapshot contents. It also forces etcd to generate a new cluster ID, preventing old members from joining.
#!/bin/bash
# Disaster recovery restore procedure
SNAPSHOT="/backup/etcd/20260515-020000/snapshot.db.gz"
RESTORE_DIR="/var/lib/etcd-restore"
# Step 1: Stop apiserver and all controllers (static pod: move manifest away)
mv /etc/kubernetes/manifests/kube-apiserver.yaml /tmp/
mv /etc/kubernetes/manifests/kube-controller-manager.yaml /tmp/
mv /etc/kubernetes/manifests/kube-scheduler.yaml /tmp/
# Step 2: Stop etcd
mv /etc/kubernetes/manifests/etcd.yaml /tmp/
# Wait for etcd process to exit
sleep 5
# Step 3: Restore snapshot to each member (run on each CP node with appropriate names/URLs)
gunzip -c "$SNAPSHOT" > /tmp/snapshot.db
ETCDCTL_API=3 etcdctl snapshot restore /tmp/snapshot.db \
--name=etcd-0 \
--initial-cluster="etcd-0=https://etcd-0:2380,etcd-1=https://etcd-1:2380,etcd-2=https://etcd-2:2380" \
--initial-cluster-token=etcd-cluster-restored-$(date +%s) \
--initial-advertise-peer-urls=https://etcd-0:2380 \
--data-dir="$RESTORE_DIR"
# Step 4: Replace old data directory
mv /var/lib/etcd /var/lib/etcd-old
mv "$RESTORE_DIR" /var/lib/etcd
chown -R etcd:etcd /var/lib/etcd
# Step 5: Restart etcd (restore manifest)
mv /tmp/etcd.yaml /etc/kubernetes/manifests/
# Wait for etcd to become healthy
sleep 10
ETCDCTL_API=3 etcdctl endpoint health ...
# Step 6: Restore apiserver and controllers
mv /tmp/kube-apiserver.yaml /etc/kubernetes/manifests/
mv /tmp/kube-controller-manager.yaml /etc/kubernetes/manifests/
mv /tmp/kube-scheduler.yaml /etc/kubernetes/manifests/
Performance Tuning
Disk Requirements
etcd performance is dominated by WAL fsync latency. Guidelines:
| Metric | Target | Alert At | Action |
|---|---|---|---|
| WAL fsync p99 | <1ms | >10ms | Move to faster disk; check I/O scheduler |
| Backend commit p99 | <25ms | >100ms | Check boltdb size; trigger compaction/defrag |
| DB size | <2GB | >5GB (of 8GB quota) | Increase compaction frequency; check write rate |
| Peer round-trip latency | <1ms | >10ms | Place etcd members in same datacenter/AZ |
| Leader elections/hour | 0 | >0 | Investigate disk latency, CPU pressure, network issues |
OS-Level Disk Tuning
# Check I/O scheduler (should be 'none' or 'noop' for SSDs)
cat /sys/block/nvme0n1/queue/scheduler
# If not 'none': echo none > /sys/block/nvme0n1/queue/scheduler
# Benchmark disk latency with fio (recommended before deploying etcd)
fio --rw=write --ioengine=sync --fdatasync=1 \
--directory=/var/lib/etcd \
--size=22m --bs=2300 \
--name=etcd-disk-benchmark
# Target: 99th percentile sync latency < 10ms
# Check if etcd process has CPU/IO priority
ionice -c 1 -n 0 -p $(pgrep etcd) # set real-time I/O class
nice -n -10 -p $(pgrep etcd) # boost CPU priority
# Dedicated disk for etcd
# /etc/fstab: /dev/nvme1n1 /var/lib/etcd ext4 noatime,nodiratime 0 2
Resource Allocation
# etcd resource recommendations:
# CPU: 2-4 cores dedicated
# Memory: 8GB recommended (etcd uses mmap; OS will cache boltdb in page cache)
# Disk: Dedicated NVMe SSD, 50GB minimum
# If etcd runs as a static pod (kubeadm), set resources in the manifest:
# /etc/kubernetes/manifests/etcd.yaml
resources:
requests:
cpu: "100m"
memory: "100Mi"
limits:
cpu: "4"
memory: "8Gi"
# Isolate etcd from noisy neighbors using cpuset (Linux)
systemctl set-property etcd CPUAffinity=0-3 # pin to CPUs 0-3
Cluster Operations
Adding a Member
# Step 1: Add the member to the cluster (updates cluster membership)
ETCDCTL_API=3 etcdctl member add etcd-3 \
--peer-urls=https://etcd-3:2380 \
--cacert=/etc/kubernetes/pki/etcd/ca.crt \
--cert=/etc/kubernetes/pki/etcd/peer.crt \
--key=/etc/kubernetes/pki/etcd/peer.key \
--endpoints=https://etcd-0:2379
# Step 2: Start etcd on the new node with --initial-cluster-state=existing
# The new member will automatically sync state from the leader
# Step 3: Verify the member joined
ETCDCTL_API=3 etcdctl member list ...
Removing a Failed Member
# Get member IDs
ETCDCTL_API=3 etcdctl member list \
--cacert=/etc/kubernetes/pki/etcd/ca.crt \
--cert=/etc/kubernetes/pki/etcd/peer.crt \
--key=/etc/kubernetes/pki/etcd/peer.key \
--endpoints=https://etcd-0:2379
# Remove the failed member by ID
ETCDCTL_API=3 etcdctl member remove ...
# The cluster immediately drops quorum requirement by 1 for the removed member
# A 3-member cluster that just had a member removed is now a 2-member cluster
# and needs BOTH remaining members to achieve quorum!
member remove, the cluster size decreases. If you remove a member from a 3-member cluster while one member is already down, you now have a 2-member cluster where both remaining members must be up for quorum. You've reduced your fault tolerance to zero. Be careful.
Recovering From Loss of Quorum
# Scenario: 3-member cluster, 2 members lost. 1 member running but no quorum.
# The remaining member will refuse all writes: "etcdserver: request timed out"
# Option A: Restore from backup (safest, may lose recent writes)
# (follow the restore procedure above)
# Option B: Force a new single-member cluster from the surviving member's data
# WARNING: This discards all writes since the last committed index on the survivor.
# Step 1: Stop etcd
systemctl stop etcd
# Step 2: Force a new cluster from this member's data dir
etcd --force-new-cluster --data-dir=/var/lib/etcd
# Step 3: Immediately run the apiserver against this single member
# Step 4: Add new etcd members to restore HA
Key etcd Metrics
# etcd exposes Prometheus metrics at :2381/metrics (or :2379/metrics)
# Sample critical queries:
# Leader elections (should be 0)
etcd_server_leader_changes_seen_total
# Write latency (WAL sync - most critical)
histogram_quantile(0.99, rate(etcd_disk_wal_fsync_duration_seconds_bucket[5m]))
# Backend commit latency (boltdb)
histogram_quantile(0.99, rate(etcd_disk_backend_commit_duration_seconds_bucket[5m]))
# Database size
etcd_mvcc_db_total_size_in_bytes
etcd_mvcc_db_total_size_in_use_in_bytes # actual used; difference = fragmentation
# Watch connections
etcd_server_slow_apply_total # operations taking > 100ms
etcd_server_slow_read_indexes_total # read index operations taking too long
# Raft health
etcd_server_proposals_committed_total # rate = write throughput
etcd_server_proposals_pending # should be near 0
etcd_server_proposals_failed_total # should be 0
etcd_network_peer_round_trip_time_seconds # inter-member latency
# Prometheus alerting rules for etcd
groups:
- name: etcd
rules:
- alert: EtcdNoLeader
expr: etcd_server_has_leader == 0
for: 1m
annotations:
summary: "etcd member has no leader"
- alert: EtcdHighNumberOfLeaderChanges
expr: increase(etcd_server_leader_changes_seen_total[1h]) > 3
annotations:
summary: "etcd leader changed more than 3 times in 1 hour"
- alert: EtcdHighFsyncDuration
expr: histogram_quantile(0.99, rate(etcd_disk_wal_fsync_duration_seconds_bucket[5m])) > 0.01
for: 5m
annotations:
summary: "etcd WAL fsync p99 > 10ms — disk performance issue"
- alert: EtcdDbSizeExceeding
expr: etcd_mvcc_db_total_size_in_bytes / etcd_server_quota_backend_bytes > 0.7
for: 10m
annotations:
summary: "etcd database size exceeds 70% of quota"
- alert: EtcdMemberDown
expr: up{job="etcd"} == 0
for: 3m
annotations:
summary: "etcd member is down"
Troubleshooting etcd
NOSPACE Alarm — DB Quota Exceeded
# Symptom: all apiserver writes fail with "etcdserver: mvcc: database space exceeded"
# Step 1: Check current state
ETCDCTL_API=3 etcdctl alarm list ... # should show NOSPACE
ETCDCTL_API=3 etcdctl endpoint status --write-out=table ... # check DB SIZE column
# Step 2: Compact and defrag (even if at quota, compaction is allowed)
REV=$(ETCDCTL_API=3 etcdctl endpoint status --write-out=json ... | jq '.[0].Status.header.revision')
ETCDCTL_API=3 etcdctl compact $REV ...
ETCDCTL_API=3 etcdctl defrag ...
# Step 3: Clear the alarm (only after DB size is back under quota)
ETCDCTL_API=3 etcdctl alarm disarm ...
# Step 4: Increase quota to prevent recurrence (restart required)
# etcd flag: --quota-backend-bytes=8589934592 (8GB)
Slow Writes / High Latency
# Check fsync latency (should be < 1ms on fast SSDs)
ETCDCTL_API=3 etcdctl check perf ...
# Expected output: 60 / 60 passed for round trip time
# Check I/O wait on the node
iostat -x 1 10 | grep -E "Device|nvme"
# Check if another process is contending for disk
iotop -o # interactive I/O monitor
# Check for etcd leader thrashing (election per minute = instability)
journalctl -u etcd --since "1 hour ago" | grep -c "elected leader"
# Check VM steal time (if on cloud)
vmstat 1 10 | awk '{print $16}' # st column = steal; should be < 1%
Network Partition / Split-Brain Detection
# Detect if a member is isolated
ETCDCTL_API=3 etcdctl endpoint health --cluster ...
# Unhealthy member will show: "failed to commit proposal" or "context deadline exceeded"
# Check peer connectivity from within each member
# (etcd logs show peer connection errors)
journalctl -u etcd | grep "lost the TCP streaming connection"
journalctl -u etcd | grep "failed to send"
# Verify quorum is intact
ETCDCTL_API=3 etcdctl endpoint status --write-out=table ...
# RAFT TERM should be the same for all healthy members
Data Corruption Recovery
# etcd detects data corruption via hash checks
# Symptom: "etcdserver: database file may be corrupt"
# Step 1: Check integrity
ETCDCTL_API=3 etcdctl snapshot status member/snap/db
# Step 2: If corrupt, restore from snapshot
# See backup/restore section above
# Prevention: never use SIGKILL on etcd (always SIGTERM)
# SIGKILL during a write can corrupt the WAL
kill -TERM $(pgrep etcd) # correct
# kill -9 $(pgrep etcd) # NEVER do this
Production Checklist
20-Item etcd Production Checklist
| # | Item | Verification |
|---|---|---|
| 1 | Odd number of members (3 or 5) | etcdctl member list |
| 2 | Members spread across AZs | Verify placement in cloud console |
| 3 | Dedicated NVMe SSD for etcd data | lsblk; df -h /var/lib/etcd |
| 4 | WAL fsync p99 < 10ms | Prometheus: etcd_disk_wal_fsync_duration_seconds |
| 5 | Hourly backups with off-cluster storage | Check backup cron; verify S3/GCS bucket |
| 6 | Backup restore tested | Annual DR drill: restore from snapshot to test cluster |
| 7 | TLS + mTLS on both client and peer ports | --client-cert-auth=true --peer-client-cert-auth=true |
| 8 | etcd not reachable from outside CP network | Firewall: port 2379/2380 restricted to CP node IPs |
| 9 | Auto-compaction configured | --auto-compaction-mode=periodic --auto-compaction-retention=8h |
| 10 | Quota set to 8GB | --quota-backend-bytes=8589934592 |
| 11 | DB size alerting at 70% quota | Prometheus alert configured |
| 12 | Leader election count alert | Alert on >3 elections/hour |
| 13 | All members healthy | etcdctl endpoint health --cluster |
| 14 | Etcd logs monitored for errors | Loki/CloudWatch alert on "failed to commit" |
| 15 | Separate etcd CA from Kubernetes CA | Verify issuer: openssl x509 -in etcd/ca.crt -noout -subject |
| 16 | Etcd member certs expiry > 30 days | kubeadm certs check-expiration |
| 17 | Events stored in separate etcd (large clusters) | Check apiserver --etcd-servers-overrides flag |
| 18 | Resource limits on etcd static pod | grep -A5 resources /etc/kubernetes/manifests/etcd.yaml |
| 19 | No NFS/network storage for etcd data dir | df -T /var/lib/etcd (should be local) |
| 20 | Never SIGKILL etcd; use SIGTERM for shutdown | Review kubelet static pod management; check restart policy |
Separating Events Into a Dedicated etcd
Kubernetes Events are extremely high-volume — every pod phase change, scheduler decision, and controller action generates an Event. In large clusters, Events can account for 50%+ of etcd writes, consuming quota and slowing down writes to actual workload state.
# Configure apiserver to use a separate etcd for Events
kube-apiserver \
--etcd-servers=https://etcd-0:2379,https://etcd-1:2379,https://etcd-2:2379 \
--etcd-servers-overrides=/events#https://etcd-events-0:2379,https://etcd-events-1:2379
# This routes all /registry/events/* keys to the events etcd cluster
# The events etcd can be smaller and have shorter retention
# (Events themselves have a default TTL of 1 hour)
Dependency Graph and Next Files
Prerequisites
This File Covers
- Raft consensus: states, write flow, quorum
- boltdb storage engine + MVCC
- WAL and snapshots
- Kubernetes keyspace layout + protobuf encoding
- Compaction and defragmentation
- Leases (leader election + node heartbeats)
- TLS/mTLS configuration
- Backup and restore procedures
- Performance tuning (disk, OS, resources)
- Cluster operations (add/remove members)
- Key metrics + Prometheus alerting rules
- Troubleshooting runbooks (NOSPACE, slow writes, split-brain, corruption)