01-control-plane/02-etcd.html Prerequisites: 01-control-plane/00-control-plane-overview.html Prerequisites: 01-control-plane/01-kube-apiserver.html Related: 09-production-operations/02-etcd-operations.html · 12-troubleshooting/09-etcd-issues.html

etcd — Complete Internals for Kubernetes

Raft consensus, MVCC storage engine, WAL, snapshots, compaction, keyspace layout, backup/restore, TLS, performance tuning, and production operations for the Kubernetes backing store.

What Is etcd?

etcd is a strongly consistent, distributed key-value store written in Go. It was created by CoreOS in 2013 and became the backing store for Kubernetes in its earliest versions. The name comes from /etc (Unix configuration directory) + d (distributed). etcd provides the following guarantees that Kubernetes relies on:

Strong consistency: Every read returns the latest committed write, as if the cluster had a single node (given quorum). No stale reads.
High availability: Tolerates minority member failures while maintaining consistency and availability for the majority partition.
Watch: Clients can watch a key or prefix and receive real-time notifications when values change. This powers the Kubernetes Informer machinery.
Transactions (MVCC): Multi-Version Concurrency Control allows optimistic locking — compare-and-swap semantics prevent lost updates between concurrent writers.
Leases: Time-bounded ownership of keys. Used by Kubernetes for leader election and node heartbeats.

etcd IS the Cluster

If all etcd members are permanently lost and no backup exists, the cluster is gone. All Kubernetes objects — every Pod, Deployment, Secret, ConfigMap, ServiceAccount, RBAC policy — live exclusively in etcd. Running pods continue to run on nodes (kubelet doesn't need etcd), but no management, scheduling, or reconciliation is possible.

Raft Consensus Algorithm

etcd uses the Raft consensus algorithm (2013, Ongaro & Ousterhout) to ensure all members agree on the log order even in the presence of failures. Raft is designed to be more understandable than Paxos while providing the same guarantees.

Member States

Every etcd member is always in one of three states:

Follower

Passive state. Receives AppendEntries RPCs from the leader (log replication + heartbeats). Forwards client writes to the leader. Starts an election if no heartbeat is received within the election timeout (150–300ms randomized).

Candidate

Transitional state during election. Increments its term, votes for itself, and sends RequestVote RPCs to all peers. Transitions to Leader if it gets majority votes, or back to Follower if it discovers a current leader or higher term.

Leader

The active state for ONE member at a time. Accepts all client writes. Replicates entries to followers via AppendEntries. Sends periodic heartbeats (≤50ms) to prevent elections. Steps down if it can't reach a majority.

Write Commit Flow

Figure 1: Raft write commit sequence. The leader writes to its WAL, replicates to followers in parallel, waits for majority ACK (quorum), commits to boltdb, then sends the response. Followers apply the entry on the next heartbeat carrying the updated commitIndex.

Quorum Math

Raft requires ⌊n/2⌋ + 1 votes (a majority) to elect a leader and to commit an entry. This determines how many member failures a cluster can tolerate:

Members (n)	Quorum	Tolerated Failures	Recommendation
1	1	0	Dev/test only; any failure = total loss
2	2	0	Worse than 1 — both must be up; never use
3	2	1	Minimum HA. Standard for most production clusters.
4	3	1	Same fault tolerance as 3, but more expensive; avoid
5	3	2	Large/critical clusters. Handles rolling upgrades safely.
7	4	3	Very large clusters. Write latency increases with more members.

Always Use Odd Numbers

Even numbers of members give no benefit. A 4-member cluster tolerates only 1 failure (same as 3) but requires 3 of 4 members for quorum — making it strictly harder to maintain. More members also increase write latency because the leader must wait for more ACKs.

Leader Election Timing

etcd's election timeout is randomized per member between --election-timeout (default 1000ms) and 2 * --election-timeout (2000ms). The randomization prevents multiple members from starting elections simultaneously. The heartbeat interval is --heartbeat-interval (default 100ms). Rule: election timeout should be ≥ 10× heartbeat interval.

# Check election and heartbeat intervals (from running etcd)
ETCDCTL_API=3 etcdctl endpoint status --write-out=json \
  --endpoints=https://127.0.0.1:2379 \
  --cacert=/etc/kubernetes/pki/etcd/ca.crt \
  --cert=/etc/kubernetes/pki/etcd/peer.crt \
  --key=/etc/kubernetes/pki/etcd/peer.key | python3 -m json.tool

# Watch for election events
journalctl -u etcd --since "5 min ago" | grep -i "elect\|leader\|campaign"

# Count leader changes (should be near 0 in healthy cluster)
ETCDCTL_API=3 etcdctl endpoint status ... | grep -i leader

Storage Engine: boltdb and MVCC

boltdb (bbolt)

etcd uses boltdb (later forked to bbolt by etcd's maintainers) as its embedded storage engine. boltdb is a pure-Go, B+tree-based key-value store with the following properties:

Single writer: Only one write transaction at a time (serialized via a mutex). Multiple concurrent read transactions are supported.
ACID transactions: Writes are atomic and durable (fsync before returning). No partial writes.
Memory-mapped file: The database file (member/snap/db) is memory-mapped with mmap(). Read transactions are zero-copy — they read directly from the OS page cache.
Copy-on-write B+tree: Pages are never modified in-place; new pages are allocated and the old ones are freed after commit. This means reads don't block writes and vice versa.

fsync is Mandatory

etcd calls fsync() on WAL (Write-Ahead Log) entries before acknowledging writes to the leader. This is why etcd is extremely sensitive to disk I/O latency. Network-attached storage (NFS, SAN without write-back caching) is a common source of etcd instability. Use local NVMe SSDs. The benchmark for etcd disk: WAL fsync p99 < 10ms.

MVCC — Multi-Version Concurrency Control

etcd implements MVCC at the application level on top of boltdb. Every write creates a new revision of the key — the previous version is not overwritten. This allows:

Watch history: Clients can watch from any past revision, receiving all changes since that point.
Optimistic concurrency: Clients include the modRevision they read; if it's been updated, the transaction fails with a conflict.
Consistent reads: Read at a specific revision to get a snapshot of the database at that point in time.

The MVCC data model:

Keyspace (boltdb):
  Key: (keyBytes, revision)     → Value: (value, create_revision, mod_revision, version, lease)

  /registry/pods/default/nginx @ rev=100   → {data=..., create_rev=100, mod_rev=100, ver=1}
  /registry/pods/default/nginx @ rev=150   → {data=..., create_rev=100, mod_rev=150, ver=2}
  /registry/pods/default/nginx @ rev=200   → {data=..., create_rev=100, mod_rev=200, ver=3}

Revision index (boltdb):
  rev=100 → /registry/pods/default/nginx (PUT)
  rev=150 → /registry/pods/default/nginx (PUT)
  rev=200 → /registry/pods/default/nginx (PUT)
  rev=201 → /registry/secrets/default/mysecret (PUT)

The cluster revision (also called the global revision) is a monotonically increasing integer. Every write that changes the state (PUT or DELETE) increments the cluster revision. This maps directly to Kubernetes resourceVersion.

# Inspect MVCC versions for a key
ETCDCTL_API=3 etcdctl get /registry/pods/default/nginx \
  --write-out=json \
  --cacert=/etc/kubernetes/pki/etcd/ca.crt \
  --cert=/etc/kubernetes/pki/etcd/peer.crt \
  --key=/etc/kubernetes/pki/etcd/peer.key | python3 -m json.tool
# Output: {"kvs":[{"key":"...","create_revision":100,"mod_revision":200,"version":3,"value":"..."}]}

# Get all revisions of a key (history)
ETCDCTL_API=3 etcdctl get /registry/pods/default/nginx \
  --write-out=json --rev=0   # 0 = latest

# Check current cluster revision
ETCDCTL_API=3 etcdctl endpoint status --write-out=json ... | jq '.[0].Status.header.revision'

WAL (Write-Ahead Log) and Snapshots

Write-Ahead Log

Before any write is applied to boltdb, etcd writes it to the WAL — a sequential append-only log on disk. The WAL serves two purposes:

Durability: If etcd crashes after writing to WAL but before applying to boltdb, it can replay the WAL on restart to recover.
Replication: The leader sends WAL entries (as Raft log entries via AppendEntries RPC) to followers.

WAL files are stored in member/wal/. Each file is a fixed size (64MB by default). When a file fills up, a new segment is created. Old WAL segments are retained until a snapshot is taken.

Snapshots

Keeping the entire WAL forever would make startup slow (replaying millions of entries). Periodically, etcd takes a snapshot of the entire boltdb state. Snapshots are stored in member/snap/. After a snapshot is saved, WAL entries before the snapshot's revision can be discarded.

etcd snapshots are triggered when --snapshot-count entries have been applied since the last snapshot (default: 100,000). For large clusters with high write rates, this can happen frequently.

# etcd data directory layout
/var/lib/etcd/
├── member/
│   ├── snap/
│   │   ├── 0000000000000002-0000000000030000.snap   # snapshot at term-2, index-30000
│   │   └── db                                       # boltdb database file (mmap'd)
│   └── wal/
│       ├── 0000000000000000-0000000000000000.wal    # first WAL segment
│       └── 0000000000000001-0000000000030001.wal    # second WAL segment

# Check snapshot metadata
ETCDCTL_API=3 etcdctl snapshot status member/snap/db --write-out=table

Kubernetes Keyspace Layout

All Kubernetes objects are stored in etcd under the /registry/ prefix. The key format encodes the API group, resource type, namespace (for namespaced resources), and name:

Cluster-scoped resources:
  /registry/{resource}/{name}
  /registry/nodes/worker-1
  /registry/namespaces/default
  /registry/clusterroles/cluster-admin
  /registry/persistentvolumes/pvc-abc123

Namespaced resources:
  /registry/{resource}/{namespace}/{name}
  /registry/pods/default/nginx-abc123
  /registry/deployments/production/myapp
  /registry/secrets/kube-system/bootstrap-token-xyz

API Group resources:
  /registry/{group}/{resource}/{namespace}/{name}
  /registry/apps/deployments/default/myapp
  /registry/apps/replicasets/default/myapp-7d4f9c
  /registry/batch/jobs/default/etl-job
  /registry/networking.k8s.io/ingresses/default/my-ingress
  /registry/rbac.authorization.k8s.io/roles/default/read-pods

Special keys:
  /registry/events/default/nginx-abc.17b2c3d4e5f6    # Events (use separate etcd in large clusters)
  /registry/minions/{node-name}                       # Old name for nodes (still used internally)
  /registry/apiextensions.k8s.io/customresourcedefinitions/{name}
  /registry/leases/kube-system/kube-scheduler         # Leader election

# Inspect the entire keyspace
ETCDCTL_API=3 etcdctl get /registry \
  --prefix --keys-only \
  --cacert=/etc/kubernetes/pki/etcd/ca.crt \
  --cert=/etc/kubernetes/pki/etcd/peer.crt \
  --key=/etc/kubernetes/pki/etcd/peer.key | head -50

# Count objects by type
ETCDCTL_API=3 etcdctl get /registry --prefix --keys-only ... | \
  sed 's|/registry/||' | cut -d/ -f1 | sort | uniq -c | sort -rn

# Read a specific pod's raw protobuf value (binary)
ETCDCTL_API=3 etcdctl get /registry/pods/default/nginx-abc \
  --print-value-only ... | strings | head -30

Object Encoding

Kubernetes objects are stored in etcd as protobuf (not JSON), prefixed with a magic byte sequence: k8s\x00. The encoding is:

Etcd value = "k8s\x00" + Unknown{TypeMeta} + protobuf(object)

TypeMeta (Unknown struct):
  TypeMeta.APIVersion = "apps/v1"
  TypeMeta.Kind = "Deployment"

protobuf encoding uses the "internal" Go types (not the versioned client types).

Why Protobuf?

JSON encoding is used for the API (client-facing), but protobuf is used for storage because it is 2–4× smaller and faster to encode/decode. A large cluster with 50,000 pods stores ~50MB of pod data in etcd. Protobuf keeps this manageable. The apiserver transparently converts between JSON (API) and protobuf (storage).

Compaction and Defragmentation

Why Compaction Is Needed

MVCC keeps all historical revisions of every key. Without compaction, the database grows unboundedly. A busy cluster with 100 pods being updated 10 times/second generates 1,000 new MVCC entries per second. After a week that's 600 million entries.

Compaction removes all historical revisions older than a specified revision, keeping only the latest value for each key.

# Manual compaction (compact everything older than current revision)
ETCD_ENDPOINTS="https://127.0.0.1:2379"
ETCD_CERT_FLAGS="--cacert=/etc/kubernetes/pki/etcd/ca.crt --cert=/etc/kubernetes/pki/etcd/peer.crt --key=/etc/kubernetes/pki/etcd/peer.key"

# Get current revision
REV=$(ETCDCTL_API=3 etcdctl endpoint status $ETCD_CERT_FLAGS --endpoints=$ETCD_ENDPOINTS --write-out=json | jq '.[0].Status.header.revision')
echo "Current revision: $REV"

# Compact to current revision (removes all history)
ETCDCTL_API=3 etcdctl compact $REV $ETCD_CERT_FLAGS --endpoints=$ETCD_ENDPOINTS

# Defragment (reclaim disk space after compaction — boltdb doesn't shrink automatically)
ETCDCTL_API=3 etcdctl defrag $ETCD_CERT_FLAGS --endpoints=$ETCD_ENDPOINTS
# Note: defrag briefly blocks all writes — do it one member at a time, non-leader first

Auto-Compaction

etcd supports automatic compaction via the --auto-compaction-mode and --auto-compaction-retention flags:

# Time-based: compact entries older than 1 hour
etcd --auto-compaction-mode=periodic --auto-compaction-retention=1h

# Revision-based: keep last 1000 revisions
etcd --auto-compaction-mode=revision --auto-compaction-retention=1000

# For Kubernetes: periodic with 8-hour retention is a common setting
# (allows Informers to reconnect with a 8-hour-old resourceVersion before needing relist)

DB Size Quota

etcd has a --quota-backend-bytes limit (default 2GB). When exceeded, etcd returns NOSPACE errors for all writes, causing the Kubernetes apiserver to fail all mutations. The cluster becomes effectively read-only. Kubernetes recommends setting this to 8GB maximum for large clusters. Monitor etcd_mvcc_db_total_size_in_bytes and alert at 70% of quota.

Leases — TTL-based Key Ownership

etcd Leases are time-bounded tokens. A key can be attached to a Lease; when the Lease expires (client fails to renew), all attached keys are automatically deleted. This is used by Kubernetes for:

Leader election: kube-scheduler and kube-controller-manager hold a Lease object in the kube-system namespace. The holder is the active leader.
Node heartbeats (v1.17+): kubelet updates a Lease object in kube-node-lease namespace every 10s, avoiding constant updates to the heavy Node object.

# Node heartbeat leases
kubectl get leases -n kube-node-lease
# NAME          HOLDER        AGE
# worker-1      worker-1      5d
# worker-2      worker-2      5d

# Leader election leases
kubectl get leases -n kube-system
# NAME                       HOLDER                          AGE
# kube-controller-manager    kube-controller-manager-node1   5d
# kube-scheduler             kube-scheduler-node1            5d

# Inspect a lease
kubectl get lease kube-scheduler -n kube-system -o yaml
# spec:
#   holderIdentity: kube-scheduler-node1_abc-uuid
#   leaseDurationSeconds: 15
#   renewTime: "2026-05-15T18:00:00Z"   # updated every renewDeadline (10s)

TLS and Security Configuration

etcd uses two separate TLS configurations:

Client TLS (port 2379)

Used by the apiserver to connect to etcd. Requires:

--trusted-ca-file: CA that signed apiserver's client cert
--cert-file: etcd server's serving cert
--key-file: etcd server's private key
--client-cert-auth=true: require mTLS from clients

Peer TLS (port 2380)

Used between etcd members for Raft replication. Requires:

--peer-trusted-ca-file: Peer CA (often same as etcd CA)
--peer-cert-file: This member's peer cert
--peer-key-file: This member's peer key
--peer-client-cert-auth=true: require mTLS from peers

# Full etcd TLS configuration flags
etcd \
  --name=etcd-0 \
  --data-dir=/var/lib/etcd \
  # Client TLS
  --listen-client-urls=https://0.0.0.0:2379 \
  --advertise-client-urls=https://etcd-0.etcd.svc:2379 \
  --trusted-ca-file=/etc/kubernetes/pki/etcd/ca.crt \
  --cert-file=/etc/kubernetes/pki/etcd/server.crt \
  --key-file=/etc/kubernetes/pki/etcd/server.key \
  --client-cert-auth=true \
  # Peer TLS
  --listen-peer-urls=https://0.0.0.0:2380 \
  --initial-advertise-peer-urls=https://etcd-0.etcd.svc:2380 \
  --peer-trusted-ca-file=/etc/kubernetes/pki/etcd/ca.crt \
  --peer-cert-file=/etc/kubernetes/pki/etcd/peer.crt \
  --peer-key-file=/etc/kubernetes/pki/etcd/peer.key \
  --peer-client-cert-auth=true \
  # Cluster formation
  --initial-cluster-state=new \
  --initial-cluster="etcd-0=https://etcd-0.etcd.svc:2380,etcd-1=https://etcd-1.etcd.svc:2380,etcd-2=https://etcd-2.etcd.svc:2380"

Backup and Restore

Taking a Snapshot

etcd snapshots are atomic, consistent point-in-time copies of the entire keyspace. A snapshot taken from any member (including followers) captures the state at the snapshot revision.

#!/bin/bash
# Production etcd backup script

BACKUP_DIR="/backup/etcd/$(date +%Y%m%d-%H%M%S)"
mkdir -p "$BACKUP_DIR"

ETCD_CERT_FLAGS="
  --cacert=/etc/kubernetes/pki/etcd/ca.crt
  --cert=/etc/kubernetes/pki/etcd/peer.crt
  --key=/etc/kubernetes/pki/etcd/peer.key"

# Take snapshot (connects to local etcd member)
ETCDCTL_API=3 etcdctl snapshot save "$BACKUP_DIR/snapshot.db" \
  --endpoints=https://127.0.0.1:2379 \
  $ETCD_CERT_FLAGS

# Verify snapshot integrity
ETCDCTL_API=3 etcdctl snapshot status "$BACKUP_DIR/snapshot.db" --write-out=table

# Compress and upload to off-cluster storage (S3, GCS, etc.)
gzip "$BACKUP_DIR/snapshot.db"
aws s3 cp "$BACKUP_DIR/snapshot.db.gz" s3://my-cluster-backups/etcd/

echo "Backup complete: $BACKUP_DIR/snapshot.db.gz"

# Automate with a CronJob (if etcd runs outside the cluster) or system cron
# /etc/cron.d/etcd-backup
0 * * * * root /usr/local/bin/etcd-backup.sh >> /var/log/etcd-backup.log 2>&1

Restoring From a Snapshot

Restoring from a snapshot replaces the entire etcd cluster state. This is a destructive operation — it overwrites all current data with the snapshot contents. It also forces etcd to generate a new cluster ID, preventing old members from joining.

Restore Impacts All Members

You must restore to ALL etcd members (or a fresh set of members) from the same snapshot. Restoring to only one member while others have newer data creates a split state. Always stop all apiservers before restoring.

#!/bin/bash
# Disaster recovery restore procedure

SNAPSHOT="/backup/etcd/20260515-020000/snapshot.db.gz"
RESTORE_DIR="/var/lib/etcd-restore"

# Step 1: Stop apiserver and all controllers (static pod: move manifest away)
mv /etc/kubernetes/manifests/kube-apiserver.yaml /tmp/
mv /etc/kubernetes/manifests/kube-controller-manager.yaml /tmp/
mv /etc/kubernetes/manifests/kube-scheduler.yaml /tmp/

# Step 2: Stop etcd
mv /etc/kubernetes/manifests/etcd.yaml /tmp/
# Wait for etcd process to exit
sleep 5

# Step 3: Restore snapshot to each member (run on each CP node with appropriate names/URLs)
gunzip -c "$SNAPSHOT" > /tmp/snapshot.db

ETCDCTL_API=3 etcdctl snapshot restore /tmp/snapshot.db \
  --name=etcd-0 \
  --initial-cluster="etcd-0=https://etcd-0:2380,etcd-1=https://etcd-1:2380,etcd-2=https://etcd-2:2380" \
  --initial-cluster-token=etcd-cluster-restored-$(date +%s) \
  --initial-advertise-peer-urls=https://etcd-0:2380 \
  --data-dir="$RESTORE_DIR"

# Step 4: Replace old data directory
mv /var/lib/etcd /var/lib/etcd-old
mv "$RESTORE_DIR" /var/lib/etcd
chown -R etcd:etcd /var/lib/etcd

# Step 5: Restart etcd (restore manifest)
mv /tmp/etcd.yaml /etc/kubernetes/manifests/
# Wait for etcd to become healthy
sleep 10
ETCDCTL_API=3 etcdctl endpoint health ...

# Step 6: Restore apiserver and controllers
mv /tmp/kube-apiserver.yaml /etc/kubernetes/manifests/
mv /tmp/kube-controller-manager.yaml /etc/kubernetes/manifests/
mv /tmp/kube-scheduler.yaml /etc/kubernetes/manifests/

Performance Tuning

Disk Requirements

etcd performance is dominated by WAL fsync latency. Guidelines:

Metric	Target	Alert At	Action
WAL fsync p99	<1ms	>10ms	Move to faster disk; check I/O scheduler
Backend commit p99	<25ms	>100ms	Check boltdb size; trigger compaction/defrag
DB size	<2GB	>5GB (of 8GB quota)	Increase compaction frequency; check write rate
Peer round-trip latency	<1ms	>10ms	Place etcd members in same datacenter/AZ
Leader elections/hour	0	>0	Investigate disk latency, CPU pressure, network issues

OS-Level Disk Tuning

# Check I/O scheduler (should be 'none' or 'noop' for SSDs)
cat /sys/block/nvme0n1/queue/scheduler
# If not 'none': echo none > /sys/block/nvme0n1/queue/scheduler

# Benchmark disk latency with fio (recommended before deploying etcd)
fio --rw=write --ioengine=sync --fdatasync=1 \
    --directory=/var/lib/etcd \
    --size=22m --bs=2300 \
    --name=etcd-disk-benchmark
# Target: 99th percentile sync latency < 10ms

# Check if etcd process has CPU/IO priority
ionice -c 1 -n 0 -p $(pgrep etcd)    # set real-time I/O class
nice -n -10 -p $(pgrep etcd)          # boost CPU priority

# Dedicated disk for etcd
# /etc/fstab: /dev/nvme1n1 /var/lib/etcd ext4 noatime,nodiratime 0 2

Resource Allocation

# etcd resource recommendations:
# CPU: 2-4 cores dedicated
# Memory: 8GB recommended (etcd uses mmap; OS will cache boltdb in page cache)
# Disk: Dedicated NVMe SSD, 50GB minimum

# If etcd runs as a static pod (kubeadm), set resources in the manifest:
# /etc/kubernetes/manifests/etcd.yaml
resources:
  requests:
    cpu: "100m"
    memory: "100Mi"
  limits:
    cpu: "4"
    memory: "8Gi"

# Isolate etcd from noisy neighbors using cpuset (Linux)
systemctl set-property etcd CPUAffinity=0-3   # pin to CPUs 0-3

Cluster Operations

Adding a Member

# Step 1: Add the member to the cluster (updates cluster membership)
ETCDCTL_API=3 etcdctl member add etcd-3 \
  --peer-urls=https://etcd-3:2380 \
  --cacert=/etc/kubernetes/pki/etcd/ca.crt \
  --cert=/etc/kubernetes/pki/etcd/peer.crt \
  --key=/etc/kubernetes/pki/etcd/peer.key \
  --endpoints=https://etcd-0:2379

# Step 2: Start etcd on the new node with --initial-cluster-state=existing
# The new member will automatically sync state from the leader

# Step 3: Verify the member joined
ETCDCTL_API=3 etcdctl member list ...

Removing a Failed Member

# Get member IDs
ETCDCTL_API=3 etcdctl member list \
  --cacert=/etc/kubernetes/pki/etcd/ca.crt \
  --cert=/etc/kubernetes/pki/etcd/peer.crt \
  --key=/etc/kubernetes/pki/etcd/peer.key \
  --endpoints=https://etcd-0:2379

# Remove the failed member by ID
ETCDCTL_API=3 etcdctl member remove  ...

# The cluster immediately drops quorum requirement by 1 for the removed member
# A 3-member cluster that just had a member removed is now a 2-member cluster
# and needs BOTH remaining members to achieve quorum!

Member Remove Is Immediate

As soon as you run member remove, the cluster size decreases. If you remove a member from a 3-member cluster while one member is already down, you now have a 2-member cluster where both remaining members must be up for quorum. You've reduced your fault tolerance to zero. Be careful.

Recovering From Loss of Quorum

# Scenario: 3-member cluster, 2 members lost. 1 member running but no quorum.
# The remaining member will refuse all writes: "etcdserver: request timed out"

# Option A: Restore from backup (safest, may lose recent writes)
# (follow the restore procedure above)

# Option B: Force a new single-member cluster from the surviving member's data
# WARNING: This discards all writes since the last committed index on the survivor.

# Step 1: Stop etcd
systemctl stop etcd

# Step 2: Force a new cluster from this member's data dir
etcd --force-new-cluster --data-dir=/var/lib/etcd

# Step 3: Immediately run the apiserver against this single member
# Step 4: Add new etcd members to restore HA

Key etcd Metrics

# etcd exposes Prometheus metrics at :2381/metrics (or :2379/metrics)
# Sample critical queries:

# Leader elections (should be 0)
etcd_server_leader_changes_seen_total

# Write latency (WAL sync - most critical)
histogram_quantile(0.99, rate(etcd_disk_wal_fsync_duration_seconds_bucket[5m]))

# Backend commit latency (boltdb)
histogram_quantile(0.99, rate(etcd_disk_backend_commit_duration_seconds_bucket[5m]))

# Database size
etcd_mvcc_db_total_size_in_bytes
etcd_mvcc_db_total_size_in_use_in_bytes   # actual used; difference = fragmentation

# Watch connections
etcd_server_slow_apply_total             # operations taking > 100ms
etcd_server_slow_read_indexes_total      # read index operations taking too long

# Raft health
etcd_server_proposals_committed_total   # rate = write throughput
etcd_server_proposals_pending           # should be near 0
etcd_server_proposals_failed_total      # should be 0
etcd_network_peer_round_trip_time_seconds # inter-member latency

# Prometheus alerting rules for etcd
groups:
- name: etcd
  rules:
  - alert: EtcdNoLeader
    expr: etcd_server_has_leader == 0
    for: 1m
    annotations:
      summary: "etcd member has no leader"

  - alert: EtcdHighNumberOfLeaderChanges
    expr: increase(etcd_server_leader_changes_seen_total[1h]) > 3
    annotations:
      summary: "etcd leader changed more than 3 times in 1 hour"

  - alert: EtcdHighFsyncDuration
    expr: histogram_quantile(0.99, rate(etcd_disk_wal_fsync_duration_seconds_bucket[5m])) > 0.01
    for: 5m
    annotations:
      summary: "etcd WAL fsync p99 > 10ms — disk performance issue"

  - alert: EtcdDbSizeExceeding
    expr: etcd_mvcc_db_total_size_in_bytes / etcd_server_quota_backend_bytes > 0.7
    for: 10m
    annotations:
      summary: "etcd database size exceeds 70% of quota"

  - alert: EtcdMemberDown
    expr: up{job="etcd"} == 0
    for: 3m
    annotations:
      summary: "etcd member is down"

Troubleshooting etcd

NOSPACE Alarm — DB Quota Exceeded

# Symptom: all apiserver writes fail with "etcdserver: mvcc: database space exceeded"
# Step 1: Check current state
ETCDCTL_API=3 etcdctl alarm list ...    # should show NOSPACE
ETCDCTL_API=3 etcdctl endpoint status --write-out=table ...  # check DB SIZE column

# Step 2: Compact and defrag (even if at quota, compaction is allowed)
REV=$(ETCDCTL_API=3 etcdctl endpoint status --write-out=json ... | jq '.[0].Status.header.revision')
ETCDCTL_API=3 etcdctl compact $REV ...
ETCDCTL_API=3 etcdctl defrag ...

# Step 3: Clear the alarm (only after DB size is back under quota)
ETCDCTL_API=3 etcdctl alarm disarm ...

# Step 4: Increase quota to prevent recurrence (restart required)
# etcd flag: --quota-backend-bytes=8589934592  (8GB)

Slow Writes / High Latency

# Check fsync latency (should be < 1ms on fast SSDs)
ETCDCTL_API=3 etcdctl check perf ...
# Expected output: 60 / 60 passed for round trip time

# Check I/O wait on the node
iostat -x 1 10 | grep -E "Device|nvme"

# Check if another process is contending for disk
iotop -o  # interactive I/O monitor

# Check for etcd leader thrashing (election per minute = instability)
journalctl -u etcd --since "1 hour ago" | grep -c "elected leader"

# Check VM steal time (if on cloud)
vmstat 1 10 | awk '{print $16}'  # st column = steal; should be < 1%

Network Partition / Split-Brain Detection

# Detect if a member is isolated
ETCDCTL_API=3 etcdctl endpoint health --cluster ...
# Unhealthy member will show: "failed to commit proposal" or "context deadline exceeded"

# Check peer connectivity from within each member
# (etcd logs show peer connection errors)
journalctl -u etcd | grep "lost the TCP streaming connection"
journalctl -u etcd | grep "failed to send"

# Verify quorum is intact
ETCDCTL_API=3 etcdctl endpoint status --write-out=table ...
# RAFT TERM should be the same for all healthy members

Data Corruption Recovery

# etcd detects data corruption via hash checks
# Symptom: "etcdserver: database file may be corrupt"

# Step 1: Check integrity
ETCDCTL_API=3 etcdctl snapshot status member/snap/db

# Step 2: If corrupt, restore from snapshot
# See backup/restore section above

# Prevention: never use SIGKILL on etcd (always SIGTERM)
# SIGKILL during a write can corrupt the WAL
kill -TERM $(pgrep etcd)   # correct
# kill -9 $(pgrep etcd)   # NEVER do this

Production Checklist

20-Item etcd Production Checklist

#	Item	Verification
1	Odd number of members (3 or 5)	`etcdctl member list`
2	Members spread across AZs	Verify placement in cloud console
3	Dedicated NVMe SSD for etcd data	`lsblk; df -h /var/lib/etcd`
4	WAL fsync p99 < 10ms	Prometheus: `etcd_disk_wal_fsync_duration_seconds`
5	Hourly backups with off-cluster storage	Check backup cron; verify S3/GCS bucket
6	Backup restore tested	Annual DR drill: restore from snapshot to test cluster
7	TLS + mTLS on both client and peer ports	`--client-cert-auth=true --peer-client-cert-auth=true`
8	etcd not reachable from outside CP network	Firewall: port 2379/2380 restricted to CP node IPs
9	Auto-compaction configured	`--auto-compaction-mode=periodic --auto-compaction-retention=8h`
10	Quota set to 8GB	`--quota-backend-bytes=8589934592`
11	DB size alerting at 70% quota	Prometheus alert configured
12	Leader election count alert	Alert on >3 elections/hour
13	All members healthy	`etcdctl endpoint health --cluster`
14	Etcd logs monitored for errors	Loki/CloudWatch alert on "failed to commit"
15	Separate etcd CA from Kubernetes CA	Verify issuer: `openssl x509 -in etcd/ca.crt -noout -subject`
16	Etcd member certs expiry > 30 days	`kubeadm certs check-expiration`
17	Events stored in separate etcd (large clusters)	Check apiserver `--etcd-servers-overrides` flag
18	Resource limits on etcd static pod	`grep -A5 resources /etc/kubernetes/manifests/etcd.yaml`
19	No NFS/network storage for etcd data dir	`df -T /var/lib/etcd` (should be local)
20	Never SIGKILL etcd; use SIGTERM for shutdown	Review kubelet static pod management; check restart policy

Separating Events Into a Dedicated etcd

Kubernetes Events are extremely high-volume — every pod phase change, scheduler decision, and controller action generates an Event. In large clusters, Events can account for 50%+ of etcd writes, consuming quota and slowing down writes to actual workload state.

# Configure apiserver to use a separate etcd for Events
kube-apiserver \
  --etcd-servers=https://etcd-0:2379,https://etcd-1:2379,https://etcd-2:2379 \
  --etcd-servers-overrides=/events#https://etcd-events-0:2379,https://etcd-events-1:2379

# This routes all /registry/events/* keys to the events etcd cluster
# The events etcd can be smaller and have shorter retention
# (Events themselves have a default TTL of 1 hour)

Dependency Graph and Next Files

Prerequisites

This File Covers

Raft consensus: states, write flow, quorum
boltdb storage engine + MVCC
WAL and snapshots
Kubernetes keyspace layout + protobuf encoding
Compaction and defragmentation
Leases (leader election + node heartbeats)
TLS/mTLS configuration
Backup and restore procedures
Performance tuning (disk, OS, resources)
Cluster operations (add/remove members)
Key metrics + Prometheus alerting rules
Troubleshooting runbooks (NOSPACE, slow writes, split-brain, corruption)