What Is etcd?

etcd is a strongly consistent, distributed key-value store written in Go. It was created by CoreOS in 2013 and became the backing store for Kubernetes in its earliest versions. The name comes from /etc (Unix configuration directory) + d (distributed). etcd provides the following guarantees that Kubernetes relies on:

etcd IS the Cluster
If all etcd members are permanently lost and no backup exists, the cluster is gone. All Kubernetes objects — every Pod, Deployment, Secret, ConfigMap, ServiceAccount, RBAC policy — live exclusively in etcd. Running pods continue to run on nodes (kubelet doesn't need etcd), but no management, scheduling, or reconciliation is possible.

Raft Consensus Algorithm

etcd uses the Raft consensus algorithm (2013, Ongaro & Ousterhout) to ensure all members agree on the log order even in the presence of failures. Raft is designed to be more understandable than Paxos while providing the same guarantees.

Member States

Every etcd member is always in one of three states:

Follower

Passive state. Receives AppendEntries RPCs from the leader (log replication + heartbeats). Forwards client writes to the leader. Starts an election if no heartbeat is received within the election timeout (150–300ms randomized).

Candidate

Transitional state during election. Increments its term, votes for itself, and sends RequestVote RPCs to all peers. Transitions to Leader if it gets majority votes, or back to Follower if it discovers a current leader or higher term.

Leader

The active state for ONE member at a time. Accepts all client writes. Replicates entries to followers via AppendEntries. Sends periodic heartbeats (≤50ms) to prevent elections. Steps down if it can't reach a majority.

Write Commit Flow

apiserver etcd-0 (Leader) term=5, index=1000 etcd-1 (Follower) term=5, index=999 etcd-2 (Follower) term=5, index=999 1. gRPC Put(key, val) append to WAL 2. AppendEntries RPC (index=1001) append to WAL append to WAL 3. ACK (success) commit (quorum!) apply to boltdb/bbolt 4. Watch event (ADDED/MODIFIED) 5. next heartbeat: commitIndex=1001 apply to boltdb apply to boltdb 6. Response OK (new resourceVersion) Write completes once leader receives majority ACK (2/3 here). Response to client comes AFTER commit.

Figure 1: Raft write commit sequence. The leader writes to its WAL, replicates to followers in parallel, waits for majority ACK (quorum), commits to boltdb, then sends the response. Followers apply the entry on the next heartbeat carrying the updated commitIndex.

Quorum Math

Raft requires ⌊n/2⌋ + 1 votes (a majority) to elect a leader and to commit an entry. This determines how many member failures a cluster can tolerate:

Members (n)QuorumTolerated FailuresRecommendation
110Dev/test only; any failure = total loss
220Worse than 1 — both must be up; never use
321Minimum HA. Standard for most production clusters.
431Same fault tolerance as 3, but more expensive; avoid
532Large/critical clusters. Handles rolling upgrades safely.
743Very large clusters. Write latency increases with more members.
Always Use Odd Numbers
Even numbers of members give no benefit. A 4-member cluster tolerates only 1 failure (same as 3) but requires 3 of 4 members for quorum — making it strictly harder to maintain. More members also increase write latency because the leader must wait for more ACKs.

Leader Election Timing

etcd's election timeout is randomized per member between --election-timeout (default 1000ms) and 2 * --election-timeout (2000ms). The randomization prevents multiple members from starting elections simultaneously. The heartbeat interval is --heartbeat-interval (default 100ms). Rule: election timeout should be ≥ 10× heartbeat interval.

# Check election and heartbeat intervals (from running etcd)
ETCDCTL_API=3 etcdctl endpoint status --write-out=json \
  --endpoints=https://127.0.0.1:2379 \
  --cacert=/etc/kubernetes/pki/etcd/ca.crt \
  --cert=/etc/kubernetes/pki/etcd/peer.crt \
  --key=/etc/kubernetes/pki/etcd/peer.key | python3 -m json.tool

# Watch for election events
journalctl -u etcd --since "5 min ago" | grep -i "elect\|leader\|campaign"

# Count leader changes (should be near 0 in healthy cluster)
ETCDCTL_API=3 etcdctl endpoint status ... | grep -i leader

Storage Engine: boltdb and MVCC

boltdb (bbolt)

etcd uses boltdb (later forked to bbolt by etcd's maintainers) as its embedded storage engine. boltdb is a pure-Go, B+tree-based key-value store with the following properties:

fsync is Mandatory
etcd calls fsync() on WAL (Write-Ahead Log) entries before acknowledging writes to the leader. This is why etcd is extremely sensitive to disk I/O latency. Network-attached storage (NFS, SAN without write-back caching) is a common source of etcd instability. Use local NVMe SSDs. The benchmark for etcd disk: WAL fsync p99 < 10ms.

MVCC — Multi-Version Concurrency Control

etcd implements MVCC at the application level on top of boltdb. Every write creates a new revision of the key — the previous version is not overwritten. This allows:

The MVCC data model:

Keyspace (boltdb):
  Key: (keyBytes, revision)     → Value: (value, create_revision, mod_revision, version, lease)

  /registry/pods/default/nginx @ rev=100   → {data=..., create_rev=100, mod_rev=100, ver=1}
  /registry/pods/default/nginx @ rev=150   → {data=..., create_rev=100, mod_rev=150, ver=2}
  /registry/pods/default/nginx @ rev=200   → {data=..., create_rev=100, mod_rev=200, ver=3}

Revision index (boltdb):
  rev=100 → /registry/pods/default/nginx (PUT)
  rev=150 → /registry/pods/default/nginx (PUT)
  rev=200 → /registry/pods/default/nginx (PUT)
  rev=201 → /registry/secrets/default/mysecret (PUT)

The cluster revision (also called the global revision) is a monotonically increasing integer. Every write that changes the state (PUT or DELETE) increments the cluster revision. This maps directly to Kubernetes resourceVersion.

# Inspect MVCC versions for a key
ETCDCTL_API=3 etcdctl get /registry/pods/default/nginx \
  --write-out=json \
  --cacert=/etc/kubernetes/pki/etcd/ca.crt \
  --cert=/etc/kubernetes/pki/etcd/peer.crt \
  --key=/etc/kubernetes/pki/etcd/peer.key | python3 -m json.tool
# Output: {"kvs":[{"key":"...","create_revision":100,"mod_revision":200,"version":3,"value":"..."}]}

# Get all revisions of a key (history)
ETCDCTL_API=3 etcdctl get /registry/pods/default/nginx \
  --write-out=json --rev=0   # 0 = latest

# Check current cluster revision
ETCDCTL_API=3 etcdctl endpoint status --write-out=json ... | jq '.[0].Status.header.revision'

WAL (Write-Ahead Log) and Snapshots

Write-Ahead Log

Before any write is applied to boltdb, etcd writes it to the WAL — a sequential append-only log on disk. The WAL serves two purposes:

  1. Durability: If etcd crashes after writing to WAL but before applying to boltdb, it can replay the WAL on restart to recover.
  2. Replication: The leader sends WAL entries (as Raft log entries via AppendEntries RPC) to followers.

WAL files are stored in member/wal/. Each file is a fixed size (64MB by default). When a file fills up, a new segment is created. Old WAL segments are retained until a snapshot is taken.

Snapshots

Keeping the entire WAL forever would make startup slow (replaying millions of entries). Periodically, etcd takes a snapshot of the entire boltdb state. Snapshots are stored in member/snap/. After a snapshot is saved, WAL entries before the snapshot's revision can be discarded.

etcd snapshots are triggered when --snapshot-count entries have been applied since the last snapshot (default: 100,000). For large clusters with high write rates, this can happen frequently.

# etcd data directory layout
/var/lib/etcd/
├── member/
│   ├── snap/
│   │   ├── 0000000000000002-0000000000030000.snap   # snapshot at term-2, index-30000
│   │   └── db                                       # boltdb database file (mmap'd)
│   └── wal/
│       ├── 0000000000000000-0000000000000000.wal    # first WAL segment
│       └── 0000000000000001-0000000000030001.wal    # second WAL segment

# Check snapshot metadata
ETCDCTL_API=3 etcdctl snapshot status member/snap/db --write-out=table

Kubernetes Keyspace Layout

All Kubernetes objects are stored in etcd under the /registry/ prefix. The key format encodes the API group, resource type, namespace (for namespaced resources), and name:

Cluster-scoped resources:
  /registry/{resource}/{name}
  /registry/nodes/worker-1
  /registry/namespaces/default
  /registry/clusterroles/cluster-admin
  /registry/persistentvolumes/pvc-abc123

Namespaced resources:
  /registry/{resource}/{namespace}/{name}
  /registry/pods/default/nginx-abc123
  /registry/deployments/production/myapp
  /registry/secrets/kube-system/bootstrap-token-xyz

API Group resources:
  /registry/{group}/{resource}/{namespace}/{name}
  /registry/apps/deployments/default/myapp
  /registry/apps/replicasets/default/myapp-7d4f9c
  /registry/batch/jobs/default/etl-job
  /registry/networking.k8s.io/ingresses/default/my-ingress
  /registry/rbac.authorization.k8s.io/roles/default/read-pods

Special keys:
  /registry/events/default/nginx-abc.17b2c3d4e5f6    # Events (use separate etcd in large clusters)
  /registry/minions/{node-name}                       # Old name for nodes (still used internally)
  /registry/apiextensions.k8s.io/customresourcedefinitions/{name}
  /registry/leases/kube-system/kube-scheduler         # Leader election
# Inspect the entire keyspace
ETCDCTL_API=3 etcdctl get /registry \
  --prefix --keys-only \
  --cacert=/etc/kubernetes/pki/etcd/ca.crt \
  --cert=/etc/kubernetes/pki/etcd/peer.crt \
  --key=/etc/kubernetes/pki/etcd/peer.key | head -50

# Count objects by type
ETCDCTL_API=3 etcdctl get /registry --prefix --keys-only ... | \
  sed 's|/registry/||' | cut -d/ -f1 | sort | uniq -c | sort -rn

# Read a specific pod's raw protobuf value (binary)
ETCDCTL_API=3 etcdctl get /registry/pods/default/nginx-abc \
  --print-value-only ... | strings | head -30

Object Encoding

Kubernetes objects are stored in etcd as protobuf (not JSON), prefixed with a magic byte sequence: k8s\x00. The encoding is:

Etcd value = "k8s\x00" + Unknown{TypeMeta} + protobuf(object)

TypeMeta (Unknown struct):
  TypeMeta.APIVersion = "apps/v1"
  TypeMeta.Kind = "Deployment"

protobuf encoding uses the "internal" Go types (not the versioned client types).
Why Protobuf?
JSON encoding is used for the API (client-facing), but protobuf is used for storage because it is 2–4× smaller and faster to encode/decode. A large cluster with 50,000 pods stores ~50MB of pod data in etcd. Protobuf keeps this manageable. The apiserver transparently converts between JSON (API) and protobuf (storage).

Compaction and Defragmentation

Why Compaction Is Needed

MVCC keeps all historical revisions of every key. Without compaction, the database grows unboundedly. A busy cluster with 100 pods being updated 10 times/second generates 1,000 new MVCC entries per second. After a week that's 600 million entries.

Compaction removes all historical revisions older than a specified revision, keeping only the latest value for each key.

# Manual compaction (compact everything older than current revision)
ETCD_ENDPOINTS="https://127.0.0.1:2379"
ETCD_CERT_FLAGS="--cacert=/etc/kubernetes/pki/etcd/ca.crt --cert=/etc/kubernetes/pki/etcd/peer.crt --key=/etc/kubernetes/pki/etcd/peer.key"

# Get current revision
REV=$(ETCDCTL_API=3 etcdctl endpoint status $ETCD_CERT_FLAGS --endpoints=$ETCD_ENDPOINTS --write-out=json | jq '.[0].Status.header.revision')
echo "Current revision: $REV"

# Compact to current revision (removes all history)
ETCDCTL_API=3 etcdctl compact $REV $ETCD_CERT_FLAGS --endpoints=$ETCD_ENDPOINTS

# Defragment (reclaim disk space after compaction — boltdb doesn't shrink automatically)
ETCDCTL_API=3 etcdctl defrag $ETCD_CERT_FLAGS --endpoints=$ETCD_ENDPOINTS
# Note: defrag briefly blocks all writes — do it one member at a time, non-leader first

Auto-Compaction

etcd supports automatic compaction via the --auto-compaction-mode and --auto-compaction-retention flags:

# Time-based: compact entries older than 1 hour
etcd --auto-compaction-mode=periodic --auto-compaction-retention=1h

# Revision-based: keep last 1000 revisions
etcd --auto-compaction-mode=revision --auto-compaction-retention=1000

# For Kubernetes: periodic with 8-hour retention is a common setting
# (allows Informers to reconnect with a 8-hour-old resourceVersion before needing relist)
DB Size Quota
etcd has a --quota-backend-bytes limit (default 2GB). When exceeded, etcd returns NOSPACE errors for all writes, causing the Kubernetes apiserver to fail all mutations. The cluster becomes effectively read-only. Kubernetes recommends setting this to 8GB maximum for large clusters. Monitor etcd_mvcc_db_total_size_in_bytes and alert at 70% of quota.

Leases — TTL-based Key Ownership

etcd Leases are time-bounded tokens. A key can be attached to a Lease; when the Lease expires (client fails to renew), all attached keys are automatically deleted. This is used by Kubernetes for:

# Node heartbeat leases
kubectl get leases -n kube-node-lease
# NAME          HOLDER        AGE
# worker-1      worker-1      5d
# worker-2      worker-2      5d

# Leader election leases
kubectl get leases -n kube-system
# NAME                       HOLDER                          AGE
# kube-controller-manager    kube-controller-manager-node1   5d
# kube-scheduler             kube-scheduler-node1            5d

# Inspect a lease
kubectl get lease kube-scheduler -n kube-system -o yaml
# spec:
#   holderIdentity: kube-scheduler-node1_abc-uuid
#   leaseDurationSeconds: 15
#   renewTime: "2026-05-15T18:00:00Z"   # updated every renewDeadline (10s)

TLS and Security Configuration

etcd uses two separate TLS configurations:

Client TLS (port 2379)

Used by the apiserver to connect to etcd. Requires:

  • --trusted-ca-file: CA that signed apiserver's client cert
  • --cert-file: etcd server's serving cert
  • --key-file: etcd server's private key
  • --client-cert-auth=true: require mTLS from clients

Peer TLS (port 2380)

Used between etcd members for Raft replication. Requires:

  • --peer-trusted-ca-file: Peer CA (often same as etcd CA)
  • --peer-cert-file: This member's peer cert
  • --peer-key-file: This member's peer key
  • --peer-client-cert-auth=true: require mTLS from peers
# Full etcd TLS configuration flags
etcd \
  --name=etcd-0 \
  --data-dir=/var/lib/etcd \
  # Client TLS
  --listen-client-urls=https://0.0.0.0:2379 \
  --advertise-client-urls=https://etcd-0.etcd.svc:2379 \
  --trusted-ca-file=/etc/kubernetes/pki/etcd/ca.crt \
  --cert-file=/etc/kubernetes/pki/etcd/server.crt \
  --key-file=/etc/kubernetes/pki/etcd/server.key \
  --client-cert-auth=true \
  # Peer TLS
  --listen-peer-urls=https://0.0.0.0:2380 \
  --initial-advertise-peer-urls=https://etcd-0.etcd.svc:2380 \
  --peer-trusted-ca-file=/etc/kubernetes/pki/etcd/ca.crt \
  --peer-cert-file=/etc/kubernetes/pki/etcd/peer.crt \
  --peer-key-file=/etc/kubernetes/pki/etcd/peer.key \
  --peer-client-cert-auth=true \
  # Cluster formation
  --initial-cluster-state=new \
  --initial-cluster="etcd-0=https://etcd-0.etcd.svc:2380,etcd-1=https://etcd-1.etcd.svc:2380,etcd-2=https://etcd-2.etcd.svc:2380"

Backup and Restore

Taking a Snapshot

etcd snapshots are atomic, consistent point-in-time copies of the entire keyspace. A snapshot taken from any member (including followers) captures the state at the snapshot revision.

#!/bin/bash
# Production etcd backup script

BACKUP_DIR="/backup/etcd/$(date +%Y%m%d-%H%M%S)"
mkdir -p "$BACKUP_DIR"

ETCD_CERT_FLAGS="
  --cacert=/etc/kubernetes/pki/etcd/ca.crt
  --cert=/etc/kubernetes/pki/etcd/peer.crt
  --key=/etc/kubernetes/pki/etcd/peer.key"

# Take snapshot (connects to local etcd member)
ETCDCTL_API=3 etcdctl snapshot save "$BACKUP_DIR/snapshot.db" \
  --endpoints=https://127.0.0.1:2379 \
  $ETCD_CERT_FLAGS

# Verify snapshot integrity
ETCDCTL_API=3 etcdctl snapshot status "$BACKUP_DIR/snapshot.db" --write-out=table

# Compress and upload to off-cluster storage (S3, GCS, etc.)
gzip "$BACKUP_DIR/snapshot.db"
aws s3 cp "$BACKUP_DIR/snapshot.db.gz" s3://my-cluster-backups/etcd/

echo "Backup complete: $BACKUP_DIR/snapshot.db.gz"
# Automate with a CronJob (if etcd runs outside the cluster) or system cron
# /etc/cron.d/etcd-backup
0 * * * * root /usr/local/bin/etcd-backup.sh >> /var/log/etcd-backup.log 2>&1

Restoring From a Snapshot

Restoring from a snapshot replaces the entire etcd cluster state. This is a destructive operation — it overwrites all current data with the snapshot contents. It also forces etcd to generate a new cluster ID, preventing old members from joining.

Restore Impacts All Members
You must restore to ALL etcd members (or a fresh set of members) from the same snapshot. Restoring to only one member while others have newer data creates a split state. Always stop all apiservers before restoring.
#!/bin/bash
# Disaster recovery restore procedure

SNAPSHOT="/backup/etcd/20260515-020000/snapshot.db.gz"
RESTORE_DIR="/var/lib/etcd-restore"

# Step 1: Stop apiserver and all controllers (static pod: move manifest away)
mv /etc/kubernetes/manifests/kube-apiserver.yaml /tmp/
mv /etc/kubernetes/manifests/kube-controller-manager.yaml /tmp/
mv /etc/kubernetes/manifests/kube-scheduler.yaml /tmp/

# Step 2: Stop etcd
mv /etc/kubernetes/manifests/etcd.yaml /tmp/
# Wait for etcd process to exit
sleep 5

# Step 3: Restore snapshot to each member (run on each CP node with appropriate names/URLs)
gunzip -c "$SNAPSHOT" > /tmp/snapshot.db

ETCDCTL_API=3 etcdctl snapshot restore /tmp/snapshot.db \
  --name=etcd-0 \
  --initial-cluster="etcd-0=https://etcd-0:2380,etcd-1=https://etcd-1:2380,etcd-2=https://etcd-2:2380" \
  --initial-cluster-token=etcd-cluster-restored-$(date +%s) \
  --initial-advertise-peer-urls=https://etcd-0:2380 \
  --data-dir="$RESTORE_DIR"

# Step 4: Replace old data directory
mv /var/lib/etcd /var/lib/etcd-old
mv "$RESTORE_DIR" /var/lib/etcd
chown -R etcd:etcd /var/lib/etcd

# Step 5: Restart etcd (restore manifest)
mv /tmp/etcd.yaml /etc/kubernetes/manifests/
# Wait for etcd to become healthy
sleep 10
ETCDCTL_API=3 etcdctl endpoint health ...

# Step 6: Restore apiserver and controllers
mv /tmp/kube-apiserver.yaml /etc/kubernetes/manifests/
mv /tmp/kube-controller-manager.yaml /etc/kubernetes/manifests/
mv /tmp/kube-scheduler.yaml /etc/kubernetes/manifests/

Performance Tuning

Disk Requirements

etcd performance is dominated by WAL fsync latency. Guidelines:

MetricTargetAlert AtAction
WAL fsync p99<1ms>10msMove to faster disk; check I/O scheduler
Backend commit p99<25ms>100msCheck boltdb size; trigger compaction/defrag
DB size<2GB>5GB (of 8GB quota)Increase compaction frequency; check write rate
Peer round-trip latency<1ms>10msPlace etcd members in same datacenter/AZ
Leader elections/hour0>0Investigate disk latency, CPU pressure, network issues

OS-Level Disk Tuning

# Check I/O scheduler (should be 'none' or 'noop' for SSDs)
cat /sys/block/nvme0n1/queue/scheduler
# If not 'none': echo none > /sys/block/nvme0n1/queue/scheduler

# Benchmark disk latency with fio (recommended before deploying etcd)
fio --rw=write --ioengine=sync --fdatasync=1 \
    --directory=/var/lib/etcd \
    --size=22m --bs=2300 \
    --name=etcd-disk-benchmark
# Target: 99th percentile sync latency < 10ms

# Check if etcd process has CPU/IO priority
ionice -c 1 -n 0 -p $(pgrep etcd)    # set real-time I/O class
nice -n -10 -p $(pgrep etcd)          # boost CPU priority

# Dedicated disk for etcd
# /etc/fstab: /dev/nvme1n1 /var/lib/etcd ext4 noatime,nodiratime 0 2

Resource Allocation

# etcd resource recommendations:
# CPU: 2-4 cores dedicated
# Memory: 8GB recommended (etcd uses mmap; OS will cache boltdb in page cache)
# Disk: Dedicated NVMe SSD, 50GB minimum

# If etcd runs as a static pod (kubeadm), set resources in the manifest:
# /etc/kubernetes/manifests/etcd.yaml
resources:
  requests:
    cpu: "100m"
    memory: "100Mi"
  limits:
    cpu: "4"
    memory: "8Gi"

# Isolate etcd from noisy neighbors using cpuset (Linux)
systemctl set-property etcd CPUAffinity=0-3   # pin to CPUs 0-3

Cluster Operations

Adding a Member

# Step 1: Add the member to the cluster (updates cluster membership)
ETCDCTL_API=3 etcdctl member add etcd-3 \
  --peer-urls=https://etcd-3:2380 \
  --cacert=/etc/kubernetes/pki/etcd/ca.crt \
  --cert=/etc/kubernetes/pki/etcd/peer.crt \
  --key=/etc/kubernetes/pki/etcd/peer.key \
  --endpoints=https://etcd-0:2379

# Step 2: Start etcd on the new node with --initial-cluster-state=existing
# The new member will automatically sync state from the leader

# Step 3: Verify the member joined
ETCDCTL_API=3 etcdctl member list ...

Removing a Failed Member

# Get member IDs
ETCDCTL_API=3 etcdctl member list \
  --cacert=/etc/kubernetes/pki/etcd/ca.crt \
  --cert=/etc/kubernetes/pki/etcd/peer.crt \
  --key=/etc/kubernetes/pki/etcd/peer.key \
  --endpoints=https://etcd-0:2379

# Remove the failed member by ID
ETCDCTL_API=3 etcdctl member remove  ...

# The cluster immediately drops quorum requirement by 1 for the removed member
# A 3-member cluster that just had a member removed is now a 2-member cluster
# and needs BOTH remaining members to achieve quorum!
Member Remove Is Immediate
As soon as you run member remove, the cluster size decreases. If you remove a member from a 3-member cluster while one member is already down, you now have a 2-member cluster where both remaining members must be up for quorum. You've reduced your fault tolerance to zero. Be careful.

Recovering From Loss of Quorum

# Scenario: 3-member cluster, 2 members lost. 1 member running but no quorum.
# The remaining member will refuse all writes: "etcdserver: request timed out"

# Option A: Restore from backup (safest, may lose recent writes)
# (follow the restore procedure above)

# Option B: Force a new single-member cluster from the surviving member's data
# WARNING: This discards all writes since the last committed index on the survivor.

# Step 1: Stop etcd
systemctl stop etcd

# Step 2: Force a new cluster from this member's data dir
etcd --force-new-cluster --data-dir=/var/lib/etcd

# Step 3: Immediately run the apiserver against this single member
# Step 4: Add new etcd members to restore HA

Key etcd Metrics

# etcd exposes Prometheus metrics at :2381/metrics (or :2379/metrics)
# Sample critical queries:

# Leader elections (should be 0)
etcd_server_leader_changes_seen_total

# Write latency (WAL sync - most critical)
histogram_quantile(0.99, rate(etcd_disk_wal_fsync_duration_seconds_bucket[5m]))

# Backend commit latency (boltdb)
histogram_quantile(0.99, rate(etcd_disk_backend_commit_duration_seconds_bucket[5m]))

# Database size
etcd_mvcc_db_total_size_in_bytes
etcd_mvcc_db_total_size_in_use_in_bytes   # actual used; difference = fragmentation

# Watch connections
etcd_server_slow_apply_total             # operations taking > 100ms
etcd_server_slow_read_indexes_total      # read index operations taking too long

# Raft health
etcd_server_proposals_committed_total   # rate = write throughput
etcd_server_proposals_pending           # should be near 0
etcd_server_proposals_failed_total      # should be 0
etcd_network_peer_round_trip_time_seconds # inter-member latency
# Prometheus alerting rules for etcd
groups:
- name: etcd
  rules:
  - alert: EtcdNoLeader
    expr: etcd_server_has_leader == 0
    for: 1m
    annotations:
      summary: "etcd member has no leader"

  - alert: EtcdHighNumberOfLeaderChanges
    expr: increase(etcd_server_leader_changes_seen_total[1h]) > 3
    annotations:
      summary: "etcd leader changed more than 3 times in 1 hour"

  - alert: EtcdHighFsyncDuration
    expr: histogram_quantile(0.99, rate(etcd_disk_wal_fsync_duration_seconds_bucket[5m])) > 0.01
    for: 5m
    annotations:
      summary: "etcd WAL fsync p99 > 10ms — disk performance issue"

  - alert: EtcdDbSizeExceeding
    expr: etcd_mvcc_db_total_size_in_bytes / etcd_server_quota_backend_bytes > 0.7
    for: 10m
    annotations:
      summary: "etcd database size exceeds 70% of quota"

  - alert: EtcdMemberDown
    expr: up{job="etcd"} == 0
    for: 3m
    annotations:
      summary: "etcd member is down"

Troubleshooting etcd

NOSPACE Alarm — DB Quota Exceeded

# Symptom: all apiserver writes fail with "etcdserver: mvcc: database space exceeded"
# Step 1: Check current state
ETCDCTL_API=3 etcdctl alarm list ...    # should show NOSPACE
ETCDCTL_API=3 etcdctl endpoint status --write-out=table ...  # check DB SIZE column

# Step 2: Compact and defrag (even if at quota, compaction is allowed)
REV=$(ETCDCTL_API=3 etcdctl endpoint status --write-out=json ... | jq '.[0].Status.header.revision')
ETCDCTL_API=3 etcdctl compact $REV ...
ETCDCTL_API=3 etcdctl defrag ...

# Step 3: Clear the alarm (only after DB size is back under quota)
ETCDCTL_API=3 etcdctl alarm disarm ...

# Step 4: Increase quota to prevent recurrence (restart required)
# etcd flag: --quota-backend-bytes=8589934592  (8GB)

Slow Writes / High Latency

# Check fsync latency (should be < 1ms on fast SSDs)
ETCDCTL_API=3 etcdctl check perf ...
# Expected output: 60 / 60 passed for round trip time

# Check I/O wait on the node
iostat -x 1 10 | grep -E "Device|nvme"

# Check if another process is contending for disk
iotop -o  # interactive I/O monitor

# Check for etcd leader thrashing (election per minute = instability)
journalctl -u etcd --since "1 hour ago" | grep -c "elected leader"

# Check VM steal time (if on cloud)
vmstat 1 10 | awk '{print $16}'  # st column = steal; should be < 1%

Network Partition / Split-Brain Detection

# Detect if a member is isolated
ETCDCTL_API=3 etcdctl endpoint health --cluster ...
# Unhealthy member will show: "failed to commit proposal" or "context deadline exceeded"

# Check peer connectivity from within each member
# (etcd logs show peer connection errors)
journalctl -u etcd | grep "lost the TCP streaming connection"
journalctl -u etcd | grep "failed to send"

# Verify quorum is intact
ETCDCTL_API=3 etcdctl endpoint status --write-out=table ...
# RAFT TERM should be the same for all healthy members

Data Corruption Recovery

# etcd detects data corruption via hash checks
# Symptom: "etcdserver: database file may be corrupt"

# Step 1: Check integrity
ETCDCTL_API=3 etcdctl snapshot status member/snap/db

# Step 2: If corrupt, restore from snapshot
# See backup/restore section above

# Prevention: never use SIGKILL on etcd (always SIGTERM)
# SIGKILL during a write can corrupt the WAL
kill -TERM $(pgrep etcd)   # correct
# kill -9 $(pgrep etcd)   # NEVER do this

Production Checklist

20-Item etcd Production Checklist
#ItemVerification
1Odd number of members (3 or 5)etcdctl member list
2Members spread across AZsVerify placement in cloud console
3Dedicated NVMe SSD for etcd datalsblk; df -h /var/lib/etcd
4WAL fsync p99 < 10msPrometheus: etcd_disk_wal_fsync_duration_seconds
5Hourly backups with off-cluster storageCheck backup cron; verify S3/GCS bucket
6Backup restore testedAnnual DR drill: restore from snapshot to test cluster
7TLS + mTLS on both client and peer ports--client-cert-auth=true --peer-client-cert-auth=true
8etcd not reachable from outside CP networkFirewall: port 2379/2380 restricted to CP node IPs
9Auto-compaction configured--auto-compaction-mode=periodic --auto-compaction-retention=8h
10Quota set to 8GB--quota-backend-bytes=8589934592
11DB size alerting at 70% quotaPrometheus alert configured
12Leader election count alertAlert on >3 elections/hour
13All members healthyetcdctl endpoint health --cluster
14Etcd logs monitored for errorsLoki/CloudWatch alert on "failed to commit"
15Separate etcd CA from Kubernetes CAVerify issuer: openssl x509 -in etcd/ca.crt -noout -subject
16Etcd member certs expiry > 30 dayskubeadm certs check-expiration
17Events stored in separate etcd (large clusters)Check apiserver --etcd-servers-overrides flag
18Resource limits on etcd static podgrep -A5 resources /etc/kubernetes/manifests/etcd.yaml
19No NFS/network storage for etcd data dirdf -T /var/lib/etcd (should be local)
20Never SIGKILL etcd; use SIGTERM for shutdownReview kubelet static pod management; check restart policy

Separating Events Into a Dedicated etcd

Kubernetes Events are extremely high-volume — every pod phase change, scheduler decision, and controller action generates an Event. In large clusters, Events can account for 50%+ of etcd writes, consuming quota and slowing down writes to actual workload state.

# Configure apiserver to use a separate etcd for Events
kube-apiserver \
  --etcd-servers=https://etcd-0:2379,https://etcd-1:2379,https://etcd-2:2379 \
  --etcd-servers-overrides=/events#https://etcd-events-0:2379,https://etcd-events-1:2379

# This routes all /registry/events/* keys to the events etcd cluster
# The events etcd can be smaller and have shorter retention
# (Events themselves have a default TTL of 1 hour)

Dependency Graph and Next Files

This File Covers

  • Raft consensus: states, write flow, quorum
  • boltdb storage engine + MVCC
  • WAL and snapshots
  • Kubernetes keyspace layout + protobuf encoding
  • Compaction and defragmentation
  • Leases (leader election + node heartbeats)
  • TLS/mTLS configuration
  • Backup and restore procedures
  • Performance tuning (disk, OS, resources)
  • Cluster operations (add/remove members)
  • Key metrics + Prometheus alerting rules
  • Troubleshooting runbooks (NOSPACE, slow writes, split-brain, corruption)