etcd Issues
Contents
Overview
Deep-dive diagnosis and recovery procedures for etcd — high latency, database size, compaction, defragmentation, member loss, quorum recovery, and backup/restore.
etcd Health Check
# Set up etcdctl shorthand (kubeadm cluster)
alias etcdctl='kubectl exec -n kube-system etcd-control-plane-1 -- \
etcdctl \
--endpoints=https://127.0.0.1:2379 \
--cacert=/etc/kubernetes/pki/etcd/ca.crt \
--cert=/etc/kubernetes/pki/etcd/healthcheck-client.crt \
--key=/etc/kubernetes/pki/etcd/healthcheck-client.key'
# Cluster health
etcdctl endpoint health --cluster
# https://10.0.0.5:2379 is healthy: successfully committed proposal
# https://10.0.0.6:2379 is healthy: ...
# https://10.0.0.7:2379 is healthy: ...
# Member list
etcdctl member list -w table
# ID STATUS NAME PEER ADDRS CLIENT ADDRS
# abc123 started cp-node-1 https://10.0.0.5:2380 https://10.0.0.5:2379
# Endpoint status (leader, db size, raft index)
etcdctl endpoint status -w table
# ENDPOINT ID VERSION DB SIZE IS LEADER IS LEARNER RAFT TERM RAFT INDEX
# 127.0.0.1:2379 abc123 3.5.0 45 MB true false 12 1234567
etcd Latency Issues
# High latency causes: API server timeouts, controller reconciliation slowdown
# Check etcd metrics
kubectl exec -n kube-system etcd-control-plane-1 -- \
curl -s http://localhost:2381/metrics | grep fsync
# Key metrics:
# etcd_disk_wal_fsync_duration_seconds — fsync latency (must be < 10ms p99)
# etcd_disk_backend_commit_duration_seconds — backend commit latency
# etcd_server_proposals_committed_total — proposal rate
# etcd_server_leader_changes_seen_total — leader election count
# PromQL: etcd WAL fsync p99 latency
histogram_quantile(0.99,
sum(rate(etcd_disk_wal_fsync_duration_seconds_bucket[5m])) by (le, instance)
)
# Target: < 10ms. If > 10ms: disk is the bottleneck
# Check disk I/O on etcd nodes
kubectl debug node/<cp-node> -it --image=ubuntu -- \
iostat -x 1 5
# Look for: %util near 100%, high w_await (ms write latency)
# Fix disk latency:
# 1. Use dedicated SSD for /var/lib/etcd (NVMe preferred)
# 2. Separate etcd from other workloads on control plane
# 3. Use io scheduler: none or deadline (not cfq)
# echo deadline > /sys/block/nvme0n1/queue/scheduler
# Check if etcd is on slow network storage (EBS gp2 vs gp3)
# etcd should use NVMe instance storage or dedicated gp3 io1 EBS, NOT gp2
etcd Database Size
# etcd database grows as objects are created/updated
# Each update creates a new revision; old revisions accumulate
# Check database size
etcdctl endpoint status -w table
# DB SIZE column: should be < 2GB for healthy clusters
# Default quota: 2GB (can increase to 8GB with --quota-backend-bytes=8589934592)
# etcd size breakdown by key prefix
etcdctl get "" --prefix --keys-only | \
sed 's|/[^/]*$||' | sort | uniq -c | sort -rn | head -20
# Typical large consumers:
# /registry/events — Event objects (auto-TTL'd but can accumulate)
# /registry/pods — Pod history
# /registry/leases — Leader election leases
# If database is > 80% of quota:
# API server starts rejecting writes with "etcdserver: mvcc: database space exceeded"
# Step 1: Compact old revisions
REVISION=$(etcdctl endpoint status -w json | jq '.[0].Status.header.revision')
etcdctl compact $REVISION
# "compacted revision 1234567"
# Step 2: Defragment (reclaims space after compaction)
etcdctl defrag --cluster
# This takes each member offline briefly — run during low-traffic window
# Defrag releases file space; compact only removes logical entries
# Step 3: Verify space is reclaimed
etcdctl endpoint status -w table
# DB SIZE should be smaller
# Auto-compact configuration (set in etcd flags):
# --auto-compaction-mode=periodic
# --auto-compaction-retention=1h (compact hourly, keep 1h of history)
Member Failure and Recovery
# One etcd member is down
etcdctl member list
# abc123 started cp-node-1 PEER CLIENT ← running
# def456 started cp-node-2 PEER CLIENT ← running
# ghi789 started cp-node-3 PEER CLIENT ← down / unhealthy
etcdctl endpoint health --cluster
# https://10.0.0.7:2379 is unhealthy: ...
# Step 1: Determine if quorum is still met
# 3-member cluster: 2 members up = quorum OK (can still write)
# 3-member cluster: only 1 member up = quorum LOST (no writes)
# Step 2: Replace the failed member
# a) Remove the failed member
etcdctl member remove ghi789
# b) Add new member (before starting it — generates peer URLs)
etcdctl member add cp-node-3-new --peer-urls=https://10.0.0.8:2380
# Member abc890 added to cluster xyz
# c) Start the new etcd member with:
# --initial-cluster-state=existing (not "new")
# --initial-cluster="cp-node-1=...,cp-node-2=...,cp-node-3-new=..."
# Step 3: Verify member joined
etcdctl member list
etcdctl endpoint health --cluster
# For kubeadm clusters: use kubeadm to re-initialize control plane on new node
kubeadm join <apiserver>:6443 --control-plane ...
Quorum Loss Recovery (Disaster)
3-member cluster: 2 or 3 members dead → quorum lost → cluster cannot accept writes
Symptoms: kubectl returns "etcdserver: request timed out" or "no leader"
This is the most severe etcd failure scenario.
Recovery options (in order of preference):
1. Restore the failed nodes (if hardware failure — data preserved on disk)
2. Restore from most recent etcd snapshot backup
3. Force a new cluster from the single surviving member (data loss possible)
# Check if there is a leader
etcdctl endpoint status -w table
# IS LEADER column: should have exactly one "true"
# Option 1: Restart dead etcd members (if disk data intact)
# SSH to dead control plane node
systemctl restart etcd
# Option 2: Restore from snapshot
# Copy snapshot to control plane node
scp etcd-backup.db cp-node-1:/var/lib/etcd/snapshot.db
# Stop etcd and kube-apiserver
mv /etc/kubernetes/manifests/etcd.yaml /tmp/
mv /etc/kubernetes/manifests/kube-apiserver.yaml /tmp/
# Restore snapshot
etcdutl snapshot restore /var/lib/etcd/snapshot.db \
--name cp-node-1 \
--initial-cluster "cp-node-1=https://10.0.0.5:2380" \
--initial-advertise-peer-urls https://10.0.0.5:2380 \
--data-dir /var/lib/etcd-new
# Replace data directory
mv /var/lib/etcd /var/lib/etcd-old
mv /var/lib/etcd-new /var/lib/etcd
# Restart etcd and kube-apiserver
mv /tmp/etcd.yaml /etc/kubernetes/manifests/
mv /tmp/kube-apiserver.yaml /etc/kubernetes/manifests/
# Option 3: Force single-member cluster (last resort, possible data loss)
# Requires etcd --force-new-cluster flag
# This creates a new cluster from the single member's current state
etcd Backup Strategy
# Take snapshot (run regularly as CronJob)
etcdctl snapshot save /backup/etcd-$(date +%Y%m%d-%H%M%S).db
# Verify snapshot
etcdctl snapshot status /backup/etcd-*.db -w table
# HASH REVISION TOTAL KEYS TOTAL SIZE
# abc12345 1234567 5000 45 MB
# CronJob to backup every 6 hours to S3
apiVersion: batch/v1
kind: CronJob
metadata:
name: etcd-backup
namespace: kube-system
spec:
schedule: "0 */6 * * *"
jobTemplate:
spec:
template:
spec:
hostNetwork: true
nodeName: cp-node-1 # run on control plane
containers:
- name: backup
image: bitnami/etcd:3.5
command:
- /bin/sh
- -c
- |
etcdctl --endpoints=https://127.0.0.1:2379 \
--cacert=/etc/kubernetes/pki/etcd/ca.crt \
--cert=/etc/kubernetes/pki/etcd/healthcheck-client.crt \
--key=/etc/kubernetes/pki/etcd/healthcheck-client.key \
snapshot save /tmp/etcd-snapshot.db
aws s3 cp /tmp/etcd-snapshot.db \
s3://my-cluster-backups/etcd/etcd-$(date +%Y%m%d-%H%M%S).db
volumeMounts:
- name: etcd-certs
mountPath: /etc/kubernetes/pki/etcd
volumes:
- name: etcd-certs
hostPath:
path: /etc/kubernetes/pki/etcd
restartPolicy: OnFailure
serviceAccountName: etcd-backup-sa
# Verify backup retention (keep last 7 days):
aws s3 ls s3://my-cluster-backups/etcd/ | tail -20
etcd Performance Tuning
# etcd flags for performance (in static pod manifest)
# /etc/kubernetes/manifests/etcd.yaml
spec:
containers:
- command:
- etcd
- --auto-compaction-mode=periodic
- --auto-compaction-retention=1h # compact hourly
- --quota-backend-bytes=8589934592 # 8GB (default 2GB)
- --max-request-bytes=33554432 # 32MB (default 1.5MB for large objects)
- --heartbeat-interval=100 # ms (default 100)
- --election-timeout=1000 # ms (default 1000)
# For high-latency networks (multi-region):
# --heartbeat-interval=500
# --election-timeout=2500
# Check Raft timing
etcdctl endpoint status -w json | \
jq '.[].Status | {leader:.leader, raftIndex:.raftIndex, raftTerm:.raftTerm}'
# High leader changes indicate instability
etcdctl endpoint status -w json | \
jq '.[].Status.raftTerm' # incrementing fast = leader elections happening
# Check etcd peer connectivity
kubectl exec -n kube-system etcd-control-plane-1 -- \
etcdctl check perf
# etcd peer latency
# Must be < 100ms for stable operation
# Cross-AZ: typically 1-5ms intra-region
# Cross-region: NOT recommended (too high latency for Raft consensus)
etcd Monitoring Alerts
# PrometheusRule for etcd
- alert: EtcdMemberDown
expr: up{job="etcd"} == 0
for: 1m
labels:
severity: critical
- alert: EtcdHighFsyncLatency
expr: histogram_quantile(0.99,
rate(etcd_disk_wal_fsync_duration_seconds_bucket[5m])) > 0.01
for: 10m
labels:
severity: warning
- alert: EtcdDatabaseSpaceExceeded80Pct
expr: etcd_mvcc_db_total_size_in_bytes / etcd_server_quota_backend_bytes > 0.8
for: 5m
labels:
severity: warning
- alert: EtcdLeaderChanges
expr: increase(etcd_server_leader_changes_seen_total[1h]) > 3
for: 0m
labels:
severity: warning
- alert: EtcdNoLeader
expr: etcd_server_has_leader == 0
for: 1m
labels:
severity: critical
Related
- 05 — Control Plane Issues — API server and etcd interaction
- 11 — Leader Election — Lease API uses etcd
- 08 — Cluster Maintenance — upgrade and maintenance procedures