etcd Issues

Overview

Deep-dive diagnosis and recovery procedures for etcd — high latency, database size, compaction, defragmentation, member loss, quorum recovery, and backup/restore.

etcd Health Check

# Set up etcdctl shorthand (kubeadm cluster)
alias etcdctl='kubectl exec -n kube-system etcd-control-plane-1 -- \
  etcdctl \
  --endpoints=https://127.0.0.1:2379 \
  --cacert=/etc/kubernetes/pki/etcd/ca.crt \
  --cert=/etc/kubernetes/pki/etcd/healthcheck-client.crt \
  --key=/etc/kubernetes/pki/etcd/healthcheck-client.key'

# Cluster health
etcdctl endpoint health --cluster
# https://10.0.0.5:2379 is healthy: successfully committed proposal
# https://10.0.0.6:2379 is healthy: ...
# https://10.0.0.7:2379 is healthy: ...

# Member list
etcdctl member list -w table
# ID               STATUS   NAME       PEER ADDRS             CLIENT ADDRS
# abc123           started  cp-node-1  https://10.0.0.5:2380  https://10.0.0.5:2379

# Endpoint status (leader, db size, raft index)
etcdctl endpoint status -w table
# ENDPOINT         ID      VERSION  DB SIZE  IS LEADER  IS LEARNER  RAFT TERM  RAFT INDEX
# 127.0.0.1:2379   abc123  3.5.0    45 MB    true       false       12         1234567

etcd Latency Issues

# High latency causes: API server timeouts, controller reconciliation slowdown

# Check etcd metrics
kubectl exec -n kube-system etcd-control-plane-1 -- \
  curl -s http://localhost:2381/metrics | grep fsync

# Key metrics:
# etcd_disk_wal_fsync_duration_seconds — fsync latency (must be < 10ms p99)
# etcd_disk_backend_commit_duration_seconds — backend commit latency
# etcd_server_proposals_committed_total — proposal rate
# etcd_server_leader_changes_seen_total — leader election count

# PromQL: etcd WAL fsync p99 latency
histogram_quantile(0.99,
  sum(rate(etcd_disk_wal_fsync_duration_seconds_bucket[5m])) by (le, instance)
)
# Target: < 10ms. If > 10ms: disk is the bottleneck

# Check disk I/O on etcd nodes
kubectl debug node/<cp-node> -it --image=ubuntu -- \
  iostat -x 1 5
# Look for: %util near 100%, high w_await (ms write latency)

# Fix disk latency:
# 1. Use dedicated SSD for /var/lib/etcd (NVMe preferred)
# 2. Separate etcd from other workloads on control plane
# 3. Use io scheduler: none or deadline (not cfq)
#    echo deadline > /sys/block/nvme0n1/queue/scheduler

# Check if etcd is on slow network storage (EBS gp2 vs gp3)
# etcd should use NVMe instance storage or dedicated gp3 io1 EBS, NOT gp2

etcd Database Size

# etcd database grows as objects are created/updated
# Each update creates a new revision; old revisions accumulate

# Check database size
etcdctl endpoint status -w table
# DB SIZE column: should be < 2GB for healthy clusters
# Default quota: 2GB (can increase to 8GB with --quota-backend-bytes=8589934592)

# etcd size breakdown by key prefix
etcdctl get "" --prefix --keys-only | \
  sed 's|/[^/]*$||' | sort | uniq -c | sort -rn | head -20
# Typical large consumers:
# /registry/events  — Event objects (auto-TTL'd but can accumulate)
# /registry/pods    — Pod history
# /registry/leases  — Leader election leases

# If database is > 80% of quota:
# API server starts rejecting writes with "etcdserver: mvcc: database space exceeded"

# Step 1: Compact old revisions
REVISION=$(etcdctl endpoint status -w json | jq '.[0].Status.header.revision')
etcdctl compact $REVISION
# "compacted revision 1234567"

# Step 2: Defragment (reclaims space after compaction)
etcdctl defrag --cluster
# This takes each member offline briefly — run during low-traffic window
# Defrag releases file space; compact only removes logical entries

# Step 3: Verify space is reclaimed
etcdctl endpoint status -w table
# DB SIZE should be smaller

# Auto-compact configuration (set in etcd flags):
# --auto-compaction-mode=periodic
# --auto-compaction-retention=1h   (compact hourly, keep 1h of history)

Member Failure and Recovery

# One etcd member is down
etcdctl member list
# abc123 started cp-node-1 PEER  CLIENT  ← running
# def456 started cp-node-2 PEER  CLIENT  ← running
# ghi789 started cp-node-3 PEER  CLIENT  ← down / unhealthy

etcdctl endpoint health --cluster
# https://10.0.0.7:2379 is unhealthy: ...

# Step 1: Determine if quorum is still met
# 3-member cluster: 2 members up = quorum OK (can still write)
# 3-member cluster: only 1 member up = quorum LOST (no writes)

# Step 2: Replace the failed member
# a) Remove the failed member
etcdctl member remove ghi789

# b) Add new member (before starting it — generates peer URLs)
etcdctl member add cp-node-3-new --peer-urls=https://10.0.0.8:2380
# Member abc890 added to cluster xyz

# c) Start the new etcd member with:
# --initial-cluster-state=existing  (not "new")
# --initial-cluster="cp-node-1=...,cp-node-2=...,cp-node-3-new=..."

# Step 3: Verify member joined
etcdctl member list
etcdctl endpoint health --cluster

# For kubeadm clusters: use kubeadm to re-initialize control plane on new node
kubeadm join <apiserver>:6443 --control-plane ...

Quorum Loss Recovery (Disaster)

3-member cluster: 2 or 3 members dead → quorum lost → cluster cannot accept writes
Symptoms: kubectl returns "etcdserver: request timed out" or "no leader"

This is the most severe etcd failure scenario.
Recovery options (in order of preference):
  1. Restore the failed nodes (if hardware failure — data preserved on disk)
  2. Restore from most recent etcd snapshot backup
  3. Force a new cluster from the single surviving member (data loss possible)
# Check if there is a leader
etcdctl endpoint status -w table
# IS LEADER column: should have exactly one "true"

# Option 1: Restart dead etcd members (if disk data intact)
# SSH to dead control plane node
systemctl restart etcd

# Option 2: Restore from snapshot
# Copy snapshot to control plane node
scp etcd-backup.db cp-node-1:/var/lib/etcd/snapshot.db

# Stop etcd and kube-apiserver
mv /etc/kubernetes/manifests/etcd.yaml /tmp/
mv /etc/kubernetes/manifests/kube-apiserver.yaml /tmp/

# Restore snapshot
etcdutl snapshot restore /var/lib/etcd/snapshot.db \
  --name cp-node-1 \
  --initial-cluster "cp-node-1=https://10.0.0.5:2380" \
  --initial-advertise-peer-urls https://10.0.0.5:2380 \
  --data-dir /var/lib/etcd-new

# Replace data directory
mv /var/lib/etcd /var/lib/etcd-old
mv /var/lib/etcd-new /var/lib/etcd

# Restart etcd and kube-apiserver
mv /tmp/etcd.yaml /etc/kubernetes/manifests/
mv /tmp/kube-apiserver.yaml /etc/kubernetes/manifests/

# Option 3: Force single-member cluster (last resort, possible data loss)
# Requires etcd --force-new-cluster flag
# This creates a new cluster from the single member's current state

etcd Backup Strategy

# Take snapshot (run regularly as CronJob)
etcdctl snapshot save /backup/etcd-$(date +%Y%m%d-%H%M%S).db

# Verify snapshot
etcdctl snapshot status /backup/etcd-*.db -w table
# HASH        REVISION  TOTAL KEYS  TOTAL SIZE
# abc12345    1234567   5000        45 MB

# CronJob to backup every 6 hours to S3
apiVersion: batch/v1
kind: CronJob
metadata:
  name: etcd-backup
  namespace: kube-system
spec:
  schedule: "0 */6 * * *"
  jobTemplate:
    spec:
      template:
        spec:
          hostNetwork: true
          nodeName: cp-node-1   # run on control plane
          containers:
          - name: backup
            image: bitnami/etcd:3.5
            command:
            - /bin/sh
            - -c
            - |
              etcdctl --endpoints=https://127.0.0.1:2379 \
                --cacert=/etc/kubernetes/pki/etcd/ca.crt \
                --cert=/etc/kubernetes/pki/etcd/healthcheck-client.crt \
                --key=/etc/kubernetes/pki/etcd/healthcheck-client.key \
                snapshot save /tmp/etcd-snapshot.db
              aws s3 cp /tmp/etcd-snapshot.db \
                s3://my-cluster-backups/etcd/etcd-$(date +%Y%m%d-%H%M%S).db
            volumeMounts:
            - name: etcd-certs
              mountPath: /etc/kubernetes/pki/etcd
          volumes:
          - name: etcd-certs
            hostPath:
              path: /etc/kubernetes/pki/etcd
          restartPolicy: OnFailure
          serviceAccountName: etcd-backup-sa

# Verify backup retention (keep last 7 days):
aws s3 ls s3://my-cluster-backups/etcd/ | tail -20

etcd Performance Tuning

# etcd flags for performance (in static pod manifest)
# /etc/kubernetes/manifests/etcd.yaml

spec:
  containers:
  - command:
    - etcd
    - --auto-compaction-mode=periodic
    - --auto-compaction-retention=1h      # compact hourly
    - --quota-backend-bytes=8589934592    # 8GB (default 2GB)
    - --max-request-bytes=33554432        # 32MB (default 1.5MB for large objects)
    - --heartbeat-interval=100            # ms (default 100)
    - --election-timeout=1000             # ms (default 1000)
    # For high-latency networks (multi-region):
    # --heartbeat-interval=500
    # --election-timeout=2500
# Check Raft timing
etcdctl endpoint status -w json | \
  jq '.[].Status | {leader:.leader, raftIndex:.raftIndex, raftTerm:.raftTerm}'

# High leader changes indicate instability
etcdctl endpoint status -w json | \
  jq '.[].Status.raftTerm'   # incrementing fast = leader elections happening

# Check etcd peer connectivity
kubectl exec -n kube-system etcd-control-plane-1 -- \
  etcdctl check perf

# etcd peer latency
# Must be < 100ms for stable operation
# Cross-AZ: typically 1-5ms intra-region
# Cross-region: NOT recommended (too high latency for Raft consensus)

etcd Monitoring Alerts

# PrometheusRule for etcd
- alert: EtcdMemberDown
  expr: up{job="etcd"} == 0
  for: 1m
  labels:
    severity: critical

- alert: EtcdHighFsyncLatency
  expr: histogram_quantile(0.99,
    rate(etcd_disk_wal_fsync_duration_seconds_bucket[5m])) > 0.01
  for: 10m
  labels:
    severity: warning

- alert: EtcdDatabaseSpaceExceeded80Pct
  expr: etcd_mvcc_db_total_size_in_bytes / etcd_server_quota_backend_bytes > 0.8
  for: 5m
  labels:
    severity: warning

- alert: EtcdLeaderChanges
  expr: increase(etcd_server_leader_changes_seen_total[1h]) > 3
  for: 0m
  labels:
    severity: warning

- alert: EtcdNoLeader
  expr: etcd_server_has_leader == 0
  for: 1m
  labels:
    severity: critical