Overview

Diagnosis and resolution of Kubernetes performance problems — CPU throttling, memory pressure, I/O bottlenecks, slow API server, and HPA/VPA misconfigurations.

Performance Triage Checklist

# 1. Pod-level resource usage
kubectl top pods -n <ns> --sort-by=cpu | head -20
kubectl top pods -n <ns> --sort-by=memory | head -20

# 2. Node-level resource usage
kubectl top nodes

# 3. Check for CPU throttling (% of time in CFS throttle)
# Prometheus: container_cpu_throttled_seconds_total
kubectl get --raw '/api/v1/namespaces/monitoring/services/prometheus-operated:web/proxy/api/v1/query' \
  --data-urlencode 'query=rate(container_cpu_throttled_seconds_total{namespace="production"}[5m]) /
    rate(container_cpu_usage_seconds_total{namespace="production"}[5m]) > 0.25' | jq .

# 4. Check for OOMKill events
kubectl get events -n <ns> --field-selector reason=OOMKilling
kubectl get events -n <ns> | grep OOM

# 5. Check resource requests vs limits
kubectl get pods -n <ns> -o json | \
  jq -r '.items[] | .metadata.name + ": " +
    (.spec.containers[0].resources | "req_cpu=" + (.requests.cpu//"none") +
     " limit_cpu=" + (.limits.cpu//"none") +
     " req_mem=" + (.requests.memory//"none") +
     " limit_mem=" + (.limits.memory//"none"))'

CPU Throttling

CPU throttling occurs when:
  A container exceeds its cpu LIMIT.
  Linux CFS (Completely Fair Scheduler) throttles it for the remainder of the 100ms period.

Symptoms:
  - p99 latency much higher than p50 (spiky latency)
  - Application feels "frozen" for short bursts
  - Prometheus metric: container_cpu_throttled_seconds_total is non-zero

Throttling vs no-limit:
  cpu limit = 500m → throttled if spikes above 500m for >50ms in 100ms window
  no cpu limit → container can burst to full node CPU, but may compete with neighbors
# Check if throttling is occurring
kubectl exec <pod> -n <ns> -- cat /sys/fs/cgroup/cpu/cpu.stat
# Look for: throttled_time (nanoseconds) > 0

# PromQL query: throttle ratio per container
rate(container_cpu_throttled_seconds_total{pod=~"payments-api.*"}[5m]) /
  rate(container_cpu_usage_seconds_total{pod=~"payments-api.*"}[5m])
# > 25% throttle ratio is a problem

# Fix options:
# 1. Increase cpu limit:
kubectl patch deployment payments-api -n production \
  -p '{"spec":{"template":{"spec":{"containers":[{"name":"payments-api","resources":{"limits":{"cpu":"2000m"}}}]}}}}'

# 2. Remove cpu limit entirely (for latency-sensitive services):
# Risk: pod can consume entire node CPU and starve neighbors
# Mitigate: use cpu request (guarantees share) without limit

# 3. Profile the application to reduce CPU usage
# Go: pprof; Java: async-profiler; Node: clinic.js

Memory Pressure and OOM

# Check OOMKill history
kubectl get events -n <ns> -o json | \
  jq '.items[] | select(.reason=="OOMKilling") |
    {pod:.involvedObject.name, msg:.message, time:.lastTimestamp}'

# Current memory usage vs limit
kubectl top pod <pod> -n <ns>
kubectl get pod <pod> -n <ns> \
  -o jsonpath='{.spec.containers[0].resources.limits.memory}'

# Java heap sizing (common mistake)
# Container limit: 2Gi
# JVM default heap: 25% of RAM = 512Mi
# Non-heap (metaspace, code cache, GC overhead): ~512Mi
# → Total Java process = 1Gi → fits within 2Gi limit
#
# Problem: JVM sets heap from HOST memory, not container cgroup limit
# Fix: add -XX:MaxRAMPercentage=75 or explicit -Xmx1500m
# OR: -XX:+UseContainerSupport (default in JDK 11+) reads cgroup limits

# Go memory — no GC limit by default
# Add: GOGC=100 (default), GOMEMLIMIT=1750MiB (Go 1.19+)
# GOMEMLIMIT tells Go GC to be aggressive before hitting container limit

# Node.js heap
# --max-old-space-size=1536   (1.5GB for 2GB container)

# VPA recommendation for memory
kubectl describe vpa <vpa-name> -n <ns>
# Look for: target memory recommendation

Slow Application Response

# Step 1: Identify which tier is slow
# - Is it the database? The upstream API? The app itself?
kubectl run netshoot --image=nicolaka/netshoot --rm -it -- \
  curl -w "@curl-format.txt" -o /dev/null -s http://payments-api.production.svc:8080/

# Step 2: Check connection pool exhaustion
# App log: "too many open connections", "pool timeout", "ETIMEDOUT"
# Fix: tune pool size, or find connection leak

# Step 3: CPU throttling causing latency spikes
# See CPU throttling section above

# Step 4: GC pauses (Java/Go)
# Go: check GODEBUG=gctrace=1 output
# Java: check GC logs (-XX:+PrintGCDetails)
# Fix: tune GC settings, increase heap headroom

# Step 5: Network latency
kubectl run netshoot --image=nicolaka/netshoot --rm -it -- \
  ping -c 10 payments-api.production.svc.cluster.local
# Inter-pod latency should be <1ms on same node, <5ms across nodes

# Step 6: DNS lookup latency
kubectl run netshoot --image=nicolaka/netshoot --rm -it -- \
  time nslookup payments-api.production.svc.cluster.local
# DNS should resolve in <5ms; if >50ms → CoreDNS issue

HPA Not Scaling Fast Enough

# Check HPA status
kubectl describe hpa payments-api -n production
# Conditions:
#   AbleToScale: True
#   ScalingActive: True
#   ScalingLimited: False

# Current vs target metric
kubectl get hpa payments-api -n production
# TARGETS: 850m/200m   ← 4x over target, should be scaling

# Why is it slow to scale?
# 1. scaleUp.stabilizationWindowSeconds > 0 (default: 0 for scale-up)
# 2. scaleUp.policies[].value too small (e.g., maxSurge of 2 pods per minute)
kubectl get hpa payments-api -n production -o yaml | grep -A10 behavior

# 3. Pods are slow to become Ready (long readiness probe initialDelay)
kubectl describe pod <new-pod> -n production | grep "Readiness"

# 4. metrics-server lag (60s scrape interval)
kubectl get --raw /apis/metrics.k8s.io/v1beta1/namespaces/production/pods | \
  jq '.items[].timestamp'

# Fix: configure scale-up behavior for faster response
behavior:
  scaleUp:
    stabilizationWindowSeconds: 0
    policies:
    - type: Pods
      value: 10
      periodSeconds: 60    # allow up to 10 pods per minute
    - type: Percent
      value: 100
      periodSeconds: 60    # or double replicas per minute
    selectPolicy: Max

Node Resource Exhaustion

# Check allocatable vs requested resources on node
kubectl describe node <node> | grep -A15 "Allocated resources"
# Requests:   CPU=3500m/4000m (87%), Memory=6Gi/8Gi (75%)

# Find which pods are consuming most resources
kubectl top pods -n production --sort-by=memory | head -20

# Check for resource request inflation (pods requesting more than they use)
# PromQL: actual usage vs request
# container_memory_usage_bytes / on(pod,container) kube_pod_container_resource_requests{resource="memory"}

# Node is fully packed → new pods stay Pending
# Options:
# 1. Scale out: add nodes or use Karpenter auto-provisioning
# 2. Reduce requests: use VPA to right-size
# 3. Use bin-packing: switch HPA to MostAllocated scoring strategy

# Check node capacity including extended resources
kubectl describe node <node> | grep -E "Capacity:|Allocatable:" -A10

Disk I/O Bottleneck

# Symptoms: high io_wait, slow database, application timeout

# Check iowait on node
kubectl debug node/<node> -it --image=ubuntu -- \
  iostat -x 1 5
# Look for: %util near 100%, high await (ms), high w_await

# Which process is causing I/O
kubectl debug node/<node> -it --image=ubuntu -- \
  iotop -Po   # show only active I/O processes

# Check if pod is hitting EBS burst budget
# EBS gp3 burst: sustained 3000 IOPS; burst to 3000 for gp3 baseline
# gp3: 3000 IOPS and 125 MB/s always (no burst bucket needed)
# gp2: IOPS = 3 × size, can burst to 3000 with burst bucket

aws cloudwatch get-metric-statistics \
  --namespace AWS/EBS --metric-name BurstBalance \
  --dimensions Name=VolumeId,Value=$PV_HANDLE \
  --start-time $(date -u -v-1H +%Y-%m-%dT%H:%M:%SZ) \
  --end-time $(date -u +%Y-%m-%dT%H:%M:%SZ) \
  --period 300 --statistics Average

# Upgrade to gp3 with higher IOPS if sustained baseline needed
kubectl patch storageclass gp3-retain \
  -p '{"parameters":{"iops":"6000","throughput":"250"}}'
# Note: existing PVs not retroactively changed; new PVCs get new params

# Use fio to benchmark actual disk performance
kubectl apply -f - <<EOF
apiVersion: v1
kind: Pod
metadata:
  name: fio-test
  namespace: production
spec:
  containers:
  - name: fio
    image: nixery.dev/fio
    command: ["fio", "--name=randread", "--ioengine=libaio",
              "--rw=randread", "--bs=4k", "--numjobs=4",
              "--runtime=30", "--filename=/data/test",
              "--size=1G", "--direct=1"]
    volumeMounts:
    - name: data
      mountPath: /data
  volumes:
  - name: data
    persistentVolumeClaim:
      claimName: payments-data
EOF

API Server Latency

# API server request latency (high → controllers slow to reconcile)
kubectl get --raw '/metrics' | grep apiserver_request_duration_seconds_bucket | \
  grep -v "^#" | tail -5

# PromQL: p99 request latency
histogram_quantile(0.99,
  sum(rate(apiserver_request_duration_seconds_bucket[5m])) by (le, verb, resource)
)

# Common causes of API server slowness:
# 1. etcd latency > 100ms
kubectl get --raw '/metrics' | grep etcd_request_duration

# 2. Too many watch connections (informer leak)
kubectl get --raw '/metrics' | grep apiserver_watch_cache_events_dispatched_total

# 3. Large list requests (missing pagination, large objects)
kubectl get --raw '/metrics' | grep apiserver_request_total | grep LIST

# Fix: ensure controllers use --watch-cache=true, paginate large lists
# Fix: limit field selectors / label selectors on LIST calls
# Fix: avoid storing large blobs in etcd (use ConfigMap references)