Performance Tuning | Kubernetes Docs

Performance Investigation Methodology

Before tuning, identify the bottleneck. Tuning the wrong layer wastes time and can mask real issues. Use the USE method (Utilization, Saturation, Errors) as a systematic top-down approach:

USE Method — Kubernetes Mapping

  For each resource, check:
  ├── Utilization  = how busy is it? (%)
  ├── Saturation   = how much work is queued/waiting?
  └── Errors       = are there operation failures?

  CPU:
    Util: rate(container_cpu_usage_seconds_total[5m]) / requests
    Sat:  container_cpu_cfs_throttled_seconds_total (throttle ratio)
    Err:  OOMKilled events (indirect CPU→memory link)

  Memory:
    Util: container_memory_working_set_bytes / requests
    Sat:  container_memory_failures_total (page faults), kswapd activity
    Err:  OOMKilled pod restarts

  Network (node):
    Util: node_network_transmit_bytes_total / NIC speed
    Sat:  node_network_transmit_drop_total (TX queue drops)
    Err:  node_network_transmit_errs_total

  Disk (storage):
    Util: rate(node_disk_io_time_seconds_total[5m])
    Sat:  node_disk_io_time_weighted_seconds_total (I/O wait)
    Err:  node_disk_read_errors_total

  Control plane:
    Util: apiserver_request_duration_seconds (p99)
    Sat:  apiserver_current_inflight_requests
    Err:  apiserver_request_total{code=~"5.."}

Symptom	Likely Cause	Where to look
High p99 latency, normal p50	CPU throttling, GC pauses, tail latency in dependency	throttle ratio metric, GC logs, distributed traces
High p50 AND p99 latency	Resource saturation, slow external dependency, network congestion	CPU/mem utilization, downstream service latency
Latency spikes every N minutes	GC stop-the-world, cron job interference, HPA scaling event	GC logs, HPA events, cron schedule
Intermittent pod restarts	OOMKill, liveness probe timeout during GC pause, CPU starvation	`kubectl describe pod` exit reason, throttle metric
Slow pod startup	Large image pull, slow readiness probe, DNS timeout	pod event timestamps, image pull duration
kubectl slow / timeouts	API server overloaded, etcd slow, network policy eval	apiserver latency metrics, etcd commit duration

CPU Throttling

CPU throttling is one of the most common and most misunderstood performance problems in Kubernetes. A pod can be throttled even when node CPU utilization is low — throttling is governed by the CFS (Completely Fair Scheduler) quota enforcement, not node-level utilization.

How CFS throttling works

CFS CPU Quota Enforcement

  cpu.limit = 500m → CFS quota = 50ms per 100ms period

  Time →  0ms      50ms     100ms    150ms    200ms
  Pod A:  ██████████████████ (uses 50ms → THROTTLED for 50ms)
  Pod B:  ████████           (uses 40ms → not throttled)

  Even if node has 80% idle CPU:
  Pod A is throttled because it hit its period quota.

  Multi-threaded processes are especially vulnerable:
  8 threads × 12ms burst = 96ms needed, but quota = 50ms
  → 46ms of throttle even at low average utilization

Measuring throttle ratio

# CPU throttle ratio per container (> 25% is significant)
sum by (namespace, pod, container) (
  rate(container_cpu_cfs_throttled_seconds_total[5m])
)
  /
sum by (namespace, pod, container) (
  rate(container_cpu_cfs_periods_total[5m])
)
> 0.25

# Top throttled containers cluster-wide
sort_desc(
  sum by (namespace, container) (
    rate(container_cpu_cfs_throttled_seconds_total[5m])
  )
  /
  sum by (namespace, container) (
    rate(container_cpu_cfs_periods_total[5m])
  )
)

Resolving CPU throttling

Root cause	Resolution
Limit too low for burst workloads	Raise CPU limit; set limit = 2–3× request for bursty services
Multi-threaded app spawns many goroutines/threads	Set `GOMAXPROCS` / thread count based on CPU request, not node CPU count
GC runs cause CPU spikes	Tune GC (see JVM section); increase CPU limit during GC window
Request too low (VPA recommends higher)	Raise request to VPA target; limit follows at 2× ratio
CPU-intensive startup (e.g. JVM class loading)	Startup probe + higher CPU limit; consider InitContainer for warmup

⚠️

automaxprocs for Go services

Go's runtime defaults GOMAXPROCS to the number of logical CPUs on the node (e.g. 96). Inside a container with a 500m CPU limit, 96 goroutines compete for 0.5 CPU — causing massive context switching and throttling. Add go.uber.org/automaxprocs to automatically set GOMAXPROCS from the container's CPU quota.

// main.go — add one import, zero configuration needed
import (
    _ "go.uber.org/automaxprocs"   // sets GOMAXPROCS from CPU quota automatically
)

// Or set manually based on CPU request (500m = 1 thread is reasonable)
func init() {
    if quota := runtime.NumCPU(); quota > 4 {
        runtime.GOMAXPROCS(4)  // cap at 4 for container with 2 CPU limit
    }
}

Linux Kernel Tuning

Kernel parameters affect networking, memory management, and file descriptor limits. In Kubernetes, these are set at the node level (not container level) — use a DaemonSet with initContainers or node configuration to apply them.

🚨

Namespace-scoped vs node-scoped sysctls

Some sysctls are namespace-scoped (safe to set per-pod via securityContext.sysctls) and some are node-scoped (require node-level access). Node-scoped tuning should be applied via DaemonSet or EKS managed node group launch template userData — never via privileged containers in production.

Network kernel parameters

apiVersion: apps/v1
kind: DaemonSet
metadata:
  name: node-tuner
  namespace: kube-system
spec:
  selector:
    matchLabels:
      app: node-tuner
  template:
    metadata:
      labels:
        app: node-tuner
    spec:
      hostPID: true
      hostNetwork: true
      tolerations:
        - operator: Exists   # run on all nodes including system
      initContainers:
        - name: sysctl-tuner
          image: busybox:1.36
          securityContext:
            privileged: true
          command:
            - /bin/sh
            - -c
            - |
              # ── TCP Connection Tuning ──────────────────────────────
              # Increase TCP backlog for high-connection-rate services
              sysctl -w net.core.somaxconn=65535
              sysctl -w net.ipv4.tcp_max_syn_backlog=65535

              # TIME_WAIT socket reuse (prevents port exhaustion)
              sysctl -w net.ipv4.tcp_tw_reuse=1
              sysctl -w net.ipv4.ip_local_port_range="1024 65535"

              # Keep-alive tuning (detect dead connections faster)
              sysctl -w net.ipv4.tcp_keepalive_time=60
              sysctl -w net.ipv4.tcp_keepalive_intvl=10
              sysctl -w net.ipv4.tcp_keepalive_probes=6

              # ── Socket Buffer Sizes ────────────────────────────────
              # Larger buffers for high-throughput services
              sysctl -w net.core.rmem_max=134217728
              sysctl -w net.core.wmem_max=134217728
              sysctl -w net.ipv4.tcp_rmem="4096 87380 134217728"
              sysctl -w net.ipv4.tcp_wmem="4096 65536 134217728"

              # ── Connection Tracking ────────────────────────────────
              # Increase nf_conntrack table (prevents conntrack drops)
              sysctl -w net.netfilter.nf_conntrack_max=1048576
              sysctl -w net.netfilter.nf_conntrack_tcp_timeout_established=86400
              sysctl -w net.netfilter.nf_conntrack_tcp_timeout_time_wait=30

              # ── File Descriptors ───────────────────────────────────
              sysctl -w fs.file-max=2097152
              sysctl -w fs.inotify.max_user_watches=1048576
              sysctl -w fs.inotify.max_user_instances=8192

              # ── VM / Memory ────────────────────────────────────────
              # Reduce swappiness (swap kills latency)
              sysctl -w vm.swappiness=1
              # Dirty page writeback (reduce I/O burst latency)
              sysctl -w vm.dirty_ratio=10
              sysctl -w vm.dirty_background_ratio=5
      containers:
        - name: pause
          image: registry.k8s.io/pause:3.9
          resources:
            requests:
              cpu: 1m
              memory: 4Mi

Namespace-scoped sysctls (per-pod)

These safe sysctls can be set on individual pods for latency-sensitive services, without requiring privileged access or Kyverno exceptions:

spec:
  securityContext:
    sysctls:
      # TCP connection reuse for high-connection-rate pods
      - name: net.ipv4.tcp_tw_reuse
        value: "1"
      # Local port range for outbound connections
      - name: net.ipv4.ip_local_port_range
        value: "1024 65535"
      # Socket backlog per pod
      - name: net.core.somaxconn
        value: "65535"

💡

Allowed safe sysctls

K8s 1.27+ allows a defined list of safe sysctls by default: kernel.shm_rmid_forced, net.ipv4.ip_local_port_range, net.ipv4.tcp_syncookies, net.ipv4.ping_group_range, net.ipv4.tcp_tw_reuse, net.ipv4.ip_unprivileged_port_start. Enable unsafe sysctls only if explicitly required and Kyverno policy permits (see 08-05).

Huge pages for memory-intensive workloads

# Node: pre-allocate 2MiB huge pages via launch template
# echo 512 > /sys/kernel/mm/hugepages/hugepages-2048kB/nr_hugepages

# Pod requesting huge pages (databases, Kafka, Redis)
spec:
  containers:
    - name: redis
      resources:
        requests:
          hugepages-2Mi: 256Mi
          memory: 256Mi
        limits:
          hugepages-2Mi: 256Mi
          memory: 256Mi
      volumeMounts:
        - name: hugepage
          mountPath: /dev/hugepages
  volumes:
    - name: hugepage
      emptyDir:
        medium: HugePages-2Mi

JVM / GC Tuning

JVM-based services (Java, Kotlin, Scala, Clojure) have unique performance challenges in containers: heap sizing, GC algorithm selection, and class loading all affect latency.

Container-aware JVM flags

FROM eclipse-temurin:21-jre

ENV JAVA_OPTS="\
  -XX:+UseContainerSupport \
  -XX:MaxRAMPercentage=75.0 \
  -XX:InitialRAMPercentage=50.0 \
  -XX:+UseG1GC \
  -XX:MaxGCPauseMillis=200 \
  -XX:G1HeapRegionSize=16m \
  -XX:+UseStringDeduplication \
  -XX:+AlwaysPreTouch \
  -XX:+DisableExplicitGC \
  -Xss512k \
  -XX:+HeapDumpOnOutOfMemoryError \
  -XX:HeapDumpPath=/tmp/heapdump.hprof \
  -Xlog:gc*:file=/tmp/gc.log:time,uptime,level,tags:filecount=5,filesize=20m \
  -Djava.security.egd=file:/dev/./urandom"

COPY target/app.jar /app.jar
ENTRYPOINT ["sh", "-c", "java $JAVA_OPTS -jar /app.jar"]

GC algorithm selection guide

GC Algorithm	JDK	Pause Goal	Heap Size	Best For
G1GC	8u40+	≤ 200ms (configurable)	4 GiB – 32 GiB	Balanced throughput + latency; default for most services
ZGC	15+ (prod), 11 (exp)	< 1ms	8 GiB+	Low-latency APIs, large heaps; slight throughput cost
Shenandoah	12+ (RedHat/Azul)	< 10ms	Any	Low-latency with smaller heaps than ZGC
ParallelGC	All	No pause target	Any	Batch / throughput-first (not for latency-sensitive)
SerialGC	All	N/A	< 512 MiB	Very small heaps, minimal CPU (sidecars, utilities)

ZGC configuration for ultra-low latency

JAVA_OPTS="\
  -XX:+UseContainerSupport \
  -XX:MaxRAMPercentage=70.0 \
  -XX:+UseZGC \
  -XX:+ZGenerational \
  -XX:SoftMaxHeapSize=6g \
  -XX:ZCollectionInterval=5 \
  -XX:ZUncommitDelay=300 \
  -Xlog:gc*:file=/tmp/gc.log:time,uptime:filecount=3,filesize=10m"

# ZGC requires more memory overhead than G1 — allow 70% not 75%
# -XX:+ZGenerational: generational ZGC (JDK 21, significant improvement)

GC metrics via JMX / micrometer

# JVM GC pause duration (micrometer instrumented apps)
histogram_quantile(0.99,
  rate(jvm_gc_pause_seconds_bucket{application="payment-service"}[5m])
)

# GC overhead (fraction of time spent in GC)
sum by (application, gc) (
  rate(jvm_gc_pause_seconds_sum{application="payment-service"}[5m])
)

# Heap usage vs max
jvm_memory_used_bytes{area="heap",application="payment-service"}
  /
jvm_memory_max_bytes{area="heap",application="payment-service"}

# GC frequency (collections per second)
rate(jvm_gc_pause_seconds_count{application="payment-service"}[5m])

JVM container resource sizing rules

Memory request = MaxHeap + NonHeap (Metaspace + CodeCache + DirectBuffers) + 20% OS overhead
Non-heap estimate: typical Spring Boot app = 256–512 MiB non-heap
Memory limit = memory request × 1.1 (small headroom; OOMKill is preferable to excessive provisioning)
Example: 4 GiB heap → 4096 + 512 + 820 (20%) = ~5.4 GiB request; 5.9 GiB limit
Use -XX:MaxRAMPercentage=75.0 so the JVM auto-sizes heap from the container limit
CPU request: 1 CPU per 2 GiB heap is a reasonable starting point for G1GC

Go Runtime Tuning

GOMAXPROCS and CPU quota

// go.mod
require go.uber.org/automaxprocs v1.5.3

// main.go
import _ "go.uber.org/automaxprocs"
// Automatically sets GOMAXPROCS = ceil(cpu_quota / cpu_period)
// For 500m CPU limit: GOMAXPROCS = 1 (not 96)

GOGC and memory pressure

# Pod env — GOGC controls GC frequency (default 100 = trigger at 2× live heap)
env:
  - name: GOGC
    value: "100"      # default; lower = more frequent GC, less memory; higher = less GC, more memory
  - name: GOMEMLIMIT   # Go 1.19+ — soft memory limit; triggers GC before OOMKill
    valueFrom:
      resourceFieldRef:
        resource: limits.memory
        divisor: "1"   # bytes; sets GOMEMLIMIT = container memory limit

// Programmatic GOMEMLIMIT (preferred — uses container limit automatically)
import "runtime/debug"

func init() {
    // Set GOMEMLIMIT to 90% of container memory limit
    // Prevents OOMKill by triggering aggressive GC before hitting the limit
    if limit := containerMemoryLimit(); limit > 0 {
        debug.SetMemoryLimit(int64(float64(limit) * 0.90))
    }
}

Go pprof — identify hot paths

# Collect 30s CPU profile from running pod
kubectl exec -n payments deploy/payment-service -- \
  curl -s "http://localhost:6060/debug/pprof/profile?seconds=30" > cpu.prof

# Analyze: show top 10 functions by CPU time
go tool pprof -top cpu.prof

# Interactive web UI (flame graph)
go tool pprof -http=:8080 cpu.prof

# Heap allocation profile
kubectl exec -n payments deploy/payment-service -- \
  curl -s "http://localhost:6060/debug/pprof/heap" > heap.prof
go tool pprof -alloc_objects heap.prof

Network Performance

Connection pool tuning

Most latency in microservices is not compute — it's time waiting for a connection. Properly tuned connection pools dramatically reduce p99 latency.

Parameter	Default	Recommended (high-traffic)	Effect
HTTP idle timeout	90s (Go)	30–60s	Recycle connections before load balancer kills them
HTTP keep-alive	enabled	enabled + DisableKeepAlives: false	Reuse TCP connections across requests
Max idle connections	100 (Go http.Transport)	500–2000	Enough connections for burst RPS
Max idle per host	2 (Go http.Transport)	100–500	Prevents connection queue buildup on single upstream
Dial timeout	30s	5s	Fast-fail on unreachable services
Response header timeout	none	30s	Prevent goroutine leak from slow upstreams

// Tuned HTTP client for high-traffic Go services
import "net/http"

var httpClient = &http.Client{
    Timeout: 30 * time.Second,
    Transport: &http.Transport{
        MaxIdleConns:          1000,
        MaxIdleConnsPerHost:   200,
        MaxConnsPerHost:       500,
        IdleConnTimeout:       60 * time.Second,
        TLSHandshakeTimeout:   5 * time.Second,
        ResponseHeaderTimeout: 20 * time.Second,
        DialContext: (&net.Dialer{
            Timeout:   5 * time.Second,
            KeepAlive: 30 * time.Second,
        }).DialContext,
        ForceAttemptHTTP2:     true,
        DisableCompression:    false,
    },
}

Service mesh connection reuse (Istio/Envoy)

# Istio DestinationRule — connection pool per upstream service
apiVersion: networking.istio.io/v1beta1
kind: DestinationRule
metadata:
  name: payment-service-dr
  namespace: payments
spec:
  host: payment-service
  trafficPolicy:
    connectionPool:
      tcp:
        maxConnections: 1000
        connectTimeout: 5s
        tcpKeepalive:
          time: 7200s
          interval: 75s
      http:
        http1MaxPendingRequests: 1000
        http2MaxRequests: 1000
        maxRequestsPerConnection: 0    # 0 = unlimited reuse (H2 multiplex)
        maxRetries: 3
        idleTimeout: 90s
        h2UpgradePolicy: UPGRADE       # prefer HTTP/2
    outlierDetection:
      consecutiveGatewayErrors: 5
      interval: 30s
      baseEjectionTime: 30s
      maxEjectionPercent: 50

Network bandwidth testing

# Test pod-to-pod bandwidth (iperf3)
# Start server in one pod
kubectl run iperf-server --image=networkstatic/iperf3 -n default \
  -- iperf3 -s

# Run client from another pod (different node)
kubectl run iperf-client --image=networkstatic/iperf3 -n default \
  --restart=Never -it --rm \
  -- iperf3 -c $(kubectl get pod iperf-server -o jsonpath='{.status.podIP}') \
     -t 30 -P 8   # 8 parallel streams

# Expected: ~10 Gbps within same AZ on m7i; ~5 Gbps cross-AZ

# Test cross-AZ latency
kubectl run netshoot --image=nicolaka/netshoot --restart=Never -it --rm -- \
  ping -c 100 10.0.2.45  # pod IP in different AZ

DNS Performance

DNS is a hidden latency source in Kubernetes. Every hostname lookup that doesn't hit the local cache goes to CoreDNS, which can become a bottleneck under high RPS.

DNS lookup flow and optimization

Kubernetes DNS Resolution Path

  Pod queries "payment-service"
    → ndots:5 triggers 5 search domain suffix attempts:
      1. payment-service.payments.svc.cluster.local  ← HIT (found)
      2. payment-service.cluster.local               (skipped after hit)
      ...

  With ndots:5, every external query like "api.stripe.com" tries:
      1. api.stripe.com.payments.svc.cluster.local   (NXDOMAIN)
      2. api.stripe.com.svc.cluster.local            (NXDOMAIN)
      3. api.stripe.com.cluster.local                (NXDOMAIN)
      4. api.stripe.com                              (HIT — external)
  = 4 queries instead of 1 → 4× DNS load for external calls

Pod DNS configuration tuning

spec:
  dnsPolicy: ClusterFirst
  dnsConfig:
    options:
      # Reduce search domain attempts (ndots:3 is usually enough)
      - name: ndots
        value: "3"
      # Enable DNS result caching in resolv.conf (not all images support)
      - name: single-request-reopen
      # Timeout for DNS queries
      - name: timeout
        value: "3"
      # Number of retries before failing
      - name: attempts
        value: "3"

Alternatively, use FQDN for cross-namespace calls to bypass search domain expansion entirely:

# Instead of: http://payment-service (triggers ndots suffix search)
# Use FQDN:   http://payment-service.payments.svc.cluster.local
# Or in same namespace, just: http://payment-service works fine

CoreDNS autoscaling and tuning

# CoreDNS HPA (proportional to nodes)
apiVersion: autoscaling/v2
kind: HorizontalPodAutoscaler
metadata:
  name: coredns
  namespace: kube-system
spec:
  scaleTargetRef:
    apiVersion: apps/v1
    kind: Deployment
    name: coredns
  minReplicas: 2
  maxReplicas: 10
  metrics:
    - type: Resource
      resource:
        name: cpu
        target:
          type: Utilization
          averageUtilization: 70

# CoreDNS ConfigMap tuning (kube-system)
apiVersion: v1
kind: ConfigMap
metadata:
  name: coredns
  namespace: kube-system
data:
  Corefile: |
    .:53 {
        errors
        health {
           lameduck 5s
        }
        ready
        kubernetes cluster.local in-addr.arpa ip6.arpa {
           pods insecure
           fallthrough in-addr.arpa ip6.arpa
           ttl 30
        }
        prometheus :9153
        forward . /etc/resolv.conf {
           max_concurrent 1000
           prefer_udp
        }
        cache 300 {           # cache TTL up to 300s (default 30s)
           success 9984        # max successful cache entries
           denial 9984         # max NXDOMAIN cache entries
           prefetch 10 1m 10%  # prefetch before TTL expires (reduces latency spikes)
        }
        loop
        reload
        loadbalance
    }

Node-local DNS cache (NodeLocal DNSCache)

# NodeLocal DNSCache reduces CoreDNS load by caching at node level
# Install: https://kubernetes.io/docs/tasks/administer-cluster/nodelocaldns/

# Verify it's running as DaemonSet
kubectl get daemonset -n kube-system node-local-dns

# After install, pods on that node use link-local DNS (169.254.20.10)
# instead of ClusterIP, bypassing iptables conntrack for DNS queries
# → reduces p99 DNS latency from ~2ms to ~0.1ms

Pod Startup Latency

Pod startup latency (from pending to ready) directly impacts autoscaling responsiveness. The time breaks down into distinct phases:

Pod Startup Timeline

  Pending       Scheduled     PullingImage   Running        Ready
     │              │              │              │             │
     ├──── 0.1s ────┤              │              │             │
     │  Scheduler   │              │              │             │
     │  decision    ├──── 0–60s ───┤              │             │
     │              │  Image pull  │              │             │
     │              │  (first time)├──── 0.5s ────┤             │
     │              │              │  Container   │             │
     │              │              │  start       ├── probe ───►│
     │              │              │              │  period     │
  Total: 0.1s + image_pull + 0.5s + startup_probe_period

Reducing image pull time

# 1. Pin images by digest (also prevents unexpected pulls)
image: my-registry/payment-service@sha256:abc123...

# 2. Use imagePullPolicy: IfNotPresent (default for non-latest)
imagePullPolicy: IfNotPresent

# 3. Pre-pull images on nodes (DaemonSet or node provisioning)
# Karpenter EC2NodeClass userData pre-pull script:
# docker pull my-registry/payment-service:v1.2.3 || true

# 4. Use a registry in the same region/AZ as the cluster
# ECR with VPC endpoint → no egress; pull from S3 locally

# 5. Minimize image size (distroless / scratch)
# Before: 800 MiB (alpine + full JDK + tools)
# After:  85 MiB (distroless/java21 + app jar only)

# 6. Sparsely-layered images (frequently-changing layers last)
COPY --chown=nonroot:nonroot libs/ /app/libs/     # rarely changes
COPY --chown=nonroot:nonroot config/ /app/config/  # sometimes changes
COPY --chown=nonroot:nonroot app.jar /app/         # always changes

Startup, readiness, and liveness probe tuning

spec:
  containers:
    - name: payment-service
      startupProbe:
        httpGet:
          path: /actuator/health/liveness
          port: 8080
        initialDelaySeconds: 10    # wait for JVM basic startup
        periodSeconds: 5
        failureThreshold: 24       # allow up to 10+24×5 = 130s for full startup
        successThreshold: 1
        timeoutSeconds: 3

      readinessProbe:
        httpGet:
          path: /actuator/health/readiness
          port: 8080
        initialDelaySeconds: 0     # startupProbe guards this; start checking immediately
        periodSeconds: 5
        failureThreshold: 3
        successThreshold: 1
        timeoutSeconds: 3

      livenessProbe:
        httpGet:
          path: /actuator/health/liveness
          port: 8080
        initialDelaySeconds: 0     # startupProbe guards this
        periodSeconds: 15          # don't check too often; liveness kill is expensive
        failureThreshold: 3
        successThreshold: 1
        timeoutSeconds: 5          # higher timeout; liveness failure restarts the pod

⚠️

Liveness probe timeout during GC pauses

A common misconfiguration: liveness probe timeoutSeconds: 1 + failureThreshold: 1 on a JVM service. A 200ms GC pause can cause the HTTP handler to respond slowly, triggering a false liveness failure and unnecessary pod restart. Set timeoutSeconds: 5 and failureThreshold: 3 (minimum 15s sustained unresponsiveness) to avoid this.

API Server Performance

A slow or overloaded API server affects every kubectl command, every controller reconcile loop, and every admission webhook call.

API server latency metrics

# API server request latency p99 by verb and resource
histogram_quantile(0.99,
  sum by (verb, resource, le) (
    rate(apiserver_request_duration_seconds_bucket{
      job="apiserver"
    }[5m])
  )
)

# Inflight requests (near max → throttling in progress)
apiserver_current_inflight_requests

# Request rate by verb
sum by (verb) (
  rate(apiserver_request_total{job="apiserver"}[5m])
)

# Error rate (5xx responses)
sum(rate(apiserver_request_total{job="apiserver",code=~"5.."}[5m]))
  /
sum(rate(apiserver_request_total{job="apiserver"}[5m]))

# Watch event queue depth (large = consumers too slow)
apiserver_watch_events_sizes_count

API server tuning parameters

# kube-apiserver flags (kubeadm ClusterConfiguration extraArgs)
kube-apiserver:
  # Increase request inflight limits (default: 400 mutating, 800 read-only)
  --max-mutating-requests-inflight: "800"
  --max-requests-inflight: "1600"
  # Request timeout for long-running operations
  --request-timeout: "60s"
  # Audit log — only record what you need (high volume = high CPU)
  --audit-policy-file: /etc/kubernetes/audit-policy.yaml
  # Feature gate for priority and fairness (APF)
  --enable-priority-and-fairness: "true"
  # Increase etcd page size for large list operations
  --default-watch-cache-size: "100"

API Priority and Fairness (APF)

APF (GA in K8s 1.29) replaces the old max-inflight limits with a fine-grained priority system that prevents one noisy client from starving critical system operations:

# Check current APF flow schemas and priority levels
kubectl get flowschemas
kubectl get prioritylevelconfigurations

# Monitor queue depth and wait time per flow
kubectl get --raw /metrics | grep apiserver_flowcontrol

# Example: increase shares for system controllers if throttled
kubectl patch prioritylevelconfiguration system-leader-election \
  --type=json \
  -p='[{"op":"replace","path":"/spec/limited/nominalConcurrencyShares","value":20}]'

Reducing API server load

Use informers/watches, not polling: Controller code that calls List in a loop every second generates massive API load. Use client-go informers with SharedIndexInformer and watch events instead.
Server-side filtering: Use labelSelector and fieldSelector in List calls to let the API server filter, reducing data transfer and client CPU.
Avoid large objects: ConfigMaps and Secrets over 1 MiB degrade etcd and API server performance. Use S3/external stores for large data.
Scope RBAC: Broad ClusterRoleBindings with * resource wildcards generate large authorization cache evaluations.

etcd Performance

etcd is the single source of truth for all Kubernetes state. etcd latency directly increases API server latency. The most common causes of etcd degradation are disk I/O latency and network latency between etcd members.

etcd health and latency metrics

# Check etcd endpoint health
etcdctl endpoint health \
  --endpoints=https://etcd-0:2379,https://etcd-1:2379,https://etcd-2:2379 \
  --cacert=/etc/kubernetes/pki/etcd/ca.crt \
  --cert=/etc/kubernetes/pki/etcd/peer.crt \
  --key=/etc/kubernetes/pki/etcd/peer.key

# Check leader and round-trip latency
etcdctl endpoint status \
  --endpoints=https://etcd-0:2379 \
  --write-out=table \
  --cacert=/etc/kubernetes/pki/etcd/ca.crt \
  --cert=/etc/kubernetes/pki/etcd/peer.crt \
  --key=/etc/kubernetes/pki/etcd/peer.key

# Output columns: ENDPOINT / ID / VERSION / DB SIZE / IS LEADER / IS LEARNER / RAFT TERM / RAFT INDEX / RAFT APPLIED INDEX / ERRORS

# etcd commit latency p99 (target: < 10ms for SSD)
histogram_quantile(0.99,
  rate(etcd_disk_backend_commit_duration_seconds_bucket[5m])
)

# WAL fsync latency (target: < 5ms; > 25ms = disk problem)
histogram_quantile(0.99,
  rate(etcd_disk_wal_fsync_duration_seconds_bucket[5m])
)

# etcd DB size (> 6 GiB = compact urgently)
etcd_mvcc_db_total_size_in_bytes

# Network peer latency between etcd members
histogram_quantile(0.99,
  rate(etcd_network_peer_round_trip_time_seconds_bucket[5m])
)

etcd compaction and defragmentation

# etcd stores all historical revisions until compacted
# Without compaction, DB grows unbounded and slows down

# Check current revision
REVISION=$(etcdctl endpoint status --write-out=json | \
  jq '.[0].Status.header.revision')
echo "Current revision: $REVISION"

# Compact to current revision (removes all old revisions)
etcdctl compact $REVISION

# Defragment after compaction (reclaims disk space)
# WARNING: etcd is briefly unavailable during defrag on that member
etcdctl defrag

# Check DB size before and after
etcdctl endpoint status --write-out=table

# Automate: kube-apiserver flag to auto-compact
# --etcd-compaction-interval=5m  (compact every 5 minutes)

etcd disk requirements

Metric	Target	Warning	Action
WAL fsync p99	< 5ms	> 10ms	Move to NVMe SSD; check I/O scheduler
Backend commit p99	< 10ms	> 25ms	Compact + defrag; check disk contention
DB size	< 4 GiB	> 6 GiB	Compact immediately; increase quota
DB size / quota	< 60%	> 80%	Compact or increase `--quota-backend-bytes`
Peer latency p99	< 5ms	> 20ms	Co-locate etcd members in same region; check network

🚨

Never run etcd on shared disks

etcd requires predictable fsync latency. Running etcd on a disk shared with other I/O-heavy workloads (logs, container images, Prometheus TSDB) causes latency spikes that appear as API server timeouts. Use dedicated NVMe SSDs for etcd. On AWS, use an io2 EBS volume with > 3000 provisioned IOPS, or an instance-store NVMe disk (i4i family) for highest performance.

Scheduler Optimization

The scheduler can become a bottleneck when there are many unschedulable pods or when complex affinity rules increase per-pod scheduling time.

# Scheduler queue depth (pods waiting to be scheduled)
scheduler_pending_pods{queue="active"}
scheduler_pending_pods{queue="backoff"}
scheduler_pending_pods{queue="unschedulable"}

# Scheduling latency p99 (time from pod creation to scheduling decision)
histogram_quantile(0.99,
  rate(scheduler_pod_scheduling_sli_duration_seconds_bucket[5m])
)

# Preemption events (expensive — pods killed to make room)
rate(scheduler_pod_preemption_victims[5m])

Scheduling performance tips

Avoid requiredDuringScheduling with broad selectors: Every pod evaluates affinity against all matching pods. Use preferredDuringScheduling where hard requirements aren't needed.
Use topologySpreadConstraints over complex anti-affinity: TSC is more efficient for zone spreading.
Increase scheduler parallelism via --parallelism flag (default 16) for large clusters with high pod churn.
Use percentageOfNodesToScore: For clusters with 1000+ nodes, the scheduler evaluates only a percentage of nodes (default 50% down to 5% for very large clusters), dramatically improving scheduling speed.

# KubeSchedulerConfiguration (K8s 1.23+)
apiVersion: kubescheduler.config.k8s.io/v1
kind: KubeSchedulerConfiguration
profiles:
  - schedulerName: default-scheduler
    plugins:
      score:
        disabled:
          - name: NodeResourcesBalancedAllocation  # disable if using bin-packing
        enabled:
          - name: NodeResourcesFit
            weight: 5
    pluginConfig:
      - name: NodeResourcesFit
        args:
          scoringStrategy:
            type: MostAllocated    # bin-pack (vs LeastAllocated = spread)
            resources:
              - name: cpu
                weight: 1
              - name: memory
                weight: 1
leaderElection:
  leaderElect: true
percentageOfNodesToScore: 10    # for 1000+ node clusters

Profiling Workflow

When a performance issue is reported and USE method analysis doesn't pinpoint it, use continuous profiling with Pyroscope (covered in detail in Section 06-07). This workflow covers the investigation path from symptom to root cause:

Performance Investigation Workflow

  Alert: p99 latency > 500ms
       │
       ▼
  1. Check CPU throttle ratio
     container_cpu_cfs_throttled_seconds_total / periods > 25%?
     ├── Yes → raise CPU limit or right-size request
     └── No  → continue

  2. Check memory pressure
     container_memory_working_set_bytes / limit > 80%?
     ├── Yes → check GC overhead, raise memory
     └── No  → continue

  3. Check GC pauses (JVM)
     jvm_gc_pause_seconds p99 > 200ms?
     ├── Yes → switch to ZGC, tune heap
     └── No  → continue

  4. Flame graph (Pyroscope / pprof)
     What function is consuming the most CPU?
     → json.Marshal hot? → use jsoniter or pre-allocated buffers
     → sql.Query hot?   → add index, reduce N+1 queries
     → sync.Mutex hot?  → partition lock, use sync/atomic

  5. Trace analysis (Tempo)
     Which span is slow? Is it this service or a downstream?
     → DB query 450ms → add EXPLAIN ANALYZE, add index
     → External API 300ms → add timeout + circuit breaker

Performance Alerts

apiVersion: monitoring.coreos.com/v1
kind: PrometheusRule
metadata:
  name: performance-tuning-alerts
  namespace: monitoring
spec:
  groups:
    - name: performance.tuning
      rules:

        - alert: ContainerCPUThrottlingHigh
          expr: |
            sum by (namespace, pod, container) (
              rate(container_cpu_cfs_throttled_seconds_total[5m])
            )
            /
            sum by (namespace, pod, container) (
              rate(container_cpu_cfs_periods_total[5m])
            ) > 0.25
          for: 10m
          labels:
            severity: warning
          annotations:
            summary: "Container {{ $labels.namespace }}/{{ $labels.pod }}/{{ $labels.container }} CPU throttled > 25%"
            description: "Throttle ratio is {{ $value | humanizePercentage }}. Raise CPU limit or lower GOMAXPROCS."
            runbook_url: https://runbooks.example.com/performance/cpu-throttling

        - alert: JVMGCPauseHigh
          expr: |
            histogram_quantile(0.99,
              rate(jvm_gc_pause_seconds_bucket[5m])
            ) > 0.5
          for: 5m
          labels:
            severity: warning
          annotations:
            summary: "JVM GC p99 pause > 500ms in {{ $labels.namespace }}/{{ $labels.application }}"
            description: "Consider switching to ZGC or increasing heap."
            runbook_url: https://runbooks.example.com/performance/jvm-gc-pauses

        - alert: APIServerLatencyHigh
          expr: |
            histogram_quantile(0.99,
              sum by (verb, le) (
                rate(apiserver_request_duration_seconds_bucket{
                  verb!~"WATCH|CONNECT"
                }[5m])
              )
            ) > 1.0
          for: 5m
          labels:
            severity: warning
          annotations:
            summary: "API server {{ $labels.verb }} p99 latency > 1s"
            runbook_url: https://runbooks.example.com/performance/apiserver-latency

        - alert: EtcdWALFsyncSlow
          expr: |
            histogram_quantile(0.99,
              rate(etcd_disk_wal_fsync_duration_seconds_bucket[5m])
            ) > 0.025
          for: 5m
          labels:
            severity: warning
          annotations:
            summary: "etcd WAL fsync p99 > 25ms — disk I/O problem"
            runbook_url: https://runbooks.example.com/performance/etcd-slow-disk

        - alert: EtcdDBSizeLarge
          expr: etcd_mvcc_db_total_size_in_bytes > 6e9
          for: 5m
          labels:
            severity: warning
          annotations:
            summary: "etcd DB size > 6 GiB — compact and defrag immediately"
            runbook_url: https://runbooks.example.com/performance/etcd-db-size

        - alert: CoreDNSLatencyHigh
          expr: |
            histogram_quantile(0.99,
              rate(coredns_dns_request_duration_seconds_bucket[5m])
            ) > 0.1
          for: 5m
          labels:
            severity: warning
          annotations:
            summary: "CoreDNS p99 latency > 100ms"
            description: "Consider NodeLocal DNSCache or increasing CoreDNS replicas."
            runbook_url: https://runbooks.example.com/performance/coredns-latency

        - alert: SchedulerPendingPodsHigh
          expr: scheduler_pending_pods{queue="unschedulable"} > 10
          for: 5m
          labels:
            severity: warning
          annotations:
            summary: "{{ $value }} pods unschedulable for > 5 min"
            runbook_url: https://runbooks.example.com/performance/scheduler-pending

Best Practices

Fix throttling before scaling

CPU throttling is not solved by adding pods. Raise the CPU limit or lower GOMAXPROCS. Adding pods just moves the throttle to more instances.

Use automaxprocs for Go

Every Go service running in a container should import go.uber.org/automaxprocs. The default 96-goroutine scheduler on a 0.5 CPU container is a latency disaster.

ZGC for low-latency JVM

If p99 is critical and heap > 4 GiB, switch from G1GC to ZGC with -XX:+ZGenerational (JDK 21). Sub-millisecond pauses eliminate GC-induced latency spikes.

NodeLocal DNSCache

Install NodeLocal DNSCache on every cluster. It reduces p99 DNS latency from 2–5ms to <0.5ms and cuts CoreDNS load by 60–80%.

etcd on dedicated NVMe

etcd fsync latency must stay below 5ms. Shared disks or gp2 EBS will cause intermittent API server timeouts. Use io2 with 3000+ IOPS or instance-store NVMe.

Startup probe guards liveness

Always use a startupProbe to cover slow JVM/Python startup. Without it, a liveness probe fires during class loading and restarts the pod unnecessarily — killing the first startup attempt.