Performance Tuning
Diagnosing and resolving latency, throughput, and saturation across workloads, the Linux kernel, JVM runtimes, and the Kubernetes control plane.
Performance Investigation Methodology
Before tuning, identify the bottleneck. Tuning the wrong layer wastes time and can mask real issues. Use the USE method (Utilization, Saturation, Errors) as a systematic top-down approach:
For each resource, check:
├── Utilization = how busy is it? (%)
├── Saturation = how much work is queued/waiting?
└── Errors = are there operation failures?
CPU:
Util: rate(container_cpu_usage_seconds_total[5m]) / requests
Sat: container_cpu_cfs_throttled_seconds_total (throttle ratio)
Err: OOMKilled events (indirect CPU→memory link)
Memory:
Util: container_memory_working_set_bytes / requests
Sat: container_memory_failures_total (page faults), kswapd activity
Err: OOMKilled pod restarts
Network (node):
Util: node_network_transmit_bytes_total / NIC speed
Sat: node_network_transmit_drop_total (TX queue drops)
Err: node_network_transmit_errs_total
Disk (storage):
Util: rate(node_disk_io_time_seconds_total[5m])
Sat: node_disk_io_time_weighted_seconds_total (I/O wait)
Err: node_disk_read_errors_total
Control plane:
Util: apiserver_request_duration_seconds (p99)
Sat: apiserver_current_inflight_requests
Err: apiserver_request_total{code=~"5.."}
| Symptom | Likely Cause | Where to look |
|---|---|---|
| High p99 latency, normal p50 | CPU throttling, GC pauses, tail latency in dependency | throttle ratio metric, GC logs, distributed traces |
| High p50 AND p99 latency | Resource saturation, slow external dependency, network congestion | CPU/mem utilization, downstream service latency |
| Latency spikes every N minutes | GC stop-the-world, cron job interference, HPA scaling event | GC logs, HPA events, cron schedule |
| Intermittent pod restarts | OOMKill, liveness probe timeout during GC pause, CPU starvation | kubectl describe pod exit reason, throttle metric |
| Slow pod startup | Large image pull, slow readiness probe, DNS timeout | pod event timestamps, image pull duration |
| kubectl slow / timeouts | API server overloaded, etcd slow, network policy eval | apiserver latency metrics, etcd commit duration |
CPU Throttling
CPU throttling is one of the most common and most misunderstood performance problems in Kubernetes. A pod can be throttled even when node CPU utilization is low — throttling is governed by the CFS (Completely Fair Scheduler) quota enforcement, not node-level utilization.
How CFS throttling works
cpu.limit = 500m → CFS quota = 50ms per 100ms period Time → 0ms 50ms 100ms 150ms 200ms Pod A: ██████████████████ (uses 50ms → THROTTLED for 50ms) Pod B: ████████ (uses 40ms → not throttled) Even if node has 80% idle CPU: Pod A is throttled because it hit its period quota. Multi-threaded processes are especially vulnerable: 8 threads × 12ms burst = 96ms needed, but quota = 50ms → 46ms of throttle even at low average utilization
Measuring throttle ratio
# CPU throttle ratio per container (> 25% is significant)
sum by (namespace, pod, container) (
rate(container_cpu_cfs_throttled_seconds_total[5m])
)
/
sum by (namespace, pod, container) (
rate(container_cpu_cfs_periods_total[5m])
)
> 0.25
# Top throttled containers cluster-wide
sort_desc(
sum by (namespace, container) (
rate(container_cpu_cfs_throttled_seconds_total[5m])
)
/
sum by (namespace, container) (
rate(container_cpu_cfs_periods_total[5m])
)
)
Resolving CPU throttling
| Root cause | Resolution |
|---|---|
| Limit too low for burst workloads | Raise CPU limit; set limit = 2–3× request for bursty services |
| Multi-threaded app spawns many goroutines/threads | Set GOMAXPROCS / thread count based on CPU request, not node CPU count |
| GC runs cause CPU spikes | Tune GC (see JVM section); increase CPU limit during GC window |
| Request too low (VPA recommends higher) | Raise request to VPA target; limit follows at 2× ratio |
| CPU-intensive startup (e.g. JVM class loading) | Startup probe + higher CPU limit; consider InitContainer for warmup |
Go's runtime defaults GOMAXPROCS to the number of logical CPUs on the node (e.g. 96). Inside a container with a 500m CPU limit, 96 goroutines compete for 0.5 CPU — causing massive context switching and throttling. Add go.uber.org/automaxprocs to automatically set GOMAXPROCS from the container's CPU quota.
// main.go — add one import, zero configuration needed
import (
_ "go.uber.org/automaxprocs" // sets GOMAXPROCS from CPU quota automatically
)
// Or set manually based on CPU request (500m = 1 thread is reasonable)
func init() {
if quota := runtime.NumCPU(); quota > 4 {
runtime.GOMAXPROCS(4) // cap at 4 for container with 2 CPU limit
}
}
Linux Kernel Tuning
Kernel parameters affect networking, memory management, and file descriptor limits. In Kubernetes, these are set at the node level (not container level) — use a DaemonSet with initContainers or node configuration to apply them.
Some sysctls are namespace-scoped (safe to set per-pod via securityContext.sysctls) and some are node-scoped (require node-level access). Node-scoped tuning should be applied via DaemonSet or EKS managed node group launch template userData — never via privileged containers in production.
Network kernel parameters
apiVersion: apps/v1
kind: DaemonSet
metadata:
name: node-tuner
namespace: kube-system
spec:
selector:
matchLabels:
app: node-tuner
template:
metadata:
labels:
app: node-tuner
spec:
hostPID: true
hostNetwork: true
tolerations:
- operator: Exists # run on all nodes including system
initContainers:
- name: sysctl-tuner
image: busybox:1.36
securityContext:
privileged: true
command:
- /bin/sh
- -c
- |
# ── TCP Connection Tuning ──────────────────────────────
# Increase TCP backlog for high-connection-rate services
sysctl -w net.core.somaxconn=65535
sysctl -w net.ipv4.tcp_max_syn_backlog=65535
# TIME_WAIT socket reuse (prevents port exhaustion)
sysctl -w net.ipv4.tcp_tw_reuse=1
sysctl -w net.ipv4.ip_local_port_range="1024 65535"
# Keep-alive tuning (detect dead connections faster)
sysctl -w net.ipv4.tcp_keepalive_time=60
sysctl -w net.ipv4.tcp_keepalive_intvl=10
sysctl -w net.ipv4.tcp_keepalive_probes=6
# ── Socket Buffer Sizes ────────────────────────────────
# Larger buffers for high-throughput services
sysctl -w net.core.rmem_max=134217728
sysctl -w net.core.wmem_max=134217728
sysctl -w net.ipv4.tcp_rmem="4096 87380 134217728"
sysctl -w net.ipv4.tcp_wmem="4096 65536 134217728"
# ── Connection Tracking ────────────────────────────────
# Increase nf_conntrack table (prevents conntrack drops)
sysctl -w net.netfilter.nf_conntrack_max=1048576
sysctl -w net.netfilter.nf_conntrack_tcp_timeout_established=86400
sysctl -w net.netfilter.nf_conntrack_tcp_timeout_time_wait=30
# ── File Descriptors ───────────────────────────────────
sysctl -w fs.file-max=2097152
sysctl -w fs.inotify.max_user_watches=1048576
sysctl -w fs.inotify.max_user_instances=8192
# ── VM / Memory ────────────────────────────────────────
# Reduce swappiness (swap kills latency)
sysctl -w vm.swappiness=1
# Dirty page writeback (reduce I/O burst latency)
sysctl -w vm.dirty_ratio=10
sysctl -w vm.dirty_background_ratio=5
containers:
- name: pause
image: registry.k8s.io/pause:3.9
resources:
requests:
cpu: 1m
memory: 4Mi
Namespace-scoped sysctls (per-pod)
These safe sysctls can be set on individual pods for latency-sensitive services, without requiring privileged access or Kyverno exceptions:
spec:
securityContext:
sysctls:
# TCP connection reuse for high-connection-rate pods
- name: net.ipv4.tcp_tw_reuse
value: "1"
# Local port range for outbound connections
- name: net.ipv4.ip_local_port_range
value: "1024 65535"
# Socket backlog per pod
- name: net.core.somaxconn
value: "65535"
K8s 1.27+ allows a defined list of safe sysctls by default: kernel.shm_rmid_forced, net.ipv4.ip_local_port_range, net.ipv4.tcp_syncookies, net.ipv4.ping_group_range, net.ipv4.tcp_tw_reuse, net.ipv4.ip_unprivileged_port_start. Enable unsafe sysctls only if explicitly required and Kyverno policy permits (see 08-05).
Huge pages for memory-intensive workloads
# Node: pre-allocate 2MiB huge pages via launch template
# echo 512 > /sys/kernel/mm/hugepages/hugepages-2048kB/nr_hugepages
# Pod requesting huge pages (databases, Kafka, Redis)
spec:
containers:
- name: redis
resources:
requests:
hugepages-2Mi: 256Mi
memory: 256Mi
limits:
hugepages-2Mi: 256Mi
memory: 256Mi
volumeMounts:
- name: hugepage
mountPath: /dev/hugepages
volumes:
- name: hugepage
emptyDir:
medium: HugePages-2Mi
JVM / GC Tuning
JVM-based services (Java, Kotlin, Scala, Clojure) have unique performance challenges in containers: heap sizing, GC algorithm selection, and class loading all affect latency.
Container-aware JVM flags
FROM eclipse-temurin:21-jre
ENV JAVA_OPTS="\
-XX:+UseContainerSupport \
-XX:MaxRAMPercentage=75.0 \
-XX:InitialRAMPercentage=50.0 \
-XX:+UseG1GC \
-XX:MaxGCPauseMillis=200 \
-XX:G1HeapRegionSize=16m \
-XX:+UseStringDeduplication \
-XX:+AlwaysPreTouch \
-XX:+DisableExplicitGC \
-Xss512k \
-XX:+HeapDumpOnOutOfMemoryError \
-XX:HeapDumpPath=/tmp/heapdump.hprof \
-Xlog:gc*:file=/tmp/gc.log:time,uptime,level,tags:filecount=5,filesize=20m \
-Djava.security.egd=file:/dev/./urandom"
COPY target/app.jar /app.jar
ENTRYPOINT ["sh", "-c", "java $JAVA_OPTS -jar /app.jar"]
GC algorithm selection guide
| GC Algorithm | JDK | Pause Goal | Heap Size | Best For |
|---|---|---|---|---|
| G1GC | 8u40+ | ≤ 200ms (configurable) | 4 GiB – 32 GiB | Balanced throughput + latency; default for most services |
| ZGC | 15+ (prod), 11 (exp) | < 1ms | 8 GiB+ | Low-latency APIs, large heaps; slight throughput cost |
| Shenandoah | 12+ (RedHat/Azul) | < 10ms | Any | Low-latency with smaller heaps than ZGC |
| ParallelGC | All | No pause target | Any | Batch / throughput-first (not for latency-sensitive) |
| SerialGC | All | N/A | < 512 MiB | Very small heaps, minimal CPU (sidecars, utilities) |
ZGC configuration for ultra-low latency
JAVA_OPTS="\
-XX:+UseContainerSupport \
-XX:MaxRAMPercentage=70.0 \
-XX:+UseZGC \
-XX:+ZGenerational \
-XX:SoftMaxHeapSize=6g \
-XX:ZCollectionInterval=5 \
-XX:ZUncommitDelay=300 \
-Xlog:gc*:file=/tmp/gc.log:time,uptime:filecount=3,filesize=10m"
# ZGC requires more memory overhead than G1 — allow 70% not 75%
# -XX:+ZGenerational: generational ZGC (JDK 21, significant improvement)
GC metrics via JMX / micrometer
# JVM GC pause duration (micrometer instrumented apps)
histogram_quantile(0.99,
rate(jvm_gc_pause_seconds_bucket{application="payment-service"}[5m])
)
# GC overhead (fraction of time spent in GC)
sum by (application, gc) (
rate(jvm_gc_pause_seconds_sum{application="payment-service"}[5m])
)
# Heap usage vs max
jvm_memory_used_bytes{area="heap",application="payment-service"}
/
jvm_memory_max_bytes{area="heap",application="payment-service"}
# GC frequency (collections per second)
rate(jvm_gc_pause_seconds_count{application="payment-service"}[5m])
JVM container resource sizing rules
- Memory request = MaxHeap + NonHeap (Metaspace + CodeCache + DirectBuffers) + 20% OS overhead
- Non-heap estimate: typical Spring Boot app = 256–512 MiB non-heap
- Memory limit = memory request × 1.1 (small headroom; OOMKill is preferable to excessive provisioning)
- Example: 4 GiB heap → 4096 + 512 + 820 (20%) = ~5.4 GiB request; 5.9 GiB limit
- Use
-XX:MaxRAMPercentage=75.0so the JVM auto-sizes heap from the container limit - CPU request: 1 CPU per 2 GiB heap is a reasonable starting point for G1GC
Go Runtime Tuning
GOMAXPROCS and CPU quota
// go.mod
require go.uber.org/automaxprocs v1.5.3
// main.go
import _ "go.uber.org/automaxprocs"
// Automatically sets GOMAXPROCS = ceil(cpu_quota / cpu_period)
// For 500m CPU limit: GOMAXPROCS = 1 (not 96)
GOGC and memory pressure
# Pod env — GOGC controls GC frequency (default 100 = trigger at 2× live heap)
env:
- name: GOGC
value: "100" # default; lower = more frequent GC, less memory; higher = less GC, more memory
- name: GOMEMLIMIT # Go 1.19+ — soft memory limit; triggers GC before OOMKill
valueFrom:
resourceFieldRef:
resource: limits.memory
divisor: "1" # bytes; sets GOMEMLIMIT = container memory limit
// Programmatic GOMEMLIMIT (preferred — uses container limit automatically)
import "runtime/debug"
func init() {
// Set GOMEMLIMIT to 90% of container memory limit
// Prevents OOMKill by triggering aggressive GC before hitting the limit
if limit := containerMemoryLimit(); limit > 0 {
debug.SetMemoryLimit(int64(float64(limit) * 0.90))
}
}
Go pprof — identify hot paths
# Collect 30s CPU profile from running pod
kubectl exec -n payments deploy/payment-service -- \
curl -s "http://localhost:6060/debug/pprof/profile?seconds=30" > cpu.prof
# Analyze: show top 10 functions by CPU time
go tool pprof -top cpu.prof
# Interactive web UI (flame graph)
go tool pprof -http=:8080 cpu.prof
# Heap allocation profile
kubectl exec -n payments deploy/payment-service -- \
curl -s "http://localhost:6060/debug/pprof/heap" > heap.prof
go tool pprof -alloc_objects heap.prof
Network Performance
Connection pool tuning
Most latency in microservices is not compute — it's time waiting for a connection. Properly tuned connection pools dramatically reduce p99 latency.
| Parameter | Default | Recommended (high-traffic) | Effect |
|---|---|---|---|
| HTTP idle timeout | 90s (Go) | 30–60s | Recycle connections before load balancer kills them |
| HTTP keep-alive | enabled | enabled + DisableKeepAlives: false | Reuse TCP connections across requests |
| Max idle connections | 100 (Go http.Transport) | 500–2000 | Enough connections for burst RPS |
| Max idle per host | 2 (Go http.Transport) | 100–500 | Prevents connection queue buildup on single upstream |
| Dial timeout | 30s | 5s | Fast-fail on unreachable services |
| Response header timeout | none | 30s | Prevent goroutine leak from slow upstreams |
// Tuned HTTP client for high-traffic Go services
import "net/http"
var httpClient = &http.Client{
Timeout: 30 * time.Second,
Transport: &http.Transport{
MaxIdleConns: 1000,
MaxIdleConnsPerHost: 200,
MaxConnsPerHost: 500,
IdleConnTimeout: 60 * time.Second,
TLSHandshakeTimeout: 5 * time.Second,
ResponseHeaderTimeout: 20 * time.Second,
DialContext: (&net.Dialer{
Timeout: 5 * time.Second,
KeepAlive: 30 * time.Second,
}).DialContext,
ForceAttemptHTTP2: true,
DisableCompression: false,
},
}
Service mesh connection reuse (Istio/Envoy)
# Istio DestinationRule — connection pool per upstream service
apiVersion: networking.istio.io/v1beta1
kind: DestinationRule
metadata:
name: payment-service-dr
namespace: payments
spec:
host: payment-service
trafficPolicy:
connectionPool:
tcp:
maxConnections: 1000
connectTimeout: 5s
tcpKeepalive:
time: 7200s
interval: 75s
http:
http1MaxPendingRequests: 1000
http2MaxRequests: 1000
maxRequestsPerConnection: 0 # 0 = unlimited reuse (H2 multiplex)
maxRetries: 3
idleTimeout: 90s
h2UpgradePolicy: UPGRADE # prefer HTTP/2
outlierDetection:
consecutiveGatewayErrors: 5
interval: 30s
baseEjectionTime: 30s
maxEjectionPercent: 50
Network bandwidth testing
# Test pod-to-pod bandwidth (iperf3)
# Start server in one pod
kubectl run iperf-server --image=networkstatic/iperf3 -n default \
-- iperf3 -s
# Run client from another pod (different node)
kubectl run iperf-client --image=networkstatic/iperf3 -n default \
--restart=Never -it --rm \
-- iperf3 -c $(kubectl get pod iperf-server -o jsonpath='{.status.podIP}') \
-t 30 -P 8 # 8 parallel streams
# Expected: ~10 Gbps within same AZ on m7i; ~5 Gbps cross-AZ
# Test cross-AZ latency
kubectl run netshoot --image=nicolaka/netshoot --restart=Never -it --rm -- \
ping -c 100 10.0.2.45 # pod IP in different AZ
DNS Performance
DNS is a hidden latency source in Kubernetes. Every hostname lookup that doesn't hit the local cache goes to CoreDNS, which can become a bottleneck under high RPS.
DNS lookup flow and optimization
Pod queries "payment-service"
→ ndots:5 triggers 5 search domain suffix attempts:
1. payment-service.payments.svc.cluster.local ← HIT (found)
2. payment-service.cluster.local (skipped after hit)
...
With ndots:5, every external query like "api.stripe.com" tries:
1. api.stripe.com.payments.svc.cluster.local (NXDOMAIN)
2. api.stripe.com.svc.cluster.local (NXDOMAIN)
3. api.stripe.com.cluster.local (NXDOMAIN)
4. api.stripe.com (HIT — external)
= 4 queries instead of 1 → 4× DNS load for external calls
Pod DNS configuration tuning
spec:
dnsPolicy: ClusterFirst
dnsConfig:
options:
# Reduce search domain attempts (ndots:3 is usually enough)
- name: ndots
value: "3"
# Enable DNS result caching in resolv.conf (not all images support)
- name: single-request-reopen
# Timeout for DNS queries
- name: timeout
value: "3"
# Number of retries before failing
- name: attempts
value: "3"
Alternatively, use FQDN for cross-namespace calls to bypass search domain expansion entirely:
# Instead of: http://payment-service (triggers ndots suffix search)
# Use FQDN: http://payment-service.payments.svc.cluster.local
# Or in same namespace, just: http://payment-service works fine
CoreDNS autoscaling and tuning
# CoreDNS HPA (proportional to nodes)
apiVersion: autoscaling/v2
kind: HorizontalPodAutoscaler
metadata:
name: coredns
namespace: kube-system
spec:
scaleTargetRef:
apiVersion: apps/v1
kind: Deployment
name: coredns
minReplicas: 2
maxReplicas: 10
metrics:
- type: Resource
resource:
name: cpu
target:
type: Utilization
averageUtilization: 70
# CoreDNS ConfigMap tuning (kube-system)
apiVersion: v1
kind: ConfigMap
metadata:
name: coredns
namespace: kube-system
data:
Corefile: |
.:53 {
errors
health {
lameduck 5s
}
ready
kubernetes cluster.local in-addr.arpa ip6.arpa {
pods insecure
fallthrough in-addr.arpa ip6.arpa
ttl 30
}
prometheus :9153
forward . /etc/resolv.conf {
max_concurrent 1000
prefer_udp
}
cache 300 { # cache TTL up to 300s (default 30s)
success 9984 # max successful cache entries
denial 9984 # max NXDOMAIN cache entries
prefetch 10 1m 10% # prefetch before TTL expires (reduces latency spikes)
}
loop
reload
loadbalance
}
Node-local DNS cache (NodeLocal DNSCache)
# NodeLocal DNSCache reduces CoreDNS load by caching at node level
# Install: https://kubernetes.io/docs/tasks/administer-cluster/nodelocaldns/
# Verify it's running as DaemonSet
kubectl get daemonset -n kube-system node-local-dns
# After install, pods on that node use link-local DNS (169.254.20.10)
# instead of ClusterIP, bypassing iptables conntrack for DNS queries
# → reduces p99 DNS latency from ~2ms to ~0.1ms
Pod Startup Latency
Pod startup latency (from pending to ready) directly impacts autoscaling responsiveness. The time breaks down into distinct phases:
Pending Scheduled PullingImage Running Ready
│ │ │ │ │
├──── 0.1s ────┤ │ │ │
│ Scheduler │ │ │ │
│ decision ├──── 0–60s ───┤ │ │
│ │ Image pull │ │ │
│ │ (first time)├──── 0.5s ────┤ │
│ │ │ Container │ │
│ │ │ start ├── probe ───►│
│ │ │ │ period │
Total: 0.1s + image_pull + 0.5s + startup_probe_period
Reducing image pull time
# 1. Pin images by digest (also prevents unexpected pulls)
image: my-registry/payment-service@sha256:abc123...
# 2. Use imagePullPolicy: IfNotPresent (default for non-latest)
imagePullPolicy: IfNotPresent
# 3. Pre-pull images on nodes (DaemonSet or node provisioning)
# Karpenter EC2NodeClass userData pre-pull script:
# docker pull my-registry/payment-service:v1.2.3 || true
# 4. Use a registry in the same region/AZ as the cluster
# ECR with VPC endpoint → no egress; pull from S3 locally
# 5. Minimize image size (distroless / scratch)
# Before: 800 MiB (alpine + full JDK + tools)
# After: 85 MiB (distroless/java21 + app jar only)
# 6. Sparsely-layered images (frequently-changing layers last)
COPY --chown=nonroot:nonroot libs/ /app/libs/ # rarely changes
COPY --chown=nonroot:nonroot config/ /app/config/ # sometimes changes
COPY --chown=nonroot:nonroot app.jar /app/ # always changes
Startup, readiness, and liveness probe tuning
spec:
containers:
- name: payment-service
startupProbe:
httpGet:
path: /actuator/health/liveness
port: 8080
initialDelaySeconds: 10 # wait for JVM basic startup
periodSeconds: 5
failureThreshold: 24 # allow up to 10+24×5 = 130s for full startup
successThreshold: 1
timeoutSeconds: 3
readinessProbe:
httpGet:
path: /actuator/health/readiness
port: 8080
initialDelaySeconds: 0 # startupProbe guards this; start checking immediately
periodSeconds: 5
failureThreshold: 3
successThreshold: 1
timeoutSeconds: 3
livenessProbe:
httpGet:
path: /actuator/health/liveness
port: 8080
initialDelaySeconds: 0 # startupProbe guards this
periodSeconds: 15 # don't check too often; liveness kill is expensive
failureThreshold: 3
successThreshold: 1
timeoutSeconds: 5 # higher timeout; liveness failure restarts the pod
A common misconfiguration: liveness probe timeoutSeconds: 1 + failureThreshold: 1 on a JVM service. A 200ms GC pause can cause the HTTP handler to respond slowly, triggering a false liveness failure and unnecessary pod restart. Set timeoutSeconds: 5 and failureThreshold: 3 (minimum 15s sustained unresponsiveness) to avoid this.
API Server Performance
A slow or overloaded API server affects every kubectl command, every controller reconcile loop, and every admission webhook call.
API server latency metrics
# API server request latency p99 by verb and resource
histogram_quantile(0.99,
sum by (verb, resource, le) (
rate(apiserver_request_duration_seconds_bucket{
job="apiserver"
}[5m])
)
)
# Inflight requests (near max → throttling in progress)
apiserver_current_inflight_requests
# Request rate by verb
sum by (verb) (
rate(apiserver_request_total{job="apiserver"}[5m])
)
# Error rate (5xx responses)
sum(rate(apiserver_request_total{job="apiserver",code=~"5.."}[5m]))
/
sum(rate(apiserver_request_total{job="apiserver"}[5m]))
# Watch event queue depth (large = consumers too slow)
apiserver_watch_events_sizes_count
API server tuning parameters
# kube-apiserver flags (kubeadm ClusterConfiguration extraArgs)
kube-apiserver:
# Increase request inflight limits (default: 400 mutating, 800 read-only)
--max-mutating-requests-inflight: "800"
--max-requests-inflight: "1600"
# Request timeout for long-running operations
--request-timeout: "60s"
# Audit log — only record what you need (high volume = high CPU)
--audit-policy-file: /etc/kubernetes/audit-policy.yaml
# Feature gate for priority and fairness (APF)
--enable-priority-and-fairness: "true"
# Increase etcd page size for large list operations
--default-watch-cache-size: "100"
API Priority and Fairness (APF)
APF (GA in K8s 1.29) replaces the old max-inflight limits with a fine-grained priority system that prevents one noisy client from starving critical system operations:
# Check current APF flow schemas and priority levels
kubectl get flowschemas
kubectl get prioritylevelconfigurations
# Monitor queue depth and wait time per flow
kubectl get --raw /metrics | grep apiserver_flowcontrol
# Example: increase shares for system controllers if throttled
kubectl patch prioritylevelconfiguration system-leader-election \
--type=json \
-p='[{"op":"replace","path":"/spec/limited/nominalConcurrencyShares","value":20}]'
Reducing API server load
- Use informers/watches, not polling: Controller code that calls
Listin a loop every second generates massive API load. Use client-go informers withSharedIndexInformerand watch events instead. - Server-side filtering: Use
labelSelectorandfieldSelectorinListcalls to let the API server filter, reducing data transfer and client CPU. - Avoid large objects: ConfigMaps and Secrets over 1 MiB degrade etcd and API server performance. Use S3/external stores for large data.
- Scope RBAC: Broad
ClusterRoleBindingswith*resource wildcards generate large authorization cache evaluations.
etcd Performance
etcd is the single source of truth for all Kubernetes state. etcd latency directly increases API server latency. The most common causes of etcd degradation are disk I/O latency and network latency between etcd members.
etcd health and latency metrics
# Check etcd endpoint health
etcdctl endpoint health \
--endpoints=https://etcd-0:2379,https://etcd-1:2379,https://etcd-2:2379 \
--cacert=/etc/kubernetes/pki/etcd/ca.crt \
--cert=/etc/kubernetes/pki/etcd/peer.crt \
--key=/etc/kubernetes/pki/etcd/peer.key
# Check leader and round-trip latency
etcdctl endpoint status \
--endpoints=https://etcd-0:2379 \
--write-out=table \
--cacert=/etc/kubernetes/pki/etcd/ca.crt \
--cert=/etc/kubernetes/pki/etcd/peer.crt \
--key=/etc/kubernetes/pki/etcd/peer.key
# Output columns: ENDPOINT / ID / VERSION / DB SIZE / IS LEADER / IS LEARNER / RAFT TERM / RAFT INDEX / RAFT APPLIED INDEX / ERRORS
# etcd commit latency p99 (target: < 10ms for SSD)
histogram_quantile(0.99,
rate(etcd_disk_backend_commit_duration_seconds_bucket[5m])
)
# WAL fsync latency (target: < 5ms; > 25ms = disk problem)
histogram_quantile(0.99,
rate(etcd_disk_wal_fsync_duration_seconds_bucket[5m])
)
# etcd DB size (> 6 GiB = compact urgently)
etcd_mvcc_db_total_size_in_bytes
# Network peer latency between etcd members
histogram_quantile(0.99,
rate(etcd_network_peer_round_trip_time_seconds_bucket[5m])
)
etcd compaction and defragmentation
# etcd stores all historical revisions until compacted
# Without compaction, DB grows unbounded and slows down
# Check current revision
REVISION=$(etcdctl endpoint status --write-out=json | \
jq '.[0].Status.header.revision')
echo "Current revision: $REVISION"
# Compact to current revision (removes all old revisions)
etcdctl compact $REVISION
# Defragment after compaction (reclaims disk space)
# WARNING: etcd is briefly unavailable during defrag on that member
etcdctl defrag
# Check DB size before and after
etcdctl endpoint status --write-out=table
# Automate: kube-apiserver flag to auto-compact
# --etcd-compaction-interval=5m (compact every 5 minutes)
etcd disk requirements
| Metric | Target | Warning | Action |
|---|---|---|---|
| WAL fsync p99 | < 5ms | > 10ms | Move to NVMe SSD; check I/O scheduler |
| Backend commit p99 | < 10ms | > 25ms | Compact + defrag; check disk contention |
| DB size | < 4 GiB | > 6 GiB | Compact immediately; increase quota |
| DB size / quota | < 60% | > 80% | Compact or increase --quota-backend-bytes |
| Peer latency p99 | < 5ms | > 20ms | Co-locate etcd members in same region; check network |
etcd requires predictable fsync latency. Running etcd on a disk shared with other I/O-heavy workloads (logs, container images, Prometheus TSDB) causes latency spikes that appear as API server timeouts. Use dedicated NVMe SSDs for etcd. On AWS, use an io2 EBS volume with > 3000 provisioned IOPS, or an instance-store NVMe disk (i4i family) for highest performance.
Scheduler Optimization
The scheduler can become a bottleneck when there are many unschedulable pods or when complex affinity rules increase per-pod scheduling time.
# Scheduler queue depth (pods waiting to be scheduled)
scheduler_pending_pods{queue="active"}
scheduler_pending_pods{queue="backoff"}
scheduler_pending_pods{queue="unschedulable"}
# Scheduling latency p99 (time from pod creation to scheduling decision)
histogram_quantile(0.99,
rate(scheduler_pod_scheduling_sli_duration_seconds_bucket[5m])
)
# Preemption events (expensive — pods killed to make room)
rate(scheduler_pod_preemption_victims[5m])
Scheduling performance tips
- Avoid
requiredDuringSchedulingwith broad selectors: Every pod evaluates affinity against all matching pods. UsepreferredDuringSchedulingwhere hard requirements aren't needed. - Use
topologySpreadConstraintsover complex anti-affinity: TSC is more efficient for zone spreading. - Increase scheduler parallelism via
--parallelismflag (default 16) for large clusters with high pod churn. - Use
percentageOfNodesToScore: For clusters with 1000+ nodes, the scheduler evaluates only a percentage of nodes (default 50% down to 5% for very large clusters), dramatically improving scheduling speed.
# KubeSchedulerConfiguration (K8s 1.23+)
apiVersion: kubescheduler.config.k8s.io/v1
kind: KubeSchedulerConfiguration
profiles:
- schedulerName: default-scheduler
plugins:
score:
disabled:
- name: NodeResourcesBalancedAllocation # disable if using bin-packing
enabled:
- name: NodeResourcesFit
weight: 5
pluginConfig:
- name: NodeResourcesFit
args:
scoringStrategy:
type: MostAllocated # bin-pack (vs LeastAllocated = spread)
resources:
- name: cpu
weight: 1
- name: memory
weight: 1
leaderElection:
leaderElect: true
percentageOfNodesToScore: 10 # for 1000+ node clusters
Profiling Workflow
When a performance issue is reported and USE method analysis doesn't pinpoint it, use continuous profiling with Pyroscope (covered in detail in Section 06-07). This workflow covers the investigation path from symptom to root cause:
Alert: p99 latency > 500ms
│
▼
1. Check CPU throttle ratio
container_cpu_cfs_throttled_seconds_total / periods > 25%?
├── Yes → raise CPU limit or right-size request
└── No → continue
2. Check memory pressure
container_memory_working_set_bytes / limit > 80%?
├── Yes → check GC overhead, raise memory
└── No → continue
3. Check GC pauses (JVM)
jvm_gc_pause_seconds p99 > 200ms?
├── Yes → switch to ZGC, tune heap
└── No → continue
4. Flame graph (Pyroscope / pprof)
What function is consuming the most CPU?
→ json.Marshal hot? → use jsoniter or pre-allocated buffers
→ sql.Query hot? → add index, reduce N+1 queries
→ sync.Mutex hot? → partition lock, use sync/atomic
5. Trace analysis (Tempo)
Which span is slow? Is it this service or a downstream?
→ DB query 450ms → add EXPLAIN ANALYZE, add index
→ External API 300ms → add timeout + circuit breaker
Performance Alerts
apiVersion: monitoring.coreos.com/v1
kind: PrometheusRule
metadata:
name: performance-tuning-alerts
namespace: monitoring
spec:
groups:
- name: performance.tuning
rules:
- alert: ContainerCPUThrottlingHigh
expr: |
sum by (namespace, pod, container) (
rate(container_cpu_cfs_throttled_seconds_total[5m])
)
/
sum by (namespace, pod, container) (
rate(container_cpu_cfs_periods_total[5m])
) > 0.25
for: 10m
labels:
severity: warning
annotations:
summary: "Container {{ $labels.namespace }}/{{ $labels.pod }}/{{ $labels.container }} CPU throttled > 25%"
description: "Throttle ratio is {{ $value | humanizePercentage }}. Raise CPU limit or lower GOMAXPROCS."
runbook_url: https://runbooks.example.com/performance/cpu-throttling
- alert: JVMGCPauseHigh
expr: |
histogram_quantile(0.99,
rate(jvm_gc_pause_seconds_bucket[5m])
) > 0.5
for: 5m
labels:
severity: warning
annotations:
summary: "JVM GC p99 pause > 500ms in {{ $labels.namespace }}/{{ $labels.application }}"
description: "Consider switching to ZGC or increasing heap."
runbook_url: https://runbooks.example.com/performance/jvm-gc-pauses
- alert: APIServerLatencyHigh
expr: |
histogram_quantile(0.99,
sum by (verb, le) (
rate(apiserver_request_duration_seconds_bucket{
verb!~"WATCH|CONNECT"
}[5m])
)
) > 1.0
for: 5m
labels:
severity: warning
annotations:
summary: "API server {{ $labels.verb }} p99 latency > 1s"
runbook_url: https://runbooks.example.com/performance/apiserver-latency
- alert: EtcdWALFsyncSlow
expr: |
histogram_quantile(0.99,
rate(etcd_disk_wal_fsync_duration_seconds_bucket[5m])
) > 0.025
for: 5m
labels:
severity: warning
annotations:
summary: "etcd WAL fsync p99 > 25ms — disk I/O problem"
runbook_url: https://runbooks.example.com/performance/etcd-slow-disk
- alert: EtcdDBSizeLarge
expr: etcd_mvcc_db_total_size_in_bytes > 6e9
for: 5m
labels:
severity: warning
annotations:
summary: "etcd DB size > 6 GiB — compact and defrag immediately"
runbook_url: https://runbooks.example.com/performance/etcd-db-size
- alert: CoreDNSLatencyHigh
expr: |
histogram_quantile(0.99,
rate(coredns_dns_request_duration_seconds_bucket[5m])
) > 0.1
for: 5m
labels:
severity: warning
annotations:
summary: "CoreDNS p99 latency > 100ms"
description: "Consider NodeLocal DNSCache or increasing CoreDNS replicas."
runbook_url: https://runbooks.example.com/performance/coredns-latency
- alert: SchedulerPendingPodsHigh
expr: scheduler_pending_pods{queue="unschedulable"} > 10
for: 5m
labels:
severity: warning
annotations:
summary: "{{ $value }} pods unschedulable for > 5 min"
runbook_url: https://runbooks.example.com/performance/scheduler-pending
Best Practices
Fix throttling before scaling
CPU throttling is not solved by adding pods. Raise the CPU limit or lower GOMAXPROCS. Adding pods just moves the throttle to more instances.
Use automaxprocs for Go
Every Go service running in a container should import go.uber.org/automaxprocs. The default 96-goroutine scheduler on a 0.5 CPU container is a latency disaster.
ZGC for low-latency JVM
If p99 is critical and heap > 4 GiB, switch from G1GC to ZGC with -XX:+ZGenerational (JDK 21). Sub-millisecond pauses eliminate GC-induced latency spikes.
NodeLocal DNSCache
Install NodeLocal DNSCache on every cluster. It reduces p99 DNS latency from 2–5ms to <0.5ms and cuts CoreDNS load by 60–80%.
etcd on dedicated NVMe
etcd fsync latency must stay below 5ms. Shared disks or gp2 EBS will cause intermittent API server timeouts. Use io2 with 3000+ IOPS or instance-store NVMe.
Startup probe guards liveness
Always use a startupProbe to cover slow JVM/Python startup. Without it, a liveness probe fires during class loading and restarts the pod unnecessarily — killing the first startup attempt.