Node Advanced Core File: 02-node-components/06-node-resource-management.html

Node Resource Management

Node resource management is the set of mechanisms by which Kubernetes tracks, reserves, and enforces CPU, memory, ephemeral storage, and extended resources across the kernel (cgroups), the kubelet, and the scheduler. Getting this right prevents workload interference, protects node stability, and makes HPA/VPA decisions accurate. Getting it wrong causes unexplained OOMKills, CPU throttling, failed evictions, and "Insufficient CPU" scheduling errors despite seemingly available capacity.

Capacity vs Allocatable

Every Node object exposes two resource views:

Node Capacity (8 CPU, 32 Gi RAM) kube-reserved 500m / 1Gi system-reserved 500m / 2Gi eviction threshold Allocatable — schedulable pod workloads 6 CPU / 28.5 Gi (example) Allocatable = Capacity − kube-reserved − system-reserved − eviction-threshold (hard memory) Allocatable consumed by pods (example: 5 pods) pod-A request pod-B pod-C pod-D pod-E remaining schedulable Scheduler sums pod requests vs allocatable. Pod limits can exceed allocatable (overcommit); only requests are scheduling constraints.
FieldWhat it representsWho sets it
status.capacityRaw hardware/VM resources: CPU cores, RAM, ephemeral disk, device pluginskubelet reads from OS/cadvisor at startup
status.allocatableCapacity minus reserved amounts and hard eviction thresholds — the safe schedulable ceilingkubelet computes from KubeletConfiguration and patches Node

Reserved Resources

The kubelet subtracts two reservation buckets from capacity before reporting allocatable. These are configured in KubeletConfiguration:

apiVersion: kubelet.config.k8s.io/v1beta1
kind: KubeletConfiguration

# Resources reserved for Kubernetes system daemons (kubelet, CRI, kube-proxy)
kubeReserved:
  cpu: "500m"
  memory: "1Gi"
  ephemeral-storage: "2Gi"
  pid: "1000"

# Resources reserved for OS system daemons (sshd, journald, systemd, etc.)
systemReserved:
  cpu: "500m"
  memory: "2Gi"
  ephemeral-storage: "2Gi"
  pid: "1000"

# Must be set to 'true' to enforce reservations via cgroups
# (otherwise reservations affect allocatable calculation but don't enforce)
enforceNodeAllocatable:
- pods          # cgroup limit on sum of all pod cgroups
- kube-reserved # requires kubeReservedCgroup to be set
- system-reserved  # requires systemReservedCgroup to be set

# Cgroup paths for enforcement (must pre-exist on node)
kubeReservedCgroup: /kubelet.slice
systemReservedCgroup: /system.slice
Reservations Without Enforcement = Accounting Only

Setting kubeReserved and systemReserved without adding them to enforceNodeAllocatable only affects the allocatable calculation shown to the scheduler — it does not create cgroup limits that actually prevent pods from consuming those resources. Under pressure, pods will eat into the reserved space. Always add both to enforceNodeAllocatable in production.

Sizing Reservations Correctly

Reservation amounts should be tuned per node type. Undersizing kills the node OS; oversizing wastes schedulable capacity.

Node RAMRecommended kube-reserved memoryRecommended system-reserved memoryTotal reserved
4 Gi512 Mi512 Mi~25% of RAM
8 Gi1 Gi1 Gi~25%
16 Gi1.5 Gi1 Gi~16%
32 Gi2 Gi2 Gi~12%
64 Gi3 Gi2 Gi~8%
128 Gi+4–6 Gi2–4 Gi~5–6%

For CPU, reserving 50–100m per core for kube-reserved is typical. System processes rarely need more than 500m total on a healthy node.

QoS Classes

Kubernetes assigns every pod a Quality of Service class at admission time, based solely on its container resource requests and limits. QoS determines eviction priority and OOMKill order:

Guaranteed

Every container has both requests and limits set, and requests == limits for both CPU and memory.

  • Never evicted for memory pressure (unless exceeding own limit)
  • CPU: pinned to exact allocation (no burstable CPU)
  • OOMKill: only if container exceeds its memory limit
  • cgroup: cpu.shares set, memory.limit_in_bytes enforced

Burstable

At least one container has a request OR limit set, but not all conditions for Guaranteed are met.

  • Evicted after BestEffort when memory pressure occurs
  • Can burst CPU above request (up to limit or node capacity)
  • OOMKill: if container exceeds its limit, or during pressure
  • Most real-world pods land here

BestEffort

No container in the pod has any requests or limits set.

  • First to be evicted under memory pressure
  • Gets CPU only when no other workload needs it
  • First OOMKilled under memory pressure
  • Avoid in production; acceptable for batch/dev
# Guaranteed pod example (requests == limits for ALL containers)
resources:
  requests:
    cpu: "500m"
    memory: "256Mi"
  limits:
    cpu: "500m"
    memory: "256Mi"

# Burstable pod example
resources:
  requests:
    cpu: "250m"
    memory: "128Mi"
  limits:
    cpu: "1"         # limit > request = burstable
    memory: "512Mi"

# BestEffort pod example (avoid in production)
resources: {}   # no requests or limits at all

# Check QoS class of a running pod:
kubectl get pod my-pod -o jsonpath='{.status.qosClass}'
QoS Affects cgroup Hierarchy Placement

Guaranteed pods are placed in kubepods/guaranteed/; Burstable in kubepods/burstable/; BestEffort in kubepods/besteffort/. These cgroup hierarchies have different cpu.shares weights, which gives Guaranteed pods proportionally more CPU bandwidth during contention.

CPU Management

Requests vs Limits

ResourceRequestLimitEnforcement
CPUScheduling constraint; sets cpu.shares weight in cgroupSets cpu.cfs_quota_us; container throttled when quota exhaustedCFS bandwidth throttling (soft enforcement)
MemoryScheduling constraint; sets memory.min in cgroup v2Sets memory.limit_in_bytes; process OOMKilled when exceededOOMKill (hard enforcement)

CPU Throttling and CFS

Linux CFS (Completely Fair Scheduler) implements CPU limits via quota/period:

# CFS parameters set by kubelet for a container with limit=500m
# Period: 100ms (default cpu.cfs_period_us = 100000 microseconds)
# Quota:  50ms  (cpu.cfs_quota_us = 50000 = 500m × 100ms)
# Meaning: container can use 50ms of CPU per 100ms window

# Inspect throttling for a container (cgroup v1):
cat /sys/fs/cgroup/cpu/kubepods/burstable/pod//cpu.stat
# nr_periods: total CFS periods run
# nr_throttled: how many periods the container hit its quota
# throttled_time: total nanoseconds throttled

# cgroup v2 equivalent:
cat /sys/fs/cgroup/kubepods.slice/kubepods-burstable.slice/.../cpu.stat

# Throttle ratio:
# throttle_ratio = nr_throttled / nr_periods
# > 25% throttle ratio suggests the CPU limit is too low
CPU Throttling Is Silent but Painful

CPU throttling does not show up as high CPU usage — the container appears idle because it is suspended waiting for its next CFS period. It manifests as increased latency, slow request processing, and timeouts at seemingly low CPU utilization. Always check container_cpu_cfs_throttled_seconds_total in Prometheus before assuming CPU limits are correctly sized.

CPU Manager Policy

The kubelet's CPU Manager (GA 1.26) controls whether pods get shared or exclusive CPU cores:

# KubeletConfiguration
cpuManagerPolicy: static   # default: "none" (shared pool)
cpuManagerReconcilePeriod: 10s

# "none" (default): all pods share a pool of CPUs via CFS weights
# "static": Guaranteed pods with integer CPU requests get dedicated cores

# For static policy: reserve CPUs for OS/kubelet (never given to pods)
reservedSystemCPUs: "0-1"    # cores 0 and 1 always reserved
# OR use kubeReserved.cpu (less precise — doesn't pin specific cores)

# Pod that gets exclusive cores under static policy:
resources:
  requests:
    cpu: "2"    # integer request
  limits:
    cpu: "2"    # must equal request (Guaranteed QoS)
# Result: kubelet pins this container to 2 specific cores via cpuset cgroup
cpuManagerPolicyWho benefitsTrade-off
none (default)General workloads; maximizes CPU sharing efficiencyCache line contention between pods; no NUMA awareness
staticLatency-sensitive, CPU-intensive: real-time, HPC, databasesReduces utilization (reserved cores may be idle); requires Guaranteed QoS

Memory Management

Memory Limits and OOMKill

When a container exceeds its memory limit, the Linux OOM killer terminates one of its processes. The kubelet detects this via PLEG and marks the container as OOMKilled. The exit code is 137 (SIGKILL = 9, exit = 128 + 9).

# Check OOMKill history for a pod
kubectl describe pod my-pod | grep -A5 "OOMKilled\|Exit Code\|Last State"

# Check system OOM events (node-level)
dmesg | grep -i "oom\|out of memory"
journalctl -k | grep -i "oom killer"

# cgroup v2: check memory events
cat /sys/fs/cgroup/kubepods.slice/.../memory.events
# oom: number of OOM kills
# oom_kill: times OOM killer was invoked

# Tune OOM score (lower = less likely to be killed by kernel OOM)
# Guaranteed pods: oom_score_adj = -997 (never killed by kernel except extreme)
# Burstable pods:  oom_score_adj = 2 to 999 (proportional to memory usage)
# BestEffort pods: oom_score_adj = 1000 (first to die)
cat /proc/$(pgrep -n nginx)/oom_score_adj

Memory Manager Policy

The Memory Manager (GA 1.22) works alongside the CPU Manager for NUMA-aware memory allocation for Guaranteed pods:

# KubeletConfiguration
memoryManagerPolicy: Static   # default: "None"
reservedMemory:
- numaNode: 0
  limits:
    memory: "1Gi"
    hugepages-1Gi: "2Gi"

# Pods that get NUMA-pinned memory (requires Guaranteed QoS + integer CPU + static CPU manager):
resources:
  requests:
    memory: "4Gi"
    hugepages-1Gi: "2Gi"
  limits:
    memory: "4Gi"
    hugepages-1Gi: "2Gi"
# kubelet allocates this pod's memory exclusively from a single NUMA node
# → eliminates cross-NUMA memory access latency for HPC/real-time workloads

Eviction Manager

The kubelet's Eviction Manager continuously monitors node resource consumption and evicts pods before the node reaches a critical state. Covered in depth in kubelet: Eviction Manager; key configuration reference here:

apiVersion: kubelet.config.k8s.io/v1beta1
kind: KubeletConfiguration

# Hard eviction thresholds (immediate eviction, no grace period)
evictionHard:
  memory.available: "200Mi"        # total node memory available
  nodefs.available: "10%"          # rootfs free space
  nodefs.inodesFree: "5%"          # rootfs inodes
  imagefs.available: "15%"         # container image filesystem
  imagefs.inodesFree: "5%"
  pid.available: "10%"             # available PID count

# Soft eviction thresholds (eviction begins only after grace period)
evictionSoft:
  memory.available: "500Mi"
  nodefs.available: "15%"
  imagefs.available: "20%"

# Grace periods for soft eviction (how long threshold must be breached)
evictionSoftGracePeriod:
  memory.available: "2m"
  nodefs.available: "5m"
  imagefs.available: "5m"

# Max time to wait for a pod to terminate during eviction
evictionMaxPodGracePeriod: 30

# Minimum amount of resource to reclaim on each eviction round
evictionMinimumReclaim:
  memory.available: "100Mi"
  nodefs.available: "500Mi"
  imagefs.available: "500Mi"

Eviction Signals

SignalDescriptionDerived from
memory.availableAvailable node memory (not working set)capacity - workingSet (cAdvisor)
nodefs.availableFree space on node root filesystemdf /
nodefs.inodesFreeFree inodes on root filesystemdf -i /
imagefs.availableFree space on container image filesystemdf /var/lib/containerd
imagefs.inodesFreeFree inodes on image filesystemdf -i /var/lib/containerd
pid.availableAvailable PIDs on the node/proc/sys/kernel/pid_max - current usage
allocatableMemory.availableAvailable memory within pod cgroup (not OS-level)cgroup memory.usage_in_bytes within kubepods

Eviction Order

When a threshold is breached, the kubelet evicts pods in this priority order:

1. BestEffort (any resource exceeded)
2. Burstable (exceeded request)
3. Guaranteed (last resort)

Within each QoS class, pods are further ranked by: consumption above request (largest over-consumption first), then by pod priority, then by pod start time (youngest first for equal consumption).

Ephemeral Storage Management

Ephemeral storage is local disk used by: container writable layers, emptyDir volumes, container logs, and hostPath volumes. It is managed separately from persistent volumes.

# Container resource requests/limits for ephemeral storage
resources:
  requests:
    ephemeral-storage: "1Gi"
  limits:
    ephemeral-storage: "2Gi"   # container evicted if exceeded

# Eviction thresholds apply to:
# 1. nodefs (root filesystem): container logs + emptyDir on rootfs
# 2. imagefs (image filesystem): container writable layers (overlayfs upper)

# Check ephemeral storage usage per pod
kubectl get pod my-pod -o jsonpath='{.status.ephemeralStorage}'
kubectl describe pod my-pod | grep -i "ephemeral\|storage"

# Node-level usage
df -h /
df -h /var/lib/containerd
du -sh /var/log/pods/

# cAdvisor provides per-container ephemeral metrics:
# container_fs_usage_bytes{container!="",image!=""}
imagefs vs nodefs Split

When containerd stores images on a dedicated disk (/var/lib/containerd on a separate mount), Kubernetes tracks imagefs and nodefs as separate signals. This is the recommended production setup — it prevents a flood of container log writes from triggering image eviction, and vice versa. Verify the split with crictl imagefsinfo.

Extended Resources and Device Plugins

Extended resources allow nodes to advertise hardware beyond the standard CPU/memory/storage: GPUs, FPGAs, high-speed network interfaces, custom ASICs.

# Node advertises extended resources via kubelet device plugin or manual patch
# Device plugin (preferred): runs as DaemonSet, registers via kubelet socket
# /var/lib/kubelet/device-plugins/kubelet.sock

# Manual extended resource advertisement (for testing only):
kubectl patch node worker-1 --subresource=status --type=json \
  -p '[{"op":"add","path":"/status/capacity/example.com~1foo","value":"4"}]'

# Requesting extended resources in a pod:
resources:
  requests:
    nvidia.com/gpu: "1"
  limits:
    nvidia.com/gpu: "1"   # must equal request; no fractional extended resources

# GPU device plugin example (NVIDIA):
# DaemonSet: nvidia-device-plugin-daemonset
# Advertises: nvidia.com/gpu = 
# Kubelet passes GPU device node paths to container via device plugin response
# Container gets: /dev/nvidia0, /dev/nvidia-uvm, etc.

Device Plugin Protocol

Device Plugin DaemonSet
Register(ResourceName)
kubelet /device-plugins socket
ListAndWatch() → capacity
Allocate(deviceIDs) → envs/mounts/devices
Device Plugin gRPC RPCDirectionPurpose
Registerplugin → kubeletRegister resource name and plugin socket path with kubelet
ListAndWatchkubelet → plugin (stream)Plugin streams available device list; kubelet updates Node capacity
Allocatekubelet → pluginOn pod admission: kubelet asks plugin to allocate specific device IDs; plugin returns env vars, mounts, device nodes to inject into container
GetPreferredAllocationkubelet → pluginOptional: plugin suggests which device IDs to allocate (e.g., GPU topology-aware)
PreStartContainerkubelet → pluginOptional: plugin performs setup just before container starts
GetDevicePluginOptionskubelet → pluginDiscover which optional RPCs the plugin supports

Topology Manager

The Topology Manager (GA 1.27) coordinates CPU Manager, Memory Manager, and Device Plugins to ensure that a pod's resources are all allocated from the same NUMA node, minimizing cross-NUMA memory latency and PCIe transfer overhead for GPU/NIC workloads.

# KubeletConfiguration
topologyManagerPolicy: single-numa-node   # default: "none"
# Policies:
# none          - no NUMA alignment (default)
# best-effort   - try NUMA alignment; schedule anyway if impossible
# restricted    - try NUMA alignment; fail pod if impossible
# single-numa-node - allocate ALL resources from exactly one NUMA node or fail

topologyManagerScope: pod   # default: "container"
# container: align per-container (stricter)
# pod:       align across all containers in the pod (more flexible)
Topology Manager Requires Consistent Policies

For Topology Manager to be effective, you must also set cpuManagerPolicy: static and memoryManagerPolicy: Static. A pod must be Guaranteed QoS with integer CPU requests. All three managers consult the Topology Manager's hint provider interface to align their allocations.

PID Limits

Kubernetes supports two independent PID limit mechanisms:

Node-level PID limit

Prevents all pods on the node from collectively consuming more than a configured fraction of the kernel PID limit.

# KubeletConfiguration
pidPressureLimit: 1000   # PIDs reserved for system
# evictionHard:
#   pid.available: "10%"

Per-pod PID limit

Limits the number of PIDs that any single pod can create. Prevents a PID fork bomb from exhausting the node.

# KubeletConfiguration
podPidsLimit: 4096       # max PIDs per pod
# Container-level via cgroup pids.max
# Check: cat /sys/fs/cgroup/.../pids.max

HugePages

HugePages (2Mi or 1Gi) reduce TLB pressure for memory-intensive applications. Kubernetes treats them as schedulable resources:

# Node reports hugepage capacity in status.capacity:
# hugepages-2Mi: "4Gi"   (2048 × 2Mi pages)
# hugepages-1Gi: "8Gi"

# Pre-allocate on node (must be done before kubelet starts):
echo 2048 > /proc/sys/vm/nr_hugepages        # 2Mi pages
echo 8 > /sys/kernel/mm/hugepages/hugepages-1048576kB/nr_hugepages  # 1Gi pages

# Make persistent via /etc/sysctl.conf:
vm.nr_hugepages = 2048

# Pod requesting hugepages:
resources:
  requests:
    hugepages-2Mi: "512Mi"   # request 256 × 2Mi pages
    memory: "512Mi"          # must also request regular memory
  limits:
    hugepages-2Mi: "512Mi"
    memory: "512Mi"

# HugePage volumes are automatically mounted as tmpfs (hugetlbfs) at:
# /dev/hugepages (default mount path in container)
HugePages Are Not Overcommittable

Unlike regular memory, hugepages cannot be overcommitted. Requests must equal limits, and the scheduler will reject a pod if the node does not have sufficient pre-allocated hugepages. Pre-allocation also happens at boot — runtime allocation fails silently on fragmented memory. Reserve hugepages in the VM/instance startup script.

ResourceQuota and LimitRange

While not strictly node-level, these admission-time objects interact directly with node resource accounting:

# LimitRange: sets defaults and enforces min/max per container/pod/namespace
apiVersion: v1
kind: LimitRange
metadata:
  name: default-limits
  namespace: my-team
spec:
  limits:
  - type: Container
    default:           # applied if container has no limits set
      cpu: "500m"
      memory: "256Mi"
    defaultRequest:    # applied if container has no requests set
      cpu: "100m"
      memory: "128Mi"
    max:               # no container may exceed this
      cpu: "4"
      memory: "4Gi"
    min:               # no container may request less than this
      cpu: "50m"
      memory: "64Mi"
  - type: PersistentVolumeClaim
    max:
      storage: "50Gi"

# ResourceQuota: limits total resource consumption per namespace
apiVersion: v1
kind: ResourceQuota
metadata:
  name: team-quota
  namespace: my-team
spec:
  hard:
    requests.cpu: "20"
    requests.memory: "40Gi"
    limits.cpu: "40"
    limits.memory: "80Gi"
    pods: "50"
    persistentvolumeclaims: "20"
    requests.storage: "500Gi"
    count/deployments.apps: "20"
LimitRange Defaults Enable QoS Guarantees Fleet-Wide

Without a LimitRange, developers who omit resource specs create BestEffort pods that get evicted first and can starve other workloads. Setting a namespace LimitRange ensures every pod gets at least a minimal request and limit — moving all pods to at least Burstable QoS — without requiring developers to specify resources manually.

cgroup v2

cgroup v2 is the unified hierarchy model that replaces the split v1 hierarchy (/sys/fs/cgroup/cpu/, /sys/fs/cgroup/memory/, etc.) with a single tree at /sys/fs/cgroup/.

Featurecgroup v1cgroup v2
HierarchySeparate trees per subsystemSingle unified hierarchy
Memory accountingCan miss inter-container chargesAccurate memory accounting with memory.stat
Pressure stall infoNot availablecpu.pressure, memory.pressure, io.pressure (PSI)
CPU throttlingcpu.cfs_quota_uscpu.max (same semantics, new location)
Memory limitsmemory.limit_in_bytesmemory.max (hard) + memory.high (soft)
Memory protectionmemory.soft_limit_in_bytes (unreliable)memory.min (guaranteed) + memory.low (best-effort)
Kubernetes supportDefault until 1.24Enabled by default on kernels 5.8+ (systemd 248+); required for kubelet in 1.25+
cgroupDrivercgroupfs or systemdsystemd strongly recommended
# Verify cgroup v2 is active
stat -fc %T /sys/fs/cgroup/
# Expected: cgroup2fs  (v2)
# "tmpfs" means v1 hybrid or v1 only

# Check kubelet and containerd both use systemd driver
cat /var/lib/kubelet/config.yaml | grep cgroupDriver
# Expected: cgroupDriver: systemd
crictl info | grep cgroupDriver
# Expected: "cgroupDriver": "systemd"

# cgroup v2: Kubernetes-relevant paths
# /sys/fs/cgroup/kubepods.slice/                         — all pod cgroups
# /sys/fs/cgroup/kubepods.slice/kubepods-burstable.slice/ — burstable class
# /sys/fs/cgroup/kubepods.slice/kubepods-guaranteed.slice/ — guaranteed class
# /sys/fs/cgroup/kubepods.slice/kubepods-besteffort.slice/ — besteffort class

# Per-pod cgroup: .../pod.slice/
# Per-container:  .../pod.slice//

Resource Observability

Key Metrics

MetricTypeLabelsDescription
container_cpu_usage_seconds_totalCountercontainer, pod, namespaceCumulative CPU seconds used
container_cpu_cfs_throttled_seconds_totalCountersameCumulative seconds container was throttled — key signal for under-sized limits
container_cpu_cfs_throttled_periods_totalCountersameNumber of CFS periods during which throttling occurred
container_memory_working_set_bytesGaugesameActive memory (used by eviction manager and HPA)
container_memory_rssGaugesameAnonymous memory; excludes file cache
container_memory_cacheGaugesamePage cache; reclaimable under pressure
container_oom_events_totalCountersameOOM events within the container cgroup
node_memory_MemAvailable_bytesGaugeTotal available memory on node (matches eviction signal)
kubelet_evictions_totalCountereviction_signalEvictions triggered per signal type
kube_node_status_allocatableGaugeresource, nodeAllocatable resource per node (from kube-state-metrics)
kube_pod_container_resource_requestsGaugeresource, podRequested resources per container (from kube-state-metrics)

Alerting Rules

# Alert: high CPU throttling rate
- alert: ContainerCPUThrottlingHigh
  expr: |
    sum(increase(container_cpu_cfs_throttled_periods_total{container!=""}[5m]))
      by (container, pod, namespace)
    /
    sum(increase(container_cpu_cfs_periods_total{container!=""}[5m]))
      by (container, pod, namespace)
    > 0.25
  for: 5m
  labels:
    severity: warning
  annotations:
    summary: "Container {{ $labels.container }} throttled >25% in {{ $labels.namespace }}/{{ $labels.pod }}"

# Alert: node memory near eviction threshold
- alert: NodeMemoryNearEviction
  expr: |
    node_memory_MemAvailable_bytes / node_memory_MemTotal_bytes < 0.15
  for: 5m
  labels:
    severity: warning
  annotations:
    summary: "Node {{ $labels.instance }} memory below 15%"

# Alert: container OOM kills
- alert: ContainerOOMKilled
  expr: |
    increase(container_oom_events_total{container!=""}[5m]) > 0
  labels:
    severity: warning
  annotations:
    summary: "OOMKill in {{ $labels.namespace }}/{{ $labels.pod }}/{{ $labels.container }}"

# Alert: node disk pressure (imagefs)
- alert: NodeDiskPressureImageFS
  expr: |
    kubelet_volume_stats_available_bytes{persistentvolumeclaim=""} /
    kubelet_volume_stats_capacity_bytes{persistentvolumeclaim=""} < 0.15
  for: 5m
  labels:
    severity: warning

# Alert: allocatable exhaustion (scheduling failure imminent)
- alert: NodeAllocatableCPULow
  expr: |
    (kube_node_status_allocatable{resource="cpu"}
      - sum by (node) (kube_pod_container_resource_requests{resource="cpu"}))
    / kube_node_status_allocatable{resource="cpu"} < 0.10
  for: 10m
  labels:
    severity: warning
  annotations:
    summary: "Node {{ $labels.node }} CPU allocatable headroom below 10%"

Troubleshooting Runbooks

Pod OOMKilled — sizing memory limits
# Symptom: pod restarts with OOMKilled, exit code 137

# 1. Check last state
kubectl describe pod my-pod | grep -A5 "Last State"
# Look for: Reason: OOMKilled

# 2. Check working set over time (need Prometheus)
container_memory_working_set_bytes{pod="my-pod", container="app"}

# 3. Check if OOM happened in container or node kernel
# Container OOM: cgroup enforced the limit (most common)
# Node OOM: system ran out of memory (rare with reservations)
kubectl get pod my-pod -o jsonpath='{.status.containerStatuses[0].lastState.terminated}'
dmesg | grep "oom killer" | tail -5

# 4. Right-size limit:
# Set limit to p99 of working_set + 20% headroom
# working_set includes: anonymous + file (used) — it's what eviction tracks

# 5. If working_set spikes on startup (JVM, etc.):
# Add initContainer to pre-warm, or set startup memory limit via VPA

# 6. Consider VPA (Vertical Pod Autoscaler) for automatic right-sizing
High CPU throttling despite low CPU usage
# Symptom: CPU usage appears low but latency is high; throttle metric elevated

# 1. Confirm throttling
kubectl get --raw "/api/v1/nodes/worker-1/proxy/metrics/cadvisor" | \
  grep cpu_cfs_throttled | grep "my-pod"

# Throttle ratio = throttled_periods / total_periods
# > 25% = problem

# 2. Common causes:
# a) CPU limit too low for bursty workloads (e.g., JVM GC spikes)
# b) CFS period too long (100ms default can cause issues for low-latency apps)
# c) Multiple containers in pod competing for shared CFS budget

# 3. Fix options:
# a) Raise CPU limit (simplest)
# b) Remove CPU limit entirely (allow bursting to node capacity) — only if workload is trusted
# c) Lower CFS period via node-level sysctl (affects all containers):
echo 5000 > /proc/sys/kernel/sched_min_granularity_ns
# d) Use CPU Manager static policy with dedicated cores (eliminates throttling for Guaranteed pods)

# 4. Verify fix:
rate(container_cpu_cfs_throttled_seconds_total{pod="my-pod"}[5m])
Node disk pressure — imagefs filling up
# Symptom: DiskPressure=True condition; pods evicted; new pods fail to schedule

# 1. Check which filesystem is full
df -h /
df -h /var/lib/containerd

# 2. Find what's consuming space
# Large images:
ctr -n k8s.io images ls | sort -k4 -h
# Large container logs:
du -sh /var/log/pods/* | sort -h | tail -20
# Unused images (not referenced by any container):
crictl rmi --prune   # removes unreferenced images
# or force GC threshold:
curl -sk -X POST https://localhost:10250/debug/pprof/  # triggers GC

# 3. Adjust image GC thresholds if eviction is too aggressive:
# KubeletConfiguration:
# imageGCHighThresholdPercent: 85  (default; start GC at 85% full)
# imageGCLowThresholdPercent: 80   (default; GC down to 80%)

# 4. Configure log rotation to prevent logs from filling nodefs:
# containerLogMaxSize: "50Mi"   (default: 10Mi)
# containerLogMaxFiles: 5       (default: 5)

# 5. If imagefs = nodefs (same disk):
# Consider mounting /var/lib/containerd on a separate volume
Scheduler reports "Insufficient CPU" but node looks available
# Symptom: pod Pending with event "0/3 nodes are available: 3 Insufficient cpu"

# 1. Check actual allocatable vs sum of requests
kubectl describe node worker-1 | grep -A5 "Allocated resources"
# Shows: cpu requests / allocatable

# 2. The scheduler compares sum(pod requests) against allocatable
# NOT against actual CPU usage! High request sum = scheduling failure
# even if actual utilization is 10%

# 3. Check for pods with very high CPU requests
kubectl get pods -A -o custom-columns='NS:.metadata.namespace,POD:.metadata.name,CPU:.spec.containers[0].resources.requests.cpu' \
  | sort -k3 -h | tail -20

# 4. Check reserved resources subtract correctly
kubectl get node worker-1 -o jsonpath='{.status.allocatable.cpu}'
# If lower than expected: check kubeReserved + systemReserved in kubelet config

# 5. Solutions:
# a) Add more nodes (Cluster Autoscaler)
# b) Reduce requests on over-provisioned pods
# c) Use VPA to right-size requests automatically
# d) Increase node size
Device plugin not advertising GPUs — extended resource missing
# Symptom: nvidia.com/gpu not visible in node capacity; pod Pending

# 1. Check device plugin pod is running
kubectl get pods -n kube-system | grep nvidia-device-plugin

# 2. Check kubelet device plugin socket
ls -la /var/lib/kubelet/device-plugins/
# Expected: kubelet.sock + nvidia_gpu plugin file

# 3. Check kubelet log for device plugin registration
journalctl -u kubelet | grep -i "device plugin\|nvidia\|gpu" | tail -20

# 4. Verify node capacity
kubectl get node worker-1 -o jsonpath='{.status.capacity}'
# Expected: "nvidia.com/gpu": "4" (or however many GPUs)

# 5. Check GPU driver and NVML are installed on node
nvidia-smi
ls /dev/nvidia*

# 6. Restart device plugin DaemonSet pod on the node
kubectl delete pod -n kube-system -l name=nvidia-device-plugin-ds \
  --field-selector spec.nodeName=worker-1

Production Best Practices

  1. Set requests and limits on every container — enforce via LimitRange defaults so that no pod is accidentally BestEffort. Missing requests make the scheduler blind; missing limits cause noisy-neighbor interference.
  2. Enforce node allocatable via cgroups — add kube-reserved, system-reserved, and pods to enforceNodeAllocatable. Without enforcement, reserved values are accounting fiction.
  3. Monitor CPU throttling continuouslycontainer_cpu_cfs_throttled_periods_total / container_cpu_cfs_periods_total > 0.25 is your primary signal for under-sized CPU limits. Many latency problems have this as root cause.
  4. Use working_set for memory limit sizing, not RSS — the kubelet eviction manager and OOM killer both act on working set. RSS undercounts page cache. Set limits to 1.2× your p99 working set.
  5. Separate imagefs from nodefs — mount /var/lib/containerd on a dedicated volume. This prevents log floods from triggering image eviction, and gives you independent tuning of GC thresholds.
  6. Enable cgroup v2 with systemd driver — cgroup v2 provides accurate memory accounting, Pressure Stall Information (PSI), and memory.min/memory.high protection. Required for Memory Manager static policy. Ensure kubelet and containerd both use cgroupDriver: systemd.
  7. Use Topology Manager for latency-sensitive workloadssingle-numa-node policy with static CPU/memory managers eliminates cross-NUMA penalties for real-time and ML inference workloads. Accept the scheduling strictness as the cost for predictable performance.
  8. Size reservations per node class, not globally — a 4-core 8 Gi node and a 64-core 256 Gi node need very different reservations. Use node group-specific kubelet config files or kubeadm node patches.
  9. Track allocatable headroom per node in Prometheus — alert when CPU or memory headroom drops below 10%. A fully-packed node cannot accept emergency pods (log shippers, debuggers) and makes drain/rescheduling painfully slow.
  10. Use VPA for request right-sizing — run VPA in Recommendation mode first (no actual updates), observe suggestions for 1–2 weeks, then switch to Auto for non-production and maintain manual right-sizing for production. Never run VPA and HPA on the same resource dimension without VPA's compatibility mode.