Node Advanced Core File: 02-node-components/06-node-resource-management.html

Node Resource Management

Node resource management is the set of mechanisms by which Kubernetes tracks, reserves, and enforces CPU, memory, ephemeral storage, and extended resources across the kernel (cgroups), the kubelet, and the scheduler. Getting this right prevents workload interference, protects node stability, and makes HPA/VPA decisions accurate. Getting it wrong causes unexplained OOMKills, CPU throttling, failed evictions, and "Insufficient CPU" scheduling errors despite seemingly available capacity.

Capacity vs Allocatable

Every Node object exposes two resource views:

Field	What it represents	Who sets it
`status.capacity`	Raw hardware/VM resources: CPU cores, RAM, ephemeral disk, device plugins	kubelet reads from OS/cadvisor at startup
`status.allocatable`	Capacity minus reserved amounts and hard eviction thresholds — the safe schedulable ceiling	kubelet computes from KubeletConfiguration and patches Node

Reserved Resources

The kubelet subtracts two reservation buckets from capacity before reporting allocatable. These are configured in KubeletConfiguration:

apiVersion: kubelet.config.k8s.io/v1beta1
kind: KubeletConfiguration

# Resources reserved for Kubernetes system daemons (kubelet, CRI, kube-proxy)
kubeReserved:
  cpu: "500m"
  memory: "1Gi"
  ephemeral-storage: "2Gi"
  pid: "1000"

# Resources reserved for OS system daemons (sshd, journald, systemd, etc.)
systemReserved:
  cpu: "500m"
  memory: "2Gi"
  ephemeral-storage: "2Gi"
  pid: "1000"

# Must be set to 'true' to enforce reservations via cgroups
# (otherwise reservations affect allocatable calculation but don't enforce)
enforceNodeAllocatable:
- pods          # cgroup limit on sum of all pod cgroups
- kube-reserved # requires kubeReservedCgroup to be set
- system-reserved  # requires systemReservedCgroup to be set

# Cgroup paths for enforcement (must pre-exist on node)
kubeReservedCgroup: /kubelet.slice
systemReservedCgroup: /system.slice

Reservations Without Enforcement = Accounting Only

Setting kubeReserved and systemReserved without adding them to enforceNodeAllocatable only affects the allocatable calculation shown to the scheduler — it does not create cgroup limits that actually prevent pods from consuming those resources. Under pressure, pods will eat into the reserved space. Always add both to enforceNodeAllocatable in production.

Sizing Reservations Correctly

Reservation amounts should be tuned per node type. Undersizing kills the node OS; oversizing wastes schedulable capacity.

Node RAM	Recommended kube-reserved memory	Recommended system-reserved memory	Total reserved
4 Gi	512 Mi	512 Mi	~25% of RAM
8 Gi	1 Gi	1 Gi	~25%
16 Gi	1.5 Gi	1 Gi	~16%
32 Gi	2 Gi	2 Gi	~12%
64 Gi	3 Gi	2 Gi	~8%
128 Gi+	4–6 Gi	2–4 Gi	~5–6%

For CPU, reserving 50–100m per core for kube-reserved is typical. System processes rarely need more than 500m total on a healthy node.

QoS Classes

Kubernetes assigns every pod a Quality of Service class at admission time, based solely on its container resource requests and limits. QoS determines eviction priority and OOMKill order:

Guaranteed

Every container has both requests and limits set, and requests == limits for both CPU and memory.

Never evicted for memory pressure (unless exceeding own limit)
CPU: pinned to exact allocation (no burstable CPU)
OOMKill: only if container exceeds its memory limit
cgroup: cpu.shares set, memory.limit_in_bytes enforced

Burstable

At least one container has a request OR limit set, but not all conditions for Guaranteed are met.

Evicted after BestEffort when memory pressure occurs
Can burst CPU above request (up to limit or node capacity)
OOMKill: if container exceeds its limit, or during pressure
Most real-world pods land here

BestEffort

No container in the pod has any requests or limits set.

First to be evicted under memory pressure
Gets CPU only when no other workload needs it
First OOMKilled under memory pressure
Avoid in production; acceptable for batch/dev

# Guaranteed pod example (requests == limits for ALL containers)
resources:
  requests:
    cpu: "500m"
    memory: "256Mi"
  limits:
    cpu: "500m"
    memory: "256Mi"

# Burstable pod example
resources:
  requests:
    cpu: "250m"
    memory: "128Mi"
  limits:
    cpu: "1"         # limit > request = burstable
    memory: "512Mi"

# BestEffort pod example (avoid in production)
resources: {}   # no requests or limits at all

# Check QoS class of a running pod:
kubectl get pod my-pod -o jsonpath='{.status.qosClass}'

QoS Affects cgroup Hierarchy Placement

Guaranteed pods are placed in kubepods/guaranteed/; Burstable in kubepods/burstable/; BestEffort in kubepods/besteffort/. These cgroup hierarchies have different cpu.shares weights, which gives Guaranteed pods proportionally more CPU bandwidth during contention.

CPU Management

Requests vs Limits

Resource	Request	Limit	Enforcement
CPU	Scheduling constraint; sets `cpu.shares` weight in cgroup	Sets `cpu.cfs_quota_us`; container throttled when quota exhausted	CFS bandwidth throttling (soft enforcement)
Memory	Scheduling constraint; sets `memory.min` in cgroup v2	Sets `memory.limit_in_bytes`; process OOMKilled when exceeded	OOMKill (hard enforcement)

CPU Throttling and CFS

Linux CFS (Completely Fair Scheduler) implements CPU limits via quota/period:

# CFS parameters set by kubelet for a container with limit=500m
# Period: 100ms (default cpu.cfs_period_us = 100000 microseconds)
# Quota:  50ms  (cpu.cfs_quota_us = 50000 = 500m × 100ms)
# Meaning: container can use 50ms of CPU per 100ms window

# Inspect throttling for a container (cgroup v1):
cat /sys/fs/cgroup/cpu/kubepods/burstable/pod//cpu.stat
# nr_periods: total CFS periods run
# nr_throttled: how many periods the container hit its quota
# throttled_time: total nanoseconds throttled

# cgroup v2 equivalent:
cat /sys/fs/cgroup/kubepods.slice/kubepods-burstable.slice/.../cpu.stat

# Throttle ratio:
# throttle_ratio = nr_throttled / nr_periods
# > 25% throttle ratio suggests the CPU limit is too low

CPU Throttling Is Silent but Painful

CPU throttling does not show up as high CPU usage — the container appears idle because it is suspended waiting for its next CFS period. It manifests as increased latency, slow request processing, and timeouts at seemingly low CPU utilization. Always check container_cpu_cfs_throttled_seconds_total in Prometheus before assuming CPU limits are correctly sized.

CPU Manager Policy

The kubelet's CPU Manager (GA 1.26) controls whether pods get shared or exclusive CPU cores:

# KubeletConfiguration
cpuManagerPolicy: static   # default: "none" (shared pool)
cpuManagerReconcilePeriod: 10s

# "none" (default): all pods share a pool of CPUs via CFS weights
# "static": Guaranteed pods with integer CPU requests get dedicated cores

# For static policy: reserve CPUs for OS/kubelet (never given to pods)
reservedSystemCPUs: "0-1"    # cores 0 and 1 always reserved
# OR use kubeReserved.cpu (less precise — doesn't pin specific cores)

# Pod that gets exclusive cores under static policy:
resources:
  requests:
    cpu: "2"    # integer request
  limits:
    cpu: "2"    # must equal request (Guaranteed QoS)
# Result: kubelet pins this container to 2 specific cores via cpuset cgroup

cpuManagerPolicy	Who benefits	Trade-off
`none` (default)	General workloads; maximizes CPU sharing efficiency	Cache line contention between pods; no NUMA awareness
`static`	Latency-sensitive, CPU-intensive: real-time, HPC, databases	Reduces utilization (reserved cores may be idle); requires Guaranteed QoS

Memory Management

Memory Limits and OOMKill

When a container exceeds its memory limit, the Linux OOM killer terminates one of its processes. The kubelet detects this via PLEG and marks the container as OOMKilled. The exit code is 137 (SIGKILL = 9, exit = 128 + 9).

# Check OOMKill history for a pod
kubectl describe pod my-pod | grep -A5 "OOMKilled\|Exit Code\|Last State"

# Check system OOM events (node-level)
dmesg | grep -i "oom\|out of memory"
journalctl -k | grep -i "oom killer"

# cgroup v2: check memory events
cat /sys/fs/cgroup/kubepods.slice/.../memory.events
# oom: number of OOM kills
# oom_kill: times OOM killer was invoked

# Tune OOM score (lower = less likely to be killed by kernel OOM)
# Guaranteed pods: oom_score_adj = -997 (never killed by kernel except extreme)
# Burstable pods:  oom_score_adj = 2 to 999 (proportional to memory usage)
# BestEffort pods: oom_score_adj = 1000 (first to die)
cat /proc/$(pgrep -n nginx)/oom_score_adj

Memory Manager Policy

The Memory Manager (GA 1.22) works alongside the CPU Manager for NUMA-aware memory allocation for Guaranteed pods:

# KubeletConfiguration
memoryManagerPolicy: Static   # default: "None"
reservedMemory:
- numaNode: 0
  limits:
    memory: "1Gi"
    hugepages-1Gi: "2Gi"

# Pods that get NUMA-pinned memory (requires Guaranteed QoS + integer CPU + static CPU manager):
resources:
  requests:
    memory: "4Gi"
    hugepages-1Gi: "2Gi"
  limits:
    memory: "4Gi"
    hugepages-1Gi: "2Gi"
# kubelet allocates this pod's memory exclusively from a single NUMA node
# → eliminates cross-NUMA memory access latency for HPC/real-time workloads

Eviction Manager

The kubelet's Eviction Manager continuously monitors node resource consumption and evicts pods before the node reaches a critical state. Covered in depth in kubelet: Eviction Manager; key configuration reference here:

apiVersion: kubelet.config.k8s.io/v1beta1
kind: KubeletConfiguration

# Hard eviction thresholds (immediate eviction, no grace period)
evictionHard:
  memory.available: "200Mi"        # total node memory available
  nodefs.available: "10%"          # rootfs free space
  nodefs.inodesFree: "5%"          # rootfs inodes
  imagefs.available: "15%"         # container image filesystem
  imagefs.inodesFree: "5%"
  pid.available: "10%"             # available PID count

# Soft eviction thresholds (eviction begins only after grace period)
evictionSoft:
  memory.available: "500Mi"
  nodefs.available: "15%"
  imagefs.available: "20%"

# Grace periods for soft eviction (how long threshold must be breached)
evictionSoftGracePeriod:
  memory.available: "2m"
  nodefs.available: "5m"
  imagefs.available: "5m"

# Max time to wait for a pod to terminate during eviction
evictionMaxPodGracePeriod: 30

# Minimum amount of resource to reclaim on each eviction round
evictionMinimumReclaim:
  memory.available: "100Mi"
  nodefs.available: "500Mi"
  imagefs.available: "500Mi"

Eviction Signals

Signal	Description	Derived from
`memory.available`	Available node memory (not working set)	`capacity - workingSet` (cAdvisor)
`nodefs.available`	Free space on node root filesystem	`df /`
`nodefs.inodesFree`	Free inodes on root filesystem	`df -i /`
`imagefs.available`	Free space on container image filesystem	`df /var/lib/containerd`
`imagefs.inodesFree`	Free inodes on image filesystem	`df -i /var/lib/containerd`
`pid.available`	Available PIDs on the node	`/proc/sys/kernel/pid_max - current usage`
`allocatableMemory.available`	Available memory within pod cgroup (not OS-level)	cgroup memory.usage_in_bytes within kubepods

Eviction Order

When a threshold is breached, the kubelet evicts pods in this priority order:

1. BestEffort (any resource exceeded)

→

2. Burstable (exceeded request)

→

3. Guaranteed (last resort)

Within each QoS class, pods are further ranked by: consumption above request (largest over-consumption first), then by pod priority, then by pod start time (youngest first for equal consumption).

Ephemeral Storage Management

Ephemeral storage is local disk used by: container writable layers, emptyDir volumes, container logs, and hostPath volumes. It is managed separately from persistent volumes.

# Container resource requests/limits for ephemeral storage
resources:
  requests:
    ephemeral-storage: "1Gi"
  limits:
    ephemeral-storage: "2Gi"   # container evicted if exceeded

# Eviction thresholds apply to:
# 1. nodefs (root filesystem): container logs + emptyDir on rootfs
# 2. imagefs (image filesystem): container writable layers (overlayfs upper)

# Check ephemeral storage usage per pod
kubectl get pod my-pod -o jsonpath='{.status.ephemeralStorage}'
kubectl describe pod my-pod | grep -i "ephemeral\|storage"

# Node-level usage
df -h /
df -h /var/lib/containerd
du -sh /var/log/pods/

# cAdvisor provides per-container ephemeral metrics:
# container_fs_usage_bytes{container!="",image!=""}

imagefs vs nodefs Split

When containerd stores images on a dedicated disk (/var/lib/containerd on a separate mount), Kubernetes tracks imagefs and nodefs as separate signals. This is the recommended production setup — it prevents a flood of container log writes from triggering image eviction, and vice versa. Verify the split with crictl imagefsinfo.

Extended Resources and Device Plugins

Extended resources allow nodes to advertise hardware beyond the standard CPU/memory/storage: GPUs, FPGAs, high-speed network interfaces, custom ASICs.

# Node advertises extended resources via kubelet device plugin or manual patch
# Device plugin (preferred): runs as DaemonSet, registers via kubelet socket
# /var/lib/kubelet/device-plugins/kubelet.sock

# Manual extended resource advertisement (for testing only):
kubectl patch node worker-1 --subresource=status --type=json \
  -p '[{"op":"add","path":"/status/capacity/example.com~1foo","value":"4"}]'

# Requesting extended resources in a pod:
resources:
  requests:
    nvidia.com/gpu: "1"
  limits:
    nvidia.com/gpu: "1"   # must equal request; no fractional extended resources

# GPU device plugin example (NVIDIA):
# DaemonSet: nvidia-device-plugin-daemonset
# Advertises: nvidia.com/gpu = 
# Kubelet passes GPU device node paths to container via device plugin response
# Container gets: /dev/nvidia0, /dev/nvidia-uvm, etc.

Device Plugin Protocol

Device Plugin DaemonSet

→

kubelet /device-plugins socket

→

ListAndWatch() → capacity

→

Allocate(deviceIDs) → envs/mounts/devices

Device Plugin gRPC RPC	Direction	Purpose
`Register`	plugin → kubelet	Register resource name and plugin socket path with kubelet
`ListAndWatch`	kubelet → plugin (stream)	Plugin streams available device list; kubelet updates Node capacity
`Allocate`	kubelet → plugin	On pod admission: kubelet asks plugin to allocate specific device IDs; plugin returns env vars, mounts, device nodes to inject into container
`GetPreferredAllocation`	kubelet → plugin	Optional: plugin suggests which device IDs to allocate (e.g., GPU topology-aware)
`PreStartContainer`	kubelet → plugin	Optional: plugin performs setup just before container starts
`GetDevicePluginOptions`	kubelet → plugin	Discover which optional RPCs the plugin supports

Topology Manager

The Topology Manager (GA 1.27) coordinates CPU Manager, Memory Manager, and Device Plugins to ensure that a pod's resources are all allocated from the same NUMA node, minimizing cross-NUMA memory latency and PCIe transfer overhead for GPU/NIC workloads.

# KubeletConfiguration
topologyManagerPolicy: single-numa-node   # default: "none"
# Policies:
# none          - no NUMA alignment (default)
# best-effort   - try NUMA alignment; schedule anyway if impossible
# restricted    - try NUMA alignment; fail pod if impossible
# single-numa-node - allocate ALL resources from exactly one NUMA node or fail

topologyManagerScope: pod   # default: "container"
# container: align per-container (stricter)
# pod:       align across all containers in the pod (more flexible)

Topology Manager Requires Consistent Policies

For Topology Manager to be effective, you must also set cpuManagerPolicy: static and memoryManagerPolicy: Static. A pod must be Guaranteed QoS with integer CPU requests. All three managers consult the Topology Manager's hint provider interface to align their allocations.

PID Limits

Kubernetes supports two independent PID limit mechanisms:

Node-level PID limit

Prevents all pods on the node from collectively consuming more than a configured fraction of the kernel PID limit.

# KubeletConfiguration
pidPressureLimit: 1000   # PIDs reserved for system
# evictionHard:
#   pid.available: "10%"

Per-pod PID limit

Limits the number of PIDs that any single pod can create. Prevents a PID fork bomb from exhausting the node.

# KubeletConfiguration
podPidsLimit: 4096       # max PIDs per pod
# Container-level via cgroup pids.max
# Check: cat /sys/fs/cgroup/.../pids.max

HugePages

HugePages (2Mi or 1Gi) reduce TLB pressure for memory-intensive applications. Kubernetes treats them as schedulable resources:

# Node reports hugepage capacity in status.capacity:
# hugepages-2Mi: "4Gi"   (2048 × 2Mi pages)
# hugepages-1Gi: "8Gi"

# Pre-allocate on node (must be done before kubelet starts):
echo 2048 > /proc/sys/vm/nr_hugepages        # 2Mi pages
echo 8 > /sys/kernel/mm/hugepages/hugepages-1048576kB/nr_hugepages  # 1Gi pages

# Make persistent via /etc/sysctl.conf:
vm.nr_hugepages = 2048

# Pod requesting hugepages:
resources:
  requests:
    hugepages-2Mi: "512Mi"   # request 256 × 2Mi pages
    memory: "512Mi"          # must also request regular memory
  limits:
    hugepages-2Mi: "512Mi"
    memory: "512Mi"

# HugePage volumes are automatically mounted as tmpfs (hugetlbfs) at:
# /dev/hugepages (default mount path in container)

HugePages Are Not Overcommittable

Unlike regular memory, hugepages cannot be overcommitted. Requests must equal limits, and the scheduler will reject a pod if the node does not have sufficient pre-allocated hugepages. Pre-allocation also happens at boot — runtime allocation fails silently on fragmented memory. Reserve hugepages in the VM/instance startup script.

ResourceQuota and LimitRange

While not strictly node-level, these admission-time objects interact directly with node resource accounting:

# LimitRange: sets defaults and enforces min/max per container/pod/namespace
apiVersion: v1
kind: LimitRange
metadata:
  name: default-limits
  namespace: my-team
spec:
  limits:
  - type: Container
    default:           # applied if container has no limits set
      cpu: "500m"
      memory: "256Mi"
    defaultRequest:    # applied if container has no requests set
      cpu: "100m"
      memory: "128Mi"
    max:               # no container may exceed this
      cpu: "4"
      memory: "4Gi"
    min:               # no container may request less than this
      cpu: "50m"
      memory: "64Mi"
  - type: PersistentVolumeClaim
    max:
      storage: "50Gi"

# ResourceQuota: limits total resource consumption per namespace
apiVersion: v1
kind: ResourceQuota
metadata:
  name: team-quota
  namespace: my-team
spec:
  hard:
    requests.cpu: "20"
    requests.memory: "40Gi"
    limits.cpu: "40"
    limits.memory: "80Gi"
    pods: "50"
    persistentvolumeclaims: "20"
    requests.storage: "500Gi"
    count/deployments.apps: "20"

LimitRange Defaults Enable QoS Guarantees Fleet-Wide

Without a LimitRange, developers who omit resource specs create BestEffort pods that get evicted first and can starve other workloads. Setting a namespace LimitRange ensures every pod gets at least a minimal request and limit — moving all pods to at least Burstable QoS — without requiring developers to specify resources manually.

cgroup v2

cgroup v2 is the unified hierarchy model that replaces the split v1 hierarchy (/sys/fs/cgroup/cpu/, /sys/fs/cgroup/memory/, etc.) with a single tree at /sys/fs/cgroup/.

Feature	cgroup v1	cgroup v2
Hierarchy	Separate trees per subsystem	Single unified hierarchy
Memory accounting	Can miss inter-container charges	Accurate memory accounting with `memory.stat`
Pressure stall info	Not available	`cpu.pressure`, `memory.pressure`, `io.pressure` (PSI)
CPU throttling	`cpu.cfs_quota_us`	`cpu.max` (same semantics, new location)
Memory limits	`memory.limit_in_bytes`	`memory.max` (hard) + `memory.high` (soft)
Memory protection	`memory.soft_limit_in_bytes` (unreliable)	`memory.min` (guaranteed) + `memory.low` (best-effort)
Kubernetes support	Default until 1.24	Enabled by default on kernels 5.8+ (systemd 248+); required for kubelet in 1.25+
cgroupDriver	`cgroupfs` or `systemd`	`systemd` strongly recommended

# Verify cgroup v2 is active
stat -fc %T /sys/fs/cgroup/
# Expected: cgroup2fs  (v2)
# "tmpfs" means v1 hybrid or v1 only

# Check kubelet and containerd both use systemd driver
cat /var/lib/kubelet/config.yaml | grep cgroupDriver
# Expected: cgroupDriver: systemd
crictl info | grep cgroupDriver
# Expected: "cgroupDriver": "systemd"

# cgroup v2: Kubernetes-relevant paths
# /sys/fs/cgroup/kubepods.slice/                         — all pod cgroups
# /sys/fs/cgroup/kubepods.slice/kubepods-burstable.slice/ — burstable class
# /sys/fs/cgroup/kubepods.slice/kubepods-guaranteed.slice/ — guaranteed class
# /sys/fs/cgroup/kubepods.slice/kubepods-besteffort.slice/ — besteffort class

# Per-pod cgroup: .../pod.slice/
# Per-container:  .../pod.slice//

Resource Observability

Key Metrics

Metric	Type	Labels	Description
`container_cpu_usage_seconds_total`	Counter	`container`, `pod`, `namespace`	Cumulative CPU seconds used
`container_cpu_cfs_throttled_seconds_total`	Counter	same	Cumulative seconds container was throttled — key signal for under-sized limits
`container_cpu_cfs_throttled_periods_total`	Counter	same	Number of CFS periods during which throttling occurred
`container_memory_working_set_bytes`	Gauge	same	Active memory (used by eviction manager and HPA)
`container_memory_rss`	Gauge	same	Anonymous memory; excludes file cache
`container_memory_cache`	Gauge	same	Page cache; reclaimable under pressure
`container_oom_events_total`	Counter	same	OOM events within the container cgroup
`node_memory_MemAvailable_bytes`	Gauge	—	Total available memory on node (matches eviction signal)
`kubelet_evictions_total`	Counter	`eviction_signal`	Evictions triggered per signal type
`kube_node_status_allocatable`	Gauge	`resource`, `node`	Allocatable resource per node (from kube-state-metrics)
`kube_pod_container_resource_requests`	Gauge	`resource`, `pod`	Requested resources per container (from kube-state-metrics)

Alerting Rules

# Alert: high CPU throttling rate
- alert: ContainerCPUThrottlingHigh
  expr: |
    sum(increase(container_cpu_cfs_throttled_periods_total{container!=""}[5m]))
      by (container, pod, namespace)
    /
    sum(increase(container_cpu_cfs_periods_total{container!=""}[5m]))
      by (container, pod, namespace)
    > 0.25
  for: 5m
  labels:
    severity: warning
  annotations:
    summary: "Container {{ $labels.container }} throttled >25% in {{ $labels.namespace }}/{{ $labels.pod }}"

# Alert: node memory near eviction threshold
- alert: NodeMemoryNearEviction
  expr: |
    node_memory_MemAvailable_bytes / node_memory_MemTotal_bytes < 0.15
  for: 5m
  labels:
    severity: warning
  annotations:
    summary: "Node {{ $labels.instance }} memory below 15%"

# Alert: container OOM kills
- alert: ContainerOOMKilled
  expr: |
    increase(container_oom_events_total{container!=""}[5m]) > 0
  labels:
    severity: warning
  annotations:
    summary: "OOMKill in {{ $labels.namespace }}/{{ $labels.pod }}/{{ $labels.container }}"

# Alert: node disk pressure (imagefs)
- alert: NodeDiskPressureImageFS
  expr: |
    kubelet_volume_stats_available_bytes{persistentvolumeclaim=""} /
    kubelet_volume_stats_capacity_bytes{persistentvolumeclaim=""} < 0.15
  for: 5m
  labels:
    severity: warning

# Alert: allocatable exhaustion (scheduling failure imminent)
- alert: NodeAllocatableCPULow
  expr: |
    (kube_node_status_allocatable{resource="cpu"}
      - sum by (node) (kube_pod_container_resource_requests{resource="cpu"}))
    / kube_node_status_allocatable{resource="cpu"} < 0.10
  for: 10m
  labels:
    severity: warning
  annotations:
    summary: "Node {{ $labels.node }} CPU allocatable headroom below 10%"

Troubleshooting Runbooks

Pod OOMKilled — sizing memory limits

# Symptom: pod restarts with OOMKilled, exit code 137

# 1. Check last state
kubectl describe pod my-pod | grep -A5 "Last State"
# Look for: Reason: OOMKilled

# 2. Check working set over time (need Prometheus)
container_memory_working_set_bytes{pod="my-pod", container="app"}

# 3. Check if OOM happened in container or node kernel
# Container OOM: cgroup enforced the limit (most common)
# Node OOM: system ran out of memory (rare with reservations)
kubectl get pod my-pod -o jsonpath='{.status.containerStatuses[0].lastState.terminated}'
dmesg | grep "oom killer" | tail -5

# 4. Right-size limit:
# Set limit to p99 of working_set + 20% headroom
# working_set includes: anonymous + file (used) — it's what eviction tracks

# 5. If working_set spikes on startup (JVM, etc.):
# Add initContainer to pre-warm, or set startup memory limit via VPA

# 6. Consider VPA (Vertical Pod Autoscaler) for automatic right-sizing

High CPU throttling despite low CPU usage

# Symptom: CPU usage appears low but latency is high; throttle metric elevated

# 1. Confirm throttling
kubectl get --raw "/api/v1/nodes/worker-1/proxy/metrics/cadvisor" | \
  grep cpu_cfs_throttled | grep "my-pod"

# Throttle ratio = throttled_periods / total_periods
# > 25% = problem

# 2. Common causes:
# a) CPU limit too low for bursty workloads (e.g., JVM GC spikes)
# b) CFS period too long (100ms default can cause issues for low-latency apps)
# c) Multiple containers in pod competing for shared CFS budget

# 3. Fix options:
# a) Raise CPU limit (simplest)
# b) Remove CPU limit entirely (allow bursting to node capacity) — only if workload is trusted
# c) Lower CFS period via node-level sysctl (affects all containers):
echo 5000 > /proc/sys/kernel/sched_min_granularity_ns
# d) Use CPU Manager static policy with dedicated cores (eliminates throttling for Guaranteed pods)

# 4. Verify fix:
rate(container_cpu_cfs_throttled_seconds_total{pod="my-pod"}[5m])

Node disk pressure — imagefs filling up

# Symptom: DiskPressure=True condition; pods evicted; new pods fail to schedule

# 1. Check which filesystem is full
df -h /
df -h /var/lib/containerd

# 2. Find what's consuming space
# Large images:
ctr -n k8s.io images ls | sort -k4 -h
# Large container logs:
du -sh /var/log/pods/* | sort -h | tail -20
# Unused images (not referenced by any container):
crictl rmi --prune   # removes unreferenced images
# or force GC threshold:
curl -sk -X POST https://localhost:10250/debug/pprof/  # triggers GC

# 3. Adjust image GC thresholds if eviction is too aggressive:
# KubeletConfiguration:
# imageGCHighThresholdPercent: 85  (default; start GC at 85% full)
# imageGCLowThresholdPercent: 80   (default; GC down to 80%)

# 4. Configure log rotation to prevent logs from filling nodefs:
# containerLogMaxSize: "50Mi"   (default: 10Mi)
# containerLogMaxFiles: 5       (default: 5)

# 5. If imagefs = nodefs (same disk):
# Consider mounting /var/lib/containerd on a separate volume

Scheduler reports "Insufficient CPU" but node looks available

# Symptom: pod Pending with event "0/3 nodes are available: 3 Insufficient cpu"

# 1. Check actual allocatable vs sum of requests
kubectl describe node worker-1 | grep -A5 "Allocated resources"
# Shows: cpu requests / allocatable

# 2. The scheduler compares sum(pod requests) against allocatable
# NOT against actual CPU usage! High request sum = scheduling failure
# even if actual utilization is 10%

# 3. Check for pods with very high CPU requests
kubectl get pods -A -o custom-columns='NS:.metadata.namespace,POD:.metadata.name,CPU:.spec.containers[0].resources.requests.cpu' \
  | sort -k3 -h | tail -20

# 4. Check reserved resources subtract correctly
kubectl get node worker-1 -o jsonpath='{.status.allocatable.cpu}'
# If lower than expected: check kubeReserved + systemReserved in kubelet config

# 5. Solutions:
# a) Add more nodes (Cluster Autoscaler)
# b) Reduce requests on over-provisioned pods
# c) Use VPA to right-size requests automatically
# d) Increase node size

Device plugin not advertising GPUs — extended resource missing

# Symptom: nvidia.com/gpu not visible in node capacity; pod Pending

# 1. Check device plugin pod is running
kubectl get pods -n kube-system | grep nvidia-device-plugin

# 2. Check kubelet device plugin socket
ls -la /var/lib/kubelet/device-plugins/
# Expected: kubelet.sock + nvidia_gpu plugin file

# 3. Check kubelet log for device plugin registration
journalctl -u kubelet | grep -i "device plugin\|nvidia\|gpu" | tail -20

# 4. Verify node capacity
kubectl get node worker-1 -o jsonpath='{.status.capacity}'
# Expected: "nvidia.com/gpu": "4" (or however many GPUs)

# 5. Check GPU driver and NVML are installed on node
nvidia-smi
ls /dev/nvidia*

# 6. Restart device plugin DaemonSet pod on the node
kubectl delete pod -n kube-system -l name=nvidia-device-plugin-ds \
  --field-selector spec.nodeName=worker-1

Production Best Practices

Set requests and limits on every container — enforce via LimitRange defaults so that no pod is accidentally BestEffort. Missing requests make the scheduler blind; missing limits cause noisy-neighbor interference.
Enforce node allocatable via cgroups — add kube-reserved, system-reserved, and pods to enforceNodeAllocatable. Without enforcement, reserved values are accounting fiction.
Monitor CPU throttling continuously — container_cpu_cfs_throttled_periods_total / container_cpu_cfs_periods_total > 0.25 is your primary signal for under-sized CPU limits. Many latency problems have this as root cause.
Use working_set for memory limit sizing, not RSS — the kubelet eviction manager and OOM killer both act on working set. RSS undercounts page cache. Set limits to 1.2× your p99 working set.
Separate imagefs from nodefs — mount /var/lib/containerd on a dedicated volume. This prevents log floods from triggering image eviction, and gives you independent tuning of GC thresholds.
Enable cgroup v2 with systemd driver — cgroup v2 provides accurate memory accounting, Pressure Stall Information (PSI), and memory.min/memory.high protection. Required for Memory Manager static policy. Ensure kubelet and containerd both use cgroupDriver: systemd.
Use Topology Manager for latency-sensitive workloads — single-numa-node policy with static CPU/memory managers eliminates cross-NUMA penalties for real-time and ML inference workloads. Accept the scheduling strictness as the cost for predictable performance.
Size reservations per node class, not globally — a 4-core 8 Gi node and a 64-core 256 Gi node need very different reservations. Use node group-specific kubelet config files or kubeadm node patches.
Track allocatable headroom per node in Prometheus — alert when CPU or memory headroom drops below 10%. A fully-packed node cannot accept emergency pods (log shippers, debuggers) and makes drain/rescheduling painfully slow.
Use VPA for request right-sizing — run VPA in Recommendation mode first (no actual updates), observe suggestions for 1–2 weeks, then switch to Auto for non-production and maintain manual right-sizing for production. Never run VPA and HPA on the same resource dimension without VPA's compatibility mode.