Node Resource Management
Node resource management is the set of mechanisms by which Kubernetes tracks, reserves, and enforces CPU, memory, ephemeral storage, and extended resources across the kernel (cgroups), the kubelet, and the scheduler. Getting this right prevents workload interference, protects node stability, and makes HPA/VPA decisions accurate. Getting it wrong causes unexplained OOMKills, CPU throttling, failed evictions, and "Insufficient CPU" scheduling errors despite seemingly available capacity.
Capacity vs Allocatable
Every Node object exposes two resource views:
| Field | What it represents | Who sets it |
|---|---|---|
status.capacity | Raw hardware/VM resources: CPU cores, RAM, ephemeral disk, device plugins | kubelet reads from OS/cadvisor at startup |
status.allocatable | Capacity minus reserved amounts and hard eviction thresholds — the safe schedulable ceiling | kubelet computes from KubeletConfiguration and patches Node |
Reserved Resources
The kubelet subtracts two reservation buckets from capacity before reporting allocatable. These are configured in KubeletConfiguration:
apiVersion: kubelet.config.k8s.io/v1beta1
kind: KubeletConfiguration
# Resources reserved for Kubernetes system daemons (kubelet, CRI, kube-proxy)
kubeReserved:
cpu: "500m"
memory: "1Gi"
ephemeral-storage: "2Gi"
pid: "1000"
# Resources reserved for OS system daemons (sshd, journald, systemd, etc.)
systemReserved:
cpu: "500m"
memory: "2Gi"
ephemeral-storage: "2Gi"
pid: "1000"
# Must be set to 'true' to enforce reservations via cgroups
# (otherwise reservations affect allocatable calculation but don't enforce)
enforceNodeAllocatable:
- pods # cgroup limit on sum of all pod cgroups
- kube-reserved # requires kubeReservedCgroup to be set
- system-reserved # requires systemReservedCgroup to be set
# Cgroup paths for enforcement (must pre-exist on node)
kubeReservedCgroup: /kubelet.slice
systemReservedCgroup: /system.slice
Setting kubeReserved and systemReserved without adding them to enforceNodeAllocatable only affects the allocatable calculation shown to the scheduler — it does not create cgroup limits that actually prevent pods from consuming those resources. Under pressure, pods will eat into the reserved space. Always add both to enforceNodeAllocatable in production.
Sizing Reservations Correctly
Reservation amounts should be tuned per node type. Undersizing kills the node OS; oversizing wastes schedulable capacity.
| Node RAM | Recommended kube-reserved memory | Recommended system-reserved memory | Total reserved |
|---|---|---|---|
| 4 Gi | 512 Mi | 512 Mi | ~25% of RAM |
| 8 Gi | 1 Gi | 1 Gi | ~25% |
| 16 Gi | 1.5 Gi | 1 Gi | ~16% |
| 32 Gi | 2 Gi | 2 Gi | ~12% |
| 64 Gi | 3 Gi | 2 Gi | ~8% |
| 128 Gi+ | 4–6 Gi | 2–4 Gi | ~5–6% |
For CPU, reserving 50–100m per core for kube-reserved is typical. System processes rarely need more than 500m total on a healthy node.
QoS Classes
Kubernetes assigns every pod a Quality of Service class at admission time, based solely on its container resource requests and limits. QoS determines eviction priority and OOMKill order:
Guaranteed
Every container has both requests and limits set, and requests == limits for both CPU and memory.
- Never evicted for memory pressure (unless exceeding own limit)
- CPU: pinned to exact allocation (no burstable CPU)
- OOMKill: only if container exceeds its memory limit
- cgroup:
cpu.sharesset,memory.limit_in_bytesenforced
Burstable
At least one container has a request OR limit set, but not all conditions for Guaranteed are met.
- Evicted after BestEffort when memory pressure occurs
- Can burst CPU above request (up to limit or node capacity)
- OOMKill: if container exceeds its limit, or during pressure
- Most real-world pods land here
BestEffort
No container in the pod has any requests or limits set.
- First to be evicted under memory pressure
- Gets CPU only when no other workload needs it
- First OOMKilled under memory pressure
- Avoid in production; acceptable for batch/dev
# Guaranteed pod example (requests == limits for ALL containers)
resources:
requests:
cpu: "500m"
memory: "256Mi"
limits:
cpu: "500m"
memory: "256Mi"
# Burstable pod example
resources:
requests:
cpu: "250m"
memory: "128Mi"
limits:
cpu: "1" # limit > request = burstable
memory: "512Mi"
# BestEffort pod example (avoid in production)
resources: {} # no requests or limits at all
# Check QoS class of a running pod:
kubectl get pod my-pod -o jsonpath='{.status.qosClass}'
Guaranteed pods are placed in kubepods/guaranteed/; Burstable in kubepods/burstable/; BestEffort in kubepods/besteffort/. These cgroup hierarchies have different cpu.shares weights, which gives Guaranteed pods proportionally more CPU bandwidth during contention.
CPU Management
Requests vs Limits
| Resource | Request | Limit | Enforcement |
|---|---|---|---|
| CPU | Scheduling constraint; sets cpu.shares weight in cgroup | Sets cpu.cfs_quota_us; container throttled when quota exhausted | CFS bandwidth throttling (soft enforcement) |
| Memory | Scheduling constraint; sets memory.min in cgroup v2 | Sets memory.limit_in_bytes; process OOMKilled when exceeded | OOMKill (hard enforcement) |
CPU Throttling and CFS
Linux CFS (Completely Fair Scheduler) implements CPU limits via quota/period:
# CFS parameters set by kubelet for a container with limit=500m
# Period: 100ms (default cpu.cfs_period_us = 100000 microseconds)
# Quota: 50ms (cpu.cfs_quota_us = 50000 = 500m × 100ms)
# Meaning: container can use 50ms of CPU per 100ms window
# Inspect throttling for a container (cgroup v1):
cat /sys/fs/cgroup/cpu/kubepods/burstable/pod//cpu.stat
# nr_periods: total CFS periods run
# nr_throttled: how many periods the container hit its quota
# throttled_time: total nanoseconds throttled
# cgroup v2 equivalent:
cat /sys/fs/cgroup/kubepods.slice/kubepods-burstable.slice/.../cpu.stat
# Throttle ratio:
# throttle_ratio = nr_throttled / nr_periods
# > 25% throttle ratio suggests the CPU limit is too low
CPU throttling does not show up as high CPU usage — the container appears idle because it is suspended waiting for its next CFS period. It manifests as increased latency, slow request processing, and timeouts at seemingly low CPU utilization. Always check container_cpu_cfs_throttled_seconds_total in Prometheus before assuming CPU limits are correctly sized.
CPU Manager Policy
The kubelet's CPU Manager (GA 1.26) controls whether pods get shared or exclusive CPU cores:
# KubeletConfiguration
cpuManagerPolicy: static # default: "none" (shared pool)
cpuManagerReconcilePeriod: 10s
# "none" (default): all pods share a pool of CPUs via CFS weights
# "static": Guaranteed pods with integer CPU requests get dedicated cores
# For static policy: reserve CPUs for OS/kubelet (never given to pods)
reservedSystemCPUs: "0-1" # cores 0 and 1 always reserved
# OR use kubeReserved.cpu (less precise — doesn't pin specific cores)
# Pod that gets exclusive cores under static policy:
resources:
requests:
cpu: "2" # integer request
limits:
cpu: "2" # must equal request (Guaranteed QoS)
# Result: kubelet pins this container to 2 specific cores via cpuset cgroup
| cpuManagerPolicy | Who benefits | Trade-off |
|---|---|---|
none (default) | General workloads; maximizes CPU sharing efficiency | Cache line contention between pods; no NUMA awareness |
static | Latency-sensitive, CPU-intensive: real-time, HPC, databases | Reduces utilization (reserved cores may be idle); requires Guaranteed QoS |
Memory Management
Memory Limits and OOMKill
When a container exceeds its memory limit, the Linux OOM killer terminates one of its processes. The kubelet detects this via PLEG and marks the container as OOMKilled. The exit code is 137 (SIGKILL = 9, exit = 128 + 9).
# Check OOMKill history for a pod
kubectl describe pod my-pod | grep -A5 "OOMKilled\|Exit Code\|Last State"
# Check system OOM events (node-level)
dmesg | grep -i "oom\|out of memory"
journalctl -k | grep -i "oom killer"
# cgroup v2: check memory events
cat /sys/fs/cgroup/kubepods.slice/.../memory.events
# oom: number of OOM kills
# oom_kill: times OOM killer was invoked
# Tune OOM score (lower = less likely to be killed by kernel OOM)
# Guaranteed pods: oom_score_adj = -997 (never killed by kernel except extreme)
# Burstable pods: oom_score_adj = 2 to 999 (proportional to memory usage)
# BestEffort pods: oom_score_adj = 1000 (first to die)
cat /proc/$(pgrep -n nginx)/oom_score_adj
Memory Manager Policy
The Memory Manager (GA 1.22) works alongside the CPU Manager for NUMA-aware memory allocation for Guaranteed pods:
# KubeletConfiguration
memoryManagerPolicy: Static # default: "None"
reservedMemory:
- numaNode: 0
limits:
memory: "1Gi"
hugepages-1Gi: "2Gi"
# Pods that get NUMA-pinned memory (requires Guaranteed QoS + integer CPU + static CPU manager):
resources:
requests:
memory: "4Gi"
hugepages-1Gi: "2Gi"
limits:
memory: "4Gi"
hugepages-1Gi: "2Gi"
# kubelet allocates this pod's memory exclusively from a single NUMA node
# → eliminates cross-NUMA memory access latency for HPC/real-time workloads
Eviction Manager
The kubelet's Eviction Manager continuously monitors node resource consumption and evicts pods before the node reaches a critical state. Covered in depth in kubelet: Eviction Manager; key configuration reference here:
apiVersion: kubelet.config.k8s.io/v1beta1
kind: KubeletConfiguration
# Hard eviction thresholds (immediate eviction, no grace period)
evictionHard:
memory.available: "200Mi" # total node memory available
nodefs.available: "10%" # rootfs free space
nodefs.inodesFree: "5%" # rootfs inodes
imagefs.available: "15%" # container image filesystem
imagefs.inodesFree: "5%"
pid.available: "10%" # available PID count
# Soft eviction thresholds (eviction begins only after grace period)
evictionSoft:
memory.available: "500Mi"
nodefs.available: "15%"
imagefs.available: "20%"
# Grace periods for soft eviction (how long threshold must be breached)
evictionSoftGracePeriod:
memory.available: "2m"
nodefs.available: "5m"
imagefs.available: "5m"
# Max time to wait for a pod to terminate during eviction
evictionMaxPodGracePeriod: 30
# Minimum amount of resource to reclaim on each eviction round
evictionMinimumReclaim:
memory.available: "100Mi"
nodefs.available: "500Mi"
imagefs.available: "500Mi"
Eviction Signals
| Signal | Description | Derived from |
|---|---|---|
memory.available | Available node memory (not working set) | capacity - workingSet (cAdvisor) |
nodefs.available | Free space on node root filesystem | df / |
nodefs.inodesFree | Free inodes on root filesystem | df -i / |
imagefs.available | Free space on container image filesystem | df /var/lib/containerd |
imagefs.inodesFree | Free inodes on image filesystem | df -i /var/lib/containerd |
pid.available | Available PIDs on the node | /proc/sys/kernel/pid_max - current usage |
allocatableMemory.available | Available memory within pod cgroup (not OS-level) | cgroup memory.usage_in_bytes within kubepods |
Eviction Order
When a threshold is breached, the kubelet evicts pods in this priority order:
Within each QoS class, pods are further ranked by: consumption above request (largest over-consumption first), then by pod priority, then by pod start time (youngest first for equal consumption).
Ephemeral Storage Management
Ephemeral storage is local disk used by: container writable layers, emptyDir volumes, container logs, and hostPath volumes. It is managed separately from persistent volumes.
# Container resource requests/limits for ephemeral storage
resources:
requests:
ephemeral-storage: "1Gi"
limits:
ephemeral-storage: "2Gi" # container evicted if exceeded
# Eviction thresholds apply to:
# 1. nodefs (root filesystem): container logs + emptyDir on rootfs
# 2. imagefs (image filesystem): container writable layers (overlayfs upper)
# Check ephemeral storage usage per pod
kubectl get pod my-pod -o jsonpath='{.status.ephemeralStorage}'
kubectl describe pod my-pod | grep -i "ephemeral\|storage"
# Node-level usage
df -h /
df -h /var/lib/containerd
du -sh /var/log/pods/
# cAdvisor provides per-container ephemeral metrics:
# container_fs_usage_bytes{container!="",image!=""}
When containerd stores images on a dedicated disk (/var/lib/containerd on a separate mount), Kubernetes tracks imagefs and nodefs as separate signals. This is the recommended production setup — it prevents a flood of container log writes from triggering image eviction, and vice versa. Verify the split with crictl imagefsinfo.
Extended Resources and Device Plugins
Extended resources allow nodes to advertise hardware beyond the standard CPU/memory/storage: GPUs, FPGAs, high-speed network interfaces, custom ASICs.
# Node advertises extended resources via kubelet device plugin or manual patch
# Device plugin (preferred): runs as DaemonSet, registers via kubelet socket
# /var/lib/kubelet/device-plugins/kubelet.sock
# Manual extended resource advertisement (for testing only):
kubectl patch node worker-1 --subresource=status --type=json \
-p '[{"op":"add","path":"/status/capacity/example.com~1foo","value":"4"}]'
# Requesting extended resources in a pod:
resources:
requests:
nvidia.com/gpu: "1"
limits:
nvidia.com/gpu: "1" # must equal request; no fractional extended resources
# GPU device plugin example (NVIDIA):
# DaemonSet: nvidia-device-plugin-daemonset
# Advertises: nvidia.com/gpu =
# Kubelet passes GPU device node paths to container via device plugin response
# Container gets: /dev/nvidia0, /dev/nvidia-uvm, etc.
Device Plugin Protocol
| Device Plugin gRPC RPC | Direction | Purpose |
|---|---|---|
Register | plugin → kubelet | Register resource name and plugin socket path with kubelet |
ListAndWatch | kubelet → plugin (stream) | Plugin streams available device list; kubelet updates Node capacity |
Allocate | kubelet → plugin | On pod admission: kubelet asks plugin to allocate specific device IDs; plugin returns env vars, mounts, device nodes to inject into container |
GetPreferredAllocation | kubelet → plugin | Optional: plugin suggests which device IDs to allocate (e.g., GPU topology-aware) |
PreStartContainer | kubelet → plugin | Optional: plugin performs setup just before container starts |
GetDevicePluginOptions | kubelet → plugin | Discover which optional RPCs the plugin supports |
Topology Manager
The Topology Manager (GA 1.27) coordinates CPU Manager, Memory Manager, and Device Plugins to ensure that a pod's resources are all allocated from the same NUMA node, minimizing cross-NUMA memory latency and PCIe transfer overhead for GPU/NIC workloads.
# KubeletConfiguration
topologyManagerPolicy: single-numa-node # default: "none"
# Policies:
# none - no NUMA alignment (default)
# best-effort - try NUMA alignment; schedule anyway if impossible
# restricted - try NUMA alignment; fail pod if impossible
# single-numa-node - allocate ALL resources from exactly one NUMA node or fail
topologyManagerScope: pod # default: "container"
# container: align per-container (stricter)
# pod: align across all containers in the pod (more flexible)
For Topology Manager to be effective, you must also set cpuManagerPolicy: static and memoryManagerPolicy: Static. A pod must be Guaranteed QoS with integer CPU requests. All three managers consult the Topology Manager's hint provider interface to align their allocations.
PID Limits
Kubernetes supports two independent PID limit mechanisms:
Node-level PID limit
Prevents all pods on the node from collectively consuming more than a configured fraction of the kernel PID limit.
# KubeletConfiguration
pidPressureLimit: 1000 # PIDs reserved for system
# evictionHard:
# pid.available: "10%"
Per-pod PID limit
Limits the number of PIDs that any single pod can create. Prevents a PID fork bomb from exhausting the node.
# KubeletConfiguration
podPidsLimit: 4096 # max PIDs per pod
# Container-level via cgroup pids.max
# Check: cat /sys/fs/cgroup/.../pids.max
HugePages
HugePages (2Mi or 1Gi) reduce TLB pressure for memory-intensive applications. Kubernetes treats them as schedulable resources:
# Node reports hugepage capacity in status.capacity:
# hugepages-2Mi: "4Gi" (2048 × 2Mi pages)
# hugepages-1Gi: "8Gi"
# Pre-allocate on node (must be done before kubelet starts):
echo 2048 > /proc/sys/vm/nr_hugepages # 2Mi pages
echo 8 > /sys/kernel/mm/hugepages/hugepages-1048576kB/nr_hugepages # 1Gi pages
# Make persistent via /etc/sysctl.conf:
vm.nr_hugepages = 2048
# Pod requesting hugepages:
resources:
requests:
hugepages-2Mi: "512Mi" # request 256 × 2Mi pages
memory: "512Mi" # must also request regular memory
limits:
hugepages-2Mi: "512Mi"
memory: "512Mi"
# HugePage volumes are automatically mounted as tmpfs (hugetlbfs) at:
# /dev/hugepages (default mount path in container)
Unlike regular memory, hugepages cannot be overcommitted. Requests must equal limits, and the scheduler will reject a pod if the node does not have sufficient pre-allocated hugepages. Pre-allocation also happens at boot — runtime allocation fails silently on fragmented memory. Reserve hugepages in the VM/instance startup script.
ResourceQuota and LimitRange
While not strictly node-level, these admission-time objects interact directly with node resource accounting:
# LimitRange: sets defaults and enforces min/max per container/pod/namespace
apiVersion: v1
kind: LimitRange
metadata:
name: default-limits
namespace: my-team
spec:
limits:
- type: Container
default: # applied if container has no limits set
cpu: "500m"
memory: "256Mi"
defaultRequest: # applied if container has no requests set
cpu: "100m"
memory: "128Mi"
max: # no container may exceed this
cpu: "4"
memory: "4Gi"
min: # no container may request less than this
cpu: "50m"
memory: "64Mi"
- type: PersistentVolumeClaim
max:
storage: "50Gi"
# ResourceQuota: limits total resource consumption per namespace
apiVersion: v1
kind: ResourceQuota
metadata:
name: team-quota
namespace: my-team
spec:
hard:
requests.cpu: "20"
requests.memory: "40Gi"
limits.cpu: "40"
limits.memory: "80Gi"
pods: "50"
persistentvolumeclaims: "20"
requests.storage: "500Gi"
count/deployments.apps: "20"
Without a LimitRange, developers who omit resource specs create BestEffort pods that get evicted first and can starve other workloads. Setting a namespace LimitRange ensures every pod gets at least a minimal request and limit — moving all pods to at least Burstable QoS — without requiring developers to specify resources manually.
cgroup v2
cgroup v2 is the unified hierarchy model that replaces the split v1 hierarchy (/sys/fs/cgroup/cpu/, /sys/fs/cgroup/memory/, etc.) with a single tree at /sys/fs/cgroup/.
| Feature | cgroup v1 | cgroup v2 |
|---|---|---|
| Hierarchy | Separate trees per subsystem | Single unified hierarchy |
| Memory accounting | Can miss inter-container charges | Accurate memory accounting with memory.stat |
| Pressure stall info | Not available | cpu.pressure, memory.pressure, io.pressure (PSI) |
| CPU throttling | cpu.cfs_quota_us | cpu.max (same semantics, new location) |
| Memory limits | memory.limit_in_bytes | memory.max (hard) + memory.high (soft) |
| Memory protection | memory.soft_limit_in_bytes (unreliable) | memory.min (guaranteed) + memory.low (best-effort) |
| Kubernetes support | Default until 1.24 | Enabled by default on kernels 5.8+ (systemd 248+); required for kubelet in 1.25+ |
| cgroupDriver | cgroupfs or systemd | systemd strongly recommended |
# Verify cgroup v2 is active
stat -fc %T /sys/fs/cgroup/
# Expected: cgroup2fs (v2)
# "tmpfs" means v1 hybrid or v1 only
# Check kubelet and containerd both use systemd driver
cat /var/lib/kubelet/config.yaml | grep cgroupDriver
# Expected: cgroupDriver: systemd
crictl info | grep cgroupDriver
# Expected: "cgroupDriver": "systemd"
# cgroup v2: Kubernetes-relevant paths
# /sys/fs/cgroup/kubepods.slice/ — all pod cgroups
# /sys/fs/cgroup/kubepods.slice/kubepods-burstable.slice/ — burstable class
# /sys/fs/cgroup/kubepods.slice/kubepods-guaranteed.slice/ — guaranteed class
# /sys/fs/cgroup/kubepods.slice/kubepods-besteffort.slice/ — besteffort class
# Per-pod cgroup: .../pod.slice/
# Per-container: .../pod.slice//
Resource Observability
Key Metrics
| Metric | Type | Labels | Description |
|---|---|---|---|
container_cpu_usage_seconds_total | Counter | container, pod, namespace | Cumulative CPU seconds used |
container_cpu_cfs_throttled_seconds_total | Counter | same | Cumulative seconds container was throttled — key signal for under-sized limits |
container_cpu_cfs_throttled_periods_total | Counter | same | Number of CFS periods during which throttling occurred |
container_memory_working_set_bytes | Gauge | same | Active memory (used by eviction manager and HPA) |
container_memory_rss | Gauge | same | Anonymous memory; excludes file cache |
container_memory_cache | Gauge | same | Page cache; reclaimable under pressure |
container_oom_events_total | Counter | same | OOM events within the container cgroup |
node_memory_MemAvailable_bytes | Gauge | — | Total available memory on node (matches eviction signal) |
kubelet_evictions_total | Counter | eviction_signal | Evictions triggered per signal type |
kube_node_status_allocatable | Gauge | resource, node | Allocatable resource per node (from kube-state-metrics) |
kube_pod_container_resource_requests | Gauge | resource, pod | Requested resources per container (from kube-state-metrics) |
Alerting Rules
# Alert: high CPU throttling rate
- alert: ContainerCPUThrottlingHigh
expr: |
sum(increase(container_cpu_cfs_throttled_periods_total{container!=""}[5m]))
by (container, pod, namespace)
/
sum(increase(container_cpu_cfs_periods_total{container!=""}[5m]))
by (container, pod, namespace)
> 0.25
for: 5m
labels:
severity: warning
annotations:
summary: "Container {{ $labels.container }} throttled >25% in {{ $labels.namespace }}/{{ $labels.pod }}"
# Alert: node memory near eviction threshold
- alert: NodeMemoryNearEviction
expr: |
node_memory_MemAvailable_bytes / node_memory_MemTotal_bytes < 0.15
for: 5m
labels:
severity: warning
annotations:
summary: "Node {{ $labels.instance }} memory below 15%"
# Alert: container OOM kills
- alert: ContainerOOMKilled
expr: |
increase(container_oom_events_total{container!=""}[5m]) > 0
labels:
severity: warning
annotations:
summary: "OOMKill in {{ $labels.namespace }}/{{ $labels.pod }}/{{ $labels.container }}"
# Alert: node disk pressure (imagefs)
- alert: NodeDiskPressureImageFS
expr: |
kubelet_volume_stats_available_bytes{persistentvolumeclaim=""} /
kubelet_volume_stats_capacity_bytes{persistentvolumeclaim=""} < 0.15
for: 5m
labels:
severity: warning
# Alert: allocatable exhaustion (scheduling failure imminent)
- alert: NodeAllocatableCPULow
expr: |
(kube_node_status_allocatable{resource="cpu"}
- sum by (node) (kube_pod_container_resource_requests{resource="cpu"}))
/ kube_node_status_allocatable{resource="cpu"} < 0.10
for: 10m
labels:
severity: warning
annotations:
summary: "Node {{ $labels.node }} CPU allocatable headroom below 10%"
Troubleshooting Runbooks
Pod OOMKilled — sizing memory limits
# Symptom: pod restarts with OOMKilled, exit code 137
# 1. Check last state
kubectl describe pod my-pod | grep -A5 "Last State"
# Look for: Reason: OOMKilled
# 2. Check working set over time (need Prometheus)
container_memory_working_set_bytes{pod="my-pod", container="app"}
# 3. Check if OOM happened in container or node kernel
# Container OOM: cgroup enforced the limit (most common)
# Node OOM: system ran out of memory (rare with reservations)
kubectl get pod my-pod -o jsonpath='{.status.containerStatuses[0].lastState.terminated}'
dmesg | grep "oom killer" | tail -5
# 4. Right-size limit:
# Set limit to p99 of working_set + 20% headroom
# working_set includes: anonymous + file (used) — it's what eviction tracks
# 5. If working_set spikes on startup (JVM, etc.):
# Add initContainer to pre-warm, or set startup memory limit via VPA
# 6. Consider VPA (Vertical Pod Autoscaler) for automatic right-sizing
High CPU throttling despite low CPU usage
# Symptom: CPU usage appears low but latency is high; throttle metric elevated
# 1. Confirm throttling
kubectl get --raw "/api/v1/nodes/worker-1/proxy/metrics/cadvisor" | \
grep cpu_cfs_throttled | grep "my-pod"
# Throttle ratio = throttled_periods / total_periods
# > 25% = problem
# 2. Common causes:
# a) CPU limit too low for bursty workloads (e.g., JVM GC spikes)
# b) CFS period too long (100ms default can cause issues for low-latency apps)
# c) Multiple containers in pod competing for shared CFS budget
# 3. Fix options:
# a) Raise CPU limit (simplest)
# b) Remove CPU limit entirely (allow bursting to node capacity) — only if workload is trusted
# c) Lower CFS period via node-level sysctl (affects all containers):
echo 5000 > /proc/sys/kernel/sched_min_granularity_ns
# d) Use CPU Manager static policy with dedicated cores (eliminates throttling for Guaranteed pods)
# 4. Verify fix:
rate(container_cpu_cfs_throttled_seconds_total{pod="my-pod"}[5m])
Node disk pressure — imagefs filling up
# Symptom: DiskPressure=True condition; pods evicted; new pods fail to schedule
# 1. Check which filesystem is full
df -h /
df -h /var/lib/containerd
# 2. Find what's consuming space
# Large images:
ctr -n k8s.io images ls | sort -k4 -h
# Large container logs:
du -sh /var/log/pods/* | sort -h | tail -20
# Unused images (not referenced by any container):
crictl rmi --prune # removes unreferenced images
# or force GC threshold:
curl -sk -X POST https://localhost:10250/debug/pprof/ # triggers GC
# 3. Adjust image GC thresholds if eviction is too aggressive:
# KubeletConfiguration:
# imageGCHighThresholdPercent: 85 (default; start GC at 85% full)
# imageGCLowThresholdPercent: 80 (default; GC down to 80%)
# 4. Configure log rotation to prevent logs from filling nodefs:
# containerLogMaxSize: "50Mi" (default: 10Mi)
# containerLogMaxFiles: 5 (default: 5)
# 5. If imagefs = nodefs (same disk):
# Consider mounting /var/lib/containerd on a separate volume
Scheduler reports "Insufficient CPU" but node looks available
# Symptom: pod Pending with event "0/3 nodes are available: 3 Insufficient cpu"
# 1. Check actual allocatable vs sum of requests
kubectl describe node worker-1 | grep -A5 "Allocated resources"
# Shows: cpu requests / allocatable
# 2. The scheduler compares sum(pod requests) against allocatable
# NOT against actual CPU usage! High request sum = scheduling failure
# even if actual utilization is 10%
# 3. Check for pods with very high CPU requests
kubectl get pods -A -o custom-columns='NS:.metadata.namespace,POD:.metadata.name,CPU:.spec.containers[0].resources.requests.cpu' \
| sort -k3 -h | tail -20
# 4. Check reserved resources subtract correctly
kubectl get node worker-1 -o jsonpath='{.status.allocatable.cpu}'
# If lower than expected: check kubeReserved + systemReserved in kubelet config
# 5. Solutions:
# a) Add more nodes (Cluster Autoscaler)
# b) Reduce requests on over-provisioned pods
# c) Use VPA to right-size requests automatically
# d) Increase node size
Device plugin not advertising GPUs — extended resource missing
# Symptom: nvidia.com/gpu not visible in node capacity; pod Pending
# 1. Check device plugin pod is running
kubectl get pods -n kube-system | grep nvidia-device-plugin
# 2. Check kubelet device plugin socket
ls -la /var/lib/kubelet/device-plugins/
# Expected: kubelet.sock + nvidia_gpu plugin file
# 3. Check kubelet log for device plugin registration
journalctl -u kubelet | grep -i "device plugin\|nvidia\|gpu" | tail -20
# 4. Verify node capacity
kubectl get node worker-1 -o jsonpath='{.status.capacity}'
# Expected: "nvidia.com/gpu": "4" (or however many GPUs)
# 5. Check GPU driver and NVML are installed on node
nvidia-smi
ls /dev/nvidia*
# 6. Restart device plugin DaemonSet pod on the node
kubectl delete pod -n kube-system -l name=nvidia-device-plugin-ds \
--field-selector spec.nodeName=worker-1
Production Best Practices
- Set requests and limits on every container — enforce via LimitRange defaults so that no pod is accidentally BestEffort. Missing requests make the scheduler blind; missing limits cause noisy-neighbor interference.
- Enforce node allocatable via cgroups — add
kube-reserved,system-reserved, andpodstoenforceNodeAllocatable. Without enforcement, reserved values are accounting fiction. - Monitor CPU throttling continuously —
container_cpu_cfs_throttled_periods_total / container_cpu_cfs_periods_total > 0.25is your primary signal for under-sized CPU limits. Many latency problems have this as root cause. - Use working_set for memory limit sizing, not RSS — the kubelet eviction manager and OOM killer both act on working set. RSS undercounts page cache. Set limits to 1.2× your p99 working set.
- Separate imagefs from nodefs — mount
/var/lib/containerdon a dedicated volume. This prevents log floods from triggering image eviction, and gives you independent tuning of GC thresholds. - Enable cgroup v2 with systemd driver — cgroup v2 provides accurate memory accounting, Pressure Stall Information (PSI), and
memory.min/memory.highprotection. Required for Memory Manager static policy. Ensure kubelet and containerd both usecgroupDriver: systemd. - Use Topology Manager for latency-sensitive workloads —
single-numa-nodepolicy with static CPU/memory managers eliminates cross-NUMA penalties for real-time and ML inference workloads. Accept the scheduling strictness as the cost for predictable performance. - Size reservations per node class, not globally — a 4-core 8 Gi node and a 64-core 256 Gi node need very different reservations. Use node group-specific kubelet config files or kubeadm node patches.
- Track allocatable headroom per node in Prometheus — alert when CPU or memory headroom drops below 10%. A fully-packed node cannot accept emergency pods (log shippers, debuggers) and makes drain/rescheduling painfully slow.
- Use VPA for request right-sizing — run VPA in Recommendation mode first (no actual updates), observe suggestions for 1–2 weeks, then switch to Auto for non-production and maintain manual right-sizing for production. Never run VPA and HPA on the same resource dimension without VPA's compatibility mode.