📋 Page Coverage Checklist
Resource Management
Requests, limits, QoS classes, LimitRanges, ResourceQuotas, and right-sizing
Resource management in Kubernetes operates at two distinct layers: the scheduler uses resource requests to decide where pods land, and the Linux kernel uses resource limits to constrain what running containers can consume. Understanding this separation — and the mechanisms behind each — is essential for building clusters that are both highly utilized and operationally stable.
Requests vs Limits
| Property | Requests | Limits |
|---|---|---|
| Used by | kube-scheduler (pod placement) | Linux kernel cgroups (runtime enforcement) |
| CPU mechanism | CFS cpu.shares (proportional) | CFS cpu.cfs_quota_us (hard cap) |
| Memory mechanism | OOM score adjustment | cgroup memory.limit_in_bytes (hard kill) |
| Effect if exceeded | Pod won't schedule (Pending) | CPU: throttled; Memory: OOM killed |
| Required? | No (but strongly recommended) | No (but required for Guaranteed QoS) |
| Node overcommit | Sum of requests ≤ allocatable | Sum of limits may far exceed capacity |
CPU: CFS Shares and Quota
CPU Requests → CFS Shares
CPU requests map to Linux CFS (Completely Fair Scheduler) cpu.shares. Shares are proportional — a pod with requests.cpu: 1000m gets twice as much CPU time as one with requests.cpu: 500m when the node is contended. When the node is idle, any container can use all available CPU regardless of its request.
cpu.shares = request_millicores × 1024 / 1000
→ requests.cpu: 500m → cpu.shares = 512
→ requests.cpu: 1 → cpu.shares = 1024
→ requests.cpu: 250m → cpu.shares = 256
CPU Limits → CFS Quota (Throttling Trap)
CPU limits map to cpu.cfs_quota_us and cpu.cfs_period_us. Every 100ms (the default period), each container is allocated a quota of CPU time equal to its limit. If a container exhausts its quota before the period ends, it is throttled (suspended) until the next period — even if CPUs are otherwise idle.
A throttled container doesn't crash — it silently pauses. This manifests as p99/p999 latency spikes, timeout errors from downstream callers, and HPA confusion (CPU utilization appears low because throttled time doesn't count as "used"). Throttling is one of the most common and least-diagnosed performance issues in Kubernetes. Monitor
container_cpu_cfs_throttled_periods_total and alert when throttling exceeds 25%.
# Detect CPU throttling for a container
kubectl exec -n <ns> <pod> -- cat /sys/fs/cgroup/cpu/cpu.stat
# throttled_time: nanoseconds spent throttled
# nr_throttled: number of periods where container was throttled
# Prometheus query for throttling ratio
# (throttled periods / total periods) per container
rate(container_cpu_cfs_throttled_periods_total[5m])
/ rate(container_cpu_cfs_periods_total[5m])
No CPU Limit Pattern
Some teams deliberately omit CPU limits to avoid throttling, relying on requests alone for scheduling. This is viable when:
- Nodes run homogeneous workloads with predictable contention
- ResourceQuota enforces limits at namespace level (
limits.cpu) - LimitRange provides defaults so pods without limits still have them
Without CPU limits, one misbehaving container can consume all idle CPU on a node, starving other pods' request allocations during CFS contention. If you run without per-pod CPU limits, use ResourceQuota at the namespace level to cap total CPU consumption.
Memory: OOM Scoring and Hard Limits
Memory Requests → OOM Score
Memory requests affect the OOM score adjustment (oom_score_adj) of the container's processes. A lower score means the OOM killer is less likely to kill that process when the node runs out of memory.
OOM score adjustment range: -1000 (never kill) to +1000 (kill first)
Guaranteed pods (req = limit): oom_score_adj = -998 (protected)
Burstable pods: oom_score_adj = 2–999 (proportional to memory)
BestEffort pods (no requests): oom_score_adj = 1000 (kill first)
Formula (Burstable):
oom_score_adj = 1000 - (1000 × memory_request / node_allocatable_memory)
Memory Limits → Hard OOM Kill
When a container's RSS exceeds limits.memory, the kernel's OOM killer sends SIGKILL (exit code 137) to the container. Unlike CPU throttling, this is immediate and unrecoverable — the container is killed and restarted by kubelet (if restartPolicy allows).
# Check if container was OOM killed
kubectl get pod <pod> -o jsonpath='{.status.containerStatuses[*].lastState.terminated.reason}'
# → OOMKilled
kubectl describe pod <pod> | grep -A3 "Last State"
# Last State: Terminated
# Reason: OOMKilled
# Exit Code: 137
The cgroup memory accounting includes both anonymous RSS (heap, stack) and file-backed page cache. A container reading large files can be OOM killed even if its heap is small if the total memory.usage_in_bytes hits the limit. Set limits with headroom for file I/O working sets, not just heap.
QoS Classes
Kubernetes assigns one of three QoS classes to every pod based on its resource configuration. This class determines eviction order when nodes face memory pressure.
| QoS Class | Criteria | OOM priority | Eviction order |
|---|---|---|---|
| Guaranteed | Every container has both cpu and memory requests AND limits, and requests == limits | oom_score_adj = -998 (protected) | Last to be evicted |
| Burstable | At least one container has a request or limit set, but not all Guaranteed criteria met | oom_score_adj 2–999 (proportional) | Middle — evicted before Guaranteed |
| BestEffort | No containers have any requests or limits set | oom_score_adj = 1000 (first killed) | First to be evicted |
# Guaranteed QoS — requests must equal limits for ALL containers
containers:
- name: app
resources:
requests:
cpu: 500m
memory: 512Mi
limits:
cpu: 500m # ← must equal request
memory: 512Mi # ← must equal request
# Burstable QoS — requests < limits (or only some containers have them)
containers:
- name: app
resources:
requests:
cpu: 200m
memory: 256Mi
limits:
cpu: 2
memory: 1Gi # Higher limits allow bursting
# BestEffort QoS — no resources at all (avoid in production)
containers:
- name: app
resources: {} # No requests or limits
# Check QoS class of a pod
kubectl get pod <pod> -o jsonpath='{.status.qosClass}'
# → Guaranteed | Burstable | BestEffort
Even a Guaranteed pod (cpu request = limit) will be throttled by CFS quota when it exceeds its limit. Guaranteed QoS only controls OOM kill order and memory eviction priority, not CPU scheduling behavior.
Node Allocatable
Not all of a node's capacity is available for pods. The scheduler schedules pods against Allocatable, which reserves capacity for the OS, kubelet, and eviction headroom.
# Check node capacity and allocatable
kubectl describe node <node> | grep -A10 "Capacity:\|Allocatable:"
# Check current resource consumption vs allocatable
kubectl describe node <node> | grep -A20 "Allocated resources:"
# Get allocatable across all nodes (JSON)
kubectl get nodes -o json | jq '.items[] | {
name: .metadata.name,
allocatable: .status.allocatable
}'
# kubelet configuration for reservations (in KubeletConfiguration)
apiVersion: kubelet.config.k8s.io/v1beta1
kind: KubeletConfiguration
kubeReserved:
cpu: "500m"
memory: "1Gi"
ephemeral-storage: "2Gi"
systemReserved:
cpu: "500m"
memory: "2Gi"
evictionHard:
memory.available: "200Mi"
nodefs.available: "10%"
nodefs.inodesFree: "5%"
evictionSoft:
memory.available: "500Mi"
evictionSoftGracePeriod:
memory.available: "90s"
LimitRange
LimitRange sets default, minimum, and maximum resource values for containers, pods, and PVCs within a namespace. It applies at admission time — pods created without explicit resource values receive the defaults.
apiVersion: v1
kind: LimitRange
metadata:
name: platform-limits
namespace: production
spec:
limits:
# Container-level defaults and bounds
- type: Container
default: # Applied as limit if none specified
cpu: "1"
memory: 512Mi
defaultRequest: # Applied as request if none specified
cpu: 100m
memory: 128Mi
min: # Reject pods with requests below this
cpu: 50m
memory: 64Mi
max: # Reject pods with requests above this
cpu: "8"
memory: 16Gi
maxLimitRequestRatio: # Reject if limit/request exceeds this ratio
cpu: "10" # Prevent limit 10× higher than request
memory: "4"
# Pod-level (sum of all containers)
- type: Pod
max:
cpu: "16"
memory: 32Gi
# PVC storage bounds
- type: PersistentVolumeClaim
min:
storage: 1Gi
max:
storage: 100Gi
If a container specifies a limit but no request, LimitRange sets the request equal to the limit. If neither is specified, both default and defaultRequest apply. LimitRange does not retroactively change existing pods — it only affects pods created after the LimitRange exists.
# View effective LimitRange in a namespace
kubectl describe limitrange -n production
# Test what resources a new pod would get
kubectl run test --image=nginx --dry-run=server -o yaml -n production | \
grep -A10 resources
ResourceQuota
ResourceQuota enforces aggregate resource consumption limits within a namespace. Unlike LimitRange (per-object bounds), ResourceQuota tracks cumulative usage across all objects.
apiVersion: v1
kind: ResourceQuota
metadata:
name: production-quota
namespace: production
spec:
hard:
# Compute resources
requests.cpu: "20"
requests.memory: 40Gi
limits.cpu: "100"
limits.memory: 200Gi
# Object counts
pods: "200"
services: "50"
secrets: "200"
configmaps: "100"
persistentvolumeclaims: "50"
services.loadbalancers: "5"
services.nodeports: "0" # Prohibit NodePort services
# Storage
requests.storage: 2Ti
requests.ephemeral-storage: 50Gi
# Per-StorageClass storage quota
gold-ssd.storageclass.storage.k8s.io/requests.storage: 500Gi
standard.storageclass.storage.k8s.io/requests.storage: 1Ti
Scoped Quotas
# Quota applying only to BestEffort pods (no resources set)
apiVersion: v1
kind: ResourceQuota
metadata:
name: besteffort-quota
namespace: production
spec:
hard:
pods: "10"
scopeSelector:
matchExpressions:
- scopeName: BestEffort
---
# Quota for high-priority batch jobs
apiVersion: v1
kind: ResourceQuota
metadata:
name: high-priority-quota
namespace: platform
spec:
hard:
requests.cpu: "50"
requests.memory: 100Gi
pods: "50"
scopeSelector:
matchExpressions:
- scopeName: PriorityClass
operator: In
values: ["high-priority"]
| Scope | Matches pods that |
|---|---|
BestEffort | Have no requests or limits (BestEffort QoS) |
NotBestEffort | Have at least one request or limit (Guaranteed or Burstable) |
Terminating | Have activeDeadlineSeconds set (Jobs) |
NotTerminating | Do not have activeDeadlineSeconds (long-running workloads) |
PriorityClass | Have a specific PriorityClass name |
# Check quota usage in a namespace
kubectl describe resourcequota -n production
# Get quota as JSON for automation
kubectl get resourcequota production-quota -n production \
-o jsonpath='{range .status.hard}{@.key}: hard={@.value}, used={.status.used[?(@.key)]}{"\n"}{end}'
# Watch quota consumption
watch kubectl get resourcequota -n production
Extended Resources
Extended resources represent non-standard hardware (GPUs, FPGAs, InfiniBand, SR-IOV NICs). They are advertised by node device plugins via the kubelet API and consumed in pod specs like CPU/memory.
NVIDIA GPU
# Pod requesting 1 NVIDIA GPU
spec:
containers:
- name: ml-trainer
image: nvcr.io/nvidia/pytorch:23.10-py3
resources:
requests:
nvidia.com/gpu: 1
limits:
nvidia.com/gpu: 1 # GPU resources: requests must equal limits
Extended resources like GPUs are integer resources — they cannot be fractionally requested, and requests must always equal limits. Fractional GPU sharing (e.g., NVIDIA MIG, time-slicing) requires specific device plugin configurations that expose virtual GPU resources (e.g.,
nvidia.com/mig-1g.5gb).
# Check available GPU resources on nodes
kubectl get nodes -o json | jq '.items[] | {
name: .metadata.name,
gpus: .status.allocatable["nvidia.com/gpu"]
}'
# Check GPU allocation per pod
kubectl get pods -A -o json | jq '
.items[] | select(.spec.containers[].resources.requests["nvidia.com/gpu"] != null) | {
name: .metadata.name,
namespace: .metadata.namespace,
gpus: .spec.containers[].resources.requests["nvidia.com/gpu"]
}'
Custom Extended Resources
# Manually advertise an extended resource on a node (for testing)
kubectl proxy &
curl -X PATCH \
"http://localhost:8001/api/v1/nodes/<node>/status" \
-H "Content-Type: application/json-patch+json" \
-d '[{"op":"add","path":"/status/capacity/example.com~1fpga","value":"2"}]'
Ephemeral Storage
Ephemeral storage includes emptyDir volumes, container logs, and container image layers written at runtime. Like CPU/memory, it can have requests and limits.
resources:
requests:
ephemeral-storage: 1Gi # Scheduler reserves this on the node
limits:
ephemeral-storage: 5Gi # Pod evicted if total ephemeral usage exceeds this
Ephemeral storage is measured as the sum of:
- Writable container layer (overlay diff from image)
- Container logs written to
/var/log/pods/ - emptyDir volumes (unless backed by tmpfs/memory)
A container writing 100MB/s of logs with
limits.ephemeral-storage: 2Gi will be evicted within 20 seconds of its log rotation window if logs aren't shipped externally. Set explicit log rotation in your container runtime config (--log-opt max-size=100m --log-opt max-file=5 for docker) and forward logs to an external system before relying on ephemeral storage limits.
Pod Overhead (RuntimeClass)
Sandbox runtimes (gVisor, Kata Containers, Firecracker) introduce fixed overhead beyond what the containers request. The overhead field on a RuntimeClass declares this overhead, and the scheduler adds it to pod resource consumption.
apiVersion: node.k8s.io/v1
kind: RuntimeClass
metadata:
name: kata-containers
handler: kata
overhead:
podFixed:
cpu: 250m # Fixed overhead per pod for the VM/sandbox runtime
memory: 128Mi # Included in scheduler placement decisions
scheduling:
nodeClassification:
tolerations:
- key: kata-containers
operator: Exists
effect: NoSchedule
# Pod using this RuntimeClass
spec:
runtimeClassName: kata-containers
containers:
- name: app
resources:
requests:
cpu: 500m # Scheduler places on node with >= 750m available
memory: 512Mi # (500m + 250m overhead = 750m total charged)
limits:
cpu: 2
memory: 1Gi
Right-Sizing Workflow
Over-requesting wastes cluster capacity; under-requesting causes throttling or OOM. A systematic right-sizing workflow:
# Current pod resource usage vs requests
kubectl top pods --containers -n production
# Pods with no resource requests (risk: unknown scheduling behavior)
kubectl get pods -A -o json | jq '
.items[] |
select(.spec.containers[].resources.requests == null) |
{ns: .metadata.namespace, name: .metadata.name}'
# Namespace-level resource consumption summary
kubectl describe resourcequota -n production | grep -E "requests|limits"
# Find pods where actual CPU < 20% of requested (over-provisioned)
# (requires Prometheus)
# promql:
# container_cpu_usage_seconds_total / container_spec_cpu_shares < 0.2
FinOps: Cost Attribution
Resource requests are the primary driver of infrastructure cost in Kubernetes — node sizing, autoscaling, and reserved instance purchasing all flow from request profiles. Proper cost attribution requires namespace/team labeling.
# Namespace with team labels for chargeback
apiVersion: v1
kind: Namespace
metadata:
name: checkout-service
labels:
team: checkout
cost-center: "CC-1042"
environment: production
# Aggregate requested resources by namespace (rough cost proxy)
kubectl get pods -A -o json | jq -r '
.items[] |
"\(.metadata.namespace) \(.spec.containers[].resources.requests.cpu // "0") \(.spec.containers[].resources.requests.memory // "0")"' | \
sort | uniq -c
# Tools for cost attribution:
# - Kubecost: per-namespace/pod/deployment cost breakdown
# - OpenCost (CNCF): open-source Kubecost alternative
# - Cloud provider cost allocation tags
Full Resource Spec Reference
containers:
- name: app
image: myapp:v2
resources:
requests:
cpu: "500m" # 0.5 vCPU shares
memory: "512Mi" # 512 mebibytes
ephemeral-storage: "1Gi" # Ephemeral storage reservation
hugepages-2Mi: "128Mi" # Huge pages (if supported)
limits:
cpu: "2" # 2 vCPU hard cap (CFS quota)
memory: "1Gi" # 1 GiB hard cap (OOM kill)
ephemeral-storage: "5Gi" # Ephemeral storage cap
nvidia.com/gpu: "1" # 1 GPU (extended resource)
hugepages-2Mi: "128Mi" # Huge pages (req must == limit)
Metrics
| Metric | Labels | Use |
|---|---|---|
container_cpu_cfs_throttled_periods_total | container, pod, namespace | CFS throttling periods — high ratio = CPU limit too tight |
container_memory_working_set_bytes | container, pod, namespace | Actual memory in use (vs limit) |
kube_pod_container_resource_requests | resource, container, namespace | Configured requests — denominator for utilization % |
kube_resourcequota | resource, type (hard/used) | Quota utilization — alert when near limit |
kube_pod_container_status_last_terminated_reason | reason=OOMKilled | OOM kill rate — should be 0 in steady state |
Alerting Rules
groups:
- name: resource-management
rules:
# CPU throttling exceeding 25%
- alert: ContainerCPUThrottling
expr: |
rate(container_cpu_cfs_throttled_periods_total{container!=""}[5m])
/ rate(container_cpu_cfs_periods_total{container!=""}[5m]) > 0.25
for: 5m
labels:
severity: warning
annotations:
summary: "{{ $labels.namespace }}/{{ $labels.pod }}/{{ $labels.container }} is CPU throttled >25%"
description: "Increase CPU limit or reduce request/limit ratio"
# Container OOM killed
- alert: ContainerOOMKilled
expr: |
kube_pod_container_status_last_terminated_reason{reason="OOMKilled"} == 1
for: 0m
labels:
severity: warning
annotations:
summary: "{{ $labels.namespace }}/{{ $labels.pod }}/{{ $labels.container }} OOM killed"
# ResourceQuota near limit (>85% used)
- alert: ResourceQuotaAlmostFull
expr: |
kube_resourcequota{type="used"} / kube_resourcequota{type="hard"} > 0.85
for: 15m
labels:
severity: warning
annotations:
summary: "ResourceQuota {{ $labels.namespace }}/{{ $labels.resourcequota }} {{ $labels.resource }} at >85%"
# Pods with no resource requests (scheduling risk)
- alert: PodMissingResourceRequests
expr: |
kube_pod_container_resource_requests{resource="cpu"} == 0
unless on(pod, namespace) kube_pod_status_phase{phase="Succeeded"}
for: 1h
labels:
severity: info
annotations:
summary: "{{ $labels.namespace }}/{{ $labels.pod }} has no CPU request"
Runbooks
Container Stuck in CrashLoopBackOff with OOMKilled
# Confirm OOM kill
kubectl describe pod <pod> -n <namespace> | grep -A5 "Last State"
# Check current memory limit
kubectl get pod <pod> -n <namespace> -o jsonpath=\
'{.spec.containers[*].resources.limits.memory}'
# Check historical usage (Prometheus)
# container_memory_working_set_bytes{pod="<pod>", container="app"}
# Fix: increase memory limit
kubectl set resources deployment <name> -n <namespace> \
--containers=app --limits=memory=2Gi
# Or patch directly
kubectl patch deployment <name> -n <namespace> --type=json -p='[
{"op":"replace","path":"/spec/template/spec/containers/0/resources/limits/memory","value":"2Gi"}
]'
Pod Pending Due to Insufficient Resources
# Check pod events
kubectl describe pod <pod> -n <namespace> | grep -A10 Events
# Check node capacity
kubectl describe nodes | grep -A5 "Allocatable:"
# Check what's consuming resources on each node
kubectl describe nodes | grep -A20 "Allocated resources:"
# Find the most resource-hungry pods
kubectl top pods -A --sort-by=memory | head -20
# Check if ResourceQuota is blocking
kubectl describe resourcequota -n <namespace> | grep -v "0/"
CPU Throttling Causing Latency Spikes
# Identify throttled containers
kubectl top pods --containers -n <namespace>
# Or check metrics directly in the container:
kubectl exec -n <ns> <pod> -c <container> -- \
cat /sys/fs/cgroup/cpu/cpu.stat
# Quick fix: remove CPU limit (allow bursting)
kubectl patch deployment <name> -n <namespace> --type=json -p='[
{"op":"remove","path":"/spec/template/spec/containers/0/resources/limits/cpu"}
]'
# Better fix: increase CPU limit to match observed usage
kubectl set resources deployment <name> -n <namespace> \
--containers=app --limits=cpu=2
Namespace ResourceQuota Full
# See what's using the quota
kubectl describe resourcequota -n <namespace>
# Find over-provisioned pods in the namespace
kubectl top pods -n <namespace> --containers --sort-by=cpu
# Identify pods with high request but low usage
kubectl get pods -n <namespace> -o json | jq '
.items[] | {
name: .metadata.name,
cpu_request: .spec.containers[].resources.requests.cpu,
memory_request: .spec.containers[].resources.requests.memory
}'
# Increase quota (requires cluster-admin)
kubectl patch resourcequota production-quota -n <namespace> --type=merge \
-p '{"spec":{"hard":{"requests.cpu":"30","requests.memory":"60Gi"}}}'
LimitRange Rejecting Pod Creation
# Check LimitRange in namespace
kubectl describe limitrange -n <namespace>
# Error from pod creation:
# "pods maximum cpu usage per Container is 8, but limit is 16"
# Fix: reduce the container's limit or update LimitRange max
kubectl patch limitrange platform-limits -n <namespace> --type=merge \
-p '{"spec":{"limits":[{"type":"Container","max":{"cpu":"16"}}]}}'
Best Practices
- Always set CPU and memory requests — without requests, pods receive BestEffort QoS and are first to be evicted under node pressure. The scheduler also cannot make intelligent placement decisions.
- Set memory limits conservatively (p99.9 of observed usage) — OOM kills are disruptive but better than unbounded memory consumption starving other pods. Give 20–30% headroom above steady-state RSS.
- Monitor CPU throttling, not just CPU usage — a container at 30% CPU utilization can still be throttled 80% of the time if its burst exceeds the limit. Check
container_cpu_cfs_throttled_periods_totalalongside utilization. - Use LimitRange to enforce defaults in every namespace — prevents pods deployed without resource specs from becoming BestEffort or consuming unbounded resources.
- Apply ResourceQuota to every tenant namespace — without quotas, one team's bug (infinite loop, memory leak) can exhaust cluster capacity for all tenants.
- Right-size with VPA Off mode before enabling VPA updates — use Goldilocks to get recommendations passively for 1–2 weeks, then apply them in a controlled rollout. Don't jump straight to Auto mode.
- For latency-sensitive workloads, prefer Guaranteed QoS — set requests = limits to avoid OOM scoring disadvantage under memory pressure and make CPU scheduling predictable (though throttling still applies).
- Include ephemeral-storage limits for log-heavy workloads — without limits, a container writing excessive logs can fill the node's disk and trigger eviction of all pods on that node.