Pod Failures
Overview
Diagnosis and resolution of every common pod failure mode — from scheduling blocks through container startup failures to runtime crashes.
Pod Status Quick Reference
kubectl get pod — STATUS column meanings:
Pending → not yet scheduled (or scheduled but not started)
Init:0/1 → init container running / waiting
Init:Error → init container failed
PodInitializing → init containers done, app container starting
Running → containers started (may or may not be Ready)
CrashLoopBackOff→ container repeatedly crashing; kubelet backing off restarts
Error → container exited with non-zero code (not yet in backoff)
OOMKilled → container killed for exceeding memory limit
Evicted → pod evicted from node (DiskPressure, MemoryPressure)
Terminating → pod being deleted (may be stuck if finalizers not removed)
Unknown → kubelet on the node is unreachable
Completed → all containers exited 0 (Jobs/one-shot containers)
ImagePullBackOff→ cannot pull container image
Pending — Scheduling Issues
# Symptoms: pod stuck in Pending for > 30s
kubectl describe pod <pod> -n <ns>
# Events section will show WHY:
# "0/5 nodes are available: 5 Insufficient cpu."
# → All nodes lack enough CPU. Options:
# - Reduce pod CPU request
# - Add nodes / enable Karpenter
kubectl describe node <node> | grep -A10 "Allocated resources"
kubectl top nodes
# "0/3 nodes are available: 3 node(s) had taint ... NoSchedule"
# → Pod needs a toleration
kubectl describe node <node> | grep -A5 Taints
# Add to pod spec:
# tolerations:
# - key: "key" operator: "Equal" value: "value" effect: "NoSchedule"
# "0/3 nodes are available: 1 node(s) didn't match Pod's node affinity/selector"
kubectl get nodes --show-labels | grep -v "^\s"
# Fix: adjust nodeSelector or nodeAffinity to match available labels
# "1 node(s) had volume node affinity conflict"
# → PVC is in a different AZ than feasible nodes
kubectl get pvc <pvc> -n <ns> -o jsonpath='{.spec.volumeName}'
kubectl get pv <pv-name> \
-o jsonpath='{.spec.nodeAffinity.required.nodeSelectorTerms}'
# Fix: delete PVC, recreate — WaitForFirstConsumer ensures AZ match
# "pod has unbound immediate PersistentVolumeClaims"
kubectl get pvc -n <ns>
# Fix: PVC is Pending — see Storage Issues playbook
# "0/3 nodes are available: 3 Too many pods"
kubectl describe node <node> | grep "pods.*Allocatable"
# Default limit: 110 pods per node; increase kubelet --max-pods
ImagePullBackOff / ErrImagePull
kubectl describe pod <pod> -n <ns>
# Events: Failed to pull image "myrepo/myapp:v1.2.3": ...
# Diagnoses:
# 1. Wrong image name or tag
kubectl get pod <pod> -n <ns> \
-o jsonpath='{.spec.containers[0].image}'
# Verify tag exists in registry
# 2. Private registry — missing imagePullSecret
kubectl get pod <pod> -n <ns> \
-o jsonpath='{.spec.imagePullSecrets}'
# Fix: create secret and add to pod spec or service account
kubectl create secret docker-registry regcred \
--docker-server=myrepo.azurecr.io \
--docker-username=<user> \
--docker-password=<token> \
-n <ns>
# Patch default SA to always pull:
kubectl patch serviceaccount default -n <ns> \
-p '{"imagePullSecrets":[{"name":"regcred"}]}'
# 3. Registry unreachable from node
kubectl run debug --image=curlimages/curl --rm -it -- \
curl -I https://myrepo.azurecr.io/v2/
# 4. Rate limiting (Docker Hub)
# Look for: "You have reached your pull rate limit"
# Fix: use authenticated pull or mirror to ECR/GCR
# 5. Image digest mismatch (signed images)
# Look for: "manifest unknown" or cosign verification failure
CrashLoopBackOff
# CrashLoopBackOff = container exits non-zero, kubelet retries with exponential backoff
# Backoff: 10s → 20s → 40s → ... → 5min max
# Step 1: Get logs from the crashed container
kubectl logs <pod> -n <ns> --previous
kubectl logs <pod> -n <ns> -c <container-name> --previous
# Step 2: Get the exit code
kubectl get pod <pod> -n <ns> \
-o jsonpath='{.status.containerStatuses[0].lastState.terminated.exitCode}'
# 1 = app error (check logs)
# 137 = SIGKILL (check for OOMKill or resource limit)
# 139 = segfault
# 143 = SIGTERM (terminationGracePeriodSeconds exceeded)
# Step 3: Common causes and fixes
# Bad environment variable / missing ConfigMap key
kubectl describe pod <pod> -n <ns>
# Look for: Error: couldn't find key X in ConfigMap Y
# Fix: check configMapKeyRef / secretKeyRef names
# Entrypoint command error
kubectl run debug --image=<same-image> --rm -it -- /bin/sh
# Test the command manually
# Liveness probe killing healthy container
kubectl describe pod <pod> -n <ns> | grep -A10 "Liveness"
# If "Liveness probe failed: ... Killing container"
# Fix: increase initialDelaySeconds, adjust failure thresholds
# Database not reachable on startup
# App exits because it can't connect to DB
# Fix: add readiness check to DB, or use retry logic in app
# Or: init container that waits for DB to be ready
OOMKilled
# Container killed because it exceeded memory limit
kubectl describe pod <pod> -n <ns>
# Look for: "OOMKilled" in Last State
# Get memory limit
kubectl get pod <pod> -n <ns> \
-o jsonpath='{.spec.containers[0].resources.limits.memory}'
# Check actual memory usage before kill (if metrics available)
kubectl top pod <pod> -n <ns>
# Options:
# 1. Increase memory limit
kubectl patch deployment <deploy> -n <ns> \
-p '{"spec":{"template":{"spec":{"containers":[{"name":"<c>","resources":{"limits":{"memory":"512Mi"}}}]}}}}'
# 2. Find memory leak in application
# Enable heap profiling (Go pprof, Java JMX, Node --heap-prof)
# Add JVM flags: -XX:+HeapDumpOnOutOfMemoryError
# 3. Use VPA to right-size limits
# VPA will observe actual usage and recommend/set requests
# Common mistake: limit set too tight relative to request
# If limit = 2× request and app has a spike → OOMKill
# Best practice: set limit = 2-3× request for Java/Node apps
# leave limit unset for apps with unpredictable spikes
# (they'll be killed by node memory pressure instead)
Init Container Failures
# Pod stuck in Init:0/1 or Init:Error
kubectl describe pod <pod> -n <ns>
kubectl logs <pod> -n <ns> -c <init-container-name>
# Common init container patterns and failures:
# Wait for database
initContainers:
- name: wait-for-db
image: busybox
command: ['sh', '-c', 'until nc -z postgres.production.svc 5432; do sleep 2; done']
# If stuck: postgres service doesn't exist or is in wrong namespace
kubectl get svc postgres -n production
# Database migration
initContainers:
- name: migrate
image: myapp:latest
command: ["./migrate", "--up"]
# If failing: check migration script logs
kubectl logs <pod> -n <ns> -c migrate
# Vault agent injector (init + sidecar)
# If stuck: vault agent can't reach Vault, or IRSA role wrong
kubectl logs <pod> -n <ns> -c vault-agent-init
Readiness Probe Failing
# Pod is Running but not Ready (0/1 in READY column)
# → No traffic from Services until Ready
kubectl describe pod <pod> -n <ns>
# Look for: "Readiness probe failed: ..."
# Common probe types and failures:
# httpGet: returns non-2xx code
# → App not listening on correct port
# → App returning 500 on /health
# → App still starting up (increase initialDelaySeconds)
# tcpSocket: connection refused
# → Port not open yet
# exec: command exits non-zero
# Fix 1: match probe to actual app endpoint
kubectl exec <pod> -n <ns> -- wget -qO- http://localhost:8080/healthz
# Fix 2: adjust timing
readinessProbe:
httpGet:
path: /healthz
port: 8080
initialDelaySeconds: 10 # wait for app startup
periodSeconds: 5
failureThreshold: 3 # fail after 3 consecutive failures
successThreshold: 1
# Fix 3: check if dependency is causing /health to return 500
# (e.g., /health checks DB connectivity — if DB is down, all pods become unready)
# Better pattern: liveness checks basic app health; readiness checks dependencies
Pod Stuck Terminating
# Pod has deletionTimestamp but won't terminate
# Check finalizers
kubectl get pod <pod> -n <ns> \
-o jsonpath='{.metadata.finalizers}'
# Check for preStop hook hanging
kubectl describe pod <pod> -n <ns> | grep -A5 "preStop"
# Check terminationGracePeriodSeconds
kubectl get pod <pod> -n <ns> \
-o jsonpath='{.spec.terminationGracePeriodSeconds}'
# If preStop is hanging or app not handling SIGTERM:
# Option 1: wait for terminationGracePeriodSeconds to expire → SIGKILL
# Option 2: force delete (loses graceful shutdown)
kubectl delete pod <pod> -n <ns> --grace-period=0 --force
# WARNING: only use for stateless pods; stateful pods risk data corruption
# If stuck due to finalizer (uncommon for pods, common for PVCs):
kubectl patch pod <pod> -n <ns> \
-p '{"metadata":{"finalizers":[]}}' --type=merge
Evicted Pods
# Node evicts pods under resource pressure
kubectl get pods -n <ns> | grep Evicted
# Get reason for eviction
kubectl describe pod <evicted-pod> -n <ns>
# "The node was low on resource: memory. Threshold quantity: 100Mi, available: 50Mi"
# "The node was low on resource: ephemeral-storage. ..."
# Clean up evicted pods (they stay in Evicted state until deleted)
kubectl get pods -A -o json | \
jq -r '.items[] | select(.status.phase=="Failed" and .status.reason=="Evicted") |
.metadata.namespace + "/" + .metadata.name' | \
xargs -I{} sh -c 'kubectl delete pod -n $(echo {} | cut -d/ -f1) $(echo {} | cut -d/ -f2)'
# Prevent eviction: set requests (QoS class matters)
# Guaranteed QoS (requests == limits) → evicted last
# Burstable QoS (requests < limits) → evicted based on usage vs request
# BestEffort QoS (no requests/limits) → evicted first
# Add PodDisruptionBudget to prevent over-eviction
Related
- 04 — Scheduler Flow — why pods become Pending
- 04 — Performance Issues — OOM and CPU throttling in depth
- 06 — Node Issues — node-level eviction and pressure