Pod Failures

Overview

Diagnosis and resolution of every common pod failure mode — from scheduling blocks through container startup failures to runtime crashes.

Pod Status Quick Reference

kubectl get pod — STATUS column meanings:

Pending         → not yet scheduled (or scheduled but not started)
Init:0/1        → init container running / waiting
Init:Error      → init container failed
PodInitializing → init containers done, app container starting
Running         → containers started (may or may not be Ready)
CrashLoopBackOff→ container repeatedly crashing; kubelet backing off restarts
Error           → container exited with non-zero code (not yet in backoff)
OOMKilled       → container killed for exceeding memory limit
Evicted         → pod evicted from node (DiskPressure, MemoryPressure)
Terminating     → pod being deleted (may be stuck if finalizers not removed)
Unknown         → kubelet on the node is unreachable
Completed       → all containers exited 0 (Jobs/one-shot containers)
ImagePullBackOff→ cannot pull container image

Pending — Scheduling Issues

# Symptoms: pod stuck in Pending for > 30s
kubectl describe pod <pod> -n <ns>
# Events section will show WHY:

# "0/5 nodes are available: 5 Insufficient cpu."
# → All nodes lack enough CPU. Options:
#    - Reduce pod CPU request
#    - Add nodes / enable Karpenter
kubectl describe node <node> | grep -A10 "Allocated resources"
kubectl top nodes

# "0/3 nodes are available: 3 node(s) had taint ... NoSchedule"
# → Pod needs a toleration
kubectl describe node <node> | grep -A5 Taints
# Add to pod spec:
# tolerations:
# - key: "key" operator: "Equal" value: "value" effect: "NoSchedule"

# "0/3 nodes are available: 1 node(s) didn't match Pod's node affinity/selector"
kubectl get nodes --show-labels | grep -v "^\s"
# Fix: adjust nodeSelector or nodeAffinity to match available labels

# "1 node(s) had volume node affinity conflict"
# → PVC is in a different AZ than feasible nodes
kubectl get pvc <pvc> -n <ns> -o jsonpath='{.spec.volumeName}'
kubectl get pv <pv-name> \
  -o jsonpath='{.spec.nodeAffinity.required.nodeSelectorTerms}'
# Fix: delete PVC, recreate — WaitForFirstConsumer ensures AZ match

# "pod has unbound immediate PersistentVolumeClaims"
kubectl get pvc -n <ns>
# Fix: PVC is Pending — see Storage Issues playbook

# "0/3 nodes are available: 3 Too many pods"
kubectl describe node <node> | grep "pods.*Allocatable"
# Default limit: 110 pods per node; increase kubelet --max-pods

ImagePullBackOff / ErrImagePull

kubectl describe pod <pod> -n <ns>
# Events: Failed to pull image "myrepo/myapp:v1.2.3": ...

# Diagnoses:
# 1. Wrong image name or tag
kubectl get pod <pod> -n <ns> \
  -o jsonpath='{.spec.containers[0].image}'
# Verify tag exists in registry

# 2. Private registry — missing imagePullSecret
kubectl get pod <pod> -n <ns> \
  -o jsonpath='{.spec.imagePullSecrets}'
# Fix: create secret and add to pod spec or service account
kubectl create secret docker-registry regcred \
  --docker-server=myrepo.azurecr.io \
  --docker-username=<user> \
  --docker-password=<token> \
  -n <ns>
# Patch default SA to always pull:
kubectl patch serviceaccount default -n <ns> \
  -p '{"imagePullSecrets":[{"name":"regcred"}]}'

# 3. Registry unreachable from node
kubectl run debug --image=curlimages/curl --rm -it -- \
  curl -I https://myrepo.azurecr.io/v2/

# 4. Rate limiting (Docker Hub)
# Look for: "You have reached your pull rate limit"
# Fix: use authenticated pull or mirror to ECR/GCR

# 5. Image digest mismatch (signed images)
# Look for: "manifest unknown" or cosign verification failure

CrashLoopBackOff

# CrashLoopBackOff = container exits non-zero, kubelet retries with exponential backoff
# Backoff: 10s → 20s → 40s → ... → 5min max

# Step 1: Get logs from the crashed container
kubectl logs <pod> -n <ns> --previous
kubectl logs <pod> -n <ns> -c <container-name> --previous

# Step 2: Get the exit code
kubectl get pod <pod> -n <ns> \
  -o jsonpath='{.status.containerStatuses[0].lastState.terminated.exitCode}'
# 1 = app error (check logs)
# 137 = SIGKILL (check for OOMKill or resource limit)
# 139 = segfault
# 143 = SIGTERM (terminationGracePeriodSeconds exceeded)

# Step 3: Common causes and fixes

# Bad environment variable / missing ConfigMap key
kubectl describe pod <pod> -n <ns>
# Look for: Error: couldn't find key X in ConfigMap Y
# Fix: check configMapKeyRef / secretKeyRef names

# Entrypoint command error
kubectl run debug --image=<same-image> --rm -it -- /bin/sh
# Test the command manually

# Liveness probe killing healthy container
kubectl describe pod <pod> -n <ns> | grep -A10 "Liveness"
# If "Liveness probe failed: ... Killing container"
# Fix: increase initialDelaySeconds, adjust failure thresholds

# Database not reachable on startup
# App exits because it can't connect to DB
# Fix: add readiness check to DB, or use retry logic in app
# Or: init container that waits for DB to be ready

OOMKilled

# Container killed because it exceeded memory limit
kubectl describe pod <pod> -n <ns>
# Look for: "OOMKilled" in Last State

# Get memory limit
kubectl get pod <pod> -n <ns> \
  -o jsonpath='{.spec.containers[0].resources.limits.memory}'

# Check actual memory usage before kill (if metrics available)
kubectl top pod <pod> -n <ns>

# Options:
# 1. Increase memory limit
kubectl patch deployment <deploy> -n <ns> \
  -p '{"spec":{"template":{"spec":{"containers":[{"name":"<c>","resources":{"limits":{"memory":"512Mi"}}}]}}}}'

# 2. Find memory leak in application
# Enable heap profiling (Go pprof, Java JMX, Node --heap-prof)
# Add JVM flags: -XX:+HeapDumpOnOutOfMemoryError

# 3. Use VPA to right-size limits
# VPA will observe actual usage and recommend/set requests

# Common mistake: limit set too tight relative to request
# If limit = 2× request and app has a spike → OOMKill
# Best practice: set limit = 2-3× request for Java/Node apps
#                leave limit unset for apps with unpredictable spikes
#                  (they'll be killed by node memory pressure instead)

Init Container Failures

# Pod stuck in Init:0/1 or Init:Error
kubectl describe pod <pod> -n <ns>
kubectl logs <pod> -n <ns> -c <init-container-name>

# Common init container patterns and failures:

# Wait for database
initContainers:
- name: wait-for-db
  image: busybox
  command: ['sh', '-c', 'until nc -z postgres.production.svc 5432; do sleep 2; done']
# If stuck: postgres service doesn't exist or is in wrong namespace
kubectl get svc postgres -n production

# Database migration
initContainers:
- name: migrate
  image: myapp:latest
  command: ["./migrate", "--up"]
# If failing: check migration script logs
kubectl logs <pod> -n <ns> -c migrate

# Vault agent injector (init + sidecar)
# If stuck: vault agent can't reach Vault, or IRSA role wrong
kubectl logs <pod> -n <ns> -c vault-agent-init

Readiness Probe Failing

# Pod is Running but not Ready (0/1 in READY column)
# → No traffic from Services until Ready

kubectl describe pod <pod> -n <ns>
# Look for: "Readiness probe failed: ..."

# Common probe types and failures:
# httpGet: returns non-2xx code
#   → App not listening on correct port
#   → App returning 500 on /health
#   → App still starting up (increase initialDelaySeconds)

# tcpSocket: connection refused
#   → Port not open yet

# exec: command exits non-zero

# Fix 1: match probe to actual app endpoint
kubectl exec <pod> -n <ns> -- wget -qO- http://localhost:8080/healthz

# Fix 2: adjust timing
readinessProbe:
  httpGet:
    path: /healthz
    port: 8080
  initialDelaySeconds: 10   # wait for app startup
  periodSeconds: 5
  failureThreshold: 3       # fail after 3 consecutive failures
  successThreshold: 1

# Fix 3: check if dependency is causing /health to return 500
# (e.g., /health checks DB connectivity — if DB is down, all pods become unready)
# Better pattern: liveness checks basic app health; readiness checks dependencies

Pod Stuck Terminating

# Pod has deletionTimestamp but won't terminate

# Check finalizers
kubectl get pod <pod> -n <ns> \
  -o jsonpath='{.metadata.finalizers}'

# Check for preStop hook hanging
kubectl describe pod <pod> -n <ns> | grep -A5 "preStop"

# Check terminationGracePeriodSeconds
kubectl get pod <pod> -n <ns> \
  -o jsonpath='{.spec.terminationGracePeriodSeconds}'

# If preStop is hanging or app not handling SIGTERM:
# Option 1: wait for terminationGracePeriodSeconds to expire → SIGKILL
# Option 2: force delete (loses graceful shutdown)
kubectl delete pod <pod> -n <ns> --grace-period=0 --force
# WARNING: only use for stateless pods; stateful pods risk data corruption

# If stuck due to finalizer (uncommon for pods, common for PVCs):
kubectl patch pod <pod> -n <ns> \
  -p '{"metadata":{"finalizers":[]}}' --type=merge

Evicted Pods

# Node evicts pods under resource pressure
kubectl get pods -n <ns> | grep Evicted

# Get reason for eviction
kubectl describe pod <evicted-pod> -n <ns>
# "The node was low on resource: memory. Threshold quantity: 100Mi, available: 50Mi"
# "The node was low on resource: ephemeral-storage. ..."

# Clean up evicted pods (they stay in Evicted state until deleted)
kubectl get pods -A -o json | \
  jq -r '.items[] | select(.status.phase=="Failed" and .status.reason=="Evicted") |
    .metadata.namespace + "/" + .metadata.name' | \
  xargs -I{} sh -c 'kubectl delete pod -n $(echo {} | cut -d/ -f1) $(echo {} | cut -d/ -f2)'

# Prevent eviction: set requests (QoS class matters)
# Guaranteed QoS (requests == limits) → evicted last
# Burstable QoS (requests < limits) → evicted based on usage vs request
# BestEffort QoS (no requests/limits) → evicted first

# Add PodDisruptionBudget to prevent over-eviction

04 — Scheduler Flow — why pods become Pending
04 — Performance Issues — OOM and CPU throttling in depth
06 — Node Issues — node-level eviction and pressure

Overview

Pod Status Quick Reference

Pending — Scheduling Issues

ImagePullBackOff / ErrImagePull

CrashLoopBackOff

OOMKilled

Init Container Failures

Readiness Probe Failing

Pod Stuck Terminating

Evicted Pods

Related