Pod Lifecycle
πŸ“‹ Page Coverage Checklist
  • Pod phases: Pending/Running/Succeeded/Failed/Unknown β€” exact semantics
  • Pod conditions: PodScheduled, ContainersReady, Initialized, Ready, DisruptionTarget
  • Container states: Waiting/Running/Terminated with all reason codes
  • Init containers: ordered execution, restartPolicy behavior, exit semantics
  • Native sidecar containers (1.33 stable): restartPolicy:Always, ordering, shutdown
  • Probes: startupProbe/livenessProbe/readinessProbe β€” timing diagram, all 4 mechanisms
  • Probe fields: initialDelaySeconds, periodSeconds, failureThreshold, successThreshold, timeoutSeconds
  • Liveness anti-pattern: external dep check causing cascading restarts
  • restartPolicy: Always/OnFailure/Never β€” phase interaction table
  • Graceful termination: preStop hook, SIGTERMβ†’SIGKILL sequence, terminationGracePeriodSeconds
  • Pod readiness gates: custom conditions from external controllers (GA 1.26)
  • PostStart/PreStop hooks: exec vs httpGet, blocking semantics, timing
  • Container restart backoff: exponential backoff, CrashLoopBackOff, reset after 10 min success
  • 5 metrics + 4 alerting rules + 5 runbooks + 8 best practices
  • Pod Lifecycle

    Phases, conditions, probes, init containers, graceful termination, and restart policies

    v1/Pod Core Kubernetes Platform Engineer

    Every pod moves through a defined lifecycle from creation to termination. Understanding the exact semantics of each phase, condition, container state, and probe β€” and how they interact with controllers, load balancers, and autoscalers β€” is the foundation for building operationally reliable workloads. This page covers the complete lifecycle model that underpins everything from rolling updates to graceful shutdowns.

    Lifecycle Overview

    Pod lifecycle from creation to termination: API create β†’ Pending (scheduling) β”‚ β–Ό Node assigned β†’ Pending (image pull + init containers) β”‚ β–Ό Init containers run (sequentially) β”‚ β–Ό Running (app containers + startupProbe) β”‚ β”œβ”€ startupProbe passes β†’ livenessProbe + readinessProbe active β”‚ β”‚ β”‚ β”œβ”€ readinessProbe passes β†’ pod.Ready=True β†’ added to Endpoints β”‚ β”‚ β”‚ └─ livenessProbe fails N times β†’ container SIGKILL β†’ restart β”‚ β”œβ”€ All containers exit 0 β†’ Succeeded β”œβ”€ Any container exits non-0 (no more restarts) β†’ Failed └─ delete signal β†’ graceful termination sequence β”‚ β–Ό preStop hook β†’ SIGTERM β†’ wait β†’ SIGKILL (at grace period) β”‚ β–Ό Pod removed from Endpoints β†’ Terminated

    Pod Phases

    A pod's status.phase is a high-level summary of where in its lifecycle the pod is. It is a single string set by the kubelet.

    PhaseMeaningContainers running?
    Pending Pod accepted by the cluster but not yet running. Covers: waiting for scheduling, image pulling, init containers running, PVC binding. No (or only init containers)
    Running Pod bound to a node and at least one container is running (or starting/restarting). Yes (at least one)
    Succeeded All containers exited with code 0 and will not be restarted. Terminal state. No
    Failed All containers have terminated; at least one exited with non-zero code or was killed. Terminal state. No
    Unknown Pod state cannot be determined β€” typically the node hosting it stopped reporting to the control plane. Unknown
    Phase is coarse-grained β€” use conditions for precision
    A pod in Running phase may have containers in CrashLoopBackOff or failing readiness probes. Running only means at least one container is running or being restarted. For accurate health assessment, read status.conditions (especially Ready) and status.containerStatuses.

    Pod Conditions

    Conditions are structured status fields that provide granular lifecycle state. Each condition has type, status (True/False/Unknown), reason, and message.

    Condition typeTrue whenGate for
    PodScheduled Pod has been assigned to a node Kubelet begins pulling images and starting init containers
    Initialized All init containers have completed successfully App containers start
    ContainersReady All app containers are ready (passing readiness probes) Contributes to pod Ready condition
    Ready Pod is able to serve requests: ContainersReady=True AND all readinessGates satisfied Pod added to Service Endpoints / EndpointSlices
    DisruptionTarget Pod is being terminated due to a voluntary disruption (drain, preemption, eviction) Informs podFailurePolicy in Jobs (Ignore action)
    PodReadyToStartContainers Sandbox created and network configured (1.29+) Init containers may start
    # View all pod conditions
    kubectl get pod <pod> -o jsonpath='{.status.conditions}' | jq .
    
    # Quick condition summary
    kubectl describe pod <pod> | grep -A15 "^Conditions:"
    
    # Find all pods not Ready in a namespace
    kubectl get pods -n <namespace> \
      -o jsonpath='{range .items[?(@.status.conditions[?(@.type=="Ready")].status!="True")]}{.metadata.name}{"\n"}{end}'

    Container States

    Each container within a pod has an independent state captured in status.containerStatuses[].state.

    StateMeaningKey sub-fields
    Waiting Container not yet running β€” waiting for image pull, init containers, or backoff reason: ContainerCreating, ImagePullBackOff, ErrImagePull, CrashLoopBackOff, CreateContainerConfigError
    Running Container executing startedAt: timestamp when container started
    Terminated Container finished (success or failure) exitCode, reason (Completed, OOMKilled, Error, ContainerCannotRun), startedAt, finishedAt

    Common Container Reason Codes

    ReasonStateCause
    CompletedTerminatedExit code 0 β€” normal successful exit
    OOMKilledTerminatedExit code 137 β€” exceeded memory limit
    ErrorTerminatedNon-zero exit code (app error)
    ContainerCannotRunTerminatedContainer runtime failed to start the container (bad entrypoint, permission error)
    DeadlineExceededTerminatedJob activeDeadlineSeconds expired
    CrashLoopBackOffWaitingContainer repeatedly failing β€” kubelet backing off restarts
    ImagePullBackOffWaitingContainer image cannot be pulled β€” auth failure, image not found
    CreateContainerConfigErrorWaitingReferenced Secret or ConfigMap does not exist
    # Get current and last container state
    kubectl get pod <pod> -o jsonpath='{range .status.containerStatuses[*]}{.name}: state={.state}, lastState={.lastState}{"\n"}{end}'
    
    # Check exit code and reason
    kubectl get pod <pod> -o jsonpath='{.status.containerStatuses[0].lastState.terminated}'

    Container Restart Backoff

    When a container fails and restartPolicy allows restart, kubelet uses exponential backoff before restarting: 10s β†’ 20s β†’ 40s β†’ 80s β†’ 160s β†’ 300s (capped). The pod shows CrashLoopBackOff in Waiting state during the backoff window.

    CrashLoopBackOff backoff schedule: Restart 1: immediate Restart 2: wait 10s Restart 3: wait 20s Restart 4: wait 40s Restart 5: wait 80s Restart 6: wait 160s Restart 7+: wait 300s (5 min, max) Reset: if container runs successfully for 10 minutes, backoff counter resets to 0 on next failure
    CrashLoopBackOff is not a phase β€” it is a container state reason
    A pod with a container in CrashLoopBackOff is still in Running phase (kubelet is actively managing it). The pod will never enter Failed phase while kubelet is retrying. To terminate the retry loop, either fix the container or set restartPolicy: Never.

    restartPolicy

    PolicyRestart on failure?Restart on success?Use case
    Always (default)YesYes β€” container is always restarted when it exitsLong-running services (Deployments, DaemonSets)
    OnFailureYes (non-zero exit)No β€” container left in Terminated state on exit 0Jobs where success = done, failure = retry
    NeverNoNo β€” pod enters Succeeded/Failed immediatelyOne-shot tasks, migration Jobs
    restartPolicyAll exit 0Any exit non-0
    AlwaysRunning (restarting)Running (restarting)
    OnFailureSucceededRunning (restarting) until backoffLimit
    NeverSucceededFailed

    Init Containers

    Init containers run sequentially before app containers start. Each must complete successfully before the next begins. If an init container fails, it is retried (per restartPolicy) before the pod proceeds.

    Init container execution sequence: init-container-1 runs β†’ exits 0 init-container-2 runs β†’ exits 0 init-container-3 runs β†’ exits non-0 β†’ RETRY (per restartPolicy) init-container-3 runs β†’ exits 0 β†’ App containers start (all simultaneously) If any init container fails repeatedly: pod stays in Init:CrashLoopBackOff Pod condition Initialized=False until all init containers succeed
    spec:
      initContainers:
        # Wait for database to be ready
        - name: wait-for-db
          image: busybox:1.36
          command: ['sh', '-c',
            'until nc -z postgres-svc 5432; do echo waiting; sleep 2; done']
    
        # Run schema migration
        - name: migrate
          image: myapp:v3
          command: ['./migrate', '--direction=up']
          env:
            - name: DATABASE_URL
              valueFrom:
                secretKeyRef:
                  name: db-credentials
                  key: url
    
      containers:
        - name: app
          image: myapp:v3
    Init containers do not have readiness or liveness probes
    Only startupProbe is meaningful on init containers (to extend their deadline). Liveness and readiness probes on init containers are silently ignored. An init container that hangs indefinitely will block the pod from starting β€” use activeDeadlineSeconds on the pod or design init containers to fail fast.

    Native Sidecar Containers (Stable 1.33)

    A sidecar is an init container with restartPolicy: Always. It starts before app containers (like other init containers) but keeps running alongside them, and is shut down gracefully after app containers exit.

    spec:
      initContainers:
        # Native sidecar β€” starts first, stays running alongside app
        - name: log-forwarder
          image: fluent/fluent-bit:3.0
          restartPolicy: Always           # ← makes this a sidecar container
          resources:
            requests:
              cpu: 50m
              memory: 64Mi
          volumeMounts:
            - name: varlog
              mountPath: /var/log/app
    
        - name: envoy-proxy
          image: envoyproxy/envoy:v1.28
          restartPolicy: Always           # ← sidecar
          ports:
            - containerPort: 15001
    
      containers:
        - name: app
          image: myapp:v3
    Native sidecar startup and shutdown ordering: Startup: sidecar-1 (log-forwarder) starts + readiness probe passes sidecar-2 (envoy-proxy) starts + readiness probe passes app container starts Shutdown (pod delete signal): preStop hooks run on all containers simultaneously SIGTERM sent to all containers App container exits β†’ Sidecars receive SIGTERM (triggered by app exit) β†’ Sidecars exit Pod terminates cleanly Key difference from regular init containers: Regular init containers: exit before app starts Sidecar init containers: run alongside app, exit after app
    Sidecars solve the Job termination problem
    Before native sidecars, running Istio or Datadog alongside a Job prevented the Job from completing (sidecar kept running after the worker exited). With restartPolicy: Always on the sidecar init container, Kubernetes sends SIGTERM to sidecars when the main container exits, enabling clean Job completion. See Jobs page.

    Probes

    Kubelet runs three types of probes against containers to determine their health and readiness. They are independent β€” each has its own timing, threshold, and action when it fails.

    Probe timeline for a newly started container: t=0 Container starts t=0..startup initialDelaySeconds: no probes run β”‚ startupProbe runs every periodSeconds β”‚ β”œβ”€ passes β†’ startupProbe disabled, livenessProbe + readinessProbe activate └─ fails failureThreshold times β†’ container SIGKILL (restart) After startup passes: livenessProbe (runs every periodSeconds) β”œβ”€ passes β†’ container healthy, no action └─ fails failureThreshold times β†’ container SIGKILL readinessProbe (runs every periodSeconds) β”œβ”€ passes β†’ pod condition Ready=True β†’ in Endpoints └─ fails β†’ pod condition Ready=False β†’ removed from Endpoints (no SIGKILL)
    ProbeFailure actionSuccess actionPurpose
    startupProbe After failureThreshold failures: container killed and restarted Probe disabled permanently; liveness + readiness activate One-time startup gate for slow-starting containers (JVM, model loading)
    livenessProbe After failureThreshold failures: container killed and restarted No action Detect hung/deadlocked containers that should be restarted
    readinessProbe Pod condition Ready=False β†’ removed from Service Endpoints Pod condition Ready=True β†’ added back to Endpoints Signal when container is ready to receive traffic

    Probe Mechanisms

    # 1. httpGet β€” HTTP GET request; success = 200-399 response
    livenessProbe:
      httpGet:
        path: /healthz
        port: 8080
        httpHeaders:
          - name: Custom-Header
            value: probe
      initialDelaySeconds: 10
      periodSeconds: 10
      failureThreshold: 3
      successThreshold: 1
      timeoutSeconds: 5
    
    # 2. exec β€” run command in container; success = exit code 0
    readinessProbe:
      exec:
        command: ["/bin/sh", "-c", "redis-cli ping | grep PONG"]
      initialDelaySeconds: 5
      periodSeconds: 5
      failureThreshold: 3
    
    # 3. tcpSocket β€” TCP connection attempt; success = connection established
    livenessProbe:
      tcpSocket:
        port: 5432
      initialDelaySeconds: 15
      periodSeconds: 20
    
    # 4. grpc β€” gRPC Health Check Protocol (v1); success = SERVING status
    readinessProbe:
      grpc:
        port: 9090
        service: "liveness"   # gRPC service name (optional)
      initialDelaySeconds: 10
      periodSeconds: 5

    Probe Timing Fields

    FieldDefaultMeaning
    initialDelaySeconds0Seconds to wait after container start before first probe. Use for slow-starting apps if not using startupProbe.
    periodSeconds10How often to run the probe (minimum 1s)
    timeoutSeconds1Seconds to wait for probe response before counting as failure
    successThreshold1Minimum consecutive successes to mark probe as passing (must be 1 for liveness/startup)
    failureThreshold3Consecutive failures before action is taken (restart for liveness/startup; remove from endpoints for readiness)
    terminationGracePeriodSecondspod-level defaultOverride for liveness/startup probe: grace period before SIGKILL after probe-triggered kill

    startupProbe: Extending Slow-Start Grace

    # Pattern: startupProbe gives up to 5 minutes for startup,
    # then hands off to liveness with a 30s timeout window
    startupProbe:
      httpGet:
        path: /ready
        port: 8080
      failureThreshold: 30      # 30 failures Γ— 10s period = 5 minute startup window
      periodSeconds: 10
      timeoutSeconds: 5
    
    livenessProbe:
      httpGet:
        path: /healthz
        port: 8080
      failureThreshold: 3       # 3 failures Γ— 10s = 30s to detect hung state
      periodSeconds: 10
    
    readinessProbe:
      httpGet:
        path: /ready
        port: 8080
      failureThreshold: 2
      periodSeconds: 5
      successThreshold: 1
    Liveness probe anti-pattern: checking external dependencies
    A liveness probe that calls an upstream service (database, cache, external API) will restart the container when the dependency is down β€” not when the container itself is broken. This causes cascading restarts across all replicas simultaneously when a shared dependency degrades. Liveness probes must only check the container's own internal health (goroutine liveness, heap state, internal deadlock detection). Dependency health belongs in readiness probes.

    Graceful Termination

    When a pod is deleted, Kubernetes executes a structured termination sequence designed to drain in-flight requests before killing the process.

    Graceful termination sequence: t=0 kubectl delete pod / controller deletes pod β”‚ β”œβ”€ Endpoints controller removes pod from Service Endpoints β”‚ (kube-proxy updates iptables/ipvs β€” propagation delay ~1-5s) β”‚ β”œβ”€ kubelet sends SIGTERM to PID 1 of each container β”‚ AND runs preStop hook (concurrently with SIGTERM) β”‚ t=preStop duration: preStop hook runs (blocks SIGTERM delivery until done) β”‚ (If preStop runs, SIGTERM is sent after preStop completes) β”‚ t=terminationGracePeriodSeconds (default 30s): β”‚ If container still running β†’ SIGKILL (immediate kill) β”‚ t=termination complete: pod removed from API server Key timing: - preStop hook + container shutdown must complete within terminationGracePeriodSeconds - iptables propagation happens in parallel β€” not guaranteed before SIGTERM - Use preStop sleep to absorb iptables propagation delay
    spec:
      terminationGracePeriodSeconds: 60    # Total time before SIGKILL (default: 30)
    
      containers:
        - name: app
          lifecycle:
            preStop:
              # Sleep to absorb kube-proxy iptables update propagation (1-5s typical)
              exec:
                command: ["/bin/sh", "-c", "sleep 10"]
    
              # OR: call a shutdown endpoint
              # httpGet:
              #   path: /shutdown
              #   port: 8080
    iptables propagation race condition
    When a pod is deleted, the Endpoints controller removes the pod from the EndpointSlice, but kube-proxy on each node needs time to update iptables/ipvs rules. New connections may still route to the terminating pod for 1–5 seconds after SIGTERM arrives. The canonical fix is a preStop: sleep 5 (or longer), which delays SIGTERM delivery, giving kube-proxy time to drain the pod from the load balancer before the application starts refusing new connections.
    # Production-grade graceful termination config
    spec:
      terminationGracePeriodSeconds: 90   # preStop(10s) + drain(60s) + buffer(20s)
      containers:
        - name: app
          lifecycle:
            preStop:
              exec:
                command: ["/bin/sh", "-c", "sleep 10"]  # Wait for LB drain
    
            postStart:
              # Run after container starts (before readiness probe)
              exec:
                command: ["/bin/sh", "-c", "./scripts/warm-cache.sh"]

    Lifecycle Hooks

    HookWhen it runsBlocks?Failure behavior
    postStart After container is created and started (not guaranteed to run before entrypoint) Yes β€” container stays in ContainerCreating until hook completes or times out Container killed and restarted
    preStop Before container receives SIGTERM (executed synchronously) Yes β€” SIGTERM is delayed until hook completes (within grace period) SIGKILL sent immediately; hook failure is logged but not propagated
    postStart is not guaranteed to run before the entrypoint
    The postStart hook runs concurrently with the container's entrypoint β€” there is no guaranteed ordering. For initialization that must complete before the application serves traffic, use an init container instead. Use postStart only for best-effort work (cache warming, metric registration) that doesn't block application startup.
    lifecycle:
      postStart:
        httpGet:         # Register with service discovery after start
          path: /register
          port: 8500     # Consul agent port
          host: localhost
    
      preStop:
        httpGet:         # Deregister before shutdown
          path: /deregister
          port: 8500
          host: localhost

    Pod Readiness Gates

    Readiness gates (GA 1.26) add custom conditions to the pod's readiness calculation. A pod is only considered Ready if both all container readiness probes pass AND all readiness gate conditions are True.

    spec:
      readinessGates:
        - conditionType: "feature-gates.example.com/canary-approved"
        - conditionType: "load-balancer.example.com/in-pool"
    
    # An external controller must set these conditions:
    # kubectl patch pod <pod> --type=merge -p '{"status":{"conditions":[
    #   {"type":"feature-gates.example.com/canary-approved","status":"True"}
    # ]}}'

    Readiness gates enable external systems to gate traffic routing independently of container health:

    • Canary gates: progressive delivery controllers hold a pod out of rotation until analysis approves
    • Load balancer registration: cloud LB controller signals when the pod is registered in the target group
    • Warm-up gates: custom warm-up controller marks pod ready only after cache pre-population

    Pod Deletion vs Force Delete

    # Normal deletion β€” graceful (uses terminationGracePeriodSeconds)
    kubectl delete pod <pod> -n <namespace>
    
    # Override grace period (useful when pod is stuck terminating)
    kubectl delete pod <pod> -n <namespace> --grace-period=5
    
    # Force delete β€” immediately removes from API server WITHOUT waiting for kubelet confirmation
    # WARNING: The pod process may still be running on the node
    kubectl delete pod <pod> -n <namespace> --grace-period=0 --force
    
    # Force delete is dangerous for StatefulSets:
    # The pod identity may be recreated before the old pod has fully stopped,
    # causing two pods with the same identity (split-brain, data corruption)
    Force delete StatefulSet pods only in emergencies
    Force-deleting a StatefulSet pod removes it from the API immediately, allowing a new pod with the same identity to start. If the old pod is still running (kubelet temporarily unreachable, not crashed), two pods will have the same network identity and claim the same PVC β€” causing data corruption in databases and split-brain in consensus systems. Only force-delete when you have confirmed the node is truly dead and the old pod cannot be running.

    Complete Lifecycle Configuration

    apiVersion: v1
    kind: Pod
    spec:
      terminationGracePeriodSeconds: 90
    
      # Readiness gate from external controller
      readinessGates:
        - conditionType: "platform.example.com/warmed-up"
    
      initContainers:
        # Native sidecar (runs alongside app containers)
        - name: log-agent
          image: fluent/fluent-bit:3.0
          restartPolicy: Always
    
        # Classic init container (runs before app, then exits)
        - name: db-migrate
          image: myapp:v3
          command: ["./migrate", "--up"]
    
      containers:
        - name: app
          image: myapp:v3
          ports:
            - containerPort: 8080
    
          # Startup: up to 5 min for JVM to warm up
          startupProbe:
            httpGet:
              path: /actuator/health
              port: 8080
            failureThreshold: 30
            periodSeconds: 10
    
          # Liveness: internal deadlock detection only
          livenessProbe:
            httpGet:
              path: /actuator/health/liveness
              port: 8080
            periodSeconds: 10
            failureThreshold: 3
    
          # Readiness: ready to accept traffic
          readinessProbe:
            httpGet:
              path: /actuator/health/readiness
              port: 8080
            periodSeconds: 5
            failureThreshold: 2
            successThreshold: 1
    
          # Lifecycle hooks
          lifecycle:
            postStart:
              exec:
                command: ["/bin/sh", "-c", "echo started >> /tmp/lifecycle.log"]
            preStop:
              exec:
                command: ["/bin/sh", "-c", "sleep 10 && /app/shutdown.sh"]
    
          resources:
            requests:
              cpu: 500m
              memory: 1Gi
            limits:
              cpu: 2
              memory: 2Gi

    Metrics

    MetricLabelsUse
    kube_pod_status_phasephase, pod, namespaceCount pods in each phase β€” watch for rising Pending/Failed
    kube_pod_container_status_restarts_totalcontainer, pod, namespaceCumulative restarts β€” high rate = CrashLoopBackOff
    kube_pod_container_status_readycontainer, podBinary readiness β€” 0 = not ready (removed from endpoints)
    kube_pod_status_readycondition, pod, namespacePod-level Ready condition
    kubelet_pod_start_duration_secondsquantilePod start latency β€” includes image pull + init containers

    Alerting Rules

    groups:
      - name: pod-lifecycle
        rules:
          # Container restart rate high (CrashLoopBackOff indicator)
          - alert: ContainerHighRestartRate
            expr: |
              rate(kube_pod_container_status_restarts_total[15m]) * 60 > 0.1
            for: 5m
            labels:
              severity: warning
            annotations:
              summary: "{{ $labels.namespace }}/{{ $labels.pod }}/{{ $labels.container }} restarting >6/hr"
    
          # Pod stuck in Pending for > 5 minutes
          - alert: PodLongPending
            expr: kube_pod_status_phase{phase="Pending"} == 1
            for: 5m
            labels:
              severity: warning
            annotations:
              summary: "Pod {{ $labels.namespace }}/{{ $labels.pod }} stuck in Pending"
    
          # Pod stuck in Failed phase
          - alert: PodFailed
            expr: kube_pod_status_phase{phase="Failed"} == 1
            for: 0m
            labels:
              severity: warning
            annotations:
              summary: "Pod {{ $labels.namespace }}/{{ $labels.pod }} is in Failed phase"
    
          # Container not ready for extended period
          - alert: ContainerNotReady
            expr: kube_pod_container_status_ready == 0
            for: 10m
            labels:
              severity: warning
            annotations:
              summary: "Container {{ $labels.container }} in {{ $labels.namespace }}/{{ $labels.pod }} not ready for 10m"

    Runbooks

    Pod Stuck in Pending

    # Check events for scheduling failure reason
    kubectl describe pod <pod> -n <namespace> | grep -A20 Events
    
    # Common: insufficient resources β†’ check node allocatable
    kubectl describe nodes | grep -A5 "Allocatable:\|Allocated resources:"
    
    # Common: image pull error β†’ check image name and registry credentials
    kubectl get pod <pod> -o jsonpath='{.spec.containers[*].image}'
    kubectl get events -n <namespace> --field-selector reason=Failed | grep ImagePull
    
    # Common: PVC not bound β†’ check StorageClass and PV availability
    kubectl get pvc -n <namespace>
    kubectl describe pvc <pvc-name> -n <namespace>

    Container in CrashLoopBackOff

    # Get current logs (if container is running)
    kubectl logs <pod> -n <namespace> -c <container>
    
    # Get logs from previous container instance
    kubectl logs <pod> -n <namespace> -c <container> --previous
    
    # Get exit code and reason
    kubectl describe pod <pod> -n <namespace> | grep -A5 "Last State"
    
    # Common: OOMKilled β†’ increase memory limit
    kubectl set resources deployment <name> --limits=memory=2Gi -n <namespace>
    
    # Common: bad entrypoint β†’ check image CMD/entrypoint
    kubectl get pod <pod> -o jsonpath='{.spec.containers[*].command} {.spec.containers[*].args}'
    
    # Common: missing Secret/ConfigMap (CreateContainerConfigError)
    kubectl get pod <pod> -o yaml | grep -A5 "envFrom\|secretRef\|configMapRef"
    kubectl get secret <secret-name> -n <namespace>

    Pod Not Becoming Ready (Readiness Probe Failing)

    # Check readiness probe config
    kubectl describe pod <pod> -n <namespace> | grep -A10 "Readiness:"
    
    # Run readiness check manually inside the container
    kubectl exec <pod> -n <namespace> -- curl -s http://localhost:8080/ready
    
    # Check if app is actually listening
    kubectl exec <pod> -n <namespace> -- ss -tlnp
    
    # Check if it's a readiness gate blocking (not the container probe)
    kubectl get pod <pod> -n <namespace> -o jsonpath='{.status.conditions}' | \
      jq '.[] | select(.type != "Ready" and .status != "True")'

    Pod Stuck in Terminating

    # Check if finalizers are blocking deletion
    kubectl get pod <pod> -n <namespace> -o jsonpath='{.metadata.finalizers}'
    
    # Remove finalizer (if stuck)
    kubectl patch pod <pod> -n <namespace> -p '{"metadata":{"finalizers":[]}}' --type=merge
    
    # Check if preStop hook is hanging
    kubectl describe pod <pod> -n <namespace> | grep -A5 "PreStop\|terminationGrace"
    
    # Force delete if node is dead (StatefulSet β€” confirm old pod is truly gone first)
    kubectl delete pod <pod> -n <namespace> --grace-period=0 --force

    Liveness Probe Causing Cascading Restarts

    # Check restart pattern β€” all replicas restarting simultaneously?
    kubectl get pods -n <namespace> -l app=<app> \
      -o custom-columns=NAME:.metadata.name,RESTARTS:.status.containerStatuses[0].restartCount,AGE:.status.startTime
    
    # Check liveness probe config
    kubectl describe pods -n <namespace> -l app=<app> | grep -A10 "Liveness:"
    
    # If probe checks external dependency, change to internal health only
    # Temporary mitigation: increase failureThreshold to reduce restart rate
    kubectl patch deployment <name> -n <namespace> --type=json -p='[{
      "op":"replace",
      "path":"/spec/template/spec/containers/0/livenessProbe/failureThreshold",
      "value":10
    }]'

    Best Practices

    1. Use startupProbe for slow-starting applications β€” set failureThreshold Γ— periodSeconds to cover your worst-case startup time. This prevents liveness probe failures during startup without requiring a dangerously high initialDelaySeconds on the liveness probe.
    2. Liveness probes must check only internal state β€” never check database connectivity, upstream services, or external dependencies in a liveness probe. Failures cascade across all replicas simultaneously when a shared dependency degrades. Dependency health belongs in readiness probes.
    3. Add a preStop sleep to absorb kube-proxy propagation delay β€” preStop: exec: sleep 5 (or 10 for high-traffic services) gives kube-proxy time to update iptables before the application stops accepting connections, preventing connection resets on in-flight requests.
    4. Set terminationGracePeriodSeconds to cover preStop + drain time β€” formula: preStop duration + max request duration + buffer ≀ terminationGracePeriodSeconds. The default 30s is often too short for services with long-running requests.
    5. Use native sidecar containers (1.33) for Jobs with injected sidecars β€” Istio, Datadog, and Vault Agent injected sidecars previously blocked Job completion. Upgrading to native sidecars (restartPolicy: Always in initContainers) resolves this cleanly.
    6. Set readinessProbe.successThreshold: 1 for all non-startup probes β€” the default is 1, but higher values require the probe to pass multiple consecutive times before marking the pod ready, which slows down rolling updates. Only increase if you need debouncing.
    7. Never force-delete StatefulSet pods without confirming the node is truly dead β€” force deletion removes the pod from the API immediately, allowing a replacement to start with the same identity. If the original pod is still running (temporary network partition), two pods will share the same PVC and network identity β€” causing data corruption.
    8. Use readiness gates for external traffic management coordination β€” if a progressive delivery controller (Argo Rollouts, Flagger) or cloud load balancer controller needs to signal readiness independently of container health, readiness gates provide the correct integration point without hacking probes.