Troubleshooting Overview

Overview

A structured guide to diagnosing Kubernetes issues — from a decision tree for locating the problem, through per-layer playbooks, to escalation and post-incident habits.

Troubleshooting Philosophy

K8s is a distributed system. When something breaks:

1. Narrow the blast radius first
   Is it one pod? One node? One namespace? The whole cluster?

2. Follow the data path
   Start from the symptom (user can't reach service)
   Trace backwards: Service → Pod → Container → Image → Node → Network

3. Distinguish symptom from cause
   "Pod CrashLoopBackOff" is a symptom. Root cause could be:
   - Bad config (wrong env var)
   - OOMKilled (under-resourced)
   - Application bug (null pointer)
   - External dependency unavailable (DB unreachable)

4. Use observability, not intuition
   Logs → Events → Metrics → Traces
   kubectl describe is always the first tool.

Troubleshooting Decision Tree

User reports: "the app is down" or "feature X is broken"
                 │
                 ▼
         Can you reach the service?
         ┌───────────────────────┐
         │ No                    │ Yes → go to application layer
         └──────────┬────────────┘
                    ▼
         Is the Service endpoint populated?
         kubectl get endpoints <svc> -n <ns>
         ┌────────────────────────────────┐
         │ Empty / wrong IPs              │ IPs present → DNS or network issue
         └──────────┬─────────────────────┘
                    ▼
         Are the pods Running and Ready?
         kubectl get pods -n <ns>
         ┌────────────────────────────────────────────────────────────────────┐
         │ Pending          │ CrashLoopBackOff   │ Running+Not Ready           │
         │ → Scheduling     │ → App/Config issue │ → readinessProbe failing    │
         │   issue          │   or OOMKill       │   or startup taking too long│
         └──────────────────┴────────────────────┴─────────────────────────────┘
                    ▼
         Is the node healthy?
         kubectl get nodes
         ┌────────────────────────────────┐
         │ NotReady                       │ Ready → pod-level issue
         └──────────┬─────────────────────┘
                    ▼
         Control plane healthy?
         kubectl get componentstatuses
         kubectl get pods -n kube-system

First-Response Checklist

# 1. Get the big picture
kubectl get pods -n <namespace> -o wide
kubectl get events -n <namespace> --sort-by='.lastTimestamp' | tail -20

# 2. Zoom in on the failing pod
kubectl describe pod <pod-name> -n <namespace>
# Check: Events section at the bottom (most important)
# Check: Container state (Waiting/reason, Last State/exit code)
# Check: Conditions (Initialized, Ready, ContainersReady, PodScheduled)

# 3. Get container logs
kubectl logs <pod-name> -n <namespace> --tail=100
kubectl logs <pod-name> -n <namespace> --previous   # previous crashed container

# 4. Check resource usage
kubectl top pods -n <namespace>
kubectl top nodes

# 5. Check node health
kubectl get nodes -o wide
kubectl describe node <node-name> | grep -A10 "Conditions:"
kubectl describe node <node-name> | grep -A15 "Allocated resources"

Exit Code Reference

Exit Code	Meaning	Common Cause
`0`	Clean exit	Normal completion (Jobs)
`1`	Application error	Unhandled exception, config error
`2`	Misuse of shell command	Shell script error
`137`	SIGKILL (128+9)	OOMKilled or `kubectl delete pod --force`
`139`	Segfault (128+11)	Memory corruption, native code crash
`143`	SIGTERM (128+15)	Graceful termination signal
`1` via OOMKill	Container memory limit exceeded	Set in pod status: `reason: OOMKilled`

# Get last exit code
kubectl get pod <pod> -n <ns> \
  -o jsonpath='{.status.containerStatuses[0].lastState.terminated.exitCode}'

# Get OOMKill reason
kubectl get pod <pod> -n <ns> \
  -o jsonpath='{.status.containerStatuses[0].lastState.terminated.reason}'

kubectl Diagnostic Cheatsheet

# Events — the most information-dense diagnostic
kubectl get events -n <ns> --sort-by='.lastTimestamp'
kubectl get events -n <ns> --field-selector reason=BackOff
kubectl get events -n <ns> --field-selector involvedObject.name=<pod-name>

# Watch events in real time
kubectl get events -n <ns> -w

# Pod conditions
kubectl get pod <pod> -n <ns> \
  -o jsonpath='{range .status.conditions[*]}{.type}: {.status} ({.reason}){"\n"}{end}'

# Container image being used
kubectl get pod <pod> -n <ns> \
  -o jsonpath='{range .status.containerStatuses[*]}{.name}: {.image}{"\n"}{end}'

# Check all restarts across namespace
kubectl get pods -n <ns> \
  -o jsonpath='{range .items[*]}{.metadata.name}: restarts={.status.containerStatuses[0].restartCount}{"\n"}{end}' \
  | sort -t= -k2 -rn | head -10

# Exec into running pod
kubectl exec -it <pod> -n <ns> -- /bin/sh

# Run diagnostic pod (netshoot has curl, dig, nc, tcpdump, etc.)
kubectl run netshoot --image=nicolaka/netshoot --rm -it -n <ns> -- /bin/bash

# Copy files out of pod for inspection
kubectl cp <ns>/<pod>:/var/log/app.log ./app.log

# Port-forward for local testing
kubectl port-forward pod/<pod-name> -n <ns> 8080:8080
kubectl port-forward svc/<svc-name> -n <ns> 8080:80

Per-Layer Diagnostic Index

Layer	Common Symptoms	Playbook
Pod failures	CrashLoopBackOff, OOMKilled, ImagePullBackOff, Pending	01 — Pod Failures
Network issues	Connection refused, DNS failure, timeout, no endpoints	02 — Network Issues
Storage issues	ContainerCreating stuck, PVC Pending, read-only filesystem	03 — Storage Issues
Performance	High CPU/memory, slow response, throttling, eviction	04 — Performance Issues
Control plane	API server unreachable, etcd errors, controller not reconciling	05 — Control Plane Issues
Node issues	NotReady, DiskPressure, MemoryPressure, kubelet crash	06 — Node Issues
Security issues	Forbidden 403, webhook blocking, audit alerts	07 — Security Issues
DNS issues	NXDOMAIN, intermittent DNS failure, slow DNS	08 — DNS Issues
Ingress issues	502/503/504, TLS errors, wrong routing	09 — Ingress Issues
etcd issues	High latency, compaction needed, quorum loss	10 — etcd Issues

Observability Stack Quick Reference

# Prometheus — query current metrics
kubectl port-forward -n monitoring svc/prometheus-operated 9090:9090
# Then open http://localhost:9090

# Grafana — dashboards
kubectl port-forward -n monitoring svc/grafana 3000:80

# Loki + Stern — aggregated logs
stern <pod-prefix> -n <namespace> --tail=50
kubectl logs -n <ns> -l app=<app> --tail=100 -f

# Jaeger / Tempo — traces
kubectl port-forward -n monitoring svc/tempo 16686:16686

# Alertmanager — current firing alerts
kubectl port-forward -n monitoring svc/alertmanager-operated 9093:9093

# Check if Prometheus scraping target
kubectl get servicemonitor,podmonitor -n monitoring
kubectl get --raw /api/v1/namespaces/monitoring/services/prometheus-operated:web/proxy/api/v1/targets \
  | jq '.data.activeTargets[] | select(.labels.namespace=="production") | {job:.labels.job, health:.health}'

Cluster Health Dashboard

# One-shot cluster health check
echo "=== Nodes ===" && kubectl get nodes -o wide
echo "=== Control Plane ===" && kubectl get pods -n kube-system -l tier=control-plane
echo "=== PVC Status ===" && kubectl get pvc -A | grep -v Bound
echo "=== Recent Events ===" && kubectl get events -A --sort-by='.lastTimestamp' | grep -v Normal | tail -20
echo "=== Crashlooping Pods ===" && kubectl get pods -A | grep -E "CrashLoop|Error|OOMKill"
echo "=== Pending Pods ===" && kubectl get pods -A | grep Pending
echo "=== Node Pressure ===" && kubectl describe nodes | grep -A5 "Conditions:" | grep -E "True|False" | grep -v "False.*Ready\|True.*Ready\|False.*Disk\|False.*Mem\|False.*PID"

01 — Pod Creation Flow — understanding the happy path helps diagnose deviations
09 — On-Call — incident response workflow
08 — Runbooks — runbook template and standards

Overview

Troubleshooting Philosophy

Troubleshooting Decision Tree

First-Response Checklist

Exit Code Reference

kubectl Diagnostic Cheatsheet

Per-Layer Diagnostic Index

Observability Stack Quick Reference

Cluster Health Dashboard

Related