Overview
A structured guide to diagnosing Kubernetes issues — from a decision tree for locating the problem, through per-layer playbooks, to escalation and post-incident habits.
Troubleshooting Philosophy
K8s is a distributed system. When something breaks:
1. Narrow the blast radius first
Is it one pod? One node? One namespace? The whole cluster?
2. Follow the data path
Start from the symptom (user can't reach service)
Trace backwards: Service → Pod → Container → Image → Node → Network
3. Distinguish symptom from cause
"Pod CrashLoopBackOff" is a symptom. Root cause could be:
- Bad config (wrong env var)
- OOMKilled (under-resourced)
- Application bug (null pointer)
- External dependency unavailable (DB unreachable)
4. Use observability, not intuition
Logs → Events → Metrics → Traces
kubectl describe is always the first tool.
Troubleshooting Decision Tree
User reports: "the app is down" or "feature X is broken"
│
▼
Can you reach the service?
┌───────────────────────┐
│ No │ Yes → go to application layer
└──────────┬────────────┘
▼
Is the Service endpoint populated?
kubectl get endpoints <svc> -n <ns>
┌────────────────────────────────┐
│ Empty / wrong IPs │ IPs present → DNS or network issue
└──────────┬─────────────────────┘
▼
Are the pods Running and Ready?
kubectl get pods -n <ns>
┌────────────────────────────────────────────────────────────────────┐
│ Pending │ CrashLoopBackOff │ Running+Not Ready │
│ → Scheduling │ → App/Config issue │ → readinessProbe failing │
│ issue │ or OOMKill │ or startup taking too long│
└──────────────────┴────────────────────┴─────────────────────────────┘
▼
Is the node healthy?
kubectl get nodes
┌────────────────────────────────┐
│ NotReady │ Ready → pod-level issue
└──────────┬─────────────────────┘
▼
Control plane healthy?
kubectl get componentstatuses
kubectl get pods -n kube-system
First-Response Checklist
# 1. Get the big picture
kubectl get pods -n <namespace> -o wide
kubectl get events -n <namespace> --sort-by='.lastTimestamp' | tail -20
# 2. Zoom in on the failing pod
kubectl describe pod <pod-name> -n <namespace>
# Check: Events section at the bottom (most important)
# Check: Container state (Waiting/reason, Last State/exit code)
# Check: Conditions (Initialized, Ready, ContainersReady, PodScheduled)
# 3. Get container logs
kubectl logs <pod-name> -n <namespace> --tail=100
kubectl logs <pod-name> -n <namespace> --previous # previous crashed container
# 4. Check resource usage
kubectl top pods -n <namespace>
kubectl top nodes
# 5. Check node health
kubectl get nodes -o wide
kubectl describe node <node-name> | grep -A10 "Conditions:"
kubectl describe node <node-name> | grep -A15 "Allocated resources"
Exit Code Reference
| Exit Code | Meaning | Common Cause |
0 | Clean exit | Normal completion (Jobs) |
1 | Application error | Unhandled exception, config error |
2 | Misuse of shell command | Shell script error |
137 | SIGKILL (128+9) | OOMKilled or kubectl delete pod --force |
139 | Segfault (128+11) | Memory corruption, native code crash |
143 | SIGTERM (128+15) | Graceful termination signal |
1 via OOMKill | Container memory limit exceeded | Set in pod status: reason: OOMKilled |
# Get last exit code
kubectl get pod <pod> -n <ns> \
-o jsonpath='{.status.containerStatuses[0].lastState.terminated.exitCode}'
# Get OOMKill reason
kubectl get pod <pod> -n <ns> \
-o jsonpath='{.status.containerStatuses[0].lastState.terminated.reason}'
kubectl Diagnostic Cheatsheet
# Events — the most information-dense diagnostic
kubectl get events -n <ns> --sort-by='.lastTimestamp'
kubectl get events -n <ns> --field-selector reason=BackOff
kubectl get events -n <ns> --field-selector involvedObject.name=<pod-name>
# Watch events in real time
kubectl get events -n <ns> -w
# Pod conditions
kubectl get pod <pod> -n <ns> \
-o jsonpath='{range .status.conditions[*]}{.type}: {.status} ({.reason}){"\n"}{end}'
# Container image being used
kubectl get pod <pod> -n <ns> \
-o jsonpath='{range .status.containerStatuses[*]}{.name}: {.image}{"\n"}{end}'
# Check all restarts across namespace
kubectl get pods -n <ns> \
-o jsonpath='{range .items[*]}{.metadata.name}: restarts={.status.containerStatuses[0].restartCount}{"\n"}{end}' \
| sort -t= -k2 -rn | head -10
# Exec into running pod
kubectl exec -it <pod> -n <ns> -- /bin/sh
# Run diagnostic pod (netshoot has curl, dig, nc, tcpdump, etc.)
kubectl run netshoot --image=nicolaka/netshoot --rm -it -n <ns> -- /bin/bash
# Copy files out of pod for inspection
kubectl cp <ns>/<pod>:/var/log/app.log ./app.log
# Port-forward for local testing
kubectl port-forward pod/<pod-name> -n <ns> 8080:8080
kubectl port-forward svc/<svc-name> -n <ns> 8080:80
Per-Layer Diagnostic Index
| Layer | Common Symptoms | Playbook |
| Pod failures | CrashLoopBackOff, OOMKilled, ImagePullBackOff, Pending | 01 — Pod Failures |
| Network issues | Connection refused, DNS failure, timeout, no endpoints | 02 — Network Issues |
| Storage issues | ContainerCreating stuck, PVC Pending, read-only filesystem | 03 — Storage Issues |
| Performance | High CPU/memory, slow response, throttling, eviction | 04 — Performance Issues |
| Control plane | API server unreachable, etcd errors, controller not reconciling | 05 — Control Plane Issues |
| Node issues | NotReady, DiskPressure, MemoryPressure, kubelet crash | 06 — Node Issues |
| Security issues | Forbidden 403, webhook blocking, audit alerts | 07 — Security Issues |
| DNS issues | NXDOMAIN, intermittent DNS failure, slow DNS | 08 — DNS Issues |
| Ingress issues | 502/503/504, TLS errors, wrong routing | 09 — Ingress Issues |
| etcd issues | High latency, compaction needed, quorum loss | 10 — etcd Issues |
Observability Stack Quick Reference
# Prometheus — query current metrics
kubectl port-forward -n monitoring svc/prometheus-operated 9090:9090
# Then open http://localhost:9090
# Grafana — dashboards
kubectl port-forward -n monitoring svc/grafana 3000:80
# Loki + Stern — aggregated logs
stern <pod-prefix> -n <namespace> --tail=50
kubectl logs -n <ns> -l app=<app> --tail=100 -f
# Jaeger / Tempo — traces
kubectl port-forward -n monitoring svc/tempo 16686:16686
# Alertmanager — current firing alerts
kubectl port-forward -n monitoring svc/alertmanager-operated 9093:9093
# Check if Prometheus scraping target
kubectl get servicemonitor,podmonitor -n monitoring
kubectl get --raw /api/v1/namespaces/monitoring/services/prometheus-operated:web/proxy/api/v1/targets \
| jq '.data.activeTargets[] | select(.labels.namespace=="production") | {job:.labels.job, health:.health}'
Cluster Health Dashboard
# One-shot cluster health check
echo "=== Nodes ===" && kubectl get nodes -o wide
echo "=== Control Plane ===" && kubectl get pods -n kube-system -l tier=control-plane
echo "=== PVC Status ===" && kubectl get pvc -A | grep -v Bound
echo "=== Recent Events ===" && kubectl get events -A --sort-by='.lastTimestamp' | grep -v Normal | tail -20
echo "=== Crashlooping Pods ===" && kubectl get pods -A | grep -E "CrashLoop|Error|OOMKill"
echo "=== Pending Pods ===" && kubectl get pods -A | grep Pending
echo "=== Node Pressure ===" && kubectl describe nodes | grep -A5 "Conditions:" | grep -E "True|False" | grep -v "False.*Ready\|True.*Ready\|False.*Disk\|False.*Mem\|False.*PID"