Node Issues

Overview

Diagnosis and resolution of Kubernetes node failures — NotReady state, resource pressure conditions, kubelet crashes, and node-level OS issues.

Node Status Quick Reference

kubectl get nodes -o wide
# STATUS    ROLES    AGE   VERSION   INTERNAL-IP   OS-IMAGE
# Ready     master   10d   v1.28.0   10.0.0.5      Ubuntu 22.04
# NotReady  worker   2d    v1.28.0   10.0.0.10     Ubuntu 22.04

# Conditions per node
kubectl describe node <node> | grep -A20 "Conditions:"
# Type              Status  Reason
# MemoryPressure    False   KubeletHasSufficientMemory
# DiskPressure      False   KubeletHasSufficientDisk
# PIDPressure       False   KubeletHasSufficientPID
# Ready             True    KubeletReady

Node NotReady

# Node shows NotReady
kubectl describe node <node>
# Look for: Conditions: Ready=False and the Reason

# Most common reasons:
# 1. kubelet stopped / crashed
# 2. Network partition (node can't reach API server)
# 3. Resource pressure (DiskPressure, MemoryPressure)

# SSH to the node and check kubelet
systemctl status kubelet
journalctl -u kubelet --since "10 minutes ago" | tail -50

# Common kubelet errors:
# "failed to run Kubelet: unable to load client CA file /etc/kubernetes/pki/ca.crt"
# → Certificate missing or wrong path in kubelet config

# "node not found" → node removed from API server while kubelet was running
# Fix: re-register node (usually happens automatically after kubelet restart)

# Network partition detection:
# Node is healthy but API server can't reach it
# kubectl get node shows NotReady
# But SSH to node shows kubelet running fine
# → Check security groups, firewall rules between node and control plane

# Restart kubelet (safe — kubelet manages running containers)
systemctl restart kubelet

DiskPressure

# Node has DiskPressure condition = True
# Effect: new pods will NOT be scheduled on this node
#         existing pods may be evicted (starting with BestEffort)

kubectl describe node <node> | grep -A5 DiskPressure
# "DiskPressure True KubeletHasDiskPressure"

# SSH to node and check disk usage
df -h
# /dev/nvme0n1p1  100G  95G  5G  95% /   ← root disk almost full

# Common causes:
# 1. Container logs filling /var/lib/docker or /var/lib/containerd
du -sh /var/lib/containerd/*
du -sh /var/log/containers/*

# Clean up: remove stopped containers
crictl rmi --prune     # remove unused images
crictl rm $(crictl ps -a -q --state exited)  # remove stopped containers

# 2. Evicted pod volumes not cleaned up
du -sh /var/lib/kubelet/pods/*/volumes

# 3. Core dumps or application logs
find /var/log -name "*.log" -size +100M

# kubelet eviction thresholds (default: evict at 10% free)
# Adjust in kubelet config:
# evictionHard:
#   nodefs.available: 5%      # was 10%
#   nodefs.inodesFree: 3%
#   imagefs.available: 5%

# Long term: resize root volume, or separate container runtime to larger volume

MemoryPressure

# Node has MemoryPressure
kubectl describe node <node> | grep -A3 MemoryPressure

# SSH to node
free -h
# Look for: available memory < eviction threshold (default ~100Mi)

# Which processes are consuming memory
ps aux --sort=-%mem | head -20

# Check OOM killer log
dmesg | grep -i "out of memory\|killed process"

# containerd/docker memory usage
cat /sys/fs/cgroup/memory/system.slice/containerd.service/memory.usage_in_bytes

# kubelet eviction:
# Hard eviction: immediately evict BestEffort → Burstable → Guaranteed (last)
# Soft eviction: evict after eviction-soft-grace-period (default 1m30s)

# Fix:
# 1. Add memory to node (resize instance)
# 2. Reduce pod memory limits to leave more allocatable
# 3. Tune kubelet eviction thresholds:
#    evictionHard:
#      memory.available: 200Mi
#    evictionSoft:
#      memory.available: 500Mi
#    evictionSoftGracePeriod:
#      memory.available: 90s

PIDPressure

# Node running out of process IDs
kubectl describe node <node> | grep -A3 PIDPressure

# SSH to node
cat /proc/sys/kernel/pid_max          # system PID limit
cat /proc/sys/kernel/threads-max      # thread limit

# Check current PID count
ps aux | wc -l

# Common cause: container running too many threads (connection per thread model)
# Fix: tune application thread pool, or increase kernel PID limit:
sysctl -w kernel.pid_max=262144
echo "kernel.pid_max=262144" >> /etc/sysctl.d/99-kubernetes.conf

# kubelet default: evict if < 1000 PIDs available
# evictionHard:
#   pid.available: 500

Node Draining and Cordoning

# Cordon: mark node unschedulable (no new pods, existing pods stay)
kubectl cordon <node>

# Drain: evict all pods and cordon
kubectl drain <node> --ignore-daemonsets --delete-emptydir-data
# --ignore-daemonsets: DaemonSet pods are not evicted (they'll be recreated)
# --delete-emptydir-data: also evict pods with emptyDir volumes (data lost!)

# Common drain failure: PodDisruptionBudget blocks eviction
# "error when evicting pods/xxx: FORBIDDEN: Cannot evict pod, PDB says minAvailable..."
kubectl get pdb -n <ns>
# Fix: wait for enough replicas to be ready elsewhere, then retry drain
# Emergency: kubectl drain --disable-eviction=true (ignores PDBs — risky)

# Uncordon after maintenance
kubectl uncordon <node>

# Check drain progress
kubectl get pods -o wide --all-namespaces | grep <node>
# After drain: only DaemonSet pods remain on node

kubelet Configuration

# View kubelet config (kubeadm clusters)
kubectl get configmap kubelet-config -n kube-system -o yaml
# OR on node:
cat /var/lib/kubelet/config.yaml

# Common kubelet config options to tune:
# maxPods: 110                          # max pods per node
# podsPerCore: 10                       # max pods per CPU core (alternative)
# systemReserved:
#   cpu: 500m                           # reserve for OS processes
#   memory: 500Mi
# kubeReserved:
#   cpu: 500m                           # reserve for kubelet + container runtime
#   memory: 500Mi
# evictionHard:
#   memory.available: 100Mi
#   nodefs.available: 10%
# cpuManagerPolicy: static              # pin containers to CPUs (for latency-sensitive)
# topologyManagerPolicy: best-effort    # NUMA-aware scheduling

# Apply kubelet config change (kubeadm)
# Edit ConfigMap, then update each node's kubelet config:
kubectl edit configmap kubelet-config -n kube-system
# On each node:
kubeadm upgrade node phase kubelet-config
systemctl restart kubelet

Node Not Joining Cluster

# New node fails to join

# Check kubelet logs on the new node
journalctl -u kubelet --since "5 minutes ago"

# Common errors:
# "node not found" → node registered but API server can't see it
# "TLS handshake timeout" → can't reach API server
# "Unauthorized" → bootstrap token expired or wrong

# Generate new bootstrap token (from control plane)
kubeadm token create --print-join-command

# Check token validity
kubeadm token list

# Node registered but stuck in NotReady
# Check if CNI plugin is running on the new node
kubectl get pod -n kube-system -l k8s-app=cilium \
  --field-selector spec.nodeName=<new-node>
# If not running: CNI DaemonSet needs to schedule there

# Check if node has correct labels/taints for CNI
kubectl describe node <new-node> | grep -E "Labels:|Taints:"

Node Debugging with Ephemeral Containers

# kubectl debug with node PID namespace (full node access)
kubectl debug node/<node-name> -it --image=ubuntu

# Inside the debug container:
chroot /host   # access node filesystem
# Now you're in the node's root filesystem

# Check running processes
ps aux | grep kubelet

# Check network config
ip addr show
ip route show
iptables -t nat -L KUBE-SERVICES | head -20

# Check containerd
crictl ps
crictl images
crictl logs <container-id>

# Check disk
df -h
du -sh /var/lib/containerd
du -sh /var/lib/kubelet/pods

# Check kernel logs
dmesg | tail -50
journalctl -k --since "1 hour ago"

Karpenter Node Provisioner Issues

# Pods Pending but Karpenter not provisioning nodes

# Check Karpenter controller logs
kubectl logs -n kube-system -l app.kubernetes.io/name=karpenter --tail=100

# Check NodePool (Karpenter v0.30+)
kubectl get nodepool
kubectl describe nodepool default

# Common issues:
# 1. NodePool has maxCPU/maxMemory limits reached
kubectl describe nodepool default | grep -A10 limits

# 2. Pods have node affinity that no NodePool can satisfy
kubectl describe pod <pending-pod> | grep -A5 "Node-Selectors\|Affinity"

# 3. AWS quota hit (instance type not available)
kubectl logs -n kube-system -l app.kubernetes.io/name=karpenter | grep "InsufficientInstanceCapacity"
# Fix: add more instance types to NodePool instanceType list

# 4. Karpenter can't assume the IAM role
kubectl logs -n kube-system -l app.kubernetes.io/name=karpenter | grep "AccessDenied"

# Check provisioned nodes
kubectl get nodeclaim
kubectl get nodeclaim -o json | jq '.items[].status.conditions'

01 — Pod Failures — eviction and node pressure effects on pods
04 — Performance Issues — node resource exhaustion
05 — Control Plane Issues — API server unreachable from nodes

Overview

Node Status Quick Reference

Node NotReady

DiskPressure

MemoryPressure

PIDPressure

Node Draining and Cordoning

kubelet Configuration

Node Not Joining Cluster

Node Debugging with Ephemeral Containers

Karpenter Node Provisioner Issues

Related