Node Lifecycle
A node's lifecycle spans from its initial registration through active workload execution to graceful shutdown or forced removal. Every phase involves specific control-plane controllers, kubelet behaviors, and timing parameters that directly affect pod availability. This page covers the complete lifecycle: registration → healthy operation → degradation → drain → deletion, plus cordon/taint mechanics and the node controller's role throughout.
Lifecycle Phases Overview
Node Registration
A node self-registers by the kubelet calling POST /api/v1/nodes on the kube-apiserver. This requires a valid client certificate with the system:node:<nodeName> identity in the system:nodes group. The TLS bootstrap process (covered in kubelet: TLS Bootstrap) issues this certificate automatically.
Bootstrap token exchange
kubelet uses a bootstrap token (--bootstrap-kubeconfig) to authenticate as system:bootstrappers and submit a CSR for a node client certificate.
CSR auto-approval
The csrapproving controller in kube-controller-manager auto-approves CSRs from system:bootstrappers matching the node identity pattern. The certificate is signed by the cluster CA.
Node object creation
kubelet POSTs the Node object with initial status.capacity, status.nodeInfo (OS, kernel, runtime version), and all initial conditions set to Unknown. The apiserver assigns a resourceVersion.
CCM initializer taint
The cloud-controller-manager adds node.cloudprovider.kubernetes.io/uninitialized=true:NoSchedule immediately, preventing workload scheduling until cloud metadata (zone, instance type, providerID) is populated.
CCM initialization
CCM's NodeController queries the cloud provider for the instance's zone, region, instance type, and external IPs; patches Node status.addresses and spec.providerID; removes the uninitialized taint.
CNI / DaemonSet startup
CNI DaemonSet pods are scheduled (they tolerate the NotReady taint). CNI plugin configures the node network. kubelet reports NetworkReady=true via CRI Status RPC.
Node becomes Ready
kubelet patches Node status with Ready=True. The node.kubernetes.io/not-ready:NoSchedule taint is automatically removed by the node lifecycle controller. The node is now schedulable.
Node Object Anatomy
apiVersion: v1
kind: Node
metadata:
name: worker-1
labels:
kubernetes.io/hostname: worker-1
kubernetes.io/os: linux
kubernetes.io/arch: amd64
node.kubernetes.io/instance-type: m5.2xlarge # set by CCM
topology.kubernetes.io/region: us-east-1 # set by CCM
topology.kubernetes.io/zone: us-east-1a # set by CCM
annotations:
node.alpha.kubernetes.io/ttl: "0"
volumes.kubernetes.io/controller-managed-attach-detach: "true"
spec:
podCIDR: 10.244.1.0/24 # assigned by node IPAM controller
podCIDRs:
- 10.244.1.0/24
providerID: aws:///us-east-1a/i-0abc12345def67890 # set by CCM
taints: # applied manually or by lifecycle controller
- key: node.kubernetes.io/not-ready
effect: NoSchedule
status:
capacity:
cpu: "8"
memory: 32Gi
pods: "110"
hugepages-1Gi: "0"
hugepages-2Mi: "0"
ephemeral-storage: 500Gi
allocatable: # capacity minus reserved (kubeReserved+systemReserved+evictionThreshold)
cpu: 7600m
memory: 29Gi
pods: "110"
ephemeral-storage: 470Gi
addresses:
- type: InternalIP
address: 10.0.1.50
- type: ExternalIP
address: 54.12.34.56
- type: Hostname
address: worker-1
conditions: # updated by kubelet every nodeStatusUpdateFrequency (10s)
- type: Ready
status: "True"
lastHeartbeatTime: "2024-01-15T10:30:00Z"
lastTransitionTime: "2024-01-14T08:00:00Z"
reason: KubeletReady
message: kubelet is posting ready status
- type: MemoryPressure
status: "False"
...
- type: DiskPressure
status: "False"
...
- type: PIDPressure
status: "False"
...
- type: NetworkUnavailable
status: "False"
...
nodeInfo:
machineID: abc123
systemUUID: def456
bootID: ghi789
kernelVersion: 5.15.0-1034-aws
osImage: Ubuntu 22.04.3 LTS
containerRuntimeVersion: containerd://1.7.2
kubeletVersion: v1.29.2
kubeProxyVersion: v1.29.2
operatingSystem: linux
architecture: amd64
daemonEndpoints:
kubeletEndpoint:
Port: 10250
images:
- names:
- registry.k8s.io/pause:3.9
sizeBytes: 268435456
volumesInUse: []
volumesAttached: []
Node Conditions
The kubelet continuously evaluates and patches these conditions. The node lifecycle controller in kube-controller-manager watches them and reacts:
| Condition | True means | False means | Control-plane reaction |
|---|---|---|---|
Ready |
Node healthy, kubelet posting status, network ready | kubelet cannot post status or network unavailable | After node-monitor-grace-period (40s default): add node.kubernetes.io/not-ready:NoSchedule taint; after pod-eviction-timeout (5m): add node.kubernetes.io/not-ready:NoExecute taint → pods evicted |
MemoryPressure |
Node memory is low (below eviction threshold) | Normal memory | Add node.kubernetes.io/memory-pressure:NoSchedule taint; kubelet begins evicting BestEffort pods |
DiskPressure |
Node disk is low (image filesystem or nodefs) | Normal disk | Add node.kubernetes.io/disk-pressure:NoSchedule taint; kubelet evicts pods with large ephemeral storage |
PIDPressure |
Too many processes running | Normal PID count | Add node.kubernetes.io/pid-pressure:NoSchedule taint |
NetworkUnavailable |
Node network not configured (CNI missing) | Network configured normally | Set by CNI plugin (via Node patch); prevents pod IP assignment |
The node lifecycle controller (part of kube-controller-manager) automatically adds and removes node.kubernetes.io/* taints based on conditions. You don't need to manually taint nodes for common conditions — the controller handles it. Manual taints are additive; they coexist with controller-managed taints.
Node Heartbeat Mechanism
Kubernetes uses two complementary heartbeat channels to detect node failures:
Node Status Updates (legacy)
- kubelet PATCHes
node.statuseverynodeStatusUpdateFrequency(default 10s) - Writes all condition timestamps + allocatable + images
- Large payload — high etcd write amplification at scale
- Control plane uses
lastHeartbeatTimeto detect failures
Node Lease (preferred)
- Introduced in 1.13, GA in 1.17; enabled by default
- kubelet renews a
coordination.k8s.io/v1/Leasein thekube-node-leasenamespace everynodeLeaseDurationSeconds/4(default every ~10s) - Tiny object (~200 bytes) vs full Node status
- Node lifecycle controller watches Lease
renewTime - Decouples heartbeat from status updates
# Node Lease object
apiVersion: coordination.k8s.io/v1
kind: Lease
metadata:
name: worker-1 # same as node name
namespace: kube-node-lease
spec:
holderIdentity: worker-1
leaseDurationSeconds: 40 # kubelet flag: --node-lease-duration-seconds
renewTime: "2024-01-15T10:30:45.123456Z" # updated every ~10s
# kubelet config knobs:
# nodeStatusUpdateFrequency: 10s (full status patches)
# nodeLeaseDurationSeconds: 40 (lease TTL; renew at TTL/4 = 10s)
Failure Detection Timing
| Parameter | Default | Owner | Effect |
|---|---|---|---|
node-monitor-period | 5s | kube-controller-manager | How often node lifecycle controller checks node conditions |
node-monitor-grace-period | 40s | kube-controller-manager | How long to wait before marking node Unknown/NotReady after last heartbeat |
pod-eviction-timeout | 5m (deprecated in 1.26) | kube-controller-manager | Legacy: grace period before evicting pods from NotReady node. Now controlled by taint-based eviction + tolerationSeconds |
nodeLeaseDurationSeconds | 40s | KubeletConfiguration | Lease TTL; lease renewed at TTL/4 |
nodeStatusUpdateFrequency | 10s | KubeletConfiguration | Full Node status patch interval |
Since Kubernetes 1.18 (GA), taint-based eviction is the standard mechanism. Pods that tolerate node.kubernetes.io/not-ready:NoExecute with tolerationSeconds: 300 (the default added automatically) will be evicted after 5 minutes. To customize eviction timing per workload, set explicit tolerations in the pod spec rather than relying on the deprecated pod-eviction-timeout flag.
Taint-Based Eviction
When a node transitions to NotReady or Unknown, the node lifecycle controller applies NoExecute taints. Pods running on the node that do not tolerate these taints are evicted after their toleration's tolerationSeconds expires.
Built-in Node Lifecycle Taints
| Taint Key | Effect | Applied When | Default Toleration |
|---|---|---|---|
node.kubernetes.io/not-ready | NoExecute | Ready=False after grace period | 300s (added automatically by admission) |
node.kubernetes.io/unreachable | NoExecute | Ready=Unknown after grace period | 300s (added automatically) |
node.kubernetes.io/not-ready | NoSchedule | Ready=False immediately | Tolerated by DaemonSets |
node.kubernetes.io/memory-pressure | NoSchedule | MemoryPressure=True | Tolerated by DaemonSets |
node.kubernetes.io/disk-pressure | NoSchedule | DiskPressure=True | Tolerated by DaemonSets |
node.kubernetes.io/pid-pressure | NoSchedule | PIDPressure=True | Tolerated by DaemonSets |
node.kubernetes.io/network-unavailable | NoSchedule | NetworkUnavailable=True | Tolerated by DaemonSets |
node.kubernetes.io/unschedulable | NoSchedule | kubectl cordon / spec.unschedulable=true | Tolerated by DaemonSets |
node.cloudprovider.kubernetes.io/uninitialized | NoSchedule | CCM node initialization pending | Tolerated by DaemonSets |
Controlling Eviction Timing via Tolerations
# Default tolerations added by DefaultTolerationSeconds admission plugin:
tolerations:
- key: node.kubernetes.io/not-ready
operator: Exists
effect: NoExecute
tolerationSeconds: 300 # 5 minutes
- key: node.kubernetes.io/unreachable
operator: Exists
effect: NoExecute
tolerationSeconds: 300
# For fast-eviction workloads (stateless, rebalanceable):
tolerations:
- key: node.kubernetes.io/not-ready
operator: Exists
effect: NoExecute
tolerationSeconds: 30 # evict after 30s
# For sticky workloads (caches, leader elections):
tolerations:
- key: node.kubernetes.io/not-ready
operator: Exists
effect: NoExecute
tolerationSeconds: 600 # wait 10 minutes before evicting
# For DaemonSets (never evict from NotReady nodes):
tolerations:
- key: node.kubernetes.io/not-ready
operator: Exists
effect: NoExecute
# no tolerationSeconds = tolerate indefinitely
Node Lifecycle Controller
The node lifecycle controller (part of kube-controller-manager, formerly called "node controller") manages three responsibilities:
CIDR Assignment
Assigns spec.podCIDR from the cluster CIDR range to each new node. Controlled by --allocate-node-cidrs and --cluster-cidr flags.
Health Monitoring
Watches node Lease and status. After node-monitor-grace-period, marks conditions Unknown and applies NoSchedule/NoExecute taints. Implements rate limiting (see below).
Node Deletion
When a Node object is deleted (by CCM or manually), evicts all pods on the node. Does not delete the node object itself — that's the CCM's role.
Eviction Rate Limiting
The node lifecycle controller applies eviction rate limits to avoid mass pod deletion during a partial zone outage:
# kube-controller-manager flags:
--node-eviction-rate=0.1 # pods/second evicted per-zone in healthy zone
--secondary-node-eviction-rate=0.01 # reduced rate when >1/3 of zone nodes are unhealthy
--unhealthy-zone-threshold=0.55 # fraction of NotReady nodes to consider zone unhealthy
--large-cluster-size-threshold=50 # nodes below this: stop evictions when zone fully unhealthy
This means: in a healthy zone, eviction proceeds at 0.1 pods/s. If ≥55% of nodes in a zone become NotReady (suggesting a network partition rather than individual node failure), the controller reduces to 0.01 pods/s, buying time for the zone to recover without evicting all pods unnecessarily.
Nodes are grouped by topology.kubernetes.io/zone label (set by CCM). If an entire zone goes dark, Kubernetes assumes a network partition and slows eviction drastically. If isolated nodes fail, eviction proceeds normally. This is why the CCM-set zone labels are load-bearing for reliability.
Cordon and Drain
Cordon and drain are the standard procedure for taking a node out of service — for upgrades, maintenance, or decommissioning.
kubectl cordon
Sets spec.unschedulable = true on the Node object. The scheduler will not place new pods on the node. Existing pods continue running. Equivalent to adding the node.kubernetes.io/unschedulable:NoSchedule taint.
kubectl uncordon
Sets spec.unschedulable = false. Removes the unschedulable taint. The node becomes schedulable again. Use after maintenance is complete.
# Cordon node (prevent new scheduling)
kubectl cordon worker-1
# Verify: spec.unschedulable=true, taint added
kubectl get node worker-1 -o jsonpath='{.spec.unschedulable}'
kubectl describe node worker-1 | grep -A3 Taints
# Drain: evict all evictable pods, then cordon
kubectl drain worker-1 \
--ignore-daemonsets \ # skip DaemonSet pods (they reschedule automatically)
--delete-emptydir-data \ # allow evicting pods using emptyDir
--grace-period=30 \ # override pod terminationGracePeriodSeconds
--timeout=5m \ # give up after 5 minutes
--force # evict pods not managed by a controller
# What kubectl drain does:
# 1. Cordons the node (spec.unschedulable=true)
# 2. Lists all pods on the node
# 3. Sends Eviction API calls (respects PodDisruptionBudgets)
# 4. Waits for pods to terminate
# 5. Returns when node is empty (or timeout hit)
Eviction API vs Delete
kubectl drain uses the Eviction API (POST /api/v1/namespaces/{ns}/pods/{pod}/eviction), not plain DELETE /api/v1/namespaces/{ns}/pods/{pod}. This difference matters:
Eviction API
- Checked against PodDisruptionBudget before proceeding
- Returns
429 Too Many Requestsif PDB is violated - kubectl drain retries on 429
- Respects
terminationGracePeriodSeconds - Triggers graceful shutdown lifecycle hooks
kubectl delete pod
- Does NOT check PodDisruptionBudget
- Can reduce availability below minimum
- Still triggers graceful shutdown (grace period)
- Use only when you understand the impact
- Force delete (
--grace-period=0) bypasses shutdown
Graceful Node Shutdown
Graceful Node Shutdown (GA in 1.21) allows the kubelet to detect OS-level shutdown signals (via systemd inhibitor locks) and gracefully terminate pods before the node powers off.
# KubeletConfiguration
apiVersion: kubelet.config.k8s.io/v1beta1
kind: KubeletConfiguration
shutdownGracePeriod: 60s # total time for all pods to terminate
shutdownGracePeriodCriticalPods: 20s # reserved for critical pods (system-node-critical)
# remaining = shutdownGracePeriod - shutdownGracePeriodCriticalPods
# regular pods get: 60s - 20s = 40s
Shutdown sequence:
systemd shutdown inhibitor acquired
kubelet registers a systemd inhibitor lock at startup, delaying actual shutdown until kubelet releases it.
SIGTERM received
systemd sends SIGTERM to kubelet (or systemctl poweroff triggers inhibitor). kubelet begins graceful shutdown mode.
Non-critical pods evicted
All non-critical pods receive SIGTERM. They have shutdownGracePeriod - shutdownGracePeriodCriticalPods seconds to exit cleanly. preStop hooks run.
Critical pods evicted
Pods with priorityClassName: system-node-critical or system-cluster-critical are terminated with shutdownGracePeriodCriticalPods seconds grace.
Inhibitor released
kubelet releases the systemd inhibitor lock. systemd proceeds with shutdown. The node powers off.
Non-Graceful Node Shutdown
If the node crashes (power cut, kernel panic) without a graceful shutdown, pods remain in Terminating state indefinitely until the node object is deleted. The Non-Graceful Node Shutdown feature (GA 1.28) addresses this:
# After detecting a non-graceful shutdown, an admin can add the taint:
kubectl taint node worker-1 node.kubernetes.io/out-of-service=nodeshutdown:NoExecute
# This triggers:
# 1. All pods on the node are force-deleted (no graceful period)
# 2. Volume detachments are forced
# 3. PVCs become available for binding elsewhere
# After node recovery (or decommission), remove the taint:
kubectl taint node worker-1 node.kubernetes.io/out-of-service-
Node Deletion
There are three paths to node deletion, each with different consequences:
| Path | Trigger | Pods | Volumes | Use when |
|---|---|---|---|---|
| kubectl delete node | Manual admin action | Evicted immediately (node lifecycle controller) | Force-detached after attachDetachReconcileSyncPeriod |
Decommissioning a node that is already drained |
| CCM deletion | Instance disappears from cloud provider API | Same as above | Cloud volumes stay; detached from Kubernetes side | Node terminated in cloud console; CCM detects and deletes Node object |
| Cluster Autoscaler scale-down | Node underutilized; CA determines safe to remove | Drained first (respects PDBs), then node deleted | Graceful detach via drain | Automatic right-sizing in cloud environments |
Running kubectl delete node worker-1 on a node that is still serving workloads skips drain and can cause split-brain for stateful applications. Always drain first, then delete. The only exception is a node that is confirmed to be permanently gone (instance terminated, hardware destroyed).
Node Autoscaling and Lifecycle
The Cluster Autoscaler (CA) interacts with node lifecycle at scale-up and scale-down:
Scale-Up
- CA detects pending pods that cannot be scheduled (insufficient resources)
- Simulates scheduling on each node group to find a fitting group
- Calls cloud provider API to increase ASG/MIG size
- New VM boots, kubelet registers Node object
- CCM initializes node (providerID, labels, taints removed)
- Scheduler places the previously-pending pod
Scale-Down
- CA identifies nodes where all pods can be safely moved (utilization < threshold)
- Checks PodDisruptionBudgets, min replicas, local storage
- Cordons the node
- Drains the node (Eviction API, respects PDBs)
- Deletes the Node object
- Terminates the cloud instance
Node Topology and Labels
Topology labels (set by CCM or node-labeler) drive scheduling decisions and affect node lifecycle:
# Standard topology labels
topology.kubernetes.io/region: us-east-1
topology.kubernetes.io/zone: us-east-1a
# Node role labels
node-role.kubernetes.io/control-plane: ""
node-role.kubernetes.io/worker: ""
# Instance type (cloud-provider specific)
node.kubernetes.io/instance-type: m5.2xlarge
beta.kubernetes.io/instance-type: m5.2xlarge # deprecated alias
# OS/arch
kubernetes.io/os: linux
kubernetes.io/arch: amd64
# Custom labels (examples)
workload-type: gpu
spot: "true"
node-pool: high-memory
The scheduler uses these labels for topology spread constraints, node affinity, and the taint/toleration system. Removing a topology label from a running node can cause pods to violate their spread constraints — the scheduler won't fix that retroactively, but new pods will be scheduled correctly.
Maintenance Patterns
Rolling Node Upgrade
# Step 1: Cordon (prevent new scheduling)
kubectl cordon worker-1
# Step 2: Drain (evict existing pods)
kubectl drain worker-1 --ignore-daemonsets --delete-emptydir-data --grace-period=60
# Step 3: Upgrade node OS / kubelet
# (OS-specific: apt-get, yum, kubeadm upgrade node, etc.)
sudo apt-get install -y kubelet=1.29.2-00 kubectl=1.29.2-00
sudo systemctl daemon-reload
sudo systemctl restart kubelet
# Step 4: Verify
kubectl get node worker-1 # should show new version, still SchedulingDisabled
# Step 5: Uncordon
kubectl uncordon worker-1
# Verify schedulable
kubectl get node worker-1 # Ready, no SchedulingDisabled
Maintenance Window via Taints
# Add a maintenance taint (prevents new scheduling, doesn't evict existing)
kubectl taint node worker-1 maintenance=scheduled:NoSchedule
# For planned full maintenance (evict everything with 30s grace):
kubectl taint node worker-1 maintenance=scheduled:NoExecute
# On the pods that can tolerate maintenance windows:
tolerations:
- key: maintenance
operator: Equal
value: scheduled
effect: NoSchedule
# Remove when maintenance complete:
kubectl taint node worker-1 maintenance-
Node Problem Detector
The Node Problem Detector (NPD) is an optional DaemonSet that monitors kernel logs, system daemons, and hardware sensors and converts them into Node Conditions and Events, filling gaps that the kubelet's native condition reporting doesn't cover.
# NPD generates conditions like:
# KernelDeadlock=True: kernel is stuck
# ReadonlyFilesystem=True: root FS went readonly (common on failed disks)
# FrequentKubeletRestart=True: kubelet restarting too often
# FrequentDockerRestart=True: container runtime restarting
# CorruptDockerOverlay2=True: overlayfs corruption detected
# Deploy NPD:
kubectl apply -f https://k8s.io/examples/debug/node-problem-detector.yaml
# NPD config: /etc/npd/kernel-monitor.json (log patterns -> conditions)
# Custom problem detectors via plugin: any stdout-emitting JSON-outputting binary
NPD can be configured to add taints and annotations when problems are detected. The Cluster Autoscaler can be configured to treat nodes with certain NPD-set taints as unhealthy and remove them automatically — enabling self-healing node replacement for persistent hardware faults.
Node Lease Coordination Details
# Watch node leases to monitor heartbeat health
kubectl get leases -n kube-node-lease
# Check when a specific node last renewed its lease
kubectl get lease worker-1 -n kube-node-lease \
-o jsonpath='{.spec.renewTime}'
# If lease.renewTime is stale but node.status.conditions are recent:
# -> split: kubelet is updating status but not renewing lease
# -> check feature gate NodeLease=true (enabled by default since 1.14)
# Calculate staleness
kubectl get lease -n kube-node-lease \
-o custom-columns='NODE:.metadata.name,RENEW:.spec.renewTime' \
| sort -k2
Key Metrics
| Metric | Type | Labels | Description |
|---|---|---|---|
node_collector_evictions_number | Counter | zone | Pod evictions triggered by node lifecycle controller per zone |
node_collector_unhealthy_nodes_in_zone | Gauge | zone | Count of unhealthy (NotReady) nodes per zone |
node_collector_zone_health | Gauge | zone | Fraction of healthy nodes in zone (0–1); below 0.45 triggers slow-eviction |
kubelet_node_status_update_errors_total | Counter | — | Failures to update node status (network or apiserver issue) |
kubelet_node_status_update_duration_seconds | Histogram | — | Latency of node status PATCH calls |
kubelet_graceful_shutdown_start_time_seconds | Gauge | — | Unix timestamp when graceful shutdown began (non-zero = in progress) |
kubelet_graceful_shutdown_end_time_seconds | Gauge | — | Unix timestamp when graceful shutdown completed |
Alerting Rules
# Alert: zone health below threshold (zone outage likely)
- alert: KubernetesZoneHealthLow
expr: node_collector_zone_health < 0.5
for: 5m
labels:
severity: critical
annotations:
summary: "Zone {{ $labels.zone }} health below 50%"
description: "{{ $value | humanizePercentage }} of nodes in zone are healthy"
# Alert: node eviction rate elevated
- alert: KubernetesNodeEvictionsElevated
expr: increase(node_collector_evictions_number[5m]) > 10
for: 2m
labels:
severity: warning
annotations:
summary: "High pod eviction rate from node controller"
# Alert: node not ready for extended period
- alert: KubernetesNodeNotReady
expr: kube_node_status_condition{condition="Ready",status="true"} == 0
for: 10m
labels:
severity: warning
annotations:
summary: "Node {{ $labels.node }} NotReady for 10 minutes"
# Alert: graceful shutdown in progress
- alert: KubernetesGracefulShutdownInProgress
expr: kubelet_graceful_shutdown_start_time_seconds > 0
labels:
severity: info
annotations:
summary: "Node {{ $labels.node }} is shutting down gracefully"
Troubleshooting Runbooks
Node stuck in NotReady — pods not evicted
# Symptom: node NotReady for >5m but pods still show Running
# 1. Check node conditions
kubectl describe node worker-1 | grep -A 20 Conditions
# 2. Check taints (should have NoExecute if NotReady > grace period)
kubectl get node worker-1 -o jsonpath='{.spec.taints}'
# 3. Check pod tolerations — they may have long tolerationSeconds
kubectl get pod -o jsonpath='{.spec.tolerations}' mypod
# 4. Check node lifecycle controller logs (in kube-controller-manager)
kubectl logs -n kube-system -l component=kube-controller-manager \
| grep -i "node lifecycle\|eviction\|taint" | tail -50
# 5. Check zone health metrics — if zone is unhealthy, eviction is rate-limited
kubectl get --raw /metrics | grep node_collector_zone_health
# 6. If node is confirmed dead, force delete the Node object
kubectl delete node worker-1
# -> Triggers immediate eviction of all pods on the node
kubectl drain hangs — PDB blocking eviction
# Symptom: kubectl drain outputs "evicting pod..." forever
# 1. Check which PDB is blocking
kubectl get pdb -A
kubectl describe pdb my-pdb -n my-namespace
# Look for: "Allowed disruptions: 0" — means no pod can be evicted
# 2. Check current disruptions
kubectl get pdb -A -o jsonpath='{range .items[*]}{.metadata.namespace}/{.metadata.name}: \
disruptions allowed={.status.disruptionsAllowed}{"\n"}{end}'
# 3. Options:
# a) Wait for more replicas to become healthy (increases allowed disruptions)
# b) Temporarily patch the PDB (risky — understand the impact)
kubectl patch pdb my-pdb -n my-namespace \
--type=merge -p '{"spec":{"minAvailable":0}}'
# c) Force eviction bypassing PDB (last resort)
kubectl delete pod my-pod-xyz -n my-namespace
# Undo after drain:
kubectl patch pdb my-pdb -n my-namespace \
--type=merge -p '{"spec":{"minAvailable":2}}'
Pods stuck in Terminating after node deletion
# Symptom: pods show Terminating but node is deleted/gone
# Cause: kubelet on the (gone) node never sent the pod deletion confirmation.
# Kubernetes won't force-delete by default — DeletionTimestamp set but
# finalizers or volume detach is pending.
# 1. Check pod status
kubectl get pod my-pod -n my-ns -o yaml | grep -A5 "deletionTimestamp\|finalizers"
# 2. For stuck CSI volumes
kubectl get volumeattachment | grep my-pod
# If stuck: manually delete the VolumeAttachment object
# 3. Force-delete the pod (use for nodes confirmed permanently gone)
kubectl delete pod my-pod -n my-ns --grace-period=0 --force
# 4. If node might come back (avoid false force-delete):
# Use out-of-service taint instead (preferred for non-graceful shutdown):
kubectl taint node worker-1 node.kubernetes.io/out-of-service=nodeshutdown:NoExecute
# This triggers force deletion + volume detach cleanly
Graceful shutdown not working — pods don't terminate before poweroff
# Symptom: systemd shuts down node but pods don't terminate gracefully
# 1. Check kubelet has shutdown grace period configured
grep -i shutdown /var/lib/kubelet/config.yaml
# Expected: shutdownGracePeriod: 60s
# 2. Verify kubelet holds a systemd inhibitor lock
systemd-inhibit --list | grep kubelet
# Expected: kubelet shutdown delay Graceful kubelet termination
# 3. Check systemd InhibitDelayMaxSec allows enough time
grep InhibitDelayMaxSec /etc/systemd/system.conf
# Default is 5s — must be >= shutdownGracePeriod
# Fix:
echo "InhibitDelayMaxSec=90" >> /etc/systemd/system.conf
systemctl daemon-reexec
# 4. Test: trigger a test shutdown with time limit
# systemctl poweroff
# Observe: kubelet should log "graceful node shutdown activated"
journalctl -u kubelet | grep -i "shutdown\|graceful"
Node re-registers after deletion — ghost node
# Symptom: deleted a Node object, but it reappears minutes later
# Cause: the kubelet is still running on the node and re-registers itself
# (kubelet calls POST /api/v1/nodes if the Node object is missing)
# Resolution:
# 1. Stop the kubelet first
ssh worker-1 "sudo systemctl stop kubelet"
# 2. Then delete the Node object
kubectl delete node worker-1
# 3. If the node should be permanently removed, also terminate the VM
# If the VM will be reused, re-joining it needs a fresh kubelet config
ssh worker-1 "sudo rm /etc/kubernetes/kubelet.conf"
ssh worker-1 "sudo kubeadm reset -f" # or equivalent bootstrap reset
Production Best Practices
- Always drain before maintenance — even for a 5-minute kubelet restart. A kubelet restart without draining leaves pods running but orphaned from PLEG updates, which can cause inconsistent state.
- Set PodDisruptionBudgets for all stateful workloads — PDBs are the only mechanism that makes
kubectl drainrespect availability guarantees. Without them, drain evicts everything instantly. - Configure
shutdownGracePeriodand increaseInhibitDelayMaxSec— the default systemd inhibitor delay (5s) is shorter than any realistic application graceful shutdown. Set both to at least 90s for production nodes. - Use the out-of-service taint for crashed nodes — never force-delete pods from crashed nodes manually. The
node.kubernetes.io/out-of-service:NoExecutetaint does it correctly, including forcing CSI volume detachments. - Monitor zone health — alert on
node_collector_zone_health < 0.5. When a whole zone is dark, the reduced eviction rate may mislead you into thinking pods are healthy when they're stuck on dead nodes. - Label nodes with accurate topology labels — the zone-aware eviction rate limiting and TopologySpreadConstraints both depend on
topology.kubernetes.io/zonebeing correct. Mis-labeled nodes spread pods incorrectly and can cause entire workloads to be on one physical zone. - Deploy Node Problem Detector — NPD catches conditions the kubelet doesn't (kernel deadlocks, readonly filesystems, OOM events in dmesg). Integrate NPD conditions with your alerting and CA-based self-healing.
- Test node failure scenarios regularly — use
systemctl stop kubeleton a test node to verify eviction timing matches your SLO expectations. Test that your PDBs actually protect availability during drain. - Set appropriate
tolerationSecondsper workload class — stateless web services can use 30s; stateful databases should use 300–600s or longer to avoid unnecessary failover churn during transient node issues. - Automate certificate rotation — enable
rotateCertificates: truein KubeletConfiguration. Expired node certificates prevent the kubelet from renewing its Lease, causing the node to appear NotReady even though workloads are healthy.