Node Advanced Core File: 02-node-components/05-node-lifecycle.html

Node Lifecycle

A node's lifecycle spans from its initial registration through active workload execution to graceful shutdown or forced removal. Every phase involves specific control-plane controllers, kubelet behaviors, and timing parameters that directly affect pod availability. This page covers the complete lifecycle: registration → healthy operation → degradation → drain → deletion, plus cordon/taint mechanics and the node controller's role throughout.

Lifecycle Phases Overview

Node Registration

A node self-registers by the kubelet calling POST /api/v1/nodes on the kube-apiserver. This requires a valid client certificate with the system:node:<nodeName> identity in the system:nodes group. The TLS bootstrap process (covered in kubelet: TLS Bootstrap) issues this certificate automatically.

Bootstrap token exchange

kubelet uses a bootstrap token (--bootstrap-kubeconfig) to authenticate as system:bootstrappers and submit a CSR for a node client certificate.

CSR auto-approval

The csrapproving controller in kube-controller-manager auto-approves CSRs from system:bootstrappers matching the node identity pattern. The certificate is signed by the cluster CA.

Node object creation

kubelet POSTs the Node object with initial status.capacity, status.nodeInfo (OS, kernel, runtime version), and all initial conditions set to Unknown. The apiserver assigns a resourceVersion.

CCM initializer taint

The cloud-controller-manager adds node.cloudprovider.kubernetes.io/uninitialized=true:NoSchedule immediately, preventing workload scheduling until cloud metadata (zone, instance type, providerID) is populated.

CCM initialization

CCM's NodeController queries the cloud provider for the instance's zone, region, instance type, and external IPs; patches Node status.addresses and spec.providerID; removes the uninitialized taint.

CNI / DaemonSet startup

CNI DaemonSet pods are scheduled (they tolerate the NotReady taint). CNI plugin configures the node network. kubelet reports NetworkReady=true via CRI Status RPC.

Node becomes Ready

kubelet patches Node status with Ready=True. The node.kubernetes.io/not-ready:NoSchedule taint is automatically removed by the node lifecycle controller. The node is now schedulable.

Node Object Anatomy

apiVersion: v1
kind: Node
metadata:
  name: worker-1
  labels:
    kubernetes.io/hostname: worker-1
    kubernetes.io/os: linux
    kubernetes.io/arch: amd64
    node.kubernetes.io/instance-type: m5.2xlarge  # set by CCM
    topology.kubernetes.io/region: us-east-1       # set by CCM
    topology.kubernetes.io/zone: us-east-1a        # set by CCM
  annotations:
    node.alpha.kubernetes.io/ttl: "0"
    volumes.kubernetes.io/controller-managed-attach-detach: "true"
spec:
  podCIDR: 10.244.1.0/24          # assigned by node IPAM controller
  podCIDRs:
  - 10.244.1.0/24
  providerID: aws:///us-east-1a/i-0abc12345def67890  # set by CCM
  taints:                          # applied manually or by lifecycle controller
  - key: node.kubernetes.io/not-ready
    effect: NoSchedule
status:
  capacity:
    cpu: "8"
    memory: 32Gi
    pods: "110"
    hugepages-1Gi: "0"
    hugepages-2Mi: "0"
    ephemeral-storage: 500Gi
  allocatable:                     # capacity minus reserved (kubeReserved+systemReserved+evictionThreshold)
    cpu: 7600m
    memory: 29Gi
    pods: "110"
    ephemeral-storage: 470Gi
  addresses:
  - type: InternalIP
    address: 10.0.1.50
  - type: ExternalIP
    address: 54.12.34.56
  - type: Hostname
    address: worker-1
  conditions:                      # updated by kubelet every nodeStatusUpdateFrequency (10s)
  - type: Ready
    status: "True"
    lastHeartbeatTime: "2024-01-15T10:30:00Z"
    lastTransitionTime: "2024-01-14T08:00:00Z"
    reason: KubeletReady
    message: kubelet is posting ready status
  - type: MemoryPressure
    status: "False"
    ...
  - type: DiskPressure
    status: "False"
    ...
  - type: PIDPressure
    status: "False"
    ...
  - type: NetworkUnavailable
    status: "False"
    ...
  nodeInfo:
    machineID: abc123
    systemUUID: def456
    bootID: ghi789
    kernelVersion: 5.15.0-1034-aws
    osImage: Ubuntu 22.04.3 LTS
    containerRuntimeVersion: containerd://1.7.2
    kubeletVersion: v1.29.2
    kubeProxyVersion: v1.29.2
    operatingSystem: linux
    architecture: amd64
  daemonEndpoints:
    kubeletEndpoint:
      Port: 10250
  images:
  - names:
    - registry.k8s.io/pause:3.9
    sizeBytes: 268435456
  volumesInUse: []
  volumesAttached: []

Node Conditions

The kubelet continuously evaluates and patches these conditions. The node lifecycle controller in kube-controller-manager watches them and reacts:

Condition	True means	False means	Control-plane reaction
`Ready`	Node healthy, kubelet posting status, network ready	kubelet cannot post status or network unavailable	After `node-monitor-grace-period` (40s default): add `node.kubernetes.io/not-ready:NoSchedule` taint; after `pod-eviction-timeout` (5m): add `node.kubernetes.io/not-ready:NoExecute` taint → pods evicted
`MemoryPressure`	Node memory is low (below eviction threshold)	Normal memory	Add `node.kubernetes.io/memory-pressure:NoSchedule` taint; kubelet begins evicting BestEffort pods
`DiskPressure`	Node disk is low (image filesystem or nodefs)	Normal disk	Add `node.kubernetes.io/disk-pressure:NoSchedule` taint; kubelet evicts pods with large ephemeral storage
`PIDPressure`	Too many processes running	Normal PID count	Add `node.kubernetes.io/pid-pressure:NoSchedule` taint
`NetworkUnavailable`	Node network not configured (CNI missing)	Network configured normally	Set by CNI plugin (via Node patch); prevents pod IP assignment

Condition → Taint Automation

The node lifecycle controller (part of kube-controller-manager) automatically adds and removes node.kubernetes.io/* taints based on conditions. You don't need to manually taint nodes for common conditions — the controller handles it. Manual taints are additive; they coexist with controller-managed taints.

Node Heartbeat Mechanism

Kubernetes uses two complementary heartbeat channels to detect node failures:

Node Status Updates (legacy)

kubelet PATCHes node.status every nodeStatusUpdateFrequency (default 10s)
Writes all condition timestamps + allocatable + images
Large payload — high etcd write amplification at scale
Control plane uses lastHeartbeatTime to detect failures

Node Lease (preferred)

Introduced in 1.13, GA in 1.17; enabled by default
kubelet renews a coordination.k8s.io/v1/Lease in the kube-node-lease namespace every nodeLeaseDurationSeconds/4 (default every ~10s)
Tiny object (~200 bytes) vs full Node status
Node lifecycle controller watches Lease renewTime
Decouples heartbeat from status updates

# Node Lease object
apiVersion: coordination.k8s.io/v1
kind: Lease
metadata:
  name: worker-1          # same as node name
  namespace: kube-node-lease
spec:
  holderIdentity: worker-1
  leaseDurationSeconds: 40   # kubelet flag: --node-lease-duration-seconds
  renewTime: "2024-01-15T10:30:45.123456Z"   # updated every ~10s

# kubelet config knobs:
# nodeStatusUpdateFrequency: 10s  (full status patches)
# nodeLeaseDurationSeconds: 40    (lease TTL; renew at TTL/4 = 10s)

Failure Detection Timing

Parameter	Default	Owner	Effect
`node-monitor-period`	5s	kube-controller-manager	How often node lifecycle controller checks node conditions
`node-monitor-grace-period`	40s	kube-controller-manager	How long to wait before marking node Unknown/NotReady after last heartbeat
`pod-eviction-timeout`	5m (deprecated in 1.26)	kube-controller-manager	Legacy: grace period before evicting pods from NotReady node. Now controlled by taint-based eviction + tolerationSeconds
`nodeLeaseDurationSeconds`	40s	KubeletConfiguration	Lease TTL; lease renewed at TTL/4
`nodeStatusUpdateFrequency`	10s	KubeletConfiguration	Full Node status patch interval

pod-eviction-timeout Superseded by Taint Tolerations

Since Kubernetes 1.18 (GA), taint-based eviction is the standard mechanism. Pods that tolerate node.kubernetes.io/not-ready:NoExecute with tolerationSeconds: 300 (the default added automatically) will be evicted after 5 minutes. To customize eviction timing per workload, set explicit tolerations in the pod spec rather than relying on the deprecated pod-eviction-timeout flag.

Taint-Based Eviction

When a node transitions to NotReady or Unknown, the node lifecycle controller applies NoExecute taints. Pods running on the node that do not tolerate these taints are evicted after their toleration's tolerationSeconds expires.

Built-in Node Lifecycle Taints

Taint Key	Effect	Applied When	Default Toleration
`node.kubernetes.io/not-ready`	NoExecute	`Ready=False` after grace period	300s (added automatically by admission)
`node.kubernetes.io/unreachable`	NoExecute	`Ready=Unknown` after grace period	300s (added automatically)
`node.kubernetes.io/not-ready`	NoSchedule	`Ready=False` immediately	Tolerated by DaemonSets
`node.kubernetes.io/memory-pressure`	NoSchedule	`MemoryPressure=True`	Tolerated by DaemonSets
`node.kubernetes.io/disk-pressure`	NoSchedule	`DiskPressure=True`	Tolerated by DaemonSets
`node.kubernetes.io/pid-pressure`	NoSchedule	`PIDPressure=True`	Tolerated by DaemonSets
`node.kubernetes.io/network-unavailable`	NoSchedule	`NetworkUnavailable=True`	Tolerated by DaemonSets
`node.kubernetes.io/unschedulable`	NoSchedule	`kubectl cordon` / `spec.unschedulable=true`	Tolerated by DaemonSets
`node.cloudprovider.kubernetes.io/uninitialized`	NoSchedule	CCM node initialization pending	Tolerated by DaemonSets

Controlling Eviction Timing via Tolerations

# Default tolerations added by DefaultTolerationSeconds admission plugin:
tolerations:
- key: node.kubernetes.io/not-ready
  operator: Exists
  effect: NoExecute
  tolerationSeconds: 300   # 5 minutes
- key: node.kubernetes.io/unreachable
  operator: Exists
  effect: NoExecute
  tolerationSeconds: 300

# For fast-eviction workloads (stateless, rebalanceable):
tolerations:
- key: node.kubernetes.io/not-ready
  operator: Exists
  effect: NoExecute
  tolerationSeconds: 30    # evict after 30s

# For sticky workloads (caches, leader elections):
tolerations:
- key: node.kubernetes.io/not-ready
  operator: Exists
  effect: NoExecute
  tolerationSeconds: 600   # wait 10 minutes before evicting

# For DaemonSets (never evict from NotReady nodes):
tolerations:
- key: node.kubernetes.io/not-ready
  operator: Exists
  effect: NoExecute
  # no tolerationSeconds = tolerate indefinitely

Node Lifecycle Controller

The node lifecycle controller (part of kube-controller-manager, formerly called "node controller") manages three responsibilities:

CIDR Assignment

Assigns spec.podCIDR from the cluster CIDR range to each new node. Controlled by --allocate-node-cidrs and --cluster-cidr flags.

Health Monitoring

Watches node Lease and status. After node-monitor-grace-period, marks conditions Unknown and applies NoSchedule/NoExecute taints. Implements rate limiting (see below).

Node Deletion

When a Node object is deleted (by CCM or manually), evicts all pods on the node. Does not delete the node object itself — that's the CCM's role.

Eviction Rate Limiting

The node lifecycle controller applies eviction rate limits to avoid mass pod deletion during a partial zone outage:

# kube-controller-manager flags:
--node-eviction-rate=0.1          # pods/second evicted per-zone in healthy zone
--secondary-node-eviction-rate=0.01  # reduced rate when >1/3 of zone nodes are unhealthy
--unhealthy-zone-threshold=0.55   # fraction of NotReady nodes to consider zone unhealthy
--large-cluster-size-threshold=50 # nodes below this: stop evictions when zone fully unhealthy

This means: in a healthy zone, eviction proceeds at 0.1 pods/s. If ≥55% of nodes in a zone become NotReady (suggesting a network partition rather than individual node failure), the controller reduces to 0.01 pods/s, buying time for the zone to recover without evicting all pods unnecessarily.

Zone-Aware Eviction

Nodes are grouped by topology.kubernetes.io/zone label (set by CCM). If an entire zone goes dark, Kubernetes assumes a network partition and slows eviction drastically. If isolated nodes fail, eviction proceeds normally. This is why the CCM-set zone labels are load-bearing for reliability.

Cordon and Drain

Cordon and drain are the standard procedure for taking a node out of service — for upgrades, maintenance, or decommissioning.

kubectl cordon

Sets spec.unschedulable = true on the Node object. The scheduler will not place new pods on the node. Existing pods continue running. Equivalent to adding the node.kubernetes.io/unschedulable:NoSchedule taint.

kubectl uncordon

Sets spec.unschedulable = false. Removes the unschedulable taint. The node becomes schedulable again. Use after maintenance is complete.

# Cordon node (prevent new scheduling)
kubectl cordon worker-1

# Verify: spec.unschedulable=true, taint added
kubectl get node worker-1 -o jsonpath='{.spec.unschedulable}'
kubectl describe node worker-1 | grep -A3 Taints

# Drain: evict all evictable pods, then cordon
kubectl drain worker-1 \
  --ignore-daemonsets \          # skip DaemonSet pods (they reschedule automatically)
  --delete-emptydir-data \       # allow evicting pods using emptyDir
  --grace-period=30 \            # override pod terminationGracePeriodSeconds
  --timeout=5m \                 # give up after 5 minutes
  --force                        # evict pods not managed by a controller

# What kubectl drain does:
# 1. Cordons the node (spec.unschedulable=true)
# 2. Lists all pods on the node
# 3. Sends Eviction API calls (respects PodDisruptionBudgets)
# 4. Waits for pods to terminate
# 5. Returns when node is empty (or timeout hit)

Eviction API vs Delete

kubectl drain uses the Eviction API (POST /api/v1/namespaces/{ns}/pods/{pod}/eviction), not plain DELETE /api/v1/namespaces/{ns}/pods/{pod}. This difference matters:

Eviction API

Checked against PodDisruptionBudget before proceeding
Returns 429 Too Many Requests if PDB is violated
kubectl drain retries on 429
Respects terminationGracePeriodSeconds
Triggers graceful shutdown lifecycle hooks

kubectl delete pod

Does NOT check PodDisruptionBudget
Can reduce availability below minimum
Still triggers graceful shutdown (grace period)
Use only when you understand the impact
Force delete (--grace-period=0) bypasses shutdown

Graceful Node Shutdown

Graceful Node Shutdown (GA in 1.21) allows the kubelet to detect OS-level shutdown signals (via systemd inhibitor locks) and gracefully terminate pods before the node powers off.

# KubeletConfiguration
apiVersion: kubelet.config.k8s.io/v1beta1
kind: KubeletConfiguration
shutdownGracePeriod: 60s          # total time for all pods to terminate
shutdownGracePeriodCriticalPods: 20s  # reserved for critical pods (system-node-critical)
# remaining = shutdownGracePeriod - shutdownGracePeriodCriticalPods
# regular pods get: 60s - 20s = 40s

Shutdown sequence:

systemd shutdown inhibitor acquired

kubelet registers a systemd inhibitor lock at startup, delaying actual shutdown until kubelet releases it.

SIGTERM received

systemd sends SIGTERM to kubelet (or systemctl poweroff triggers inhibitor). kubelet begins graceful shutdown mode.

Non-critical pods evicted

All non-critical pods receive SIGTERM. They have shutdownGracePeriod - shutdownGracePeriodCriticalPods seconds to exit cleanly. preStop hooks run.

Critical pods evicted

Pods with priorityClassName: system-node-critical or system-cluster-critical are terminated with shutdownGracePeriodCriticalPods seconds grace.

Inhibitor released

kubelet releases the systemd inhibitor lock. systemd proceeds with shutdown. The node powers off.

Non-Graceful Node Shutdown

If the node crashes (power cut, kernel panic) without a graceful shutdown, pods remain in Terminating state indefinitely until the node object is deleted. The Non-Graceful Node Shutdown feature (GA 1.28) addresses this:

# After detecting a non-graceful shutdown, an admin can add the taint:
kubectl taint node worker-1 node.kubernetes.io/out-of-service=nodeshutdown:NoExecute

# This triggers:
# 1. All pods on the node are force-deleted (no graceful period)
# 2. Volume detachments are forced
# 3. PVCs become available for binding elsewhere
# After node recovery (or decommission), remove the taint:
kubectl taint node worker-1 node.kubernetes.io/out-of-service-

Node Deletion

There are three paths to node deletion, each with different consequences:

Path	Trigger	Pods	Volumes	Use when
kubectl delete node	Manual admin action	Evicted immediately (node lifecycle controller)	Force-detached after `attachDetachReconcileSyncPeriod`	Decommissioning a node that is already drained
CCM deletion	Instance disappears from cloud provider API	Same as above	Cloud volumes stay; detached from Kubernetes side	Node terminated in cloud console; CCM detects and deletes Node object
Cluster Autoscaler scale-down	Node underutilized; CA determines safe to remove	Drained first (respects PDBs), then node deleted	Graceful detach via drain	Automatic right-sizing in cloud environments

Never Force-Delete Running Nodes

Running kubectl delete node worker-1 on a node that is still serving workloads skips drain and can cause split-brain for stateful applications. Always drain first, then delete. The only exception is a node that is confirmed to be permanently gone (instance terminated, hardware destroyed).

Node Autoscaling and Lifecycle

The Cluster Autoscaler (CA) interacts with node lifecycle at scale-up and scale-down:

Scale-Up

CA detects pending pods that cannot be scheduled (insufficient resources)
Simulates scheduling on each node group to find a fitting group
Calls cloud provider API to increase ASG/MIG size
New VM boots, kubelet registers Node object
CCM initializes node (providerID, labels, taints removed)
Scheduler places the previously-pending pod

Scale-Down

CA identifies nodes where all pods can be safely moved (utilization < threshold)
Checks PodDisruptionBudgets, min replicas, local storage
Cordons the node
Drains the node (Eviction API, respects PDBs)
Deletes the Node object
Terminates the cloud instance

Node Topology and Labels

Topology labels (set by CCM or node-labeler) drive scheduling decisions and affect node lifecycle:

# Standard topology labels
topology.kubernetes.io/region: us-east-1
topology.kubernetes.io/zone: us-east-1a

# Node role labels
node-role.kubernetes.io/control-plane: ""
node-role.kubernetes.io/worker: ""

# Instance type (cloud-provider specific)
node.kubernetes.io/instance-type: m5.2xlarge
beta.kubernetes.io/instance-type: m5.2xlarge  # deprecated alias

# OS/arch
kubernetes.io/os: linux
kubernetes.io/arch: amd64

# Custom labels (examples)
workload-type: gpu
spot: "true"
node-pool: high-memory

The scheduler uses these labels for topology spread constraints, node affinity, and the taint/toleration system. Removing a topology label from a running node can cause pods to violate their spread constraints — the scheduler won't fix that retroactively, but new pods will be scheduled correctly.

Maintenance Patterns

Rolling Node Upgrade

# Step 1: Cordon (prevent new scheduling)
kubectl cordon worker-1

# Step 2: Drain (evict existing pods)
kubectl drain worker-1 --ignore-daemonsets --delete-emptydir-data --grace-period=60

# Step 3: Upgrade node OS / kubelet
# (OS-specific: apt-get, yum, kubeadm upgrade node, etc.)
sudo apt-get install -y kubelet=1.29.2-00 kubectl=1.29.2-00
sudo systemctl daemon-reload
sudo systemctl restart kubelet

# Step 4: Verify
kubectl get node worker-1   # should show new version, still SchedulingDisabled

# Step 5: Uncordon
kubectl uncordon worker-1

# Verify schedulable
kubectl get node worker-1   # Ready, no SchedulingDisabled

Maintenance Window via Taints

# Add a maintenance taint (prevents new scheduling, doesn't evict existing)
kubectl taint node worker-1 maintenance=scheduled:NoSchedule

# For planned full maintenance (evict everything with 30s grace):
kubectl taint node worker-1 maintenance=scheduled:NoExecute

# On the pods that can tolerate maintenance windows:
tolerations:
- key: maintenance
  operator: Equal
  value: scheduled
  effect: NoSchedule

# Remove when maintenance complete:
kubectl taint node worker-1 maintenance-

Node Problem Detector

The Node Problem Detector (NPD) is an optional DaemonSet that monitors kernel logs, system daemons, and hardware sensors and converts them into Node Conditions and Events, filling gaps that the kubelet's native condition reporting doesn't cover.

# NPD generates conditions like:
# KernelDeadlock=True: kernel is stuck
# ReadonlyFilesystem=True: root FS went readonly (common on failed disks)
# FrequentKubeletRestart=True: kubelet restarting too often
# FrequentDockerRestart=True: container runtime restarting
# CorruptDockerOverlay2=True: overlayfs corruption detected

# Deploy NPD:
kubectl apply -f https://k8s.io/examples/debug/node-problem-detector.yaml

# NPD config: /etc/npd/kernel-monitor.json (log patterns -> conditions)
# Custom problem detectors via plugin: any stdout-emitting JSON-outputting binary

NPD + Cluster Autoscaler Integration

NPD can be configured to add taints and annotations when problems are detected. The Cluster Autoscaler can be configured to treat nodes with certain NPD-set taints as unhealthy and remove them automatically — enabling self-healing node replacement for persistent hardware faults.

Node Lease Coordination Details

# Watch node leases to monitor heartbeat health
kubectl get leases -n kube-node-lease

# Check when a specific node last renewed its lease
kubectl get lease worker-1 -n kube-node-lease \
  -o jsonpath='{.spec.renewTime}'

# If lease.renewTime is stale but node.status.conditions are recent:
# -> split: kubelet is updating status but not renewing lease
# -> check feature gate NodeLease=true (enabled by default since 1.14)

# Calculate staleness
kubectl get lease -n kube-node-lease \
  -o custom-columns='NODE:.metadata.name,RENEW:.spec.renewTime' \
  | sort -k2

Key Metrics

Metric	Type	Labels	Description
`node_collector_evictions_number`	Counter	`zone`	Pod evictions triggered by node lifecycle controller per zone
`node_collector_unhealthy_nodes_in_zone`	Gauge	`zone`	Count of unhealthy (NotReady) nodes per zone
`node_collector_zone_health`	Gauge	`zone`	Fraction of healthy nodes in zone (0–1); below 0.45 triggers slow-eviction
`kubelet_node_status_update_errors_total`	Counter	—	Failures to update node status (network or apiserver issue)
`kubelet_node_status_update_duration_seconds`	Histogram	—	Latency of node status PATCH calls
`kubelet_graceful_shutdown_start_time_seconds`	Gauge	—	Unix timestamp when graceful shutdown began (non-zero = in progress)
`kubelet_graceful_shutdown_end_time_seconds`	Gauge	—	Unix timestamp when graceful shutdown completed

Alerting Rules

# Alert: zone health below threshold (zone outage likely)
- alert: KubernetesZoneHealthLow
  expr: node_collector_zone_health < 0.5
  for: 5m
  labels:
    severity: critical
  annotations:
    summary: "Zone {{ $labels.zone }} health below 50%"
    description: "{{ $value | humanizePercentage }} of nodes in zone are healthy"

# Alert: node eviction rate elevated
- alert: KubernetesNodeEvictionsElevated
  expr: increase(node_collector_evictions_number[5m]) > 10
  for: 2m
  labels:
    severity: warning
  annotations:
    summary: "High pod eviction rate from node controller"

# Alert: node not ready for extended period
- alert: KubernetesNodeNotReady
  expr: kube_node_status_condition{condition="Ready",status="true"} == 0
  for: 10m
  labels:
    severity: warning
  annotations:
    summary: "Node {{ $labels.node }} NotReady for 10 minutes"

# Alert: graceful shutdown in progress
- alert: KubernetesGracefulShutdownInProgress
  expr: kubelet_graceful_shutdown_start_time_seconds > 0
  labels:
    severity: info
  annotations:
    summary: "Node {{ $labels.node }} is shutting down gracefully"

Troubleshooting Runbooks

Node stuck in NotReady — pods not evicted

# Symptom: node NotReady for >5m but pods still show Running

# 1. Check node conditions
kubectl describe node worker-1 | grep -A 20 Conditions

# 2. Check taints (should have NoExecute if NotReady > grace period)
kubectl get node worker-1 -o jsonpath='{.spec.taints}'

# 3. Check pod tolerations — they may have long tolerationSeconds
kubectl get pod -o jsonpath='{.spec.tolerations}' mypod

# 4. Check node lifecycle controller logs (in kube-controller-manager)
kubectl logs -n kube-system -l component=kube-controller-manager \
  | grep -i "node lifecycle\|eviction\|taint" | tail -50

# 5. Check zone health metrics — if zone is unhealthy, eviction is rate-limited
kubectl get --raw /metrics | grep node_collector_zone_health

# 6. If node is confirmed dead, force delete the Node object
kubectl delete node worker-1
# -> Triggers immediate eviction of all pods on the node

kubectl drain hangs — PDB blocking eviction

# Symptom: kubectl drain outputs "evicting pod..." forever

# 1. Check which PDB is blocking
kubectl get pdb -A
kubectl describe pdb my-pdb -n my-namespace
# Look for: "Allowed disruptions: 0" — means no pod can be evicted

# 2. Check current disruptions
kubectl get pdb -A -o jsonpath='{range .items[*]}{.metadata.namespace}/{.metadata.name}: \
  disruptions allowed={.status.disruptionsAllowed}{"\n"}{end}'

# 3. Options:
# a) Wait for more replicas to become healthy (increases allowed disruptions)
# b) Temporarily patch the PDB (risky — understand the impact)
kubectl patch pdb my-pdb -n my-namespace \
  --type=merge -p '{"spec":{"minAvailable":0}}'
# c) Force eviction bypassing PDB (last resort)
kubectl delete pod my-pod-xyz -n my-namespace
# Undo after drain:
kubectl patch pdb my-pdb -n my-namespace \
  --type=merge -p '{"spec":{"minAvailable":2}}'

Pods stuck in Terminating after node deletion

# Symptom: pods show Terminating but node is deleted/gone

# Cause: kubelet on the (gone) node never sent the pod deletion confirmation.
# Kubernetes won't force-delete by default — DeletionTimestamp set but
# finalizers or volume detach is pending.

# 1. Check pod status
kubectl get pod my-pod -n my-ns -o yaml | grep -A5 "deletionTimestamp\|finalizers"

# 2. For stuck CSI volumes
kubectl get volumeattachment | grep my-pod
# If stuck: manually delete the VolumeAttachment object

# 3. Force-delete the pod (use for nodes confirmed permanently gone)
kubectl delete pod my-pod -n my-ns --grace-period=0 --force

# 4. If node might come back (avoid false force-delete):
# Use out-of-service taint instead (preferred for non-graceful shutdown):
kubectl taint node worker-1 node.kubernetes.io/out-of-service=nodeshutdown:NoExecute
# This triggers force deletion + volume detach cleanly

Graceful shutdown not working — pods don't terminate before poweroff

# Symptom: systemd shuts down node but pods don't terminate gracefully

# 1. Check kubelet has shutdown grace period configured
grep -i shutdown /var/lib/kubelet/config.yaml
# Expected: shutdownGracePeriod: 60s

# 2. Verify kubelet holds a systemd inhibitor lock
systemd-inhibit --list | grep kubelet
# Expected: kubelet  shutdown  delay  Graceful kubelet termination

# 3. Check systemd InhibitDelayMaxSec allows enough time
grep InhibitDelayMaxSec /etc/systemd/system.conf
# Default is 5s — must be >= shutdownGracePeriod
# Fix:
echo "InhibitDelayMaxSec=90" >> /etc/systemd/system.conf
systemctl daemon-reexec

# 4. Test: trigger a test shutdown with time limit
# systemctl poweroff
# Observe: kubelet should log "graceful node shutdown activated"
journalctl -u kubelet | grep -i "shutdown\|graceful"

Node re-registers after deletion — ghost node

# Symptom: deleted a Node object, but it reappears minutes later

# Cause: the kubelet is still running on the node and re-registers itself
# (kubelet calls POST /api/v1/nodes if the Node object is missing)

# Resolution:
# 1. Stop the kubelet first
ssh worker-1 "sudo systemctl stop kubelet"

# 2. Then delete the Node object
kubectl delete node worker-1

# 3. If the node should be permanently removed, also terminate the VM
# If the VM will be reused, re-joining it needs a fresh kubelet config
ssh worker-1 "sudo rm /etc/kubernetes/kubelet.conf"
ssh worker-1 "sudo kubeadm reset -f"   # or equivalent bootstrap reset

Production Best Practices

Always drain before maintenance — even for a 5-minute kubelet restart. A kubelet restart without draining leaves pods running but orphaned from PLEG updates, which can cause inconsistent state.
Set PodDisruptionBudgets for all stateful workloads — PDBs are the only mechanism that makes kubectl drain respect availability guarantees. Without them, drain evicts everything instantly.
Configure shutdownGracePeriod and increase InhibitDelayMaxSec — the default systemd inhibitor delay (5s) is shorter than any realistic application graceful shutdown. Set both to at least 90s for production nodes.
Use the out-of-service taint for crashed nodes — never force-delete pods from crashed nodes manually. The node.kubernetes.io/out-of-service:NoExecute taint does it correctly, including forcing CSI volume detachments.
Monitor zone health — alert on node_collector_zone_health < 0.5. When a whole zone is dark, the reduced eviction rate may mislead you into thinking pods are healthy when they're stuck on dead nodes.
Label nodes with accurate topology labels — the zone-aware eviction rate limiting and TopologySpreadConstraints both depend on topology.kubernetes.io/zone being correct. Mis-labeled nodes spread pods incorrectly and can cause entire workloads to be on one physical zone.
Deploy Node Problem Detector — NPD catches conditions the kubelet doesn't (kernel deadlocks, readonly filesystems, OOM events in dmesg). Integrate NPD conditions with your alerting and CA-based self-healing.
Test node failure scenarios regularly — use systemctl stop kubelet on a test node to verify eviction timing matches your SLO expectations. Test that your PDBs actually protect availability during drain.
Set appropriate tolerationSeconds per workload class — stateless web services can use 30s; stateful databases should use 300–600s or longer to avoid unnecessary failover churn during transient node issues.
Automate certificate rotation — enable rotateCertificates: true in KubeletConfiguration. Expired node certificates prevent the kubelet from renewing its Lease, causing the node to appear NotReady even though workloads are healthy.