kubelet

Node Components Pod Lifecycle CRI Probes 02-node-components / 01-kubelet.html

The kubelet is the primary node agent. It runs on every worker node (and optionally on control plane nodes) as a systemd service and is responsible for the entire pod lifecycle on that node — from downloading the pod spec to managing container execution, running health probes, mounting volumes, and reporting status back to the API server. kubelet is the only Kubernetes component that has a privileged, direct relationship with the Linux kernel on its node.

Unlike control plane components which are stateless reconcilers, kubelet is stateful in the sense that it owns actual running processes and mount points. Its design reflects this: it combines an event-driven informer loop (like a controller) with imperative OS-level operations (unlike a controller).

Process Identity and Configuration

Process and ports

Runs as: systemd unit kubelet.service
API endpoint: :10250 (HTTPS) — used by apiserver, Metrics Server, kubectl exec/logs
Healthz: :10248 (HTTP, localhost only)
Read-only port: :10255 — disable in production (readOnlyPort: 0)
cAdvisor: embedded, exposes container metrics via /metrics/cadvisor
Node identity cert: system:node:<nodeName> in group system:nodes

Configuration loading order

Command-line flags (lowest precedence, mostly deprecated)
--config flag pointing to a KubeletConfiguration YAML file
Dynamic kubelet config (deprecated, removed in 1.24)
Drop-in config files in --config-dir (1.28+, GA 1.30) — merged over base config

Use KubeletConfiguration YAML for production; command-line flags are not suitable for complex configuration.

Internal Architecture

The syncLoop — Main Reconciliation Loop

The heart of kubelet is syncLoop(), a perpetual loop that processes events from multiple channels and calls syncPod() for any pod that needs reconciliation. Unlike control-plane controllers, syncPod() performs real I/O: it calls the container runtime, mounts filesystems, and executes probes.

Event Sources That Trigger syncLoop

Source	What it delivers
API server watch (via Pod informer)	New pod assigned to this node, pod spec updates, pod deletion
PLEG (Pod Lifecycle Event Generator)	Container state changes detected by polling the CRI every 1s (relist interval)
Probe Manager	Liveness / readiness / startup probe results
Housekeeping timer	Periodic cleanup of dead containers, images, orphaned volumes
Static pod file watch	Changes to files in `--pod-manifest-path` (e.g., control plane static pods)
HTTP pod source	Pod specs fetched from a URL (`--manifest-url`) — rare

syncPod() Steps

For each pod, syncPod() runs the following sequence (abbreviated):

Validate pod — Check pod spec for invalid configurations (e.g., duplicate container names, invalid image names).
Prepare pod directory — Create /var/lib/kubelet/pods/<uid>/ with subdirectories for volumes, plugins, containers.
Mount volumes — Ask the Volume Manager to attach and mount all required volumes. Block until volumes are ready or timeout.
Pull image secrets — Resolve imagePullSecrets and pass credentials to the container runtime.
Create sandbox (pause container) — Call CRI RunPodSandbox. This creates the network namespace, calls the CNI plugin to assign the pod IP, and starts the pause container that holds the namespace.
Start init containers — Run init containers serially. If any init container fails, restart it per restartPolicy. Block until all init containers succeed.
Start app containers — Call CRI CreateContainer + StartContainer for each container. Inject environment variables, mount secrets/configmaps as files, configure cgroups.
Register probes — Tell Probe Manager to begin running liveness, readiness, and startup probes.

PLEG — Pod Lifecycle Event Generator

PLEG is kubelet's mechanism for detecting container state changes without relying on container runtime events (which are not reliable across all runtimes). It works by polling the CRI for the list of all containers every 1 second (the relist interval) and diffing the result against its internal cache.

CRI.ListContainers()

→

Diff against cache

→

Emit lifecycle events

→

syncLoop() triggered

PLEG emits four event types: ContainerStarted, ContainerDied, ContainerRemoved, ContainerChanged.

PLEG is not healthy — a common node failure mode

If the CRI relist takes longer than 3 minutes (the PLEG health check threshold), kubelet marks PLEG as not healthy, which cascades to setting the node Ready condition to False. This is seen in kubelet logs as "PLEG is not healthy" and often indicates a slow or unresponsive container runtime. Kubernetes 1.26+ introduced Evented PLEG (beta), which uses CRI event streaming instead of polling to reduce this overhead. See CRI Interface for details.

KubeletConfiguration Reference

The full configuration object. All fields below are the most important production-relevant settings:

apiVersion: kubelet.config.k8s.io/v1beta1
kind: KubeletConfiguration

# --- Cluster connectivity ---
clusterDNS:
  - "10.96.0.10"                    # CoreDNS ClusterIP
clusterDomain: "cluster.local"

# --- TLS ---
tlsCertFile: /var/lib/kubelet/pki/kubelet.crt
tlsPrivateKeyFile: /var/lib/kubelet/pki/kubelet.key
rotateCertificates: true             # Auto-rotate node cert before expiry

# --- Authentication / Authorization ---
authentication:
  anonymous:
    enabled: false                   # Disable anonymous access to :10250
  webhook:
    enabled: true                    # Delegate authn to apiserver TokenReview
    cacheTTL: "2m"
  x509:
    clientCAFile: /etc/kubernetes/pki/ca.crt
authorization:
  mode: Webhook                      # Delegate authz to apiserver SubjectAccessReview
  webhook:
    cacheAuthorizedTTL: "5m"
    cacheUnauthorizedTTL: "30s"

# --- Ports ---
port: 10250
readOnlyPort: 0                      # DISABLE — security requirement
healthzPort: 10248
healthzBindAddress: "127.0.0.1"

# --- Container runtime ---
containerRuntimeEndpoint: "unix:///run/containerd/containerd.sock"
cgroupDriver: "systemd"              # Must match containerd's cgroup driver
cgroupsPerQOS: true                  # Create cgroup hierarchy per QoS class

# --- Resource reservations ---
kubeReserved:
  cpu: "200m"
  memory: "512Mi"
  ephemeral-storage: "2Gi"
systemReserved:
  cpu: "200m"
  memory: "512Mi"
enforceNodeAllocatable:
  - pods
  - kube-reserved
  - system-reserved

# --- Eviction ---
evictionHard:
  memory.available: "200Mi"
  nodefs.available: "5%"
  nodefs.inodesFree: "5%"
  imagefs.available: "10%"
evictionSoft:
  memory.available: "500Mi"
  nodefs.available: "10%"
evictionSoftGracePeriod:
  memory.available: "1m30s"
  nodefs.available: "1m30s"
evictionMinimumReclaim:
  memory.available: "0Mi"
  nodefs.available: "500Mi"

# --- Pod limits ---
maxPods: 110
podPidsLimit: 4096                   # Limit PIDs per pod (cgroups v2)

# --- Image GC ---
imageGCHighThresholdPercent: 85      # Start GC when disk > 85%
imageGCLowThresholdPercent: 80       # GC until disk < 80%
imageMinimumGCAge: "2m"

# --- Heartbeat ---
nodeStatusUpdateFrequency: "10s"     # How often kubelet posts node status
nodeStatusReportFrequency: "5m"      # How often full status is recalculated
nodeLeaseDurationSeconds: 40         # Lease duration; heartbeat every 10s

# --- Logging ---
logging:
  format: json                       # json | text
  verbosity: 2                       # 0-10; 2 is normal, 4-5 for debug

# --- Feature gates (examples) ---
featureGates:
  EventedPLEG: true                  # Use CRI events instead of polling (1.26+ beta)
  TopologyManager: true
  MemoryManager: true

Static Pods

Static pods are pod specs read directly from a directory on the node's filesystem (default: /etc/kubernetes/manifests/) rather than from the API server. kubelet creates them and reports mirror pod objects to the API server, but the lifecycle is managed entirely by kubelet — not by the Deployment or ReplicaSet controllers.

This is how kubeadm deploys the control plane: kube-apiserver, etcd, kube-scheduler, and kube-controller-manager all run as static pods on the control plane nodes. This means they are running even before the API server is healthy enough to accept requests — a deliberate bootstrap design.

# Static pod manifest directory
ls /etc/kubernetes/manifests/
# etcd.yaml
# kube-apiserver.yaml
# kube-controller-manager.yaml
# kube-scheduler.yaml

# Mirror pod visible in the API (read-only)
kubectl get pod kube-apiserver-controlplane -n kube-system
# NAME                          READY   STATUS    NODE
# kube-apiserver-controlplane   1/1     Running   controlplane

Mirror pods are read-only

Mirror pods (the API representation of static pods) have an annotation kubernetes.io/config.mirror: <hash>. You cannot delete them via kubectl delete pod — the delete is rejected. To stop a static pod, remove or rename its manifest file from the manifests directory.

Health Probes

kubelet runs three types of health probes per container. Each probe is run independently by the Probe Manager in goroutines:

Probe	Purpose	Failure action	Initial delay
`livenessProbe`	Is the container still alive? If not, it has deadlocked or entered a broken state.	Container is killed and restarted (per `restartPolicy`)	`initialDelaySeconds` (default 0)
`readinessProbe`	Is the container ready to serve traffic?	Pod IP removed from Endpoints / EndpointSlices — no traffic sent to it	`initialDelaySeconds` (default 0)
`startupProbe`	Has the container finished its slow startup? Disables liveness/readiness until it succeeds.	Container killed if startup probe fails after `failureThreshold * periodSeconds`	Runs immediately; prevents premature liveness kills

Probe Mechanisms

Mechanism	Config	How kubelet runs it
`exec`	`exec.command: [cmd, args]`	kubelet calls CRI `ExecSync` inside the container. Exit code 0 = success.
`httpGet`	`httpGet.path`, `httpGet.port`	kubelet makes an HTTP GET from the node (not inside the container). 2xx/3xx = success.
`tcpSocket`	`tcpSocket.port`	kubelet opens a TCP connection to the container's IP:port. Connection established = success.
`grpc`	`grpc.port`, `grpc.service`	kubelet calls gRPC Health Checking Protocol. Status SERVING = success. (GA 1.27)

containers:
  - name: app
    image: myapp:v2
    # Startup probe: allow up to 5m for slow startup
    startupProbe:
      httpGet:
        path: /healthz
        port: 8080
      failureThreshold: 30          # 30 × 10s = 5 minutes max startup time
      periodSeconds: 10

    # Liveness: restart if deadlocked
    livenessProbe:
      httpGet:
        path: /healthz
        port: 8080
      initialDelaySeconds: 0        # startupProbe takes care of initial delay
      periodSeconds: 15
      failureThreshold: 3           # 3 consecutive failures → restart
      timeoutSeconds: 5

    # Readiness: control traffic
    readinessProbe:
      httpGet:
        path: /ready
        port: 8080
      periodSeconds: 5
      failureThreshold: 2
      successThreshold: 1

Never use liveness probes for external dependencies

A liveness probe that checks an external database or service will restart all pods when that dependency is down, turning a partial outage into a total one. Liveness probes should only check the health of the container's own process. Use readiness probes for external dependency checks — a failed readiness probe takes the pod out of rotation without restarting it.

Volume Manager

The Volume Manager is a sub-component of kubelet responsible for the full volume lifecycle on the node: attaching, mounting, unmounting, and detaching volumes. It runs a continuous reconciliation loop comparing desired state (from pod specs) against actual state (mounts on the filesystem).

Pod spec has PVC

→

PVC bound to PV

→

Attach (cloud: CCM signals AD controller)

→

Mount to node path

→

Bind-mount into container

Key paths on the node:

/var/lib/kubelet/pods/<pod-uid>/volumes/ — per-pod volume mount points
/var/lib/kubelet/plugins/<csi-driver-name>/ — CSI staging paths (globalMount)
/var/lib/kubelet/pods/<pod-uid>/volumes/kubernetes.io~csi/<vol-name>/mount — final bind-mount path

Volume unmount blocks pod deletion

If a pod is in Terminating state indefinitely, it is often because the Volume Manager cannot unmount a volume (NFS hung, CSI driver unresponsive, or a finalizer on the PVC). The volume must be cleanly unmounted before kubelet will allow the pod directory to be removed. Use kubectl describe pod and kubectl get events to find the volume-related error.

Eviction Manager

When node resources are exhausted, the Eviction Manager proactively kills pods to reclaim resources — before the Linux OOM killer acts (which would be more disruptive and less predictable).

Eviction Signals

Signal	Description	Measured as
`memory.available`	Available memory on the node	`capacity - workingSet`
`nodefs.available`	Disk space on the root filesystem (kubelet data, logs)	% of total capacity
`nodefs.inodesFree`	Available inodes on root filesystem	% of total inodes
`imagefs.available`	Disk space on image filesystem (container images, writable layers)	% of total capacity
`pid.available`	Available PIDs on the node	% of max PID capacity

Eviction Thresholds: Hard vs Soft

Hard eviction

Immediate pod eviction when threshold is crossed. No grace period. Configured via evictionHard.

evictionHard:
  memory.available: "200Mi"
  nodefs.available: "5%"

Soft eviction

Eviction begins only after the threshold has been exceeded for the evictionSoftGracePeriod. Allows transient spikes without disruption. Configured via evictionSoft + evictionSoftGracePeriod.

evictionSoft:
  memory.available: "500Mi"
evictionSoftGracePeriod:
  memory.available: "1m30s"

Eviction Pod Selection Order

When evicting pods for resource reclamation, kubelet selects victims in this order (for memory pressure):

BestEffort pods (no requests/limits set) — evicted first, starting with those consuming the most memory above 0
Burstable pods where actual usage exceeds requests — sorted by how far over their request they are
Guaranteed pods (requests = limits) — last resort, only if nothing else can be evicted

QoS classes and eviction

QoS class is derived from resource requests/limits. Guaranteed: all containers have requests = limits for CPU and memory. Burstable: at least one container has a request or limit. BestEffort: no requests or limits. Setting requests and limits is therefore not just about scheduling — it directly affects eviction priority.

Image Garbage Collection

kubelet automatically removes unused container images when disk usage on the image filesystem exceeds the high threshold (default: 85%). It removes images that are not currently referenced by any running container, ordered by last-used time (oldest first).

# Check image disk usage
crictl images
du -sh /var/lib/containerd/io.containerd.snapshotter.v1.overlayfs/

# Force image GC via kubelet API (requires auth)
curl -sk --cert /var/lib/kubelet/pki/kubelet-client-current.pem \
     --key /var/lib/kubelet/pki/kubelet-client-current.pem \
     https://localhost:10250/debug/pprof/ 2>/dev/null

# View current disk pressure
kubectl describe node worker-1 | grep DiskPressure

TLS Bootstrap and Certificate Rotation

When a new node joins the cluster, it does not yet have a signed certificate. The TLS bootstrap process:

kubelet starts with a bootstrap kubeconfig containing only a bootstrap token (one-time use).
kubelet uses the bootstrap token to authenticate and submits a CertificateSigningRequest to the API server.
The csrapproving controller (or a human operator) approves the CSR.
kubelet receives the signed certificate, saves it, and uses it for all future API communication.
With rotateCertificates: true, kubelet automatically generates a new key and CSR before the current certificate expires (when 80% of the validity period has passed).

# Check kubelet certificate expiry
openssl x509 -noout -dates \
  -in /var/lib/kubelet/pki/kubelet-client-current.pem

# Watch CSRs during node join
kubectl get csr -w

# Approve a pending CSR (if auto-approval is disabled)
kubectl certificate approve node-csr-abc123

kubelet API Endpoints

The kubelet exposes an HTTPS API on port 10250. This API is used by the Kubernetes API server to implement kubectl exec, kubectl logs, kubectl port-forward, and by Metrics Server to collect resource usage.

Endpoint	Used by	Description
`/exec/{namespace}/{pod}/{container}`	apiserver (`kubectl exec`)	WebSocket tunnel for interactive exec inside a container
`/logs/{namespace}/{pod}/{container}`	apiserver (`kubectl logs`)	Streams container log output; supports `follow`, `sinceTime`, `tailLines`
`/portForward/{namespace}/{pod}`	apiserver (`kubectl port-forward`)	SPDY/WebSocket tunnel to forward a port to a pod
`/attach/{namespace}/{pod}/{container}`	apiserver (`kubectl attach`)	Attach to a running container's stdin/stdout
`/metrics`	Prometheus	kubelet process metrics
`/metrics/cadvisor`	Prometheus, Metrics Server	Container resource metrics (CPU, memory, network, disk per container)
`/metrics/resource`	Metrics Server	Summary resource metrics in the format expected by Metrics Server
`/stats/summary`	Metrics Server (legacy)	JSON summary of node and pod resource usage
`/healthz`	Probes, monitoring	kubelet liveness check

Prometheus Metrics

Metric	Type	Description
`kubelet_running_pods`	Gauge	Current number of running pods on this node
`kubelet_running_containers`	Gauge	Current number of running containers by state
`kubelet_pod_start_duration_seconds`	Histogram	Latency from pod being seen to all containers running
`kubelet_pod_worker_duration_seconds`	Histogram	Time spent in syncPod() per pod operation
`kubelet_cgroup_manager_duration_seconds`	Histogram	Latency of cgroup operations (high = cgroup v1 overhead)
`kubelet_pleg_relist_duration_seconds`	Histogram	Time to relist all containers (PLEG). Alert if p99 > 10s.
`kubelet_pleg_relist_interval_seconds`	Histogram	Actual interval between relists (should be ~1s)
`kubelet_volume_stats_available_bytes`	Gauge	Available bytes in a volume (per pod, PVC, namespace)
`kubelet_evictions`	Counter	Total pod evictions by eviction signal
`container_oom_events_total`	Counter	OOM kills from cAdvisor — critical signal

Alerting Rules

groups:
  - name: kubelet
    rules:
      - alert: KubeletPLEGNotHealthy
        expr: |
          kube_node_status_condition{condition="Ready",status="true"} == 0
          and on(node)
          kubelet_pleg_relist_duration_seconds{quantile="0.99"} > 10
        for: 2m
        labels:
          severity: critical
        annotations:
          summary: "PLEG relist p99 > 10s on {{ $labels.node }} — kubelet may mark node NotReady"

      - alert: KubeletPodStartLatencyHigh
        expr: |
          histogram_quantile(0.99,
            rate(kubelet_pod_start_duration_seconds_bucket[5m])
          ) > 60
        for: 5m
        labels:
          severity: warning
        annotations:
          summary: "Pod start p99 > 60s on {{ $labels.instance }}"

      - alert: KubeletEvictingPods
        expr: increase(kubelet_evictions[5m]) > 0
        labels:
          severity: warning
        annotations:
          summary: "kubelet on {{ $labels.instance }} is evicting pods — resource pressure"

      - alert: KubeletContainerOOMKilled
        expr: increase(container_oom_events_total[10m]) > 0
        labels:
          severity: warning
        annotations:
          summary: "Container OOM killed on {{ $labels.instance }}"

Troubleshooting Runbooks

Runbook 1: Node NotReady — kubelet not posting heartbeat

# 1. Check node condition
kubectl describe node worker-1 | grep -A5 Conditions

# 2. SSH to node, check kubelet service
systemctl status kubelet
journalctl -u kubelet -n 100 --no-pager

# 3. Common causes:
# a) kubelet crashed — check exit code and restart count
systemctl is-failed kubelet && journalctl -u kubelet --since "5 minutes ago"

# b) containerd unresponsive — kubelet cannot relist
systemctl status containerd
crictl ps  # Should list containers quickly

# c) Certificate expired — kubelet cannot authenticate to apiserver
openssl x509 -noout -dates -in /var/lib/kubelet/pki/kubelet-client-current.pem
# If expired: re-bootstrap the node or manually approve a new CSR

# d) Network partition — kubelet running but apiserver unreachable
curl -k https://:6443/healthz

# 4. Restart kubelet after fixing root cause
systemctl restart kubelet
watch kubectl get node worker-1

Runbook 2: Pod stuck in ContainerCreating

# 1. Get pod events
kubectl describe pod my-pod -n my-ns | tail -20

# 2. Common causes and fixes:
# a) Image pull failure
# Events: "Failed to pull image: ... 404"
kubectl get pod my-pod -o jsonpath='{.spec.containers[*].image}'
# Fix: check image name, tag, registry credentials

# b) Volume not mounting
# Events: "Unable to attach or mount volumes"
kubectl describe pvc my-pvc -n my-ns
# Fix: check PVC binding, CSI driver health, storage class

# c) CNI failure — pod IP not assigned
# Events: "Network plugin returned error"
journalctl -u kubelet | grep CNI
ls /etc/cni/net.d/
# Fix: reinstall/restart CNI plugin DaemonSet

# d) Init container failing
kubectl logs my-pod -c init-container-name --previous

# 3. Check kubelet logs on the target node
kubectl get pod my-pod -o jsonpath='{.spec.nodeName}'
# SSH to that node:
journalctl -u kubelet | grep "my-pod" | tail -30

Runbook 3: PLEG is not healthy / node flapping NotReady

# Symptom: kubelet log shows "PLEG is not healthy"
# "container runtime is down" or relist taking > 3m

# 1. Check PLEG relist duration
# On the node:
journalctl -u kubelet | grep "PLEG"

# Via Prometheus:
# kubelet_pleg_relist_duration_seconds{quantile="0.99"} > 5

# 2. Check container runtime health
systemctl status containerd
crictl ps --timeout 5s  # Times out if containerd is slow

# 3. Check total container count (high count = slow relist)
crictl ps -a | wc -l
# If thousands of dead containers: container GC may be misconfigured
# kubelet containerGCMaxPerPodContainer default = 1, maxContainers = -1

# 4. Enable Evented PLEG (Kubernetes 1.26+ beta)
# In KubeletConfiguration:
# featureGates:
#   EventedPLEG: true

# 5. If containerd is truly hung:
systemctl restart containerd
# Monitor for impact on running pods

Runbook 4: Pod evicted — memory pressure

# 1. Find evicted pods
kubectl get pods --all-namespaces | grep Evicted
kubectl get events --all-namespaces | grep Evicted

# 2. Check node memory
kubectl top node worker-1
kubectl describe node worker-1 | grep -A5 "Allocated resources"

# 3. Find memory hogs
kubectl top pods --all-namespaces --sort-by=memory | head -20

# 4. Check which pods lack memory limits (BestEffort — evicted first)
kubectl get pods --all-namespaces -o json | jq '
  .items[] | select(
    .spec.containers[].resources.limits == null
  ) | .metadata.namespace + "/" + .metadata.name'

# 5. Set memory requests and limits on pods without them
# This elevates them from BestEffort to Burstable

# 6. Increase system-reserved / kube-reserved to give kubelet headroom
# Edit /etc/kubernetes/kubelet-config.yaml, restart kubelet

# 7. Delete evicted pods (they consume API objects but no resources)
kubectl get pods --all-namespaces -o json | jq '
  .items[] | select(.status.reason == "Evicted") |
  "kubectl delete pod -n " + .metadata.namespace + " " + .metadata.name
' -r | bash

Runbook 5: OOMKilled containers — container memory limit too low

# Symptom: container restarts with OOMKilled exit code (137)
kubectl describe pod my-pod | grep -A5 "Last State"
# Last State: Terminated
#   Reason: OOMKilled
#   Exit Code: 137

# 1. Check container memory usage before OOM
kubectl top pod my-pod --containers

# 2. Check OOM kill events
kubectl get events --field-selector reason=OOMKilling

# 3. View historical memory from cAdvisor (via Prometheus)
# container_memory_working_set_bytes{pod="my-pod"} — peak usage
# Compare against:
# kube_pod_container_resource_limits{resource="memory",pod="my-pod"}

# 4. Increase memory limit
kubectl set resources deployment my-deploy \
  --containers=app --limits=memory=512Mi

# 5. For JVM workloads: ensure -XX:MaxRAMPercentage is set
# The JVM does not automatically respect cgroup memory limits in older JDKs
# JDK 11+: -XX:+UseContainerSupport (default on)
# Explicitly: -XX:MaxRAMPercentage=75.0

Production Best Practices

Always use KubeletConfiguration file (--config), not command-line flags. Flags are deprecated and harder to audit. Store the config file under version control alongside your node provisioning code.
Disable the read-only port (readOnlyPort: 0) and enable webhook authentication/authorization. The kubelet API on :10250 can expose pod secrets and allow exec — it must be protected.
Set cgroupDriver: systemd and ensure containerd is configured with the same driver. A mismatch causes pod failures with cryptic cgroup errors. Always verify both sides on new nodes.
Enable rotateCertificates: true. Node certificates expire by default after 1 year. Without auto-rotation, expired certs cause nodes to go NotReady silently until the certificate is renewed.
Configure eviction thresholds with both hard and soft values. Hard eviction should be a safety net (e.g., 200Mi); soft eviction (500Mi, 90s grace) gives you time to respond before things get critical. Always set evictionMinimumReclaim to prevent thrashing.
Use podPidsLimit to prevent fork bombs from a single pod exhausting the node's PID namespace. A reasonable default is 4096 per pod; set lower for untrusted multi-tenant environments.
Set startupProbe for slow-starting containers rather than inflating initialDelaySeconds on liveness probes. startupProbe prevents premature kills during startup without adding unnecessary delay to all subsequent liveness checks.
Monitor kubelet_pleg_relist_duration_seconds p99. Values consistently above 5s indicate the container runtime is struggling and the node may be approaching a NotReady state before the heartbeat timeout fires.
Set imageGCHighThresholdPercent: 85 and imageGCLowThresholdPercent: 80. The default 85%/80% is reasonable, but on nodes with small image filesystems these thresholds may be too close to alert thresholds. Consider 80%/75% for tighter margin.
Test graceful shutdown with shutdownGracePeriod and shutdownGracePeriodCriticalPods. When a node is drained or stopped, kubelet needs time to gracefully terminate pods. Default is 0 (immediate kill). Set to 60s+ for stateful workloads.