kubelet
The kubelet is the primary node agent. It runs on every worker node (and optionally on control plane nodes) as a systemd service and is responsible for the entire pod lifecycle on that node — from downloading the pod spec to managing container execution, running health probes, mounting volumes, and reporting status back to the API server. kubelet is the only Kubernetes component that has a privileged, direct relationship with the Linux kernel on its node.
Unlike control plane components which are stateless reconcilers, kubelet is stateful in the sense that it owns actual running processes and mount points. Its design reflects this: it combines an event-driven informer loop (like a controller) with imperative OS-level operations (unlike a controller).
Process Identity and Configuration
Process and ports
- Runs as:
systemdunitkubelet.service - API endpoint:
:10250(HTTPS) — used by apiserver, Metrics Server,kubectl exec/logs - Healthz:
:10248(HTTP, localhost only) - Read-only port:
:10255— disable in production (readOnlyPort: 0) - cAdvisor: embedded, exposes container metrics via
/metrics/cadvisor - Node identity cert:
system:node:<nodeName>in groupsystem:nodes
Configuration loading order
- Command-line flags (lowest precedence, mostly deprecated)
--configflag pointing to aKubeletConfigurationYAML file- Dynamic kubelet config (deprecated, removed in 1.24)
- Drop-in config files in
--config-dir(1.28+, GA 1.30) — merged over base config
Use KubeletConfiguration YAML for production; command-line flags are not suitable for complex configuration.
Internal Architecture
The syncLoop — Main Reconciliation Loop
The heart of kubelet is syncLoop(), a perpetual loop that processes events from multiple channels and calls syncPod() for any pod that needs reconciliation. Unlike control-plane controllers, syncPod() performs real I/O: it calls the container runtime, mounts filesystems, and executes probes.
Event Sources That Trigger syncLoop
| Source | What it delivers |
|---|---|
| API server watch (via Pod informer) | New pod assigned to this node, pod spec updates, pod deletion |
| PLEG (Pod Lifecycle Event Generator) | Container state changes detected by polling the CRI every 1s (relist interval) |
| Probe Manager | Liveness / readiness / startup probe results |
| Housekeeping timer | Periodic cleanup of dead containers, images, orphaned volumes |
| Static pod file watch | Changes to files in --pod-manifest-path (e.g., control plane static pods) |
| HTTP pod source | Pod specs fetched from a URL (--manifest-url) — rare |
syncPod() Steps
For each pod, syncPod() runs the following sequence (abbreviated):
- Validate pod — Check pod spec for invalid configurations (e.g., duplicate container names, invalid image names).
- Prepare pod directory — Create
/var/lib/kubelet/pods/<uid>/with subdirectories for volumes, plugins, containers. - Mount volumes — Ask the Volume Manager to attach and mount all required volumes. Block until volumes are ready or timeout.
- Pull image secrets — Resolve
imagePullSecretsand pass credentials to the container runtime. - Create sandbox (pause container) — Call CRI
RunPodSandbox. This creates the network namespace, calls the CNI plugin to assign the pod IP, and starts the pause container that holds the namespace. - Start init containers — Run init containers serially. If any init container fails, restart it per
restartPolicy. Block until all init containers succeed. - Start app containers — Call CRI
CreateContainer+StartContainerfor each container. Inject environment variables, mount secrets/configmaps as files, configure cgroups. - Register probes — Tell Probe Manager to begin running liveness, readiness, and startup probes.
PLEG — Pod Lifecycle Event Generator
PLEG is kubelet's mechanism for detecting container state changes without relying on container runtime events (which are not reliable across all runtimes). It works by polling the CRI for the list of all containers every 1 second (the relist interval) and diffing the result against its internal cache.
PLEG emits four event types: ContainerStarted, ContainerDied, ContainerRemoved, ContainerChanged.
Ready condition to False. This is seen in kubelet logs as "PLEG is not healthy" and often indicates a slow or unresponsive container runtime. Kubernetes 1.26+ introduced Evented PLEG (beta), which uses CRI event streaming instead of polling to reduce this overhead. See CRI Interface for details.
KubeletConfiguration Reference
The full configuration object. All fields below are the most important production-relevant settings:
apiVersion: kubelet.config.k8s.io/v1beta1
kind: KubeletConfiguration
# --- Cluster connectivity ---
clusterDNS:
- "10.96.0.10" # CoreDNS ClusterIP
clusterDomain: "cluster.local"
# --- TLS ---
tlsCertFile: /var/lib/kubelet/pki/kubelet.crt
tlsPrivateKeyFile: /var/lib/kubelet/pki/kubelet.key
rotateCertificates: true # Auto-rotate node cert before expiry
# --- Authentication / Authorization ---
authentication:
anonymous:
enabled: false # Disable anonymous access to :10250
webhook:
enabled: true # Delegate authn to apiserver TokenReview
cacheTTL: "2m"
x509:
clientCAFile: /etc/kubernetes/pki/ca.crt
authorization:
mode: Webhook # Delegate authz to apiserver SubjectAccessReview
webhook:
cacheAuthorizedTTL: "5m"
cacheUnauthorizedTTL: "30s"
# --- Ports ---
port: 10250
readOnlyPort: 0 # DISABLE — security requirement
healthzPort: 10248
healthzBindAddress: "127.0.0.1"
# --- Container runtime ---
containerRuntimeEndpoint: "unix:///run/containerd/containerd.sock"
cgroupDriver: "systemd" # Must match containerd's cgroup driver
cgroupsPerQOS: true # Create cgroup hierarchy per QoS class
# --- Resource reservations ---
kubeReserved:
cpu: "200m"
memory: "512Mi"
ephemeral-storage: "2Gi"
systemReserved:
cpu: "200m"
memory: "512Mi"
enforceNodeAllocatable:
- pods
- kube-reserved
- system-reserved
# --- Eviction ---
evictionHard:
memory.available: "200Mi"
nodefs.available: "5%"
nodefs.inodesFree: "5%"
imagefs.available: "10%"
evictionSoft:
memory.available: "500Mi"
nodefs.available: "10%"
evictionSoftGracePeriod:
memory.available: "1m30s"
nodefs.available: "1m30s"
evictionMinimumReclaim:
memory.available: "0Mi"
nodefs.available: "500Mi"
# --- Pod limits ---
maxPods: 110
podPidsLimit: 4096 # Limit PIDs per pod (cgroups v2)
# --- Image GC ---
imageGCHighThresholdPercent: 85 # Start GC when disk > 85%
imageGCLowThresholdPercent: 80 # GC until disk < 80%
imageMinimumGCAge: "2m"
# --- Heartbeat ---
nodeStatusUpdateFrequency: "10s" # How often kubelet posts node status
nodeStatusReportFrequency: "5m" # How often full status is recalculated
nodeLeaseDurationSeconds: 40 # Lease duration; heartbeat every 10s
# --- Logging ---
logging:
format: json # json | text
verbosity: 2 # 0-10; 2 is normal, 4-5 for debug
# --- Feature gates (examples) ---
featureGates:
EventedPLEG: true # Use CRI events instead of polling (1.26+ beta)
TopologyManager: true
MemoryManager: true
Static Pods
Static pods are pod specs read directly from a directory on the node's filesystem (default: /etc/kubernetes/manifests/) rather than from the API server. kubelet creates them and reports mirror pod objects to the API server, but the lifecycle is managed entirely by kubelet — not by the Deployment or ReplicaSet controllers.
This is how kubeadm deploys the control plane: kube-apiserver, etcd, kube-scheduler, and kube-controller-manager all run as static pods on the control plane nodes. This means they are running even before the API server is healthy enough to accept requests — a deliberate bootstrap design.
# Static pod manifest directory
ls /etc/kubernetes/manifests/
# etcd.yaml
# kube-apiserver.yaml
# kube-controller-manager.yaml
# kube-scheduler.yaml
# Mirror pod visible in the API (read-only)
kubectl get pod kube-apiserver-controlplane -n kube-system
# NAME READY STATUS NODE
# kube-apiserver-controlplane 1/1 Running controlplane
kubernetes.io/config.mirror: <hash>. You cannot delete them via kubectl delete pod — the delete is rejected. To stop a static pod, remove or rename its manifest file from the manifests directory.
Health Probes
kubelet runs three types of health probes per container. Each probe is run independently by the Probe Manager in goroutines:
| Probe | Purpose | Failure action | Initial delay |
|---|---|---|---|
livenessProbe | Is the container still alive? If not, it has deadlocked or entered a broken state. | Container is killed and restarted (per restartPolicy) | initialDelaySeconds (default 0) |
readinessProbe | Is the container ready to serve traffic? | Pod IP removed from Endpoints / EndpointSlices — no traffic sent to it | initialDelaySeconds (default 0) |
startupProbe | Has the container finished its slow startup? Disables liveness/readiness until it succeeds. | Container killed if startup probe fails after failureThreshold * periodSeconds | Runs immediately; prevents premature liveness kills |
Probe Mechanisms
| Mechanism | Config | How kubelet runs it |
|---|---|---|
exec | exec.command: [cmd, args] | kubelet calls CRI ExecSync inside the container. Exit code 0 = success. |
httpGet | httpGet.path, httpGet.port | kubelet makes an HTTP GET from the node (not inside the container). 2xx/3xx = success. |
tcpSocket | tcpSocket.port | kubelet opens a TCP connection to the container's IP:port. Connection established = success. |
grpc | grpc.port, grpc.service | kubelet calls gRPC Health Checking Protocol. Status SERVING = success. (GA 1.27) |
containers:
- name: app
image: myapp:v2
# Startup probe: allow up to 5m for slow startup
startupProbe:
httpGet:
path: /healthz
port: 8080
failureThreshold: 30 # 30 × 10s = 5 minutes max startup time
periodSeconds: 10
# Liveness: restart if deadlocked
livenessProbe:
httpGet:
path: /healthz
port: 8080
initialDelaySeconds: 0 # startupProbe takes care of initial delay
periodSeconds: 15
failureThreshold: 3 # 3 consecutive failures → restart
timeoutSeconds: 5
# Readiness: control traffic
readinessProbe:
httpGet:
path: /ready
port: 8080
periodSeconds: 5
failureThreshold: 2
successThreshold: 1
Volume Manager
The Volume Manager is a sub-component of kubelet responsible for the full volume lifecycle on the node: attaching, mounting, unmounting, and detaching volumes. It runs a continuous reconciliation loop comparing desired state (from pod specs) against actual state (mounts on the filesystem).
Key paths on the node:
/var/lib/kubelet/pods/<pod-uid>/volumes/— per-pod volume mount points/var/lib/kubelet/plugins/<csi-driver-name>/— CSI staging paths (globalMount)/var/lib/kubelet/pods/<pod-uid>/volumes/kubernetes.io~csi/<vol-name>/mount— final bind-mount path
Terminating state indefinitely, it is often because the Volume Manager cannot unmount a volume (NFS hung, CSI driver unresponsive, or a finalizer on the PVC). The volume must be cleanly unmounted before kubelet will allow the pod directory to be removed. Use kubectl describe pod and kubectl get events to find the volume-related error.
Eviction Manager
When node resources are exhausted, the Eviction Manager proactively kills pods to reclaim resources — before the Linux OOM killer acts (which would be more disruptive and less predictable).
Eviction Signals
| Signal | Description | Measured as |
|---|---|---|
memory.available | Available memory on the node | capacity - workingSet |
nodefs.available | Disk space on the root filesystem (kubelet data, logs) | % of total capacity |
nodefs.inodesFree | Available inodes on root filesystem | % of total inodes |
imagefs.available | Disk space on image filesystem (container images, writable layers) | % of total capacity |
pid.available | Available PIDs on the node | % of max PID capacity |
Eviction Thresholds: Hard vs Soft
Hard eviction
Immediate pod eviction when threshold is crossed. No grace period. Configured via evictionHard.
evictionHard:
memory.available: "200Mi"
nodefs.available: "5%"
Soft eviction
Eviction begins only after the threshold has been exceeded for the evictionSoftGracePeriod. Allows transient spikes without disruption. Configured via evictionSoft + evictionSoftGracePeriod.
evictionSoft:
memory.available: "500Mi"
evictionSoftGracePeriod:
memory.available: "1m30s"
Eviction Pod Selection Order
When evicting pods for resource reclamation, kubelet selects victims in this order (for memory pressure):
- BestEffort pods (no requests/limits set) — evicted first, starting with those consuming the most memory above 0
- Burstable pods where actual usage exceeds requests — sorted by how far over their request they are
- Guaranteed pods (requests = limits) — last resort, only if nothing else can be evicted
Guaranteed: all containers have requests = limits for CPU and memory. Burstable: at least one container has a request or limit. BestEffort: no requests or limits. Setting requests and limits is therefore not just about scheduling — it directly affects eviction priority.
Image Garbage Collection
kubelet automatically removes unused container images when disk usage on the image filesystem exceeds the high threshold (default: 85%). It removes images that are not currently referenced by any running container, ordered by last-used time (oldest first).
# Check image disk usage
crictl images
du -sh /var/lib/containerd/io.containerd.snapshotter.v1.overlayfs/
# Force image GC via kubelet API (requires auth)
curl -sk --cert /var/lib/kubelet/pki/kubelet-client-current.pem \
--key /var/lib/kubelet/pki/kubelet-client-current.pem \
https://localhost:10250/debug/pprof/ 2>/dev/null
# View current disk pressure
kubectl describe node worker-1 | grep DiskPressure
TLS Bootstrap and Certificate Rotation
When a new node joins the cluster, it does not yet have a signed certificate. The TLS bootstrap process:
- kubelet starts with a bootstrap kubeconfig containing only a bootstrap token (one-time use).
- kubelet uses the bootstrap token to authenticate and submits a
CertificateSigningRequestto the API server. - The
csrapprovingcontroller (or a human operator) approves the CSR. - kubelet receives the signed certificate, saves it, and uses it for all future API communication.
- With
rotateCertificates: true, kubelet automatically generates a new key and CSR before the current certificate expires (when 80% of the validity period has passed).
# Check kubelet certificate expiry
openssl x509 -noout -dates \
-in /var/lib/kubelet/pki/kubelet-client-current.pem
# Watch CSRs during node join
kubectl get csr -w
# Approve a pending CSR (if auto-approval is disabled)
kubectl certificate approve node-csr-abc123
kubelet API Endpoints
The kubelet exposes an HTTPS API on port 10250. This API is used by the Kubernetes API server to implement kubectl exec, kubectl logs, kubectl port-forward, and by Metrics Server to collect resource usage.
| Endpoint | Used by | Description |
|---|---|---|
/exec/{namespace}/{pod}/{container} | apiserver (kubectl exec) | WebSocket tunnel for interactive exec inside a container |
/logs/{namespace}/{pod}/{container} | apiserver (kubectl logs) | Streams container log output; supports follow, sinceTime, tailLines |
/portForward/{namespace}/{pod} | apiserver (kubectl port-forward) | SPDY/WebSocket tunnel to forward a port to a pod |
/attach/{namespace}/{pod}/{container} | apiserver (kubectl attach) | Attach to a running container's stdin/stdout |
/metrics | Prometheus | kubelet process metrics |
/metrics/cadvisor | Prometheus, Metrics Server | Container resource metrics (CPU, memory, network, disk per container) |
/metrics/resource | Metrics Server | Summary resource metrics in the format expected by Metrics Server |
/stats/summary | Metrics Server (legacy) | JSON summary of node and pod resource usage |
/healthz | Probes, monitoring | kubelet liveness check |
Prometheus Metrics
| Metric | Type | Description |
|---|---|---|
kubelet_running_pods | Gauge | Current number of running pods on this node |
kubelet_running_containers | Gauge | Current number of running containers by state |
kubelet_pod_start_duration_seconds | Histogram | Latency from pod being seen to all containers running |
kubelet_pod_worker_duration_seconds | Histogram | Time spent in syncPod() per pod operation |
kubelet_cgroup_manager_duration_seconds | Histogram | Latency of cgroup operations (high = cgroup v1 overhead) |
kubelet_pleg_relist_duration_seconds | Histogram | Time to relist all containers (PLEG). Alert if p99 > 10s. |
kubelet_pleg_relist_interval_seconds | Histogram | Actual interval between relists (should be ~1s) |
kubelet_volume_stats_available_bytes | Gauge | Available bytes in a volume (per pod, PVC, namespace) |
kubelet_evictions | Counter | Total pod evictions by eviction signal |
container_oom_events_total | Counter | OOM kills from cAdvisor — critical signal |
Alerting Rules
groups:
- name: kubelet
rules:
- alert: KubeletPLEGNotHealthy
expr: |
kube_node_status_condition{condition="Ready",status="true"} == 0
and on(node)
kubelet_pleg_relist_duration_seconds{quantile="0.99"} > 10
for: 2m
labels:
severity: critical
annotations:
summary: "PLEG relist p99 > 10s on {{ $labels.node }} — kubelet may mark node NotReady"
- alert: KubeletPodStartLatencyHigh
expr: |
histogram_quantile(0.99,
rate(kubelet_pod_start_duration_seconds_bucket[5m])
) > 60
for: 5m
labels:
severity: warning
annotations:
summary: "Pod start p99 > 60s on {{ $labels.instance }}"
- alert: KubeletEvictingPods
expr: increase(kubelet_evictions[5m]) > 0
labels:
severity: warning
annotations:
summary: "kubelet on {{ $labels.instance }} is evicting pods — resource pressure"
- alert: KubeletContainerOOMKilled
expr: increase(container_oom_events_total[10m]) > 0
labels:
severity: warning
annotations:
summary: "Container OOM killed on {{ $labels.instance }}"
Troubleshooting Runbooks
Runbook 1: Node NotReady — kubelet not posting heartbeat
# 1. Check node condition
kubectl describe node worker-1 | grep -A5 Conditions
# 2. SSH to node, check kubelet service
systemctl status kubelet
journalctl -u kubelet -n 100 --no-pager
# 3. Common causes:
# a) kubelet crashed — check exit code and restart count
systemctl is-failed kubelet && journalctl -u kubelet --since "5 minutes ago"
# b) containerd unresponsive — kubelet cannot relist
systemctl status containerd
crictl ps # Should list containers quickly
# c) Certificate expired — kubelet cannot authenticate to apiserver
openssl x509 -noout -dates -in /var/lib/kubelet/pki/kubelet-client-current.pem
# If expired: re-bootstrap the node or manually approve a new CSR
# d) Network partition — kubelet running but apiserver unreachable
curl -k https://:6443/healthz
# 4. Restart kubelet after fixing root cause
systemctl restart kubelet
watch kubectl get node worker-1
Runbook 2: Pod stuck in ContainerCreating
# 1. Get pod events
kubectl describe pod my-pod -n my-ns | tail -20
# 2. Common causes and fixes:
# a) Image pull failure
# Events: "Failed to pull image: ... 404"
kubectl get pod my-pod -o jsonpath='{.spec.containers[*].image}'
# Fix: check image name, tag, registry credentials
# b) Volume not mounting
# Events: "Unable to attach or mount volumes"
kubectl describe pvc my-pvc -n my-ns
# Fix: check PVC binding, CSI driver health, storage class
# c) CNI failure — pod IP not assigned
# Events: "Network plugin returned error"
journalctl -u kubelet | grep CNI
ls /etc/cni/net.d/
# Fix: reinstall/restart CNI plugin DaemonSet
# d) Init container failing
kubectl logs my-pod -c init-container-name --previous
# 3. Check kubelet logs on the target node
kubectl get pod my-pod -o jsonpath='{.spec.nodeName}'
# SSH to that node:
journalctl -u kubelet | grep "my-pod" | tail -30
Runbook 3: PLEG is not healthy / node flapping NotReady
# Symptom: kubelet log shows "PLEG is not healthy"
# "container runtime is down" or relist taking > 3m
# 1. Check PLEG relist duration
# On the node:
journalctl -u kubelet | grep "PLEG"
# Via Prometheus:
# kubelet_pleg_relist_duration_seconds{quantile="0.99"} > 5
# 2. Check container runtime health
systemctl status containerd
crictl ps --timeout 5s # Times out if containerd is slow
# 3. Check total container count (high count = slow relist)
crictl ps -a | wc -l
# If thousands of dead containers: container GC may be misconfigured
# kubelet containerGCMaxPerPodContainer default = 1, maxContainers = -1
# 4. Enable Evented PLEG (Kubernetes 1.26+ beta)
# In KubeletConfiguration:
# featureGates:
# EventedPLEG: true
# 5. If containerd is truly hung:
systemctl restart containerd
# Monitor for impact on running pods
Runbook 4: Pod evicted — memory pressure
# 1. Find evicted pods
kubectl get pods --all-namespaces | grep Evicted
kubectl get events --all-namespaces | grep Evicted
# 2. Check node memory
kubectl top node worker-1
kubectl describe node worker-1 | grep -A5 "Allocated resources"
# 3. Find memory hogs
kubectl top pods --all-namespaces --sort-by=memory | head -20
# 4. Check which pods lack memory limits (BestEffort — evicted first)
kubectl get pods --all-namespaces -o json | jq '
.items[] | select(
.spec.containers[].resources.limits == null
) | .metadata.namespace + "/" + .metadata.name'
# 5. Set memory requests and limits on pods without them
# This elevates them from BestEffort to Burstable
# 6. Increase system-reserved / kube-reserved to give kubelet headroom
# Edit /etc/kubernetes/kubelet-config.yaml, restart kubelet
# 7. Delete evicted pods (they consume API objects but no resources)
kubectl get pods --all-namespaces -o json | jq '
.items[] | select(.status.reason == "Evicted") |
"kubectl delete pod -n " + .metadata.namespace + " " + .metadata.name
' -r | bash
Runbook 5: OOMKilled containers — container memory limit too low
# Symptom: container restarts with OOMKilled exit code (137)
kubectl describe pod my-pod | grep -A5 "Last State"
# Last State: Terminated
# Reason: OOMKilled
# Exit Code: 137
# 1. Check container memory usage before OOM
kubectl top pod my-pod --containers
# 2. Check OOM kill events
kubectl get events --field-selector reason=OOMKilling
# 3. View historical memory from cAdvisor (via Prometheus)
# container_memory_working_set_bytes{pod="my-pod"} — peak usage
# Compare against:
# kube_pod_container_resource_limits{resource="memory",pod="my-pod"}
# 4. Increase memory limit
kubectl set resources deployment my-deploy \
--containers=app --limits=memory=512Mi
# 5. For JVM workloads: ensure -XX:MaxRAMPercentage is set
# The JVM does not automatically respect cgroup memory limits in older JDKs
# JDK 11+: -XX:+UseContainerSupport (default on)
# Explicitly: -XX:MaxRAMPercentage=75.0
Production Best Practices
- Always use
KubeletConfigurationfile (--config), not command-line flags. Flags are deprecated and harder to audit. Store the config file under version control alongside your node provisioning code. - Disable the read-only port (
readOnlyPort: 0) and enable webhook authentication/authorization. The kubelet API on :10250 can expose pod secrets and allow exec — it must be protected. - Set
cgroupDriver: systemdand ensure containerd is configured with the same driver. A mismatch causes pod failures with cryptic cgroup errors. Always verify both sides on new nodes. - Enable
rotateCertificates: true. Node certificates expire by default after 1 year. Without auto-rotation, expired certs cause nodes to go NotReady silently until the certificate is renewed. - Configure eviction thresholds with both hard and soft values. Hard eviction should be a safety net (e.g., 200Mi); soft eviction (500Mi, 90s grace) gives you time to respond before things get critical. Always set
evictionMinimumReclaimto prevent thrashing. - Use
podPidsLimitto prevent fork bombs from a single pod exhausting the node's PID namespace. A reasonable default is 4096 per pod; set lower for untrusted multi-tenant environments. - Set
startupProbefor slow-starting containers rather than inflatinginitialDelaySecondson liveness probes. startupProbe prevents premature kills during startup without adding unnecessary delay to all subsequent liveness checks. - Monitor
kubelet_pleg_relist_duration_secondsp99. Values consistently above 5s indicate the container runtime is struggling and the node may be approaching a NotReady state before the heartbeat timeout fires. - Set
imageGCHighThresholdPercent: 85andimageGCLowThresholdPercent: 80. The default 85%/80% is reasonable, but on nodes with small image filesystems these thresholds may be too close to alert thresholds. Consider 80%/75% for tighter margin. - Test graceful shutdown with
shutdownGracePeriodandshutdownGracePeriodCriticalPods. When a node is drained or stopped, kubelet needs time to gracefully terminate pods. Default is 0 (immediate kill). Set to 60s+ for stateful workloads.