Worker Node Overview
A worker node (simply called a "node" in Kubernetes) is a machine — physical or virtual — that runs containerized workloads. Nodes provide the compute, memory, storage, and networking substrate on which pods execute. The control plane (see Control Plane Overview) decides where to run pods; nodes do the actual work of running them.
Every worker node runs a fixed set of components. Understanding what each component does, how they communicate, and how they share resources on the node is the foundation for diagnosing failures, tuning performance, and hardening security.
Required Node Components
kubelet
The primary node agent. Watches the API server for pods assigned to this node, translates pod specs into container runs via the CRI, manages pod lifecycle, reports node and pod status, and runs probes. Details: 01-kubelet.
kube-proxy
Network rules agent. Programs iptables, nftables, or IPVS rules to implement Service virtual IPs. Watches EndpointSlices and translates them into local forwarding rules. Details: 02-kube-proxy.
Container Runtime
Implements the Container Runtime Interface (CRI). Pulls images, creates and manages container namespaces, executes processes. Primary option today: containerd. Details: 03-container-runtime.
CNI Plugin
Network plugin called by the container runtime (via kubelet) when a pod's network namespace is set up. Assigns IP addresses, creates veth pairs, configures routing. Not a long-running daemon — invoked per pod. Details: CNI Plugins.
CSI Node Plugin
When using CSI storage drivers, a node plugin DaemonSet pod runs on each node to mount/unmount volumes. Interacts with kubelet's volume manager via the CSI Node Service gRPC interface. Details: CSI Drivers.
Node-level DaemonSets
Not a Kubernetes core component, but most clusters run per-node DaemonSets for log collection (Fluentd/Fluent Bit), metrics (node-exporter), security (Falco), and storage (CSI node plugins).
Node Internal Architecture
Component Ports and Endpoints
| Component | Port | Protocol | Purpose | Auth required |
|---|---|---|---|---|
| kubelet | 10250 | HTTPS | API — pod logs, exec, port-forward, metrics; used by apiserver and Metrics Server | Yes — client cert or bearer token |
| kubelet (read-only) | 10255 | HTTP | Unauthenticated metrics endpoint (deprecated — disable in production) | No |
| kubelet healthz | 10248 | HTTP | Local healthz probe only (not exposed externally by default) | No (loopback) |
| kube-proxy | 10249 | HTTP | Prometheus metrics for proxy rules state | No (loopback default) |
| kube-proxy healthz | 10256 | HTTP | Health check endpoint for readiness/liveness probes | No |
| containerd | unix socket | gRPC | /run/containerd/containerd.sock — CRI calls from kubelet | Unix file permissions |
| node-exporter | 9100 | HTTP | Prometheus node metrics (CPU, memory, disk, network per node) | Typically no (in-cluster scrape) |
--read-only-port=0 in kubelet config) in all production clusters. This is checked by CIS Kubernetes Benchmark.
The Node Object in the API
A node is represented as a Node object in the Kubernetes API. The node itself (via kubelet) populates most of the status fields; the control plane (via the Node Lifecycle Controller — see kube-controller-manager) manages taints and conditions.
# Inspect a node's full status
kubectl get node worker-1 -o yaml
apiVersion: v1
kind: Node
metadata:
name: worker-1
labels:
kubernetes.io/arch: amd64
kubernetes.io/hostname: worker-1
kubernetes.io/os: linux
node.kubernetes.io/instance-type: m5.2xlarge # set by CCM
topology.kubernetes.io/region: us-east-1
topology.kubernetes.io/zone: us-east-1a
annotations:
node.alpha.kubernetes.io/ttl: "0"
volumes.kubernetes.io/controller-managed-attach-detach: "true"
spec:
podCIDR: 10.244.3.0/24 # Assigned by controller-manager
podCIDRs:
- 10.244.3.0/24
providerID: aws:///us-east-1a/i-0abc123def456 # Set by CCM
taints: # Added by controller; tolerations on pods
- effect: NoSchedule
key: node.kubernetes.io/not-ready
status:
addresses:
- address: 10.0.1.42
type: InternalIP
- address: worker-1
type: Hostname
allocatable: # Capacity minus system reserved
cpu: "7800m"
memory: "29Gi"
pods: "110"
hugepages-2Mi: "0"
ephemeral-storage: "90Gi"
capacity: # Raw hardware capacity
cpu: "8"
memory: "32Gi"
pods: "110"
ephemeral-storage: "100Gi"
conditions:
- type: MemoryPressure
status: "False"
- type: DiskPressure
status: "False"
- type: PIDPressure
status: "False"
- type: Ready
status: "True"
reason: KubeletReady
message: "kubelet is posting ready status"
lastHeartbeatTime: "2026-05-16T12:00:00Z"
lastTransitionTime: "2026-05-15T08:00:00Z"
nodeInfo:
kernelVersion: 5.15.0-1053-aws
osImage: Ubuntu 22.04.3 LTS
containerRuntimeVersion: containerd://1.7.13
kubeletVersion: v1.29.3
kubeProxyVersion: v1.29.3
architecture: amd64
operatingSystem: linux
Capacity vs Allocatable
Two distinct resource measurements exist at the node level. The difference is critical for scheduling and bin-packing:
capacity
Total raw hardware resources — what the machine physically has. This is what you'd see in lscpu or free -h. kubelet reads this from the OS and reports it.
- CPU: number of logical cores
- Memory: total physical RAM
- Ephemeral storage: total disk for images + writable layers
- Pods: maximum pod count (default: 110)
allocatable
Capacity minus what is reserved for system processes. The scheduler uses allocatable, not capacity, when placing pods.
allocatable = capacity
- kube-reserved
- system-reserved
- eviction-threshold
Configure via kubelet flags: --kube-reserved, --system-reserved, --eviction-hard.
--kube-reserved and --system-reserved. Typical starting values: 100–250m CPU and 256–512Mi memory for kube-reserved; 100–500m CPU and 256Mi–1Gi for system-reserved depending on node size.
Reservation Example
# kubelet configuration (KubeletConfiguration)
apiVersion: kubelet.config.k8s.io/v1beta1
kind: KubeletConfiguration
kubeReserved:
cpu: "200m"
memory: "512Mi"
ephemeral-storage: "2Gi"
systemReserved:
cpu: "200m"
memory: "512Mi"
ephemeral-storage: "2Gi"
evictionHard:
memory.available: "200Mi"
nodefs.available: "5%"
nodefs.inodesFree: "5%"
imagefs.available: "10%"
On an 8-core / 32Gi node, this yields:
- Allocatable CPU: 8000m − 200m − 200m = 7600m
- Allocatable memory: 32Gi − 512Mi − 512Mi − 200Mi eviction = ~30.8Gi
Node Conditions
kubelet sets node conditions to communicate health to the control plane. The Node Lifecycle Controller (in kube-controller-manager) reads these to taint and eventually evict pods from unhealthy nodes.
| Condition | True means | Control plane reaction |
|---|---|---|
Ready | Node is healthy and accepting pods | Normal operation |
Ready=False | kubelet unhealthy — can't run pods | Taint node.kubernetes.io/not-ready; evict after pod-eviction-timeout |
Ready=Unknown | No heartbeat for node-monitor-grace-period (40s default) | Taint node.kubernetes.io/unreachable; evict after timeout |
MemoryPressure | Available memory below threshold | Taint node.kubernetes.io/memory-pressure; kubelet evicts BestEffort pods |
DiskPressure | Available disk below threshold | Taint node.kubernetes.io/disk-pressure; kubelet evicts pods + GCs images |
PIDPressure | PID count near maximum | Taint node.kubernetes.io/pid-pressure |
NetworkUnavailable | CNI not configured or network broken | Set by CNI plugin via Node status; prevents pod scheduling |
Node Registration
When kubelet starts on a new machine, it registers the node with the API server. This is the bootstrap sequence:
- TLS bootstrap — kubelet uses a bootstrap token to authenticate and sends a CertificateSigningRequest (CSR) for its node identity certificate (
system:node:<nodeName>). - Certificate issuance — The
CertificateApprovalcontroller or auto-approver signs the CSR. kubelet saves the resulting certificate and rotates it before expiry. - Node creation — kubelet POSTs a Node object with capacity, nodeInfo, and an
uninitializedtaint if cloud integration is enabled. - CCM initialization — The cloud-controller-manager's Node Controller (see cloud-controller-manager) reads cloud metadata, sets the
providerID, region/zone labels, and removes the uninitialized taint. - CNI setup — The network plugin is invoked by kubelet when the first pod is created on the node, setting up the node's pod network.
--register-node=true), kubelet registers itself automatically. In some environments (air-gapped, custom orchestration) operators pre-create Node objects and set --register-node=false. The kubelet still runs but will only manage pods on a pre-existing node object with a matching name.
How a Pod Runs on a Node
Once a pod is scheduled and its spec.nodeName is set, the execution chain on the worker node is:
- kubelet watch — kubelet's informer detects a pod object with
spec.nodeName == thisNodeand nostatus.phaseset. - Admit pod — kubelet runs local admission: checks resource fit against allocatable, checks node selectors/taints. If it cannot admit, sets
status.phase = Failed. - Volume setup — kubelet's volume manager coordinates with the CSI node plugin to attach and mount any required PersistentVolumes and ConfigMap/Secret volumes.
- CNI invocation — kubelet calls the container runtime (containerd), which calls the CNI plugin binary. CNI creates the network namespace, veth pair, assigns the pod IP from the node's
podCIDR, and programs routes. - Pull image — containerd pulls the container image via the image store (OCI registry). If the image is already present and
imagePullPolicyisIfNotPresent, the pull is skipped. - Create and start containers — containerd calls runc (or another OCI runtime) to set up Linux namespaces (PID, net, mount, UTS, IPC), cgroups, and execute the container entrypoint.
- Init containers first — If the pod has init containers, they run serially to completion before app containers start.
- Probes — kubelet runs liveness, readiness, and startup probes per the pod spec. Failed liveness probes trigger container restarts. Failed readiness probes remove the pod's IP from Endpoints/EndpointSlices.
- Status reporting — kubelet PATCHes the pod's
statussubresource and the node'sstatuswith heartbeats everynodeStatusUpdateFrequency(default: 10s).
Linux Kernel Primitives Used by Nodes
Kubernetes node functionality is built directly on Linux kernel features. Understanding these makes container behavior predictable:
Namespaces
Each container gets isolated views of: PID (process tree), net (network stack + interfaces), mnt (filesystem mount points), UTS (hostname), IPC (message queues), user (UID/GID mapping). Pods share the network namespace between containers — that's how they communicate on localhost.
cgroups v2
Linux control groups enforce CPU and memory limits. Kubernetes 1.25+ uses cgroups v2 by default. Each pod and container gets a cgroup hierarchy. Memory limits enforced via memory.max; CPU limits via cpu.max (CFS quota). OOM kills originate from the kernel when a container exceeds memory.max.
Overlay filesystem
Container images are stored as OCI layers. containerd uses overlayfs (or other snapshotters) to present a unified read-write filesystem to the container. The writeable layer is created per-container and deleted on container removal. Ephemeral storage limits apply to this writeable layer.
seccomp / AppArmor / SELinux
Kernel security modules restrict syscalls and file access. The RuntimeDefault seccomp profile (applied when seccompProfile.type: RuntimeDefault) uses the container runtime's default filter blocking ~40 dangerous syscalls. AppArmor and SELinux profiles can be applied per pod/container.
netfilter / iptables / nftables
kube-proxy programs netfilter rules to implement Service virtual IPs (ClusterIP, NodePort, LoadBalancer) and SNAT for pod traffic leaving the node. In IPVS mode, kube-proxy uses the Linux IPVS kernel module instead for O(1) rule lookup at scale.
eBPF
Increasingly, CNI plugins (Cilium, Calico eBPF) and service mesh agents (Cilium, Istio ambient) use eBPF programs attached to the kernel's traffic hooks, replacing iptables for dramatically lower overhead at scale. See eBPF Networking.
Node Specialization
Not all nodes are equal. Production clusters commonly segment nodes by function:
| Node type | Labels / Taints | Use case |
|---|---|---|
| General-purpose workers | — | Standard workloads — Deployments, StatefulSets, Jobs |
| GPU nodes | nvidia.com/gpu: "true", taint: nvidia.com/gpu=present:NoSchedule | ML training/inference; require NVIDIA device plugin DaemonSet |
| Spot / preemptible | cloud.google.com/gke-spot: "true" or equivalent | Batch jobs, stateless workers that tolerate interruption |
| High-memory nodes | node.kubernetes.io/instance-type: r6i.32xlarge | In-memory databases, caches, large JVM workloads |
| ARM nodes | kubernetes.io/arch: arm64 | Cost-optimized (Graviton, Ampere) for compatible workloads |
| System nodes | node-role.kubernetes.io/infra: "", taint infra:NoSchedule | Monitoring, logging, admission webhook infrastructure |
| Windows nodes | kubernetes.io/os: windows | Windows Server container workloads; see 07-windows-worker-nodes |
Quick Node Diagnostics
# Overview of all nodes
kubectl get nodes -o wide
# Node conditions and events
kubectl describe node worker-1
# Resource usage across nodes
kubectl top nodes
# Pods running on a specific node
kubectl get pods --all-namespaces --field-selector spec.nodeName=worker-1
# kubelet logs (systemd)
journalctl -u kubelet -f --since "10 minutes ago"
# containerd logs
journalctl -u containerd -f
# kube-proxy logs
kubectl logs -n kube-system -l k8s-app=kube-proxy --tail=50
# Node resource allocations
kubectl describe node worker-1 | grep -A10 "Allocated resources"
# Check cgroup driver consistency (must match between kubelet and containerd)
kubectl get node worker-1 -o jsonpath='{.status.nodeInfo.containerRuntimeVersion}'
crictl info | grep cgroupDriver
Key Node Metrics
| Metric | Source | Meaning |
|---|---|---|
node_cpu_seconds_total | node-exporter | CPU time by mode (user, system, idle, iowait). High iowait → storage bottleneck. |
node_memory_MemAvailable_bytes | node-exporter | Available memory. Alert when < eviction threshold. |
node_filesystem_avail_bytes | node-exporter | Disk space. Alert at < 10% (before eviction kicks in at 5%). |
kubelet_running_pods | kubelet | Current pod count. Alert approaching maxPods (default 110). |
kubelet_node_name | kubelet | Gauge=1 per node; used to detect kubelet restarts. |
kube_node_status_condition | kube-state-metrics | Node condition status. Alert on Ready!=1 or any pressure condition. |
kube_node_status_allocatable | kube-state-metrics | Allocatable resources per node. Useful for capacity planning. |
container_oom_events_total | cadvisor (kubelet) | OOM kills. Alert on any value > 0 per container. |
Alerting Rules
groups:
- name: worker-node
rules:
- alert: NodeNotReady
expr: kube_node_status_condition{condition="Ready",status="true"} == 0
for: 2m
labels:
severity: critical
annotations:
summary: "Node {{ $labels.node }} not Ready for 2m"
- alert: NodeMemoryPressure
expr: kube_node_status_condition{condition="MemoryPressure",status="true"} == 1
for: 1m
labels:
severity: warning
annotations:
summary: "Node {{ $labels.node }} under memory pressure"
- alert: NodeDiskPressure
expr: kube_node_status_condition{condition="DiskPressure",status="true"} == 1
for: 1m
labels:
severity: warning
annotations:
summary: "Node {{ $labels.node }} under disk pressure"
- alert: NodeHighCPU
expr: |
(1 - avg by(instance)(rate(node_cpu_seconds_total{mode="idle"}[5m]))) > 0.9
for: 10m
labels:
severity: warning
annotations:
summary: "Node {{ $labels.instance }} CPU > 90% for 10m"
- alert: ContainerOOMKilled
expr: increase(container_oom_events_total[5m]) > 0
labels:
severity: warning
annotations:
summary: "Container {{ $labels.container }} in pod {{ $labels.pod }} OOM killed"
Production Best Practices
- Always configure
kube-reservedandsystem-reserved. Without reservations, the scheduler can completely exhaust node resources, causing kubelet and containerd to be OOM-killed — which looks like a node failure from the control plane's perspective. - Disable the kubelet read-only port (
--read-only-port=0). It exposes pod and node information without authentication and is a CIS benchmark finding. - Use cgroups v2. It provides better memory accounting, I/O limits, and pressure notifications. Ensure both kubelet (
cgroupDriver: systemd) and containerd use the same cgroup driver — mismatch causes pod failures. - Enforce node registration constraints. Use
--node-labelsin kubelet config to set immutable labels at registration time. Use NodeRestriction admission (see Admission Controllers §NodeRestriction) to prevent kubelets from labeling themselves with arbitrary values. - Set
maxPodsbased on node size. The default of 110 is appropriate for small nodes. Large nodes can host more, but increasing beyond 250 requires careful CNI and kubelet tuning. - Monitor node conditions with kube-state-metrics and alert on any pressure condition. By the time a node reaches disk pressure, it has already started evicting pods — catch the trend early.
- Use node taints to dedicate nodes to specific workloads. System DaemonSets should tolerate all taints; application workloads should only be scheduled on appropriately labeled and tainted nodes.
- Enable kubelet certificate rotation (
rotateCertificates: true). Node certificates expire; automatic rotation prevents node-level authentication failures after ~1 year.