Worker Node Overview

Node Components kubelet kube-proxy Container Runtime 02-node-components / 00-worker-node-overview.html

A worker node (simply called a "node" in Kubernetes) is a machine — physical or virtual — that runs containerized workloads. Nodes provide the compute, memory, storage, and networking substrate on which pods execute. The control plane (see Control Plane Overview) decides where to run pods; nodes do the actual work of running them.

Every worker node runs a fixed set of components. Understanding what each component does, how they communicate, and how they share resources on the node is the foundation for diagnosing failures, tuning performance, and hardening security.

Required Node Components

kubelet

The primary node agent. Watches the API server for pods assigned to this node, translates pod specs into container runs via the CRI, manages pod lifecycle, reports node and pod status, and runs probes. Details: 01-kubelet.

kube-proxy

Network rules agent. Programs iptables, nftables, or IPVS rules to implement Service virtual IPs. Watches EndpointSlices and translates them into local forwarding rules. Details: 02-kube-proxy.

Container Runtime

Implements the Container Runtime Interface (CRI). Pulls images, creates and manages container namespaces, executes processes. Primary option today: containerd. Details: 03-container-runtime.

CNI Plugin

Network plugin called by the container runtime (via kubelet) when a pod's network namespace is set up. Assigns IP addresses, creates veth pairs, configures routing. Not a long-running daemon — invoked per pod. Details: CNI Plugins.

CSI Node Plugin

When using CSI storage drivers, a node plugin DaemonSet pod runs on each node to mount/unmount volumes. Interacts with kubelet's volume manager via the CSI Node Service gRPC interface. Details: CSI Drivers.

Node-level DaemonSets

Not a Kubernetes core component, but most clusters run per-node DaemonSets for log collection (Fluentd/Fluent Bit), metrics (node-exporter), security (Falco), and storage (CSI node plugins).

Node Internal Architecture

Worker Node (Linux) kube-apiserver :6443 (control plane — off-node) kubelet :10250 (HTTPS) :10255 (read-only, opt) kube-proxy :10249 metrics containerd /run/containerd/containerd.sock (CRI gRPC) runc / kata-containers (OCI Runtime) iptables / IPVS kernel netfilter Pod (network namespace) Container A app process overlay fs cgroups v2 namespaces Container B sidecar shared /tmp pause container (network/IPC namespace holder) eth0: 10.244.x.x (CNI assigned) veth pair: pod eth0 ↔ node cbr0/cni0 Node network stack eth0: node IP | cbr0/cni0: pod CIDR bridge | kube-ipvs0 / iptables NAT for Services watch/patch CRI gRPC OCI create/start

Component Ports and Endpoints

ComponentPortProtocolPurposeAuth required
kubelet10250HTTPSAPI — pod logs, exec, port-forward, metrics; used by apiserver and Metrics ServerYes — client cert or bearer token
kubelet (read-only)10255HTTPUnauthenticated metrics endpoint (deprecated — disable in production)No
kubelet healthz10248HTTPLocal healthz probe only (not exposed externally by default)No (loopback)
kube-proxy10249HTTPPrometheus metrics for proxy rules stateNo (loopback default)
kube-proxy healthz10256HTTPHealth check endpoint for readiness/liveness probesNo
containerdunix socketgRPC/run/containerd/containerd.sock — CRI calls from kubeletUnix file permissions
node-exporter9100HTTPPrometheus node metrics (CPU, memory, disk, network per node)Typically no (in-cluster scrape)
Disable kubelet read-only port in production
Port 10255 exposes pod specs, node metadata, and running container information without any authentication. It must be disabled (--read-only-port=0 in kubelet config) in all production clusters. This is checked by CIS Kubernetes Benchmark.

The Node Object in the API

A node is represented as a Node object in the Kubernetes API. The node itself (via kubelet) populates most of the status fields; the control plane (via the Node Lifecycle Controller — see kube-controller-manager) manages taints and conditions.

# Inspect a node's full status
kubectl get node worker-1 -o yaml
apiVersion: v1
kind: Node
metadata:
  name: worker-1
  labels:
    kubernetes.io/arch: amd64
    kubernetes.io/hostname: worker-1
    kubernetes.io/os: linux
    node.kubernetes.io/instance-type: m5.2xlarge   # set by CCM
    topology.kubernetes.io/region: us-east-1
    topology.kubernetes.io/zone: us-east-1a
  annotations:
    node.alpha.kubernetes.io/ttl: "0"
    volumes.kubernetes.io/controller-managed-attach-detach: "true"
spec:
  podCIDR: 10.244.3.0/24          # Assigned by controller-manager
  podCIDRs:
    - 10.244.3.0/24
  providerID: aws:///us-east-1a/i-0abc123def456   # Set by CCM
  taints:                          # Added by controller; tolerations on pods
    - effect: NoSchedule
      key: node.kubernetes.io/not-ready
status:
  addresses:
    - address: 10.0.1.42
      type: InternalIP
    - address: worker-1
      type: Hostname
  allocatable:                     # Capacity minus system reserved
    cpu: "7800m"
    memory: "29Gi"
    pods: "110"
    hugepages-2Mi: "0"
    ephemeral-storage: "90Gi"
  capacity:                        # Raw hardware capacity
    cpu: "8"
    memory: "32Gi"
    pods: "110"
    ephemeral-storage: "100Gi"
  conditions:
    - type: MemoryPressure
      status: "False"
    - type: DiskPressure
      status: "False"
    - type: PIDPressure
      status: "False"
    - type: Ready
      status: "True"
      reason: KubeletReady
      message: "kubelet is posting ready status"
      lastHeartbeatTime: "2026-05-16T12:00:00Z"
      lastTransitionTime: "2026-05-15T08:00:00Z"
  nodeInfo:
    kernelVersion: 5.15.0-1053-aws
    osImage: Ubuntu 22.04.3 LTS
    containerRuntimeVersion: containerd://1.7.13
    kubeletVersion: v1.29.3
    kubeProxyVersion: v1.29.3
    architecture: amd64
    operatingSystem: linux

Capacity vs Allocatable

Two distinct resource measurements exist at the node level. The difference is critical for scheduling and bin-packing:

capacity

Total raw hardware resources — what the machine physically has. This is what you'd see in lscpu or free -h. kubelet reads this from the OS and reports it.

  • CPU: number of logical cores
  • Memory: total physical RAM
  • Ephemeral storage: total disk for images + writable layers
  • Pods: maximum pod count (default: 110)

allocatable

Capacity minus what is reserved for system processes. The scheduler uses allocatable, not capacity, when placing pods.

allocatable = capacity
  - kube-reserved
  - system-reserved
  - eviction-threshold

Configure via kubelet flags: --kube-reserved, --system-reserved, --eviction-hard.

Under-reserving system resources is a top cause of node instability
If no reservation is configured, the scheduler can pack pods tightly enough that kubelet, containerd, and the OS kernel are starved of memory, triggering OOM kills and cascading failures. Always configure --kube-reserved and --system-reserved. Typical starting values: 100–250m CPU and 256–512Mi memory for kube-reserved; 100–500m CPU and 256Mi–1Gi for system-reserved depending on node size.

Reservation Example

# kubelet configuration (KubeletConfiguration)
apiVersion: kubelet.config.k8s.io/v1beta1
kind: KubeletConfiguration
kubeReserved:
  cpu: "200m"
  memory: "512Mi"
  ephemeral-storage: "2Gi"
systemReserved:
  cpu: "200m"
  memory: "512Mi"
  ephemeral-storage: "2Gi"
evictionHard:
  memory.available: "200Mi"
  nodefs.available: "5%"
  nodefs.inodesFree: "5%"
  imagefs.available: "10%"

On an 8-core / 32Gi node, this yields:

Node Conditions

kubelet sets node conditions to communicate health to the control plane. The Node Lifecycle Controller (in kube-controller-manager) reads these to taint and eventually evict pods from unhealthy nodes.

ConditionTrue meansControl plane reaction
ReadyNode is healthy and accepting podsNormal operation
Ready=Falsekubelet unhealthy — can't run podsTaint node.kubernetes.io/not-ready; evict after pod-eviction-timeout
Ready=UnknownNo heartbeat for node-monitor-grace-period (40s default)Taint node.kubernetes.io/unreachable; evict after timeout
MemoryPressureAvailable memory below thresholdTaint node.kubernetes.io/memory-pressure; kubelet evicts BestEffort pods
DiskPressureAvailable disk below thresholdTaint node.kubernetes.io/disk-pressure; kubelet evicts pods + GCs images
PIDPressurePID count near maximumTaint node.kubernetes.io/pid-pressure
NetworkUnavailableCNI not configured or network brokenSet by CNI plugin via Node status; prevents pod scheduling

Node Registration

When kubelet starts on a new machine, it registers the node with the API server. This is the bootstrap sequence:

kubelet starts
TLS bootstrap (CSR)
apiserver signs cert
POST /api/v1/nodes
Node object created
CCM initializes (cloud)
  1. TLS bootstrap — kubelet uses a bootstrap token to authenticate and sends a CertificateSigningRequest (CSR) for its node identity certificate (system:node:<nodeName>).
  2. Certificate issuance — The CertificateApproval controller or auto-approver signs the CSR. kubelet saves the resulting certificate and rotates it before expiry.
  3. Node creation — kubelet POSTs a Node object with capacity, nodeInfo, and an uninitialized taint if cloud integration is enabled.
  4. CCM initialization — The cloud-controller-manager's Node Controller (see cloud-controller-manager) reads cloud metadata, sets the providerID, region/zone labels, and removes the uninitialized taint.
  5. CNI setup — The network plugin is invoked by kubelet when the first pod is created on the node, setting up the node's pod network.
Node self-registration vs. manual registration
By default (--register-node=true), kubelet registers itself automatically. In some environments (air-gapped, custom orchestration) operators pre-create Node objects and set --register-node=false. The kubelet still runs but will only manage pods on a pre-existing node object with a matching name.

How a Pod Runs on a Node

Once a pod is scheduled and its spec.nodeName is set, the execution chain on the worker node is:

apiserver (Pod.spec.nodeName set)
kubelet watch triggers
CNI: assign IP, create veth
containerd: pull image, create container
runc: create namespaces, exec entrypoint
kubelet: run probes, report status
  1. kubelet watch — kubelet's informer detects a pod object with spec.nodeName == thisNode and no status.phase set.
  2. Admit pod — kubelet runs local admission: checks resource fit against allocatable, checks node selectors/taints. If it cannot admit, sets status.phase = Failed.
  3. Volume setup — kubelet's volume manager coordinates with the CSI node plugin to attach and mount any required PersistentVolumes and ConfigMap/Secret volumes.
  4. CNI invocation — kubelet calls the container runtime (containerd), which calls the CNI plugin binary. CNI creates the network namespace, veth pair, assigns the pod IP from the node's podCIDR, and programs routes.
  5. Pull image — containerd pulls the container image via the image store (OCI registry). If the image is already present and imagePullPolicy is IfNotPresent, the pull is skipped.
  6. Create and start containers — containerd calls runc (or another OCI runtime) to set up Linux namespaces (PID, net, mount, UTS, IPC), cgroups, and execute the container entrypoint.
  7. Init containers first — If the pod has init containers, they run serially to completion before app containers start.
  8. Probes — kubelet runs liveness, readiness, and startup probes per the pod spec. Failed liveness probes trigger container restarts. Failed readiness probes remove the pod's IP from Endpoints/EndpointSlices.
  9. Status reporting — kubelet PATCHes the pod's status subresource and the node's status with heartbeats every nodeStatusUpdateFrequency (default: 10s).

Linux Kernel Primitives Used by Nodes

Kubernetes node functionality is built directly on Linux kernel features. Understanding these makes container behavior predictable:

Namespaces

Each container gets isolated views of: PID (process tree), net (network stack + interfaces), mnt (filesystem mount points), UTS (hostname), IPC (message queues), user (UID/GID mapping). Pods share the network namespace between containers — that's how they communicate on localhost.

cgroups v2

Linux control groups enforce CPU and memory limits. Kubernetes 1.25+ uses cgroups v2 by default. Each pod and container gets a cgroup hierarchy. Memory limits enforced via memory.max; CPU limits via cpu.max (CFS quota). OOM kills originate from the kernel when a container exceeds memory.max.

Overlay filesystem

Container images are stored as OCI layers. containerd uses overlayfs (or other snapshotters) to present a unified read-write filesystem to the container. The writeable layer is created per-container and deleted on container removal. Ephemeral storage limits apply to this writeable layer.

seccomp / AppArmor / SELinux

Kernel security modules restrict syscalls and file access. The RuntimeDefault seccomp profile (applied when seccompProfile.type: RuntimeDefault) uses the container runtime's default filter blocking ~40 dangerous syscalls. AppArmor and SELinux profiles can be applied per pod/container.

netfilter / iptables / nftables

kube-proxy programs netfilter rules to implement Service virtual IPs (ClusterIP, NodePort, LoadBalancer) and SNAT for pod traffic leaving the node. In IPVS mode, kube-proxy uses the Linux IPVS kernel module instead for O(1) rule lookup at scale.

eBPF

Increasingly, CNI plugins (Cilium, Calico eBPF) and service mesh agents (Cilium, Istio ambient) use eBPF programs attached to the kernel's traffic hooks, replacing iptables for dramatically lower overhead at scale. See eBPF Networking.

Node Specialization

Not all nodes are equal. Production clusters commonly segment nodes by function:

Node typeLabels / TaintsUse case
General-purpose workersStandard workloads — Deployments, StatefulSets, Jobs
GPU nodesnvidia.com/gpu: "true", taint: nvidia.com/gpu=present:NoScheduleML training/inference; require NVIDIA device plugin DaemonSet
Spot / preemptiblecloud.google.com/gke-spot: "true" or equivalentBatch jobs, stateless workers that tolerate interruption
High-memory nodesnode.kubernetes.io/instance-type: r6i.32xlargeIn-memory databases, caches, large JVM workloads
ARM nodeskubernetes.io/arch: arm64Cost-optimized (Graviton, Ampere) for compatible workloads
System nodesnode-role.kubernetes.io/infra: "", taint infra:NoScheduleMonitoring, logging, admission webhook infrastructure
Windows nodeskubernetes.io/os: windowsWindows Server container workloads; see 07-windows-worker-nodes

Quick Node Diagnostics

# Overview of all nodes
kubectl get nodes -o wide

# Node conditions and events
kubectl describe node worker-1

# Resource usage across nodes
kubectl top nodes

# Pods running on a specific node
kubectl get pods --all-namespaces --field-selector spec.nodeName=worker-1

# kubelet logs (systemd)
journalctl -u kubelet -f --since "10 minutes ago"

# containerd logs
journalctl -u containerd -f

# kube-proxy logs
kubectl logs -n kube-system -l k8s-app=kube-proxy --tail=50

# Node resource allocations
kubectl describe node worker-1 | grep -A10 "Allocated resources"

# Check cgroup driver consistency (must match between kubelet and containerd)
kubectl get node worker-1 -o jsonpath='{.status.nodeInfo.containerRuntimeVersion}'
crictl info | grep cgroupDriver

Key Node Metrics

MetricSourceMeaning
node_cpu_seconds_totalnode-exporterCPU time by mode (user, system, idle, iowait). High iowait → storage bottleneck.
node_memory_MemAvailable_bytesnode-exporterAvailable memory. Alert when < eviction threshold.
node_filesystem_avail_bytesnode-exporterDisk space. Alert at < 10% (before eviction kicks in at 5%).
kubelet_running_podskubeletCurrent pod count. Alert approaching maxPods (default 110).
kubelet_node_namekubeletGauge=1 per node; used to detect kubelet restarts.
kube_node_status_conditionkube-state-metricsNode condition status. Alert on Ready!=1 or any pressure condition.
kube_node_status_allocatablekube-state-metricsAllocatable resources per node. Useful for capacity planning.
container_oom_events_totalcadvisor (kubelet)OOM kills. Alert on any value > 0 per container.

Alerting Rules

groups:
  - name: worker-node
    rules:
      - alert: NodeNotReady
        expr: kube_node_status_condition{condition="Ready",status="true"} == 0
        for: 2m
        labels:
          severity: critical
        annotations:
          summary: "Node {{ $labels.node }} not Ready for 2m"

      - alert: NodeMemoryPressure
        expr: kube_node_status_condition{condition="MemoryPressure",status="true"} == 1
        for: 1m
        labels:
          severity: warning
        annotations:
          summary: "Node {{ $labels.node }} under memory pressure"

      - alert: NodeDiskPressure
        expr: kube_node_status_condition{condition="DiskPressure",status="true"} == 1
        for: 1m
        labels:
          severity: warning
        annotations:
          summary: "Node {{ $labels.node }} under disk pressure"

      - alert: NodeHighCPU
        expr: |
          (1 - avg by(instance)(rate(node_cpu_seconds_total{mode="idle"}[5m]))) > 0.9
        for: 10m
        labels:
          severity: warning
        annotations:
          summary: "Node {{ $labels.instance }} CPU > 90% for 10m"

      - alert: ContainerOOMKilled
        expr: increase(container_oom_events_total[5m]) > 0
        labels:
          severity: warning
        annotations:
          summary: "Container {{ $labels.container }} in pod {{ $labels.pod }} OOM killed"

Production Best Practices

  1. Always configure kube-reserved and system-reserved. Without reservations, the scheduler can completely exhaust node resources, causing kubelet and containerd to be OOM-killed — which looks like a node failure from the control plane's perspective.
  2. Disable the kubelet read-only port (--read-only-port=0). It exposes pod and node information without authentication and is a CIS benchmark finding.
  3. Use cgroups v2. It provides better memory accounting, I/O limits, and pressure notifications. Ensure both kubelet (cgroupDriver: systemd) and containerd use the same cgroup driver — mismatch causes pod failures.
  4. Enforce node registration constraints. Use --node-labels in kubelet config to set immutable labels at registration time. Use NodeRestriction admission (see Admission Controllers §NodeRestriction) to prevent kubelets from labeling themselves with arbitrary values.
  5. Set maxPods based on node size. The default of 110 is appropriate for small nodes. Large nodes can host more, but increasing beyond 250 requires careful CNI and kubelet tuning.
  6. Monitor node conditions with kube-state-metrics and alert on any pressure condition. By the time a node reaches disk pressure, it has already started evicting pods — catch the trend early.
  7. Use node taints to dedicate nodes to specific workloads. System DaemonSets should tolerate all taints; application workloads should only be scheduled on appropriately labeled and tainted nodes.
  8. Enable kubelet certificate rotation (rotateCertificates: true). Node certificates expire; automatic rotation prevents node-level authentication failures after ~1 year.