Worker Node Overview

Node Components kubelet kube-proxy Container Runtime 02-node-components / 00-worker-node-overview.html

A worker node (simply called a "node" in Kubernetes) is a machine — physical or virtual — that runs containerized workloads. Nodes provide the compute, memory, storage, and networking substrate on which pods execute. The control plane (see Control Plane Overview) decides where to run pods; nodes do the actual work of running them.

Every worker node runs a fixed set of components. Understanding what each component does, how they communicate, and how they share resources on the node is the foundation for diagnosing failures, tuning performance, and hardening security.

Required Node Components

kubelet

The primary node agent. Watches the API server for pods assigned to this node, translates pod specs into container runs via the CRI, manages pod lifecycle, reports node and pod status, and runs probes. Details: 01-kubelet.

kube-proxy

Network rules agent. Programs iptables, nftables, or IPVS rules to implement Service virtual IPs. Watches EndpointSlices and translates them into local forwarding rules. Details: 02-kube-proxy.

Container Runtime

Implements the Container Runtime Interface (CRI). Pulls images, creates and manages container namespaces, executes processes. Primary option today: containerd. Details: 03-container-runtime.

CNI Plugin

Network plugin called by the container runtime (via kubelet) when a pod's network namespace is set up. Assigns IP addresses, creates veth pairs, configures routing. Not a long-running daemon — invoked per pod. Details: CNI Plugins.

CSI Node Plugin

When using CSI storage drivers, a node plugin DaemonSet pod runs on each node to mount/unmount volumes. Interacts with kubelet's volume manager via the CSI Node Service gRPC interface. Details: CSI Drivers.

Node-level DaemonSets

Not a Kubernetes core component, but most clusters run per-node DaemonSets for log collection (Fluentd/Fluent Bit), metrics (node-exporter), security (Falco), and storage (CSI node plugins).

Node Internal Architecture

Component Ports and Endpoints

Component	Port	Protocol	Purpose	Auth required
kubelet	10250	HTTPS	API — pod logs, exec, port-forward, metrics; used by apiserver and Metrics Server	Yes — client cert or bearer token
kubelet (read-only)	10255	HTTP	Unauthenticated metrics endpoint (deprecated — disable in production)	No
kubelet healthz	10248	HTTP	Local healthz probe only (not exposed externally by default)	No (loopback)
kube-proxy	10249	HTTP	Prometheus metrics for proxy rules state	No (loopback default)
kube-proxy healthz	10256	HTTP	Health check endpoint for readiness/liveness probes	No
containerd	unix socket	gRPC	`/run/containerd/containerd.sock` — CRI calls from kubelet	Unix file permissions
node-exporter	9100	HTTP	Prometheus node metrics (CPU, memory, disk, network per node)	Typically no (in-cluster scrape)

Disable kubelet read-only port in production

Port 10255 exposes pod specs, node metadata, and running container information without any authentication. It must be disabled (--read-only-port=0 in kubelet config) in all production clusters. This is checked by CIS Kubernetes Benchmark.

The Node Object in the API

A node is represented as a Node object in the Kubernetes API. The node itself (via kubelet) populates most of the status fields; the control plane (via the Node Lifecycle Controller — see kube-controller-manager) manages taints and conditions.

# Inspect a node's full status
kubectl get node worker-1 -o yaml

apiVersion: v1
kind: Node
metadata:
  name: worker-1
  labels:
    kubernetes.io/arch: amd64
    kubernetes.io/hostname: worker-1
    kubernetes.io/os: linux
    node.kubernetes.io/instance-type: m5.2xlarge   # set by CCM
    topology.kubernetes.io/region: us-east-1
    topology.kubernetes.io/zone: us-east-1a
  annotations:
    node.alpha.kubernetes.io/ttl: "0"
    volumes.kubernetes.io/controller-managed-attach-detach: "true"
spec:
  podCIDR: 10.244.3.0/24          # Assigned by controller-manager
  podCIDRs:
    - 10.244.3.0/24
  providerID: aws:///us-east-1a/i-0abc123def456   # Set by CCM
  taints:                          # Added by controller; tolerations on pods
    - effect: NoSchedule
      key: node.kubernetes.io/not-ready
status:
  addresses:
    - address: 10.0.1.42
      type: InternalIP
    - address: worker-1
      type: Hostname
  allocatable:                     # Capacity minus system reserved
    cpu: "7800m"
    memory: "29Gi"
    pods: "110"
    hugepages-2Mi: "0"
    ephemeral-storage: "90Gi"
  capacity:                        # Raw hardware capacity
    cpu: "8"
    memory: "32Gi"
    pods: "110"
    ephemeral-storage: "100Gi"
  conditions:
    - type: MemoryPressure
      status: "False"
    - type: DiskPressure
      status: "False"
    - type: PIDPressure
      status: "False"
    - type: Ready
      status: "True"
      reason: KubeletReady
      message: "kubelet is posting ready status"
      lastHeartbeatTime: "2026-05-16T12:00:00Z"
      lastTransitionTime: "2026-05-15T08:00:00Z"
  nodeInfo:
    kernelVersion: 5.15.0-1053-aws
    osImage: Ubuntu 22.04.3 LTS
    containerRuntimeVersion: containerd://1.7.13
    kubeletVersion: v1.29.3
    kubeProxyVersion: v1.29.3
    architecture: amd64
    operatingSystem: linux

Capacity vs Allocatable

Two distinct resource measurements exist at the node level. The difference is critical for scheduling and bin-packing:

capacity

Total raw hardware resources — what the machine physically has. This is what you'd see in lscpu or free -h. kubelet reads this from the OS and reports it.

CPU: number of logical cores
Memory: total physical RAM
Ephemeral storage: total disk for images + writable layers
Pods: maximum pod count (default: 110)

allocatable

Capacity minus what is reserved for system processes. The scheduler uses allocatable, not capacity, when placing pods.

allocatable = capacity
  - kube-reserved
  - system-reserved
  - eviction-threshold

Configure via kubelet flags: --kube-reserved, --system-reserved, --eviction-hard.

Under-reserving system resources is a top cause of node instability

If no reservation is configured, the scheduler can pack pods tightly enough that kubelet, containerd, and the OS kernel are starved of memory, triggering OOM kills and cascading failures. Always configure --kube-reserved and --system-reserved. Typical starting values: 100–250m CPU and 256–512Mi memory for kube-reserved; 100–500m CPU and 256Mi–1Gi for system-reserved depending on node size.

Reservation Example

# kubelet configuration (KubeletConfiguration)
apiVersion: kubelet.config.k8s.io/v1beta1
kind: KubeletConfiguration
kubeReserved:
  cpu: "200m"
  memory: "512Mi"
  ephemeral-storage: "2Gi"
systemReserved:
  cpu: "200m"
  memory: "512Mi"
  ephemeral-storage: "2Gi"
evictionHard:
  memory.available: "200Mi"
  nodefs.available: "5%"
  nodefs.inodesFree: "5%"
  imagefs.available: "10%"

On an 8-core / 32Gi node, this yields:

Allocatable CPU: 8000m − 200m − 200m = 7600m
Allocatable memory: 32Gi − 512Mi − 512Mi − 200Mi eviction = ~30.8Gi

Node Conditions

kubelet sets node conditions to communicate health to the control plane. The Node Lifecycle Controller (in kube-controller-manager) reads these to taint and eventually evict pods from unhealthy nodes.

Condition	True means	Control plane reaction
`Ready`	Node is healthy and accepting pods	Normal operation
`Ready=False`	kubelet unhealthy — can't run pods	Taint `node.kubernetes.io/not-ready`; evict after `pod-eviction-timeout`
`Ready=Unknown`	No heartbeat for `node-monitor-grace-period` (40s default)	Taint `node.kubernetes.io/unreachable`; evict after timeout
`MemoryPressure`	Available memory below threshold	Taint `node.kubernetes.io/memory-pressure`; kubelet evicts BestEffort pods
`DiskPressure`	Available disk below threshold	Taint `node.kubernetes.io/disk-pressure`; kubelet evicts pods + GCs images
`PIDPressure`	PID count near maximum	Taint `node.kubernetes.io/pid-pressure`
`NetworkUnavailable`	CNI not configured or network broken	Set by CNI plugin via Node status; prevents pod scheduling

Node Registration

When kubelet starts on a new machine, it registers the node with the API server. This is the bootstrap sequence:

kubelet starts

→

TLS bootstrap (CSR)

→

apiserver signs cert

→

POST /api/v1/nodes

→

Node object created

→

CCM initializes (cloud)

TLS bootstrap — kubelet uses a bootstrap token to authenticate and sends a CertificateSigningRequest (CSR) for its node identity certificate (system:node:<nodeName>).
Certificate issuance — The CertificateApproval controller or auto-approver signs the CSR. kubelet saves the resulting certificate and rotates it before expiry.
Node creation — kubelet POSTs a Node object with capacity, nodeInfo, and an uninitialized taint if cloud integration is enabled.
CCM initialization — The cloud-controller-manager's Node Controller (see cloud-controller-manager) reads cloud metadata, sets the providerID, region/zone labels, and removes the uninitialized taint.
CNI setup — The network plugin is invoked by kubelet when the first pod is created on the node, setting up the node's pod network.

Node self-registration vs. manual registration

By default (--register-node=true), kubelet registers itself automatically. In some environments (air-gapped, custom orchestration) operators pre-create Node objects and set --register-node=false. The kubelet still runs but will only manage pods on a pre-existing node object with a matching name.

How a Pod Runs on a Node

Once a pod is scheduled and its spec.nodeName is set, the execution chain on the worker node is:

apiserver (Pod.spec.nodeName set)

→

kubelet watch triggers

→

CNI: assign IP, create veth

→

containerd: pull image, create container

→

runc: create namespaces, exec entrypoint

→

kubelet: run probes, report status

kubelet watch — kubelet's informer detects a pod object with spec.nodeName == thisNode and no status.phase set.
Admit pod — kubelet runs local admission: checks resource fit against allocatable, checks node selectors/taints. If it cannot admit, sets status.phase = Failed.
Volume setup — kubelet's volume manager coordinates with the CSI node plugin to attach and mount any required PersistentVolumes and ConfigMap/Secret volumes.
CNI invocation — kubelet calls the container runtime (containerd), which calls the CNI plugin binary. CNI creates the network namespace, veth pair, assigns the pod IP from the node's podCIDR, and programs routes.
Pull image — containerd pulls the container image via the image store (OCI registry). If the image is already present and imagePullPolicy is IfNotPresent, the pull is skipped.
Create and start containers — containerd calls runc (or another OCI runtime) to set up Linux namespaces (PID, net, mount, UTS, IPC), cgroups, and execute the container entrypoint.
Init containers first — If the pod has init containers, they run serially to completion before app containers start.
Probes — kubelet runs liveness, readiness, and startup probes per the pod spec. Failed liveness probes trigger container restarts. Failed readiness probes remove the pod's IP from Endpoints/EndpointSlices.
Status reporting — kubelet PATCHes the pod's status subresource and the node's status with heartbeats every nodeStatusUpdateFrequency (default: 10s).

Linux Kernel Primitives Used by Nodes

Kubernetes node functionality is built directly on Linux kernel features. Understanding these makes container behavior predictable:

Namespaces

Each container gets isolated views of: PID (process tree), net (network stack + interfaces), mnt (filesystem mount points), UTS (hostname), IPC (message queues), user (UID/GID mapping). Pods share the network namespace between containers — that's how they communicate on localhost.

cgroups v2

Linux control groups enforce CPU and memory limits. Kubernetes 1.25+ uses cgroups v2 by default. Each pod and container gets a cgroup hierarchy. Memory limits enforced via memory.max; CPU limits via cpu.max (CFS quota). OOM kills originate from the kernel when a container exceeds memory.max.

Overlay filesystem

Container images are stored as OCI layers. containerd uses overlayfs (or other snapshotters) to present a unified read-write filesystem to the container. The writeable layer is created per-container and deleted on container removal. Ephemeral storage limits apply to this writeable layer.

seccomp / AppArmor / SELinux

Kernel security modules restrict syscalls and file access. The RuntimeDefault seccomp profile (applied when seccompProfile.type: RuntimeDefault) uses the container runtime's default filter blocking ~40 dangerous syscalls. AppArmor and SELinux profiles can be applied per pod/container.

netfilter / iptables / nftables

kube-proxy programs netfilter rules to implement Service virtual IPs (ClusterIP, NodePort, LoadBalancer) and SNAT for pod traffic leaving the node. In IPVS mode, kube-proxy uses the Linux IPVS kernel module instead for O(1) rule lookup at scale.

eBPF

Increasingly, CNI plugins (Cilium, Calico eBPF) and service mesh agents (Cilium, Istio ambient) use eBPF programs attached to the kernel's traffic hooks, replacing iptables for dramatically lower overhead at scale. See eBPF Networking.

Node Specialization

Not all nodes are equal. Production clusters commonly segment nodes by function:

Node type	Labels / Taints	Use case
General-purpose workers	—	Standard workloads — Deployments, StatefulSets, Jobs
GPU nodes	`nvidia.com/gpu: "true"`, `taint: nvidia.com/gpu=present:NoSchedule`	ML training/inference; require NVIDIA device plugin DaemonSet
Spot / preemptible	`cloud.google.com/gke-spot: "true"` or equivalent	Batch jobs, stateless workers that tolerate interruption
High-memory nodes	`node.kubernetes.io/instance-type: r6i.32xlarge`	In-memory databases, caches, large JVM workloads
ARM nodes	`kubernetes.io/arch: arm64`	Cost-optimized (Graviton, Ampere) for compatible workloads
System nodes	`node-role.kubernetes.io/infra: ""`, taint `infra:NoSchedule`	Monitoring, logging, admission webhook infrastructure
Windows nodes	`kubernetes.io/os: windows`	Windows Server container workloads; see 07-windows-worker-nodes

Quick Node Diagnostics

# Overview of all nodes
kubectl get nodes -o wide

# Node conditions and events
kubectl describe node worker-1

# Resource usage across nodes
kubectl top nodes

# Pods running on a specific node
kubectl get pods --all-namespaces --field-selector spec.nodeName=worker-1

# kubelet logs (systemd)
journalctl -u kubelet -f --since "10 minutes ago"

# containerd logs
journalctl -u containerd -f

# kube-proxy logs
kubectl logs -n kube-system -l k8s-app=kube-proxy --tail=50

# Node resource allocations
kubectl describe node worker-1 | grep -A10 "Allocated resources"

# Check cgroup driver consistency (must match between kubelet and containerd)
kubectl get node worker-1 -o jsonpath='{.status.nodeInfo.containerRuntimeVersion}'
crictl info | grep cgroupDriver

Key Node Metrics

Metric	Source	Meaning
`node_cpu_seconds_total`	node-exporter	CPU time by mode (user, system, idle, iowait). High iowait → storage bottleneck.
`node_memory_MemAvailable_bytes`	node-exporter	Available memory. Alert when < eviction threshold.
`node_filesystem_avail_bytes`	node-exporter	Disk space. Alert at < 10% (before eviction kicks in at 5%).
`kubelet_running_pods`	kubelet	Current pod count. Alert approaching `maxPods` (default 110).
`kubelet_node_name`	kubelet	Gauge=1 per node; used to detect kubelet restarts.
`kube_node_status_condition`	kube-state-metrics	Node condition status. Alert on `Ready!=1` or any pressure condition.
`kube_node_status_allocatable`	kube-state-metrics	Allocatable resources per node. Useful for capacity planning.
`container_oom_events_total`	cadvisor (kubelet)	OOM kills. Alert on any value > 0 per container.

Alerting Rules

groups:
  - name: worker-node
    rules:
      - alert: NodeNotReady
        expr: kube_node_status_condition{condition="Ready",status="true"} == 0
        for: 2m
        labels:
          severity: critical
        annotations:
          summary: "Node {{ $labels.node }} not Ready for 2m"

      - alert: NodeMemoryPressure
        expr: kube_node_status_condition{condition="MemoryPressure",status="true"} == 1
        for: 1m
        labels:
          severity: warning
        annotations:
          summary: "Node {{ $labels.node }} under memory pressure"

      - alert: NodeDiskPressure
        expr: kube_node_status_condition{condition="DiskPressure",status="true"} == 1
        for: 1m
        labels:
          severity: warning
        annotations:
          summary: "Node {{ $labels.node }} under disk pressure"

      - alert: NodeHighCPU
        expr: |
          (1 - avg by(instance)(rate(node_cpu_seconds_total{mode="idle"}[5m]))) > 0.9
        for: 10m
        labels:
          severity: warning
        annotations:
          summary: "Node {{ $labels.instance }} CPU > 90% for 10m"

      - alert: ContainerOOMKilled
        expr: increase(container_oom_events_total[5m]) > 0
        labels:
          severity: warning
        annotations:
          summary: "Container {{ $labels.container }} in pod {{ $labels.pod }} OOM killed"

Production Best Practices

Always configure kube-reserved and system-reserved. Without reservations, the scheduler can completely exhaust node resources, causing kubelet and containerd to be OOM-killed — which looks like a node failure from the control plane's perspective.
Disable the kubelet read-only port (--read-only-port=0). It exposes pod and node information without authentication and is a CIS benchmark finding.
Use cgroups v2. It provides better memory accounting, I/O limits, and pressure notifications. Ensure both kubelet (cgroupDriver: systemd) and containerd use the same cgroup driver — mismatch causes pod failures.
Enforce node registration constraints. Use --node-labels in kubelet config to set immutable labels at registration time. Use NodeRestriction admission (see Admission Controllers §NodeRestriction) to prevent kubelets from labeling themselves with arbitrary values.
Set maxPods based on node size. The default of 110 is appropriate for small nodes. Large nodes can host more, but increasing beyond 250 requires careful CNI and kubelet tuning.
Monitor node conditions with kube-state-metrics and alert on any pressure condition. By the time a node reaches disk pressure, it has already started evicting pods — catch the trend early.
Use node taints to dedicate nodes to specific workloads. System DaemonSets should tolerate all taints; application workloads should only be scheduled on appropriately labeled and tainted nodes.
Enable kubelet certificate rotation (rotateCertificates: true). Node certificates expire; automatic rotation prevents node-level authentication failures after ~1 year.