Container Runtime

Node Components containerd OCI runc Images 02-node-components / 03-container-runtime.html

The container runtime is the software layer that actually creates and manages containers on a node. It sits between kubelet (which expresses intent — "run this container from this image with these resource limits") and the Linux kernel (which enforces namespaces and cgroups). kubelet talks to the runtime through the Container Runtime Interface (CRI) — a gRPC API documented in detail in 04-cri-interface. This page covers the runtimes themselves: their architecture, configuration, image handling, snapshotters, and OCI runtime execution.

Runtime Ecosystem

containerd — The Standard Runtime

containerd (donated to CNCF by Docker, now graduated) is the dominant container runtime for Kubernetes. It was originally extracted from Docker as the "low-level" component handling image management and container execution, while Docker's higher-level UX was stripped away. Kubernetes talks to containerd via its built-in CRI plugin (since containerd 1.1, no longer a separate process).

containerd Internal Architecture

kubelet (CRI gRPC)

→

containerd CRI plugin

→

containerd core (metadata, snapshots, events)

→

containerd-shim-runc-v2

→

runc (OCI)

→

container process

The containerd Shim

The containerd shim (containerd-shim-runc-v2) is a small process that lives between containerd and each container's OCI runtime. One shim instance exists per pod sandbox. Its role:

Serves as the process parent for containers. If containerd restarts, the shim keeps the container alive.
Handles container I/O: collects stdout/stderr from the container process and makes it available to kubectl logs.
Implements the TTY/exec protocol for kubectl exec.
Reaps zombie processes in the container's PID namespace (acts as init if shareProcessNamespace: false).

containerd restarts are safe for running containers

Because the shim is independent of containerd, restarting the containerd daemon does not kill running containers. This is essential for node-level upgrades of containerd without disrupting workloads. The shim reconnects to the new containerd instance after restart.

containerd Configuration

# Default config location
/etc/containerd/config.toml

# Generate default config
containerd config default > /etc/containerd/config.toml

version = 2

[grpc]
  address = "/run/containerd/containerd.sock"
  uid = 0
  gid = 0

[plugins."io.containerd.grpc.v1.cri"]
  # Must match kubelet cgroupDriver
  sandbox_image = "registry.k8s.io/pause:3.9"

  [plugins."io.containerd.grpc.v1.cri".containerd]
    snapshotter = "overlayfs"     # overlayfs | zfs | btrfs | devmapper
    default_runtime_name = "runc"
    discard_unpacked_layers = true

    [plugins."io.containerd.grpc.v1.cri".containerd.runtimes.runc]
      runtime_type = "io.containerd.runc.v2"

      [plugins."io.containerd.grpc.v1.cri".containerd.runtimes.runc.options]
        SystemdCgroup = true       # CRITICAL: must match kubelet cgroupDriver

    # Register an alternate runtime for sandboxed workloads
    [plugins."io.containerd.grpc.v1.cri".containerd.runtimes.kata]
      runtime_type = "io.containerd.kata.v2"

  [plugins."io.containerd.grpc.v1.cri".registry]
    [plugins."io.containerd.grpc.v1.cri".registry.mirrors]
      [plugins."io.containerd.grpc.v1.cri".registry.mirrors."docker.io"]
        endpoint = ["https://mirror.gcr.io", "https://registry-1.docker.io"]

    [plugins."io.containerd.grpc.v1.cri".registry.configs]
      [plugins."io.containerd.grpc.v1.cri".registry.configs."private.registry.example.com".auth]
        username = "user"
        password = "pass"
      [plugins."io.containerd.grpc.v1.cri".registry.configs."private.registry.example.com".tls]
        ca_file = "/etc/containerd/certs/ca.crt"

[plugins."io.containerd.grpc.v1.cri".cni]
  bin_dir = "/opt/cni/bin"
  conf_dir = "/etc/cni/net.d"

[metrics]
  address = "127.0.0.1:1338"    # Prometheus metrics

[debug]
  level = "info"                 # debug | info | warn | error

SystemdCgroup must match kubelet's cgroupDriver

If containerd has SystemdCgroup = true but kubelet has cgroupDriver: cgroupfs (or vice versa), containers will fail to start with cryptic cgroup errors. Always verify both sides after installing or upgrading either component. The current setting can be checked with crictl info | grep -i cgroup.

OCI Standards

The Open Container Initiative (OCI) defines two standards that all compliant runtimes and registries must implement:

OCI Image Specification

Defines the format of a container image: a manifest (JSON), a configuration object (JSON, containing entrypoint, env, labels, etc.), and one or more content-addressable filesystem layers (tarballs, compressed with gzip or zstd).

Images are content-addressed by SHA256 digest of their manifest
Layers are deduplicated across images sharing the same base
Multi-arch images use an Image Index (manifest list) that maps platform (os/arch) to per-platform manifests

OCI Runtime Specification

Defines the format of the container configuration JSON (config.json) passed to the runtime when creating a container, and the lifecycle operations the runtime must implement.

create — set up namespaces and cgroups, but do not start process
start — execute the process inside the configured environment
state — query container state
kill — send a signal to the container process
delete — destroy the container and clean up resources

Image Layers and Snapshotters

Container images are stored as a stack of immutable, content-addressable layers. When a container starts, a writable layer (the container layer) is added on top. All reads below that layer are served from the read-only image layers using a snapshotter (the mechanism for stacking layers).

Snapshotters

Snapshotter	Mechanism	Use case	Requirements
`overlayfs`	Linux overlay filesystem (kernel 3.18+)	Default for most Linux distros; efficient copy-on-write	Kernel overlayfs support; NOT available on some older kernels or NFS rootfs
`native`	Full copy of each layer (no COW)	Fallback when overlayfs unavailable	Disk-intensive; high storage usage
`zfs`	ZFS clones/snapshots	High I/O workloads; excellent snapshot performance	ZFS installed on node; not common in cloud VMs
`btrfs`	Btrfs subvolumes/snapshots	COW with efficient snapshots	btrfs filesystem for container storage
`devmapper`	Device Mapper thin-provisioning	RHEL/CentOS legacy compatibility	Complex setup; mostly superseded by overlayfs
`stargz` (Nydus)	Lazy image pulling (eStargz)	Large images with slow cold-start — fetch only needed layers	Requires compatible registry; remote snapshotter

Image Pulling and Registry Authentication

When kubelet instructs containerd to run a container, containerd checks whether the image already exists locally. If not, it pulls the image from the registry. The pull process:

Resolve manifest — fetch the image manifest from the registry using the image reference (name + tag or digest).
Check platform — if the manifest is a multi-arch index, select the manifest matching the node's os/arch.
Download missing layers — compare manifest layer digests against the local content store; pull only layers not already present.
Verify digests — each layer is verified against its SHA256 digest after download.
Unpack — layers are unpacked into the snapshotter.

imagePullPolicy

Policy	Behavior	Use case
`Always`	Always contact the registry before starting. Pulls if digest has changed.	Mutable tags (like `:latest`); ensures freshness
`IfNotPresent`	Only pull if the image is not present locally. Default for versioned tags.	Production — avoids unnecessary registry calls
`Never`	Never pull. Fails if image not present locally.	Air-gapped environments where images are pre-loaded

Using :latest or mutable tags in production

If imagePullPolicy: IfNotPresent is used with a mutable tag like :latest, different nodes in the cluster may be running different actual image versions — whichever version was cached when each node last pulled. Always use immutable image digests or versioned tags in production.

Registry Authentication

Kubernetes supports three mechanisms for providing registry credentials:

imagePullSecrets on Pod

spec:
  imagePullSecrets:
    - name: regcred
# Secret type: kubernetes.io/dockerconfigjson
kubectl create secret docker-registry regcred \
  --docker-server=registry.example.com \
  --docker-username=user \
  --docker-password=pass

ServiceAccount imagePullSecrets

apiVersion: v1
kind: ServiceAccount
metadata:
  name: my-sa
imagePullSecrets:
  - name: regcred
# All pods using this SA inherit these credentials

Node-level credential config

containerd's registry.configs section or a .docker/config.json on the node applies to all pods. Used for internal mirrors and air-gapped environments. See containerd config above.

Cloud IAM (IRSA / Workload Identity)

On EKS/GKE/AKS, nodes have IAM roles that grant ECR/GCR/ACR pull permissions automatically. No Kubernetes secrets needed. Credentials are refreshed automatically by cloud credential helpers.

OCI Runtimes

runc — Reference Implementation

runc is the reference implementation of the OCI Runtime Specification, written in Go, maintained by the Open Container Initiative. It uses libcontainer (a Go library) to set up Linux namespaces and cgroups. runc is a short-lived CLI tool — it is invoked once per container create/start/delete operation, not as a persistent daemon.

# runc is typically at:
which runc   # /usr/local/bin/runc or /usr/bin/runc

# Check version
runc --version
# runc version 1.1.12
# commit: v1.1.12-0-g51d5e946
# spec: 1.0.2-dev

# List containers managed by runc directly (for debugging)
runc list --root /run/containerd/runc/default/

crun — High-Performance Alternative

crun is a C implementation of the OCI runtime spec. It has lower memory usage (~10x less than runc) and faster startup time (~2x faster), making it attractive for high-density nodes or serverless workloads. It is the default OCI runtime on Fedora/RHEL and is fully compatible with the same OCI spec as runc.

# Use crun instead of runc in containerd config.toml
[plugins."io.containerd.grpc.v1.cri".containerd.runtimes.runc.options]
  BinaryName = "/usr/bin/crun"   # Override binary

Sandboxed Runtimes

For workloads requiring stronger isolation than Linux namespaces provide (multi-tenant platforms, untrusted code execution), sandboxed runtimes create an additional isolation layer:

kata-containers

Runs each pod inside a lightweight VM (using QEMU, Cloud Hypervisor, or Firecracker). The container process runs inside the VM's kernel, providing strong isolation. The pod has a private kernel — a container escape would require a VM escape.

~100ms additional startup time
~128MB additional memory overhead per pod
Requires nested virtualization (or bare metal with VT-x)
Transparent to Kubernetes — same pod spec, different RuntimeClass

gVisor (runsc)

Intercepts syscalls from the container process and handles them in a user-space kernel written in Go (the Sentry). Container processes never directly call the Linux kernel — they go through gVisor's Sentry instead.

~5-50ms additional latency per syscall
Not all syscalls supported (but covers most common ones)
Does not require virtualization hardware
Used in production at Google (GKE Autopilot runs on gVisor)

RuntimeClass

RuntimeClass is the Kubernetes mechanism for selecting which OCI runtime to use per pod. A cluster can have multiple runtimes registered, and pods select one via spec.runtimeClassName.

# Define available runtimes
apiVersion: node.k8s.io/v1
kind: RuntimeClass
metadata:
  name: kata-qemu
handler: kata               # Maps to containerd runtime config name
overhead:
  podFixed:
    memory: "140Mi"         # Extra memory reserved per pod for VM overhead
    cpu: "250m"             # Scheduler accounts for this in resource requests
scheduling:
  nodeSelector:
    runtime: kata           # Only schedule on nodes with kata installed
  tolerations:
    - key: runtime
      operator: Equal
      value: kata
      effect: NoSchedule
---
apiVersion: node.k8s.io/v1
kind: RuntimeClass
metadata:
  name: gvisor
handler: runsc
---
# Use in a pod
apiVersion: v1
kind: Pod
metadata:
  name: untrusted-workload
spec:
  runtimeClassName: kata-qemu    # or gvisor
  containers:
    - name: app
      image: myapp:v1

RuntimeClass overhead is added to scheduling requests

The overhead.podFixed field tells the scheduler to add extra resource requests to the pod's declared requests when placing it. This ensures nodes are not over-committed with sandboxed pods whose actual resource usage (including the VM/Sentry overhead) exceeds what the container spec declares.

crictl — Runtime Debugging Tool

crictl is the CLI tool for interacting with any CRI-compatible runtime directly, bypassing kubectl. It is the primary tool for debugging containers at the node level.

# Configure crictl to use containerd
cat > /etc/crictl.yaml << 'EOF'
runtime-endpoint: unix:///run/containerd/containerd.sock
image-endpoint: unix:///run/containerd/containerd.sock
timeout: 10
debug: false
EOF

# List pods (sandboxes)
crictl pods

# List containers
crictl ps -a        # -a includes stopped containers

# List images
crictl images

# Pull an image manually
crictl pull registry.k8s.io/pause:3.9

# Inspect a container
crictl inspect 

# Get container logs
crictl logs 
crictl logs -f        # follow

# Exec into a running container
crictl exec -it  /bin/sh

# Get pod stats
crictl statsp

# Remove all stopped containers
crictl rm $(crictl ps -a -q --state exited)

# Remove unused images
crictl rmi --prune

Image Management and Disk Pressure

Images are stored in containerd's content store at /var/lib/containerd/. Disk pressure from images is one of the most common node issues in production.

# Check total image storage
du -sh /var/lib/containerd/
du -sh /var/lib/containerd/io.containerd.snapshotter.v1.overlayfs/

# List images with sizes
crictl images --verbose | grep -E "IMAGE|SIZE"

# containerd native tool (ctr) for lower-level inspection
ctr --namespace k8s.io images ls
ctr --namespace k8s.io snapshots ls | wc -l

# Inspect disk usage per container snapshot
ctr --namespace k8s.io snapshots usage

# Force image garbage collection (kubelet GC must be configured)
# kubelet auto-GCs based on imageGCHighThresholdPercent
# To manually force: restart kubelet or trigger via kubelet API

# Identify large images pulling excessive disk
crictl images -o json | jq -r '.images[] | [.repoTags[0], .size] | @tsv' \
  | sort -k2 -n -r | head -10

cgroup Driver Consistency

Both kubelet and containerd must use the same cgroup driver. The two options are:

Driver	Cgroup hierarchy	Recommended
`systemd`	Cgroups are managed as systemd units/slices. Hierarchy: `system.slice/kubepods.slice/besteffort.slice/pod<uid>/<container>`	Yes — use on systemd-based distros (Ubuntu 22.04+, RHEL 8+, Fedora)
`cgroupfs`	kubelet writes directly to `/sys/fs/cgroup/`. Hierarchy: `/kubepods/besteffort/pod<uid>/<container>`	Legacy — avoid on cgroups v2; can cause double-management issues with systemd

# Verify cgroup driver agreement
# kubelet:
cat /var/lib/kubelet/config.yaml | grep cgroupDriver

# containerd:
crictl info | grep -i cgroup
# or:
grep -A3 "runc.options" /etc/containerd/config.toml | grep SystemdCgroup

# Check if cgroups v2 is active
stat -fc %T /sys/fs/cgroup/
# tmpfs = cgroups v1
# cgroup2fs = cgroups v2

Prometheus Metrics

Metric	Source	Description
`container_tasks_state`	cAdvisor	Container process states (running, sleeping, stopped, etc.)
`container_memory_working_set_bytes`	cAdvisor	Working set memory — what the OOM killer uses for eviction decisions
`container_cpu_usage_seconds_total`	cAdvisor	Cumulative CPU time; rate = current CPU usage
`container_fs_usage_bytes`	cAdvisor	Container writable layer disk usage
`containerd_snapshots_total`	containerd	Total snapshots in the content store (proxy for image count)

Troubleshooting Runbooks

Runbook 1: Image pull failure — ErrImagePull / ImagePullBackOff

# 1. Get pod events
kubectl describe pod my-pod | grep -A10 Events

# 2. Common causes:
# a) Image doesn't exist or wrong tag
crictl pull myapp:v99   # Test pull directly
# Fix: check image name spelling and registry

# b) Registry auth failure (401)
# Events: "unauthorized: authentication required"
kubectl get secret regcred -o jsonpath='{.data.\.dockerconfigjson}' \
  | base64 -d | jq .
# Fix: recreate secret with correct credentials

# c) Registry unreachable (network issue)
# Test from node:
curl -v https://registry.example.com/v2/
# Fix: check DNS, firewall, proxy settings on node

# d) Certificate error (TLS)
# Events: "x509: certificate signed by unknown authority"
# Fix: add CA to containerd registry config or node trust store
# In /etc/containerd/config.toml:
# [registry.configs."registry.example.com".tls]
#   ca_file = "/etc/ssl/certs/custom-ca.crt"

# e) Rate limiting (Docker Hub)
# Events: "toomanyrequests: You have reached your pull rate limit"
# Fix: use registry mirror, or authenticate with Docker Hub credentials

Runbook 2: containerd unresponsive — pods stuck creating

# 1. Check containerd service
systemctl status containerd
journalctl -u containerd -n 50 --no-pager

# 2. Test CRI socket
crictl ps --timeout 3s
# If times out: containerd is not responding

# 3. Check for deadlocks or high load
top -p $(pidof containerd)
strace -p $(pidof containerd) -c -f &   # Summary of syscalls

# 4. Check containerd goroutine dump
kill -USR1 $(pidof containerd)
# Goroutine dump appears in containerd logs

# 5. Check for disk full (common root cause)
df -h /var/lib/containerd
df -h /run/containerd

# 6. Restart containerd (safe — running containers survive via shim)
systemctl restart containerd

# 7. If still unresponsive after restart, check kernel messages
dmesg | tail -20 | grep -E "containerd|cgroup|oom"

Runbook 3: cgroup driver mismatch — containers fail to start

# Symptom: kubelet logs show errors like:
# "failed to create containerd task: failed to create shim task:
#  OCI runtime create failed: ... cgroupv2: cgroup path ... is not valid"
# OR pods stuck in ContainerCreating forever

# 1. Check both sides
cat /var/lib/kubelet/config.yaml | grep cgroupDriver
crictl info | grep -A2 cgroupDriver

# 2. Fix containerd config
vim /etc/containerd/config.toml
# Set: SystemdCgroup = true   (if kubelet uses systemd)
# OR:  SystemdCgroup = false  (if kubelet uses cgroupfs)

# 3. Restart containerd
systemctl restart containerd

# 4. Verify agreement
crictl info | grep cgroupDriver
# Should match kubelet config

# 5. Check cgroups v2 (requires systemd driver)
stat -fc %T /sys/fs/cgroup/
# cgroup2fs → must use systemd driver

Runbook 4: Disk full — image storage exhausted

# Symptom: DiskPressure on node, new pods cannot start

# 1. Check disk usage
df -h /var/lib/containerd

# 2. Find large images
crictl images -o json | jq -r '.images[] |
  "\(.size | . / 1048576 | floor)MB\t\(.repoTags[0])"' \
  | sort -rn | head -20

# 3. Remove unused images immediately
crictl rmi --prune

# 4. Find containers with large writable layers
crictl stats -a | sort -k4 -rn | head -10

# 5. Remove exited containers holding disk
crictl rm $(crictl ps -aq --state exited)

# 6. If kubelet imageGC is not working, check thresholds
cat /var/lib/kubelet/config.yaml | grep -E "imageGC|High|Low"
# Lower threshold: imageGCHighThresholdPercent: 70

# 7. Emergency: remove all unused containerd data
# WARNING: this removes stopped containers and unused images
ctr --namespace k8s.io images prune --all

Runbook 5: RuntimeClass not working — pod using wrong runtime

# 1. Verify RuntimeClass object exists
kubectl get runtimeclass
kubectl describe runtimeclass kata-qemu

# 2. Verify containerd has the handler configured
grep -A5 "kata" /etc/containerd/config.toml

# 3. Check if kata/gvisor binary is installed
which containerd-shim-kata-v2   # or
ls /usr/local/bin/containerd-shim-*

# 4. Check pod's runtimeClassName is set
kubectl get pod my-pod -o jsonpath='{.spec.runtimeClassName}'

# 5. Check pod events for runtime errors
kubectl describe pod my-pod | grep -A10 Events
# "failed to create containerd task: ... kata: ... binary not found"

# 6. Verify node selector in RuntimeClass matches node labels
kubectl describe runtimeclass kata-qemu | grep -A5 Scheduling
kubectl get node worker-1 --show-labels | grep runtime

# 7. Check nested virtualization (for kata-qemu on cloud VMs)
grep -c vmx /proc/cpuinfo   # Should be > 0
# On AWS: metal instances or those with nested-virt enabled
# On GCP: --enable-nested-virtualization node flag

Production Best Practices

Use containerd 1.7+ with cgroups v2 and SystemdCgroup = true. cgroups v2 provides better memory accounting, I/O rate limiting, and pressure events. Ensure both kubelet and containerd use systemd cgroup driver.
Pin image digests in production, not tags. Tags are mutable; myapp@sha256:abc123 is immutable. This ensures reproducible deployments and prevents accidental pulls of different versions across nodes.
Configure containerd registry mirrors for Docker Hub. Docker Hub has aggressive rate limits (100 pulls/6h for unauthenticated, 200 for free accounts). Use mirror.gcr.io or a private registry mirror to avoid ImagePullBackOff at scale.
Set discard_unpacked_layers = true in containerd config. This removes intermediate decompressed layer data after unpacking, saving significant disk space (typically 30-50% of image store size).
Use RuntimeClass with overhead for sandboxed runtimes. Without overhead registration, the scheduler treats kata-containers pods as if they have no VM overhead, leading to node over-commitment and out-of-memory failures on the hypervisor.
Monitor containerd metrics and watch for slow CRI calls. kubelet_cri_operation_duration_seconds (exposed by kubelet for CRI calls) should be well under 1s for container create/start operations. Slow CRI calls cascade into PLEG health failures.
Pre-pull critical images on nodes using DaemonSets or node startup scripts. Large images (500MB+) can delay pod startup by 30-60s on cold nodes. Pre-pulling ensures images are available before the first pod requires them.
Avoid imagePullPolicy: Always for immutably-tagged images in production. It adds registry latency to every pod start, increases registry load, and provides no benefit if the image digest hasn't changed.