Container Runtime
The container runtime is the software layer that actually creates and manages containers on a node. It sits between kubelet (which expresses intent — "run this container from this image with these resource limits") and the Linux kernel (which enforces namespaces and cgroups). kubelet talks to the runtime through the Container Runtime Interface (CRI) — a gRPC API documented in detail in 04-cri-interface. This page covers the runtimes themselves: their architecture, configuration, image handling, snapshotters, and OCI runtime execution.
Runtime Ecosystem
containerd — The Standard Runtime
containerd (donated to CNCF by Docker, now graduated) is the dominant container runtime for Kubernetes. It was originally extracted from Docker as the "low-level" component handling image management and container execution, while Docker's higher-level UX was stripped away. Kubernetes talks to containerd via its built-in CRI plugin (since containerd 1.1, no longer a separate process).
containerd Internal Architecture
The containerd Shim
The containerd shim (containerd-shim-runc-v2) is a small process that lives between containerd and each container's OCI runtime. One shim instance exists per pod sandbox. Its role:
- Serves as the process parent for containers. If containerd restarts, the shim keeps the container alive.
- Handles container I/O: collects stdout/stderr from the container process and makes it available to
kubectl logs. - Implements the TTY/exec protocol for
kubectl exec. - Reaps zombie processes in the container's PID namespace (acts as init if
shareProcessNamespace: false).
containerd Configuration
# Default config location
/etc/containerd/config.toml
# Generate default config
containerd config default > /etc/containerd/config.toml
version = 2
[grpc]
address = "/run/containerd/containerd.sock"
uid = 0
gid = 0
[plugins."io.containerd.grpc.v1.cri"]
# Must match kubelet cgroupDriver
sandbox_image = "registry.k8s.io/pause:3.9"
[plugins."io.containerd.grpc.v1.cri".containerd]
snapshotter = "overlayfs" # overlayfs | zfs | btrfs | devmapper
default_runtime_name = "runc"
discard_unpacked_layers = true
[plugins."io.containerd.grpc.v1.cri".containerd.runtimes.runc]
runtime_type = "io.containerd.runc.v2"
[plugins."io.containerd.grpc.v1.cri".containerd.runtimes.runc.options]
SystemdCgroup = true # CRITICAL: must match kubelet cgroupDriver
# Register an alternate runtime for sandboxed workloads
[plugins."io.containerd.grpc.v1.cri".containerd.runtimes.kata]
runtime_type = "io.containerd.kata.v2"
[plugins."io.containerd.grpc.v1.cri".registry]
[plugins."io.containerd.grpc.v1.cri".registry.mirrors]
[plugins."io.containerd.grpc.v1.cri".registry.mirrors."docker.io"]
endpoint = ["https://mirror.gcr.io", "https://registry-1.docker.io"]
[plugins."io.containerd.grpc.v1.cri".registry.configs]
[plugins."io.containerd.grpc.v1.cri".registry.configs."private.registry.example.com".auth]
username = "user"
password = "pass"
[plugins."io.containerd.grpc.v1.cri".registry.configs."private.registry.example.com".tls]
ca_file = "/etc/containerd/certs/ca.crt"
[plugins."io.containerd.grpc.v1.cri".cni]
bin_dir = "/opt/cni/bin"
conf_dir = "/etc/cni/net.d"
[metrics]
address = "127.0.0.1:1338" # Prometheus metrics
[debug]
level = "info" # debug | info | warn | error
SystemdCgroup = true but kubelet has cgroupDriver: cgroupfs (or vice versa), containers will fail to start with cryptic cgroup errors. Always verify both sides after installing or upgrading either component. The current setting can be checked with crictl info | grep -i cgroup.
OCI Standards
The Open Container Initiative (OCI) defines two standards that all compliant runtimes and registries must implement:
OCI Image Specification
Defines the format of a container image: a manifest (JSON), a configuration object (JSON, containing entrypoint, env, labels, etc.), and one or more content-addressable filesystem layers (tarballs, compressed with gzip or zstd).
- Images are content-addressed by SHA256 digest of their manifest
- Layers are deduplicated across images sharing the same base
- Multi-arch images use an Image Index (manifest list) that maps platform (os/arch) to per-platform manifests
OCI Runtime Specification
Defines the format of the container configuration JSON (config.json) passed to the runtime when creating a container, and the lifecycle operations the runtime must implement.
create— set up namespaces and cgroups, but do not start processstart— execute the process inside the configured environmentstate— query container statekill— send a signal to the container processdelete— destroy the container and clean up resources
Image Layers and Snapshotters
Container images are stored as a stack of immutable, content-addressable layers. When a container starts, a writable layer (the container layer) is added on top. All reads below that layer are served from the read-only image layers using a snapshotter (the mechanism for stacking layers).
Snapshotters
| Snapshotter | Mechanism | Use case | Requirements |
|---|---|---|---|
overlayfs | Linux overlay filesystem (kernel 3.18+) | Default for most Linux distros; efficient copy-on-write | Kernel overlayfs support; NOT available on some older kernels or NFS rootfs |
native | Full copy of each layer (no COW) | Fallback when overlayfs unavailable | Disk-intensive; high storage usage |
zfs | ZFS clones/snapshots | High I/O workloads; excellent snapshot performance | ZFS installed on node; not common in cloud VMs |
btrfs | Btrfs subvolumes/snapshots | COW with efficient snapshots | btrfs filesystem for container storage |
devmapper | Device Mapper thin-provisioning | RHEL/CentOS legacy compatibility | Complex setup; mostly superseded by overlayfs |
stargz (Nydus) | Lazy image pulling (eStargz) | Large images with slow cold-start — fetch only needed layers | Requires compatible registry; remote snapshotter |
Image Pulling and Registry Authentication
When kubelet instructs containerd to run a container, containerd checks whether the image already exists locally. If not, it pulls the image from the registry. The pull process:
- Resolve manifest — fetch the image manifest from the registry using the image reference (name + tag or digest).
- Check platform — if the manifest is a multi-arch index, select the manifest matching the node's
os/arch. - Download missing layers — compare manifest layer digests against the local content store; pull only layers not already present.
- Verify digests — each layer is verified against its SHA256 digest after download.
- Unpack — layers are unpacked into the snapshotter.
imagePullPolicy
| Policy | Behavior | Use case |
|---|---|---|
Always | Always contact the registry before starting. Pulls if digest has changed. | Mutable tags (like :latest); ensures freshness |
IfNotPresent | Only pull if the image is not present locally. Default for versioned tags. | Production — avoids unnecessary registry calls |
Never | Never pull. Fails if image not present locally. | Air-gapped environments where images are pre-loaded |
imagePullPolicy: IfNotPresent is used with a mutable tag like :latest, different nodes in the cluster may be running different actual image versions — whichever version was cached when each node last pulled. Always use immutable image digests or versioned tags in production.
Registry Authentication
Kubernetes supports three mechanisms for providing registry credentials:
imagePullSecrets on Pod
spec:
imagePullSecrets:
- name: regcred
# Secret type: kubernetes.io/dockerconfigjson
kubectl create secret docker-registry regcred \
--docker-server=registry.example.com \
--docker-username=user \
--docker-password=pass
ServiceAccount imagePullSecrets
apiVersion: v1
kind: ServiceAccount
metadata:
name: my-sa
imagePullSecrets:
- name: regcred
# All pods using this SA inherit these credentials
Node-level credential config
containerd's registry.configs section or a .docker/config.json on the node applies to all pods. Used for internal mirrors and air-gapped environments. See containerd config above.
Cloud IAM (IRSA / Workload Identity)
On EKS/GKE/AKS, nodes have IAM roles that grant ECR/GCR/ACR pull permissions automatically. No Kubernetes secrets needed. Credentials are refreshed automatically by cloud credential helpers.
OCI Runtimes
runc — Reference Implementation
runc is the reference implementation of the OCI Runtime Specification, written in Go, maintained by the Open Container Initiative. It uses libcontainer (a Go library) to set up Linux namespaces and cgroups. runc is a short-lived CLI tool — it is invoked once per container create/start/delete operation, not as a persistent daemon.
# runc is typically at:
which runc # /usr/local/bin/runc or /usr/bin/runc
# Check version
runc --version
# runc version 1.1.12
# commit: v1.1.12-0-g51d5e946
# spec: 1.0.2-dev
# List containers managed by runc directly (for debugging)
runc list --root /run/containerd/runc/default/
crun — High-Performance Alternative
crun is a C implementation of the OCI runtime spec. It has lower memory usage (~10x less than runc) and faster startup time (~2x faster), making it attractive for high-density nodes or serverless workloads. It is the default OCI runtime on Fedora/RHEL and is fully compatible with the same OCI spec as runc.
# Use crun instead of runc in containerd config.toml
[plugins."io.containerd.grpc.v1.cri".containerd.runtimes.runc.options]
BinaryName = "/usr/bin/crun" # Override binary
Sandboxed Runtimes
For workloads requiring stronger isolation than Linux namespaces provide (multi-tenant platforms, untrusted code execution), sandboxed runtimes create an additional isolation layer:
kata-containers
Runs each pod inside a lightweight VM (using QEMU, Cloud Hypervisor, or Firecracker). The container process runs inside the VM's kernel, providing strong isolation. The pod has a private kernel — a container escape would require a VM escape.
- ~100ms additional startup time
- ~128MB additional memory overhead per pod
- Requires nested virtualization (or bare metal with VT-x)
- Transparent to Kubernetes — same pod spec, different RuntimeClass
gVisor (runsc)
Intercepts syscalls from the container process and handles them in a user-space kernel written in Go (the Sentry). Container processes never directly call the Linux kernel — they go through gVisor's Sentry instead.
- ~5-50ms additional latency per syscall
- Not all syscalls supported (but covers most common ones)
- Does not require virtualization hardware
- Used in production at Google (GKE Autopilot runs on gVisor)
RuntimeClass
RuntimeClass is the Kubernetes mechanism for selecting which OCI runtime to use per pod. A cluster can have multiple runtimes registered, and pods select one via spec.runtimeClassName.
# Define available runtimes
apiVersion: node.k8s.io/v1
kind: RuntimeClass
metadata:
name: kata-qemu
handler: kata # Maps to containerd runtime config name
overhead:
podFixed:
memory: "140Mi" # Extra memory reserved per pod for VM overhead
cpu: "250m" # Scheduler accounts for this in resource requests
scheduling:
nodeSelector:
runtime: kata # Only schedule on nodes with kata installed
tolerations:
- key: runtime
operator: Equal
value: kata
effect: NoSchedule
---
apiVersion: node.k8s.io/v1
kind: RuntimeClass
metadata:
name: gvisor
handler: runsc
---
# Use in a pod
apiVersion: v1
kind: Pod
metadata:
name: untrusted-workload
spec:
runtimeClassName: kata-qemu # or gvisor
containers:
- name: app
image: myapp:v1
overhead.podFixed field tells the scheduler to add extra resource requests to the pod's declared requests when placing it. This ensures nodes are not over-committed with sandboxed pods whose actual resource usage (including the VM/Sentry overhead) exceeds what the container spec declares.
crictl — Runtime Debugging Tool
crictl is the CLI tool for interacting with any CRI-compatible runtime directly, bypassing kubectl. It is the primary tool for debugging containers at the node level.
# Configure crictl to use containerd
cat > /etc/crictl.yaml << 'EOF'
runtime-endpoint: unix:///run/containerd/containerd.sock
image-endpoint: unix:///run/containerd/containerd.sock
timeout: 10
debug: false
EOF
# List pods (sandboxes)
crictl pods
# List containers
crictl ps -a # -a includes stopped containers
# List images
crictl images
# Pull an image manually
crictl pull registry.k8s.io/pause:3.9
# Inspect a container
crictl inspect
# Get container logs
crictl logs
crictl logs -f # follow
# Exec into a running container
crictl exec -it /bin/sh
# Get pod stats
crictl statsp
# Remove all stopped containers
crictl rm $(crictl ps -a -q --state exited)
# Remove unused images
crictl rmi --prune
Image Management and Disk Pressure
Images are stored in containerd's content store at /var/lib/containerd/. Disk pressure from images is one of the most common node issues in production.
# Check total image storage
du -sh /var/lib/containerd/
du -sh /var/lib/containerd/io.containerd.snapshotter.v1.overlayfs/
# List images with sizes
crictl images --verbose | grep -E "IMAGE|SIZE"
# containerd native tool (ctr) for lower-level inspection
ctr --namespace k8s.io images ls
ctr --namespace k8s.io snapshots ls | wc -l
# Inspect disk usage per container snapshot
ctr --namespace k8s.io snapshots usage
# Force image garbage collection (kubelet GC must be configured)
# kubelet auto-GCs based on imageGCHighThresholdPercent
# To manually force: restart kubelet or trigger via kubelet API
# Identify large images pulling excessive disk
crictl images -o json | jq -r '.images[] | [.repoTags[0], .size] | @tsv' \
| sort -k2 -n -r | head -10
cgroup Driver Consistency
Both kubelet and containerd must use the same cgroup driver. The two options are:
| Driver | Cgroup hierarchy | Recommended |
|---|---|---|
systemd | Cgroups are managed as systemd units/slices. Hierarchy: system.slice/kubepods.slice/besteffort.slice/pod<uid>/<container> | Yes — use on systemd-based distros (Ubuntu 22.04+, RHEL 8+, Fedora) |
cgroupfs | kubelet writes directly to /sys/fs/cgroup/. Hierarchy: /kubepods/besteffort/pod<uid>/<container> | Legacy — avoid on cgroups v2; can cause double-management issues with systemd |
# Verify cgroup driver agreement
# kubelet:
cat /var/lib/kubelet/config.yaml | grep cgroupDriver
# containerd:
crictl info | grep -i cgroup
# or:
grep -A3 "runc.options" /etc/containerd/config.toml | grep SystemdCgroup
# Check if cgroups v2 is active
stat -fc %T /sys/fs/cgroup/
# tmpfs = cgroups v1
# cgroup2fs = cgroups v2
Prometheus Metrics
| Metric | Source | Description |
|---|---|---|
container_tasks_state | cAdvisor | Container process states (running, sleeping, stopped, etc.) |
container_memory_working_set_bytes | cAdvisor | Working set memory — what the OOM killer uses for eviction decisions |
container_cpu_usage_seconds_total | cAdvisor | Cumulative CPU time; rate = current CPU usage |
container_fs_usage_bytes | cAdvisor | Container writable layer disk usage |
containerd_snapshots_total | containerd | Total snapshots in the content store (proxy for image count) |
Troubleshooting Runbooks
Runbook 1: Image pull failure — ErrImagePull / ImagePullBackOff
# 1. Get pod events
kubectl describe pod my-pod | grep -A10 Events
# 2. Common causes:
# a) Image doesn't exist or wrong tag
crictl pull myapp:v99 # Test pull directly
# Fix: check image name spelling and registry
# b) Registry auth failure (401)
# Events: "unauthorized: authentication required"
kubectl get secret regcred -o jsonpath='{.data.\.dockerconfigjson}' \
| base64 -d | jq .
# Fix: recreate secret with correct credentials
# c) Registry unreachable (network issue)
# Test from node:
curl -v https://registry.example.com/v2/
# Fix: check DNS, firewall, proxy settings on node
# d) Certificate error (TLS)
# Events: "x509: certificate signed by unknown authority"
# Fix: add CA to containerd registry config or node trust store
# In /etc/containerd/config.toml:
# [registry.configs."registry.example.com".tls]
# ca_file = "/etc/ssl/certs/custom-ca.crt"
# e) Rate limiting (Docker Hub)
# Events: "toomanyrequests: You have reached your pull rate limit"
# Fix: use registry mirror, or authenticate with Docker Hub credentials
Runbook 2: containerd unresponsive — pods stuck creating
# 1. Check containerd service
systemctl status containerd
journalctl -u containerd -n 50 --no-pager
# 2. Test CRI socket
crictl ps --timeout 3s
# If times out: containerd is not responding
# 3. Check for deadlocks or high load
top -p $(pidof containerd)
strace -p $(pidof containerd) -c -f & # Summary of syscalls
# 4. Check containerd goroutine dump
kill -USR1 $(pidof containerd)
# Goroutine dump appears in containerd logs
# 5. Check for disk full (common root cause)
df -h /var/lib/containerd
df -h /run/containerd
# 6. Restart containerd (safe — running containers survive via shim)
systemctl restart containerd
# 7. If still unresponsive after restart, check kernel messages
dmesg | tail -20 | grep -E "containerd|cgroup|oom"
Runbook 3: cgroup driver mismatch — containers fail to start
# Symptom: kubelet logs show errors like:
# "failed to create containerd task: failed to create shim task:
# OCI runtime create failed: ... cgroupv2: cgroup path ... is not valid"
# OR pods stuck in ContainerCreating forever
# 1. Check both sides
cat /var/lib/kubelet/config.yaml | grep cgroupDriver
crictl info | grep -A2 cgroupDriver
# 2. Fix containerd config
vim /etc/containerd/config.toml
# Set: SystemdCgroup = true (if kubelet uses systemd)
# OR: SystemdCgroup = false (if kubelet uses cgroupfs)
# 3. Restart containerd
systemctl restart containerd
# 4. Verify agreement
crictl info | grep cgroupDriver
# Should match kubelet config
# 5. Check cgroups v2 (requires systemd driver)
stat -fc %T /sys/fs/cgroup/
# cgroup2fs → must use systemd driver
Runbook 4: Disk full — image storage exhausted
# Symptom: DiskPressure on node, new pods cannot start
# 1. Check disk usage
df -h /var/lib/containerd
# 2. Find large images
crictl images -o json | jq -r '.images[] |
"\(.size | . / 1048576 | floor)MB\t\(.repoTags[0])"' \
| sort -rn | head -20
# 3. Remove unused images immediately
crictl rmi --prune
# 4. Find containers with large writable layers
crictl stats -a | sort -k4 -rn | head -10
# 5. Remove exited containers holding disk
crictl rm $(crictl ps -aq --state exited)
# 6. If kubelet imageGC is not working, check thresholds
cat /var/lib/kubelet/config.yaml | grep -E "imageGC|High|Low"
# Lower threshold: imageGCHighThresholdPercent: 70
# 7. Emergency: remove all unused containerd data
# WARNING: this removes stopped containers and unused images
ctr --namespace k8s.io images prune --all
Runbook 5: RuntimeClass not working — pod using wrong runtime
# 1. Verify RuntimeClass object exists
kubectl get runtimeclass
kubectl describe runtimeclass kata-qemu
# 2. Verify containerd has the handler configured
grep -A5 "kata" /etc/containerd/config.toml
# 3. Check if kata/gvisor binary is installed
which containerd-shim-kata-v2 # or
ls /usr/local/bin/containerd-shim-*
# 4. Check pod's runtimeClassName is set
kubectl get pod my-pod -o jsonpath='{.spec.runtimeClassName}'
# 5. Check pod events for runtime errors
kubectl describe pod my-pod | grep -A10 Events
# "failed to create containerd task: ... kata: ... binary not found"
# 6. Verify node selector in RuntimeClass matches node labels
kubectl describe runtimeclass kata-qemu | grep -A5 Scheduling
kubectl get node worker-1 --show-labels | grep runtime
# 7. Check nested virtualization (for kata-qemu on cloud VMs)
grep -c vmx /proc/cpuinfo # Should be > 0
# On AWS: metal instances or those with nested-virt enabled
# On GCP: --enable-nested-virtualization node flag
Production Best Practices
- Use containerd 1.7+ with cgroups v2 and
SystemdCgroup = true. cgroups v2 provides better memory accounting, I/O rate limiting, and pressure events. Ensure both kubelet and containerd usesystemdcgroup driver. - Pin image digests in production, not tags. Tags are mutable;
myapp@sha256:abc123is immutable. This ensures reproducible deployments and prevents accidental pulls of different versions across nodes. - Configure containerd registry mirrors for Docker Hub. Docker Hub has aggressive rate limits (100 pulls/6h for unauthenticated, 200 for free accounts). Use
mirror.gcr.ioor a private registry mirror to avoidImagePullBackOffat scale. - Set
discard_unpacked_layers = truein containerd config. This removes intermediate decompressed layer data after unpacking, saving significant disk space (typically 30-50% of image store size). - Use RuntimeClass with
overheadfor sandboxed runtimes. Without overhead registration, the scheduler treats kata-containers pods as if they have no VM overhead, leading to node over-commitment and out-of-memory failures on the hypervisor. - Monitor containerd metrics and watch for slow CRI calls.
kubelet_cri_operation_duration_seconds(exposed by kubelet for CRI calls) should be well under 1s for container create/start operations. Slow CRI calls cascade into PLEG health failures. - Pre-pull critical images on nodes using DaemonSets or node startup scripts. Large images (500MB+) can delay pod startup by 30-60s on cold nodes. Pre-pulling ensures images are available before the first pod requires them.
- Avoid
imagePullPolicy: Alwaysfor immutably-tagged images in production. It adds registry latency to every pod start, increases registry load, and provides no benefit if the image digest hasn't changed.