Container Orchestration Foundations
Kubernetes orchestrates containers, but containers are not a Kubernetes concept —
they are a Linux kernel feature composed from six isolation primitives.
This document explains every layer of the container stack from the Linux kernel up to the
CRI boundary where Kubernetes hands off control. Without this foundation, the behaviour of
kubelet, CRI plugins, resource limits, and security policies cannot be fully understood.
1 · What Is a Container?
A container is not a virtual machine. It is a Linux process (or process group) that the kernel has isolated and resource-constrained using a combination of:
- Namespaces — what the process can see (other processes, network interfaces, file system mounts, hostname, etc.)
- cgroups — how much resource it can consume (CPU, memory, I/O, PIDs)
- Union file system / overlay FS — the isolated, layered file system the process boots into
- Capabilities — which privileged kernel operations it is allowed to call
- Seccomp — which system calls it may invoke
- AppArmor / SELinux — mandatory access control policy over file and network access
The process itself is unmodified. The kernel surrounds it with these isolation layers
transparently. A containerised nginx binary is the exact same ELF binary as
one running on bare metal — the container is a property of the runtime environment, not the application.
2 · Linux Namespaces
Linux namespaces were introduced in kernel 3.8 (2013) and provide the visibility isolation for containers. There are currently 8 namespace types:
| Namespace | Kernel flag | What it isolates | Container use |
|---|---|---|---|
pid | CLONE_NEWPID | Process ID space — PID 1 inside namespace is unrelated to host PID 1 | Container init is PID 1 inside; can't see host processes |
net | CLONE_NEWNET | Network interfaces, routing tables, iptables rules, sockets | Each Pod gets its own eth0, loopback, IP address |
mnt | CLONE_NEWNS | Mount points (file system tree) | Container root FS is isolated; can't see host mounts |
uts | CLONE_NEWUTS | Hostname and domain name (uname) | Container can have its own hostname independent of host |
ipc | CLONE_NEWIPC | System V IPC, POSIX message queues | Containers share IPC only if in same Pod (same IPC ns) |
user | CLONE_NEWUSER | UID/GID mappings — root inside can be non-root outside | Rootless containers; UID 0 inside → UID 65534 on host |
cgroup | CLONE_NEWCGROUP | cgroup root — container sees a virtualised cgroup hierarchy | Container cannot see or escape host cgroup limits |
time | CLONE_NEWTIME | System clock (CLOCK_MONOTONIC, CLOCK_BOOTTIME) | Rarely used in K8s currently; useful for checkpoint/restore |
2.1 How Pods Share Namespaces
A Pod is, at the Linux level, a shared namespace envelope.
The kubelet creates a pause container (also called the "infra container") first.
All other containers in the Pod join the pause container's net and ipc
namespaces. Each container keeps its own mnt namespace (separate root FS).
| Namespace type | Shared across Pod? | Notes |
|---|---|---|
| net | Yes — all containers share one network namespace | All containers on localhost; same Pod IP; ports must not clash |
| ipc | Yes — by default | Enables shared memory between containers in the same Pod |
| uts | Yes — shared hostname | Pod hostname = Pod name by default |
| mnt | No — each container has its own rootfs | Volumes are bind-mounted into individual container mnt namespaces |
| pid | Optional (shareProcessNamespace: true) | When enabled, containers can signal each other across PID namespace |
| user | No — each container can have own UID mapping | Relevant for rootless pods and user namespace feature gate |
# Inspect namespaces for a running container on a node
CONTAINER_PID=$(crictl inspect <containerID> | jq -r .info.pid)
ls -la /proc/$CONTAINER_PID/ns/
# Output (each symlink = namespace inode):
# net -> net:[4026532345] # unique = isolated network namespace
# pid -> pid:[4026532346]
# mnt -> mnt:[4026532347]
# uts -> uts:[4026532348]
# Verify two containers in same Pod share net namespace
# (they will have same net:[...] inode)
CONTAINER_A_PID=1234; CONTAINER_B_PID=1235
ls -la /proc/$CONTAINER_A_PID/ns/net
ls -la /proc/$CONTAINER_B_PID/ns/net
2.2 The Pause Container
The pause container (image: registry.k8s.io/pause:3.9, ~700 KB) does
almost nothing — it runs a tight loop that catches and ignores signals:
// pause.c — the entire pause container userspace program
#include <signal.h>
#include <stdio.h>
#include <unistd.h>
static void handler(int sig) {}
int main() {
signal(SIGINT, handler);
signal(SIGTERM, handler);
// Hold the namespaces alive; exit triggers Pod deletion
for (;;) pause();
return 0;
}
Its sole purpose is to be the namespace anchor: as long as the pause container is alive, the Pod's network and IPC namespaces remain alive. App containers can crash and restart without losing the network namespace (and thus without losing the Pod IP address).
3 · Control Groups (cgroups)
cgroups (Control Groups) are a Linux kernel feature for
resource accounting and enforcement over groups of processes.
Kubernetes uses them to implement container requests and limits.
3.1 cgroup v1 Subsystems
| Subsystem | Controls | K8s use |
|---|---|---|
| cpu | CPU time shares and CFS quota | resources.requests.cpu → cpu.shares; resources.limits.cpu → cpu.cfs_quota_us |
| cpuset | Which CPUs/NUMA nodes a process may use | CPU Manager static policy pins containers to dedicated CPU cores |
| memory | Memory usage limits, OOM score | resources.limits.memory → memory.limit_in_bytes; OOM kill if exceeded |
| blkio | Block I/O bandwidth and IOPS throttling | Optionally throttle disk I/O per container (not default K8s) |
| pids | Maximum number of PIDs in the group | PodPidsLimit kubelet flag; prevents fork bombs |
| net_cls | Tags packets with classid for tc (traffic control) | Some CNI plugins use for bandwidth enforcement |
| devices | Allow/deny device file access | GPU device plugin, /dev/fuse access control |
| hugetlb | Huge page memory limits | HugePage resources on Pods (DPDK, databases) |
3.2 cgroup v2 (Unified Hierarchy)
cgroup v2 (merged in kernel 4.5, default on most distros since 2021) replaces the fragmented v1 subsystem tree with a single unified hierarchy. Kubernetes declared cgroup v2 stable in v1.25.
cgroup v1 layout
/sys/fs/cgroup/
cpu/kubepods/pod-uid/container-id/
memory/kubepods/pod-uid/container-id/
pids/kubepods/pod-uid/container-id/
blkio/kubepods/pod-uid/container-id/
cpuset/kubepods/pod-uid/container-id/
# Each subsystem in its own tree
# Controllers are mounted separately
cgroup v2 layout
/sys/fs/cgroup/
kubepods.slice/
kubepods-pod<uid>.slice/
<containerID>/
cpu.max # replaces cpu.cfs_quota_us
cpu.weight # replaces cpu.shares
memory.max # replaces memory.limit_in_bytes
memory.swap.max
io.max # combined blkio
pids.max
# Single unified tree; all controllers in one dir
3.3 Kubernetes cgroup Hierarchy
The kubelet creates cgroups in a three-level hierarchy:
/sys/fs/cgroup/
└── kubepods/ ← all K8s workloads
├── besteffort/ ← BestEffort QoS Pods
├── burstable/ ← Burstable QoS Pods
│ └── pod<pod-uid>/ ← per-Pod cgroup
│ ├── pause ← pause container
│ └── <container-id>/ ← app container cgroup
└── guaranteed/ ← Guaranteed QoS Pods
└── pod<pod-uid>/
└── <container-id>/
# System (non-K8s) processes live in:
/sys/fs/cgroup/system.slice/
/sys/fs/cgroup/user.slice/
3.4 How CPU Limits Work Internally
When you set resources.limits.cpu: "500m", the kubelet writes:
# cgroup v1
echo 50000 > /sys/fs/cgroup/cpu/kubepods/.../cpu.cfs_quota_us
# (100000 us is 1 CPU; 50000 us = 500m = 0.5 CPU per 100ms period)
# cgroup v2
echo "50000 100000" > /sys/fs/cgroup/kubepods/.../cpu.max
# ^quota ^period (microseconds)
# CPU *request* (500m) maps to cpu.shares (v1) or cpu.weight (v2):
# shares = max(2, floor(milliCPU * 1024 / 1000))
# weight (v2) = 1 + floor((shares - 2) * 9999 / 262142) # range 1-10000
limits.cpu: "1" that tries to use 1.5 CPUs for a burst
will be throttled by the CFS scheduler. The kernel cuts off its CPU time for the remainder
of the 100ms period. This causes latency spikes even on machines with plenty of idle CPU.
Use kubectl top pods + Prometheus container_cpu_cfs_throttled_seconds_total
to detect. Consider setting no CPU limit (request only) for latency-sensitive apps.
3.5 Memory Limits and OOM Killer
When a container exceeds resources.limits.memory, the kernel OOM killer
terminates the process. The kubelet detects the exit (via CRI container status) and the
container is restarted according to restartPolicy. This appears as
OOMKilled in kubectl describe pod.
# Inspect OOM kills
kubectl describe pod <name> | grep -A5 "Last State"
# Last State: Terminated
# Reason: OOMKilled
# Exit Code: 137
# Check kernel OOM log
dmesg | grep -i oom
# [123456.789] oom-kill:constraint=CONSTRAINT_MEMCG,task=java,...
# Check container memory usage vs limit
kubectl top pod <name> --containers
cat /sys/fs/cgroup/memory/kubepods/.../memory.usage_in_bytes
cat /sys/fs/cgroup/memory/kubepods/.../memory.limit_in_bytes
4 · OCI Specifications
The Open Container Initiative (OCI), founded by Docker and CoreOS in 2015 under the Linux Foundation, standardised two specifications that define what a container is:
4.1 OCI Image Specification
An OCI image is a layered, content-addressable bundle of:
- Image manifest — JSON document listing layers and the image config
- Image config — JSON with entrypoint, env vars, labels, architecture, OS, working dir
- Layers — tar.gz archives of filesystem diffs; each layer is identified by its SHA-256 digest
OCI image manifest structure (annotated JSON)
{
"schemaVersion": 2,
"mediaType": "application/vnd.oci.image.manifest.v1+json",
"config": {
"mediaType": "application/vnd.oci.image.config.v1+json",
"digest": "sha256:abc123...", // Points to image config JSON
"size": 7023
},
"layers": [
{
// Base OS layer (e.g., debian:bookworm rootfs)
"mediaType": "application/vnd.oci.image.layer.v1.tar+gzip",
"digest": "sha256:def456...",
"size": 27141519
},
{
// apt-get install nginx layer
"mediaType": "application/vnd.oci.image.layer.v1.tar+gzip",
"digest": "sha256:789abc...",
"size": 2342343
},
{
// COPY nginx.conf layer
"mediaType": "application/vnd.oci.image.layer.v1.tar+gzip",
"digest": "sha256:cdef01...",
"size": 1024
}
]
}
4.2 Image Layers and Copy-on-Write
Container images are composed of read-only layers stacked using a
union file system (typically overlay2 on Linux).
When a container starts, a thin read-write layer is added on top.
# Inspect overlayfs mounts for a running container (on the node)
CONTAINER_ID="abc123..."
# containerd stores snapshots in:
ls /var/lib/containerd/io.containerd.snapshotter.v1.overlayfs/snapshots/
# View the actual overlay mount for a container
cat /proc/mounts | grep overlay
# overlay / overlay rw,lowerdir=.../sha256:..,upperdir=...,workdir=...
# Docker legacy path:
ls /var/lib/docker/overlay2/
# Check disk usage per image layer
du -sh /var/lib/containerd/io.containerd.content.v1.content/blobs/sha256/*
4.3 OCI Runtime Specification
The OCI runtime spec defines the interface between a high-level runtime and a low-level runtime. It specifies a bundle directory structure and a lifecycle API:
OCI Bundle layout:
bundle/
config.json ← Generated by high-level runtime (containerd)
rootfs/ ← Unpacked image layers (merged overlayfs)
config.json contains:
- ociVersion
- root.path (path to rootfs)
- process (args, env, cwd, user, capabilities, rlimits, seccomp)
- hostname
- mounts (bind mounts for volumes, /dev, /proc, /sys)
- hooks (prestart, createRuntime, createContainer, startContainer, poststart, poststop)
- linux.namespaces (which namespaces to create/join)
- linux.cgroupsPath
- linux.resources (cpu, memory, devices, hugepageLimits)
- linux.seccomp (syscall whitelist/blacklist)
- linux.maskedPaths, readonlyPaths
4.4 OCI Container Lifecycle (State Machine)
5 · runc — The OCI Low-Level Runtime
runc is the reference implementation of the OCI runtime spec, written in Go,
and is the default low-level runtime used by both containerd and CRI-O.
5.1 What runc Does
- Reads the OCI bundle (
config.json+rootfs/) - Creates Linux namespaces (
clone()orunshare()syscalls) - Moves the process into the correct cgroup (
cgroupsPath) - Applies seccomp profile (loads BPF program via
seccomp()syscall) - Drops capabilities to the configured set
- Applies AppArmor/SELinux profile
- Sets up mounts (
pivot_rootorchrootto rootfs) - Executes the container's init process (
exec())
# runc command surface (used internally by containerd)
runc create --bundle /path/to/bundle mycontainer
runc start mycontainer
runc state mycontainer # {"ociVersion":"...","id":"...","status":"running","pid":12345,...}
runc list # all containers managed by this runc instance
runc kill mycontainer SIGTERM
runc delete mycontainer
# Run runc manually (debugging)
mkdir -p /tmp/mycontainer/rootfs
# Extract a rootfs tarball
tar xf alpine.tar -C /tmp/mycontainer/rootfs
# Generate a default config.json
cd /tmp/mycontainer && runc spec
# Edit config.json as needed, then:
runc run mycontainer
5.2 Alternative OCI Runtimes
| Runtime | Technology | Use case | K8s RuntimeClass |
|---|---|---|---|
| runc | Linux namespaces + cgroups | Default, general purpose | runc |
| crun | Same as runc but written in C, 50% lower memory | High-density, startup performance | crun |
| gVisor (runsc) | User-space kernel intercepting syscalls (Go) | Untrusted multi-tenant workloads; GKE Sandbox | gvisor |
| Kata Containers | Lightweight VM per Pod (QEMU/Cloud Hypervisor/Firecracker) | Strongest isolation; financial/regulated workloads | kata |
| Nabla (runnc) | Library OS (unikernels) on seccomp+ptrace | Research; minimal syscall surface | nabla |
| youki | runc re-implementation in Rust | Memory safety, active development | youki |
# Using RuntimeClass to select a non-default runtime
---
apiVersion: node.k8s.io/v1
kind: RuntimeClass
metadata:
name: gvisor
handler: runsc # Matches handler name configured in containerd/CRI-O
scheduling:
nodeSelector:
sandbox: gvisor # Only schedule on nodes with gVisor installed
---
apiVersion: v1
kind: Pod
metadata:
name: sandboxed-app
spec:
runtimeClassName: gvisor
containers:
- name: app
image: myapp:latest
6 · Container Runtime Interface (CRI)
The CRI is a gRPC API defined by Kubernetes (SIG Node) that decouples
the kubelet from any specific container runtime. It was introduced in Kubernetes v1.5
and standardised in v1.9. Every CRI-compliant runtime exposes a Unix socket at a
well-known path (/run/containerd/containerd.sock,
/var/run/crio/crio.sock, etc.).
6.1 CRI gRPC Services
The CRI defines two gRPC services:
// RuntimeService — manages Pod sandboxes and containers
service RuntimeService {
// Sandbox lifecycle
rpc RunPodSandbox(RunPodSandboxRequest) returns (RunPodSandboxResponse);
rpc StopPodSandbox(StopPodSandboxRequest) returns (StopPodSandboxResponse);
rpc RemovePodSandbox(RemovePodSandboxRequest) returns (RemovePodSandboxResponse);
rpc PodSandboxStatus(PodSandboxStatusRequest) returns (PodSandboxStatusResponse);
rpc ListPodSandbox(ListPodSandboxRequest) returns (ListPodSandboxResponse);
// Container lifecycle
rpc CreateContainer(CreateContainerRequest) returns (CreateContainerResponse);
rpc StartContainer(StartContainerRequest) returns (StartContainerResponse);
rpc StopContainer(StopContainerRequest) returns (StopContainerResponse);
rpc RemoveContainer(RemoveContainerRequest) returns (RemoveContainerResponse);
rpc ListContainers(ListContainersRequest) returns (ListContainersResponse);
rpc ContainerStatus(ContainerStatusRequest) returns (ContainerStatusResponse);
// Exec / attach / port-forward
rpc ExecSync(ExecSyncRequest) returns (ExecSyncResponse);
rpc Exec(ExecRequest) returns (ExecResponse); // returns URL for streaming
rpc Attach(AttachRequest) returns (AttachResponse); // returns URL for streaming
rpc PortForward(PortForwardRequest) returns (PortForwardResponse);
// Stats
rpc ContainerStats(ContainerStatsRequest) returns (ContainerStatsResponse);
rpc ListContainerStats(ListContainerStatsRequest) returns (ListContainerStatsResponse);
}
// ImageService — manages images
service ImageService {
rpc ListImages(ListImagesRequest) returns (ListImagesResponse);
rpc ImageStatus(ImageStatusRequest) returns (ImageStatusResponse);
rpc PullImage(PullImageRequest) returns (PullImageResponse);
rpc RemoveImage(RemoveImageRequest) returns (RemoveImageResponse);
rpc ImageFsInfo(ImageFsInfoRequest) returns (ImageFsInfoResponse);
}
6.2 Debugging with crictl
crictl is the CLI for the CRI — it speaks directly to the CRI socket,
bypassing Kubernetes entirely. Essential for node-level debugging.
# Configure crictl to use the right socket
cat /etc/crictl.yaml
# runtime-endpoint: unix:///run/containerd/containerd.sock
# image-endpoint: unix:///run/containerd/containerd.sock
# List running Pod sandboxes
crictl pods
# List containers (all states)
crictl ps -a
# Get container details (full JSON from CRI)
crictl inspect <containerID>
# Get Pod sandbox details
crictl inspectp <podID>
# Pull an image
crictl pull nginx:1.25
# List images
crictl images
# Check image layers and size
crictl imagefsinfo
# Execute a command in a container (via CRI)
crictl exec -it <containerID> sh
# View container logs via CRI (same as kubectl logs)
crictl logs <containerID>
# Container stats
crictl stats
# Stop and remove a container (without kubectl — for emergencies)
crictl stop <containerID>
crictl rm <containerID>
7 · containerd
containerd is the CNCF-graduated, OCI-compliant, high-level container runtime
used as the default CRI implementation in Kubernetes since the docker shim removal in v1.24.
It was originally extracted from Docker as its core runtime component.
7.1 containerd Architecture
7.2 containerd-shim — Why It Exists
The shim (containerd-shim-runc-v2) is a small child process
that sits between containerd and runc. It exists for two critical reasons:
- Daemonless containers: runc exits after starting the container init process. The shim becomes the parent of the container process, handling stdio forwarding and exit code collection. If containerd restarts, the shim and container keep running.
-
Reattachment: When containerd restarts, it can reconnect to the shim
via a Unix socket (
/run/containerd/containerd.sock.ttrpc-style), recovering container state without restarting containers.
7.3 Key containerd Paths and Configuration
# containerd config
cat /etc/containerd/config.toml
# Important config sections:
[plugins."io.containerd.grpc.v1.cri"]
sandbox_image = "registry.k8s.io/pause:3.9"
[plugins."io.containerd.grpc.v1.cri".containerd]
default_runtime_name = "runc"
[plugins."io.containerd.grpc.v1.cri".containerd.runtimes.runc]
runtime_type = "io.containerd.runc.v2"
[plugins."io.containerd.grpc.v1.cri".containerd.runtimes.runc.options]
SystemdCgroup = true # REQUIRED when kubelet uses systemd cgroup driver
# Data paths
/var/lib/containerd/ # all containerd data
io.containerd.content.v1.content/blobs/sha256/ # image layer blobs
io.containerd.snapshotter.v1.overlayfs/ # snapshot layers
io.containerd.metadata.v1.bolt/meta.db # bbolt metadata DB
# Runtime socket
/run/containerd/containerd.sock
# Check containerd service
systemctl status containerd
journalctl -u containerd -f
8 · CRI-O
CRI-O is a lightweight, Kubernetes-only CRI implementation maintained by Red Hat. Unlike containerd, it has no daemon-level API beyond CRI — it cannot be used outside Kubernetes. This intentional narrowness makes it simpler and more security-auditable.
| Dimension | containerd | CRI-O |
|---|---|---|
| Scope | General-purpose container runtime (Docker, nerdctl, K8s) | Kubernetes CRI only |
| Architecture | Daemon + plugins + shim | Daemon + conmon (container monitor) + OCI runtime |
| Image storage | Own content store + snapshotters | containers/storage library |
| Default runtime | runc via containerd-shim-runc-v2 | runc via conmon |
| Adoption | Default on GKE, AKS, EKS, most cloud K8s | Default on OpenShift (RHCOS), some bare-metal deployments |
| Config path | /etc/containerd/config.toml | /etc/crio/crio.conf |
| Socket path | /run/containerd/containerd.sock | /var/run/crio/crio.sock |
| Release cadence | Independent; lags K8s minor versions | Releases in lockstep with Kubernetes minor versions |
9 · Container Security Features
9.1 Linux Capabilities
The Linux privilege model has been split into ~41 granular capabilities since kernel 2.2. Containers drop most capabilities by default; Kubernetes allows fine-grained control:
spec:
containers:
- name: app
securityContext:
capabilities:
drop:
- ALL # Drop every capability first
add:
- NET_BIND_SERVICE # Re-add only what's needed (bind port < 1024)
runAsNonRoot: true
runAsUser: 1000
runAsGroup: 3000
readOnlyRootFilesystem: true
allowPrivilegeEscalation: false
Common capabilities and their risks
| Capability | What it allows | Risk if granted |
|---|---|---|
| CAP_SYS_ADMIN | Mount filesystems, load kernel modules, set hostname, ptrace | Near-root; can escape container |
| CAP_NET_ADMIN | Configure network interfaces, iptables, routing | Can intercept all node traffic |
| CAP_SYS_PTRACE | ptrace any process in namespace | Can read memory of other containers in Pod |
| CAP_DAC_OVERRIDE | Bypass file permission checks | Read/write any file in the container regardless of mode |
| CAP_NET_BIND_SERVICE | Bind to ports below 1024 | Low risk; needed for HTTP/HTTPS on standard ports |
| CAP_CHOWN | Change file ownership | Can change ownership of files it has access to |
| CAP_SETUID / CAP_SETGID | Change UID/GID | Can re-escalate to root if binaries are setuid |
9.2 Seccomp
seccomp (Secure Computing Mode) restricts which system calls a process can make.
Kubernetes supports loading seccomp profiles as JSON files and applying them per-container:
spec:
securityContext:
seccompProfile:
type: RuntimeDefault # Use containerd/CRI-O's built-in default profile
# type: Localhost
# localhostProfile: profiles/my-app.json
# type: Unconfined # NO seccomp (dangerous)
RuntimeDefault (stable in v1.27 via the SeccompDefault feature gate) applies the
runtime's default profile which blocks ~50 dangerous syscalls including
ptrace, kexec_load, open_by_handle_at,
create_module, and all 32-bit ABI syscalls.
9.3 AppArmor and SELinux
# AppArmor (annotations-based, stable in K8s v1.30)
metadata:
annotations:
container.apparmor.security.beta.kubernetes.io/nginx: runtime/default
# or: localhost/my-custom-profile
spec:
securityContext:
# SELinux labels (OpenShift / RHEL nodes)
seLinuxOptions:
level: "s0:c123,c456"
user: system_u
role: system_r
type: container_t
10 · Rootless Containers
Rootless containers run the entire container runtime (containerd or Podman) as a
non-root user on the host. The container's UID 0 maps to the host user's UID via
the user namespace, meaning even if a container escapes, it is not root on the host.
UserNamespacesSupport feature gate.
It requires kernel ≥ 5.11 for nested user namespace support and
/etc/subuid//etc/subgid user ID range delegation.
# Enable user namespace for a Pod (KEP-127, K8s 1.30 beta)
spec:
hostUsers: false # Each Pod gets its own user namespace
containers:
- name: app
# UID 0 inside = mapped to unprivileged UID on host
# Requires: feature gate UserNamespacesSupport=true
# Requires: kernel >= 5.11 and idmap mounts
11 · Container Lifecycle Inside Kubernetes
When the kubelet is told to run a Pod, the container creation sequence is:
- kubelet calls RunPodSandbox → containerd creates pause container, sets up network namespace, calls CNI plugin to assign IP
- kubelet calls PullImage for each container image (skipped if already present)
- kubelet calls CreateContainer → containerd creates OCI bundle, prepares overlay FS snapshot
- kubelet calls StartContainer → containerd invokes runc to start the process
- kubelet starts running liveness/readiness probes after
initialDelaySeconds - kubelet reports container status back to API server via status update
# Trace the full lifecycle on a node
journalctl -u kubelet -f | grep -E "RunPodSandbox|CreateContainer|StartContainer|PullImage"
# Trace containerd's view
journalctl -u containerd -f
# Watch CRI events live
crictl events
11.1 Container Termination Sequence
When a Pod is deleted or a container must be stopped, the sequence is:
- kubelet sends SIGTERM to PID 1 inside the container
- Container has
terminationGracePeriodSeconds(default 30s) to exit cleanly - If still running after grace period, kubelet sends SIGKILL
- kubelet calls StopContainer → containerd → runc kill
- kubelet calls RemoveContainer → containerd deletes overlay snapshot
- kubelet calls StopPodSandbox → containerd calls CNI DEL to release IP
- kubelet calls RemovePodSandbox → pause container removed
CMD ["/bin/sh", "-c", "..."])
are notorious for ignoring SIGTERM. Use exec form in Dockerfile
(CMD ["myapp"]) or use tini as PID 1 to properly forward signals.
12 · Image Pull Policy and Registry Authentication
spec:
containers:
- name: app
image: myregistry.io/myapp:v1.2.3
imagePullPolicy: IfNotPresent # Always | Never | IfNotPresent (default for tagged images)
imagePullSecrets:
- name: registry-credentials # Secret of type kubernetes.io/dockerconfigjson
# Create image pull secret from docker config
kubectl create secret docker-registry registry-credentials \
--docker-server=myregistry.io \
--docker-username=myuser \
--docker-password=mypassword \
--docker-email=me@example.com
# Or from existing docker config
kubectl create secret generic registry-credentials \
--from-file=.dockerconfigjson=$HOME/.docker/config.json \
--type=kubernetes.io/dockerconfigjson
# Attach to default ServiceAccount so all Pods in namespace get it
kubectl patch serviceaccount default \
-p '{"imagePullSecrets":[{"name":"registry-credentials"}]}'
myapp:latest can change between node pulls.
Pin to digest: myapp@sha256:abc123... for reproducibility.
Use crane digest myapp:v1.2.3 to get the digest.
13 · Production Best Practices
Container runtime production checklist (12 items)
- Use SystemdCgroup=true in containerd config when kubelet uses the systemd cgroup driver. Mismatch causes Pod evictions and memory accounting errors.
- Pin sandbox image (
pause) to a specific digest in containerd config to prevent silent updates. - Enable seccomp RuntimeDefault cluster-wide via the kubelet
--seccomp-defaultflag or PodSecurityAdmission baseline/restricted policy. - Drop ALL capabilities and add back only what your app needs. Most apps need zero capabilities.
- Set readOnlyRootFilesystem: true for all containers that don't need to write to their root FS. Writable root FS is the top container escape vector.
- Never run as root (
runAsNonRoot: true). If the app requires root, consider rootless containers instead. - Monitor containerd/CRI-O for image GC events. When node disk fills with images, the kubelet evicts Pods. Set
imageGCHighThresholdPercentandimageGCLowThresholdPercent. - Use cgroup v2 on all new nodes (default on Ubuntu 22.04+, RHEL 9+). cgroup v2 fixes the memory accounting issues present in v1 and enables better QoS.
- Tune terminationGracePeriodSeconds per workload. Databases may need 120s+; stateless HTTP servers usually need 15–30s.
- Implement preStop hooks for graceful shutdown of apps that don't handle SIGTERM. The preStop hook runs before SIGTERM is sent.
- Watch for CPU throttling with
container_cpu_cfs_throttled_seconds_total. Consider removing CPU limits for latency-sensitive workloads and using LimitRange to set namespace defaults. - Audit image pull secrets rotation. Credentials in imagePullSecrets are long-lived by default. Integrate with IRSA/Workload Identity for cloud registries instead of static credentials.
14 · Troubleshooting Container Runtime Issues
# ---- Container won't start ----
# 1. Check kubelet log for CRI errors
journalctl -u kubelet -n 200 | grep -i error
# 2. Check containerd log
journalctl -u containerd -n 200
# 3. crictl inspect for full container spec and error
crictl inspect <containerID> | jq .status.reason
# ---- Image pull failure ----
crictl pull <image> # Test pull outside of K8s context
# Check registry credentials
kubectl get secret registry-credentials -o jsonpath='{.data.\.dockerconfigjson}' | base64 -d
# ---- OOM kill diagnosis ----
dmesg | grep -i "oom-kill"
kubectl describe pod <name> | grep -A3 "Last State"
# ---- CPU throttling ----
# On the node, for a container's cgroup:
cat /sys/fs/cgroup/cpu/kubepods/burstable/pod<uid>/<containerID>/cpu.stat
# nr_throttled = times the container was throttled
# throttled_time = total ns spent throttled
# ---- containerd not responding ----
systemctl restart containerd
# Pods keep running! (shim keeps them alive)
# After containerd restart, reconnects to shims automatically
# ---- Check overlay FS health ----
dmesg | grep -i overlayfs
df -h /var/lib/containerd
Next Files
Recommended reading order
- 00-foundations/03-cluster-architecture-overview.html — Full cluster component topology
- 00-foundations/04-kubernetes-api-model.html — API machinery, watch semantics
- 02-node-components/03-container-runtime-interface.html — CRI deep-dive, kubelet↔containerd protocol
- 02-node-components/05-linux-kernel-cgroups-namespaces.html — Production cgroup tuning
- 02-node-components/01-kubelet.html — How kubelet drives the CRI
References
- OCI Image Spec — github.com/opencontainers/image-spec
- OCI Runtime Spec — github.com/opencontainers/runtime-spec
- runc source — github.com/opencontainers/runc
- containerd docs — containerd.io/docs
- CRI-O docs — cri-o.io
- Linux man pages:
namespaces(7),cgroups(7),capabilities(7),seccomp(2) - CRI API proto — github.com/kubernetes/cri-api
- Container Security — Liz Rice, O'Reilly 2020