File Path00-foundations/02-container-orchestration.html

Prerequisites00-introduction, 01-history, basic Linux process knowledge

Concepts Covered

Linux namespacescgroups v1/v2OCI image spec OCI runtime specrunccontainerd CRI-OCRI gRPC APIimage layers copy-on-writecontainer lifecycleseccomp/AppArmor rootless containersgVisor/kata

Related Files

Container Orchestration Foundations

Kubernetes orchestrates containers, but containers are not a Kubernetes concept — they are a Linux kernel feature composed from six isolation primitives. This document explains every layer of the container stack from the Linux kernel up to the CRI boundary where Kubernetes hands off control. Without this foundation, the behaviour of kubelet, CRI plugins, resource limits, and security policies cannot be fully understood.

Scope This file covers the pre-orchestration layer: what makes a container a container. The Kubernetes-specific CRI interface is detailed further in 02-node-components/03-container-runtime-interface.html.

1 · What Is a Container?

A container is not a virtual machine. It is a Linux process (or process group) that the kernel has isolated and resource-constrained using a combination of:

Namespaces — what the process can see (other processes, network interfaces, file system mounts, hostname, etc.)
cgroups — how much resource it can consume (CPU, memory, I/O, PIDs)
Union file system / overlay FS — the isolated, layered file system the process boots into
Capabilities — which privileged kernel operations it is allowed to call
Seccomp — which system calls it may invoke
AppArmor / SELinux — mandatory access control policy over file and network access

The process itself is unmodified. The kernel surrounds it with these isolation layers transparently. A containerised nginx binary is the exact same ELF binary as one running on bare metal — the container is a property of the runtime environment, not the application.

Figure 1 — Container stack: kubelet → high-level runtime (containerd/CRI-O) → OCI low-level runtime (runc) → Linux kernel primitives.

2 · Linux Namespaces

Linux namespaces were introduced in kernel 3.8 (2013) and provide the visibility isolation for containers. There are currently 8 namespace types:

Namespace	Kernel flag	What it isolates	Container use
`pid`	`CLONE_NEWPID`	Process ID space — PID 1 inside namespace is unrelated to host PID 1	Container init is PID 1 inside; can't see host processes
`net`	`CLONE_NEWNET`	Network interfaces, routing tables, iptables rules, sockets	Each Pod gets its own `eth0`, loopback, IP address
`mnt`	`CLONE_NEWNS`	Mount points (file system tree)	Container root FS is isolated; can't see host mounts
`uts`	`CLONE_NEWUTS`	Hostname and domain name (`uname`)	Container can have its own hostname independent of host
`ipc`	`CLONE_NEWIPC`	System V IPC, POSIX message queues	Containers share IPC only if in same Pod (same IPC ns)
`user`	`CLONE_NEWUSER`	UID/GID mappings — root inside can be non-root outside	Rootless containers; UID 0 inside → UID 65534 on host
`cgroup`	`CLONE_NEWCGROUP`	cgroup root — container sees a virtualised cgroup hierarchy	Container cannot see or escape host cgroup limits
`time`	`CLONE_NEWTIME`	System clock (CLOCK_MONOTONIC, CLOCK_BOOTTIME)	Rarely used in K8s currently; useful for checkpoint/restore

A Pod is, at the Linux level, a shared namespace envelope. The kubelet creates a pause container (also called the "infra container") first. All other containers in the Pod join the pause container's net and ipc namespaces. Each container keeps its own mnt namespace (separate root FS).

Namespace type	Shared across Pod?	Notes
net	Yes — all containers share one network namespace	All containers on `localhost`; same Pod IP; ports must not clash
ipc	Yes — by default	Enables shared memory between containers in the same Pod
uts	Yes — shared hostname	Pod hostname = Pod name by default
mnt	No — each container has its own rootfs	Volumes are bind-mounted into individual container mnt namespaces
pid	Optional (`shareProcessNamespace: true`)	When enabled, containers can signal each other across PID namespace
user	No — each container can have own UID mapping	Relevant for rootless pods and user namespace feature gate

# Inspect namespaces for a running container on a node
CONTAINER_PID=$(crictl inspect <containerID> | jq -r .info.pid)
ls -la /proc/$CONTAINER_PID/ns/

# Output (each symlink = namespace inode):
# net  -> net:[4026532345]   # unique = isolated network namespace
# pid  -> pid:[4026532346]
# mnt  -> mnt:[4026532347]
# uts  -> uts:[4026532348]

# Verify two containers in same Pod share net namespace
# (they will have same net:[...] inode)
CONTAINER_A_PID=1234; CONTAINER_B_PID=1235
ls -la /proc/$CONTAINER_A_PID/ns/net
ls -la /proc/$CONTAINER_B_PID/ns/net

2.2 The Pause Container

The pause container (image: registry.k8s.io/pause:3.9, ~700 KB) does almost nothing — it runs a tight loop that catches and ignores signals:

// pause.c — the entire pause container userspace program
#include <signal.h>
#include <stdio.h>
#include <unistd.h>

static void handler(int sig) {}

int main() {
    signal(SIGINT,  handler);
    signal(SIGTERM, handler);
    // Hold the namespaces alive; exit triggers Pod deletion
    for (;;) pause();
    return 0;
}

Its sole purpose is to be the namespace anchor: as long as the pause container is alive, the Pod's network and IPC namespaces remain alive. App containers can crash and restart without losing the network namespace (and thus without losing the Pod IP address).

3 · Control Groups (cgroups)

cgroups (Control Groups) are a Linux kernel feature for resource accounting and enforcement over groups of processes. Kubernetes uses them to implement container requests and limits.

3.1 cgroup v1 Subsystems

Subsystem	Controls	K8s use
cpu	CPU time shares and CFS quota	`resources.requests.cpu` → cpu.shares; `resources.limits.cpu` → cpu.cfs_quota_us
cpuset	Which CPUs/NUMA nodes a process may use	CPU Manager static policy pins containers to dedicated CPU cores
memory	Memory usage limits, OOM score	`resources.limits.memory` → memory.limit_in_bytes; OOM kill if exceeded
blkio	Block I/O bandwidth and IOPS throttling	Optionally throttle disk I/O per container (not default K8s)
pids	Maximum number of PIDs in the group	PodPidsLimit kubelet flag; prevents fork bombs
net_cls	Tags packets with classid for tc (traffic control)	Some CNI plugins use for bandwidth enforcement
devices	Allow/deny device file access	GPU device plugin, /dev/fuse access control
hugetlb	Huge page memory limits	HugePage resources on Pods (DPDK, databases)

3.2 cgroup v2 (Unified Hierarchy)

cgroup v2 (merged in kernel 4.5, default on most distros since 2021) replaces the fragmented v1 subsystem tree with a single unified hierarchy. Kubernetes declared cgroup v2 stable in v1.25.

cgroup v1 layout

/sys/fs/cgroup/
  cpu/kubepods/pod-uid/container-id/
  memory/kubepods/pod-uid/container-id/
  pids/kubepods/pod-uid/container-id/
  blkio/kubepods/pod-uid/container-id/
  cpuset/kubepods/pod-uid/container-id/
# Each subsystem in its own tree
# Controllers are mounted separately

cgroup v2 layout

/sys/fs/cgroup/
  kubepods.slice/
    kubepods-pod<uid>.slice/
      <containerID>/
        cpu.max         # replaces cpu.cfs_quota_us
        cpu.weight      # replaces cpu.shares
        memory.max      # replaces memory.limit_in_bytes
        memory.swap.max
        io.max          # combined blkio
        pids.max
# Single unified tree; all controllers in one dir

3.3 Kubernetes cgroup Hierarchy

The kubelet creates cgroups in a three-level hierarchy:

/sys/fs/cgroup/
└── kubepods/                          ← all K8s workloads
    ├── besteffort/                    ← BestEffort QoS Pods
    ├── burstable/                     ← Burstable QoS Pods
    │   └── pod<pod-uid>/             ← per-Pod cgroup
    │       ├── pause                  ← pause container
    │       └── <container-id>/       ← app container cgroup
    └── guaranteed/                    ← Guaranteed QoS Pods
        └── pod<pod-uid>/
            └── <container-id>/

# System (non-K8s) processes live in:
/sys/fs/cgroup/system.slice/
/sys/fs/cgroup/user.slice/

3.4 How CPU Limits Work Internally

When you set resources.limits.cpu: "500m", the kubelet writes:

# cgroup v1
echo 50000 > /sys/fs/cgroup/cpu/kubepods/.../cpu.cfs_quota_us
# (100000 us is 1 CPU; 50000 us = 500m = 0.5 CPU per 100ms period)

# cgroup v2
echo "50000 100000" > /sys/fs/cgroup/kubepods/.../cpu.max
#    ^quota  ^period (microseconds)

# CPU *request* (500m) maps to cpu.shares (v1) or cpu.weight (v2):
# shares = max(2, floor(milliCPU * 1024 / 1000))
# weight (v2) = 1 + floor((shares - 2) * 9999 / 262142)  # range 1-10000

CPU throttling is the #1 hidden performance issue A container with limits.cpu: "1" that tries to use 1.5 CPUs for a burst will be throttled by the CFS scheduler. The kernel cuts off its CPU time for the remainder of the 100ms period. This causes latency spikes even on machines with plenty of idle CPU. Use kubectl top pods + Prometheus container_cpu_cfs_throttled_seconds_total to detect. Consider setting no CPU limit (request only) for latency-sensitive apps.

3.5 Memory Limits and OOM Killer

When a container exceeds resources.limits.memory, the kernel OOM killer terminates the process. The kubelet detects the exit (via CRI container status) and the container is restarted according to restartPolicy. This appears as OOMKilled in kubectl describe pod.

# Inspect OOM kills
kubectl describe pod <name> | grep -A5 "Last State"
# Last State: Terminated
#   Reason:   OOMKilled
#   Exit Code: 137

# Check kernel OOM log
dmesg | grep -i oom
# [123456.789] oom-kill:constraint=CONSTRAINT_MEMCG,task=java,...

# Check container memory usage vs limit
kubectl top pod <name> --containers
cat /sys/fs/cgroup/memory/kubepods/.../memory.usage_in_bytes
cat /sys/fs/cgroup/memory/kubepods/.../memory.limit_in_bytes

4 · OCI Specifications

The Open Container Initiative (OCI), founded by Docker and CoreOS in 2015 under the Linux Foundation, standardised two specifications that define what a container is:

4.1 OCI Image Specification

An OCI image is a layered, content-addressable bundle of:

Image manifest — JSON document listing layers and the image config
Image config — JSON with entrypoint, env vars, labels, architecture, OS, working dir
Layers — tar.gz archives of filesystem diffs; each layer is identified by its SHA-256 digest

OCI image manifest structure (annotated JSON)

{
  "schemaVersion": 2,
  "mediaType": "application/vnd.oci.image.manifest.v1+json",
  "config": {
    "mediaType": "application/vnd.oci.image.config.v1+json",
    "digest": "sha256:abc123...",   // Points to image config JSON
    "size": 7023
  },
  "layers": [
    {
      // Base OS layer (e.g., debian:bookworm rootfs)
      "mediaType": "application/vnd.oci.image.layer.v1.tar+gzip",
      "digest": "sha256:def456...",
      "size": 27141519
    },
    {
      // apt-get install nginx layer
      "mediaType": "application/vnd.oci.image.layer.v1.tar+gzip",
      "digest": "sha256:789abc...",
      "size": 2342343
    },
    {
      // COPY nginx.conf layer
      "mediaType": "application/vnd.oci.image.layer.v1.tar+gzip",
      "digest": "sha256:cdef01...",
      "size": 1024
    }
  ]
}

4.2 Image Layers and Copy-on-Write

Container images are composed of read-only layers stacked using a union file system (typically overlay2 on Linux). When a container starts, a thin read-write layer is added on top.

Figure 2 — overlayfs: read-only image layers (shared across containers) + per-container R/W layer. File writes trigger copy-on-write into the R/W layer.

# Inspect overlayfs mounts for a running container (on the node)
CONTAINER_ID="abc123..."
# containerd stores snapshots in:
ls /var/lib/containerd/io.containerd.snapshotter.v1.overlayfs/snapshots/

# View the actual overlay mount for a container
cat /proc/mounts | grep overlay
# overlay / overlay rw,lowerdir=.../sha256:..,upperdir=...,workdir=...

# Docker legacy path:
ls /var/lib/docker/overlay2/

# Check disk usage per image layer
du -sh /var/lib/containerd/io.containerd.content.v1.content/blobs/sha256/*

4.3 OCI Runtime Specification

The OCI runtime spec defines the interface between a high-level runtime and a low-level runtime. It specifies a bundle directory structure and a lifecycle API:

OCI Bundle layout:
  bundle/
    config.json         ← Generated by high-level runtime (containerd)
    rootfs/             ← Unpacked image layers (merged overlayfs)

config.json contains:
  - ociVersion
  - root.path (path to rootfs)
  - process (args, env, cwd, user, capabilities, rlimits, seccomp)
  - hostname
  - mounts (bind mounts for volumes, /dev, /proc, /sys)
  - hooks (prestart, createRuntime, createContainer, startContainer, poststart, poststop)
  - linux.namespaces (which namespaces to create/join)
  - linux.cgroupsPath
  - linux.resources (cpu, memory, devices, hugepageLimits)
  - linux.seccomp (syscall whitelist/blacklist)
  - linux.maskedPaths, readonlyPaths

4.4 OCI Container Lifecycle (State Machine)

Figure 3 — OCI container lifecycle. runc implements: create → start → kill → delete.

5 · runc — The OCI Low-Level Runtime

runc is the reference implementation of the OCI runtime spec, written in Go, and is the default low-level runtime used by both containerd and CRI-O.

5.1 What runc Does

Reads the OCI bundle (config.json + rootfs/)
Creates Linux namespaces (clone() or unshare() syscalls)
Moves the process into the correct cgroup (cgroupsPath)
Applies seccomp profile (loads BPF program via seccomp() syscall)
Drops capabilities to the configured set
Applies AppArmor/SELinux profile
Sets up mounts (pivot_root or chroot to rootfs)
Executes the container's init process (exec())

# runc command surface (used internally by containerd)
runc create --bundle /path/to/bundle mycontainer
runc start mycontainer
runc state mycontainer   # {"ociVersion":"...","id":"...","status":"running","pid":12345,...}
runc list                # all containers managed by this runc instance
runc kill mycontainer SIGTERM
runc delete mycontainer

# Run runc manually (debugging)
mkdir -p /tmp/mycontainer/rootfs
# Extract a rootfs tarball
tar xf alpine.tar -C /tmp/mycontainer/rootfs
# Generate a default config.json
cd /tmp/mycontainer && runc spec
# Edit config.json as needed, then:
runc run mycontainer

5.2 Alternative OCI Runtimes

Runtime	Technology	Use case	K8s RuntimeClass
runc	Linux namespaces + cgroups	Default, general purpose	`runc`
crun	Same as runc but written in C, 50% lower memory	High-density, startup performance	`crun`
gVisor (runsc)	User-space kernel intercepting syscalls (Go)	Untrusted multi-tenant workloads; GKE Sandbox	`gvisor`
Kata Containers	Lightweight VM per Pod (QEMU/Cloud Hypervisor/Firecracker)	Strongest isolation; financial/regulated workloads	`kata`
Nabla (runnc)	Library OS (unikernels) on seccomp+ptrace	Research; minimal syscall surface	`nabla`
youki	runc re-implementation in Rust	Memory safety, active development	`youki`

# Using RuntimeClass to select a non-default runtime
---
apiVersion: node.k8s.io/v1
kind: RuntimeClass
metadata:
  name: gvisor
handler: runsc        # Matches handler name configured in containerd/CRI-O
scheduling:
  nodeSelector:
    sandbox: gvisor   # Only schedule on nodes with gVisor installed
---
apiVersion: v1
kind: Pod
metadata:
  name: sandboxed-app
spec:
  runtimeClassName: gvisor
  containers:
  - name: app
    image: myapp:latest

6 · Container Runtime Interface (CRI)

The CRI is a gRPC API defined by Kubernetes (SIG Node) that decouples the kubelet from any specific container runtime. It was introduced in Kubernetes v1.5 and standardised in v1.9. Every CRI-compliant runtime exposes a Unix socket at a well-known path (/run/containerd/containerd.sock, /var/run/crio/crio.sock, etc.).

6.1 CRI gRPC Services

The CRI defines two gRPC services:

// RuntimeService — manages Pod sandboxes and containers
service RuntimeService {
  // Sandbox lifecycle
  rpc RunPodSandbox(RunPodSandboxRequest) returns (RunPodSandboxResponse);
  rpc StopPodSandbox(StopPodSandboxRequest) returns (StopPodSandboxResponse);
  rpc RemovePodSandbox(RemovePodSandboxRequest) returns (RemovePodSandboxResponse);
  rpc PodSandboxStatus(PodSandboxStatusRequest) returns (PodSandboxStatusResponse);
  rpc ListPodSandbox(ListPodSandboxRequest) returns (ListPodSandboxResponse);

  // Container lifecycle
  rpc CreateContainer(CreateContainerRequest) returns (CreateContainerResponse);
  rpc StartContainer(StartContainerRequest) returns (StartContainerResponse);
  rpc StopContainer(StopContainerRequest) returns (StopContainerResponse);
  rpc RemoveContainer(RemoveContainerRequest) returns (RemoveContainerResponse);
  rpc ListContainers(ListContainersRequest) returns (ListContainersResponse);
  rpc ContainerStatus(ContainerStatusRequest) returns (ContainerStatusResponse);

  // Exec / attach / port-forward
  rpc ExecSync(ExecSyncRequest) returns (ExecSyncResponse);
  rpc Exec(ExecRequest) returns (ExecResponse);          // returns URL for streaming
  rpc Attach(AttachRequest) returns (AttachResponse);    // returns URL for streaming
  rpc PortForward(PortForwardRequest) returns (PortForwardResponse);

  // Stats
  rpc ContainerStats(ContainerStatsRequest) returns (ContainerStatsResponse);
  rpc ListContainerStats(ListContainerStatsRequest) returns (ListContainerStatsResponse);
}

// ImageService — manages images
service ImageService {
  rpc ListImages(ListImagesRequest) returns (ListImagesResponse);
  rpc ImageStatus(ImageStatusRequest) returns (ImageStatusResponse);
  rpc PullImage(PullImageRequest) returns (PullImageResponse);
  rpc RemoveImage(RemoveImageRequest) returns (RemoveImageResponse);
  rpc ImageFsInfo(ImageFsInfoRequest) returns (ImageFsInfoResponse);
}

6.2 Debugging with crictl

crictl is the CLI for the CRI — it speaks directly to the CRI socket, bypassing Kubernetes entirely. Essential for node-level debugging.

# Configure crictl to use the right socket
cat /etc/crictl.yaml
# runtime-endpoint: unix:///run/containerd/containerd.sock
# image-endpoint: unix:///run/containerd/containerd.sock

# List running Pod sandboxes
crictl pods

# List containers (all states)
crictl ps -a

# Get container details (full JSON from CRI)
crictl inspect <containerID>

# Get Pod sandbox details
crictl inspectp <podID>

# Pull an image
crictl pull nginx:1.25

# List images
crictl images

# Check image layers and size
crictl imagefsinfo

# Execute a command in a container (via CRI)
crictl exec -it <containerID> sh

# View container logs via CRI (same as kubectl logs)
crictl logs <containerID>

# Container stats
crictl stats

# Stop and remove a container (without kubectl — for emergencies)
crictl stop <containerID>
crictl rm <containerID>

7 · containerd

containerd is the CNCF-graduated, OCI-compliant, high-level container runtime used as the default CRI implementation in Kubernetes since the docker shim removal in v1.24. It was originally extracted from Docker as its core runtime component.

7.1 containerd Architecture

Figure 4 — containerd internal architecture. The CRI plugin is built-in; containerd-shim-runc-v2 is a child process that persists after containerd restart, keeping containers alive.

7.2 containerd-shim — Why It Exists

The shim (containerd-shim-runc-v2) is a small child process that sits between containerd and runc. It exists for two critical reasons:

Daemonless containers: runc exits after starting the container init process. The shim becomes the parent of the container process, handling stdio forwarding and exit code collection. If containerd restarts, the shim and container keep running.
Reattachment: When containerd restarts, it can reconnect to the shim via a Unix socket (/run/containerd/containerd.sock.ttrpc-style), recovering container state without restarting containers.

7.3 Key containerd Paths and Configuration

# containerd config
cat /etc/containerd/config.toml

# Important config sections:
[plugins."io.containerd.grpc.v1.cri"]
  sandbox_image = "registry.k8s.io/pause:3.9"
  [plugins."io.containerd.grpc.v1.cri".containerd]
    default_runtime_name = "runc"
    [plugins."io.containerd.grpc.v1.cri".containerd.runtimes.runc]
      runtime_type = "io.containerd.runc.v2"
      [plugins."io.containerd.grpc.v1.cri".containerd.runtimes.runc.options]
        SystemdCgroup = true   # REQUIRED when kubelet uses systemd cgroup driver

# Data paths
/var/lib/containerd/              # all containerd data
  io.containerd.content.v1.content/blobs/sha256/   # image layer blobs
  io.containerd.snapshotter.v1.overlayfs/           # snapshot layers
  io.containerd.metadata.v1.bolt/meta.db           # bbolt metadata DB

# Runtime socket
/run/containerd/containerd.sock

# Check containerd service
systemctl status containerd
journalctl -u containerd -f

8 · CRI-O

CRI-O is a lightweight, Kubernetes-only CRI implementation maintained by Red Hat. Unlike containerd, it has no daemon-level API beyond CRI — it cannot be used outside Kubernetes. This intentional narrowness makes it simpler and more security-auditable.

Dimension	containerd	CRI-O
Scope	General-purpose container runtime (Docker, nerdctl, K8s)	Kubernetes CRI only
Architecture	Daemon + plugins + shim	Daemon + conmon (container monitor) + OCI runtime
Image storage	Own content store + snapshotters	containers/storage library
Default runtime	runc via containerd-shim-runc-v2	runc via conmon
Adoption	Default on GKE, AKS, EKS, most cloud K8s	Default on OpenShift (RHCOS), some bare-metal deployments
Config path	/etc/containerd/config.toml	/etc/crio/crio.conf
Socket path	/run/containerd/containerd.sock	/var/run/crio/crio.sock
Release cadence	Independent; lags K8s minor versions	Releases in lockstep with Kubernetes minor versions

9 · Container Security Features

9.1 Linux Capabilities

The Linux privilege model has been split into ~41 granular capabilities since kernel 2.2. Containers drop most capabilities by default; Kubernetes allows fine-grained control:

spec:
  containers:
  - name: app
    securityContext:
      capabilities:
        drop:
        - ALL                    # Drop every capability first
        add:
        - NET_BIND_SERVICE       # Re-add only what's needed (bind port < 1024)
      runAsNonRoot: true
      runAsUser: 1000
      runAsGroup: 3000
      readOnlyRootFilesystem: true
      allowPrivilegeEscalation: false

Common capabilities and their risks

Capability	What it allows	Risk if granted
CAP_SYS_ADMIN	Mount filesystems, load kernel modules, set hostname, ptrace	Near-root; can escape container
CAP_NET_ADMIN	Configure network interfaces, iptables, routing	Can intercept all node traffic
CAP_SYS_PTRACE	ptrace any process in namespace	Can read memory of other containers in Pod
CAP_DAC_OVERRIDE	Bypass file permission checks	Read/write any file in the container regardless of mode
CAP_NET_BIND_SERVICE	Bind to ports below 1024	Low risk; needed for HTTP/HTTPS on standard ports
CAP_CHOWN	Change file ownership	Can change ownership of files it has access to
CAP_SETUID / CAP_SETGID	Change UID/GID	Can re-escalate to root if binaries are setuid

9.2 Seccomp

seccomp (Secure Computing Mode) restricts which system calls a process can make. Kubernetes supports loading seccomp profiles as JSON files and applying them per-container:

spec:
  securityContext:
    seccompProfile:
      type: RuntimeDefault     # Use containerd/CRI-O's built-in default profile
      # type: Localhost
      # localhostProfile: profiles/my-app.json
      # type: Unconfined       # NO seccomp (dangerous)

RuntimeDefault (stable in v1.27 via the SeccompDefault feature gate) applies the runtime's default profile which blocks ~50 dangerous syscalls including ptrace, kexec_load, open_by_handle_at, create_module, and all 32-bit ABI syscalls.

9.3 AppArmor and SELinux

# AppArmor (annotations-based, stable in K8s v1.30)
metadata:
  annotations:
    container.apparmor.security.beta.kubernetes.io/nginx: runtime/default
    # or: localhost/my-custom-profile
spec:
  securityContext:
    # SELinux labels (OpenShift / RHEL nodes)
    seLinuxOptions:
      level: "s0:c123,c456"
      user: system_u
      role: system_r
      type: container_t

10 · Rootless Containers

Rootless containers run the entire container runtime (containerd or Podman) as a non-root user on the host. The container's UID 0 maps to the host user's UID via the user namespace, meaning even if a container escapes, it is not root on the host.

Production status Rootless containerd is production-ready as of containerd 1.7 / Kubernetes 1.26 via the UserNamespacesSupport feature gate. It requires kernel ≥ 5.11 for nested user namespace support and /etc/subuid//etc/subgid user ID range delegation.

# Enable user namespace for a Pod (KEP-127, K8s 1.30 beta)
spec:
  hostUsers: false    # Each Pod gets its own user namespace
  containers:
  - name: app
    # UID 0 inside = mapped to unprivileged UID on host
    # Requires: feature gate UserNamespacesSupport=true
    # Requires: kernel >= 5.11 and idmap mounts

11 · Container Lifecycle Inside Kubernetes

When the kubelet is told to run a Pod, the container creation sequence is:

kubelet calls RunPodSandbox → containerd creates pause container, sets up network namespace, calls CNI plugin to assign IP
kubelet calls PullImage for each container image (skipped if already present)
kubelet calls CreateContainer → containerd creates OCI bundle, prepares overlay FS snapshot
kubelet calls StartContainer → containerd invokes runc to start the process
kubelet starts running liveness/readiness probes after initialDelaySeconds
kubelet reports container status back to API server via status update

# Trace the full lifecycle on a node
journalctl -u kubelet -f | grep -E "RunPodSandbox|CreateContainer|StartContainer|PullImage"

# Trace containerd's view
journalctl -u containerd -f

# Watch CRI events live
crictl events

11.1 Container Termination Sequence

When a Pod is deleted or a container must be stopped, the sequence is:

kubelet sends SIGTERM to PID 1 inside the container
Container has terminationGracePeriodSeconds (default 30s) to exit cleanly
If still running after grace period, kubelet sends SIGKILL
kubelet calls StopContainer → containerd → runc kill
kubelet calls RemoveContainer → containerd deletes overlay snapshot
kubelet calls StopPodSandbox → containerd calls CNI DEL to release IP
kubelet calls RemovePodSandbox → pause container removed

PID 1 signal handling Processes running as PID 1 inside a container do NOT receive SIGTERM by default unless they explicitly handle it. Shell scripts (CMD ["/bin/sh", "-c", "..."]) are notorious for ignoring SIGTERM. Use exec form in Dockerfile (CMD ["myapp"]) or use tini as PID 1 to properly forward signals.

12 · Image Pull Policy and Registry Authentication

spec:
  containers:
  - name: app
    image: myregistry.io/myapp:v1.2.3
    imagePullPolicy: IfNotPresent  # Always | Never | IfNotPresent (default for tagged images)
  imagePullSecrets:
  - name: registry-credentials     # Secret of type kubernetes.io/dockerconfigjson

# Create image pull secret from docker config
kubectl create secret docker-registry registry-credentials \
  --docker-server=myregistry.io \
  --docker-username=myuser \
  --docker-password=mypassword \
  --docker-email=me@example.com

# Or from existing docker config
kubectl create secret generic registry-credentials \
  --from-file=.dockerconfigjson=$HOME/.docker/config.json \
  --type=kubernetes.io/dockerconfigjson

# Attach to default ServiceAccount so all Pods in namespace get it
kubectl patch serviceaccount default \
  -p '{"imagePullSecrets":[{"name":"registry-credentials"}]}'

Production: always use digest-pinned images Tags are mutable — myapp:latest can change between node pulls. Pin to digest: myapp@sha256:abc123... for reproducibility. Use crane digest myapp:v1.2.3 to get the digest.

13 · Production Best Practices

Container runtime production checklist (12 items)

Use SystemdCgroup=true in containerd config when kubelet uses the systemd cgroup driver. Mismatch causes Pod evictions and memory accounting errors.
Pin sandbox image (pause) to a specific digest in containerd config to prevent silent updates.
Enable seccomp RuntimeDefault cluster-wide via the kubelet --seccomp-default flag or PodSecurityAdmission baseline/restricted policy.
Drop ALL capabilities and add back only what your app needs. Most apps need zero capabilities.
Set readOnlyRootFilesystem: true for all containers that don't need to write to their root FS. Writable root FS is the top container escape vector.
Never run as root (runAsNonRoot: true). If the app requires root, consider rootless containers instead.
Monitor containerd/CRI-O for image GC events. When node disk fills with images, the kubelet evicts Pods. Set imageGCHighThresholdPercent and imageGCLowThresholdPercent.
Use cgroup v2 on all new nodes (default on Ubuntu 22.04+, RHEL 9+). cgroup v2 fixes the memory accounting issues present in v1 and enables better QoS.
Tune terminationGracePeriodSeconds per workload. Databases may need 120s+; stateless HTTP servers usually need 15–30s.
Implement preStop hooks for graceful shutdown of apps that don't handle SIGTERM. The preStop hook runs before SIGTERM is sent.
Watch for CPU throttling with container_cpu_cfs_throttled_seconds_total. Consider removing CPU limits for latency-sensitive workloads and using LimitRange to set namespace defaults.
Audit image pull secrets rotation. Credentials in imagePullSecrets are long-lived by default. Integrate with IRSA/Workload Identity for cloud registries instead of static credentials.

14 · Troubleshooting Container Runtime Issues

# ---- Container won't start ----
# 1. Check kubelet log for CRI errors
journalctl -u kubelet -n 200 | grep -i error

# 2. Check containerd log
journalctl -u containerd -n 200

# 3. crictl inspect for full container spec and error
crictl inspect <containerID> | jq .status.reason

# ---- Image pull failure ----
crictl pull <image>   # Test pull outside of K8s context
# Check registry credentials
kubectl get secret registry-credentials -o jsonpath='{.data.\.dockerconfigjson}' | base64 -d

# ---- OOM kill diagnosis ----
dmesg | grep -i "oom-kill"
kubectl describe pod <name> | grep -A3 "Last State"

# ---- CPU throttling ----
# On the node, for a container's cgroup:
cat /sys/fs/cgroup/cpu/kubepods/burstable/pod<uid>/<containerID>/cpu.stat
# nr_throttled = times the container was throttled
# throttled_time = total ns spent throttled

# ---- containerd not responding ----
systemctl restart containerd
# Pods keep running! (shim keeps them alive)
# After containerd restart, reconnects to shims automatically

# ---- Check overlay FS health ----
dmesg | grep -i overlayfs
df -h /var/lib/containerd

Next Files

References

OCI Image Spec — github.com/opencontainers/image-spec
OCI Runtime Spec — github.com/opencontainers/runtime-spec
runc source — github.com/opencontainers/runc
containerd docs — containerd.io/docs
CRI-O docs — cri-o.io
Linux man pages: namespaces(7), cgroups(7), capabilities(7), seccomp(2)
CRI API proto — github.com/kubernetes/cri-api
Container Security — Liz Rice, O'Reilly 2020

← Previous01 · History of Kubernetes Next →03 · Cluster Architecture Overview

Container Orchestration Foundations

1 · What Is a Container?

2 · Linux Namespaces

2.1 How Pods Share Namespaces

2.2 The Pause Container

3 · Control Groups (cgroups)

3.1 cgroup v1 Subsystems

3.2 cgroup v2 (Unified Hierarchy)

cgroup v1 layout

cgroup v2 layout

3.3 Kubernetes cgroup Hierarchy

3.4 How CPU Limits Work Internally

3.5 Memory Limits and OOM Killer

4 · OCI Specifications

4.1 OCI Image Specification

4.2 Image Layers and Copy-on-Write

4.3 OCI Runtime Specification

4.4 OCI Container Lifecycle (State Machine)

5 · runc — The OCI Low-Level Runtime

5.1 What runc Does

5.2 Alternative OCI Runtimes

6 · Container Runtime Interface (CRI)

6.1 CRI gRPC Services

6.2 Debugging with crictl

7 · containerd

7.1 containerd Architecture

7.2 containerd-shim — Why It Exists

7.3 Key containerd Paths and Configuration

8 · CRI-O

9 · Container Security Features

9.1 Linux Capabilities

9.2 Seccomp

9.3 AppArmor and SELinux

10 · Rootless Containers

11 · Container Lifecycle Inside Kubernetes

11.1 Container Termination Sequence

12 · Image Pull Policy and Registry Authentication

13 · Production Best Practices

14 · Troubleshooting Container Runtime Issues

Next Files

Recommended reading order

References