File Path00-foundations/02-container-orchestration.html
Prerequisites00-introduction, 01-history, basic Linux process knowledge
Concepts Covered
Linux namespacescgroups v1/v2OCI image spec OCI runtime specrunccontainerd CRI-OCRI gRPC APIimage layers copy-on-writecontainer lifecycleseccomp/AppArmor rootless containersgVisor/kata
Related Files

Container Orchestration Foundations

Kubernetes orchestrates containers, but containers are not a Kubernetes concept — they are a Linux kernel feature composed from six isolation primitives. This document explains every layer of the container stack from the Linux kernel up to the CRI boundary where Kubernetes hands off control. Without this foundation, the behaviour of kubelet, CRI plugins, resource limits, and security policies cannot be fully understood.

Scope This file covers the pre-orchestration layer: what makes a container a container. The Kubernetes-specific CRI interface is detailed further in 02-node-components/03-container-runtime-interface.html.

1 · What Is a Container?

A container is not a virtual machine. It is a Linux process (or process group) that the kernel has isolated and resource-constrained using a combination of:

  • Namespaces — what the process can see (other processes, network interfaces, file system mounts, hostname, etc.)
  • cgroups — how much resource it can consume (CPU, memory, I/O, PIDs)
  • Union file system / overlay FS — the isolated, layered file system the process boots into
  • Capabilities — which privileged kernel operations it is allowed to call
  • Seccomp — which system calls it may invoke
  • AppArmor / SELinux — mandatory access control policy over file and network access

The process itself is unmodified. The kernel surrounds it with these isolation layers transparently. A containerised nginx binary is the exact same ELF binary as one running on bare metal — the container is a property of the runtime environment, not the application.

Linux Kernel namespaces cgroups seccomp capabilities overlayfs OCI Runtime (runc / crun / kata / gVisor) Reads config.json, calls kernel APIs, starts init process High-level Runtime: containerd / CRI-O Image pull, snapshot management, CRI gRPC server kubelet (CRI client) RunPodSandbox / CreateContainer / StartContainer …
Figure 1 — Container stack: kubelet → high-level runtime (containerd/CRI-O) → OCI low-level runtime (runc) → Linux kernel primitives.

2 · Linux Namespaces

Linux namespaces were introduced in kernel 3.8 (2013) and provide the visibility isolation for containers. There are currently 8 namespace types:

NamespaceKernel flagWhat it isolatesContainer use
pidCLONE_NEWPIDProcess ID space — PID 1 inside namespace is unrelated to host PID 1Container init is PID 1 inside; can't see host processes
netCLONE_NEWNETNetwork interfaces, routing tables, iptables rules, socketsEach Pod gets its own eth0, loopback, IP address
mntCLONE_NEWNSMount points (file system tree)Container root FS is isolated; can't see host mounts
utsCLONE_NEWUTSHostname and domain name (uname)Container can have its own hostname independent of host
ipcCLONE_NEWIPCSystem V IPC, POSIX message queuesContainers share IPC only if in same Pod (same IPC ns)
userCLONE_NEWUSERUID/GID mappings — root inside can be non-root outsideRootless containers; UID 0 inside → UID 65534 on host
cgroupCLONE_NEWCGROUPcgroup root — container sees a virtualised cgroup hierarchyContainer cannot see or escape host cgroup limits
timeCLONE_NEWTIMESystem clock (CLOCK_MONOTONIC, CLOCK_BOOTTIME)Rarely used in K8s currently; useful for checkpoint/restore

2.1 How Pods Share Namespaces

A Pod is, at the Linux level, a shared namespace envelope. The kubelet creates a pause container (also called the "infra container") first. All other containers in the Pod join the pause container's net and ipc namespaces. Each container keeps its own mnt namespace (separate root FS).

Namespace typeShared across Pod?Notes
netYes — all containers share one network namespaceAll containers on localhost; same Pod IP; ports must not clash
ipcYes — by defaultEnables shared memory between containers in the same Pod
utsYes — shared hostnamePod hostname = Pod name by default
mntNo — each container has its own rootfsVolumes are bind-mounted into individual container mnt namespaces
pidOptional (shareProcessNamespace: true)When enabled, containers can signal each other across PID namespace
userNo — each container can have own UID mappingRelevant for rootless pods and user namespace feature gate
# Inspect namespaces for a running container on a node
CONTAINER_PID=$(crictl inspect <containerID> | jq -r .info.pid)
ls -la /proc/$CONTAINER_PID/ns/

# Output (each symlink = namespace inode):
# net  -> net:[4026532345]   # unique = isolated network namespace
# pid  -> pid:[4026532346]
# mnt  -> mnt:[4026532347]
# uts  -> uts:[4026532348]

# Verify two containers in same Pod share net namespace
# (they will have same net:[...] inode)
CONTAINER_A_PID=1234; CONTAINER_B_PID=1235
ls -la /proc/$CONTAINER_A_PID/ns/net
ls -la /proc/$CONTAINER_B_PID/ns/net

2.2 The Pause Container

The pause container (image: registry.k8s.io/pause:3.9, ~700 KB) does almost nothing — it runs a tight loop that catches and ignores signals:

// pause.c — the entire pause container userspace program
#include <signal.h>
#include <stdio.h>
#include <unistd.h>

static void handler(int sig) {}

int main() {
    signal(SIGINT,  handler);
    signal(SIGTERM, handler);
    // Hold the namespaces alive; exit triggers Pod deletion
    for (;;) pause();
    return 0;
}

Its sole purpose is to be the namespace anchor: as long as the pause container is alive, the Pod's network and IPC namespaces remain alive. App containers can crash and restart without losing the network namespace (and thus without losing the Pod IP address).

3 · Control Groups (cgroups)

cgroups (Control Groups) are a Linux kernel feature for resource accounting and enforcement over groups of processes. Kubernetes uses them to implement container requests and limits.

3.1 cgroup v1 Subsystems

SubsystemControlsK8s use
cpuCPU time shares and CFS quotaresources.requests.cpu → cpu.shares; resources.limits.cpu → cpu.cfs_quota_us
cpusetWhich CPUs/NUMA nodes a process may useCPU Manager static policy pins containers to dedicated CPU cores
memoryMemory usage limits, OOM scoreresources.limits.memory → memory.limit_in_bytes; OOM kill if exceeded
blkioBlock I/O bandwidth and IOPS throttlingOptionally throttle disk I/O per container (not default K8s)
pidsMaximum number of PIDs in the groupPodPidsLimit kubelet flag; prevents fork bombs
net_clsTags packets with classid for tc (traffic control)Some CNI plugins use for bandwidth enforcement
devicesAllow/deny device file accessGPU device plugin, /dev/fuse access control
hugetlbHuge page memory limitsHugePage resources on Pods (DPDK, databases)

3.2 cgroup v2 (Unified Hierarchy)

cgroup v2 (merged in kernel 4.5, default on most distros since 2021) replaces the fragmented v1 subsystem tree with a single unified hierarchy. Kubernetes declared cgroup v2 stable in v1.25.

cgroup v1 layout

/sys/fs/cgroup/
  cpu/kubepods/pod-uid/container-id/
  memory/kubepods/pod-uid/container-id/
  pids/kubepods/pod-uid/container-id/
  blkio/kubepods/pod-uid/container-id/
  cpuset/kubepods/pod-uid/container-id/
# Each subsystem in its own tree
# Controllers are mounted separately

cgroup v2 layout

/sys/fs/cgroup/
  kubepods.slice/
    kubepods-pod<uid>.slice/
      <containerID>/
        cpu.max         # replaces cpu.cfs_quota_us
        cpu.weight      # replaces cpu.shares
        memory.max      # replaces memory.limit_in_bytes
        memory.swap.max
        io.max          # combined blkio
        pids.max
# Single unified tree; all controllers in one dir

3.3 Kubernetes cgroup Hierarchy

The kubelet creates cgroups in a three-level hierarchy:

/sys/fs/cgroup/
└── kubepods/                          ← all K8s workloads
    ├── besteffort/                    ← BestEffort QoS Pods
    ├── burstable/                     ← Burstable QoS Pods
    │   └── pod<pod-uid>/             ← per-Pod cgroup
    │       ├── pause                  ← pause container
    │       └── <container-id>/       ← app container cgroup
    └── guaranteed/                    ← Guaranteed QoS Pods
        └── pod<pod-uid>/
            └── <container-id>/

# System (non-K8s) processes live in:
/sys/fs/cgroup/system.slice/
/sys/fs/cgroup/user.slice/

3.4 How CPU Limits Work Internally

When you set resources.limits.cpu: "500m", the kubelet writes:

# cgroup v1
echo 50000 > /sys/fs/cgroup/cpu/kubepods/.../cpu.cfs_quota_us
# (100000 us is 1 CPU; 50000 us = 500m = 0.5 CPU per 100ms period)

# cgroup v2
echo "50000 100000" > /sys/fs/cgroup/kubepods/.../cpu.max
#    ^quota  ^period (microseconds)

# CPU *request* (500m) maps to cpu.shares (v1) or cpu.weight (v2):
# shares = max(2, floor(milliCPU * 1024 / 1000))
# weight (v2) = 1 + floor((shares - 2) * 9999 / 262142)  # range 1-10000
CPU throttling is the #1 hidden performance issue A container with limits.cpu: "1" that tries to use 1.5 CPUs for a burst will be throttled by the CFS scheduler. The kernel cuts off its CPU time for the remainder of the 100ms period. This causes latency spikes even on machines with plenty of idle CPU. Use kubectl top pods + Prometheus container_cpu_cfs_throttled_seconds_total to detect. Consider setting no CPU limit (request only) for latency-sensitive apps.

3.5 Memory Limits and OOM Killer

When a container exceeds resources.limits.memory, the kernel OOM killer terminates the process. The kubelet detects the exit (via CRI container status) and the container is restarted according to restartPolicy. This appears as OOMKilled in kubectl describe pod.

# Inspect OOM kills
kubectl describe pod <name> | grep -A5 "Last State"
# Last State: Terminated
#   Reason:   OOMKilled
#   Exit Code: 137

# Check kernel OOM log
dmesg | grep -i oom
# [123456.789] oom-kill:constraint=CONSTRAINT_MEMCG,task=java,...

# Check container memory usage vs limit
kubectl top pod <name> --containers
cat /sys/fs/cgroup/memory/kubepods/.../memory.usage_in_bytes
cat /sys/fs/cgroup/memory/kubepods/.../memory.limit_in_bytes

4 · OCI Specifications

The Open Container Initiative (OCI), founded by Docker and CoreOS in 2015 under the Linux Foundation, standardised two specifications that define what a container is:

4.1 OCI Image Specification

An OCI image is a layered, content-addressable bundle of:

  • Image manifest — JSON document listing layers and the image config
  • Image config — JSON with entrypoint, env vars, labels, architecture, OS, working dir
  • Layers — tar.gz archives of filesystem diffs; each layer is identified by its SHA-256 digest
OCI image manifest structure (annotated JSON)
{
  "schemaVersion": 2,
  "mediaType": "application/vnd.oci.image.manifest.v1+json",
  "config": {
    "mediaType": "application/vnd.oci.image.config.v1+json",
    "digest": "sha256:abc123...",   // Points to image config JSON
    "size": 7023
  },
  "layers": [
    {
      // Base OS layer (e.g., debian:bookworm rootfs)
      "mediaType": "application/vnd.oci.image.layer.v1.tar+gzip",
      "digest": "sha256:def456...",
      "size": 27141519
    },
    {
      // apt-get install nginx layer
      "mediaType": "application/vnd.oci.image.layer.v1.tar+gzip",
      "digest": "sha256:789abc...",
      "size": 2342343
    },
    {
      // COPY nginx.conf layer
      "mediaType": "application/vnd.oci.image.layer.v1.tar+gzip",
      "digest": "sha256:cdef01...",
      "size": 1024
    }
  ]
}

4.2 Image Layers and Copy-on-Write

Container images are composed of read-only layers stacked using a union file system (typically overlay2 on Linux). When a container starts, a thin read-write layer is added on top.

Container layer (R/W) — ephemeral, per-container Layer 3: COPY nginx.conf (1 KB) Layer 2: apt-get install nginx (22 MB) Layer 1: debian:bookworm base (27 MB) overlayfs (lower/upper/work/merged dirs) ← writes go here ← shared, cached ← shared, cached ← shared, cached
Figure 2 — overlayfs: read-only image layers (shared across containers) + per-container R/W layer. File writes trigger copy-on-write into the R/W layer.
# Inspect overlayfs mounts for a running container (on the node)
CONTAINER_ID="abc123..."
# containerd stores snapshots in:
ls /var/lib/containerd/io.containerd.snapshotter.v1.overlayfs/snapshots/

# View the actual overlay mount for a container
cat /proc/mounts | grep overlay
# overlay / overlay rw,lowerdir=.../sha256:..,upperdir=...,workdir=...

# Docker legacy path:
ls /var/lib/docker/overlay2/

# Check disk usage per image layer
du -sh /var/lib/containerd/io.containerd.content.v1.content/blobs/sha256/*

4.3 OCI Runtime Specification

The OCI runtime spec defines the interface between a high-level runtime and a low-level runtime. It specifies a bundle directory structure and a lifecycle API:

OCI Bundle layout:
  bundle/
    config.json         ← Generated by high-level runtime (containerd)
    rootfs/             ← Unpacked image layers (merged overlayfs)

config.json contains:
  - ociVersion
  - root.path (path to rootfs)
  - process (args, env, cwd, user, capabilities, rlimits, seccomp)
  - hostname
  - mounts (bind mounts for volumes, /dev, /proc, /sys)
  - hooks (prestart, createRuntime, createContainer, startContainer, poststart, poststop)
  - linux.namespaces (which namespaces to create/join)
  - linux.cgroupsPath
  - linux.resources (cpu, memory, devices, hugepageLimits)
  - linux.seccomp (syscall whitelist/blacklist)
  - linux.maskedPaths, readonlyPaths

4.4 OCI Container Lifecycle (State Machine)

creating create created start running stop/kill stopped delete deleted
Figure 3 — OCI container lifecycle. runc implements: create → start → kill → delete.

5 · runc — The OCI Low-Level Runtime

runc is the reference implementation of the OCI runtime spec, written in Go, and is the default low-level runtime used by both containerd and CRI-O.

5.1 What runc Does

  1. Reads the OCI bundle (config.json + rootfs/)
  2. Creates Linux namespaces (clone() or unshare() syscalls)
  3. Moves the process into the correct cgroup (cgroupsPath)
  4. Applies seccomp profile (loads BPF program via seccomp() syscall)
  5. Drops capabilities to the configured set
  6. Applies AppArmor/SELinux profile
  7. Sets up mounts (pivot_root or chroot to rootfs)
  8. Executes the container's init process (exec())
# runc command surface (used internally by containerd)
runc create --bundle /path/to/bundle mycontainer
runc start mycontainer
runc state mycontainer   # {"ociVersion":"...","id":"...","status":"running","pid":12345,...}
runc list                # all containers managed by this runc instance
runc kill mycontainer SIGTERM
runc delete mycontainer

# Run runc manually (debugging)
mkdir -p /tmp/mycontainer/rootfs
# Extract a rootfs tarball
tar xf alpine.tar -C /tmp/mycontainer/rootfs
# Generate a default config.json
cd /tmp/mycontainer && runc spec
# Edit config.json as needed, then:
runc run mycontainer

5.2 Alternative OCI Runtimes

RuntimeTechnologyUse caseK8s RuntimeClass
runcLinux namespaces + cgroupsDefault, general purposerunc
crunSame as runc but written in C, 50% lower memoryHigh-density, startup performancecrun
gVisor (runsc)User-space kernel intercepting syscalls (Go)Untrusted multi-tenant workloads; GKE Sandboxgvisor
Kata ContainersLightweight VM per Pod (QEMU/Cloud Hypervisor/Firecracker)Strongest isolation; financial/regulated workloadskata
Nabla (runnc)Library OS (unikernels) on seccomp+ptraceResearch; minimal syscall surfacenabla
youkirunc re-implementation in RustMemory safety, active developmentyouki
# Using RuntimeClass to select a non-default runtime
---
apiVersion: node.k8s.io/v1
kind: RuntimeClass
metadata:
  name: gvisor
handler: runsc        # Matches handler name configured in containerd/CRI-O
scheduling:
  nodeSelector:
    sandbox: gvisor   # Only schedule on nodes with gVisor installed
---
apiVersion: v1
kind: Pod
metadata:
  name: sandboxed-app
spec:
  runtimeClassName: gvisor
  containers:
  - name: app
    image: myapp:latest

6 · Container Runtime Interface (CRI)

The CRI is a gRPC API defined by Kubernetes (SIG Node) that decouples the kubelet from any specific container runtime. It was introduced in Kubernetes v1.5 and standardised in v1.9. Every CRI-compliant runtime exposes a Unix socket at a well-known path (/run/containerd/containerd.sock, /var/run/crio/crio.sock, etc.).

6.1 CRI gRPC Services

The CRI defines two gRPC services:

// RuntimeService — manages Pod sandboxes and containers
service RuntimeService {
  // Sandbox lifecycle
  rpc RunPodSandbox(RunPodSandboxRequest) returns (RunPodSandboxResponse);
  rpc StopPodSandbox(StopPodSandboxRequest) returns (StopPodSandboxResponse);
  rpc RemovePodSandbox(RemovePodSandboxRequest) returns (RemovePodSandboxResponse);
  rpc PodSandboxStatus(PodSandboxStatusRequest) returns (PodSandboxStatusResponse);
  rpc ListPodSandbox(ListPodSandboxRequest) returns (ListPodSandboxResponse);

  // Container lifecycle
  rpc CreateContainer(CreateContainerRequest) returns (CreateContainerResponse);
  rpc StartContainer(StartContainerRequest) returns (StartContainerResponse);
  rpc StopContainer(StopContainerRequest) returns (StopContainerResponse);
  rpc RemoveContainer(RemoveContainerRequest) returns (RemoveContainerResponse);
  rpc ListContainers(ListContainersRequest) returns (ListContainersResponse);
  rpc ContainerStatus(ContainerStatusRequest) returns (ContainerStatusResponse);

  // Exec / attach / port-forward
  rpc ExecSync(ExecSyncRequest) returns (ExecSyncResponse);
  rpc Exec(ExecRequest) returns (ExecResponse);          // returns URL for streaming
  rpc Attach(AttachRequest) returns (AttachResponse);    // returns URL for streaming
  rpc PortForward(PortForwardRequest) returns (PortForwardResponse);

  // Stats
  rpc ContainerStats(ContainerStatsRequest) returns (ContainerStatsResponse);
  rpc ListContainerStats(ListContainerStatsRequest) returns (ListContainerStatsResponse);
}

// ImageService — manages images
service ImageService {
  rpc ListImages(ListImagesRequest) returns (ListImagesResponse);
  rpc ImageStatus(ImageStatusRequest) returns (ImageStatusResponse);
  rpc PullImage(PullImageRequest) returns (PullImageResponse);
  rpc RemoveImage(RemoveImageRequest) returns (RemoveImageResponse);
  rpc ImageFsInfo(ImageFsInfoRequest) returns (ImageFsInfoResponse);
}

6.2 Debugging with crictl

crictl is the CLI for the CRI — it speaks directly to the CRI socket, bypassing Kubernetes entirely. Essential for node-level debugging.

# Configure crictl to use the right socket
cat /etc/crictl.yaml
# runtime-endpoint: unix:///run/containerd/containerd.sock
# image-endpoint: unix:///run/containerd/containerd.sock

# List running Pod sandboxes
crictl pods

# List containers (all states)
crictl ps -a

# Get container details (full JSON from CRI)
crictl inspect <containerID>

# Get Pod sandbox details
crictl inspectp <podID>

# Pull an image
crictl pull nginx:1.25

# List images
crictl images

# Check image layers and size
crictl imagefsinfo

# Execute a command in a container (via CRI)
crictl exec -it <containerID> sh

# View container logs via CRI (same as kubectl logs)
crictl logs <containerID>

# Container stats
crictl stats

# Stop and remove a container (without kubectl — for emergencies)
crictl stop <containerID>
crictl rm <containerID>

7 · containerd

containerd is the CNCF-graduated, OCI-compliant, high-level container runtime used as the default CRI implementation in Kubernetes since the docker shim removal in v1.24. It was originally extracted from Docker as its core runtime component.

7.1 containerd Architecture

kubelet CRI gRPC containerd daemon CRI plugin (gRPC server) Image service + snapshotter containerd-shim-runc-v2 Content store (blobs) Metadata store (bbolt) Events / streaming runc / crun Linux kernel
Figure 4 — containerd internal architecture. The CRI plugin is built-in; containerd-shim-runc-v2 is a child process that persists after containerd restart, keeping containers alive.

7.2 containerd-shim — Why It Exists

The shim (containerd-shim-runc-v2) is a small child process that sits between containerd and runc. It exists for two critical reasons:

  1. Daemonless containers: runc exits after starting the container init process. The shim becomes the parent of the container process, handling stdio forwarding and exit code collection. If containerd restarts, the shim and container keep running.
  2. Reattachment: When containerd restarts, it can reconnect to the shim via a Unix socket (/run/containerd/containerd.sock.ttrpc-style), recovering container state without restarting containers.

7.3 Key containerd Paths and Configuration

# containerd config
cat /etc/containerd/config.toml

# Important config sections:
[plugins."io.containerd.grpc.v1.cri"]
  sandbox_image = "registry.k8s.io/pause:3.9"
  [plugins."io.containerd.grpc.v1.cri".containerd]
    default_runtime_name = "runc"
    [plugins."io.containerd.grpc.v1.cri".containerd.runtimes.runc]
      runtime_type = "io.containerd.runc.v2"
      [plugins."io.containerd.grpc.v1.cri".containerd.runtimes.runc.options]
        SystemdCgroup = true   # REQUIRED when kubelet uses systemd cgroup driver

# Data paths
/var/lib/containerd/              # all containerd data
  io.containerd.content.v1.content/blobs/sha256/   # image layer blobs
  io.containerd.snapshotter.v1.overlayfs/           # snapshot layers
  io.containerd.metadata.v1.bolt/meta.db           # bbolt metadata DB

# Runtime socket
/run/containerd/containerd.sock

# Check containerd service
systemctl status containerd
journalctl -u containerd -f

8 · CRI-O

CRI-O is a lightweight, Kubernetes-only CRI implementation maintained by Red Hat. Unlike containerd, it has no daemon-level API beyond CRI — it cannot be used outside Kubernetes. This intentional narrowness makes it simpler and more security-auditable.

DimensioncontainerdCRI-O
ScopeGeneral-purpose container runtime (Docker, nerdctl, K8s)Kubernetes CRI only
ArchitectureDaemon + plugins + shimDaemon + conmon (container monitor) + OCI runtime
Image storageOwn content store + snapshotterscontainers/storage library
Default runtimerunc via containerd-shim-runc-v2runc via conmon
AdoptionDefault on GKE, AKS, EKS, most cloud K8sDefault on OpenShift (RHCOS), some bare-metal deployments
Config path/etc/containerd/config.toml/etc/crio/crio.conf
Socket path/run/containerd/containerd.sock/var/run/crio/crio.sock
Release cadenceIndependent; lags K8s minor versionsReleases in lockstep with Kubernetes minor versions

9 · Container Security Features

9.1 Linux Capabilities

The Linux privilege model has been split into ~41 granular capabilities since kernel 2.2. Containers drop most capabilities by default; Kubernetes allows fine-grained control:

spec:
  containers:
  - name: app
    securityContext:
      capabilities:
        drop:
        - ALL                    # Drop every capability first
        add:
        - NET_BIND_SERVICE       # Re-add only what's needed (bind port < 1024)
      runAsNonRoot: true
      runAsUser: 1000
      runAsGroup: 3000
      readOnlyRootFilesystem: true
      allowPrivilegeEscalation: false
Common capabilities and their risks
CapabilityWhat it allowsRisk if granted
CAP_SYS_ADMINMount filesystems, load kernel modules, set hostname, ptraceNear-root; can escape container
CAP_NET_ADMINConfigure network interfaces, iptables, routingCan intercept all node traffic
CAP_SYS_PTRACEptrace any process in namespaceCan read memory of other containers in Pod
CAP_DAC_OVERRIDEBypass file permission checksRead/write any file in the container regardless of mode
CAP_NET_BIND_SERVICEBind to ports below 1024Low risk; needed for HTTP/HTTPS on standard ports
CAP_CHOWNChange file ownershipCan change ownership of files it has access to
CAP_SETUID / CAP_SETGIDChange UID/GIDCan re-escalate to root if binaries are setuid

9.2 Seccomp

seccomp (Secure Computing Mode) restricts which system calls a process can make. Kubernetes supports loading seccomp profiles as JSON files and applying them per-container:

spec:
  securityContext:
    seccompProfile:
      type: RuntimeDefault     # Use containerd/CRI-O's built-in default profile
      # type: Localhost
      # localhostProfile: profiles/my-app.json
      # type: Unconfined       # NO seccomp (dangerous)

RuntimeDefault (stable in v1.27 via the SeccompDefault feature gate) applies the runtime's default profile which blocks ~50 dangerous syscalls including ptrace, kexec_load, open_by_handle_at, create_module, and all 32-bit ABI syscalls.

9.3 AppArmor and SELinux

# AppArmor (annotations-based, stable in K8s v1.30)
metadata:
  annotations:
    container.apparmor.security.beta.kubernetes.io/nginx: runtime/default
    # or: localhost/my-custom-profile
spec:
  securityContext:
    # SELinux labels (OpenShift / RHEL nodes)
    seLinuxOptions:
      level: "s0:c123,c456"
      user: system_u
      role: system_r
      type: container_t

10 · Rootless Containers

Rootless containers run the entire container runtime (containerd or Podman) as a non-root user on the host. The container's UID 0 maps to the host user's UID via the user namespace, meaning even if a container escapes, it is not root on the host.

Production status Rootless containerd is production-ready as of containerd 1.7 / Kubernetes 1.26 via the UserNamespacesSupport feature gate. It requires kernel ≥ 5.11 for nested user namespace support and /etc/subuid//etc/subgid user ID range delegation.
# Enable user namespace for a Pod (KEP-127, K8s 1.30 beta)
spec:
  hostUsers: false    # Each Pod gets its own user namespace
  containers:
  - name: app
    # UID 0 inside = mapped to unprivileged UID on host
    # Requires: feature gate UserNamespacesSupport=true
    # Requires: kernel >= 5.11 and idmap mounts

11 · Container Lifecycle Inside Kubernetes

When the kubelet is told to run a Pod, the container creation sequence is:

  1. kubelet calls RunPodSandbox → containerd creates pause container, sets up network namespace, calls CNI plugin to assign IP
  2. kubelet calls PullImage for each container image (skipped if already present)
  3. kubelet calls CreateContainer → containerd creates OCI bundle, prepares overlay FS snapshot
  4. kubelet calls StartContainer → containerd invokes runc to start the process
  5. kubelet starts running liveness/readiness probes after initialDelaySeconds
  6. kubelet reports container status back to API server via status update
# Trace the full lifecycle on a node
journalctl -u kubelet -f | grep -E "RunPodSandbox|CreateContainer|StartContainer|PullImage"

# Trace containerd's view
journalctl -u containerd -f

# Watch CRI events live
crictl events

11.1 Container Termination Sequence

When a Pod is deleted or a container must be stopped, the sequence is:

  1. kubelet sends SIGTERM to PID 1 inside the container
  2. Container has terminationGracePeriodSeconds (default 30s) to exit cleanly
  3. If still running after grace period, kubelet sends SIGKILL
  4. kubelet calls StopContainer → containerd → runc kill
  5. kubelet calls RemoveContainer → containerd deletes overlay snapshot
  6. kubelet calls StopPodSandbox → containerd calls CNI DEL to release IP
  7. kubelet calls RemovePodSandbox → pause container removed
PID 1 signal handling Processes running as PID 1 inside a container do NOT receive SIGTERM by default unless they explicitly handle it. Shell scripts (CMD ["/bin/sh", "-c", "..."]) are notorious for ignoring SIGTERM. Use exec form in Dockerfile (CMD ["myapp"]) or use tini as PID 1 to properly forward signals.

12 · Image Pull Policy and Registry Authentication

spec:
  containers:
  - name: app
    image: myregistry.io/myapp:v1.2.3
    imagePullPolicy: IfNotPresent  # Always | Never | IfNotPresent (default for tagged images)
  imagePullSecrets:
  - name: registry-credentials     # Secret of type kubernetes.io/dockerconfigjson
# Create image pull secret from docker config
kubectl create secret docker-registry registry-credentials \
  --docker-server=myregistry.io \
  --docker-username=myuser \
  --docker-password=mypassword \
  --docker-email=me@example.com

# Or from existing docker config
kubectl create secret generic registry-credentials \
  --from-file=.dockerconfigjson=$HOME/.docker/config.json \
  --type=kubernetes.io/dockerconfigjson

# Attach to default ServiceAccount so all Pods in namespace get it
kubectl patch serviceaccount default \
  -p '{"imagePullSecrets":[{"name":"registry-credentials"}]}'
Production: always use digest-pinned images Tags are mutable — myapp:latest can change between node pulls. Pin to digest: myapp@sha256:abc123... for reproducibility. Use crane digest myapp:v1.2.3 to get the digest.

13 · Production Best Practices

Container runtime production checklist (12 items)
  1. Use SystemdCgroup=true in containerd config when kubelet uses the systemd cgroup driver. Mismatch causes Pod evictions and memory accounting errors.
  2. Pin sandbox image (pause) to a specific digest in containerd config to prevent silent updates.
  3. Enable seccomp RuntimeDefault cluster-wide via the kubelet --seccomp-default flag or PodSecurityAdmission baseline/restricted policy.
  4. Drop ALL capabilities and add back only what your app needs. Most apps need zero capabilities.
  5. Set readOnlyRootFilesystem: true for all containers that don't need to write to their root FS. Writable root FS is the top container escape vector.
  6. Never run as root (runAsNonRoot: true). If the app requires root, consider rootless containers instead.
  7. Monitor containerd/CRI-O for image GC events. When node disk fills with images, the kubelet evicts Pods. Set imageGCHighThresholdPercent and imageGCLowThresholdPercent.
  8. Use cgroup v2 on all new nodes (default on Ubuntu 22.04+, RHEL 9+). cgroup v2 fixes the memory accounting issues present in v1 and enables better QoS.
  9. Tune terminationGracePeriodSeconds per workload. Databases may need 120s+; stateless HTTP servers usually need 15–30s.
  10. Implement preStop hooks for graceful shutdown of apps that don't handle SIGTERM. The preStop hook runs before SIGTERM is sent.
  11. Watch for CPU throttling with container_cpu_cfs_throttled_seconds_total. Consider removing CPU limits for latency-sensitive workloads and using LimitRange to set namespace defaults.
  12. Audit image pull secrets rotation. Credentials in imagePullSecrets are long-lived by default. Integrate with IRSA/Workload Identity for cloud registries instead of static credentials.

14 · Troubleshooting Container Runtime Issues

# ---- Container won't start ----
# 1. Check kubelet log for CRI errors
journalctl -u kubelet -n 200 | grep -i error

# 2. Check containerd log
journalctl -u containerd -n 200

# 3. crictl inspect for full container spec and error
crictl inspect <containerID> | jq .status.reason

# ---- Image pull failure ----
crictl pull <image>   # Test pull outside of K8s context
# Check registry credentials
kubectl get secret registry-credentials -o jsonpath='{.data.\.dockerconfigjson}' | base64 -d

# ---- OOM kill diagnosis ----
dmesg | grep -i "oom-kill"
kubectl describe pod <name> | grep -A3 "Last State"

# ---- CPU throttling ----
# On the node, for a container's cgroup:
cat /sys/fs/cgroup/cpu/kubepods/burstable/pod<uid>/<containerID>/cpu.stat
# nr_throttled = times the container was throttled
# throttled_time = total ns spent throttled

# ---- containerd not responding ----
systemctl restart containerd
# Pods keep running! (shim keeps them alive)
# After containerd restart, reconnects to shims automatically

# ---- Check overlay FS health ----
dmesg | grep -i overlayfs
df -h /var/lib/containerd

Next Files

Recommended reading order

References

  • OCI Image Spec — github.com/opencontainers/image-spec
  • OCI Runtime Spec — github.com/opencontainers/runtime-spec
  • runc source — github.com/opencontainers/runc
  • containerd docs — containerd.io/docs
  • CRI-O docs — cri-o.io
  • Linux man pages: namespaces(7), cgroups(7), capabilities(7), seccomp(2)
  • CRI API proto — github.com/kubernetes/cri-api
  • Container Security — Liz Rice, O'Reilly 2020