Pods — Kubernetes Docs

▶ What This Page Covers

Pod network model: shared netns, localhost communication, port conflicts

Full annotated PodSpec — every significant field with defaults

Container spec fields: image pull policy, command vs args, env, envFrom

All three probe types: startup, liveness, readiness — when each fires

All four probe mechanisms: exec, httpGet, tcpSocket, grpc

Probe tuning fields: initialDelaySeconds, periodSeconds, failureThreshold, successThreshold, timeoutSeconds, terminationGracePeriodSeconds override

Pod-level security context: runAsUser/Group, fsGroup, fsGroupChangePolicy, supplementalGroups, sysctls

Container-level security context: allowPrivilegeEscalation, readOnlyRootFilesystem, capabilities add/drop, seccompProfile, privileged

PodSecurity admission: Restricted, Baseline, Privileged profiles and what each blocks

Multi-container patterns: sidecar, ambassador, adapter with concrete examples

Init containers: sequential ordering, failure handling, shared volumes

Native sidecar containers (1.29+ stable 1.33): restartPolicy:Always, startup ordering, proper shutdown

Ephemeral containers: kubectl debug, shareProcessNamespace, use cases

Container lifecycle hooks: postStart, preStop — exec vs httpGet

DNS policy: ClusterFirst, ClusterFirstWithHostNet, Default, None + dnsConfig

Host namespaces: hostNetwork, hostPID, hostIPC — security implications

RuntimeClass: gVisor, Kata Containers, selection per-pod

Pod overhead: RuntimeClass overhead added to resource accounting

Topology and scheduling fields: nodeSelector, nodeName, tolerations summary

imagePullSecrets and image pull policy details

5 metrics + 4 alerts + 5 runbooks + 8 best practices

Pod Network Model

Every pod receives a single IP address shared by all its containers. Containers within a pod communicate over localhost — they share the same network namespace (same interfaces, same routing table, same port space). Two containers in the same pod cannot both listen on port 8080.

Pod network model: ┌─────────────────────────────────────────────────────────┐ │ Pod IP: 10.0.1.42 │ │ │ │ ┌────────────────┐ ┌────────────────┐ │ │ │ app container │ │ sidecar (envoy)│ │ │ │ :8080 (HTTP) │◄──►│ :15001 (proxy)│ │ │ │ │ │ │ │ │ │ localhost:15001 │ │ localhost:8080 │ │ │ └────────────────┘ └────────────────┘ │ │ │ │ │ │ └────────┬────────────┘ │ │ │ shared eth0 │ │ 10.0.1.42:8080 ← reachable from cluster │ │ 10.0.1.42:15001 ← reachable from cluster │ └─────────────────────────────────────────────────────────┘ Pause container (infra): → Created first, holds the netns open → All other containers join this netns via clone(CLONE_NEWNET) → Lives for the entire pod lifetime

Full PodSpec Anatomy

apiVersion: v1
kind: Pod
metadata:
  name: web-app
  namespace: production
  labels:
    app: web-app
    version: v2.1.0
  annotations:
    prometheus.io/scrape: "true"
    prometheus.io/port: "9090"
spec:
  # ── Scheduling ─────────────────────────────────────────────────
  nodeSelector:
    kubernetes.io/os: linux
    node.kubernetes.io/instance-type: m5.xlarge   # optional instance type pin

  nodeName: ""              # direct assignment (bypasses scheduler — avoid in production)

  tolerations:
  - key: dedicated
    operator: Equal
    value: gpu
    effect: NoSchedule

  topologySpreadConstraints:
  - maxSkew: 1
    topologyKey: topology.kubernetes.io/zone
    whenUnsatisfiable: DoNotSchedule
    labelSelector:
      matchLabels:
        app: web-app

  affinity:
    podAntiAffinity:
      requiredDuringSchedulingIgnoredDuringExecution:
      - labelSelector:
          matchLabels:
            app: web-app
        topologyKey: kubernetes.io/hostname

  priorityClassName: high-priority   # references PriorityClass object

  # ── Runtime ────────────────────────────────────────────────────
  runtimeClassName: gvisor           # sandbox runtime (optional)

  # ── Identity ────────────────────────────────────────────────────
  serviceAccountName: web-app-sa
  automountServiceAccountToken: true  # default true; set false if no API access needed

  # ── Image pull ─────────────────────────────────────────────────
  imagePullSecrets:
  - name: registry-credentials

  # ── Networking ──────────────────────────────────────────────────
  hostname: web-0                    # overrides pod name in hostname
  subdomain: web-headless            # enables DNS: web-0.web-headless.ns.svc.cluster.local
  hostNetwork: false                 # default; true = use node's network namespace
  hostPID: false
  hostIPC: false
  dnsPolicy: ClusterFirst            # default; see DNS section
  dnsConfig:
    searches:
    - internal.corp.com
    options:
    - name: ndots
      value: "2"

  # ── Termination ─────────────────────────────────────────────────
  terminationGracePeriodSeconds: 60  # default 30

  # ── Pod-level security ─────────────────────────────────────────
  securityContext:
    runAsUser: 1000
    runAsGroup: 3000
    runAsNonRoot: true
    fsGroup: 2000
    fsGroupChangePolicy: OnRootMismatch   # Always or OnRootMismatch
    supplementalGroups: [4000]
    seccompProfile:
      type: RuntimeDefault              # RuntimeDefault | Localhost | Unconfined
    sysctls:
    - name: net.core.somaxconn
      value: "1024"

  # ── Init containers ─────────────────────────────────────────────
  initContainers:
  - name: wait-for-db
    image: busybox:1.36
    command: ["sh", "-c", "until nc -z postgres 5432; do sleep 2; done"]

  # Native sidecar (1.29+): restartPolicy: Always in initContainers
  - name: log-shipper
    image: fluent/fluent-bit:3.0
    restartPolicy: Always              # makes this a sidecar, not a blocking init
    volumeMounts:
    - name: logs
      mountPath: /logs

  # ── Main containers ─────────────────────────────────────────────
  containers:
  - name: app
    image: myapp:v2.1.0
    imagePullPolicy: IfNotPresent      # Always | Never | IfNotPresent (default for tagged)

    command: ["/app/server"]           # overrides ENTRYPOINT
    args: ["--port=8080", "--log-level=info"]  # overrides CMD

    ports:
    - name: http
      containerPort: 8080
      protocol: TCP
    - name: metrics
      containerPort: 9090

    env:
    - name: DB_HOST
      value: postgres.data.svc.cluster.local
    - name: DB_PASSWORD
      valueFrom:
        secretKeyRef:
          name: db-secret
          key: password
          optional: false             # fail pod if secret missing (default false)
    - name: MY_POD_IP
      valueFrom:
        fieldRef:
          fieldPath: status.podIP     # Downward API
    - name: MY_CPU_REQUEST
      valueFrom:
        resourceFieldRef:
          containerName: app
          resource: requests.cpu

    envFrom:
    - configMapRef:
        name: app-config
        optional: false
    - secretRef:
        name: app-secrets
        optional: true               # pod starts even if secret absent

    resources:
      requests:
        cpu: "250m"
        memory: "256Mi"
      limits:
        memory: "512Mi"              # no CPU limit — avoid throttling

    volumeMounts:
    - name: config
      mountPath: /etc/app
      readOnly: true
    - name: tmp
      mountPath: /tmp
    - name: logs
      mountPath: /var/log/app

    # Container-level security
    securityContext:
      allowPrivilegeEscalation: false
      readOnlyRootFilesystem: true
      runAsNonRoot: true
      capabilities:
        drop: ["ALL"]
        add: ["NET_BIND_SERVICE"]   # only if binding port <1024
      seccompProfile:
        type: RuntimeDefault

    lifecycle:
      postStart:
        exec:
          command: ["/bin/sh", "-c", "echo started > /tmp/started"]
      preStop:
        exec:
          command: ["/bin/sh", "-c", "sleep 5 && /app/graceful-stop.sh"]

    # Probes — see detailed section below
    startupProbe:
      httpGet:
        path: /healthz
        port: 8080
      failureThreshold: 30
      periodSeconds: 10

    livenessProbe:
      httpGet:
        path: /healthz
        port: 8080
      initialDelaySeconds: 0        # after startupProbe passes, no additional delay needed
      periodSeconds: 10
      failureThreshold: 3
      timeoutSeconds: 5

    readinessProbe:
      httpGet:
        path: /ready
        port: 8080
      periodSeconds: 5
      failureThreshold: 3
      successThreshold: 1
      timeoutSeconds: 3

  # ── Volumes ─────────────────────────────────────────────────────
  volumes:
  - name: config
    configMap:
      name: app-config
      defaultMode: 0644
  - name: tmp
    emptyDir: {}
  - name: logs
    emptyDir: {}

  # ── Restart policy ──────────────────────────────────────────────
  restartPolicy: Always    # Always (default) | OnFailure | Never

Image Pull Policy

imagePullPolicy	When Applied Automatically	Behavior
`Always`	Image tag is `:latest`	Kubelet always contacts registry to check digest; pulls if digest differs. Guarantees freshness; requires registry availability at pod start.
`IfNotPresent`	Any tag except `:latest`	Uses cached image if present; pulls only if image not on node. Default for version-tagged images. Faster startup on cached nodes.
`Never`	Never automatically set	Never pulls from registry. Pod fails if image not cached on node. Use for air-gapped environments with pre-pulled images.

Never Use :latest in Production

The :latest tag forces Always pull policy and makes deployments non-reproducible — the same tag may refer to different image digests on different nodes. Use immutable digests (image@sha256:abc123...) or version tags (v1.2.3) for production workloads. Reference digests in manifests for supply-chain security.

Probes

Kubernetes has three probe types, each serving a distinct purpose. They are independent — a pod can have all three, any combination, or none.

Probe execution timeline: Pod starts │ ▼ ┌──────────────────────────────────────────────────────────────────┐ │ startupProbe (if defined) │ │ Fires every periodSeconds until success or failureThreshold hit │ │ Max wait = failureThreshold × periodSeconds (e.g. 30×10 = 5min) │ │ → Failure before threshold: container restarted │ │ → Success: startup probe disabled permanently for this container │ └────────────────────────────┬─────────────────────────────────────┘ │ startup succeeded ▼ ┌─────────────────────────────────────────────────────────────────┐ │ livenessProbe (continuous, every periodSeconds) │ │ Fires from pod start (after startup passes) │ │ failureThreshold consecutive failures → container restarted │ │ Does NOT remove pod from Service Endpoints │ └────────────────────────────┬────────────────────────────────────┘ │ ┌─────────────────────────────────────────────────────────────────┐ │ readinessProbe (continuous, every periodSeconds) │ │ failureThreshold failures → pod removed from Endpoints │ │ successThreshold successes → pod re-added to Endpoints │ │ Does NOT restart container │ └─────────────────────────────────────────────────────────────────┘

Probe	On Failure	On Success	When to Use
startupProbe	Container restarted (after failureThreshold)	Disables itself; liveness takes over	Slow-starting containers (JVM warmup, large model loading). Prevents liveness from killing container during boot.
livenessProbe	Container restarted	Nothing (pass is expected)	Detect deadlocks, infinite loops, hung processes that can't self-recover. Should check minimally — not dependent on external services.
readinessProbe	Pod removed from Service Endpoints (no restart)	Pod re-added to Endpoints	Temporary unavailability: cache warming, circuit breaker open, dependency down. Drives traffic routing, not container health.

All Four Probe Mechanisms

# 1. httpGet — HTTP GET request; success = 2xx or 3xx status code
livenessProbe:
  httpGet:
    path: /healthz
    port: 8080            # port name or number
    scheme: HTTP          # HTTP (default) or HTTPS
    httpHeaders:
    - name: X-Health-Check
      value: "true"
  initialDelaySeconds: 10
  periodSeconds: 10
  timeoutSeconds: 5
  failureThreshold: 3
  successThreshold: 1     # must be 1 for liveness (only readiness allows >1)

# 2. exec — runs command in container; success = exit code 0
livenessProbe:
  exec:
    command:
    - /bin/sh
    - -c
    - redis-cli ping | grep -q PONG

# 3. tcpSocket — TCP connection attempt; success = connection accepted
livenessProbe:
  tcpSocket:
    port: 5432            # useful for databases with no HTTP endpoint
    host: ""              # defaults to pod IP

# 4. grpc — gRPC health protocol (grpc.health.v1.Health/Check); 1.24+ stable
livenessProbe:
  grpc:
    port: 50051
    service: ""           # empty string = check overall server health

Probe Tuning Reference

Field	Default	Notes
`initialDelaySeconds`	0	Seconds after container starts before first probe fires. Deprecated in favor of startupProbe — use startupProbe instead.
`periodSeconds`	10	How often to probe. Minimum 1s.
`timeoutSeconds`	1	Probe must complete within this window or it counts as failure. Set ≥ p99 response time.
`failureThreshold`	3	Consecutive failures before action (restart or remove from Endpoints).
`successThreshold`	1	Consecutive successes to re-add to Endpoints (readiness only; liveness/startup must be 1).
`terminationGracePeriodSeconds`	Pod's value	Per-probe override (1.25+). Set this longer than the pod's grace period if the probe needs more time during shutdown.

Liveness Probe Anti-Patterns

A liveness probe that checks external dependencies (database connectivity, downstream API) causes cascading restarts: if the database goes down, all pods restart simultaneously — making recovery harder. Liveness probes should only detect states the process itself cannot recover from (deadlock, memory corruption). Use readiness probes for dependency health.

Security Context

Security context fields exist at two levels: pod-level (spec.securityContext) applies to all containers and volumes; container-level (spec.containers[*].securityContext) applies to a specific container and overrides pod-level where both exist.

Pod-Level Security Context

spec:
  securityContext:
    runAsUser: 1000            # UID for all containers (overridden by container-level)
    runAsGroup: 3000           # primary GID for all containers
    runAsNonRoot: true         # refuse to start if image runs as UID 0
    fsGroup: 2000              # GID applied to all volumes; files created in volumes get this GID
    fsGroupChangePolicy: OnRootMismatch  # Only chown/chmod if ownership wrong (faster for large volumes)
                                         # Always: always chown (safe but slow for large volumes)
    supplementalGroups: [4000] # additional GIDs for the process (e.g., for shared file access)
    seccompProfile:
      type: RuntimeDefault     # use container runtime's default seccomp profile
      # type: Localhost
      # localhostProfile: profiles/myprofile.json
      # type: Unconfined       # disable seccomp (not recommended)
    sysctls:                   # kernel parameters; only safe (namespaced) sysctls allowed by default
    - name: net.core.somaxconn
      value: "1024"
    - name: kernel.shm_rmid_forced  # unsafe; requires AllowedUnsafeSysctls feature gate
      value: "0"

Container-Level Security Context

containers:
- name: app
  securityContext:
    runAsUser: 1000            # overrides pod-level runAsUser for this container
    runAsGroup: 3000
    runAsNonRoot: true
    allowPrivilegeEscalation: false   # prevents setuid binaries; CRITICAL: always set false
    readOnlyRootFilesystem: true      # immutable container filesystem; forces explicit writable mounts
    privileged: false                 # default; true = full host kernel access (avoid)

    capabilities:
      drop: ["ALL"]            # drop all capabilities first (start from minimal)
      add:
      - NET_BIND_SERVICE       # bind ports < 1024 (only if needed)
      - SYS_PTRACE             # for debugging tools; remove in production

    seccompProfile:
      type: RuntimeDefault

    procMount: Default         # Default | Unmasked (unmasked exposes /proc — avoid)

PodSecurity Admission Profiles

Policy	What It Allows	What It Blocks	Use Case
Privileged	Everything	Nothing	System namespaces (kube-system), CNI, CSI node plugins
Baseline	Most workloads; some host access	hostPID, hostIPC, hostNetwork, privileged, hostPort, allowPrivilegeEscalation (via securityContext.allowPrivilegeEscalation:true), non-default capabilities beyond baseline set	General workloads not requiring host access
Restricted	Strictly hardened pods only	All of Baseline PLUS: must set allowPrivilegeEscalation:false, must drop ALL capabilities, must set runAsNonRoot:true, must use RuntimeDefault or Localhost seccomp, volumes limited to approved types	Security-sensitive namespaces; PCI/SOC2 compliance

# Apply PodSecurity to a namespace via labels:
kubectl label namespace production \
  pod-security.kubernetes.io/enforce=restricted \
  pod-security.kubernetes.io/enforce-version=latest \
  pod-security.kubernetes.io/warn=restricted \
  pod-security.kubernetes.io/audit=restricted

Multi-Container Patterns

Sidecar Pattern

A helper container that augments the main container's functionality without modifying it. They share a volume for communication.

containers:
- name: app
  image: myapp:v1
  volumeMounts:
  - name: logs
    mountPath: /var/log/app

- name: log-shipper          # sidecar: ships logs to Loki/Elasticsearch
  image: fluent/fluent-bit:3.0
  volumeMounts:
  - name: logs
    mountPath: /logs
    readOnly: true
  resources:
    requests:
      cpu: "50m"
      memory: "64Mi"
    limits:
      memory: "128Mi"

volumes:
- name: logs
  emptyDir: {}              # shared between app and log-shipper

Ambassador Pattern

A proxy sidecar that handles outbound communication on behalf of the main container — connection pooling, retries, circuit breaking, or protocol translation.

containers:
- name: app
  image: myapp:v1
  env:
  - name: DATABASE_URL
    value: "localhost:5432"    # connects to ambassador on localhost, not directly to DB

- name: db-ambassador          # ambassador: manages connection pool to PostgreSQL
  image: pgbouncer:1.22
  env:
  - name: DATABASES_HOST
    value: postgres.data.svc.cluster.local
  - name: POOL_MODE
    value: transaction
  ports:
  - containerPort: 5432        # app connects here; ambassador connects to real DB

Adapter Pattern

Transforms the main container's output into a standardized format — typically for metrics or log normalization.

containers:
- name: legacy-app
  image: legacy-metrics-app:v1
  # exposes metrics in proprietary format on :8888

- name: metrics-adapter        # converts legacy format to Prometheus exposition format
  image: prom/statsd-exporter:latest
  ports:
  - containerPort: 9102         # Prometheus scrapes this port
  args:
  - --statsd.listen-udp=:8125
  - --statsd.mapping-config=/etc/mapping.yml

Init Containers

Init containers run sequentially before any main container starts. Each must exit with code 0 for the next to begin. If an init container fails, the pod is restarted according to restartPolicy (unless Never). Init containers share the same volumes as main containers but run with their own image and security context.

initContainers:
# Step 1: wait for dependency
- name: wait-for-postgres
  image: busybox:1.36
  command:
  - sh
  - -c
  - |
    until nc -z -w 3 postgres.data.svc.cluster.local 5432; do
      echo "Waiting for PostgreSQL..."; sleep 5
    done
    echo "PostgreSQL is ready"

# Step 2: run database migrations (after DB is ready)
- name: db-migrate
  image: myapp:v2.1.0          # same image as main app
  command: ["/app/migrate", "--direction=up"]
  env:
  - name: DATABASE_URL
    valueFrom:
      secretKeyRef:
        name: db-secret
        key: url
  # Shares the same env/secrets as the main app

# Step 3: seed config files into a shared volume
- name: config-init
  image: alpine:3.19
  command:
  - sh
  - -c
  - |
    cp /etc/base-config/* /shared-config/
    sed -i "s/{{ENV}}/production/g" /shared-config/app.conf
  volumeMounts:
  - name: shared-config
    mountPath: /shared-config

containers:
- name: app
  image: myapp:v2.1.0
  volumeMounts:
  - name: shared-config
    mountPath: /etc/app
    readOnly: true

volumes:
- name: shared-config
  emptyDir: {}

Native Sidecar Containers (1.29+ / Stable 1.33)

Classic sidecars defined in spec.containers have no ordering guarantee relative to each other. A log-shipper sidecar might miss early log output if it starts after the main app. Native sidecars fix this: defined in spec.initContainers with restartPolicy: Always, they start during the init phase (before main containers) and run for the pod's lifetime.

initContainers:
# Classic init: runs to completion, blocks next step
- name: wait-for-db
  image: busybox:1.36
  command: ["sh", "-c", "until nc -z postgres 5432; do sleep 2; done"]
  # No restartPolicy → classic init behavior

# Native sidecar: starts here (in init phase order), runs forever
- name: log-shipper
  image: fluent/fluent-bit:3.0
  restartPolicy: Always          # ← this is what makes it a native sidecar
  volumeMounts:
  - name: logs
    mountPath: /logs
  readinessProbe:                # startup waits until this passes before next init
    exec:
      command: ["test", "-f", "/tmp/fluent-bit.pid"]
    initialDelaySeconds: 2

# Another native sidecar
- name: envoy-proxy
  image: envoyproxy/envoy:v1.29
  restartPolicy: Always
  readinessProbe:
    httpGet:
      path: /ready
      port: 9901

containers:
- name: app                      # starts AFTER both native sidecars are Ready
  image: myapp:v1

Native Sidecar Shutdown Ordering

When the pod terminates, native sidecars receive SIGTERM after all main containers have exited. This ensures a log-shipper flushes buffered logs before shutting down — fixing the classic problem where the log sidecar died before the app finished writing. The shutdown order is: main containers exit → native sidecars receive SIGTERM.

Ephemeral Containers

Ephemeral containers are temporary containers injected into a running pod for debugging. They cannot define ports, probes, or resources and are never restarted. They are useful when a minimal production image lacks debugging tools.

# Inject a debug container into a running pod
kubectl debug -it pod/web-app-7d9b8f \
  --image=busybox:1.36 \
  --target=app \            # attach to the app container's process namespace
  -- sh

# Using a distroless image — the app has no shell, so we bring one
kubectl debug -it pod/web-app-7d9b8f \
  --image=gcr.io/distroless/base:debug \
  --copy-to=web-app-debug \ # create a copy of the pod with the debug container
  -- bash

# Share process namespace to see all processes in the pod
# Requires spec.shareProcessNamespace: true (set in pod spec) or --copy-to with it

# Enable process namespace sharing in pod spec (for ephemeral container process access):
spec:
  shareProcessNamespace: true   # containers can see each other's processes via /proc
  containers:
  - name: app
    image: myapp:v1

Container Lifecycle Hooks

Hook	When It Fires	Blocking?	Failure Behavior
`postStart`	Immediately after container is created (async with ENTRYPOINT — may run before or after)	Yes — container stays in `ContainerCreating` until hook completes	Container killed and restarted
`preStop`	Immediately before container is terminated (before SIGTERM)	Yes — SIGTERM delayed until hook completes or grace period expires	Container still terminated

lifecycle:
  postStart:
    exec:
      command: ["/bin/sh", "-c", "/app/register-service.sh"]
    # OR httpGet:
    #   path: /post-start
    #   port: 8080

  preStop:
    exec:
      command:
      - /bin/sh
      - -c
      - |
        # Drain connections gracefully before SIGTERM
        sleep 5
        /app/deregister-service.sh
        # SIGTERM sent after this exits (or grace period expires)

postStart vs readinessProbe

postStart runs once at startup; it is not a health check. Use it for one-time initialization tasks (registering with a service registry, seeding a cache). Use readinessProbe to gate traffic until the application is actually ready to serve. The two are independent — the pod enters Ready state based on the readiness probe, not the postStart hook.

DNS Policy

dnsPolicy	Behavior	Use Case
`ClusterFirst`	Default. Queries go to CoreDNS; cluster.local names resolved internally; fallback to upstream DNS.	All normal pods
`ClusterFirstWithHostNet`	Same as ClusterFirst but required when `hostNetwork: true` (otherwise hostNetwork pods use node DNS).	DaemonSets and infrastructure pods using hostNetwork that still need in-cluster DNS
`Default`	Pod inherits the node's `/etc/resolv.conf` — does NOT resolve cluster services.	Pods that only need external DNS and never access services by cluster DNS name
`None`	Ignores Kubernetes DNS; `spec.dnsConfig` must be provided.	Custom DNS configurations; pods with special resolver requirements

# dnsConfig for custom search domains and options
spec:
  dnsPolicy: ClusterFirst
  dnsConfig:
    nameservers:
    - 10.96.0.10          # CoreDNS IP (only used when dnsPolicy: None)
    searches:
    - production.svc.cluster.local
    - svc.cluster.local
    - cluster.local
    - internal.corp.com   # additional search domain
    options:
    - name: ndots
      value: "2"           # lower from default 5 reduces DNS lookup overhead
    - name: attempts
      value: "3"

Host Namespaces

Field	Default	Effect	Security Risk
`hostNetwork: true`	false	Pod uses node's network namespace — sees all host interfaces, binds to host ports directly	High — can eavesdrop on node traffic, bind privileged ports
`hostPID: true`	false	Pod shares node's PID namespace — can see and signal all processes on the node	Critical — can kill or inspect any node process
`hostIPC: true`	false	Pod shares node's IPC namespace — can access shared memory and semaphores of other processes	High — can read/write shared memory of host processes

Host Namespaces Require Explicit Justification

hostNetwork and hostPID are blocked by the PodSecurity Baseline and Restricted profiles. They are legitimate only for infrastructure components (CNI plugins, node monitoring agents, kubelet itself). Application workloads should never use host namespaces. Audit with: kubectl get pods -A -o json | jq '.items[] | select(.spec.hostNetwork==true) | .metadata.namespace + "/" + .metadata.name'

RuntimeClass

RuntimeClass selects the container runtime for a pod. Different runtimes provide different security isolation models — from the default runc (process isolation) to hardware-virtualized sandboxes.

# RuntimeClass object (created by cluster admin)
apiVersion: node.k8s.io/v1
kind: RuntimeClass
metadata:
  name: gvisor
handler: runsc            # maps to containerd runtime handler name
overhead:
  podFixed:
    cpu: "250m"
    memory: "256Mi"        # gVisor sentry process overhead added to resource accounting
scheduling:
  nodeSelector:
    sandboxed: "true"      # RuntimeClass can constrain which nodes are used
  tolerations:
  - key: sandboxed
    operator: Exists
---
# In pod spec:
spec:
  runtimeClassName: gvisor     # use the gVisor sandbox

Runtime	Isolation	Overhead	Use Case
`runc` (default)	Linux namespaces + cgroups	Minimal	All standard workloads
`gVisor` (runsc)	Userspace kernel intercept (ptrace/KVM)	Medium (250m CPU, 256Mi memory)	Multi-tenant, untrusted code, CI runners
`Kata Containers`	Hardware virtualization (KVM VM per pod)	High (~500m CPU, 512Mi memory)	Strict isolation, regulated workloads, FaaS
`Firecracker`	MicroVM per pod	Low-medium (~125ms boot)	AWS Lambda-style serverless, fast VM boot

Metrics, Alerts, and Runbooks

Key Pod Metrics

Metric	Source	Alert Condition
`kube_pod_container_status_restarts_total`	kube-state-metrics	increase > 5 in 1h → CrashLoopBackOff risk
`kube_pod_status_phase{phase="Failed"}`	kube-state-metrics	Any Failed pod older than 5m in production namespace
`kube_pod_status_ready{condition="false"}`	kube-state-metrics	Pod not ready > 10m (readiness probe continuously failing)
`container_oom_events_total`	cadvisor	Any OOM kill in production
`container_cpu_cfs_throttled_seconds_total`	cadvisor	throttle ratio > 25% → consider raising or removing CPU limits

Alerting Rules

groups:
- name: pod-health
  rules:
  - alert: PodCrashLooping
    expr: |
      increase(kube_pod_container_status_restarts_total[1h]) > 5
    for: 5m
    labels:
      severity: warning
    annotations:
      summary: "Pod {{ $labels.namespace }}/{{ $labels.pod }} is crash-looping"

  - alert: PodOOMKilled
    expr: |
      increase(container_oom_events_total[5m]) > 0
    for: 0m
    labels:
      severity: critical
    annotations:
      summary: "Container OOM killed: {{ $labels.namespace }}/{{ $labels.pod }}/{{ $labels.container }}"

  - alert: PodNotReady
    expr: |
      kube_pod_status_ready{condition="false"} == 1
    for: 10m
    labels:
      severity: warning
    annotations:
      summary: "Pod {{ $labels.namespace }}/{{ $labels.pod }} not ready for 10 minutes"

  - alert: ContainerCPUThrottling
    expr: |
      rate(container_cpu_cfs_throttled_seconds_total[5m])
      / rate(container_cpu_cfs_periods_total[5m]) > 0.25
    for: 15m
    annotations:
      summary: "Container {{ $labels.container }} CPU throttled >25%"

Runbooks

CrashLoopBackOff

Check last exit code: kubectl describe pod POD → "Last State: Terminated reason: Error exitCode: N". Check logs: kubectl logs POD --previous. Common causes: missing env/secret, wrong command, OOM (exitCode 137), permission denied on read-only rootfs.

OOMKilled (exitCode 137)

Container exceeded memory limit. Check actual usage: kubectl top pod POD --containers. Increase memory limit or request. Check for memory leaks with heap profiling. If limit was appropriate, investigate memory leak or request growth — VPA can recommend right-sized requests.

Pod Stuck Pending

Run kubectl describe pod POD → Events section. Causes: (1) insufficient resources — check kubectl describe nodes for Allocated resources; (2) no matching nodes — check nodeSelector/affinity; (3) taint mismatch — check tolerations; (4) PVC not bound — check PVC status.

Readiness Probe Failing

Check probe events: kubectl describe pod POD → "Readiness probe failed". For httpGet: exec into pod and run curl localhost:PORT/PATH. Check if app has started, dependency health, and timeout vs actual response time. Increase timeoutSeconds if p99 latency exceeds probe timeout.

Image Pull Error (ErrImagePull)

Events show "Failed to pull image". Causes: (1) image tag doesn't exist; (2) registry credentials missing/expired — check imagePullSecrets and secret validity; (3) registry unreachable — network policy or node DNS issue; (4) rate limiting (Docker Hub). For private registries: kubectl create secret docker-registry.

Best Practices

Always define a startupProbe for slow-starting containers — JVM apps, ML model servers, and apps with database migrations can take 30–120 seconds to start. A startupProbe with failureThreshold: 30, periodSeconds: 10 gives up to 5 minutes before liveness takes over, preventing premature container kills.
Set readOnlyRootFilesystem: true and provide explicit emptyDir mounts for writable paths — forces you to enumerate where the app writes (tmp, logs, cache). Limits blast radius if the container is compromised — attackers cannot modify binaries or write scripts to /usr/local/bin.
Drop all capabilities and add back only what is needed — capabilities.drop: [ALL] removes all Linux capabilities. Most apps need none. Only add NET_BIND_SERVICE if binding ports < 1024 (most Kubernetes apps run on high ports and don't need this).
Set allowPrivilegeEscalation: false on every container — prevents setuid/setgid binaries from escalating to root. This is a one-line fix that stops a whole class of container escape techniques. Pair with runAsNonRoot: true.
Use separate liveness and readiness probes with different endpoints — liveness checks minimal viability (/healthz: can the process respond?); readiness checks full operational status (/ready: are all dependencies healthy?). Using the same endpoint for both causes cascading restarts when a dependency is down.
Never set CPU limits without measuring throttling first — add container_cpu_cfs_throttled_seconds_total monitoring before adding CPU limits. If throttling exceeds 25%, the limit is too low for the workload's actual burst behavior. Many teams set requests only for CPU and rely on node-level fair-scheduling.
Use preStop: sleep 5 for zero-downtime deployments — the kube-proxy/Endpoints propagation race means a pod can receive traffic 1–5 seconds after deletion begins. A preStop sleep delays SIGTERM, giving load balancers time to drain. Set terminationGracePeriodSeconds to at least preStop duration + application shutdown time + 10s.
Create a dedicated ServiceAccount per workload and disable automounting on the default SA — the default ServiceAccount in each namespace automounts a token giving API read access. Even if not used, it's an attack vector. Create automountServiceAccountToken: false on the default SA, and create explicit SAs with minimal RBAC for workloads that need API access.

← Previous Workloads Overview Next → Deployments