▶ What This Page Covers
  • Pod network model: shared netns, localhost communication, port conflicts
  • Full annotated PodSpec — every significant field with defaults
  • Container spec fields: image pull policy, command vs args, env, envFrom
  • All three probe types: startup, liveness, readiness — when each fires
  • All four probe mechanisms: exec, httpGet, tcpSocket, grpc
  • Probe tuning fields: initialDelaySeconds, periodSeconds, failureThreshold, successThreshold, timeoutSeconds, terminationGracePeriodSeconds override
  • Pod-level security context: runAsUser/Group, fsGroup, fsGroupChangePolicy, supplementalGroups, sysctls
  • Container-level security context: allowPrivilegeEscalation, readOnlyRootFilesystem, capabilities add/drop, seccompProfile, privileged
  • PodSecurity admission: Restricted, Baseline, Privileged profiles and what each blocks
  • Multi-container patterns: sidecar, ambassador, adapter with concrete examples
  • Init containers: sequential ordering, failure handling, shared volumes
  • Native sidecar containers (1.29+ stable 1.33): restartPolicy:Always, startup ordering, proper shutdown
  • Ephemeral containers: kubectl debug, shareProcessNamespace, use cases
  • Container lifecycle hooks: postStart, preStop — exec vs httpGet
  • DNS policy: ClusterFirst, ClusterFirstWithHostNet, Default, None + dnsConfig
  • Host namespaces: hostNetwork, hostPID, hostIPC — security implications
  • RuntimeClass: gVisor, Kata Containers, selection per-pod
  • Pod overhead: RuntimeClass overhead added to resource accounting
  • Topology and scheduling fields: nodeSelector, nodeName, tolerations summary
  • imagePullSecrets and image pull policy details
  • 5 metrics + 4 alerts + 5 runbooks + 8 best practices
  • Pod Network Model

    Every pod receives a single IP address shared by all its containers. Containers within a pod communicate over localhost — they share the same network namespace (same interfaces, same routing table, same port space). Two containers in the same pod cannot both listen on port 8080.

    Pod network model: ┌─────────────────────────────────────────────────────────┐ │ Pod IP: 10.0.1.42 │ │ │ │ ┌────────────────┐ ┌────────────────┐ │ │ │ app container │ │ sidecar (envoy)│ │ │ │ :8080 (HTTP) │◄──►│ :15001 (proxy)│ │ │ │ │ │ │ │ │ │ localhost:15001 │ │ localhost:8080 │ │ │ └────────────────┘ └────────────────┘ │ │ │ │ │ │ └────────┬────────────┘ │ │ │ shared eth0 │ │ 10.0.1.42:8080 ← reachable from cluster │ │ 10.0.1.42:15001 ← reachable from cluster │ └─────────────────────────────────────────────────────────┘ Pause container (infra): → Created first, holds the netns open → All other containers join this netns via clone(CLONE_NEWNET) → Lives for the entire pod lifetime

    Full PodSpec Anatomy

    apiVersion: v1
    kind: Pod
    metadata:
      name: web-app
      namespace: production
      labels:
        app: web-app
        version: v2.1.0
      annotations:
        prometheus.io/scrape: "true"
        prometheus.io/port: "9090"
    spec:
      # ── Scheduling ─────────────────────────────────────────────────
      nodeSelector:
        kubernetes.io/os: linux
        node.kubernetes.io/instance-type: m5.xlarge   # optional instance type pin
    
      nodeName: ""              # direct assignment (bypasses scheduler — avoid in production)
    
      tolerations:
      - key: dedicated
        operator: Equal
        value: gpu
        effect: NoSchedule
    
      topologySpreadConstraints:
      - maxSkew: 1
        topologyKey: topology.kubernetes.io/zone
        whenUnsatisfiable: DoNotSchedule
        labelSelector:
          matchLabels:
            app: web-app
    
      affinity:
        podAntiAffinity:
          requiredDuringSchedulingIgnoredDuringExecution:
          - labelSelector:
              matchLabels:
                app: web-app
            topologyKey: kubernetes.io/hostname
    
      priorityClassName: high-priority   # references PriorityClass object
    
      # ── Runtime ────────────────────────────────────────────────────
      runtimeClassName: gvisor           # sandbox runtime (optional)
    
      # ── Identity ────────────────────────────────────────────────────
      serviceAccountName: web-app-sa
      automountServiceAccountToken: true  # default true; set false if no API access needed
    
      # ── Image pull ─────────────────────────────────────────────────
      imagePullSecrets:
      - name: registry-credentials
    
      # ── Networking ──────────────────────────────────────────────────
      hostname: web-0                    # overrides pod name in hostname
      subdomain: web-headless            # enables DNS: web-0.web-headless.ns.svc.cluster.local
      hostNetwork: false                 # default; true = use node's network namespace
      hostPID: false
      hostIPC: false
      dnsPolicy: ClusterFirst            # default; see DNS section
      dnsConfig:
        searches:
        - internal.corp.com
        options:
        - name: ndots
          value: "2"
    
      # ── Termination ─────────────────────────────────────────────────
      terminationGracePeriodSeconds: 60  # default 30
    
      # ── Pod-level security ─────────────────────────────────────────
      securityContext:
        runAsUser: 1000
        runAsGroup: 3000
        runAsNonRoot: true
        fsGroup: 2000
        fsGroupChangePolicy: OnRootMismatch   # Always or OnRootMismatch
        supplementalGroups: [4000]
        seccompProfile:
          type: RuntimeDefault              # RuntimeDefault | Localhost | Unconfined
        sysctls:
        - name: net.core.somaxconn
          value: "1024"
    
      # ── Init containers ─────────────────────────────────────────────
      initContainers:
      - name: wait-for-db
        image: busybox:1.36
        command: ["sh", "-c", "until nc -z postgres 5432; do sleep 2; done"]
    
      # Native sidecar (1.29+): restartPolicy: Always in initContainers
      - name: log-shipper
        image: fluent/fluent-bit:3.0
        restartPolicy: Always              # makes this a sidecar, not a blocking init
        volumeMounts:
        - name: logs
          mountPath: /logs
    
      # ── Main containers ─────────────────────────────────────────────
      containers:
      - name: app
        image: myapp:v2.1.0
        imagePullPolicy: IfNotPresent      # Always | Never | IfNotPresent (default for tagged)
    
        command: ["/app/server"]           # overrides ENTRYPOINT
        args: ["--port=8080", "--log-level=info"]  # overrides CMD
    
        ports:
        - name: http
          containerPort: 8080
          protocol: TCP
        - name: metrics
          containerPort: 9090
    
        env:
        - name: DB_HOST
          value: postgres.data.svc.cluster.local
        - name: DB_PASSWORD
          valueFrom:
            secretKeyRef:
              name: db-secret
              key: password
              optional: false             # fail pod if secret missing (default false)
        - name: MY_POD_IP
          valueFrom:
            fieldRef:
              fieldPath: status.podIP     # Downward API
        - name: MY_CPU_REQUEST
          valueFrom:
            resourceFieldRef:
              containerName: app
              resource: requests.cpu
    
        envFrom:
        - configMapRef:
            name: app-config
            optional: false
        - secretRef:
            name: app-secrets
            optional: true               # pod starts even if secret absent
    
        resources:
          requests:
            cpu: "250m"
            memory: "256Mi"
          limits:
            memory: "512Mi"              # no CPU limit — avoid throttling
    
        volumeMounts:
        - name: config
          mountPath: /etc/app
          readOnly: true
        - name: tmp
          mountPath: /tmp
        - name: logs
          mountPath: /var/log/app
    
        # Container-level security
        securityContext:
          allowPrivilegeEscalation: false
          readOnlyRootFilesystem: true
          runAsNonRoot: true
          capabilities:
            drop: ["ALL"]
            add: ["NET_BIND_SERVICE"]   # only if binding port <1024
          seccompProfile:
            type: RuntimeDefault
    
        lifecycle:
          postStart:
            exec:
              command: ["/bin/sh", "-c", "echo started > /tmp/started"]
          preStop:
            exec:
              command: ["/bin/sh", "-c", "sleep 5 && /app/graceful-stop.sh"]
    
        # Probes — see detailed section below
        startupProbe:
          httpGet:
            path: /healthz
            port: 8080
          failureThreshold: 30
          periodSeconds: 10
    
        livenessProbe:
          httpGet:
            path: /healthz
            port: 8080
          initialDelaySeconds: 0        # after startupProbe passes, no additional delay needed
          periodSeconds: 10
          failureThreshold: 3
          timeoutSeconds: 5
    
        readinessProbe:
          httpGet:
            path: /ready
            port: 8080
          periodSeconds: 5
          failureThreshold: 3
          successThreshold: 1
          timeoutSeconds: 3
    
      # ── Volumes ─────────────────────────────────────────────────────
      volumes:
      - name: config
        configMap:
          name: app-config
          defaultMode: 0644
      - name: tmp
        emptyDir: {}
      - name: logs
        emptyDir: {}
    
      # ── Restart policy ──────────────────────────────────────────────
      restartPolicy: Always    # Always (default) | OnFailure | Never

    Image Pull Policy

    imagePullPolicyWhen Applied AutomaticallyBehavior
    AlwaysImage tag is :latestKubelet always contacts registry to check digest; pulls if digest differs. Guarantees freshness; requires registry availability at pod start.
    IfNotPresentAny tag except :latestUses cached image if present; pulls only if image not on node. Default for version-tagged images. Faster startup on cached nodes.
    NeverNever automatically setNever pulls from registry. Pod fails if image not cached on node. Use for air-gapped environments with pre-pulled images.
    Never Use :latest in Production

    The :latest tag forces Always pull policy and makes deployments non-reproducible — the same tag may refer to different image digests on different nodes. Use immutable digests (image@sha256:abc123...) or version tags (v1.2.3) for production workloads. Reference digests in manifests for supply-chain security.

    Probes

    Kubernetes has three probe types, each serving a distinct purpose. They are independent — a pod can have all three, any combination, or none.

    Probe execution timeline: Pod starts │ ▼ ┌──────────────────────────────────────────────────────────────────┐ │ startupProbe (if defined) │ │ Fires every periodSeconds until success or failureThreshold hit │ │ Max wait = failureThreshold × periodSeconds (e.g. 30×10 = 5min) │ │ → Failure before threshold: container restarted │ │ → Success: startup probe disabled permanently for this container │ └────────────────────────────┬─────────────────────────────────────┘ │ startup succeeded ▼ ┌─────────────────────────────────────────────────────────────────┐ │ livenessProbe (continuous, every periodSeconds) │ │ Fires from pod start (after startup passes) │ │ failureThreshold consecutive failures → container restarted │ │ Does NOT remove pod from Service Endpoints │ └────────────────────────────┬────────────────────────────────────┘ │ ┌─────────────────────────────────────────────────────────────────┐ │ readinessProbe (continuous, every periodSeconds) │ │ failureThreshold failures → pod removed from Endpoints │ │ successThreshold successes → pod re-added to Endpoints │ │ Does NOT restart container │ └─────────────────────────────────────────────────────────────────┘
    ProbeOn FailureOn SuccessWhen to Use
    startupProbe Container restarted (after failureThreshold) Disables itself; liveness takes over Slow-starting containers (JVM warmup, large model loading). Prevents liveness from killing container during boot.
    livenessProbe Container restarted Nothing (pass is expected) Detect deadlocks, infinite loops, hung processes that can't self-recover. Should check minimally — not dependent on external services.
    readinessProbe Pod removed from Service Endpoints (no restart) Pod re-added to Endpoints Temporary unavailability: cache warming, circuit breaker open, dependency down. Drives traffic routing, not container health.

    All Four Probe Mechanisms

    # 1. httpGet — HTTP GET request; success = 2xx or 3xx status code
    livenessProbe:
      httpGet:
        path: /healthz
        port: 8080            # port name or number
        scheme: HTTP          # HTTP (default) or HTTPS
        httpHeaders:
        - name: X-Health-Check
          value: "true"
      initialDelaySeconds: 10
      periodSeconds: 10
      timeoutSeconds: 5
      failureThreshold: 3
      successThreshold: 1     # must be 1 for liveness (only readiness allows >1)
    
    # 2. exec — runs command in container; success = exit code 0
    livenessProbe:
      exec:
        command:
        - /bin/sh
        - -c
        - redis-cli ping | grep -q PONG
    
    # 3. tcpSocket — TCP connection attempt; success = connection accepted
    livenessProbe:
      tcpSocket:
        port: 5432            # useful for databases with no HTTP endpoint
        host: ""              # defaults to pod IP
    
    # 4. grpc — gRPC health protocol (grpc.health.v1.Health/Check); 1.24+ stable
    livenessProbe:
      grpc:
        port: 50051
        service: ""           # empty string = check overall server health

    Probe Tuning Reference

    FieldDefaultNotes
    initialDelaySeconds0Seconds after container starts before first probe fires. Deprecated in favor of startupProbe — use startupProbe instead.
    periodSeconds10How often to probe. Minimum 1s.
    timeoutSeconds1Probe must complete within this window or it counts as failure. Set ≥ p99 response time.
    failureThreshold3Consecutive failures before action (restart or remove from Endpoints).
    successThreshold1Consecutive successes to re-add to Endpoints (readiness only; liveness/startup must be 1).
    terminationGracePeriodSecondsPod's valuePer-probe override (1.25+). Set this longer than the pod's grace period if the probe needs more time during shutdown.
    Liveness Probe Anti-Patterns

    A liveness probe that checks external dependencies (database connectivity, downstream API) causes cascading restarts: if the database goes down, all pods restart simultaneously — making recovery harder. Liveness probes should only detect states the process itself cannot recover from (deadlock, memory corruption). Use readiness probes for dependency health.

    Security Context

    Security context fields exist at two levels: pod-level (spec.securityContext) applies to all containers and volumes; container-level (spec.containers[*].securityContext) applies to a specific container and overrides pod-level where both exist.

    Pod-Level Security Context

    spec:
      securityContext:
        runAsUser: 1000            # UID for all containers (overridden by container-level)
        runAsGroup: 3000           # primary GID for all containers
        runAsNonRoot: true         # refuse to start if image runs as UID 0
        fsGroup: 2000              # GID applied to all volumes; files created in volumes get this GID
        fsGroupChangePolicy: OnRootMismatch  # Only chown/chmod if ownership wrong (faster for large volumes)
                                             # Always: always chown (safe but slow for large volumes)
        supplementalGroups: [4000] # additional GIDs for the process (e.g., for shared file access)
        seccompProfile:
          type: RuntimeDefault     # use container runtime's default seccomp profile
          # type: Localhost
          # localhostProfile: profiles/myprofile.json
          # type: Unconfined       # disable seccomp (not recommended)
        sysctls:                   # kernel parameters; only safe (namespaced) sysctls allowed by default
        - name: net.core.somaxconn
          value: "1024"
        - name: kernel.shm_rmid_forced  # unsafe; requires AllowedUnsafeSysctls feature gate
          value: "0"

    Container-Level Security Context

    containers:
    - name: app
      securityContext:
        runAsUser: 1000            # overrides pod-level runAsUser for this container
        runAsGroup: 3000
        runAsNonRoot: true
        allowPrivilegeEscalation: false   # prevents setuid binaries; CRITICAL: always set false
        readOnlyRootFilesystem: true      # immutable container filesystem; forces explicit writable mounts
        privileged: false                 # default; true = full host kernel access (avoid)
    
        capabilities:
          drop: ["ALL"]            # drop all capabilities first (start from minimal)
          add:
          - NET_BIND_SERVICE       # bind ports < 1024 (only if needed)
          - SYS_PTRACE             # for debugging tools; remove in production
    
        seccompProfile:
          type: RuntimeDefault
    
        procMount: Default         # Default | Unmasked (unmasked exposes /proc — avoid)

    PodSecurity Admission Profiles

    PolicyWhat It AllowsWhat It BlocksUse Case
    Privileged Everything Nothing System namespaces (kube-system), CNI, CSI node plugins
    Baseline Most workloads; some host access hostPID, hostIPC, hostNetwork, privileged, hostPort, allowPrivilegeEscalation (via securityContext.allowPrivilegeEscalation:true), non-default capabilities beyond baseline set General workloads not requiring host access
    Restricted Strictly hardened pods only All of Baseline PLUS: must set allowPrivilegeEscalation:false, must drop ALL capabilities, must set runAsNonRoot:true, must use RuntimeDefault or Localhost seccomp, volumes limited to approved types Security-sensitive namespaces; PCI/SOC2 compliance
    # Apply PodSecurity to a namespace via labels:
    kubectl label namespace production \
      pod-security.kubernetes.io/enforce=restricted \
      pod-security.kubernetes.io/enforce-version=latest \
      pod-security.kubernetes.io/warn=restricted \
      pod-security.kubernetes.io/audit=restricted

    Multi-Container Patterns

    Sidecar Pattern

    A helper container that augments the main container's functionality without modifying it. They share a volume for communication.

    containers:
    - name: app
      image: myapp:v1
      volumeMounts:
      - name: logs
        mountPath: /var/log/app
    
    - name: log-shipper          # sidecar: ships logs to Loki/Elasticsearch
      image: fluent/fluent-bit:3.0
      volumeMounts:
      - name: logs
        mountPath: /logs
        readOnly: true
      resources:
        requests:
          cpu: "50m"
          memory: "64Mi"
        limits:
          memory: "128Mi"
    
    volumes:
    - name: logs
      emptyDir: {}              # shared between app and log-shipper

    Ambassador Pattern

    A proxy sidecar that handles outbound communication on behalf of the main container — connection pooling, retries, circuit breaking, or protocol translation.

    containers:
    - name: app
      image: myapp:v1
      env:
      - name: DATABASE_URL
        value: "localhost:5432"    # connects to ambassador on localhost, not directly to DB
    
    - name: db-ambassador          # ambassador: manages connection pool to PostgreSQL
      image: pgbouncer:1.22
      env:
      - name: DATABASES_HOST
        value: postgres.data.svc.cluster.local
      - name: POOL_MODE
        value: transaction
      ports:
      - containerPort: 5432        # app connects here; ambassador connects to real DB

    Adapter Pattern

    Transforms the main container's output into a standardized format — typically for metrics or log normalization.

    containers:
    - name: legacy-app
      image: legacy-metrics-app:v1
      # exposes metrics in proprietary format on :8888
    
    - name: metrics-adapter        # converts legacy format to Prometheus exposition format
      image: prom/statsd-exporter:latest
      ports:
      - containerPort: 9102         # Prometheus scrapes this port
      args:
      - --statsd.listen-udp=:8125
      - --statsd.mapping-config=/etc/mapping.yml

    Init Containers

    Init containers run sequentially before any main container starts. Each must exit with code 0 for the next to begin. If an init container fails, the pod is restarted according to restartPolicy (unless Never). Init containers share the same volumes as main containers but run with their own image and security context.

    initContainers:
    # Step 1: wait for dependency
    - name: wait-for-postgres
      image: busybox:1.36
      command:
      - sh
      - -c
      - |
        until nc -z -w 3 postgres.data.svc.cluster.local 5432; do
          echo "Waiting for PostgreSQL..."; sleep 5
        done
        echo "PostgreSQL is ready"
    
    # Step 2: run database migrations (after DB is ready)
    - name: db-migrate
      image: myapp:v2.1.0          # same image as main app
      command: ["/app/migrate", "--direction=up"]
      env:
      - name: DATABASE_URL
        valueFrom:
          secretKeyRef:
            name: db-secret
            key: url
      # Shares the same env/secrets as the main app
    
    # Step 3: seed config files into a shared volume
    - name: config-init
      image: alpine:3.19
      command:
      - sh
      - -c
      - |
        cp /etc/base-config/* /shared-config/
        sed -i "s/{{ENV}}/production/g" /shared-config/app.conf
      volumeMounts:
      - name: shared-config
        mountPath: /shared-config
    
    containers:
    - name: app
      image: myapp:v2.1.0
      volumeMounts:
      - name: shared-config
        mountPath: /etc/app
        readOnly: true
    
    volumes:
    - name: shared-config
      emptyDir: {}

    Native Sidecar Containers (1.29+ / Stable 1.33)

    Classic sidecars defined in spec.containers have no ordering guarantee relative to each other. A log-shipper sidecar might miss early log output if it starts after the main app. Native sidecars fix this: defined in spec.initContainers with restartPolicy: Always, they start during the init phase (before main containers) and run for the pod's lifetime.

    initContainers:
    # Classic init: runs to completion, blocks next step
    - name: wait-for-db
      image: busybox:1.36
      command: ["sh", "-c", "until nc -z postgres 5432; do sleep 2; done"]
      # No restartPolicy → classic init behavior
    
    # Native sidecar: starts here (in init phase order), runs forever
    - name: log-shipper
      image: fluent/fluent-bit:3.0
      restartPolicy: Always          # ← this is what makes it a native sidecar
      volumeMounts:
      - name: logs
        mountPath: /logs
      readinessProbe:                # startup waits until this passes before next init
        exec:
          command: ["test", "-f", "/tmp/fluent-bit.pid"]
        initialDelaySeconds: 2
    
    # Another native sidecar
    - name: envoy-proxy
      image: envoyproxy/envoy:v1.29
      restartPolicy: Always
      readinessProbe:
        httpGet:
          path: /ready
          port: 9901
    
    containers:
    - name: app                      # starts AFTER both native sidecars are Ready
      image: myapp:v1
    Native Sidecar Shutdown Ordering

    When the pod terminates, native sidecars receive SIGTERM after all main containers have exited. This ensures a log-shipper flushes buffered logs before shutting down — fixing the classic problem where the log sidecar died before the app finished writing. The shutdown order is: main containers exit → native sidecars receive SIGTERM.

    Ephemeral Containers

    Ephemeral containers are temporary containers injected into a running pod for debugging. They cannot define ports, probes, or resources and are never restarted. They are useful when a minimal production image lacks debugging tools.

    # Inject a debug container into a running pod
    kubectl debug -it pod/web-app-7d9b8f \
      --image=busybox:1.36 \
      --target=app \            # attach to the app container's process namespace
      -- sh
    
    # Using a distroless image — the app has no shell, so we bring one
    kubectl debug -it pod/web-app-7d9b8f \
      --image=gcr.io/distroless/base:debug \
      --copy-to=web-app-debug \ # create a copy of the pod with the debug container
      -- bash
    
    # Share process namespace to see all processes in the pod
    # Requires spec.shareProcessNamespace: true (set in pod spec) or --copy-to with it
    # Enable process namespace sharing in pod spec (for ephemeral container process access):
    spec:
      shareProcessNamespace: true   # containers can see each other's processes via /proc
      containers:
      - name: app
        image: myapp:v1

    Container Lifecycle Hooks

    HookWhen It FiresBlocking?Failure Behavior
    postStart Immediately after container is created (async with ENTRYPOINT — may run before or after) Yes — container stays in ContainerCreating until hook completes Container killed and restarted
    preStop Immediately before container is terminated (before SIGTERM) Yes — SIGTERM delayed until hook completes or grace period expires Container still terminated
    lifecycle:
      postStart:
        exec:
          command: ["/bin/sh", "-c", "/app/register-service.sh"]
        # OR httpGet:
        #   path: /post-start
        #   port: 8080
    
      preStop:
        exec:
          command:
          - /bin/sh
          - -c
          - |
            # Drain connections gracefully before SIGTERM
            sleep 5
            /app/deregister-service.sh
            # SIGTERM sent after this exits (or grace period expires)
    postStart vs readinessProbe

    postStart runs once at startup; it is not a health check. Use it for one-time initialization tasks (registering with a service registry, seeding a cache). Use readinessProbe to gate traffic until the application is actually ready to serve. The two are independent — the pod enters Ready state based on the readiness probe, not the postStart hook.

    DNS Policy

    dnsPolicyBehaviorUse Case
    ClusterFirstDefault. Queries go to CoreDNS; cluster.local names resolved internally; fallback to upstream DNS.All normal pods
    ClusterFirstWithHostNetSame as ClusterFirst but required when hostNetwork: true (otherwise hostNetwork pods use node DNS).DaemonSets and infrastructure pods using hostNetwork that still need in-cluster DNS
    DefaultPod inherits the node's /etc/resolv.conf — does NOT resolve cluster services.Pods that only need external DNS and never access services by cluster DNS name
    NoneIgnores Kubernetes DNS; spec.dnsConfig must be provided.Custom DNS configurations; pods with special resolver requirements
    # dnsConfig for custom search domains and options
    spec:
      dnsPolicy: ClusterFirst
      dnsConfig:
        nameservers:
        - 10.96.0.10          # CoreDNS IP (only used when dnsPolicy: None)
        searches:
        - production.svc.cluster.local
        - svc.cluster.local
        - cluster.local
        - internal.corp.com   # additional search domain
        options:
        - name: ndots
          value: "2"           # lower from default 5 reduces DNS lookup overhead
        - name: attempts
          value: "3"

    Host Namespaces

    FieldDefaultEffectSecurity Risk
    hostNetwork: truefalsePod uses node's network namespace — sees all host interfaces, binds to host ports directlyHigh — can eavesdrop on node traffic, bind privileged ports
    hostPID: truefalsePod shares node's PID namespace — can see and signal all processes on the nodeCritical — can kill or inspect any node process
    hostIPC: truefalsePod shares node's IPC namespace — can access shared memory and semaphores of other processesHigh — can read/write shared memory of host processes
    Host Namespaces Require Explicit Justification

    hostNetwork and hostPID are blocked by the PodSecurity Baseline and Restricted profiles. They are legitimate only for infrastructure components (CNI plugins, node monitoring agents, kubelet itself). Application workloads should never use host namespaces. Audit with: kubectl get pods -A -o json | jq '.items[] | select(.spec.hostNetwork==true) | .metadata.namespace + "/" + .metadata.name'

    RuntimeClass

    RuntimeClass selects the container runtime for a pod. Different runtimes provide different security isolation models — from the default runc (process isolation) to hardware-virtualized sandboxes.

    # RuntimeClass object (created by cluster admin)
    apiVersion: node.k8s.io/v1
    kind: RuntimeClass
    metadata:
      name: gvisor
    handler: runsc            # maps to containerd runtime handler name
    overhead:
      podFixed:
        cpu: "250m"
        memory: "256Mi"        # gVisor sentry process overhead added to resource accounting
    scheduling:
      nodeSelector:
        sandboxed: "true"      # RuntimeClass can constrain which nodes are used
      tolerations:
      - key: sandboxed
        operator: Exists
    ---
    # In pod spec:
    spec:
      runtimeClassName: gvisor     # use the gVisor sandbox
    RuntimeIsolationOverheadUse Case
    runc (default)Linux namespaces + cgroupsMinimalAll standard workloads
    gVisor (runsc)Userspace kernel intercept (ptrace/KVM)Medium (250m CPU, 256Mi memory)Multi-tenant, untrusted code, CI runners
    Kata ContainersHardware virtualization (KVM VM per pod)High (~500m CPU, 512Mi memory)Strict isolation, regulated workloads, FaaS
    FirecrackerMicroVM per podLow-medium (~125ms boot)AWS Lambda-style serverless, fast VM boot

    Metrics, Alerts, and Runbooks

    Key Pod Metrics

    MetricSourceAlert Condition
    kube_pod_container_status_restarts_totalkube-state-metricsincrease > 5 in 1h → CrashLoopBackOff risk
    kube_pod_status_phase{phase="Failed"}kube-state-metricsAny Failed pod older than 5m in production namespace
    kube_pod_status_ready{condition="false"}kube-state-metricsPod not ready > 10m (readiness probe continuously failing)
    container_oom_events_totalcadvisorAny OOM kill in production
    container_cpu_cfs_throttled_seconds_totalcadvisorthrottle ratio > 25% → consider raising or removing CPU limits

    Alerting Rules

    groups:
    - name: pod-health
      rules:
      - alert: PodCrashLooping
        expr: |
          increase(kube_pod_container_status_restarts_total[1h]) > 5
        for: 5m
        labels:
          severity: warning
        annotations:
          summary: "Pod {{ $labels.namespace }}/{{ $labels.pod }} is crash-looping"
    
      - alert: PodOOMKilled
        expr: |
          increase(container_oom_events_total[5m]) > 0
        for: 0m
        labels:
          severity: critical
        annotations:
          summary: "Container OOM killed: {{ $labels.namespace }}/{{ $labels.pod }}/{{ $labels.container }}"
    
      - alert: PodNotReady
        expr: |
          kube_pod_status_ready{condition="false"} == 1
        for: 10m
        labels:
          severity: warning
        annotations:
          summary: "Pod {{ $labels.namespace }}/{{ $labels.pod }} not ready for 10 minutes"
    
      - alert: ContainerCPUThrottling
        expr: |
          rate(container_cpu_cfs_throttled_seconds_total[5m])
          / rate(container_cpu_cfs_periods_total[5m]) > 0.25
        for: 15m
        annotations:
          summary: "Container {{ $labels.container }} CPU throttled >25%"

    Runbooks

    CrashLoopBackOff

    Check last exit code: kubectl describe pod POD → "Last State: Terminated reason: Error exitCode: N". Check logs: kubectl logs POD --previous. Common causes: missing env/secret, wrong command, OOM (exitCode 137), permission denied on read-only rootfs.

    OOMKilled (exitCode 137)

    Container exceeded memory limit. Check actual usage: kubectl top pod POD --containers. Increase memory limit or request. Check for memory leaks with heap profiling. If limit was appropriate, investigate memory leak or request growth — VPA can recommend right-sized requests.

    Pod Stuck Pending

    Run kubectl describe pod POD → Events section. Causes: (1) insufficient resources — check kubectl describe nodes for Allocated resources; (2) no matching nodes — check nodeSelector/affinity; (3) taint mismatch — check tolerations; (4) PVC not bound — check PVC status.

    Readiness Probe Failing

    Check probe events: kubectl describe pod POD → "Readiness probe failed". For httpGet: exec into pod and run curl localhost:PORT/PATH. Check if app has started, dependency health, and timeout vs actual response time. Increase timeoutSeconds if p99 latency exceeds probe timeout.

    Image Pull Error (ErrImagePull)

    Events show "Failed to pull image". Causes: (1) image tag doesn't exist; (2) registry credentials missing/expired — check imagePullSecrets and secret validity; (3) registry unreachable — network policy or node DNS issue; (4) rate limiting (Docker Hub). For private registries: kubectl create secret docker-registry.

    Best Practices

    1. Always define a startupProbe for slow-starting containers — JVM apps, ML model servers, and apps with database migrations can take 30–120 seconds to start. A startupProbe with failureThreshold: 30, periodSeconds: 10 gives up to 5 minutes before liveness takes over, preventing premature container kills.
    2. Set readOnlyRootFilesystem: true and provide explicit emptyDir mounts for writable paths — forces you to enumerate where the app writes (tmp, logs, cache). Limits blast radius if the container is compromised — attackers cannot modify binaries or write scripts to /usr/local/bin.
    3. Drop all capabilities and add back only what is neededcapabilities.drop: [ALL] removes all Linux capabilities. Most apps need none. Only add NET_BIND_SERVICE if binding ports < 1024 (most Kubernetes apps run on high ports and don't need this).
    4. Set allowPrivilegeEscalation: false on every container — prevents setuid/setgid binaries from escalating to root. This is a one-line fix that stops a whole class of container escape techniques. Pair with runAsNonRoot: true.
    5. Use separate liveness and readiness probes with different endpoints — liveness checks minimal viability (/healthz: can the process respond?); readiness checks full operational status (/ready: are all dependencies healthy?). Using the same endpoint for both causes cascading restarts when a dependency is down.
    6. Never set CPU limits without measuring throttling first — add container_cpu_cfs_throttled_seconds_total monitoring before adding CPU limits. If throttling exceeds 25%, the limit is too low for the workload's actual burst behavior. Many teams set requests only for CPU and rely on node-level fair-scheduling.
    7. Use preStop: sleep 5 for zero-downtime deployments — the kube-proxy/Endpoints propagation race means a pod can receive traffic 1–5 seconds after deletion begins. A preStop sleep delays SIGTERM, giving load balancers time to drain. Set terminationGracePeriodSeconds to at least preStop duration + application shutdown time + 10s.
    8. Create a dedicated ServiceAccount per workload and disable automounting on the default SA — the default ServiceAccount in each namespace automounts a token giving API read access. Even if not used, it's an attack vector. Create automountServiceAccountToken: false on the default SA, and create explicit SAs with minimal RBAC for workloads that need API access.