Pod Security

Section 06 › 02 Last updated: 2025 ~40 min read

On this page

  1. PSP Removal Background
  2. Pod Security Admission (PSA)
  3. Pod Security Standards
  4. PSS Field-by-Field Reference
  5. securityContext Deep Dive
  6. Linux Capabilities
  7. seccomp Profiles
  8. AppArmor
  9. SELinux
  10. Privileged Container Risks
  11. Host Namespaces
  12. Migrating from PSP to PSA
  13. Extending with Kyverno / OPA
  14. Metrics & Alerts
  15. Best Practices
Coverage checklist
  • PSP removal in 1.25 background
  • PSA: enforce / audit / warn modes
  • PSA namespace labels
  • PSA exemptions (namespaces, runtimeClasses, usernames)
  • PSS: privileged / baseline / restricted levels
  • PSS field-by-field complete reference table
  • securityContext: pod-level vs container-level fields
  • runAsUser / runAsGroup / fsGroup semantics
  • runAsNonRoot enforcement
  • allowPrivilegeEscalation
  • readOnlyRootFilesystem
  • capabilities: full Linux caps list, drop ALL + add pattern
  • Dangerous capabilities: CAP_SYS_ADMIN, CAP_NET_ADMIN, CAP_SYS_PTRACE
  • seccomp: RuntimeDefault, Localhost, Unconfined
  • Custom seccomp profile JSON example
  • AppArmor: profile modes (enforce/complain), K8s annotation (stable 1.30)
  • SELinux: seLinuxOptions fields
  • Privileged containers: kernel module load, device access, container escape
  • hostPID / hostNetwork / hostIPC risks
  • hostPath mount risks and mitigation
  • PSP → PSA migration checklist
  • Kyverno policies for pod security
  • OPA/Gatekeeper ConstraintTemplate for pod security
  • 5 metrics, 4 alerts, 5 runbooks
  • 8 best practices

PSP Removal Background

PodSecurityPolicy (PSP) was the original Kubernetes mechanism for enforcing pod-level security constraints. It was deprecated in Kubernetes 1.21 and removed in 1.25. Clusters upgrading to 1.25+ must migrate to an alternative before the upgrade.

PSP had several fundamental design flaws:

Replacement options:

Pod Security Admission (PSA)

PSA is a built-in admission controller (GA in 1.25) that enforces Pod Security Standards by evaluating pod specs against one of three levels. It is configured per-namespace using labels.

Modes

ModeLabelEffect on Violation
enforcepod-security.kubernetes.io/enforcePod is rejected; cannot be created
auditpod-security.kubernetes.io/auditPod is allowed; violation recorded in audit log
warnpod-security.kubernetes.io/warnPod is allowed; warning message returned to client

Each mode is independent and can reference a different level. A namespace can have all three modes active simultaneously with different levels:

# Apply PSA labels to a namespace
kubectl label namespace production \
  pod-security.kubernetes.io/enforce=restricted \
  pod-security.kubernetes.io/enforce-version=v1.28 \
  pod-security.kubernetes.io/audit=restricted \
  pod-security.kubernetes.io/audit-version=v1.28 \
  pod-security.kubernetes.io/warn=restricted \
  pod-security.kubernetes.io/warn-version=v1.28
Pin the PSS version with -version labels. Each mode label has a corresponding -version label (e.g., pod-security.kubernetes.io/enforce-version=v1.28). Without it, the version defaults to latest, which means a Kubernetes upgrade could change what's enforced and suddenly break workloads. Always pin to a specific version and bump it deliberately during upgrades.

PSA Exemptions

Some system workloads legitimately need privileged access. PSA supports three exemption types, configured in the PodSecurity admission plugin config (via kube-apiserver --admission-control-config-file):

apiVersion: apiserver.config.k8s.io/v1
kind: AdmissionConfiguration
plugins:
- name: PodSecurity
  configuration:
    apiVersion: pod-security.admission.config.k8s.io/v1
    kind: PodSecurityConfiguration
    defaults:
      enforce: "restricted"          # cluster-wide default for namespaces without labels
      enforce-version: "latest"
      audit: "restricted"
      audit-version: "latest"
      warn: "restricted"
      warn-version: "latest"
    exemptions:
      usernames:                     # API server users exempt from PSA (e.g., cluster bootstrappers)
      - "system:serviceaccount:kube-system:replicaset-controller"
      runtimeClasses:                # RuntimeClasses exempt from PSA (e.g., kata-containers)
      - "kata-containers"
      namespaces:                    # Namespaces fully exempt from PSA
      - kube-system
      - kube-public
      - cert-manager                 # cert-manager needs elevated permissions
Use namespace exemptions sparingly. Exempting an entire namespace disables PSA enforcement for all pods in it. Prefer applying the privileged level to system namespaces rather than exempting them entirely — exemption bypasses audit and warn modes too, removing all visibility.

Graduation Timeline

VersionStatusNotes
1.22AlphaPSA introduced; PSP still active
1.23Beta (on by default)Enabled by default; PSP still active
1.25GA; PSP removedClusters must migrate before upgrading to 1.25

Pod Security Standards

Pod Security Standards (PSS) define three levels, each a superset of restrictions from the previous level. The levels apply to pod specs as a whole — if any container in the pod violates the level, the entire pod is rejected.

Privileged

  • No restrictions
  • All host namespace access allowed
  • Privileged containers allowed
  • Any capabilities allowed
  • Any seccomp profile (or none)
  • Use: kube-system, CNI plugins, node agents, storage CSI drivers

Baseline

  • No privileged containers
  • No host namespaces (hostPID, hostIPC, hostNetwork)
  • No dangerous capabilities
  • No hostPath or restricted volume types
  • Host port restriction
  • AppArmor: only runtime/default or localhost profiles
  • seccomp: no restriction (Unconfined allowed)
  • No privilege escalation (Linux)
  • Use: General workloads not needing host access

Restricted

  • All baseline restrictions plus:
  • Must drop ALL capabilities
  • seccompProfile: RuntimeDefault or Localhost required
  • runAsNonRoot: true
  • allowPrivilegeEscalation: false
  • Only specific volume types (configMap, CSI, downwardAPI, emptyDir, ephemeral, persistentVolumeClaim, projected, secret)
  • Use: Untrusted workloads, public-facing services, high-security requirements

PSS Field-by-Field Reference

This table maps every PSS check to its pod spec field, what each level allows, and any nuances.

Host Namespaces

FieldPrivilegedBaselineRestrictedNotes
spec.hostNetworkAnyMust be false or unsetMust be false or unsetShares host network stack; bypasses CNI
spec.hostPIDAnyMust be false or unsetMust be false or unsetCan see and signal all host processes
spec.hostIPCAnyMust be false or unsetMust be false or unsetAccess host IPC namespace (shared memory, semaphores)

Privileged Containers

FieldPrivilegedBaselineRestricted
containers[].securityContext.privilegedAnyMust be false or unsetMust be false or unset
initContainers[].securityContext.privilegedAnyMust be false or unsetMust be false or unset
ephemeralContainers[].securityContext.privilegedAnyMust be false or unsetMust be false or unset

Capabilities

FieldBaselineRestrictedNotes
securityContext.capabilities.addOnly: NET_BIND_SERVICE (or empty)Empty (no adds after dropping ALL)Baseline allows adding NET_BIND_SERVICE for port <1024 binding
securityContext.capabilities.dropNo requirementMust include ALLRestricted requires dropping ALL first; then cannot add any

Privilege Escalation

FieldBaselineRestricted
securityContext.allowPrivilegeEscalationNo restrictionMust be false

Run As Non-Root

FieldBaselineRestricted
spec.securityContext.runAsNonRootNo restrictionMust be true (pod or all containers)
containers[].securityContext.runAsNonRootNo restrictionMust be true OR pod-level must be true
spec.securityContext.runAsUserNo restrictionMust not be 0 (if set)

seccomp

FieldBaselineRestricted
spec.securityContext.seccompProfile.typeNo restriction (Unconfined allowed)Must be RuntimeDefault or Localhost
containers[].securityContext.seccompProfile.typeNo restrictionMust be RuntimeDefault or Localhost (OR pod-level set)

Volumes

Baseline Allowed Volume TypesRestricted (subset of baseline)
configMap, csi, downwardAPI, emptyDir, ephemeral, hostPath (restricted), nfs, persistentVolumeClaim, projected, secret + all others except: hostPath to sensitive paths, inline volumes from sources not in the allowed listconfigMap, csi, downwardAPI, emptyDir, ephemeral, persistentVolumeClaim, projected, secret only — no hostPath, no NFS

Host Ports

FieldBaseline / Restricted
containers[].ports[].hostPortMust be 0, unset, or defined — baseline allows 0/unset. Host ports > 0 violate baseline.

AppArmor (baseline only check)

AnnotationBaseline: Allowed values
container.apparmor.security.beta.kubernetes.io/<container>runtime/default or localhost/<profile>unconfined violates baseline

Sysctls

FieldBaseline / Restricted
spec.securityContext.sysctls[].nameOnly "safe" sysctls: kernel.shm_rmid_forced, net.ipv4.ip_local_port_range, net.ipv4.ip_unprivileged_port_start, net.ipv4.tcp_syncookies, net.ipv4.ping_group_range

securityContext Deep Dive

Pod-Level vs Container-Level Fields

Some fields exist at both pod and container scope. Container-level overrides pod-level. Some fields only exist at one scope:

FieldPod-LevelContainer-LevelNotes
runAsUser✅ Default for all containers✅ Overrides pod-levelUID for process; must match image USER or be set explicitly
runAsGroup✅ Primary GID for all containers✅ Overrides pod-level
runAsNonRoot✅ Overrides pod-levelKubelet verifies UID != 0 at container start; fails if image runs as root
fsGroup✅ Only pod-levelSupplemental GID applied to volume mounts; owns files in mounted volumes
fsGroupChangePolicy✅ Only pod-levelOnRootMismatch (1.20+): only chown if root ownership wrong; avoids slow chown on large volumes
supplementalGroups✅ Only pod-levelAdditional GIDs for all containers
seccompProfile✅ Default for all containers✅ Overrides pod-levelContainer-level overrides pod-level profile
seLinuxOptions✅ Default✅ Overrides pod-level
sysctls✅ Only pod-levelOnly safe sysctls without special runtime config
privileged✅ Only container-level
allowPrivilegeEscalation✅ Only container-level
capabilities✅ Only container-level
readOnlyRootFilesystem✅ Only container-level

Complete Restricted-Compliant securityContext

apiVersion: v1
kind: Pod
spec:
  securityContext:                       # Pod-level
    runAsNonRoot: true
    runAsUser: 1000                      # must not be 0
    runAsGroup: 3000
    fsGroup: 2000                        # volume files owned by GID 2000
    fsGroupChangePolicy: OnRootMismatch  # avoid slow recursive chown
    seccompProfile:
      type: RuntimeDefault               # applies to all containers unless overridden
    sysctls: []                          # no unsafe sysctls
  containers:
  - name: app
    image: myapp:latest
    securityContext:                     # Container-level (overrides pod-level where applicable)
      allowPrivilegeEscalation: false    # MUST be false for restricted
      readOnlyRootFilesystem: true       # prevents writes to container rootfs
      capabilities:
        drop: ["ALL"]                    # MUST drop ALL for restricted
        # add: ["NET_BIND_SERVICE"]      # only if binding port < 1024
      seccompProfile:
        type: RuntimeDefault             # can override pod-level per-container
      runAsNonRoot: true                 # belt-and-suspenders with pod-level
    volumeMounts:
    - name: tmp
      mountPath: /tmp                    # writable temp dir for apps that need it
    - name: cache
      mountPath: /app/cache
  volumes:
  - name: tmp
    emptyDir: {}                         # allowed in restricted
  - name: cache
    emptyDir: {}
  # NOT allowed in restricted:
  # hostPath, NFS, secrets via env (allowed but discouraged), hostNetwork/PID/IPC

fsGroup and Volume Ownership

When fsGroup is set, the kubelet changes ownership of mounted volumes to the specified GID. For large volumes (hundreds of thousands of files), this recursive chown causes significant pod startup latency. Use fsGroupChangePolicy: OnRootMismatch (GA 1.23) to skip the chown if the root directory already has the correct ownership.

Linux Capabilities

Linux capabilities split root privilege into discrete units. When a container runs without privileged: true, it starts with a default set of capabilities inherited from the container runtime. The recommended security posture is drop ALL, add only what's needed.

Default Container Capabilities (containerd/CRI-O)

By default, containers receive these capabilities: CHOWN, DAC_OVERRIDE, FSETID, FOWNER, MKNOD, NET_RAW, SETGID, SETUID, SETFCAP, SETPCAP, NET_BIND_SERVICE, SYS_CHROOT, KILL, AUDIT_WRITE.

NET_RAW in default capabilities is a known risk. NET_RAW allows raw socket creation (ping, ARP spoofing, packet crafting). It's in the default set for compatibility but should be dropped in almost all workloads. Drop it explicitly: capabilities.drop: ["NET_RAW"] or drop: ["ALL"].

Dangerous Capabilities Reference

CapabilityWhat It AllowsAttack Potential
CAP_SYS_ADMINMount filesystems, modify kernel parameters, load kernel modules, many othersContainer escape — effectively root on host
CAP_SYS_PTRACETrace and inspect other processes; inject codeProcess hijack — can ptrace processes on host if hostPID=true
CAP_NET_ADMINConfigure network interfaces, routes, firewall rules, packet manglingNetwork attack — ARP poison, traffic interception, iptables modification
CAP_SYS_MODULELoad/unload kernel modulesContainer escape — insert rootkit kernel module
CAP_SYS_RAWIODirect access to I/O ports, /dev/mem, /dev/kmemContainer escape via physical memory access
CAP_DAC_READ_SEARCHBypass file permission checks; read any fileData exfiltration — read /etc/shadow, host secrets
CAP_CHOWNChange file ownershipPrivilege — change ownership of sensitive files
CAP_SETUIDSet UID; switch to any user including rootRoot escalation
CAP_NET_RAWRaw sockets, packet injectionNetwork attack
CAP_MKNODCreate device filesDevice access

Capability Drop ALL Pattern

# Minimum viable capabilities for most web applications
securityContext:
  capabilities:
    drop: ["ALL"]
    # Most apps need zero capabilities after dropping ALL
    # Common exceptions:
    # add: ["NET_BIND_SERVICE"]   # ONLY if binding port 80 or 443 as non-root
    # add: ["SYS_NICE"]           # ONLY if app sets process priority (rare)
    # add: ["IPC_LOCK"]           # ONLY if app uses mlock() for security (e.g., Vault)

seccomp Profiles

seccomp (secure computing mode) is a Linux kernel feature that restricts the syscalls a process can make. Kubernetes supports three seccomp profile types:

TypeDescriptionSecurity Level
UnconfinedNo syscall filtering; all syscalls allowedMinimum — default before 1.27
RuntimeDefaultContainer runtime's default profile (blocks ~100 dangerous syscalls)Good baseline — recommended for most workloads
LocalhostCustom JSON profile from /var/lib/kubelet/seccomp/ on the nodeMaximum — application-specific allowlist
RuntimeDefault is the cluster default since Kubernetes 1.27. Starting in 1.27, the kubelet uses RuntimeDefault as the seccomp default for new pods when SeccompDefault feature gate is enabled (on by default in 1.27+). However, explicitly setting it in the pod spec is still best practice for clarity and portability.

RuntimeDefault Profile

The RuntimeDefault profile is defined by the container runtime. For containerd, it's based on Docker's default seccomp profile. It blocks syscalls including:

Custom Localhost Profile

// /var/lib/kubelet/seccomp/profiles/nginx.json
// Allowlist approach — only permit needed syscalls
{
  "defaultAction": "SCMP_ACT_ERRNO",
  "architectures": ["SCMP_ARCH_X86_64", "SCMP_ARCH_X86", "SCMP_ARCH_X32"],
  "syscalls": [
    {
      "names": [
        "accept4", "access", "arch_prctl", "bind", "brk", "capget", "capset",
        "chdir", "chmod", "chown", "clone", "close", "connect", "dup", "dup2",
        "epoll_create1", "epoll_ctl", "epoll_wait", "eventfd2", "execve",
        "exit", "exit_group", "faccessat", "fchmod", "fchown", "fcntl",
        "fstat", "fstatfs", "futex", "getcwd", "getdents64", "getegid",
        "geteuid", "getgid", "getpid", "getppid", "getrandom", "gettid",
        "gettimeofday", "getuid", "ioctl", "lseek", "lstat", "madvise",
        "mkdir", "mmap", "mprotect", "munmap", "nanosleep", "open",
        "openat", "pipe2", "poll", "prctl", "pread64", "read", "readlink",
        "recv", "recvfrom", "recvmsg", "rename", "rt_sigaction",
        "rt_sigprocmask", "rt_sigreturn", "send", "sendfile", "sendmsg",
        "sendto", "set_robust_list", "setgid", "setgroups", "setuid",
        "socket", "stat", "statfs", "sysinfo", "tgkill", "uname",
        "unlink", "wait4", "write", "writev"
      ],
      "action": "SCMP_ACT_ALLOW"
    }
  ]
}
# Use the custom localhost profile in a pod
securityContext:
  seccompProfile:
    type: Localhost
    localhostProfile: profiles/nginx.json   # relative to /var/lib/kubelet/seccomp/
Custom seccomp profiles must be present on every node. Localhost profiles are read from the node's filesystem. You must ensure the profile file is deployed to every node before pods using it can be scheduled. Use a DaemonSet to distribute profiles, or use Seccomp Operator (CNCF) which manages profile distribution via CRDs.

AppArmor

AppArmor is a Linux MAC (Mandatory Access Control) system that restricts program capabilities based on per-program profiles. It complements seccomp: seccomp restricts syscalls, AppArmor restricts file/capability/network access by pathname.

Profile Modes

ModeDescription
enforcePolicy violations are blocked and logged
complainPolicy violations are logged but not blocked — useful for profile development
unconfinedNo AppArmor restrictions applied

Kubernetes Integration

AppArmor support moved to GA in Kubernetes 1.30 with a dedicated appArmorProfile field in the securityContext. Prior to 1.30, annotations were used:

# Kubernetes 1.30+ (GA): securityContext field
spec:
  securityContext:
    appArmorProfile:
      type: RuntimeDefault         # or Localhost
  containers:
  - name: app
    securityContext:
      appArmorProfile:
        type: Localhost
        localhostProfile: k8s-nginx  # profile name loaded on node

---
# Prior to 1.30: annotation-based (still supported for compatibility)
metadata:
  annotations:
    container.apparmor.security.beta.kubernetes.io/app: "runtime/default"
    # or: localhost/
    # or: unconfined (violates PSS baseline)
AppArmor requires support from the node OS. AppArmor is available on Debian/Ubuntu-based distributions and SUSE. It is not available on RHEL/CentOS/Fedora (which use SELinux instead). Check node OS before deploying AppArmor profiles — pods requesting a missing profile will fail to start.

SELinux

SELinux (Security-Enhanced Linux) is a MAC system used on RHEL/CentOS/Fedora. It uses a label-based policy model where every process, file, and socket has a security label. Access is governed by policy rules between labels.

spec:
  securityContext:
    seLinuxOptions:
      level: "s0:c123,c456"    # MCS label for namespace isolation
      role: "object_r"
      type: "svirt_sandbox_file_t"
      user: "system_u"
  containers:
  - name: app
    securityContext:
      seLinuxOptions:
        type: "container_t"     # container-specific label

In most Kubernetes deployments on RHEL/CoreOS, the container runtime (CRI-O, containerd) automatically assigns SELinux labels to containers. Manual configuration is primarily needed when:

Privileged Container Risks

A privileged container (securityContext.privileged: true) disables most container isolation. It receives all capabilities, can access all host devices, shares the host's cgroups and namespaces for devices, and can typically escape to the host. It is effectively an unrestricted process on the host node.

Container escape via privileged container is trivially easy. A privileged container can: load kernel modules, mount the host filesystem, access /dev/sda (raw disk), modify iptables, read /proc/1/root (host root filesystem). A typical escape:
nsenter --target 1 --mount --uts --ipc --net --pid -- /bin/bash
This spawns a shell in the host's namespaces from inside a privileged container. Any workload that doesn't require host access has zero business being privileged.

What Actually Requires Privileged

Workload TypeRequires Privileged?Alternative
CNI plugins (calico-node, cilium-agent)Often yes — needs to configure host network interfacesUse specific capabilities + hostNetwork instead of full privileged
CSI drivers (storage)Sometimes — for mount operationsUse only on specific CSI plugin containers; not main application
GPU workloadsDevice access via device plugin — not full privilegedGPU device plugin via resources.limits; no privileged needed
Node monitoring agents (Datadog, Falco)Partially — needs access to host proc, host net, host pidSpecific capabilities + hostPID/hostNetwork; not full privileged
Application containers (99%)NeverProper securityContext with drop ALL

Host Namespaces

FieldRisk When EnabledLegitimate Use
hostNetwork: truePod shares host network stack; can bind to all host ports; traffic not routed through CNI policies; bypasses NetworkPolicyCNI plugins, host-level monitoring (node-exporter), some service mesh data planes
hostPID: truePod can see and signal all processes on the host; can read /proc/<host-pid>/memSystem-level debuggers, Falco (for syscall monitoring), some monitoring agents
hostIPC: truePod can access host shared memory segments and semaphores; can read/write IPC data from other processesExtremely rare — specific HPC or legacy enterprise applications

hostPath Volume Risks

The hostPath volume type mounts a directory from the host node's filesystem directly into the container. This creates a direct path to host data:

hostPath MountRisk
/Full read/write access to host root filesystem — equivalent to privileged
/etcModify /etc/sudoers, /etc/passwd, host certs
/var/run/docker.sock or /run/containerd/containerd.sockFull control of container runtime — can launch privileged containers
/procRead/write host kernel parameters, process memory
/var/lib/kubeletAccess to all pod secrets and service account tokens on the node
/sysModify kernel settings and hardware interfaces
Mounting the container runtime socket grants full cluster access. Mounting /var/run/docker.sock or /run/containerd/containerd.sock into a container lets that container launch new privileged containers on the host, bypass all pod security controls, and read all secrets cached on the node. This is a complete cluster compromise vector. Block it via admission policy.
# Kyverno policy: block mounting container runtime socket
apiVersion: kyverno.io/v1
kind: ClusterPolicy
metadata:
  name: block-runtime-socket
spec:
  validationFailureAction: Enforce
  rules:
  - name: no-runtime-socket
    match:
      resources:
        kinds: ["Pod"]
    validate:
      message: "Mounting container runtime socket is not allowed"
      deny:
        conditions:
          any:
          - key: "{{ request.object.spec.volumes[].hostPath.path | [contains(@, '/var/run/docker.sock'), contains(@, '/run/containerd')] | [] }}"
            operator: AnyIn
            value: [true]

Migrating from PSP to PSA

Migration Strategy

  1. Audit current PSP usage:
    kubectl get psp -o json | jq '.items[] | {name:.metadata.name, privileged:.spec.privileged, hostNetwork:.spec.hostNetwork}'
  2. Map PSPs to PSS levels: Identify which PSS level (privileged/baseline/restricted) each namespace's effective PSP corresponds to.
  3. Enable PSA in warn mode first: Apply pod-security.kubernetes.io/warn=restricted and run your workloads. Collect warnings. This is non-breaking.
    kubectl label namespace my-app pod-security.kubernetes.io/warn=restricted
  4. Enable PSA in audit mode: Add pod-security.kubernetes.io/audit=restricted. Check audit logs for violations.
    kubectl get events -n my-app | grep -i "violated"
  5. Fix violating workloads: Update pod specs to meet the target PSS level (add securityContext, drop capabilities, set seccomp, etc.).
  6. Enable enforce mode: Switch to pod-security.kubernetes.io/enforce=restricted when all violations are resolved.
  7. Remove PSP resources: Delete PSP objects and RBAC bindings granting use verb on PSPs.

PSP → PSS Field Mapping

PSP FieldPSS Equivalent
privileged: falseBaseline: privileged=false enforced
hostPID/hostIPC/hostNetwork: falseBaseline: all three forbidden
allowedCapabilities: []Restricted: drop ALL, no adds
requiredDropCapabilities: [ALL]Restricted: drop ALL required
volumes: [configMap, emptyDir, ...]Restricted: only allowed volume types
runAsNonRoot: trueRestricted: runAsNonRoot enforced
allowPrivilegeEscalation: falseRestricted: allowPrivilegeEscalation=false
seccomp: runtime/defaultRestricted: seccompProfile RuntimeDefault
apparmor: runtime/defaultBaseline: AppArmor unconfined violates baseline

Extending with Kyverno / OPA

PSA is intentionally simple — it enforces only the three built-in PSS levels. For custom policies (e.g., "images must come from our registry", "all pods must have a specific label", "no latest tags"), use Kyverno or OPA/Gatekeeper.

Kyverno — enforce securityContext patterns

# Kyverno: require readOnlyRootFilesystem on all containers
apiVersion: kyverno.io/v1
kind: ClusterPolicy
metadata:
  name: require-ro-rootfs
spec:
  validationFailureAction: Enforce
  background: true
  rules:
  - name: check-readonly-rootfs
    match:
      any:
      - resources:
          kinds: ["Pod"]
    validate:
      message: "readOnlyRootFilesystem must be true"
      pattern:
        spec:
          containers:
          - securityContext:
              readOnlyRootFilesystem: true
          =(initContainers):
          - securityContext:
              readOnlyRootFilesystem: true

OPA/Gatekeeper — ConstraintTemplate for capabilities

apiVersion: templates.gatekeeper.sh/v1
kind: ConstraintTemplate
metadata:
  name: k8snocapabilities
spec:
  crd:
    spec:
      names:
        kind: K8sNoCapabilities
      validation:
        openAPIV3Schema:
          properties:
            allowedCapabilities:
              type: array
              items:
                type: string
  targets:
  - target: admission.k8s.gatekeeper.sh
    rego: |
      package k8snocapabilities
      violation[{"msg": msg}] {
        container := input.review.object.spec.containers[_]
        cap := container.securityContext.capabilities.add[_]
        not cap == input.parameters.allowedCapabilities[_]
        msg := sprintf("Container %v has disallowed capability: %v", [container.name, cap])
      }
---
apiVersion: constraints.gatekeeper.sh/v1beta1
kind: K8sNoCapabilities
metadata:
  name: no-dangerous-caps
spec:
  match:
    kinds:
    - apiGroups: [""]
      kinds: ["Pod"]
  parameters:
    allowedCapabilities: ["NET_BIND_SERVICE"]  # only this cap is allowed to be added

Metrics & Alerts

Key Metrics

MetricSourceWhat It Tells You
pod_security_evaluations_total{decision="allow|deny|exempt",level,mode,policy}kube-apiserverPSA evaluation results per level/mode; track deny counts for violations
apiserver_admission_controller_admission_duration_seconds{name="PodSecurity"}kube-apiserverPSA admission latency — should be sub-millisecond
falco_events{rule,priority}FalcoRuntime security events by rule and priority
container_processes{container, namespace}cAdvisor/kubeletProcess count per container — unexpected process spike may indicate exec/shell spawn
apiserver_audit_event_totalkube-apiserverTotal audit events; correlated with PSA audit-mode violations

Alerts

groups:
- name: pod-security.rules
  rules:

  - alert: PSAViolationsHigh
    expr: |
      rate(pod_security_evaluations_total{decision="deny",mode="enforce"}[5m]) > 0
    annotations:
      summary: "PSA enforce mode is rejecting pods in {{ $labels.namespace }}"
      description: "Pod Security Admission ({{ $labels.level }}) is blocking pod creation. Check workload securityContext."
    labels:
      severity: warning

  - alert: PrivilegedPodRunning
    expr: |
      count(kube_pod_container_info{container!=""}) by (namespace, pod)
      # Implement via Falco rule: container.privileged=true → alert
    annotations:
      summary: "Privileged container detected: {{ $labels.pod }} in {{ $labels.namespace }}"
    labels:
      severity: critical

  - alert: FalcoHighPriorityEvent
    expr: |
      rate(falco_events{priority="Critical"}[5m]) > 0
    annotations:
      summary: "Falco critical event: {{ $labels.rule }}"
    labels:
      severity: critical

  - alert: ContainerRunningAsRoot
    # Via Falco rule: user.uid=0 AND container=true → alert
    annotations:
      summary: "Container running as root: {{ $labels.container }} in {{ $labels.pod }}"
    labels:
      severity: warning

Runbooks

  1. PSA enforce mode blocking pods: Check kubectl get events -n <namespace> | grep PolicyViolation. Identify which PSS check failed. Options: fix the pod spec to comply, lower the namespace's enforce level if legitimate (and explain why), or add a PSA exemption for the specific workload. Do not blindly lower the PSS level — understand why the violation is occurring first.
  2. Privileged container detected at runtime: Identify the pod and node (kubectl get pod -o wide). If unauthorized: cordon the node, delete the pod, audit what it did (Falco logs, audit log). If authorized (CNI plugin, CSI driver): verify it matches expected workloads. Consider whether capabilities + specific access can replace full privileged.
  3. Falco critical event — shell spawned in container: Identify pod, container, and user who exec'd. Check kubectl get events and audit log for who triggered the exec. Determine if it was authorized (troubleshooting by a human) or automated (attacker executing code). If unauthorized: isolate the pod (remove from Service, set networkPolicy to deny all), preserve forensic evidence, trigger incident response.
  4. Container running as root (unexpected): Check the image's Dockerfile for USER instruction. If missing, add USER nonroot to the Dockerfile, rebuild, and redeploy. Add runAsNonRoot: true to the pod spec as a safety net — it causes the pod to fail to start if the image runs as root, forcing the issue to be resolved at deploy time rather than silently.
  5. Workload fails to start after PSA enforce upgrade: Use kubectl describe pod <pod> -n <ns> to see the specific PSA violation. Common issues: missing seccompProfile (add RuntimeDefault), missing capabilities.drop: [ALL], allowPrivilegeEscalation not set to false. Apply the minimal fix rather than lowering the PSS level.

Best Practices

  1. Apply PSA restricted to all tenant namespaces by default. Use cluster-level defaults in the PSA admission config to set warn=restricted and audit=restricted cluster-wide, with enforce=baseline as the cluster default. Teams that need to deploy to restricted-compliant workloads can remain on the cluster default. Override to privileged only for explicitly exempted system namespaces.
  2. Always use enforce + audit + warn together with the same level. Enforce alone gives no visibility before violations. Audit and warn together let you detect violations in existing workloads and new deployments before they hit enforce. Set all three to the same target level during migration; set all three to the same level in steady state.
  3. Drop ALL capabilities, then add only what's proven necessary. Start with capabilities.drop: ["ALL"]. Run the workload. If it fails, check the error, identify the needed syscall, determine which capability grants it, add only that capability. Most applications need zero capabilities after dropping ALL.
  4. Set seccompProfile RuntimeDefault as your baseline, consider Localhost for critical services. RuntimeDefault blocks ~100 dangerous syscalls with zero configuration. For services handling sensitive data (auth services, secret managers, payment processors), generate a custom allowlist profile using strace or Falco's syscall logging in complain mode, then deploy as Localhost profile.
  5. Use readOnlyRootFilesystem: true and back it with emptyDir mounts for writable paths. Set readOnlyRootFilesystem: true on all containers. If the application writes to disk (temp files, caches, logs), mount specific writable directories as emptyDir. This prevents malware from writing persistence or tooling to the container filesystem.
  6. Ban hostPath mounts in tenant namespaces via admission policy. PSA doesn't block all hostPath mounts (baseline allows some). Use Kyverno or OPA/Gatekeeper to deny all hostPath volume types in non-system namespaces. Block specific dangerous paths (docker socket, /proc, /etc, /var/lib/kubelet) even in system namespaces via path-specific policies.
  7. Pair Falco with PSA for runtime detection. PSA prevents known-bad configurations at deploy time. Falco detects unknown-bad behavior at runtime (a container that starts compliant but later executes a shell, makes unexpected network connections, or writes to sensitive paths). Both layers are needed.
  8. Pin PSA version labels and update them deliberately. Always set pod-security.kubernetes.io/enforce-version=v1.<N>. When upgrading Kubernetes, review PSS changelog for new checks, test with warn mode first, then bump the version pin once workloads are compliant. Never use latest in production enforce mode.