Pod Security
On this page
- PSP Removal Background
- Pod Security Admission (PSA)
- Pod Security Standards
- PSS Field-by-Field Reference
- securityContext Deep Dive
- Linux Capabilities
- seccomp Profiles
- AppArmor
- SELinux
- Privileged Container Risks
- Host Namespaces
- Migrating from PSP to PSA
- Extending with Kyverno / OPA
- Metrics & Alerts
- Best Practices
Coverage checklist
- PSP removal in 1.25 background
- PSA: enforce / audit / warn modes
- PSA namespace labels
- PSA exemptions (namespaces, runtimeClasses, usernames)
- PSS: privileged / baseline / restricted levels
- PSS field-by-field complete reference table
- securityContext: pod-level vs container-level fields
- runAsUser / runAsGroup / fsGroup semantics
- runAsNonRoot enforcement
- allowPrivilegeEscalation
- readOnlyRootFilesystem
- capabilities: full Linux caps list, drop ALL + add pattern
- Dangerous capabilities: CAP_SYS_ADMIN, CAP_NET_ADMIN, CAP_SYS_PTRACE
- seccomp: RuntimeDefault, Localhost, Unconfined
- Custom seccomp profile JSON example
- AppArmor: profile modes (enforce/complain), K8s annotation (stable 1.30)
- SELinux: seLinuxOptions fields
- Privileged containers: kernel module load, device access, container escape
- hostPID / hostNetwork / hostIPC risks
- hostPath mount risks and mitigation
- PSP → PSA migration checklist
- Kyverno policies for pod security
- OPA/Gatekeeper ConstraintTemplate for pod security
- 5 metrics, 4 alerts, 5 runbooks
- 8 best practices
PSP Removal Background
PodSecurityPolicy (PSP) was the original Kubernetes mechanism for enforcing pod-level security constraints. It was deprecated in Kubernetes 1.21 and removed in 1.25. Clusters upgrading to 1.25+ must migrate to an alternative before the upgrade.
PSP had several fundamental design flaws:
- Authorization complexity: PSPs were enforced via RBAC — a pod's ServiceAccount (or the user creating the pod) had to have
useverb on the PSP. This created confusing interactions where PSPs existed but weren't being enforced. - Silent non-enforcement: If no PSP matched a pod, the pod was rejected — but if any PSP matched (including permissive ones), the most permissive allowed policy won. This led to overly permissive PSPs being deployed just to unblock workloads.
- Namespace-level control was poor: PSPs were cluster-scoped objects with namespace-level applicability controlled by RBAC bindings — a confusing indirection.
Replacement options:
- Pod Security Admission (PSA) — built into Kubernetes since 1.22 (GA 1.25). Label-based, simple, but not fully customizable.
- OPA/Gatekeeper — fully custom Rego policies via CRDs; maximum flexibility.
- Kyverno — Kubernetes-native YAML-based policies; lower learning curve than Rego.
- Kubewarden — WebAssembly-based policies.
Pod Security Admission (PSA)
PSA is a built-in admission controller (GA in 1.25) that enforces Pod Security Standards by evaluating pod specs against one of three levels. It is configured per-namespace using labels.
Modes
| Mode | Label | Effect on Violation |
|---|---|---|
enforce | pod-security.kubernetes.io/enforce | Pod is rejected; cannot be created |
audit | pod-security.kubernetes.io/audit | Pod is allowed; violation recorded in audit log |
warn | pod-security.kubernetes.io/warn | Pod is allowed; warning message returned to client |
Each mode is independent and can reference a different level. A namespace can have all three modes active simultaneously with different levels:
# Apply PSA labels to a namespace
kubectl label namespace production \
pod-security.kubernetes.io/enforce=restricted \
pod-security.kubernetes.io/enforce-version=v1.28 \
pod-security.kubernetes.io/audit=restricted \
pod-security.kubernetes.io/audit-version=v1.28 \
pod-security.kubernetes.io/warn=restricted \
pod-security.kubernetes.io/warn-version=v1.28
-version label (e.g., pod-security.kubernetes.io/enforce-version=v1.28). Without it, the version defaults to latest, which means a Kubernetes upgrade could change what's enforced and suddenly break workloads. Always pin to a specific version and bump it deliberately during upgrades.
PSA Exemptions
Some system workloads legitimately need privileged access. PSA supports three exemption types, configured in the PodSecurity admission plugin config (via kube-apiserver --admission-control-config-file):
apiVersion: apiserver.config.k8s.io/v1
kind: AdmissionConfiguration
plugins:
- name: PodSecurity
configuration:
apiVersion: pod-security.admission.config.k8s.io/v1
kind: PodSecurityConfiguration
defaults:
enforce: "restricted" # cluster-wide default for namespaces without labels
enforce-version: "latest"
audit: "restricted"
audit-version: "latest"
warn: "restricted"
warn-version: "latest"
exemptions:
usernames: # API server users exempt from PSA (e.g., cluster bootstrappers)
- "system:serviceaccount:kube-system:replicaset-controller"
runtimeClasses: # RuntimeClasses exempt from PSA (e.g., kata-containers)
- "kata-containers"
namespaces: # Namespaces fully exempt from PSA
- kube-system
- kube-public
- cert-manager # cert-manager needs elevated permissions
privileged level to system namespaces rather than exempting them entirely — exemption bypasses audit and warn modes too, removing all visibility.
Graduation Timeline
| Version | Status | Notes |
|---|---|---|
| 1.22 | Alpha | PSA introduced; PSP still active |
| 1.23 | Beta (on by default) | Enabled by default; PSP still active |
| 1.25 | GA; PSP removed | Clusters must migrate before upgrading to 1.25 |
Pod Security Standards
Pod Security Standards (PSS) define three levels, each a superset of restrictions from the previous level. The levels apply to pod specs as a whole — if any container in the pod violates the level, the entire pod is rejected.
Privileged
- No restrictions
- All host namespace access allowed
- Privileged containers allowed
- Any capabilities allowed
- Any seccomp profile (or none)
- Use: kube-system, CNI plugins, node agents, storage CSI drivers
Baseline
- No privileged containers
- No host namespaces (hostPID, hostIPC, hostNetwork)
- No dangerous capabilities
- No hostPath or restricted volume types
- Host port restriction
- AppArmor: only runtime/default or localhost profiles
- seccomp: no restriction (Unconfined allowed)
- No privilege escalation (Linux)
- Use: General workloads not needing host access
Restricted
- All baseline restrictions plus:
- Must drop ALL capabilities
- seccompProfile: RuntimeDefault or Localhost required
- runAsNonRoot: true
- allowPrivilegeEscalation: false
- Only specific volume types (configMap, CSI, downwardAPI, emptyDir, ephemeral, persistentVolumeClaim, projected, secret)
- Use: Untrusted workloads, public-facing services, high-security requirements
PSS Field-by-Field Reference
This table maps every PSS check to its pod spec field, what each level allows, and any nuances.
Host Namespaces
| Field | Privileged | Baseline | Restricted | Notes |
|---|---|---|---|---|
spec.hostNetwork | Any | Must be false or unset | Must be false or unset | Shares host network stack; bypasses CNI |
spec.hostPID | Any | Must be false or unset | Must be false or unset | Can see and signal all host processes |
spec.hostIPC | Any | Must be false or unset | Must be false or unset | Access host IPC namespace (shared memory, semaphores) |
Privileged Containers
| Field | Privileged | Baseline | Restricted |
|---|---|---|---|
containers[].securityContext.privileged | Any | Must be false or unset | Must be false or unset |
initContainers[].securityContext.privileged | Any | Must be false or unset | Must be false or unset |
ephemeralContainers[].securityContext.privileged | Any | Must be false or unset | Must be false or unset |
Capabilities
| Field | Baseline | Restricted | Notes |
|---|---|---|---|
securityContext.capabilities.add | Only: NET_BIND_SERVICE (or empty) | Empty (no adds after dropping ALL) | Baseline allows adding NET_BIND_SERVICE for port <1024 binding |
securityContext.capabilities.drop | No requirement | Must include ALL | Restricted requires dropping ALL first; then cannot add any |
Privilege Escalation
| Field | Baseline | Restricted |
|---|---|---|
securityContext.allowPrivilegeEscalation | No restriction | Must be false |
Run As Non-Root
| Field | Baseline | Restricted |
|---|---|---|
spec.securityContext.runAsNonRoot | No restriction | Must be true (pod or all containers) |
containers[].securityContext.runAsNonRoot | No restriction | Must be true OR pod-level must be true |
spec.securityContext.runAsUser | No restriction | Must not be 0 (if set) |
seccomp
| Field | Baseline | Restricted |
|---|---|---|
spec.securityContext.seccompProfile.type | No restriction (Unconfined allowed) | Must be RuntimeDefault or Localhost |
containers[].securityContext.seccompProfile.type | No restriction | Must be RuntimeDefault or Localhost (OR pod-level set) |
Volumes
| Baseline Allowed Volume Types | Restricted (subset of baseline) |
|---|---|
| configMap, csi, downwardAPI, emptyDir, ephemeral, hostPath (restricted), nfs, persistentVolumeClaim, projected, secret + all others except: hostPath to sensitive paths, inline volumes from sources not in the allowed list | configMap, csi, downwardAPI, emptyDir, ephemeral, persistentVolumeClaim, projected, secret only — no hostPath, no NFS |
Host Ports
| Field | Baseline / Restricted |
|---|---|
containers[].ports[].hostPort | Must be 0, unset, or defined — baseline allows 0/unset. Host ports > 0 violate baseline. |
AppArmor (baseline only check)
| Annotation | Baseline: Allowed values |
|---|---|
container.apparmor.security.beta.kubernetes.io/<container> | runtime/default or localhost/<profile> — unconfined violates baseline |
Sysctls
| Field | Baseline / Restricted |
|---|---|
spec.securityContext.sysctls[].name | Only "safe" sysctls: kernel.shm_rmid_forced, net.ipv4.ip_local_port_range, net.ipv4.ip_unprivileged_port_start, net.ipv4.tcp_syncookies, net.ipv4.ping_group_range |
securityContext Deep Dive
Pod-Level vs Container-Level Fields
Some fields exist at both pod and container scope. Container-level overrides pod-level. Some fields only exist at one scope:
| Field | Pod-Level | Container-Level | Notes |
|---|---|---|---|
runAsUser | ✅ Default for all containers | ✅ Overrides pod-level | UID for process; must match image USER or be set explicitly |
runAsGroup | ✅ Primary GID for all containers | ✅ Overrides pod-level | |
runAsNonRoot | ✅ | ✅ Overrides pod-level | Kubelet verifies UID != 0 at container start; fails if image runs as root |
fsGroup | ✅ Only pod-level | ❌ | Supplemental GID applied to volume mounts; owns files in mounted volumes |
fsGroupChangePolicy | ✅ Only pod-level | ❌ | OnRootMismatch (1.20+): only chown if root ownership wrong; avoids slow chown on large volumes |
supplementalGroups | ✅ Only pod-level | ❌ | Additional GIDs for all containers |
seccompProfile | ✅ Default for all containers | ✅ Overrides pod-level | Container-level overrides pod-level profile |
seLinuxOptions | ✅ Default | ✅ Overrides pod-level | |
sysctls | ✅ Only pod-level | ❌ | Only safe sysctls without special runtime config |
privileged | ❌ | ✅ Only container-level | |
allowPrivilegeEscalation | ❌ | ✅ Only container-level | |
capabilities | ❌ | ✅ Only container-level | |
readOnlyRootFilesystem | ❌ | ✅ Only container-level |
Complete Restricted-Compliant securityContext
apiVersion: v1
kind: Pod
spec:
securityContext: # Pod-level
runAsNonRoot: true
runAsUser: 1000 # must not be 0
runAsGroup: 3000
fsGroup: 2000 # volume files owned by GID 2000
fsGroupChangePolicy: OnRootMismatch # avoid slow recursive chown
seccompProfile:
type: RuntimeDefault # applies to all containers unless overridden
sysctls: [] # no unsafe sysctls
containers:
- name: app
image: myapp:latest
securityContext: # Container-level (overrides pod-level where applicable)
allowPrivilegeEscalation: false # MUST be false for restricted
readOnlyRootFilesystem: true # prevents writes to container rootfs
capabilities:
drop: ["ALL"] # MUST drop ALL for restricted
# add: ["NET_BIND_SERVICE"] # only if binding port < 1024
seccompProfile:
type: RuntimeDefault # can override pod-level per-container
runAsNonRoot: true # belt-and-suspenders with pod-level
volumeMounts:
- name: tmp
mountPath: /tmp # writable temp dir for apps that need it
- name: cache
mountPath: /app/cache
volumes:
- name: tmp
emptyDir: {} # allowed in restricted
- name: cache
emptyDir: {}
# NOT allowed in restricted:
# hostPath, NFS, secrets via env (allowed but discouraged), hostNetwork/PID/IPC
fsGroup and Volume Ownership
When fsGroup is set, the kubelet changes ownership of mounted volumes to the specified GID. For large volumes (hundreds of thousands of files), this recursive chown causes significant pod startup latency. Use fsGroupChangePolicy: OnRootMismatch (GA 1.23) to skip the chown if the root directory already has the correct ownership.
Linux Capabilities
Linux capabilities split root privilege into discrete units. When a container runs without privileged: true, it starts with a default set of capabilities inherited from the container runtime. The recommended security posture is drop ALL, add only what's needed.
Default Container Capabilities (containerd/CRI-O)
By default, containers receive these capabilities: CHOWN, DAC_OVERRIDE, FSETID, FOWNER, MKNOD, NET_RAW, SETGID, SETUID, SETFCAP, SETPCAP, NET_BIND_SERVICE, SYS_CHROOT, KILL, AUDIT_WRITE.
NET_RAW allows raw socket creation (ping, ARP spoofing, packet crafting). It's in the default set for compatibility but should be dropped in almost all workloads. Drop it explicitly: capabilities.drop: ["NET_RAW"] or drop: ["ALL"].
Dangerous Capabilities Reference
| Capability | What It Allows | Attack Potential |
|---|---|---|
CAP_SYS_ADMIN | Mount filesystems, modify kernel parameters, load kernel modules, many others | Container escape — effectively root on host |
CAP_SYS_PTRACE | Trace and inspect other processes; inject code | Process hijack — can ptrace processes on host if hostPID=true |
CAP_NET_ADMIN | Configure network interfaces, routes, firewall rules, packet mangling | Network attack — ARP poison, traffic interception, iptables modification |
CAP_SYS_MODULE | Load/unload kernel modules | Container escape — insert rootkit kernel module |
CAP_SYS_RAWIO | Direct access to I/O ports, /dev/mem, /dev/kmem | Container escape via physical memory access |
CAP_DAC_READ_SEARCH | Bypass file permission checks; read any file | Data exfiltration — read /etc/shadow, host secrets |
CAP_CHOWN | Change file ownership | Privilege — change ownership of sensitive files |
CAP_SETUID | Set UID; switch to any user including root | Root escalation |
CAP_NET_RAW | Raw sockets, packet injection | Network attack |
CAP_MKNOD | Create device files | Device access |
Capability Drop ALL Pattern
# Minimum viable capabilities for most web applications
securityContext:
capabilities:
drop: ["ALL"]
# Most apps need zero capabilities after dropping ALL
# Common exceptions:
# add: ["NET_BIND_SERVICE"] # ONLY if binding port 80 or 443 as non-root
# add: ["SYS_NICE"] # ONLY if app sets process priority (rare)
# add: ["IPC_LOCK"] # ONLY if app uses mlock() for security (e.g., Vault)
seccomp Profiles
seccomp (secure computing mode) is a Linux kernel feature that restricts the syscalls a process can make. Kubernetes supports three seccomp profile types:
| Type | Description | Security Level |
|---|---|---|
Unconfined | No syscall filtering; all syscalls allowed | Minimum — default before 1.27 |
RuntimeDefault | Container runtime's default profile (blocks ~100 dangerous syscalls) | Good baseline — recommended for most workloads |
Localhost | Custom JSON profile from /var/lib/kubelet/seccomp/ on the node | Maximum — application-specific allowlist |
SeccompDefault feature gate is enabled (on by default in 1.27+). However, explicitly setting it in the pod spec is still best practice for clarity and portability.
RuntimeDefault Profile
The RuntimeDefault profile is defined by the container runtime. For containerd, it's based on Docker's default seccomp profile. It blocks syscalls including:
kexec_load— load new kernelkeyctl,add_key,request_key— kernel key managementptrace— process tracing (blocks in some profiles)reboot,create_module,finit_module— system-level opsmount,umount2— filesystem mounting
Custom Localhost Profile
// /var/lib/kubelet/seccomp/profiles/nginx.json
// Allowlist approach — only permit needed syscalls
{
"defaultAction": "SCMP_ACT_ERRNO",
"architectures": ["SCMP_ARCH_X86_64", "SCMP_ARCH_X86", "SCMP_ARCH_X32"],
"syscalls": [
{
"names": [
"accept4", "access", "arch_prctl", "bind", "brk", "capget", "capset",
"chdir", "chmod", "chown", "clone", "close", "connect", "dup", "dup2",
"epoll_create1", "epoll_ctl", "epoll_wait", "eventfd2", "execve",
"exit", "exit_group", "faccessat", "fchmod", "fchown", "fcntl",
"fstat", "fstatfs", "futex", "getcwd", "getdents64", "getegid",
"geteuid", "getgid", "getpid", "getppid", "getrandom", "gettid",
"gettimeofday", "getuid", "ioctl", "lseek", "lstat", "madvise",
"mkdir", "mmap", "mprotect", "munmap", "nanosleep", "open",
"openat", "pipe2", "poll", "prctl", "pread64", "read", "readlink",
"recv", "recvfrom", "recvmsg", "rename", "rt_sigaction",
"rt_sigprocmask", "rt_sigreturn", "send", "sendfile", "sendmsg",
"sendto", "set_robust_list", "setgid", "setgroups", "setuid",
"socket", "stat", "statfs", "sysinfo", "tgkill", "uname",
"unlink", "wait4", "write", "writev"
],
"action": "SCMP_ACT_ALLOW"
}
]
}
# Use the custom localhost profile in a pod
securityContext:
seccompProfile:
type: Localhost
localhostProfile: profiles/nginx.json # relative to /var/lib/kubelet/seccomp/
AppArmor
AppArmor is a Linux MAC (Mandatory Access Control) system that restricts program capabilities based on per-program profiles. It complements seccomp: seccomp restricts syscalls, AppArmor restricts file/capability/network access by pathname.
Profile Modes
| Mode | Description |
|---|---|
enforce | Policy violations are blocked and logged |
complain | Policy violations are logged but not blocked — useful for profile development |
unconfined | No AppArmor restrictions applied |
Kubernetes Integration
AppArmor support moved to GA in Kubernetes 1.30 with a dedicated appArmorProfile field in the securityContext. Prior to 1.30, annotations were used:
# Kubernetes 1.30+ (GA): securityContext field
spec:
securityContext:
appArmorProfile:
type: RuntimeDefault # or Localhost
containers:
- name: app
securityContext:
appArmorProfile:
type: Localhost
localhostProfile: k8s-nginx # profile name loaded on node
---
# Prior to 1.30: annotation-based (still supported for compatibility)
metadata:
annotations:
container.apparmor.security.beta.kubernetes.io/app: "runtime/default"
# or: localhost/
# or: unconfined (violates PSS baseline)
SELinux
SELinux (Security-Enhanced Linux) is a MAC system used on RHEL/CentOS/Fedora. It uses a label-based policy model where every process, file, and socket has a security label. Access is governed by policy rules between labels.
spec:
securityContext:
seLinuxOptions:
level: "s0:c123,c456" # MCS label for namespace isolation
role: "object_r"
type: "svirt_sandbox_file_t"
user: "system_u"
containers:
- name: app
securityContext:
seLinuxOptions:
type: "container_t" # container-specific label
In most Kubernetes deployments on RHEL/CoreOS, the container runtime (CRI-O, containerd) automatically assigns SELinux labels to containers. Manual configuration is primarily needed when:
- Accessing host volumes that need specific SELinux labels
- Implementing MCS (Multi-Category Security) for strict container isolation
- Running on OpenShift (which has a specific SELinux policy model)
Privileged Container Risks
A privileged container (securityContext.privileged: true) disables most container isolation. It receives all capabilities, can access all host devices, shares the host's cgroups and namespaces for devices, and can typically escape to the host. It is effectively an unrestricted process on the host node.
nsenter --target 1 --mount --uts --ipc --net --pid -- /bin/bashThis spawns a shell in the host's namespaces from inside a privileged container. Any workload that doesn't require host access has zero business being privileged.
What Actually Requires Privileged
| Workload Type | Requires Privileged? | Alternative |
|---|---|---|
| CNI plugins (calico-node, cilium-agent) | Often yes — needs to configure host network interfaces | Use specific capabilities + hostNetwork instead of full privileged |
| CSI drivers (storage) | Sometimes — for mount operations | Use only on specific CSI plugin containers; not main application |
| GPU workloads | Device access via device plugin — not full privileged | GPU device plugin via resources.limits; no privileged needed |
| Node monitoring agents (Datadog, Falco) | Partially — needs access to host proc, host net, host pid | Specific capabilities + hostPID/hostNetwork; not full privileged |
| Application containers (99%) | Never | Proper securityContext with drop ALL |
Host Namespaces
| Field | Risk When Enabled | Legitimate Use |
|---|---|---|
hostNetwork: true | Pod shares host network stack; can bind to all host ports; traffic not routed through CNI policies; bypasses NetworkPolicy | CNI plugins, host-level monitoring (node-exporter), some service mesh data planes |
hostPID: true | Pod can see and signal all processes on the host; can read /proc/<host-pid>/mem | System-level debuggers, Falco (for syscall monitoring), some monitoring agents |
hostIPC: true | Pod can access host shared memory segments and semaphores; can read/write IPC data from other processes | Extremely rare — specific HPC or legacy enterprise applications |
hostPath Volume Risks
The hostPath volume type mounts a directory from the host node's filesystem directly into the container. This creates a direct path to host data:
| hostPath Mount | Risk |
|---|---|
/ | Full read/write access to host root filesystem — equivalent to privileged |
/etc | Modify /etc/sudoers, /etc/passwd, host certs |
/var/run/docker.sock or /run/containerd/containerd.sock | Full control of container runtime — can launch privileged containers |
/proc | Read/write host kernel parameters, process memory |
/var/lib/kubelet | Access to all pod secrets and service account tokens on the node |
/sys | Modify kernel settings and hardware interfaces |
/var/run/docker.sock or /run/containerd/containerd.sock into a container lets that container launch new privileged containers on the host, bypass all pod security controls, and read all secrets cached on the node. This is a complete cluster compromise vector. Block it via admission policy.
# Kyverno policy: block mounting container runtime socket
apiVersion: kyverno.io/v1
kind: ClusterPolicy
metadata:
name: block-runtime-socket
spec:
validationFailureAction: Enforce
rules:
- name: no-runtime-socket
match:
resources:
kinds: ["Pod"]
validate:
message: "Mounting container runtime socket is not allowed"
deny:
conditions:
any:
- key: "{{ request.object.spec.volumes[].hostPath.path | [contains(@, '/var/run/docker.sock'), contains(@, '/run/containerd')] | [] }}"
operator: AnyIn
value: [true]
Migrating from PSP to PSA
Migration Strategy
- Audit current PSP usage:
kubectl get psp -o json | jq '.items[] | {name:.metadata.name, privileged:.spec.privileged, hostNetwork:.spec.hostNetwork}' - Map PSPs to PSS levels: Identify which PSS level (privileged/baseline/restricted) each namespace's effective PSP corresponds to.
- Enable PSA in warn mode first: Apply
pod-security.kubernetes.io/warn=restrictedand run your workloads. Collect warnings. This is non-breaking.kubectl label namespace my-app pod-security.kubernetes.io/warn=restricted - Enable PSA in audit mode: Add
pod-security.kubernetes.io/audit=restricted. Check audit logs for violations.kubectl get events -n my-app | grep -i "violated" - Fix violating workloads: Update pod specs to meet the target PSS level (add securityContext, drop capabilities, set seccomp, etc.).
- Enable enforce mode: Switch to
pod-security.kubernetes.io/enforce=restrictedwhen all violations are resolved. - Remove PSP resources: Delete PSP objects and RBAC bindings granting
useverb on PSPs.
PSP → PSS Field Mapping
| PSP Field | PSS Equivalent |
|---|---|
privileged: false | Baseline: privileged=false enforced |
hostPID/hostIPC/hostNetwork: false | Baseline: all three forbidden |
allowedCapabilities: [] | Restricted: drop ALL, no adds |
requiredDropCapabilities: [ALL] | Restricted: drop ALL required |
volumes: [configMap, emptyDir, ...] | Restricted: only allowed volume types |
runAsNonRoot: true | Restricted: runAsNonRoot enforced |
allowPrivilegeEscalation: false | Restricted: allowPrivilegeEscalation=false |
seccomp: runtime/default | Restricted: seccompProfile RuntimeDefault |
apparmor: runtime/default | Baseline: AppArmor unconfined violates baseline |
Extending with Kyverno / OPA
PSA is intentionally simple — it enforces only the three built-in PSS levels. For custom policies (e.g., "images must come from our registry", "all pods must have a specific label", "no latest tags"), use Kyverno or OPA/Gatekeeper.
Kyverno — enforce securityContext patterns
# Kyverno: require readOnlyRootFilesystem on all containers
apiVersion: kyverno.io/v1
kind: ClusterPolicy
metadata:
name: require-ro-rootfs
spec:
validationFailureAction: Enforce
background: true
rules:
- name: check-readonly-rootfs
match:
any:
- resources:
kinds: ["Pod"]
validate:
message: "readOnlyRootFilesystem must be true"
pattern:
spec:
containers:
- securityContext:
readOnlyRootFilesystem: true
=(initContainers):
- securityContext:
readOnlyRootFilesystem: true
OPA/Gatekeeper — ConstraintTemplate for capabilities
apiVersion: templates.gatekeeper.sh/v1
kind: ConstraintTemplate
metadata:
name: k8snocapabilities
spec:
crd:
spec:
names:
kind: K8sNoCapabilities
validation:
openAPIV3Schema:
properties:
allowedCapabilities:
type: array
items:
type: string
targets:
- target: admission.k8s.gatekeeper.sh
rego: |
package k8snocapabilities
violation[{"msg": msg}] {
container := input.review.object.spec.containers[_]
cap := container.securityContext.capabilities.add[_]
not cap == input.parameters.allowedCapabilities[_]
msg := sprintf("Container %v has disallowed capability: %v", [container.name, cap])
}
---
apiVersion: constraints.gatekeeper.sh/v1beta1
kind: K8sNoCapabilities
metadata:
name: no-dangerous-caps
spec:
match:
kinds:
- apiGroups: [""]
kinds: ["Pod"]
parameters:
allowedCapabilities: ["NET_BIND_SERVICE"] # only this cap is allowed to be added
Metrics & Alerts
Key Metrics
| Metric | Source | What It Tells You |
|---|---|---|
pod_security_evaluations_total{decision="allow|deny|exempt",level,mode,policy} | kube-apiserver | PSA evaluation results per level/mode; track deny counts for violations |
apiserver_admission_controller_admission_duration_seconds{name="PodSecurity"} | kube-apiserver | PSA admission latency — should be sub-millisecond |
falco_events{rule,priority} | Falco | Runtime security events by rule and priority |
container_processes{container, namespace} | cAdvisor/kubelet | Process count per container — unexpected process spike may indicate exec/shell spawn |
apiserver_audit_event_total | kube-apiserver | Total audit events; correlated with PSA audit-mode violations |
Alerts
groups:
- name: pod-security.rules
rules:
- alert: PSAViolationsHigh
expr: |
rate(pod_security_evaluations_total{decision="deny",mode="enforce"}[5m]) > 0
annotations:
summary: "PSA enforce mode is rejecting pods in {{ $labels.namespace }}"
description: "Pod Security Admission ({{ $labels.level }}) is blocking pod creation. Check workload securityContext."
labels:
severity: warning
- alert: PrivilegedPodRunning
expr: |
count(kube_pod_container_info{container!=""}) by (namespace, pod)
# Implement via Falco rule: container.privileged=true → alert
annotations:
summary: "Privileged container detected: {{ $labels.pod }} in {{ $labels.namespace }}"
labels:
severity: critical
- alert: FalcoHighPriorityEvent
expr: |
rate(falco_events{priority="Critical"}[5m]) > 0
annotations:
summary: "Falco critical event: {{ $labels.rule }}"
labels:
severity: critical
- alert: ContainerRunningAsRoot
# Via Falco rule: user.uid=0 AND container=true → alert
annotations:
summary: "Container running as root: {{ $labels.container }} in {{ $labels.pod }}"
labels:
severity: warning
Runbooks
- PSA enforce mode blocking pods: Check
kubectl get events -n <namespace> | grep PolicyViolation. Identify which PSS check failed. Options: fix the pod spec to comply, lower the namespace's enforce level if legitimate (and explain why), or add a PSA exemption for the specific workload. Do not blindly lower the PSS level — understand why the violation is occurring first. - Privileged container detected at runtime: Identify the pod and node (
kubectl get pod -o wide). If unauthorized: cordon the node, delete the pod, audit what it did (Falco logs, audit log). If authorized (CNI plugin, CSI driver): verify it matches expected workloads. Consider whether capabilities + specific access can replace full privileged. - Falco critical event — shell spawned in container: Identify pod, container, and user who exec'd. Check
kubectl get eventsand audit log for who triggered the exec. Determine if it was authorized (troubleshooting by a human) or automated (attacker executing code). If unauthorized: isolate the pod (remove from Service, set networkPolicy to deny all), preserve forensic evidence, trigger incident response. - Container running as root (unexpected): Check the image's Dockerfile for USER instruction. If missing, add
USER nonrootto the Dockerfile, rebuild, and redeploy. AddrunAsNonRoot: trueto the pod spec as a safety net — it causes the pod to fail to start if the image runs as root, forcing the issue to be resolved at deploy time rather than silently. - Workload fails to start after PSA enforce upgrade: Use
kubectl describe pod <pod> -n <ns>to see the specific PSA violation. Common issues: missingseccompProfile(add RuntimeDefault), missingcapabilities.drop: [ALL],allowPrivilegeEscalationnot set to false. Apply the minimal fix rather than lowering the PSS level.
Best Practices
- Apply PSA restricted to all tenant namespaces by default. Use cluster-level defaults in the PSA admission config to set
warn=restrictedandaudit=restrictedcluster-wide, withenforce=baselineas the cluster default. Teams that need to deploy to restricted-compliant workloads can remain on the cluster default. Override to privileged only for explicitly exempted system namespaces. - Always use enforce + audit + warn together with the same level. Enforce alone gives no visibility before violations. Audit and warn together let you detect violations in existing workloads and new deployments before they hit enforce. Set all three to the same target level during migration; set all three to the same level in steady state.
- Drop ALL capabilities, then add only what's proven necessary. Start with
capabilities.drop: ["ALL"]. Run the workload. If it fails, check the error, identify the needed syscall, determine which capability grants it, add only that capability. Most applications need zero capabilities after dropping ALL. - Set seccompProfile RuntimeDefault as your baseline, consider Localhost for critical services. RuntimeDefault blocks ~100 dangerous syscalls with zero configuration. For services handling sensitive data (auth services, secret managers, payment processors), generate a custom allowlist profile using
straceor Falco's syscall logging in complain mode, then deploy as Localhost profile. - Use readOnlyRootFilesystem: true and back it with emptyDir mounts for writable paths. Set
readOnlyRootFilesystem: trueon all containers. If the application writes to disk (temp files, caches, logs), mount specific writable directories asemptyDir. This prevents malware from writing persistence or tooling to the container filesystem. - Ban hostPath mounts in tenant namespaces via admission policy. PSA doesn't block all hostPath mounts (baseline allows some). Use Kyverno or OPA/Gatekeeper to deny all hostPath volume types in non-system namespaces. Block specific dangerous paths (docker socket, /proc, /etc, /var/lib/kubelet) even in system namespaces via path-specific policies.
- Pair Falco with PSA for runtime detection. PSA prevents known-bad configurations at deploy time. Falco detects unknown-bad behavior at runtime (a container that starts compliant but later executes a shell, makes unexpected network connections, or writes to sensitive paths). Both layers are needed.
- Pin PSA version labels and update them deliberately. Always set
pod-security.kubernetes.io/enforce-version=v1.<N>. When upgrading Kubernetes, review PSS changelog for new checks, test with warn mode first, then bump the version pin once workloads are compliant. Never uselatestin production enforce mode.