Pods
▶ What This Page Covers
Pod Network Model
Every pod receives a single IP address shared by all its containers. Containers within a pod communicate over localhost — they share the same network namespace (same interfaces, same routing table, same port space). Two containers in the same pod cannot both listen on port 8080.
Full PodSpec Anatomy
apiVersion: v1
kind: Pod
metadata:
name: web-app
namespace: production
labels:
app: web-app
version: v2.1.0
annotations:
prometheus.io/scrape: "true"
prometheus.io/port: "9090"
spec:
# ── Scheduling ─────────────────────────────────────────────────
nodeSelector:
kubernetes.io/os: linux
node.kubernetes.io/instance-type: m5.xlarge # optional instance type pin
nodeName: "" # direct assignment (bypasses scheduler — avoid in production)
tolerations:
- key: dedicated
operator: Equal
value: gpu
effect: NoSchedule
topologySpreadConstraints:
- maxSkew: 1
topologyKey: topology.kubernetes.io/zone
whenUnsatisfiable: DoNotSchedule
labelSelector:
matchLabels:
app: web-app
affinity:
podAntiAffinity:
requiredDuringSchedulingIgnoredDuringExecution:
- labelSelector:
matchLabels:
app: web-app
topologyKey: kubernetes.io/hostname
priorityClassName: high-priority # references PriorityClass object
# ── Runtime ────────────────────────────────────────────────────
runtimeClassName: gvisor # sandbox runtime (optional)
# ── Identity ────────────────────────────────────────────────────
serviceAccountName: web-app-sa
automountServiceAccountToken: true # default true; set false if no API access needed
# ── Image pull ─────────────────────────────────────────────────
imagePullSecrets:
- name: registry-credentials
# ── Networking ──────────────────────────────────────────────────
hostname: web-0 # overrides pod name in hostname
subdomain: web-headless # enables DNS: web-0.web-headless.ns.svc.cluster.local
hostNetwork: false # default; true = use node's network namespace
hostPID: false
hostIPC: false
dnsPolicy: ClusterFirst # default; see DNS section
dnsConfig:
searches:
- internal.corp.com
options:
- name: ndots
value: "2"
# ── Termination ─────────────────────────────────────────────────
terminationGracePeriodSeconds: 60 # default 30
# ── Pod-level security ─────────────────────────────────────────
securityContext:
runAsUser: 1000
runAsGroup: 3000
runAsNonRoot: true
fsGroup: 2000
fsGroupChangePolicy: OnRootMismatch # Always or OnRootMismatch
supplementalGroups: [4000]
seccompProfile:
type: RuntimeDefault # RuntimeDefault | Localhost | Unconfined
sysctls:
- name: net.core.somaxconn
value: "1024"
# ── Init containers ─────────────────────────────────────────────
initContainers:
- name: wait-for-db
image: busybox:1.36
command: ["sh", "-c", "until nc -z postgres 5432; do sleep 2; done"]
# Native sidecar (1.29+): restartPolicy: Always in initContainers
- name: log-shipper
image: fluent/fluent-bit:3.0
restartPolicy: Always # makes this a sidecar, not a blocking init
volumeMounts:
- name: logs
mountPath: /logs
# ── Main containers ─────────────────────────────────────────────
containers:
- name: app
image: myapp:v2.1.0
imagePullPolicy: IfNotPresent # Always | Never | IfNotPresent (default for tagged)
command: ["/app/server"] # overrides ENTRYPOINT
args: ["--port=8080", "--log-level=info"] # overrides CMD
ports:
- name: http
containerPort: 8080
protocol: TCP
- name: metrics
containerPort: 9090
env:
- name: DB_HOST
value: postgres.data.svc.cluster.local
- name: DB_PASSWORD
valueFrom:
secretKeyRef:
name: db-secret
key: password
optional: false # fail pod if secret missing (default false)
- name: MY_POD_IP
valueFrom:
fieldRef:
fieldPath: status.podIP # Downward API
- name: MY_CPU_REQUEST
valueFrom:
resourceFieldRef:
containerName: app
resource: requests.cpu
envFrom:
- configMapRef:
name: app-config
optional: false
- secretRef:
name: app-secrets
optional: true # pod starts even if secret absent
resources:
requests:
cpu: "250m"
memory: "256Mi"
limits:
memory: "512Mi" # no CPU limit — avoid throttling
volumeMounts:
- name: config
mountPath: /etc/app
readOnly: true
- name: tmp
mountPath: /tmp
- name: logs
mountPath: /var/log/app
# Container-level security
securityContext:
allowPrivilegeEscalation: false
readOnlyRootFilesystem: true
runAsNonRoot: true
capabilities:
drop: ["ALL"]
add: ["NET_BIND_SERVICE"] # only if binding port <1024
seccompProfile:
type: RuntimeDefault
lifecycle:
postStart:
exec:
command: ["/bin/sh", "-c", "echo started > /tmp/started"]
preStop:
exec:
command: ["/bin/sh", "-c", "sleep 5 && /app/graceful-stop.sh"]
# Probes — see detailed section below
startupProbe:
httpGet:
path: /healthz
port: 8080
failureThreshold: 30
periodSeconds: 10
livenessProbe:
httpGet:
path: /healthz
port: 8080
initialDelaySeconds: 0 # after startupProbe passes, no additional delay needed
periodSeconds: 10
failureThreshold: 3
timeoutSeconds: 5
readinessProbe:
httpGet:
path: /ready
port: 8080
periodSeconds: 5
failureThreshold: 3
successThreshold: 1
timeoutSeconds: 3
# ── Volumes ─────────────────────────────────────────────────────
volumes:
- name: config
configMap:
name: app-config
defaultMode: 0644
- name: tmp
emptyDir: {}
- name: logs
emptyDir: {}
# ── Restart policy ──────────────────────────────────────────────
restartPolicy: Always # Always (default) | OnFailure | Never
Image Pull Policy
| imagePullPolicy | When Applied Automatically | Behavior |
|---|---|---|
Always | Image tag is :latest | Kubelet always contacts registry to check digest; pulls if digest differs. Guarantees freshness; requires registry availability at pod start. |
IfNotPresent | Any tag except :latest | Uses cached image if present; pulls only if image not on node. Default for version-tagged images. Faster startup on cached nodes. |
Never | Never automatically set | Never pulls from registry. Pod fails if image not cached on node. Use for air-gapped environments with pre-pulled images. |
The :latest tag forces Always pull policy and makes deployments non-reproducible — the same tag may refer to different image digests on different nodes. Use immutable digests (image@sha256:abc123...) or version tags (v1.2.3) for production workloads. Reference digests in manifests for supply-chain security.
Probes
Kubernetes has three probe types, each serving a distinct purpose. They are independent — a pod can have all three, any combination, or none.
| Probe | On Failure | On Success | When to Use |
|---|---|---|---|
| startupProbe | Container restarted (after failureThreshold) | Disables itself; liveness takes over | Slow-starting containers (JVM warmup, large model loading). Prevents liveness from killing container during boot. |
| livenessProbe | Container restarted | Nothing (pass is expected) | Detect deadlocks, infinite loops, hung processes that can't self-recover. Should check minimally — not dependent on external services. |
| readinessProbe | Pod removed from Service Endpoints (no restart) | Pod re-added to Endpoints | Temporary unavailability: cache warming, circuit breaker open, dependency down. Drives traffic routing, not container health. |
All Four Probe Mechanisms
# 1. httpGet — HTTP GET request; success = 2xx or 3xx status code
livenessProbe:
httpGet:
path: /healthz
port: 8080 # port name or number
scheme: HTTP # HTTP (default) or HTTPS
httpHeaders:
- name: X-Health-Check
value: "true"
initialDelaySeconds: 10
periodSeconds: 10
timeoutSeconds: 5
failureThreshold: 3
successThreshold: 1 # must be 1 for liveness (only readiness allows >1)
# 2. exec — runs command in container; success = exit code 0
livenessProbe:
exec:
command:
- /bin/sh
- -c
- redis-cli ping | grep -q PONG
# 3. tcpSocket — TCP connection attempt; success = connection accepted
livenessProbe:
tcpSocket:
port: 5432 # useful for databases with no HTTP endpoint
host: "" # defaults to pod IP
# 4. grpc — gRPC health protocol (grpc.health.v1.Health/Check); 1.24+ stable
livenessProbe:
grpc:
port: 50051
service: "" # empty string = check overall server health
Probe Tuning Reference
| Field | Default | Notes |
|---|---|---|
initialDelaySeconds | 0 | Seconds after container starts before first probe fires. Deprecated in favor of startupProbe — use startupProbe instead. |
periodSeconds | 10 | How often to probe. Minimum 1s. |
timeoutSeconds | 1 | Probe must complete within this window or it counts as failure. Set ≥ p99 response time. |
failureThreshold | 3 | Consecutive failures before action (restart or remove from Endpoints). |
successThreshold | 1 | Consecutive successes to re-add to Endpoints (readiness only; liveness/startup must be 1). |
terminationGracePeriodSeconds | Pod's value | Per-probe override (1.25+). Set this longer than the pod's grace period if the probe needs more time during shutdown. |
A liveness probe that checks external dependencies (database connectivity, downstream API) causes cascading restarts: if the database goes down, all pods restart simultaneously — making recovery harder. Liveness probes should only detect states the process itself cannot recover from (deadlock, memory corruption). Use readiness probes for dependency health.
Security Context
Security context fields exist at two levels: pod-level (spec.securityContext) applies to all containers and volumes; container-level (spec.containers[*].securityContext) applies to a specific container and overrides pod-level where both exist.
Pod-Level Security Context
spec:
securityContext:
runAsUser: 1000 # UID for all containers (overridden by container-level)
runAsGroup: 3000 # primary GID for all containers
runAsNonRoot: true # refuse to start if image runs as UID 0
fsGroup: 2000 # GID applied to all volumes; files created in volumes get this GID
fsGroupChangePolicy: OnRootMismatch # Only chown/chmod if ownership wrong (faster for large volumes)
# Always: always chown (safe but slow for large volumes)
supplementalGroups: [4000] # additional GIDs for the process (e.g., for shared file access)
seccompProfile:
type: RuntimeDefault # use container runtime's default seccomp profile
# type: Localhost
# localhostProfile: profiles/myprofile.json
# type: Unconfined # disable seccomp (not recommended)
sysctls: # kernel parameters; only safe (namespaced) sysctls allowed by default
- name: net.core.somaxconn
value: "1024"
- name: kernel.shm_rmid_forced # unsafe; requires AllowedUnsafeSysctls feature gate
value: "0"
Container-Level Security Context
containers:
- name: app
securityContext:
runAsUser: 1000 # overrides pod-level runAsUser for this container
runAsGroup: 3000
runAsNonRoot: true
allowPrivilegeEscalation: false # prevents setuid binaries; CRITICAL: always set false
readOnlyRootFilesystem: true # immutable container filesystem; forces explicit writable mounts
privileged: false # default; true = full host kernel access (avoid)
capabilities:
drop: ["ALL"] # drop all capabilities first (start from minimal)
add:
- NET_BIND_SERVICE # bind ports < 1024 (only if needed)
- SYS_PTRACE # for debugging tools; remove in production
seccompProfile:
type: RuntimeDefault
procMount: Default # Default | Unmasked (unmasked exposes /proc — avoid)
PodSecurity Admission Profiles
| Policy | What It Allows | What It Blocks | Use Case |
|---|---|---|---|
| Privileged | Everything | Nothing | System namespaces (kube-system), CNI, CSI node plugins |
| Baseline | Most workloads; some host access | hostPID, hostIPC, hostNetwork, privileged, hostPort, allowPrivilegeEscalation (via securityContext.allowPrivilegeEscalation:true), non-default capabilities beyond baseline set | General workloads not requiring host access |
| Restricted | Strictly hardened pods only | All of Baseline PLUS: must set allowPrivilegeEscalation:false, must drop ALL capabilities, must set runAsNonRoot:true, must use RuntimeDefault or Localhost seccomp, volumes limited to approved types | Security-sensitive namespaces; PCI/SOC2 compliance |
# Apply PodSecurity to a namespace via labels:
kubectl label namespace production \
pod-security.kubernetes.io/enforce=restricted \
pod-security.kubernetes.io/enforce-version=latest \
pod-security.kubernetes.io/warn=restricted \
pod-security.kubernetes.io/audit=restricted
Multi-Container Patterns
Sidecar Pattern
A helper container that augments the main container's functionality without modifying it. They share a volume for communication.
containers:
- name: app
image: myapp:v1
volumeMounts:
- name: logs
mountPath: /var/log/app
- name: log-shipper # sidecar: ships logs to Loki/Elasticsearch
image: fluent/fluent-bit:3.0
volumeMounts:
- name: logs
mountPath: /logs
readOnly: true
resources:
requests:
cpu: "50m"
memory: "64Mi"
limits:
memory: "128Mi"
volumes:
- name: logs
emptyDir: {} # shared between app and log-shipper
Ambassador Pattern
A proxy sidecar that handles outbound communication on behalf of the main container — connection pooling, retries, circuit breaking, or protocol translation.
containers:
- name: app
image: myapp:v1
env:
- name: DATABASE_URL
value: "localhost:5432" # connects to ambassador on localhost, not directly to DB
- name: db-ambassador # ambassador: manages connection pool to PostgreSQL
image: pgbouncer:1.22
env:
- name: DATABASES_HOST
value: postgres.data.svc.cluster.local
- name: POOL_MODE
value: transaction
ports:
- containerPort: 5432 # app connects here; ambassador connects to real DB
Adapter Pattern
Transforms the main container's output into a standardized format — typically for metrics or log normalization.
containers:
- name: legacy-app
image: legacy-metrics-app:v1
# exposes metrics in proprietary format on :8888
- name: metrics-adapter # converts legacy format to Prometheus exposition format
image: prom/statsd-exporter:latest
ports:
- containerPort: 9102 # Prometheus scrapes this port
args:
- --statsd.listen-udp=:8125
- --statsd.mapping-config=/etc/mapping.yml
Init Containers
Init containers run sequentially before any main container starts. Each must exit with code 0 for the next to begin. If an init container fails, the pod is restarted according to restartPolicy (unless Never). Init containers share the same volumes as main containers but run with their own image and security context.
initContainers:
# Step 1: wait for dependency
- name: wait-for-postgres
image: busybox:1.36
command:
- sh
- -c
- |
until nc -z -w 3 postgres.data.svc.cluster.local 5432; do
echo "Waiting for PostgreSQL..."; sleep 5
done
echo "PostgreSQL is ready"
# Step 2: run database migrations (after DB is ready)
- name: db-migrate
image: myapp:v2.1.0 # same image as main app
command: ["/app/migrate", "--direction=up"]
env:
- name: DATABASE_URL
valueFrom:
secretKeyRef:
name: db-secret
key: url
# Shares the same env/secrets as the main app
# Step 3: seed config files into a shared volume
- name: config-init
image: alpine:3.19
command:
- sh
- -c
- |
cp /etc/base-config/* /shared-config/
sed -i "s/{{ENV}}/production/g" /shared-config/app.conf
volumeMounts:
- name: shared-config
mountPath: /shared-config
containers:
- name: app
image: myapp:v2.1.0
volumeMounts:
- name: shared-config
mountPath: /etc/app
readOnly: true
volumes:
- name: shared-config
emptyDir: {}
Native Sidecar Containers (1.29+ / Stable 1.33)
Classic sidecars defined in spec.containers have no ordering guarantee relative to each other. A log-shipper sidecar might miss early log output if it starts after the main app. Native sidecars fix this: defined in spec.initContainers with restartPolicy: Always, they start during the init phase (before main containers) and run for the pod's lifetime.
initContainers:
# Classic init: runs to completion, blocks next step
- name: wait-for-db
image: busybox:1.36
command: ["sh", "-c", "until nc -z postgres 5432; do sleep 2; done"]
# No restartPolicy → classic init behavior
# Native sidecar: starts here (in init phase order), runs forever
- name: log-shipper
image: fluent/fluent-bit:3.0
restartPolicy: Always # ← this is what makes it a native sidecar
volumeMounts:
- name: logs
mountPath: /logs
readinessProbe: # startup waits until this passes before next init
exec:
command: ["test", "-f", "/tmp/fluent-bit.pid"]
initialDelaySeconds: 2
# Another native sidecar
- name: envoy-proxy
image: envoyproxy/envoy:v1.29
restartPolicy: Always
readinessProbe:
httpGet:
path: /ready
port: 9901
containers:
- name: app # starts AFTER both native sidecars are Ready
image: myapp:v1
When the pod terminates, native sidecars receive SIGTERM after all main containers have exited. This ensures a log-shipper flushes buffered logs before shutting down — fixing the classic problem where the log sidecar died before the app finished writing. The shutdown order is: main containers exit → native sidecars receive SIGTERM.
Ephemeral Containers
Ephemeral containers are temporary containers injected into a running pod for debugging. They cannot define ports, probes, or resources and are never restarted. They are useful when a minimal production image lacks debugging tools.
# Inject a debug container into a running pod
kubectl debug -it pod/web-app-7d9b8f \
--image=busybox:1.36 \
--target=app \ # attach to the app container's process namespace
-- sh
# Using a distroless image — the app has no shell, so we bring one
kubectl debug -it pod/web-app-7d9b8f \
--image=gcr.io/distroless/base:debug \
--copy-to=web-app-debug \ # create a copy of the pod with the debug container
-- bash
# Share process namespace to see all processes in the pod
# Requires spec.shareProcessNamespace: true (set in pod spec) or --copy-to with it
# Enable process namespace sharing in pod spec (for ephemeral container process access):
spec:
shareProcessNamespace: true # containers can see each other's processes via /proc
containers:
- name: app
image: myapp:v1
Container Lifecycle Hooks
| Hook | When It Fires | Blocking? | Failure Behavior |
|---|---|---|---|
postStart |
Immediately after container is created (async with ENTRYPOINT — may run before or after) | Yes — container stays in ContainerCreating until hook completes |
Container killed and restarted |
preStop |
Immediately before container is terminated (before SIGTERM) | Yes — SIGTERM delayed until hook completes or grace period expires | Container still terminated |
lifecycle:
postStart:
exec:
command: ["/bin/sh", "-c", "/app/register-service.sh"]
# OR httpGet:
# path: /post-start
# port: 8080
preStop:
exec:
command:
- /bin/sh
- -c
- |
# Drain connections gracefully before SIGTERM
sleep 5
/app/deregister-service.sh
# SIGTERM sent after this exits (or grace period expires)
postStart runs once at startup; it is not a health check. Use it for one-time initialization tasks (registering with a service registry, seeding a cache). Use readinessProbe to gate traffic until the application is actually ready to serve. The two are independent — the pod enters Ready state based on the readiness probe, not the postStart hook.
DNS Policy
| dnsPolicy | Behavior | Use Case |
|---|---|---|
ClusterFirst | Default. Queries go to CoreDNS; cluster.local names resolved internally; fallback to upstream DNS. | All normal pods |
ClusterFirstWithHostNet | Same as ClusterFirst but required when hostNetwork: true (otherwise hostNetwork pods use node DNS). | DaemonSets and infrastructure pods using hostNetwork that still need in-cluster DNS |
Default | Pod inherits the node's /etc/resolv.conf — does NOT resolve cluster services. | Pods that only need external DNS and never access services by cluster DNS name |
None | Ignores Kubernetes DNS; spec.dnsConfig must be provided. | Custom DNS configurations; pods with special resolver requirements |
# dnsConfig for custom search domains and options
spec:
dnsPolicy: ClusterFirst
dnsConfig:
nameservers:
- 10.96.0.10 # CoreDNS IP (only used when dnsPolicy: None)
searches:
- production.svc.cluster.local
- svc.cluster.local
- cluster.local
- internal.corp.com # additional search domain
options:
- name: ndots
value: "2" # lower from default 5 reduces DNS lookup overhead
- name: attempts
value: "3"
Host Namespaces
| Field | Default | Effect | Security Risk |
|---|---|---|---|
hostNetwork: true | false | Pod uses node's network namespace — sees all host interfaces, binds to host ports directly | High — can eavesdrop on node traffic, bind privileged ports |
hostPID: true | false | Pod shares node's PID namespace — can see and signal all processes on the node | Critical — can kill or inspect any node process |
hostIPC: true | false | Pod shares node's IPC namespace — can access shared memory and semaphores of other processes | High — can read/write shared memory of host processes |
hostNetwork and hostPID are blocked by the PodSecurity Baseline and Restricted profiles. They are legitimate only for infrastructure components (CNI plugins, node monitoring agents, kubelet itself). Application workloads should never use host namespaces. Audit with: kubectl get pods -A -o json | jq '.items[] | select(.spec.hostNetwork==true) | .metadata.namespace + "/" + .metadata.name'
RuntimeClass
RuntimeClass selects the container runtime for a pod. Different runtimes provide different security isolation models — from the default runc (process isolation) to hardware-virtualized sandboxes.
# RuntimeClass object (created by cluster admin)
apiVersion: node.k8s.io/v1
kind: RuntimeClass
metadata:
name: gvisor
handler: runsc # maps to containerd runtime handler name
overhead:
podFixed:
cpu: "250m"
memory: "256Mi" # gVisor sentry process overhead added to resource accounting
scheduling:
nodeSelector:
sandboxed: "true" # RuntimeClass can constrain which nodes are used
tolerations:
- key: sandboxed
operator: Exists
---
# In pod spec:
spec:
runtimeClassName: gvisor # use the gVisor sandbox
| Runtime | Isolation | Overhead | Use Case |
|---|---|---|---|
runc (default) | Linux namespaces + cgroups | Minimal | All standard workloads |
gVisor (runsc) | Userspace kernel intercept (ptrace/KVM) | Medium (250m CPU, 256Mi memory) | Multi-tenant, untrusted code, CI runners |
Kata Containers | Hardware virtualization (KVM VM per pod) | High (~500m CPU, 512Mi memory) | Strict isolation, regulated workloads, FaaS |
Firecracker | MicroVM per pod | Low-medium (~125ms boot) | AWS Lambda-style serverless, fast VM boot |
Metrics, Alerts, and Runbooks
Key Pod Metrics
| Metric | Source | Alert Condition |
|---|---|---|
kube_pod_container_status_restarts_total | kube-state-metrics | increase > 5 in 1h → CrashLoopBackOff risk |
kube_pod_status_phase{phase="Failed"} | kube-state-metrics | Any Failed pod older than 5m in production namespace |
kube_pod_status_ready{condition="false"} | kube-state-metrics | Pod not ready > 10m (readiness probe continuously failing) |
container_oom_events_total | cadvisor | Any OOM kill in production |
container_cpu_cfs_throttled_seconds_total | cadvisor | throttle ratio > 25% → consider raising or removing CPU limits |
Alerting Rules
groups:
- name: pod-health
rules:
- alert: PodCrashLooping
expr: |
increase(kube_pod_container_status_restarts_total[1h]) > 5
for: 5m
labels:
severity: warning
annotations:
summary: "Pod {{ $labels.namespace }}/{{ $labels.pod }} is crash-looping"
- alert: PodOOMKilled
expr: |
increase(container_oom_events_total[5m]) > 0
for: 0m
labels:
severity: critical
annotations:
summary: "Container OOM killed: {{ $labels.namespace }}/{{ $labels.pod }}/{{ $labels.container }}"
- alert: PodNotReady
expr: |
kube_pod_status_ready{condition="false"} == 1
for: 10m
labels:
severity: warning
annotations:
summary: "Pod {{ $labels.namespace }}/{{ $labels.pod }} not ready for 10 minutes"
- alert: ContainerCPUThrottling
expr: |
rate(container_cpu_cfs_throttled_seconds_total[5m])
/ rate(container_cpu_cfs_periods_total[5m]) > 0.25
for: 15m
annotations:
summary: "Container {{ $labels.container }} CPU throttled >25%"
Runbooks
Check last exit code: kubectl describe pod POD → "Last State: Terminated reason: Error exitCode: N". Check logs: kubectl logs POD --previous. Common causes: missing env/secret, wrong command, OOM (exitCode 137), permission denied on read-only rootfs.
Container exceeded memory limit. Check actual usage: kubectl top pod POD --containers. Increase memory limit or request. Check for memory leaks with heap profiling. If limit was appropriate, investigate memory leak or request growth — VPA can recommend right-sized requests.
Run kubectl describe pod POD → Events section. Causes: (1) insufficient resources — check kubectl describe nodes for Allocated resources; (2) no matching nodes — check nodeSelector/affinity; (3) taint mismatch — check tolerations; (4) PVC not bound — check PVC status.
Check probe events: kubectl describe pod POD → "Readiness probe failed". For httpGet: exec into pod and run curl localhost:PORT/PATH. Check if app has started, dependency health, and timeout vs actual response time. Increase timeoutSeconds if p99 latency exceeds probe timeout.
Events show "Failed to pull image". Causes: (1) image tag doesn't exist; (2) registry credentials missing/expired — check imagePullSecrets and secret validity; (3) registry unreachable — network policy or node DNS issue; (4) rate limiting (Docker Hub). For private registries: kubectl create secret docker-registry.
Best Practices
- Always define a startupProbe for slow-starting containers — JVM apps, ML model servers, and apps with database migrations can take 30–120 seconds to start. A startupProbe with
failureThreshold: 30, periodSeconds: 10gives up to 5 minutes before liveness takes over, preventing premature container kills. - Set
readOnlyRootFilesystem: trueand provide explicitemptyDirmounts for writable paths — forces you to enumerate where the app writes (tmp, logs, cache). Limits blast radius if the container is compromised — attackers cannot modify binaries or write scripts to/usr/local/bin. - Drop all capabilities and add back only what is needed —
capabilities.drop: [ALL]removes all Linux capabilities. Most apps need none. Only addNET_BIND_SERVICEif binding ports < 1024 (most Kubernetes apps run on high ports and don't need this). - Set
allowPrivilegeEscalation: falseon every container — prevents setuid/setgid binaries from escalating to root. This is a one-line fix that stops a whole class of container escape techniques. Pair withrunAsNonRoot: true. - Use separate liveness and readiness probes with different endpoints — liveness checks minimal viability (
/healthz: can the process respond?); readiness checks full operational status (/ready: are all dependencies healthy?). Using the same endpoint for both causes cascading restarts when a dependency is down. - Never set CPU limits without measuring throttling first — add
container_cpu_cfs_throttled_seconds_totalmonitoring before adding CPU limits. If throttling exceeds 25%, the limit is too low for the workload's actual burst behavior. Many teams set requests only for CPU and rely on node-level fair-scheduling. - Use
preStop: sleep 5for zero-downtime deployments — the kube-proxy/Endpoints propagation race means a pod can receive traffic 1–5 seconds after deletion begins. A preStop sleep delays SIGTERM, giving load balancers time to drain. SetterminationGracePeriodSecondsto at leastpreStop duration + application shutdown time + 10s. - Create a dedicated ServiceAccount per workload and disable automounting on the default SA — the default ServiceAccount in each namespace automounts a token giving API read access. Even if not used, it's an attack vector. Create
automountServiceAccountToken: falseon the default SA, and create explicit SAs with minimal RBAC for workloads that need API access.