Workloads Overview
▶ What This Page Covers
The Pod: Kubernetes' Atomic Workload Unit
Every workload in Kubernetes runs as one or more Pods. A Pod is a co-scheduled group of one or more containers sharing a network namespace (same IP, same port space) and optionally sharing storage volumes. Pods are ephemeral by design — they are created, run, and eventually terminate. Controllers give pods durability, scale, and update semantics.
The Controller Reconciliation Pattern
All Kubernetes workload controllers follow the same fundamental loop: watch the actual state of the cluster, compare it to the desired state declared in the spec, and take actions to reconcile the difference. This pattern makes Kubernetes self-healing — if a pod crashes, the controller notices and creates a replacement.
The controller never talks to kubelet directly. It writes Pod objects to the API server. The scheduler picks them up and assigns spec.nodeName. Kubelet on the assigned node watches for pods with its node name and starts the containers.
Workload Controller Kinds
Manages stateless replicated applications. Creates and manages ReplicaSets for rolling updates and rollback. The go-to controller for web servers, APIs, and microservices.
- Stateless — any pod can serve any request
- RollingUpdate or Recreate strategy
- Rollback via revision history
- HPA-compatible for autoscaling
Manages stateful applications requiring stable network identity, stable persistent storage, and ordered deployment/scaling. Each pod gets a fixed ordinal and a dedicated PVC.
- Stable DNS: pod-0.svc.ns.svc.cluster.local
- Ordered creation (0→N) and deletion (N→0)
- volumeClaimTemplates for per-pod PVCs
- Used for databases, message queues, caches
Runs exactly one pod per node (or per matching node). New nodes automatically get a pod. Used for infrastructure agents that must be co-located with every workload node.
- Log collectors (Fluentd, Filebeat)
- Monitoring agents (node-exporter, Datadog)
- CNI plugins, CSI node plugins
- Can target node subsets via nodeSelector
Runs one or more pods to completion. The controller tracks successful completions and retries on failure. Used for batch processing, database migrations, one-off scripts.
- completions × parallelism for parallel work queues
- backoffLimit controls retry count
- activeDeadlineSeconds for timeout
- Indexed Jobs for partitioned work (1.21+)
Creates Jobs on a cron schedule. Manages job history (successfulJobsHistoryLimit / failedJobsHistoryLimit) and handles missed schedules via concurrencyPolicy.
- Standard Unix cron syntax
- concurrencyPolicy: Allow / Forbid / Replace
- startingDeadlineSeconds for missed fires
- Time zone support (spec.timeZone, 1.27 GA)
Maintains a stable set of replica pods. Rarely used directly — Deployment manages ReplicaSets on your behalf and adds rolling update and rollback capability.
- Direct use only for specialized controllers
- Owns pods via label selector + ownerReference
- No update strategy — use Deployment instead
Workload Decision Matrix
| Requirement | Deployment | StatefulSet | DaemonSet | Job | CronJob |
|---|---|---|---|---|---|
| Stateless replicas | ✓ primary | ✗ | ~ (1/node) | ✗ | ✗ |
| Stable pod identity (name) | ✗ random | ✓ ordinal | ~ (node-scoped) | ✗ | ✗ |
| Stable DNS per pod | ✗ | ✓ headless svc | ✗ | ✗ | ✗ |
| Per-pod dedicated PVC | ✗ | ✓ volumeClaimTemplates | ✗ | ✗ | ✗ |
| Run on every node | ✗ | ✗ | ✓ primary | ✗ | ✗ |
| Run to completion | ✗ | ✗ | ✗ | ✓ primary | ✓ scheduled |
| Scheduled / recurring | ✗ | ✗ | ✗ | ✗ | ✓ primary |
| Parallel batch processing | ~ (Deployment HPA) | ✗ | ✗ | ✓ indexed/work-queue | ✗ |
| Rolling update strategy | ✓ maxSurge/maxUnavailable | ✓ partition-based | ✓ maxUnavailable | ✗ | ✗ |
| Automatic rollback | ✓ revision history | ✗ manual | ✗ manual | ✗ | ✗ |
| HPA autoscaling | ✓ | ✓ | ✗ | ✗ | ✗ |
| Access to host network/PID | ~ (with hostNetwork) | ~ | ✓ common pattern | ~ | ~ |
Owner References and Garbage Collection
Every pod created by a controller carries an ownerReferences entry pointing back to the controller (or its ReplicaSet). When the owner is deleted, the garbage collector cascades deletion to all owned objects — by default, this includes pods.
# Pod created by a Deployment has two owner refs (pod → ReplicaSet → Deployment):
metadata:
ownerReferences:
- apiVersion: apps/v1
kind: ReplicaSet
name: nginx-7d9c8f8b6
uid: abc123...
controller: true # only one ownerRef can be the controller
blockOwnerDeletion: true # prevents owner deletion until this object is deleted first
# Cascade modes when deleting a Deployment:
# --cascade=foreground → owner marked for deletion, GC deletes pods first (default UI)
# --cascade=background → owner deleted immediately, GC cleans up pods asynchronously
# --cascade=orphan → owner deleted, pods left running (useful for StatefulSet migration)
The .spec.selector on a Deployment, StatefulSet, or DaemonSet is immutable after creation. To change labels on pods, you must delete and recreate the controller. If you change pod template labels without changing the selector, the old pods are orphaned (no longer matched) and the controller creates new pods — causing a temporary scale-up beyond desired replicas.
Pod Template Propagation
All controllers embed a spec.template which is a complete PodSpec (minus apiVersion, kind, and metadata.name). The controller stamps this template onto every pod it creates, injecting a generated name and the controller's ownerReference.
apiVersion: apps/v1
kind: Deployment
spec:
selector:
matchLabels:
app: nginx # must match template labels
template: # ← PodTemplateSpec: everything below becomes the Pod spec
metadata:
labels:
app: nginx # must satisfy selector.matchLabels
version: v1.2.3 # additional labels allowed
spec:
containers:
- name: nginx
image: nginx:1.25
# ... full PodSpec here
Any change to spec.template (image tag, env vars, resource limits, annotations) triggers a rolling update. Changes to spec.replicas do not. If you need to force a rolling restart without changing the template (e.g., to pick up a new ConfigMap), use kubectl rollout restart deployment/NAME — it adds a kubectl.kubernetes.io/restartedAt annotation to the template.
Autoscaling Landscape
| Scaler | What It Scales | Metric Sources | Reaction Time | Notes |
|---|---|---|---|---|
| HPA HorizontalPodAutoscaler |
Deployment/StatefulSet replicas | CPU, memory (metrics-server); custom (Prometheus Adapter); external | ~15–30s | GA; built-in. Cannot scale to zero. |
| VPA VerticalPodAutoscaler |
Pod CPU/memory requests | Historical usage (VPA recommender) | Requires pod restart (Off/Initial/Auto modes) | Addon; conflicts with HPA on CPU/memory. Use Off mode for recommendations only. |
| KEDA Kubernetes Event-Driven Autoscaling |
Deployment/Job replicas (incl. scale to zero) | Kafka lag, SQS depth, Prometheus, cron, 60+ scalers | ~1–5s (event-driven) | CNCF graduated. ScaledObject / ScaledJob CRDs. Wraps HPA. |
| Cluster Autoscaler | Node count in node groups | Pending pods (unschedulable) | ~1–3 min (cloud API) | Works with HPA — HPA adds pods, CA adds nodes. Per-cloud provider. |
| Karpenter | Node provisioning (AWS-native) | Pending pod requirements (shape, topology, GPU) | ~30–60s | AWS-native; smarter bin-packing than CA; consolidation feature. |
Resource Model Summary
Every container specifies CPU and memory requests (scheduling guarantee) and limits (enforcement ceiling). The combination determines the pod's Quality of Service (QoS) class, which governs eviction priority under node pressure.
| QoS Class | Condition | Eviction Priority | OOM Kill Priority |
|---|---|---|---|
| Guaranteed | requests == limits for ALL containers (CPU + memory) | Last evicted | Lowest OOM score (-998) |
| Burstable | At least one container has requests < limits, or only requests set | Middle | Middle OOM score (based on memory usage ratio) |
| BestEffort | No requests or limits on ANY container | First evicted | Highest OOM score (1000) |
# Guaranteed QoS — both resources, requests == limits in every container
resources:
requests:
cpu: "500m"
memory: "512Mi"
limits:
cpu: "500m"
memory: "512Mi"
# Burstable QoS — requests set, limits higher or absent
resources:
requests:
cpu: "100m"
memory: "128Mi"
limits:
cpu: "1"
memory: "512Mi"
# BestEffort QoS — no resources specified (dev/test only)
resources: {}
CPU limits cause CPU throttling via CFS bandwidth control — the container is paused for the remainder of its quota period even if the node has idle capacity. This inflates p99 latency invisibly. Many platform teams remove CPU limits entirely and rely on CPU requests for scheduling fairness. Memory limits remain necessary because memory is not compressible — exceeding the limit causes an OOM kill.
Scheduling Primitives Summary
| Mechanism | Direction | Hard/Soft | Use Case |
|---|---|---|---|
nodeSelector | Pod → Node | Hard | Simple label match; schedule to GPU nodes, SSD nodes |
nodeAffinity required | Pod → Node | Hard | Complex label expressions; zone restriction |
nodeAffinity preferred | Pod → Node | Soft | Prefer certain node types but don't require them |
podAffinity required | Pod → Pod | Hard | Co-locate pods (e.g., app + cache on same node) |
podAntiAffinity required | Pod → Pod | Hard | Spread replicas across nodes/zones |
podAntiAffinity preferred | Pod → Pod | Soft | Best-effort spread without blocking scheduling |
taints on Node | Node → Pod | Hard/Soft | Reserve nodes for specific workloads (NoSchedule/PreferNoSchedule) |
tolerations on Pod | Pod → Node | Override taint | Allow pods to run on tainted nodes (GPU, spot, control-plane) |
TopologySpreadConstraints | Pod → Topology | Hard/Soft | Even distribution across zones/nodes; preferred over anti-affinity |
priorityClass | Pod → Scheduler | Preemption | Critical pods preempt lower-priority pods; system-cluster-critical for infra |
Pod Lifecycle Summary
Graceful Termination Sequence
# Termination flow when pod is deleted:
# 1. Pod marked Terminating; removed from Service Endpoints immediately
# 2. preStop hook executed (if defined) — blocks until hook exits or timeout
# 3. SIGTERM sent to PID 1 of each container
# 4. Wait up to terminationGracePeriodSeconds (default: 30s)
# 5. SIGKILL sent if still running after grace period
spec:
terminationGracePeriodSeconds: 60 # override default 30s
containers:
- name: app
lifecycle:
preStop:
exec:
command: ["/bin/sh", "-c", "sleep 5"] # drain in-flight requests
# OR
httpGet:
path: /drain
port: 8080
Between pod deletion and kube-proxy/CoreDNS propagating the Endpoints removal, there is a race: new requests can still be routed to a terminating pod. A preStop: sleep 5 introduces a delay before SIGTERM, giving load balancers time to drain. This is a common production pattern. Set terminationGracePeriodSeconds to at least preStop duration + app shutdown time + 10s buffer.
Init Containers and Sidecar Containers
| Type | Defined In | Run When | Must Complete? | Use Case |
|---|---|---|---|---|
| Init container | spec.initContainers |
Before main containers; sequential | Yes (exit 0) — failure restarts pod | Wait for DB, git clone, certificate generation, schema migration |
| Native sidecar (1.29+) | spec.initContainers with restartPolicy: Always |
Started in init phase, runs for pod lifetime | No — runs alongside main containers | Log shippers, service mesh proxies, secret injectors that must start before main app |
| Regular sidecar (legacy) | spec.containers |
Parallel with main containers | No — but blocks pod Ready | Envoy proxy, Vault agent, metrics exporter |
Workload Identity and Service Accounts
Every pod runs under a ServiceAccount. The mounted service account token (projected volume at /var/run/secrets/kubernetes.io/serviceaccount/token) allows the pod to authenticate to the Kubernetes API server. With IRSA (AWS), Workload Identity (GCP/Azure), or SPIFFE/SPIRE, the same token is exchanged for cloud credentials without long-lived secrets.
spec:
serviceAccountName: my-app-sa # default: "default" SA in namespace
automountServiceAccountToken: false # disable if pod doesn't need API access
# Best practice: create a dedicated SA per workload, not share "default"
apiVersion: v1
kind: ServiceAccount
metadata:
name: my-app-sa
namespace: production
annotations:
eks.amazonaws.com/role-arn: arn:aws:iam::123456789:role/my-app-role # IRSA
Cross-Cutting Concerns per Workload Kind
| Concern | Deployment | StatefulSet | DaemonSet | Job |
|---|---|---|---|---|
| RBAC | ServiceAccount bound to Role/ClusterRole — same pattern for all kinds | |||
| NetworkPolicy | Pod label selector | Pod label selector (ordinal pods all share same labels) | Pod label selector | Pod label selector (job-name label auto-applied) |
| PodSecurity | Enforced at namespace level via pod-security.kubernetes.io/enforce label; applies to all pod-creating objects in namespace |
|||
| PodDisruptionBudget | Essential for zero-downtime | Essential for quorum apps | Optional (one pod per node, drain handled separately) | Not applicable (run-to-completion) |
| HPA | ✓ | ✓ | ✗ | ✗ (use KEDA ScaledJob) |
| Probes | startup/liveness/readiness all apply | Same; readiness blocks pod-N start in OrderedReady mode | startup/liveness; readiness less critical (not load-balanced) | Probes apply but rarely used; activeDeadlineSeconds preferred |
Section Roadmap
This section covers 11 additional files, each a deep dive into a specific workload topic: