Workloads Overview — Kubernetes Docs

▶ What This Page Covers

Kubernetes workload model: Pod as the atomic unit

All workload controller kinds and their primary use cases

Controller reconciliation loop pattern — desired vs observed state

ReplicaSet mechanics and why you never use it directly

Deployment, StatefulSet, DaemonSet, Job, CronJob decision matrix

Pod template propagation from controller to pods

Label selector immutability and its implications

Owner references and garbage collection cascade

Pod identity: ephemeral (Deployment) vs stable (StatefulSet)

Workload autoscaling landscape: HPA, VPA, KEDA, Cluster Autoscaler

Resource model: requests, limits, QoS classes

Scheduling primitives: nodeSelector, affinity, taints/tolerations, topology spread

Pod lifecycle phases and container states

Graceful termination: SIGTERM, preStop, terminationGracePeriodSeconds

Init containers vs sidecar containers (native 1.29+)

Workload identity and service accounts

Cross-cutting concerns: RBAC, NetworkPolicy, PodSecurity per workload kind

Section roadmap linking all 11 subsequent files

The Pod: Kubernetes' Atomic Workload Unit

Every workload in Kubernetes runs as one or more Pods. A Pod is a co-scheduled group of one or more containers sharing a network namespace (same IP, same port space) and optionally sharing storage volumes. Pods are ephemeral by design — they are created, run, and eventually terminate. Controllers give pods durability, scale, and update semantics.

Workload hierarchy: ┌──────────────────────────────────────────────────────────────────┐ │ Controller Object │ │ (Deployment / StatefulSet / DaemonSet / Job / CronJob) │ │ │ │ │ manages ▼ │ │ ┌────────────┐ │ │ │ ReplicaSet │ (Deployment only) │ │ └─────┬──────┘ │ │ │ owns (ownerRef) │ │ ┌───────────────┼───────────────┐ │ │ ▼ ▼ ▼ │ │ ┌─────┐ ┌─────┐ ┌─────┐ │ │ │ Pod │ │ Pod │ │ Pod │ │ │ │ 0 │ │ 1 │ │ 2 │ │ │ └──┬──┘ └──┬──┘ └──┬──┘ │ │ │ │ │ │ │ ┌─────┴──┐ ┌─────┴──┐ ┌─────┴──┐ │ │ │ app │ │ app │ │ app │ containers │ │ │sidecar │ │sidecar │ │sidecar │ (shared netns) │ │ └────────┘ └────────┘ └────────┘ │ └──────────────────────────────────────────────────────────────────┘ Pod ← atomic scheduling unit: all containers land on the same node Controller ← reconciles desired replica count / placement against actual

The Controller Reconciliation Pattern

All Kubernetes workload controllers follow the same fundamental loop: watch the actual state of the cluster, compare it to the desired state declared in the spec, and take actions to reconcile the difference. This pattern makes Kubernetes self-healing — if a pod crashes, the controller notices and creates a replacement.

Controller reconciliation loop (every ~15s or on watch event): ┌──────────────────────────────────────────────────────────────┐ │ Controller (e.g., Deployment controller in kube-controller-manager) │ │ │ │ 1. LIST/WATCH: fetch all pods matching .spec.selector │ │ 2. Count running pods → actualReplicas = 2 │ │ 3. Read desired state → spec.replicas = 3 │ │ 4. Diff: need 1 more pod │ │ 5. CREATE pod from .spec.template │ │ 6. Update status.availableReplicas │ │ │ │ Triggers: pod deleted → under-replicated → create │ │ pod added extra → over-replicated → delete │ │ spec.replicas changed → scale up/down │ │ node failure → pod evicted → create replacement │ └──────────────────────────────────────────────────────────────┘

The controller never talks to kubelet directly. It writes Pod objects to the API server. The scheduler picks them up and assigns spec.nodeName. Kubelet on the assigned node watches for pods with its node name and starts the containers.

Workload Controller Kinds

🔵

Deployment

apps/v1

Manages stateless replicated applications. Creates and manages ReplicaSets for rolling updates and rollback. The go-to controller for web servers, APIs, and microservices.

Stateless — any pod can serve any request
RollingUpdate or Recreate strategy
Rollback via revision history
HPA-compatible for autoscaling

→ Deployments deep dive

🟢

StatefulSet

apps/v1

Manages stateful applications requiring stable network identity, stable persistent storage, and ordered deployment/scaling. Each pod gets a fixed ordinal and a dedicated PVC.

Stable DNS: pod-0.svc.ns.svc.cluster.local
Ordered creation (0→N) and deletion (N→0)
volumeClaimTemplates for per-pod PVCs
Used for databases, message queues, caches

→ StatefulSets deep dive

🟡

DaemonSet

apps/v1

Runs exactly one pod per node (or per matching node). New nodes automatically get a pod. Used for infrastructure agents that must be co-located with every workload node.

Log collectors (Fluentd, Filebeat)
Monitoring agents (node-exporter, Datadog)
CNI plugins, CSI node plugins
Can target node subsets via nodeSelector

→ DaemonSets deep dive

🟣

Job

batch/v1

Runs one or more pods to completion. The controller tracks successful completions and retries on failure. Used for batch processing, database migrations, one-off scripts.

completions × parallelism for parallel work queues
backoffLimit controls retry count
activeDeadlineSeconds for timeout
Indexed Jobs for partitioned work (1.21+)

→ Jobs & CronJobs deep dive

🟠

CronJob

batch/v1

Creates Jobs on a cron schedule. Manages job history (successfulJobsHistoryLimit / failedJobsHistoryLimit) and handles missed schedules via concurrencyPolicy.

Standard Unix cron syntax
concurrencyPolicy: Allow / Forbid / Replace
startingDeadlineSeconds for missed fires
Time zone support (spec.timeZone, 1.27 GA)

→ Jobs & CronJobs deep dive

🔴

ReplicaSet

apps/v1

Maintains a stable set of replica pods. Rarely used directly — Deployment manages ReplicaSets on your behalf and adds rolling update and rollback capability.

Direct use only for specialized controllers
Owns pods via label selector + ownerReference
No update strategy — use Deployment instead

Workload Decision Matrix

Requirement	Deployment	StatefulSet	DaemonSet	Job	CronJob
Stateless replicas	✓ primary	✗	~ (1/node)	✗	✗
Stable pod identity (name)	✗ random	✓ ordinal	~ (node-scoped)	✗	✗
Stable DNS per pod	✗	✓ headless svc	✗	✗	✗
Per-pod dedicated PVC	✗	✓ volumeClaimTemplates	✗	✗	✗
Run on every node	✗	✗	✓ primary	✗	✗
Run to completion	✗	✗	✗	✓ primary	✓ scheduled
Scheduled / recurring	✗	✗	✗	✗	✓ primary
Parallel batch processing	~ (Deployment HPA)	✗	✗	✓ indexed/work-queue	✗
Rolling update strategy	✓ maxSurge/maxUnavailable	✓ partition-based	✓ maxUnavailable	✗	✗
Automatic rollback	✓ revision history	✗ manual	✗ manual	✗	✗
HPA autoscaling	✓	✓	✗	✗	✗
Access to host network/PID	~ (with hostNetwork)	~	✓ common pattern	~	~

Owner References and Garbage Collection

Every pod created by a controller carries an ownerReferences entry pointing back to the controller (or its ReplicaSet). When the owner is deleted, the garbage collector cascades deletion to all owned objects — by default, this includes pods.

# Pod created by a Deployment has two owner refs (pod → ReplicaSet → Deployment):
metadata:
  ownerReferences:
  - apiVersion: apps/v1
    kind: ReplicaSet
    name: nginx-7d9c8f8b6
    uid: abc123...
    controller: true          # only one ownerRef can be the controller
    blockOwnerDeletion: true  # prevents owner deletion until this object is deleted first

# Cascade modes when deleting a Deployment:
# --cascade=foreground  → owner marked for deletion, GC deletes pods first (default UI)
# --cascade=background  → owner deleted immediately, GC cleans up pods asynchronously
# --cascade=orphan      → owner deleted, pods left running (useful for StatefulSet migration)

Label Selector Immutability

The .spec.selector on a Deployment, StatefulSet, or DaemonSet is immutable after creation. To change labels on pods, you must delete and recreate the controller. If you change pod template labels without changing the selector, the old pods are orphaned (no longer matched) and the controller creates new pods — causing a temporary scale-up beyond desired replicas.

Pod Template Propagation

All controllers embed a spec.template which is a complete PodSpec (minus apiVersion, kind, and metadata.name). The controller stamps this template onto every pod it creates, injecting a generated name and the controller's ownerReference.

apiVersion: apps/v1
kind: Deployment
spec:
  selector:
    matchLabels:
      app: nginx          # must match template labels
  template:               # ← PodTemplateSpec: everything below becomes the Pod spec
    metadata:
      labels:
        app: nginx        # must satisfy selector.matchLabels
        version: v1.2.3   # additional labels allowed
    spec:
      containers:
      - name: nginx
        image: nginx:1.25
        # ... full PodSpec here

Template Changes Trigger Rolling Updates

Any change to spec.template (image tag, env vars, resource limits, annotations) triggers a rolling update. Changes to spec.replicas do not. If you need to force a rolling restart without changing the template (e.g., to pick up a new ConfigMap), use kubectl rollout restart deployment/NAME — it adds a kubectl.kubernetes.io/restartedAt annotation to the template.

Autoscaling Landscape

Scaler	What It Scales	Metric Sources	Reaction Time	Notes
HPA HorizontalPodAutoscaler	Deployment/StatefulSet replicas	CPU, memory (metrics-server); custom (Prometheus Adapter); external	~15–30s	GA; built-in. Cannot scale to zero.
VPA VerticalPodAutoscaler	Pod CPU/memory requests	Historical usage (VPA recommender)	Requires pod restart (Off/Initial/Auto modes)	Addon; conflicts with HPA on CPU/memory. Use Off mode for recommendations only.
KEDA Kubernetes Event-Driven Autoscaling	Deployment/Job replicas (incl. scale to zero)	Kafka lag, SQS depth, Prometheus, cron, 60+ scalers	~1–5s (event-driven)	CNCF graduated. ScaledObject / ScaledJob CRDs. Wraps HPA.
Cluster Autoscaler	Node count in node groups	Pending pods (unschedulable)	~1–3 min (cloud API)	Works with HPA — HPA adds pods, CA adds nodes. Per-cloud provider.
Karpenter	Node provisioning (AWS-native)	Pending pod requirements (shape, topology, GPU)	~30–60s	AWS-native; smarter bin-packing than CA; consolidation feature.

Resource Model Summary

Every container specifies CPU and memory requests (scheduling guarantee) and limits (enforcement ceiling). The combination determines the pod's Quality of Service (QoS) class, which governs eviction priority under node pressure.

QoS Class	Condition	Eviction Priority	OOM Kill Priority
Guaranteed	requests == limits for ALL containers (CPU + memory)	Last evicted	Lowest OOM score (-998)
Burstable	At least one container has requests < limits, or only requests set	Middle	Middle OOM score (based on memory usage ratio)
BestEffort	No requests or limits on ANY container	First evicted	Highest OOM score (1000)

# Guaranteed QoS — both resources, requests == limits in every container
resources:
  requests:
    cpu: "500m"
    memory: "512Mi"
  limits:
    cpu: "500m"
    memory: "512Mi"

# Burstable QoS — requests set, limits higher or absent
resources:
  requests:
    cpu: "100m"
    memory: "128Mi"
  limits:
    cpu: "1"
    memory: "512Mi"

# BestEffort QoS — no resources specified (dev/test only)
resources: {}

CPU Limits Are a Trap

CPU limits cause CPU throttling via CFS bandwidth control — the container is paused for the remainder of its quota period even if the node has idle capacity. This inflates p99 latency invisibly. Many platform teams remove CPU limits entirely and rely on CPU requests for scheduling fairness. Memory limits remain necessary because memory is not compressible — exceeding the limit causes an OOM kill.

Scheduling Primitives Summary

Mechanism	Direction	Hard/Soft	Use Case
`nodeSelector`	Pod → Node	Hard	Simple label match; schedule to GPU nodes, SSD nodes
`nodeAffinity` required	Pod → Node	Hard	Complex label expressions; zone restriction
`nodeAffinity` preferred	Pod → Node	Soft	Prefer certain node types but don't require them
`podAffinity` required	Pod → Pod	Hard	Co-locate pods (e.g., app + cache on same node)
`podAntiAffinity` required	Pod → Pod	Hard	Spread replicas across nodes/zones
`podAntiAffinity` preferred	Pod → Pod	Soft	Best-effort spread without blocking scheduling
`taints` on Node	Node → Pod	Hard/Soft	Reserve nodes for specific workloads (NoSchedule/PreferNoSchedule)
`tolerations` on Pod	Pod → Node	Override taint	Allow pods to run on tainted nodes (GPU, spot, control-plane)
`TopologySpreadConstraints`	Pod → Topology	Hard/Soft	Even distribution across zones/nodes; preferred over anti-affinity
`priorityClass`	Pod → Scheduler	Preemption	Critical pods preempt lower-priority pods; system-cluster-critical for infra

Pod Lifecycle Summary

Pod lifecycle phases: ┌──────────┐ ┌──────────┐ ┌──────────┐ ┌──────────┐ ┌──────────┐ │ Pending │───►│ Running │───►│Succeeded │ │ Failed │ │ Unknown │ └──────────┘ └──────────┘ └──────────┘ └──────────┘ └──────────┘ │ │ │ Scheduled │ All containers Pod in terminal state │ to node │ running/starting (exit 0 all containers) │ │ │ ▼ │ Container states: │ ┌──────────┐ ┌──────────┐ ┌──────────┐ │ │ Waiting │ │ Running │ │Terminated│ │ └──────────┘ └──────────┘ └──────────┘ │ (init, image (started, (exit code, │ pull, crash) probes) reason, age) │ Conditions (all must be True for Running → Ready): ┌──────────────────────┬────────────────────────────────────────┐ │ PodScheduled │ scheduler found a node │ │ Initialized │ all init containers completed │ │ ContainersReady │ all containers passed readiness probe │ │ Ready │ pod can receive traffic (Endpoints) │ └──────────────────────┴────────────────────────────────────────┘

Graceful Termination Sequence

# Termination flow when pod is deleted:
# 1. Pod marked Terminating; removed from Service Endpoints immediately
# 2. preStop hook executed (if defined) — blocks until hook exits or timeout
# 3. SIGTERM sent to PID 1 of each container
# 4. Wait up to terminationGracePeriodSeconds (default: 30s)
# 5. SIGKILL sent if still running after grace period

spec:
  terminationGracePeriodSeconds: 60   # override default 30s
  containers:
  - name: app
    lifecycle:
      preStop:
        exec:
          command: ["/bin/sh", "-c", "sleep 5"]   # drain in-flight requests
        # OR
        httpGet:
          path: /drain
          port: 8080

preStop Sleep Pattern for Zero-Downtime Deploys

Between pod deletion and kube-proxy/CoreDNS propagating the Endpoints removal, there is a race: new requests can still be routed to a terminating pod. A preStop: sleep 5 introduces a delay before SIGTERM, giving load balancers time to drain. This is a common production pattern. Set terminationGracePeriodSeconds to at least preStop duration + app shutdown time + 10s buffer.

Init Containers and Sidecar Containers

Type	Defined In	Run When	Must Complete?	Use Case
Init container	`spec.initContainers`	Before main containers; sequential	Yes (exit 0) — failure restarts pod	Wait for DB, git clone, certificate generation, schema migration
Native sidecar (1.29+)	`spec.initContainers` with `restartPolicy: Always`	Started in init phase, runs for pod lifetime	No — runs alongside main containers	Log shippers, service mesh proxies, secret injectors that must start before main app
Regular sidecar (legacy)	`spec.containers`	Parallel with main containers	No — but blocks pod Ready	Envoy proxy, Vault agent, metrics exporter

Workload Identity and Service Accounts

Every pod runs under a ServiceAccount. The mounted service account token (projected volume at /var/run/secrets/kubernetes.io/serviceaccount/token) allows the pod to authenticate to the Kubernetes API server. With IRSA (AWS), Workload Identity (GCP/Azure), or SPIFFE/SPIRE, the same token is exchanged for cloud credentials without long-lived secrets.

spec:
  serviceAccountName: my-app-sa      # default: "default" SA in namespace
  automountServiceAccountToken: false # disable if pod doesn't need API access

# Best practice: create a dedicated SA per workload, not share "default"
apiVersion: v1
kind: ServiceAccount
metadata:
  name: my-app-sa
  namespace: production
  annotations:
    eks.amazonaws.com/role-arn: arn:aws:iam::123456789:role/my-app-role  # IRSA

Cross-Cutting Concerns per Workload Kind

Concern	Deployment	StatefulSet	DaemonSet	Job
RBAC	ServiceAccount bound to Role/ClusterRole — same pattern for all kinds
NetworkPolicy	Pod label selector	Pod label selector (ordinal pods all share same labels)	Pod label selector	Pod label selector (job-name label auto-applied)
PodSecurity	Enforced at namespace level via `pod-security.kubernetes.io/enforce` label; applies to all pod-creating objects in namespace
PodDisruptionBudget	Essential for zero-downtime	Essential for quorum apps	Optional (one pod per node, drain handled separately)	Not applicable (run-to-completion)
HPA	✓	✓	✗	✗ (use KEDA ScaledJob)
Probes	startup/liveness/readiness all apply	Same; readiness blocks pod-N start in OrderedReady mode	startup/liveness; readiness less critical (not load-balanced)	Probes apply but rarely used; activeDeadlineSeconds preferred

Section Roadmap

This section covers 11 additional files, each a deep dive into a specific workload topic:

5.1

Pods

Full PodSpec, probes, security context, multi-container patterns

5.2

Deployments

RollingUpdate strategy, maxSurge/maxUnavailable, rollback, revision history

5.3

StatefulSets

Ordered lifecycle, volumeClaimTemplates, headless service, update strategies

5.4

DaemonSets

Node targeting, RollingUpdate, host namespaces, tolerations for control plane

5.5

Jobs & CronJobs

completions, parallelism, indexed jobs, backoffLimit, TTL, cron scheduling

5.6

HPA

v2 API, custom metrics, external metrics, stabilization window, KEDA

5.7

VPA

Recommender/Updater/Admission Controller, modes, HPA conflict, Goldilocks

5.8

PodDisruptionBudgets

minAvailable, maxUnavailable, unhealthyPodEvictionPolicy, node drain interaction

5.9

Resource Management

Requests vs limits, QoS, LimitRange, ResourceQuota, eBPF overhead, right-sizing

5.10

Scheduling

Scheduler framework, affinity, taints/tolerations, topology spread, priority/preemption

5.11

Pod Lifecycle

Phases, conditions, probes, init containers, graceful termination, restart policies

← Previous Storage Capacity Next → Pods