▶ What This Page Covers
  • Kubernetes workload model: Pod as the atomic unit
  • All workload controller kinds and their primary use cases
  • Controller reconciliation loop pattern — desired vs observed state
  • ReplicaSet mechanics and why you never use it directly
  • Deployment, StatefulSet, DaemonSet, Job, CronJob decision matrix
  • Pod template propagation from controller to pods
  • Label selector immutability and its implications
  • Owner references and garbage collection cascade
  • Pod identity: ephemeral (Deployment) vs stable (StatefulSet)
  • Workload autoscaling landscape: HPA, VPA, KEDA, Cluster Autoscaler
  • Resource model: requests, limits, QoS classes
  • Scheduling primitives: nodeSelector, affinity, taints/tolerations, topology spread
  • Pod lifecycle phases and container states
  • Graceful termination: SIGTERM, preStop, terminationGracePeriodSeconds
  • Init containers vs sidecar containers (native 1.29+)
  • Workload identity and service accounts
  • Cross-cutting concerns: RBAC, NetworkPolicy, PodSecurity per workload kind
  • Section roadmap linking all 11 subsequent files
  • The Pod: Kubernetes' Atomic Workload Unit

    Every workload in Kubernetes runs as one or more Pods. A Pod is a co-scheduled group of one or more containers sharing a network namespace (same IP, same port space) and optionally sharing storage volumes. Pods are ephemeral by design — they are created, run, and eventually terminate. Controllers give pods durability, scale, and update semantics.

    Workload hierarchy: ┌──────────────────────────────────────────────────────────────────┐ │ Controller Object │ │ (Deployment / StatefulSet / DaemonSet / Job / CronJob) │ │ │ │ │ manages ▼ │ │ ┌────────────┐ │ │ │ ReplicaSet │ (Deployment only) │ │ └─────┬──────┘ │ │ │ owns (ownerRef) │ │ ┌───────────────┼───────────────┐ │ │ ▼ ▼ ▼ │ │ ┌─────┐ ┌─────┐ ┌─────┐ │ │ │ Pod │ │ Pod │ │ Pod │ │ │ │ 0 │ │ 1 │ │ 2 │ │ │ └──┬──┘ └──┬──┘ └──┬──┘ │ │ │ │ │ │ │ ┌─────┴──┐ ┌─────┴──┐ ┌─────┴──┐ │ │ │ app │ │ app │ │ app │ containers │ │ │sidecar │ │sidecar │ │sidecar │ (shared netns) │ │ └────────┘ └────────┘ └────────┘ │ └──────────────────────────────────────────────────────────────────┘ Pod ← atomic scheduling unit: all containers land on the same node Controller ← reconciles desired replica count / placement against actual

    The Controller Reconciliation Pattern

    All Kubernetes workload controllers follow the same fundamental loop: watch the actual state of the cluster, compare it to the desired state declared in the spec, and take actions to reconcile the difference. This pattern makes Kubernetes self-healing — if a pod crashes, the controller notices and creates a replacement.

    Controller reconciliation loop (every ~15s or on watch event): ┌──────────────────────────────────────────────────────────────┐ │ Controller (e.g., Deployment controller in kube-controller-manager) │ │ │ │ 1. LIST/WATCH: fetch all pods matching .spec.selector │ │ 2. Count running pods → actualReplicas = 2 │ │ 3. Read desired state → spec.replicas = 3 │ │ 4. Diff: need 1 more pod │ │ 5. CREATE pod from .spec.template │ │ 6. Update status.availableReplicas │ │ │ │ Triggers: pod deleted → under-replicated → create │ │ pod added extra → over-replicated → delete │ │ spec.replicas changed → scale up/down │ │ node failure → pod evicted → create replacement │ └──────────────────────────────────────────────────────────────┘

    The controller never talks to kubelet directly. It writes Pod objects to the API server. The scheduler picks them up and assigns spec.nodeName. Kubelet on the assigned node watches for pods with its node name and starts the containers.

    Workload Controller Kinds

    🔵
    Deployment
    apps/v1

    Manages stateless replicated applications. Creates and manages ReplicaSets for rolling updates and rollback. The go-to controller for web servers, APIs, and microservices.

    • Stateless — any pod can serve any request
    • RollingUpdate or Recreate strategy
    • Rollback via revision history
    • HPA-compatible for autoscaling
    → Deployments deep dive
    🟢
    StatefulSet
    apps/v1

    Manages stateful applications requiring stable network identity, stable persistent storage, and ordered deployment/scaling. Each pod gets a fixed ordinal and a dedicated PVC.

    • Stable DNS: pod-0.svc.ns.svc.cluster.local
    • Ordered creation (0→N) and deletion (N→0)
    • volumeClaimTemplates for per-pod PVCs
    • Used for databases, message queues, caches
    → StatefulSets deep dive
    🟡
    DaemonSet
    apps/v1

    Runs exactly one pod per node (or per matching node). New nodes automatically get a pod. Used for infrastructure agents that must be co-located with every workload node.

    • Log collectors (Fluentd, Filebeat)
    • Monitoring agents (node-exporter, Datadog)
    • CNI plugins, CSI node plugins
    • Can target node subsets via nodeSelector
    → DaemonSets deep dive
    🟣
    Job
    batch/v1

    Runs one or more pods to completion. The controller tracks successful completions and retries on failure. Used for batch processing, database migrations, one-off scripts.

    • completions × parallelism for parallel work queues
    • backoffLimit controls retry count
    • activeDeadlineSeconds for timeout
    • Indexed Jobs for partitioned work (1.21+)
    → Jobs & CronJobs deep dive
    🟠
    CronJob
    batch/v1

    Creates Jobs on a cron schedule. Manages job history (successfulJobsHistoryLimit / failedJobsHistoryLimit) and handles missed schedules via concurrencyPolicy.

    • Standard Unix cron syntax
    • concurrencyPolicy: Allow / Forbid / Replace
    • startingDeadlineSeconds for missed fires
    • Time zone support (spec.timeZone, 1.27 GA)
    → Jobs & CronJobs deep dive
    🔴
    ReplicaSet
    apps/v1

    Maintains a stable set of replica pods. Rarely used directly — Deployment manages ReplicaSets on your behalf and adds rolling update and rollback capability.

    • Direct use only for specialized controllers
    • Owns pods via label selector + ownerReference
    • No update strategy — use Deployment instead

    Workload Decision Matrix

    Requirement Deployment StatefulSet DaemonSet Job CronJob
    Stateless replicas ✓ primary ~ (1/node)
    Stable pod identity (name) ✗ random ✓ ordinal ~ (node-scoped)
    Stable DNS per pod ✓ headless svc
    Per-pod dedicated PVC ✓ volumeClaimTemplates
    Run on every node ✓ primary
    Run to completion ✓ primary ✓ scheduled
    Scheduled / recurring ✓ primary
    Parallel batch processing ~ (Deployment HPA) ✓ indexed/work-queue
    Rolling update strategy ✓ maxSurge/maxUnavailable ✓ partition-based ✓ maxUnavailable
    Automatic rollback ✓ revision history ✗ manual ✗ manual
    HPA autoscaling
    Access to host network/PID ~ (with hostNetwork) ~ ✓ common pattern ~ ~

    Owner References and Garbage Collection

    Every pod created by a controller carries an ownerReferences entry pointing back to the controller (or its ReplicaSet). When the owner is deleted, the garbage collector cascades deletion to all owned objects — by default, this includes pods.

    # Pod created by a Deployment has two owner refs (pod → ReplicaSet → Deployment):
    metadata:
      ownerReferences:
      - apiVersion: apps/v1
        kind: ReplicaSet
        name: nginx-7d9c8f8b6
        uid: abc123...
        controller: true          # only one ownerRef can be the controller
        blockOwnerDeletion: true  # prevents owner deletion until this object is deleted first
    
    # Cascade modes when deleting a Deployment:
    # --cascade=foreground  → owner marked for deletion, GC deletes pods first (default UI)
    # --cascade=background  → owner deleted immediately, GC cleans up pods asynchronously
    # --cascade=orphan      → owner deleted, pods left running (useful for StatefulSet migration)
    Label Selector Immutability

    The .spec.selector on a Deployment, StatefulSet, or DaemonSet is immutable after creation. To change labels on pods, you must delete and recreate the controller. If you change pod template labels without changing the selector, the old pods are orphaned (no longer matched) and the controller creates new pods — causing a temporary scale-up beyond desired replicas.

    Pod Template Propagation

    All controllers embed a spec.template which is a complete PodSpec (minus apiVersion, kind, and metadata.name). The controller stamps this template onto every pod it creates, injecting a generated name and the controller's ownerReference.

    apiVersion: apps/v1
    kind: Deployment
    spec:
      selector:
        matchLabels:
          app: nginx          # must match template labels
      template:               # ← PodTemplateSpec: everything below becomes the Pod spec
        metadata:
          labels:
            app: nginx        # must satisfy selector.matchLabels
            version: v1.2.3   # additional labels allowed
        spec:
          containers:
          - name: nginx
            image: nginx:1.25
            # ... full PodSpec here
    Template Changes Trigger Rolling Updates

    Any change to spec.template (image tag, env vars, resource limits, annotations) triggers a rolling update. Changes to spec.replicas do not. If you need to force a rolling restart without changing the template (e.g., to pick up a new ConfigMap), use kubectl rollout restart deployment/NAME — it adds a kubectl.kubernetes.io/restartedAt annotation to the template.

    Autoscaling Landscape

    ScalerWhat It ScalesMetric SourcesReaction TimeNotes
    HPA
    HorizontalPodAutoscaler
    Deployment/StatefulSet replicas CPU, memory (metrics-server); custom (Prometheus Adapter); external ~15–30s GA; built-in. Cannot scale to zero.
    VPA
    VerticalPodAutoscaler
    Pod CPU/memory requests Historical usage (VPA recommender) Requires pod restart (Off/Initial/Auto modes) Addon; conflicts with HPA on CPU/memory. Use Off mode for recommendations only.
    KEDA
    Kubernetes Event-Driven Autoscaling
    Deployment/Job replicas (incl. scale to zero) Kafka lag, SQS depth, Prometheus, cron, 60+ scalers ~1–5s (event-driven) CNCF graduated. ScaledObject / ScaledJob CRDs. Wraps HPA.
    Cluster Autoscaler Node count in node groups Pending pods (unschedulable) ~1–3 min (cloud API) Works with HPA — HPA adds pods, CA adds nodes. Per-cloud provider.
    Karpenter Node provisioning (AWS-native) Pending pod requirements (shape, topology, GPU) ~30–60s AWS-native; smarter bin-packing than CA; consolidation feature.

    Resource Model Summary

    Every container specifies CPU and memory requests (scheduling guarantee) and limits (enforcement ceiling). The combination determines the pod's Quality of Service (QoS) class, which governs eviction priority under node pressure.

    QoS ClassConditionEviction PriorityOOM Kill Priority
    Guaranteed requests == limits for ALL containers (CPU + memory) Last evicted Lowest OOM score (-998)
    Burstable At least one container has requests < limits, or only requests set Middle Middle OOM score (based on memory usage ratio)
    BestEffort No requests or limits on ANY container First evicted Highest OOM score (1000)
    # Guaranteed QoS — both resources, requests == limits in every container
    resources:
      requests:
        cpu: "500m"
        memory: "512Mi"
      limits:
        cpu: "500m"
        memory: "512Mi"
    
    # Burstable QoS — requests set, limits higher or absent
    resources:
      requests:
        cpu: "100m"
        memory: "128Mi"
      limits:
        cpu: "1"
        memory: "512Mi"
    
    # BestEffort QoS — no resources specified (dev/test only)
    resources: {}
    CPU Limits Are a Trap

    CPU limits cause CPU throttling via CFS bandwidth control — the container is paused for the remainder of its quota period even if the node has idle capacity. This inflates p99 latency invisibly. Many platform teams remove CPU limits entirely and rely on CPU requests for scheduling fairness. Memory limits remain necessary because memory is not compressible — exceeding the limit causes an OOM kill.

    Scheduling Primitives Summary

    MechanismDirectionHard/SoftUse Case
    nodeSelectorPod → NodeHardSimple label match; schedule to GPU nodes, SSD nodes
    nodeAffinity requiredPod → NodeHardComplex label expressions; zone restriction
    nodeAffinity preferredPod → NodeSoftPrefer certain node types but don't require them
    podAffinity requiredPod → PodHardCo-locate pods (e.g., app + cache on same node)
    podAntiAffinity requiredPod → PodHardSpread replicas across nodes/zones
    podAntiAffinity preferredPod → PodSoftBest-effort spread without blocking scheduling
    taints on NodeNode → PodHard/SoftReserve nodes for specific workloads (NoSchedule/PreferNoSchedule)
    tolerations on PodPod → NodeOverride taintAllow pods to run on tainted nodes (GPU, spot, control-plane)
    TopologySpreadConstraintsPod → TopologyHard/SoftEven distribution across zones/nodes; preferred over anti-affinity
    priorityClassPod → SchedulerPreemptionCritical pods preempt lower-priority pods; system-cluster-critical for infra

    Pod Lifecycle Summary

    Pod lifecycle phases: ┌──────────┐ ┌──────────┐ ┌──────────┐ ┌──────────┐ ┌──────────┐ │ Pending │───►│ Running │───►│Succeeded │ │ Failed │ │ Unknown │ └──────────┘ └──────────┘ └──────────┘ └──────────┘ └──────────┘ │ │ │ Scheduled │ All containers Pod in terminal state │ to node │ running/starting (exit 0 all containers) │ │ │ ▼ │ Container states: │ ┌──────────┐ ┌──────────┐ ┌──────────┐ │ │ Waiting │ │ Running │ │Terminated│ │ └──────────┘ └──────────┘ └──────────┘ │ (init, image (started, (exit code, │ pull, crash) probes) reason, age) │ Conditions (all must be True for Running → Ready): ┌──────────────────────┬────────────────────────────────────────┐ │ PodScheduled │ scheduler found a node │ │ Initialized │ all init containers completed │ │ ContainersReady │ all containers passed readiness probe │ │ Ready │ pod can receive traffic (Endpoints) │ └──────────────────────┴────────────────────────────────────────┘

    Graceful Termination Sequence

    # Termination flow when pod is deleted:
    # 1. Pod marked Terminating; removed from Service Endpoints immediately
    # 2. preStop hook executed (if defined) — blocks until hook exits or timeout
    # 3. SIGTERM sent to PID 1 of each container
    # 4. Wait up to terminationGracePeriodSeconds (default: 30s)
    # 5. SIGKILL sent if still running after grace period
    
    spec:
      terminationGracePeriodSeconds: 60   # override default 30s
      containers:
      - name: app
        lifecycle:
          preStop:
            exec:
              command: ["/bin/sh", "-c", "sleep 5"]   # drain in-flight requests
            # OR
            httpGet:
              path: /drain
              port: 8080
    preStop Sleep Pattern for Zero-Downtime Deploys

    Between pod deletion and kube-proxy/CoreDNS propagating the Endpoints removal, there is a race: new requests can still be routed to a terminating pod. A preStop: sleep 5 introduces a delay before SIGTERM, giving load balancers time to drain. This is a common production pattern. Set terminationGracePeriodSeconds to at least preStop duration + app shutdown time + 10s buffer.

    Init Containers and Sidecar Containers

    TypeDefined InRun WhenMust Complete?Use Case
    Init container spec.initContainers Before main containers; sequential Yes (exit 0) — failure restarts pod Wait for DB, git clone, certificate generation, schema migration
    Native sidecar (1.29+) spec.initContainers with restartPolicy: Always Started in init phase, runs for pod lifetime No — runs alongside main containers Log shippers, service mesh proxies, secret injectors that must start before main app
    Regular sidecar (legacy) spec.containers Parallel with main containers No — but blocks pod Ready Envoy proxy, Vault agent, metrics exporter

    Workload Identity and Service Accounts

    Every pod runs under a ServiceAccount. The mounted service account token (projected volume at /var/run/secrets/kubernetes.io/serviceaccount/token) allows the pod to authenticate to the Kubernetes API server. With IRSA (AWS), Workload Identity (GCP/Azure), or SPIFFE/SPIRE, the same token is exchanged for cloud credentials without long-lived secrets.

    spec:
      serviceAccountName: my-app-sa      # default: "default" SA in namespace
      automountServiceAccountToken: false # disable if pod doesn't need API access
    
    # Best practice: create a dedicated SA per workload, not share "default"
    apiVersion: v1
    kind: ServiceAccount
    metadata:
      name: my-app-sa
      namespace: production
      annotations:
        eks.amazonaws.com/role-arn: arn:aws:iam::123456789:role/my-app-role  # IRSA

    Cross-Cutting Concerns per Workload Kind

    ConcernDeploymentStatefulSetDaemonSetJob
    RBAC ServiceAccount bound to Role/ClusterRole — same pattern for all kinds
    NetworkPolicy Pod label selector Pod label selector (ordinal pods all share same labels) Pod label selector Pod label selector (job-name label auto-applied)
    PodSecurity Enforced at namespace level via pod-security.kubernetes.io/enforce label; applies to all pod-creating objects in namespace
    PodDisruptionBudget Essential for zero-downtime Essential for quorum apps Optional (one pod per node, drain handled separately) Not applicable (run-to-completion)
    HPA ✗ (use KEDA ScaledJob)
    Probes startup/liveness/readiness all apply Same; readiness blocks pod-N start in OrderedReady mode startup/liveness; readiness less critical (not load-balanced) Probes apply but rarely used; activeDeadlineSeconds preferred

    Section Roadmap

    This section covers 11 additional files, each a deep dive into a specific workload topic: