Jobs & CronJobs
πŸ“‹ Page Coverage Checklist
  • Job spec anatomy: completions, parallelism, backoffLimit, activeDeadlineSeconds, TTL
  • Job completion modes: NonIndexed vs Indexed with completion index env var
  • Job failure policy (1.25+): onExitCodes / onPodConditions actions
  • backoffLimitPerIndex (1.26+): per-index retry isolation
  • Job pod disruption with podReplacementPolicy and terminationGracePeriodSeconds
  • Suspend/resume Jobs; external controller patterns
  • Work queue pattern, static work assignment, indexed scatter-gather
  • Sidecar termination problem and native sidecar fix (1.28+)
  • CronJob spec: schedule, timeZone (GA 1.27), concurrencyPolicy, startingDeadlineSeconds
  • CronJob history limits: successfulJobsHistoryLimit, failingJobsHistoryLimit
  • CronJob missed schedules (>100 β†’ stuck), lastScheduleTime semantics
  • Job patterns: fan-out/fan-in, external work queue (KEDA)
  • Resource management for batch: PriorityClass, preemption, node taints
  • 5 metrics + 4 alerting rules + 5 runbooks + 8 best practices
  • Jobs & CronJobs

    Run-to-completion workloads, indexed parallelism, and time-based scheduling

    batch/v1 Kubernetes 1.24+ Platform Engineer

    Unlike Deployments or StatefulSets, Jobs and CronJobs model finite work: they create Pods, track their completions, and succeed or fail as a whole unit. Understanding their controller mechanics β€” completion tracking, retry semantics, parallelism, and scheduling guarantees β€” is essential for building reliable batch pipelines, ETL jobs, database migrations, and any workload that must run once (or on a schedule) and stop.

    Job Controller Mechanics

    The Job controller lives in kube-controller-manager and reconciles the observed Pod states against the desired completion count. Its core loop:

    Job spec: completions=3, parallelism=2, backoffLimit=4 Job Controller reconcile loop β”Œβ”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β” β”‚ β”‚ β”‚ Watch Job + owned Pods β”‚ β”‚ β”‚ β”‚ β”‚ β–Ό β”‚ β”‚ Count succeeded / failed / active Pods β”‚ β”‚ β”‚ β”‚ β”‚ β”œβ”€ succeeded >= completions β†’ Complete βœ“ β”‚ β”‚ β”œβ”€ failed > backoffLimit β†’ Failed βœ— β”‚ β”‚ β”œβ”€ active > parallelism β†’ delete excess β”‚ β”‚ └─ active < parallelism β†’ create more β”‚ β”‚ β”‚ β”‚ Pod failure β†’ increment .status.failed β”‚ β”‚ Exponential backoff: 10s β†’ 20s β†’ 40s β†’ 80s β”‚ β”‚ (capped at 6 minutes) β”‚ β””β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”˜ Ownership: Job owns Pods via ownerRef (GC on Job deletion) Selector: auto-generated (controller-uid label), immutable

    Key invariant: the Job controller does not use a ReplicaSet intermediary. It owns Pods directly, identified by the auto-generated label controller-uid=<job-uid>. This selector is immutable once set.

    Job Spec Anatomy

    apiVersion: batch/v1
    kind: Job
    metadata:
      name: data-migration-v2
      namespace: platform
      labels:
        app: data-migration
        version: v2
    spec:
      # --- Completion semantics ---
      completions: 5          # Total successful pods needed (default: 1)
      parallelism: 2          # Max pods running simultaneously (default: 1)
      completionMode: Indexed # NonIndexed (default) or Indexed
    
      # --- Failure handling ---
      backoffLimit: 4         # Max pod failures before Job fails (default: 6)
      backoffLimitPerIndex: 1 # Per-index failures before that index fails (1.26+)
      activeDeadlineSeconds: 600  # Job-level timeout (wall clock)
    
      # --- Cleanup ---
      ttlSecondsAfterFinished: 3600  # Auto-delete 1hr after completion
    
      # --- Suspension ---
      suspend: false          # Set true to pause (deletes active pods)
    
      # --- Pod replacement ---
      podReplacementPolicy: Failed  # Failed (default) | TerminatingOrFailed
    
      # --- Pod failure policy (1.25+) ---
      podFailurePolicy:
        rules:
          - action: FailJob           # Fail entire Job immediately
            onExitCodes:
              containerName: worker
              operator: In
              values: [42]            # Exit code 42 = non-retriable error
          - action: Ignore            # Don't count toward backoffLimit
            onPodConditions:
              - type: DisruptionTarget # Node preemption / eviction
          - action: FailIndex         # Fail this index only (Indexed mode)
            onExitCodes:
              operator: In
              values: [1, 2, 127]
    
      # --- Selector (auto-generated; only set if manualSelector: true) ---
      # selector:
      #   matchLabels:
      #     controller-uid: <job-uid>
    
      template:
        metadata:
          labels:
            app: data-migration
        spec:
          restartPolicy: Never  # REQUIRED: Never or OnFailure
          containers:
            - name: worker
              image: registry.example.com/migration:v2@sha256:abc123
              env:
                - name: JOB_COMPLETION_INDEX   # Injected by controller (Indexed mode)
                  valueFrom:
                    fieldRef:
                      fieldPath: metadata.annotations['batch.kubernetes.io/job-completion-index']
              resources:
                requests:
                  cpu: 500m
                  memory: 512Mi
                limits:
                  cpu: 2
                  memory: 1Gi
              volumeMounts:
                - name: work-dir
                  mountPath: /work
          volumes:
            - name: work-dir
              emptyDir: {}
          # Batch pods typically don't need high availability
          tolerations:
            - key: batch
              operator: Equal
              value: "true"
              effect: NoSchedule
          nodeSelector:
            workload-type: batch
    restartPolicy is mandatory and constrained
    Jobs require restartPolicy: Never or restartPolicy: OnFailure. Always is forbidden β€” a pod that always restarts never counts as "succeeded." With Never, each failure creates a new Pod (counting against backoffLimit). With OnFailure, the same Pod is restarted in-place (container restart, not pod replacement).

    Completion Modes

    NonIndexed (default)

    Pods are interchangeable. The controller creates pods until completions successful pods exist. Use this when each pod unit of work is identical (e.g., draining a shared queue).

    • No completion index assigned to pods
    • Any pod success counts toward total; pod failures are retried up to backoffLimit
    • Work distribution is external (e.g., message queue, database cursor)

    Indexed (GA 1.24)

    Each pod gets a unique stable index from 0 to completions-1. Exactly one pod must succeed at each index for the Job to complete.

    Indexed Job: completions=4, parallelism=2 Index 0: Pod-xkz β†’ succeeded βœ“ Index 1: Pod-mnp β†’ running... Index 2: Pod-qrs β†’ pending Index 3: Pod-tuv β†’ pending JOB_COMPLETION_INDEX env var = "0", "1", "2", "3" respectively Hostname: <job-name>-<index> (e.g., data-migration-v2-0) DNS: <job-name>-<index>.<svc>.<ns>.svc.cluster.local (requires headless Service with job-name selector)
    # Inject the index from annotation (recommended pattern)
    env:
      - name: JOB_COMPLETION_INDEX
        valueFrom:
          fieldRef:
            fieldPath: metadata.annotations['batch.kubernetes.io/job-completion-index']
    
    # Or from the downward API (alternative)
    env:
      - name: JOB_COMPLETION_INDEX
        valueFrom:
          fieldRef:
            fieldPath: metadata.annotations['batch.kubernetes.io/job-completion-index']

    The index can also be read from a static file mounted at /etc/podinfo/job-completion-index if you use a downwardAPI volume.

    Indexed Job use cases

    Data processing

    Shard-parallel ETL

    Each index processes one partition/shard. Index β†’ shard mapping is deterministic from the env var.

    Machine learning

    Distributed training

    Index maps to worker rank. Used by PyTorch distributed and MPI-based frameworks.

    Testing

    Parallel test matrix

    Each index runs one test suite combination. Scatter-gather after all complete.

    Pod Failure Policy (1.25+)

    The default behavior counts every pod failure toward backoffLimit. Pod failure policy lets you classify failures and take different actions β€” critically, distinguishing between application bugs (retriable) and infrastructure events (should not burn retries).

    ActionEffectBest for
    FailJobImmediately mark entire Job as failed, delete active podsNon-retriable exit codes (config error, data corruption)
    FailIndexFail only this index (Indexed mode), others continuePer-shard non-retriable error without killing the whole job
    IgnoreDon't count toward backoffLimit, pod failure is recorded but ignoredNode preemption, spot instance reclaim (DisruptionTarget condition)
    CountDefault: increment failure counter toward backoffLimitRetriable transient errors
    podFailurePolicy:
      rules:
        # Rule 1: non-retriable application error
        - action: FailJob
          onExitCodes:
            containerName: worker
            operator: In
            values: [42, 43]         # Exit 42/43 = configuration / data error
    
        # Rule 2: OOM β€” infrastructure issue, retry
        - action: Count
          onExitCodes:
            operator: In
            values: [137]            # SIGKILL (OOM)
    
        # Rule 3: preemption / eviction β€” don't waste retries
        - action: Ignore
          onPodConditions:
            - type: DisruptionTarget
    
        # Rule 4: per-index isolation (Indexed jobs only)
        - action: FailIndex
          onExitCodes:
            operator: NotIn
            values: [0]              # Any non-zero exit in this index = index fails
    DisruptionTarget condition
    Kubernetes 1.25+ sets the DisruptionTarget pod condition when a Pod is terminated due to node pressure eviction, preemption, or kubectl drain. Using Ignore on this condition prevents spot instance reclaims from burning your retry budget.

    Per-Index Backoff (1.26+)

    backoffLimitPerIndex applies backoffLimit semantics independently to each index. Without it, failures across all indexes share a single global counter β€” one pathologically failing index can exhaust retries for all others.

    spec:
      completions: 100
      parallelism: 10
      completionMode: Indexed
      backoffLimit: 10000         # High global limit (not meaningful with perIndex)
      backoffLimitPerIndex: 2     # Each index may fail 2 times before that index fails
      maxFailedIndexes: 5         # Job fails if more than 5 indexes fail (optional)
    Without backoffLimitPerIndex: Index 3 fails 6 times β†’ entire Job fails (backoffLimit=6 exhausted) Indexes 0,1,2,4..99 never get a chance With backoffLimitPerIndex: 2 Index 3 fails 3 times β†’ index 3 marked Failed All other indexes continue unaffected maxFailedIndexes: 5 β†’ Job fails when 6th index fails

    Suspend & Resume

    Setting spec.suspend: true pauses a Job: all active Pods are deleted (terminated gracefully), and no new Pods are created. The Job's Suspended condition is set to True. Resuming (setting suspend: false) restores scheduling.

    # Suspend a running job
    kubectl patch job data-migration-v2 -p '{"spec":{"suspend":true}}'
    
    # Resume it
    kubectl patch job data-migration-v2 -p '{"spec":{"suspend":false}}'
    
    # Check suspension status
    kubectl get job data-migration-v2 -o jsonpath='{.status.conditions}'

    Suspend/resume is the foundation for external job schedulers (Volcano, Yunikorn, Apache Airflow Kubernetes executor) that need to queue Jobs without creating Pods until resources are available.

    TTL-based Cleanup

    Finished Jobs (succeeded or failed) accumulate indefinitely without cleanup. The TTL-after-finished controller (GA 1.23) auto-deletes Jobs and their owned Pods after a configurable delay.

    spec:
      ttlSecondsAfterFinished: 86400  # Delete 24 hours after finish
      # 0 = delete immediately after finish (cascade deletes pods too)
      # omit = never auto-delete (manual cleanup required)
    CronJob history vs TTL
    CronJobs manage their own Job history via successfulJobsHistoryLimit and failingJobsHistoryLimit. If you also set ttlSecondsAfterFinished on the Job template, the TTL controller may delete Jobs before CronJob history limits are evaluated. Use one mechanism or the other, not both.

    Pod Replacement Policy

    podReplacementPolicy (1.28+) controls when replacement pods are created:

    PolicyReplacement created whenUse case
    Failed (default)Pod reaches Failed phase (all containers terminated)Standard jobs
    TerminatingOrFailedPod has deletionTimestamp (terminating) OR Failed phaseLong graceful termination; don't wait for full shutdown before scheduling replacement

    Job Patterns

    Pattern 1: Work Queue (NonIndexed)

    Pods pull tasks from an external queue (Redis, SQS, RabbitMQ, Kafka). When the queue is empty, pods exit 0. Set completions to the number of workers you want running; when all succeed (having drained the queue), the Job completes.

    spec:
      completions: null         # Null = succeed when any pod succeeds AND
      parallelism: 5            # all pods have exited (work queue pattern)
      # With completions:null, Job completes when all pods succeed
      # This is the "work queue" completion mode
      template:
        spec:
          restartPolicy: OnFailure
          containers:
            - name: worker
              image: queue-worker:v1
              env:
                - name: QUEUE_URL
                  value: redis://redis-svc:6379/0
    completions: null semantics
    When completions is null and completionMode: NonIndexed, the Job succeeds when at least one pod succeeds and all pods have terminated. This is the classic work-queue model where workers self-terminate when the queue is empty.

    Pattern 2: Indexed Fan-out / Fan-in

    # Stage 1: fan-out (Indexed Job)
    apiVersion: batch/v1
    kind: Job
    metadata:
      name: process-shards
    spec:
      completions: 32
      parallelism: 8
      completionMode: Indexed
      template:
        spec:
          restartPolicy: Never
          containers:
            - name: processor
              image: shard-processor:v1
              command: ["/bin/sh", "-c"]
              args:
                - |
                  INDEX=${JOB_COMPLETION_INDEX}
                  # Process shard $INDEX of 32 total shards
                  ./process-shard --index=${INDEX} --total=32 --output=s3://bucket/shards/${INDEX}
              env:
                - name: JOB_COMPLETION_INDEX
                  valueFrom:
                    fieldRef:
                      fieldPath: metadata.annotations['batch.kubernetes.io/job-completion-index']
    ---
    # Stage 2: fan-in (separate Job triggered by CI/workflow engine after fan-out completes)
    apiVersion: batch/v1
    kind: Job
    metadata:
      name: merge-shards
    spec:
      completions: 1
      template:
        spec:
          restartPolicy: Never
          containers:
            - name: merger
              image: shard-merger:v1
              command: ["./merge-shards", "--input=s3://bucket/shards/", "--count=32"]

    Pattern 3: Database Migration

    apiVersion: batch/v1
    kind: Job
    metadata:
      name: db-migrate-v3-2-0
      annotations:
        # Immutable label for audit trail
        migration/version: "3.2.0"
        migration/type: "schema"
    spec:
      completions: 1
      parallelism: 1
      backoffLimit: 0           # Schema migrations are NOT idempotent β€” fail fast
      activeDeadlineSeconds: 300
      ttlSecondsAfterFinished: 604800  # Keep for 1 week for debugging
      template:
        spec:
          restartPolicy: Never
          initContainers:
            - name: wait-for-db
              image: busybox:1.36
              command: ['sh', '-c', 'until nc -z postgres-svc 5432; do sleep 2; done']
          containers:
            - name: migrator
              image: myapp:v3.2.0
              command: ["./migrate", "--direction=up", "--target=3.2.0"]
              env:
                - name: DATABASE_URL
                  valueFrom:
                    secretKeyRef:
                      name: db-credentials
                      key: url
              resources:
                requests:
                  cpu: 100m
                  memory: 128Mi

    Sidecar Termination Problem

    Before Kubernetes 1.28, running sidecars (e.g., Istio envoy, Datadog agent, Vault agent) alongside a Job was problematic: when the main container exits 0, the Job wants to complete β€” but the sidecar is still running, so the Pod never reaches the Succeeded phase.

    Pre-1.28 sidecar problem: Pod: [main-container (exit 0)] [sidecar (still running)] β”‚ Pod never reaches Succeeded phase Job never completes Manual workaround: shareProcessNamespace + kill sidecar 1.28+ Native sidecar solution: initContainers: - name: vault-agent restartPolicy: Always # ← This makes it a sidecar container ... - name: envoy restartPolicy: Always ... containers: - name: main-worker ... Shutdown order: main exits 0 β†’ sidecars receive SIGTERM β†’ Pod Succeeded βœ“
    # Native sidecar in a Job (1.28+, stable 1.33)
    spec:
      template:
        spec:
          restartPolicy: Never
          initContainers:
            - name: vault-agent
              image: vault:1.15
              restartPolicy: Always   # sidecar designation
              args: ["agent", "-config=/vault/config"]
              volumeMounts:
                - name: vault-config
                  mountPath: /vault/config
            - name: datadog-agent
              image: datadog/agent:7
              restartPolicy: Always   # sidecar designation
              env:
                - name: DD_API_KEY
                  valueFrom:
                    secretKeyRef:
                      name: datadog-secret
                      key: api-key
          containers:
            - name: worker
              image: batch-worker:v2
              command: ["./process"]
    Native sidecars in Jobs (stable 1.33)
    With restartPolicy: Always in an initContainer, Kubernetes treats it as a sidecar: it starts before regular containers, receives SIGTERM when the main container exits, and its exit does not fail the Pod. This cleanly solves the Job sidecar problem without shell hacks.

    CronJob

    CronJob creates Job objects on a time-based schedule. The CronJob controller runs in kube-controller-manager and periodically checks whether a new Job should be spawned based on the schedule and concurrency policy.

    CronJob lifecycle: CronJob controller (runs every 10s by default) β”‚ β–Ό Evaluate schedule: should a new Job run now? β”‚ β”œβ”€ Yes β†’ create Job β†’ Job controller creates Pods β”œβ”€ No β†’ wait └─ Missed? β†’ check startingDeadlineSeconds CronJob owns Jobs via ownerRef Jobs own Pods via ownerRef CronJob GC: prune old Jobs per history limits

    CronJob Spec

    apiVersion: batch/v1
    kind: CronJob
    metadata:
      name: nightly-report
      namespace: analytics
    spec:
      # --- Schedule ---
      schedule: "0 2 * * *"          # Every day at 02:00
      timeZone: "America/New_York"   # Named timezone (GA 1.27); default UTC
    
      # --- Concurrency ---
      concurrencyPolicy: Forbid      # Allow | Forbid | Replace
    
      # --- Missed schedule deadline ---
      startingDeadlineSeconds: 300   # Allow up to 5 min late start; nil = no deadline
    
      # --- History ---
      successfulJobsHistoryLimit: 3  # Keep last 3 successful Jobs (default: 3)
      failingJobsHistoryLimit: 1     # Keep last 1 failed Job (default: 1)
    
      # --- Suspension ---
      suspend: false                 # Suspend scheduling (existing Jobs unaffected)
    
      jobTemplate:
        spec:
          backoffLimit: 2
          activeDeadlineSeconds: 3600
          ttlSecondsAfterFinished: 86400
          template:
            spec:
              restartPolicy: OnFailure
              containers:
                - name: reporter
                  image: analytics/reporter:v3
                  env:
                    - name: REPORT_DATE
                      value: "$(date -d 'yesterday' +%Y-%m-%d)"
                  resources:
                    requests:
                      cpu: 500m
                      memory: 1Gi
                    limits:
                      cpu: 2
                      memory: 4Gi

    Cron Schedule Syntax

    FieldRangeSpecial chars
    Minute0–59* , - /
    Hour0–23* , - /
    Day of month1–31* , - / ?
    Month1–12 or JAN–DEC* , - /
    Day of week0–6 (Sun=0) or SUN–SAT* , - / ?
    Example scheduleMeaning
    0 * * * *Every hour at :00
    */15 * * * *Every 15 minutes
    0 2 * * *Daily at 02:00 UTC
    0 9 * * 1-5Weekdays at 09:00
    0 0 1 * *Monthly, 1st day at midnight
    0 0 * * 0Every Sunday at midnight
    @hourlyMacro = 0 * * * *
    @dailyMacro = 0 0 * * *
    @weeklyMacro = 0 0 * * 0
    @monthlyMacro = 0 0 1 * *

    Time Zones (GA 1.27)

    Before 1.27, CronJobs always ran in UTC; teams worked around this by shifting schedule values manually. The timeZone field accepts IANA timezone names from the tz database.

    spec:
      schedule: "0 9 * * 1-5"         # 09:00 local time, weekdays
      timeZone: "Europe/Berlin"        # CET/CEST automatically handled
    
    # Common zones
    # America/New_York     UTC-5/UTC-4 (EST/EDT)
    # America/Los_Angeles  UTC-8/UTC-7 (PST/PDT)
    # Asia/Tokyo           UTC+9
    # Australia/Sydney     UTC+10/UTC+11
    # UTC                  Always UTC (explicit is better than implicit)
    DST edge cases
    When a clock change makes a time slot ambiguous (e.g., 02:30 appears twice during fall-back) or skipped (spring-forward), the CronJob controller uses the first occurrence. Schedules at midnight in DST-observing zones can shift by one hour seasonally β€” audit CronJobs for timezone-sensitive business logic.

    Concurrency Policy

    PolicyBehavior when previous Job still runningUse case
    Allow (default) Create new Job anyway β€” multiple Jobs may run simultaneously Independent periodic tasks; each run is isolated
    Forbid Skip this schedule tick; record a missed schedule Non-reentrant jobs (DB maintenance, cache warm-up)
    Replace Delete the current running Job, create a new one Stateless jobs that must always run on fresh data; old run is stale
    Replace deletes the running Job
    With concurrencyPolicy: Replace, the in-flight Job is forcefully deleted (all its Pods are terminated) before the new Job starts. Any partially completed work is lost. Only use Replace when jobs are fully idempotent and partial runs have no side effects.

    Missed Schedules & startingDeadlineSeconds

    If the CronJob controller is unavailable (controller-manager downtime, cluster upgrade) or a Job is stuck, schedule ticks may be missed. The controller catches up by counting missed schedules since the last successful run.

    100-missed-schedule limit
    If more than 100 schedule ticks have been missed since the last run (or since the CronJob was created), the controller stops scheduling entirely and logs: Cannot determine if job needs to be started. Too many missed start time (> 100). Set or decrease .spec.startingDeadlineSeconds or check clock skew. This is a hard stop β€” the CronJob will not auto-recover. You must manually delete and recreate it, or update the schedule to reset the counter.
    spec:
      startingDeadlineSeconds: 300
      # If the job cannot start within 5 minutes of its scheduled time,
      # skip this occurrence and wait for the next schedule tick.
      # null (default) = no deadline; may trigger the 100-missed-schedule trap
      # if controller is down for many schedule periods

    lastScheduleTime in the CronJob status shows when the last Job was successfully spawned. Use this to detect scheduling staleness:

    # Check last schedule time
    kubectl get cronjob nightly-report -o jsonpath='{.status.lastScheduleTime}'
    
    # List all active Jobs owned by this CronJob
    kubectl get jobs -l "batch.kubernetes.io/cronjob-name=nightly-report"
    
    # Manually trigger a CronJob (create Job from template)
    kubectl create job --from=cronjob/nightly-report nightly-report-manual-$(date +%s)

    History Limits

    CronJob prunes old Jobs to prevent unbounded accumulation. The controller keeps the N most recent Jobs of each type.

    FieldDefaultRecommendation
    successfulJobsHistoryLimit33–10 for debugging; 0 to disable (not recommended for prod)
    failingJobsHistoryLimit13–5 to preserve failure logs; higher if jobs are long-lived
    Logs lost when Job is pruned
    When a Job is deleted (by history limits or TTL), its Pods and their logs are deleted too unless you have a log aggregation pipeline (Loki, Elasticsearch, Datadog). Always ship logs to external storage for batch jobs before relying on history limits for debugging.

    Resource Management for Batch

    PriorityClass for Batch Isolation

    # Batch priority class β€” lower than production workloads
    apiVersion: scheduling.k8s.io/v1
    kind: PriorityClass
    metadata:
      name: batch-low
    value: 100              # Production = 1000; system-cluster-critical = 2000000000
    globalDefault: false
    preemptionPolicy: Never # Batch should not preempt production pods
    description: "Low-priority batch workloads"
    ---
    # Interactive/short jobs
    apiVersion: scheduling.k8s.io/v1
    kind: PriorityClass
    metadata:
      name: batch-high
    value: 500
    globalDefault: false
    preemptionPolicy: PreemptLowerPriority
    description: "Time-sensitive batch jobs (e.g., SLA-bound reports)"
    ---
    # Use in Job template
    spec:
      template:
        spec:
          priorityClassName: batch-low

    Dedicated Batch Nodes

    # Taint batch nodes to prevent non-batch workloads from landing there
    kubectl taint nodes batch-node-1 batch=true:NoSchedule
    kubectl taint nodes batch-node-2 batch=true:NoSchedule
    
    # Label batch nodes
    kubectl label nodes batch-node-1 workload-type=batch
    kubectl label nodes batch-node-2 workload-type=batch
    # Job template tolerations + nodeSelector for batch nodes
    spec:
      template:
        spec:
          tolerations:
            - key: batch
              operator: Equal
              value: "true"
              effect: NoSchedule
          nodeSelector:
            workload-type: batch
          # Ensure batch pods don't get evicted for production
          priorityClassName: batch-low

    Right-sizing Batch Resources

    Batch jobs often have predictable resource profiles. Over-requesting wastes cluster capacity; under-requesting causes OOM kills that count as failures.

    resources:
      requests:
        cpu: "500m"      # What the scheduler uses for placement
        memory: "1Gi"    # Set to p95 of observed usage
      limits:
        cpu: "4"         # Generous CPU limit (throttling is recoverable)
        memory: "2Gi"    # Tight memory limit = OOM = pod failure
                         # Memory limit should be >= p99.9 of usage
    Memory OOM in batch jobs
    For batch jobs processing variable-size inputs (e.g., large files), memory needs may vary per run. Either provision generously, use VPA recommendations (see VPA page), or implement input-size-based resource selection in your workflow engine. Pod OOM kills increment the backoff counter.

    KEDA for Event-Driven Job Scaling

    KEDA (Kubernetes Event-Driven Autoscaling) can trigger Jobs based on queue depth β€” creating zero Jobs when the queue is empty and scaling to N Jobs as messages accumulate. This complements CronJob (time-based) with event-driven batch execution.

    apiVersion: keda.sh/v1alpha1
    kind: ScaledJob
    metadata:
      name: queue-processor
      namespace: platform
    spec:
      jobTargetRef:
        parallelism: 5
        completions: 5
        backoffLimit: 3
        template:
          spec:
            restartPolicy: Never
            containers:
              - name: processor
                image: queue-processor:v2
                resources:
                  requests:
                    cpu: 200m
                    memory: 256Mi
      pollingInterval: 30        # Check queue every 30 seconds
      maxReplicaCount: 50        # Max concurrent Jobs
      scalingStrategy:
        strategy: "accurate"     # Create one Job per queue message
      triggers:
        - type: redis
          metadata:
            address: redis-svc.platform.svc.cluster.local:6379
            listName: work-queue
            listLength: "5"      # One Job per 5 messages (batch processing)

    Operational Commands

    # --- Job operations ---
    # List jobs with status
    kubectl get jobs -n platform -o wide
    
    # Watch a job complete
    kubectl get job data-migration-v2 -w
    
    # Get job status details
    kubectl describe job data-migration-v2
    
    # Get job completion ratio
    kubectl get job data-migration-v2 \
      -o jsonpath='{.status.succeeded}/{.spec.completions} succeeded'
    
    # View logs from all pods of a job
    kubectl logs -l job-name=data-migration-v2 --all-containers
    
    # Delete finished jobs older than 1 day (manual cleanup)
    kubectl get jobs -o json | jq -r \
      '.items[] | select(.status.completionTime != null) |
       select(now - (.status.completionTime | fromdateiso8601) > 86400) |
       .metadata.name' | xargs -I{} kubectl delete job {}
    
    # --- CronJob operations ---
    # List cronjobs with schedule and last schedule time
    kubectl get cronjobs -o wide
    
    # Manually trigger a CronJob
    kubectl create job --from=cronjob/nightly-report nightly-report-manual-$(date +%s)
    
    # Suspend a CronJob (stop new jobs from being created)
    kubectl patch cronjob nightly-report -p '{"spec":{"suspend":true}}'
    
    # Resume a CronJob
    kubectl patch cronjob nightly-report -p '{"spec":{"suspend":false}}'
    
    # List Jobs created by a specific CronJob
    kubectl get jobs -l "batch.kubernetes.io/cronjob-name=nightly-report"
    
    # View CronJob events (missed schedules, etc.)
    kubectl describe cronjob nightly-report
    
    # --- Indexed Job debugging ---
    # Get pods for each index
    kubectl get pods -l job-name=process-shards -L batch.kubernetes.io/job-completion-index
    
    # Get logs for specific index
    kubectl logs -l batch.kubernetes.io/job-completion-index=3 -l job-name=process-shards

    Job Status Fields

    FieldTypeDescription
    status.activeintNumber of currently running pods
    status.succeededintNumber of successfully completed pods
    status.failedintNumber of failed pods (total, not just toward backoffLimit)
    status.completedIndexesstringCompact range notation of completed indexes (Indexed mode)
    status.failedIndexesstringCompact range notation of failed indexes (with backoffLimitPerIndex)
    status.startTimetimeWhen the Job was acknowledged by the controller
    status.completionTimetimeWhen the Job entered terminal state (succeeded or failed)
    status.conditions[]ConditionComplete, Failed, Suspended, FailureTarget
    status.uncountedTerminatedPodsobjectPods that terminated but haven't been counted yet (transient)

    Metrics

    MetricLabelsUse
    kube_job_status_activejob_name, namespaceCurrently running pods in a Job
    kube_job_status_succeededjob_name, namespaceSuccessful completions
    kube_job_status_failedjob_name, namespaceCumulative failures
    kube_job_completejob_name, condition1 when Job is complete (succeeded/failed)
    kube_cronjob_next_schedule_timecronjob, namespaceUnix timestamp of next scheduled execution

    Alerting Rules

    groups:
      - name: jobs-cronjobs
        rules:
          # Job failed
          - alert: JobFailed
            expr: kube_job_status_conditions{condition="Failed",status="true"} > 0
            for: 0m
            labels:
              severity: critical
            annotations:
              summary: "Job {{ $labels.namespace }}/{{ $labels.job_name }} has failed"
              description: "Check logs: kubectl logs -l job-name={{ $labels.job_name }} -n {{ $labels.namespace }}"
    
          # Job taking too long (no activeDeadlineSeconds set)
          - alert: JobStalled
            expr: |
              kube_job_status_active > 0
              and (time() - kube_job_status_start_time) > 7200
            for: 5m
            labels:
              severity: warning
            annotations:
              summary: "Job {{ $labels.job_name }} has been running for > 2 hours"
    
          # CronJob not scheduled on time
          - alert: CronJobMissedSchedule
            expr: |
              time() - kube_cronjob_status_last_schedule_time > 3600
              unless kube_cronjob_spec_suspend == 1
            for: 5m
            labels:
              severity: warning
            annotations:
              summary: "CronJob {{ $labels.namespace }}/{{ $labels.cronjob }} missed last schedule"
    
          # CronJob suspended unexpectedly
          - alert: CronJobSuspended
            expr: kube_cronjob_spec_suspend == 1
            for: 1h
            labels:
              severity: info
            annotations:
              summary: "CronJob {{ $labels.cronjob }} has been suspended for > 1 hour"

    Runbooks

    Job Stuck / Not Completing

    # 1. Check job status and conditions
    kubectl describe job <job-name> -n <namespace>
    
    # 2. Check pod states
    kubectl get pods -l job-name=<job-name> -n <namespace>
    
    # 3. Look at logs from failed pods
    kubectl logs -l job-name=<job-name> --previous -n <namespace>
    
    # 4. Check if backoffLimit exhausted
    kubectl get job <job-name> -o jsonpath='{.status.failed}/{.spec.backoffLimit}'
    
    # 5. Check for resource constraints (Pending pods)
    kubectl describe pods -l job-name=<job-name> -n <namespace> | grep -A5 Events

    CronJob Stopped Scheduling (100-missed limit)

    # Check events
    kubectl describe cronjob <name> -n <namespace> | grep -A20 Events
    
    # If "Too many missed start time" error:
    # Option 1: delete and recreate (loses history)
    kubectl delete cronjob <name> -n <namespace>
    kubectl apply -f cronjob.yaml
    
    # Option 2: add/reduce startingDeadlineSeconds to reset the window
    kubectl patch cronjob <name> -p '{"spec":{"startingDeadlineSeconds":300}}'
    
    # Option 3: check for controller-manager issues
    kubectl logs -n kube-system -l component=kube-controller-manager | grep cronjob

    IndexedJob Index Stuck

    # Which indexes are complete?
    kubectl get job <job-name> -o jsonpath='{.status.completedIndexes}'
    
    # Which indexes failed (with backoffLimitPerIndex)?
    kubectl get job <job-name> -o jsonpath='{.status.failedIndexes}'
    
    # Get pods for a specific index
    kubectl get pods -l job-name=<job-name>,batch.kubernetes.io/job-completion-index=5
    
    # Retry a specific index by setting backoffLimitPerIndex higher and re-patching
    # (Must be done before the index is in terminal Failed state)

    Manually Triggering a CronJob

    # One-shot manual run from CronJob template
    kubectl create job --from=cronjob/<cronjob-name> <cronjob-name>-manual-$(date +%s) \
      -n <namespace>
    
    # Monitor the manually triggered Job
    kubectl get job -l "batch.kubernetes.io/cronjob-name=<cronjob-name>" \
      -n <namespace> -w

    Job Pod Failures Due to Node Preemption

    # Check if DisruptionTarget condition is set on failed pods
    kubectl get pods -l job-name=<job-name> -o json | \
      jq '.items[].status.conditions[] | select(.type=="DisruptionTarget")'
    
    # Solution: add podFailurePolicy to Ignore DisruptionTarget (see above)
    # This prevents preemption from consuming backoffLimit retries

    CronJob Job Not Appearing

    # Check if CronJob is suspended
    kubectl get cronjob <name> -o jsonpath='{.spec.suspend}'
    
    # Check concurrencyPolicy blocking new jobs
    kubectl get cronjob <name> -o jsonpath='{.spec.concurrencyPolicy}'
    kubectl get jobs -l "batch.kubernetes.io/cronjob-name=<name>" --sort-by=.metadata.creationTimestamp
    
    # Check controller-manager logs for scheduling decisions
    kubectl logs -n kube-system -l component=kube-controller-manager --tail=200 | grep <name>

    Best Practices

    1. Always set activeDeadlineSeconds β€” without it, a hung Job runs forever and its pods accumulate. Set it to 2–3Γ— the expected runtime.
    2. Use ttlSecondsAfterFinished to prevent Job accumulation. Even with CronJob history limits, standalone Jobs need explicit cleanup.
    3. Set backoffLimit: 0 for non-idempotent operations (schema migrations, one-time data fixes). Retrying a migration that partially ran can corrupt data.
    4. Add podFailurePolicy with DisruptionTarget: Ignore on any Job that runs on spot or preemptible nodes β€” prevents spot reclaims from wasting retry budget.
    5. Use Indexed completion mode for sharded workloads β€” deterministic index-to-shard mapping eliminates distributed coordination overhead.
    6. Use backoffLimitPerIndex for large Indexed jobs β€” prevents one hot shard from exhausting global retries.
    7. Never set concurrencyPolicy: Replace on stateful jobs β€” the in-flight job is deleted without waiting for graceful termination. Only safe for truly idempotent, stateless work.
    8. Set explicit timeZone on CronJobs β€” UTC-only schedules create operational confusion for teams in non-UTC zones; DST errors cause missed or doubled runs.

    Job vs CronJob vs Deployment

    DimensionJobCronJobDeployment
    LifecycleRun-to-completionRecurring run-to-completionLong-running (always on)
    Pod restartPolicyNever or OnFailureNever or OnFailureAlways
    Completion trackingYes (succeeded count)Yes (per-Job)No (desired replicas)
    SchedulingImmediate on creationTime-based (cron)Continuous
    HistoryTTL or manualsuccessfulJobsHistoryLimitReplicaSet revisions
    Parallelismspec.parallelismVia Job templatespec.replicas
    Failure semanticsbackoffLimit, podFailurePolicyInherited from JobRestart controller