📋 Page Coverage Checklist

Job spec anatomy: completions, parallelism, backoffLimit, activeDeadlineSeconds, TTL

Job completion modes: NonIndexed vs Indexed with completion index env var

Job failure policy (1.25+): onExitCodes / onPodConditions actions

backoffLimitPerIndex (1.26+): per-index retry isolation

Job pod disruption with podReplacementPolicy and terminationGracePeriodSeconds

Suspend/resume Jobs; external controller patterns

Work queue pattern, static work assignment, indexed scatter-gather

Sidecar termination problem and native sidecar fix (1.28+)

CronJob spec: schedule, timeZone (GA 1.27), concurrencyPolicy, startingDeadlineSeconds

CronJob history limits: successfulJobsHistoryLimit, failingJobsHistoryLimit

CronJob missed schedules (>100 → stuck), lastScheduleTime semantics

Job patterns: fan-out/fan-in, external work queue (KEDA)

Resource management for batch: PriorityClass, preemption, node taints

5 metrics + 4 alerting rules + 5 runbooks + 8 best practices

Jobs & CronJobs

Run-to-completion workloads, indexed parallelism, and time-based scheduling

batch/v1 Kubernetes 1.24+ Platform Engineer

Unlike Deployments or StatefulSets, Jobs and CronJobs model finite work: they create Pods, track their completions, and succeed or fail as a whole unit. Understanding their controller mechanics — completion tracking, retry semantics, parallelism, and scheduling guarantees — is essential for building reliable batch pipelines, ETL jobs, database migrations, and any workload that must run once (or on a schedule) and stop.

Job Controller Mechanics

The Job controller lives in kube-controller-manager and reconciles the observed Pod states against the desired completion count. Its core loop:

Job spec: completions=3, parallelism=2, backoffLimit=4 Job Controller reconcile loop ┌─────────────────────────────────────────────────┐ │ │ │ Watch Job + owned Pods │ │ │ │ │ ▼ │ │ Count succeeded / failed / active Pods │ │ │ │ │ ├─ succeeded >= completions → Complete ✓ │ │ ├─ failed > backoffLimit → Failed ✗ │ │ ├─ active > parallelism → delete excess │ │ └─ active < parallelism → create more │ │ │ │ Pod failure → increment .status.failed │ │ Exponential backoff: 10s → 20s → 40s → 80s │ │ (capped at 6 minutes) │ └─────────────────────────────────────────────────┘ Ownership: Job owns Pods via ownerRef (GC on Job deletion) Selector: auto-generated (controller-uid label), immutable

Key invariant: the Job controller does not use a ReplicaSet intermediary. It owns Pods directly, identified by the auto-generated label controller-uid=<job-uid>. This selector is immutable once set.

Job Spec Anatomy

apiVersion: batch/v1
kind: Job
metadata:
  name: data-migration-v2
  namespace: platform
  labels:
    app: data-migration
    version: v2
spec:
  # --- Completion semantics ---
  completions: 5          # Total successful pods needed (default: 1)
  parallelism: 2          # Max pods running simultaneously (default: 1)
  completionMode: Indexed # NonIndexed (default) or Indexed

  # --- Failure handling ---
  backoffLimit: 4         # Max pod failures before Job fails (default: 6)
  backoffLimitPerIndex: 1 # Per-index failures before that index fails (1.26+)
  activeDeadlineSeconds: 600  # Job-level timeout (wall clock)

  # --- Cleanup ---
  ttlSecondsAfterFinished: 3600  # Auto-delete 1hr after completion

  # --- Suspension ---
  suspend: false          # Set true to pause (deletes active pods)

  # --- Pod replacement ---
  podReplacementPolicy: Failed  # Failed (default) | TerminatingOrFailed

  # --- Pod failure policy (1.25+) ---
  podFailurePolicy:
    rules:
      - action: FailJob           # Fail entire Job immediately
        onExitCodes:
          containerName: worker
          operator: In
          values: [42]            # Exit code 42 = non-retriable error
      - action: Ignore            # Don't count toward backoffLimit
        onPodConditions:
          - type: DisruptionTarget # Node preemption / eviction
      - action: FailIndex         # Fail this index only (Indexed mode)
        onExitCodes:
          operator: In
          values: [1, 2, 127]

  # --- Selector (auto-generated; only set if manualSelector: true) ---
  # selector:
  #   matchLabels:
  #     controller-uid: <job-uid>

  template:
    metadata:
      labels:
        app: data-migration
    spec:
      restartPolicy: Never  # REQUIRED: Never or OnFailure
      containers:
        - name: worker
          image: registry.example.com/migration:v2@sha256:abc123
          env:
            - name: JOB_COMPLETION_INDEX   # Injected by controller (Indexed mode)
              valueFrom:
                fieldRef:
                  fieldPath: metadata.annotations['batch.kubernetes.io/job-completion-index']
          resources:
            requests:
              cpu: 500m
              memory: 512Mi
            limits:
              cpu: 2
              memory: 1Gi
          volumeMounts:
            - name: work-dir
              mountPath: /work
      volumes:
        - name: work-dir
          emptyDir: {}
      # Batch pods typically don't need high availability
      tolerations:
        - key: batch
          operator: Equal
          value: "true"
          effect: NoSchedule
      nodeSelector:
        workload-type: batch

restartPolicy is mandatory and constrained
Jobs require restartPolicy: Never or restartPolicy: OnFailure. Always is forbidden — a pod that always restarts never counts as "succeeded." With Never, each failure creates a new Pod (counting against backoffLimit). With OnFailure, the same Pod is restarted in-place (container restart, not pod replacement).

Completion Modes

NonIndexed (default)

Pods are interchangeable. The controller creates pods until completions successful pods exist. Use this when each pod unit of work is identical (e.g., draining a shared queue).

No completion index assigned to pods
Any pod success counts toward total; pod failures are retried up to backoffLimit
Work distribution is external (e.g., message queue, database cursor)

Indexed (GA 1.24)

Each pod gets a unique stable index from 0 to completions-1. Exactly one pod must succeed at each index for the Job to complete.

Indexed Job: completions=4, parallelism=2 Index 0: Pod-xkz → succeeded ✓ Index 1: Pod-mnp → running... Index 2: Pod-qrs → pending Index 3: Pod-tuv → pending JOB_COMPLETION_INDEX env var = "0", "1", "2", "3" respectively Hostname: <job-name>-<index> (e.g., data-migration-v2-0) DNS: <job-name>-<index>.<svc>.<ns>.svc.cluster.local (requires headless Service with job-name selector)

# Inject the index from annotation (recommended pattern)
env:
  - name: JOB_COMPLETION_INDEX
    valueFrom:
      fieldRef:
        fieldPath: metadata.annotations['batch.kubernetes.io/job-completion-index']

# Or from the downward API (alternative)
env:
  - name: JOB_COMPLETION_INDEX
    valueFrom:
      fieldRef:
        fieldPath: metadata.annotations['batch.kubernetes.io/job-completion-index']

The index can also be read from a static file mounted at /etc/podinfo/job-completion-index if you use a downwardAPI volume.

Indexed Job use cases

Data processing

Shard-parallel ETL

Each index processes one partition/shard. Index → shard mapping is deterministic from the env var.

Machine learning

Distributed training

Index maps to worker rank. Used by PyTorch distributed and MPI-based frameworks.

Testing

Parallel test matrix

Each index runs one test suite combination. Scatter-gather after all complete.

Pod Failure Policy (1.25+)

The default behavior counts every pod failure toward backoffLimit. Pod failure policy lets you classify failures and take different actions — critically, distinguishing between application bugs (retriable) and infrastructure events (should not burn retries).

Action	Effect	Best for
`FailJob`	Immediately mark entire Job as failed, delete active pods	Non-retriable exit codes (config error, data corruption)
`FailIndex`	Fail only this index (Indexed mode), others continue	Per-shard non-retriable error without killing the whole job
`Ignore`	Don't count toward backoffLimit, pod failure is recorded but ignored	Node preemption, spot instance reclaim (DisruptionTarget condition)
`Count`	Default: increment failure counter toward backoffLimit	Retriable transient errors

podFailurePolicy:
  rules:
    # Rule 1: non-retriable application error
    - action: FailJob
      onExitCodes:
        containerName: worker
        operator: In
        values: [42, 43]         # Exit 42/43 = configuration / data error

    # Rule 2: OOM — infrastructure issue, retry
    - action: Count
      onExitCodes:
        operator: In
        values: [137]            # SIGKILL (OOM)

    # Rule 3: preemption / eviction — don't waste retries
    - action: Ignore
      onPodConditions:
        - type: DisruptionTarget

    # Rule 4: per-index isolation (Indexed jobs only)
    - action: FailIndex
      onExitCodes:
        operator: NotIn
        values: [0]              # Any non-zero exit in this index = index fails

DisruptionTarget condition
Kubernetes 1.25+ sets the DisruptionTarget pod condition when a Pod is terminated due to node pressure eviction, preemption, or kubectl drain. Using Ignore on this condition prevents spot instance reclaims from burning your retry budget.

Per-Index Backoff (1.26+)

backoffLimitPerIndex applies backoffLimit semantics independently to each index. Without it, failures across all indexes share a single global counter — one pathologically failing index can exhaust retries for all others.

spec:
  completions: 100
  parallelism: 10
  completionMode: Indexed
  backoffLimit: 10000         # High global limit (not meaningful with perIndex)
  backoffLimitPerIndex: 2     # Each index may fail 2 times before that index fails
  maxFailedIndexes: 5         # Job fails if more than 5 indexes fail (optional)

Without backoffLimitPerIndex: Index 3 fails 6 times → entire Job fails (backoffLimit=6 exhausted) Indexes 0,1,2,4..99 never get a chance With backoffLimitPerIndex: 2 Index 3 fails 3 times → index 3 marked Failed All other indexes continue unaffected maxFailedIndexes: 5 → Job fails when 6th index fails

Suspend & Resume

Setting spec.suspend: true pauses a Job: all active Pods are deleted (terminated gracefully), and no new Pods are created. The Job's Suspended condition is set to True. Resuming (setting suspend: false) restores scheduling.

# Suspend a running job
kubectl patch job data-migration-v2 -p '{"spec":{"suspend":true}}'

# Resume it
kubectl patch job data-migration-v2 -p '{"spec":{"suspend":false}}'

# Check suspension status
kubectl get job data-migration-v2 -o jsonpath='{.status.conditions}'

Suspend/resume is the foundation for external job schedulers (Volcano, Yunikorn, Apache Airflow Kubernetes executor) that need to queue Jobs without creating Pods until resources are available.

TTL-based Cleanup

Finished Jobs (succeeded or failed) accumulate indefinitely without cleanup. The TTL-after-finished controller (GA 1.23) auto-deletes Jobs and their owned Pods after a configurable delay.

spec:
  ttlSecondsAfterFinished: 86400  # Delete 24 hours after finish
  # 0 = delete immediately after finish (cascade deletes pods too)
  # omit = never auto-delete (manual cleanup required)

CronJob history vs TTL
CronJobs manage their own Job history via successfulJobsHistoryLimit and failingJobsHistoryLimit. If you also set ttlSecondsAfterFinished on the Job template, the TTL controller may delete Jobs before CronJob history limits are evaluated. Use one mechanism or the other, not both.

Pod Replacement Policy

podReplacementPolicy (1.28+) controls when replacement pods are created:

Policy	Replacement created when	Use case
`Failed` (default)	Pod reaches Failed phase (all containers terminated)	Standard jobs
`TerminatingOrFailed`	Pod has deletionTimestamp (terminating) OR Failed phase	Long graceful termination; don't wait for full shutdown before scheduling replacement

Job Patterns

Pattern 1: Work Queue (NonIndexed)

Pods pull tasks from an external queue (Redis, SQS, RabbitMQ, Kafka). When the queue is empty, pods exit 0. Set completions to the number of workers you want running; when all succeed (having drained the queue), the Job completes.

spec:
  completions: null         # Null = succeed when any pod succeeds AND
  parallelism: 5            # all pods have exited (work queue pattern)
  # With completions:null, Job completes when all pods succeed
  # This is the "work queue" completion mode
  template:
    spec:
      restartPolicy: OnFailure
      containers:
        - name: worker
          image: queue-worker:v1
          env:
            - name: QUEUE_URL
              value: redis://redis-svc:6379/0

completions: null semantics
When completions is null and completionMode: NonIndexed, the Job succeeds when at least one pod succeeds and all pods have terminated. This is the classic work-queue model where workers self-terminate when the queue is empty.

Pattern 2: Indexed Fan-out / Fan-in

# Stage 1: fan-out (Indexed Job)
apiVersion: batch/v1
kind: Job
metadata:
  name: process-shards
spec:
  completions: 32
  parallelism: 8
  completionMode: Indexed
  template:
    spec:
      restartPolicy: Never
      containers:
        - name: processor
          image: shard-processor:v1
          command: ["/bin/sh", "-c"]
          args:
            - |
              INDEX=${JOB_COMPLETION_INDEX}
              # Process shard $INDEX of 32 total shards
              ./process-shard --index=${INDEX} --total=32 --output=s3://bucket/shards/${INDEX}
          env:
            - name: JOB_COMPLETION_INDEX
              valueFrom:
                fieldRef:
                  fieldPath: metadata.annotations['batch.kubernetes.io/job-completion-index']
---
# Stage 2: fan-in (separate Job triggered by CI/workflow engine after fan-out completes)
apiVersion: batch/v1
kind: Job
metadata:
  name: merge-shards
spec:
  completions: 1
  template:
    spec:
      restartPolicy: Never
      containers:
        - name: merger
          image: shard-merger:v1
          command: ["./merge-shards", "--input=s3://bucket/shards/", "--count=32"]

Pattern 3: Database Migration

apiVersion: batch/v1
kind: Job
metadata:
  name: db-migrate-v3-2-0
  annotations:
    # Immutable label for audit trail
    migration/version: "3.2.0"
    migration/type: "schema"
spec:
  completions: 1
  parallelism: 1
  backoffLimit: 0           # Schema migrations are NOT idempotent — fail fast
  activeDeadlineSeconds: 300
  ttlSecondsAfterFinished: 604800  # Keep for 1 week for debugging
  template:
    spec:
      restartPolicy: Never
      initContainers:
        - name: wait-for-db
          image: busybox:1.36
          command: ['sh', '-c', 'until nc -z postgres-svc 5432; do sleep 2; done']
      containers:
        - name: migrator
          image: myapp:v3.2.0
          command: ["./migrate", "--direction=up", "--target=3.2.0"]
          env:
            - name: DATABASE_URL
              valueFrom:
                secretKeyRef:
                  name: db-credentials
                  key: url
          resources:
            requests:
              cpu: 100m
              memory: 128Mi

Sidecar Termination Problem

Before Kubernetes 1.28, running sidecars (e.g., Istio envoy, Datadog agent, Vault agent) alongside a Job was problematic: when the main container exits 0, the Job wants to complete — but the sidecar is still running, so the Pod never reaches the Succeeded phase.

Pre-1.28 sidecar problem: Pod: [main-container (exit 0)] [sidecar (still running)] │ Pod never reaches Succeeded phase Job never completes Manual workaround: shareProcessNamespace + kill sidecar 1.28+ Native sidecar solution: initContainers: - name: vault-agent restartPolicy: Always # ← This makes it a sidecar container ... - name: envoy restartPolicy: Always ... containers: - name: main-worker ... Shutdown order: main exits 0 → sidecars receive SIGTERM → Pod Succeeded ✓

# Native sidecar in a Job (1.28+, stable 1.33)
spec:
  template:
    spec:
      restartPolicy: Never
      initContainers:
        - name: vault-agent
          image: vault:1.15
          restartPolicy: Always   # sidecar designation
          args: ["agent", "-config=/vault/config"]
          volumeMounts:
            - name: vault-config
              mountPath: /vault/config
        - name: datadog-agent
          image: datadog/agent:7
          restartPolicy: Always   # sidecar designation
          env:
            - name: DD_API_KEY
              valueFrom:
                secretKeyRef:
                  name: datadog-secret
                  key: api-key
      containers:
        - name: worker
          image: batch-worker:v2
          command: ["./process"]

Native sidecars in Jobs (stable 1.33)
With restartPolicy: Always in an initContainer, Kubernetes treats it as a sidecar: it starts before regular containers, receives SIGTERM when the main container exits, and its exit does not fail the Pod. This cleanly solves the Job sidecar problem without shell hacks.

CronJob

CronJob creates Job objects on a time-based schedule. The CronJob controller runs in kube-controller-manager and periodically checks whether a new Job should be spawned based on the schedule and concurrency policy.

CronJob lifecycle: CronJob controller (runs every 10s by default) │ ▼ Evaluate schedule: should a new Job run now? │ ├─ Yes → create Job → Job controller creates Pods ├─ No → wait └─ Missed? → check startingDeadlineSeconds CronJob owns Jobs via ownerRef Jobs own Pods via ownerRef CronJob GC: prune old Jobs per history limits

CronJob Spec

apiVersion: batch/v1
kind: CronJob
metadata:
  name: nightly-report
  namespace: analytics
spec:
  # --- Schedule ---
  schedule: "0 2 * * *"          # Every day at 02:00
  timeZone: "America/New_York"   # Named timezone (GA 1.27); default UTC

  # --- Concurrency ---
  concurrencyPolicy: Forbid      # Allow | Forbid | Replace

  # --- Missed schedule deadline ---
  startingDeadlineSeconds: 300   # Allow up to 5 min late start; nil = no deadline

  # --- History ---
  successfulJobsHistoryLimit: 3  # Keep last 3 successful Jobs (default: 3)
  failingJobsHistoryLimit: 1     # Keep last 1 failed Job (default: 1)

  # --- Suspension ---
  suspend: false                 # Suspend scheduling (existing Jobs unaffected)

  jobTemplate:
    spec:
      backoffLimit: 2
      activeDeadlineSeconds: 3600
      ttlSecondsAfterFinished: 86400
      template:
        spec:
          restartPolicy: OnFailure
          containers:
            - name: reporter
              image: analytics/reporter:v3
              env:
                - name: REPORT_DATE
                  value: "$(date -d 'yesterday' +%Y-%m-%d)"
              resources:
                requests:
                  cpu: 500m
                  memory: 1Gi
                limits:
                  cpu: 2
                  memory: 4Gi

Cron Schedule Syntax

Field	Range	Special chars
Minute	0–59	`* , - /`
Hour	0–23	`* , - /`
Day of month	1–31	`* , - / ?`
Month	1–12 or JAN–DEC	`* , - /`
Day of week	0–6 (Sun=0) or SUN–SAT	`* , - / ?`

Example schedule	Meaning
`0 * * * *`	Every hour at :00
`/15 * * *`	Every 15 minutes
`0 2 * * *`	Daily at 02:00 UTC
`0 9 * * 1-5`	Weekdays at 09:00
`0 0 1 * *`	Monthly, 1st day at midnight
`0 0 * * 0`	Every Sunday at midnight
`@hourly`	Macro = `0 * * * *`
`@daily`	Macro = `0 0 * * *`
`@weekly`	Macro = `0 0 * * 0`
`@monthly`	Macro = `0 0 1 * *`

Time Zones (GA 1.27)

Before 1.27, CronJobs always ran in UTC; teams worked around this by shifting schedule values manually. The timeZone field accepts IANA timezone names from the tz database.

spec:
  schedule: "0 9 * * 1-5"         # 09:00 local time, weekdays
  timeZone: "Europe/Berlin"        # CET/CEST automatically handled

# Common zones
# America/New_York     UTC-5/UTC-4 (EST/EDT)
# America/Los_Angeles  UTC-8/UTC-7 (PST/PDT)
# Asia/Tokyo           UTC+9
# Australia/Sydney     UTC+10/UTC+11
# UTC                  Always UTC (explicit is better than implicit)

DST edge cases
When a clock change makes a time slot ambiguous (e.g., 02:30 appears twice during fall-back) or skipped (spring-forward), the CronJob controller uses the first occurrence. Schedules at midnight in DST-observing zones can shift by one hour seasonally — audit CronJobs for timezone-sensitive business logic.

Concurrency Policy

Policy	Behavior when previous Job still running	Use case
`Allow` (default)	Create new Job anyway — multiple Jobs may run simultaneously	Independent periodic tasks; each run is isolated
`Forbid`	Skip this schedule tick; record a missed schedule	Non-reentrant jobs (DB maintenance, cache warm-up)
`Replace`	Delete the current running Job, create a new one	Stateless jobs that must always run on fresh data; old run is stale

Replace deletes the running Job
With concurrencyPolicy: Replace, the in-flight Job is forcefully deleted (all its Pods are terminated) before the new Job starts. Any partially completed work is lost. Only use Replace when jobs are fully idempotent and partial runs have no side effects.

Missed Schedules & startingDeadlineSeconds

If the CronJob controller is unavailable (controller-manager downtime, cluster upgrade) or a Job is stuck, schedule ticks may be missed. The controller catches up by counting missed schedules since the last successful run.

100-missed-schedule limit
If more than 100 schedule ticks have been missed since the last run (or since the CronJob was created), the controller stops scheduling entirely and logs:

Cannot determine if job needs to be started. Too many missed start time (> 100). Set or decrease .spec.startingDeadlineSeconds or check clock skew.

This is a hard stop — the CronJob will not auto-recover. You must manually delete and recreate it, or update the schedule to reset the counter.

spec:
  startingDeadlineSeconds: 300
  # If the job cannot start within 5 minutes of its scheduled time,
  # skip this occurrence and wait for the next schedule tick.
  # null (default) = no deadline; may trigger the 100-missed-schedule trap
  # if controller is down for many schedule periods

lastScheduleTime in the CronJob status shows when the last Job was successfully spawned. Use this to detect scheduling staleness:

# Check last schedule time
kubectl get cronjob nightly-report -o jsonpath='{.status.lastScheduleTime}'

# List all active Jobs owned by this CronJob
kubectl get jobs -l "batch.kubernetes.io/cronjob-name=nightly-report"

# Manually trigger a CronJob (create Job from template)
kubectl create job --from=cronjob/nightly-report nightly-report-manual-$(date +%s)

History Limits

CronJob prunes old Jobs to prevent unbounded accumulation. The controller keeps the N most recent Jobs of each type.

Field	Default	Recommendation
`successfulJobsHistoryLimit`	3	3–10 for debugging; 0 to disable (not recommended for prod)
`failingJobsHistoryLimit`	1	3–5 to preserve failure logs; higher if jobs are long-lived

Logs lost when Job is pruned
When a Job is deleted (by history limits or TTL), its Pods and their logs are deleted too unless you have a log aggregation pipeline (Loki, Elasticsearch, Datadog). Always ship logs to external storage for batch jobs before relying on history limits for debugging.

Resource Management for Batch

PriorityClass for Batch Isolation

# Batch priority class — lower than production workloads
apiVersion: scheduling.k8s.io/v1
kind: PriorityClass
metadata:
  name: batch-low
value: 100              # Production = 1000; system-cluster-critical = 2000000000
globalDefault: false
preemptionPolicy: Never # Batch should not preempt production pods
description: "Low-priority batch workloads"
---
# Interactive/short jobs
apiVersion: scheduling.k8s.io/v1
kind: PriorityClass
metadata:
  name: batch-high
value: 500
globalDefault: false
preemptionPolicy: PreemptLowerPriority
description: "Time-sensitive batch jobs (e.g., SLA-bound reports)"
---
# Use in Job template
spec:
  template:
    spec:
      priorityClassName: batch-low

Dedicated Batch Nodes

# Taint batch nodes to prevent non-batch workloads from landing there
kubectl taint nodes batch-node-1 batch=true:NoSchedule
kubectl taint nodes batch-node-2 batch=true:NoSchedule

# Label batch nodes
kubectl label nodes batch-node-1 workload-type=batch
kubectl label nodes batch-node-2 workload-type=batch

# Job template tolerations + nodeSelector for batch nodes
spec:
  template:
    spec:
      tolerations:
        - key: batch
          operator: Equal
          value: "true"
          effect: NoSchedule
      nodeSelector:
        workload-type: batch
      # Ensure batch pods don't get evicted for production
      priorityClassName: batch-low

Right-sizing Batch Resources

Batch jobs often have predictable resource profiles. Over-requesting wastes cluster capacity; under-requesting causes OOM kills that count as failures.

resources:
  requests:
    cpu: "500m"      # What the scheduler uses for placement
    memory: "1Gi"    # Set to p95 of observed usage
  limits:
    cpu: "4"         # Generous CPU limit (throttling is recoverable)
    memory: "2Gi"    # Tight memory limit = OOM = pod failure
                     # Memory limit should be >= p99.9 of usage

Memory OOM in batch jobs
For batch jobs processing variable-size inputs (e.g., large files), memory needs may vary per run. Either provision generously, use VPA recommendations (see VPA page), or implement input-size-based resource selection in your workflow engine. Pod OOM kills increment the backoff counter.

KEDA for Event-Driven Job Scaling

KEDA (Kubernetes Event-Driven Autoscaling) can trigger Jobs based on queue depth — creating zero Jobs when the queue is empty and scaling to N Jobs as messages accumulate. This complements CronJob (time-based) with event-driven batch execution.

apiVersion: keda.sh/v1alpha1
kind: ScaledJob
metadata:
  name: queue-processor
  namespace: platform
spec:
  jobTargetRef:
    parallelism: 5
    completions: 5
    backoffLimit: 3
    template:
      spec:
        restartPolicy: Never
        containers:
          - name: processor
            image: queue-processor:v2
            resources:
              requests:
                cpu: 200m
                memory: 256Mi
  pollingInterval: 30        # Check queue every 30 seconds
  maxReplicaCount: 50        # Max concurrent Jobs
  scalingStrategy:
    strategy: "accurate"     # Create one Job per queue message
  triggers:
    - type: redis
      metadata:
        address: redis-svc.platform.svc.cluster.local:6379
        listName: work-queue
        listLength: "5"      # One Job per 5 messages (batch processing)

Operational Commands

# --- Job operations ---
# List jobs with status
kubectl get jobs -n platform -o wide

# Watch a job complete
kubectl get job data-migration-v2 -w

# Get job status details
kubectl describe job data-migration-v2

# Get job completion ratio
kubectl get job data-migration-v2 \
  -o jsonpath='{.status.succeeded}/{.spec.completions} succeeded'

# View logs from all pods of a job
kubectl logs -l job-name=data-migration-v2 --all-containers

# Delete finished jobs older than 1 day (manual cleanup)
kubectl get jobs -o json | jq -r \
  '.items[] | select(.status.completionTime != null) |
   select(now - (.status.completionTime | fromdateiso8601) > 86400) |
   .metadata.name' | xargs -I{} kubectl delete job {}

# --- CronJob operations ---
# List cronjobs with schedule and last schedule time
kubectl get cronjobs -o wide

# Manually trigger a CronJob
kubectl create job --from=cronjob/nightly-report nightly-report-manual-$(date +%s)

# Suspend a CronJob (stop new jobs from being created)
kubectl patch cronjob nightly-report -p '{"spec":{"suspend":true}}'

# Resume a CronJob
kubectl patch cronjob nightly-report -p '{"spec":{"suspend":false}}'

# List Jobs created by a specific CronJob
kubectl get jobs -l "batch.kubernetes.io/cronjob-name=nightly-report"

# View CronJob events (missed schedules, etc.)
kubectl describe cronjob nightly-report

# --- Indexed Job debugging ---
# Get pods for each index
kubectl get pods -l job-name=process-shards -L batch.kubernetes.io/job-completion-index

# Get logs for specific index
kubectl logs -l batch.kubernetes.io/job-completion-index=3 -l job-name=process-shards

Job Status Fields

Field	Type	Description
`status.active`	int	Number of currently running pods
`status.succeeded`	int	Number of successfully completed pods
`status.failed`	int	Number of failed pods (total, not just toward backoffLimit)
`status.completedIndexes`	string	Compact range notation of completed indexes (Indexed mode)
`status.failedIndexes`	string	Compact range notation of failed indexes (with backoffLimitPerIndex)
`status.startTime`	time	When the Job was acknowledged by the controller
`status.completionTime`	time	When the Job entered terminal state (succeeded or failed)
`status.conditions`	[]Condition	`Complete`, `Failed`, `Suspended`, `FailureTarget`
`status.uncountedTerminatedPods`	object	Pods that terminated but haven't been counted yet (transient)

Metrics

Metric	Labels	Use
`kube_job_status_active`	`job_name`, `namespace`	Currently running pods in a Job
`kube_job_status_succeeded`	`job_name`, `namespace`	Successful completions
`kube_job_status_failed`	`job_name`, `namespace`	Cumulative failures
`kube_job_complete`	`job_name`, `condition`	1 when Job is complete (succeeded/failed)
`kube_cronjob_next_schedule_time`	`cronjob`, `namespace`	Unix timestamp of next scheduled execution

Alerting Rules

groups:
  - name: jobs-cronjobs
    rules:
      # Job failed
      - alert: JobFailed
        expr: kube_job_status_conditions{condition="Failed",status="true"} > 0
        for: 0m
        labels:
          severity: critical
        annotations:
          summary: "Job {{ $labels.namespace }}/{{ $labels.job_name }} has failed"
          description: "Check logs: kubectl logs -l job-name={{ $labels.job_name }} -n {{ $labels.namespace }}"

      # Job taking too long (no activeDeadlineSeconds set)
      - alert: JobStalled
        expr: |
          kube_job_status_active > 0
          and (time() - kube_job_status_start_time) > 7200
        for: 5m
        labels:
          severity: warning
        annotations:
          summary: "Job {{ $labels.job_name }} has been running for > 2 hours"

      # CronJob not scheduled on time
      - alert: CronJobMissedSchedule
        expr: |
          time() - kube_cronjob_status_last_schedule_time > 3600
          unless kube_cronjob_spec_suspend == 1
        for: 5m
        labels:
          severity: warning
        annotations:
          summary: "CronJob {{ $labels.namespace }}/{{ $labels.cronjob }} missed last schedule"

      # CronJob suspended unexpectedly
      - alert: CronJobSuspended
        expr: kube_cronjob_spec_suspend == 1
        for: 1h
        labels:
          severity: info
        annotations:
          summary: "CronJob {{ $labels.cronjob }} has been suspended for > 1 hour"

Runbooks

Job Stuck / Not Completing

# 1. Check job status and conditions
kubectl describe job <job-name> -n <namespace>

# 2. Check pod states
kubectl get pods -l job-name=<job-name> -n <namespace>

# 3. Look at logs from failed pods
kubectl logs -l job-name=<job-name> --previous -n <namespace>

# 4. Check if backoffLimit exhausted
kubectl get job <job-name> -o jsonpath='{.status.failed}/{.spec.backoffLimit}'

# 5. Check for resource constraints (Pending pods)
kubectl describe pods -l job-name=<job-name> -n <namespace> | grep -A5 Events

CronJob Stopped Scheduling (100-missed limit)

# Check events
kubectl describe cronjob <name> -n <namespace> | grep -A20 Events

# If "Too many missed start time" error:
# Option 1: delete and recreate (loses history)
kubectl delete cronjob <name> -n <namespace>
kubectl apply -f cronjob.yaml

# Option 2: add/reduce startingDeadlineSeconds to reset the window
kubectl patch cronjob <name> -p '{"spec":{"startingDeadlineSeconds":300}}'

# Option 3: check for controller-manager issues
kubectl logs -n kube-system -l component=kube-controller-manager | grep cronjob

IndexedJob Index Stuck

# Which indexes are complete?
kubectl get job <job-name> -o jsonpath='{.status.completedIndexes}'

# Which indexes failed (with backoffLimitPerIndex)?
kubectl get job <job-name> -o jsonpath='{.status.failedIndexes}'

# Get pods for a specific index
kubectl get pods -l job-name=<job-name>,batch.kubernetes.io/job-completion-index=5

# Retry a specific index by setting backoffLimitPerIndex higher and re-patching
# (Must be done before the index is in terminal Failed state)

Manually Triggering a CronJob

# One-shot manual run from CronJob template
kubectl create job --from=cronjob/<cronjob-name> <cronjob-name>-manual-$(date +%s) \
  -n <namespace>

# Monitor the manually triggered Job
kubectl get job -l "batch.kubernetes.io/cronjob-name=<cronjob-name>" \
  -n <namespace> -w

Job Pod Failures Due to Node Preemption

# Check if DisruptionTarget condition is set on failed pods
kubectl get pods -l job-name=<job-name> -o json | \
  jq '.items[].status.conditions[] | select(.type=="DisruptionTarget")'

# Solution: add podFailurePolicy to Ignore DisruptionTarget (see above)
# This prevents preemption from consuming backoffLimit retries

CronJob Job Not Appearing

# Check if CronJob is suspended
kubectl get cronjob <name> -o jsonpath='{.spec.suspend}'

# Check concurrencyPolicy blocking new jobs
kubectl get cronjob <name> -o jsonpath='{.spec.concurrencyPolicy}'
kubectl get jobs -l "batch.kubernetes.io/cronjob-name=<name>" --sort-by=.metadata.creationTimestamp

# Check controller-manager logs for scheduling decisions
kubectl logs -n kube-system -l component=kube-controller-manager --tail=200 | grep <name>

Best Practices

Always set activeDeadlineSeconds — without it, a hung Job runs forever and its pods accumulate. Set it to 2–3× the expected runtime.
Use ttlSecondsAfterFinished to prevent Job accumulation. Even with CronJob history limits, standalone Jobs need explicit cleanup.
Set backoffLimit: 0 for non-idempotent operations (schema migrations, one-time data fixes). Retrying a migration that partially ran can corrupt data.
Add podFailurePolicy with DisruptionTarget: Ignore on any Job that runs on spot or preemptible nodes — prevents spot reclaims from wasting retry budget.
Use Indexed completion mode for sharded workloads — deterministic index-to-shard mapping eliminates distributed coordination overhead.
Use backoffLimitPerIndex for large Indexed jobs — prevents one hot shard from exhausting global retries.
Never set concurrencyPolicy: Replace on stateful jobs — the in-flight job is deleted without waiting for graceful termination. Only safe for truly idempotent, stateless work.
Set explicit timeZone on CronJobs — UTC-only schedules create operational confusion for teams in non-UTC zones; DST errors cause missed or doubled runs.

Job vs CronJob vs Deployment

Dimension	Job	CronJob	Deployment
Lifecycle	Run-to-completion	Recurring run-to-completion	Long-running (always on)
Pod restartPolicy	Never or OnFailure	Never or OnFailure	Always
Completion tracking	Yes (succeeded count)	Yes (per-Job)	No (desired replicas)
Scheduling	Immediate on creation	Time-based (cron)	Continuous
History	TTL or manual	successfulJobsHistoryLimit	ReplicaSet revisions
Parallelism	spec.parallelism	Via Job template	spec.replicas
Failure semantics	backoffLimit, podFailurePolicy	Inherited from Job	Restart controller