Scheduling
šŸ“‹ Page Coverage Checklist
  • Scheduler framework: all extension points with responsibilities
  • nodeSelector and nodeName (bypasses scheduler)
  • Node affinity: required vs preferred, all operators
  • Pod affinity and anti-affinity: topologyKey, weight, co-location patterns
  • Taints and tolerations: NoSchedule/PreferNoSchedule/NoExecute, operator:Exists, tolerationSeconds
  • Topology spread constraints: maxSkew, whenUnsatisfiable, matchLabelKeys (1.27)
  • Priority and preemption: PriorityClass, preemptionPolicy:Never
  • Descheduler: eviction-based rebalancing, LowNodeUtilization, RemoveDuplicates
  • Gang scheduling: Volcano PodGroup, Yunikorn Queue
  • Multiple schedulers and scheduler profiles
  • Scheduling queue: activeQ, backoffQ, unschedulableQ
  • 5 metrics + 4 alerting rules + 5 runbooks + 8 best practices
  • Scheduling

    How pods are placed on nodes: the scheduler framework, affinity, taints, topology spread, and priority

    kube-scheduler Kubernetes 1.19+ Platform Engineer

    The Kubernetes scheduler watches for unscheduled pods and selects the best node for each one by running the pod through a pipeline of filter and scoring plugins. Understanding this pipeline — and the pod-level constraints that influence it — is essential for building clusters that achieve high utilization, fault tolerance, zone awareness, and hardware affinity simultaneously.

    Scheduler Framework

    The scheduler framework (GA 1.19) is a plugin-based architecture where all scheduling logic is implemented as plugins hooked into named extension points. The default scheduler ships with plugins for resource fitting, affinity, topology spread, and more. Custom schedulers extend or replace these plugins.

    Scheduling pipeline for a single pod: Unscheduled pod enters activeQ (priority queue) │ ā–¼ ā”Œā”€ā”€ā”€ā”€ā”€ā”€ā”€ā”€ā”€ā”€ā”€ā”€ā”€ā”€ā”€ā”€ā”€ā”€ā”€ā”€ā”€ā”€ā”€ā”€ā”€ā”€ā”€ā”€ā”€ā”€ā”€ā”€ā”€ā”€ā”€ā”€ā”€ā”€ā”€ā”€ā”€ā”€ā”€ā”€ā”€ā”€ā”€ā”€ā”€ā”€ā”€ā”€ā”€ā”€ā”€ā”€ā”€ā” │ FILTER PHASE — eliminate nodes that cannot host pod │ │ │ │ PreFilter → pre-compute data for filter plugins │ │ Filter → NodeResourcesFit, NodeAffinity, │ │ TaintToleration, NodePorts, PodTopology │ │ Spread, VolumeBinding, etc. │ │ │ │ Feasible nodes = nodes that passed ALL filters │ │ If 0 feasible nodes → pod goes to unschedulableQ │ ā””ā”€ā”€ā”€ā”€ā”€ā”€ā”€ā”€ā”€ā”€ā”€ā”€ā”€ā”€ā”€ā”€ā”€ā”€ā”€ā”€ā”€ā”€ā”€ā”€ā”€ā”€ā”€ā”€ā”€ā”€ā”€ā”€ā”€ā”€ā”€ā”€ā”€ā”€ā”€ā”€ā”€ā”€ā”€ā”€ā”€ā”€ā”€ā”€ā”€ā”€ā”€ā”€ā”€ā”€ā”€ā”€ā”€ā”˜ │ ā–¼ ā”Œā”€ā”€ā”€ā”€ā”€ā”€ā”€ā”€ā”€ā”€ā”€ā”€ā”€ā”€ā”€ā”€ā”€ā”€ā”€ā”€ā”€ā”€ā”€ā”€ā”€ā”€ā”€ā”€ā”€ā”€ā”€ā”€ā”€ā”€ā”€ā”€ā”€ā”€ā”€ā”€ā”€ā”€ā”€ā”€ā”€ā”€ā”€ā”€ā”€ā”€ā”€ā”€ā”€ā”€ā”€ā”€ā”€ā” │ SCORE PHASE — rank feasible nodes (0–100 per plugin) │ │ │ │ PreScore → pre-compute data for score plugins │ │ Score → NodeResourcesBalancedAllocation, │ │ ImageLocality, InterPodAffinity, │ │ NodeAffinity (preferred), etc. │ │ NormalizeScore → normalize scores to 0–100 │ │ │ │ Final score = weighted sum of all plugin scores │ ā””ā”€ā”€ā”€ā”€ā”€ā”€ā”€ā”€ā”€ā”€ā”€ā”€ā”€ā”€ā”€ā”€ā”€ā”€ā”€ā”€ā”€ā”€ā”€ā”€ā”€ā”€ā”€ā”€ā”€ā”€ā”€ā”€ā”€ā”€ā”€ā”€ā”€ā”€ā”€ā”€ā”€ā”€ā”€ā”€ā”€ā”€ā”€ā”€ā”€ā”€ā”€ā”€ā”€ā”€ā”€ā”€ā”€ā”˜ │ ā–¼ Selected node (highest score) │ ā–¼ Reserve → Permit → PreBind → Bind → PostBind (reserve resources, wait for permits, write pod.spec.nodeName)
    Extension pointPhasePurpose
    QueueSortQueueOrder pods in the scheduling queue (default: PrioritySort)
    PreFilterFilterPre-compute or validate data needed by Filter plugins
    FilterFilterEliminate nodes that cannot satisfy pod constraints
    PostFilterFilterRun after Filter (e.g., preemption — find pods to evict)
    PreScoreScorePre-compute data for Score plugins
    ScoreScoreRank feasible nodes 0–100 per plugin
    NormalizeScoreScoreNormalize per-plugin scores before weighting
    ReserveBindReserve resources (VolumeBinding reserves PVCs here)
    PermitBindAllow/deny/wait before binding (used by gang schedulers)
    PreBindBindPre-bind work (e.g., provision volumes)
    BindBindWrite pod.spec.nodeName to API server
    PostBindBindInformational — cleanup, metrics after bind

    Scheduling Queue

    The scheduler maintains three queues for pods:

    activeQ → pods ready to be scheduled (priority-ordered) backoffQ → pods that failed scheduling; waiting with exponential backoff (10s → 20s → 40s ... up to 10 minutes) unschedulableQ → pods that cannot be scheduled with current cluster state (re-checked when cluster state changes: new node, pod deleted, etc.) Pod enters activeQ on creation Scheduling fails → moved to backoffQ or unschedulableQ Cluster event (new node added, pod deleted) → unschedulableQ pods are re-evaluated and may move back to activeQ
    # See why a pod is unschedulable
    kubectl describe pod <pod> -n <namespace> | grep -A20 Events
    # Look for: "0/10 nodes are available: 3 Insufficient cpu, 7 node(s) had taint..."
    
    # Check scheduler logs
    kubectl logs -n kube-system -l component=kube-scheduler --tail=100 | grep <pod-name>

    nodeSelector and nodeName

    nodeSelector

    The simplest node constraint: a map of key-value labels that the node must have. All entries must match (AND logic).

    spec:
      nodeSelector:
        kubernetes.io/os: linux
        node.kubernetes.io/instance-type: m5.2xlarge
        topology.kubernetes.io/zone: us-east-1a

    nodeName (bypasses scheduler)

    Setting spec.nodeName directly assigns the pod to a specific node, bypassing the scheduler entirely — no filter or score plugins run. The pod is bound to the named node regardless of resource availability.

    spec:
      nodeName: worker-node-3    # Bypasses scheduler; skips all filters
    nodeName bypasses all scheduling checks
    A pod with spec.nodeName set will be scheduled to that node even if the node lacks sufficient resources, has incompatible taints, or is cordoned. The kubelet will attempt to run the pod and may fail to start it (Pending/OutOfmemory). Only use nodeName for static system pods or testing — never in production workloads.

    Node Affinity

    Node affinity is a more expressive replacement for nodeSelector, supporting operators, weight-based preferences, and separation of scheduling-time vs runtime requirements.

    spec:
      affinity:
        nodeAffinity:
          # HARD requirement — pod will not schedule if not satisfied
          requiredDuringSchedulingIgnoredDuringExecution:
            nodeSelectorTerms:
              - matchExpressions:
                  - key: kubernetes.io/arch
                    operator: In
                    values: [amd64, arm64]
                  - key: node.kubernetes.io/instance-type
                    operator: NotIn
                    values: [t3.micro, t3.small]   # Exclude small instances
    
          # SOFT preference — scheduler prefers but doesn't require
          preferredDuringSchedulingIgnoredDuringExecution:
            - weight: 80          # Higher weight = stronger preference (1–100)
              preference:
                matchExpressions:
                  - key: topology.kubernetes.io/zone
                    operator: In
                    values: [us-east-1a]     # Prefer zone a
    
            - weight: 20
              preference:
                matchExpressions:
                  - key: node-type
                    operator: In
                    values: [compute-optimized]
    OperatorMeaningExample
    InLabel value is in the setzone In [a, b]
    NotInLabel value is NOT in the settype NotIn [spot]
    ExistsLabel key exists (any value)gpu Exists
    DoesNotExistLabel key does not existspot DoesNotExist
    GtLabel value (integer) greater thangeneration Gt [2]
    LtLabel value (integer) less thangeneration Lt [5]
    IgnoredDuringExecution
    Both required and preferred node affinity are "IgnoredDuringExecution" — if a node's labels change after a pod is scheduled, the pod continues running. A future RequiredDuringSchedulingRequiredDuringExecution type would evict pods whose nodes no longer match, but this has not been implemented.

    Pod Affinity and Anti-Affinity

    Pod affinity/anti-affinity places pods relative to other pods, using topology domains (zone, node, rack). This enables co-location (cache next to app) and spreading (replicas on different nodes).

    spec:
      affinity:
        # Co-locate with pods that have app=redis (same node)
        podAffinity:
          requiredDuringSchedulingIgnoredDuringExecution:
            - labelSelector:
                matchLabels:
                  app: redis
              topologyKey: kubernetes.io/hostname   # "same node"
              namespaces: [production]              # Optional: limit to namespace
    
          preferredDuringSchedulingIgnoredDuringExecution:
            - weight: 50
              podAffinityTerm:
                labelSelector:
                  matchLabels:
                    tier: cache
                topologyKey: topology.kubernetes.io/zone  # Prefer same zone as cache
    
        # Anti-affinity: spread replicas across nodes
        podAntiAffinity:
          requiredDuringSchedulingIgnoredDuringExecution:
            - labelSelector:
                matchLabels:
                  app: api-server        # Don't place on same node as another api-server pod
              topologyKey: kubernetes.io/hostname
    
          preferredDuringSchedulingIgnoredDuringExecution:
            - weight: 100
              podAffinityTerm:
                labelSelector:
                  matchLabels:
                    app: api-server
                topologyKey: topology.kubernetes.io/zone  # Prefer different zones
    Required pod anti-affinity scales as O(n²) at large replica counts
    With requiredDuringScheduling pod anti-affinity and topologyKey: kubernetes.io/hostname, each new pod must check all existing pods. At hundreds of replicas, scheduler throughput drops significantly. For large-scale spreading, prefer topologySpreadConstraints (below) — it is specifically optimized for this case.

    Taints and Tolerations

    Taints mark nodes as unsuitable for pods that don't explicitly tolerate the taint. They are the node-side counterpart to node affinity.

    # Apply a taint to a node
    kubectl taint nodes gpu-node-1 nvidia.com/gpu=present:NoSchedule
    kubectl taint nodes spot-node-2 spot=true:PreferNoSchedule
    kubectl taint nodes maintenance-node node.kubernetes.io/unschedulable:NoExecute
    
    # Remove a taint
    kubectl taint nodes gpu-node-1 nvidia.com/gpu=present:NoSchedule-
    EffectScheduling behaviorRunning pods behavior
    NoSchedulePod will NOT be scheduled on this node unless it tolerates the taintExisting pods on the node are NOT evicted
    PreferNoScheduleScheduler tries to avoid this node; will use it if no other optionExisting pods unaffected
    NoExecutePod will NOT be scheduled AND existing pods without toleration are evictedPods without toleration are evicted (with tolerationSeconds grace)
    spec:
      tolerations:
        # Exact match toleration
        - key: nvidia.com/gpu
          operator: Equal
          value: present
          effect: NoSchedule
    
        # Tolerate any value for this key
        - key: spot
          operator: Exists
          effect: PreferNoSchedule
    
        # Tolerate NoExecute with a grace period (stay up to 300s after taint added)
        - key: node.kubernetes.io/not-ready
          operator: Exists
          effect: NoExecute
          tolerationSeconds: 300   # Pod evicted 300s after node becomes NotReady
    
        # Catch-all: tolerate ALL taints on the node (use sparingly)
        - operator: Exists

    Built-in Node Taints

    Taint keyAdded whenEffect
    node.kubernetes.io/not-readyNode condition NotReadyNoExecute
    node.kubernetes.io/unreachableNode unreachable from controllerNoExecute
    node.kubernetes.io/memory-pressureMemoryPressure conditionNoSchedule
    node.kubernetes.io/disk-pressureDiskPressure conditionNoSchedule
    node.kubernetes.io/pid-pressurePIDPressure conditionNoSchedule
    node.kubernetes.io/unschedulableNode cordoned (kubectl cordon)NoSchedule
    node.kubernetes.io/network-unavailableNetwork not configured by CNINoSchedule
    Default tolerations auto-injected by node lifecycle controller
    Pods are automatically given tolerationSeconds: 300 tolerations for not-ready and unreachable taints. This means pods survive 5 minutes of node unresponsiveness before being evicted and rescheduled elsewhere. Increase this for stateful workloads that need more time (e.g., databases renegotiating connections); decrease for stateless services that benefit from faster failover.

    Topology Spread Constraints

    Topology spread constraints (GA 1.19) spread pods evenly across topology domains — more efficient than pod anti-affinity and designed for large replica counts.

    spec:
      topologySpreadConstraints:
        # Spread across zones (hard constraint)
        - maxSkew: 1                        # Max difference between most and least loaded zone
          topologyKey: topology.kubernetes.io/zone
          whenUnsatisfiable: DoNotSchedule  # Hard: block if skew would exceed maxSkew
          labelSelector:
            matchLabels:
              app: api-server
          minDomains: 3                     # Require at least 3 zones to exist (1.24+)
          nodeAffinityPolicy: Honor         # Consider node affinity in domain calculation (1.26+)
          nodeTaintsPolicy: Honor           # Consider node taints (1.26+)
    
        # Spread across nodes within each zone (soft constraint)
        - maxSkew: 2
          topologyKey: kubernetes.io/hostname
          whenUnsatisfiable: ScheduleAnyway # Soft: schedule even if skew would exceed maxSkew
          labelSelector:
            matchLabels:
              app: api-server
          matchLabelKeys:                   # Also spread across rollout versions (1.27+)
            - pod-template-hash             # Each RS gets its own spread domain
    FieldDefaultMeaning
    maxSkew1Maximum allowed difference between the most- and least-loaded topology domain
    topologyKey—Node label key that defines the topology domain (zone, hostname, rack, etc.)
    whenUnsatisfiableDoNotScheduleDoNotSchedule (hard) or ScheduleAnyway (soft — schedule but penalize)
    labelSelector—Which pods to count when computing domain loads
    minDomainsnilMinimum number of topology domains that must exist; if fewer, constraint is unsatisfiable
    matchLabelKeysnilAdditional label keys (e.g., pod-template-hash) to scope spread per rollout revision
    Topology spread example: maxSkew=1, 3 zones, 6 replicas Initial: zone-a=2, zone-b=2, zone-c=2 → skew=0 āœ“ Add pod: zone-a=3, zone-b=2, zone-c=2 → skew=1 āœ“ Add pod: zone-a=3, zone-b=3, zone-c=2 → skew=1 āœ“ Add pod: zone-a=4, zone-b=3, zone-c=2 → skew=2 āœ— (blocked with DoNotSchedule) → scheduler places in zone-c instead: a=3, b=3, c=3 → skew=0 āœ“ With ScheduleAnyway: pod is placed in the least-loaded zone as a best effort but is NOT blocked if all zones would exceed maxSkew

    Cluster-Level Default Spread

    # KubeSchedulerConfiguration — apply default topology spread to all pods
    apiVersion: kubescheduler.config.k8s.io/v1
    kind: KubeSchedulerConfiguration
    profiles:
      - schedulerName: default-scheduler
        pluginConfig:
          - name: PodTopologySpread
            args:
              defaultConstraints:
                - maxSkew: 3
                  topologyKey: topology.kubernetes.io/zone
                  whenUnsatisfiable: ScheduleAnyway
                - maxSkew: 5
                  topologyKey: kubernetes.io/hostname
                  whenUnsatisfiable: ScheduleAnyway
              defaultingType: List   # Only apply to pods without explicit topologySpreadConstraints

    Priority and Preemption

    PriorityClass assigns a numeric priority to pods. When a high-priority pod cannot be scheduled due to resource constraints, the scheduler may preempt (evict) lower-priority pods to make room.

    # Define priority classes
    apiVersion: scheduling.k8s.io/v1
    kind: PriorityClass
    metadata:
      name: critical-production
    value: 1000000        # Higher number = higher priority
    globalDefault: false
    preemptionPolicy: PreemptLowerPriority  # Default: can preempt lower-priority pods
    description: "Critical production services"
    
    ---
    apiVersion: scheduling.k8s.io/v1
    kind: PriorityClass
    metadata:
      name: batch-low
    value: 100
    globalDefault: false
    preemptionPolicy: Never  # This class cannot preempt other pods
    description: "Low-priority batch jobs"
    
    ---
    # Use in pod
    spec:
      priorityClassName: critical-production

    Built-in System Priority Classes

    NameValueUsed by
    system-cluster-critical2,000,000,000kube-dns, kube-proxy, CNI pods
    system-node-critical2,000,001,000kubelet static pods, node-critical DaemonSets
    Preemption bypasses PodDisruptionBudgets
    When the scheduler preempts pods to make room for a high-priority pod, it does NOT respect PodDisruptionBudgets. PDBs only apply to the Eviction API (voluntary disruptions). A critical pod can preempt multiple pods simultaneously, potentially violating quorum guarantees. Mitigate by setting appropriate priority values so high-priority pods don't preempt quorum-critical workloads.
    # Check pod priority
    kubectl get pod <pod> -o jsonpath='{.spec.priority} {.spec.priorityClassName}'
    
    # See which pods have been preempted (look for nominating field)
    kubectl get pod <high-priority-pod> -o jsonpath='{.status.nominatedNodeName}'
    
    # List all priority classes
    kubectl get priorityclasses --sort-by='.value'

    Multiple Schedulers and Profiles

    The scheduler supports multiple named profiles within a single binary. Each profile can enable/disable different plugins, allowing different scheduling behavior for different workload types.

    apiVersion: kubescheduler.config.k8s.io/v1
    kind: KubeSchedulerConfiguration
    profiles:
      # Default profile
      - schedulerName: default-scheduler
        plugins:
          score:
            disabled:
              - name: NodeResourcesBalancedAllocation  # Disable bin-packing
            enabled:
              - name: NodeResourcesMostAllocated       # Enable node-packing instead
    
      # Batch profile — maximize node utilization
      - schedulerName: batch-scheduler
        plugins:
          score:
            enabled:
              - name: NodeResourcesMostAllocated
                weight: 10
        pluginConfig:
          - name: NodeResourcesMostAllocated
            args:
              resources:
                - name: cpu
                  weight: 1
                - name: memory
                  weight: 1
    # Use the batch scheduler for a specific pod
    spec:
      schedulerName: batch-scheduler   # Must match a profile name in KubeSchedulerConfiguration

    Descheduler

    The scheduler only places pods at creation time. Once running, pods stay on their initial node even if the cluster becomes imbalanced (new nodes added, workloads removed). The Descheduler is an add-on that periodically evicts pods to trigger rebalancing.

    Descheduler workflow (runs as CronJob or Deployment): 1. Scan all pods against enabled strategies 2. Identify pods that violate constraints or could be better placed 3. Evict qualifying pods (via Eviction API — respects PDBs) 4. Pods are rescheduled by kube-scheduler to better nodes

    Key Strategies

    StrategyWhat it evictsUse case
    LowNodeUtilizationPods from over-utilized nodes to under-utilized ones (thresholds: CPU/memory %)Rebalance after node additions or workload changes
    RemoveDuplicatesExtra pods from the same ReplicaSet/Deployment on the same node (when topology spread was violated at creation)Fix uneven distribution from rapid scale-outs
    RemovePodsViolatingNodeAffinityPods on nodes that no longer satisfy their node affinity (labels changed after scheduling)Enforce node affinity rules retroactively
    RemovePodsViolatingTopologySpreadConstraintPods causing topology spread violationsRebalance after node failures or additions
    RemovePodsHavingTooManyRestartsPods with excessive restarts (crash-looping)Force reschedule of unstable pods to different nodes
    PodLifeTimePods older than a configured ageEnforce pod refresh cycle for security/version freshness
    apiVersion: "descheduler/v1alpha2"
    kind: "DeschedulerPolicy"
    profiles:
      - name: default
        pluginConfig:
          - name: LowNodeUtilization
            args:
              thresholds:
                cpu: 20          # Nodes below 20% CPU = underutilized
                memory: 20
                pods: 20
              targetThresholds:
                cpu: 50          # Move pods until nodes are ≤50% CPU
                memory: 50
                pods: 50
          - name: RemoveDuplicates
            args:
              excludeOwnerKinds: ["ReplicaSet"]  # Don't evict RS pods (managed by scheduler)
        plugins:
          balance:
            enabled:
              - LowNodeUtilization
              - RemoveDuplicates

    Gang Scheduling

    Gang scheduling ensures that a group of pods is scheduled all-or-nothing. Without it, distributed ML training jobs (PyTorch, MPI) can deadlock: worker pods consume resources while the parameter server pod sits Pending, and no job makes progress.

    Volcano

    # PodGroup — define the minimum pods needed to start
    apiVersion: scheduling.volcano.sh/v1beta1
    kind: PodGroup
    metadata:
      name: pytorch-training-job
      namespace: ml-platform
    spec:
      minMember: 8          # All 8 pods must be schedulable before ANY start
      minResources:
        cpu: "32"
        memory: 64Gi
      queue: ml-training
      priorityClassName: ml-high
    
    ---
    # Reference PodGroup from pods
    spec:
      schedulerName: volcano
      metadata:
        annotations:
          scheduling.volcano.sh/pod-group: pytorch-training-job

    Apache YuniKorn

    # YuniKorn queue-based scheduling
    spec:
      schedulerName: yunikorn
      metadata:
        labels:
          queue: root.ml-platform.training
        annotations:
          yunikorn.apache.org/task-group-name: pytorch-workers
          yunikorn.apache.org/task-groups: |
            [{
              "name": "pytorch-workers",
              "minMember": 8,
              "minResource": {"cpu": "4", "memory": "8Gi"}
            }]

    Scheduling Decision Guide

    GoalMechanismNotes
    Require specific node typenodeSelector or nodeAffinity.requiredUse affinity for complex expressions
    Prefer certain nodesnodeAffinity.preferred with weightMultiple preferences sum their weights
    Keep pods off certain nodesTaint + toleration; or nodeAffinity.required NotInTaint is node-side; affinity is pod-side
    Co-locate with another workloadpodAffinity.required, topologyKey: hostnameE.g., app next to its sidecar cache
    Spread replicas across zonestopologySpreadConstraints with zone keyPrefer over anti-affinity for large fleets
    Spread replicas across nodestopologySpreadConstraints with hostname keyUse maxSkew: 1 for tight spread
    Dedicated nodes for workload typeTaint nodes + add toleration to podsnodeSelector ensures pods only go there
    Prioritize critical pods under pressurePriorityClass with preemptionPDB not respected during preemption
    All-or-nothing batch schedulingVolcano PodGroup / YuniKorn task groupsPrevents deadlock in distributed training
    Rebalance after cluster changesDescheduler LowNodeUtilizationEvicts and reschedules — disrupts running pods

    Metrics

    MetricLabelsUse
    scheduler_pending_podsqueue (active/backoff/unschedulable)Scheduling queue depth — rising unschedulable = cluster capacity issue
    scheduler_scheduling_attempt_duration_secondsprofile, result (scheduled/unschedulable/error)Scheduling latency percentiles
    scheduler_preemption_victimsprofileNumber of pods preempted per scheduling cycle
    scheduler_pod_scheduling_sli_duration_secondsprofileE2E scheduling SLI: time from pod creation to running
    kube_pod_status_unschedulablepod, namespaceBinary metric: 1 if pod is currently unschedulable

    Alerting Rules

    groups:
      - name: scheduling
        rules:
          # Pods stuck unschedulable for > 5 minutes
          - alert: PodsUnschedulable
            expr: scheduler_pending_pods{queue="unschedulable"} > 0
            for: 5m
            labels:
              severity: warning
            annotations:
              summary: "{{ $value }} pods are unschedulable for > 5 minutes"
              description: "Check kubectl describe pod for scheduling failure reasons"
    
          # Scheduling latency too high (p99 > 1s)
          - alert: SchedulerHighLatency
            expr: |
              histogram_quantile(0.99,
                rate(scheduler_scheduling_attempt_duration_seconds_bucket{result="scheduled"}[5m])) > 1
            for: 5m
            labels:
              severity: warning
            annotations:
              summary: "Scheduler p99 latency > 1s — large cluster or complex constraints"
    
          # High preemption rate (priority pods evicting others)
          - alert: HighPreemptionRate
            expr: rate(scheduler_preemption_victims[5m]) > 0.5
            for: 5m
            labels:
              severity: warning
            annotations:
              summary: "Scheduler is preempting >30 pods/min — check priority class configuration"
    
          # Scheduler not running / unhealthy
          - alert: KubeSchedulerDown
            expr: absent(up{job="kube-scheduler"} == 1)
            for: 1m
            labels:
              severity: critical
            annotations:
              summary: "kube-scheduler is down — no new pods can be scheduled"

    Runbooks

    Pods Stuck in Pending (Unschedulable)

    # Get the failure reason
    kubectl describe pod <pod> -n <namespace> | grep -A20 Events
    
    # Common reasons and fixes:
    # "Insufficient cpu/memory" → scale up nodes or reduce requests
    # "node(s) had taint ... that the pod didn't tolerate" → add toleration
    # "node(s) didn't match Pod's node affinity" → check labels on nodes
    # "didn't match pod anti-affinity rules" → use topology spread instead
    # "0 nodes available" → check minDomains / zone availability
    
    # Check node resources
    kubectl describe nodes | grep -A5 "Allocated resources:"
    
    # Check if nodes exist with required labels
    kubectl get nodes -l kubernetes.io/os=linux,node-type=gpu

    Pod Scheduled to Wrong Node Type

    # Check what labels the scheduled node has
    kubectl get node $(kubectl get pod <pod> -o jsonpath='{.spec.nodeName}') \
      --show-labels
    
    # Check the pod's nodeSelector / affinity
    kubectl get pod <pod> -o jsonpath='{.spec.nodeSelector}'
    kubectl get pod <pod> -o yaml | grep -A30 affinity
    
    # If nodeSelector is missing/wrong, update the Deployment
    kubectl patch deployment <name> -n <namespace> --type=merge \
      -p '{"spec":{"template":{"spec":{"nodeSelector":{"node-type":"gpu"}}}}}'

    Topology Spread Constraint Blocking Scheduling

    # Check pod events for spread violations
    kubectl describe pod <pod> | grep "didn't match.*topology"
    
    # Check current pod distribution
    kubectl get pods -n <namespace> -l app=<app> \
      -o custom-columns=NAME:.metadata.name,NODE:.spec.nodeName,ZONE:.metadata.labels['topology\.kubernetes\.io/zone']
    
    # Count pods per zone
    kubectl get pods -n <namespace> -l app=<app> \
      -o jsonpath='{.items[*].spec.nodeName}' | tr ' ' '\n' | \
      xargs -I{} kubectl get node {} -o jsonpath='{.metadata.labels.topology\.kubernetes\.io/zone}' | \
      sort | uniq -c
    
    # Temporarily loosen constraint
    kubectl patch deployment <name> -n <namespace> --type=json -p='[{
      "op":"replace",
      "path":"/spec/template/spec/topologySpreadConstraints/0/whenUnsatisfiable",
      "value":"ScheduleAnyway"
    }]'

    High-Priority Pod Not Triggering Preemption

    # Check if priority class is set correctly
    kubectl get pod <pod> -o jsonpath='{.spec.priority} {.spec.priorityClassName}'
    
    # Check if there are lower-priority pods to preempt
    kubectl get pods -A --sort-by='.spec.priority' -o custom-columns=\
    'NAMESPACE:.metadata.namespace,NAME:.metadata.name,PRIORITY:.spec.priority,NODE:.spec.nodeName' | head -20
    
    # Check scheduler logs for preemption decisions
    kubectl logs -n kube-system -l component=kube-scheduler | grep -i "preempt\|nominat"
    
    # Check nominated node (scheduler's preemption candidate)
    kubectl get pod <pod> -o jsonpath='{.status.nominatedNodeName}'

    Uneven Pod Distribution After Scale-Out

    # Check distribution
    kubectl get pods -l app=<app> -o wide | awk '{print $7}' | sort | uniq -c
    
    # Trigger rebalancing via Descheduler (if installed)
    kubectl create job descheduler-manual --from=cronjob/descheduler -n kube-system
    
    # Manual rebalancing: rollout restart spreads pods fresh
    kubectl rollout restart deployment <name> -n <namespace>

    Best Practices

    1. Use topologySpreadConstraints over pod anti-affinity for spreading replicas — topology spread is O(n) while required pod anti-affinity is O(n²). At 50+ replicas, the scheduling latency difference is significant.
    2. Set minDomains when you require multi-zone spread — without it, topology spread happily schedules all pods into a single zone if only one zone has capacity. minDomains: 3 forces a minimum of 3 zones to be used.
    3. Use preferredDuring affinity for preferences, not requirements — requiredDuring constraints that are too strict cause pods to stay Pending indefinitely. Reserve required for genuine hardware requirements (GPUs, OS, architecture).
    4. Taint dedicated nodes AND use nodeSelector — a taint prevents non-tolerated pods from landing on a node, but doesn't force your pods to land there. Add nodeSelector or affinity to direct your pods to the dedicated nodes.
    5. Set preemptionPolicy: Never for batch workloads — prevents low-value batch jobs from preempting production traffic when they momentarily spike. Production workloads should have higher priority but batch should not preempt anyone.
    6. Run the Descheduler for long-running clusters — initial scheduling decisions become suboptimal as the cluster changes. Descheduler with LowNodeUtilization and RemoveDuplicates keeps placement healthy without manual intervention.
    7. Use gang scheduling for distributed ML/HPC jobs — partial scheduling of distributed training is worse than no scheduling. Without a PodGroup minimum, workers consume GPU resources while waiting for peers, blocking other jobs.
    8. Monitor scheduler_pending_pods{queue="unschedulable"} continuously — a growing unschedulable count is an early warning of capacity shortage or misconfigured constraints, often surfacing before users notice impact.