📋 Page Coverage Checklist

Scheduler framework: all extension points with responsibilities

nodeSelector and nodeName (bypasses scheduler)

Node affinity: required vs preferred, all operators

Pod affinity and anti-affinity: topologyKey, weight, co-location patterns

Taints and tolerations: NoSchedule/PreferNoSchedule/NoExecute, operator:Exists, tolerationSeconds

Topology spread constraints: maxSkew, whenUnsatisfiable, matchLabelKeys (1.27)

Priority and preemption: PriorityClass, preemptionPolicy:Never

Descheduler: eviction-based rebalancing, LowNodeUtilization, RemoveDuplicates

Gang scheduling: Volcano PodGroup, Yunikorn Queue

Multiple schedulers and scheduler profiles

Scheduling queue: activeQ, backoffQ, unschedulableQ

5 metrics + 4 alerting rules + 5 runbooks + 8 best practices

Scheduling

How pods are placed on nodes: the scheduler framework, affinity, taints, topology spread, and priority

kube-scheduler Kubernetes 1.19+ Platform Engineer

The Kubernetes scheduler watches for unscheduled pods and selects the best node for each one by running the pod through a pipeline of filter and scoring plugins. Understanding this pipeline — and the pod-level constraints that influence it — is essential for building clusters that achieve high utilization, fault tolerance, zone awareness, and hardware affinity simultaneously.

Scheduler Framework

The scheduler framework (GA 1.19) is a plugin-based architecture where all scheduling logic is implemented as plugins hooked into named extension points. The default scheduler ships with plugins for resource fitting, affinity, topology spread, and more. Custom schedulers extend or replace these plugins.

Scheduling pipeline for a single pod: Unscheduled pod enters activeQ (priority queue) │ ▼ ┌─────────────────────────────────────────────────────────┐ │ FILTER PHASE — eliminate nodes that cannot host pod │ │ │ │ PreFilter → pre-compute data for filter plugins │ │ Filter → NodeResourcesFit, NodeAffinity, │ │ TaintToleration, NodePorts, PodTopology │ │ Spread, VolumeBinding, etc. │ │ │ │ Feasible nodes = nodes that passed ALL filters │ │ If 0 feasible nodes → pod goes to unschedulableQ │ └─────────────────────────────────────────────────────────┘ │ ▼ ┌─────────────────────────────────────────────────────────┐ │ SCORE PHASE — rank feasible nodes (0–100 per plugin) │ │ │ │ PreScore → pre-compute data for score plugins │ │ Score → NodeResourcesBalancedAllocation, │ │ ImageLocality, InterPodAffinity, │ │ NodeAffinity (preferred), etc. │ │ NormalizeScore → normalize scores to 0–100 │ │ │ │ Final score = weighted sum of all plugin scores │ └─────────────────────────────────────────────────────────┘ │ ▼ Selected node (highest score) │ ▼ Reserve → Permit → PreBind → Bind → PostBind (reserve resources, wait for permits, write pod.spec.nodeName)

Extension point	Phase	Purpose
`QueueSort`	Queue	Order pods in the scheduling queue (default: PrioritySort)
`PreFilter`	Filter	Pre-compute or validate data needed by Filter plugins
`Filter`	Filter	Eliminate nodes that cannot satisfy pod constraints
`PostFilter`	Filter	Run after Filter (e.g., preemption — find pods to evict)
`PreScore`	Score	Pre-compute data for Score plugins
`Score`	Score	Rank feasible nodes 0–100 per plugin
`NormalizeScore`	Score	Normalize per-plugin scores before weighting
`Reserve`	Bind	Reserve resources (VolumeBinding reserves PVCs here)
`Permit`	Bind	Allow/deny/wait before binding (used by gang schedulers)
`PreBind`	Bind	Pre-bind work (e.g., provision volumes)
`Bind`	Bind	Write `pod.spec.nodeName` to API server
`PostBind`	Bind	Informational — cleanup, metrics after bind

Scheduling Queue

The scheduler maintains three queues for pods:

activeQ → pods ready to be scheduled (priority-ordered) backoffQ → pods that failed scheduling; waiting with exponential backoff (10s → 20s → 40s ... up to 10 minutes) unschedulableQ → pods that cannot be scheduled with current cluster state (re-checked when cluster state changes: new node, pod deleted, etc.) Pod enters activeQ on creation Scheduling fails → moved to backoffQ or unschedulableQ Cluster event (new node added, pod deleted) → unschedulableQ pods are re-evaluated and may move back to activeQ

# See why a pod is unschedulable
kubectl describe pod <pod> -n <namespace> | grep -A20 Events
# Look for: "0/10 nodes are available: 3 Insufficient cpu, 7 node(s) had taint..."

# Check scheduler logs
kubectl logs -n kube-system -l component=kube-scheduler --tail=100 | grep <pod-name>

nodeSelector and nodeName

nodeSelector

The simplest node constraint: a map of key-value labels that the node must have. All entries must match (AND logic).

spec:
  nodeSelector:
    kubernetes.io/os: linux
    node.kubernetes.io/instance-type: m5.2xlarge
    topology.kubernetes.io/zone: us-east-1a

nodeName (bypasses scheduler)

Setting spec.nodeName directly assigns the pod to a specific node, bypassing the scheduler entirely — no filter or score plugins run. The pod is bound to the named node regardless of resource availability.

spec:
  nodeName: worker-node-3    # Bypasses scheduler; skips all filters

nodeName bypasses all scheduling checks
A pod with spec.nodeName set will be scheduled to that node even if the node lacks sufficient resources, has incompatible taints, or is cordoned. The kubelet will attempt to run the pod and may fail to start it (Pending/OutOfmemory). Only use nodeName for static system pods or testing — never in production workloads.

Node Affinity

Node affinity is a more expressive replacement for nodeSelector, supporting operators, weight-based preferences, and separation of scheduling-time vs runtime requirements.

spec:
  affinity:
    nodeAffinity:
      # HARD requirement — pod will not schedule if not satisfied
      requiredDuringSchedulingIgnoredDuringExecution:
        nodeSelectorTerms:
          - matchExpressions:
              - key: kubernetes.io/arch
                operator: In
                values: [amd64, arm64]
              - key: node.kubernetes.io/instance-type
                operator: NotIn
                values: [t3.micro, t3.small]   # Exclude small instances

      # SOFT preference — scheduler prefers but doesn't require
      preferredDuringSchedulingIgnoredDuringExecution:
        - weight: 80          # Higher weight = stronger preference (1–100)
          preference:
            matchExpressions:
              - key: topology.kubernetes.io/zone
                operator: In
                values: [us-east-1a]     # Prefer zone a

        - weight: 20
          preference:
            matchExpressions:
              - key: node-type
                operator: In
                values: [compute-optimized]

Operator	Meaning	Example
`In`	Label value is in the set	`zone In [a, b]`
`NotIn`	Label value is NOT in the set	`type NotIn [spot]`
`Exists`	Label key exists (any value)	`gpu Exists`
`DoesNotExist`	Label key does not exist	`spot DoesNotExist`
`Gt`	Label value (integer) greater than	`generation Gt [2]`
`Lt`	Label value (integer) less than	`generation Lt [5]`

IgnoredDuringExecution
Both required and preferred node affinity are "IgnoredDuringExecution" — if a node's labels change after a pod is scheduled, the pod continues running. A future RequiredDuringSchedulingRequiredDuringExecution type would evict pods whose nodes no longer match, but this has not been implemented.

Pod Affinity and Anti-Affinity

Pod affinity/anti-affinity places pods relative to other pods, using topology domains (zone, node, rack). This enables co-location (cache next to app) and spreading (replicas on different nodes).

spec:
  affinity:
    # Co-locate with pods that have app=redis (same node)
    podAffinity:
      requiredDuringSchedulingIgnoredDuringExecution:
        - labelSelector:
            matchLabels:
              app: redis
          topologyKey: kubernetes.io/hostname   # "same node"
          namespaces: [production]              # Optional: limit to namespace

      preferredDuringSchedulingIgnoredDuringExecution:
        - weight: 50
          podAffinityTerm:
            labelSelector:
              matchLabels:
                tier: cache
            topologyKey: topology.kubernetes.io/zone  # Prefer same zone as cache

    # Anti-affinity: spread replicas across nodes
    podAntiAffinity:
      requiredDuringSchedulingIgnoredDuringExecution:
        - labelSelector:
            matchLabels:
              app: api-server        # Don't place on same node as another api-server pod
          topologyKey: kubernetes.io/hostname

      preferredDuringSchedulingIgnoredDuringExecution:
        - weight: 100
          podAffinityTerm:
            labelSelector:
              matchLabels:
                app: api-server
            topologyKey: topology.kubernetes.io/zone  # Prefer different zones

Required pod anti-affinity scales as O(n²) at large replica counts
With requiredDuringScheduling pod anti-affinity and topologyKey: kubernetes.io/hostname, each new pod must check all existing pods. At hundreds of replicas, scheduler throughput drops significantly. For large-scale spreading, prefer topologySpreadConstraints (below) — it is specifically optimized for this case.

Taints and Tolerations

Taints mark nodes as unsuitable for pods that don't explicitly tolerate the taint. They are the node-side counterpart to node affinity.

# Apply a taint to a node
kubectl taint nodes gpu-node-1 nvidia.com/gpu=present:NoSchedule
kubectl taint nodes spot-node-2 spot=true:PreferNoSchedule
kubectl taint nodes maintenance-node node.kubernetes.io/unschedulable:NoExecute

# Remove a taint
kubectl taint nodes gpu-node-1 nvidia.com/gpu=present:NoSchedule-

Effect	Scheduling behavior	Running pods behavior
`NoSchedule`	Pod will NOT be scheduled on this node unless it tolerates the taint	Existing pods on the node are NOT evicted
`PreferNoSchedule`	Scheduler tries to avoid this node; will use it if no other option	Existing pods unaffected
`NoExecute`	Pod will NOT be scheduled AND existing pods without toleration are evicted	Pods without toleration are evicted (with tolerationSeconds grace)

spec:
  tolerations:
    # Exact match toleration
    - key: nvidia.com/gpu
      operator: Equal
      value: present
      effect: NoSchedule

    # Tolerate any value for this key
    - key: spot
      operator: Exists
      effect: PreferNoSchedule

    # Tolerate NoExecute with a grace period (stay up to 300s after taint added)
    - key: node.kubernetes.io/not-ready
      operator: Exists
      effect: NoExecute
      tolerationSeconds: 300   # Pod evicted 300s after node becomes NotReady

    # Catch-all: tolerate ALL taints on the node (use sparingly)
    - operator: Exists

Built-in Node Taints

Taint key	Added when	Effect
`node.kubernetes.io/not-ready`	Node condition NotReady	NoExecute
`node.kubernetes.io/unreachable`	Node unreachable from controller	NoExecute
`node.kubernetes.io/memory-pressure`	MemoryPressure condition	NoSchedule
`node.kubernetes.io/disk-pressure`	DiskPressure condition	NoSchedule
`node.kubernetes.io/pid-pressure`	PIDPressure condition	NoSchedule
`node.kubernetes.io/unschedulable`	Node cordoned (`kubectl cordon`)	NoSchedule
`node.kubernetes.io/network-unavailable`	Network not configured by CNI	NoSchedule

Default tolerations auto-injected by node lifecycle controller
Pods are automatically given tolerationSeconds: 300 tolerations for not-ready and unreachable taints. This means pods survive 5 minutes of node unresponsiveness before being evicted and rescheduled elsewhere. Increase this for stateful workloads that need more time (e.g., databases renegotiating connections); decrease for stateless services that benefit from faster failover.

Topology Spread Constraints

Topology spread constraints (GA 1.19) spread pods evenly across topology domains — more efficient than pod anti-affinity and designed for large replica counts.

spec:
  topologySpreadConstraints:
    # Spread across zones (hard constraint)
    - maxSkew: 1                        # Max difference between most and least loaded zone
      topologyKey: topology.kubernetes.io/zone
      whenUnsatisfiable: DoNotSchedule  # Hard: block if skew would exceed maxSkew
      labelSelector:
        matchLabels:
          app: api-server
      minDomains: 3                     # Require at least 3 zones to exist (1.24+)
      nodeAffinityPolicy: Honor         # Consider node affinity in domain calculation (1.26+)
      nodeTaintsPolicy: Honor           # Consider node taints (1.26+)

    # Spread across nodes within each zone (soft constraint)
    - maxSkew: 2
      topologyKey: kubernetes.io/hostname
      whenUnsatisfiable: ScheduleAnyway # Soft: schedule even if skew would exceed maxSkew
      labelSelector:
        matchLabels:
          app: api-server
      matchLabelKeys:                   # Also spread across rollout versions (1.27+)
        - pod-template-hash             # Each RS gets its own spread domain

Field	Default	Meaning
`maxSkew`	1	Maximum allowed difference between the most- and least-loaded topology domain
`topologyKey`	—	Node label key that defines the topology domain (zone, hostname, rack, etc.)
`whenUnsatisfiable`	DoNotSchedule	`DoNotSchedule` (hard) or `ScheduleAnyway` (soft — schedule but penalize)
`labelSelector`	—	Which pods to count when computing domain loads
`minDomains`	nil	Minimum number of topology domains that must exist; if fewer, constraint is unsatisfiable
`matchLabelKeys`	nil	Additional label keys (e.g., `pod-template-hash`) to scope spread per rollout revision

Topology spread example: maxSkew=1, 3 zones, 6 replicas Initial: zone-a=2, zone-b=2, zone-c=2 → skew=0 ✓ Add pod: zone-a=3, zone-b=2, zone-c=2 → skew=1 ✓ Add pod: zone-a=3, zone-b=3, zone-c=2 → skew=1 ✓ Add pod: zone-a=4, zone-b=3, zone-c=2 → skew=2 ✗ (blocked with DoNotSchedule) → scheduler places in zone-c instead: a=3, b=3, c=3 → skew=0 ✓ With ScheduleAnyway: pod is placed in the least-loaded zone as a best effort but is NOT blocked if all zones would exceed maxSkew

Cluster-Level Default Spread

# KubeSchedulerConfiguration — apply default topology spread to all pods
apiVersion: kubescheduler.config.k8s.io/v1
kind: KubeSchedulerConfiguration
profiles:
  - schedulerName: default-scheduler
    pluginConfig:
      - name: PodTopologySpread
        args:
          defaultConstraints:
            - maxSkew: 3
              topologyKey: topology.kubernetes.io/zone
              whenUnsatisfiable: ScheduleAnyway
            - maxSkew: 5
              topologyKey: kubernetes.io/hostname
              whenUnsatisfiable: ScheduleAnyway
          defaultingType: List   # Only apply to pods without explicit topologySpreadConstraints

Priority and Preemption

PriorityClass assigns a numeric priority to pods. When a high-priority pod cannot be scheduled due to resource constraints, the scheduler may preempt (evict) lower-priority pods to make room.

# Define priority classes
apiVersion: scheduling.k8s.io/v1
kind: PriorityClass
metadata:
  name: critical-production
value: 1000000        # Higher number = higher priority
globalDefault: false
preemptionPolicy: PreemptLowerPriority  # Default: can preempt lower-priority pods
description: "Critical production services"

---
apiVersion: scheduling.k8s.io/v1
kind: PriorityClass
metadata:
  name: batch-low
value: 100
globalDefault: false
preemptionPolicy: Never  # This class cannot preempt other pods
description: "Low-priority batch jobs"

---
# Use in pod
spec:
  priorityClassName: critical-production

Built-in System Priority Classes

Name	Value	Used by
`system-cluster-critical`	2,000,000,000	kube-dns, kube-proxy, CNI pods
`system-node-critical`	2,000,001,000	kubelet static pods, node-critical DaemonSets

Preemption bypasses PodDisruptionBudgets
When the scheduler preempts pods to make room for a high-priority pod, it does NOT respect PodDisruptionBudgets. PDBs only apply to the Eviction API (voluntary disruptions). A critical pod can preempt multiple pods simultaneously, potentially violating quorum guarantees. Mitigate by setting appropriate priority values so high-priority pods don't preempt quorum-critical workloads.

# Check pod priority
kubectl get pod <pod> -o jsonpath='{.spec.priority} {.spec.priorityClassName}'

# See which pods have been preempted (look for nominating field)
kubectl get pod <high-priority-pod> -o jsonpath='{.status.nominatedNodeName}'

# List all priority classes
kubectl get priorityclasses --sort-by='.value'

Multiple Schedulers and Profiles

The scheduler supports multiple named profiles within a single binary. Each profile can enable/disable different plugins, allowing different scheduling behavior for different workload types.

apiVersion: kubescheduler.config.k8s.io/v1
kind: KubeSchedulerConfiguration
profiles:
  # Default profile
  - schedulerName: default-scheduler
    plugins:
      score:
        disabled:
          - name: NodeResourcesBalancedAllocation  # Disable bin-packing
        enabled:
          - name: NodeResourcesMostAllocated       # Enable node-packing instead

  # Batch profile — maximize node utilization
  - schedulerName: batch-scheduler
    plugins:
      score:
        enabled:
          - name: NodeResourcesMostAllocated
            weight: 10
    pluginConfig:
      - name: NodeResourcesMostAllocated
        args:
          resources:
            - name: cpu
              weight: 1
            - name: memory
              weight: 1

# Use the batch scheduler for a specific pod
spec:
  schedulerName: batch-scheduler   # Must match a profile name in KubeSchedulerConfiguration

Descheduler

The scheduler only places pods at creation time. Once running, pods stay on their initial node even if the cluster becomes imbalanced (new nodes added, workloads removed). The Descheduler is an add-on that periodically evicts pods to trigger rebalancing.

Descheduler workflow (runs as CronJob or Deployment): 1. Scan all pods against enabled strategies 2. Identify pods that violate constraints or could be better placed 3. Evict qualifying pods (via Eviction API — respects PDBs) 4. Pods are rescheduled by kube-scheduler to better nodes

Key Strategies

Strategy	What it evicts	Use case
`LowNodeUtilization`	Pods from over-utilized nodes to under-utilized ones (thresholds: CPU/memory %)	Rebalance after node additions or workload changes
`RemoveDuplicates`	Extra pods from the same ReplicaSet/Deployment on the same node (when topology spread was violated at creation)	Fix uneven distribution from rapid scale-outs
`RemovePodsViolatingNodeAffinity`	Pods on nodes that no longer satisfy their node affinity (labels changed after scheduling)	Enforce node affinity rules retroactively
`RemovePodsViolatingTopologySpreadConstraint`	Pods causing topology spread violations	Rebalance after node failures or additions
`RemovePodsHavingTooManyRestarts`	Pods with excessive restarts (crash-looping)	Force reschedule of unstable pods to different nodes
`PodLifeTime`	Pods older than a configured age	Enforce pod refresh cycle for security/version freshness

apiVersion: "descheduler/v1alpha2"
kind: "DeschedulerPolicy"
profiles:
  - name: default
    pluginConfig:
      - name: LowNodeUtilization
        args:
          thresholds:
            cpu: 20          # Nodes below 20% CPU = underutilized
            memory: 20
            pods: 20
          targetThresholds:
            cpu: 50          # Move pods until nodes are ≤50% CPU
            memory: 50
            pods: 50
      - name: RemoveDuplicates
        args:
          excludeOwnerKinds: ["ReplicaSet"]  # Don't evict RS pods (managed by scheduler)
    plugins:
      balance:
        enabled:
          - LowNodeUtilization
          - RemoveDuplicates

Gang Scheduling

Gang scheduling ensures that a group of pods is scheduled all-or-nothing. Without it, distributed ML training jobs (PyTorch, MPI) can deadlock: worker pods consume resources while the parameter server pod sits Pending, and no job makes progress.

Volcano

# PodGroup — define the minimum pods needed to start
apiVersion: scheduling.volcano.sh/v1beta1
kind: PodGroup
metadata:
  name: pytorch-training-job
  namespace: ml-platform
spec:
  minMember: 8          # All 8 pods must be schedulable before ANY start
  minResources:
    cpu: "32"
    memory: 64Gi
  queue: ml-training
  priorityClassName: ml-high

---
# Reference PodGroup from pods
spec:
  schedulerName: volcano
  metadata:
    annotations:
      scheduling.volcano.sh/pod-group: pytorch-training-job

Apache YuniKorn

# YuniKorn queue-based scheduling
spec:
  schedulerName: yunikorn
  metadata:
    labels:
      queue: root.ml-platform.training
    annotations:
      yunikorn.apache.org/task-group-name: pytorch-workers
      yunikorn.apache.org/task-groups: |
        [{
          "name": "pytorch-workers",
          "minMember": 8,
          "minResource": {"cpu": "4", "memory": "8Gi"}
        }]

Scheduling Decision Guide

Goal	Mechanism	Notes
Require specific node type	`nodeSelector` or `nodeAffinity.required`	Use affinity for complex expressions
Prefer certain nodes	`nodeAffinity.preferred` with weight	Multiple preferences sum their weights
Keep pods off certain nodes	Taint + toleration; or `nodeAffinity.required NotIn`	Taint is node-side; affinity is pod-side
Co-locate with another workload	`podAffinity.required`, `topologyKey: hostname`	E.g., app next to its sidecar cache
Spread replicas across zones	`topologySpreadConstraints` with zone key	Prefer over anti-affinity for large fleets
Spread replicas across nodes	`topologySpreadConstraints` with hostname key	Use `maxSkew: 1` for tight spread
Dedicated nodes for workload type	Taint nodes + add toleration to pods	nodeSelector ensures pods only go there
Prioritize critical pods under pressure	PriorityClass with preemption	PDB not respected during preemption
All-or-nothing batch scheduling	Volcano PodGroup / YuniKorn task groups	Prevents deadlock in distributed training
Rebalance after cluster changes	Descheduler LowNodeUtilization	Evicts and reschedules — disrupts running pods

Metrics

Metric	Labels	Use
`scheduler_pending_pods`	`queue` (active/backoff/unschedulable)	Scheduling queue depth — rising unschedulable = cluster capacity issue
`scheduler_scheduling_attempt_duration_seconds`	`profile`, `result` (scheduled/unschedulable/error)	Scheduling latency percentiles
`scheduler_preemption_victims`	`profile`	Number of pods preempted per scheduling cycle
`scheduler_pod_scheduling_sli_duration_seconds`	`profile`	E2E scheduling SLI: time from pod creation to running
`kube_pod_status_unschedulable`	`pod`, `namespace`	Binary metric: 1 if pod is currently unschedulable

Alerting Rules

groups:
  - name: scheduling
    rules:
      # Pods stuck unschedulable for > 5 minutes
      - alert: PodsUnschedulable
        expr: scheduler_pending_pods{queue="unschedulable"} > 0
        for: 5m
        labels:
          severity: warning
        annotations:
          summary: "{{ $value }} pods are unschedulable for > 5 minutes"
          description: "Check kubectl describe pod for scheduling failure reasons"

      # Scheduling latency too high (p99 > 1s)
      - alert: SchedulerHighLatency
        expr: |
          histogram_quantile(0.99,
            rate(scheduler_scheduling_attempt_duration_seconds_bucket{result="scheduled"}[5m])) > 1
        for: 5m
        labels:
          severity: warning
        annotations:
          summary: "Scheduler p99 latency > 1s — large cluster or complex constraints"

      # High preemption rate (priority pods evicting others)
      - alert: HighPreemptionRate
        expr: rate(scheduler_preemption_victims[5m]) > 0.5
        for: 5m
        labels:
          severity: warning
        annotations:
          summary: "Scheduler is preempting >30 pods/min — check priority class configuration"

      # Scheduler not running / unhealthy
      - alert: KubeSchedulerDown
        expr: absent(up{job="kube-scheduler"} == 1)
        for: 1m
        labels:
          severity: critical
        annotations:
          summary: "kube-scheduler is down — no new pods can be scheduled"

Runbooks

Pods Stuck in Pending (Unschedulable)

# Get the failure reason
kubectl describe pod <pod> -n <namespace> | grep -A20 Events

# Common reasons and fixes:
# "Insufficient cpu/memory" → scale up nodes or reduce requests
# "node(s) had taint ... that the pod didn't tolerate" → add toleration
# "node(s) didn't match Pod's node affinity" → check labels on nodes
# "didn't match pod anti-affinity rules" → use topology spread instead
# "0 nodes available" → check minDomains / zone availability

# Check node resources
kubectl describe nodes | grep -A5 "Allocated resources:"

# Check if nodes exist with required labels
kubectl get nodes -l kubernetes.io/os=linux,node-type=gpu

Pod Scheduled to Wrong Node Type

# Check what labels the scheduled node has
kubectl get node $(kubectl get pod <pod> -o jsonpath='{.spec.nodeName}') \
  --show-labels

# Check the pod's nodeSelector / affinity
kubectl get pod <pod> -o jsonpath='{.spec.nodeSelector}'
kubectl get pod <pod> -o yaml | grep -A30 affinity

# If nodeSelector is missing/wrong, update the Deployment
kubectl patch deployment <name> -n <namespace> --type=merge \
  -p '{"spec":{"template":{"spec":{"nodeSelector":{"node-type":"gpu"}}}}}'

Topology Spread Constraint Blocking Scheduling

# Check pod events for spread violations
kubectl describe pod <pod> | grep "didn't match.*topology"

# Check current pod distribution
kubectl get pods -n <namespace> -l app=<app> \
  -o custom-columns=NAME:.metadata.name,NODE:.spec.nodeName,ZONE:.metadata.labels['topology\.kubernetes\.io/zone']

# Count pods per zone
kubectl get pods -n <namespace> -l app=<app> \
  -o jsonpath='{.items[*].spec.nodeName}' | tr ' ' '\n' | \
  xargs -I{} kubectl get node {} -o jsonpath='{.metadata.labels.topology\.kubernetes\.io/zone}' | \
  sort | uniq -c

# Temporarily loosen constraint
kubectl patch deployment <name> -n <namespace> --type=json -p='[{
  "op":"replace",
  "path":"/spec/template/spec/topologySpreadConstraints/0/whenUnsatisfiable",
  "value":"ScheduleAnyway"
}]'

High-Priority Pod Not Triggering Preemption

# Check if priority class is set correctly
kubectl get pod <pod> -o jsonpath='{.spec.priority} {.spec.priorityClassName}'

# Check if there are lower-priority pods to preempt
kubectl get pods -A --sort-by='.spec.priority' -o custom-columns=\
'NAMESPACE:.metadata.namespace,NAME:.metadata.name,PRIORITY:.spec.priority,NODE:.spec.nodeName' | head -20

# Check scheduler logs for preemption decisions
kubectl logs -n kube-system -l component=kube-scheduler | grep -i "preempt\|nominat"

# Check nominated node (scheduler's preemption candidate)
kubectl get pod <pod> -o jsonpath='{.status.nominatedNodeName}'

Uneven Pod Distribution After Scale-Out

# Check distribution
kubectl get pods -l app=<app> -o wide | awk '{print $7}' | sort | uniq -c

# Trigger rebalancing via Descheduler (if installed)
kubectl create job descheduler-manual --from=cronjob/descheduler -n kube-system

# Manual rebalancing: rollout restart spreads pods fresh
kubectl rollout restart deployment <name> -n <namespace>

Best Practices

Use topologySpreadConstraints over pod anti-affinity for spreading replicas — topology spread is O(n) while required pod anti-affinity is O(n²). At 50+ replicas, the scheduling latency difference is significant.
Set minDomains when you require multi-zone spread — without it, topology spread happily schedules all pods into a single zone if only one zone has capacity. minDomains: 3 forces a minimum of 3 zones to be used.
Use preferredDuring affinity for preferences, not requirements — requiredDuring constraints that are too strict cause pods to stay Pending indefinitely. Reserve required for genuine hardware requirements (GPUs, OS, architecture).
Taint dedicated nodes AND use nodeSelector — a taint prevents non-tolerated pods from landing on a node, but doesn't force your pods to land there. Add nodeSelector or affinity to direct your pods to the dedicated nodes.
Set preemptionPolicy: Never for batch workloads — prevents low-value batch jobs from preempting production traffic when they momentarily spike. Production workloads should have higher priority but batch should not preempt anyone.
Run the Descheduler for long-running clusters — initial scheduling decisions become suboptimal as the cluster changes. Descheduler with LowNodeUtilization and RemoveDuplicates keeps placement healthy without manual intervention.
Use gang scheduling for distributed ML/HPC jobs — partial scheduling of distributed training is worse than no scheduling. Without a PodGroup minimum, workers consume GPU resources while waiting for peers, blocking other jobs.
Monitor scheduler_pending_pods{queue="unschedulable"} continuously — a growing unschedulable count is an early warning of capacity shortage or misconfigured constraints, often surfacing before users notice impact.