š Page Coverage Checklist
Scheduling
How pods are placed on nodes: the scheduler framework, affinity, taints, topology spread, and priority
The Kubernetes scheduler watches for unscheduled pods and selects the best node for each one by running the pod through a pipeline of filter and scoring plugins. Understanding this pipeline ā and the pod-level constraints that influence it ā is essential for building clusters that achieve high utilization, fault tolerance, zone awareness, and hardware affinity simultaneously.
Scheduler Framework
The scheduler framework (GA 1.19) is a plugin-based architecture where all scheduling logic is implemented as plugins hooked into named extension points. The default scheduler ships with plugins for resource fitting, affinity, topology spread, and more. Custom schedulers extend or replace these plugins.
| Extension point | Phase | Purpose |
|---|---|---|
QueueSort | Queue | Order pods in the scheduling queue (default: PrioritySort) |
PreFilter | Filter | Pre-compute or validate data needed by Filter plugins |
Filter | Filter | Eliminate nodes that cannot satisfy pod constraints |
PostFilter | Filter | Run after Filter (e.g., preemption ā find pods to evict) |
PreScore | Score | Pre-compute data for Score plugins |
Score | Score | Rank feasible nodes 0ā100 per plugin |
NormalizeScore | Score | Normalize per-plugin scores before weighting |
Reserve | Bind | Reserve resources (VolumeBinding reserves PVCs here) |
Permit | Bind | Allow/deny/wait before binding (used by gang schedulers) |
PreBind | Bind | Pre-bind work (e.g., provision volumes) |
Bind | Bind | Write pod.spec.nodeName to API server |
PostBind | Bind | Informational ā cleanup, metrics after bind |
Scheduling Queue
The scheduler maintains three queues for pods:
# See why a pod is unschedulable
kubectl describe pod <pod> -n <namespace> | grep -A20 Events
# Look for: "0/10 nodes are available: 3 Insufficient cpu, 7 node(s) had taint..."
# Check scheduler logs
kubectl logs -n kube-system -l component=kube-scheduler --tail=100 | grep <pod-name>
nodeSelector and nodeName
nodeSelector
The simplest node constraint: a map of key-value labels that the node must have. All entries must match (AND logic).
spec:
nodeSelector:
kubernetes.io/os: linux
node.kubernetes.io/instance-type: m5.2xlarge
topology.kubernetes.io/zone: us-east-1a
nodeName (bypasses scheduler)
Setting spec.nodeName directly assigns the pod to a specific node, bypassing the scheduler entirely ā no filter or score plugins run. The pod is bound to the named node regardless of resource availability.
spec:
nodeName: worker-node-3 # Bypasses scheduler; skips all filters
A pod with
spec.nodeName set will be scheduled to that node even if the node lacks sufficient resources, has incompatible taints, or is cordoned. The kubelet will attempt to run the pod and may fail to start it (Pending/OutOfmemory). Only use nodeName for static system pods or testing ā never in production workloads.
Node Affinity
Node affinity is a more expressive replacement for nodeSelector, supporting operators, weight-based preferences, and separation of scheduling-time vs runtime requirements.
spec:
affinity:
nodeAffinity:
# HARD requirement ā pod will not schedule if not satisfied
requiredDuringSchedulingIgnoredDuringExecution:
nodeSelectorTerms:
- matchExpressions:
- key: kubernetes.io/arch
operator: In
values: [amd64, arm64]
- key: node.kubernetes.io/instance-type
operator: NotIn
values: [t3.micro, t3.small] # Exclude small instances
# SOFT preference ā scheduler prefers but doesn't require
preferredDuringSchedulingIgnoredDuringExecution:
- weight: 80 # Higher weight = stronger preference (1ā100)
preference:
matchExpressions:
- key: topology.kubernetes.io/zone
operator: In
values: [us-east-1a] # Prefer zone a
- weight: 20
preference:
matchExpressions:
- key: node-type
operator: In
values: [compute-optimized]
| Operator | Meaning | Example |
|---|---|---|
In | Label value is in the set | zone In [a, b] |
NotIn | Label value is NOT in the set | type NotIn [spot] |
Exists | Label key exists (any value) | gpu Exists |
DoesNotExist | Label key does not exist | spot DoesNotExist |
Gt | Label value (integer) greater than | generation Gt [2] |
Lt | Label value (integer) less than | generation Lt [5] |
Both required and preferred node affinity are "IgnoredDuringExecution" ā if a node's labels change after a pod is scheduled, the pod continues running. A future
RequiredDuringSchedulingRequiredDuringExecution type would evict pods whose nodes no longer match, but this has not been implemented.
Pod Affinity and Anti-Affinity
Pod affinity/anti-affinity places pods relative to other pods, using topology domains (zone, node, rack). This enables co-location (cache next to app) and spreading (replicas on different nodes).
spec:
affinity:
# Co-locate with pods that have app=redis (same node)
podAffinity:
requiredDuringSchedulingIgnoredDuringExecution:
- labelSelector:
matchLabels:
app: redis
topologyKey: kubernetes.io/hostname # "same node"
namespaces: [production] # Optional: limit to namespace
preferredDuringSchedulingIgnoredDuringExecution:
- weight: 50
podAffinityTerm:
labelSelector:
matchLabels:
tier: cache
topologyKey: topology.kubernetes.io/zone # Prefer same zone as cache
# Anti-affinity: spread replicas across nodes
podAntiAffinity:
requiredDuringSchedulingIgnoredDuringExecution:
- labelSelector:
matchLabels:
app: api-server # Don't place on same node as another api-server pod
topologyKey: kubernetes.io/hostname
preferredDuringSchedulingIgnoredDuringExecution:
- weight: 100
podAffinityTerm:
labelSelector:
matchLabels:
app: api-server
topologyKey: topology.kubernetes.io/zone # Prefer different zones
With
requiredDuringScheduling pod anti-affinity and topologyKey: kubernetes.io/hostname, each new pod must check all existing pods. At hundreds of replicas, scheduler throughput drops significantly. For large-scale spreading, prefer topologySpreadConstraints (below) ā it is specifically optimized for this case.
Taints and Tolerations
Taints mark nodes as unsuitable for pods that don't explicitly tolerate the taint. They are the node-side counterpart to node affinity.
# Apply a taint to a node
kubectl taint nodes gpu-node-1 nvidia.com/gpu=present:NoSchedule
kubectl taint nodes spot-node-2 spot=true:PreferNoSchedule
kubectl taint nodes maintenance-node node.kubernetes.io/unschedulable:NoExecute
# Remove a taint
kubectl taint nodes gpu-node-1 nvidia.com/gpu=present:NoSchedule-
| Effect | Scheduling behavior | Running pods behavior |
|---|---|---|
NoSchedule | Pod will NOT be scheduled on this node unless it tolerates the taint | Existing pods on the node are NOT evicted |
PreferNoSchedule | Scheduler tries to avoid this node; will use it if no other option | Existing pods unaffected |
NoExecute | Pod will NOT be scheduled AND existing pods without toleration are evicted | Pods without toleration are evicted (with tolerationSeconds grace) |
spec:
tolerations:
# Exact match toleration
- key: nvidia.com/gpu
operator: Equal
value: present
effect: NoSchedule
# Tolerate any value for this key
- key: spot
operator: Exists
effect: PreferNoSchedule
# Tolerate NoExecute with a grace period (stay up to 300s after taint added)
- key: node.kubernetes.io/not-ready
operator: Exists
effect: NoExecute
tolerationSeconds: 300 # Pod evicted 300s after node becomes NotReady
# Catch-all: tolerate ALL taints on the node (use sparingly)
- operator: Exists
Built-in Node Taints
| Taint key | Added when | Effect |
|---|---|---|
node.kubernetes.io/not-ready | Node condition NotReady | NoExecute |
node.kubernetes.io/unreachable | Node unreachable from controller | NoExecute |
node.kubernetes.io/memory-pressure | MemoryPressure condition | NoSchedule |
node.kubernetes.io/disk-pressure | DiskPressure condition | NoSchedule |
node.kubernetes.io/pid-pressure | PIDPressure condition | NoSchedule |
node.kubernetes.io/unschedulable | Node cordoned (kubectl cordon) | NoSchedule |
node.kubernetes.io/network-unavailable | Network not configured by CNI | NoSchedule |
Pods are automatically given
tolerationSeconds: 300 tolerations for not-ready and unreachable taints. This means pods survive 5 minutes of node unresponsiveness before being evicted and rescheduled elsewhere. Increase this for stateful workloads that need more time (e.g., databases renegotiating connections); decrease for stateless services that benefit from faster failover.
Topology Spread Constraints
Topology spread constraints (GA 1.19) spread pods evenly across topology domains ā more efficient than pod anti-affinity and designed for large replica counts.
spec:
topologySpreadConstraints:
# Spread across zones (hard constraint)
- maxSkew: 1 # Max difference between most and least loaded zone
topologyKey: topology.kubernetes.io/zone
whenUnsatisfiable: DoNotSchedule # Hard: block if skew would exceed maxSkew
labelSelector:
matchLabels:
app: api-server
minDomains: 3 # Require at least 3 zones to exist (1.24+)
nodeAffinityPolicy: Honor # Consider node affinity in domain calculation (1.26+)
nodeTaintsPolicy: Honor # Consider node taints (1.26+)
# Spread across nodes within each zone (soft constraint)
- maxSkew: 2
topologyKey: kubernetes.io/hostname
whenUnsatisfiable: ScheduleAnyway # Soft: schedule even if skew would exceed maxSkew
labelSelector:
matchLabels:
app: api-server
matchLabelKeys: # Also spread across rollout versions (1.27+)
- pod-template-hash # Each RS gets its own spread domain
| Field | Default | Meaning |
|---|---|---|
maxSkew | 1 | Maximum allowed difference between the most- and least-loaded topology domain |
topologyKey | ā | Node label key that defines the topology domain (zone, hostname, rack, etc.) |
whenUnsatisfiable | DoNotSchedule | DoNotSchedule (hard) or ScheduleAnyway (soft ā schedule but penalize) |
labelSelector | ā | Which pods to count when computing domain loads |
minDomains | nil | Minimum number of topology domains that must exist; if fewer, constraint is unsatisfiable |
matchLabelKeys | nil | Additional label keys (e.g., pod-template-hash) to scope spread per rollout revision |
Cluster-Level Default Spread
# KubeSchedulerConfiguration ā apply default topology spread to all pods
apiVersion: kubescheduler.config.k8s.io/v1
kind: KubeSchedulerConfiguration
profiles:
- schedulerName: default-scheduler
pluginConfig:
- name: PodTopologySpread
args:
defaultConstraints:
- maxSkew: 3
topologyKey: topology.kubernetes.io/zone
whenUnsatisfiable: ScheduleAnyway
- maxSkew: 5
topologyKey: kubernetes.io/hostname
whenUnsatisfiable: ScheduleAnyway
defaultingType: List # Only apply to pods without explicit topologySpreadConstraints
Priority and Preemption
PriorityClass assigns a numeric priority to pods. When a high-priority pod cannot be scheduled due to resource constraints, the scheduler may preempt (evict) lower-priority pods to make room.
# Define priority classes
apiVersion: scheduling.k8s.io/v1
kind: PriorityClass
metadata:
name: critical-production
value: 1000000 # Higher number = higher priority
globalDefault: false
preemptionPolicy: PreemptLowerPriority # Default: can preempt lower-priority pods
description: "Critical production services"
---
apiVersion: scheduling.k8s.io/v1
kind: PriorityClass
metadata:
name: batch-low
value: 100
globalDefault: false
preemptionPolicy: Never # This class cannot preempt other pods
description: "Low-priority batch jobs"
---
# Use in pod
spec:
priorityClassName: critical-production
Built-in System Priority Classes
| Name | Value | Used by |
|---|---|---|
system-cluster-critical | 2,000,000,000 | kube-dns, kube-proxy, CNI pods |
system-node-critical | 2,000,001,000 | kubelet static pods, node-critical DaemonSets |
When the scheduler preempts pods to make room for a high-priority pod, it does NOT respect PodDisruptionBudgets. PDBs only apply to the Eviction API (voluntary disruptions). A critical pod can preempt multiple pods simultaneously, potentially violating quorum guarantees. Mitigate by setting appropriate priority values so high-priority pods don't preempt quorum-critical workloads.
# Check pod priority
kubectl get pod <pod> -o jsonpath='{.spec.priority} {.spec.priorityClassName}'
# See which pods have been preempted (look for nominating field)
kubectl get pod <high-priority-pod> -o jsonpath='{.status.nominatedNodeName}'
# List all priority classes
kubectl get priorityclasses --sort-by='.value'
Multiple Schedulers and Profiles
The scheduler supports multiple named profiles within a single binary. Each profile can enable/disable different plugins, allowing different scheduling behavior for different workload types.
apiVersion: kubescheduler.config.k8s.io/v1
kind: KubeSchedulerConfiguration
profiles:
# Default profile
- schedulerName: default-scheduler
plugins:
score:
disabled:
- name: NodeResourcesBalancedAllocation # Disable bin-packing
enabled:
- name: NodeResourcesMostAllocated # Enable node-packing instead
# Batch profile ā maximize node utilization
- schedulerName: batch-scheduler
plugins:
score:
enabled:
- name: NodeResourcesMostAllocated
weight: 10
pluginConfig:
- name: NodeResourcesMostAllocated
args:
resources:
- name: cpu
weight: 1
- name: memory
weight: 1
# Use the batch scheduler for a specific pod
spec:
schedulerName: batch-scheduler # Must match a profile name in KubeSchedulerConfiguration
Descheduler
The scheduler only places pods at creation time. Once running, pods stay on their initial node even if the cluster becomes imbalanced (new nodes added, workloads removed). The Descheduler is an add-on that periodically evicts pods to trigger rebalancing.
Key Strategies
| Strategy | What it evicts | Use case |
|---|---|---|
LowNodeUtilization | Pods from over-utilized nodes to under-utilized ones (thresholds: CPU/memory %) | Rebalance after node additions or workload changes |
RemoveDuplicates | Extra pods from the same ReplicaSet/Deployment on the same node (when topology spread was violated at creation) | Fix uneven distribution from rapid scale-outs |
RemovePodsViolatingNodeAffinity | Pods on nodes that no longer satisfy their node affinity (labels changed after scheduling) | Enforce node affinity rules retroactively |
RemovePodsViolatingTopologySpreadConstraint | Pods causing topology spread violations | Rebalance after node failures or additions |
RemovePodsHavingTooManyRestarts | Pods with excessive restarts (crash-looping) | Force reschedule of unstable pods to different nodes |
PodLifeTime | Pods older than a configured age | Enforce pod refresh cycle for security/version freshness |
apiVersion: "descheduler/v1alpha2"
kind: "DeschedulerPolicy"
profiles:
- name: default
pluginConfig:
- name: LowNodeUtilization
args:
thresholds:
cpu: 20 # Nodes below 20% CPU = underutilized
memory: 20
pods: 20
targetThresholds:
cpu: 50 # Move pods until nodes are ā¤50% CPU
memory: 50
pods: 50
- name: RemoveDuplicates
args:
excludeOwnerKinds: ["ReplicaSet"] # Don't evict RS pods (managed by scheduler)
plugins:
balance:
enabled:
- LowNodeUtilization
- RemoveDuplicates
Gang Scheduling
Gang scheduling ensures that a group of pods is scheduled all-or-nothing. Without it, distributed ML training jobs (PyTorch, MPI) can deadlock: worker pods consume resources while the parameter server pod sits Pending, and no job makes progress.
Volcano
# PodGroup ā define the minimum pods needed to start
apiVersion: scheduling.volcano.sh/v1beta1
kind: PodGroup
metadata:
name: pytorch-training-job
namespace: ml-platform
spec:
minMember: 8 # All 8 pods must be schedulable before ANY start
minResources:
cpu: "32"
memory: 64Gi
queue: ml-training
priorityClassName: ml-high
---
# Reference PodGroup from pods
spec:
schedulerName: volcano
metadata:
annotations:
scheduling.volcano.sh/pod-group: pytorch-training-job
Apache YuniKorn
# YuniKorn queue-based scheduling
spec:
schedulerName: yunikorn
metadata:
labels:
queue: root.ml-platform.training
annotations:
yunikorn.apache.org/task-group-name: pytorch-workers
yunikorn.apache.org/task-groups: |
[{
"name": "pytorch-workers",
"minMember": 8,
"minResource": {"cpu": "4", "memory": "8Gi"}
}]
Scheduling Decision Guide
| Goal | Mechanism | Notes |
|---|---|---|
| Require specific node type | nodeSelector or nodeAffinity.required | Use affinity for complex expressions |
| Prefer certain nodes | nodeAffinity.preferred with weight | Multiple preferences sum their weights |
| Keep pods off certain nodes | Taint + toleration; or nodeAffinity.required NotIn | Taint is node-side; affinity is pod-side |
| Co-locate with another workload | podAffinity.required, topologyKey: hostname | E.g., app next to its sidecar cache |
| Spread replicas across zones | topologySpreadConstraints with zone key | Prefer over anti-affinity for large fleets |
| Spread replicas across nodes | topologySpreadConstraints with hostname key | Use maxSkew: 1 for tight spread |
| Dedicated nodes for workload type | Taint nodes + add toleration to pods | nodeSelector ensures pods only go there |
| Prioritize critical pods under pressure | PriorityClass with preemption | PDB not respected during preemption |
| All-or-nothing batch scheduling | Volcano PodGroup / YuniKorn task groups | Prevents deadlock in distributed training |
| Rebalance after cluster changes | Descheduler LowNodeUtilization | Evicts and reschedules ā disrupts running pods |
Metrics
| Metric | Labels | Use |
|---|---|---|
scheduler_pending_pods | queue (active/backoff/unschedulable) | Scheduling queue depth ā rising unschedulable = cluster capacity issue |
scheduler_scheduling_attempt_duration_seconds | profile, result (scheduled/unschedulable/error) | Scheduling latency percentiles |
scheduler_preemption_victims | profile | Number of pods preempted per scheduling cycle |
scheduler_pod_scheduling_sli_duration_seconds | profile | E2E scheduling SLI: time from pod creation to running |
kube_pod_status_unschedulable | pod, namespace | Binary metric: 1 if pod is currently unschedulable |
Alerting Rules
groups:
- name: scheduling
rules:
# Pods stuck unschedulable for > 5 minutes
- alert: PodsUnschedulable
expr: scheduler_pending_pods{queue="unschedulable"} > 0
for: 5m
labels:
severity: warning
annotations:
summary: "{{ $value }} pods are unschedulable for > 5 minutes"
description: "Check kubectl describe pod for scheduling failure reasons"
# Scheduling latency too high (p99 > 1s)
- alert: SchedulerHighLatency
expr: |
histogram_quantile(0.99,
rate(scheduler_scheduling_attempt_duration_seconds_bucket{result="scheduled"}[5m])) > 1
for: 5m
labels:
severity: warning
annotations:
summary: "Scheduler p99 latency > 1s ā large cluster or complex constraints"
# High preemption rate (priority pods evicting others)
- alert: HighPreemptionRate
expr: rate(scheduler_preemption_victims[5m]) > 0.5
for: 5m
labels:
severity: warning
annotations:
summary: "Scheduler is preempting >30 pods/min ā check priority class configuration"
# Scheduler not running / unhealthy
- alert: KubeSchedulerDown
expr: absent(up{job="kube-scheduler"} == 1)
for: 1m
labels:
severity: critical
annotations:
summary: "kube-scheduler is down ā no new pods can be scheduled"
Runbooks
Pods Stuck in Pending (Unschedulable)
# Get the failure reason
kubectl describe pod <pod> -n <namespace> | grep -A20 Events
# Common reasons and fixes:
# "Insufficient cpu/memory" ā scale up nodes or reduce requests
# "node(s) had taint ... that the pod didn't tolerate" ā add toleration
# "node(s) didn't match Pod's node affinity" ā check labels on nodes
# "didn't match pod anti-affinity rules" ā use topology spread instead
# "0 nodes available" ā check minDomains / zone availability
# Check node resources
kubectl describe nodes | grep -A5 "Allocated resources:"
# Check if nodes exist with required labels
kubectl get nodes -l kubernetes.io/os=linux,node-type=gpu
Pod Scheduled to Wrong Node Type
# Check what labels the scheduled node has
kubectl get node $(kubectl get pod <pod> -o jsonpath='{.spec.nodeName}') \
--show-labels
# Check the pod's nodeSelector / affinity
kubectl get pod <pod> -o jsonpath='{.spec.nodeSelector}'
kubectl get pod <pod> -o yaml | grep -A30 affinity
# If nodeSelector is missing/wrong, update the Deployment
kubectl patch deployment <name> -n <namespace> --type=merge \
-p '{"spec":{"template":{"spec":{"nodeSelector":{"node-type":"gpu"}}}}}'
Topology Spread Constraint Blocking Scheduling
# Check pod events for spread violations
kubectl describe pod <pod> | grep "didn't match.*topology"
# Check current pod distribution
kubectl get pods -n <namespace> -l app=<app> \
-o custom-columns=NAME:.metadata.name,NODE:.spec.nodeName,ZONE:.metadata.labels['topology\.kubernetes\.io/zone']
# Count pods per zone
kubectl get pods -n <namespace> -l app=<app> \
-o jsonpath='{.items[*].spec.nodeName}' | tr ' ' '\n' | \
xargs -I{} kubectl get node {} -o jsonpath='{.metadata.labels.topology\.kubernetes\.io/zone}' | \
sort | uniq -c
# Temporarily loosen constraint
kubectl patch deployment <name> -n <namespace> --type=json -p='[{
"op":"replace",
"path":"/spec/template/spec/topologySpreadConstraints/0/whenUnsatisfiable",
"value":"ScheduleAnyway"
}]'
High-Priority Pod Not Triggering Preemption
# Check if priority class is set correctly
kubectl get pod <pod> -o jsonpath='{.spec.priority} {.spec.priorityClassName}'
# Check if there are lower-priority pods to preempt
kubectl get pods -A --sort-by='.spec.priority' -o custom-columns=\
'NAMESPACE:.metadata.namespace,NAME:.metadata.name,PRIORITY:.spec.priority,NODE:.spec.nodeName' | head -20
# Check scheduler logs for preemption decisions
kubectl logs -n kube-system -l component=kube-scheduler | grep -i "preempt\|nominat"
# Check nominated node (scheduler's preemption candidate)
kubectl get pod <pod> -o jsonpath='{.status.nominatedNodeName}'
Uneven Pod Distribution After Scale-Out
# Check distribution
kubectl get pods -l app=<app> -o wide | awk '{print $7}' | sort | uniq -c
# Trigger rebalancing via Descheduler (if installed)
kubectl create job descheduler-manual --from=cronjob/descheduler -n kube-system
# Manual rebalancing: rollout restart spreads pods fresh
kubectl rollout restart deployment <name> -n <namespace>
Best Practices
- Use
topologySpreadConstraintsover pod anti-affinity for spreading replicas ā topology spread is O(n) while required pod anti-affinity is O(n²). At 50+ replicas, the scheduling latency difference is significant. - Set
minDomainswhen you require multi-zone spread ā without it, topology spread happily schedules all pods into a single zone if only one zone has capacity.minDomains: 3forces a minimum of 3 zones to be used. - Use
preferredDuringaffinity for preferences, not requirements ārequiredDuringconstraints that are too strict cause pods to stay Pending indefinitely. Reserverequiredfor genuine hardware requirements (GPUs, OS, architecture). - Taint dedicated nodes AND use nodeSelector ā a taint prevents non-tolerated pods from landing on a node, but doesn't force your pods to land there. Add
nodeSelectoror affinity to direct your pods to the dedicated nodes. - Set
preemptionPolicy: Neverfor batch workloads ā prevents low-value batch jobs from preempting production traffic when they momentarily spike. Production workloads should have higher priority but batch should not preempt anyone. - Run the Descheduler for long-running clusters ā initial scheduling decisions become suboptimal as the cluster changes. Descheduler with
LowNodeUtilizationandRemoveDuplicateskeeps placement healthy without manual intervention. - Use gang scheduling for distributed ML/HPC jobs ā partial scheduling of distributed training is worse than no scheduling. Without a PodGroup minimum, workers consume GPU resources while waiting for peers, blocking other jobs.
- Monitor
scheduler_pending_pods{queue="unschedulable"}continuously ā a growing unschedulable count is an early warning of capacity shortage or misconfigured constraints, often surfacing before users notice impact.