kube-scheduler
How Kubernetes decides which node runs each Pod — the scheduling framework, two-phase filter/score pipeline, 12 extension points, preemption, topology spread, custom profiles, and performance tuning for large clusters.
What kube-scheduler Does
kube-scheduler watches the API server for Pods whose spec.nodeName is empty (unbound Pods) and assigns each to the most suitable node by writing a Binding object — a POST to /api/v1/namespaces/{ns}/pods/{name}/binding. Once bound, kubelet on the target node picks up the Pod via its own watch stream and starts it.
The scheduler only runs at Pod creation time. It has no ongoing role once a Pod is bound. If a node dies, the node-lifecycle controller (in kube-controller-manager) marks Pods for deletion; the scheduler re-schedules replacement Pods created by the owning ReplicaSet or StatefulSet.
Process identity
- Binary:
kube-scheduler - Static pod:
/etc/kubernetes/manifests/kube-scheduler.yaml - Secure port:
:10259(HTTPS, metrics + healthz) - Leader-elected: only one active instance at a time
- Kubeconfig:
/etc/kubernetes/scheduler.conf
Scheduling throughput
- Default: ~100–200 pods/second on a 1000-node cluster
- Bottleneck: per-pod Filter + Score runs across all nodes
- Optimization:
percentageOfNodesToScorelimits nodes evaluated - Parallel goroutines:
--parallelismflag (default 16)
Scheduling Framework and Extension Points
The Scheduling Framework (introduced in 1.15, stable in 1.19) replaces the old predicates/priorities model. Every scheduling decision is decomposed into 12 ordered extension points, each implemented by one or more plugins.
| Extension Point | Called per | Purpose | On failure |
|---|---|---|---|
QueueSort | Pod pair (compare) | Determines ordering in activeQ heap. Default: by priority then creation time. | N/A |
PreFilter | Pod (once) | Compute and cache pod-level data (e.g., resource sum, affinity terms). Return error to reject immediately. | Pod unschedulable |
Filter | Pod × Node | Eliminates nodes that cannot run the Pod. Runs in parallel across nodes. | Node removed from feasible set |
PostFilter | Pod (if Filter left no nodes) | Last resort — attempt preemption. Default plugin: DefaultPreemption. | Pod stays unschedulable |
PreScore | Pod (once) | Compute and cache shared scoring state for feasible nodes. | Pod unschedulable |
Score | Pod × feasible Node | Assign 0–100 score. Runs in parallel. Multiple plugins, weights summed. | Node score = 0 |
NormalizeScore | Per plugin, all nodes | Re-scale plugin's raw scores to 0–100 range before weighting. | Pod unschedulable |
Reserve | Pod + selected node | Claim resources optimistically (e.g., volume binding). Rollback via Unreserve on failure. | → Unreserve |
Permit | Pod + selected node | Approve / deny / wait. Wait is used for gang scheduling (hold until all pods in group are ready to bind). | → Unreserve |
PreBind | Pod + node | Perform pre-binding work (e.g., provision and attach volumes). | → Unreserve |
Bind | Pod + node | Write the Binding object to kube-apiserver. Default: DefaultBinder (POST /binding). | → Unreserve |
PostBind | Pod + node | Informational hook after successful bind. No failure handling. | N/A |
Built-in Scheduler Plugins
Filter Plugins
NodeUnschedulable
Rejects nodes with spec.unschedulable: true (cordoned nodes), unless Pod tolerates the node.kubernetes.io/unschedulable taint.
NodeName
Rejects all nodes except spec.nodeName if explicitly set. Short-circuits Filter — only one node evaluated.
NodeResourcesFit
Checks allocatable CPU/memory/extended resources ≥ requested. Also enforces resources.limits for resources requiring limits (e.g., hugepages). Supports LeastAllocated, MostAllocated, RequestedToCapacityRatio strategies.
NodeAffinity
Enforces spec.affinity.nodeAffinity.requiredDuringSchedulingIgnoredDuringExecution. Preferred terms scored in the Score phase.
TaintToleration
Rejects nodes whose taints are not tolerated by the Pod's spec.tolerations.
InterPodAffinity
Required pod affinity/anti-affinity rules (hard). Checks existing pods' labels on candidate node's topology domain.
VolumeBinding
Checks PVC satisfaction: volume zone, node selector, access mode. For WaitForFirstConsumer StorageClass, triggers dynamic provisioning after bind.
VolumeRestrictions
Limits number of volumes per node (e.g., AWS EBS: 39, GCE PD: 127). Counts in-use and reserved-but-not-yet-bound volumes.
PodTopologySpread
Enforces spec.topologySpreadConstraints with whenUnsatisfiable: DoNotSchedule. Also scores with ScheduleAnyway mode.
NodePorts
Rejects nodes that already have a Pod occupying the same hostPort.
NodeResourcesBalancedAllocation
Filter + Score plugin. Rejects if balanced allocation would be impossible; scores to minimize CPU/memory imbalance across resources.
EBSLimits / GCEPDLimits / AzureDiskLimits
Cloud-specific volume count limits. Replaced by node-level CSI migration in 1.23+.
Score Plugins
NodeResourcesLeastAllocated (default)
Scores nodes with more free capacity higher. Spreads pods across nodes evenly. Formula: score = (cpu_free/cpu_cap * w_cpu + mem_free/mem_cap * w_mem) / (w_cpu + w_mem) * 100. Default weights: CPU=1, Memory=1.
NodeResourcesMostAllocated (bin-packing)
Scores nodes with less free capacity higher. Used when consolidating workloads to save cost (node autoscaler can drain underutilized nodes). Enable via scheduler profile config.
NodeAffinity (preferred)
Scores nodes matching preferredDuringSchedulingIgnoredDuringExecution rules. Weight multiplied by match count, normalized to 0–100.
InterPodAffinity (preferred)
Scores nodes where preferred affinity/anti-affinity rules are satisfied. Careful: O(pods × nodes) calculation — expensive on large clusters.
ImageLocality
Scores nodes that already have the container image cached. Score proportional to sum of image layer sizes cached on node. Reduces image pull latency for large images.
PodTopologySpread (preferred)
When whenUnsatisfiable: ScheduleAnyway, scores candidate nodes to minimize topology skew. Complementary to the Filter plugin variant.
Scheduling Queue Internals
The scheduler maintains three internal queues that govern when and in what order Pods are evaluated:
| Queue | Mechanism | Move condition |
|---|---|---|
| activeQ | Max-heap ordered by priority + creation time (via QueueSort plugin) | New pods, returned from backoffQ/unschedulableQ |
| backoffQ | Min-heap ordered by backoff expiry timestamp. Base: 1s, doubles each failure, max 10s | Scheduling attempted but failed (Filter rejected by some nodes) |
| unschedulableQ | HashMap, flushed by cluster events (new Node, updated PV, changed Pod labels) | No feasible node found (all nodes filtered out) |
When a new Node registers, or a Pod is deleted (freeing resources), or a PV becomes available, the scheduler identifies which Pods in unschedulableQ might now be schedulable (using registered clusterEventHandlers per plugin) and moves them to activeQ. This avoids busy-waiting and scales well to thousands of pending pods.
Two-Phase Deep Dive: Filter and Score
Filter Phase
Each Filter plugin's Filter(pod, nodeInfo) method runs for every (pod, feasible_node) pair. Runs are parallelized across nodes (default 16 goroutines). A node is included in the feasible set only if all plugins return success.
# See which plugins rejected a pod and why
kubectl describe pod my-pod | grep -A 20 "Events:"
# Events will show: "0/5 nodes are available: 2 node(s) had taint, 3 Insufficient memory"
# Count of filter failures per plugin (Prometheus)
scheduler_framework_extension_point_duration_seconds{extension_point="Filter",status="Error"}
# Enable verbose scheduler logs (very noisy — use only in non-prod)
# Add to kube-scheduler pod: --v=10
The percentageOfNodesToScore optimization: on clusters with >100 nodes, once enough feasible nodes are found (default: min(50, number_of_nodes) to a max of 100%), the scheduler stops evaluating remaining nodes. This is a heuristic — good enough in practice, can be disabled for precise scheduling.
apiVersion: kubescheduler.config.k8s.io/v1
kind: KubeSchedulerConfiguration
percentageOfNodesToScore: 0 # 0 = use adaptive default based on cluster size
profiles:
- schedulerName: default-scheduler
Score Phase
Each Score plugin returns 0–100 per feasible node. The final score for each node is the weighted sum: Σ(plugin_weight × normalized_score). NormalizeScore brings raw scores within a plugin into [0,100] range. The node with the highest final score wins. Ties broken by random selection (to prevent hot nodes).
# Example: default score plugin weights
# (from default-scheduler profile in KubeSchedulerConfiguration)
plugins:
score:
enabled:
- name: NodeResourcesBalancedAllocation
weight: 1
- name: ImageLocality
weight: 1
- name: InterPodAffinity
weight: 1
- name: NodeResourcesFit # LeastAllocated strategy
weight: 1
- name: NodeAffinity
weight: 1
- name: PodTopologySpread
weight: 2 # higher weight = stronger influence
- name: TaintToleration
weight: 1
Node Affinity, Pod Affinity, Taints and Tolerations
Node Affinity
spec:
affinity:
nodeAffinity:
requiredDuringSchedulingIgnoredDuringExecution: # HARD — Filter
nodeSelectorTerms:
- matchExpressions:
- key: kubernetes.io/arch
operator: In
values: [amd64]
- key: node-role/gpu
operator: Exists
preferredDuringSchedulingIgnoredDuringExecution: # SOFT — Score
- weight: 80
preference:
matchExpressions:
- key: topology.kubernetes.io/zone
operator: In
values: [us-east-1a]
Pod Affinity and Anti-Affinity
spec:
affinity:
podAntiAffinity:
requiredDuringSchedulingIgnoredDuringExecution: # HARD anti-affinity
- labelSelector:
matchLabels:
app: my-app
topologyKey: kubernetes.io/hostname # no two pods on same node
preferredDuringSchedulingIgnoredDuringExecution: # SOFT spread across zones
- weight: 100
podAffinityTerm:
labelSelector:
matchLabels:
app: my-app
topologyKey: topology.kubernetes.io/zone
Pod affinity/anti-affinity rules require examining ALL existing pods to determine which topology domains are occupied. On clusters with tens of thousands of pods this becomes an O(pods × nodes) operation per pending pod. Use PodTopologySpread instead wherever possible — it uses pre-computed per-topology counts and is significantly more efficient.
Taints and Tolerations
# Taint a node (effect: NoSchedule | PreferNoSchedule | NoExecute)
kubectl taint nodes worker-1 dedicated=gpu:NoSchedule
kubectl taint nodes worker-1 node.kubernetes.io/not-ready:NoExecute
# Remove a taint
kubectl taint nodes worker-1 dedicated=gpu:NoSchedule-
# Pod tolerates the taint
spec:
tolerations:
- key: dedicated
operator: Equal
value: gpu
effect: NoSchedule
- key: node.kubernetes.io/not-ready # tolerate node-not-ready for 300s
operator: Exists
effect: NoExecute
tolerationSeconds: 300
| Effect | Behavior | Common use |
|---|---|---|
| NoSchedule | New pods without toleration not scheduled here | Dedicated nodes (GPU, spot) |
| PreferNoSchedule | Scheduler tries to avoid, not hard requirement | Soft node preference |
| NoExecute | Existing pods evicted if not tolerating; new pods rejected | Node health management (node-lifecycle-controller) |
Topology Spread Constraints
Introduced in 1.14 (GA 1.19). Replaces complex pod anti-affinity patterns for even distribution. More efficient than InterPodAffinity as it uses pre-aggregated per-topology-domain counts.
spec:
topologySpreadConstraints:
- maxSkew: 1 # max allowed difference between any two domains
topologyKey: topology.kubernetes.io/zone
whenUnsatisfiable: DoNotSchedule # hard constraint (Filter)
labelSelector:
matchLabels:
app: my-app
matchLabelKeys: # k8s 1.27+ — auto-exclude old revisions
- pod-template-hash
- maxSkew: 2
topologyKey: kubernetes.io/hostname
whenUnsatisfiable: ScheduleAnyway # soft constraint (Score)
labelSelector:
matchLabels:
app: my-app
maxSkew
Maximum allowed difference in pod count between any two topology domains. maxSkew: 1 means zones can differ by at most 1. Smaller = stricter spread = harder to schedule.
whenUnsatisfiable modes
DoNotSchedule: hard filter — pod stays pending if constraint cannot be met. ScheduleAnyway: soft score — scheduler picks the node that minimizes skew increase, but will schedule anywhere.
matchLabelKeys (1.27+)
Automatically includes only pods from the current ReplicaSet revision in skew calculations. Without this, a rolling update temporarily doubles pod count in one zone, causing false "skew exceeded" rejections.
Default constraints
Cluster-wide default topologySpreadConstraints can be set in KubeSchedulerConfiguration.defaultConstraints — applies to all pods that don't set their own constraints.
Preemption and PriorityClass
When no node can fit a pending pod after filtering, the DefaultPreemption plugin (running in the PostFilter extension point) tries to evict lower-priority pods to make room.
apiVersion: scheduling.k8s.io/v1
kind: PriorityClass
metadata:
name: high-priority
value: 1000000 # 0 = lowest; 1,000,000,000 = max (system-reserved)
globalDefault: false
preemptionPolicy: PreemptLowerPriority # PreemptLowerPriority | Never
description: "Critical production workloads"
---
# Use in pod spec
spec:
priorityClassName: high-priority
Built-in system PriorityClasses:
| Name | Value | Used by |
|---|---|---|
| system-cluster-critical | 2,000,000,000 | kube-dns, kube-proxy, metrics-server |
| system-node-critical | 2,000,001,000 | kube-apiserver, etcd, kube-scheduler (static pods) |
Preemption Algorithm
- Collect candidate nodes where evicting some pods would make room for the pending pod.
- For each candidate node: identify the minimum set of pods to evict (lowest priority, fewest disruptions, no PDB violations).
- Pick the node that minimizes disruption (fewest pods evicted, respect PDB).
- Evict chosen pods (graceful termination via
deletionTimestamp). Pending pod is NOT immediately scheduled — it waits for evicted pods to terminate, then re-enters the scheduling cycle.
Preemption respects PDBs — it will not evict pods if doing so would violate a PDB's minAvailable or maxUnavailable. However, if the only way to schedule a very high-priority pod is to violate a PDB, the preemption code will still evict (PDB is not a hard blocker for preemption). In 1.28+, the behavior can be tuned with preemptionPolicy: Never on lower-priority classes to protect them from preemption entirely.
Scheduler Profiles and Multiple Schedulers
A single kube-scheduler binary can run multiple profiles, each with its own schedulerName, plugin configuration, and plugin weights. Pods opt into a profile via spec.schedulerName.
apiVersion: kubescheduler.config.k8s.io/v1
kind: KubeSchedulerConfiguration
leaderElection:
leaderElect: true
clientConnection:
kubeconfig: /etc/kubernetes/scheduler.conf
profiles:
# Profile 1: default balanced scheduler
- schedulerName: default-scheduler
plugins:
score:
enabled:
- name: NodeResourcesFit
weight: 1
- name: PodTopologySpread
weight: 2
pluginConfig:
- name: NodeResourcesFit
args:
scoringStrategy:
type: LeastAllocated
resources:
- name: cpu
weight: 1
- name: memory
weight: 1
# Profile 2: bin-packing for batch jobs
- schedulerName: bin-pack-scheduler
plugins:
score:
enabled:
- name: NodeResourcesFit
weight: 1
pluginConfig:
- name: NodeResourcesFit
args:
scoringStrategy:
type: MostAllocated # pack pods tightly
resources:
- name: cpu
weight: 1
- name: memory
weight: 1
# Pod using the bin-pack profile
spec:
schedulerName: bin-pack-scheduler
containers:
- name: batch-job
image: my-batch:latest
Running a completely separate scheduler binary is possible but operationally complex (leader election, RBAC, update coordination). Prefer scheduler profiles within the same binary — same leader election, same informer caches, far simpler to operate. Use a separate binary only when you need fundamentally different scheduling logic (e.g., gang scheduling, custom preemption) not achievable via plugin configuration.
Gang Scheduling and the Permit Extension Point
Gang scheduling (also called co-scheduling) ensures all pods in a group are scheduled together — or none are. This is critical for ML training jobs (all workers must start simultaneously) and MPI workloads.
The Permit extension point enables this: a plugin can return Wait instead of Allow. The pod is tentatively assigned to a node but held in a waiting state until the plugin explicitly allows it (or a timeout expires).
# Coscheduling via the scheduling-plugins project (sig-scheduling)
# https://github.com/kubernetes-sigs/scheduler-plugins
apiVersion: scheduling.sigs.k8s.io/v1alpha1
kind: PodGroup
metadata:
name: my-ml-job
spec:
minMember: 4 # all 4 workers must be schedulable before any binds
minResources:
cpu: "4"
memory: 8Gi
---
spec:
schedulerName: scheduler-plugins-scheduler
labels:
scheduling.sigs.k8s.io/pod-group: my-ml-job
The Permit plugin tracks how many pods in the group have been tentatively placed. Once minMember is reached, all waiting pods are released to Bind simultaneously. If the timeout expires before all members are placed, waiting pods are rejected (→ Unreserve) and returned to the queue.
Leader Election
In HA setups, multiple kube-scheduler replicas run but only one is active. Leadership is managed via a Lease object in the kube-system namespace. The details are identical to the kube-controller-manager pattern — see 00-control-plane-overview.html § Leader Election.
# Check current scheduler leader
kubectl -n kube-system get lease kube-scheduler -o yaml
# .spec.holderIdentity shows the active scheduler pod
# Scheduler logs showing leader election
kubectl -n kube-system logs kube-scheduler-master-1 | grep -i "leader"
# I0115 10:30:00 leaderelection.go:248] attempting to acquire leader lease...
# I0115 10:30:01 leaderelection.go:258] successfully acquired lease kube-system/kube-scheduler
Configuration Reference
apiVersion: kubescheduler.config.k8s.io/v1
kind: KubeSchedulerConfiguration
# Leader election
leaderElection:
leaderElect: true
leaseDuration: 15s
renewDeadline: 10s
retryPeriod: 2s
resourceNamespace: kube-system
resourceName: kube-scheduler
# API server connection
clientConnection:
kubeconfig: /etc/kubernetes/scheduler.conf
qps: 50
burst: 100
# Performance tuning
percentageOfNodesToScore: 0 # 0 = adaptive (recommended)
parallelism: 16 # goroutines for filter/score
# Health / metrics
healthzBindAddress: 0.0.0.0:10259
metricsBindAddress: 0.0.0.0:10259
| Flag / Config | Default | Purpose |
|---|---|---|
percentageOfNodesToScore | 0 (adaptive) | Stop evaluating nodes once this % is found feasible. 0 = use cluster-size heuristic (50 nodes minimum, scales down for large clusters) |
parallelism | 16 | Number of goroutines running Filter/Score in parallel |
--leader-elect | true | Enable leader election for HA |
--kube-api-qps | 50 | Queries per second to kube-apiserver |
--kube-api-burst | 100 | Burst capacity above QPS limit |
--v | 2 | Log verbosity. 4+ shows scheduling decisions per pod |
Metrics and Alerting
# Scrape scheduler metrics
curl -sk https://localhost:10259/metrics \
--cert /etc/kubernetes/pki/scheduler-client.crt \
--key /etc/kubernetes/pki/scheduler-client.key | grep scheduler_
# Key metrics
scheduler_scheduling_attempt_duration_seconds{result="scheduled"} # e2e scheduling latency
scheduler_scheduling_attempt_duration_seconds{result="unschedulable"}
scheduler_pending_pods{queue="active"} # pods waiting to be scheduled
scheduler_pending_pods{queue="backoff"} # pods in backoff
scheduler_pending_pods{queue="unschedulable"}# pods with no node found
scheduler_preemption_attempts_total # total preemption attempts
scheduler_preemption_victims # total pods evicted by preemption
scheduler_framework_extension_point_duration_seconds{extension_point="Filter"}
scheduler_framework_extension_point_duration_seconds{extension_point="Score"}
scheduler_volume_scheduling_stage_error_reason # volume binding errors
Prometheus Alerting Rules
groups:
- name: kube-scheduler
rules:
- alert: KubeSchedulerDown
expr: absent(up{job="kube-scheduler"} == 1)
for: 5m
labels:
severity: critical
annotations:
summary: "kube-scheduler is down"
description: "No kube-scheduler instance is up. Pods cannot be scheduled."
- alert: KubeSchedulerHighPendingPods
expr: scheduler_pending_pods{queue="unschedulable"} > 50
for: 15m
labels:
severity: warning
annotations:
summary: "High number of unschedulable pods"
description: "{{ $value }} pods have been unschedulable for >15m."
- alert: KubeSchedulerHighLatency
expr: |
histogram_quantile(0.99,
rate(scheduler_scheduling_attempt_duration_seconds_bucket{result="scheduled"}[5m])
) > 5
for: 10m
labels:
severity: warning
annotations:
summary: "Scheduler p99 latency > 5s"
description: "P99 scheduling latency is {{ $value }}s."
- alert: KubeSchedulerPreemptionHigh
expr: rate(scheduler_preemption_victims[5m]) > 10
for: 5m
labels:
severity: warning
annotations:
summary: "High preemption rate"
description: "Scheduler is evicting >10 pods/s. Check PriorityClass assignments."
Troubleshooting Scheduling Issues
Pod stuck in Pending — "0/N nodes are available"
# Full scheduling failure reason
kubectl describe pod my-pod | grep -A 30 "Events:"
# Common messages:
# "0/5 nodes are available: 2 Insufficient cpu, 3 node(s) had taint..."
# "0/5 nodes are available: 5 node(s) didn't match pod's node affinity/selector"
# "0/5 nodes are available: 5 pod has unbound immediate PersistentVolumeClaims"
# Check node resources and allocatable
kubectl describe nodes | grep -A 8 "Allocated resources"
# Check if nodes are cordoned
kubectl get nodes | grep SchedulingDisabled
# Check taints on nodes
kubectl get nodes -o custom-columns=NAME:.metadata.name,TAINTS:.spec.taints
# Check if the pod's resource requests are too high
kubectl get pod my-pod -o jsonpath='{.spec.containers[*].resources}'
Pod stuck in Pending — PVC binding issue
# Check PVC status
kubectl get pvc
kubectl describe pvc my-pvc
# Look for: "waiting for first consumer to be created before binding"
# This is normal for WaitForFirstConsumer StorageClass — PVC binds AFTER scheduler picks node
# Check StorageClass binding mode
kubectl get storageclass standard -o jsonpath='{.volumeBindingMode}'
# Immediate = PVC bound before scheduling (may conflict with zone affinity)
# WaitForFirstConsumer = PVC bound after scheduling (node zone determines PVC zone)
# Volume scheduling logs
kubectl -n kube-system logs kube-scheduler-master-1 | grep -i "volume"
Pods not spreading evenly across zones
# Check existing topology spread
kubectl get pods -l app=my-app -o wide | awk '{print $7}' | sort | uniq -c
# Check if topologySpreadConstraints are set
kubectl get deploy my-deploy -o jsonpath='{.spec.template.spec.topologySpreadConstraints}'
# Check zone labels on nodes
kubectl get nodes --show-labels | grep topology.kubernetes.io/zone
# If using old pod anti-affinity instead:
# Migrate to topologySpreadConstraints for better performance and flexibility
# Check scheduler logs for spread scoring
kubectl -n kube-system logs kube-scheduler-master-1 -v=4 | grep TopologySpread
Scheduler performance — high latency / slow scheduling
# Check scheduling attempt duration
kubectl get --raw /metrics | grep scheduler_scheduling_attempt_duration
# Common causes of slow scheduling:
# 1. InterPodAffinity with large pod counts — migrate to topologySpreadConstraints
# 2. percentageOfNodesToScore too high — reduce for large clusters
# 3. Too many Score plugins with high weight — profile and trim
# 4. Custom webhook in Permit extension point — check latency
# Increase parallelism for large clusters
# In KubeSchedulerConfiguration:
parallelism: 32 # default 16, increase for 1000+ node clusters
# Check if scheduler leader is healthy
kubectl -n kube-system get lease kube-scheduler -o yaml | grep holderIdentity
Preemption not working as expected
# Check if priority classes are set correctly
kubectl get priorityclass
kubectl get pod high-priority-pod -o jsonpath='{.spec.priority}'
# Check if PDB is blocking preemption
kubectl get pdb -A
kubectl describe pdb my-pdb | grep -E "Allowed Disruptions|Status"
# Check scheduler logs for preemption decisions
kubectl -n kube-system logs kube-scheduler-master-1 | grep -i preempt
# Check if preemptionPolicy is set to Never
kubectl get priorityclass low-priority -o jsonpath='{.preemptionPolicy}'
Scheduling Sequence: Pod to Bound
Production Best Practices
Use PriorityClasses for all workloads
Define at minimum 3 tiers: critical (1M), standard (100K), batch (1K). Set preemptionPolicy: Never on batch to prevent it from evicting standard workloads. Set the cluster's globalDefault: true on standard so unclassified pods get a sane default.
Prefer PodTopologySpread over pod anti-affinity
For spreading replicas across zones/nodes, topologySpreadConstraints is orders of magnitude more efficient than requiredDuringSchedulingIgnoredDuringExecution anti-affinity on clusters with thousands of pods.
Set resource requests on all containers
Without resources.requests, pods are BestEffort class — scheduled anywhere, not accounted for in Filter/Score calculations. This leads to overcommit, OOM kills, and unpredictable scheduling. Every production container must have CPU + memory requests.
Tune percentageOfNodesToScore on large clusters
On 1000+ node clusters, the default adaptive heuristic evaluates only a fraction of nodes. If you have strict affinity requirements, consider raising this value. If you prioritize scheduling throughput, the default is fine.
Use WaitForFirstConsumer StorageClass
For zone-aware storage, set volumeBindingMode: WaitForFirstConsumer. This lets the scheduler pick the node first (respecting topology constraints), then provision the volume in the right zone. Immediate mode creates the PV in a random zone, which can conflict with node affinity.
Node labels for topology awareness
Ensure all nodes have standard topology labels: topology.kubernetes.io/zone, topology.kubernetes.io/region, kubernetes.io/hostname. Cloud providers set these automatically; on-premise clusters need them set explicitly (or use a Node Feature Discovery operator).
Avoid scheduling during node maintenance
Cordon nodes before draining (kubectl cordon). This sets spec.unschedulable: true and adds the node.kubernetes.io/unschedulable taint, preventing the scheduler from placing new pods. Never drain without cordoning first — new pods may arrive on the node while drain is running.
Monitor pending pods
Alert on scheduler_pending_pods{queue="unschedulable"} > 0 for more than 5 minutes. This indicates a scheduling failure that won't self-resolve — requires human investigation of node resources, taints, or affinity rules.