What kube-scheduler Does

kube-scheduler watches the API server for Pods whose spec.nodeName is empty (unbound Pods) and assigns each to the most suitable node by writing a Binding object — a POST to /api/v1/namespaces/{ns}/pods/{name}/binding. Once bound, kubelet on the target node picks up the Pod via its own watch stream and starts it.

▶ Scheduler is Not in the Data Path

The scheduler only runs at Pod creation time. It has no ongoing role once a Pod is bound. If a node dies, the node-lifecycle controller (in kube-controller-manager) marks Pods for deletion; the scheduler re-schedules replacement Pods created by the owning ReplicaSet or StatefulSet.

Process identity

  • Binary: kube-scheduler
  • Static pod: /etc/kubernetes/manifests/kube-scheduler.yaml
  • Secure port: :10259 (HTTPS, metrics + healthz)
  • Leader-elected: only one active instance at a time
  • Kubeconfig: /etc/kubernetes/scheduler.conf

Scheduling throughput

  • Default: ~100–200 pods/second on a 1000-node cluster
  • Bottleneck: per-pod Filter + Score runs across all nodes
  • Optimization: percentageOfNodesToScore limits nodes evaluated
  • Parallel goroutines: --parallelism flag (default 16)
Scheduler High-Level Flow Pending Pod Filter Phase Score Phase Select Winner (highest score) Bind to Node (POST /binding) all nodes → feasible nodes feasible nodes ranked 0–100

Scheduling Framework and Extension Points

The Scheduling Framework (introduced in 1.15, stable in 1.19) replaces the old predicates/priorities model. Every scheduling decision is decomposed into 12 ordered extension points, each implemented by one or more plugins.

Scheduling Framework — 12 Extension Points QUEUE QueueSort FILTER CYCLE PreFilter Filter PostFilter (preemption) SCORE CYCLE PreScore Score NORMALIZE NormalizeScore RESERVE Reserve PERMIT Permit BIND CYCLE PreBind Bind PostBind On Reserve/Permit failure → Unreserve (rollback) Unreserve (rollback) Filter cycle Score cycle Permit (gang scheduling) Bind cycle
Extension PointCalled perPurposeOn failure
QueueSortPod pair (compare)Determines ordering in activeQ heap. Default: by priority then creation time.N/A
PreFilterPod (once)Compute and cache pod-level data (e.g., resource sum, affinity terms). Return error to reject immediately.Pod unschedulable
FilterPod × NodeEliminates nodes that cannot run the Pod. Runs in parallel across nodes.Node removed from feasible set
PostFilterPod (if Filter left no nodes)Last resort — attempt preemption. Default plugin: DefaultPreemption.Pod stays unschedulable
PreScorePod (once)Compute and cache shared scoring state for feasible nodes.Pod unschedulable
ScorePod × feasible NodeAssign 0–100 score. Runs in parallel. Multiple plugins, weights summed.Node score = 0
NormalizeScorePer plugin, all nodesRe-scale plugin's raw scores to 0–100 range before weighting.Pod unschedulable
ReservePod + selected nodeClaim resources optimistically (e.g., volume binding). Rollback via Unreserve on failure.→ Unreserve
PermitPod + selected nodeApprove / deny / wait. Wait is used for gang scheduling (hold until all pods in group are ready to bind).→ Unreserve
PreBindPod + nodePerform pre-binding work (e.g., provision and attach volumes).→ Unreserve
BindPod + nodeWrite the Binding object to kube-apiserver. Default: DefaultBinder (POST /binding).→ Unreserve
PostBindPod + nodeInformational hook after successful bind. No failure handling.N/A

Built-in Scheduler Plugins

Filter Plugins

NodeUnschedulable

Rejects nodes with spec.unschedulable: true (cordoned nodes), unless Pod tolerates the node.kubernetes.io/unschedulable taint.

NodeName

Rejects all nodes except spec.nodeName if explicitly set. Short-circuits Filter — only one node evaluated.

NodeResourcesFit

Checks allocatable CPU/memory/extended resources ≥ requested. Also enforces resources.limits for resources requiring limits (e.g., hugepages). Supports LeastAllocated, MostAllocated, RequestedToCapacityRatio strategies.

NodeAffinity

Enforces spec.affinity.nodeAffinity.requiredDuringSchedulingIgnoredDuringExecution. Preferred terms scored in the Score phase.

TaintToleration

Rejects nodes whose taints are not tolerated by the Pod's spec.tolerations.

InterPodAffinity

Required pod affinity/anti-affinity rules (hard). Checks existing pods' labels on candidate node's topology domain.

VolumeBinding

Checks PVC satisfaction: volume zone, node selector, access mode. For WaitForFirstConsumer StorageClass, triggers dynamic provisioning after bind.

VolumeRestrictions

Limits number of volumes per node (e.g., AWS EBS: 39, GCE PD: 127). Counts in-use and reserved-but-not-yet-bound volumes.

PodTopologySpread

Enforces spec.topologySpreadConstraints with whenUnsatisfiable: DoNotSchedule. Also scores with ScheduleAnyway mode.

NodePorts

Rejects nodes that already have a Pod occupying the same hostPort.

NodeResourcesBalancedAllocation

Filter + Score plugin. Rejects if balanced allocation would be impossible; scores to minimize CPU/memory imbalance across resources.

EBSLimits / GCEPDLimits / AzureDiskLimits

Cloud-specific volume count limits. Replaced by node-level CSI migration in 1.23+.

Score Plugins

NodeResourcesLeastAllocated (default)

Scores nodes with more free capacity higher. Spreads pods across nodes evenly. Formula: score = (cpu_free/cpu_cap * w_cpu + mem_free/mem_cap * w_mem) / (w_cpu + w_mem) * 100. Default weights: CPU=1, Memory=1.

NodeResourcesMostAllocated (bin-packing)

Scores nodes with less free capacity higher. Used when consolidating workloads to save cost (node autoscaler can drain underutilized nodes). Enable via scheduler profile config.

NodeAffinity (preferred)

Scores nodes matching preferredDuringSchedulingIgnoredDuringExecution rules. Weight multiplied by match count, normalized to 0–100.

InterPodAffinity (preferred)

Scores nodes where preferred affinity/anti-affinity rules are satisfied. Careful: O(pods × nodes) calculation — expensive on large clusters.

ImageLocality

Scores nodes that already have the container image cached. Score proportional to sum of image layer sizes cached on node. Reduces image pull latency for large images.

PodTopologySpread (preferred)

When whenUnsatisfiable: ScheduleAnyway, scores candidate nodes to minimize topology skew. Complementary to the Filter plugin variant.

Scheduling Queue Internals

The scheduler maintains three internal queues that govern when and in what order Pods are evaluated:

Scheduling Queue — Three Sub-queues activeQ Priority heap Popped by scheduler goroutines backoffQ Exp. backoff heap (base 1s, max 10s) Retry after scheduling failure unschedulableQ Map: key=pod No node found at all filter fail → backoff no nodes → unschedulable backoff expires → activeQ cluster event (new node, updated pod) triggers re-evaluation → activeQ
QueueMechanismMove condition
activeQMax-heap ordered by priority + creation time (via QueueSort plugin)New pods, returned from backoffQ/unschedulableQ
backoffQMin-heap ordered by backoff expiry timestamp. Base: 1s, doubles each failure, max 10sScheduling attempted but failed (Filter rejected by some nodes)
unschedulableQHashMap, flushed by cluster events (new Node, updated PV, changed Pod labels)No feasible node found (all nodes filtered out)
▶ Cluster Event → unschedulableQ Flush

When a new Node registers, or a Pod is deleted (freeing resources), or a PV becomes available, the scheduler identifies which Pods in unschedulableQ might now be schedulable (using registered clusterEventHandlers per plugin) and moves them to activeQ. This avoids busy-waiting and scales well to thousands of pending pods.

Two-Phase Deep Dive: Filter and Score

Filter Phase

Each Filter plugin's Filter(pod, nodeInfo) method runs for every (pod, feasible_node) pair. Runs are parallelized across nodes (default 16 goroutines). A node is included in the feasible set only if all plugins return success.

# See which plugins rejected a pod and why
kubectl describe pod my-pod | grep -A 20 "Events:"
# Events will show: "0/5 nodes are available: 2 node(s) had taint, 3 Insufficient memory"

# Count of filter failures per plugin (Prometheus)
scheduler_framework_extension_point_duration_seconds{extension_point="Filter",status="Error"}

# Enable verbose scheduler logs (very noisy — use only in non-prod)
# Add to kube-scheduler pod: --v=10

The percentageOfNodesToScore optimization: on clusters with >100 nodes, once enough feasible nodes are found (default: min(50, number_of_nodes) to a max of 100%), the scheduler stops evaluating remaining nodes. This is a heuristic — good enough in practice, can be disabled for precise scheduling.

apiVersion: kubescheduler.config.k8s.io/v1
kind: KubeSchedulerConfiguration
percentageOfNodesToScore: 0   # 0 = use adaptive default based on cluster size
profiles:
- schedulerName: default-scheduler

Score Phase

Each Score plugin returns 0–100 per feasible node. The final score for each node is the weighted sum: Σ(plugin_weight × normalized_score). NormalizeScore brings raw scores within a plugin into [0,100] range. The node with the highest final score wins. Ties broken by random selection (to prevent hot nodes).

# Example: default score plugin weights
# (from default-scheduler profile in KubeSchedulerConfiguration)
plugins:
  score:
    enabled:
    - name: NodeResourcesBalancedAllocation
      weight: 1
    - name: ImageLocality
      weight: 1
    - name: InterPodAffinity
      weight: 1
    - name: NodeResourcesFit       # LeastAllocated strategy
      weight: 1
    - name: NodeAffinity
      weight: 1
    - name: PodTopologySpread
      weight: 2                    # higher weight = stronger influence
    - name: TaintToleration
      weight: 1

Node Affinity, Pod Affinity, Taints and Tolerations

Node Affinity

spec:
  affinity:
    nodeAffinity:
      requiredDuringSchedulingIgnoredDuringExecution:   # HARD — Filter
        nodeSelectorTerms:
        - matchExpressions:
          - key: kubernetes.io/arch
            operator: In
            values: [amd64]
          - key: node-role/gpu
            operator: Exists
      preferredDuringSchedulingIgnoredDuringExecution:  # SOFT — Score
      - weight: 80
        preference:
          matchExpressions:
          - key: topology.kubernetes.io/zone
            operator: In
            values: [us-east-1a]

Pod Affinity and Anti-Affinity

spec:
  affinity:
    podAntiAffinity:
      requiredDuringSchedulingIgnoredDuringExecution:  # HARD anti-affinity
      - labelSelector:
          matchLabels:
            app: my-app
        topologyKey: kubernetes.io/hostname  # no two pods on same node
      preferredDuringSchedulingIgnoredDuringExecution:  # SOFT spread across zones
      - weight: 100
        podAffinityTerm:
          labelSelector:
            matchLabels:
              app: my-app
          topologyKey: topology.kubernetes.io/zone
⚠ Pod Affinity Scalability

Pod affinity/anti-affinity rules require examining ALL existing pods to determine which topology domains are occupied. On clusters with tens of thousands of pods this becomes an O(pods × nodes) operation per pending pod. Use PodTopologySpread instead wherever possible — it uses pre-computed per-topology counts and is significantly more efficient.

Taints and Tolerations

# Taint a node (effect: NoSchedule | PreferNoSchedule | NoExecute)
kubectl taint nodes worker-1 dedicated=gpu:NoSchedule
kubectl taint nodes worker-1 node.kubernetes.io/not-ready:NoExecute

# Remove a taint
kubectl taint nodes worker-1 dedicated=gpu:NoSchedule-

# Pod tolerates the taint
spec:
  tolerations:
  - key: dedicated
    operator: Equal
    value: gpu
    effect: NoSchedule
  - key: node.kubernetes.io/not-ready   # tolerate node-not-ready for 300s
    operator: Exists
    effect: NoExecute
    tolerationSeconds: 300
EffectBehaviorCommon use
NoScheduleNew pods without toleration not scheduled hereDedicated nodes (GPU, spot)
PreferNoScheduleScheduler tries to avoid, not hard requirementSoft node preference
NoExecuteExisting pods evicted if not tolerating; new pods rejectedNode health management (node-lifecycle-controller)

Topology Spread Constraints

Introduced in 1.14 (GA 1.19). Replaces complex pod anti-affinity patterns for even distribution. More efficient than InterPodAffinity as it uses pre-aggregated per-topology-domain counts.

spec:
  topologySpreadConstraints:
  - maxSkew: 1                         # max allowed difference between any two domains
    topologyKey: topology.kubernetes.io/zone
    whenUnsatisfiable: DoNotSchedule   # hard constraint (Filter)
    labelSelector:
      matchLabels:
        app: my-app
    matchLabelKeys:                    # k8s 1.27+ — auto-exclude old revisions
    - pod-template-hash
  - maxSkew: 2
    topologyKey: kubernetes.io/hostname
    whenUnsatisfiable: ScheduleAnyway  # soft constraint (Score)
    labelSelector:
      matchLabels:
        app: my-app

maxSkew

Maximum allowed difference in pod count between any two topology domains. maxSkew: 1 means zones can differ by at most 1. Smaller = stricter spread = harder to schedule.

whenUnsatisfiable modes

DoNotSchedule: hard filter — pod stays pending if constraint cannot be met. ScheduleAnyway: soft score — scheduler picks the node that minimizes skew increase, but will schedule anywhere.

matchLabelKeys (1.27+)

Automatically includes only pods from the current ReplicaSet revision in skew calculations. Without this, a rolling update temporarily doubles pod count in one zone, causing false "skew exceeded" rejections.

Default constraints

Cluster-wide default topologySpreadConstraints can be set in KubeSchedulerConfiguration.defaultConstraints — applies to all pods that don't set their own constraints.

Preemption and PriorityClass

When no node can fit a pending pod after filtering, the DefaultPreemption plugin (running in the PostFilter extension point) tries to evict lower-priority pods to make room.

apiVersion: scheduling.k8s.io/v1
kind: PriorityClass
metadata:
  name: high-priority
value: 1000000         # 0 = lowest; 1,000,000,000 = max (system-reserved)
globalDefault: false
preemptionPolicy: PreemptLowerPriority  # PreemptLowerPriority | Never
description: "Critical production workloads"
---
# Use in pod spec
spec:
  priorityClassName: high-priority

Built-in system PriorityClasses:

NameValueUsed by
system-cluster-critical2,000,000,000kube-dns, kube-proxy, metrics-server
system-node-critical2,000,001,000kube-apiserver, etcd, kube-scheduler (static pods)

Preemption Algorithm

  1. Collect candidate nodes where evicting some pods would make room for the pending pod.
  2. For each candidate node: identify the minimum set of pods to evict (lowest priority, fewest disruptions, no PDB violations).
  3. Pick the node that minimizes disruption (fewest pods evicted, respect PDB).
  4. Evict chosen pods (graceful termination via deletionTimestamp). Pending pod is NOT immediately scheduled — it waits for evicted pods to terminate, then re-enters the scheduling cycle.
⚠ PodDisruptionBudget + Preemption

Preemption respects PDBs — it will not evict pods if doing so would violate a PDB's minAvailable or maxUnavailable. However, if the only way to schedule a very high-priority pod is to violate a PDB, the preemption code will still evict (PDB is not a hard blocker for preemption). In 1.28+, the behavior can be tuned with preemptionPolicy: Never on lower-priority classes to protect them from preemption entirely.

Scheduler Profiles and Multiple Schedulers

A single kube-scheduler binary can run multiple profiles, each with its own schedulerName, plugin configuration, and plugin weights. Pods opt into a profile via spec.schedulerName.

apiVersion: kubescheduler.config.k8s.io/v1
kind: KubeSchedulerConfiguration
leaderElection:
  leaderElect: true
clientConnection:
  kubeconfig: /etc/kubernetes/scheduler.conf
profiles:
# Profile 1: default balanced scheduler
- schedulerName: default-scheduler
  plugins:
    score:
      enabled:
      - name: NodeResourcesFit
        weight: 1
      - name: PodTopologySpread
        weight: 2
  pluginConfig:
  - name: NodeResourcesFit
    args:
      scoringStrategy:
        type: LeastAllocated
        resources:
        - name: cpu
          weight: 1
        - name: memory
          weight: 1

# Profile 2: bin-packing for batch jobs
- schedulerName: bin-pack-scheduler
  plugins:
    score:
      enabled:
      - name: NodeResourcesFit
        weight: 1
  pluginConfig:
  - name: NodeResourcesFit
    args:
      scoringStrategy:
        type: MostAllocated   # pack pods tightly
        resources:
        - name: cpu
          weight: 1
        - name: memory
          weight: 1
# Pod using the bin-pack profile
spec:
  schedulerName: bin-pack-scheduler
  containers:
  - name: batch-job
    image: my-batch:latest
▶ Custom Scheduler vs Scheduler Profile

Running a completely separate scheduler binary is possible but operationally complex (leader election, RBAC, update coordination). Prefer scheduler profiles within the same binary — same leader election, same informer caches, far simpler to operate. Use a separate binary only when you need fundamentally different scheduling logic (e.g., gang scheduling, custom preemption) not achievable via plugin configuration.

Gang Scheduling and the Permit Extension Point

Gang scheduling (also called co-scheduling) ensures all pods in a group are scheduled together — or none are. This is critical for ML training jobs (all workers must start simultaneously) and MPI workloads.

The Permit extension point enables this: a plugin can return Wait instead of Allow. The pod is tentatively assigned to a node but held in a waiting state until the plugin explicitly allows it (or a timeout expires).

# Coscheduling via the scheduling-plugins project (sig-scheduling)
# https://github.com/kubernetes-sigs/scheduler-plugins
apiVersion: scheduling.sigs.k8s.io/v1alpha1
kind: PodGroup
metadata:
  name: my-ml-job
spec:
  minMember: 4           # all 4 workers must be schedulable before any binds
  minResources:
    cpu: "4"
    memory: 8Gi
---
spec:
  schedulerName: scheduler-plugins-scheduler
  labels:
    scheduling.sigs.k8s.io/pod-group: my-ml-job

The Permit plugin tracks how many pods in the group have been tentatively placed. Once minMember is reached, all waiting pods are released to Bind simultaneously. If the timeout expires before all members are placed, waiting pods are rejected (→ Unreserve) and returned to the queue.

Leader Election

In HA setups, multiple kube-scheduler replicas run but only one is active. Leadership is managed via a Lease object in the kube-system namespace. The details are identical to the kube-controller-manager pattern — see 00-control-plane-overview.html § Leader Election.

# Check current scheduler leader
kubectl -n kube-system get lease kube-scheduler -o yaml
# .spec.holderIdentity shows the active scheduler pod

# Scheduler logs showing leader election
kubectl -n kube-system logs kube-scheduler-master-1 | grep -i "leader"
# I0115 10:30:00 leaderelection.go:248] attempting to acquire leader lease...
# I0115 10:30:01 leaderelection.go:258] successfully acquired lease kube-system/kube-scheduler

Configuration Reference

apiVersion: kubescheduler.config.k8s.io/v1
kind: KubeSchedulerConfiguration
# Leader election
leaderElection:
  leaderElect: true
  leaseDuration: 15s
  renewDeadline: 10s
  retryPeriod: 2s
  resourceNamespace: kube-system
  resourceName: kube-scheduler

# API server connection
clientConnection:
  kubeconfig: /etc/kubernetes/scheduler.conf
  qps: 50
  burst: 100

# Performance tuning
percentageOfNodesToScore: 0     # 0 = adaptive (recommended)
parallelism: 16                 # goroutines for filter/score

# Health / metrics
healthzBindAddress: 0.0.0.0:10259
metricsBindAddress: 0.0.0.0:10259
Flag / ConfigDefaultPurpose
percentageOfNodesToScore0 (adaptive)Stop evaluating nodes once this % is found feasible. 0 = use cluster-size heuristic (50 nodes minimum, scales down for large clusters)
parallelism16Number of goroutines running Filter/Score in parallel
--leader-electtrueEnable leader election for HA
--kube-api-qps50Queries per second to kube-apiserver
--kube-api-burst100Burst capacity above QPS limit
--v2Log verbosity. 4+ shows scheduling decisions per pod

Metrics and Alerting

# Scrape scheduler metrics
curl -sk https://localhost:10259/metrics \
  --cert /etc/kubernetes/pki/scheduler-client.crt \
  --key /etc/kubernetes/pki/scheduler-client.key | grep scheduler_

# Key metrics
scheduler_scheduling_attempt_duration_seconds{result="scheduled"}   # e2e scheduling latency
scheduler_scheduling_attempt_duration_seconds{result="unschedulable"}
scheduler_pending_pods{queue="active"}       # pods waiting to be scheduled
scheduler_pending_pods{queue="backoff"}      # pods in backoff
scheduler_pending_pods{queue="unschedulable"}# pods with no node found
scheduler_preemption_attempts_total          # total preemption attempts
scheduler_preemption_victims                 # total pods evicted by preemption
scheduler_framework_extension_point_duration_seconds{extension_point="Filter"}
scheduler_framework_extension_point_duration_seconds{extension_point="Score"}
scheduler_volume_scheduling_stage_error_reason  # volume binding errors
Prometheus Alerting Rules
groups:
- name: kube-scheduler
  rules:
  - alert: KubeSchedulerDown
    expr: absent(up{job="kube-scheduler"} == 1)
    for: 5m
    labels:
      severity: critical
    annotations:
      summary: "kube-scheduler is down"
      description: "No kube-scheduler instance is up. Pods cannot be scheduled."

  - alert: KubeSchedulerHighPendingPods
    expr: scheduler_pending_pods{queue="unschedulable"} > 50
    for: 15m
    labels:
      severity: warning
    annotations:
      summary: "High number of unschedulable pods"
      description: "{{ $value }} pods have been unschedulable for >15m."

  - alert: KubeSchedulerHighLatency
    expr: |
      histogram_quantile(0.99,
        rate(scheduler_scheduling_attempt_duration_seconds_bucket{result="scheduled"}[5m])
      ) > 5
    for: 10m
    labels:
      severity: warning
    annotations:
      summary: "Scheduler p99 latency > 5s"
      description: "P99 scheduling latency is {{ $value }}s."

  - alert: KubeSchedulerPreemptionHigh
    expr: rate(scheduler_preemption_victims[5m]) > 10
    for: 5m
    labels:
      severity: warning
    annotations:
      summary: "High preemption rate"
      description: "Scheduler is evicting >10 pods/s. Check PriorityClass assignments."

Troubleshooting Scheduling Issues

Pod stuck in Pending — "0/N nodes are available"
# Full scheduling failure reason
kubectl describe pod my-pod | grep -A 30 "Events:"
# Common messages:
# "0/5 nodes are available: 2 Insufficient cpu, 3 node(s) had taint..."
# "0/5 nodes are available: 5 node(s) didn't match pod's node affinity/selector"
# "0/5 nodes are available: 5 pod has unbound immediate PersistentVolumeClaims"

# Check node resources and allocatable
kubectl describe nodes | grep -A 8 "Allocated resources"

# Check if nodes are cordoned
kubectl get nodes | grep SchedulingDisabled

# Check taints on nodes
kubectl get nodes -o custom-columns=NAME:.metadata.name,TAINTS:.spec.taints

# Check if the pod's resource requests are too high
kubectl get pod my-pod -o jsonpath='{.spec.containers[*].resources}'
Pod stuck in Pending — PVC binding issue
# Check PVC status
kubectl get pvc
kubectl describe pvc my-pvc
# Look for: "waiting for first consumer to be created before binding"
# This is normal for WaitForFirstConsumer StorageClass — PVC binds AFTER scheduler picks node

# Check StorageClass binding mode
kubectl get storageclass standard -o jsonpath='{.volumeBindingMode}'
# Immediate = PVC bound before scheduling (may conflict with zone affinity)
# WaitForFirstConsumer = PVC bound after scheduling (node zone determines PVC zone)

# Volume scheduling logs
kubectl -n kube-system logs kube-scheduler-master-1 | grep -i "volume"
Pods not spreading evenly across zones
# Check existing topology spread
kubectl get pods -l app=my-app -o wide | awk '{print $7}' | sort | uniq -c

# Check if topologySpreadConstraints are set
kubectl get deploy my-deploy -o jsonpath='{.spec.template.spec.topologySpreadConstraints}'

# Check zone labels on nodes
kubectl get nodes --show-labels | grep topology.kubernetes.io/zone

# If using old pod anti-affinity instead:
# Migrate to topologySpreadConstraints for better performance and flexibility

# Check scheduler logs for spread scoring
kubectl -n kube-system logs kube-scheduler-master-1 -v=4 | grep TopologySpread
Scheduler performance — high latency / slow scheduling
# Check scheduling attempt duration
kubectl get --raw /metrics | grep scheduler_scheduling_attempt_duration

# Common causes of slow scheduling:
# 1. InterPodAffinity with large pod counts — migrate to topologySpreadConstraints
# 2. percentageOfNodesToScore too high — reduce for large clusters
# 3. Too many Score plugins with high weight — profile and trim
# 4. Custom webhook in Permit extension point — check latency

# Increase parallelism for large clusters
# In KubeSchedulerConfiguration:
parallelism: 32  # default 16, increase for 1000+ node clusters

# Check if scheduler leader is healthy
kubectl -n kube-system get lease kube-scheduler -o yaml | grep holderIdentity
Preemption not working as expected
# Check if priority classes are set correctly
kubectl get priorityclass
kubectl get pod high-priority-pod -o jsonpath='{.spec.priority}'

# Check if PDB is blocking preemption
kubectl get pdb -A
kubectl describe pdb my-pdb | grep -E "Allowed Disruptions|Status"

# Check scheduler logs for preemption decisions
kubectl -n kube-system logs kube-scheduler-master-1 | grep -i preempt

# Check if preemptionPolicy is set to Never
kubectl get priorityclass low-priority -o jsonpath='{.preemptionPolicy}'

Scheduling Sequence: Pod to Bound

Pod Scheduling — Full Sequence Scheduler kube-apiserver etcd kubelet WATCH /pods?fieldSelector=spec.nodeName= 1. Dequeue pending pod from activeQ 2. GET /nodes (from watchCache RV=0) 200 OK — node list (cached) Filter Phase Run all Filter plugins Score Phase Run Score/Normalize 3. POST /namespaces/{ns}/pods/{name}/binding etcd.Put(pod.spec.nodeName = worker-1) ACK rev=12345 201 Created (binding) WATCH event: pod MODIFIED nodeName=worker-1 kubelet starts pod

Production Best Practices

Use PriorityClasses for all workloads

Define at minimum 3 tiers: critical (1M), standard (100K), batch (1K). Set preemptionPolicy: Never on batch to prevent it from evicting standard workloads. Set the cluster's globalDefault: true on standard so unclassified pods get a sane default.

Prefer PodTopologySpread over pod anti-affinity

For spreading replicas across zones/nodes, topologySpreadConstraints is orders of magnitude more efficient than requiredDuringSchedulingIgnoredDuringExecution anti-affinity on clusters with thousands of pods.

Set resource requests on all containers

Without resources.requests, pods are BestEffort class — scheduled anywhere, not accounted for in Filter/Score calculations. This leads to overcommit, OOM kills, and unpredictable scheduling. Every production container must have CPU + memory requests.

Tune percentageOfNodesToScore on large clusters

On 1000+ node clusters, the default adaptive heuristic evaluates only a fraction of nodes. If you have strict affinity requirements, consider raising this value. If you prioritize scheduling throughput, the default is fine.

Use WaitForFirstConsumer StorageClass

For zone-aware storage, set volumeBindingMode: WaitForFirstConsumer. This lets the scheduler pick the node first (respecting topology constraints), then provision the volume in the right zone. Immediate mode creates the PV in a random zone, which can conflict with node affinity.

Node labels for topology awareness

Ensure all nodes have standard topology labels: topology.kubernetes.io/zone, topology.kubernetes.io/region, kubernetes.io/hostname. Cloud providers set these automatically; on-premise clusters need them set explicitly (or use a Node Feature Discovery operator).

Avoid scheduling during node maintenance

Cordon nodes before draining (kubectl cordon). This sets spec.unschedulable: true and adds the node.kubernetes.io/unschedulable taint, preventing the scheduler from placing new pods. Never drain without cordoning first — new pods may arrive on the node while drain is running.

Monitor pending pods

Alert on scheduler_pending_pods{queue="unschedulable"} > 0 for more than 5 minutes. This indicates a scheduling failure that won't self-resolve — requires human investigation of node resources, taints, or affinity rules.