Scheduler Flow
Overview
The Kubernetes scheduler is a single-threaded control loop that watches for pods with spec.nodeName == "" and selects the best node for each. This document traces the full scheduling cycle — from pod entering the queue to the bind API call.
Scheduler Architecture
┌─────────────────────────────────────────┐
│ kube-scheduler │
│ │
API Server ──watch──► Scheduling Queue │
(Pods with ├── Active Queue (ready to sched)│
nodeName=="") ├── Backoff Queue (failed, retry) │
│ └── Unschedulable (blocked) │
│ │ │
│ ▼ │
│ ┌─── Scheduling Cycle ──────────┐ │
│ │ 1. Filter plugins │ │
│ │ 2. Score plugins │ │
│ │ 3. Reserve │ │
│ │ 4. Permit (preemption gate) │ │
│ └───────────────────────────────┘ │
│ │ │
│ ▼ │
│ ┌─── Binding Cycle ──────────────┐ │
│ │ 5. PreBind │ │
│ │ 6. Bind (API server PATCH) │ │
│ │ 7. PostBind │ │
│ └────────────────────────────────┘ │
└─────────────────────────────────────────┘
Full Scheduling Sequence
API Server Scheduler etcd
│ │ │
│──WATCH event ──► │ │
│ (pod Added, │ │
│ nodeName=="") │ │
│ │ │
│ │ Pod → Active Queue │
│ │ │
│ │ ┌─── FILTER phase ──────────────┐ │
│ │ │ │ │
│ │ │ For each node in cluster: │ │
│ │ │ │ │
│ │ │ NodeUnschedulable │ │
│ │ │ └─ skip if node.spec.unschedulable=true
│ │ │ │ │
│ │ │ NodeResourcesFit │ │
│ │ │ └─ CPU request ≤ allocatable │ │
│ │ │ Memory request ≤ available│ │
│ │ │ │ │
│ │ │ NodeAffinity │ │
│ │ │ └─ required matchExpressions │ │
│ │ │ │ │
│ │ │ TaintToleration │ │
│ │ │ └─ pod tolerates all taints │ │
│ │ │ │ │
│ │ │ PodTopologySpread │ │
│ │ │ └─ maxSkew not exceeded │ │
│ │ │ │ │
│ │ │ InterPodAffinity │ │
│ │ │ └─ required co-location rules│ │
│ │ │ │ │
│ │ │ VolumeBinding │ │
│ │ │ └─ PVC zone matches node AZ │ │
│ │ │ │ │
│ │ │ Result: feasible node list │ │
│ │ └───────────────────────────────┘ │
│ │ │
│ │ ┌─── SCORE phase ───────────────┐ │
│ │ │ │ │
│ │ │ LeastAllocated │ │
│ │ │ └─ score = (1 - alloc%) × 100│ │
│ │ │ │ │
│ │ │ ImageLocality │ │
│ │ │ └─ image already present +10 │ │
│ │ │ │ │
│ │ │ InterPodAffinity │ │
│ │ │ └─ preferred co-location │ │
│ │ │ │ │
│ │ │ NodeAffinity │ │
│ │ │ └─ preferred matchExpressions│ │
│ │ │ │ │
│ │ │ Scores weighted + summed │ │
│ │ │ Highest score wins │ │
│ │ │ Tiebreak: random │ │
│ │ └───────────────────────────────┘ │
│ │ │
│ │ Selected node: worker-3 │
│ │ │
│◄── PATCH pod ──────│ │
│ spec.nodeName= │ │
│ "worker-3" │ │
│ │ │
│──WRITE ───────────────────────────────────────────────► │
│ pod.spec.nodeName = "worker-3" │
│ │
│──WATCH event (Modified) ───────────────────────────► │
│ [kubelet on worker-3 receives the update] │
Filter Plugins Reference
| Plugin | What it checks | Fail condition |
|---|---|---|
NodeUnschedulable | Node cordoned | node.spec.unschedulable=true |
NodeResourcesFit | CPU/memory/extended resources | Requested > allocatable |
NodeName | spec.nodeName explicit request | Name doesn't match |
NodeAffinity | spec.affinity.nodeAffinity required | Required terms not satisfied |
TaintToleration | Node taints vs pod tolerations | Untolerated NoSchedule taint |
PodTopologySpread | topologySpreadConstraints required | DoNotSchedule + maxSkew exceeded |
InterPodAffinity | Required pod affinity/anti-affinity | Required terms not satisfied |
VolumeBinding | PVC zone, access mode, volume topology | Zone mismatch, RWO conflict |
NodePorts | HostPort availability | Port already in use on node |
VolumeRestrictions | Volume type limits per node | Too many volumes |
Score Plugins Reference
| Plugin | Score logic | Weight |
|---|---|---|
LeastAllocated | Prefer nodes with most free CPU+memory | 1 |
MostAllocated | Prefer nodes already loaded (bin-packing) | optional |
ImageLocality | Prefer nodes with image layers cached | 1 |
InterPodAffinity | Prefer nodes with matching pods nearby | 1 |
NodeAffinity | Preferred node selector terms | 1 |
NodeResourcesBalancedAllocation | Prefer balanced CPU/memory usage | 1 |
PodTopologySpread | Soft spread (DoNotSchedule=false) | 2 |
Preemption — When No Node Fits
If no feasible node exists after filtering, the scheduler attempts preemption:
1. Find pods on nodes that could be evicted to make room
2. Victim candidates must have lower PriorityClass than the new pod
3. Victims are sorted by lowest priority first, then fewest disruptions
4. Scheduler nominates the node (pod.status.nominatedNodeName)
5. Preemption controller deletes victim pods
6. Nominated pod waits for victims to terminate
7. Next scheduling cycle: pod schedules to newly freed node
# PriorityClass example
apiVersion: scheduling.k8s.io/v1
kind: PriorityClass
metadata:
name: production-critical
value: 1000000
globalDefault: false
preemptionPolicy: PreemptLowerPriority # can preempt lower-priority pods
---
apiVersion: scheduling.k8s.io/v1
kind: PriorityClass
metadata:
name: batch-low
value: 100
preemptionPolicy: Never # will not preempt others
Scheduling Percentages (Performance Tuning)
By default the scheduler evaluates all nodes during scoring. For large clusters (1000+ nodes) this is slow. The percentageOfNodesToScore setting limits scoring to a sample:
# KubeSchedulerConfiguration
apiVersion: kubescheduler.config.k8s.io/v1
kind: KubeSchedulerConfiguration
profiles:
- schedulerName: default-scheduler
plugins:
score:
enabled:
- name: NodeResourcesFit
weight: 1
pluginConfig:
- name: NodeResourcesFit
args:
scoringStrategy:
type: MostAllocated # bin-packing (default: LeastAllocated)
percentageOfNodesToScore: 10 # score only 10% of feasible nodes for large clusters
Debugging Scheduling Issues
# Check why a pod is Unschedulable
kubectl describe pod <pod-name> -n <namespace>
# Look for: Events — "0/5 nodes are available: ..."
# Common messages and their meanings:
# "Insufficient cpu" → requests exceed allocatable on all nodes
# "node(s) had taint ... NoSchedule" → add toleration or remove taint
# "node(s) didn't match Pod's node affinity/selector" → fix nodeSelector
# "1 node(s) had volume node affinity conflict" → PVC zone ≠ node zone
# "pod has unbound immediate PersistentVolumeClaims" → PVC still Pending
# Check scheduler logs
kubectl logs -n kube-system -l component=kube-scheduler --tail=100
# Simulate scheduling (dry-run to selected node)
# Use scheduler extender or kube-scheduler simulator
# Check node resources
kubectl describe node <node-name> | grep -A15 "Allocated resources"
kubectl resource-capacity --util
Related
- 01 — Pod Creation Flow — scheduling in context
- 01 — Capacity Planning — headroom and node pools
- 02 — Performance Tuning — KubeSchedulerConfiguration