Scheduler Flow

Overview

The Kubernetes scheduler is a single-threaded control loop that watches for pods with spec.nodeName == "" and selects the best node for each. This document traces the full scheduling cycle — from pod entering the queue to the bind API call.

Scheduler Architecture

                    ┌─────────────────────────────────────────┐
                    │          kube-scheduler                  │
                    │                                         │
   API Server ──watch──► Scheduling Queue                     │
   (Pods with              ├── Active Queue   (ready to sched)│
    nodeName=="")           ├── Backoff Queue  (failed, retry) │
                    │      └── Unschedulable   (blocked)      │
                    │              │                          │
                    │              ▼                          │
                    │    ┌─── Scheduling Cycle ──────────┐    │
                    │    │  1. Filter plugins             │    │
                    │    │  2. Score plugins              │    │
                    │    │  3. Reserve                    │    │
                    │    │  4. Permit (preemption gate)    │    │
                    │    └───────────────────────────────┘    │
                    │              │                          │
                    │              ▼                          │
                    │    ┌─── Binding Cycle ──────────────┐   │
                    │    │  5. PreBind                     │   │
                    │    │  6. Bind (API server PATCH)     │   │
                    │    │  7. PostBind                    │   │
                    │    └────────────────────────────────┘   │
                    └─────────────────────────────────────────┘

Full Scheduling Sequence

API Server           Scheduler                              etcd
    │                    │                                    │
    │──WATCH event ──►   │                                    │
    │  (pod Added,       │                                    │
    │   nodeName=="")    │                                    │
    │                    │                                    │
    │                    │  Pod → Active Queue                │
    │                    │                                    │
    │                    │  ┌─── FILTER phase ──────────────┐ │
    │                    │  │                               │ │
    │                    │  │  For each node in cluster:    │ │
    │                    │  │                               │ │
    │                    │  │  NodeUnschedulable            │ │
    │                    │  │  └─ skip if node.spec.unschedulable=true
    │                    │  │                               │ │
    │                    │  │  NodeResourcesFit             │ │
    │                    │  │  └─ CPU request ≤ allocatable │ │
    │                    │  │     Memory request ≤ available│ │
    │                    │  │                               │ │
    │                    │  │  NodeAffinity                 │ │
    │                    │  │  └─ required matchExpressions │ │
    │                    │  │                               │ │
    │                    │  │  TaintToleration              │ │
    │                    │  │  └─ pod tolerates all taints  │ │
    │                    │  │                               │ │
    │                    │  │  PodTopologySpread            │ │
    │                    │  │  └─ maxSkew not exceeded      │ │
    │                    │  │                               │ │
    │                    │  │  InterPodAffinity             │ │
    │                    │  │  └─ required co-location rules│ │
    │                    │  │                               │ │
    │                    │  │  VolumeBinding                │ │
    │                    │  │  └─ PVC zone matches node AZ  │ │
    │                    │  │                               │ │
    │                    │  │  Result: feasible node list   │ │
    │                    │  └───────────────────────────────┘ │
    │                    │                                    │
    │                    │  ┌─── SCORE phase ───────────────┐ │
    │                    │  │                               │ │
    │                    │  │  LeastAllocated               │ │
    │                    │  │  └─ score = (1 - alloc%) × 100│ │
    │                    │  │                               │ │
    │                    │  │  ImageLocality                │ │
    │                    │  │  └─ image already present +10 │ │
    │                    │  │                               │ │
    │                    │  │  InterPodAffinity             │ │
    │                    │  │  └─ preferred co-location     │ │
    │                    │  │                               │ │
    │                    │  │  NodeAffinity                 │ │
    │                    │  │  └─ preferred matchExpressions│ │
    │                    │  │                               │ │
    │                    │  │  Scores weighted + summed     │ │
    │                    │  │  Highest score wins           │ │
    │                    │  │  Tiebreak: random             │ │
    │                    │  └───────────────────────────────┘ │
    │                    │                                    │
    │                    │  Selected node: worker-3          │
    │                    │                                    │
    │◄── PATCH pod ──────│                                    │
    │    spec.nodeName=  │                                    │
    │    "worker-3"      │                                    │
    │                    │                                    │
    │──WRITE ───────────────────────────────────────────────► │
    │  pod.spec.nodeName = "worker-3"                         │
    │                                                         │
    │──WATCH event (Modified) ───────────────────────────►    │
    │  [kubelet on worker-3 receives the update]              │

Filter Plugins Reference

Plugin	What it checks	Fail condition
`NodeUnschedulable`	Node cordoned	`node.spec.unschedulable=true`
`NodeResourcesFit`	CPU/memory/extended resources	Requested > allocatable
`NodeName`	`spec.nodeName` explicit request	Name doesn't match
`NodeAffinity`	`spec.affinity.nodeAffinity` required	Required terms not satisfied
`TaintToleration`	Node taints vs pod tolerations	Untolerated `NoSchedule` taint
`PodTopologySpread`	`topologySpreadConstraints` required	`DoNotSchedule` + maxSkew exceeded
`InterPodAffinity`	Required pod affinity/anti-affinity	Required terms not satisfied
`VolumeBinding`	PVC zone, access mode, volume topology	Zone mismatch, RWO conflict
`NodePorts`	HostPort availability	Port already in use on node
`VolumeRestrictions`	Volume type limits per node	Too many volumes

Score Plugins Reference

Plugin	Score logic	Weight
`LeastAllocated`	Prefer nodes with most free CPU+memory	1
`MostAllocated`	Prefer nodes already loaded (bin-packing)	optional
`ImageLocality`	Prefer nodes with image layers cached	1
`InterPodAffinity`	Prefer nodes with matching pods nearby	1
`NodeAffinity`	Preferred node selector terms	1
`NodeResourcesBalancedAllocation`	Prefer balanced CPU/memory usage	1
`PodTopologySpread`	Soft spread (DoNotSchedule=false)	2

Preemption — When No Node Fits

If no feasible node exists after filtering, the scheduler attempts preemption:

1. Find pods on nodes that could be evicted to make room
2. Victim candidates must have lower PriorityClass than the new pod
3. Victims are sorted by lowest priority first, then fewest disruptions
4. Scheduler nominates the node (pod.status.nominatedNodeName)
5. Preemption controller deletes victim pods
6. Nominated pod waits for victims to terminate
7. Next scheduling cycle: pod schedules to newly freed node

# PriorityClass example
apiVersion: scheduling.k8s.io/v1
kind: PriorityClass
metadata:
  name: production-critical
value: 1000000
globalDefault: false
preemptionPolicy: PreemptLowerPriority   # can preempt lower-priority pods

---
apiVersion: scheduling.k8s.io/v1
kind: PriorityClass
metadata:
  name: batch-low
value: 100
preemptionPolicy: Never                   # will not preempt others

Scheduling Percentages (Performance Tuning)

By default the scheduler evaluates all nodes during scoring. For large clusters (1000+ nodes) this is slow. The percentageOfNodesToScore setting limits scoring to a sample:

# KubeSchedulerConfiguration
apiVersion: kubescheduler.config.k8s.io/v1
kind: KubeSchedulerConfiguration
profiles:
- schedulerName: default-scheduler
  plugins:
    score:
      enabled:
      - name: NodeResourcesFit
        weight: 1
  pluginConfig:
  - name: NodeResourcesFit
    args:
      scoringStrategy:
        type: MostAllocated       # bin-packing (default: LeastAllocated)
percentageOfNodesToScore: 10      # score only 10% of feasible nodes for large clusters

Debugging Scheduling Issues

# Check why a pod is Unschedulable
kubectl describe pod <pod-name> -n <namespace>
# Look for: Events — "0/5 nodes are available: ..."

# Common messages and their meanings:
# "Insufficient cpu"            → requests exceed allocatable on all nodes
# "node(s) had taint ... NoSchedule" → add toleration or remove taint
# "node(s) didn't match Pod's node affinity/selector" → fix nodeSelector
# "1 node(s) had volume node affinity conflict" → PVC zone ≠ node zone
# "pod has unbound immediate PersistentVolumeClaims" → PVC still Pending

# Check scheduler logs
kubectl logs -n kube-system -l component=kube-scheduler --tail=100

# Simulate scheduling (dry-run to selected node)
# Use scheduler extender or kube-scheduler simulator

# Check node resources
kubectl describe node <node-name> | grep -A15 "Allocated resources"
kubectl resource-capacity --util

01 — Pod Creation Flow — scheduling in context
01 — Capacity Planning — headroom and node pools
02 — Performance Tuning — KubeSchedulerConfiguration