Overview

The Kubernetes scheduler is a single-threaded control loop that watches for pods with spec.nodeName == "" and selects the best node for each. This document traces the full scheduling cycle — from pod entering the queue to the bind API call.

Scheduler Architecture

                    ┌─────────────────────────────────────────┐
                    │          kube-scheduler                  │
                    │                                         │
   API Server ──watch──► Scheduling Queue                     │
   (Pods with              ├── Active Queue   (ready to sched)│
    nodeName=="")           ├── Backoff Queue  (failed, retry) │
                    │      └── Unschedulable   (blocked)      │
                    │              │                          │
                    │              ▼                          │
                    │    ┌─── Scheduling Cycle ──────────┐    │
                    │    │  1. Filter plugins             │    │
                    │    │  2. Score plugins              │    │
                    │    │  3. Reserve                    │    │
                    │    │  4. Permit (preemption gate)    │    │
                    │    └───────────────────────────────┘    │
                    │              │                          │
                    │              ▼                          │
                    │    ┌─── Binding Cycle ──────────────┐   │
                    │    │  5. PreBind                     │   │
                    │    │  6. Bind (API server PATCH)     │   │
                    │    │  7. PostBind                    │   │
                    │    └────────────────────────────────┘   │
                    └─────────────────────────────────────────┘

Full Scheduling Sequence

API Server           Scheduler                              etcd
    │                    │                                    │
    │──WATCH event ──►   │                                    │
    │  (pod Added,       │                                    │
    │   nodeName=="")    │                                    │
    │                    │                                    │
    │                    │  Pod → Active Queue                │
    │                    │                                    │
    │                    │  ┌─── FILTER phase ──────────────┐ │
    │                    │  │                               │ │
    │                    │  │  For each node in cluster:    │ │
    │                    │  │                               │ │
    │                    │  │  NodeUnschedulable            │ │
    │                    │  │  └─ skip if node.spec.unschedulable=true
    │                    │  │                               │ │
    │                    │  │  NodeResourcesFit             │ │
    │                    │  │  └─ CPU request ≤ allocatable │ │
    │                    │  │     Memory request ≤ available│ │
    │                    │  │                               │ │
    │                    │  │  NodeAffinity                 │ │
    │                    │  │  └─ required matchExpressions │ │
    │                    │  │                               │ │
    │                    │  │  TaintToleration              │ │
    │                    │  │  └─ pod tolerates all taints  │ │
    │                    │  │                               │ │
    │                    │  │  PodTopologySpread            │ │
    │                    │  │  └─ maxSkew not exceeded      │ │
    │                    │  │                               │ │
    │                    │  │  InterPodAffinity             │ │
    │                    │  │  └─ required co-location rules│ │
    │                    │  │                               │ │
    │                    │  │  VolumeBinding                │ │
    │                    │  │  └─ PVC zone matches node AZ  │ │
    │                    │  │                               │ │
    │                    │  │  Result: feasible node list   │ │
    │                    │  └───────────────────────────────┘ │
    │                    │                                    │
    │                    │  ┌─── SCORE phase ───────────────┐ │
    │                    │  │                               │ │
    │                    │  │  LeastAllocated               │ │
    │                    │  │  └─ score = (1 - alloc%) × 100│ │
    │                    │  │                               │ │
    │                    │  │  ImageLocality                │ │
    │                    │  │  └─ image already present +10 │ │
    │                    │  │                               │ │
    │                    │  │  InterPodAffinity             │ │
    │                    │  │  └─ preferred co-location     │ │
    │                    │  │                               │ │
    │                    │  │  NodeAffinity                 │ │
    │                    │  │  └─ preferred matchExpressions│ │
    │                    │  │                               │ │
    │                    │  │  Scores weighted + summed     │ │
    │                    │  │  Highest score wins           │ │
    │                    │  │  Tiebreak: random             │ │
    │                    │  └───────────────────────────────┘ │
    │                    │                                    │
    │                    │  Selected node: worker-3          │
    │                    │                                    │
    │◄── PATCH pod ──────│                                    │
    │    spec.nodeName=  │                                    │
    │    "worker-3"      │                                    │
    │                    │                                    │
    │──WRITE ───────────────────────────────────────────────► │
    │  pod.spec.nodeName = "worker-3"                         │
    │                                                         │
    │──WATCH event (Modified) ───────────────────────────►    │
    │  [kubelet on worker-3 receives the update]              │

Filter Plugins Reference

PluginWhat it checksFail condition
NodeUnschedulableNode cordonednode.spec.unschedulable=true
NodeResourcesFitCPU/memory/extended resourcesRequested > allocatable
NodeNamespec.nodeName explicit requestName doesn't match
NodeAffinityspec.affinity.nodeAffinity requiredRequired terms not satisfied
TaintTolerationNode taints vs pod tolerationsUntolerated NoSchedule taint
PodTopologySpreadtopologySpreadConstraints requiredDoNotSchedule + maxSkew exceeded
InterPodAffinityRequired pod affinity/anti-affinityRequired terms not satisfied
VolumeBindingPVC zone, access mode, volume topologyZone mismatch, RWO conflict
NodePortsHostPort availabilityPort already in use on node
VolumeRestrictionsVolume type limits per nodeToo many volumes

Score Plugins Reference

PluginScore logicWeight
LeastAllocatedPrefer nodes with most free CPU+memory1
MostAllocatedPrefer nodes already loaded (bin-packing)optional
ImageLocalityPrefer nodes with image layers cached1
InterPodAffinityPrefer nodes with matching pods nearby1
NodeAffinityPreferred node selector terms1
NodeResourcesBalancedAllocationPrefer balanced CPU/memory usage1
PodTopologySpreadSoft spread (DoNotSchedule=false)2

Preemption — When No Node Fits

If no feasible node exists after filtering, the scheduler attempts preemption:

1. Find pods on nodes that could be evicted to make room
2. Victim candidates must have lower PriorityClass than the new pod
3. Victims are sorted by lowest priority first, then fewest disruptions
4. Scheduler nominates the node (pod.status.nominatedNodeName)
5. Preemption controller deletes victim pods
6. Nominated pod waits for victims to terminate
7. Next scheduling cycle: pod schedules to newly freed node
# PriorityClass example
apiVersion: scheduling.k8s.io/v1
kind: PriorityClass
metadata:
  name: production-critical
value: 1000000
globalDefault: false
preemptionPolicy: PreemptLowerPriority   # can preempt lower-priority pods

---
apiVersion: scheduling.k8s.io/v1
kind: PriorityClass
metadata:
  name: batch-low
value: 100
preemptionPolicy: Never                   # will not preempt others

Scheduling Percentages (Performance Tuning)

By default the scheduler evaluates all nodes during scoring. For large clusters (1000+ nodes) this is slow. The percentageOfNodesToScore setting limits scoring to a sample:

# KubeSchedulerConfiguration
apiVersion: kubescheduler.config.k8s.io/v1
kind: KubeSchedulerConfiguration
profiles:
- schedulerName: default-scheduler
  plugins:
    score:
      enabled:
      - name: NodeResourcesFit
        weight: 1
  pluginConfig:
  - name: NodeResourcesFit
    args:
      scoringStrategy:
        type: MostAllocated       # bin-packing (default: LeastAllocated)
percentageOfNodesToScore: 10      # score only 10% of feasible nodes for large clusters

Debugging Scheduling Issues

# Check why a pod is Unschedulable
kubectl describe pod <pod-name> -n <namespace>
# Look for: Events — "0/5 nodes are available: ..."

# Common messages and their meanings:
# "Insufficient cpu"            → requests exceed allocatable on all nodes
# "node(s) had taint ... NoSchedule" → add toleration or remove taint
# "node(s) didn't match Pod's node affinity/selector" → fix nodeSelector
# "1 node(s) had volume node affinity conflict" → PVC zone ≠ node zone
# "pod has unbound immediate PersistentVolumeClaims" → PVC still Pending

# Check scheduler logs
kubectl logs -n kube-system -l component=kube-scheduler --tail=100

# Simulate scheduling (dry-run to selected node)
# Use scheduler extender or kube-scheduler simulator

# Check node resources
kubectl describe node <node-name> | grep -A15 "Allocated resources"
kubectl resource-capacity --util