kube-controller-manager
The engine that drives Kubernetes self-healing — 30+ reconciliation loops running inside a single binary, continuously comparing desired state to observed state and taking corrective action via the Kubernetes API.
What kube-controller-manager Does
kube-controller-manager is a collection of independent control loops (controllers) compiled into one binary and launched as a single process. Each controller watches specific API objects and reconciles them toward their desired state. They share an in-process informer cache, so all 30+ controllers together make only a single set of watch streams to kube-apiserver.
Process identity
- Binary:
kube-controller-manager - Static pod:
/etc/kubernetes/manifests/kube-controller-manager.yaml - Secure port:
:10257(HTTPS, metrics + healthz) - Kubeconfig:
/etc/kubernetes/controller-manager.conf - Service account: uses
system:kube-controller-manager - Leader-elected: one active instance per cluster
Design principles
- Level-triggered: reconcile based on current observed state, not event history
- Idempotent: reconcile can be called any number of times safely
- Optimistic: assume success, detect and fix divergence via watch
- API-only: all state mutations go through kube-apiserver, never direct to etcd
- Shared informers: one watch per resource type, all controllers share
The Reconciliation Loop Pattern
Every controller in kube-controller-manager follows the same fundamental observe → compare → act pattern. Understanding this pattern explains why Kubernetes converges to desired state even after partial failures.
Controllers are level-triggered: they do not replay missed events. When a controller runs, it looks at the current state of the world (from its informer cache) and calculates what needs to change — regardless of what events caused it to be triggered. This means a controller that missed 10 rapid updates will still converge correctly after one reconcile, because it sees the final state, not a sequence of deltas.
Key invariants that make this work safely:
- Reads from informer cache (
resourceVersion=0), not live API — avoids thundering-herd on kube-apiserver - Writes go through kube-apiserver with
resourceVersionfor optimistic concurrency - Status is written separately via
/statussubresource — controller owns status, user owns spec - Work queue deduplication — rapid spec changes produce one reconcile, not many
- Exponential backoff on errors — failed reconciles retry with increasing delays
Shared Informer Architecture
All controllers inside kube-controller-manager share a single SharedInformerFactory. This means that even though 30+ controllers all need to watch Pods, there is only one Pod watch stream to kube-apiserver, and one copy of the Pod cache in memory. Each controller registers its own event handlers against the shared informer.
Complete Controller Inventory
As of Kubernetes 1.30, kube-controller-manager runs the following controllers. Each is a separate goroutine with its own work queue and informer event handlers.
Workload Controllers
Deployment Controller
Watches Deployments and their child ReplicaSets. On spec change, creates a new RS with updated pod template hash, scales it up, and scales down old RS — implementing the rolling update strategy. Computes maxSurge/maxUnavailable constraints.
ReplicaSet Controller
Maintains the exact number of Pod replicas specified by spec.replicas. Watches Pods via label selector. On pod count divergence: creates or deletes Pods. Sets ownerReference on created Pods pointing back to the RS.
StatefulSet Controller
Manages Pods with stable network identity and persistent storage. Creates Pods in order (0, 1, 2…), ensures each is Running+Ready before creating the next. Handles PVC creation per pod, rolling updates with partition support.
DaemonSet Controller
Ensures exactly one Pod runs on every eligible node (respecting nodeSelector, affinity, taints). Creates Pods directly with spec.nodeName set — bypasses the scheduler. Handles node addition/removal and rolling updates.
Job Controller
Manages batch jobs to completion. Tracks succeeded/failed pod counts against completions/parallelism. Handles pod failures via backoffLimit. Supports Indexed completion mode for parallelism with job index injection.
CronJob Controller
Creates Job objects on a cron schedule. Enforces concurrencyPolicy (Allow/Forbid/Replace) and startingDeadlineSeconds. Tracks history of completed/failed jobs up to successfulJobsHistoryLimit.
ReplicationController
Legacy predecessor to ReplicaSet. Functionally identical but does not support set-based selectors. Still active for backward compatibility. New workloads should use ReplicaSets via Deployments.
Node Controllers
Node Lifecycle Controller
The most critical node-related controller. Monitors NodeReady condition via kubelet heartbeats. After node-monitor-grace-period (default 40s) without heartbeat, marks node Unknown. After pod-eviction-timeout (default 5m), initiates pod eviction by adding NoExecute taints. Also manages node.kubernetes.io/not-ready and node.kubernetes.io/unreachable taints.
Node IPAM Controller
Assigns pod CIDR blocks to nodes from the cluster's --cluster-cidr range. Writes the assigned CIDR to node.spec.podCIDR and node.spec.podCIDRs. Cloud-specific IPAM may delegate to cloud-controller-manager instead.
Taint Manager
Part of the node lifecycle subsystem. Watches node taints with NoExecute effect. For each pod on a tainted node, checks tolerations: if not tolerated (or tolerationSeconds expired), marks pod for deletion. Runs the eviction rate limiter to avoid mass eviction storms.
Storage Controllers
PersistentVolume Controller
Binds PVCs to PVs via the claim binder loop. For static provisioning: finds a matching PV and binds. For dynamic provisioning with Immediate binding mode: invokes the StorageClass provisioner. Manages the PV lifecycle: Available → Bound → Released → (Retain/Delete/Recycle).
AttachDetach Controller
Manages volume attach/detach operations independently from kubelet. Watches pods, determines which volumes need to be attached to which nodes, calls the volume plugin's Attach()/Detach(). Tracks VolumeAttachment objects for CSI volumes.
PVC Protection Controller
Adds the kubernetes.io/pvc-protection finalizer to PVCs that are bound to a running pod. Prevents accidental deletion of in-use PVCs — the PVC goes to Terminating but isn't deleted until all pods release it.
PV Protection Controller
Adds kubernetes.io/pv-protection finalizer to PVs that are bound to a PVC. Ensures PVs aren't deleted while a PVC is bound to them.
Network Controllers
Endpoint Controller (legacy)
Watches Services and Pods. For each Service, maintains the Endpoints object with the IP:port of all Ready pods matching the Service selector. Replaced by EndpointSlice controller for scalability but still runs for backward compatibility.
EndpointSlice Controller
Manages EndpointSlice objects (max 100 endpoints per slice). Shards a Service's backends across multiple slices for scalability. Incremental updates — only changed slices are updated, reducing kube-apiserver and etcd write amplification on large Services.
EndpointSliceMirroring Controller
Mirrors manually-managed Endpoints objects to EndpointSlice objects for backward compatibility. Allows tools that still write to Endpoints (not EndpointSlice) to work with kube-proxy's EndpointSlice-based routing.
Service Controller
Watches Services of type LoadBalancer and delegates to the cloud provider (via cloud-controller-manager in modern setups, or in-tree cloud provider plugins) to create/update external load balancers. Also manages Service NodePort allocation.
Namespace, RBAC, and Security Controllers
Namespace Controller
When a Namespace enters Terminating phase (after DELETE call), this controller deletes all resources within the namespace in the correct order, then removes the kubernetes finalizer to allow the Namespace object itself to be deleted.
ServiceAccount Controller
Creates a default ServiceAccount named default in every namespace. Also creates the legacy token Secret for each ServiceAccount (pre-1.24 behavior, now managed by TokenRequest API). Ensures the ServiceAccount exists before pods can reference it.
Token Controller
Watches ServiceAccount token Secrets. Creates token files when annotations indicate they need populating. Deletes tokens for ServiceAccounts that no longer exist. In 1.24+, the bound service account token volume projection largely replaces this flow.
RBAC Bootstrapping Controller
Ensures built-in ClusterRoles and ClusterRoleBindings (like cluster-admin, system:node) exist and are up-to-date. Re-creates them if accidentally deleted. Runs on every leader election.
Certificate Signing Controllers
Processes CertificateSigningRequest (CSR) objects. Three sub-controllers: csrsigning (signs approved CSRs using the cluster CA), csrapproving (auto-approves node/kubelet CSRs), csrcleaner (deletes old CSRs after TTL).
Bootstrap Token Controller
Manages tokens used for node bootstrapping (e.g., kubeadm join). Cleans up expired bootstrap tokens, creates/updates the cluster-info ConfigMap in kube-public, and creates the RBAC bindings needed for bootstrapping nodes.
Garbage Collection and Lifecycle Controllers
Garbage Collector
The most complex controller. Traverses the ownerReferences graph to delete orphaned objects. Maintains a DAG of all API objects in memory. When an owner is deleted, orphans are queued for deletion. Handles foreground/background deletion cascades. See 13-garbage-collection.html.
TTL Controller
Deletes completed Jobs and their pods after spec.ttlSecondsAfterFinished expires. This is a separate mechanism from successfulJobsHistoryLimit on CronJob — TTL operates on individual Jobs regardless of their parent.
TTL-After-Finished Controller
Specifically handles the .spec.ttlSecondsAfterFinished field on Jobs. Watches for completed Jobs and sets up a timer to delete them. Prevents accumulation of finished Jobs cluttering the cluster.
Resource Quota Controller
Enforces ResourceQuota objects. Watches all resource types and maintains usage counts. Replenishes quota when resources are deleted. Quota admission happens in the apiserver admission pipeline; this controller maintains the quota usage status.
HorizontalPodAutoscaler (HPA) Controller
Polls metrics (CPU, memory, custom metrics via metrics-server or custom metrics API) every --horizontal-pod-autoscaler-sync-period (default 15s). Calculates desired replicas, applies scale-up/scale-down stabilization windows, and PATCHes the target's /scale subresource.
Disruption Controller
Maintains PodDisruptionBudget status. Tracks how many disruptions have occurred vs allowed. Updates status.disruptionsAllowed. The eviction API (used by kubectl drain and the scheduler's preemption) checks this status before evicting pods.
Deep Dive: Node Lifecycle Controller
The Node Lifecycle Controller is arguably the most operationally important controller — it determines when failed nodes trigger pod eviction and how quickly your cluster recovers from node failures.
| Parameter | Flag | Default | Description |
|---|---|---|---|
| Grace period | --node-monitor-grace-period | 40s | Time after last heartbeat before node is marked Unknown |
| Eviction timeout | --pod-eviction-timeout | 5m0s | How long to wait before evicting pods from unreachable node |
| Eviction rate | --node-eviction-rate | 0.1/s | Max pods evicted per second in a healthy zone |
| Secondary rate | --secondary-node-eviction-rate | 0.01/s | Eviction rate when >55% nodes unhealthy (network partition) |
| Unhealthy threshold | --unhealthy-zone-threshold | 55% | Fraction of unhealthy nodes to trigger secondary rate |
| Monitor period | --node-monitor-period | 5s | How often controller checks node status |
If more than 55% of nodes in a zone become unreachable simultaneously, the eviction rate drops to near-zero. This is intentional — it detects a network partition scenario where the control plane lost contact with an entire zone. In that case, mass-evicting pods would be counterproductive because the nodes (and their pods) are likely still running fine. The controller waits for the zone to recover rather than evicting healthy workloads.
Deep Dive: Garbage Collector
The Garbage Collector (GC) is the most architecturally complex controller. It maintains an in-memory owner-dependency graph (a DAG) of all API objects and propagates deletions through the graph.
GraphBuilder
Watches ALL resource types via a "meta-only" informer (fetches only metadata, not full objects — much cheaper). Builds a DAG: edges are ownerReferences. When an object is created/deleted, the graph is updated and affected nodes are enqueued for processing.
GarbageCollector
Processes the dirty-queue from GraphBuilder. For each object: if owner is missing/deleted and object lacks a foregroundDeletion finalizer → delete the object. For foreground deletion: remove dependents first, then remove the owner's foregroundDeletion finalizer.
# ownerReference chain: Deployment → ReplicaSet → Pod
kubectl get pod my-pod-abc12 -o jsonpath='{.metadata.ownerReferences}'
# [{apiVersion:apps/v1, kind:ReplicaSet, name:my-deploy-6799fc88d8, uid:..., controller:true}]
kubectl get rs my-deploy-6799fc88d8 -o jsonpath='{.metadata.ownerReferences}'
# [{apiVersion:apps/v1, kind:Deployment, name:my-deploy, uid:..., controller:true}]
# Delete deployment with background GC (default)
kubectl delete deployment my-deploy
# Deployment deleted immediately; RS and Pods deleted asynchronously by GC
# Delete with foreground GC (wait for dependents)
kubectl delete deployment my-deploy --cascade=foreground
# Orphan — delete deployment but keep ReplicaSets and Pods
kubectl delete deployment my-deploy --cascade=orphan
# Check if GC is processing (look for rapid object deletions)
kubectl get events --field-selector reason=SuccessfulDelete | head -20
GC Race Condition: Dangling ownerReferences
If a pod is created with an ownerReference pointing to a UID that no longer exists (e.g., due to a rapid delete+recreate), the GC will delete the pod as an orphan. This is a known issue when controllers don't set blockOwnerDeletion: true on ownerReferences. Controllers should always verify the owner's UID is current before creating dependents.
# Detect pods with dangling ownerReferences
kubectl get pods -A -o json | jq -r '
.items[] |
select(.metadata.ownerReferences != null) |
"\(.metadata.namespace)/\(.metadata.name): owner=\(.metadata.ownerReferences[0].name)"
' | head -20
Deep Dive: Deployment Controller
The Deployment controller orchestrates rolling updates by managing ReplicaSets. Understanding its algorithm is essential for debugging slow rollouts and unexpected pod terminations.
# Rolling update algorithm
# Given: maxSurge=1, maxUnavailable=1, replicas=3
#
# Initial state: RS-v1 has 3 pods (all Ready)
# After spec.template change:
#
# Step 1: Create RS-v2 with 0 replicas
# Step 2: Scale RS-v2 to 1 (maxSurge=1, total=4=3+1)
# Step 3: Wait for RS-v2 pod[0] to be Ready
# Step 4: Scale RS-v1 to 2 (maxUnavailable=1, so min Ready=2)
# Step 5: Scale RS-v2 to 2 → wait → scale RS-v1 to 1
# Step 6: Scale RS-v2 to 3 → wait → scale RS-v1 to 0
# RS-v1 kept at 0 replicas (not deleted) for rollback
kubectl rollout status deployment/my-deploy
kubectl rollout history deployment/my-deploy
kubectl rollout undo deployment/my-deploy # roll back to previous RS
kubectl rollout undo deployment/my-deploy --to-revision=2 # roll back to specific revision
| Field | Default | Effect |
|---|---|---|
spec.strategy.rollingUpdate.maxSurge | 25% | Max pods above desired count during rollout. Higher = faster rollout, more resource burst |
spec.strategy.rollingUpdate.maxUnavailable | 25% | Max unavailable pods during rollout. 0 = zero-downtime. Lower = slower rollout |
spec.progressDeadlineSeconds | 600 | Time before rollout is marked Failed if not progressing |
spec.minReadySeconds | 0 | Seconds a new pod must be Ready before old pod is terminated |
spec.revisionHistoryLimit | 10 | Number of old ReplicaSets to keep for rollback |
Deployment conditions and status fields
# Key status fields
kubectl get deploy my-deploy -o yaml | grep -A 30 "status:"
# status:
# availableReplicas: 3 # Ready pods with minReadySeconds met
# readyReplicas: 3 # pods passing readinessProbe
# replicas: 3 # total pods (including not ready)
# updatedReplicas: 3 # pods with current template
# observedGeneration: 5 # last spec generation processed
# conditions:
# - type: Available # availableReplicas >= minAvailable
# status: "True"
# - type: Progressing # rollout is making progress
# status: "True"
# - type: ReplicaFailure # pod creation failing
# status: "False"
# Detect stalled rollout
kubectl get deploy my-deploy -o jsonpath='{.status.conditions[?(@.type=="Progressing")].message}'
Deep Dive: HPA Controller
The Horizontal Pod Autoscaler controller runs its own sync loop independently from the event-driven controllers. It polls metrics at a fixed interval and calculates desired replicas using a ratio formula.
desiredReplicas = ceil(currentReplicas * (currentMetricValue / desiredMetricValue))
# Example: target CPU = 50%, current CPU = 80%, current replicas = 3
# desiredReplicas = ceil(3 * (80 / 50)) = ceil(4.8) = 5
# Scale-up: immediate (no stabilization by default)
# Scale-down: 5-minute stabilization window (--horizontal-pod-autoscaler-downscale-stabilization)
# This prevents flapping on temporary load spikes
apiVersion: autoscaling/v2
kind: HorizontalPodAutoscaler
metadata:
name: my-app
spec:
scaleTargetRef:
apiVersion: apps/v1
kind: Deployment
name: my-app
minReplicas: 2
maxReplicas: 20
metrics:
- type: Resource
resource:
name: cpu
target:
type: Utilization
averageUtilization: 60
- type: Resource
resource:
name: memory
target:
type: AverageValue
averageValue: 500Mi
- type: Pods
pods:
metric:
name: http_requests_per_second
target:
type: AverageValue
averageValue: "100"
behavior:
scaleDown:
stabilizationWindowSeconds: 300 # 5-minute cooldown on scale-down
policies:
- type: Percent
value: 10
periodSeconds: 60 # max 10% pods removed per minute
scaleUp:
stabilizationWindowSeconds: 0 # immediate scale-up
policies:
- type: Pods
value: 4
periodSeconds: 60 # max 4 pods added per minute
If you manage spec.replicas via Server-Side Apply in your manifests, the HPA will conflict with your manager — it sets spec.replicas via its own field manager. Best practice: omit spec.replicas from your Deployment manifest when HPA manages it (or set it to the initial value and let HPA take over). See 04-kubernetes-api-model.html § SSA.
Configuration and Key Flags
# kube-controller-manager static pod (kubeadm)
spec:
containers:
- command:
- kube-controller-manager
- --allocate-node-cidrs=true
- --cluster-cidr=10.244.0.0/16
- --service-cluster-ip-range=10.96.0.0/12
- --cluster-signing-cert-file=/etc/kubernetes/pki/ca.crt
- --cluster-signing-key-file=/etc/kubernetes/pki/ca.key
- --controllers=*,bootstrapsigner,tokencleaner # * = all default, plus extras
- --kubeconfig=/etc/kubernetes/controller-manager.conf
- --leader-elect=true
- --root-ca-file=/etc/kubernetes/pki/ca.crt
- --service-account-private-key-file=/etc/kubernetes/pki/sa.key
- --use-service-account-credentials=true
- --node-monitor-grace-period=40s
- --pod-eviction-timeout=5m0s
- --node-eviction-rate=0.1
- --concurrent-deployment-syncs=5
- --concurrent-replicaset-syncs=5
- --concurrent-gc-syncs=20
| Flag | Default | Purpose |
|---|---|---|
--controllers | * | Comma-separated list of controllers to enable. *=all. Prefix with - to disable (e.g., -ttl) |
--concurrent-deployment-syncs | 5 | Number of Deployment objects synced in parallel |
--concurrent-replicaset-syncs | 5 | Number of ReplicaSet objects synced in parallel |
--concurrent-statefulset-syncs | 10 | Number of StatefulSet objects synced in parallel |
--concurrent-daemonset-syncs | 2 | Number of DaemonSet objects synced in parallel |
--concurrent-gc-syncs | 20 | GC goroutines for garbage collecting orphaned objects |
--concurrent-endpoint-syncs | 5 | Endpoint reconciliation goroutines |
--horizontal-pod-autoscaler-sync-period | 15s | HPA metrics poll interval |
--horizontal-pod-autoscaler-downscale-stabilization | 5m0s | Scale-down stabilization window |
--use-service-account-credentials | true | Each controller uses its own SA token (fine-grained RBAC) |
--leader-elect | true | Enable leader election for HA |
--kube-api-qps | 20 | API server QPS limit |
--kube-api-burst | 30 | API server burst above QPS |
When enabled (default), each controller runs with its own ServiceAccount token instead of the kube-controller-manager credential. This enables fine-grained RBAC: the node lifecycle controller has Node/Pod access, the deployment controller has Deployment/ReplicaSet/Pod access, etc. Disabling this collapses all permissions to the single system:kube-controller-manager credential, which is a security regression.
Leader Election
kube-controller-manager uses a Lease object in kube-system namespace for leader election. Only the leader runs the controllers; standby instances loop on lease renewal attempts.
# Check current leader
kubectl -n kube-system get lease kube-controller-manager -o yaml
# .spec.holderIdentity: "master-1_..."
# .spec.acquireTime: "2024-01-15T10:30:00Z"
# .spec.renewTime: "2024-01-15T11:45:30Z" # updated every leaseDuration/3
# .spec.leaseDurationSeconds: 15
# Leader election logs
kubectl -n kube-system logs kube-controller-manager-master-1 | grep -i "leader"
# I0115 10:30:00 leaderelection.go:258] successfully acquired lease kube-system/kube-controller-manager
# Lease parameters (via flags):
# --leader-elect-lease-duration=15s # time a non-leader waits to acquire
# --leader-elect-renew-deadline=10s # time leader has to renew before losing
# --leader-elect-retry-period=2s # retry period for acquiring/renewing
If the leader fails to renew its lease (e.g., network partition from apiserver) but continues running, two instances may briefly believe they are leader. Kubernetes tolerates this because controllers are idempotent and use optimistic concurrency — conflicting writes return 409 and are retried. The Lease mechanism minimizes the window but cannot guarantee exactly-once execution. Controllers must be designed with this in mind.
Metrics and Alerting
# Scrape metrics
curl -sk https://localhost:10257/metrics \
--cert /etc/kubernetes/pki/controller-manager-client.crt \
--key /etc/kubernetes/pki/controller-manager-client.key
# Key metrics
workqueue_depth{name="deployment"} # items in work queue
workqueue_adds_total{name="deployment"} # total items added
workqueue_queue_duration_seconds{name="deployment"} # time items wait in queue
workqueue_work_duration_seconds{name="deployment"} # time to process each item
workqueue_retries_total{name="deployment"} # retries due to errors
controller_runtime_reconcile_total{controller="deployment",result="success"}
controller_runtime_reconcile_total{controller="deployment",result="error"}
controller_runtime_reconcile_errors_total{controller="deployment"}
# Node lifecycle
node_collector_evictions_total{zone="us-east-1a"} # pods evicted
node_collector_unhealthy_nodes_in_zone{zone="us-east-1a"}
# GC
garbage_collector_attempt_to_delete_queue_consumers # GC worker count
garbage_collector_dirty_processing_latency_seconds
# HPA
horizontal_pod_autoscaler_metadata_generation # HPA reconcile count
Prometheus Alerting Rules
groups:
- name: kube-controller-manager
rules:
- alert: KubeControllerManagerDown
expr: absent(up{job="kube-controller-manager"} == 1)
for: 5m
labels:
severity: critical
annotations:
summary: "kube-controller-manager is down"
description: "No kube-controller-manager instance is active. Self-healing is suspended."
- alert: KubeControllerManagerHighWorkQueueDepth
expr: workqueue_depth{job="kube-controller-manager",name=~"deployment|replicaset"} > 100
for: 10m
labels:
severity: warning
annotations:
summary: "Controller work queue depth is high"
description: "{{ $labels.name }} controller queue depth is {{ $value }}."
- alert: KubeControllerManagerReconcileErrors
expr: |
rate(workqueue_retries_total{job="kube-controller-manager"}[5m]) > 1
for: 10m
labels:
severity: warning
annotations:
summary: "High controller reconcile error rate"
description: "Controller {{ $labels.name }} is retrying at {{ $value }}/s."
- alert: KubeNodeEvictionHigh
expr: rate(node_collector_evictions_total[5m]) > 0.5
for: 5m
labels:
severity: warning
annotations:
summary: "High node eviction rate"
description: "{{ $value }} pods/s evicted in zone {{ $labels.zone }}."
- alert: KubeControllerManagerLeaderElectionLost
expr: |
increase(leader_election_master_status{job="kube-controller-manager"}[5m]) == 0
and on(instance) leader_election_master_status == 0
for: 5m
labels:
severity: critical
annotations:
summary: "No kube-controller-manager leader"
Troubleshooting
Pods not being created / Deployment stuck
# Check if kube-controller-manager is running
kubectl -n kube-system get pod -l component=kube-controller-manager
kubectl -n kube-system logs kube-controller-manager-master-1 | tail -50
# Check work queue depth
kubectl get --raw /metrics | grep 'workqueue_depth{name="deployment"}'
# Check Deployment conditions
kubectl describe deploy my-deploy | grep -A 5 "Conditions:"
# Check ReplicaSet events
kubectl get events --field-selector involvedObject.kind=ReplicaSet
# Common causes:
# - kube-controller-manager in CrashLoopBackOff (check logs)
# - Resource quota exhausted (ResourceQuota controller rejecting pods)
# - Image pull secret missing (pod creation succeeds but pod stuck)
# - RBAC: controller SA missing permissions to create pods
Pods not evicted from failed node
# Check node conditions
kubectl describe node failed-node | grep -A 10 "Conditions:"
# Check node taints (NoExecute should be added)
kubectl get node failed-node -o jsonpath='{.spec.taints}'
# Check pod toleration seconds
kubectl get pod my-pod -o jsonpath='{.spec.tolerations}'
# Check eviction rate limit metrics
kubectl get --raw /metrics | grep node_collector_evictions_total
# Force-evict a pod (bypasses normal eviction flow)
kubectl delete pod my-pod --grace-period=0 --force
# If node-monitor-grace-period not met yet, wait or reduce it
# (changing requires restarting kube-controller-manager)
Objects stuck in Terminating state
# Find what finalizers are blocking deletion
kubectl get <resource> <name> -o jsonpath='{.metadata.finalizers}'
# Find all objects with finalizers
kubectl get all -A -o json | jq -r '
.items[] |
select(.metadata.finalizers != null and (.metadata.finalizers | length) > 0) |
"\(.kind) \(.metadata.namespace)/\(.metadata.name): \(.metadata.finalizers)"
'
# Emergency: manually remove a stuck finalizer (DANGEROUS — only if controller is dead)
kubectl patch <resource> <name> --type=json \
-p '[{"op":"remove","path":"/metadata/finalizers"}]'
# Check if the relevant controller is running (e.g., pvc-protection requires running kcm)
kubectl -n kube-system logs kube-controller-manager-master-1 | grep -i "pvc-protection"
GC not cleaning up orphaned pods/replicasets
# Check GC work queue
kubectl get --raw /metrics | grep garbage_collector
# Check if GC is processing
kubectl -n kube-system logs kube-controller-manager-master-1 | grep -i "garbage"
# Find orphaned ReplicaSets (no ownerReference or owner missing)
kubectl get rs -A -o json | jq -r '
.items[] |
select(.metadata.ownerReferences == null or (.metadata.ownerReferences | length) == 0) |
"\(.metadata.namespace)/\(.metadata.name)"
'
# Check GC concurrent syncs (increase if cluster has many objects)
# --concurrent-gc-syncs=20 (default); increase to 50+ for very large clusters
HPA not scaling / wrong replica count
# Check HPA status
kubectl describe hpa my-app
# Look for: "unable to get metrics", "failed to get CPU utilization"
# Conditions section shows why scaling is blocked
# Check metrics-server is running and healthy
kubectl top pods -n kube-system
kubectl -n kube-system get pods -l k8s-app=metrics-server
# Check if resource requests are set (required for CPU% metric)
kubectl get pod my-pod -o jsonpath='{.spec.containers[0].resources.requests}'
# Check HPA events
kubectl get events --field-selector involvedObject.kind=HorizontalPodAutoscaler
# Check if min/max bounds are blocking
kubectl get hpa my-app -o jsonpath='{.spec.minReplicas},{.spec.maxReplicas},{.status.currentReplicas}'
Production Best Practices
Always run ≥2 replicas with leader election
With a single kube-controller-manager instance, a crash means no self-healing until it restarts. Run 2–3 replicas. Only the leader is active; standby instances take over within leaseDurationSeconds (15s default) of leader failure.
Tune eviction parameters for your SLA
Default pod-eviction-timeout=5m means workloads wait 5+ minutes to reschedule on node failure. For latency-sensitive services, reduce to 60–90s. For stateful workloads that should prefer node recovery over rescheduling, keep the default or increase.
Set resource requests on every container
The HPA controller and the Node Lifecycle controller both depend on accurate resource accounting. Without CPU/memory requests, HPA cannot compute utilization, and the scheduler cannot make good placement decisions.
Use ResourceQuota per namespace
Without ResourceQuota, a rogue deployment can consume all cluster resources. Set CPU/memory quotas per namespace. The ResourceQuota controller enforces these at admission time, not at scheduling time.
Monitor work queue depth
High workqueue_depth for the deployment or replicaset controller indicates the controller is falling behind. This manifests as slow rollouts. Increase --concurrent-deployment-syncs or investigate if kube-apiserver is throttling the controller.
Keep revisionHistoryLimit reasonable
Default revisionHistoryLimit: 10 keeps 10 old ReplicaSets. On clusters with hundreds of Deployments, this creates thousands of zombie ReplicaSets. Reduce to 3–5 for most workloads; set 0 only if you never need rollback.
Use PodDisruptionBudgets for all stateful apps
Without a PDB, the node lifecycle controller can evict all pods of a StatefulSet simultaneously during zone failure. PDBs ensure the Disruption controller blocks eviction if it would violate minAvailable.
Audit finalizers regularly
Finalizers that are never removed cause objects to accumulate in Terminating state, consuming etcd space and slowing GC. Audit for stuck Terminating objects monthly: kubectl get all -A --field-selector metadata.deletionTimestamp!=null.
Increase concurrent syncs for large clusters
On clusters with 1000+ nodes or 10,000+ pods, increase --concurrent-gc-syncs=50, --concurrent-deployment-syncs=20, and tune --kube-api-qps=100 --kube-api-burst=200. Monitor workqueue depth to know when tuning is needed.
Network partition: trust secondary eviction rate
If you see eviction rates drop to near-zero and many nodes show Unknown, this is likely a zone-level network partition, not individual node failures. Do NOT manually force-evict all pods — the workloads may be running fine in the partitioned zone. Wait for the partition to resolve.