kube-controller-manager | K8s Internals

What kube-controller-manager Does

kube-controller-manager is a collection of independent control loops (controllers) compiled into one binary and launched as a single process. Each controller watches specific API objects and reconciles them toward their desired state. They share an in-process informer cache, so all 30+ controllers together make only a single set of watch streams to kube-apiserver.

Process identity

Binary: kube-controller-manager
Static pod: /etc/kubernetes/manifests/kube-controller-manager.yaml
Secure port: :10257 (HTTPS, metrics + healthz)
Kubeconfig: /etc/kubernetes/controller-manager.conf
Service account: uses system:kube-controller-manager
Leader-elected: one active instance per cluster

Design principles

Level-triggered: reconcile based on current observed state, not event history
Idempotent: reconcile can be called any number of times safely
Optimistic: assume success, detect and fix divergence via watch
API-only: all state mutations go through kube-apiserver, never direct to etcd
Shared informers: one watch per resource type, all controllers share

The Reconciliation Loop Pattern

Every controller in kube-controller-manager follows the same fundamental observe → compare → act pattern. Understanding this pattern explains why Kubernetes converges to desired state even after partial failures.

▶ Level-Triggered vs Edge-Triggered

Controllers are level-triggered: they do not replay missed events. When a controller runs, it looks at the current state of the world (from its informer cache) and calculates what needs to change — regardless of what events caused it to be triggered. This means a controller that missed 10 rapid updates will still converge correctly after one reconcile, because it sees the final state, not a sequence of deltas.

Key invariants that make this work safely:

Reads from informer cache (resourceVersion=0), not live API — avoids thundering-herd on kube-apiserver
Writes go through kube-apiserver with resourceVersion for optimistic concurrency
Status is written separately via /status subresource — controller owns status, user owns spec
Work queue deduplication — rapid spec changes produce one reconcile, not many
Exponential backoff on errors — failed reconciles retry with increasing delays

Shared Informer Architecture

All controllers inside kube-controller-manager share a single SharedInformerFactory. This means that even though 30+ controllers all need to watch Pods, there is only one Pod watch stream to kube-apiserver, and one copy of the Pod cache in memory. Each controller registers its own event handlers against the shared informer.

Complete Controller Inventory

As of Kubernetes 1.30, kube-controller-manager runs the following controllers. Each is a separate goroutine with its own work queue and informer event handlers.

Workload Controllers

Deployment Controller

Watches Deployments and their child ReplicaSets. On spec change, creates a new RS with updated pod template hash, scales it up, and scales down old RS — implementing the rolling update strategy. Computes maxSurge/maxUnavailable constraints.

ReplicaSet Controller

Maintains the exact number of Pod replicas specified by spec.replicas. Watches Pods via label selector. On pod count divergence: creates or deletes Pods. Sets ownerReference on created Pods pointing back to the RS.

StatefulSet Controller

Manages Pods with stable network identity and persistent storage. Creates Pods in order (0, 1, 2…), ensures each is Running+Ready before creating the next. Handles PVC creation per pod, rolling updates with partition support.

DaemonSet Controller

Ensures exactly one Pod runs on every eligible node (respecting nodeSelector, affinity, taints). Creates Pods directly with spec.nodeName set — bypasses the scheduler. Handles node addition/removal and rolling updates.

Job Controller

Manages batch jobs to completion. Tracks succeeded/failed pod counts against completions/parallelism. Handles pod failures via backoffLimit. Supports Indexed completion mode for parallelism with job index injection.

CronJob Controller

Creates Job objects on a cron schedule. Enforces concurrencyPolicy (Allow/Forbid/Replace) and startingDeadlineSeconds. Tracks history of completed/failed jobs up to successfulJobsHistoryLimit.

ReplicationController

Legacy predecessor to ReplicaSet. Functionally identical but does not support set-based selectors. Still active for backward compatibility. New workloads should use ReplicaSets via Deployments.

Node Controllers

Node Lifecycle Controller

The most critical node-related controller. Monitors NodeReady condition via kubelet heartbeats. After node-monitor-grace-period (default 40s) without heartbeat, marks node Unknown. After pod-eviction-timeout (default 5m), initiates pod eviction by adding NoExecute taints. Also manages node.kubernetes.io/not-ready and node.kubernetes.io/unreachable taints.

Node IPAM Controller

Assigns pod CIDR blocks to nodes from the cluster's --cluster-cidr range. Writes the assigned CIDR to node.spec.podCIDR and node.spec.podCIDRs. Cloud-specific IPAM may delegate to cloud-controller-manager instead.

Taint Manager

Part of the node lifecycle subsystem. Watches node taints with NoExecute effect. For each pod on a tainted node, checks tolerations: if not tolerated (or tolerationSeconds expired), marks pod for deletion. Runs the eviction rate limiter to avoid mass eviction storms.

Storage Controllers

PersistentVolume Controller

Binds PVCs to PVs via the claim binder loop. For static provisioning: finds a matching PV and binds. For dynamic provisioning with Immediate binding mode: invokes the StorageClass provisioner. Manages the PV lifecycle: Available → Bound → Released → (Retain/Delete/Recycle).

AttachDetach Controller

Manages volume attach/detach operations independently from kubelet. Watches pods, determines which volumes need to be attached to which nodes, calls the volume plugin's Attach()/Detach(). Tracks VolumeAttachment objects for CSI volumes.

PVC Protection Controller

Adds the kubernetes.io/pvc-protection finalizer to PVCs that are bound to a running pod. Prevents accidental deletion of in-use PVCs — the PVC goes to Terminating but isn't deleted until all pods release it.

PV Protection Controller

Adds kubernetes.io/pv-protection finalizer to PVs that are bound to a PVC. Ensures PVs aren't deleted while a PVC is bound to them.

Network Controllers

Endpoint Controller (legacy)

Watches Services and Pods. For each Service, maintains the Endpoints object with the IP:port of all Ready pods matching the Service selector. Replaced by EndpointSlice controller for scalability but still runs for backward compatibility.

EndpointSlice Controller

Manages EndpointSlice objects (max 100 endpoints per slice). Shards a Service's backends across multiple slices for scalability. Incremental updates — only changed slices are updated, reducing kube-apiserver and etcd write amplification on large Services.

EndpointSliceMirroring Controller

Mirrors manually-managed Endpoints objects to EndpointSlice objects for backward compatibility. Allows tools that still write to Endpoints (not EndpointSlice) to work with kube-proxy's EndpointSlice-based routing.

Service Controller

Watches Services of type LoadBalancer and delegates to the cloud provider (via cloud-controller-manager in modern setups, or in-tree cloud provider plugins) to create/update external load balancers. Also manages Service NodePort allocation.

Namespace, RBAC, and Security Controllers

Namespace Controller

When a Namespace enters Terminating phase (after DELETE call), this controller deletes all resources within the namespace in the correct order, then removes the kubernetes finalizer to allow the Namespace object itself to be deleted.

ServiceAccount Controller

Creates a default ServiceAccount named default in every namespace. Also creates the legacy token Secret for each ServiceAccount (pre-1.24 behavior, now managed by TokenRequest API). Ensures the ServiceAccount exists before pods can reference it.

Token Controller

Watches ServiceAccount token Secrets. Creates token files when annotations indicate they need populating. Deletes tokens for ServiceAccounts that no longer exist. In 1.24+, the bound service account token volume projection largely replaces this flow.

RBAC Bootstrapping Controller

Ensures built-in ClusterRoles and ClusterRoleBindings (like cluster-admin, system:node) exist and are up-to-date. Re-creates them if accidentally deleted. Runs on every leader election.

Certificate Signing Controllers

Processes CertificateSigningRequest (CSR) objects. Three sub-controllers: csrsigning (signs approved CSRs using the cluster CA), csrapproving (auto-approves node/kubelet CSRs), csrcleaner (deletes old CSRs after TTL).

Bootstrap Token Controller

Manages tokens used for node bootstrapping (e.g., kubeadm join). Cleans up expired bootstrap tokens, creates/updates the cluster-info ConfigMap in kube-public, and creates the RBAC bindings needed for bootstrapping nodes.

Garbage Collection and Lifecycle Controllers

Garbage Collector

The most complex controller. Traverses the ownerReferences graph to delete orphaned objects. Maintains a DAG of all API objects in memory. When an owner is deleted, orphans are queued for deletion. Handles foreground/background deletion cascades. See 13-garbage-collection.html.

TTL Controller

Deletes completed Jobs and their pods after spec.ttlSecondsAfterFinished expires. This is a separate mechanism from successfulJobsHistoryLimit on CronJob — TTL operates on individual Jobs regardless of their parent.

TTL-After-Finished Controller

Specifically handles the .spec.ttlSecondsAfterFinished field on Jobs. Watches for completed Jobs and sets up a timer to delete them. Prevents accumulation of finished Jobs cluttering the cluster.

Resource Quota Controller

Enforces ResourceQuota objects. Watches all resource types and maintains usage counts. Replenishes quota when resources are deleted. Quota admission happens in the apiserver admission pipeline; this controller maintains the quota usage status.

HorizontalPodAutoscaler (HPA) Controller

Polls metrics (CPU, memory, custom metrics via metrics-server or custom metrics API) every --horizontal-pod-autoscaler-sync-period (default 15s). Calculates desired replicas, applies scale-up/scale-down stabilization windows, and PATCHes the target's /scale subresource.

Disruption Controller

Maintains PodDisruptionBudget status. Tracks how many disruptions have occurred vs allowed. Updates status.disruptionsAllowed. The eviction API (used by kubectl drain and the scheduler's preemption) checks this status before evicting pods.

Deep Dive: Node Lifecycle Controller

The Node Lifecycle Controller is arguably the most operationally important controller — it determines when failed nodes trigger pod eviction and how quickly your cluster recovers from node failures.

Parameter	Flag	Default	Description
Grace period	`--node-monitor-grace-period`	40s	Time after last heartbeat before node is marked Unknown
Eviction timeout	`--pod-eviction-timeout`	5m0s	How long to wait before evicting pods from unreachable node
Eviction rate	`--node-eviction-rate`	0.1/s	Max pods evicted per second in a healthy zone
Secondary rate	`--secondary-node-eviction-rate`	0.01/s	Eviction rate when >55% nodes unhealthy (network partition)
Unhealthy threshold	`--unhealthy-zone-threshold`	55%	Fraction of unhealthy nodes to trigger secondary rate
Monitor period	`--node-monitor-period`	5s	How often controller checks node status

⚠ Network Partition Avoidance

If more than 55% of nodes in a zone become unreachable simultaneously, the eviction rate drops to near-zero. This is intentional — it detects a network partition scenario where the control plane lost contact with an entire zone. In that case, mass-evicting pods would be counterproductive because the nodes (and their pods) are likely still running fine. The controller waits for the zone to recover rather than evicting healthy workloads.

Deep Dive: Garbage Collector

The Garbage Collector (GC) is the most architecturally complex controller. It maintains an in-memory owner-dependency graph (a DAG) of all API objects and propagates deletions through the graph.

GraphBuilder

Watches ALL resource types via a "meta-only" informer (fetches only metadata, not full objects — much cheaper). Builds a DAG: edges are ownerReferences. When an object is created/deleted, the graph is updated and affected nodes are enqueued for processing.

GarbageCollector

Processes the dirty-queue from GraphBuilder. For each object: if owner is missing/deleted and object lacks a foregroundDeletion finalizer → delete the object. For foreground deletion: remove dependents first, then remove the owner's foregroundDeletion finalizer.

# ownerReference chain: Deployment → ReplicaSet → Pod
kubectl get pod my-pod-abc12 -o jsonpath='{.metadata.ownerReferences}'
# [{apiVersion:apps/v1, kind:ReplicaSet, name:my-deploy-6799fc88d8, uid:..., controller:true}]

kubectl get rs my-deploy-6799fc88d8 -o jsonpath='{.metadata.ownerReferences}'
# [{apiVersion:apps/v1, kind:Deployment, name:my-deploy, uid:..., controller:true}]

# Delete deployment with background GC (default)
kubectl delete deployment my-deploy
# Deployment deleted immediately; RS and Pods deleted asynchronously by GC

# Delete with foreground GC (wait for dependents)
kubectl delete deployment my-deploy --cascade=foreground

# Orphan — delete deployment but keep ReplicaSets and Pods
kubectl delete deployment my-deploy --cascade=orphan

# Check if GC is processing (look for rapid object deletions)
kubectl get events --field-selector reason=SuccessfulDelete | head -20

GC Race Condition: Dangling ownerReferences

If a pod is created with an ownerReference pointing to a UID that no longer exists (e.g., due to a rapid delete+recreate), the GC will delete the pod as an orphan. This is a known issue when controllers don't set blockOwnerDeletion: true on ownerReferences. Controllers should always verify the owner's UID is current before creating dependents.

# Detect pods with dangling ownerReferences
kubectl get pods -A -o json | jq -r '
  .items[] |
  select(.metadata.ownerReferences != null) |
  "\(.metadata.namespace)/\(.metadata.name): owner=\(.metadata.ownerReferences[0].name)"
' | head -20

Deep Dive: Deployment Controller

The Deployment controller orchestrates rolling updates by managing ReplicaSets. Understanding its algorithm is essential for debugging slow rollouts and unexpected pod terminations.

# Rolling update algorithm
# Given: maxSurge=1, maxUnavailable=1, replicas=3
#
# Initial state: RS-v1 has 3 pods (all Ready)
# After spec.template change:
#
# Step 1: Create RS-v2 with 0 replicas
# Step 2: Scale RS-v2 to 1 (maxSurge=1, total=4=3+1)
# Step 3: Wait for RS-v2 pod[0] to be Ready
# Step 4: Scale RS-v1 to 2 (maxUnavailable=1, so min Ready=2)
# Step 5: Scale RS-v2 to 2 → wait → scale RS-v1 to 1
# Step 6: Scale RS-v2 to 3 → wait → scale RS-v1 to 0
# RS-v1 kept at 0 replicas (not deleted) for rollback

kubectl rollout status deployment/my-deploy
kubectl rollout history deployment/my-deploy
kubectl rollout undo deployment/my-deploy          # roll back to previous RS
kubectl rollout undo deployment/my-deploy --to-revision=2  # roll back to specific revision

Field	Default	Effect
`spec.strategy.rollingUpdate.maxSurge`	25%	Max pods above desired count during rollout. Higher = faster rollout, more resource burst
`spec.strategy.rollingUpdate.maxUnavailable`	25%	Max unavailable pods during rollout. 0 = zero-downtime. Lower = slower rollout
`spec.progressDeadlineSeconds`	600	Time before rollout is marked Failed if not progressing
`spec.minReadySeconds`	0	Seconds a new pod must be Ready before old pod is terminated
`spec.revisionHistoryLimit`	10	Number of old ReplicaSets to keep for rollback

Deployment conditions and status fields

# Key status fields
kubectl get deploy my-deploy -o yaml | grep -A 30 "status:"
# status:
#   availableReplicas: 3     # Ready pods with minReadySeconds met
#   readyReplicas: 3          # pods passing readinessProbe
#   replicas: 3               # total pods (including not ready)
#   updatedReplicas: 3        # pods with current template
#   observedGeneration: 5     # last spec generation processed
#   conditions:
#   - type: Available          # availableReplicas >= minAvailable
#     status: "True"
#   - type: Progressing        # rollout is making progress
#     status: "True"
#   - type: ReplicaFailure     # pod creation failing
#     status: "False"

# Detect stalled rollout
kubectl get deploy my-deploy -o jsonpath='{.status.conditions[?(@.type=="Progressing")].message}'

Deep Dive: HPA Controller

The Horizontal Pod Autoscaler controller runs its own sync loop independently from the event-driven controllers. It polls metrics at a fixed interval and calculates desired replicas using a ratio formula.

desiredReplicas = ceil(currentReplicas * (currentMetricValue / desiredMetricValue))

# Example: target CPU = 50%, current CPU = 80%, current replicas = 3
# desiredReplicas = ceil(3 * (80 / 50)) = ceil(4.8) = 5

# Scale-up: immediate (no stabilization by default)
# Scale-down: 5-minute stabilization window (--horizontal-pod-autoscaler-downscale-stabilization)
# This prevents flapping on temporary load spikes

apiVersion: autoscaling/v2
kind: HorizontalPodAutoscaler
metadata:
  name: my-app
spec:
  scaleTargetRef:
    apiVersion: apps/v1
    kind: Deployment
    name: my-app
  minReplicas: 2
  maxReplicas: 20
  metrics:
  - type: Resource
    resource:
      name: cpu
      target:
        type: Utilization
        averageUtilization: 60
  - type: Resource
    resource:
      name: memory
      target:
        type: AverageValue
        averageValue: 500Mi
  - type: Pods
    pods:
      metric:
        name: http_requests_per_second
      target:
        type: AverageValue
        averageValue: "100"
  behavior:
    scaleDown:
      stabilizationWindowSeconds: 300  # 5-minute cooldown on scale-down
      policies:
      - type: Percent
        value: 10
        periodSeconds: 60              # max 10% pods removed per minute
    scaleUp:
      stabilizationWindowSeconds: 0    # immediate scale-up
      policies:
      - type: Pods
        value: 4
        periodSeconds: 60              # max 4 pods added per minute

▶ HPA and SSA Field Ownership Conflict

If you manage spec.replicas via Server-Side Apply in your manifests, the HPA will conflict with your manager — it sets spec.replicas via its own field manager. Best practice: omit spec.replicas from your Deployment manifest when HPA manages it (or set it to the initial value and let HPA take over). See 04-kubernetes-api-model.html § SSA.

Configuration and Key Flags

# kube-controller-manager static pod (kubeadm)
spec:
  containers:
  - command:
    - kube-controller-manager
    - --allocate-node-cidrs=true
    - --cluster-cidr=10.244.0.0/16
    - --service-cluster-ip-range=10.96.0.0/12
    - --cluster-signing-cert-file=/etc/kubernetes/pki/ca.crt
    - --cluster-signing-key-file=/etc/kubernetes/pki/ca.key
    - --controllers=*,bootstrapsigner,tokencleaner  # * = all default, plus extras
    - --kubeconfig=/etc/kubernetes/controller-manager.conf
    - --leader-elect=true
    - --root-ca-file=/etc/kubernetes/pki/ca.crt
    - --service-account-private-key-file=/etc/kubernetes/pki/sa.key
    - --use-service-account-credentials=true
    - --node-monitor-grace-period=40s
    - --pod-eviction-timeout=5m0s
    - --node-eviction-rate=0.1
    - --concurrent-deployment-syncs=5
    - --concurrent-replicaset-syncs=5
    - --concurrent-gc-syncs=20

Flag	Default	Purpose
`--controllers`	`*`	Comma-separated list of controllers to enable. `*`=all. Prefix with `-` to disable (e.g., `-ttl`)
`--concurrent-deployment-syncs`	5	Number of Deployment objects synced in parallel
`--concurrent-replicaset-syncs`	5	Number of ReplicaSet objects synced in parallel
`--concurrent-statefulset-syncs`	10	Number of StatefulSet objects synced in parallel
`--concurrent-daemonset-syncs`	2	Number of DaemonSet objects synced in parallel
`--concurrent-gc-syncs`	20	GC goroutines for garbage collecting orphaned objects
`--concurrent-endpoint-syncs`	5	Endpoint reconciliation goroutines
`--horizontal-pod-autoscaler-sync-period`	15s	HPA metrics poll interval
`--horizontal-pod-autoscaler-downscale-stabilization`	5m0s	Scale-down stabilization window
`--use-service-account-credentials`	true	Each controller uses its own SA token (fine-grained RBAC)
`--leader-elect`	true	Enable leader election for HA
`--kube-api-qps`	20	API server QPS limit
`--kube-api-burst`	30	API server burst above QPS

▶ --use-service-account-credentials

When enabled (default), each controller runs with its own ServiceAccount token instead of the kube-controller-manager credential. This enables fine-grained RBAC: the node lifecycle controller has Node/Pod access, the deployment controller has Deployment/ReplicaSet/Pod access, etc. Disabling this collapses all permissions to the single system:kube-controller-manager credential, which is a security regression.

Leader Election

kube-controller-manager uses a Lease object in kube-system namespace for leader election. Only the leader runs the controllers; standby instances loop on lease renewal attempts.

# Check current leader
kubectl -n kube-system get lease kube-controller-manager -o yaml
# .spec.holderIdentity: "master-1_..."
# .spec.acquireTime: "2024-01-15T10:30:00Z"
# .spec.renewTime: "2024-01-15T11:45:30Z"   # updated every leaseDuration/3
# .spec.leaseDurationSeconds: 15

# Leader election logs
kubectl -n kube-system logs kube-controller-manager-master-1 | grep -i "leader"
# I0115 10:30:00 leaderelection.go:258] successfully acquired lease kube-system/kube-controller-manager

# Lease parameters (via flags):
# --leader-elect-lease-duration=15s    # time a non-leader waits to acquire
# --leader-elect-renew-deadline=10s    # time leader has to renew before losing
# --leader-elect-retry-period=2s       # retry period for acquiring/renewing

✖ Split-Brain During Leader Election

If the leader fails to renew its lease (e.g., network partition from apiserver) but continues running, two instances may briefly believe they are leader. Kubernetes tolerates this because controllers are idempotent and use optimistic concurrency — conflicting writes return 409 and are retried. The Lease mechanism minimizes the window but cannot guarantee exactly-once execution. Controllers must be designed with this in mind.

Metrics and Alerting

# Scrape metrics
curl -sk https://localhost:10257/metrics \
  --cert /etc/kubernetes/pki/controller-manager-client.crt \
  --key /etc/kubernetes/pki/controller-manager-client.key

# Key metrics
workqueue_depth{name="deployment"}                    # items in work queue
workqueue_adds_total{name="deployment"}               # total items added
workqueue_queue_duration_seconds{name="deployment"}   # time items wait in queue
workqueue_work_duration_seconds{name="deployment"}    # time to process each item
workqueue_retries_total{name="deployment"}            # retries due to errors

controller_runtime_reconcile_total{controller="deployment",result="success"}
controller_runtime_reconcile_total{controller="deployment",result="error"}
controller_runtime_reconcile_errors_total{controller="deployment"}

# Node lifecycle
node_collector_evictions_total{zone="us-east-1a"}     # pods evicted
node_collector_unhealthy_nodes_in_zone{zone="us-east-1a"}

# GC
garbage_collector_attempt_to_delete_queue_consumers   # GC worker count
garbage_collector_dirty_processing_latency_seconds

# HPA
horizontal_pod_autoscaler_metadata_generation          # HPA reconcile count

Prometheus Alerting Rules

groups:
- name: kube-controller-manager
  rules:
  - alert: KubeControllerManagerDown
    expr: absent(up{job="kube-controller-manager"} == 1)
    for: 5m
    labels:
      severity: critical
    annotations:
      summary: "kube-controller-manager is down"
      description: "No kube-controller-manager instance is active. Self-healing is suspended."

  - alert: KubeControllerManagerHighWorkQueueDepth
    expr: workqueue_depth{job="kube-controller-manager",name=~"deployment|replicaset"} > 100
    for: 10m
    labels:
      severity: warning
    annotations:
      summary: "Controller work queue depth is high"
      description: "{{ $labels.name }} controller queue depth is {{ $value }}."

  - alert: KubeControllerManagerReconcileErrors
    expr: |
      rate(workqueue_retries_total{job="kube-controller-manager"}[5m]) > 1
    for: 10m
    labels:
      severity: warning
    annotations:
      summary: "High controller reconcile error rate"
      description: "Controller {{ $labels.name }} is retrying at {{ $value }}/s."

  - alert: KubeNodeEvictionHigh
    expr: rate(node_collector_evictions_total[5m]) > 0.5
    for: 5m
    labels:
      severity: warning
    annotations:
      summary: "High node eviction rate"
      description: "{{ $value }} pods/s evicted in zone {{ $labels.zone }}."

  - alert: KubeControllerManagerLeaderElectionLost
    expr: |
      increase(leader_election_master_status{job="kube-controller-manager"}[5m]) == 0
      and on(instance) leader_election_master_status == 0
    for: 5m
    labels:
      severity: critical
    annotations:
      summary: "No kube-controller-manager leader"

Troubleshooting

Pods not being created / Deployment stuck

# Check if kube-controller-manager is running
kubectl -n kube-system get pod -l component=kube-controller-manager
kubectl -n kube-system logs kube-controller-manager-master-1 | tail -50

# Check work queue depth
kubectl get --raw /metrics | grep 'workqueue_depth{name="deployment"}'

# Check Deployment conditions
kubectl describe deploy my-deploy | grep -A 5 "Conditions:"

# Check ReplicaSet events
kubectl get events --field-selector involvedObject.kind=ReplicaSet

# Common causes:
# - kube-controller-manager in CrashLoopBackOff (check logs)
# - Resource quota exhausted (ResourceQuota controller rejecting pods)
# - Image pull secret missing (pod creation succeeds but pod stuck)
# - RBAC: controller SA missing permissions to create pods

Pods not evicted from failed node

# Check node conditions
kubectl describe node failed-node | grep -A 10 "Conditions:"

# Check node taints (NoExecute should be added)
kubectl get node failed-node -o jsonpath='{.spec.taints}'

# Check pod toleration seconds
kubectl get pod my-pod -o jsonpath='{.spec.tolerations}'

# Check eviction rate limit metrics
kubectl get --raw /metrics | grep node_collector_evictions_total

# Force-evict a pod (bypasses normal eviction flow)
kubectl delete pod my-pod --grace-period=0 --force

# If node-monitor-grace-period not met yet, wait or reduce it
# (changing requires restarting kube-controller-manager)

Objects stuck in Terminating state

# Find what finalizers are blocking deletion
kubectl get <resource> <name> -o jsonpath='{.metadata.finalizers}'

# Find all objects with finalizers
kubectl get all -A -o json | jq -r '
  .items[] |
  select(.metadata.finalizers != null and (.metadata.finalizers | length) > 0) |
  "\(.kind) \(.metadata.namespace)/\(.metadata.name): \(.metadata.finalizers)"
'

# Emergency: manually remove a stuck finalizer (DANGEROUS — only if controller is dead)
kubectl patch <resource> <name> --type=json \
  -p '[{"op":"remove","path":"/metadata/finalizers"}]'

# Check if the relevant controller is running (e.g., pvc-protection requires running kcm)
kubectl -n kube-system logs kube-controller-manager-master-1 | grep -i "pvc-protection"

GC not cleaning up orphaned pods/replicasets

# Check GC work queue
kubectl get --raw /metrics | grep garbage_collector

# Check if GC is processing
kubectl -n kube-system logs kube-controller-manager-master-1 | grep -i "garbage"

# Find orphaned ReplicaSets (no ownerReference or owner missing)
kubectl get rs -A -o json | jq -r '
  .items[] |
  select(.metadata.ownerReferences == null or (.metadata.ownerReferences | length) == 0) |
  "\(.metadata.namespace)/\(.metadata.name)"
'

# Check GC concurrent syncs (increase if cluster has many objects)
# --concurrent-gc-syncs=20 (default); increase to 50+ for very large clusters

HPA not scaling / wrong replica count

# Check HPA status
kubectl describe hpa my-app
# Look for: "unable to get metrics", "failed to get CPU utilization"
# Conditions section shows why scaling is blocked

# Check metrics-server is running and healthy
kubectl top pods -n kube-system
kubectl -n kube-system get pods -l k8s-app=metrics-server

# Check if resource requests are set (required for CPU% metric)
kubectl get pod my-pod -o jsonpath='{.spec.containers[0].resources.requests}'

# Check HPA events
kubectl get events --field-selector involvedObject.kind=HorizontalPodAutoscaler

# Check if min/max bounds are blocking
kubectl get hpa my-app -o jsonpath='{.spec.minReplicas},{.spec.maxReplicas},{.status.currentReplicas}'

Production Best Practices

Always run ≥2 replicas with leader election

With a single kube-controller-manager instance, a crash means no self-healing until it restarts. Run 2–3 replicas. Only the leader is active; standby instances take over within leaseDurationSeconds (15s default) of leader failure.

Tune eviction parameters for your SLA

Default pod-eviction-timeout=5m means workloads wait 5+ minutes to reschedule on node failure. For latency-sensitive services, reduce to 60–90s. For stateful workloads that should prefer node recovery over rescheduling, keep the default or increase.

Set resource requests on every container

The HPA controller and the Node Lifecycle controller both depend on accurate resource accounting. Without CPU/memory requests, HPA cannot compute utilization, and the scheduler cannot make good placement decisions.

Use ResourceQuota per namespace

Without ResourceQuota, a rogue deployment can consume all cluster resources. Set CPU/memory quotas per namespace. The ResourceQuota controller enforces these at admission time, not at scheduling time.

Monitor work queue depth

High workqueue_depth for the deployment or replicaset controller indicates the controller is falling behind. This manifests as slow rollouts. Increase --concurrent-deployment-syncs or investigate if kube-apiserver is throttling the controller.

Keep revisionHistoryLimit reasonable

Default revisionHistoryLimit: 10 keeps 10 old ReplicaSets. On clusters with hundreds of Deployments, this creates thousands of zombie ReplicaSets. Reduce to 3–5 for most workloads; set 0 only if you never need rollback.

Use PodDisruptionBudgets for all stateful apps

Without a PDB, the node lifecycle controller can evict all pods of a StatefulSet simultaneously during zone failure. PDBs ensure the Disruption controller blocks eviction if it would violate minAvailable.

Audit finalizers regularly

Finalizers that are never removed cause objects to accumulate in Terminating state, consuming etcd space and slowing GC. Audit for stuck Terminating objects monthly: kubectl get all -A --field-selector metadata.deletionTimestamp!=null.

Increase concurrent syncs for large clusters

On clusters with 1000+ nodes or 10,000+ pods, increase --concurrent-gc-syncs=50, --concurrent-deployment-syncs=20, and tune --kube-api-qps=100 --kube-api-burst=200. Monitor workqueue depth to know when tuning is needed.

Network partition: trust secondary eviction rate

If you see eviction rates drop to near-zero and many nodes show Unknown, this is likely a zone-level network partition, not individual node failures. Do NOT manually force-evict all pods — the workloads may be running fine in the partitioned zone. Wait for the partition to resolve.