01-control-plane/00-control-plane-overview.html Prerequisites: 00-foundations/03-cluster-architecture-overview.html Prerequisites: 00-foundations/04-kubernetes-api-model.html Related: 01-kube-apiserver.html · 02-etcd.html · 03-kube-scheduler.html · 04-kube-controller-manager.html

Control Plane — Complete Internal Architecture

Deep internals of the five core control-plane components, their gRPC/REST APIs, reconciliation loops, leader election, HA topology, failure modes, and production operations.

What Is the Control Plane?

The control plane is the set of processes that implement the desired-state management loop of the cluster. It never runs your workloads; it only watches, decides, and instructs. Every cluster mutation flows through the control plane, and every node periodically reports status back to it.

Design Principle

The control plane is the "brain" of Kubernetes. It is intentionally separated from the data plane (where your workloads run) so that node failures never corrupt cluster state, and control-plane upgrades don't interrupt running containers.

The Five Core Components

kube-apiserver

:6443 (HTTPS)

The single, authoritative REST API gateway for the entire cluster. All state reads and writes go through it. Stateless horizontally scalable. Backs every object into etcd.

etcd

:2379 (client) · :2380 (peer)

Strongly consistent, distributed key-value store. The ground truth for all cluster state. Uses Raft consensus. Loss of etcd = loss of the cluster.

kube-scheduler

:10259 (HTTPS metrics/healthz)

Watches for unscheduled Pods and assigns each to a Node. Two-phase: Filter (which nodes can run this pod?) → Score (which is optimal?). Highly extensible via plugins.

kube-controller-manager

:10257 (HTTPS metrics/healthz)

Runs 30+ reconciliation loops (controllers) in a single binary. Each controller drives actual cluster state toward desired state. Node controller, ReplicaSet controller, Job controller, etc.

cloud-controller-manager

:10258 (HTTPS metrics/healthz)

Optional. Integrates with cloud provider APIs for LoadBalancer provisioning, Node address annotation, Route programming. Decouples cloud logic from core controllers.

Control Plane Architecture Diagram

Figure 1: Control plane component communication topology. All external and internal traffic routes through kube-apiserver. etcd is only accessible by apiserver. The scheduler and controllers only talk to the apiserver — never directly to nodes.

Communication Matrix

Understanding who talks to whom, over what protocol, on what port, authenticated how is critical for firewall rules, mTLS policy, and audit log interpretation.

Source	Destination	Port(s)	Protocol	AuthN Method	Direction	Notes
kubectl / API clients	kube-apiserver	6443	HTTPS (TLS 1.3)	x509 cert / token / OIDC	→ CP	External-facing. Must be behind LB in HA.
kube-apiserver	etcd	2379	gRPC / TLS	mTLS (client cert)	→ etcd	Only apiserver talks to etcd. Dedicated cert.
kube-scheduler	kube-apiserver	6443	HTTPS + HTTP/2	in-cluster ServiceAccount / kubeconfig	→ CP	Watch Pods (unscheduled), patch Pod.Spec.NodeName
kube-controller-manager	kube-apiserver	6443	HTTPS + HTTP/2	in-cluster ServiceAccount / kubeconfig	→ CP	Each controller has separate SA + RBAC
cloud-controller-manager	kube-apiserver	6443	HTTPS + HTTP/2	in-cluster ServiceAccount / kubeconfig	→ CP	Also calls cloud provider API externally
kubelet	kube-apiserver	6443	HTTPS + HTTP/2	TLS bootstrap → node cert	→ CP	Node authn group `system:nodes`
kube-apiserver	kubelet	10250	HTTPS	apiserver presents client cert to kubelet	→ Node	exec, log, portforward, attach
kube-proxy	kube-apiserver	6443	HTTPS + HTTP/2	ServiceAccount / kubeconfig	→ CP	Watches Services and EndpointSlices
etcd leader	etcd followers	2380	gRPC / TLS	mTLS (peer cert)	intra-etcd	Raft replication and heartbeats
kube-apiserver (HA)	kube-apiserver peers	—	—	—	independent	APIservers are stateless; they don't talk to each other. LB distributes.

Critical Security Rule

etcd must NEVER be accessible from outside the control-plane network. Firewall TCP 2379 and 2380 to only apiserver IPs. Direct etcd access bypasses all Kubernetes RBAC and audit logging.

Component Roles — Internal Mechanics

kube-apiserver — The API Gateway

The apiserver is the only component that reads from and writes to etcd. It is completely stateless: all state lives in etcd. Multiple apiserver replicas can run simultaneously because they share etcd as the source of truth.

Internal request lifecycle (full detail in 04-kubernetes-api-model.html §Request Lifecycle):

TLS Termination

→

Authentication

→

Authorization (RBAC)

→

Mutating Admission

→

Schema Validation

→

Validating Admission

→

etcd persist

The apiserver also proxies requests to webhooks (MutatingAdmissionWebhook, ValidatingAdmissionWebhook) and to aggregated API servers (metrics-server, custom API servers via APIService). It handles Watch via long-lived HTTP/2 streaming connections — see the Informer architecture in 04-kubernetes-api-model.html §Informers.

Key apiserver flags

# Minimal production-relevant flags
kube-apiserver \
  --advertise-address=                  # IP used by kubelet to reach apiserver
  --bind-address=0.0.0.0                            # Listen on all interfaces
  --secure-port=6443                                # HTTPS port (default)
  --etcd-servers=https://etcd-0:2379,https://etcd-1:2379,https://etcd-2:2379
  --etcd-cafile=/etc/kubernetes/pki/etcd/ca.crt
  --etcd-certfile=/etc/kubernetes/pki/apiserver-etcd-client.crt
  --etcd-keyfile=/etc/kubernetes/pki/apiserver-etcd-client.key
  --client-ca-file=/etc/kubernetes/pki/ca.crt
  --tls-cert-file=/etc/kubernetes/pki/apiserver.crt
  --tls-private-key-file=/etc/kubernetes/pki/apiserver.key
  --authorization-mode=Node,RBAC
  --enable-admission-plugins=NodeRestriction,PodSecurity,MutatingAdmissionWebhook,ValidatingAdmissionWebhook
  --service-cluster-ip-range=10.96.0.0/12
  --service-node-port-range=30000-32767
  --allow-privileged=false
  --audit-log-path=/var/log/kubernetes/audit.log
  --audit-policy-file=/etc/kubernetes/audit-policy.yaml
  --feature-gates=...                               # feature flags
  --max-requests-inflight=800                       # concurrency limit
  --max-mutating-requests-inflight=400
  --request-timeout=60s

etcd — The Ground Truth

etcd is a distributed key-value store using the Raft consensus algorithm. It guarantees strong consistency — every read reflects the most recently committed write (given quorum). Kubernetes stores every API object as a protobuf-encoded value at a path like /registry/{group}/{resource}/{namespace}/{name}.

Raft requires a quorum of ⌊n/2⌋ + 1 members. A 3-member cluster tolerates 1 failure (quorum = 2). A 5-member cluster tolerates 2 failures (quorum = 3). Always run an odd number of members.

etcd Raft Leader Election Flow

All members start as Followers. Each has a randomized election timeout (150–300ms).
When a Follower doesn't receive a heartbeat within its timeout, it transitions to Candidate and increments its term.
The Candidate sends RequestVote RPCs to all peers. Each peer grants one vote per term to the first Candidate that has an up-to-date log.
If the Candidate receives votes from a majority, it becomes Leader.
The Leader sends periodic heartbeat AppendEntries RPCs (≤50ms) to prevent elections.
All client writes (from apiserver) go to the Leader. It appends the entry to its log, replicates to followers, and commits once a majority acknowledges.

# Inspect etcd leader
ETCDCTL_API=3 etcdctl \
  --endpoints=https://127.0.0.1:2379 \
  --cacert=/etc/kubernetes/pki/etcd/ca.crt \
  --cert=/etc/kubernetes/pki/etcd/peer.crt \
  --key=/etc/kubernetes/pki/etcd/peer.key \
  endpoint status --write-out=table

# Output columns: ENDPOINT, ID, VERSION, DB SIZE, IS LEADER, RAFT TERM, RAFT INDEX

kube-scheduler — Two-Phase Placement

The scheduler watches the apiserver for Pods in Pending state with spec.nodeName == "". For each such pod, it runs a two-phase algorithm:

Phase 1: Filter (Predicates)

Eliminate nodes that cannot run this pod. Built-in filter plugins include:

NodeResourcesFit — CPU/memory requests must fit
NodeSelector — nodeSelector labels must match
NodeAffinity — requiredDuringScheduling rules
TaintToleration — pod must tolerate all taints
PodTopologySpread — spread constraints
VolumeBinding — PVCs must be bindable to node
InterPodAffinity — pod anti-affinity hard rules
NodePorts — host ports must be free

Phase 2: Score (Priorities)

Rank remaining feasible nodes. Built-in score plugins include:

LeastAllocated — prefer nodes with most free resources
BalancedAllocation — balance CPU vs memory
NodeAffinityPriority — preferredDuringScheduling
InterPodAffinityPriority — soft affinity/anti-affinity
ImageLocality — prefer nodes with image already pulled
TaintToleration — deprioritize tainted nodes

Highest-scoring node wins. Scheduler writes pod.spec.nodeName via a Bind operation.

The scheduler uses the Scheduling Framework (introduced v1.15, stable v1.19) — a plugin-based system where all logic is expressed as plugins implementing extension points: PreFilter, Filter, PostFilter, PreScore, Score, NormalizeScore, Reserve, Permit, PreBind, Bind, PostBind.

Scheduler Internals: Scheduling Queue and Backoff

The scheduler maintains a priority queue (heap-based) ordered by pod priority class. When a pod fails scheduling, it goes into a backoff queue with exponential backoff (2s → 4s → 8s → max 10s). Failed pods are retried when cluster state changes (node added, resource freed, taint removed).

# Check scheduling failures
kubectl get events --field-selector reason=FailedScheduling
kubectl describe pod    # "Events" section shows scheduler decisions

# Enable verbose scheduler logs
kube-scheduler --v=10  # Logs filter/score results per node

# Inspect scheduler metrics
curl -sk https://localhost:10259/metrics | grep scheduler_

kube-controller-manager — The Reconciliation Engine

The kube-controller-manager runs 30+ controllers in a single binary with a single process. Each controller is an independent goroutine running a reconciliation loop:

// Generic reconciliation loop pattern (all controllers follow this)
for {
    desired := readDesiredState(apiserver)  // via Informer cache
    actual  := readActualState(apiserver)   // via Informer cache
    if desired != actual {
        makeChanges(apiserver)              // Create/Update/Delete child objects
    }
    // Sleep or wait for next Informer event
}

The key controllers and what they manage:

Controller	Watches	Creates / Manages	Key Action
ReplicationController	RC objects + Pods	Pods	Maintain pod count == spec.replicas
ReplicaSet controller	RS objects + Pods	Pods	Maintain pod count, owns pods via ownerRef
Deployment controller	Deployments + RS	ReplicaSets	Manage rolling updates, rollbacks
StatefulSet controller	StatefulSets + Pods + PVCs	Pods, PVCs	Ordered pod creation, stable network IDs
DaemonSet controller	DaemonSets + Nodes + Pods	Pods (one per node)	Schedule pod to every matching node
Job controller	Jobs + Pods	Pods	Run pods to completion, handle retries
CronJob controller	CronJobs	Jobs	Create Job objects on schedule
Node controller	Nodes	—	Taint nodes unreachable after 40s, evict pods after 5min
Endpoints controller	Services + Pods	Endpoints	Update endpoints when pod IPs change (legacy; EndpointSlice preferred)
EndpointSlice controller	Services + Pods	EndpointSlices	Scale-friendly endpoint tracking
Namespace controller	Namespaces	—	Delete all objects when namespace is deleted
ServiceAccount controller	Namespaces	ServiceAccounts	Create default SA in every namespace
Token controller	ServiceAccounts + Secrets	Secrets (token)	Create SA token secrets (pre-v1.22 legacy)
PersistentVolume controller	PVs + PVCs	—	Bind PVCs to PVs, handle reclaim policies
ResourceQuota controller	ResourceQuotas	—	Track and enforce quota usage
GarbageCollection controller	All objects	—	Delete orphaned objects via ownerReferences
TTLAfterFinished controller	Jobs	—	Delete finished Jobs after TTL

Single Binary, Multiple Controllers

All controllers share one process and one process-wide leader election Lease. If the kube-controller-manager pod restarts, all controllers restart together. This simplifies deployment but means a single bug can impact all controllers. The --controllers flag can selectively disable specific controllers (e.g., --controllers=-ttl).

cloud-controller-manager — Cloud Integration

Introduced to decouple cloud-provider-specific logic from the core Kubernetes binary. Before its introduction (pre-v1.6), every cloud provider patched their logic directly into kube-controller-manager and kubelet, creating a monolithic and slow release cycle.

The CCM runs three cloud-specific controllers:

Node controller: Annotates new Nodes with cloud metadata (instance type, zone, region, provider ID). Deletes Node objects when cloud instance is terminated.
Route controller: Programs cloud routing tables so pod-to-pod traffic flows across nodes (GCP VPC native routing, AWS VPC CNI, etc.).
Service controller: Watches Services of type: LoadBalancer. Creates/deletes cloud load balancers. Updates status.loadBalancer.ingress with the allocated IP/hostname.

High Availability Control Plane

Stacked vs External etcd Topology

Stacked etcd (Default kubeadm)

etcd runs on the same nodes as the control plane components. Simpler to manage. Used by default in kubeadm clusters.

Each CP node: apiserver + scheduler + controller-manager + etcd
etcd cluster: 3 members on 3 CP nodes
Risk: Losing a CP node loses both a control plane member AND an etcd member simultaneously
Minimum: 3 CP nodes for quorum tolerance

External etcd

etcd runs on separate dedicated nodes. More resilient, harder to operate.

CP nodes: apiserver + scheduler + controller-manager (no etcd)
etcd nodes: 3 or 5 dedicated etcd members
CP node failure doesn't affect etcd quorum
Requires more nodes: minimum 3 CP + 3 etcd = 6 nodes
Recommended for production clusters > 50 nodes

apiserver HA — Stateless Horizontal Scale

Since the apiserver is stateless (all state in etcd), you can run any number of replicas. A TCP load balancer (or VIP via keepalived/haproxy) distributes client connections across all healthy apiserver instances. There is no leader election for apiservers — all replicas serve reads and writes concurrently.

# Verify HA apiserver (kubeadm)
kubectl get pods -n kube-system -l component=kube-apiserver

# Check which apiserver is serving your request
kubectl config view --minify -o jsonpath='{.clusters[0].cluster.server}'

# Each kubeconfig points to the VIP; inspect apiserver endpoints
kubectl get endpoints kubernetes -n default

Scheduler and Controller-Manager Leader Election

Unlike the apiserver, the scheduler and controller-manager are NOT safe to run as multiple active instances — two schedulers could assign the same pod to two nodes. They use Kubernetes Lease-based leader election:

The leader election mechanism:

Each replica tries to acquire a Lease object in the kube-system namespace.
The Lease has a leaseDurationSeconds (default 15s) and a renewDeadlineSeconds (default 10s).
The leader continuously renews the Lease by updating renewTime.
If the leader fails to renew within leaseDuration, another replica acquires the Lease and becomes leader.
Non-leaders sleep and periodically attempt to acquire the Lease.

# Check scheduler leader
kubectl get lease kube-scheduler -n kube-system -o yaml
# holderIdentity: kube-scheduler-node1_abc-uuid

# Check controller-manager leader
kubectl get lease kube-controller-manager -n kube-system -o yaml

# Watch for leadership changes
kubectl get lease -n kube-system --watch

# Controller-manager leader election flags
kube-controller-manager \
  --leader-elect=true \
  --leader-elect-lease-duration=15s \
  --leader-elect-renew-deadline=10s \
  --leader-elect-retry-period=2s

Split-Brain Risk

If the leader election Lease is not renewed due to a network partition (not a process crash), both old and new leaders may believe they are active for up to leaseDurationSeconds. This is the "split-brain" window. Kubernetes mitigates this with optimistic locking (resourceVersion conflicts) but double-scheduling is theoretically possible during this window.

Static Pods — How Control Plane Components Run

On clusters provisioned by kubeadm, all control plane components (apiserver, etcd, scheduler, controller-manager) run as static pods. Static pods are managed directly by the kubelet on the node — not by the apiserver or any controller.

The kubelet's --pod-manifest-path (default: /etc/kubernetes/manifests/) contains YAML files. The kubelet watches this directory and creates/restarts pods whenever files are added, modified, or deleted.

# Static pod manifests location (kubeadm)
ls /etc/kubernetes/manifests/
# etcd.yaml
# kube-apiserver.yaml
# kube-controller-manager.yaml
# kube-scheduler.yaml

# Static pods appear as mirror pods in the API server
# (prefixed with node name, e.g. kube-apiserver-control-plane-1)
kubectl get pods -n kube-system

# Editing a static pod manifest immediately restarts the component
# (kubelet detects inotify change and recreates the pod)
vim /etc/kubernetes/manifests/kube-apiserver.yaml

Static Pod Bootstrap Problem

The kube-apiserver runs as a static pod managed by kubelet. But the kubelet itself needs to communicate with the apiserver for many operations. This is the chicken-and-egg bootstrap: kubelet can start and manage static pods before the apiserver is running. Once the apiserver comes up, the kubelet registers the static pods as "mirror pods" (read-only copies in etcd). The kubelet continues to manage these pods locally — deleting the mirror pod from the API server does NOT delete the actual static pod.

Component Startup Order and Dependencies

1. etcd

→

Must be fully up and serving before any other component starts

2. kube-apiserver

→

Connects to etcd. Other components cannot start without a healthy apiserver.

3. kube-controller-manager

→

Connects to apiserver. Acquires leader election Lease.

4. kube-scheduler

→

Connects to apiserver. Acquires leader election Lease. Can start before or after controller-manager.

5. cloud-controller-manager

→

Optional. Connects to apiserver and cloud provider API.

6. kubelet (on each node)

→

Registers Node object. Requires apiserver to be reachable.

7. kube-proxy (on each node)

→

Watches Services and EndpointSlices. Requires apiserver.

8. CoreDNS (addon)

→

Scheduled as pods. Requires scheduler and kubelet to be running.

PKI and Certificate Architecture

Every communication in the control plane uses TLS. The control plane PKI is a hierarchy of Certificate Authorities managed by kubeadm (or manually for more complex setups).

# View all cluster certificates and their expiry
kubeadm certs check-expiration

# Renew all certificates (kubeadm)
kubeadm certs renew all

# View cert details
openssl x509 -in /etc/kubernetes/pki/apiserver.crt -noout -text | grep -A 5 "Subject Alternative"
openssl x509 -in /etc/kubernetes/pki/apiserver.crt -noout -dates

# Check controller-manager cert
openssl x509 -in /etc/kubernetes/pki/apiserver-etcd-client.crt -noout -text | grep "Subject:"

Health Check Endpoints and Monitoring

Every control plane component exposes health check endpoints. These are polled by load balancers, monitoring systems, and readiness probes in the static pod manifests.

Component	Endpoint	Port	Expected Response
kube-apiserver	`/healthz`	6443	`ok`
kube-apiserver	`/readyz`	6443	`ok` (all checks pass)
kube-apiserver	`/livez`	6443	`ok`
kube-apiserver	`/metrics`	6443	Prometheus text format
kube-scheduler	`/healthz`	10259	`ok`
kube-scheduler	`/metrics`	10259	Prometheus text format
kube-controller-manager	`/healthz`	10257	`ok`
kube-controller-manager	`/metrics`	10257	Prometheus text format
etcd	`/health`	2379	`{"health":"true"}`
etcd	`/metrics`	2381	Prometheus text format

# Check apiserver health from within cluster
kubectl get --raw /healthz
kubectl get --raw /readyz
kubectl get --raw /livez
kubectl get --raw /readyz?verbose   # Shows each check's status

# Individual readiness checks
kubectl get --raw /readyz/poststarthook/rbac/bootstrap-roles
kubectl get --raw /readyz/etcd

# From node directly
curl -sk https://localhost:6443/healthz --cert /etc/kubernetes/pki/admin.crt --key /etc/kubernetes/pki/admin.key

# Check scheduler and controller-manager from CP node
curl -sk https://127.0.0.1:10259/healthz
curl -sk https://127.0.0.1:10257/healthz

# etcd health check
ETCDCTL_API=3 etcdctl endpoint health \
  --endpoints=https://127.0.0.1:2379 \
  --cacert=/etc/kubernetes/pki/etcd/ca.crt \
  --cert=/etc/kubernetes/pki/etcd/healthcheck-client.crt \
  --key=/etc/kubernetes/pki/etcd/healthcheck-client.key

Critical Metrics to Monitor

Metric	Component	Alert Threshold	Meaning
`apiserver_request_duration_seconds`	apiserver	p99 > 1s	API request latency by verb/resource
`apiserver_request_total`	apiserver	error rate > 1%	Total requests by code/verb/resource
`etcd_request_duration_seconds`	etcd	p99 > 100ms	etcd operation latency (critical for apiserver throughput)
`etcd_server_leader_changes_seen_total`	etcd	> 0 in 5min	etcd leader elections (indicates instability)
`etcd_mvcc_db_total_size_in_bytes`	etcd	> 8GB	etcd database size (default quota 2GB, configurable to 8GB)
`scheduler_schedule_attempts_total`	scheduler	unschedulable > 0	Pods that couldn't be scheduled
`scheduler_pending_pods`	scheduler	> 50 sustained	Pods waiting to be scheduled
`workqueue_depth`	controller-manager	> 100 sustained	Controller reconciliation queue backlog
`workqueue_queue_duration_seconds`	controller-manager	p99 > 5s	Time items wait in controller work queue
`apiserver_current_inflight_requests`	apiserver	near max-requests-inflight	Concurrency pressure on apiserver

Troubleshooting the Control Plane

apiserver Not Responding

# Step 1: Check if the static pod is running on CP node
ssh cp-node-1
crictl ps | grep kube-apiserver

# Step 2: Check kubelet logs (manages static pods)
journalctl -u kubelet --since "10 minutes ago" | grep apiserver

# Step 3: Check apiserver logs directly
kubectl logs -n kube-system kube-apiserver-cp-node-1
# Or directly via crictl:
crictl logs $(crictl ps --name kube-apiserver -q)

# Step 4: Check manifest for syntax errors
cat /etc/kubernetes/manifests/kube-apiserver.yaml | python3 -c "import sys,yaml;yaml.safe_load(sys.stdin)"

# Step 5: Verify etcd connectivity
ETCDCTL_API=3 etcdctl endpoint health ...
# If etcd is down → apiserver will not serve writes (reads from cache possible)

# Step 6: Check certificates
openssl x509 -in /etc/kubernetes/pki/apiserver.crt -noout -dates
kubeadm certs check-expiration

Scheduler Not Scheduling Pods

# Check if scheduler is running
kubectl get pods -n kube-system -l component=kube-scheduler

# Check leader election
kubectl get lease kube-scheduler -n kube-system

# Check scheduler logs for specific pod
kubectl logs -n kube-system kube-scheduler-cp-node-1 | grep "Failed to schedule"
kubectl logs -n kube-system kube-scheduler-cp-node-1 | grep 

# Get scheduling failure events
kubectl get events --field-selector reason=FailedScheduling --all-namespaces

# Describe the unscheduled pod
kubectl describe pod    # Look at "Events:" section

# Common causes:
# - Insufficient CPU/memory on all nodes
# - Taint with no toleration
# - NodeSelector/Affinity that matches no nodes
# - PVC cannot be bound (VolumeBinding plugin failure)

Controller Not Reconciling

# Check controller-manager status
kubectl get pods -n kube-system -l component=kube-controller-manager
kubectl logs -n kube-system kube-controller-manager-cp-node-1 --tail=100

# Check leader
kubectl get lease kube-controller-manager -n kube-system

# Check work queue metrics (if prometheus is available)
kubectl port-forward -n kube-system kube-controller-manager-cp-node-1 10257:10257
curl -sk https://localhost:10257/metrics | grep workqueue

# Check for throttling / rate limiting
kubectl logs ... | grep "Throttling request"

# Check RBAC permissions for specific controller SA
kubectl auth can-i list deployments --as=system:serviceaccount:kube-system:deployment-controller

etcd Issues

# etcd cannot reach quorum
# Symptom: apiserver returns 503 for writes, etcd logs show "failed to reach quorum"
ETCDCTL_API=3 etcdctl endpoint status --write-out=table ...

# etcd database size too large
ETCDCTL_API=3 etcdctl endpoint status ...  # check DB SIZE column
# Compact and defragment:
ETCDCTL_API=3 etcdctl compact $(etcdctl endpoint status --write-out=json | jq '.[0].Status.header.revision')
ETCDCTL_API=3 etcdctl defrag

# etcd slow writes / high fsync latency
# Check disk performance:
iostat -x 1 5
# etcd is sensitive to fsync latency; use SSDs; avoid NFS/network storage

# Check etcd alarms
ETCDCTL_API=3 etcdctl alarm list
ETCDCTL_API=3 etcdctl alarm disarm  # After resolving the root cause

Upgrading the Control Plane

Control plane upgrades must be done before upgrading worker nodes. The supported pattern is to upgrade one minor version at a time (e.g., 1.29 → 1.30, not 1.29 → 1.31).

# Using kubeadm (standard upgrade flow)

# 1. Upgrade kubeadm on first CP node
apt-get update && apt-get install -y kubeadm=1.30.0-1.1

# 2. Verify upgrade plan
kubeadm upgrade plan

# 3. Apply upgrade to first CP node (upgrades apiserver, scheduler, controller-manager, etcd)
kubeadm upgrade apply v1.30.0

# 4. Upgrade kubeadm on remaining CP nodes
# Then on each additional CP node:
kubeadm upgrade node

# 5. Upgrade kubelet and kubectl on CP nodes
apt-get install -y kubelet=1.30.0-1.1 kubectl=1.30.0-1.1
systemctl daemon-reload && systemctl restart kubelet

# 6. Then upgrade worker nodes (drain → upgrade kubelet → uncordon)

Upgrade Order is Critical

NEVER upgrade worker node kubelets before the control plane. kubelet version must be within 1 minor version of the apiserver. kube-proxy must be at the same minor version as the apiserver. Violating this skew policy causes undefined behavior.

Production Control Plane Checklist

20-Item Production Readiness Checklist

#	Item	Default	Production Setting
1	etcd node count	1 (single)	3 or 5 (odd, for quorum)
2	apiserver replicas	1	3 (behind VIP/LB)
3	etcd encryption at rest	Disabled	Enable with `--encryption-provider-config`
4	Audit logging	Disabled	Enable with policy covering Secrets, RBAC, exec
5	Certificate rotation	Manual	Use cert-manager or kubeadm auto-renewal + alerting
6	etcd backup	None	Automated hourly snapshots to off-cluster storage
7	etcd disk	Any	Dedicated NVMe SSD; separate from OS disk
8	API server request limits	400/800	Tune per cluster size; enable APF FlowSchemas
9	etcd quota	2GB	Increase to 8GB for large clusters; add compaction cron
10	Authorization mode	AlwaysAllow (dev)	`--authorization-mode=Node,RBAC`
11	Admission plugins	minimal	Enable PodSecurity, NodeRestriction, ResourceQuota
12	anonymous-auth	true	Disable: `--anonymous-auth=false`
13	insecure-port	0 (disabled v1.20+)	Ensure `--insecure-port=0`
14	profiling	Enabled	Disable in production: `--profiling=false`
15	etcd peer encryption	mTLS	Verify peer certs are separate from client certs
16	Control plane node isolation	Tolerates all (kubeadm default)	Taint CP nodes: `NoSchedule` for workloads
17	Resource requests on CP pods	None (static pods)	Set requests in static pod manifests; use separate resource classes
18	etcd compaction	Auto (every 5min default)	Verify compaction is running; alert on DB growth rate
19	OIDC integration	x509 only	Integrate with org IdP (Dex, Okta) for human user authn
20	Monitoring coverage	None	Prometheus + Alertmanager for all 10 metrics above

Dependency Graph and Next Files

Prerequisites (Already Covered)

This File Covers

Control plane component roles
Communication topology
HA: stacked vs external etcd
Leader election mechanism
Static pods and bootstrap
PKI and certificate hierarchy
Health checks and metrics
Upgrade procedure