File Path00-foundations/03-cluster-architecture-overview.html

Prerequisites 00-intro, 01-history, 02-containers

Concepts Covered

Cluster topologyControl plane components Worker node componentsHA architecture Multi-master setupComponent communication Port referenceCertificate topology Failure domainsNetwork topology Cloud vs bare metalStacked vs external etcd

Related Files

Cluster Architecture Overview

This document is the architectural map of a Kubernetes cluster — every component, its role, how it communicates with other components, what ports it uses, and how the whole system maintains consistency under failures. Read this before diving into any individual component's deep-dive.

1 · The Two Planes

A Kubernetes cluster is logically split into two planes:

Control Plane — makes decisions (scheduling, reconciliation, admission), stores state. Brains of the cluster.
Data Plane (Worker Nodes) — executes workloads. Runs the actual Pods that serve traffic.

In a production cluster these planes run on separate physical/virtual machines. The control plane should never run user workloads — use NoSchedule taints.

Figure 1 — Production HA cluster: 3 control plane nodes (etcd stacked), N worker nodes. All components communicate via kube-apiserver.

2 · Control Plane Components In Depth

2.1 kube-apiserver

The API server is the only stateful, central component that all other components talk to. It is the single source of truth gateway — no component talks to another component directly (exception: kubelet can be called back by the API server for exec/logs/port-forward, and etcd is called only by the API server).

Property	Value
Port	6443 (HTTPS). Never 8080 (insecure) in production.
Horizontally scalable	Yes — multiple replicas behind a load balancer. All replicas are active (no leader election).
Stateful?	No — completely stateless. All state is in etcd.
Auth mechanisms	x509 client certificates, bearer tokens, OIDC, webhook, service account tokens
Deep-dive	01-control-plane/01-kube-apiserver.html

2.2 etcd

etcd is the cluster's brain memory. Every Kubernetes object (Pod spec, Deployment state, Secret value, Node status, Service endpoint) is persisted as a key-value entry in etcd. The kube-apiserver is the only component that reads from or writes to etcd directly.

Property	Value
Client port	2379 (only API server connects)
Peer port	2380 (etcd-to-etcd Raft replication)
Consensus	Raft — requires quorum of (n/2)+1 members to commit a write
Recommended cluster size	3 members (1 failure tolerance) or 5 members (2 failure tolerance)
Deep-dive	01-control-plane/02-etcd.html

2.3 kube-scheduler

The scheduler watches for Pods with spec.nodeName == "" (unscheduled) and writes a nodeName into the Pod spec via a binding API call. It runs a filter → score → bind pipeline for every Pod.

Property	Value
Metrics/health port	10259 (HTTPS)
Leader election	Yes — only one scheduler instance is active at a time, via Lease API object
Talks to	Only kube-apiserver (watch for unscheduled Pods, write bindings)
Deep-dive	01-control-plane/03-kube-scheduler.html

2.4 kube-controller-manager

A single binary that embeds 30+ individual controllers, each running its own reconciliation loop. All controllers use the same watch-and-reconcile pattern.

Property	Value
Metrics/health port	10257 (HTTPS)
Leader election	Yes — one active instance at a time via Lease
Key controllers included	ReplicationController, ReplicaSet, Deployment, StatefulSet, DaemonSet, Job, CronJob, Node, Endpoint/EndpointSlice, Namespace, PersistentVolume, ServiceAccount, Token, Certificate
Deep-dive	01-control-plane/04-kube-controller-manager.html

2.5 cloud-controller-manager

Separates cloud-provider-specific logic from the core Kubernetes code. Manages: cloud load balancers (LoadBalancer type Services), cloud node metadata (instance types, zones), and cloud routes (for overlay-less networking on GCE/AWS).

Property	Value
Port	10258 (HTTPS)
Leader election	Yes
Absent on	bare metal clusters without a cloud provider integration
Deep-dive	01-control-plane/05-cloud-controller-manager.html

3 · Worker Node Components In Depth

3.1 kubelet

The kubelet is the primary node agent. It is the bridge between the Kubernetes API and the container runtime. It watches the API server for Pods assigned to its node, drives the CRI to create/destroy containers, runs probes, and reports status back.

Property	Value
HTTPS API port	10250 — API server calls back here for exec/logs/port-forward
Read-only port	10255 (deprecated, disabled by default since 1.16)
Healthz port	10248
Talks to	kube-apiserver (watch + status updates), CRI socket (containerd/CRI-O), CNI plugins
Deep-dive	02-node-components/01-kubelet.html

3.2 kube-proxy

kube-proxy watches Service and EndpointSlice objects and programs the host's networking stack (iptables rules, IPVS virtual servers, or nftables) to implement the Service virtual IP abstraction.

Property	Value
Healthz port	10256
Modes	iptables (default), IPVS (large clusters), nftables (beta in 1.31), eBPF (via Cilium replacing kube-proxy entirely)
Deep-dive	02-node-components/02-kube-proxy.html

3.3 Container Runtime (CRI)

The CRI-compliant runtime (containerd or CRI-O) receives instructions from the kubelet via gRPC, pulls images, manages container lifecycle, and interacts with the OCI low-level runtime (runc).

3.4 CNI Plugin

The CNI plugin (Calico, Cilium, Flannel, Weave, etc.) is invoked by the kubelet via the CRI when a Pod sandbox is created. It assigns a Pod IP, creates the veth pair, and programs routes. CNI is not a running daemon — it is a set of binary executables in /opt/cni/bin/ called by the runtime.

4 · Component Communication Matrix

A critical production security concept: understanding exactly which components talk to which, over which port, authenticated with which certificate.

Source	Destination	Protocol/Port	Auth method	Direction
kubectl / any client	kube-apiserver	HTTPS :6443	x509 cert, OIDC token, bearer token	Client → API server
kube-apiserver	etcd	HTTPS :2379	x509 client cert (etcd CA signed)	API server → etcd only
kube-scheduler	kube-apiserver	HTTPS :6443	x509 client cert (system:kube-scheduler)	scheduler → API server (watch + write binding)
kube-controller-manager	kube-apiserver	HTTPS :6443	x509 client cert (system:kube-controller-manager)	controller-manager → API server
cloud-controller-manager	kube-apiserver	HTTPS :6443	x509 client cert	ccm → API server
kubelet	kube-apiserver	HTTPS :6443	x509 client cert (system:node:<nodeName>, bootstrapped via TLS bootstrap or kubeadm)	kubelet → API server (watch Pods, update status)
kube-apiserver	kubelet	HTTPS :10250	API server presents its own cert; kubelet CA must be trusted	API server → kubelet (exec, logs, port-forward)
kube-proxy	kube-apiserver	HTTPS :6443	x509 client cert or ServiceAccount token	kube-proxy → API server (watch Services/EndpointSlices)
CoreDNS	kube-apiserver	HTTPS :6443	ServiceAccount token (RBAC: get/list/watch Services/Endpoints)	CoreDNS → API server
etcd peer	etcd peer	HTTPS :2380	Mutual TLS (etcd peer CA)	Raft replication between etcd members
Admission webhook	kube-apiserver	HTTPS (webhook TLS)	Webhook server cert trusted by API server caBundle field	API server → webhook server

The "Hub and Spoke" model The kube-apiserver is at the centre of all communication. No two non-API-server components communicate directly with each other (except etcd peer replication). This is a deliberate design choice that simplifies security: you only need to secure one endpoint, and all state transitions are auditable in one place.

5 · Complete Port Reference

Component	Port	Protocol	Purpose	Firewall rule needed?
kube-apiserver	6443	HTTPS	Kubernetes API (all clients)	Yes — from all nodes + kubectl clients
etcd client	2379	HTTPS	etcd API (API server only)	Yes — from control plane nodes only
etcd peer	2380	HTTPS	Raft replication between etcd members	Yes — between control plane nodes only
kube-scheduler	10259	HTTPS	Healthz + metrics (Prometheus scrape)	Optional — monitoring only
kube-controller-manager	10257	HTTPS	Healthz + metrics	Optional — monitoring only
cloud-controller-manager	10258	HTTPS	Healthz + metrics	Optional — monitoring only
kubelet	10250	HTTPS	kubelet API (exec/logs/port-forward)	Yes — from control plane nodes
kubelet healthz	10248	HTTP	Local healthcheck only	No — local loopback only
kube-proxy healthz	10256	HTTP	Health probe	No — local
NodePort range	30000–32767	TCP/UDP	External Service exposure	Yes — from external clients
CoreDNS	53 (Pod IP)	UDP/TCP	DNS resolution for cluster DNS	No — internal Pod traffic only
metrics-server	443 (Service)	HTTPS	Resource metrics (kubectl top)	No — internal

6 · High Availability Architecture

6.1 Stacked vs. External etcd

There are two HA topologies for the control plane:

6.2 Quorum Math and Failure Tolerance

etcd members	Quorum required	Failure tolerance	Recommendation
1	1	0	Development only — any failure loses the cluster
2	2	0	Worse than 1 — both must agree, but 1 failure still breaks quorum
3	2	1	Minimum production HA — standard for most clusters
5	3	2	Large/critical clusters, concurrent failure scenarios
7	4	3	Very large clusters; write latency increases — rarely worth it

Never run an even number of etcd members 2 and 4 members provide the same or worse fault tolerance as 1 and 3, respectively, but require more nodes. The split-brain scenario (network partition with equal halves) is deadlier with even numbers. Always use odd numbers.

6.3 Controller/Scheduler Leader Election

While the API server is active-active (all replicas serve traffic), the scheduler and controller-manager use leader election to ensure only one instance makes decisions at a time. Leader election is implemented using a coordination.k8s.io/v1/Lease object in the kube-system namespace.

# Inspect leader election leases
kubectl get leases -n kube-system
# NAME                      HOLDER                    AGE
# kube-controller-manager   node1.cluster.internal    45d
# kube-scheduler            node1.cluster.internal    45d

# Who currently holds the scheduler lease?
kubectl get lease kube-scheduler -n kube-system -o jsonpath='{.spec.holderIdentity}'

# Leader election config (in controller-manager flags)
# --leader-elect=true (default)
# --leader-elect-lease-duration=15s  (time a lease is valid)
# --leader-elect-renew-deadline=10s  (how long leader retries to renew)
# --leader-elect-retry-period=2s    (how often non-leader tries to acquire)

7 · Network Topology

A Kubernetes cluster uses four distinct IP address spaces:

Network	Default range	What it covers	Configured in
Node network	Depends on infrastructure	Physical/VM NIC IPs on each node	Cloud VPC or bare metal network
Pod CIDR	`10.244.0.0/16` (Flannel) / `192.168.0.0/16` (Calico)	Pod IP addresses (one unique IP per Pod)	`--pod-network-cidr` on kubeadm / CNI config
Service CIDR	`10.96.0.0/12`	ClusterIP virtual IPs for Services	`--service-cluster-ip-range` on API server
DNS ClusterIP	`10.96.0.10`	CoreDNS Service IP (fixed)	`--cluster-dns` on kubelet

IP range overlap causes silent routing failures The Pod CIDR, Service CIDR, and Node network MUST NOT overlap with each other or with on-premises corporate networks that the nodes can reach. A Pod CIDR of 10.244.0.0/16 will cause routing issues if any corporate subnet uses 10.244.x.x. Plan your CIDR allocations carefully before cluster creation — changing them requires a cluster rebuild.

7.1 Kubernetes Networking Requirements

The Kubernetes networking model mandates:

Every Pod gets a unique, cluster-routable IP (no NAT for Pod-to-Pod traffic within the cluster)
Every Pod can reach every other Pod by IP, regardless of which node they are on
Every node can reach every Pod IP
Pods see their own IP as their IP (no masquerading for inbound)

CNI plugins are responsible for implementing these rules. They achieve this via: VXLAN overlays (Flannel), BGP routing (Calico), eBPF dataplane (Cilium), etc.

8 · Failure Domains and Topology

8.1 Multi-AZ Worker Node Distribution

In cloud deployments, worker nodes should be spread across at least 3 availability zones. Kubernetes provides topology-aware scheduling to enforce this.

# Spread Pods evenly across AZs (topologySpreadConstraints)
spec:
  topologySpreadConstraints:
  - maxSkew: 1                                  # Max imbalance between zones
    topologyKey: topology.kubernetes.io/zone   # Use the zone label on nodes
    whenUnsatisfiable: DoNotSchedule           # Fail scheduling rather than violate
    labelSelector:
      matchLabels:
        app: web
  - maxSkew: 1
    topologyKey: kubernetes.io/hostname        # Also spread across nodes within a zone
    whenUnsatisfiable: ScheduleAnyway          # Best-effort node spread

8.2 Control Plane AZ Distribution

# Verify control plane nodes are spread across AZs
kubectl get nodes -l node-role.kubernetes.io/control-plane \
  -o custom-columns='NAME:.metadata.name,ZONE:.metadata.labels.topology\.kubernetes\.io/zone'

# Example output (good):
# NAME              ZONE
# cp-node-1         us-east-1a
# cp-node-2         us-east-1b
# cp-node-3         us-east-1c

9 · Cluster Sizing Reference

Cluster size	Nodes	Pods	Control plane	etcd	API server instances
Development	1–5	<100	1 node, all-in-one	Single (stacked)	1
Small production	5–20	<500	3 nodes (stacked)	3 members	3 (one per CP node)
Medium production	20–100	<5,000	3 nodes (external etcd recommended)	3–5 dedicated nodes	3
Large production	100–500	<25,000	5 nodes (external etcd)	5 dedicated nodes	3–5 behind LB
Very large	500–5000+	150,000+	5+ nodes, tuned API server	5 dedicated high-I/O nodes (SSD)	5+ behind LB + caching

Kubernetes scalability benchmarks (SIG Scalability) As of Kubernetes 1.30+, tested to: 5,000 nodes, 150,000 Pods, 300,000 total containers. API server p99 response time SLO: <1s for mutating, <5s for list. These limits require careful etcd tuning, sufficient API server memory, and low-latency storage for etcd (NVMe SSD, fsync latency <10ms).

10 · Cloud vs Bare Metal Architecture Differences

Concern	Cloud (GKE/EKS/AKS)	Bare Metal / On-Premises
Control plane management	Fully managed — you never see CP nodes	You manage CP nodes, upgrades, HA, backups
etcd management	Managed by cloud provider	You manage backup, recovery, disk sizing
Load balancer	cloud-controller-manager provisions cloud LB automatically for LoadBalancer Services	MetalLB or kube-vip or external F5/HAProxy required
Node provisioning	Cluster Autoscaler calls cloud API (ASG, MIG) to add nodes	Cluster API with MAAS/vSphere provider, or manual
Storage	EBS/GCE PD/Azure Disk CSI drivers built-in	Ceph/Longhorn/NFS CSI, or SAN with custom driver
Networking	VPC-native routing (no overlay needed on GKE, EKS VPC CNI)	Must configure overlay (VXLAN, BGP) or routed fabric
Certificate rotation	Automatic (managed control plane)	Must monitor expiry; kubelet rotates via RotateKubeletClientCertificate
OIDC identity integration	Workload Identity (GKE) / IRSA (EKS) / Pod Identity (AKS)	OIDC provider + external Vault / Keycloak integration

11 · Component Startup Order

When bootstrapping or recovering a cluster, components must start in the correct order:

etcd — must be healthy first; API server cannot start without etcd
kube-apiserver — reads/writes etcd; all other components depend on it
kube-controller-manager — starts watching API server; leader election requires API server
kube-scheduler — starts watching API server; leader election requires API server
cloud-controller-manager (if present) — starts after API server
kubelet on each node — registers with API server; starts DaemonSet Pods (CoreDNS, CNI agent, kube-proxy)
kube-proxy — DaemonSet Pod, starts after kubelet is ready
CoreDNS — Deployment, starts after kube-proxy is programming Service rules
CNI agent (if daemon-based, e.g., Calico node, Cilium agent) — DaemonSet, started by kubelet

# Monitor startup sequence on a control plane node
journalctl -u etcd -f &
journalctl -u kube-apiserver -f &
journalctl -u kube-controller-manager -f &
journalctl -u kube-scheduler -f &

# For kubeadm clusters (static Pods in /etc/kubernetes/manifests/):
# These are started by kubelet as static Pods — no systemd units
ls /etc/kubernetes/manifests/
# etcd.yaml  kube-apiserver.yaml  kube-controller-manager.yaml  kube-scheduler.yaml

# Check static Pod health
kubectl get pods -n kube-system | grep -E "etcd|apiserver|scheduler|controller"

12 · Certificate Topology

Kubernetes uses mutual TLS for all inter-component communication. Understanding the certificate chain is essential for troubleshooting and for manually bootstrapping clusters.

Full certificate authority hierarchy

Kubernetes PKI hierarchy (typical kubeadm cluster):

/etc/kubernetes/pki/
├── ca.crt / ca.key
│   └── Cluster CA — signs:
│       ├── apiserver.crt (server cert for kube-apiserver TLS)
│       ├── apiserver-kubelet-client.crt (API server → kubelet client cert)
│       ├── controller-manager.conf (embedded client cert for controller-manager)
│       ├── scheduler.conf (embedded client cert for scheduler)
│       └── kubelet client certs (in /var/lib/kubelet/pki/ on each node)
│
├── etcd/ca.crt / etcd/ca.key
│   └── etcd CA (separate) — signs:
│       ├── etcd/server.crt (etcd server TLS)
│       ├── etcd/peer.crt (etcd peer replication TLS)
│       └── apiserver-etcd-client.crt (API server → etcd client cert)
│
├── front-proxy-ca.crt / front-proxy-ca.key
│   └── Front-proxy CA — signs:
│       └── front-proxy-client.crt (used for API aggregation layer)
│
└── sa.pub / sa.key
    └── Service Account signing key pair
        (sa.key signs SA tokens; sa.pub used by API server to verify them)

Default certificate validity: 1 year (kubeadm)
CA validity: 10 years
Auto-rotation: kubelet client certs rotate automatically via
  RotateKubeletClientCertificate feature (enabled by default since 1.8)

# Check certificate expiry dates on a control plane node
kubeadm certs check-expiration

# Rotate certificates manually (kubeadm clusters)
kubeadm certs renew all

# Check a specific cert
openssl x509 -in /etc/kubernetes/pki/apiserver.crt -noout -dates
# notAfter=Jun  1 00:00:00 2026 GMT

# Check kubelet node cert (on each worker node)
openssl x509 -in /var/lib/kubelet/pki/kubelet-client-current.pem -noout -subject -dates

Certificate expiry is a production incident waiting to happen Kubernetes certificates default to 1-year validity with kubeadm. Set up monitoring: x509_cert_expiry metric from prometheus/blackbox-exporter or use kubeadm certs check-expiration in a CronJob. Alert at 60 days remaining. At expiry, the entire cluster becomes inaccessible.

13 · Single-Node Development Clusters

For local development, all components run on a single machine. The control plane and worker role are combined.

Tool	Mechanism	Use case	Production parity
kind (K8s in Docker)	Each node is a Docker container; uses kubeadm internally	CI/CD pipelines, multi-node local testing	High — uses real kubeadm, real etcd
minikube	Single node VM or Docker driver	Developer laptop, quick prototyping	Medium — AddOns differ from production CNI/CSI
k3s	Single binary, SQLite or embedded etcd, replaces kube-proxy with iptables	Edge, IoT, CI, resource-constrained	Medium — some defaults differ (Traefik instead of nginx-ingress)
k3d	k3s in Docker containers	Fast local multi-node k3s clusters	Medium
Desktop (Docker/Rancher)	Bundled K8s in Docker Desktop or Rancher Desktop	Developer convenience, integrated with local registry	Low — opinionated defaults

# Create a multi-node kind cluster (3 nodes: 1 control-plane, 2 workers)
cat <<EOF | kind create cluster --name dev --config=-
kind: Cluster
apiVersion: kind.x-k8s.io/v1alpha4
nodes:
- role: control-plane
- role: worker
- role: worker
networking:
  podSubnet: "10.244.0.0/16"
  serviceSubnet: "10.96.0.0/12"
EOF

# Load a local image into kind (no registry needed)
kind load docker-image myapp:dev --name dev

14 · Cluster Health Checks

# ---- Overall cluster health ----
kubectl get nodes -o wide                    # All nodes, status, version
kubectl get pods -n kube-system              # All control plane + system Pods

# ---- API server health ----
kubectl get --raw /healthz                   # "ok"
kubectl get --raw /healthz/etcd              # "ok"
kubectl get --raw /readyz                    # Checks all health indicators
kubectl get --raw /livez                     # Liveness check

# ---- etcd health (on control plane node) ----
ETCDCTL_API=3 etcdctl \
  --endpoints=https://127.0.0.1:2379 \
  --cacert=/etc/kubernetes/pki/etcd/ca.crt \
  --cert=/etc/kubernetes/pki/etcd/healthcheck-client.crt \
  --key=/etc/kubernetes/pki/etcd/healthcheck-client.key \
  endpoint health

ETCDCTL_API=3 etcdctl ... endpoint status --write-out=table
# Shows: DB size, Raft index, leader, version

# ---- Component health (modern) ----
kubectl get --raw /api/v1/nodes | jq .
kubectl describe node <cp-node> | grep -A5 "Conditions:"

# ---- Scheduler / controller-manager ----
kubectl get --raw /healthz    # through API server aggregation
# or directly on the pod:
kubectl exec -n kube-system kube-scheduler-<hash> -- wget -O- http://localhost:10259/healthz
kubectl exec -n kube-system kube-controller-manager-<hash> -- wget -O- http://localhost:10257/healthz

# ---- Event stream for recent issues ----
kubectl get events --sort-by='.lastTimestamp' -A | tail -30

15 · Architecture Production Checklist

15-point production architecture checklist

3 control plane nodes minimum, spread across 3 AZs. Prefer 5 for critical clusters.
External etcd for large/critical clusters (100+ nodes or high churn). Dedicate high-I/O SSDs with <10ms fsync latency.
Load balancer in front of API servers. The LB VIP must be included in the API server TLS SAN (Subject Alternative Name) or all kubectl connections fail.
Taint control plane nodes with node-role.kubernetes.io/control-plane:NoSchedule to prevent workloads from landing on CP nodes.
Size control plane nodes for API request volume: 4 CPU / 8 GB RAM minimum; 8 CPU / 16 GB for 100+ node clusters; etcd needs dedicated I/O.
Monitor certificate expiry with alerting at ≥60 days before expiry.
Back up etcd daily to off-cluster storage. Test restores quarterly.
Plan CIDR ranges before cluster creation — they cannot be changed without a rebuild.
Audit API server flags: ensure --anonymous-auth=false, --audit-log-path is set, --enable-admission-plugins includes critical plugins.
Enable audit logging. Store logs in an external SIEM. Default audit policy logs all request metadata.
Worker node sizing: avoid >110 Pods/node (default kubelet limit). Use at least 4 vCPU / 8 GB RAM for general worker nodes.
Use Cluster Autoscaler with min/max node group bounds to prevent runaway scale-out.
Deploy PodDisruptionBudgets for all critical workloads before the first node drain.
Document your disaster recovery procedure and test it. etcd restore → API server restart → verify all Pods recover.
Use a managed K8s service unless you have a strong reason for self-hosted. The operational burden of managing control planes is significant.

Next Files

Dependency graph — recommended reading order

00-foundations/04-kubernetes-api-model.html — API machinery, resource versioning, watch protocol, etcd key format
01-control-plane/00-control-plane-overview.html — Control plane component deep dives
01-control-plane/01-kube-apiserver.html — API server request pipeline, admission chain
01-control-plane/02-etcd.html — Raft consensus, etcd internals, backup/restore
09-production-operations/01-high-availability.html — Detailed HA setup with kubeadm
11-api-flows/01-pod-creation-sequence.html — End-to-end Pod creation across all components

References

Kubernetes architecture docs — kubernetes.io/docs/concepts/architecture/
kubeadm HA docs — kubernetes.io/docs/setup/production-environment/tools/kubeadm/high-availability/
etcd documentation — etcd.io/docs/
Kubernetes scalability thresholds — github.com/kubernetes/community/blob/master/sig-scalability/configs-and-limits/thresholds.md
PKI certificates documentation — kubernetes.io/docs/setup/best-practices/certificates/
Production Kubernetes — Josh Rosso, Craig Tracey et al., O'Reilly 2021

← Previous02 · Container Orchestration Next →04 · Kubernetes API Model