Cluster Architecture Overview
This document is the architectural map of a Kubernetes cluster — every component, its role, how it communicates with other components, what ports it uses, and how the whole system maintains consistency under failures. Read this before diving into any individual component's deep-dive.
1 · The Two Planes
A Kubernetes cluster is logically split into two planes:
- Control Plane — makes decisions (scheduling, reconciliation, admission), stores state. Brains of the cluster.
- Data Plane (Worker Nodes) — executes workloads. Runs the actual Pods that serve traffic.
In a production cluster these planes run on separate physical/virtual machines.
The control plane should never run user workloads — use NoSchedule taints.
2 · Control Plane Components In Depth
2.1 kube-apiserver
The API server is the only stateful, central component that all other components talk to. It is the single source of truth gateway — no component talks to another component directly (exception: kubelet can be called back by the API server for exec/logs/port-forward, and etcd is called only by the API server).
| Property | Value |
|---|---|
| Port | 6443 (HTTPS). Never 8080 (insecure) in production. |
| Horizontally scalable | Yes — multiple replicas behind a load balancer. All replicas are active (no leader election). |
| Stateful? | No — completely stateless. All state is in etcd. |
| Auth mechanisms | x509 client certificates, bearer tokens, OIDC, webhook, service account tokens |
| Deep-dive | 01-control-plane/01-kube-apiserver.html |
2.2 etcd
etcd is the cluster's brain memory. Every Kubernetes object (Pod spec, Deployment state, Secret value, Node status, Service endpoint) is persisted as a key-value entry in etcd. The kube-apiserver is the only component that reads from or writes to etcd directly.
| Property | Value |
|---|---|
| Client port | 2379 (only API server connects) |
| Peer port | 2380 (etcd-to-etcd Raft replication) |
| Consensus | Raft — requires quorum of (n/2)+1 members to commit a write |
| Recommended cluster size | 3 members (1 failure tolerance) or 5 members (2 failure tolerance) |
| Deep-dive | 01-control-plane/02-etcd.html |
2.3 kube-scheduler
The scheduler watches for Pods with spec.nodeName == "" (unscheduled)
and writes a nodeName into the Pod spec via a binding API call.
It runs a filter → score → bind pipeline for every Pod.
| Property | Value |
|---|---|
| Metrics/health port | 10259 (HTTPS) |
| Leader election | Yes — only one scheduler instance is active at a time, via Lease API object |
| Talks to | Only kube-apiserver (watch for unscheduled Pods, write bindings) |
| Deep-dive | 01-control-plane/03-kube-scheduler.html |
2.4 kube-controller-manager
A single binary that embeds 30+ individual controllers, each running its own reconciliation loop. All controllers use the same watch-and-reconcile pattern.
| Property | Value |
|---|---|
| Metrics/health port | 10257 (HTTPS) |
| Leader election | Yes — one active instance at a time via Lease |
| Key controllers included | ReplicationController, ReplicaSet, Deployment, StatefulSet, DaemonSet, Job, CronJob, Node, Endpoint/EndpointSlice, Namespace, PersistentVolume, ServiceAccount, Token, Certificate |
| Deep-dive | 01-control-plane/04-kube-controller-manager.html |
2.5 cloud-controller-manager
Separates cloud-provider-specific logic from the core Kubernetes code. Manages: cloud load balancers (LoadBalancer type Services), cloud node metadata (instance types, zones), and cloud routes (for overlay-less networking on GCE/AWS).
| Property | Value |
|---|---|
| Port | 10258 (HTTPS) |
| Leader election | Yes |
| Absent on | bare metal clusters without a cloud provider integration |
| Deep-dive | 01-control-plane/05-cloud-controller-manager.html |
3 · Worker Node Components In Depth
3.1 kubelet
The kubelet is the primary node agent. It is the bridge between the Kubernetes API and the container runtime. It watches the API server for Pods assigned to its node, drives the CRI to create/destroy containers, runs probes, and reports status back.
| Property | Value |
|---|---|
| HTTPS API port | 10250 — API server calls back here for exec/logs/port-forward |
| Read-only port | 10255 (deprecated, disabled by default since 1.16) |
| Healthz port | 10248 |
| Talks to | kube-apiserver (watch + status updates), CRI socket (containerd/CRI-O), CNI plugins |
| Deep-dive | 02-node-components/01-kubelet.html |
3.2 kube-proxy
kube-proxy watches Service and EndpointSlice objects and programs the host's networking stack (iptables rules, IPVS virtual servers, or nftables) to implement the Service virtual IP abstraction.
| Property | Value |
|---|---|
| Healthz port | 10256 |
| Modes | iptables (default), IPVS (large clusters), nftables (beta in 1.31), eBPF (via Cilium replacing kube-proxy entirely) |
| Deep-dive | 02-node-components/02-kube-proxy.html |
3.3 Container Runtime (CRI)
The CRI-compliant runtime (containerd or CRI-O) receives instructions from the kubelet via gRPC, pulls images, manages container lifecycle, and interacts with the OCI low-level runtime (runc).
3.4 CNI Plugin
The CNI plugin (Calico, Cilium, Flannel, Weave, etc.) is invoked by the kubelet via the
CRI when a Pod sandbox is created. It assigns a Pod IP, creates the veth pair, and programs
routes. CNI is not a running daemon — it is a set of binary executables in
/opt/cni/bin/ called by the runtime.
4 · Component Communication Matrix
A critical production security concept: understanding exactly which components talk to which, over which port, authenticated with which certificate.
| Source | Destination | Protocol/Port | Auth method | Direction |
|---|---|---|---|---|
| kubectl / any client | kube-apiserver | HTTPS :6443 | x509 cert, OIDC token, bearer token | Client → API server |
| kube-apiserver | etcd | HTTPS :2379 | x509 client cert (etcd CA signed) | API server → etcd only |
| kube-scheduler | kube-apiserver | HTTPS :6443 | x509 client cert (system:kube-scheduler) | scheduler → API server (watch + write binding) |
| kube-controller-manager | kube-apiserver | HTTPS :6443 | x509 client cert (system:kube-controller-manager) | controller-manager → API server |
| cloud-controller-manager | kube-apiserver | HTTPS :6443 | x509 client cert | ccm → API server |
| kubelet | kube-apiserver | HTTPS :6443 | x509 client cert (system:node:<nodeName>, bootstrapped via TLS bootstrap or kubeadm) | kubelet → API server (watch Pods, update status) |
| kube-apiserver | kubelet | HTTPS :10250 | API server presents its own cert; kubelet CA must be trusted | API server → kubelet (exec, logs, port-forward) |
| kube-proxy | kube-apiserver | HTTPS :6443 | x509 client cert or ServiceAccount token | kube-proxy → API server (watch Services/EndpointSlices) |
| CoreDNS | kube-apiserver | HTTPS :6443 | ServiceAccount token (RBAC: get/list/watch Services/Endpoints) | CoreDNS → API server |
| etcd peer | etcd peer | HTTPS :2380 | Mutual TLS (etcd peer CA) | Raft replication between etcd members |
| Admission webhook | kube-apiserver | HTTPS (webhook TLS) | Webhook server cert trusted by API server caBundle field | API server → webhook server |
5 · Complete Port Reference
| Component | Port | Protocol | Purpose | Firewall rule needed? |
|---|---|---|---|---|
| kube-apiserver | 6443 | HTTPS | Kubernetes API (all clients) | Yes — from all nodes + kubectl clients |
| etcd client | 2379 | HTTPS | etcd API (API server only) | Yes — from control plane nodes only |
| etcd peer | 2380 | HTTPS | Raft replication between etcd members | Yes — between control plane nodes only |
| kube-scheduler | 10259 | HTTPS | Healthz + metrics (Prometheus scrape) | Optional — monitoring only |
| kube-controller-manager | 10257 | HTTPS | Healthz + metrics | Optional — monitoring only |
| cloud-controller-manager | 10258 | HTTPS | Healthz + metrics | Optional — monitoring only |
| kubelet | 10250 | HTTPS | kubelet API (exec/logs/port-forward) | Yes — from control plane nodes |
| kubelet healthz | 10248 | HTTP | Local healthcheck only | No — local loopback only |
| kube-proxy healthz | 10256 | HTTP | Health probe | No — local |
| NodePort range | 30000–32767 | TCP/UDP | External Service exposure | Yes — from external clients |
| CoreDNS | 53 (Pod IP) | UDP/TCP | DNS resolution for cluster DNS | No — internal Pod traffic only |
| metrics-server | 443 (Service) | HTTPS | Resource metrics (kubectl top) | No — internal |
6 · High Availability Architecture
6.1 Stacked vs. External etcd
There are two HA topologies for the control plane:
6.2 Quorum Math and Failure Tolerance
| etcd members | Quorum required | Failure tolerance | Recommendation |
|---|---|---|---|
| 1 | 1 | 0 | Development only — any failure loses the cluster |
| 2 | 2 | 0 | Worse than 1 — both must agree, but 1 failure still breaks quorum |
| 3 | 2 | 1 | Minimum production HA — standard for most clusters |
| 5 | 3 | 2 | Large/critical clusters, concurrent failure scenarios |
| 7 | 4 | 3 | Very large clusters; write latency increases — rarely worth it |
6.3 Controller/Scheduler Leader Election
While the API server is active-active (all replicas serve traffic), the scheduler
and controller-manager use leader election to ensure only one instance
makes decisions at a time. Leader election is implemented using a
coordination.k8s.io/v1/Lease object in the kube-system namespace.
# Inspect leader election leases
kubectl get leases -n kube-system
# NAME HOLDER AGE
# kube-controller-manager node1.cluster.internal 45d
# kube-scheduler node1.cluster.internal 45d
# Who currently holds the scheduler lease?
kubectl get lease kube-scheduler -n kube-system -o jsonpath='{.spec.holderIdentity}'
# Leader election config (in controller-manager flags)
# --leader-elect=true (default)
# --leader-elect-lease-duration=15s (time a lease is valid)
# --leader-elect-renew-deadline=10s (how long leader retries to renew)
# --leader-elect-retry-period=2s (how often non-leader tries to acquire)
7 · Network Topology
A Kubernetes cluster uses four distinct IP address spaces:
| Network | Default range | What it covers | Configured in |
|---|---|---|---|
| Node network | Depends on infrastructure | Physical/VM NIC IPs on each node | Cloud VPC or bare metal network |
| Pod CIDR | 10.244.0.0/16 (Flannel) / 192.168.0.0/16 (Calico) | Pod IP addresses (one unique IP per Pod) | --pod-network-cidr on kubeadm / CNI config |
| Service CIDR | 10.96.0.0/12 | ClusterIP virtual IPs for Services | --service-cluster-ip-range on API server |
| DNS ClusterIP | 10.96.0.10 | CoreDNS Service IP (fixed) | --cluster-dns on kubelet |
7.1 Kubernetes Networking Requirements
The Kubernetes networking model mandates:
- Every Pod gets a unique, cluster-routable IP (no NAT for Pod-to-Pod traffic within the cluster)
- Every Pod can reach every other Pod by IP, regardless of which node they are on
- Every node can reach every Pod IP
- Pods see their own IP as their IP (no masquerading for inbound)
CNI plugins are responsible for implementing these rules. They achieve this via: VXLAN overlays (Flannel), BGP routing (Calico), eBPF dataplane (Cilium), etc.
8 · Failure Domains and Topology
8.1 Multi-AZ Worker Node Distribution
In cloud deployments, worker nodes should be spread across at least 3 availability zones. Kubernetes provides topology-aware scheduling to enforce this.
# Spread Pods evenly across AZs (topologySpreadConstraints)
spec:
topologySpreadConstraints:
- maxSkew: 1 # Max imbalance between zones
topologyKey: topology.kubernetes.io/zone # Use the zone label on nodes
whenUnsatisfiable: DoNotSchedule # Fail scheduling rather than violate
labelSelector:
matchLabels:
app: web
- maxSkew: 1
topologyKey: kubernetes.io/hostname # Also spread across nodes within a zone
whenUnsatisfiable: ScheduleAnyway # Best-effort node spread
8.2 Control Plane AZ Distribution
# Verify control plane nodes are spread across AZs
kubectl get nodes -l node-role.kubernetes.io/control-plane \
-o custom-columns='NAME:.metadata.name,ZONE:.metadata.labels.topology\.kubernetes\.io/zone'
# Example output (good):
# NAME ZONE
# cp-node-1 us-east-1a
# cp-node-2 us-east-1b
# cp-node-3 us-east-1c
9 · Cluster Sizing Reference
| Cluster size | Nodes | Pods | Control plane | etcd | API server instances |
|---|---|---|---|---|---|
| Development | 1–5 | <100 | 1 node, all-in-one | Single (stacked) | 1 |
| Small production | 5–20 | <500 | 3 nodes (stacked) | 3 members | 3 (one per CP node) |
| Medium production | 20–100 | <5,000 | 3 nodes (external etcd recommended) | 3–5 dedicated nodes | 3 |
| Large production | 100–500 | <25,000 | 5 nodes (external etcd) | 5 dedicated nodes | 3–5 behind LB |
| Very large | 500–5000+ | 150,000+ | 5+ nodes, tuned API server | 5 dedicated high-I/O nodes (SSD) | 5+ behind LB + caching |
10 · Cloud vs Bare Metal Architecture Differences
| Concern | Cloud (GKE/EKS/AKS) | Bare Metal / On-Premises |
|---|---|---|
| Control plane management | Fully managed — you never see CP nodes | You manage CP nodes, upgrades, HA, backups |
| etcd management | Managed by cloud provider | You manage backup, recovery, disk sizing |
| Load balancer | cloud-controller-manager provisions cloud LB automatically for LoadBalancer Services | MetalLB or kube-vip or external F5/HAProxy required |
| Node provisioning | Cluster Autoscaler calls cloud API (ASG, MIG) to add nodes | Cluster API with MAAS/vSphere provider, or manual |
| Storage | EBS/GCE PD/Azure Disk CSI drivers built-in | Ceph/Longhorn/NFS CSI, or SAN with custom driver |
| Networking | VPC-native routing (no overlay needed on GKE, EKS VPC CNI) | Must configure overlay (VXLAN, BGP) or routed fabric |
| Certificate rotation | Automatic (managed control plane) | Must monitor expiry; kubelet rotates via RotateKubeletClientCertificate |
| OIDC identity integration | Workload Identity (GKE) / IRSA (EKS) / Pod Identity (AKS) | OIDC provider + external Vault / Keycloak integration |
11 · Component Startup Order
When bootstrapping or recovering a cluster, components must start in the correct order:
- etcd — must be healthy first; API server cannot start without etcd
- kube-apiserver — reads/writes etcd; all other components depend on it
- kube-controller-manager — starts watching API server; leader election requires API server
- kube-scheduler — starts watching API server; leader election requires API server
- cloud-controller-manager (if present) — starts after API server
- kubelet on each node — registers with API server; starts DaemonSet Pods (CoreDNS, CNI agent, kube-proxy)
- kube-proxy — DaemonSet Pod, starts after kubelet is ready
- CoreDNS — Deployment, starts after kube-proxy is programming Service rules
- CNI agent (if daemon-based, e.g., Calico node, Cilium agent) — DaemonSet, started by kubelet
# Monitor startup sequence on a control plane node
journalctl -u etcd -f &
journalctl -u kube-apiserver -f &
journalctl -u kube-controller-manager -f &
journalctl -u kube-scheduler -f &
# For kubeadm clusters (static Pods in /etc/kubernetes/manifests/):
# These are started by kubelet as static Pods — no systemd units
ls /etc/kubernetes/manifests/
# etcd.yaml kube-apiserver.yaml kube-controller-manager.yaml kube-scheduler.yaml
# Check static Pod health
kubectl get pods -n kube-system | grep -E "etcd|apiserver|scheduler|controller"
12 · Certificate Topology
Kubernetes uses mutual TLS for all inter-component communication. Understanding the certificate chain is essential for troubleshooting and for manually bootstrapping clusters.
Full certificate authority hierarchy
Kubernetes PKI hierarchy (typical kubeadm cluster):
/etc/kubernetes/pki/
├── ca.crt / ca.key
│ └── Cluster CA — signs:
│ ├── apiserver.crt (server cert for kube-apiserver TLS)
│ ├── apiserver-kubelet-client.crt (API server → kubelet client cert)
│ ├── controller-manager.conf (embedded client cert for controller-manager)
│ ├── scheduler.conf (embedded client cert for scheduler)
│ └── kubelet client certs (in /var/lib/kubelet/pki/ on each node)
│
├── etcd/ca.crt / etcd/ca.key
│ └── etcd CA (separate) — signs:
│ ├── etcd/server.crt (etcd server TLS)
│ ├── etcd/peer.crt (etcd peer replication TLS)
│ └── apiserver-etcd-client.crt (API server → etcd client cert)
│
├── front-proxy-ca.crt / front-proxy-ca.key
│ └── Front-proxy CA — signs:
│ └── front-proxy-client.crt (used for API aggregation layer)
│
└── sa.pub / sa.key
└── Service Account signing key pair
(sa.key signs SA tokens; sa.pub used by API server to verify them)
Default certificate validity: 1 year (kubeadm)
CA validity: 10 years
Auto-rotation: kubelet client certs rotate automatically via
RotateKubeletClientCertificate feature (enabled by default since 1.8)
# Check certificate expiry dates on a control plane node
kubeadm certs check-expiration
# Rotate certificates manually (kubeadm clusters)
kubeadm certs renew all
# Check a specific cert
openssl x509 -in /etc/kubernetes/pki/apiserver.crt -noout -dates
# notAfter=Jun 1 00:00:00 2026 GMT
# Check kubelet node cert (on each worker node)
openssl x509 -in /var/lib/kubelet/pki/kubelet-client-current.pem -noout -subject -dates
x509_cert_expiry metric from prometheus/blackbox-exporter
or use kubeadm certs check-expiration in a CronJob.
Alert at 60 days remaining. At expiry, the entire cluster becomes inaccessible.
13 · Single-Node Development Clusters
For local development, all components run on a single machine. The control plane and worker role are combined.
| Tool | Mechanism | Use case | Production parity |
|---|---|---|---|
| kind (K8s in Docker) | Each node is a Docker container; uses kubeadm internally | CI/CD pipelines, multi-node local testing | High — uses real kubeadm, real etcd |
| minikube | Single node VM or Docker driver | Developer laptop, quick prototyping | Medium — AddOns differ from production CNI/CSI |
| k3s | Single binary, SQLite or embedded etcd, replaces kube-proxy with iptables | Edge, IoT, CI, resource-constrained | Medium — some defaults differ (Traefik instead of nginx-ingress) |
| k3d | k3s in Docker containers | Fast local multi-node k3s clusters | Medium |
| Desktop (Docker/Rancher) | Bundled K8s in Docker Desktop or Rancher Desktop | Developer convenience, integrated with local registry | Low — opinionated defaults |
# Create a multi-node kind cluster (3 nodes: 1 control-plane, 2 workers)
cat <<EOF | kind create cluster --name dev --config=-
kind: Cluster
apiVersion: kind.x-k8s.io/v1alpha4
nodes:
- role: control-plane
- role: worker
- role: worker
networking:
podSubnet: "10.244.0.0/16"
serviceSubnet: "10.96.0.0/12"
EOF
# Load a local image into kind (no registry needed)
kind load docker-image myapp:dev --name dev
14 · Cluster Health Checks
# ---- Overall cluster health ----
kubectl get nodes -o wide # All nodes, status, version
kubectl get pods -n kube-system # All control plane + system Pods
# ---- API server health ----
kubectl get --raw /healthz # "ok"
kubectl get --raw /healthz/etcd # "ok"
kubectl get --raw /readyz # Checks all health indicators
kubectl get --raw /livez # Liveness check
# ---- etcd health (on control plane node) ----
ETCDCTL_API=3 etcdctl \
--endpoints=https://127.0.0.1:2379 \
--cacert=/etc/kubernetes/pki/etcd/ca.crt \
--cert=/etc/kubernetes/pki/etcd/healthcheck-client.crt \
--key=/etc/kubernetes/pki/etcd/healthcheck-client.key \
endpoint health
ETCDCTL_API=3 etcdctl ... endpoint status --write-out=table
# Shows: DB size, Raft index, leader, version
# ---- Component health (modern) ----
kubectl get --raw /api/v1/nodes | jq .
kubectl describe node <cp-node> | grep -A5 "Conditions:"
# ---- Scheduler / controller-manager ----
kubectl get --raw /healthz # through API server aggregation
# or directly on the pod:
kubectl exec -n kube-system kube-scheduler-<hash> -- wget -O- http://localhost:10259/healthz
kubectl exec -n kube-system kube-controller-manager-<hash> -- wget -O- http://localhost:10257/healthz
# ---- Event stream for recent issues ----
kubectl get events --sort-by='.lastTimestamp' -A | tail -30
15 · Architecture Production Checklist
15-point production architecture checklist
- 3 control plane nodes minimum, spread across 3 AZs. Prefer 5 for critical clusters.
- External etcd for large/critical clusters (100+ nodes or high churn). Dedicate high-I/O SSDs with <10ms fsync latency.
- Load balancer in front of API servers. The LB VIP must be included in the API server TLS SAN (Subject Alternative Name) or all kubectl connections fail.
- Taint control plane nodes with
node-role.kubernetes.io/control-plane:NoScheduleto prevent workloads from landing on CP nodes. - Size control plane nodes for API request volume: 4 CPU / 8 GB RAM minimum; 8 CPU / 16 GB for 100+ node clusters; etcd needs dedicated I/O.
- Monitor certificate expiry with alerting at ≥60 days before expiry.
- Back up etcd daily to off-cluster storage. Test restores quarterly.
- Plan CIDR ranges before cluster creation — they cannot be changed without a rebuild.
- Audit API server flags: ensure
--anonymous-auth=false,--audit-log-pathis set,--enable-admission-pluginsincludes critical plugins. - Enable audit logging. Store logs in an external SIEM. Default audit policy logs all request metadata.
- Worker node sizing: avoid >110 Pods/node (default kubelet limit). Use at least 4 vCPU / 8 GB RAM for general worker nodes.
- Use Cluster Autoscaler with min/max node group bounds to prevent runaway scale-out.
- Deploy PodDisruptionBudgets for all critical workloads before the first node drain.
- Document your disaster recovery procedure and test it. etcd restore → API server restart → verify all Pods recover.
- Use a managed K8s service unless you have a strong reason for self-hosted. The operational burden of managing control planes is significant.
Next Files
Dependency graph — recommended reading order
- 00-foundations/04-kubernetes-api-model.html — API machinery, resource versioning, watch protocol, etcd key format
- 01-control-plane/00-control-plane-overview.html — Control plane component deep dives
- 01-control-plane/01-kube-apiserver.html — API server request pipeline, admission chain
- 01-control-plane/02-etcd.html — Raft consensus, etcd internals, backup/restore
- 09-production-operations/01-high-availability.html — Detailed HA setup with kubeadm
- 11-api-flows/01-pod-creation-sequence.html — End-to-end Pod creation across all components
References
- Kubernetes architecture docs — kubernetes.io/docs/concepts/architecture/
- kubeadm HA docs — kubernetes.io/docs/setup/production-environment/tools/kubeadm/high-availability/
- etcd documentation — etcd.io/docs/
- Kubernetes scalability thresholds — github.com/kubernetes/community/blob/master/sig-scalability/configs-and-limits/thresholds.md
- PKI certificates documentation — kubernetes.io/docs/setup/best-practices/certificates/
- Production Kubernetes — Josh Rosso, Craig Tracey et al., O'Reilly 2021