Cluster Provisioning
1. Provisioning Options
| Approach | When to Use | Control | Operational Burden |
|---|---|---|---|
| Managed K8s (EKS, GKE, AKS) | Most production workloads; cloud-native orgs | Medium (control plane managed by cloud) | Low (control plane patches automatic) |
| Cluster API | Multi-cloud / on-prem fleet; platform teams managing 10+ clusters | High (full lifecycle via K8s CRDs) | Medium (CAPI controllers to maintain) |
| kubeadm | On-prem bare metal; air-gapped; custom distributions | Full (every component) | High (etcd, certs, upgrades manual) |
| kOps | AWS self-managed clusters (alternative to EKS) | High (but AWS-only focused) | Medium |
| k3s / k0s / RKE2 | Edge, IoT, dev environments, resource-constrained nodes | High | Low (single binary, no external etcd required) |
| EKS Blueprints / GKE Autopilot | Opinionated starting point for cloud-native teams | Low–medium (opinionated defaults) | Very low |
2. Managed Kubernetes (EKS / GKE / AKS)
AWS EKS — Key Configuration Decisions
# eks cluster creation via eksctl (most common quickstart)
eksctl create cluster \
--name production \
--region us-east-1 \
--version 1.30 \
--nodegroup-name general \
--node-type m6i.2xlarge \
--nodes-min 3 \
--nodes-max 20 \
--managed \
--amd64 \
--with-oidc \
--ssh-access=false \
--full-ecr-access \
--alb-ingress-access
EKS Networking Modes
| Mode | CNI | IP Source | Notes |
|---|---|---|---|
| VPC CNI (default) | aws-node (Amazon VPC CNI) | VPC subnet IPs (1:1 pod:ENI) | Simple; IP exhaustion risk in small subnets |
| VPC CNI + prefix delegation | aws-node with ENABLE_PREFIX_DELEGATION=true | /28 prefix per ENI slot | 16× more IPs per ENI; requires Nitro instances |
| Custom CNI (Cilium/Calico) | 3rd-party CNI installed as DaemonSet | Overlay or custom CIDR | Network policy, eBPF, encryption; replace aws-node |
# Enable prefix delegation on existing EKS cluster
kubectl set env daemonset aws-node \
-n kube-system \
ENABLE_PREFIX_DELEGATION=true \
WARM_PREFIX_TARGET=1 \
MINIMUM_IP_TARGET=5
EKS OIDC for IRSA (IAM Roles for Service Accounts)
# Associate OIDC provider (required for IRSA)
eksctl utils associate-iam-oidc-provider \
--cluster production \
--region us-east-1 \
--approve
# Create IAM role for a service account
eksctl create iamserviceaccount \
--name pyroscope \
--namespace observability \
--cluster production \
--role-name pyroscope-s3-role \
--attach-policy-arn arn:aws:iam::aws:policy/AmazonS3FullAccess \
--approve \
--override-existing-serviceaccounts
# The ServiceAccount gets:
# annotations:
# eks.amazonaws.com/role-arn: arn:aws:iam::123456789:role/pyroscope-s3-role
EKS Add-ons (managed by AWS)
# List available add-ons
aws eks describe-addon-versions --kubernetes-version 1.30
# Install/update EKS managed add-ons
aws eks create-addon \
--cluster-name production \
--addon-name vpc-cni \
--addon-version v1.18.0-eksbuild.1 \
--resolve-conflicts OVERWRITE
# Key managed add-ons:
# vpc-cni — AWS VPC CNI plugin
# coredns — DNS (auto-updated with cluster)
# kube-proxy — per-node kube-proxy DaemonSet
# aws-ebs-csi-driver — EBS persistent volumes
# aws-efs-csi-driver — EFS persistent volumes
# aws-load-balancer-controller — ALB/NLB via Ingress/Service
GKE — Autopilot vs Standard
| Feature | GKE Standard | GKE Autopilot |
|---|---|---|
| Node management | User manages node pools | Google manages all nodes |
| Billing | Per node (even idle) | Per pod resource request |
| DaemonSets | Full support | Limited (GKE-managed only) |
| Node access | SSH available | No node access |
| Spot nodes | Spot node pool | Spot pods via annotation |
| Best for | Custom CNI, bare-metal-like control | Cost-optimised, managed clusters |
# GKE cluster with Workload Identity and private nodes
gcloud container clusters create production \
--region us-central1 \
--release-channel regular \
--workload-pool=PROJECT_ID.svc.id.goog \
--enable-private-nodes \
--master-ipv4-cidr 172.16.0.0/28 \
--enable-ip-alias \
--enable-shielded-nodes \
--enable-autoupgrade \
--enable-autorepair \
--num-nodes 3 \
--machine-type n2-standard-4 \
--disk-type pd-ssd
AKS — Azure Kubernetes Service
# AKS cluster with system + user node pools
az aks create \
--resource-group myResourceGroup \
--name production \
--kubernetes-version 1.30 \
--node-count 3 \
--node-vm-size Standard_D4s_v3 \
--enable-oidc-issuer \
--enable-workload-identity \
--enable-managed-identity \
--network-plugin azure \
--network-policy calico \
--enable-cluster-autoscaler \
--min-count 3 \
--max-count 20 \
--zones 1 2 3
# Add user node pool (separate from system)
az aks nodepool add \
--resource-group myResourceGroup \
--cluster-name production \
--name workload \
--node-count 3 \
--node-vm-size Standard_D8s_v3 \
--enable-cluster-autoscaler \
--min-count 1 \
--max-count 30 \
--zones 1 2 3 \
--labels workload=general \
--mode User
3. Cluster API (CAPI)
Cluster API is a Kubernetes-native framework for declarative cluster lifecycle management. You run a management cluster and define workload clusters as CRDs. CAPI controllers reconcile the desired cluster state, handling creation, scaling, upgrades, and deletion.
CAPI Architecture
Management Cluster
├── CAPI core controllers (cluster-api)
├── Bootstrap provider (kubeadm, k3s, talos)
├── Control Plane provider (kubeadm CP, RKE2 CP)
└── Infrastructure provider (AWS/CAPAS, GCP/CAPG, Azure/CAPZ, vSphere/CAPV)
Object hierarchy:
Cluster ──────── owns ──────── MachineDeployment (workers)
│ │
├── KubeadmControlPlane ────── Machine (one per replica)
│ │ │
│ └── AWSMachineTemplate └── AWSMachine (infra)
│
└── AWSCluster (infra: VPC, subnets, SGs, ELB)
CAPI Bootstrap (AWS / CAPA)
# Install clusterctl
curl -L https://github.com/kubernetes-sigs/cluster-api/releases/download/v1.7.0/clusterctl-linux-amd64 \
-o /usr/local/bin/clusterctl
chmod +x /usr/local/bin/clusterctl
# Initialize management cluster (runs in existing cluster)
# AWS credentials via environment or IAM role
export AWS_REGION=us-east-1
export AWS_ACCESS_KEY_ID=...
export AWS_SECRET_ACCESS_KEY=...
clusterctl init \
--infrastructure aws \
--bootstrap kubeadm \
--control-plane kubeadm
# Generate workload cluster manifest
clusterctl generate cluster production \
--flavor machinepool \
--kubernetes-version v1.30.0 \
--control-plane-machine-count=3 \
--worker-machine-count=3 \
> production-cluster.yaml
kubectl apply -f production-cluster.yaml
# Watch cluster come up
kubectl get cluster production -w
clusterctl describe cluster production
CAPI Cluster YAML (AWS)
apiVersion: cluster.x-k8s.io/v1beta1
kind: Cluster
metadata:
name: production
namespace: default
spec:
clusterNetwork:
pods:
cidrBlocks: ["192.168.0.0/16"]
infrastructureRef:
apiVersion: infrastructure.cluster.x-k8s.io/v1beta2
kind: AWSCluster
name: production
controlPlaneRef:
apiVersion: controlplane.cluster.x-k8s.io/v1beta1
kind: KubeadmControlPlane
name: production-control-plane
---
apiVersion: infrastructure.cluster.x-k8s.io/v1beta2
kind: AWSCluster
metadata:
name: production
spec:
region: us-east-1
sshKeyName: platform-key
network:
vpc:
availabilityZoneUsageLimit: 3
subnets:
- availabilityZone: us-east-1a
cidrBlock: 10.0.1.0/24
isPublic: false
- availabilityZone: us-east-1b
cidrBlock: 10.0.2.0/24
isPublic: false
- availabilityZone: us-east-1c
cidrBlock: 10.0.3.0/24
isPublic: false
---
apiVersion: controlplane.cluster.x-k8s.io/v1beta1
kind: KubeadmControlPlane
metadata:
name: production-control-plane
spec:
replicas: 3
version: v1.30.0
machineTemplate:
infrastructureRef:
apiVersion: infrastructure.cluster.x-k8s.io/v1beta2
kind: AWSMachineTemplate
name: production-control-plane
kubeadmConfigSpec:
initConfiguration:
nodeRegistration:
kubeletExtraArgs:
cloud-provider: external
clusterConfiguration:
apiServer:
extraArgs:
cloud-provider: external
audit-log-path: /var/log/kubernetes/audit.log
audit-log-maxage: "30"
audit-log-maxbackup: "10"
audit-log-maxsize: "100"
feature-gates: "ServerSideApply=true"
etcd:
local:
dataDir: /var/lib/etcddisk/etcd
extraArgs:
quota-backend-bytes: "8589934592" # 8 GiB
---
apiVersion: cluster.x-k8s.io/v1beta1
kind: MachineDeployment
metadata:
name: production-workers
spec:
clusterName: production
replicas: 3
selector:
matchLabels:
cluster.x-k8s.io/cluster-name: production
template:
metadata:
labels:
cluster.x-k8s.io/cluster-name: production
spec:
version: v1.30.0
clusterName: production
bootstrap:
configRef:
apiVersion: bootstrap.cluster.x-k8s.io/v1beta1
kind: KubeadmConfigTemplate
name: production-workers
infrastructureRef:
apiVersion: infrastructure.cluster.x-k8s.io/v1beta2
kind: AWSMachineTemplate
name: production-workers
ClusterClass — Reusable Cluster Templates
ClusterClass (CAPI v1.2+) lets you define a cluster topology once and instantiate it many times with variable overrides — similar to a Helm chart but for clusters.
apiVersion: cluster.x-k8s.io/v1beta1
kind: ClusterClass
metadata:
name: aws-production-class
spec:
controlPlane:
ref:
apiVersion: controlplane.cluster.x-k8s.io/v1beta1
kind: KubeadmControlPlaneTemplate
name: aws-kcp-template
infrastructure:
ref:
apiVersion: infrastructure.cluster.x-k8s.io/v1beta2
kind: AWSClusterTemplate
name: aws-cluster-template
workers:
machineDeployments:
- class: general
template:
bootstrap:
ref:
kind: KubeadmConfigTemplate
name: aws-worker-bootstrap
infrastructure:
ref:
kind: AWSMachineTemplate
name: aws-worker-template
variables:
- name: region
required: true
schema:
openAPIV3Schema:
type: string
- name: workerInstanceType
required: false
schema:
openAPIV3Schema:
type: string
default: m6i.2xlarge
---
# Instantiate from ClusterClass:
apiVersion: cluster.x-k8s.io/v1beta1
kind: Cluster
metadata:
name: team-a-cluster
spec:
topology:
class: aws-production-class
version: v1.30.0
controlPlane:
replicas: 3
workers:
machineDeployments:
- name: general
replicas: 5
variables:
- name: region
value: eu-west-1
- name: workerInstanceType
value: m6i.4xlarge
4. kubeadm: Self-Managed Clusters
Control Plane Bootstrap
# kubeadm-config.yaml — production control plane configuration
apiVersion: kubeadm.k8s.io/v1beta3
kind: ClusterConfiguration
kubernetesVersion: v1.30.0
clusterName: production
controlPlaneEndpoint: "k8s-api.internal.example.com:6443" # Load balancer VIP
networking:
podSubnet: "10.244.0.0/16"
serviceSubnet: "10.96.0.0/12"
dnsDomain: "cluster.local"
apiServer:
extraArgs:
audit-log-path: /var/log/kubernetes/audit.log
audit-log-maxage: "30"
audit-log-maxbackup: "10"
audit-log-maxsize: "100"
enable-admission-plugins: >-
NodeRestriction,PodSecurity,ResourceQuota,
LimitRanger,ServiceAccount,DefaultStorageClass,
MutatingAdmissionWebhook,ValidatingAdmissionWebhook
oidc-issuer-url: "https://dex.example.com"
oidc-client-id: "kubernetes"
oidc-username-claim: "email"
oidc-groups-claim: "groups"
extraVolumes:
- name: audit-log
hostPath: /var/log/kubernetes
mountPath: /var/log/kubernetes
pathType: DirectoryOrCreate
controllerManager:
extraArgs:
bind-address: "0.0.0.0"
node-cidr-mask-size: "24"
etcd:
local:
dataDir: /var/lib/etcd
extraArgs:
quota-backend-bytes: "8589934592"
auto-compaction-mode: revision
auto-compaction-retention: "1000"
---
apiVersion: kubeadm.k8s.io/v1beta3
kind: InitConfiguration
nodeRegistration:
kubeletExtraArgs:
cloud-provider: ""
container-runtime-endpoint: unix:///var/run/containerd/containerd.sock
# Initialize first control plane node
kubeadm init --config kubeadm-config.yaml --upload-certs
# Output includes:
# kubeadm join k8s-api:6443 --token ... --discovery-token-ca-cert-hash sha256:...
# --control-plane --certificate-key
# Join additional control plane nodes
kubeadm join k8s-api.internal.example.com:6443 \
--token \
--discovery-token-ca-cert-hash sha256: \
--control-plane \
--certificate-key
# Join worker nodes
kubeadm join k8s-api.internal.example.com:6443 \
--token \
--discovery-token-ca-cert-hash sha256:
etcd External Cluster (Production Best Practice)
# For large clusters: run etcd separately from control plane
# Stacked etcd (default): etcd on same nodes as kube-apiserver
# External etcd (recommended for prod): dedicated etcd nodes
# etcd cluster topology:
# - Minimum 3 nodes for quorum (tolerate 1 failure)
# - 5 nodes for higher availability (tolerate 2 failures)
# - Dedicated SSDs (fsync latency < 10ms critical for etcd)
# - Never share etcd nodes with any other workload
# kubeadm-config.yaml (external etcd):
etcd:
external:
endpoints:
- https://etcd-0.internal:2379
- https://etcd-1.internal:2379
- https://etcd-2.internal:2379
caFile: /etc/kubernetes/pki/etcd/ca.crt
certFile: /etc/kubernetes/pki/apiserver-etcd-client.crt
keyFile: /etc/kubernetes/pki/apiserver-etcd-client.key
etcd writes are synchronous (fsync after every commit). On AWS, use io1/io2 EBS volumes with 3000+ IOPS, or NVMe instance store (with replication). On bare metal, dedicated NVMe SSD. Shared network storage (EFS, NFS) will cause etcd leader elections and API server timeouts. Monitor etcd_disk_wal_fsync_duration_seconds — keep p99 < 10ms.
Certificate Management
# Check certificate expiry
kubeadm certs check-expiration
# Renew all certificates (run on each control plane node)
kubeadm certs renew all
# Auto-renewal: kubeadm rotates certs on upgrade
# For manual renewal, add to cron before 1-year expiry:
0 0 1 * * /usr/bin/kubeadm certs renew all && systemctl restart kubelet
5. EKS Blueprints (Terraform)
EKS Blueprints is a Terraform module that provisions an EKS cluster with opinionated add-on management, team RBAC, and GitOps bootstrap in one configuration block.
# main.tf
module "eks_blueprints" {
source = "aws-ia/eks-blueprints/aws"
version = "~> 4.32"
cluster_name = "production"
cluster_version = "1.30"
vpc_id = module.vpc.vpc_id
private_subnet_ids = module.vpc.private_subnets
# Control plane logging
cluster_enabled_log_types = ["api", "audit", "authenticator", "controllerManager", "scheduler"]
# IRSA — required for add-ons
enable_irsa = true
# Managed node groups
eks_managed_node_groups = {
general = {
min_size = 3
max_size = 20
desired_size = 5
instance_types = ["m6i.2xlarge"]
capacity_type = "ON_DEMAND"
labels = { workload = "general" }
# Taint system namespace pods to system node group
# (no taint on general = default target)
}
system = {
min_size = 3
max_size = 6
desired_size = 3
instance_types = ["m6i.xlarge"]
capacity_type = "ON_DEMAND"
labels = { "node.kubernetes.io/purpose" = "system" }
taints = [{
key = "node.kubernetes.io/purpose"
value = "system"
effect = "NO_SCHEDULE"
}]
}
spot = {
min_size = 0
max_size = 50
desired_size = 0
instance_types = ["m6i.2xlarge", "m5.2xlarge", "m5n.2xlarge"]
capacity_type = "SPOT"
labels = { "node.kubernetes.io/capacity-type" = "spot" }
taints = [{
key = "spot"
value = "true"
effect = "NO_SCHEDULE"
}]
}
}
}
# Add-ons module
module "eks_blueprints_addons" {
source = "aws-ia/eks-blueprints-addons/aws"
version = "~> 1.16"
cluster_name = module.eks_blueprints.cluster_name
cluster_endpoint = module.eks_blueprints.cluster_endpoint
cluster_version = module.eks_blueprints.cluster_version
oidc_provider_arn = module.eks_blueprints.oidc_provider_arn
# EKS managed add-ons
eks_addons = {
aws-ebs-csi-driver = {
most_recent = true
service_account_role_arn = module.ebs_csi_irsa_role.iam_role_arn
}
coredns = { most_recent = true }
vpc-cni = { most_recent = true }
kube-proxy = { most_recent = true }
}
# AWS Load Balancer Controller
enable_aws_load_balancer_controller = true
# Cluster Autoscaler (or use Karpenter — see below)
enable_cluster_autoscaler = true
# cert-manager
enable_cert_manager = true
# external-dns
enable_external_dns = true
# Karpenter (if using — disables Cluster Autoscaler)
enable_karpenter = false
# Metrics Server
enable_metrics_server = true
}
6. Node Pools & Node Groups
Node Group Strategy
Recommended node group topology for production clusters:
┌──────────────────────────────────────────────────────────┐
│ system node group (3 nodes, m6i.xlarge, ON_DEMAND) │
│ Runs: CoreDNS, kube-proxy, CNI, CSI, ingress controller │
│ Taint: node.kubernetes.io/purpose=system:NoSchedule │
├──────────────────────────────────────────────────────────┤
│ general node group (3-20 nodes, m6i.2xlarge, ON_DEMAND) │
│ Runs: most application workloads │
│ No taints — default scheduling target │
├──────────────────────────────────────────────────────────┤
│ observability node group (3 nodes, m6i.4xlarge) │
│ Runs: Prometheus, Loki, Tempo, Grafana (stateful) │
│ Taint: workload=observability:NoSchedule │
├──────────────────────────────────────────────────────────┤
│ spot node group (0-50 nodes, mixed instance types) │
│ Runs: batch jobs, CI runners, ML training │
│ Taint: spot=true:NoSchedule │
├──────────────────────────────────────────────────────────┤
│ gpu node group (0-10 nodes, p3.2xlarge or g5.xlarge) │
│ Runs: GPU workloads, ML inference │
│ Taint: nvidia.com/gpu=present:NoSchedule │
└──────────────────────────────────────────────────────────┘
Node Labels and Taints for Scheduling
# Standard K8s well-known node labels (set automatically by cloud providers)
kubernetes.io/hostname: worker-node-1
kubernetes.io/arch: amd64
kubernetes.io/os: linux
node.kubernetes.io/instance-type: m6i.2xlarge
topology.kubernetes.io/zone: us-east-1a
topology.kubernetes.io/region: us-east-1
# EKS specific
eks.amazonaws.com/nodegroup: general
eks.amazonaws.com/capacityType: ON_DEMAND
# Custom labels (set in node group config)
workload: general
node.kubernetes.io/purpose: system
# Targeting observability workloads:
nodeSelector:
workload: observability
tolerations:
- key: workload
operator: Equal
value: observability
effect: NoSchedule
Multi-AZ Distribution for Stateful Workloads
# Force pods across AZs using topologySpreadConstraints
# (preferred over podAntiAffinity for large StatefulSets)
topologySpreadConstraints:
- maxSkew: 1
topologyKey: topology.kubernetes.io/zone
whenUnsatisfiable: DoNotSchedule
labelSelector:
matchLabels:
app: kafka-broker
# StorageClass with WaitForFirstConsumer (volume follows pod AZ)
apiVersion: storage.k8s.io/v1
kind: StorageClass
metadata:
name: ebs-gp3-zone-aware
provisioner: ebs.csi.aws.com
volumeBindingMode: WaitForFirstConsumer # Critical for multi-AZ
parameters:
type: gp3
iops: "3000"
throughput: "125"
reclaimPolicy: Retain
allowVolumeExpansion: true
7. Karpenter: Node Autoprovisioning
Karpenter replaces the Cluster Autoscaler for node-level scaling. Instead of managing fixed node groups, Karpenter watches for unschedulable pods and provisions exactly the right node type within seconds using NodePool and EC2NodeClass CRDs.
Karpenter vs Cluster Autoscaler
| Feature | Cluster Autoscaler | Karpenter |
|---|---|---|
| Provisioning speed | 2–5 minutes (new ASG instance) | ~45 seconds (direct EC2 RunInstances API) |
| Instance selection | Fixed node group instance type | Flexible: picks optimal from NodePool list |
| Spot handling | Separate spot ASG; manual fallback config | Built-in: weighted spot + on-demand fallback |
| Consolidation | Removes underutilised nodes (basic) | Disruption consolidation: repack pods onto fewer nodes |
| Node lifecycle | ASG manages replacement | Karpenter replaces drifted/old nodes automatically |
| Configuration | Node group config in cloud console/IaC | NodePool + NodeClass CRDs in cluster |
Karpenter Install (EKS)
# Create IAM role for Karpenter with IRSA
export CLUSTER_NAME=production
export AWS_ACCOUNT_ID=$(aws sts get-caller-identity --query Account --output text)
export KARPENTER_NAMESPACE=karpenter
export KARPENTER_VERSION=v0.37.0
# Create node IAM role (used by nodes Karpenter provisions)
aws iam create-role \
--role-name KarpenterNodeRole-${CLUSTER_NAME} \
--assume-role-policy-document file://node-trust-policy.json
for policy in \
AmazonEKSWorkerNodePolicy \
AmazonEKS_CNI_Policy \
AmazonEC2ContainerRegistryReadOnly \
AmazonSSMManagedInstanceCore; do
aws iam attach-role-policy \
--role-name KarpenterNodeRole-${CLUSTER_NAME} \
--policy-arn arn:aws:iam::aws:policy/${policy}
done
# Create instance profile
aws iam create-instance-profile \
--instance-profile-name KarpenterNodeInstanceProfile-${CLUSTER_NAME}
aws iam add-role-to-instance-profile \
--instance-profile-name KarpenterNodeInstanceProfile-${CLUSTER_NAME} \
--role-name KarpenterNodeRole-${CLUSTER_NAME}
# Install Karpenter via Helm
helm upgrade --install karpenter oci://public.ecr.aws/karpenter/karpenter \
--version ${KARPENTER_VERSION} \
--namespace ${KARPENTER_NAMESPACE} \
--create-namespace \
--set "settings.clusterName=${CLUSTER_NAME}" \
--set "settings.interruptionQueue=${CLUSTER_NAME}" \
--set controller.resources.requests.cpu=1 \
--set controller.resources.requests.memory=1Gi \
--set controller.resources.limits.cpu=1 \
--set controller.resources.limits.memory=1Gi \
--wait
NodePool and EC2NodeClass
apiVersion: karpenter.sh/v1beta1
kind: NodePool
metadata:
name: general
spec:
template:
metadata:
labels:
workload: general
spec:
nodeClassRef:
apiVersion: karpenter.k8s.aws/v1beta1
kind: EC2NodeClass
name: default
requirements:
- key: kubernetes.io/arch
operator: In
values: ["amd64"]
- key: kubernetes.io/os
operator: In
values: ["linux"]
- key: karpenter.sh/capacity-type
operator: In
values: ["on-demand", "spot"] # prefer spot, fall back to on-demand
- key: karpenter.k8s.aws/instance-category
operator: In
values: ["c", "m", "r"]
- key: karpenter.k8s.aws/instance-generation
operator: Gt
values: ["5"] # Only 6th gen or newer
- key: karpenter.k8s.aws/instance-size
operator: NotIn
values: ["nano", "micro", "small", "medium", "xlarge"] # min 2xlarge
taints: []
disruption:
consolidationPolicy: WhenUnderutilized
consolidateAfter: 30s
# Replace nodes older than 720h (30 days) for security patching
expireAfter: 720h
limits:
cpu: 1000 # max 1000 vCPUs across all NodePool nodes
memory: 4000Gi
---
apiVersion: karpenter.k8s.aws/v1beta1
kind: EC2NodeClass
metadata:
name: default
spec:
amiFamily: AL2 # Amazon Linux 2 (or Bottlerocket, Ubuntu)
role: KarpenterNodeRole-production # IAM instance profile
subnetSelectorTerms:
- tags:
kubernetes.io/cluster/production: owned
karpenter.sh/discovery: production
securityGroupSelectorTerms:
- tags:
kubernetes.io/cluster/production: owned
blockDeviceMappings:
- deviceName: /dev/xvda
ebs:
volumeSize: 100Gi
volumeType: gp3
iops: 3000
throughput: 125
encrypted: true
userData: |
#!/bin/bash
/etc/eks/bootstrap.sh production \
--container-runtime containerd \
--kubelet-extra-args '--max-pods=110'
tags:
team: platform
managed-by: karpenter
Spot Interruption Handling
# Karpenter uses EC2 Spot Interruption Notifications via SQS
# AWS sends 2-minute warning → Karpenter cordons node + triggers Pod Disruption Budget
# Create SQS queue for spot interruptions:
aws sqs create-queue \
--queue-name Karpenter-${CLUSTER_NAME} \
--attributes '{
"SqsManagedSseEnabled": "true",
"MessageRetentionPeriod": "300"
}'
# EventBridge rules route interruption events to SQS:
# - EC2 Spot Instance Interruption Warning
# - EC2 Instance Rebalance Recommendation
# - EC2 Instance State-change Notification
# (set up via CloudFormation template in Karpenter docs)
8. Cluster Add-ons
Every production cluster needs a standard set of add-ons installed immediately after bootstrap. These should be deployed via GitOps (Argo CD or Flux) — not manual Helm installs.
| Category | Add-on | Purpose | Install Method |
|---|---|---|---|
| Networking | CNI plugin (Cilium/Calico/VPC CNI) | Pod networking | Cluster bootstrap / EKS managed |
| Networking | CoreDNS | Service discovery DNS | kubeadm / EKS managed |
| Networking | Ingress NGINX / AWS ALB Controller | HTTP ingress | Helm via Argo CD |
| Networking | cert-manager | TLS certificate automation | Helm via Argo CD |
| Networking | external-dns | DNS record sync from Ingress/Service | Helm via Argo CD |
| Storage | EBS CSI Driver / GCE PD CSI | Cloud block storage | EKS managed / Helm |
| Storage | EFS CSI Driver | Shared file storage (RWX) | Helm via Argo CD |
| Autoscaling | Karpenter / Cluster Autoscaler | Node scaling | Helm via Argo CD |
| Autoscaling | Metrics Server | HPA/VPA resource metrics | Helm via Argo CD |
| Autoscaling | KEDA | Event-driven autoscaling | Helm via Argo CD |
| Security | Gatekeeper / Kyverno | Policy enforcement | Helm via Argo CD |
| Security | Falco | Runtime security (syscall audit) | Helm via Argo CD |
| Secrets | External Secrets Operator | Sync secrets from Vault/AWS SM | Helm via Argo CD |
| Observability | kube-prometheus-stack | Prometheus + Grafana + rules | Helm via Argo CD |
| Observability | Loki stack | Log aggregation | Helm via Argo CD |
| Observability | OpenTelemetry Operator | Tracing infrastructure | Helm via Argo CD |
Add-on Dependency Ordering (Argo CD Sync Waves)
# Use Argo CD sync waves to control add-on installation order.
# Annotations on Applications or Helm release namespaces:
# Wave -1: CRDs first (cert-manager, Gatekeeper, KEDA, Karpenter)
# Wave 0: Core networking (CNI config, CoreDNS tuning)
# Wave 1: Storage (CSI drivers, StorageClasses)
# Wave 2: Security (Gatekeeper policies, ESO, Falco)
# Wave 3: Ingress (NGINX, cert-manager issuers, external-dns)
# Wave 4: Autoscaling (Karpenter NodePools, KEDA ScaledObjects)
# Wave 5: Observability (Prometheus, Loki, Tempo, Grafana)
# Wave 10: Application namespaces and workloads
# Example Application annotation:
metadata:
annotations:
argocd.argoproj.io/sync-wave: "2"
9. Bootstrap to GitOps
A newly provisioned cluster must be onboarded to the GitOps system so all future configuration is managed declaratively. The bootstrap process converts a "bare" cluster into a GitOps-managed cluster.
# Bootstrap sequence:
# 1. Cluster created (Terraform / CAPI / eksctl)
# 2. kubeconfig obtained and merged
# 3. Argo CD installed (bootstrap — only this one step is imperative)
# 4. Argo CD pointed at the GitOps repo
# 5. App-of-apps / ApplicationSet deploys everything else
# Step 3: Install Argo CD (one-time bootstrap)
kubectl create namespace argocd
kubectl apply -n argocd \
-f https://raw.githubusercontent.com/argoproj/argo-cd/stable/manifests/install.yaml
# Step 4: Create root Application pointing at cluster's app-of-apps directory
kubectl apply -f - <<'EOF'
apiVersion: argoproj.io/v1alpha1
kind: Application
metadata:
name: root
namespace: argocd
spec:
project: default
source:
repoURL: https://github.com/myorg/platform-gitops
targetRevision: HEAD
path: clusters/production/apps
destination:
server: https://kubernetes.default.svc
namespace: argocd
syncPolicy:
automated:
prune: true
selfHeal: true
syncOptions:
- CreateNamespace=true
- ServerSideApply=true
EOF
# Step 5: argocd now syncs all add-ons from Git
# The apps/ directory contains Application manifests for each add-on
GitOps Repository Structure for Multi-Cluster
platform-gitops/
├── clusters/
│ ├── production/
│ │ ├── apps/ # Argo CD Applications (root app-of-apps)
│ │ │ ├── cert-manager.yaml
│ │ │ ├── ingress-nginx.yaml
│ │ │ ├── karpenter.yaml
│ │ │ ├── kube-prometheus-stack.yaml
│ │ │ └── ...
│ │ └── values/ # Per-cluster Helm value overrides
│ │ ├── karpenter-values.yaml
│ │ └── prometheus-values.yaml
│ ├── staging/
│ │ └── apps/
│ └── dev/
│ └── apps/
├── add-ons/ # Reusable add-on Helm chart wrappers
│ ├── cert-manager/
│ ├── ingress-nginx/
│ └── kube-prometheus-stack/
└── base/ # Shared Kustomize bases
├── namespaces/
└── rbac/
10. Cluster Upgrade Strategies
Kubernetes supports n-2 version skew between control plane and nodes. Always upgrade control plane first, then node groups. Never skip minor versions (e.g., 1.28 → 1.30 is unsupported; must go 1.28 → 1.29 → 1.30). Check API deprecations before each upgrade — use pluto or kubectl convert.
Pre-Upgrade Checklist
# 1. Check API deprecations with pluto
pluto detect-all-in-cluster --target-versions k8s=v1.30
# 2. Review release notes and deprecated APIs
# https://kubernetes.io/docs/reference/using-api/deprecation-guide/
# 3. Check add-on compatibility matrix
# cert-manager, ingress-nginx, kube-prometheus-stack all have K8s version tables
# 4. Test in non-prod cluster first (same version upgrade path)
# 5. Verify PodDisruptionBudgets allow node drains
kubectl get pdb -A
# PDBs with minAvailable=100% or maxUnavailable=0 will block drain!
# 6. Check for pods using deprecated APIs
kubectl get pods -A -o json | \
jq '.items[] | select(.spec.containers[].image | test("k8s.gcr.io")) | .metadata'
EKS Control Plane Upgrade
# Upgrade EKS control plane (AWS manages etcd and control plane pods)
aws eks update-cluster-version \
--name production \
--kubernetes-version 1.30
# Watch upgrade status
aws eks describe-cluster \
--name production \
--query 'cluster.status'
# Takes ~15 minutes; API server is briefly unavailable during switchover
# (kube-proxy and CoreDNS tolerate this; existing pods continue running)
Node Group Rolling Upgrade
# Option A: Rolling update via eksctl (managed node groups)
eksctl upgrade nodegroup \
--cluster=production \
--name=general \
--kubernetes-version=1.30
# Option B: Blue/green node group (zero downtime, more control)
# 1. Create new node group with new K8s version
eksctl create nodegroup \
--cluster production \
--name general-v130 \
--kubernetes-version 1.30 \
--node-type m6i.2xlarge \
--nodes-min 3 --nodes-max 20
# 2. Cordon old node group (prevent new scheduling)
kubectl cordon -l eks.amazonaws.com/nodegroup=general
# 3. Drain old nodes (respects PDBs)
for node in $(kubectl get nodes -l eks.amazonaws.com/nodegroup=general -o name); do
kubectl drain $node \
--ignore-daemonsets \
--delete-emptydir-data \
--timeout=5m \
--grace-period=30
done
# 4. Delete old node group after all pods migrated
eksctl delete nodegroup \
--cluster production \
--name general \
--drain=false # Already drained above
kubeadm Control Plane Upgrade
# On first control plane node:
# 1. Upgrade kubeadm itself
apt-get update
apt-get install -y kubeadm=1.30.0-00
# 2. Verify upgrade plan
kubeadm upgrade plan
# 3. Apply upgrade (upgrades kube-apiserver, kube-controller-manager, kube-scheduler, etcd)
kubeadm upgrade apply v1.30.0
# 4. Upgrade kubelet and kubectl on control plane node
apt-get install -y kubelet=1.30.0-00 kubectl=1.30.0-00
systemctl daemon-reload
systemctl restart kubelet
# On additional control plane nodes:
kubeadm upgrade node
apt-get install -y kubelet=1.30.0-00 kubectl=1.30.0-00
systemctl daemon-reload && systemctl restart kubelet
# Worker nodes:
kubectl drain --ignore-daemonsets --delete-emptydir-data
# SSH into worker node:
apt-get install -y kubeadm=1.30.0-00
kubeadm upgrade node
apt-get install -y kubelet=1.30.0-00
systemctl daemon-reload && systemctl restart kubelet
# Back on control plane:
kubectl uncordon
Karpenter Node Drift — Automatic Replacement
# Karpenter automatically replaces nodes when:
# 1. AMI has drifted from EC2NodeClass amiSelector
# 2. Node has expired (expireAfter in NodePool)
# 3. Node is underutilised and can be consolidated
# Trigger controlled drift for immediate node replacement:
kubectl annotate nodepool general \
karpenter.sh/nodepool-hash-version=1 # forces re-evaluation
# Check node replacement progress:
kubectl get nodes -l karpenter.sh/nodepool=general
kubectl get nodeclaim # Karpenter's unit of node lifecycle
11. Multi-Cluster Fleet Management
Argo CD ApplicationSets for Fleet-Wide Add-ons
apiVersion: argoproj.io/v1alpha1
kind: ApplicationSet
metadata:
name: cluster-addons
namespace: argocd
spec:
generators:
- clusters:
selector:
matchLabels:
environment: production # target all production clusters
template:
metadata:
name: "{{name}}-cert-manager"
spec:
project: platform
source:
repoURL: https://github.com/myorg/platform-gitops
targetRevision: HEAD
path: add-ons/cert-manager
helm:
valueFiles:
- values.yaml
- "clusters/{{name}}/values/cert-manager-values.yaml"
destination:
server: "{{server}}"
namespace: cert-manager
syncPolicy:
automated:
prune: true
selfHeal: true
Cluster Registration Pattern
# Register a new cluster with Argo CD management cluster
argocd cluster add arn:aws:eks:eu-west-1:123456789:cluster/prod-eu \
--name prod-eu \
--grpc-web
# Label the cluster for ApplicationSet selection
kubectl label secret prod-eu \
-n argocd \
environment=production \
region=eu-west-1 \
tier=workload
# Now ApplicationSets with clusters generator will automatically
# target this cluster on next reconciliation
Fleet Configuration Hierarchy
platform-gitops/
├── base/ # All clusters inherit
│ ├── cert-manager/
│ ├── gatekeeper/
│ └── kube-prometheus-stack/
├── overlays/
│ ├── production/ # Production-tier overrides (HA, large resources)
│ ├── staging/ # Staging-tier overrides (smaller, relaxed policies)
│ └── dev/ # Dev overrides (single replica, no PDB)
└── clusters/
├── prod-us-east-1/ # Cluster-specific (region, IRSA ARNs, node types)
├── prod-eu-west-1/
└── staging-us-east-1/
12. Best Practices
1. Use managed K8s unless you have a reason not to
EKS/GKE/AKS handle etcd backups, control plane upgrades, and API server HA. Self-managed clusters require significant operational expertise — only choose this for air-gap, on-prem, or specific compliance requirements.
2. Separate system and application node groups
Taint a dedicated system node group for DNS, CNI, CSI, and ingress. This prevents noisy application workloads from evicting platform add-ons under pressure.
3. Never skip minor version upgrades
K8s skew policy requires sequential minor version upgrades. Run pluto before each upgrade to find deprecated API usage. Test every upgrade in staging with the same workloads first.
4. Bootstrap clusters to GitOps immediately
Install Argo CD/Flux as the first post-provision step. All add-ons, RBAC, policies, and namespaces should be managed from Git from day one. Manual configuration creates drift that compounds over time.
5. Use ClusterClass or blueprints for consistency
CAPI ClusterClass or EKS Blueprints modules encode organisational standards (audit logging, OIDC, node sizing, add-ons) once and instantiate consistently. Every cluster should be provisioned from the same template.
6. Enable etcd encryption at rest
Configure --encryption-provider-config on kube-apiserver to encrypt Secrets at rest in etcd with AES-GCM or KMS envelope encryption. EKS: enable with encryptionConfig in cluster spec.
7. Use Karpenter over Cluster Autoscaler for new EKS clusters
Karpenter provisions nodes in ~45 seconds vs 2–5 minutes for CA, handles spot interruptions natively, and consolidates underutilised nodes automatically. Only use CA if you need non-AWS provider support.
8. Test node drains with PDB validation before every upgrade
Run kubectl get pdb -A before upgrade and identify any with maxUnavailable=0 or minAvailable=100%. These will block node drains and stall upgrades. Fix or temporarily relax them during maintenance windows.
Coverage Checklist
- Provisioning options comparison table (managed/CAPI/kubeadm/kOps/k3s/blueprints)
- EKS cluster creation via eksctl with key flags
- EKS networking modes: VPC CNI, prefix delegation, custom CNI comparison
- Prefix delegation enablement via kubectl set env
- OIDC / IRSA setup: eksctl associate-iam-oidc-provider + create iamserviceaccount
- EKS managed add-ons: describe-addon-versions + create-addon commands
- GKE Standard vs Autopilot comparison table
- GKE cluster creation with Workload Identity + private nodes
- AKS cluster creation with system + user node pools and availability zones
- CAPI architecture diagram: management cluster / providers / object hierarchy
- clusterctl init + generate cluster + apply commands
- Full CAPI Cluster + AWSCluster + KubeadmControlPlane + MachineDeployment YAML
- ClusterClass CRD for reusable cluster templates + Cluster instance with topology
- kubeadm ClusterConfiguration YAML (audit log, OIDC, etcd quota, admission plugins)
- kubeadm init + control plane join + worker join commands
- External etcd topology (stacked vs external) + etcd disk performance callout
- Certificate expiry check + renewal commands (kubeadm certs)
- EKS Blueprints Terraform module (eks_blueprints + eks_blueprints_addons) with ON_DEMAND/Spot/system node groups
- Node group topology diagram (system/general/observability/spot/GPU)
- Well-known node labels reference (K8s standard + EKS-specific)
- topologySpreadConstraints for multi-AZ StatefulSet distribution
- WaitForFirstConsumer StorageClass for zone-aware volume binding
- Karpenter vs Cluster Autoscaler comparison table
- Karpenter IAM setup + Helm install commands
- NodePool YAML (requirements, disruption, expireAfter, limits)
- EC2NodeClass YAML (AMI, role, subnet/SG selectors, block device, userData)
- Spot interruption SQS queue + EventBridge rules setup
- Cluster add-ons reference table (15 add-ons across 5 categories)
- Argo CD sync waves for add-on dependency ordering (-1 to 10)
- Argo CD bootstrap sequence (install → root app → app-of-apps)
- GitOps repository structure for multi-cluster (clusters/add-ons/base)
- Pre-upgrade checklist: pluto, API deprecations, PDB check, add-on compat
- K8s skew policy callout (no version skipping)
- EKS control plane upgrade (aws eks update-cluster-version)
- Node group blue/green upgrade (create new NG → cordon → drain → delete old)
- kubeadm control plane + worker node upgrade commands
- Karpenter node drift automatic replacement + nodeclaim commands
- ApplicationSet with clusters generator for fleet-wide add-on deployment
- Cluster registration with argocd CLI + label for ApplicationSet selection
- Fleet configuration hierarchy (base/overlays/clusters)
- 8 best practices cards