Cluster Provisioning

1. Provisioning Options

ApproachWhen to UseControlOperational Burden
Managed K8s (EKS, GKE, AKS) Most production workloads; cloud-native orgs Medium (control plane managed by cloud) Low (control plane patches automatic)
Cluster API Multi-cloud / on-prem fleet; platform teams managing 10+ clusters High (full lifecycle via K8s CRDs) Medium (CAPI controllers to maintain)
kubeadm On-prem bare metal; air-gapped; custom distributions Full (every component) High (etcd, certs, upgrades manual)
kOps AWS self-managed clusters (alternative to EKS) High (but AWS-only focused) Medium
k3s / k0s / RKE2 Edge, IoT, dev environments, resource-constrained nodes High Low (single binary, no external etcd required)
EKS Blueprints / GKE Autopilot Opinionated starting point for cloud-native teams Low–medium (opinionated defaults) Very low

2. Managed Kubernetes (EKS / GKE / AKS)

AWS EKS — Key Configuration Decisions

# eks cluster creation via eksctl (most common quickstart)
eksctl create cluster \
  --name production \
  --region us-east-1 \
  --version 1.30 \
  --nodegroup-name general \
  --node-type m6i.2xlarge \
  --nodes-min 3 \
  --nodes-max 20 \
  --managed \
  --amd64 \
  --with-oidc \
  --ssh-access=false \
  --full-ecr-access \
  --alb-ingress-access

EKS Networking Modes

ModeCNIIP SourceNotes
VPC CNI (default)aws-node (Amazon VPC CNI)VPC subnet IPs (1:1 pod:ENI)Simple; IP exhaustion risk in small subnets
VPC CNI + prefix delegationaws-node with ENABLE_PREFIX_DELEGATION=true/28 prefix per ENI slot16× more IPs per ENI; requires Nitro instances
Custom CNI (Cilium/Calico)3rd-party CNI installed as DaemonSetOverlay or custom CIDRNetwork policy, eBPF, encryption; replace aws-node
# Enable prefix delegation on existing EKS cluster
kubectl set env daemonset aws-node \
  -n kube-system \
  ENABLE_PREFIX_DELEGATION=true \
  WARM_PREFIX_TARGET=1 \
  MINIMUM_IP_TARGET=5

EKS OIDC for IRSA (IAM Roles for Service Accounts)

# Associate OIDC provider (required for IRSA)
eksctl utils associate-iam-oidc-provider \
  --cluster production \
  --region us-east-1 \
  --approve

# Create IAM role for a service account
eksctl create iamserviceaccount \
  --name pyroscope \
  --namespace observability \
  --cluster production \
  --role-name pyroscope-s3-role \
  --attach-policy-arn arn:aws:iam::aws:policy/AmazonS3FullAccess \
  --approve \
  --override-existing-serviceaccounts

# The ServiceAccount gets:
# annotations:
#   eks.amazonaws.com/role-arn: arn:aws:iam::123456789:role/pyroscope-s3-role

EKS Add-ons (managed by AWS)

# List available add-ons
aws eks describe-addon-versions --kubernetes-version 1.30

# Install/update EKS managed add-ons
aws eks create-addon \
  --cluster-name production \
  --addon-name vpc-cni \
  --addon-version v1.18.0-eksbuild.1 \
  --resolve-conflicts OVERWRITE

# Key managed add-ons:
# vpc-cni          — AWS VPC CNI plugin
# coredns          — DNS (auto-updated with cluster)
# kube-proxy       — per-node kube-proxy DaemonSet
# aws-ebs-csi-driver   — EBS persistent volumes
# aws-efs-csi-driver   — EFS persistent volumes
# aws-load-balancer-controller — ALB/NLB via Ingress/Service

GKE — Autopilot vs Standard

FeatureGKE StandardGKE Autopilot
Node managementUser manages node poolsGoogle manages all nodes
BillingPer node (even idle)Per pod resource request
DaemonSetsFull supportLimited (GKE-managed only)
Node accessSSH availableNo node access
Spot nodesSpot node poolSpot pods via annotation
Best forCustom CNI, bare-metal-like controlCost-optimised, managed clusters
# GKE cluster with Workload Identity and private nodes
gcloud container clusters create production \
  --region us-central1 \
  --release-channel regular \
  --workload-pool=PROJECT_ID.svc.id.goog \
  --enable-private-nodes \
  --master-ipv4-cidr 172.16.0.0/28 \
  --enable-ip-alias \
  --enable-shielded-nodes \
  --enable-autoupgrade \
  --enable-autorepair \
  --num-nodes 3 \
  --machine-type n2-standard-4 \
  --disk-type pd-ssd

AKS — Azure Kubernetes Service

# AKS cluster with system + user node pools
az aks create \
  --resource-group myResourceGroup \
  --name production \
  --kubernetes-version 1.30 \
  --node-count 3 \
  --node-vm-size Standard_D4s_v3 \
  --enable-oidc-issuer \
  --enable-workload-identity \
  --enable-managed-identity \
  --network-plugin azure \
  --network-policy calico \
  --enable-cluster-autoscaler \
  --min-count 3 \
  --max-count 20 \
  --zones 1 2 3

# Add user node pool (separate from system)
az aks nodepool add \
  --resource-group myResourceGroup \
  --cluster-name production \
  --name workload \
  --node-count 3 \
  --node-vm-size Standard_D8s_v3 \
  --enable-cluster-autoscaler \
  --min-count 1 \
  --max-count 30 \
  --zones 1 2 3 \
  --labels workload=general \
  --mode User

3. Cluster API (CAPI)

Cluster API is a Kubernetes-native framework for declarative cluster lifecycle management. You run a management cluster and define workload clusters as CRDs. CAPI controllers reconcile the desired cluster state, handling creation, scaling, upgrades, and deletion.

CAPI Architecture

Management Cluster
├── CAPI core controllers (cluster-api)
├── Bootstrap provider (kubeadm, k3s, talos)
├── Control Plane provider (kubeadm CP, RKE2 CP)
└── Infrastructure provider (AWS/CAPAS, GCP/CAPG, Azure/CAPZ, vSphere/CAPV)

Object hierarchy:
  Cluster ──────── owns ──────── MachineDeployment (workers)
      │                               │
      ├── KubeadmControlPlane ────── Machine (one per replica)
      │       │                          │
      │       └── AWSMachineTemplate     └── AWSMachine (infra)
      │
      └── AWSCluster (infra: VPC, subnets, SGs, ELB)

CAPI Bootstrap (AWS / CAPA)

# Install clusterctl
curl -L https://github.com/kubernetes-sigs/cluster-api/releases/download/v1.7.0/clusterctl-linux-amd64 \
  -o /usr/local/bin/clusterctl
chmod +x /usr/local/bin/clusterctl

# Initialize management cluster (runs in existing cluster)
# AWS credentials via environment or IAM role
export AWS_REGION=us-east-1
export AWS_ACCESS_KEY_ID=...
export AWS_SECRET_ACCESS_KEY=...

clusterctl init \
  --infrastructure aws \
  --bootstrap kubeadm \
  --control-plane kubeadm

# Generate workload cluster manifest
clusterctl generate cluster production \
  --flavor machinepool \
  --kubernetes-version v1.30.0 \
  --control-plane-machine-count=3 \
  --worker-machine-count=3 \
  > production-cluster.yaml

kubectl apply -f production-cluster.yaml

# Watch cluster come up
kubectl get cluster production -w
clusterctl describe cluster production

CAPI Cluster YAML (AWS)

apiVersion: cluster.x-k8s.io/v1beta1
kind: Cluster
metadata:
  name: production
  namespace: default
spec:
  clusterNetwork:
    pods:
      cidrBlocks: ["192.168.0.0/16"]
  infrastructureRef:
    apiVersion: infrastructure.cluster.x-k8s.io/v1beta2
    kind: AWSCluster
    name: production
  controlPlaneRef:
    apiVersion: controlplane.cluster.x-k8s.io/v1beta1
    kind: KubeadmControlPlane
    name: production-control-plane
---
apiVersion: infrastructure.cluster.x-k8s.io/v1beta2
kind: AWSCluster
metadata:
  name: production
spec:
  region: us-east-1
  sshKeyName: platform-key
  network:
    vpc:
      availabilityZoneUsageLimit: 3
    subnets:
      - availabilityZone: us-east-1a
        cidrBlock: 10.0.1.0/24
        isPublic: false
      - availabilityZone: us-east-1b
        cidrBlock: 10.0.2.0/24
        isPublic: false
      - availabilityZone: us-east-1c
        cidrBlock: 10.0.3.0/24
        isPublic: false
---
apiVersion: controlplane.cluster.x-k8s.io/v1beta1
kind: KubeadmControlPlane
metadata:
  name: production-control-plane
spec:
  replicas: 3
  version: v1.30.0
  machineTemplate:
    infrastructureRef:
      apiVersion: infrastructure.cluster.x-k8s.io/v1beta2
      kind: AWSMachineTemplate
      name: production-control-plane
  kubeadmConfigSpec:
    initConfiguration:
      nodeRegistration:
        kubeletExtraArgs:
          cloud-provider: external
    clusterConfiguration:
      apiServer:
        extraArgs:
          cloud-provider: external
          audit-log-path: /var/log/kubernetes/audit.log
          audit-log-maxage: "30"
          audit-log-maxbackup: "10"
          audit-log-maxsize: "100"
          feature-gates: "ServerSideApply=true"
      etcd:
        local:
          dataDir: /var/lib/etcddisk/etcd
          extraArgs:
            quota-backend-bytes: "8589934592"  # 8 GiB
---
apiVersion: cluster.x-k8s.io/v1beta1
kind: MachineDeployment
metadata:
  name: production-workers
spec:
  clusterName: production
  replicas: 3
  selector:
    matchLabels:
      cluster.x-k8s.io/cluster-name: production
  template:
    metadata:
      labels:
        cluster.x-k8s.io/cluster-name: production
    spec:
      version: v1.30.0
      clusterName: production
      bootstrap:
        configRef:
          apiVersion: bootstrap.cluster.x-k8s.io/v1beta1
          kind: KubeadmConfigTemplate
          name: production-workers
      infrastructureRef:
        apiVersion: infrastructure.cluster.x-k8s.io/v1beta2
        kind: AWSMachineTemplate
        name: production-workers

ClusterClass — Reusable Cluster Templates

ClusterClass (CAPI v1.2+) lets you define a cluster topology once and instantiate it many times with variable overrides — similar to a Helm chart but for clusters.

apiVersion: cluster.x-k8s.io/v1beta1
kind: ClusterClass
metadata:
  name: aws-production-class
spec:
  controlPlane:
    ref:
      apiVersion: controlplane.cluster.x-k8s.io/v1beta1
      kind: KubeadmControlPlaneTemplate
      name: aws-kcp-template
  infrastructure:
    ref:
      apiVersion: infrastructure.cluster.x-k8s.io/v1beta2
      kind: AWSClusterTemplate
      name: aws-cluster-template
  workers:
    machineDeployments:
      - class: general
        template:
          bootstrap:
            ref:
              kind: KubeadmConfigTemplate
              name: aws-worker-bootstrap
          infrastructure:
            ref:
              kind: AWSMachineTemplate
              name: aws-worker-template
  variables:
    - name: region
      required: true
      schema:
        openAPIV3Schema:
          type: string
    - name: workerInstanceType
      required: false
      schema:
        openAPIV3Schema:
          type: string
          default: m6i.2xlarge
---
# Instantiate from ClusterClass:
apiVersion: cluster.x-k8s.io/v1beta1
kind: Cluster
metadata:
  name: team-a-cluster
spec:
  topology:
    class: aws-production-class
    version: v1.30.0
    controlPlane:
      replicas: 3
    workers:
      machineDeployments:
        - name: general
          replicas: 5
    variables:
      - name: region
        value: eu-west-1
      - name: workerInstanceType
        value: m6i.4xlarge

4. kubeadm: Self-Managed Clusters

Control Plane Bootstrap

# kubeadm-config.yaml — production control plane configuration
apiVersion: kubeadm.k8s.io/v1beta3
kind: ClusterConfiguration
kubernetesVersion: v1.30.0
clusterName: production
controlPlaneEndpoint: "k8s-api.internal.example.com:6443"  # Load balancer VIP
networking:
  podSubnet: "10.244.0.0/16"
  serviceSubnet: "10.96.0.0/12"
  dnsDomain: "cluster.local"
apiServer:
  extraArgs:
    audit-log-path: /var/log/kubernetes/audit.log
    audit-log-maxage: "30"
    audit-log-maxbackup: "10"
    audit-log-maxsize: "100"
    enable-admission-plugins: >-
      NodeRestriction,PodSecurity,ResourceQuota,
      LimitRanger,ServiceAccount,DefaultStorageClass,
      MutatingAdmissionWebhook,ValidatingAdmissionWebhook
    oidc-issuer-url: "https://dex.example.com"
    oidc-client-id: "kubernetes"
    oidc-username-claim: "email"
    oidc-groups-claim: "groups"
  extraVolumes:
    - name: audit-log
      hostPath: /var/log/kubernetes
      mountPath: /var/log/kubernetes
      pathType: DirectoryOrCreate
controllerManager:
  extraArgs:
    bind-address: "0.0.0.0"
    node-cidr-mask-size: "24"
etcd:
  local:
    dataDir: /var/lib/etcd
    extraArgs:
      quota-backend-bytes: "8589934592"
      auto-compaction-mode: revision
      auto-compaction-retention: "1000"
---
apiVersion: kubeadm.k8s.io/v1beta3
kind: InitConfiguration
nodeRegistration:
  kubeletExtraArgs:
    cloud-provider: ""
    container-runtime-endpoint: unix:///var/run/containerd/containerd.sock
# Initialize first control plane node
kubeadm init --config kubeadm-config.yaml --upload-certs

# Output includes:
# kubeadm join k8s-api:6443 --token ... --discovery-token-ca-cert-hash sha256:...
#   --control-plane --certificate-key 

# Join additional control plane nodes
kubeadm join k8s-api.internal.example.com:6443 \
  --token  \
  --discovery-token-ca-cert-hash sha256: \
  --control-plane \
  --certificate-key 

# Join worker nodes
kubeadm join k8s-api.internal.example.com:6443 \
  --token  \
  --discovery-token-ca-cert-hash sha256:

etcd External Cluster (Production Best Practice)

# For large clusters: run etcd separately from control plane
# Stacked etcd (default): etcd on same nodes as kube-apiserver
# External etcd (recommended for prod): dedicated etcd nodes

# etcd cluster topology:
# - Minimum 3 nodes for quorum (tolerate 1 failure)
# - 5 nodes for higher availability (tolerate 2 failures)
# - Dedicated SSDs (fsync latency < 10ms critical for etcd)
# - Never share etcd nodes with any other workload

# kubeadm-config.yaml (external etcd):
etcd:
  external:
    endpoints:
      - https://etcd-0.internal:2379
      - https://etcd-1.internal:2379
      - https://etcd-2.internal:2379
    caFile: /etc/kubernetes/pki/etcd/ca.crt
    certFile: /etc/kubernetes/pki/apiserver-etcd-client.crt
    keyFile: /etc/kubernetes/pki/apiserver-etcd-client.key
etcd disk performance is critical

etcd writes are synchronous (fsync after every commit). On AWS, use io1/io2 EBS volumes with 3000+ IOPS, or NVMe instance store (with replication). On bare metal, dedicated NVMe SSD. Shared network storage (EFS, NFS) will cause etcd leader elections and API server timeouts. Monitor etcd_disk_wal_fsync_duration_seconds — keep p99 < 10ms.

Certificate Management

# Check certificate expiry
kubeadm certs check-expiration

# Renew all certificates (run on each control plane node)
kubeadm certs renew all

# Auto-renewal: kubeadm rotates certs on upgrade
# For manual renewal, add to cron before 1-year expiry:
0 0 1 * * /usr/bin/kubeadm certs renew all && systemctl restart kubelet

5. EKS Blueprints (Terraform)

EKS Blueprints is a Terraform module that provisions an EKS cluster with opinionated add-on management, team RBAC, and GitOps bootstrap in one configuration block.

# main.tf
module "eks_blueprints" {
  source  = "aws-ia/eks-blueprints/aws"
  version = "~> 4.32"

  cluster_name    = "production"
  cluster_version = "1.30"

  vpc_id             = module.vpc.vpc_id
  private_subnet_ids = module.vpc.private_subnets

  # Control plane logging
  cluster_enabled_log_types = ["api", "audit", "authenticator", "controllerManager", "scheduler"]

  # IRSA — required for add-ons
  enable_irsa = true

  # Managed node groups
  eks_managed_node_groups = {
    general = {
      min_size     = 3
      max_size     = 20
      desired_size = 5

      instance_types = ["m6i.2xlarge"]
      capacity_type  = "ON_DEMAND"

      labels = { workload = "general" }

      # Taint system namespace pods to system node group
      # (no taint on general = default target)
    }

    system = {
      min_size     = 3
      max_size     = 6
      desired_size = 3

      instance_types = ["m6i.xlarge"]
      capacity_type  = "ON_DEMAND"

      labels = { "node.kubernetes.io/purpose" = "system" }
      taints = [{
        key    = "node.kubernetes.io/purpose"
        value  = "system"
        effect = "NO_SCHEDULE"
      }]
    }

    spot = {
      min_size     = 0
      max_size     = 50
      desired_size = 0

      instance_types = ["m6i.2xlarge", "m5.2xlarge", "m5n.2xlarge"]
      capacity_type  = "SPOT"

      labels = { "node.kubernetes.io/capacity-type" = "spot" }
      taints = [{
        key    = "spot"
        value  = "true"
        effect = "NO_SCHEDULE"
      }]
    }
  }
}

# Add-ons module
module "eks_blueprints_addons" {
  source  = "aws-ia/eks-blueprints-addons/aws"
  version = "~> 1.16"

  cluster_name      = module.eks_blueprints.cluster_name
  cluster_endpoint  = module.eks_blueprints.cluster_endpoint
  cluster_version   = module.eks_blueprints.cluster_version
  oidc_provider_arn = module.eks_blueprints.oidc_provider_arn

  # EKS managed add-ons
  eks_addons = {
    aws-ebs-csi-driver = {
      most_recent = true
      service_account_role_arn = module.ebs_csi_irsa_role.iam_role_arn
    }
    coredns    = { most_recent = true }
    vpc-cni    = { most_recent = true }
    kube-proxy = { most_recent = true }
  }

  # AWS Load Balancer Controller
  enable_aws_load_balancer_controller = true

  # Cluster Autoscaler (or use Karpenter — see below)
  enable_cluster_autoscaler = true

  # cert-manager
  enable_cert_manager = true

  # external-dns
  enable_external_dns = true

  # Karpenter (if using — disables Cluster Autoscaler)
  enable_karpenter = false

  # Metrics Server
  enable_metrics_server = true
}

6. Node Pools & Node Groups

Node Group Strategy

Recommended node group topology for production clusters:

┌──────────────────────────────────────────────────────────┐
│  system node group  (3 nodes, m6i.xlarge, ON_DEMAND)     │
│  Runs: CoreDNS, kube-proxy, CNI, CSI, ingress controller │
│  Taint: node.kubernetes.io/purpose=system:NoSchedule     │
├──────────────────────────────────────────────────────────┤
│  general node group (3-20 nodes, m6i.2xlarge, ON_DEMAND) │
│  Runs: most application workloads                        │
│  No taints — default scheduling target                   │
├──────────────────────────────────────────────────────────┤
│  observability node group (3 nodes, m6i.4xlarge)         │
│  Runs: Prometheus, Loki, Tempo, Grafana (stateful)       │
│  Taint: workload=observability:NoSchedule                │
├──────────────────────────────────────────────────────────┤
│  spot node group (0-50 nodes, mixed instance types)      │
│  Runs: batch jobs, CI runners, ML training               │
│  Taint: spot=true:NoSchedule                             │
├──────────────────────────────────────────────────────────┤
│  gpu node group (0-10 nodes, p3.2xlarge or g5.xlarge)    │
│  Runs: GPU workloads, ML inference                       │
│  Taint: nvidia.com/gpu=present:NoSchedule                │
└──────────────────────────────────────────────────────────┘

Node Labels and Taints for Scheduling

# Standard K8s well-known node labels (set automatically by cloud providers)
kubernetes.io/hostname: worker-node-1
kubernetes.io/arch: amd64
kubernetes.io/os: linux
node.kubernetes.io/instance-type: m6i.2xlarge
topology.kubernetes.io/zone: us-east-1a
topology.kubernetes.io/region: us-east-1

# EKS specific
eks.amazonaws.com/nodegroup: general
eks.amazonaws.com/capacityType: ON_DEMAND

# Custom labels (set in node group config)
workload: general
node.kubernetes.io/purpose: system

# Targeting observability workloads:
nodeSelector:
  workload: observability
tolerations:
  - key: workload
    operator: Equal
    value: observability
    effect: NoSchedule

Multi-AZ Distribution for Stateful Workloads

# Force pods across AZs using topologySpreadConstraints
# (preferred over podAntiAffinity for large StatefulSets)
topologySpreadConstraints:
  - maxSkew: 1
    topologyKey: topology.kubernetes.io/zone
    whenUnsatisfiable: DoNotSchedule
    labelSelector:
      matchLabels:
        app: kafka-broker

# StorageClass with WaitForFirstConsumer (volume follows pod AZ)
apiVersion: storage.k8s.io/v1
kind: StorageClass
metadata:
  name: ebs-gp3-zone-aware
provisioner: ebs.csi.aws.com
volumeBindingMode: WaitForFirstConsumer  # Critical for multi-AZ
parameters:
  type: gp3
  iops: "3000"
  throughput: "125"
reclaimPolicy: Retain
allowVolumeExpansion: true

7. Karpenter: Node Autoprovisioning

Karpenter replaces the Cluster Autoscaler for node-level scaling. Instead of managing fixed node groups, Karpenter watches for unschedulable pods and provisions exactly the right node type within seconds using NodePool and EC2NodeClass CRDs.

Karpenter vs Cluster Autoscaler

FeatureCluster AutoscalerKarpenter
Provisioning speed2–5 minutes (new ASG instance)~45 seconds (direct EC2 RunInstances API)
Instance selectionFixed node group instance typeFlexible: picks optimal from NodePool list
Spot handlingSeparate spot ASG; manual fallback configBuilt-in: weighted spot + on-demand fallback
ConsolidationRemoves underutilised nodes (basic)Disruption consolidation: repack pods onto fewer nodes
Node lifecycleASG manages replacementKarpenter replaces drifted/old nodes automatically
ConfigurationNode group config in cloud console/IaCNodePool + NodeClass CRDs in cluster

Karpenter Install (EKS)

# Create IAM role for Karpenter with IRSA
export CLUSTER_NAME=production
export AWS_ACCOUNT_ID=$(aws sts get-caller-identity --query Account --output text)
export KARPENTER_NAMESPACE=karpenter
export KARPENTER_VERSION=v0.37.0

# Create node IAM role (used by nodes Karpenter provisions)
aws iam create-role \
  --role-name KarpenterNodeRole-${CLUSTER_NAME} \
  --assume-role-policy-document file://node-trust-policy.json

for policy in \
  AmazonEKSWorkerNodePolicy \
  AmazonEKS_CNI_Policy \
  AmazonEC2ContainerRegistryReadOnly \
  AmazonSSMManagedInstanceCore; do
  aws iam attach-role-policy \
    --role-name KarpenterNodeRole-${CLUSTER_NAME} \
    --policy-arn arn:aws:iam::aws:policy/${policy}
done

# Create instance profile
aws iam create-instance-profile \
  --instance-profile-name KarpenterNodeInstanceProfile-${CLUSTER_NAME}
aws iam add-role-to-instance-profile \
  --instance-profile-name KarpenterNodeInstanceProfile-${CLUSTER_NAME} \
  --role-name KarpenterNodeRole-${CLUSTER_NAME}

# Install Karpenter via Helm
helm upgrade --install karpenter oci://public.ecr.aws/karpenter/karpenter \
  --version ${KARPENTER_VERSION} \
  --namespace ${KARPENTER_NAMESPACE} \
  --create-namespace \
  --set "settings.clusterName=${CLUSTER_NAME}" \
  --set "settings.interruptionQueue=${CLUSTER_NAME}" \
  --set controller.resources.requests.cpu=1 \
  --set controller.resources.requests.memory=1Gi \
  --set controller.resources.limits.cpu=1 \
  --set controller.resources.limits.memory=1Gi \
  --wait

NodePool and EC2NodeClass

apiVersion: karpenter.sh/v1beta1
kind: NodePool
metadata:
  name: general
spec:
  template:
    metadata:
      labels:
        workload: general
    spec:
      nodeClassRef:
        apiVersion: karpenter.k8s.aws/v1beta1
        kind: EC2NodeClass
        name: default
      requirements:
        - key: kubernetes.io/arch
          operator: In
          values: ["amd64"]
        - key: kubernetes.io/os
          operator: In
          values: ["linux"]
        - key: karpenter.sh/capacity-type
          operator: In
          values: ["on-demand", "spot"]  # prefer spot, fall back to on-demand
        - key: karpenter.k8s.aws/instance-category
          operator: In
          values: ["c", "m", "r"]
        - key: karpenter.k8s.aws/instance-generation
          operator: Gt
          values: ["5"]  # Only 6th gen or newer
        - key: karpenter.k8s.aws/instance-size
          operator: NotIn
          values: ["nano", "micro", "small", "medium", "xlarge"]  # min 2xlarge
      taints: []
  disruption:
    consolidationPolicy: WhenUnderutilized
    consolidateAfter: 30s
    # Replace nodes older than 720h (30 days) for security patching
    expireAfter: 720h
  limits:
    cpu: 1000         # max 1000 vCPUs across all NodePool nodes
    memory: 4000Gi
---
apiVersion: karpenter.k8s.aws/v1beta1
kind: EC2NodeClass
metadata:
  name: default
spec:
  amiFamily: AL2  # Amazon Linux 2 (or Bottlerocket, Ubuntu)
  role: KarpenterNodeRole-production  # IAM instance profile
  subnetSelectorTerms:
    - tags:
        kubernetes.io/cluster/production: owned
        karpenter.sh/discovery: production
  securityGroupSelectorTerms:
    - tags:
        kubernetes.io/cluster/production: owned
  blockDeviceMappings:
    - deviceName: /dev/xvda
      ebs:
        volumeSize: 100Gi
        volumeType: gp3
        iops: 3000
        throughput: 125
        encrypted: true
  userData: |
    #!/bin/bash
    /etc/eks/bootstrap.sh production \
      --container-runtime containerd \
      --kubelet-extra-args '--max-pods=110'
  tags:
    team: platform
    managed-by: karpenter

Spot Interruption Handling

# Karpenter uses EC2 Spot Interruption Notifications via SQS
# AWS sends 2-minute warning → Karpenter cordons node + triggers Pod Disruption Budget

# Create SQS queue for spot interruptions:
aws sqs create-queue \
  --queue-name Karpenter-${CLUSTER_NAME} \
  --attributes '{
    "SqsManagedSseEnabled": "true",
    "MessageRetentionPeriod": "300"
  }'

# EventBridge rules route interruption events to SQS:
# - EC2 Spot Instance Interruption Warning
# - EC2 Instance Rebalance Recommendation
# - EC2 Instance State-change Notification
# (set up via CloudFormation template in Karpenter docs)

8. Cluster Add-ons

Every production cluster needs a standard set of add-ons installed immediately after bootstrap. These should be deployed via GitOps (Argo CD or Flux) — not manual Helm installs.

CategoryAdd-onPurposeInstall Method
NetworkingCNI plugin (Cilium/Calico/VPC CNI)Pod networkingCluster bootstrap / EKS managed
NetworkingCoreDNSService discovery DNSkubeadm / EKS managed
NetworkingIngress NGINX / AWS ALB ControllerHTTP ingressHelm via Argo CD
Networkingcert-managerTLS certificate automationHelm via Argo CD
Networkingexternal-dnsDNS record sync from Ingress/ServiceHelm via Argo CD
StorageEBS CSI Driver / GCE PD CSICloud block storageEKS managed / Helm
StorageEFS CSI DriverShared file storage (RWX)Helm via Argo CD
AutoscalingKarpenter / Cluster AutoscalerNode scalingHelm via Argo CD
AutoscalingMetrics ServerHPA/VPA resource metricsHelm via Argo CD
AutoscalingKEDAEvent-driven autoscalingHelm via Argo CD
SecurityGatekeeper / KyvernoPolicy enforcementHelm via Argo CD
SecurityFalcoRuntime security (syscall audit)Helm via Argo CD
SecretsExternal Secrets OperatorSync secrets from Vault/AWS SMHelm via Argo CD
Observabilitykube-prometheus-stackPrometheus + Grafana + rulesHelm via Argo CD
ObservabilityLoki stackLog aggregationHelm via Argo CD
ObservabilityOpenTelemetry OperatorTracing infrastructureHelm via Argo CD

Add-on Dependency Ordering (Argo CD Sync Waves)

# Use Argo CD sync waves to control add-on installation order.
# Annotations on Applications or Helm release namespaces:

# Wave -1: CRDs first (cert-manager, Gatekeeper, KEDA, Karpenter)
# Wave  0: Core networking (CNI config, CoreDNS tuning)
# Wave  1: Storage (CSI drivers, StorageClasses)
# Wave  2: Security (Gatekeeper policies, ESO, Falco)
# Wave  3: Ingress (NGINX, cert-manager issuers, external-dns)
# Wave  4: Autoscaling (Karpenter NodePools, KEDA ScaledObjects)
# Wave  5: Observability (Prometheus, Loki, Tempo, Grafana)
# Wave 10: Application namespaces and workloads

# Example Application annotation:
metadata:
  annotations:
    argocd.argoproj.io/sync-wave: "2"

9. Bootstrap to GitOps

A newly provisioned cluster must be onboarded to the GitOps system so all future configuration is managed declaratively. The bootstrap process converts a "bare" cluster into a GitOps-managed cluster.

# Bootstrap sequence:
# 1. Cluster created (Terraform / CAPI / eksctl)
# 2. kubeconfig obtained and merged
# 3. Argo CD installed (bootstrap — only this one step is imperative)
# 4. Argo CD pointed at the GitOps repo
# 5. App-of-apps / ApplicationSet deploys everything else

# Step 3: Install Argo CD (one-time bootstrap)
kubectl create namespace argocd
kubectl apply -n argocd \
  -f https://raw.githubusercontent.com/argoproj/argo-cd/stable/manifests/install.yaml

# Step 4: Create root Application pointing at cluster's app-of-apps directory
kubectl apply -f - <<'EOF'
apiVersion: argoproj.io/v1alpha1
kind: Application
metadata:
  name: root
  namespace: argocd
spec:
  project: default
  source:
    repoURL: https://github.com/myorg/platform-gitops
    targetRevision: HEAD
    path: clusters/production/apps
  destination:
    server: https://kubernetes.default.svc
    namespace: argocd
  syncPolicy:
    automated:
      prune: true
      selfHeal: true
    syncOptions:
      - CreateNamespace=true
      - ServerSideApply=true
EOF

# Step 5: argocd now syncs all add-ons from Git
# The apps/ directory contains Application manifests for each add-on

GitOps Repository Structure for Multi-Cluster

platform-gitops/
├── clusters/
│   ├── production/
│   │   ├── apps/                     # Argo CD Applications (root app-of-apps)
│   │   │   ├── cert-manager.yaml
│   │   │   ├── ingress-nginx.yaml
│   │   │   ├── karpenter.yaml
│   │   │   ├── kube-prometheus-stack.yaml
│   │   │   └── ...
│   │   └── values/                   # Per-cluster Helm value overrides
│   │       ├── karpenter-values.yaml
│   │       └── prometheus-values.yaml
│   ├── staging/
│   │   └── apps/
│   └── dev/
│       └── apps/
├── add-ons/                          # Reusable add-on Helm chart wrappers
│   ├── cert-manager/
│   ├── ingress-nginx/
│   └── kube-prometheus-stack/
└── base/                             # Shared Kustomize bases
    ├── namespaces/
    └── rbac/

10. Cluster Upgrade Strategies

Kubernetes skew policy

Kubernetes supports n-2 version skew between control plane and nodes. Always upgrade control plane first, then node groups. Never skip minor versions (e.g., 1.28 → 1.30 is unsupported; must go 1.28 → 1.29 → 1.30). Check API deprecations before each upgrade — use pluto or kubectl convert.

Pre-Upgrade Checklist

# 1. Check API deprecations with pluto
pluto detect-all-in-cluster --target-versions k8s=v1.30

# 2. Review release notes and deprecated APIs
# https://kubernetes.io/docs/reference/using-api/deprecation-guide/

# 3. Check add-on compatibility matrix
# cert-manager, ingress-nginx, kube-prometheus-stack all have K8s version tables

# 4. Test in non-prod cluster first (same version upgrade path)

# 5. Verify PodDisruptionBudgets allow node drains
kubectl get pdb -A
# PDBs with minAvailable=100% or maxUnavailable=0 will block drain!

# 6. Check for pods using deprecated APIs
kubectl get pods -A -o json | \
  jq '.items[] | select(.spec.containers[].image | test("k8s.gcr.io")) | .metadata'

EKS Control Plane Upgrade

# Upgrade EKS control plane (AWS manages etcd and control plane pods)
aws eks update-cluster-version \
  --name production \
  --kubernetes-version 1.30

# Watch upgrade status
aws eks describe-cluster \
  --name production \
  --query 'cluster.status'

# Takes ~15 minutes; API server is briefly unavailable during switchover
# (kube-proxy and CoreDNS tolerate this; existing pods continue running)

Node Group Rolling Upgrade

# Option A: Rolling update via eksctl (managed node groups)
eksctl upgrade nodegroup \
  --cluster=production \
  --name=general \
  --kubernetes-version=1.30

# Option B: Blue/green node group (zero downtime, more control)
# 1. Create new node group with new K8s version
eksctl create nodegroup \
  --cluster production \
  --name general-v130 \
  --kubernetes-version 1.30 \
  --node-type m6i.2xlarge \
  --nodes-min 3 --nodes-max 20

# 2. Cordon old node group (prevent new scheduling)
kubectl cordon -l eks.amazonaws.com/nodegroup=general

# 3. Drain old nodes (respects PDBs)
for node in $(kubectl get nodes -l eks.amazonaws.com/nodegroup=general -o name); do
  kubectl drain $node \
    --ignore-daemonsets \
    --delete-emptydir-data \
    --timeout=5m \
    --grace-period=30
done

# 4. Delete old node group after all pods migrated
eksctl delete nodegroup \
  --cluster production \
  --name general \
  --drain=false  # Already drained above

kubeadm Control Plane Upgrade

# On first control plane node:
# 1. Upgrade kubeadm itself
apt-get update
apt-get install -y kubeadm=1.30.0-00

# 2. Verify upgrade plan
kubeadm upgrade plan

# 3. Apply upgrade (upgrades kube-apiserver, kube-controller-manager, kube-scheduler, etcd)
kubeadm upgrade apply v1.30.0

# 4. Upgrade kubelet and kubectl on control plane node
apt-get install -y kubelet=1.30.0-00 kubectl=1.30.0-00
systemctl daemon-reload
systemctl restart kubelet

# On additional control plane nodes:
kubeadm upgrade node
apt-get install -y kubelet=1.30.0-00 kubectl=1.30.0-00
systemctl daemon-reload && systemctl restart kubelet

# Worker nodes:
kubectl drain  --ignore-daemonsets --delete-emptydir-data
# SSH into worker node:
apt-get install -y kubeadm=1.30.0-00
kubeadm upgrade node
apt-get install -y kubelet=1.30.0-00
systemctl daemon-reload && systemctl restart kubelet
# Back on control plane:
kubectl uncordon 

Karpenter Node Drift — Automatic Replacement

# Karpenter automatically replaces nodes when:
# 1. AMI has drifted from EC2NodeClass amiSelector
# 2. Node has expired (expireAfter in NodePool)
# 3. Node is underutilised and can be consolidated

# Trigger controlled drift for immediate node replacement:
kubectl annotate nodepool general \
  karpenter.sh/nodepool-hash-version=1  # forces re-evaluation

# Check node replacement progress:
kubectl get nodes -l karpenter.sh/nodepool=general
kubectl get nodeclaim  # Karpenter's unit of node lifecycle

11. Multi-Cluster Fleet Management

Argo CD ApplicationSets for Fleet-Wide Add-ons

apiVersion: argoproj.io/v1alpha1
kind: ApplicationSet
metadata:
  name: cluster-addons
  namespace: argocd
spec:
  generators:
    - clusters:
        selector:
          matchLabels:
            environment: production   # target all production clusters
  template:
    metadata:
      name: "{{name}}-cert-manager"
    spec:
      project: platform
      source:
        repoURL: https://github.com/myorg/platform-gitops
        targetRevision: HEAD
        path: add-ons/cert-manager
        helm:
          valueFiles:
            - values.yaml
            - "clusters/{{name}}/values/cert-manager-values.yaml"
      destination:
        server: "{{server}}"
        namespace: cert-manager
      syncPolicy:
        automated:
          prune: true
          selfHeal: true

Cluster Registration Pattern

# Register a new cluster with Argo CD management cluster
argocd cluster add arn:aws:eks:eu-west-1:123456789:cluster/prod-eu \
  --name prod-eu \
  --grpc-web

# Label the cluster for ApplicationSet selection
kubectl label secret prod-eu \
  -n argocd \
  environment=production \
  region=eu-west-1 \
  tier=workload

# Now ApplicationSets with clusters generator will automatically
# target this cluster on next reconciliation

Fleet Configuration Hierarchy

platform-gitops/
├── base/                     # All clusters inherit
│   ├── cert-manager/
│   ├── gatekeeper/
│   └── kube-prometheus-stack/
├── overlays/
│   ├── production/           # Production-tier overrides (HA, large resources)
│   ├── staging/              # Staging-tier overrides (smaller, relaxed policies)
│   └── dev/                  # Dev overrides (single replica, no PDB)
└── clusters/
    ├── prod-us-east-1/       # Cluster-specific (region, IRSA ARNs, node types)
    ├── prod-eu-west-1/
    └── staging-us-east-1/

12. Best Practices

1. Use managed K8s unless you have a reason not to

EKS/GKE/AKS handle etcd backups, control plane upgrades, and API server HA. Self-managed clusters require significant operational expertise — only choose this for air-gap, on-prem, or specific compliance requirements.

2. Separate system and application node groups

Taint a dedicated system node group for DNS, CNI, CSI, and ingress. This prevents noisy application workloads from evicting platform add-ons under pressure.

3. Never skip minor version upgrades

K8s skew policy requires sequential minor version upgrades. Run pluto before each upgrade to find deprecated API usage. Test every upgrade in staging with the same workloads first.

4. Bootstrap clusters to GitOps immediately

Install Argo CD/Flux as the first post-provision step. All add-ons, RBAC, policies, and namespaces should be managed from Git from day one. Manual configuration creates drift that compounds over time.

5. Use ClusterClass or blueprints for consistency

CAPI ClusterClass or EKS Blueprints modules encode organisational standards (audit logging, OIDC, node sizing, add-ons) once and instantiate consistently. Every cluster should be provisioned from the same template.

6. Enable etcd encryption at rest

Configure --encryption-provider-config on kube-apiserver to encrypt Secrets at rest in etcd with AES-GCM or KMS envelope encryption. EKS: enable with encryptionConfig in cluster spec.

7. Use Karpenter over Cluster Autoscaler for new EKS clusters

Karpenter provisions nodes in ~45 seconds vs 2–5 minutes for CA, handles spot interruptions natively, and consolidates underutilised nodes automatically. Only use CA if you need non-AWS provider support.

8. Test node drains with PDB validation before every upgrade

Run kubectl get pdb -A before upgrade and identify any with maxUnavailable=0 or minAvailable=100%. These will block node drains and stall upgrades. Fix or temporarily relax them during maintenance windows.

Coverage Checklist
  • Provisioning options comparison table (managed/CAPI/kubeadm/kOps/k3s/blueprints)
  • EKS cluster creation via eksctl with key flags
  • EKS networking modes: VPC CNI, prefix delegation, custom CNI comparison
  • Prefix delegation enablement via kubectl set env
  • OIDC / IRSA setup: eksctl associate-iam-oidc-provider + create iamserviceaccount
  • EKS managed add-ons: describe-addon-versions + create-addon commands
  • GKE Standard vs Autopilot comparison table
  • GKE cluster creation with Workload Identity + private nodes
  • AKS cluster creation with system + user node pools and availability zones
  • CAPI architecture diagram: management cluster / providers / object hierarchy
  • clusterctl init + generate cluster + apply commands
  • Full CAPI Cluster + AWSCluster + KubeadmControlPlane + MachineDeployment YAML
  • ClusterClass CRD for reusable cluster templates + Cluster instance with topology
  • kubeadm ClusterConfiguration YAML (audit log, OIDC, etcd quota, admission plugins)
  • kubeadm init + control plane join + worker join commands
  • External etcd topology (stacked vs external) + etcd disk performance callout
  • Certificate expiry check + renewal commands (kubeadm certs)
  • EKS Blueprints Terraform module (eks_blueprints + eks_blueprints_addons) with ON_DEMAND/Spot/system node groups
  • Node group topology diagram (system/general/observability/spot/GPU)
  • Well-known node labels reference (K8s standard + EKS-specific)
  • topologySpreadConstraints for multi-AZ StatefulSet distribution
  • WaitForFirstConsumer StorageClass for zone-aware volume binding
  • Karpenter vs Cluster Autoscaler comparison table
  • Karpenter IAM setup + Helm install commands
  • NodePool YAML (requirements, disruption, expireAfter, limits)
  • EC2NodeClass YAML (AMI, role, subnet/SG selectors, block device, userData)
  • Spot interruption SQS queue + EventBridge rules setup
  • Cluster add-ons reference table (15 add-ons across 5 categories)
  • Argo CD sync waves for add-on dependency ordering (-1 to 10)
  • Argo CD bootstrap sequence (install → root app → app-of-apps)
  • GitOps repository structure for multi-cluster (clusters/add-ons/base)
  • Pre-upgrade checklist: pluto, API deprecations, PDB check, add-on compat
  • K8s skew policy callout (no version skipping)
  • EKS control plane upgrade (aws eks update-cluster-version)
  • Node group blue/green upgrade (create new NG → cordon → drain → delete old)
  • kubeadm control plane + worker node upgrade commands
  • Karpenter node drift automatic replacement + nodeclaim commands
  • ApplicationSet with clusters generator for fleet-wide add-on deployment
  • Cluster registration with argocd CLI + label for ApplicationSet selection
  • Fleet configuration hierarchy (base/overlays/clusters)
  • 8 best practices cards