Fundamentals

Capacity planning answers two questions: Do we have enough resources today? and When do we need to add more? In Kubernetes, the complexity comes from three independent resource dimensions — CPU, memory, and pod count — layered on top of autoscaling, spot variability, and the gap between what workloads request and what they use.

Capacity Supply vs Demand
  Node allocatable capacity
  ┌─────────────────────────────────────────────────────────────┐
  │  System reserved  │  Kube reserved  │  Available for pods   │
  └─────────────────────────────────────────────────────────────┘
                                         ┌─────────────────────┐
                                         │  Requested (sum)    │
                                         │  ├── Actual usage   │
                                         │  └── Slack (waste)  │
                                         │  Headroom buffer    │
                                         │  Free (unscheduled) │
                                         └─────────────────────┘
  Warning zones:
    Requests / Allocatable > 80%  → near saturation
    Actual usage / Requests < 25% → over-provisioned (right-size first)
    Headroom < 1 full node        → autoscaler can't absorb node failure
⚠️
Requests drive scheduling, not usage

The Kubernetes scheduler packs pods based on requests, not actual resource consumption. A pod that requests 4 CPU but uses 0.1 CPU blocks 4 CPU worth of scheduling capacity. Over-requested workloads cause artificial node saturation while actual utilization stays low — always right-size requests before adding nodes.

Capacity Dimensions

Kubernetes clusters can hit limits on any of five independent dimensions. Hitting any single one blocks further scheduling even when others are available:

DimensionWhat Limits ItHow to CheckExpansion Strategy
CPU (requests)Sum of pod CPU requests vs node allocatable CPUkubectl resource-capacity --sort cpu.requestAdd nodes, right-size requests, enable VPA
Memory (requests)Sum of pod memory requests vs node allocatable memorykubectl resource-capacity --sort mem.requestAdd nodes, right-size requests, enable VPA
Pod count--max-pods per node (default 110, AWS VPC CNI varies by instance)kubectl describe node | grep CapacityIncrease max-pods (VPC CNI prefix delegation), add nodes
IP addressesVPC subnet CIDR /24 = 251 usable IPs; shared with nodesaws ec2 describe-subnets | jq '.[] | .AvailableIpAddressCount'Add subnets/secondary CIDRs, enable prefix delegation
Storage (PV)EBS volume limits per instance type (28–128 attachments)kubectl get pvc -A | grep BoundUse EFS/S3 for shared storage, increase instance size

Node Sizing Strategy

Instance family selection

Workload ProfileAWS FamilyCPU:Memory RatioUse Case
General purposem7i, m6i, m51:4Mixed microservices, most production workloads
CPU-intensivec7i, c6i, c51:2API servers, data processing, encoding
Memory-intensiver7i, r6i, r51:8JVM apps, in-memory caches, analytics
Burstable / devt3, t4g1:2–4Development, CI runners, low-traffic services
GPU / MLp4, g5, g4dnModel training, inference, video
ARM / cost-optimizedm7g, c7g, r7g (Graviton)1:4Same as above families; ~20% cheaper

Node size tradeoffs

Fewer, larger nodes

  • Lower per-node overhead (daemon sets, kube-reserved)
  • Fewer node reboots during upgrades
  • Better bin-packing for large pods
  • Risk: larger blast radius per node failure
  • Risk: fragmentation for small pods
  • Use: m7i.4xlarge (16 vCPU / 64 GiB) as baseline

More, smaller nodes

  • Smaller blast radius per node failure
  • Better fault isolation with PodAntiAffinity
  • Higher overhead ratio (DaemonSets cost more)
  • Risk: more nodes to manage/upgrade
  • Risk: IP exhaustion faster
  • Use: m7i.xlarge (4 vCPU / 16 GiB) for isolation-critical

kube-reserved and system-reserved budgets

Kubernetes reserves resources on each node for the kubelet, system daemons, and eviction threshold before pods can be scheduled. Always set these explicitly in your node configuration:

# EKS managed node group bootstrap configuration
# /etc/eks/bootstrap.sh extra args (via launch template userData)
--kubelet-extra-args '
  --kube-reserved=cpu=250m,memory=1Gi,ephemeral-storage=1Gi
  --system-reserved=cpu=250m,memory=500Mi,ephemeral-storage=1Gi
  --eviction-hard=memory.available<500Mi,nodefs.available<10%,nodefs.inodesFree<5%
  --eviction-soft=memory.available<1Gi,nodefs.available<15%
  --eviction-soft-grace-period=memory.available=2m,nodefs.available=2m
  --max-pods=110
'

Effective allocatable capacity formula

allocatable_cpu = node_cpu − kube_reserved_cpu − system_reserved_cpu
allocatable_mem = node_mem − kube_reserved_mem − system_reserved_mem − eviction_threshold

Example: m7i.xlarge (4 vCPU / 16 GiB)
CPU: 4000m − 250m − 250m = 3500m per node
Mem: 16384Mi − 1024Mi − 512Mi − 500Mi = 14348Mi per node
# Verify actual allocatable on a running node
kubectl get node ip-10-0-1-42.us-east-1.compute.internal \
  -o jsonpath='{.status.allocatable}' | jq .

# Output:
# {
#   "cpu": "3920m",
#   "memory": "14902836Ki",
#   "pods": "110",
#   "ephemeral-storage": "99Gi"
# }

Measuring Current Utilization

kubectl resource-capacity (krew)

# Per-node CPU and memory requests vs allocatable
kubectl resource-capacity --sort cpu.request

# Output (example):
# NODE                              CPU REQUESTS   CPU LIMITS   MEMORY REQUESTS   MEMORY LIMITS
# ip-10-0-1-42.us-east-1.compute   2350m/3920m    4000m/3920m  8Gi/14.2Gi        16Gi/14.2Gi
# ip-10-0-1-55.us-east-1.compute   1100m/3920m    2000m/3920m  4Gi/14.2Gi        8Gi/14.2Gi

# Include actual usage (requires metrics-server)
kubectl resource-capacity --util --sort cpu.util

# Per-pod breakdown on a node
kubectl resource-capacity --pods --node ip-10-0-1-42.us-east-1.compute \
  --sort cpu.request

PromQL — cluster-wide utilization

# Cluster CPU request utilization (requested / allocatable)
sum(kube_pod_container_resource_requests{resource="cpu"})
  /
sum(kube_node_status_allocatable{resource="cpu"})

# Cluster memory request utilization
sum(kube_pod_container_resource_requests{resource="memory"})
  /
sum(kube_node_status_allocatable{resource="memory"})

# Per-namespace CPU request share (top consumers)
sort_desc(
  sum by (namespace) (kube_pod_container_resource_requests{resource="cpu"})
    /
  scalar(sum(kube_node_status_allocatable{resource="cpu"}))
)

# Actual CPU usage as % of allocatable
sum(rate(container_cpu_usage_seconds_total{container!=""}[5m]))
  /
sum(kube_node_status_allocatable{resource="cpu"})

# Node-level memory pressure (usage approaching allocatable)
max by (node) (
  (node_memory_MemTotal_bytes - node_memory_MemAvailable_bytes)
    /
  node_memory_MemTotal_bytes
)

# Pods per node vs max-pods limit
sum by (node) (kube_pod_info{pod_phase="Running"})
  /
max by (node) (kube_node_status_capacity{resource="pods"})

Namespace-level capacity report

#!/bin/bash
# Weekly capacity report script
echo "=== Cluster Capacity Report $(date +%Y-%m-%d) ==="

echo ""
echo "--- Node Summary ---"
kubectl get nodes -o custom-columns=\
'NAME:.metadata.name,STATUS:.status.conditions[-1].type,CPU:.status.allocatable.cpu,MEMORY:.status.allocatable.memory,PODS:.status.allocatable.pods'

echo ""
echo "--- Top CPU-Requesting Namespaces ---"
kubectl top pods -A --sort-by=cpu 2>/dev/null | \
  awk 'NR==1 || NR<=20' | column -t

echo ""
echo "--- ResourceQuota Usage ---"
kubectl get resourcequota -A -o custom-columns=\
'NAMESPACE:.metadata.namespace,NAME:.metadata.name,CPU-REQ:.status.used.requests\.cpu,CPU-LIM:.status.used.limits\.cpu,MEM-REQ:.status.used.requests\.memory'

echo ""
echo "--- Nodes Near CPU Request Saturation (>75%) ---"
kubectl resource-capacity --sort cpu.request 2>/dev/null | \
  awk 'NR==1 || ($3+0)/$4+0 > 0.75'

Right-Sizing Workloads

Before adding nodes, eliminate request waste. Most clusters see 40–70% CPU over-provisioning — a VPA recommendation pass typically reduces cluster size by 20–40% with no reliability impact.

VPA recommendation workflow

# Step 1: Deploy VPA in Off mode (recommendation-only — no pod restarts)
apiVersion: autoscaling.k8s.io/v1
kind: VerticalPodAutoscaler
metadata:
  name: payment-service-vpa
  namespace: payments
spec:
  targetRef:
    apiVersion: apps/v1
    kind: Deployment
    name: payment-service
  updatePolicy:
    updateMode: "Off"   # recommendation only
  resourcePolicy:
    containerPolicies:
      - containerName: payment-service
        minAllowed:
          cpu: 100m
          memory: 128Mi
        maxAllowed:
          cpu: 4
          memory: 8Gi
        controlledValues: RequestsAndLimits
# Step 2: Read recommendations after 24h+ of data
kubectl describe vpa payment-service-vpa -n payments

# Output:
#   Recommendation:
#     Container Recommendations:
#       Container Name:  payment-service
#         Lower Bound:
#           Cpu:     100m
#           Memory:  256Mi
#         Target:
#           Cpu:     350m      ← recommended request
#           Memory:  512Mi
#         Uncapped Target:
#           Cpu:     350m
#           Memory:  480Mi
#         Upper Bound:
#           Cpu:     1200m
#           Memory:  2Gi

# Step 3: Bulk export all VPA recommendations across cluster
kubectl get vpa -A -o json | jq -r '
  .items[] |
  .metadata.namespace + "/" + .metadata.name + "  " +
  (.status.recommendation.containerRecommendations[]? |
    .containerName + "  cpu:" +
    .target.cpu + "  mem:" + .target.memory)'

Goldilocks dashboard

Goldilocks (covered in depth in Section 08-07) installs VPAs for every Deployment in a labeled namespace and visualizes recommendations in a dashboard:

# Label a namespace for Goldilocks analysis
kubectl label namespace payments goldilocks.fairwinds.com/enabled=true

# Access dashboard (port-forward)
kubectl port-forward -n goldilocks svc/goldilocks-dashboard 8080:80

Common right-sizing patterns

PatternSymptomVPA saysAction
CPU over-requestedCPU request 2000m, usage 80mTarget: 150mLower request to 200m; add 20% safety margin
Memory under-requestedOOMKilled events, memory request 256MiTarget: 1.2GiRaise request to 1.5Gi; set limit at 2Gi
Limits ≫ RequestsCPU limit 8000m on 100m requestTarget: 200mAlign: request 200m, limit 400m (2× ratio)
No limits setPod using 100% node CPU during spikeN/ASet CPU limit at 3× VPA target; memory limit at 1.5× target
Batch job with prod sizingDaily job with 4cpu/8Gi running 5 min/dayTarget: 1cpu/2GiLower requests; consider PriorityClass: batch-low
💡
VPA + HPA conflict rule

Do not use VPA updateMode: Auto with HPA on the same CPU/memory metric — they fight each other. The recommended pattern: VPA on memory (HPA rarely scales on memory), HPA on CPU or a custom metric (RPS, queue depth). See 08-07 for full VPA/HPA strategy.

Headroom & Safety Margins

Headroom is the slack capacity you deliberately keep available to absorb: node failures, traffic spikes before autoscaling reacts, and rolling updates that temporarily double pod count.

Headroom targets by workload criticality

Cluster TierCPU HeadroomMemory HeadroomMin Free NodesRationale
Production≥ 20%≥ 25%2 nodes (N+2)Absorb node failure + autoscale lag (2–4 min on spot)
Staging≥ 15%≥ 20%1 node (N+1)Functional parity with some cost savings
Development≥ 5%≥ 10%0Autoscale freely; minor disruption acceptable

Pause pods for guaranteed headroom

Karpenter and Cluster Autoscaler scale down idle nodes. A pause pod (a pod running pause image with real resource requests but zero workload cost) permanently reserves capacity on every node, preventing the autoscaler from reclaiming it:

apiVersion: apps/v1
kind: Deployment
metadata:
  name: capacity-reservation
  namespace: kube-system
spec:
  replicas: 2   # one per AZ; adjust to match headroom target
  selector:
    matchLabels:
      app: capacity-reservation
  template:
    metadata:
      labels:
        app: capacity-reservation
    spec:
      priorityClassName: batch-low   # evicted first when real pods need space
      terminationGracePeriodSeconds: 0
      containers:
        - name: pause
          image: registry.k8s.io/pause:3.9
          resources:
            requests:
              cpu: "1500m"      # reserves 1.5 CPU per node
              memory: "3Gi"     # reserves 3 GiB memory per node
          securityContext:
            allowPrivilegeEscalation: false
            runAsNonRoot: true
            runAsUser: 65534
            seccompProfile:
              type: RuntimeDefault
            capabilities:
              drop: ["ALL"]
💡
How pause pods work with autoscaling

When a real pod is pending, the scheduler evicts the low-priority pause pod (PriorityClass: batch-low) to make room. The now-evicted pause pod becomes pending, which triggers Karpenter to provision a new node. Net result: the new node is available for the next real workload burst, maintaining the headroom invariant.

Demand Forecasting

Capacity planning without a forecast is reactive. Use historical Prometheus data to project future demand.

Linear regression with predict_linear()

# Predict CPU request total in 30 days based on last 14 days of growth
predict_linear(
  sum(kube_pod_container_resource_requests{resource="cpu"})[14d:1h],
  30 * 24 * 3600
)

# As % of current allocatable — will we saturate?
predict_linear(
  sum(kube_pod_container_resource_requests{resource="cpu"})[14d:1h],
  30 * 24 * 3600
)
  /
sum(kube_node_status_allocatable{resource="cpu"})

# Namespace-level forecast: payments team CPU growth
predict_linear(
  sum by (namespace) (
    kube_pod_container_resource_requests{resource="cpu", namespace="payments"}
  )[7d:1h],
  14 * 24 * 3600
)

Seasonality and traffic patterns

predict_linear() assumes constant growth. Real traffic has weekly cycles (weekday vs weekend) and seasonal spikes (end-of-month billing, Black Friday). For these, use longer lookback windows and apply multipliers manually:

#!/bin/bash
# Extract weekly peak-to-average ratio from Prometheus
# Use this to size for peak, not average

PROM_URL="http://prometheus.monitoring.svc:9090"

# 7-day average CPU requests
AVG=$(curl -s "${PROM_URL}/api/v1/query" \
  --data-urlencode 'query=avg_over_time(sum(kube_pod_container_resource_requests{resource="cpu"})[7d:1h])' | \
  jq -r '.data.result[0].value[1]')

# 7-day peak CPU requests
PEAK=$(curl -s "${PROM_URL}/api/v1/query" \
  --data-urlencode 'query=max_over_time(sum(kube_pod_container_resource_requests{resource="cpu"})[7d:1h])' | \
  jq -r '.data.result[0].value[1]')

echo "Average CPU requests: ${AVG} cores"
echo "Peak CPU requests:    ${PEAK} cores"
echo "Peak-to-avg ratio:    $(echo "scale=2; $PEAK/$AVG" | bc)×"
echo ""
echo "Size cluster for PEAK × 1.3 safety margin = $(echo "scale=1; $PEAK*1.3" | bc) cores"

Capacity planning spreadsheet model

InputValueNotes
Current CPU requests120 coresFrom Prometheus
Monthly growth rate8%From predict_linear 3-month trend
Peak-to-average multiplier1.4×From weekly seasonality analysis
Headroom buffer20%Production tier requirement
6-month CPU target120 × (1.08^6) × 1.4 × 1.2 = 282 coresOrder nodes 4 weeks before hitting limit
Node size (m7i.4xlarge)3.5 allocatable CPU per nodeAfter kube/system reserved
Nodes needed in 6 months282 / 3.5 = 81 nodesvs 40 today → plan for doubling

Autoscaling Strategy

Autoscaling handles the short-term spikes; capacity planning handles the long-term trend. Both are required — autoscaling alone is not a substitute for planning.

Three-layer autoscaling model

Autoscaling Layers
  Layer 3 — Node autoscaling (Karpenter / Cluster Autoscaler)
    Response time: 2–4 min (on-demand), 1–2 min (Karpenter)
    Trigger: Pending pods that can't be scheduled
    Action: Add / remove nodes

  Layer 2 — Horizontal pod autoscaling (HPA / KEDA)
    Response time: 15 sec–2 min (depends on metric scrape interval)
    Trigger: CPU%, memory%, custom metric (RPS, queue depth)
    Action: Scale replicas up/down

  Layer 1 — Vertical pod autoscaling (VPA)
    Response time: Minutes–hours (restarts pods)
    Trigger: VPA updateMode != Off
    Action: Right-size requests/limits per pod

  ──────────────────────────────────────────────────────────────
  Capacity planning: ensures Layer 3 has room to scale into
  (burst headroom, savings plan coverage for baseline)

Karpenter NodePool tuning for capacity planning

Karpenter (covered in depth in Section 08-01) consolidates and provisions nodes. For capacity planning purposes, configure disruption budgets to prevent over-aggressive consolidation during business hours:

apiVersion: karpenter.sh/v1
kind: NodePool
metadata:
  name: production
spec:
  template:
    spec:
      requirements:
        - key: karpenter.sh/capacity-type
          operator: In
          values: ["spot", "on-demand"]
        - key: node.kubernetes.io/instance-type
          operator: In
          values:
            - m7i.xlarge
            - m7i.2xlarge
            - m7i.4xlarge
            - m6i.xlarge
            - m6i.2xlarge
            - m6i.4xlarge
      nodeClassRef:
        group: karpenter.k8s.aws
        kind: EC2NodeClass
        name: production
  disruption:
    consolidationPolicy: WhenEmptyOrUnderutilized
    consolidateAfter: 30s
    budgets:
      # During business hours: consolidate at most 10% of nodes at once
      - schedule: "0 9 * * 1-5"    # Mon-Fri 9am
        duration: 8h
        nodes: "10%"
      # Off-hours: allow 20% consolidation
      - nodes: "20%"
  limits:
    cpu: "500"
    memory: "2000Gi"

HPA configuration for spike absorption

apiVersion: autoscaling/v2
kind: HorizontalPodAutoscaler
metadata:
  name: payment-service
  namespace: payments
spec:
  scaleTargetRef:
    apiVersion: apps/v1
    kind: Deployment
    name: payment-service
  minReplicas: 3    # ≥ 3 for multi-AZ spread
  maxReplicas: 50
  metrics:
    - type: Resource
      resource:
        name: cpu
        target:
          type: Utilization
          averageUtilization: 60   # scale out at 60% CPU, not 80%
    - type: Pods
      pods:
        metric:
          name: http_requests_per_second
        target:
          type: AverageValue
          averageValue: "1000"     # 1000 RPS per pod
  behavior:
    scaleUp:
      stabilizationWindowSeconds: 60   # quick scale-up
      policies:
        - type: Pods
          value: 4
          periodSeconds: 60    # add up to 4 pods per minute
        - type: Percent
          value: 100
          periodSeconds: 60    # or double, whichever is larger
      selectPolicy: Max
    scaleDown:
      stabilizationWindowSeconds: 300  # 5 min before scale-down
      policies:
        - type: Pods
          value: 2
          periodSeconds: 60    # remove at most 2 pods per minute

Node Pool Architecture

Separate node pools by workload class. This allows independent scaling, targeted instance selection, and cost isolation:

Node Pool Topology
  ┌──────────────────────────────────────────────────────────────────┐
  │  system pool  (2× m7i.xlarge, on-demand, multi-AZ)              │
  │  Taint: CriticalAddonsOnly=true:NoSchedule                       │
  │  Workloads: CoreDNS, kube-proxy, metrics-server, cert-manager    │
  ├──────────────────────────────────────────────────────────────────┤
  │  observability pool  (3× m7i.2xlarge, on-demand, multi-AZ)      │
  │  Taint: dedicated=observability:NoSchedule                       │
  │  Workloads: Prometheus, Loki, Tempo, Pyroscope, Grafana          │
  ├──────────────────────────────────────────────────────────────────┤
  │  production pool  (Karpenter managed, spot+on-demand, multi-AZ)  │
  │  No taint — default pool                                         │
  │  Workloads: All production services                              │
  ├──────────────────────────────────────────────────────────────────┤
  │  batch pool  (Karpenter managed, spot-only)                      │
  │  Taint: dedicated=batch:NoSchedule                               │
  │  Workloads: Argo Workflows, Spark, ML training, CI runners       │
  ├──────────────────────────────────────────────────────────────────┤
  │  gpu pool  (Karpenter managed, g5.xlarge, spot+on-demand)        │
  │  Taint: nvidia.com/gpu=true:NoSchedule                           │
  │  Workloads: ML inference, video encoding                         │
  └──────────────────────────────────────────────────────────────────┘
# Batch workload targeting the batch node pool
spec:
  tolerations:
    - key: dedicated
      value: batch
      effect: NoSchedule
  nodeSelector:
    dedicated: batch
  # Or with affinity for more control:
  affinity:
    nodeAffinity:
      requiredDuringSchedulingIgnoredDuringExecution:
        nodeSelectorTerms:
          - matchExpressions:
              - key: dedicated
                operator: In
                values: ["batch"]

Stateful Workload Capacity

Stateful workloads (databases, Kafka, etcd) have additional capacity constraints beyond CPU and memory.

Prometheus StatefulSet sizing example

DimensionCalculationExample Value
Samples ingested/secScrape targets × avg metrics × 1/scrape_interval50,000 targets × 200 metrics × 1/15s = 667k samples/sec
CPU~1 CPU per 100k active series10M active series → 100m × 100 = 10 CPU
Memory (head chunk)~3 KiB per active series10M series × 3 KiB = 30 GiB RAM
Storage (TSDB)bytes/sample × samples/sec × retention_seconds × 1.1 overhead1.5B × 667k/s × 86400s × 15d × 1.1 ≈ 1.2 TiB per replica
IOPSWAL: ~1000 IOPS; compaction bursts to 5000 IOPSgp3 with 4000 IOPS provisioned

PostgreSQL/MySQL sizing heuristics

IP Exhaustion Planning

VPC CNI assigns a real VPC IP to every pod. With hundreds of pods per node, subnets exhaust quickly. Plan IP capacity as carefully as compute.

# Check available IPs per subnet
aws ec2 describe-subnets \
  --filters "Name=tag:kubernetes.io/cluster/my-cluster,Values=shared" \
  --query 'Subnets[*].{Subnet:SubnetId,AZ:AvailabilityZone,Available:AvailableIpAddressCount,CIDR:CidrBlock}' \
  --output table

# Check current pod IP consumption
kubectl get pods -A -o wide | grep -v '' | \
  awk '{print $8}' | sort | uniq -c | sort -rn | head -20

# Count pods per node to understand IP density
kubectl get pods -A -o wide --field-selector spec.nodeName=ip-10-0-1-42.us-east-1.compute.internal | wc -l

EKS VPC CNI prefix delegation

Prefix delegation assigns a /28 prefix (16 IPs) per ENI slot instead of 1 IP, multiplying pod density by 16× with no subnet CIDR changes:

# Enable prefix delegation on existing cluster
kubectl set env daemonset aws-node \
  -n kube-system \
  ENABLE_PREFIX_DELEGATION=true \
  WARM_PREFIX_TARGET=1

# Verify (new nodes will get prefix IPs; existing nodes need rolling replacement)
kubectl describe node ip-10-0-1-42.us-east-1.compute.internal | \
  grep -A5 "Capacity"

# max-pods with prefix delegation on m7i.xlarge:
# ENIs: 4, slots/ENI: 4, prefix size: /28 (16 IPs)
# max-pods = (4-1) × 4 × 16 - 1 = 191 pods (vs 58 without prefix)
InstanceMax pods (default)Max pods (prefix delegation)
t3.medium17110
m7i.xlarge58191
m7i.4xlarge234737 (capped at 110 or 250 by kubelet)
c7i.2xlarge58191

Capacity Alerts

apiVersion: monitoring.coreos.com/v1
kind: PrometheusRule
metadata:
  name: capacity-planning-alerts
  namespace: monitoring
spec:
  groups:
    - name: capacity.planning
      interval: 5m
      rules:

        # CPU request saturation
        - alert: ClusterCPURequestSaturationHigh
          expr: |
            sum(kube_pod_container_resource_requests{resource="cpu"})
              /
            sum(kube_node_status_allocatable{resource="cpu"}) > 0.80
          for: 15m
          labels:
            severity: warning
          annotations:
            summary: "Cluster CPU request utilization > 80%"
            description: "CPU requests are {{ $value | humanizePercentage }} of allocatable. Autoscaler may not have enough headroom."
            runbook_url: https://runbooks.example.com/capacity/cpu-saturation

        - alert: ClusterCPURequestSaturationCritical
          expr: |
            sum(kube_pod_container_resource_requests{resource="cpu"})
              /
            sum(kube_node_status_allocatable{resource="cpu"}) > 0.90
          for: 5m
          labels:
            severity: critical
          annotations:
            summary: "Cluster CPU request utilization > 90% — pending pods likely"
            runbook_url: https://runbooks.example.com/capacity/cpu-saturation

        # Memory request saturation
        - alert: ClusterMemoryRequestSaturationHigh
          expr: |
            sum(kube_pod_container_resource_requests{resource="memory"})
              /
            sum(kube_node_status_allocatable{resource="memory"}) > 0.80
          for: 15m
          labels:
            severity: warning
          annotations:
            summary: "Cluster memory request utilization > 80%"
            runbook_url: https://runbooks.example.com/capacity/memory-saturation

        # Pending pods (scheduling failure — often a capacity signal)
        - alert: PodsPendingTooLong
          expr: |
            count(kube_pod_status_phase{phase="Pending"}) by (namespace) > 0
          for: 10m
          labels:
            severity: warning
          annotations:
            summary: "Pods pending > 10 min in namespace {{ $labels.namespace }}"
            description: "{{ $value }} pods are pending. Check node capacity or resource quota."
            runbook_url: https://runbooks.example.com/capacity/pending-pods

        # Node CPU utilization (actual usage, not requests)
        - alert: NodeCPUUtilizationHigh
          expr: |
            (1 - avg by (node) (rate(node_cpu_seconds_total{mode="idle"}[5m]))) > 0.85
          for: 10m
          labels:
            severity: warning
          annotations:
            summary: "Node {{ $labels.node }} CPU utilization > 85%"
            runbook_url: https://runbooks.example.com/capacity/node-cpu-high

        # Node memory pressure
        - alert: NodeMemoryPressure
          expr: |
            (node_memory_MemTotal_bytes - node_memory_MemAvailable_bytes)
              /
            node_memory_MemTotal_bytes > 0.90
          for: 5m
          labels:
            severity: critical
          annotations:
            summary: "Node {{ $labels.instance }} memory usage > 90%"
            runbook_url: https://runbooks.example.com/capacity/node-memory-pressure

        # Pod count approaching max-pods limit
        - alert: NodePodCapacityHigh
          expr: |
            sum by (node) (kube_pod_info{phase="Running"})
              /
            max by (node) (kube_node_status_capacity{resource="pods"}) > 0.85
          for: 10m
          labels:
            severity: warning
          annotations:
            summary: "Node {{ $labels.node }} pod capacity > 85%"
            runbook_url: https://runbooks.example.com/capacity/node-pod-limit

        # Namespace quota near limit
        - alert: NamespaceCPUQuotaNearLimit
          expr: |
            kube_resourcequota{resource="requests.cpu", type="used"}
              /
            kube_resourcequota{resource="requests.cpu", type="hard"} > 0.85
          for: 5m
          labels:
            severity: warning
          annotations:
            summary: "Namespace {{ $labels.namespace }} CPU quota > 85% used"
            runbook_url: https://runbooks.example.com/capacity/quota-near-limit

        # Karpenter node provisioning failures
        - alert: KarpenterNodeProvisioningFailed
          expr: |
            increase(karpenter_nodeclaims_disrupted_total{reason="failed"}[30m]) > 0
          labels:
            severity: warning
          annotations:
            summary: "Karpenter failed to provision nodes in last 30 min"
            runbook_url: https://runbooks.example.com/capacity/karpenter-provisioning-failure

        # Capacity forecast: will saturate in < 7 days
        - alert: ClusterCPUForecastSaturation
          expr: |
            predict_linear(
              sum(kube_pod_container_resource_requests{resource="cpu"})[3d:1h],
              7 * 24 * 3600
            )
              /
            sum(kube_node_status_allocatable{resource="cpu"}) > 0.90
          for: 1h
          labels:
            severity: warning
          annotations:
            summary: "Cluster CPU requests forecast to reach 90% within 7 days"
            runbook_url: https://runbooks.example.com/capacity/cpu-forecast

Capacity Review Cadence

CadenceActivityOwnerOutput
DailyCheck capacity alerts in Grafana/PagerDuty; review pending podsOn-call engineerImmediate remediation if alerts firing
WeeklyRun capacity report script; review VPA recommendations top-10 namespaces; check Goldilocks dashboardPlatform teamRight-sizing tickets for top over-provisioned workloads
MonthlyCPU/memory trend analysis (30d); forecast next 90 days; review FinOps efficiency KPIs; update savings plan commitmentPlatform + FinOpsCapacity plan document; savings plan adjustment
QuarterlyFull capacity audit: node pool sizing review; instance family refresh (new gen available?); VPC subnet headroom; storage PVC growth; multi-cluster capacity balancePlatform team leadArchitectural changes, subnet expansion, instance migration plan

Monthly capacity review checklist

#!/bin/bash
# Monthly capacity review — run on first Monday of each month
set -euo pipefail

PROM_URL="${PROM_URL:-http://prometheus.monitoring.svc:9090}"
REPORT_DATE=$(date +%Y-%m-%d)

echo "======================================"
echo " Monthly Capacity Review: $REPORT_DATE"
echo "======================================"

echo ""
echo "## 1. Node Summary"
kubectl get nodes -o custom-columns=\
'NAME:.metadata.name,TYPE:.metadata.labels.node\.kubernetes\.io/instance-type,STATUS:.status.conditions[-1].type,AGE:.metadata.creationTimestamp'

echo ""
echo "## 2. Cluster Utilization (30-day avg)"
for METRIC in cpu memory; do
  AVG=$(curl -s "${PROM_URL}/api/v1/query" \
    --data-urlencode "query=avg_over_time((sum(kube_pod_container_resource_requests{resource=\"${METRIC}\"}) / sum(kube_node_status_allocatable{resource=\"${METRIC}\"}))[30d:1h])" | \
    jq -r '.data.result[0].value[1] // "N/A"')
  PEAK=$(curl -s "${PROM_URL}/api/v1/query" \
    --data-urlencode "query=max_over_time((sum(kube_pod_container_resource_requests{resource=\"${METRIC}\"}) / sum(kube_node_status_allocatable{resource=\"${METRIC}\"}))[30d:1h])" | \
    jq -r '.data.result[0].value[1] // "N/A"')
  echo "  ${METRIC}: avg=${AVG} peak=${PEAK}"
done

echo ""
echo "## 3. Top 10 CPU-Requesting Namespaces (30d avg)"
curl -s "${PROM_URL}/api/v1/query" \
  --data-urlencode 'query=sort_desc(sum by (namespace) (avg_over_time(kube_pod_container_resource_requests{resource="cpu"}[30d])))' | \
  jq -r '.data.result[:10][] | "  \(.metric.namespace): \(.value[1]) cores"'

echo ""
echo "## 4. 90-day CPU Forecast"
FORECAST=$(curl -s "${PROM_URL}/api/v1/query" \
  --data-urlencode "query=predict_linear(sum(kube_pod_container_resource_requests{resource=\"cpu\"})[30d:1h], 90*24*3600) / sum(kube_node_status_allocatable{resource=\"cpu\"})" | \
  jq -r '.data.result[0].value[1] // "N/A"')
echo "  Projected utilization in 90 days: ${FORECAST}"

echo ""
echo "## 5. VPA Top Savings Opportunities"
kubectl get vpa -A -o json 2>/dev/null | jq -r '
  .items[] |
  select(.status.recommendation != null) |
  .metadata.namespace + "/" + .metadata.name' | head -10

echo ""
echo "## 6. Unattached PVCs (storage waste)"
kubectl get pvc -A --field-selector=status.phase!=Bound 2>/dev/null | \
  grep -v "^NAMESPACE" || echo "  None found"

Best Practices

Right-size before scaling out

VPA recommendations typically reveal 40–60% CPU over-provisioning. Always run a right-sizing pass before concluding you need more nodes.

Reserve headroom with pause pods

Never run at > 80% request utilization in production. Use low-priority pause pods to maintain N+2 node headroom that autoscaling can reclaim instantly.

Forecast from Prometheus data

Use predict_linear() on 14–30 day windows for growth trends. Alert when the forecast projects saturation within 7 days.

Plan all five dimensions

CPU and memory are obvious. Also plan pod count, IP addresses (VPC subnets), and storage PV/EBS attachment limits — any one can block scheduling.

Separate pools by workload class

System, observability, production, batch, and GPU pools scale independently, use appropriate instance families, and provide blast-radius isolation.

Monthly review cadence

Capacity planning is a monthly discipline, not a one-time event. Schedule the review, track the trend, and adjust savings plan commitments accordingly.