Capacity Planning | Kubernetes Docs

Fundamentals

Capacity planning answers two questions: Do we have enough resources today? and When do we need to add more? In Kubernetes, the complexity comes from three independent resource dimensions — CPU, memory, and pod count — layered on top of autoscaling, spot variability, and the gap between what workloads request and what they use.

Capacity Supply vs Demand

  Node allocatable capacity
  ┌─────────────────────────────────────────────────────────────┐
  │  System reserved  │  Kube reserved  │  Available for pods   │
  └─────────────────────────────────────────────────────────────┘
                                         ┌─────────────────────┐
                                         │  Requested (sum)    │
                                         │  ├── Actual usage   │
                                         │  └── Slack (waste)  │
                                         │  Headroom buffer    │
                                         │  Free (unscheduled) │
                                         └─────────────────────┘
  Warning zones:
    Requests / Allocatable > 80%  → near saturation
    Actual usage / Requests < 25% → over-provisioned (right-size first)
    Headroom < 1 full node        → autoscaler can't absorb node failure

⚠️

Requests drive scheduling, not usage

The Kubernetes scheduler packs pods based on requests, not actual resource consumption. A pod that requests 4 CPU but uses 0.1 CPU blocks 4 CPU worth of scheduling capacity. Over-requested workloads cause artificial node saturation while actual utilization stays low — always right-size requests before adding nodes.

Capacity Dimensions

Kubernetes clusters can hit limits on any of five independent dimensions. Hitting any single one blocks further scheduling even when others are available:

Dimension	What Limits It	How to Check	Expansion Strategy
CPU (requests)	Sum of pod CPU requests vs node allocatable CPU	`kubectl resource-capacity --sort cpu.request`	Add nodes, right-size requests, enable VPA
Memory (requests)	Sum of pod memory requests vs node allocatable memory	`kubectl resource-capacity --sort mem.request`	Add nodes, right-size requests, enable VPA
Pod count	`--max-pods` per node (default 110, AWS VPC CNI varies by instance)	`kubectl describe node \| grep Capacity`	Increase max-pods (VPC CNI prefix delegation), add nodes
IP addresses	VPC subnet CIDR /24 = 251 usable IPs; shared with nodes	`aws ec2 describe-subnets \| jq '.[] \| .AvailableIpAddressCount'`	Add subnets/secondary CIDRs, enable prefix delegation
Storage (PV)	EBS volume limits per instance type (28–128 attachments)	`kubectl get pvc -A \| grep Bound`	Use EFS/S3 for shared storage, increase instance size

Node Sizing Strategy

Instance family selection

Workload Profile	AWS Family	CPU:Memory Ratio	Use Case
General purpose	m7i, m6i, m5	1:4	Mixed microservices, most production workloads
CPU-intensive	c7i, c6i, c5	1:2	API servers, data processing, encoding
Memory-intensive	r7i, r6i, r5	1:8	JVM apps, in-memory caches, analytics
Burstable / dev	t3, t4g	1:2–4	Development, CI runners, low-traffic services
GPU / ML	p4, g5, g4dn	—	Model training, inference, video
ARM / cost-optimized	m7g, c7g, r7g (Graviton)	1:4	Same as above families; ~20% cheaper

Node size tradeoffs

Fewer, larger nodes

Lower per-node overhead (daemon sets, kube-reserved)
Fewer node reboots during upgrades
Better bin-packing for large pods
Risk: larger blast radius per node failure
Risk: fragmentation for small pods
Use: m7i.4xlarge (16 vCPU / 64 GiB) as baseline

More, smaller nodes

Smaller blast radius per node failure
Better fault isolation with PodAntiAffinity
Higher overhead ratio (DaemonSets cost more)
Risk: more nodes to manage/upgrade
Risk: IP exhaustion faster
Use: m7i.xlarge (4 vCPU / 16 GiB) for isolation-critical

kube-reserved and system-reserved budgets

Kubernetes reserves resources on each node for the kubelet, system daemons, and eviction threshold before pods can be scheduled. Always set these explicitly in your node configuration:

# EKS managed node group bootstrap configuration
# /etc/eks/bootstrap.sh extra args (via launch template userData)
--kubelet-extra-args '
  --kube-reserved=cpu=250m,memory=1Gi,ephemeral-storage=1Gi
  --system-reserved=cpu=250m,memory=500Mi,ephemeral-storage=1Gi
  --eviction-hard=memory.available<500Mi,nodefs.available<10%,nodefs.inodesFree<5%
  --eviction-soft=memory.available<1Gi,nodefs.available<15%
  --eviction-soft-grace-period=memory.available=2m,nodefs.available=2m
  --max-pods=110
'

Effective allocatable capacity formula

allocatable_cpu = node_cpu − kube_reserved_cpu − system_reserved_cpu
allocatable_mem = node_mem − kube_reserved_mem − system_reserved_mem − eviction_threshold

Example: m7i.xlarge (4 vCPU / 16 GiB)
CPU: 4000m − 250m − 250m = 3500m per node
Mem: 16384Mi − 1024Mi − 512Mi − 500Mi = 14348Mi per node

# Verify actual allocatable on a running node
kubectl get node ip-10-0-1-42.us-east-1.compute.internal \
  -o jsonpath='{.status.allocatable}' | jq .

# Output:
# {
#   "cpu": "3920m",
#   "memory": "14902836Ki",
#   "pods": "110",
#   "ephemeral-storage": "99Gi"
# }

Measuring Current Utilization

kubectl resource-capacity (krew)

# Per-node CPU and memory requests vs allocatable
kubectl resource-capacity --sort cpu.request

# Output (example):
# NODE                              CPU REQUESTS   CPU LIMITS   MEMORY REQUESTS   MEMORY LIMITS
# ip-10-0-1-42.us-east-1.compute   2350m/3920m    4000m/3920m  8Gi/14.2Gi        16Gi/14.2Gi
# ip-10-0-1-55.us-east-1.compute   1100m/3920m    2000m/3920m  4Gi/14.2Gi        8Gi/14.2Gi

# Include actual usage (requires metrics-server)
kubectl resource-capacity --util --sort cpu.util

# Per-pod breakdown on a node
kubectl resource-capacity --pods --node ip-10-0-1-42.us-east-1.compute \
  --sort cpu.request

PromQL — cluster-wide utilization

# Cluster CPU request utilization (requested / allocatable)
sum(kube_pod_container_resource_requests{resource="cpu"})
  /
sum(kube_node_status_allocatable{resource="cpu"})

# Cluster memory request utilization
sum(kube_pod_container_resource_requests{resource="memory"})
  /
sum(kube_node_status_allocatable{resource="memory"})

# Per-namespace CPU request share (top consumers)
sort_desc(
  sum by (namespace) (kube_pod_container_resource_requests{resource="cpu"})
    /
  scalar(sum(kube_node_status_allocatable{resource="cpu"}))
)

# Actual CPU usage as % of allocatable
sum(rate(container_cpu_usage_seconds_total{container!=""}[5m]))
  /
sum(kube_node_status_allocatable{resource="cpu"})

# Node-level memory pressure (usage approaching allocatable)
max by (node) (
  (node_memory_MemTotal_bytes - node_memory_MemAvailable_bytes)
    /
  node_memory_MemTotal_bytes
)

# Pods per node vs max-pods limit
sum by (node) (kube_pod_info{pod_phase="Running"})
  /
max by (node) (kube_node_status_capacity{resource="pods"})

Namespace-level capacity report

#!/bin/bash
# Weekly capacity report script
echo "=== Cluster Capacity Report $(date +%Y-%m-%d) ==="

echo ""
echo "--- Node Summary ---"
kubectl get nodes -o custom-columns=\
'NAME:.metadata.name,STATUS:.status.conditions[-1].type,CPU:.status.allocatable.cpu,MEMORY:.status.allocatable.memory,PODS:.status.allocatable.pods'

echo ""
echo "--- Top CPU-Requesting Namespaces ---"
kubectl top pods -A --sort-by=cpu 2>/dev/null | \
  awk 'NR==1 || NR<=20' | column -t

echo ""
echo "--- ResourceQuota Usage ---"
kubectl get resourcequota -A -o custom-columns=\
'NAMESPACE:.metadata.namespace,NAME:.metadata.name,CPU-REQ:.status.used.requests\.cpu,CPU-LIM:.status.used.limits\.cpu,MEM-REQ:.status.used.requests\.memory'

echo ""
echo "--- Nodes Near CPU Request Saturation (>75%) ---"
kubectl resource-capacity --sort cpu.request 2>/dev/null | \
  awk 'NR==1 || ($3+0)/$4+0 > 0.75'

Right-Sizing Workloads

Before adding nodes, eliminate request waste. Most clusters see 40–70% CPU over-provisioning — a VPA recommendation pass typically reduces cluster size by 20–40% with no reliability impact.

VPA recommendation workflow

# Step 1: Deploy VPA in Off mode (recommendation-only — no pod restarts)
apiVersion: autoscaling.k8s.io/v1
kind: VerticalPodAutoscaler
metadata:
  name: payment-service-vpa
  namespace: payments
spec:
  targetRef:
    apiVersion: apps/v1
    kind: Deployment
    name: payment-service
  updatePolicy:
    updateMode: "Off"   # recommendation only
  resourcePolicy:
    containerPolicies:
      - containerName: payment-service
        minAllowed:
          cpu: 100m
          memory: 128Mi
        maxAllowed:
          cpu: 4
          memory: 8Gi
        controlledValues: RequestsAndLimits

# Step 2: Read recommendations after 24h+ of data
kubectl describe vpa payment-service-vpa -n payments

# Output:
#   Recommendation:
#     Container Recommendations:
#       Container Name:  payment-service
#         Lower Bound:
#           Cpu:     100m
#           Memory:  256Mi
#         Target:
#           Cpu:     350m      ← recommended request
#           Memory:  512Mi
#         Uncapped Target:
#           Cpu:     350m
#           Memory:  480Mi
#         Upper Bound:
#           Cpu:     1200m
#           Memory:  2Gi

# Step 3: Bulk export all VPA recommendations across cluster
kubectl get vpa -A -o json | jq -r '
  .items[] |
  .metadata.namespace + "/" + .metadata.name + "  " +
  (.status.recommendation.containerRecommendations[]? |
    .containerName + "  cpu:" +
    .target.cpu + "  mem:" + .target.memory)'

Goldilocks dashboard

Goldilocks (covered in depth in Section 08-07) installs VPAs for every Deployment in a labeled namespace and visualizes recommendations in a dashboard:

# Label a namespace for Goldilocks analysis
kubectl label namespace payments goldilocks.fairwinds.com/enabled=true

# Access dashboard (port-forward)
kubectl port-forward -n goldilocks svc/goldilocks-dashboard 8080:80

Common right-sizing patterns

Pattern	Symptom	VPA says	Action
CPU over-requested	CPU request 2000m, usage 80m	Target: 150m	Lower request to 200m; add 20% safety margin
Memory under-requested	OOMKilled events, memory request 256Mi	Target: 1.2Gi	Raise request to 1.5Gi; set limit at 2Gi
Limits ≫ Requests	CPU limit 8000m on 100m request	Target: 200m	Align: request 200m, limit 400m (2× ratio)
No limits set	Pod using 100% node CPU during spike	N/A	Set CPU limit at 3× VPA target; memory limit at 1.5× target
Batch job with prod sizing	Daily job with 4cpu/8Gi running 5 min/day	Target: 1cpu/2Gi	Lower requests; consider PriorityClass: batch-low

💡

VPA + HPA conflict rule

Do not use VPA updateMode: Auto with HPA on the same CPU/memory metric — they fight each other. The recommended pattern: VPA on memory (HPA rarely scales on memory), HPA on CPU or a custom metric (RPS, queue depth). See 08-07 for full VPA/HPA strategy.

Headroom & Safety Margins

Headroom is the slack capacity you deliberately keep available to absorb: node failures, traffic spikes before autoscaling reacts, and rolling updates that temporarily double pod count.

Headroom targets by workload criticality

Cluster Tier	CPU Headroom	Memory Headroom	Min Free Nodes	Rationale
Production	≥ 20%	≥ 25%	2 nodes (N+2)	Absorb node failure + autoscale lag (2–4 min on spot)
Staging	≥ 15%	≥ 20%	1 node (N+1)	Functional parity with some cost savings
Development	≥ 5%	≥ 10%	0	Autoscale freely; minor disruption acceptable

Pause pods for guaranteed headroom

Karpenter and Cluster Autoscaler scale down idle nodes. A pause pod (a pod running pause image with real resource requests but zero workload cost) permanently reserves capacity on every node, preventing the autoscaler from reclaiming it:

apiVersion: apps/v1
kind: Deployment
metadata:
  name: capacity-reservation
  namespace: kube-system
spec:
  replicas: 2   # one per AZ; adjust to match headroom target
  selector:
    matchLabels:
      app: capacity-reservation
  template:
    metadata:
      labels:
        app: capacity-reservation
    spec:
      priorityClassName: batch-low   # evicted first when real pods need space
      terminationGracePeriodSeconds: 0
      containers:
        - name: pause
          image: registry.k8s.io/pause:3.9
          resources:
            requests:
              cpu: "1500m"      # reserves 1.5 CPU per node
              memory: "3Gi"     # reserves 3 GiB memory per node
          securityContext:
            allowPrivilegeEscalation: false
            runAsNonRoot: true
            runAsUser: 65534
            seccompProfile:
              type: RuntimeDefault
            capabilities:
              drop: ["ALL"]

💡

How pause pods work with autoscaling

When a real pod is pending, the scheduler evicts the low-priority pause pod (PriorityClass: batch-low) to make room. The now-evicted pause pod becomes pending, which triggers Karpenter to provision a new node. Net result: the new node is available for the next real workload burst, maintaining the headroom invariant.

Demand Forecasting

Capacity planning without a forecast is reactive. Use historical Prometheus data to project future demand.

Linear regression with predict_linear()

# Predict CPU request total in 30 days based on last 14 days of growth
predict_linear(
  sum(kube_pod_container_resource_requests{resource="cpu"})[14d:1h],
  30 * 24 * 3600
)

# As % of current allocatable — will we saturate?
predict_linear(
  sum(kube_pod_container_resource_requests{resource="cpu"})[14d:1h],
  30 * 24 * 3600
)
  /
sum(kube_node_status_allocatable{resource="cpu"})

# Namespace-level forecast: payments team CPU growth
predict_linear(
  sum by (namespace) (
    kube_pod_container_resource_requests{resource="cpu", namespace="payments"}
  )[7d:1h],
  14 * 24 * 3600
)

Seasonality and traffic patterns

predict_linear() assumes constant growth. Real traffic has weekly cycles (weekday vs weekend) and seasonal spikes (end-of-month billing, Black Friday). For these, use longer lookback windows and apply multipliers manually:

#!/bin/bash
# Extract weekly peak-to-average ratio from Prometheus
# Use this to size for peak, not average

PROM_URL="http://prometheus.monitoring.svc:9090"

# 7-day average CPU requests
AVG=$(curl -s "${PROM_URL}/api/v1/query" \
  --data-urlencode 'query=avg_over_time(sum(kube_pod_container_resource_requests{resource="cpu"})[7d:1h])' | \
  jq -r '.data.result[0].value[1]')

# 7-day peak CPU requests
PEAK=$(curl -s "${PROM_URL}/api/v1/query" \
  --data-urlencode 'query=max_over_time(sum(kube_pod_container_resource_requests{resource="cpu"})[7d:1h])' | \
  jq -r '.data.result[0].value[1]')

echo "Average CPU requests: ${AVG} cores"
echo "Peak CPU requests:    ${PEAK} cores"
echo "Peak-to-avg ratio:    $(echo "scale=2; $PEAK/$AVG" | bc)×"
echo ""
echo "Size cluster for PEAK × 1.3 safety margin = $(echo "scale=1; $PEAK*1.3" | bc) cores"

Capacity planning spreadsheet model

Input	Value	Notes
Current CPU requests	120 cores	From Prometheus
Monthly growth rate	8%	From predict_linear 3-month trend
Peak-to-average multiplier	1.4×	From weekly seasonality analysis
Headroom buffer	20%	Production tier requirement
6-month CPU target	120 × (1.08^6) × 1.4 × 1.2 = 282 cores	Order nodes 4 weeks before hitting limit
Node size (m7i.4xlarge)	3.5 allocatable CPU per node	After kube/system reserved
Nodes needed in 6 months	282 / 3.5 = 81 nodes	vs 40 today → plan for doubling

Autoscaling Strategy

Autoscaling handles the short-term spikes; capacity planning handles the long-term trend. Both are required — autoscaling alone is not a substitute for planning.

Three-layer autoscaling model

Autoscaling Layers

  Layer 3 — Node autoscaling (Karpenter / Cluster Autoscaler)
    Response time: 2–4 min (on-demand), 1–2 min (Karpenter)
    Trigger: Pending pods that can't be scheduled
    Action: Add / remove nodes

  Layer 2 — Horizontal pod autoscaling (HPA / KEDA)
    Response time: 15 sec–2 min (depends on metric scrape interval)
    Trigger: CPU%, memory%, custom metric (RPS, queue depth)
    Action: Scale replicas up/down

  Layer 1 — Vertical pod autoscaling (VPA)
    Response time: Minutes–hours (restarts pods)
    Trigger: VPA updateMode != Off
    Action: Right-size requests/limits per pod

  ──────────────────────────────────────────────────────────────
  Capacity planning: ensures Layer 3 has room to scale into
  (burst headroom, savings plan coverage for baseline)

Karpenter NodePool tuning for capacity planning

Karpenter (covered in depth in Section 08-01) consolidates and provisions nodes. For capacity planning purposes, configure disruption budgets to prevent over-aggressive consolidation during business hours:

apiVersion: karpenter.sh/v1
kind: NodePool
metadata:
  name: production
spec:
  template:
    spec:
      requirements:
        - key: karpenter.sh/capacity-type
          operator: In
          values: ["spot", "on-demand"]
        - key: node.kubernetes.io/instance-type
          operator: In
          values:
            - m7i.xlarge
            - m7i.2xlarge
            - m7i.4xlarge
            - m6i.xlarge
            - m6i.2xlarge
            - m6i.4xlarge
      nodeClassRef:
        group: karpenter.k8s.aws
        kind: EC2NodeClass
        name: production
  disruption:
    consolidationPolicy: WhenEmptyOrUnderutilized
    consolidateAfter: 30s
    budgets:
      # During business hours: consolidate at most 10% of nodes at once
      - schedule: "0 9 * * 1-5"    # Mon-Fri 9am
        duration: 8h
        nodes: "10%"
      # Off-hours: allow 20% consolidation
      - nodes: "20%"
  limits:
    cpu: "500"
    memory: "2000Gi"

HPA configuration for spike absorption

apiVersion: autoscaling/v2
kind: HorizontalPodAutoscaler
metadata:
  name: payment-service
  namespace: payments
spec:
  scaleTargetRef:
    apiVersion: apps/v1
    kind: Deployment
    name: payment-service
  minReplicas: 3    # ≥ 3 for multi-AZ spread
  maxReplicas: 50
  metrics:
    - type: Resource
      resource:
        name: cpu
        target:
          type: Utilization
          averageUtilization: 60   # scale out at 60% CPU, not 80%
    - type: Pods
      pods:
        metric:
          name: http_requests_per_second
        target:
          type: AverageValue
          averageValue: "1000"     # 1000 RPS per pod
  behavior:
    scaleUp:
      stabilizationWindowSeconds: 60   # quick scale-up
      policies:
        - type: Pods
          value: 4
          periodSeconds: 60    # add up to 4 pods per minute
        - type: Percent
          value: 100
          periodSeconds: 60    # or double, whichever is larger
      selectPolicy: Max
    scaleDown:
      stabilizationWindowSeconds: 300  # 5 min before scale-down
      policies:
        - type: Pods
          value: 2
          periodSeconds: 60    # remove at most 2 pods per minute

Node Pool Architecture

Separate node pools by workload class. This allows independent scaling, targeted instance selection, and cost isolation:

Node Pool Topology

  ┌──────────────────────────────────────────────────────────────────┐
  │  system pool  (2× m7i.xlarge, on-demand, multi-AZ)              │
  │  Taint: CriticalAddonsOnly=true:NoSchedule                       │
  │  Workloads: CoreDNS, kube-proxy, metrics-server, cert-manager    │
  ├──────────────────────────────────────────────────────────────────┤
  │  observability pool  (3× m7i.2xlarge, on-demand, multi-AZ)      │
  │  Taint: dedicated=observability:NoSchedule                       │
  │  Workloads: Prometheus, Loki, Tempo, Pyroscope, Grafana          │
  ├──────────────────────────────────────────────────────────────────┤
  │  production pool  (Karpenter managed, spot+on-demand, multi-AZ)  │
  │  No taint — default pool                                         │
  │  Workloads: All production services                              │
  ├──────────────────────────────────────────────────────────────────┤
  │  batch pool  (Karpenter managed, spot-only)                      │
  │  Taint: dedicated=batch:NoSchedule                               │
  │  Workloads: Argo Workflows, Spark, ML training, CI runners       │
  ├──────────────────────────────────────────────────────────────────┤
  │  gpu pool  (Karpenter managed, g5.xlarge, spot+on-demand)        │
  │  Taint: nvidia.com/gpu=true:NoSchedule                           │
  │  Workloads: ML inference, video encoding                         │
  └──────────────────────────────────────────────────────────────────┘

# Batch workload targeting the batch node pool
spec:
  tolerations:
    - key: dedicated
      value: batch
      effect: NoSchedule
  nodeSelector:
    dedicated: batch
  # Or with affinity for more control:
  affinity:
    nodeAffinity:
      requiredDuringSchedulingIgnoredDuringExecution:
        nodeSelectorTerms:
          - matchExpressions:
              - key: dedicated
                operator: In
                values: ["batch"]

Stateful Workload Capacity

Stateful workloads (databases, Kafka, etcd) have additional capacity constraints beyond CPU and memory.

Prometheus StatefulSet sizing example

Dimension	Calculation	Example Value
Samples ingested/sec	Scrape targets × avg metrics × 1/scrape_interval	50,000 targets × 200 metrics × 1/15s = 667k samples/sec
CPU	~1 CPU per 100k active series	10M active series → 100m × 100 = 10 CPU
Memory (head chunk)	~3 KiB per active series	10M series × 3 KiB = 30 GiB RAM
Storage (TSDB)	bytes/sample × samples/sec × retention_seconds × 1.1 overhead	1.5B × 667k/s × 86400s × 15d × 1.1 ≈ 1.2 TiB per replica
IOPS	WAL: ~1000 IOPS; compaction bursts to 5000 IOPS	gp3 with 4000 IOPS provisioned

PostgreSQL/MySQL sizing heuristics

Connections: size memory for max_connections × work_mem; use PgBouncer connection pooler
Shared buffers: 25% of instance RAM (OS will cache the rest via page cache)
Storage: plan for 3× current data size (growth + WAL + vacuuming space)
IOPS: OLTP writes need >3000 IOPS sustained; use io2 or gp3 with provisioned IOPS
CPU: OLTP is typically memory and I/O bound, not CPU bound; 4–8 vCPU is enough for most databases up to 100k QPS

IP Exhaustion Planning

VPC CNI assigns a real VPC IP to every pod. With hundreds of pods per node, subnets exhaust quickly. Plan IP capacity as carefully as compute.

# Check available IPs per subnet
aws ec2 describe-subnets \
  --filters "Name=tag:kubernetes.io/cluster/my-cluster,Values=shared" \
  --query 'Subnets[*].{Subnet:SubnetId,AZ:AvailabilityZone,Available:AvailableIpAddressCount,CIDR:CidrBlock}' \
  --output table

# Check current pod IP consumption
kubectl get pods -A -o wide | grep -v '' | \
  awk '{print $8}' | sort | uniq -c | sort -rn | head -20

# Count pods per node to understand IP density
kubectl get pods -A -o wide --field-selector spec.nodeName=ip-10-0-1-42.us-east-1.compute.internal | wc -l

EKS VPC CNI prefix delegation

Prefix delegation assigns a /28 prefix (16 IPs) per ENI slot instead of 1 IP, multiplying pod density by 16× with no subnet CIDR changes:

# Enable prefix delegation on existing cluster
kubectl set env daemonset aws-node \
  -n kube-system \
  ENABLE_PREFIX_DELEGATION=true \
  WARM_PREFIX_TARGET=1

# Verify (new nodes will get prefix IPs; existing nodes need rolling replacement)
kubectl describe node ip-10-0-1-42.us-east-1.compute.internal | \
  grep -A5 "Capacity"

# max-pods with prefix delegation on m7i.xlarge:
# ENIs: 4, slots/ENI: 4, prefix size: /28 (16 IPs)
# max-pods = (4-1) × 4 × 16 - 1 = 191 pods (vs 58 without prefix)

Instance	Max pods (default)	Max pods (prefix delegation)
t3.medium	17	110
m7i.xlarge	58	191
m7i.4xlarge	234	737 (capped at 110 or 250 by kubelet)
c7i.2xlarge	58	191

Capacity Alerts

apiVersion: monitoring.coreos.com/v1
kind: PrometheusRule
metadata:
  name: capacity-planning-alerts
  namespace: monitoring
spec:
  groups:
    - name: capacity.planning
      interval: 5m
      rules:

        # CPU request saturation
        - alert: ClusterCPURequestSaturationHigh
          expr: |
            sum(kube_pod_container_resource_requests{resource="cpu"})
              /
            sum(kube_node_status_allocatable{resource="cpu"}) > 0.80
          for: 15m
          labels:
            severity: warning
          annotations:
            summary: "Cluster CPU request utilization > 80%"
            description: "CPU requests are {{ $value | humanizePercentage }} of allocatable. Autoscaler may not have enough headroom."
            runbook_url: https://runbooks.example.com/capacity/cpu-saturation

        - alert: ClusterCPURequestSaturationCritical
          expr: |
            sum(kube_pod_container_resource_requests{resource="cpu"})
              /
            sum(kube_node_status_allocatable{resource="cpu"}) > 0.90
          for: 5m
          labels:
            severity: critical
          annotations:
            summary: "Cluster CPU request utilization > 90% — pending pods likely"
            runbook_url: https://runbooks.example.com/capacity/cpu-saturation

        # Memory request saturation
        - alert: ClusterMemoryRequestSaturationHigh
          expr: |
            sum(kube_pod_container_resource_requests{resource="memory"})
              /
            sum(kube_node_status_allocatable{resource="memory"}) > 0.80
          for: 15m
          labels:
            severity: warning
          annotations:
            summary: "Cluster memory request utilization > 80%"
            runbook_url: https://runbooks.example.com/capacity/memory-saturation

        # Pending pods (scheduling failure — often a capacity signal)
        - alert: PodsPendingTooLong
          expr: |
            count(kube_pod_status_phase{phase="Pending"}) by (namespace) > 0
          for: 10m
          labels:
            severity: warning
          annotations:
            summary: "Pods pending > 10 min in namespace {{ $labels.namespace }}"
            description: "{{ $value }} pods are pending. Check node capacity or resource quota."
            runbook_url: https://runbooks.example.com/capacity/pending-pods

        # Node CPU utilization (actual usage, not requests)
        - alert: NodeCPUUtilizationHigh
          expr: |
            (1 - avg by (node) (rate(node_cpu_seconds_total{mode="idle"}[5m]))) > 0.85
          for: 10m
          labels:
            severity: warning
          annotations:
            summary: "Node {{ $labels.node }} CPU utilization > 85%"
            runbook_url: https://runbooks.example.com/capacity/node-cpu-high

        # Node memory pressure
        - alert: NodeMemoryPressure
          expr: |
            (node_memory_MemTotal_bytes - node_memory_MemAvailable_bytes)
              /
            node_memory_MemTotal_bytes > 0.90
          for: 5m
          labels:
            severity: critical
          annotations:
            summary: "Node {{ $labels.instance }} memory usage > 90%"
            runbook_url: https://runbooks.example.com/capacity/node-memory-pressure

        # Pod count approaching max-pods limit
        - alert: NodePodCapacityHigh
          expr: |
            sum by (node) (kube_pod_info{phase="Running"})
              /
            max by (node) (kube_node_status_capacity{resource="pods"}) > 0.85
          for: 10m
          labels:
            severity: warning
          annotations:
            summary: "Node {{ $labels.node }} pod capacity > 85%"
            runbook_url: https://runbooks.example.com/capacity/node-pod-limit

        # Namespace quota near limit
        - alert: NamespaceCPUQuotaNearLimit
          expr: |
            kube_resourcequota{resource="requests.cpu", type="used"}
              /
            kube_resourcequota{resource="requests.cpu", type="hard"} > 0.85
          for: 5m
          labels:
            severity: warning
          annotations:
            summary: "Namespace {{ $labels.namespace }} CPU quota > 85% used"
            runbook_url: https://runbooks.example.com/capacity/quota-near-limit

        # Karpenter node provisioning failures
        - alert: KarpenterNodeProvisioningFailed
          expr: |
            increase(karpenter_nodeclaims_disrupted_total{reason="failed"}[30m]) > 0
          labels:
            severity: warning
          annotations:
            summary: "Karpenter failed to provision nodes in last 30 min"
            runbook_url: https://runbooks.example.com/capacity/karpenter-provisioning-failure

        # Capacity forecast: will saturate in < 7 days
        - alert: ClusterCPUForecastSaturation
          expr: |
            predict_linear(
              sum(kube_pod_container_resource_requests{resource="cpu"})[3d:1h],
              7 * 24 * 3600
            )
              /
            sum(kube_node_status_allocatable{resource="cpu"}) > 0.90
          for: 1h
          labels:
            severity: warning
          annotations:
            summary: "Cluster CPU requests forecast to reach 90% within 7 days"
            runbook_url: https://runbooks.example.com/capacity/cpu-forecast

Capacity Review Cadence

Cadence	Activity	Owner	Output
Daily	Check capacity alerts in Grafana/PagerDuty; review pending pods	On-call engineer	Immediate remediation if alerts firing
Weekly	Run capacity report script; review VPA recommendations top-10 namespaces; check Goldilocks dashboard	Platform team	Right-sizing tickets for top over-provisioned workloads
Monthly	CPU/memory trend analysis (30d); forecast next 90 days; review FinOps efficiency KPIs; update savings plan commitment	Platform + FinOps	Capacity plan document; savings plan adjustment
Quarterly	Full capacity audit: node pool sizing review; instance family refresh (new gen available?); VPC subnet headroom; storage PVC growth; multi-cluster capacity balance	Platform team lead	Architectural changes, subnet expansion, instance migration plan

Monthly capacity review checklist

#!/bin/bash
# Monthly capacity review — run on first Monday of each month
set -euo pipefail

PROM_URL="${PROM_URL:-http://prometheus.monitoring.svc:9090}"
REPORT_DATE=$(date +%Y-%m-%d)

echo "======================================"
echo " Monthly Capacity Review: $REPORT_DATE"
echo "======================================"

echo ""
echo "## 1. Node Summary"
kubectl get nodes -o custom-columns=\
'NAME:.metadata.name,TYPE:.metadata.labels.node\.kubernetes\.io/instance-type,STATUS:.status.conditions[-1].type,AGE:.metadata.creationTimestamp'

echo ""
echo "## 2. Cluster Utilization (30-day avg)"
for METRIC in cpu memory; do
  AVG=$(curl -s "${PROM_URL}/api/v1/query" \
    --data-urlencode "query=avg_over_time((sum(kube_pod_container_resource_requests{resource=\"${METRIC}\"}) / sum(kube_node_status_allocatable{resource=\"${METRIC}\"}))[30d:1h])" | \
    jq -r '.data.result[0].value[1] // "N/A"')
  PEAK=$(curl -s "${PROM_URL}/api/v1/query" \
    --data-urlencode "query=max_over_time((sum(kube_pod_container_resource_requests{resource=\"${METRIC}\"}) / sum(kube_node_status_allocatable{resource=\"${METRIC}\"}))[30d:1h])" | \
    jq -r '.data.result[0].value[1] // "N/A"')
  echo "  ${METRIC}: avg=${AVG} peak=${PEAK}"
done

echo ""
echo "## 3. Top 10 CPU-Requesting Namespaces (30d avg)"
curl -s "${PROM_URL}/api/v1/query" \
  --data-urlencode 'query=sort_desc(sum by (namespace) (avg_over_time(kube_pod_container_resource_requests{resource="cpu"}[30d])))' | \
  jq -r '.data.result[:10][] | "  \(.metric.namespace): \(.value[1]) cores"'

echo ""
echo "## 4. 90-day CPU Forecast"
FORECAST=$(curl -s "${PROM_URL}/api/v1/query" \
  --data-urlencode "query=predict_linear(sum(kube_pod_container_resource_requests{resource=\"cpu\"})[30d:1h], 90*24*3600) / sum(kube_node_status_allocatable{resource=\"cpu\"})" | \
  jq -r '.data.result[0].value[1] // "N/A"')
echo "  Projected utilization in 90 days: ${FORECAST}"

echo ""
echo "## 5. VPA Top Savings Opportunities"
kubectl get vpa -A -o json 2>/dev/null | jq -r '
  .items[] |
  select(.status.recommendation != null) |
  .metadata.namespace + "/" + .metadata.name' | head -10

echo ""
echo "## 6. Unattached PVCs (storage waste)"
kubectl get pvc -A --field-selector=status.phase!=Bound 2>/dev/null | \
  grep -v "^NAMESPACE" || echo "  None found"

Best Practices

Right-size before scaling out

VPA recommendations typically reveal 40–60% CPU over-provisioning. Always run a right-sizing pass before concluding you need more nodes.

Reserve headroom with pause pods

Never run at > 80% request utilization in production. Use low-priority pause pods to maintain N+2 node headroom that autoscaling can reclaim instantly.

Forecast from Prometheus data

Use predict_linear() on 14–30 day windows for growth trends. Alert when the forecast projects saturation within 7 days.

Plan all five dimensions

CPU and memory are obvious. Also plan pod count, IP addresses (VPC subnets), and storage PV/EBS attachment limits — any one can block scheduling.

Separate pools by workload class

System, observability, production, batch, and GPU pools scale independently, use appropriate instance families, and provide blast-radius isolation.

Monthly review cadence

Capacity planning is a monthly discipline, not a one-time event. Schedule the review, track the trend, and adjust savings plan commitments accordingly.