Capacity Planning
Sizing clusters to meet today's demand and tomorrow's growth — balancing headroom, cost efficiency, and autoscaling safety nets.
Fundamentals
Capacity planning answers two questions: Do we have enough resources today? and When do we need to add more? In Kubernetes, the complexity comes from three independent resource dimensions — CPU, memory, and pod count — layered on top of autoscaling, spot variability, and the gap between what workloads request and what they use.
Node allocatable capacity
┌─────────────────────────────────────────────────────────────┐
│ System reserved │ Kube reserved │ Available for pods │
└─────────────────────────────────────────────────────────────┘
┌─────────────────────┐
│ Requested (sum) │
│ ├── Actual usage │
│ └── Slack (waste) │
│ Headroom buffer │
│ Free (unscheduled) │
└─────────────────────┘
Warning zones:
Requests / Allocatable > 80% → near saturation
Actual usage / Requests < 25% → over-provisioned (right-size first)
Headroom < 1 full node → autoscaler can't absorb node failure
The Kubernetes scheduler packs pods based on requests, not actual resource consumption. A pod that requests 4 CPU but uses 0.1 CPU blocks 4 CPU worth of scheduling capacity. Over-requested workloads cause artificial node saturation while actual utilization stays low — always right-size requests before adding nodes.
Capacity Dimensions
Kubernetes clusters can hit limits on any of five independent dimensions. Hitting any single one blocks further scheduling even when others are available:
| Dimension | What Limits It | How to Check | Expansion Strategy |
|---|---|---|---|
| CPU (requests) | Sum of pod CPU requests vs node allocatable CPU | kubectl resource-capacity --sort cpu.request | Add nodes, right-size requests, enable VPA |
| Memory (requests) | Sum of pod memory requests vs node allocatable memory | kubectl resource-capacity --sort mem.request | Add nodes, right-size requests, enable VPA |
| Pod count | --max-pods per node (default 110, AWS VPC CNI varies by instance) | kubectl describe node | grep Capacity | Increase max-pods (VPC CNI prefix delegation), add nodes |
| IP addresses | VPC subnet CIDR /24 = 251 usable IPs; shared with nodes | aws ec2 describe-subnets | jq '.[] | .AvailableIpAddressCount' | Add subnets/secondary CIDRs, enable prefix delegation |
| Storage (PV) | EBS volume limits per instance type (28–128 attachments) | kubectl get pvc -A | grep Bound | Use EFS/S3 for shared storage, increase instance size |
Node Sizing Strategy
Instance family selection
| Workload Profile | AWS Family | CPU:Memory Ratio | Use Case |
|---|---|---|---|
| General purpose | m7i, m6i, m5 | 1:4 | Mixed microservices, most production workloads |
| CPU-intensive | c7i, c6i, c5 | 1:2 | API servers, data processing, encoding |
| Memory-intensive | r7i, r6i, r5 | 1:8 | JVM apps, in-memory caches, analytics |
| Burstable / dev | t3, t4g | 1:2–4 | Development, CI runners, low-traffic services |
| GPU / ML | p4, g5, g4dn | — | Model training, inference, video |
| ARM / cost-optimized | m7g, c7g, r7g (Graviton) | 1:4 | Same as above families; ~20% cheaper |
Node size tradeoffs
Fewer, larger nodes
- Lower per-node overhead (daemon sets, kube-reserved)
- Fewer node reboots during upgrades
- Better bin-packing for large pods
- Risk: larger blast radius per node failure
- Risk: fragmentation for small pods
- Use: m7i.4xlarge (16 vCPU / 64 GiB) as baseline
More, smaller nodes
- Smaller blast radius per node failure
- Better fault isolation with PodAntiAffinity
- Higher overhead ratio (DaemonSets cost more)
- Risk: more nodes to manage/upgrade
- Risk: IP exhaustion faster
- Use: m7i.xlarge (4 vCPU / 16 GiB) for isolation-critical
kube-reserved and system-reserved budgets
Kubernetes reserves resources on each node for the kubelet, system daemons, and eviction threshold before pods can be scheduled. Always set these explicitly in your node configuration:
# EKS managed node group bootstrap configuration
# /etc/eks/bootstrap.sh extra args (via launch template userData)
--kubelet-extra-args '
--kube-reserved=cpu=250m,memory=1Gi,ephemeral-storage=1Gi
--system-reserved=cpu=250m,memory=500Mi,ephemeral-storage=1Gi
--eviction-hard=memory.available<500Mi,nodefs.available<10%,nodefs.inodesFree<5%
--eviction-soft=memory.available<1Gi,nodefs.available<15%
--eviction-soft-grace-period=memory.available=2m,nodefs.available=2m
--max-pods=110
'
Effective allocatable capacity formula
allocatable_mem = node_mem − kube_reserved_mem − system_reserved_mem − eviction_threshold
Example: m7i.xlarge (4 vCPU / 16 GiB)
CPU: 4000m − 250m − 250m = 3500m per node
Mem: 16384Mi − 1024Mi − 512Mi − 500Mi = 14348Mi per node
# Verify actual allocatable on a running node
kubectl get node ip-10-0-1-42.us-east-1.compute.internal \
-o jsonpath='{.status.allocatable}' | jq .
# Output:
# {
# "cpu": "3920m",
# "memory": "14902836Ki",
# "pods": "110",
# "ephemeral-storage": "99Gi"
# }
Measuring Current Utilization
kubectl resource-capacity (krew)
# Per-node CPU and memory requests vs allocatable
kubectl resource-capacity --sort cpu.request
# Output (example):
# NODE CPU REQUESTS CPU LIMITS MEMORY REQUESTS MEMORY LIMITS
# ip-10-0-1-42.us-east-1.compute 2350m/3920m 4000m/3920m 8Gi/14.2Gi 16Gi/14.2Gi
# ip-10-0-1-55.us-east-1.compute 1100m/3920m 2000m/3920m 4Gi/14.2Gi 8Gi/14.2Gi
# Include actual usage (requires metrics-server)
kubectl resource-capacity --util --sort cpu.util
# Per-pod breakdown on a node
kubectl resource-capacity --pods --node ip-10-0-1-42.us-east-1.compute \
--sort cpu.request
PromQL — cluster-wide utilization
# Cluster CPU request utilization (requested / allocatable)
sum(kube_pod_container_resource_requests{resource="cpu"})
/
sum(kube_node_status_allocatable{resource="cpu"})
# Cluster memory request utilization
sum(kube_pod_container_resource_requests{resource="memory"})
/
sum(kube_node_status_allocatable{resource="memory"})
# Per-namespace CPU request share (top consumers)
sort_desc(
sum by (namespace) (kube_pod_container_resource_requests{resource="cpu"})
/
scalar(sum(kube_node_status_allocatable{resource="cpu"}))
)
# Actual CPU usage as % of allocatable
sum(rate(container_cpu_usage_seconds_total{container!=""}[5m]))
/
sum(kube_node_status_allocatable{resource="cpu"})
# Node-level memory pressure (usage approaching allocatable)
max by (node) (
(node_memory_MemTotal_bytes - node_memory_MemAvailable_bytes)
/
node_memory_MemTotal_bytes
)
# Pods per node vs max-pods limit
sum by (node) (kube_pod_info{pod_phase="Running"})
/
max by (node) (kube_node_status_capacity{resource="pods"})
Namespace-level capacity report
#!/bin/bash
# Weekly capacity report script
echo "=== Cluster Capacity Report $(date +%Y-%m-%d) ==="
echo ""
echo "--- Node Summary ---"
kubectl get nodes -o custom-columns=\
'NAME:.metadata.name,STATUS:.status.conditions[-1].type,CPU:.status.allocatable.cpu,MEMORY:.status.allocatable.memory,PODS:.status.allocatable.pods'
echo ""
echo "--- Top CPU-Requesting Namespaces ---"
kubectl top pods -A --sort-by=cpu 2>/dev/null | \
awk 'NR==1 || NR<=20' | column -t
echo ""
echo "--- ResourceQuota Usage ---"
kubectl get resourcequota -A -o custom-columns=\
'NAMESPACE:.metadata.namespace,NAME:.metadata.name,CPU-REQ:.status.used.requests\.cpu,CPU-LIM:.status.used.limits\.cpu,MEM-REQ:.status.used.requests\.memory'
echo ""
echo "--- Nodes Near CPU Request Saturation (>75%) ---"
kubectl resource-capacity --sort cpu.request 2>/dev/null | \
awk 'NR==1 || ($3+0)/$4+0 > 0.75'
Right-Sizing Workloads
Before adding nodes, eliminate request waste. Most clusters see 40–70% CPU over-provisioning — a VPA recommendation pass typically reduces cluster size by 20–40% with no reliability impact.
VPA recommendation workflow
# Step 1: Deploy VPA in Off mode (recommendation-only — no pod restarts)
apiVersion: autoscaling.k8s.io/v1
kind: VerticalPodAutoscaler
metadata:
name: payment-service-vpa
namespace: payments
spec:
targetRef:
apiVersion: apps/v1
kind: Deployment
name: payment-service
updatePolicy:
updateMode: "Off" # recommendation only
resourcePolicy:
containerPolicies:
- containerName: payment-service
minAllowed:
cpu: 100m
memory: 128Mi
maxAllowed:
cpu: 4
memory: 8Gi
controlledValues: RequestsAndLimits
# Step 2: Read recommendations after 24h+ of data
kubectl describe vpa payment-service-vpa -n payments
# Output:
# Recommendation:
# Container Recommendations:
# Container Name: payment-service
# Lower Bound:
# Cpu: 100m
# Memory: 256Mi
# Target:
# Cpu: 350m ← recommended request
# Memory: 512Mi
# Uncapped Target:
# Cpu: 350m
# Memory: 480Mi
# Upper Bound:
# Cpu: 1200m
# Memory: 2Gi
# Step 3: Bulk export all VPA recommendations across cluster
kubectl get vpa -A -o json | jq -r '
.items[] |
.metadata.namespace + "/" + .metadata.name + " " +
(.status.recommendation.containerRecommendations[]? |
.containerName + " cpu:" +
.target.cpu + " mem:" + .target.memory)'
Goldilocks dashboard
Goldilocks (covered in depth in Section 08-07) installs VPAs for every Deployment in a labeled namespace and visualizes recommendations in a dashboard:
# Label a namespace for Goldilocks analysis
kubectl label namespace payments goldilocks.fairwinds.com/enabled=true
# Access dashboard (port-forward)
kubectl port-forward -n goldilocks svc/goldilocks-dashboard 8080:80
Common right-sizing patterns
| Pattern | Symptom | VPA says | Action |
|---|---|---|---|
| CPU over-requested | CPU request 2000m, usage 80m | Target: 150m | Lower request to 200m; add 20% safety margin |
| Memory under-requested | OOMKilled events, memory request 256Mi | Target: 1.2Gi | Raise request to 1.5Gi; set limit at 2Gi |
| Limits ≫ Requests | CPU limit 8000m on 100m request | Target: 200m | Align: request 200m, limit 400m (2× ratio) |
| No limits set | Pod using 100% node CPU during spike | N/A | Set CPU limit at 3× VPA target; memory limit at 1.5× target |
| Batch job with prod sizing | Daily job with 4cpu/8Gi running 5 min/day | Target: 1cpu/2Gi | Lower requests; consider PriorityClass: batch-low |
Do not use VPA updateMode: Auto with HPA on the same CPU/memory metric — they fight each other. The recommended pattern: VPA on memory (HPA rarely scales on memory), HPA on CPU or a custom metric (RPS, queue depth). See 08-07 for full VPA/HPA strategy.
Headroom & Safety Margins
Headroom is the slack capacity you deliberately keep available to absorb: node failures, traffic spikes before autoscaling reacts, and rolling updates that temporarily double pod count.
Headroom targets by workload criticality
| Cluster Tier | CPU Headroom | Memory Headroom | Min Free Nodes | Rationale |
|---|---|---|---|---|
| Production | ≥ 20% | ≥ 25% | 2 nodes (N+2) | Absorb node failure + autoscale lag (2–4 min on spot) |
| Staging | ≥ 15% | ≥ 20% | 1 node (N+1) | Functional parity with some cost savings |
| Development | ≥ 5% | ≥ 10% | 0 | Autoscale freely; minor disruption acceptable |
Pause pods for guaranteed headroom
Karpenter and Cluster Autoscaler scale down idle nodes. A pause pod (a pod running pause image with real resource requests but zero workload cost) permanently reserves capacity on every node, preventing the autoscaler from reclaiming it:
apiVersion: apps/v1
kind: Deployment
metadata:
name: capacity-reservation
namespace: kube-system
spec:
replicas: 2 # one per AZ; adjust to match headroom target
selector:
matchLabels:
app: capacity-reservation
template:
metadata:
labels:
app: capacity-reservation
spec:
priorityClassName: batch-low # evicted first when real pods need space
terminationGracePeriodSeconds: 0
containers:
- name: pause
image: registry.k8s.io/pause:3.9
resources:
requests:
cpu: "1500m" # reserves 1.5 CPU per node
memory: "3Gi" # reserves 3 GiB memory per node
securityContext:
allowPrivilegeEscalation: false
runAsNonRoot: true
runAsUser: 65534
seccompProfile:
type: RuntimeDefault
capabilities:
drop: ["ALL"]
When a real pod is pending, the scheduler evicts the low-priority pause pod (PriorityClass: batch-low) to make room. The now-evicted pause pod becomes pending, which triggers Karpenter to provision a new node. Net result: the new node is available for the next real workload burst, maintaining the headroom invariant.
Demand Forecasting
Capacity planning without a forecast is reactive. Use historical Prometheus data to project future demand.
Linear regression with predict_linear()
# Predict CPU request total in 30 days based on last 14 days of growth
predict_linear(
sum(kube_pod_container_resource_requests{resource="cpu"})[14d:1h],
30 * 24 * 3600
)
# As % of current allocatable — will we saturate?
predict_linear(
sum(kube_pod_container_resource_requests{resource="cpu"})[14d:1h],
30 * 24 * 3600
)
/
sum(kube_node_status_allocatable{resource="cpu"})
# Namespace-level forecast: payments team CPU growth
predict_linear(
sum by (namespace) (
kube_pod_container_resource_requests{resource="cpu", namespace="payments"}
)[7d:1h],
14 * 24 * 3600
)
Seasonality and traffic patterns
predict_linear() assumes constant growth. Real traffic has weekly cycles (weekday vs weekend) and seasonal spikes (end-of-month billing, Black Friday). For these, use longer lookback windows and apply multipliers manually:
#!/bin/bash
# Extract weekly peak-to-average ratio from Prometheus
# Use this to size for peak, not average
PROM_URL="http://prometheus.monitoring.svc:9090"
# 7-day average CPU requests
AVG=$(curl -s "${PROM_URL}/api/v1/query" \
--data-urlencode 'query=avg_over_time(sum(kube_pod_container_resource_requests{resource="cpu"})[7d:1h])' | \
jq -r '.data.result[0].value[1]')
# 7-day peak CPU requests
PEAK=$(curl -s "${PROM_URL}/api/v1/query" \
--data-urlencode 'query=max_over_time(sum(kube_pod_container_resource_requests{resource="cpu"})[7d:1h])' | \
jq -r '.data.result[0].value[1]')
echo "Average CPU requests: ${AVG} cores"
echo "Peak CPU requests: ${PEAK} cores"
echo "Peak-to-avg ratio: $(echo "scale=2; $PEAK/$AVG" | bc)×"
echo ""
echo "Size cluster for PEAK × 1.3 safety margin = $(echo "scale=1; $PEAK*1.3" | bc) cores"
Capacity planning spreadsheet model
| Input | Value | Notes |
|---|---|---|
| Current CPU requests | 120 cores | From Prometheus |
| Monthly growth rate | 8% | From predict_linear 3-month trend |
| Peak-to-average multiplier | 1.4× | From weekly seasonality analysis |
| Headroom buffer | 20% | Production tier requirement |
| 6-month CPU target | 120 × (1.08^6) × 1.4 × 1.2 = 282 cores | Order nodes 4 weeks before hitting limit |
| Node size (m7i.4xlarge) | 3.5 allocatable CPU per node | After kube/system reserved |
| Nodes needed in 6 months | 282 / 3.5 = 81 nodes | vs 40 today → plan for doubling |
Autoscaling Strategy
Autoscaling handles the short-term spikes; capacity planning handles the long-term trend. Both are required — autoscaling alone is not a substitute for planning.
Three-layer autoscaling model
Layer 3 — Node autoscaling (Karpenter / Cluster Autoscaler)
Response time: 2–4 min (on-demand), 1–2 min (Karpenter)
Trigger: Pending pods that can't be scheduled
Action: Add / remove nodes
Layer 2 — Horizontal pod autoscaling (HPA / KEDA)
Response time: 15 sec–2 min (depends on metric scrape interval)
Trigger: CPU%, memory%, custom metric (RPS, queue depth)
Action: Scale replicas up/down
Layer 1 — Vertical pod autoscaling (VPA)
Response time: Minutes–hours (restarts pods)
Trigger: VPA updateMode != Off
Action: Right-size requests/limits per pod
──────────────────────────────────────────────────────────────
Capacity planning: ensures Layer 3 has room to scale into
(burst headroom, savings plan coverage for baseline)
Karpenter NodePool tuning for capacity planning
Karpenter (covered in depth in Section 08-01) consolidates and provisions nodes. For capacity planning purposes, configure disruption budgets to prevent over-aggressive consolidation during business hours:
apiVersion: karpenter.sh/v1
kind: NodePool
metadata:
name: production
spec:
template:
spec:
requirements:
- key: karpenter.sh/capacity-type
operator: In
values: ["spot", "on-demand"]
- key: node.kubernetes.io/instance-type
operator: In
values:
- m7i.xlarge
- m7i.2xlarge
- m7i.4xlarge
- m6i.xlarge
- m6i.2xlarge
- m6i.4xlarge
nodeClassRef:
group: karpenter.k8s.aws
kind: EC2NodeClass
name: production
disruption:
consolidationPolicy: WhenEmptyOrUnderutilized
consolidateAfter: 30s
budgets:
# During business hours: consolidate at most 10% of nodes at once
- schedule: "0 9 * * 1-5" # Mon-Fri 9am
duration: 8h
nodes: "10%"
# Off-hours: allow 20% consolidation
- nodes: "20%"
limits:
cpu: "500"
memory: "2000Gi"
HPA configuration for spike absorption
apiVersion: autoscaling/v2
kind: HorizontalPodAutoscaler
metadata:
name: payment-service
namespace: payments
spec:
scaleTargetRef:
apiVersion: apps/v1
kind: Deployment
name: payment-service
minReplicas: 3 # ≥ 3 for multi-AZ spread
maxReplicas: 50
metrics:
- type: Resource
resource:
name: cpu
target:
type: Utilization
averageUtilization: 60 # scale out at 60% CPU, not 80%
- type: Pods
pods:
metric:
name: http_requests_per_second
target:
type: AverageValue
averageValue: "1000" # 1000 RPS per pod
behavior:
scaleUp:
stabilizationWindowSeconds: 60 # quick scale-up
policies:
- type: Pods
value: 4
periodSeconds: 60 # add up to 4 pods per minute
- type: Percent
value: 100
periodSeconds: 60 # or double, whichever is larger
selectPolicy: Max
scaleDown:
stabilizationWindowSeconds: 300 # 5 min before scale-down
policies:
- type: Pods
value: 2
periodSeconds: 60 # remove at most 2 pods per minute
Node Pool Architecture
Separate node pools by workload class. This allows independent scaling, targeted instance selection, and cost isolation:
┌──────────────────────────────────────────────────────────────────┐ │ system pool (2× m7i.xlarge, on-demand, multi-AZ) │ │ Taint: CriticalAddonsOnly=true:NoSchedule │ │ Workloads: CoreDNS, kube-proxy, metrics-server, cert-manager │ ├──────────────────────────────────────────────────────────────────┤ │ observability pool (3× m7i.2xlarge, on-demand, multi-AZ) │ │ Taint: dedicated=observability:NoSchedule │ │ Workloads: Prometheus, Loki, Tempo, Pyroscope, Grafana │ ├──────────────────────────────────────────────────────────────────┤ │ production pool (Karpenter managed, spot+on-demand, multi-AZ) │ │ No taint — default pool │ │ Workloads: All production services │ ├──────────────────────────────────────────────────────────────────┤ │ batch pool (Karpenter managed, spot-only) │ │ Taint: dedicated=batch:NoSchedule │ │ Workloads: Argo Workflows, Spark, ML training, CI runners │ ├──────────────────────────────────────────────────────────────────┤ │ gpu pool (Karpenter managed, g5.xlarge, spot+on-demand) │ │ Taint: nvidia.com/gpu=true:NoSchedule │ │ Workloads: ML inference, video encoding │ └──────────────────────────────────────────────────────────────────┘
# Batch workload targeting the batch node pool
spec:
tolerations:
- key: dedicated
value: batch
effect: NoSchedule
nodeSelector:
dedicated: batch
# Or with affinity for more control:
affinity:
nodeAffinity:
requiredDuringSchedulingIgnoredDuringExecution:
nodeSelectorTerms:
- matchExpressions:
- key: dedicated
operator: In
values: ["batch"]
Stateful Workload Capacity
Stateful workloads (databases, Kafka, etcd) have additional capacity constraints beyond CPU and memory.
Prometheus StatefulSet sizing example
| Dimension | Calculation | Example Value |
|---|---|---|
| Samples ingested/sec | Scrape targets × avg metrics × 1/scrape_interval | 50,000 targets × 200 metrics × 1/15s = 667k samples/sec |
| CPU | ~1 CPU per 100k active series | 10M active series → 100m × 100 = 10 CPU |
| Memory (head chunk) | ~3 KiB per active series | 10M series × 3 KiB = 30 GiB RAM |
| Storage (TSDB) | bytes/sample × samples/sec × retention_seconds × 1.1 overhead | 1.5B × 667k/s × 86400s × 15d × 1.1 ≈ 1.2 TiB per replica |
| IOPS | WAL: ~1000 IOPS; compaction bursts to 5000 IOPS | gp3 with 4000 IOPS provisioned |
PostgreSQL/MySQL sizing heuristics
- Connections: size memory for
max_connections × work_mem; use PgBouncer connection pooler - Shared buffers: 25% of instance RAM (OS will cache the rest via page cache)
- Storage: plan for 3× current data size (growth + WAL + vacuuming space)
- IOPS: OLTP writes need >3000 IOPS sustained; use io2 or gp3 with provisioned IOPS
- CPU: OLTP is typically memory and I/O bound, not CPU bound; 4–8 vCPU is enough for most databases up to 100k QPS
IP Exhaustion Planning
VPC CNI assigns a real VPC IP to every pod. With hundreds of pods per node, subnets exhaust quickly. Plan IP capacity as carefully as compute.
# Check available IPs per subnet
aws ec2 describe-subnets \
--filters "Name=tag:kubernetes.io/cluster/my-cluster,Values=shared" \
--query 'Subnets[*].{Subnet:SubnetId,AZ:AvailabilityZone,Available:AvailableIpAddressCount,CIDR:CidrBlock}' \
--output table
# Check current pod IP consumption
kubectl get pods -A -o wide | grep -v '' | \
awk '{print $8}' | sort | uniq -c | sort -rn | head -20
# Count pods per node to understand IP density
kubectl get pods -A -o wide --field-selector spec.nodeName=ip-10-0-1-42.us-east-1.compute.internal | wc -l
EKS VPC CNI prefix delegation
Prefix delegation assigns a /28 prefix (16 IPs) per ENI slot instead of 1 IP, multiplying pod density by 16× with no subnet CIDR changes:
# Enable prefix delegation on existing cluster
kubectl set env daemonset aws-node \
-n kube-system \
ENABLE_PREFIX_DELEGATION=true \
WARM_PREFIX_TARGET=1
# Verify (new nodes will get prefix IPs; existing nodes need rolling replacement)
kubectl describe node ip-10-0-1-42.us-east-1.compute.internal | \
grep -A5 "Capacity"
# max-pods with prefix delegation on m7i.xlarge:
# ENIs: 4, slots/ENI: 4, prefix size: /28 (16 IPs)
# max-pods = (4-1) × 4 × 16 - 1 = 191 pods (vs 58 without prefix)
| Instance | Max pods (default) | Max pods (prefix delegation) |
|---|---|---|
| t3.medium | 17 | 110 |
| m7i.xlarge | 58 | 191 |
| m7i.4xlarge | 234 | 737 (capped at 110 or 250 by kubelet) |
| c7i.2xlarge | 58 | 191 |
Capacity Alerts
apiVersion: monitoring.coreos.com/v1
kind: PrometheusRule
metadata:
name: capacity-planning-alerts
namespace: monitoring
spec:
groups:
- name: capacity.planning
interval: 5m
rules:
# CPU request saturation
- alert: ClusterCPURequestSaturationHigh
expr: |
sum(kube_pod_container_resource_requests{resource="cpu"})
/
sum(kube_node_status_allocatable{resource="cpu"}) > 0.80
for: 15m
labels:
severity: warning
annotations:
summary: "Cluster CPU request utilization > 80%"
description: "CPU requests are {{ $value | humanizePercentage }} of allocatable. Autoscaler may not have enough headroom."
runbook_url: https://runbooks.example.com/capacity/cpu-saturation
- alert: ClusterCPURequestSaturationCritical
expr: |
sum(kube_pod_container_resource_requests{resource="cpu"})
/
sum(kube_node_status_allocatable{resource="cpu"}) > 0.90
for: 5m
labels:
severity: critical
annotations:
summary: "Cluster CPU request utilization > 90% — pending pods likely"
runbook_url: https://runbooks.example.com/capacity/cpu-saturation
# Memory request saturation
- alert: ClusterMemoryRequestSaturationHigh
expr: |
sum(kube_pod_container_resource_requests{resource="memory"})
/
sum(kube_node_status_allocatable{resource="memory"}) > 0.80
for: 15m
labels:
severity: warning
annotations:
summary: "Cluster memory request utilization > 80%"
runbook_url: https://runbooks.example.com/capacity/memory-saturation
# Pending pods (scheduling failure — often a capacity signal)
- alert: PodsPendingTooLong
expr: |
count(kube_pod_status_phase{phase="Pending"}) by (namespace) > 0
for: 10m
labels:
severity: warning
annotations:
summary: "Pods pending > 10 min in namespace {{ $labels.namespace }}"
description: "{{ $value }} pods are pending. Check node capacity or resource quota."
runbook_url: https://runbooks.example.com/capacity/pending-pods
# Node CPU utilization (actual usage, not requests)
- alert: NodeCPUUtilizationHigh
expr: |
(1 - avg by (node) (rate(node_cpu_seconds_total{mode="idle"}[5m]))) > 0.85
for: 10m
labels:
severity: warning
annotations:
summary: "Node {{ $labels.node }} CPU utilization > 85%"
runbook_url: https://runbooks.example.com/capacity/node-cpu-high
# Node memory pressure
- alert: NodeMemoryPressure
expr: |
(node_memory_MemTotal_bytes - node_memory_MemAvailable_bytes)
/
node_memory_MemTotal_bytes > 0.90
for: 5m
labels:
severity: critical
annotations:
summary: "Node {{ $labels.instance }} memory usage > 90%"
runbook_url: https://runbooks.example.com/capacity/node-memory-pressure
# Pod count approaching max-pods limit
- alert: NodePodCapacityHigh
expr: |
sum by (node) (kube_pod_info{phase="Running"})
/
max by (node) (kube_node_status_capacity{resource="pods"}) > 0.85
for: 10m
labels:
severity: warning
annotations:
summary: "Node {{ $labels.node }} pod capacity > 85%"
runbook_url: https://runbooks.example.com/capacity/node-pod-limit
# Namespace quota near limit
- alert: NamespaceCPUQuotaNearLimit
expr: |
kube_resourcequota{resource="requests.cpu", type="used"}
/
kube_resourcequota{resource="requests.cpu", type="hard"} > 0.85
for: 5m
labels:
severity: warning
annotations:
summary: "Namespace {{ $labels.namespace }} CPU quota > 85% used"
runbook_url: https://runbooks.example.com/capacity/quota-near-limit
# Karpenter node provisioning failures
- alert: KarpenterNodeProvisioningFailed
expr: |
increase(karpenter_nodeclaims_disrupted_total{reason="failed"}[30m]) > 0
labels:
severity: warning
annotations:
summary: "Karpenter failed to provision nodes in last 30 min"
runbook_url: https://runbooks.example.com/capacity/karpenter-provisioning-failure
# Capacity forecast: will saturate in < 7 days
- alert: ClusterCPUForecastSaturation
expr: |
predict_linear(
sum(kube_pod_container_resource_requests{resource="cpu"})[3d:1h],
7 * 24 * 3600
)
/
sum(kube_node_status_allocatable{resource="cpu"}) > 0.90
for: 1h
labels:
severity: warning
annotations:
summary: "Cluster CPU requests forecast to reach 90% within 7 days"
runbook_url: https://runbooks.example.com/capacity/cpu-forecast
Capacity Review Cadence
| Cadence | Activity | Owner | Output |
|---|---|---|---|
| Daily | Check capacity alerts in Grafana/PagerDuty; review pending pods | On-call engineer | Immediate remediation if alerts firing |
| Weekly | Run capacity report script; review VPA recommendations top-10 namespaces; check Goldilocks dashboard | Platform team | Right-sizing tickets for top over-provisioned workloads |
| Monthly | CPU/memory trend analysis (30d); forecast next 90 days; review FinOps efficiency KPIs; update savings plan commitment | Platform + FinOps | Capacity plan document; savings plan adjustment |
| Quarterly | Full capacity audit: node pool sizing review; instance family refresh (new gen available?); VPC subnet headroom; storage PVC growth; multi-cluster capacity balance | Platform team lead | Architectural changes, subnet expansion, instance migration plan |
Monthly capacity review checklist
#!/bin/bash
# Monthly capacity review — run on first Monday of each month
set -euo pipefail
PROM_URL="${PROM_URL:-http://prometheus.monitoring.svc:9090}"
REPORT_DATE=$(date +%Y-%m-%d)
echo "======================================"
echo " Monthly Capacity Review: $REPORT_DATE"
echo "======================================"
echo ""
echo "## 1. Node Summary"
kubectl get nodes -o custom-columns=\
'NAME:.metadata.name,TYPE:.metadata.labels.node\.kubernetes\.io/instance-type,STATUS:.status.conditions[-1].type,AGE:.metadata.creationTimestamp'
echo ""
echo "## 2. Cluster Utilization (30-day avg)"
for METRIC in cpu memory; do
AVG=$(curl -s "${PROM_URL}/api/v1/query" \
--data-urlencode "query=avg_over_time((sum(kube_pod_container_resource_requests{resource=\"${METRIC}\"}) / sum(kube_node_status_allocatable{resource=\"${METRIC}\"}))[30d:1h])" | \
jq -r '.data.result[0].value[1] // "N/A"')
PEAK=$(curl -s "${PROM_URL}/api/v1/query" \
--data-urlencode "query=max_over_time((sum(kube_pod_container_resource_requests{resource=\"${METRIC}\"}) / sum(kube_node_status_allocatable{resource=\"${METRIC}\"}))[30d:1h])" | \
jq -r '.data.result[0].value[1] // "N/A"')
echo " ${METRIC}: avg=${AVG} peak=${PEAK}"
done
echo ""
echo "## 3. Top 10 CPU-Requesting Namespaces (30d avg)"
curl -s "${PROM_URL}/api/v1/query" \
--data-urlencode 'query=sort_desc(sum by (namespace) (avg_over_time(kube_pod_container_resource_requests{resource="cpu"}[30d])))' | \
jq -r '.data.result[:10][] | " \(.metric.namespace): \(.value[1]) cores"'
echo ""
echo "## 4. 90-day CPU Forecast"
FORECAST=$(curl -s "${PROM_URL}/api/v1/query" \
--data-urlencode "query=predict_linear(sum(kube_pod_container_resource_requests{resource=\"cpu\"})[30d:1h], 90*24*3600) / sum(kube_node_status_allocatable{resource=\"cpu\"})" | \
jq -r '.data.result[0].value[1] // "N/A"')
echo " Projected utilization in 90 days: ${FORECAST}"
echo ""
echo "## 5. VPA Top Savings Opportunities"
kubectl get vpa -A -o json 2>/dev/null | jq -r '
.items[] |
select(.status.recommendation != null) |
.metadata.namespace + "/" + .metadata.name' | head -10
echo ""
echo "## 6. Unattached PVCs (storage waste)"
kubectl get pvc -A --field-selector=status.phase!=Bound 2>/dev/null | \
grep -v "^NAMESPACE" || echo " None found"
Best Practices
Right-size before scaling out
VPA recommendations typically reveal 40–60% CPU over-provisioning. Always run a right-sizing pass before concluding you need more nodes.
Reserve headroom with pause pods
Never run at > 80% request utilization in production. Use low-priority pause pods to maintain N+2 node headroom that autoscaling can reclaim instantly.
Forecast from Prometheus data
Use predict_linear() on 14–30 day windows for growth trends. Alert when the forecast projects saturation within 7 days.
Plan all five dimensions
CPU and memory are obvious. Also plan pod count, IP addresses (VPC subnets), and storage PV/EBS attachment limits — any one can block scheduling.
Separate pools by workload class
System, observability, production, batch, and GPU pools scale independently, use appropriate instance families, and provide blast-radius isolation.
Monthly review cadence
Capacity planning is a monthly discipline, not a one-time event. Schedule the review, track the trend, and adjust savings plan commitments accordingly.