📋 Page Coverage Checklist
Horizontal Pod Autoscaler
Scale workload replicas automatically based on resource utilization and custom metrics
The Horizontal Pod Autoscaler (HPA) adjusts the replicas field of a scalable workload (Deployment, StatefulSet, ReplicaSet, or any resource implementing the scale subresource) in response to observed metrics. It operates as a control loop in kube-controller-manager, periodically comparing current metric values against targets and computing a new desired replica count. The autoscaling/v2 API (GA since 1.23) supports multiple metric sources, fine-grained scaling behavior, and per-container metrics.
Control Loop & Metrics Pipeline
Replica Calculation Formula
For a utilization target (percentage of requests):
desiredReplicas = ceil(currentReplicas × (currentUtilization / targetUtilization))
For an average value target:
desiredReplicas = ceil(currentReplicas × (totalMetricValue / (targetAverageValue × currentReplicas)))
= ceil(totalMetricValue / targetAverageValue)
Pods in non-Ready state (Pending, Terminating, or recently started) are excluded from metric averaging to avoid false scale-ups during rollouts.
If metric data is unavailable for a pod (e.g., pod just started, metrics-server lag), that pod's metric is assumed to be at 100% of target for scale-up calculations and 0% for scale-down. This prevents premature scale-down during rollouts. If all pods' metrics are unavailable, the HPA takes no action (neither scales up nor down).
HPA Spec — Full Reference
apiVersion: autoscaling/v2
kind: HorizontalPodAutoscaler
metadata:
name: api-server-hpa
namespace: production
spec:
# --- Target workload ---
scaleTargetRef:
apiVersion: apps/v1
kind: Deployment
name: api-server
# --- Replica bounds ---
minReplicas: 3 # Never scale below this (default: 1)
maxReplicas: 50 # Hard ceiling
# --- Metrics (evaluated independently; max desired wins) ---
metrics:
# 1. CPU utilization (% of requests)
- type: Resource
resource:
name: cpu
target:
type: Utilization
averageUtilization: 70 # Target 70% of CPU requests across all pods
# 2. Memory — use AverageValue, not Utilization (memory doesn't release fast)
- type: Resource
resource:
name: memory
target:
type: AverageValue
averageValue: 800Mi # Target 800Mi average per pod
# 3. Per-container CPU (ContainerResource — GA 1.20)
- type: ContainerResource
containerResource:
name: cpu
container: proxy-sidecar # Scale based on sidecar, not main container
target:
type: Utilization
averageUtilization: 80
# 4. Custom metric from Pods (per-pod, summed across replicas)
- type: Pods
pods:
metric:
name: http_requests_per_second
target:
type: AverageValue
averageValue: "500" # 500 RPS per pod target
# 5. Custom metric from Object (single source, e.g., Ingress)
- type: Object
object:
metric:
name: ingress_requests_per_second
describedObject:
apiVersion: networking.k8s.io/v1
kind: Ingress
name: api-ingress
target:
type: Value
value: "10000" # Total 10k RPS on this Ingress
# 6. External metric (cloud queue, external system)
- type: External
external:
metric:
name: sqs_messages_visible
selector:
matchLabels:
queue: api-job-queue
target:
type: AverageValue
averageValue: "30" # 30 messages per worker pod
# --- Scaling behavior ---
behavior:
scaleUp:
stabilizationWindowSeconds: 0 # Scale up immediately (no dampening)
policies:
- type: Pods
value: 4 # Add at most 4 pods per period
periodSeconds: 60
- type: Percent
value: 100 # Or double replicas per period
periodSeconds: 60
selectPolicy: Max # Use whichever policy allows more pods
scaleDown:
stabilizationWindowSeconds: 300 # Wait 5 min of consistently low metrics
policies:
- type: Pods
value: 2 # Remove at most 2 pods per period
periodSeconds: 60
- type: Percent
value: 10 # Or 10% of replicas per period
periodSeconds: 60
selectPolicy: Min # Use whichever policy removes fewer pods
Metric Types in Detail
Resource Metrics
Resource metrics use the metrics.k8s.io API provided by metrics-server. Metrics-server scrapes kubelet summary APIs every 60 seconds and serves rolling averages.
| Target type | Formula | Best for |
|---|---|---|
Utilization |
currentCPU / requests.cpu × 100 |
CPU — pods must have CPU requests set |
AverageValue |
Total metric value across pods / replica count | Memory, custom per-pod metrics |
Value |
Raw single value (Object/External only) | Queue depth, global counters |
Using
Utilization target for memory is misleading: a JVM heap that is 90% allocated but not under GC pressure will trigger scale-up even if the application is healthy. Prefer AverageValue with a generous headroom above the working set. Also note that scaling down does not reclaim memory already allocated by JVM/Go runtime — the pod must be restarted. Scale-down stabilization window (300s default) is critical for memory-based scaling.
ContainerResource (GA 1.20)
When pods run multiple containers, Resource metrics aggregate all containers. ContainerResource targets a specific container, enabling independent scaling decisions:
- type: ContainerResource
containerResource:
name: memory
container: app # Only consider the app container's memory
target:
type: AverageValue
averageValue: 512Mi
This is essential when a heavyweight sidecar (Istio Envoy, Datadog agent) consumes disproportionate resources — you don't want sidecar resource usage to drive scaling of the main application.
Pods Metric
The Pods metric type reads from custom.metrics.k8s.io and averages the named metric across all pods selected by the HPA's target. The Prometheus Adapter or a custom adapter must serve this API.
- type: Pods
pods:
metric:
name: http_requests_per_second
selector: # Optional: filter by label on the metric series
matchLabels:
route: /api/v2
target:
type: AverageValue
averageValue: "1000" # 1000 req/s per pod
Object Metric
The Object type reads a single metric value from a specific Kubernetes object. Common use: total request rate on an Ingress, queue depth on a Kafka topic CRD.
- type: Object
object:
metric:
name: nginx_ingress_requests_per_second
describedObject:
apiVersion: networking.k8s.io/v1
kind: Ingress
name: frontend-ingress
target:
type: Value
value: "5000" # Total RPS on the Ingress drives replica count
External Metric
External metrics come from systems outside the cluster (cloud queues, monitoring systems). An adapter must bridge the external system to external.metrics.k8s.io.
- type: External
external:
metric:
name: aws_sqs_approximate_number_of_messages_visible
selector:
matchLabels:
queue_name: payment-jobs
target:
type: AverageValue
averageValue: "10" # 10 messages per worker pod ideal
Scaling Behavior
The behavior block controls how fast scaling happens, independently for scale-up and scale-down. Without this, the HPA scales up and down at full speed, which can cause flapping.
| Field | Default (scale-up) | Default (scale-down) | Effect |
|---|---|---|---|
stabilizationWindowSeconds | 0 | 300 | Seconds to look back; use max desired replicas seen in window |
selectPolicy | Max | Min | Which policy to apply when multiple policies conflict |
policies[].type | — | — | Pods (absolute) or Percent (relative) |
policies[].value | — | — | Max change allowed per period |
policies[].periodSeconds | — | — | Time window for the policy (max 1800s) |
Disabling Scale-Down
behavior:
scaleDown:
selectPolicy: Disabled # Never scale down (scale-up only HPA)
Useful for workloads where scale-down is disruptive (e.g., stateful-ish services with warm caches) or when you want manual control over scale-down.
Deployment spec.replicas and Server-Side Apply
A common misconfiguration: your CI pipeline applies the Deployment manifest on every deploy, overwriting spec.replicas back to the value in your Git repo (e.g., 3), undoing what HPA set (e.g., 15). This causes a momentary replica crash on every deployment.
Remove
spec.replicas from your Deployment manifest entirely (or set it only on first apply). Once HPA is managing replicas, it owns that field. If using Server-Side Apply (SSA), the HPA manager claims the replicas field; a subsequent kubectl apply from a different field manager will conflict. Use kubectl apply --server-side --force-conflicts only if you intentionally want to reclaim ownership — not on every deploy.
# Check which manager owns spec.replicas
kubectl get deployment api-server -o json | \
jq '.metadata.managedFields[] | select(.fieldsV1."f:spec"."f:replicas" != null) | .manager'
# Correct SSA-based workflow: strip replicas from manifests
# In your Deployment YAML:
spec:
# replicas: 3 ← DELETE THIS LINE; HPA owns it
selector:
matchLabels:
app: api-server
HPA Status
kubectl get hpa api-server-hpa -n production
# NAME REFERENCE TARGETS MINPODS MAXPODS REPLICAS
# api-server-hpa Deployment/api-server 72%/70%, 800Mi 3 50 8
kubectl describe hpa api-server-hpa -n production
| Status field | Meaning |
|---|---|
status.currentReplicas | Replicas currently managed by the target |
status.desiredReplicas | Replicas computed by the last HPA evaluation |
status.currentMetrics | Last observed value for each metric source |
status.lastScaleTime | Timestamp of last replica change |
status.conditions[].type: AbleToScale | Can the HPA currently scale? (False if backoff active) |
status.conditions[].type: ScalingActive | Is HPA actively watching metrics? |
status.conditions[].type: ScalingLimited | Desired exceeds maxReplicas or violates behavior policy |
# Watch HPA in real time
kubectl get hpa -n production -w
# Get conditions (why isn't it scaling?)
kubectl get hpa api-server-hpa -o jsonpath='{.status.conditions}' | jq .
# Events show recent scale decisions
kubectl describe hpa api-server-hpa | grep -A30 Events
Prometheus Adapter
To use application-level metrics (request rate, queue depth, error rate) as HPA targets, you need a metrics adapter that translates Prometheus queries into the custom.metrics.k8s.io API. The Prometheus Adapter is the most common open-source solution.
# Prometheus Adapter ConfigMap — metric rules
apiVersion: v1
kind: ConfigMap
metadata:
name: adapter-config
namespace: monitoring
data:
config.yaml: |
rules:
# HTTP requests per second per pod
- seriesQuery: 'http_requests_total{namespace!="",pod!=""}'
resources:
overrides:
namespace: {resource: "namespace"}
pod: {resource: "pod"}
name:
matches: "^http_requests_total$"
as: "http_requests_per_second"
metricsQuery: |
sum(rate(http_requests_total{<<.LabelMatchers>>}[2m])) by (<<.GroupBy>>)
# Queue depth per pod (custom application metric)
- seriesQuery: 'worker_queue_depth{namespace!="",pod!=""}'
resources:
overrides:
namespace: {resource: "namespace"}
pod: {resource: "pod"}
name:
matches: "^worker_queue_depth$"
as: "worker_queue_depth"
metricsQuery: 'avg(worker_queue_depth{<<.LabelMatchers>>}) by (<<.GroupBy>>)'
# Ingress RPS (Object metric — scoped to Ingress)
- seriesQuery: 'nginx_ingress_controller_requests{namespace!="",ingress!=""}'
resources:
overrides:
namespace: {resource: "namespace"}
ingress: {group: "networking.k8s.io", resource: "ingress"}
name:
as: "nginx_ingress_requests_per_second"
metricsQuery: |
sum(rate(nginx_ingress_controller_requests{<<.LabelMatchers>>}[2m])) by (<<.GroupBy>>)
# Verify custom metrics are available
kubectl get --raw "/apis/custom.metrics.k8s.io/v1beta1" | jq .
# Check a specific metric
kubectl get --raw \
"/apis/custom.metrics.k8s.io/v1beta1/namespaces/production/pods/*/http_requests_per_second" \
| jq .
# List all registered custom metrics
kubectl get --raw "/apis/custom.metrics.k8s.io/v1beta1" | jq '.resources[].name'
KEDA — Kubernetes Event-Driven Autoscaling
KEDA extends HPA with a rich library of built-in scalers and adds scale-to-zero capability. It deploys a metrics adapter and manages HPA objects on your behalf via the ScaledObject CRD.
apiVersion: keda.sh/v1alpha1
kind: ScaledObject
metadata:
name: api-worker-scaler
namespace: production
spec:
scaleTargetRef:
apiVersion: apps/v1
kind: Deployment
name: api-worker
# Replica bounds
minReplicaCount: 0 # Scale to zero when idle
maxReplicaCount: 100
# Cooldown periods
pollingInterval: 15 # Check trigger every 15s
cooldownPeriod: 300 # Wait 300s before scaling to zero
# Advanced scaling behavior (passed to managed HPA)
advanced:
horizontalPodAutoscalerConfig:
behavior:
scaleDown:
stabilizationWindowSeconds: 60
triggers:
# SQS queue depth
- type: aws-sqs-queue
authenticationRef:
name: keda-sqs-auth
metadata:
queueURL: https://sqs.us-east-1.amazonaws.com/123456789/api-jobs
queueLength: "10" # 10 messages per worker pod
awsRegion: us-east-1
# Kafka consumer lag
- type: kafka
metadata:
bootstrapServers: kafka-svc.platform:9092
consumerGroup: api-worker-group
topic: api-events
lagThreshold: "50" # 50 messages lag per pod
# Prometheus metric
- type: prometheus
metadata:
serverAddress: http://prometheus-svc.monitoring:9090
metricName: http_requests_per_second
threshold: "500"
query: |
sum(rate(http_requests_total{app="api-worker"}[2m]))
# Cron-based pre-scaling
- type: cron
metadata:
timezone: America/New_York
start: "0 8 * * 1-5" # Scale up at 8am weekdays
end: "0 20 * * 1-5" # Scale down at 8pm weekdays
desiredReplicas: "10" # Pre-scale to 10 during business hours
KEDA TriggerAuthentication
apiVersion: keda.sh/v1alpha1
kind: TriggerAuthentication
metadata:
name: keda-sqs-auth
namespace: production
spec:
podIdentity:
provider: aws # Use IRSA (IAM Roles for Service Accounts)
# Or use secretTargetRef:
# secretTargetRef:
# - parameter: awsAccessKeyID
# name: aws-credentials
# key: access-key-id
# - parameter: awsSecretAccessKey
# name: aws-credentials
# key: secret-access-key
Scale-to-Zero Considerations
When scaling from 0→1, the first request that arrives is queued by KEDA while the pod starts. For workloads with slow startup (JVM, ML model loading), this can mean 30–120 seconds of latency for the first request after idle. Mitigations: (1) set
minReplicaCount: 1 during business hours using the cron trigger; (2) use fast-starting runtimes; (3) set readinessProbe so traffic is only sent after the pod is truly ready.
# Check KEDA ScaledObject status
kubectl get scaledobject api-worker-scaler -n production
kubectl describe scaledobject api-worker-scaler -n production
# View the HPA KEDA created
kubectl get hpa -n production -l "scaledobject.keda.sh/name=api-worker-scaler"
# Check KEDA operator logs
kubectl logs -n keda -l app=keda-operator --tail=50
HPA + VPA Interaction
HPA and VPA both modify pod-level resources but in different dimensions: HPA changes replica count, VPA changes container resource requests. Running both simultaneously on the same workload with the same metric (e.g., CPU) causes conflict:
The safest production pattern when you need both: use HPA for custom/external metrics (request rate, queue depth) and VPA for right-sizing CPU/memory requests. Set VPA.spec.updatePolicy.updateMode: "Off" on workloads where HPA is managing replicas based on CPU/memory.
Common HPA Patterns
Pattern: Request-Rate Scaling
# Scale based on total RPS via Ingress Object metric
metrics:
- type: Object
object:
metric:
name: nginx_ingress_requests_per_second
describedObject:
apiVersion: networking.k8s.io/v1
kind: Ingress
name: api-ingress
target:
type: Value
value: "2000" # 2000 RPS total → N replicas
# With 3 pods at 2000 RPS each, desiredReplicas = totalRPS / 2000
# e.g., 10000 RPS → ceil(10000/2000) = 5 pods
Pattern: Queue-Depth Scaling (without KEDA)
# Worker reads queue depth via Custom Metric (Prometheus Adapter)
metrics:
- type: External
external:
metric:
name: redis_list_length
selector:
matchLabels:
list_name: job-queue
target:
type: AverageValue
averageValue: "20" # 20 items per worker pod
behavior:
scaleDown:
stabilizationWindowSeconds: 600 # Don't scale down while queue drains
policies:
- type: Pods
value: 1
periodSeconds: 120 # Remove 1 pod every 2 min max
Pattern: Conservative Scale-Down for Stateful-ish Services
behavior:
scaleUp:
stabilizationWindowSeconds: 0
policies:
- type: Percent
value: 50 # Scale up by 50% at most per period
periodSeconds: 60
scaleDown:
stabilizationWindowSeconds: 900 # 15 min window before scaling down
policies:
- type: Percent
value: 5 # Remove at most 5% of pods per period
periodSeconds: 300 # Once every 5 minutes
Metrics
| Metric | Labels | Use |
|---|---|---|
kube_horizontalpodautoscaler_status_current_replicas | hpa, namespace | Current replica count managed by HPA |
kube_horizontalpodautoscaler_status_desired_replicas | hpa, namespace | Desired replica count from last evaluation |
kube_horizontalpodautoscaler_spec_max_replicas | hpa, namespace | Configured maxReplicas ceiling |
kube_horizontalpodautoscaler_spec_min_replicas | hpa, namespace | Configured minReplicas floor |
kube_horizontalpodautoscaler_status_condition | condition, status | HPA condition health (AbleToScale, ScalingActive, ScalingLimited) |
Alerting Rules
groups:
- name: hpa
rules:
# HPA at max replicas — may need to raise ceiling
- alert: HPAAtMaxReplicas
expr: |
kube_horizontalpodautoscaler_status_current_replicas
== kube_horizontalpodautoscaler_spec_max_replicas
for: 15m
labels:
severity: warning
annotations:
summary: "HPA {{ $labels.namespace }}/{{ $labels.hpa }} at maxReplicas for 15m"
description: "Consider raising maxReplicas or optimizing the workload"
# HPA unable to scale (metric unavailable)
- alert: HPAScalingInactive
expr: |
kube_horizontalpodautoscaler_status_condition{
condition="ScalingActive",status="false"} == 1
for: 5m
labels:
severity: warning
annotations:
summary: "HPA {{ $labels.hpa }} in {{ $labels.namespace }} is not scaling"
# Desired replicas oscillating (flapping)
- alert: HPAFlapping
expr: |
changes(kube_horizontalpodautoscaler_status_desired_replicas[30m]) > 5
for: 5m
labels:
severity: warning
annotations:
summary: "HPA {{ $labels.hpa }} is flapping (>5 changes in 30m)"
# HPA at minReplicas for extended period (possible over-provisioning)
- alert: HPAAtMinReplicasLong
expr: |
kube_horizontalpodautoscaler_status_current_replicas
== kube_horizontalpodautoscaler_spec_min_replicas
for: 72h
labels:
severity: info
annotations:
summary: "HPA {{ $labels.hpa }} has been at minReplicas for 3 days — review minReplicas"
Runbooks
HPA Not Scaling Despite High Load
# 1. Check HPA conditions
kubectl describe hpa <name> -n <namespace> | grep -A5 Conditions
# 2. Verify metrics are available
kubectl get --raw "/apis/metrics.k8s.io/v1beta1/namespaces/<ns>/pods" | jq .
kubectl get --raw "/apis/custom.metrics.k8s.io/v1beta1" | jq .
# 3. Check metrics-server is running
kubectl get deployment metrics-server -n kube-system
# 4. Verify pods have resource requests (required for Utilization target)
kubectl get pods -n <namespace> -o json | \
jq '.items[].spec.containers[].resources.requests'
# 5. Check if at maxReplicas ceiling
kubectl get hpa <name> -o jsonpath='{.spec.maxReplicas} {.status.currentReplicas}'
HPA Flapping / Oscillating
# Check scale events
kubectl describe hpa <name> -n <namespace> | grep -A30 Events
# Increase stabilization window to reduce flapping
kubectl patch hpa <name> -n <namespace> --type=merge -p '{
"spec": {
"behavior": {
"scaleDown": {"stabilizationWindowSeconds": 600},
"scaleUp": {"stabilizationWindowSeconds": 30}
}
}
}'
Custom Metrics Returning Errors
# Test custom metrics API directly
kubectl get --raw \
"/apis/custom.metrics.k8s.io/v1beta1/namespaces/<ns>/pods/*/<metric-name>"
# Check Prometheus Adapter logs
kubectl logs -n monitoring -l app.kubernetes.io/name=prometheus-adapter --tail=100
# Verify Prometheus query returns data
curl -G prometheus-svc.monitoring:9090/api/v1/query \
--data-urlencode 'query=sum(rate(http_requests_total[2m])) by (pod)'
HPA Spec.Replicas Conflict (GitOps Override)
# Check which field manager owns replicas
kubectl get deployment <name> -o json | \
jq '.metadata.managedFields[] | {manager, fields: .fieldsV1."f:spec"."f:replicas"} |
select(.fields != null)'
# Remove replicas from manifest (GitOps fix)
# In Deployment YAML, delete the spec.replicas line
# Then re-apply without spec.replicas
# Force-release ownership if needed (SSA)
kubectl apply --server-side --force-conflicts -f deployment.yaml
KEDA ScaledObject Not Scaling
# Check ScaledObject status
kubectl describe scaledobject <name> -n <namespace>
# Check KEDA operator logs
kubectl logs -n keda -l app=keda-operator --tail=100 | grep ERROR
# Verify trigger authentication
kubectl describe triggerauthentication <auth-name> -n <namespace>
# Check the HPA KEDA manages
kubectl get hpa -n <namespace> -l "scaledobject.keda.sh/name=<name>"
kubectl describe hpa -n <namespace> -l "scaledobject.keda.sh/name=<name>"
Best Practices
- Always set CPU requests — HPA's
Utilizationtarget divides current CPU byrequests.cpu. Without requests, utilization is undefined and HPA falls back to raw value or skips the metric. - Prefer custom/external metrics over CPU for latency-sensitive services — CPU utilization lags behind request rate spikes. A metric like
http_requests_per_secondreacts faster to traffic surges. - Remove
spec.replicasfrom Deployment manifests managed by HPA — prevent GitOps pipelines from overriding the replica count on every deploy. - Set a meaningful
minReplicas—minReplicas: 1means a single point of failure during deployments. For HA, use at least 2; for critical paths, 3 (spread across zones). - Tune scale-down conservatively — default 300s stabilization window is often too short for services with warm caches or sticky connections. 600–900s is safer for stateful-ish services.
- Use KEDA for scale-to-zero and event-driven workloads — native HPA minimum is 1 replica. KEDA handles the 0↔1 transition and provides richer trigger sources out of the box.
- Test autoscaling in staging under realistic load — run load tests that ramp up and sustain traffic; verify the HPA correctly computes desired replicas, the behavior policies don't block needed scale-up, and scale-down doesn't happen during momentary lulls in the ramp.
- Monitor
ScalingLimitedcondition continuously — when HPA is constrained bymaxReplicasunder real traffic, it means your ceiling is too low or the workload needs right-sizing. Alert on this and review periodically.