StatefulSets
▶ What This Page Covers
This page covers StatefulSet controller mechanics: spec fields, pod identity, update strategies, and scaling behavior. For storage-focused content — volumeClaimTemplates in depth, persistent volume lifecycle, distributed databases (PostgreSQL, Kafka, Cassandra, MongoDB) on Kubernetes, Longhorn, and Rook-Ceph — see Stateful Storage Patterns.
Controller Mechanics
The StatefulSet controller is fundamentally different from the Deployment controller. Where Deployments treat pods as interchangeable and manage a fleet of ReplicaSets, the StatefulSet controller manages pods directly — there is no intermediate ReplicaSet layer. Each pod is individually tracked by its ordinal index, and the controller enforces strict lifecycle ordering.
Full StatefulSet Spec
apiVersion: apps/v1
kind: StatefulSet
metadata:
name: postgres
namespace: data
spec:
# ── Identity ──────────────────────────────────────────────────────
serviceName: postgres-headless # REQUIRED: name of headless Service
# creates stable DNS per pod
replicas: 3
# Selector is IMMUTABLE after creation
selector:
matchLabels:
app: postgres
# ── Pod management ────────────────────────────────────────────────
podManagementPolicy: OrderedReady # OrderedReady (default) | Parallel
# ── Update strategy ───────────────────────────────────────────────
updateStrategy:
type: RollingUpdate # RollingUpdate (default) | OnDelete
rollingUpdate:
partition: 0 # only update pods with ordinal >= partition
maxUnavailable: 1 # 1.24+ GA: allow multiple pods down during update
# default 1; percentage or absolute
# ── PVC retention ─────────────────────────────────────────────────
persistentVolumeClaimRetentionPolicy: # GA 1.27
whenDeleted: Retain # Retain | Delete
whenScaled: Retain # Retain | Delete
# ── Revision history ──────────────────────────────────────────────
revisionHistoryLimit: 10 # number of ControllerRevisions to keep (default 10)
# unlike Deployment, StatefulSet uses ControllerRevision objects
# ── Min ready ─────────────────────────────────────────────────────
minReadySeconds: 0 # same semantics as Deployment
# ── Pod template ──────────────────────────────────────────────────
template:
metadata:
labels:
app: postgres
spec:
terminationGracePeriodSeconds: 60
subdomain: postgres-headless # auto-injected from serviceName; do not set manually
containers:
- name: postgres
image: postgres:16
ports:
- name: postgres
containerPort: 5432
env:
- name: PGDATA
value: /var/lib/postgresql/data/pgdata
- name: POD_NAME
valueFrom:
fieldRef:
fieldPath: metadata.name # "postgres-0", "postgres-1", etc.
- name: ORDINAL
valueFrom:
fieldRef:
fieldPath: metadata.name # parse ordinal from pod name in app
readinessProbe:
exec:
command: ["pg_isready", "-U", "postgres", "-h", "localhost"]
periodSeconds: 5
failureThreshold: 6
resources:
requests:
cpu: "500m"
memory: "1Gi"
limits:
memory: "2Gi"
volumeMounts:
- name: data
mountPath: /var/lib/postgresql/data
# ── Volume claim templates ────────────────────────────────────────
volumeClaimTemplates:
- metadata:
name: data
spec:
accessModes: [ReadWriteOnce]
storageClassName: ebs-gp3
resources:
requests:
storage: 100Gi
Stable Pod Identity
Each StatefulSet pod has three components of stable identity that persist across pod deletion and rescheduling:
| Identity Component | Format | Example (postgres, ordinal 1) | Persists Across Reschedule? |
|---|---|---|---|
| Pod name | {sts-name}-{ordinal} |
postgres-1 |
Yes — same ordinal, same name |
| DNS hostname | {pod-name}.{headless-svc}.{ns}.svc.cluster.local |
postgres-1.postgres-headless.data.svc.cluster.local |
Yes — DNS resolves to new pod IP after reschedule |
| PVC binding | {template-name}-{pod-name} |
data-postgres-1 |
Yes — same PVC re-attached to rescheduled pod |
When the node running postgres-1 fails, Kubernetes eventually evicts the pod. The StatefulSet controller creates a new pod also named postgres-1 on a healthy node. The same PVC data-postgres-1 is attached to this new pod (after the old VolumeAttachment is force-deleted if the node is truly gone). The DNS hostname postgres-1.postgres-headless.data.svc.cluster.local resolves to the new pod IP. From other pods' perspective, postgres-1 has the same address and data as before.
Headless Service
The serviceName field in a StatefulSet spec references a headless service (clusterIP: None). This service must exist before the StatefulSet is created — the controller does not create it automatically. Without the headless service, pod DNS hostnames are not created.
apiVersion: v1
kind: Service
metadata:
name: postgres-headless
namespace: data
labels:
app: postgres
spec:
clusterIP: None # headless: no VIP, only DNS A records per pod
publishNotReadyAddresses: true # include not-yet-Ready pods in DNS
# CRITICAL for init containers that use DNS for peer discovery
selector:
app: postgres
ports:
- name: postgres
port: 5432
targetPort: 5432
OrderedReady Pod Management
OrderedReady (the default) enforces strict sequential ordering for all pod lifecycle operations. The controller never creates the next pod until the current one is Running and Ready. It never deletes the next pod until the current one is fully terminated.
OrderedReady Deadlock
If pod-N is stuck not-Ready (readiness probe failing indefinitely), the StatefulSet controller blocks: it will not create pod-N+1 or proceed with any further operations. This is a known deadlock condition.
# Diagnose: identify the stuck pod
kubectl get pods -n data -l app=postgres
# NAME READY STATUS RESTARTS AGE
# postgres-0 1/1 Running 0 10m
# postgres-1 0/1 Running 5 3m ← stuck not-Ready
# postgres-2 0/0 Pending 0 0s ← never created
# Check why postgres-1 is not Ready
kubectl describe pod postgres-1 -n data # → readiness probe failure events
kubectl logs postgres-1 -n data # → application error
# Resolution options:
# Option A: Fix the underlying issue (app bug, missing config, dependency down)
# Controller automatically proceeds once pod-1 becomes Ready
# Option B: Temporary — delete the stuck pod to let it restart
kubectl delete pod postgres-1 -n data
# Option C: If app can run with Parallel management, change policy
kubectl patch sts postgres -n data -p '{"spec":{"podManagementPolicy":"Parallel"}}'
# WARNING: Parallel on a database that requires leader election is unsafe
# Option D: Use partition to skip the stuck pod
kubectl patch sts postgres -n data -p '{"spec":{"updateStrategy":{"rollingUpdate":{"partition":2}}}}'
Parallel Pod Management
Parallel pod management creates and deletes all pods simultaneously without waiting for each to be Running/Ready first. Pods still get stable names and PVCs — only the lifecycle ordering changes. Use for applications that can handle concurrent initialisation (stateless-ish shards, cache nodes that resync on startup).
spec:
podManagementPolicy: Parallel
# All 3 pods created at the same time:
# postgres-0, postgres-1, postgres-2 all created simultaneously
# None waits for the others to be Ready before starting
# Faster cluster restart; dangerous for primary/replica databases
For PostgreSQL with Patroni, Galera Cluster, or Cassandra bootstrap, all replicas starting simultaneously can race for primary election or corrupt the cluster state. Use OrderedReady for initial cluster bootstrap, then switch to Parallel only for rolling restarts on a healthy, fully-initialized cluster where the application can handle concurrent startup.
Update Strategies
RollingUpdate (Default)
StatefulSet RollingUpdate proceeds in reverse ordinal order (highest ordinal first), one pod at a time. This is the opposite of scale-up order — it updates replicas before the primary (ordinal 0), minimising risk of primary disruption.
maxUnavailable in RollingUpdate (GA 1.24)
updateStrategy:
type: RollingUpdate
rollingUpdate:
partition: 0
maxUnavailable: 2 # allow 2 pods to be simultaneously unavailable during update
# speeds up update of large StatefulSets
# default: 1 (one pod at a time)
# percentage: "33%" → floor(replicas × 0.33)
# WARNING: for quorum-based systems (Kafka, etcd), maxUnavailable must
# not exceed (replicas - quorum_size)
# For 3-node etcd (quorum=2): maxUnavailable must be 1
Partition — Canary Upgrades
The partition field divides the StatefulSet at a boundary: pods with ordinal ≥ partition are updated; pods with ordinal < partition are frozen at their current version. This enables controlled canary upgrades where you update one replica, validate it, then lower the partition.
# Scenario: postgres StatefulSet, 5 replicas (0-4), upgrading postgres:15 → postgres:16
# Step 1: Start canary — only update the highest ordinal
kubectl patch sts postgres -n data \
-p '{"spec":{"updateStrategy":{"rollingUpdate":{"partition":4}}}}'
# Only postgres-4 gets updated; postgres-0 through postgres-3 stay at old version
# Step 2: Validate postgres-4 is healthy
kubectl exec postgres-4 -n data -- psql -U postgres -c "SELECT version();"
kubectl logs postgres-4 -n data | tail -50
# Step 3: Expand canary to include pod-3 and pod-4
kubectl patch sts postgres -n data \
-p '{"spec":{"updateStrategy":{"rollingUpdate":{"partition":3}}}}'
# Step 4: Full rollout (update all including primary postgres-0)
kubectl patch sts postgres -n data \
-p '{"spec":{"updateStrategy":{"rollingUpdate":{"partition":0}}}}'
# Emergency freeze: set partition to replicas count to prevent ANY further updates
kubectl patch sts postgres -n data \
-p '{"spec":{"updateStrategy":{"rollingUpdate":{"partition":5}}}}'
# No pods will be updated; existing state is preserved
OnDelete Strategy
With OnDelete, the controller only updates a pod when you manually delete it. This gives complete control over update timing — essential for databases where you need to perform a controlled failover before restarting a node.
updateStrategy:
type: OnDelete
# Update workflow for a 3-node Patroni PostgreSQL cluster:
# 1. Update the image in the StatefulSet spec (no pods restart yet)
kubectl set image sts/postgres postgres=postgres:16 -n data
# 2. Verify patroni replica status (postgres-2 and postgres-1 are replicas)
kubectl exec postgres-0 -n data -- patronictl list
# 3. Restart replica first (controller creates updated postgres-2)
kubectl delete pod postgres-2 -n data
# Wait for postgres-2 to rejoin as replica...
kubectl exec postgres-0 -n data -- patronictl list
# 4. Restart second replica
kubectl delete pod postgres-1 -n data
# 5. Trigger manual failover (promotes postgres-1 or postgres-2 to primary)
kubectl exec postgres-0 -n data -- patronictl switchover postgres --master postgres-0
# 6. Now postgres-0 is a replica; restart it safely
kubectl delete pod postgres-0 -n data
Scaling Behavior
Scale-Up Sequencing
# Scale from 3 to 5 replicas
kubectl scale sts postgres -n data --replicas=5
# With OrderedReady:
# postgres-3 created and must become Running+Ready before postgres-4 is created
# New PVCs created automatically: data-postgres-3, data-postgres-4
# With Parallel:
# postgres-3 and postgres-4 created simultaneously
Scale-Down and PVC Orphaning
# Scale from 5 to 3 replicas
kubectl scale sts postgres -n data --replicas=3
# With OrderedReady (scale-down is always sequential, highest-ordinal-first):
# postgres-4 deleted (pod only — PVC data-postgres-4 remains by default)
# Wait for postgres-4 Terminated
# postgres-3 deleted
# Wait for postgres-3 Terminated
# PVCs data-postgres-3 and data-postgres-4 are ORPHANED (not deleted)
# Check for orphaned PVCs after scale-down:
kubectl get pvc -n data -l app=postgres
# NAME STATUS VOLUME CAPACITY ACCESS MODES STORAGECLASS
# data-postgres-0 Bound pvc-abc123 100Gi RWO ebs-gp3
# data-postgres-1 Bound pvc-def456 100Gi RWO ebs-gp3
# data-postgres-2 Bound pvc-ghi789 100Gi RWO ebs-gp3
# data-postgres-3 Bound pvc-jkl012 100Gi RWO ebs-gp3 ← ORPHANED
# data-postgres-4 Bound pvc-mno345 100Gi RWO ebs-gp3 ← ORPHANED
# Scale back to 5: orphaned PVCs are automatically re-used (same names re-attached)
volumeClaimTemplates
See Stateful Storage Patterns § volumeClaimTemplates for full coverage including naming convention, immutability, multiple templates, and the --cascade=orphan migration pattern. Key points summarised here:
| Aspect | Behavior |
|---|---|
| PVC naming | {template.metadata.name}-{statefulset-name}-{ordinal} |
| Immutability | volumeClaimTemplates cannot be modified after StatefulSet creation. Delete with --cascade=orphan and recreate to change. |
| PVC on pod delete | Retained — PVC persists; pod is recreated and re-attaches same PVC |
| PVC on StatefulSet delete | Retained by default; controlled by persistentVolumeClaimRetentionPolicy.whenDeleted |
| PVC on scale-down | Retained by default; controlled by persistentVolumeClaimRetentionPolicy.whenScaled |
| Scale-up re-bind | Existing PVCs from previous scale-up are re-used if names match |
ControllerRevisions and Rollback
Unlike Deployments (which use ReplicaSets for history), StatefulSets store revision history in ControllerRevision objects — lightweight objects containing only the diff from the previous revision.
# List ControllerRevisions for a StatefulSet
kubectl get controllerrevision -n data -l app=postgres \
-o custom-columns='NAME:.metadata.name,REVISION:.revision,AGE:.metadata.creationTimestamp'
# Roll back StatefulSet (same command as Deployment)
kubectl rollout undo sts/postgres -n data
kubectl rollout undo sts/postgres -n data --to-revision=3
# Check rollout status
kubectl rollout status sts/postgres -n data
# Waiting for partitioned roll out to finish: 0 out of 3 new pods have been updated...
# partitioned roll out complete: 3 new pods have been updated...
# Watch pod updates in real time
kubectl get pods -n data -l app=postgres -w
Status Fields
kubectl get sts postgres -n data -o yaml | yq .status
# replicas: 3 # total pods (Running + Pending + Terminating)
# readyReplicas: 3 # pods with all containers Ready
# currentReplicas: 3 # pods at currentRevision (old version during update)
# updatedReplicas: 2 # pods at updateRevision (new version during update)
# availableReplicas: 3 # pods Ready for >= minReadySeconds (1.26+)
# currentRevision: postgres-7d9f # ControllerRevision name of current spec
# updateRevision: postgres-9a2b # ControllerRevision name of target spec (during update)
# observedGeneration: 4 # last generation processed by controller
# compare with metadata.generation to detect lag
# collisionCount: 0 # hash collision counter for revision names
| Status Field | Meaning During Rolling Update |
|---|---|
replicas | Total pod count (always equals desired unless scaling) |
currentReplicas | Pods still running the old revision |
updatedReplicas | Pods running the new revision (target) |
readyReplicas | Pods that are Running + all probes passing |
currentRevision == updateRevision | Update complete — all pods on same revision |
Forced Spec Changes with --cascade=orphan
Because volumeClaimTemplates and selector are immutable, some changes require deleting and recreating the StatefulSet. The --cascade=orphan flag deletes the StatefulSet object while leaving its pods and PVCs running — a zero-downtime StatefulSet spec update.
# Scenario: need to change volumeClaimTemplates storage size or storageClass
# Step 1: Delete the StatefulSet object, leave pods running
kubectl delete sts postgres -n data --cascade=orphan
# Pods continue running: postgres-0, postgres-1, postgres-2 still alive
# PVCs still exist and bound
# No traffic interruption
# Step 2: Apply new StatefulSet manifest (with updated volumeClaimTemplates)
kubectl apply -f postgres-statefulset-v2.yaml
# Controller re-adopts existing pods (they already match the selector)
# New volumeClaimTemplates apply only to NEWLY CREATED pods (e.g., scale-up)
# Step 3: For existing pods to get new PVC spec, manually migrate data:
# - Scale up to get new pods with new PVC size
# - Copy data from old PVCs to new PVCs
# - Delete old pods (controller creates replacements with new PVCs)
# Note: existing PVCs are NOT automatically resized by this process
# Use PVC expansion (kubectl edit pvc) for in-place size increases
Operational Commands
# Watch pod creation/deletion during scaling or update
kubectl get pods -n data -l app=postgres -w
# Check rollout status (blocks until complete or times out)
kubectl rollout status sts/postgres -n data --timeout=10m
# Trigger a rolling restart without spec change
kubectl rollout restart sts/postgres -n data
# Pause a rolling update (set partition to current updatedReplicas count)
# Get current updatedReplicas
UPDATED=$(kubectl get sts postgres -n data -o jsonpath='{.status.updatedReplicas}')
REPLICAS=$(kubectl get sts postgres -n data -o jsonpath='{.spec.replicas}')
# Freeze at current progress (don't update pods 0..UPDATED-1)
kubectl patch sts postgres -n data \
-p "{\"spec\":{\"updateStrategy\":{\"rollingUpdate\":{\"partition\":$((REPLICAS - UPDATED))}}}}"
# Force delete a stuck Terminating pod (node failure recovery)
kubectl delete pod postgres-1 -n data --grace-period=0 --force
# Check all PVCs for a StatefulSet
kubectl get pvc -n data -l app=postgres --sort-by='.metadata.name'
# Describe to see events (scale, update, probe failures)
kubectl describe sts postgres -n data
# Get ControllerRevision history
kubectl get controllerrevision -n data -l app=postgres
StatefulSet vs Deployment — When to Use Each
| Use StatefulSet When… | Use Deployment When… |
|---|---|
| Each replica needs its own persistent data (per-pod PVC) | All replicas share a read volume, or use no persistent storage |
| Pods must discover each other by stable DNS name (Cassandra seeds, Kafka broker IDs, etcd peers) | Pod identity is irrelevant; any replica can serve any request |
| Ordered startup is required (replica must not start before primary) | All pods can start concurrently without coordination |
| Ordered shutdown is required (must drain highest-ordinal replica first) | Any pod can be terminated independently |
| Application uses its own hostname for peer registration (pod name embedded in config) | Application is stateless or uses external service discovery |
| Rolling upgrades must proceed in a controlled sequence with validation | Rolling updates can proceed at any speed with maxSurge/maxUnavailable |
Metrics, Alerts, and Runbooks
Key Metrics
| Metric | Source | Alert Condition |
|---|---|---|
kube_statefulset_status_replicas_ready | kube-state-metrics | < kube_statefulset_replicas for > 5m |
kube_statefulset_status_replicas_current != kube_statefulset_status_replicas_updated | kube-state-metrics | Rollout in progress; alert if stale > 30m |
kube_statefulset_status_observed_generation != kube_statefulset_metadata_generation | kube-state-metrics | Controller not processing updates; possible controller issue |
kube_persistentvolumeclaim_status_phase{phase="Lost"} | kube-state-metrics | PVC Lost — pod stuck Pending; immediate alert |
kube_pod_status_ready{condition="false"} for StatefulSet pods | kube-state-metrics | Not-Ready pod blocking OrderedReady progression |
Alerting Rules
groups:
- name: statefulset-health
rules:
- alert: StatefulSetNotFullyReady
expr: |
kube_statefulset_status_replicas_ready
< kube_statefulset_replicas
for: 5m
labels:
severity: warning
annotations:
summary: "StatefulSet {{ $labels.namespace }}/{{ $labels.statefulset }} not fully ready"
- alert: StatefulSetRolloutStuck
expr: |
kube_statefulset_status_replicas_updated
!= kube_statefulset_replicas
and
kube_statefulset_status_replicas_current
!= kube_statefulset_status_replicas_updated
for: 30m
labels:
severity: warning
annotations:
summary: "StatefulSet {{ $labels.namespace }}/{{ $labels.statefulset }} rollout has not completed in 30 minutes"
- alert: StatefulSetPVCLost
expr: |
kube_persistentvolumeclaim_status_phase{phase="Lost"} == 1
for: 1m
labels:
severity: critical
annotations:
summary: "PVC {{ $labels.namespace }}/{{ $labels.persistentvolumeclaim }} is Lost — StatefulSet pod will be stuck Pending"
- alert: StatefulSetReplicasMismatch
expr: |
kube_statefulset_replicas
!= kube_statefulset_status_replicas_ready
for: 10m
labels:
severity: warning
Runbooks
Identify stuck pod: kubectl get pods -l app=NAME — look for a pod that has been not-Ready for a long time while higher-ordinal pods are Pending. Check: kubectl describe pod NAME-N for probe failures or image pull errors. Fix the root cause, or delete the stuck pod to trigger a restart. If the root cause is irreproducible, consider Parallel temporarily.
Check current partition: kubectl get sts NAME -o jsonpath='{.spec.updateStrategy.rollingUpdate.partition}'. If partition is higher than 0, either lower it intentionally or reset to 0 to complete the rollout. Also check status.updatedReplicas vs spec.replicas to confirm rollout progress.
PVC phase Lost means the bound PV no longer exists. Check: kubectl describe pvc NAME. To recover: create a new PV with the same volumeHandle pointing to the original storage, patch the PV's claimRef to reference the PVC's name and UID. See Persistent Volumes § Lost recovery.
Pod stuck Terminating after node failure: kubectl delete pod NAME --grace-period=0 --force. Also delete the stale VolumeAttachment: kubectl delete volumeattachment NAME. The StatefulSet controller will create a replacement pod. Verify the new pod attaches the PVC cleanly before declaring recovery complete.
List: kubectl get pvc -n NS -l app=NAME. Identify which PVCs have no corresponding running pod. If scaling back up is not planned, back up data and delete manually. If data is needed: kubectl exec into a pod with the PVC mounted, or create a temporary pod to access the PVC before deletion.
Best Practices
- Always create the headless service before the StatefulSet — the StatefulSet controller references the service by name in
serviceNamebut does not create it. If the service is missing, pod DNS hostnames are not registered. Apply manifests in order: Service → StatefulSet. UsepublishNotReadyAddresses: truefor peer-discovery during initialization. - Use
OrderedReadyfor initial bootstrap, evaluateParallelfor restarts —OrderedReadyprevents bootstrap races (e.g., two Cassandra nodes joining the ring simultaneously). After the cluster is fully initialized and stable,Parallelcan be used for faster rolling restarts since the application already knows its identity and can handle concurrent startup. - Set
partitiontoreplicasbefore risky upgrades — freeze all updates by settingpartition: N(equal to replica count). Update the spec, validate in staging, then lower the partition incrementally. This is safer thanOnDeletebecause the partition can be lowered remotely without needing to delete pods manually. - Use
terminationGracePeriodSecondsappropriate to your database — PostgreSQL needs time to finish checkpoints and close WAL (30–60s minimum). Cassandra needs time to flush memtables (60–120s). Kafka needs time to complete log segment flushes and gracefully hand off partition leadership. The default 30s is too short for most databases. - Keep
revisionHistoryLimitat the default (10) — StatefulSet ControllerRevisions are small (only diffs) unlike Deployment ReplicaSets (which include full pod specs). There is little reason to reduce this, and doing so removes rollback capability. Unlike Deployments, ControllerRevisions do not consume meaningful etcd space. - Set
persistentVolumeClaimRetentionPolicy.whenDeleted: RetainandwhenScaled: Retainfor production — the default behavior (both Retain) is the safest. Only use Delete for ephemeral test environments. Recovering accidentally deleted database PVCs is painful and may involve data loss. - Monitor
readyReplicasseparately fromreplicas— a StatefulSet can showreplicas: 3whilereadyReplicas: 2indefinitely without triggering any error. Set alerts on the difference. A long-running mismatch indicates a stuck pod that is blocking OrderedReady progression. - Test failover and recovery procedures under realistic conditions — simulate node failure (
kubectl drain), test that the pod reschedules and re-attaches its PVC, verify the application recovers correctly. Many database-specific failure modes only surface under real I/O conditions, not in smoke tests.