StatefulSets — Kubernetes Docs

▶ What This Page Covers

StatefulSet controller mechanics vs Deployment controller

Full annotated StatefulSet spec — every significant field

Stable pod identity: ordinal index, name format, DNS hostname

Headless service requirement: clusterIP:None, publishNotReadyAddresses

DNS record structure per pod — stable addresses across rescheduling

OrderedReady pod management: creation and deletion sequencing

Parallel pod management: concurrent creation/deletion, identity preserved

OrderedReady deadlock: causes, detection, resolution

RollingUpdate strategy: reverse-ordinal update order

Partition field: canary upgrades, freezing lower ordinals

OnDelete strategy: manual upgrade workflow

Scale-up and scale-down sequencing with OrderedReady

volumeClaimTemplates: naming, immutability, multiple templates

PVC lifecycle: orphaned PVCs, persistentVolumeClaimRetentionPolicy

StatefulSet vs Deployment: when to use each

Forced rollout: --cascade=orphan pattern

Controller status fields: currentReplicas, updatedReplicas, readyReplicas

StatefulSet conditions and observedGeneration

Operational commands reference

5 metrics + 4 alerts + 5 runbooks + 8 best practices

Companion Page: Stateful Storage Patterns

This page covers StatefulSet controller mechanics: spec fields, pod identity, update strategies, and scaling behavior. For storage-focused content — volumeClaimTemplates in depth, persistent volume lifecycle, distributed databases (PostgreSQL, Kafka, Cassandra, MongoDB) on Kubernetes, Longhorn, and Rook-Ceph — see Stateful Storage Patterns.

Controller Mechanics

The StatefulSet controller is fundamentally different from the Deployment controller. Where Deployments treat pods as interchangeable and manage a fleet of ReplicaSets, the StatefulSet controller manages pods directly — there is no intermediate ReplicaSet layer. Each pod is individually tracked by its ordinal index, and the controller enforces strict lifecycle ordering.

StatefulSet object model (no ReplicaSet intermediary): StatefulSet: postgres (replicas: 3, serviceName: postgres-headless) │ ├── Pod: postgres-0 ←── stable name; PVC: data-postgres-0 │ ownerRef: StatefulSet/postgres │ ├── Pod: postgres-1 ←── stable name; PVC: data-postgres-1 │ ownerRef: StatefulSet/postgres │ └── Pod: postgres-2 ←── stable name; PVC: data-postgres-2 ownerRef: StatefulSet/postgres Contrast with Deployment: Deployment → ReplicaSet (hash suffix) → Pods (random hash suffix) StatefulSet → Pods (ordinal suffix) directly Pod name format: {statefulset-name}-{ordinal} PVC name format: {template-name}-{pod-name} = {template-name}-{statefulset-name}-{ordinal}

Full StatefulSet Spec

apiVersion: apps/v1
kind: StatefulSet
metadata:
  name: postgres
  namespace: data
spec:
  # ── Identity ──────────────────────────────────────────────────────
  serviceName: postgres-headless   # REQUIRED: name of headless Service
                                   # creates stable DNS per pod

  replicas: 3

  # Selector is IMMUTABLE after creation
  selector:
    matchLabels:
      app: postgres

  # ── Pod management ────────────────────────────────────────────────
  podManagementPolicy: OrderedReady   # OrderedReady (default) | Parallel

  # ── Update strategy ───────────────────────────────────────────────
  updateStrategy:
    type: RollingUpdate              # RollingUpdate (default) | OnDelete
    rollingUpdate:
      partition: 0                   # only update pods with ordinal >= partition
      maxUnavailable: 1              # 1.24+ GA: allow multiple pods down during update
                                     # default 1; percentage or absolute

  # ── PVC retention ─────────────────────────────────────────────────
  persistentVolumeClaimRetentionPolicy:   # GA 1.27
    whenDeleted: Retain              # Retain | Delete
    whenScaled: Retain               # Retain | Delete

  # ── Revision history ──────────────────────────────────────────────
  revisionHistoryLimit: 10           # number of ControllerRevisions to keep (default 10)
                                     # unlike Deployment, StatefulSet uses ControllerRevision objects

  # ── Min ready ─────────────────────────────────────────────────────
  minReadySeconds: 0                 # same semantics as Deployment

  # ── Pod template ──────────────────────────────────────────────────
  template:
    metadata:
      labels:
        app: postgres
    spec:
      terminationGracePeriodSeconds: 60
      subdomain: postgres-headless   # auto-injected from serviceName; do not set manually
      containers:
      - name: postgres
        image: postgres:16
        ports:
        - name: postgres
          containerPort: 5432
        env:
        - name: PGDATA
          value: /var/lib/postgresql/data/pgdata
        - name: POD_NAME
          valueFrom:
            fieldRef:
              fieldPath: metadata.name   # "postgres-0", "postgres-1", etc.
        - name: ORDINAL
          valueFrom:
            fieldRef:
              fieldPath: metadata.name   # parse ordinal from pod name in app
        readinessProbe:
          exec:
            command: ["pg_isready", "-U", "postgres", "-h", "localhost"]
          periodSeconds: 5
          failureThreshold: 6
        resources:
          requests:
            cpu: "500m"
            memory: "1Gi"
          limits:
            memory: "2Gi"
        volumeMounts:
        - name: data
          mountPath: /var/lib/postgresql/data

  # ── Volume claim templates ────────────────────────────────────────
  volumeClaimTemplates:
  - metadata:
      name: data
    spec:
      accessModes: [ReadWriteOnce]
      storageClassName: ebs-gp3
      resources:
        requests:
          storage: 100Gi

Stable Pod Identity

Each StatefulSet pod has three components of stable identity that persist across pod deletion and rescheduling:

Identity Component	Format	Example (postgres, ordinal 1)	Persists Across Reschedule?
Pod name	`{sts-name}-{ordinal}`	`postgres-1`	Yes — same ordinal, same name
DNS hostname	`{pod-name}.{headless-svc}.{ns}.svc.cluster.local`	`postgres-1.postgres-headless.data.svc.cluster.local`	Yes — DNS resolves to new pod IP after reschedule
PVC binding	`{template-name}-{pod-name}`	`data-postgres-1`	Yes — same PVC re-attached to rescheduled pod

Identity Survives Node Failure

When the node running postgres-1 fails, Kubernetes eventually evicts the pod. The StatefulSet controller creates a new pod also named postgres-1 on a healthy node. The same PVC data-postgres-1 is attached to this new pod (after the old VolumeAttachment is force-deleted if the node is truly gone). The DNS hostname postgres-1.postgres-headless.data.svc.cluster.local resolves to the new pod IP. From other pods' perspective, postgres-1 has the same address and data as before.

Headless Service

The serviceName field in a StatefulSet spec references a headless service (clusterIP: None). This service must exist before the StatefulSet is created — the controller does not create it automatically. Without the headless service, pod DNS hostnames are not created.

apiVersion: v1
kind: Service
metadata:
  name: postgres-headless
  namespace: data
  labels:
    app: postgres
spec:
  clusterIP: None              # headless: no VIP, only DNS A records per pod
  publishNotReadyAddresses: true  # include not-yet-Ready pods in DNS
                                  # CRITICAL for init containers that use DNS for peer discovery
  selector:
    app: postgres
  ports:
  - name: postgres
    port: 5432
    targetPort: 5432

DNS records created by headless service "postgres-headless" in namespace "data": Service DNS (round-robin A records — all pod IPs): postgres-headless.data.svc.cluster.local → [10.0.1.10, 10.0.1.11, 10.0.1.12] Per-pod DNS A records (stable, resolves to pod's current IP): postgres-0.postgres-headless.data.svc.cluster.local → 10.0.1.10 postgres-1.postgres-headless.data.svc.cluster.local → 10.0.1.11 postgres-2.postgres-headless.data.svc.cluster.local → 10.0.1.12 When postgres-1 is rescheduled to a new node (new IP: 10.0.2.15): postgres-1.postgres-headless.data.svc.cluster.local → 10.0.2.15 (updated) postgres-0 and postgres-2 still resolve to their original IPs With publishNotReadyAddresses: true: → postgres-2 included in DNS even during init (before readiness probe passes) → Allows postgres-0 to discover postgres-2 as a peer during cluster bootstrap Short hostname (within same namespace): postgres-0.postgres-headless → resolves within data namespace

OrderedReady Pod Management

OrderedReady (the default) enforces strict sequential ordering for all pod lifecycle operations. The controller never creates the next pod until the current one is Running and Ready. It never deletes the next pod until the current one is fully terminated.

Scale-up (0 → 3 replicas) with OrderedReady: Step 1: Create postgres-0 Wait for postgres-0 to be Running + Ready Step 2: Create postgres-1 Wait for postgres-1 to be Running + Ready Step 3: Create postgres-2 Wait for postgres-2 to be Running + Ready Done: all 3 pods running Scale-down (3 → 1 replica) with OrderedReady: Step 1: Delete postgres-2 Wait for postgres-2 to be fully Terminated Step 2: Delete postgres-1 Wait for postgres-1 to be fully Terminated Done: only postgres-0 remains Key invariant: at no point during scale-down is postgres-0 deleted before postgres-1 (protects primary in primary/replica setups)

OrderedReady Deadlock

If pod-N is stuck not-Ready (readiness probe failing indefinitely), the StatefulSet controller blocks: it will not create pod-N+1 or proceed with any further operations. This is a known deadlock condition.

# Diagnose: identify the stuck pod
kubectl get pods -n data -l app=postgres
# NAME         READY   STATUS    RESTARTS   AGE
# postgres-0   1/1     Running   0          10m
# postgres-1   0/1     Running   5          3m   ← stuck not-Ready
# postgres-2   0/0     Pending   0          0s   ← never created

# Check why postgres-1 is not Ready
kubectl describe pod postgres-1 -n data   # → readiness probe failure events
kubectl logs postgres-1 -n data           # → application error

# Resolution options:

# Option A: Fix the underlying issue (app bug, missing config, dependency down)
# Controller automatically proceeds once pod-1 becomes Ready

# Option B: Temporary — delete the stuck pod to let it restart
kubectl delete pod postgres-1 -n data

# Option C: If app can run with Parallel management, change policy
kubectl patch sts postgres -n data -p '{"spec":{"podManagementPolicy":"Parallel"}}'
# WARNING: Parallel on a database that requires leader election is unsafe

# Option D: Use partition to skip the stuck pod
kubectl patch sts postgres -n data -p '{"spec":{"updateStrategy":{"rollingUpdate":{"partition":2}}}}'

Parallel Pod Management

Parallel pod management creates and deletes all pods simultaneously without waiting for each to be Running/Ready first. Pods still get stable names and PVCs — only the lifecycle ordering changes. Use for applications that can handle concurrent initialisation (stateless-ish shards, cache nodes that resync on startup).

spec:
  podManagementPolicy: Parallel
  # All 3 pods created at the same time:
  # postgres-0, postgres-1, postgres-2 all created simultaneously
  # None waits for the others to be Ready before starting
  # Faster cluster restart; dangerous for primary/replica databases

Parallel Is Unsafe for Leader-Elected Databases

For PostgreSQL with Patroni, Galera Cluster, or Cassandra bootstrap, all replicas starting simultaneously can race for primary election or corrupt the cluster state. Use OrderedReady for initial cluster bootstrap, then switch to Parallel only for rolling restarts on a healthy, fully-initialized cluster where the application can handle concurrent startup.

Update Strategies

RollingUpdate (Default)

StatefulSet RollingUpdate proceeds in reverse ordinal order (highest ordinal first), one pod at a time. This is the opposite of scale-up order — it updates replicas before the primary (ordinal 0), minimising risk of primary disruption.

RollingUpdate with replicas=3 (updating postgres:15 → postgres:16): Step 1: Update postgres-2 (highest ordinal) Delete postgres-2 → wait Terminated Create new postgres-2 with new image → wait Running+Ready (+ minReadySeconds) Step 2: Update postgres-1 Delete postgres-1 → wait Terminated Create new postgres-1 → wait Running+Ready Step 3: Update postgres-0 (primary / lowest ordinal — updated last) Delete postgres-0 → wait Terminated Create new postgres-0 → wait Running+Ready Total pods running during update: always ≥ 2 (with default maxUnavailable=1) Reverse order rationale: postgres-0 is typically the primary/leader Updating replicas first allows failover testing before touching the primary

maxUnavailable in RollingUpdate (GA 1.24)

updateStrategy:
  type: RollingUpdate
  rollingUpdate:
    partition: 0
    maxUnavailable: 2    # allow 2 pods to be simultaneously unavailable during update
                         # speeds up update of large StatefulSets
                         # default: 1 (one pod at a time)
                         # percentage: "33%" → floor(replicas × 0.33)
    # WARNING: for quorum-based systems (Kafka, etcd), maxUnavailable must
    # not exceed (replicas - quorum_size)
    # For 3-node etcd (quorum=2): maxUnavailable must be 1

Partition — Canary Upgrades

The partition field divides the StatefulSet at a boundary: pods with ordinal ≥ partition are updated; pods with ordinal < partition are frozen at their current version. This enables controlled canary upgrades where you update one replica, validate it, then lower the partition.

# Scenario: postgres StatefulSet, 5 replicas (0-4), upgrading postgres:15 → postgres:16

# Step 1: Start canary — only update the highest ordinal
kubectl patch sts postgres -n data \
  -p '{"spec":{"updateStrategy":{"rollingUpdate":{"partition":4}}}}'
# Only postgres-4 gets updated; postgres-0 through postgres-3 stay at old version

# Step 2: Validate postgres-4 is healthy
kubectl exec postgres-4 -n data -- psql -U postgres -c "SELECT version();"
kubectl logs postgres-4 -n data | tail -50

# Step 3: Expand canary to include pod-3 and pod-4
kubectl patch sts postgres -n data \
  -p '{"spec":{"updateStrategy":{"rollingUpdate":{"partition":3}}}}'

# Step 4: Full rollout (update all including primary postgres-0)
kubectl patch sts postgres -n data \
  -p '{"spec":{"updateStrategy":{"rollingUpdate":{"partition":0}}}}'

# Emergency freeze: set partition to replicas count to prevent ANY further updates
kubectl patch sts postgres -n data \
  -p '{"spec":{"updateStrategy":{"rollingUpdate":{"partition":5}}}}'
# No pods will be updated; existing state is preserved

OnDelete Strategy

With OnDelete, the controller only updates a pod when you manually delete it. This gives complete control over update timing — essential for databases where you need to perform a controlled failover before restarting a node.

updateStrategy:
  type: OnDelete

# Update workflow for a 3-node Patroni PostgreSQL cluster:

# 1. Update the image in the StatefulSet spec (no pods restart yet)
kubectl set image sts/postgres postgres=postgres:16 -n data

# 2. Verify patroni replica status (postgres-2 and postgres-1 are replicas)
kubectl exec postgres-0 -n data -- patronictl list

# 3. Restart replica first (controller creates updated postgres-2)
kubectl delete pod postgres-2 -n data
# Wait for postgres-2 to rejoin as replica...
kubectl exec postgres-0 -n data -- patronictl list

# 4. Restart second replica
kubectl delete pod postgres-1 -n data

# 5. Trigger manual failover (promotes postgres-1 or postgres-2 to primary)
kubectl exec postgres-0 -n data -- patronictl switchover postgres --master postgres-0

# 6. Now postgres-0 is a replica; restart it safely
kubectl delete pod postgres-0 -n data

Scaling Behavior

Scale-Up Sequencing

# Scale from 3 to 5 replicas
kubectl scale sts postgres -n data --replicas=5

# With OrderedReady:
# postgres-3 created and must become Running+Ready before postgres-4 is created
# New PVCs created automatically: data-postgres-3, data-postgres-4

# With Parallel:
# postgres-3 and postgres-4 created simultaneously

Scale-Down and PVC Orphaning

# Scale from 5 to 3 replicas
kubectl scale sts postgres -n data --replicas=3

# With OrderedReady (scale-down is always sequential, highest-ordinal-first):
# postgres-4 deleted (pod only — PVC data-postgres-4 remains by default)
# Wait for postgres-4 Terminated
# postgres-3 deleted
# Wait for postgres-3 Terminated
# PVCs data-postgres-3 and data-postgres-4 are ORPHANED (not deleted)

# Check for orphaned PVCs after scale-down:
kubectl get pvc -n data -l app=postgres
# NAME             STATUS   VOLUME          CAPACITY   ACCESS MODES   STORAGECLASS
# data-postgres-0  Bound    pvc-abc123      100Gi      RWO            ebs-gp3
# data-postgres-1  Bound    pvc-def456      100Gi      RWO            ebs-gp3
# data-postgres-2  Bound    pvc-ghi789      100Gi      RWO            ebs-gp3
# data-postgres-3  Bound    pvc-jkl012      100Gi      RWO            ebs-gp3   ← ORPHANED
# data-postgres-4  Bound    pvc-mno345      100Gi      RWO            ebs-gp3   ← ORPHANED

# Scale back to 5: orphaned PVCs are automatically re-used (same names re-attached)

volumeClaimTemplates

See Stateful Storage Patterns § volumeClaimTemplates for full coverage including naming convention, immutability, multiple templates, and the --cascade=orphan migration pattern. Key points summarised here:

Aspect	Behavior
PVC naming	`{template.metadata.name}-{statefulset-name}-{ordinal}`
Immutability	`volumeClaimTemplates` cannot be modified after StatefulSet creation. Delete with `--cascade=orphan` and recreate to change.
PVC on pod delete	Retained — PVC persists; pod is recreated and re-attaches same PVC
PVC on StatefulSet delete	Retained by default; controlled by `persistentVolumeClaimRetentionPolicy.whenDeleted`
PVC on scale-down	Retained by default; controlled by `persistentVolumeClaimRetentionPolicy.whenScaled`
Scale-up re-bind	Existing PVCs from previous scale-up are re-used if names match

ControllerRevisions and Rollback

Unlike Deployments (which use ReplicaSets for history), StatefulSets store revision history in ControllerRevision objects — lightweight objects containing only the diff from the previous revision.

# List ControllerRevisions for a StatefulSet
kubectl get controllerrevision -n data -l app=postgres \
  -o custom-columns='NAME:.metadata.name,REVISION:.revision,AGE:.metadata.creationTimestamp'

# Roll back StatefulSet (same command as Deployment)
kubectl rollout undo sts/postgres -n data
kubectl rollout undo sts/postgres -n data --to-revision=3

# Check rollout status
kubectl rollout status sts/postgres -n data
# Waiting for partitioned roll out to finish: 0 out of 3 new pods have been updated...
# partitioned roll out complete: 3 new pods have been updated...

# Watch pod updates in real time
kubectl get pods -n data -l app=postgres -w

Status Fields

kubectl get sts postgres -n data -o yaml | yq .status
# replicas: 3                     # total pods (Running + Pending + Terminating)
# readyReplicas: 3                 # pods with all containers Ready
# currentReplicas: 3               # pods at currentRevision (old version during update)
# updatedReplicas: 2               # pods at updateRevision (new version during update)
# availableReplicas: 3             # pods Ready for >= minReadySeconds (1.26+)
# currentRevision: postgres-7d9f   # ControllerRevision name of current spec
# updateRevision: postgres-9a2b    # ControllerRevision name of target spec (during update)
# observedGeneration: 4            # last generation processed by controller
#                                    compare with metadata.generation to detect lag
# collisionCount: 0                # hash collision counter for revision names

Status Field	Meaning During Rolling Update
`replicas`	Total pod count (always equals desired unless scaling)
`currentReplicas`	Pods still running the old revision
`updatedReplicas`	Pods running the new revision (target)
`readyReplicas`	Pods that are Running + all probes passing
`currentRevision == updateRevision`	Update complete — all pods on same revision

Forced Spec Changes with --cascade=orphan

Because volumeClaimTemplates and selector are immutable, some changes require deleting and recreating the StatefulSet. The --cascade=orphan flag deletes the StatefulSet object while leaving its pods and PVCs running — a zero-downtime StatefulSet spec update.

# Scenario: need to change volumeClaimTemplates storage size or storageClass

# Step 1: Delete the StatefulSet object, leave pods running
kubectl delete sts postgres -n data --cascade=orphan
# Pods continue running: postgres-0, postgres-1, postgres-2 still alive
# PVCs still exist and bound
# No traffic interruption

# Step 2: Apply new StatefulSet manifest (with updated volumeClaimTemplates)
kubectl apply -f postgres-statefulset-v2.yaml
# Controller re-adopts existing pods (they already match the selector)
# New volumeClaimTemplates apply only to NEWLY CREATED pods (e.g., scale-up)

# Step 3: For existing pods to get new PVC spec, manually migrate data:
# - Scale up to get new pods with new PVC size
# - Copy data from old PVCs to new PVCs
# - Delete old pods (controller creates replacements with new PVCs)

# Note: existing PVCs are NOT automatically resized by this process
# Use PVC expansion (kubectl edit pvc) for in-place size increases

Operational Commands

# Watch pod creation/deletion during scaling or update
kubectl get pods -n data -l app=postgres -w

# Check rollout status (blocks until complete or times out)
kubectl rollout status sts/postgres -n data --timeout=10m

# Trigger a rolling restart without spec change
kubectl rollout restart sts/postgres -n data

# Pause a rolling update (set partition to current updatedReplicas count)
# Get current updatedReplicas
UPDATED=$(kubectl get sts postgres -n data -o jsonpath='{.status.updatedReplicas}')
REPLICAS=$(kubectl get sts postgres -n data -o jsonpath='{.spec.replicas}')
# Freeze at current progress (don't update pods 0..UPDATED-1)
kubectl patch sts postgres -n data \
  -p "{\"spec\":{\"updateStrategy\":{\"rollingUpdate\":{\"partition\":$((REPLICAS - UPDATED))}}}}"

# Force delete a stuck Terminating pod (node failure recovery)
kubectl delete pod postgres-1 -n data --grace-period=0 --force

# Check all PVCs for a StatefulSet
kubectl get pvc -n data -l app=postgres --sort-by='.metadata.name'

# Describe to see events (scale, update, probe failures)
kubectl describe sts postgres -n data

# Get ControllerRevision history
kubectl get controllerrevision -n data -l app=postgres

StatefulSet vs Deployment — When to Use Each

Use StatefulSet When…	Use Deployment When…
Each replica needs its own persistent data (per-pod PVC)	All replicas share a read volume, or use no persistent storage
Pods must discover each other by stable DNS name (Cassandra seeds, Kafka broker IDs, etcd peers)	Pod identity is irrelevant; any replica can serve any request
Ordered startup is required (replica must not start before primary)	All pods can start concurrently without coordination
Ordered shutdown is required (must drain highest-ordinal replica first)	Any pod can be terminated independently
Application uses its own hostname for peer registration (pod name embedded in config)	Application is stateless or uses external service discovery
Rolling upgrades must proceed in a controlled sequence with validation	Rolling updates can proceed at any speed with maxSurge/maxUnavailable

Metrics, Alerts, and Runbooks

Key Metrics

Metric	Source	Alert Condition
`kube_statefulset_status_replicas_ready`	kube-state-metrics	< `kube_statefulset_replicas` for > 5m
`kube_statefulset_status_replicas_current != kube_statefulset_status_replicas_updated`	kube-state-metrics	Rollout in progress; alert if stale > 30m
`kube_statefulset_status_observed_generation != kube_statefulset_metadata_generation`	kube-state-metrics	Controller not processing updates; possible controller issue
`kube_persistentvolumeclaim_status_phase{phase="Lost"}`	kube-state-metrics	PVC Lost — pod stuck Pending; immediate alert
`kube_pod_status_ready{condition="false"}` for StatefulSet pods	kube-state-metrics	Not-Ready pod blocking OrderedReady progression

Alerting Rules

groups:
- name: statefulset-health
  rules:
  - alert: StatefulSetNotFullyReady
    expr: |
      kube_statefulset_status_replicas_ready
        < kube_statefulset_replicas
    for: 5m
    labels:
      severity: warning
    annotations:
      summary: "StatefulSet {{ $labels.namespace }}/{{ $labels.statefulset }} not fully ready"

  - alert: StatefulSetRolloutStuck
    expr: |
      kube_statefulset_status_replicas_updated
        != kube_statefulset_replicas
      and
      kube_statefulset_status_replicas_current
        != kube_statefulset_status_replicas_updated
    for: 30m
    labels:
      severity: warning
    annotations:
      summary: "StatefulSet {{ $labels.namespace }}/{{ $labels.statefulset }} rollout has not completed in 30 minutes"

  - alert: StatefulSetPVCLost
    expr: |
      kube_persistentvolumeclaim_status_phase{phase="Lost"} == 1
    for: 1m
    labels:
      severity: critical
    annotations:
      summary: "PVC {{ $labels.namespace }}/{{ $labels.persistentvolumeclaim }} is Lost — StatefulSet pod will be stuck Pending"

  - alert: StatefulSetReplicasMismatch
    expr: |
      kube_statefulset_replicas
        != kube_statefulset_status_replicas_ready
    for: 10m
    labels:
      severity: warning

Runbooks

Pod Stuck — OrderedReady Deadlock

Identify stuck pod: kubectl get pods -l app=NAME — look for a pod that has been not-Ready for a long time while higher-ordinal pods are Pending. Check: kubectl describe pod NAME-N for probe failures or image pull errors. Fix the root cause, or delete the stuck pod to trigger a restart. If the root cause is irreproducible, consider Parallel temporarily.

Rollout Stuck at Partition

Check current partition: kubectl get sts NAME -o jsonpath='{.spec.updateStrategy.rollingUpdate.partition}'. If partition is higher than 0, either lower it intentionally or reset to 0 to complete the rollout. Also check status.updatedReplicas vs spec.replicas to confirm rollout progress.

PVC Lost — Pod Stuck Pending

PVC phase Lost means the bound PV no longer exists. Check: kubectl describe pvc NAME. To recover: create a new PV with the same volumeHandle pointing to the original storage, patch the PV's claimRef to reference the PVC's name and UID. See Persistent Volumes § Lost recovery.

Force Delete Stuck Terminating Pod

Pod stuck Terminating after node failure: kubectl delete pod NAME --grace-period=0 --force. Also delete the stale VolumeAttachment: kubectl delete volumeattachment NAME. The StatefulSet controller will create a replacement pod. Verify the new pod attaches the PVC cleanly before declaring recovery complete.

Orphaned PVCs After Scale-Down

List: kubectl get pvc -n NS -l app=NAME. Identify which PVCs have no corresponding running pod. If scaling back up is not planned, back up data and delete manually. If data is needed: kubectl exec into a pod with the PVC mounted, or create a temporary pod to access the PVC before deletion.

Best Practices

Always create the headless service before the StatefulSet — the StatefulSet controller references the service by name in serviceName but does not create it. If the service is missing, pod DNS hostnames are not registered. Apply manifests in order: Service → StatefulSet. Use publishNotReadyAddresses: true for peer-discovery during initialization.
Use OrderedReady for initial bootstrap, evaluate Parallel for restarts — OrderedReady prevents bootstrap races (e.g., two Cassandra nodes joining the ring simultaneously). After the cluster is fully initialized and stable, Parallel can be used for faster rolling restarts since the application already knows its identity and can handle concurrent startup.
Set partition to replicas before risky upgrades — freeze all updates by setting partition: N (equal to replica count). Update the spec, validate in staging, then lower the partition incrementally. This is safer than OnDelete because the partition can be lowered remotely without needing to delete pods manually.
Use terminationGracePeriodSeconds appropriate to your database — PostgreSQL needs time to finish checkpoints and close WAL (30–60s minimum). Cassandra needs time to flush memtables (60–120s). Kafka needs time to complete log segment flushes and gracefully hand off partition leadership. The default 30s is too short for most databases.
Keep revisionHistoryLimit at the default (10) — StatefulSet ControllerRevisions are small (only diffs) unlike Deployment ReplicaSets (which include full pod specs). There is little reason to reduce this, and doing so removes rollback capability. Unlike Deployments, ControllerRevisions do not consume meaningful etcd space.
Set persistentVolumeClaimRetentionPolicy.whenDeleted: Retain and whenScaled: Retain for production — the default behavior (both Retain) is the safest. Only use Delete for ephemeral test environments. Recovering accidentally deleted database PVCs is painful and may involve data loss.
Monitor readyReplicas separately from replicas — a StatefulSet can show replicas: 3 while readyReplicas: 2 indefinitely without triggering any error. Set alerts on the difference. A long-running mismatch indicates a stuck pod that is blocking OrderedReady progression.
Test failover and recovery procedures under realistic conditions — simulate node failure (kubectl drain), test that the pod reschedules and re-attaches its PVC, verify the application recovers correctly. Many database-specific failure modes only surface under real I/O conditions, not in smoke tests.

← Previous Deployments Next → DaemonSets