▶ What This Page Covers
  • StatefulSet controller mechanics vs Deployment controller
  • Full annotated StatefulSet spec — every significant field
  • Stable pod identity: ordinal index, name format, DNS hostname
  • Headless service requirement: clusterIP:None, publishNotReadyAddresses
  • DNS record structure per pod — stable addresses across rescheduling
  • OrderedReady pod management: creation and deletion sequencing
  • Parallel pod management: concurrent creation/deletion, identity preserved
  • OrderedReady deadlock: causes, detection, resolution
  • RollingUpdate strategy: reverse-ordinal update order
  • Partition field: canary upgrades, freezing lower ordinals
  • OnDelete strategy: manual upgrade workflow
  • Scale-up and scale-down sequencing with OrderedReady
  • volumeClaimTemplates: naming, immutability, multiple templates
  • PVC lifecycle: orphaned PVCs, persistentVolumeClaimRetentionPolicy
  • StatefulSet vs Deployment: when to use each
  • Forced rollout: --cascade=orphan pattern
  • Controller status fields: currentReplicas, updatedReplicas, readyReplicas
  • StatefulSet conditions and observedGeneration
  • Operational commands reference
  • 5 metrics + 4 alerts + 5 runbooks + 8 best practices
  • Companion Page: Stateful Storage Patterns

    This page covers StatefulSet controller mechanics: spec fields, pod identity, update strategies, and scaling behavior. For storage-focused content — volumeClaimTemplates in depth, persistent volume lifecycle, distributed databases (PostgreSQL, Kafka, Cassandra, MongoDB) on Kubernetes, Longhorn, and Rook-Ceph — see Stateful Storage Patterns.

    Controller Mechanics

    The StatefulSet controller is fundamentally different from the Deployment controller. Where Deployments treat pods as interchangeable and manage a fleet of ReplicaSets, the StatefulSet controller manages pods directly — there is no intermediate ReplicaSet layer. Each pod is individually tracked by its ordinal index, and the controller enforces strict lifecycle ordering.

    StatefulSet object model (no ReplicaSet intermediary): StatefulSet: postgres (replicas: 3, serviceName: postgres-headless) │ ├── Pod: postgres-0 ←── stable name; PVC: data-postgres-0 │ ownerRef: StatefulSet/postgres │ ├── Pod: postgres-1 ←── stable name; PVC: data-postgres-1 │ ownerRef: StatefulSet/postgres │ └── Pod: postgres-2 ←── stable name; PVC: data-postgres-2 ownerRef: StatefulSet/postgres Contrast with Deployment: Deployment → ReplicaSet (hash suffix) → Pods (random hash suffix) StatefulSet → Pods (ordinal suffix) directly Pod name format: {statefulset-name}-{ordinal} PVC name format: {template-name}-{pod-name} = {template-name}-{statefulset-name}-{ordinal}

    Full StatefulSet Spec

    apiVersion: apps/v1
    kind: StatefulSet
    metadata:
      name: postgres
      namespace: data
    spec:
      # ── Identity ──────────────────────────────────────────────────────
      serviceName: postgres-headless   # REQUIRED: name of headless Service
                                       # creates stable DNS per pod
    
      replicas: 3
    
      # Selector is IMMUTABLE after creation
      selector:
        matchLabels:
          app: postgres
    
      # ── Pod management ────────────────────────────────────────────────
      podManagementPolicy: OrderedReady   # OrderedReady (default) | Parallel
    
      # ── Update strategy ───────────────────────────────────────────────
      updateStrategy:
        type: RollingUpdate              # RollingUpdate (default) | OnDelete
        rollingUpdate:
          partition: 0                   # only update pods with ordinal >= partition
          maxUnavailable: 1              # 1.24+ GA: allow multiple pods down during update
                                         # default 1; percentage or absolute
    
      # ── PVC retention ─────────────────────────────────────────────────
      persistentVolumeClaimRetentionPolicy:   # GA 1.27
        whenDeleted: Retain              # Retain | Delete
        whenScaled: Retain               # Retain | Delete
    
      # ── Revision history ──────────────────────────────────────────────
      revisionHistoryLimit: 10           # number of ControllerRevisions to keep (default 10)
                                         # unlike Deployment, StatefulSet uses ControllerRevision objects
    
      # ── Min ready ─────────────────────────────────────────────────────
      minReadySeconds: 0                 # same semantics as Deployment
    
      # ── Pod template ──────────────────────────────────────────────────
      template:
        metadata:
          labels:
            app: postgres
        spec:
          terminationGracePeriodSeconds: 60
          subdomain: postgres-headless   # auto-injected from serviceName; do not set manually
          containers:
          - name: postgres
            image: postgres:16
            ports:
            - name: postgres
              containerPort: 5432
            env:
            - name: PGDATA
              value: /var/lib/postgresql/data/pgdata
            - name: POD_NAME
              valueFrom:
                fieldRef:
                  fieldPath: metadata.name   # "postgres-0", "postgres-1", etc.
            - name: ORDINAL
              valueFrom:
                fieldRef:
                  fieldPath: metadata.name   # parse ordinal from pod name in app
            readinessProbe:
              exec:
                command: ["pg_isready", "-U", "postgres", "-h", "localhost"]
              periodSeconds: 5
              failureThreshold: 6
            resources:
              requests:
                cpu: "500m"
                memory: "1Gi"
              limits:
                memory: "2Gi"
            volumeMounts:
            - name: data
              mountPath: /var/lib/postgresql/data
    
      # ── Volume claim templates ────────────────────────────────────────
      volumeClaimTemplates:
      - metadata:
          name: data
        spec:
          accessModes: [ReadWriteOnce]
          storageClassName: ebs-gp3
          resources:
            requests:
              storage: 100Gi

    Stable Pod Identity

    Each StatefulSet pod has three components of stable identity that persist across pod deletion and rescheduling:

    Identity ComponentFormatExample (postgres, ordinal 1)Persists Across Reschedule?
    Pod name {sts-name}-{ordinal} postgres-1 Yes — same ordinal, same name
    DNS hostname {pod-name}.{headless-svc}.{ns}.svc.cluster.local postgres-1.postgres-headless.data.svc.cluster.local Yes — DNS resolves to new pod IP after reschedule
    PVC binding {template-name}-{pod-name} data-postgres-1 Yes — same PVC re-attached to rescheduled pod
    Identity Survives Node Failure

    When the node running postgres-1 fails, Kubernetes eventually evicts the pod. The StatefulSet controller creates a new pod also named postgres-1 on a healthy node. The same PVC data-postgres-1 is attached to this new pod (after the old VolumeAttachment is force-deleted if the node is truly gone). The DNS hostname postgres-1.postgres-headless.data.svc.cluster.local resolves to the new pod IP. From other pods' perspective, postgres-1 has the same address and data as before.

    Headless Service

    The serviceName field in a StatefulSet spec references a headless service (clusterIP: None). This service must exist before the StatefulSet is created — the controller does not create it automatically. Without the headless service, pod DNS hostnames are not created.

    apiVersion: v1
    kind: Service
    metadata:
      name: postgres-headless
      namespace: data
      labels:
        app: postgres
    spec:
      clusterIP: None              # headless: no VIP, only DNS A records per pod
      publishNotReadyAddresses: true  # include not-yet-Ready pods in DNS
                                      # CRITICAL for init containers that use DNS for peer discovery
      selector:
        app: postgres
      ports:
      - name: postgres
        port: 5432
        targetPort: 5432
    DNS records created by headless service "postgres-headless" in namespace "data": Service DNS (round-robin A records — all pod IPs): postgres-headless.data.svc.cluster.local → [10.0.1.10, 10.0.1.11, 10.0.1.12] Per-pod DNS A records (stable, resolves to pod's current IP): postgres-0.postgres-headless.data.svc.cluster.local → 10.0.1.10 postgres-1.postgres-headless.data.svc.cluster.local → 10.0.1.11 postgres-2.postgres-headless.data.svc.cluster.local → 10.0.1.12 When postgres-1 is rescheduled to a new node (new IP: 10.0.2.15): postgres-1.postgres-headless.data.svc.cluster.local → 10.0.2.15 (updated) postgres-0 and postgres-2 still resolve to their original IPs With publishNotReadyAddresses: true: → postgres-2 included in DNS even during init (before readiness probe passes) → Allows postgres-0 to discover postgres-2 as a peer during cluster bootstrap Short hostname (within same namespace): postgres-0.postgres-headless → resolves within data namespace

    OrderedReady Pod Management

    OrderedReady (the default) enforces strict sequential ordering for all pod lifecycle operations. The controller never creates the next pod until the current one is Running and Ready. It never deletes the next pod until the current one is fully terminated.

    Scale-up (0 → 3 replicas) with OrderedReady: Step 1: Create postgres-0 Wait for postgres-0 to be Running + Ready Step 2: Create postgres-1 Wait for postgres-1 to be Running + Ready Step 3: Create postgres-2 Wait for postgres-2 to be Running + Ready Done: all 3 pods running Scale-down (3 → 1 replica) with OrderedReady: Step 1: Delete postgres-2 Wait for postgres-2 to be fully Terminated Step 2: Delete postgres-1 Wait for postgres-1 to be fully Terminated Done: only postgres-0 remains Key invariant: at no point during scale-down is postgres-0 deleted before postgres-1 (protects primary in primary/replica setups)

    OrderedReady Deadlock

    If pod-N is stuck not-Ready (readiness probe failing indefinitely), the StatefulSet controller blocks: it will not create pod-N+1 or proceed with any further operations. This is a known deadlock condition.

    # Diagnose: identify the stuck pod
    kubectl get pods -n data -l app=postgres
    # NAME         READY   STATUS    RESTARTS   AGE
    # postgres-0   1/1     Running   0          10m
    # postgres-1   0/1     Running   5          3m   ← stuck not-Ready
    # postgres-2   0/0     Pending   0          0s   ← never created
    
    # Check why postgres-1 is not Ready
    kubectl describe pod postgres-1 -n data   # → readiness probe failure events
    kubectl logs postgres-1 -n data           # → application error
    
    # Resolution options:
    
    # Option A: Fix the underlying issue (app bug, missing config, dependency down)
    # Controller automatically proceeds once pod-1 becomes Ready
    
    # Option B: Temporary — delete the stuck pod to let it restart
    kubectl delete pod postgres-1 -n data
    
    # Option C: If app can run with Parallel management, change policy
    kubectl patch sts postgres -n data -p '{"spec":{"podManagementPolicy":"Parallel"}}'
    # WARNING: Parallel on a database that requires leader election is unsafe
    
    # Option D: Use partition to skip the stuck pod
    kubectl patch sts postgres -n data -p '{"spec":{"updateStrategy":{"rollingUpdate":{"partition":2}}}}'

    Parallel Pod Management

    Parallel pod management creates and deletes all pods simultaneously without waiting for each to be Running/Ready first. Pods still get stable names and PVCs — only the lifecycle ordering changes. Use for applications that can handle concurrent initialisation (stateless-ish shards, cache nodes that resync on startup).

    spec:
      podManagementPolicy: Parallel
      # All 3 pods created at the same time:
      # postgres-0, postgres-1, postgres-2 all created simultaneously
      # None waits for the others to be Ready before starting
      # Faster cluster restart; dangerous for primary/replica databases
    Parallel Is Unsafe for Leader-Elected Databases

    For PostgreSQL with Patroni, Galera Cluster, or Cassandra bootstrap, all replicas starting simultaneously can race for primary election or corrupt the cluster state. Use OrderedReady for initial cluster bootstrap, then switch to Parallel only for rolling restarts on a healthy, fully-initialized cluster where the application can handle concurrent startup.

    Update Strategies

    RollingUpdate (Default)

    StatefulSet RollingUpdate proceeds in reverse ordinal order (highest ordinal first), one pod at a time. This is the opposite of scale-up order — it updates replicas before the primary (ordinal 0), minimising risk of primary disruption.

    RollingUpdate with replicas=3 (updating postgres:15 → postgres:16): Step 1: Update postgres-2 (highest ordinal) Delete postgres-2 → wait Terminated Create new postgres-2 with new image → wait Running+Ready (+ minReadySeconds) Step 2: Update postgres-1 Delete postgres-1 → wait Terminated Create new postgres-1 → wait Running+Ready Step 3: Update postgres-0 (primary / lowest ordinal — updated last) Delete postgres-0 → wait Terminated Create new postgres-0 → wait Running+Ready Total pods running during update: always ≥ 2 (with default maxUnavailable=1) Reverse order rationale: postgres-0 is typically the primary/leader Updating replicas first allows failover testing before touching the primary

    maxUnavailable in RollingUpdate (GA 1.24)

    updateStrategy:
      type: RollingUpdate
      rollingUpdate:
        partition: 0
        maxUnavailable: 2    # allow 2 pods to be simultaneously unavailable during update
                             # speeds up update of large StatefulSets
                             # default: 1 (one pod at a time)
                             # percentage: "33%" → floor(replicas × 0.33)
        # WARNING: for quorum-based systems (Kafka, etcd), maxUnavailable must
        # not exceed (replicas - quorum_size)
        # For 3-node etcd (quorum=2): maxUnavailable must be 1

    Partition — Canary Upgrades

    The partition field divides the StatefulSet at a boundary: pods with ordinal ≥ partition are updated; pods with ordinal < partition are frozen at their current version. This enables controlled canary upgrades where you update one replica, validate it, then lower the partition.

    # Scenario: postgres StatefulSet, 5 replicas (0-4), upgrading postgres:15 → postgres:16
    
    # Step 1: Start canary — only update the highest ordinal
    kubectl patch sts postgres -n data \
      -p '{"spec":{"updateStrategy":{"rollingUpdate":{"partition":4}}}}'
    # Only postgres-4 gets updated; postgres-0 through postgres-3 stay at old version
    
    # Step 2: Validate postgres-4 is healthy
    kubectl exec postgres-4 -n data -- psql -U postgres -c "SELECT version();"
    kubectl logs postgres-4 -n data | tail -50
    
    # Step 3: Expand canary to include pod-3 and pod-4
    kubectl patch sts postgres -n data \
      -p '{"spec":{"updateStrategy":{"rollingUpdate":{"partition":3}}}}'
    
    # Step 4: Full rollout (update all including primary postgres-0)
    kubectl patch sts postgres -n data \
      -p '{"spec":{"updateStrategy":{"rollingUpdate":{"partition":0}}}}'
    
    # Emergency freeze: set partition to replicas count to prevent ANY further updates
    kubectl patch sts postgres -n data \
      -p '{"spec":{"updateStrategy":{"rollingUpdate":{"partition":5}}}}'
    # No pods will be updated; existing state is preserved

    OnDelete Strategy

    With OnDelete, the controller only updates a pod when you manually delete it. This gives complete control over update timing — essential for databases where you need to perform a controlled failover before restarting a node.

    updateStrategy:
      type: OnDelete
    
    # Update workflow for a 3-node Patroni PostgreSQL cluster:
    
    # 1. Update the image in the StatefulSet spec (no pods restart yet)
    kubectl set image sts/postgres postgres=postgres:16 -n data
    
    # 2. Verify patroni replica status (postgres-2 and postgres-1 are replicas)
    kubectl exec postgres-0 -n data -- patronictl list
    
    # 3. Restart replica first (controller creates updated postgres-2)
    kubectl delete pod postgres-2 -n data
    # Wait for postgres-2 to rejoin as replica...
    kubectl exec postgres-0 -n data -- patronictl list
    
    # 4. Restart second replica
    kubectl delete pod postgres-1 -n data
    
    # 5. Trigger manual failover (promotes postgres-1 or postgres-2 to primary)
    kubectl exec postgres-0 -n data -- patronictl switchover postgres --master postgres-0
    
    # 6. Now postgres-0 is a replica; restart it safely
    kubectl delete pod postgres-0 -n data

    Scaling Behavior

    Scale-Up Sequencing

    # Scale from 3 to 5 replicas
    kubectl scale sts postgres -n data --replicas=5
    
    # With OrderedReady:
    # postgres-3 created and must become Running+Ready before postgres-4 is created
    # New PVCs created automatically: data-postgres-3, data-postgres-4
    
    # With Parallel:
    # postgres-3 and postgres-4 created simultaneously

    Scale-Down and PVC Orphaning

    # Scale from 5 to 3 replicas
    kubectl scale sts postgres -n data --replicas=3
    
    # With OrderedReady (scale-down is always sequential, highest-ordinal-first):
    # postgres-4 deleted (pod only — PVC data-postgres-4 remains by default)
    # Wait for postgres-4 Terminated
    # postgres-3 deleted
    # Wait for postgres-3 Terminated
    # PVCs data-postgres-3 and data-postgres-4 are ORPHANED (not deleted)
    
    # Check for orphaned PVCs after scale-down:
    kubectl get pvc -n data -l app=postgres
    # NAME             STATUS   VOLUME          CAPACITY   ACCESS MODES   STORAGECLASS
    # data-postgres-0  Bound    pvc-abc123      100Gi      RWO            ebs-gp3
    # data-postgres-1  Bound    pvc-def456      100Gi      RWO            ebs-gp3
    # data-postgres-2  Bound    pvc-ghi789      100Gi      RWO            ebs-gp3
    # data-postgres-3  Bound    pvc-jkl012      100Gi      RWO            ebs-gp3   ← ORPHANED
    # data-postgres-4  Bound    pvc-mno345      100Gi      RWO            ebs-gp3   ← ORPHANED
    
    # Scale back to 5: orphaned PVCs are automatically re-used (same names re-attached)

    volumeClaimTemplates

    See Stateful Storage Patterns § volumeClaimTemplates for full coverage including naming convention, immutability, multiple templates, and the --cascade=orphan migration pattern. Key points summarised here:

    AspectBehavior
    PVC naming{template.metadata.name}-{statefulset-name}-{ordinal}
    ImmutabilityvolumeClaimTemplates cannot be modified after StatefulSet creation. Delete with --cascade=orphan and recreate to change.
    PVC on pod deleteRetained — PVC persists; pod is recreated and re-attaches same PVC
    PVC on StatefulSet deleteRetained by default; controlled by persistentVolumeClaimRetentionPolicy.whenDeleted
    PVC on scale-downRetained by default; controlled by persistentVolumeClaimRetentionPolicy.whenScaled
    Scale-up re-bindExisting PVCs from previous scale-up are re-used if names match

    ControllerRevisions and Rollback

    Unlike Deployments (which use ReplicaSets for history), StatefulSets store revision history in ControllerRevision objects — lightweight objects containing only the diff from the previous revision.

    # List ControllerRevisions for a StatefulSet
    kubectl get controllerrevision -n data -l app=postgres \
      -o custom-columns='NAME:.metadata.name,REVISION:.revision,AGE:.metadata.creationTimestamp'
    
    # Roll back StatefulSet (same command as Deployment)
    kubectl rollout undo sts/postgres -n data
    kubectl rollout undo sts/postgres -n data --to-revision=3
    
    # Check rollout status
    kubectl rollout status sts/postgres -n data
    # Waiting for partitioned roll out to finish: 0 out of 3 new pods have been updated...
    # partitioned roll out complete: 3 new pods have been updated...
    
    # Watch pod updates in real time
    kubectl get pods -n data -l app=postgres -w

    Status Fields

    kubectl get sts postgres -n data -o yaml | yq .status
    # replicas: 3                     # total pods (Running + Pending + Terminating)
    # readyReplicas: 3                 # pods with all containers Ready
    # currentReplicas: 3               # pods at currentRevision (old version during update)
    # updatedReplicas: 2               # pods at updateRevision (new version during update)
    # availableReplicas: 3             # pods Ready for >= minReadySeconds (1.26+)
    # currentRevision: postgres-7d9f   # ControllerRevision name of current spec
    # updateRevision: postgres-9a2b    # ControllerRevision name of target spec (during update)
    # observedGeneration: 4            # last generation processed by controller
    #                                    compare with metadata.generation to detect lag
    # collisionCount: 0                # hash collision counter for revision names
    Status FieldMeaning During Rolling Update
    replicasTotal pod count (always equals desired unless scaling)
    currentReplicasPods still running the old revision
    updatedReplicasPods running the new revision (target)
    readyReplicasPods that are Running + all probes passing
    currentRevision == updateRevisionUpdate complete — all pods on same revision

    Forced Spec Changes with --cascade=orphan

    Because volumeClaimTemplates and selector are immutable, some changes require deleting and recreating the StatefulSet. The --cascade=orphan flag deletes the StatefulSet object while leaving its pods and PVCs running — a zero-downtime StatefulSet spec update.

    # Scenario: need to change volumeClaimTemplates storage size or storageClass
    
    # Step 1: Delete the StatefulSet object, leave pods running
    kubectl delete sts postgres -n data --cascade=orphan
    # Pods continue running: postgres-0, postgres-1, postgres-2 still alive
    # PVCs still exist and bound
    # No traffic interruption
    
    # Step 2: Apply new StatefulSet manifest (with updated volumeClaimTemplates)
    kubectl apply -f postgres-statefulset-v2.yaml
    # Controller re-adopts existing pods (they already match the selector)
    # New volumeClaimTemplates apply only to NEWLY CREATED pods (e.g., scale-up)
    
    # Step 3: For existing pods to get new PVC spec, manually migrate data:
    # - Scale up to get new pods with new PVC size
    # - Copy data from old PVCs to new PVCs
    # - Delete old pods (controller creates replacements with new PVCs)
    
    # Note: existing PVCs are NOT automatically resized by this process
    # Use PVC expansion (kubectl edit pvc) for in-place size increases

    Operational Commands

    # Watch pod creation/deletion during scaling or update
    kubectl get pods -n data -l app=postgres -w
    
    # Check rollout status (blocks until complete or times out)
    kubectl rollout status sts/postgres -n data --timeout=10m
    
    # Trigger a rolling restart without spec change
    kubectl rollout restart sts/postgres -n data
    
    # Pause a rolling update (set partition to current updatedReplicas count)
    # Get current updatedReplicas
    UPDATED=$(kubectl get sts postgres -n data -o jsonpath='{.status.updatedReplicas}')
    REPLICAS=$(kubectl get sts postgres -n data -o jsonpath='{.spec.replicas}')
    # Freeze at current progress (don't update pods 0..UPDATED-1)
    kubectl patch sts postgres -n data \
      -p "{\"spec\":{\"updateStrategy\":{\"rollingUpdate\":{\"partition\":$((REPLICAS - UPDATED))}}}}"
    
    # Force delete a stuck Terminating pod (node failure recovery)
    kubectl delete pod postgres-1 -n data --grace-period=0 --force
    
    # Check all PVCs for a StatefulSet
    kubectl get pvc -n data -l app=postgres --sort-by='.metadata.name'
    
    # Describe to see events (scale, update, probe failures)
    kubectl describe sts postgres -n data
    
    # Get ControllerRevision history
    kubectl get controllerrevision -n data -l app=postgres

    StatefulSet vs Deployment — When to Use Each

    Use StatefulSet When…Use Deployment When…
    Each replica needs its own persistent data (per-pod PVC)All replicas share a read volume, or use no persistent storage
    Pods must discover each other by stable DNS name (Cassandra seeds, Kafka broker IDs, etcd peers)Pod identity is irrelevant; any replica can serve any request
    Ordered startup is required (replica must not start before primary)All pods can start concurrently without coordination
    Ordered shutdown is required (must drain highest-ordinal replica first)Any pod can be terminated independently
    Application uses its own hostname for peer registration (pod name embedded in config)Application is stateless or uses external service discovery
    Rolling upgrades must proceed in a controlled sequence with validationRolling updates can proceed at any speed with maxSurge/maxUnavailable

    Metrics, Alerts, and Runbooks

    Key Metrics

    MetricSourceAlert Condition
    kube_statefulset_status_replicas_readykube-state-metrics< kube_statefulset_replicas for > 5m
    kube_statefulset_status_replicas_current != kube_statefulset_status_replicas_updatedkube-state-metricsRollout in progress; alert if stale > 30m
    kube_statefulset_status_observed_generation != kube_statefulset_metadata_generationkube-state-metricsController not processing updates; possible controller issue
    kube_persistentvolumeclaim_status_phase{phase="Lost"}kube-state-metricsPVC Lost — pod stuck Pending; immediate alert
    kube_pod_status_ready{condition="false"} for StatefulSet podskube-state-metricsNot-Ready pod blocking OrderedReady progression

    Alerting Rules

    groups:
    - name: statefulset-health
      rules:
      - alert: StatefulSetNotFullyReady
        expr: |
          kube_statefulset_status_replicas_ready
            < kube_statefulset_replicas
        for: 5m
        labels:
          severity: warning
        annotations:
          summary: "StatefulSet {{ $labels.namespace }}/{{ $labels.statefulset }} not fully ready"
    
      - alert: StatefulSetRolloutStuck
        expr: |
          kube_statefulset_status_replicas_updated
            != kube_statefulset_replicas
          and
          kube_statefulset_status_replicas_current
            != kube_statefulset_status_replicas_updated
        for: 30m
        labels:
          severity: warning
        annotations:
          summary: "StatefulSet {{ $labels.namespace }}/{{ $labels.statefulset }} rollout has not completed in 30 minutes"
    
      - alert: StatefulSetPVCLost
        expr: |
          kube_persistentvolumeclaim_status_phase{phase="Lost"} == 1
        for: 1m
        labels:
          severity: critical
        annotations:
          summary: "PVC {{ $labels.namespace }}/{{ $labels.persistentvolumeclaim }} is Lost — StatefulSet pod will be stuck Pending"
    
      - alert: StatefulSetReplicasMismatch
        expr: |
          kube_statefulset_replicas
            != kube_statefulset_status_replicas_ready
        for: 10m
        labels:
          severity: warning

    Runbooks

    Pod Stuck — OrderedReady Deadlock

    Identify stuck pod: kubectl get pods -l app=NAME — look for a pod that has been not-Ready for a long time while higher-ordinal pods are Pending. Check: kubectl describe pod NAME-N for probe failures or image pull errors. Fix the root cause, or delete the stuck pod to trigger a restart. If the root cause is irreproducible, consider Parallel temporarily.

    Rollout Stuck at Partition

    Check current partition: kubectl get sts NAME -o jsonpath='{.spec.updateStrategy.rollingUpdate.partition}'. If partition is higher than 0, either lower it intentionally or reset to 0 to complete the rollout. Also check status.updatedReplicas vs spec.replicas to confirm rollout progress.

    PVC Lost — Pod Stuck Pending

    PVC phase Lost means the bound PV no longer exists. Check: kubectl describe pvc NAME. To recover: create a new PV with the same volumeHandle pointing to the original storage, patch the PV's claimRef to reference the PVC's name and UID. See Persistent Volumes § Lost recovery.

    Force Delete Stuck Terminating Pod

    Pod stuck Terminating after node failure: kubectl delete pod NAME --grace-period=0 --force. Also delete the stale VolumeAttachment: kubectl delete volumeattachment NAME. The StatefulSet controller will create a replacement pod. Verify the new pod attaches the PVC cleanly before declaring recovery complete.

    Orphaned PVCs After Scale-Down

    List: kubectl get pvc -n NS -l app=NAME. Identify which PVCs have no corresponding running pod. If scaling back up is not planned, back up data and delete manually. If data is needed: kubectl exec into a pod with the PVC mounted, or create a temporary pod to access the PVC before deletion.

    Best Practices

    1. Always create the headless service before the StatefulSet — the StatefulSet controller references the service by name in serviceName but does not create it. If the service is missing, pod DNS hostnames are not registered. Apply manifests in order: Service → StatefulSet. Use publishNotReadyAddresses: true for peer-discovery during initialization.
    2. Use OrderedReady for initial bootstrap, evaluate Parallel for restartsOrderedReady prevents bootstrap races (e.g., two Cassandra nodes joining the ring simultaneously). After the cluster is fully initialized and stable, Parallel can be used for faster rolling restarts since the application already knows its identity and can handle concurrent startup.
    3. Set partition to replicas before risky upgrades — freeze all updates by setting partition: N (equal to replica count). Update the spec, validate in staging, then lower the partition incrementally. This is safer than OnDelete because the partition can be lowered remotely without needing to delete pods manually.
    4. Use terminationGracePeriodSeconds appropriate to your database — PostgreSQL needs time to finish checkpoints and close WAL (30–60s minimum). Cassandra needs time to flush memtables (60–120s). Kafka needs time to complete log segment flushes and gracefully hand off partition leadership. The default 30s is too short for most databases.
    5. Keep revisionHistoryLimit at the default (10) — StatefulSet ControllerRevisions are small (only diffs) unlike Deployment ReplicaSets (which include full pod specs). There is little reason to reduce this, and doing so removes rollback capability. Unlike Deployments, ControllerRevisions do not consume meaningful etcd space.
    6. Set persistentVolumeClaimRetentionPolicy.whenDeleted: Retain and whenScaled: Retain for production — the default behavior (both Retain) is the safest. Only use Delete for ephemeral test environments. Recovering accidentally deleted database PVCs is painful and may involve data loss.
    7. Monitor readyReplicas separately from replicas — a StatefulSet can show replicas: 3 while readyReplicas: 2 indefinitely without triggering any error. Set alerts on the difference. A long-running mismatch indicates a stuck pod that is blocking OrderedReady progression.
    8. Test failover and recovery procedures under realistic conditions — simulate node failure (kubectl drain), test that the pod reschedules and re-attaches its PVC, verify the application recovers correctly. Many database-specific failure modes only surface under real I/O conditions, not in smoke tests.