▶ What This Page Covers
  • StatefulSet vs Deployment storage semantics: stable identity, ordered management
  • volumeClaimTemplates anatomy — naming convention, multiple templates, immutability
  • PVC lifecycle: orphaned PVCs, no auto-delete on StatefulSet delete
  • persistentVolumeClaimRetentionPolicy GA 1.27 — whenDeleted/whenScaled
  • podManagementPolicy: OrderedReady vs Parallel with storage implications
  • Ordered provisioning sequence and scaling constraints for stateful apps
  • RollingUpdate with partition field — canary upgrades for databases
  • OnDelete upgrade strategy for manual-controlled database rolling upgrades
  • Pod anti-affinity patterns — hostname and zone topology keys
  • TopologySpreadConstraints for even replica distribution
  • PodDisruptionBudgets for quorum-based and leader-elected systems
  • Headless service deep-dive — DNS records per pod, stable DNS names
  • PostgreSQL on Kubernetes — CloudNativePG operator, patroni, pg_basebackup replica init
  • Cassandra on Kubernetes — rack-awareness, seed discovery via headless DNS
  • Kafka on Kubernetes — KRaft vs ZooKeeper, broker ID = ordinal, topic replication
  • MongoDB on Kubernetes — replica set via MongoDB Community Operator
  • Redis Cluster and Redis Sentinel topologies
  • Local PVs with StatefulSets — NVMe performance, node failure data-loss warning
  • Longhorn architecture — manager DaemonSet, volume replication, S3 backup
  • Rook-Ceph architecture — CephCluster CRD, OSD placement, CRUSH rack-awareness
  • Rook-Ceph storage classes: RBD (block) and CephFS (shared filesystem)
  • Init container patterns — replica initialization, seed node bootstrap, data migration
  • Backup strategies — PVC snapshot + application-level + Velero integration
  • 5 metrics + 4 alerting rules + 5 runbooks
  • 8 best practices for production stateful workloads
  • StatefulSet vs Deployment: Storage Semantics

    Deployments treat pods as interchangeable cattle — any pod can serve any request, and PVCs are shared (ReadWriteMany) or not used at all. StatefulSets assign each pod a stable identity: a fixed ordinal (0, 1, 2…), a stable DNS hostname, and a stable PVC that follows the pod across rescheduling. The PVC is never re-created — it persists even after pod or StatefulSet deletion.

    CharacteristicDeploymentStatefulSet
    Pod identityRandom hash suffix (random-a1b2c)Ordinal suffix (db-0, db-1)
    DNS hostnameUnpredictableStable: pod-name.headless-svc.ns.svc.cluster.local
    PVC per podShared PVC or no PVCDedicated PVC per pod via volumeClaimTemplates
    Pod creation orderAll at once (parallel)Sequential 0→N (OrderedReady) or parallel
    Pod deletion orderRandomSequential N→0
    PVC on pod deleteN/APVC retained (not deleted with pod)
    PVC on StatefulSet deleteN/APVCs orphaned by default (Retain policy)
    Rolling update directionNew pods before oldReverse ordinal (N→0)

    volumeClaimTemplates

    volumeClaimTemplates defines PVC templates that the StatefulSet controller instantiates for each pod. Each template creates a PVC named {template.metadata.name}-{pod-name}. For a StatefulSet named postgres with template name data, pod postgres-0 gets PVC data-postgres-0.

    Full Anatomy

    apiVersion: apps/v1
    kind: StatefulSet
    metadata:
      name: postgres
      namespace: data
    spec:
      serviceName: postgres-headless   # required headless service name
      replicas: 3
      selector:
        matchLabels:
          app: postgres
      podManagementPolicy: OrderedReady  # default; sequential pod lifecycle
      updateStrategy:
        type: RollingUpdate
        rollingUpdate:
          partition: 0   # update all pods; set to N to freeze pods 0..N-1
      template:
        metadata:
          labels:
            app: postgres
        spec:
          terminationGracePeriodSeconds: 60
          containers:
          - name: postgres
            image: postgres:16
            env:
            - name: PGDATA
              value: /var/lib/postgresql/data/pgdata
            volumeMounts:
            - name: data
              mountPath: /var/lib/postgresql/data
            - name: wal
              mountPath: /var/lib/postgresql/wal
      volumeClaimTemplates:
      - metadata:
          name: data
          annotations:
            # Optional: survive StatefulSet deletion (persistentVolumeClaimRetentionPolicy covers this in 1.27+)
        spec:
          accessModes: [ReadWriteOnce]
          storageClassName: ebs-gp3
          resources:
            requests:
              storage: 100Gi
      - metadata:
          name: wal
        spec:
          accessModes: [ReadWriteOnce]
          storageClassName: ebs-io2   # separate faster disk for WAL
          resources:
            requests:
              storage: 20Gi
    volumeClaimTemplates Are Immutable

    Once a StatefulSet is created, volumeClaimTemplates cannot be modified. Changing storage size, storage class, or access modes requires deleting and re-creating the StatefulSet (with --cascade=orphan to preserve PVCs) or migrating data to new PVCs.

    PVC Naming Convention

    StatefulSetTemplate NameReplica 0Replica 1Replica 2
    postgresdatadata-postgres-0data-postgres-1data-postgres-2
    postgreswalwal-postgres-0wal-postgres-1wal-postgres-2
    kafkadatadata-kafka-0data-kafka-1data-kafka-2
    cassandracassandra-datacassandra-data-cassandra-0cassandra-data-cassandra-1

    PVC Lifecycle and Orphaned PVCs

    Default Behavior: PVCs Are Not Deleted

    When you delete a StatefulSet, the pods are deleted but the PVCs are left behind — orphaned. This is intentional: it protects against accidental data loss. A new StatefulSet with the same name and template names will re-bind to the existing PVCs and continue from where the data left off.

    Deleting with --cascade=orphan

    Use kubectl delete sts postgres --cascade=orphan to delete the StatefulSet object without deleting its pods. This lets you recreate the StatefulSet (e.g., with updated volumeClaimTemplates) while keeping pods and PVCs running — zero-downtime migration of StatefulSet spec.

    Detecting Orphaned PVCs

    # List PVCs in a namespace that have no owning pod
    kubectl get pvc -n data -o json | jq -r '
      .items[] |
      select(.metadata.ownerReferences == null) |
      "\(.metadata.name) \(.status.phase) \(.spec.resources.requests.storage)"
    '
    
    # More targeted: find PVCs matching StatefulSet template pattern with no matching pod
    kubectl get pvc -n data --no-headers | awk '{print $1}' | while read pvc; do
      pod=$(echo "$pvc" | sed 's/^[^-]*-//')
      kubectl get pod "$pod" -n data &>/dev/null || echo "ORPHANED: $pvc"
    done

    persistentVolumeClaimRetentionPolicy (GA 1.27)

    Introduced in 1.22 (alpha), stable in 1.27, this field controls automated PVC deletion lifecycle:

    spec:
      persistentVolumeClaimRetentionPolicy:
        whenDeleted: Delete   # Delete PVCs when StatefulSet is deleted
        whenScaled: Retain    # Keep PVCs when scaling down (safe default)
        # whenScaled: Delete  # DANGER: permanently deletes PVCs on scale-down
    PolicywhenDeletedwhenScaledUse Case
    Retain/Retain (default)Keep PVCsKeep PVCsProduction DBs — manual cleanup
    Delete/RetainDelete PVCs on StatefulSet deleteKeep PVCs on scale-downCI/staging: easy teardown, safe scale
    Retain/DeleteKeep PVCsDelete PVCs on scale-downStateless-ish caches, re-seed on scale-up
    Delete/DeleteDelete all PVCsDelete PVCs on scale-downEphemeral test environments only
    whenScaled: Delete Is Irreversible

    If you scale a StatefulSet from 3→1 with whenScaled: Delete, PVCs for pods 1 and 2 are permanently deleted. Scaling back to 3 creates brand-new empty PVCs — all data from replicas 1 and 2 is gone. Never use this for databases without verified backup/restore testing.

    Pod Management and Update Strategies

    OrderedReady vs Parallel

    podManagementPolicy: OrderedReady  # default
    # Pods created 0 → 1 → 2 (each must be Running+Ready before next)
    # Pods deleted 2 → 1 → 0 (each must be Terminated before next)
    # Safe for databases that require leader election before follower starts
    
    podManagementPolicy: Parallel
    # All pods created simultaneously (no sequencing)
    # All pods deleted simultaneously
    # Faster for stateless-ish work or when app handles concurrent init
    # Still maintains stable identity and PVC binding
    OrderedReady Can Deadlock

    If pod-0 becomes not-Ready (e.g., database waiting for pod-1 to form quorum), the StatefulSet controller blocks — it will not create pod-1 until pod-0 is Ready. Break the deadlock with kubectl rollout restart or by temporarily patching the pod's readiness probe, or by setting podManagementPolicy: Parallel if the application supports parallel start.

    RollingUpdate with Partition (Canary Upgrades)

    The partition field causes the controller to update only pods with ordinal ≥ partition. Pods with ordinal < partition keep their current revision.

    # Update only pod-2 (highest ordinal) to test new version
    kubectl patch sts postgres -p '{"spec":{"updateStrategy":{"rollingUpdate":{"partition":2}}}}'
    kubectl rollout status sts/postgres
    
    # If postgres-2 is healthy, expand to pod-1 and pod-2
    kubectl patch sts postgres -p '{"spec":{"updateStrategy":{"rollingUpdate":{"partition":1}}}}'
    
    # Full rollout
    kubectl patch sts postgres -p '{"spec":{"updateStrategy":{"rollingUpdate":{"partition":0}}}}'

    OnDelete Strategy (Manual DB Rolling Upgrade)

    spec:
      updateStrategy:
        type: OnDelete
    # Pods are only updated when manually deleted — you control timing
    # Essential for: Patroni (must demote replica first), Galera, Cassandra
    # Workflow:
    #   1. kubectl set image sts/postgres postgres=postgres:16.2
    #   2. kubectl delete pod postgres-2  # kills replica, controller creates new pod with new image
    #   3. Verify postgres-2 healthy, then delete postgres-1
    #   4. For primary (postgres-0): trigger manual failover first

    Headless Service and DNS

    A headless service (clusterIP: None) is required for StatefulSets. It creates per-pod DNS A records rather than a single VIP, enabling direct pod addressing.

    apiVersion: v1
    kind: Service
    metadata:
      name: postgres-headless
      namespace: data
    spec:
      clusterIP: None         # headless — no VIP
      publishNotReadyAddresses: true  # include NotReady pods (important for init)
      selector:
        app: postgres
      ports:
      - name: postgres
        port: 5432
    DNS records created for StatefulSet "postgres" with headless service "postgres-headless" in namespace "data": Pod hostname format: {pod-name}.{service}.{namespace}.svc.cluster.local postgres-0.postgres-headless.data.svc.cluster.local → 10.0.1.10 postgres-1.postgres-headless.data.svc.cluster.local → 10.0.1.11 postgres-2.postgres-headless.data.svc.cluster.local → 10.0.1.12 Service DNS (round-robin A records): postgres-headless.data.svc.cluster.local → [10.0.1.10, 10.0.1.11, 10.0.1.12] With publishNotReadyAddresses: true → DNS includes not-yet-ready pods (pod-0 during init can find pod-1 seed) Without it → pods invisible until Ready (can deadlock peer-discovery)

    Anti-Affinity and Topology

    Hard Anti-Affinity (One Pod Per Node)

    spec:
      template:
        spec:
          affinity:
            podAntiAffinity:
              requiredDuringSchedulingIgnoredDuringExecution:
              - labelSelector:
                  matchLabels:
                    app: postgres
                topologyKey: kubernetes.io/hostname
                # Prevents any two postgres pods on the same node
                # Blocks scheduling if insufficient nodes — use "preferred" for flexibility

    Zone-Spread Anti-Affinity (HA Across AZs)

          affinity:
            podAntiAffinity:
              preferredDuringSchedulingIgnoredDuringExecution:
              - weight: 100
                podAffinityTerm:
                  labelSelector:
                    matchLabels:
                      app: postgres
                  topologyKey: topology.kubernetes.io/zone

    TopologySpreadConstraints (Preferred over Anti-Affinity)

          topologySpreadConstraints:
          - maxSkew: 1
            topologyKey: topology.kubernetes.io/zone
            whenUnsatisfiable: DoNotSchedule   # hard: block if can't satisfy
            labelSelector:
              matchLabels:
                app: postgres
          - maxSkew: 1
            topologyKey: kubernetes.io/hostname
            whenUnsatisfiable: ScheduleAnyway  # soft: spread but don't block

    PodDisruptionBudgets for Stateful Apps

    # PostgreSQL primary/replica: always keep at least 1 pod (the primary)
    apiVersion: policy/v1
    kind: PodDisruptionBudget
    metadata:
      name: postgres-pdb
      namespace: data
    spec:
      minAvailable: 1
      selector:
        matchLabels:
          app: postgres
    ---
    # Kafka 3-broker cluster with min.insync.replicas=2:
    # Must keep at least 2 brokers to accept writes
    apiVersion: policy/v1
    kind: PodDisruptionBudget
    metadata:
      name: kafka-pdb
    spec:
      minAvailable: 2
      selector:
        matchLabels:
          app: kafka

    PostgreSQL on Kubernetes

    CloudNativePG Operator (Recommended)

    CloudNativePG is the CNCF-sandbox operator for PostgreSQL. It manages primary/replica topology, automatic failover, backup integration, and connection pooling via PgBouncer. Use it instead of manual StatefulSet configuration for production.

    apiVersion: postgresql.cnpg.io/v1
    kind: Cluster
    metadata:
      name: postgres-cluster
      namespace: data
    spec:
      instances: 3
      imageName: ghcr.io/cloudnative-pg/postgresql:16.2
    
      storage:
        size: 100Gi
        storageClass: ebs-gp3
    
      walStorage:
        size: 20Gi
        storageClass: ebs-io2    # separate fast storage for WAL
    
      backup:
        retentionPolicy: "30d"
        barmanObjectStore:
          destinationPath: s3://my-bucket/postgres
          s3Credentials:
            accessKeyId:
              name: aws-creds
              key: ACCESS_KEY_ID
            secretAccessKey:
              name: aws-creds
              key: SECRET_ACCESS_KEY
    
      affinity:
        enablePodAntiAffinity: true
        topologyKey: topology.kubernetes.io/zone
    
      resources:
        requests:
          memory: 4Gi
          cpu: "2"
        limits:
          memory: 8Gi

    Patroni-Based StatefulSet (Manual)

    For teams not using an operator, Patroni provides HA PostgreSQL with etcd/consul/Kubernetes leader election. The key pattern is an init container that runs pg_basebackup on replicas to initialize from the primary.

    initContainers:
    - name: init-replica
      image: postgres:16
      command:
      - sh
      - -c
      - |
        # Skip if primary (ordinal 0) or if data dir already initialized
        ORDINAL=$(echo $HOSTNAME | awk -F'-' '{print $NF}')
        if [ "$ORDINAL" = "0" ] || [ -f "/var/lib/postgresql/data/pgdata/PG_VERSION" ]; then
          exit 0
        fi
        # Wait for primary to be ready
        until pg_isready -h postgres-0.postgres-headless.data.svc.cluster.local -p 5432; do
          echo "Waiting for primary..."; sleep 2
        done
        # Stream base backup from primary
        pg_basebackup \
          -h postgres-0.postgres-headless.data.svc.cluster.local \
          -U replication \
          -D /var/lib/postgresql/data/pgdata \
          -Xs -P -R    # -R writes recovery.conf / standby.signal
      volumeMounts:
      - name: data
        mountPath: /var/lib/postgresql/data

    Cassandra on Kubernetes

    Cassandra uses a gossip protocol and a token ring. Kubernetes StatefulSets map well because Cassandra nodes require stable DNS names for seed discovery and stable storage for SSTables.

    apiVersion: apps/v1
    kind: StatefulSet
    metadata:
      name: cassandra
    spec:
      serviceName: cassandra
      replicas: 3
      template:
        spec:
          containers:
          - name: cassandra
            image: cassandra:4.1
            env:
            - name: CASSANDRA_SEEDS
              # First two pods as seeds; headless DNS provides stable addresses
              value: "cassandra-0.cassandra.data.svc.cluster.local,cassandra-1.cassandra.data.svc.cluster.local"
            - name: CASSANDRA_CLUSTER_NAME
              value: "production"
            - name: CASSANDRA_DC
              value: "dc1"
            - name: CASSANDRA_RACK
              # Use node label for rack awareness; requires Downward API or env injection
              value: "rack1"
            - name: MAX_HEAP_SIZE
              value: "8192M"
            - name: HEAP_NEWSIZE
              value: "2048M"
            resources:
              requests:
                memory: 16Gi    # Cassandra is memory-hungry
                cpu: "4"
            readinessProbe:
              exec:
                command: ["/bin/bash", "-c", "nodetool status | grep -E '^UN\\s+$POD_IP'"]
              initialDelaySeconds: 90
              periodSeconds: 10
            volumeMounts:
            - name: cassandra-data
              mountPath: /var/lib/cassandra
      volumeClaimTemplates:
      - metadata:
          name: cassandra-data
        spec:
          accessModes: [ReadWriteOnce]
          storageClassName: local-nvme   # local NVMe for compaction I/O
          resources:
            requests:
              storage: 500Gi
    Cassandra Bootstrap Order

    Use podManagementPolicy: OrderedReady for initial cluster bootstrap — each node must join the ring and reach UN (Up/Normal) state before the next node starts. After initial bootstrap, Parallel can be used for rolling restarts if the cluster is healthy.

    Kafka on Kubernetes

    Kafka brokers maintain persistent log segments. The broker ID maps naturally to the StatefulSet pod ordinal. Strimzi is the dominant Kubernetes operator for Kafka.

    Strimzi Kafka Cluster (Operator Pattern)

    apiVersion: kafka.strimzi.io/v1beta2
    kind: Kafka
    metadata:
      name: production-kafka
    spec:
      kafka:
        version: 3.7.0
        replicas: 3
        listeners:
        - name: plain
          port: 9092
          type: internal
          tls: false
        - name: tls
          port: 9093
          type: internal
          tls: true
        config:
          offsets.topic.replication.factor: 3
          transaction.state.log.replication.factor: 3
          transaction.state.log.min.isr: 2
          default.replication.factor: 3
          min.insync.replicas: 2     # producers with acks=all require 2 ISR
          log.retention.hours: 168   # 7 days
        storage:
          type: persistent-claim
          size: 500Gi
          class: ebs-gp3
          deleteClaim: false         # retain PVCs on Kafka cluster delete
        resources:
          requests:
            memory: 8Gi
            cpu: "2"
          limits:
            memory: 16Gi
      zookeeper:
        replicas: 3
        storage:
          type: persistent-claim
          size: 20Gi
          class: ebs-gp3

    KRaft Mode (ZooKeeper-Free, Kafka 3.3+ Production-Ready)

      kafka:
        version: 3.7.0
        metadataVersion: 3.7-IV4
        replicas: 3
        # No zookeeper section — KRaft uses internal metadata topic
        # Combined mode: broker + controller roles in same pods
        # Separate mode (recommended for large clusters): dedicated controller nodes

    MongoDB on Kubernetes

    apiVersion: mongodbcommunity.mongodb.com/v1
    kind: MongoDBCommunity
    metadata:
      name: mongodb-replica-set
    spec:
      members: 3
      type: ReplicaSet
      version: "7.0.4"
      security:
        authentication:
          modes: ["SCRAM"]
      users:
      - name: admin
        db: admin
        passwordSecretRef:
          name: admin-password
        roles:
        - name: clusterAdmin
          db: admin
      statefulSet:
        spec:
          volumeClaimTemplates:
          - metadata:
              name: data-volume
            spec:
              accessModes: [ReadWriteOnce]
              storageClassName: ebs-gp3
              resources:
                requests:
                  storage: 200Gi
          - metadata:
              name: logs-volume
            spec:
              accessModes: [ReadWriteOnce]
              storageClassName: ebs-gp3
              resources:
                requests:
                  storage: 10Gi

    Redis on Kubernetes

    Redis Sentinel (HA for Single Shard)

    # Redis StatefulSet with persistent storage
    # Sentinel monitors primary and triggers failover
    # Primary: redis-0; Replicas: redis-1, redis-2
    # Sentinel quorum: 2 of 3 sentinels must agree for failover
    apiVersion: apps/v1
    kind: StatefulSet
    metadata:
      name: redis
    spec:
      serviceName: redis-headless
      replicas: 3
      template:
        spec:
          initContainers:
          - name: config
            image: redis:7.2
            command: ["sh", "-c"]
            args:
            - |
              ORDINAL=$(echo $HOSTNAME | awk -F'-' '{print $NF}')
              if [ "$ORDINAL" = "0" ]; then
                cp /tmp/redis-default.conf /etc/redis/redis.conf
              else
                echo "replicaof redis-0.redis-headless.data.svc.cluster.local 6379" >> /etc/redis/redis.conf
              fi
          containers:
          - name: redis
            image: redis:7.2
            command: ["redis-server", "/etc/redis/redis.conf"]
            volumeMounts:
            - name: data
              mountPath: /data
      volumeClaimTemplates:
      - metadata:
          name: data
        spec:
          accessModes: [ReadWriteOnce]
          storageClassName: ebs-gp3
          resources:
            requests:
              storage: 50Gi

    Local PVs with StatefulSets

    Local PVs bind directly to a node's NVMe disk, providing the lowest possible latency. The tradeoff: if the node fails, the data on that node is inaccessible (or lost if the disk fails) until the node recovers.

    # Local PV — manually provisioned, node-affinity required
    apiVersion: v1
    kind: PersistentVolume
    metadata:
      name: local-pv-node1-nvme
    spec:
      capacity:
        storage: 1Ti
      accessModes: [ReadWriteOnce]
      persistentVolumeReclaimPolicy: Retain
      storageClassName: local-nvme
      local:
        path: /mnt/nvme0n1
      nodeAffinity:
        required:
          nodeSelectorTerms:
          - matchExpressions:
            - key: kubernetes.io/hostname
              operator: In
              values: [node1]
    ---
    apiVersion: storage.k8s.io/v1
    kind: StorageClass
    metadata:
      name: local-nvme
    provisioner: kubernetes.io/no-provisioner
    volumeBindingMode: WaitForFirstConsumer  # critical: bind only when pod is scheduled
    Local PV Node Failure Means Data Inaccessibility

    When the node hosting a local PV goes down, the pod cannot reschedule to another node — it is stuck in Pending indefinitely because no other node has the required PV. Recovery requires either restoring the node, manually moving data, or re-seeding the replica. Use Longhorn or Rook-Ceph for automatic replication if node failure tolerance is required.

    Longhorn

    Longhorn (CNCF incubating) is a distributed block storage system built for Kubernetes. It replicates volumes across nodes, handles node failures automatically, and provides S3-backed snapshots and backups.

    Longhorn Architecture: ┌─────────────────────────────────────────────────────────┐ │ Kubernetes Cluster │ │ │ │ ┌──────────┐ ┌──────────┐ ┌──────────┐ │ │ │ Node 1 │ │ Node 2 │ │ Node 3 │ │ │ │ │ │ │ │ │ │ │ │ longhorn-│ │ longhorn-│ │ longhorn-│ │ │ │ manager │ │ manager │ │ manager │ ← DaemonSet│ │ │ (agent) │ │ (agent) │ │ (agent) │ │ │ │ │ │ │ │ │ │ │ │ Engine │ │ Replica │ │ Replica │ │ │ │ Process │◄──│ Process │ │ Process │ │ │ │ (active) │ │(replica1)│ │(replica2)│ │ │ │ ▲ │ │ /dev/ │ │ /dev/ │ │ │ │ │ │ │ longhorn│ │ longhorn │ │ │ └────│─────┘ └──────────┘ └──────────┘ │ │ │ iSCSI/NVMe-oF │ │ ┌────┴──────────────────────────────────────────────┐ │ │ │ Application Pod (postgres-0) │ │ │ │ PVC: data-postgres-0 → Longhorn Volume │ │ │ └───────────────────────────────────────────────────┘ │ └─────────────────────────────────────────────────────────┘
    # Longhorn StorageClass
    apiVersion: storage.k8s.io/v1
    kind: StorageClass
    metadata:
      name: longhorn
      annotations:
        storageclass.kubernetes.io/is-default-class: "true"
    provisioner: driver.longhorn.io
    allowVolumeExpansion: true
    parameters:
      numberOfReplicas: "3"          # replicate across 3 nodes
      staleReplicaTimeout: "2880"    # 48 hours before treating stale replica as failed
      fromBackup: ""
      fsType: ext4
      diskSelector: ""               # target specific disk tags
      nodeSelector: ""               # target specific node tags
      recurringJobSelector: '[{"name":"backup", "isGroup":true}]'
    Volume Replication

    Each Longhorn volume has N replicas on N distinct nodes. Engine process runs on the node hosting the pod; replicas sync over the network. replica count configurable per volume.

    Backup to S3

    Incremental snapshots pushed to S3/MinIO/NFS. RecurringJob CRD schedules automatic backups. Restores create new PVC from backup — cross-cluster recovery supported.

    Node Drain Safe

    Longhorn respects node cordon/drain — engine migrates to another node automatically when replicas exist elsewhere. PodDisruptionBudget enforced during drain.

    UI Dashboard

    Built-in web UI shows volume health, replica placement, backup status. Expose via Service or Ingress. RBAC controls available via Longhorn RBAC resources.

    Rook-Ceph

    Rook is a CNCF-graduated cloud-native storage orchestrator. Rook deploys and manages Ceph — a production-grade distributed storage system providing block (RBD), file (CephFS), and object (RGW/S3) storage from the same cluster.

    Rook-Ceph Architecture: ┌───────────────────────────────────────────────────────────────┐ │ Rook-Ceph Namespace │ │ │ │ rook-ceph-operator (Deployment) │ │ └── watches CephCluster/CephBlockPool/CephFilesystem CRDs │ │ │ │ Ceph Daemons (managed by operator): │ │ ┌─────────┐ ┌────────────────────────────────────────────┐ │ │ │ MON │ │ OSD (DaemonSet or StatefulSet per device) │ │ │ │ (x3) │ │ ┌──────┐ ┌──────┐ ┌──────┐ ┌──────┐ │ │ │ │ cluster │ │ │OSD-0 │ │OSD-1 │ │OSD-2 │ │OSD-3 │ │ │ │ │ state │ │ │node1 │ │node1 │ │node2 │ │node3 │ │ │ │ │ quorum │ │ │/dev/ │ │/dev/ │ │/dev/ │ │/dev/ │ │ │ │ └─────────┘ │ │nvme0 │ │nvme1 │ │nvme0 │ │nvme0 │ │ │ │ │ └──────┘ └──────┘ └──────┘ └──────┘ │ │ │ ┌─────────┐ └────────────────────────────────────────────┘ │ │ │ MGR │ │ │ │ (x2 HA) │ ┌──────────────┐ ┌──────────────────────────┐ │ │ │dashboard│ │ MDS (x2) │ │ RGW (S3 endpoint) │ │ │ └─────────┘ │ (CephFS MDS) │ │ (object storage) │ │ │ └──────────────┘ └──────────────────────────┘ │ └───────────────────────────────────────────────────────────────┘

    CephCluster CRD

    apiVersion: ceph.rook.io/v1
    kind: CephCluster
    metadata:
      name: rook-ceph
      namespace: rook-ceph
    spec:
      cephVersion:
        image: quay.io/ceph/ceph:v18.2.2   # Reef
      dataDirHostPath: /var/lib/rook       # persisted on host for mon data
      mon:
        count: 3
        allowMultiplePerNode: false
      mgr:
        count: 2
        modules:
        - name: pg_autoscaler
          enabled: true
      dashboard:
        enabled: true
        ssl: true
      storage:
        useAllNodes: false
        useAllDevices: false
        nodes:
        - name: "node1"
          devices:
          - name: "nvme0n1"
          - name: "nvme1n1"
        - name: "node2"
          devices:
          - name: "nvme0n1"
        - name: "node3"
          devices:
          - name: "nvme0n1"
      placement:
        osd:
          podAntiAffinity:
            requiredDuringSchedulingIgnoredDuringExecution:
            - topologyKey: kubernetes.io/hostname
              labelSelector:
                matchLabels:
                  app: rook-ceph-osd

    Ceph Block Storage (RBD) StorageClass

    apiVersion: ceph.rook.io/v1
    kind: CephBlockPool
    metadata:
      name: replicapool
      namespace: rook-ceph
    spec:
      failureDomain: host     # CRUSH: tolerate full host failure
      replicated:
        size: 3
        requireSafeReplicaSize: true
    ---
    apiVersion: storage.k8s.io/v1
    kind: StorageClass
    metadata:
      name: rook-ceph-block
    provisioner: rook-ceph.rbd.csi.ceph.com
    parameters:
      clusterID: rook-ceph
      pool: replicapool
      imageFormat: "2"
      imageFeatures: layering   # required for snapshots; add exclusive-lock,object-map,fast-diff for better performance
      csi.storage.k8s.io/provisioner-secret-name: rook-csi-rbd-provisioner
      csi.storage.k8s.io/provisioner-secret-namespace: rook-ceph
      csi.storage.k8s.io/controller-expand-secret-name: rook-csi-rbd-provisioner
      csi.storage.k8s.io/controller-expand-secret-namespace: rook-ceph
      csi.storage.k8s.io/node-stage-secret-name: rook-csi-rbd-node
      csi.storage.k8s.io/node-stage-secret-namespace: rook-ceph
    allowVolumeExpansion: true
    reclaimPolicy: Delete

    CephFS StorageClass (ReadWriteMany)

    apiVersion: ceph.rook.io/v1
    kind: CephFilesystem
    metadata:
      name: myfs
      namespace: rook-ceph
    spec:
      metadataPool:
        replicated:
          size: 3
      dataPools:
      - name: data0
        replicated:
          size: 3
      preserveFilesystemOnDelete: true   # don't wipe filesystem on CRD delete
      metadataServer:
        activeCount: 1
        activeStandby: true
    ---
    apiVersion: storage.k8s.io/v1
    kind: StorageClass
    metadata:
      name: rook-cephfs
    provisioner: rook-ceph.cephfs.csi.ceph.com
    parameters:
      clusterID: rook-ceph
      fsName: myfs
      pool: myfs-data0
      csi.storage.k8s.io/provisioner-secret-name: rook-csi-cephfs-provisioner
      csi.storage.k8s.io/provisioner-secret-namespace: rook-ceph
      csi.storage.k8s.io/node-stage-secret-name: rook-csi-cephfs-node
      csi.storage.k8s.io/node-stage-secret-namespace: rook-ceph
    allowVolumeExpansion: true
    reclaimPolicy: Delete
    CRUSH Map for Rack-Awareness

    Set failureDomain: rack in CephBlockPool and label nodes with topology.rook.io/rack=rack1 to spread OSD replicas across physical racks. With 3 racks and replica size 3, a full rack failure is tolerated with no data loss. Requires at least 3 OSDs per rack for optimal placement.

    Backup Strategies for Stateful Workloads

    StrategyMechanismProsConsBest For
    PVC Volume SnapshotCSI CreateSnapshot RPCFast, crash-consistentNot app-consistent without quiesceCaches, replicas with WAL replay
    App-level dumppg_dump / mongodump / redis-cli BGSAVEFully consistent, portable SQLSlow for large DBs, CPU overheadPrimary source of truth DBs
    pg_basebackup / pgBackRestStreaming replication protocolFast physical backup, WAL archiving, PITRPostgreSQL-specificPostgreSQL production backups
    Velero + CSI snapshotsVelero + VolumeSnapshot APICluster-wide, namespace backup, manifest + data togetherRequires CSI snapshot supportFull namespace DR
    Longhorn backupLonghorn recurring job → S3Incremental, integrated with Longhorn volumesLonghorn-specificLonghorn-managed volumes
    Rook-Ceph RBD mirroringCeph RBD mirroring to remote clusterAsync replication to DR clusterComplex setup, separate Ceph clusterActive-passive cross-cluster DR

    Metrics, Alerts, and Runbooks

    Key Metrics

    MetricSourceWhat to Watch
    kube_statefulset_status_replicas_readykube-state-metricsShould equal kube_statefulset_replicas; alert on mismatch
    kube_statefulset_status_current_revision != kube_statefulset_status_update_revisionkube-state-metricsIndicates incomplete rolling update; alert if stale > 15m
    kube_persistentvolumeclaim_status_phase{phase="Lost"}kube-state-metricsPVC Lost = pod stuck Pending; page immediately
    kubelet_volume_stats_used_bytes / kubelet_volume_stats_capacity_byteskubeletAlert at 80% and 90% utilization
    kube_pod_container_status_restarts_totalkube-state-metricsStatefulSet pod restarts may indicate data corruption or OOM

    Alerting Rules

    groups:
    - name: stateful-storage
      rules:
      - alert: StatefulSetNotFullyReady
        expr: |
          kube_statefulset_status_replicas_ready
            != kube_statefulset_replicas
        for: 5m
        annotations:
          summary: "StatefulSet {{ $labels.statefulset }} not fully ready"
    
      - alert: StatefulSetRolloutStuck
        expr: |
          kube_statefulset_status_current_revision
            != kube_statefulset_status_update_revision
        for: 15m
        annotations:
          summary: "StatefulSet {{ $labels.statefulset }} rollout stuck"
    
      - alert: PVCDiskPressureWarning
        expr: |
          (kubelet_volume_stats_used_bytes / kubelet_volume_stats_capacity_bytes) > 0.80
        for: 5m
        annotations:
          summary: "PVC {{ $labels.persistentvolumeclaim }} >80% full"
    
      - alert: PVCDiskPressureCritical
        expr: |
          (kubelet_volume_stats_used_bytes / kubelet_volume_stats_capacity_bytes) > 0.90
        for: 2m
        labels:
          severity: critical

    Runbooks

    StatefulSet Pod Stuck Pending

    Check PVC status (kubectl get pvc). If Pending: storage class mis-match or quota. If Lost: PV deleted — recreate PV with original volumeHandle and claimRef UID. If no PVC: volumeClaimTemplates binding mode issue.

    StatefulSet Rollout Stuck

    Check pod-N readiness: kubectl describe pod sts-N. Common cause: new image crashes on startup, readiness probe fails. If OrderedReady deadlock: temporarily patch partition value to skip the stuck pod or use kubectl rollout undo.

    PVC Full — Online Expansion

    Edit PVC spec.resources.requests.storage to larger value (storage class must have allowVolumeExpansion: true). For filesystem resize: wait for FileSystemResizePending condition to clear after pod restart. Check CSI driver supports ControllerExpandVolume.

    Rook-Ceph OSD Down

    Check OSD pod: kubectl -n rook-ceph get pod -l app=rook-ceph-osd. If node failure: Ceph will auto-heal after mon_osd_down_out_interval (10 min default). Check PG status: kubectl -n rook-ceph exec deploy/rook-ceph-tools -- ceph status. Alert if PGs stuck in undersized+degraded > 30min.

    Orphaned PVCs After StatefulSet Delete

    List with label selector: kubectl get pvc -n ns -l app=deleted-sts. Verify no running pods use them: kubectl get pod -n ns -o json | jq '.items[].spec.volumes[].persistentVolumeClaim.claimName'. Delete manually after confirming data backed up.

    Best Practices

    1. Always use a headless serviceclusterIP: None with publishNotReadyAddresses: true for peer discovery during init. Never share a ClusterIP service for intra-cluster database replication.
    2. Set terminationGracePeriodSeconds appropriately — databases need time to flush WAL, finish checkpoints, and gracefully close connections. 60–120 seconds is typical; never leave at default 30s for PostgreSQL or Cassandra.
    3. Use podManagementPolicy: OrderedReady for initial bootstrap — then consider Parallel for rolling restarts after the cluster is healthy (faster for large clusters). Never use Parallel with databases that require leader election before followers start.
    4. Pin partition to the highest ordinal during upgrades — canary one replica before rolling all. Especially critical for major Kafka, PostgreSQL, and Cassandra version upgrades with format changes.
    5. Never rely on StatefulSet PVC auto-cleanup for production data — keep persistentVolumeClaimRetentionPolicy.whenDeleted: Retain and whenScaled: Retain for all production databases. Automate orphaned PVC detection and manual review.
    6. Separate data and WAL/log volumes — put WAL on a separate higher-IOPS StorageClass (io2 vs gp3). This allows independent sizing and prevents WAL I/O from contending with table data reads.
    7. Use an operator for complex databases — CloudNativePG for PostgreSQL, Strimzi for Kafka, MongoDB Community Operator for MongoDB. Manual StatefulSet management misses critical details (failover logic, backup scheduling, rolling upgrade ordering, connection pooling).
    8. Test failover and recovery procedures regularly — simulate node failure, test backup restore, validate that replicas rejoin after restart. Storage failures surface during actual incidents, not just monitoring.