Stateful Storage Patterns — Kubernetes Docs

▶ What This Page Covers

StatefulSet vs Deployment storage semantics: stable identity, ordered management

volumeClaimTemplates anatomy — naming convention, multiple templates, immutability

PVC lifecycle: orphaned PVCs, no auto-delete on StatefulSet delete

persistentVolumeClaimRetentionPolicy GA 1.27 — whenDeleted/whenScaled

podManagementPolicy: OrderedReady vs Parallel with storage implications

Ordered provisioning sequence and scaling constraints for stateful apps

RollingUpdate with partition field — canary upgrades for databases

OnDelete upgrade strategy for manual-controlled database rolling upgrades

Pod anti-affinity patterns — hostname and zone topology keys

TopologySpreadConstraints for even replica distribution

PodDisruptionBudgets for quorum-based and leader-elected systems

Headless service deep-dive — DNS records per pod, stable DNS names

PostgreSQL on Kubernetes — CloudNativePG operator, patroni, pg_basebackup replica init

Cassandra on Kubernetes — rack-awareness, seed discovery via headless DNS

Kafka on Kubernetes — KRaft vs ZooKeeper, broker ID = ordinal, topic replication

MongoDB on Kubernetes — replica set via MongoDB Community Operator

Redis Cluster and Redis Sentinel topologies

Local PVs with StatefulSets — NVMe performance, node failure data-loss warning

Longhorn architecture — manager DaemonSet, volume replication, S3 backup

Rook-Ceph architecture — CephCluster CRD, OSD placement, CRUSH rack-awareness

Rook-Ceph storage classes: RBD (block) and CephFS (shared filesystem)

Init container patterns — replica initialization, seed node bootstrap, data migration

Backup strategies — PVC snapshot + application-level + Velero integration

5 metrics + 4 alerting rules + 5 runbooks

8 best practices for production stateful workloads

StatefulSet vs Deployment: Storage Semantics

Deployments treat pods as interchangeable cattle — any pod can serve any request, and PVCs are shared (ReadWriteMany) or not used at all. StatefulSets assign each pod a stable identity: a fixed ordinal (0, 1, 2…), a stable DNS hostname, and a stable PVC that follows the pod across rescheduling. The PVC is never re-created — it persists even after pod or StatefulSet deletion.

Characteristic	Deployment	StatefulSet
Pod identity	Random hash suffix (random-a1b2c)	Ordinal suffix (db-0, db-1)
DNS hostname	Unpredictable	Stable: pod-name.headless-svc.ns.svc.cluster.local
PVC per pod	Shared PVC or no PVC	Dedicated PVC per pod via volumeClaimTemplates
Pod creation order	All at once (parallel)	Sequential 0→N (OrderedReady) or parallel
Pod deletion order	Random	Sequential N→0
PVC on pod delete	N/A	PVC retained (not deleted with pod)
PVC on StatefulSet delete	N/A	PVCs orphaned by default (Retain policy)
Rolling update direction	New pods before old	Reverse ordinal (N→0)

volumeClaimTemplates

volumeClaimTemplates defines PVC templates that the StatefulSet controller instantiates for each pod. Each template creates a PVC named {template.metadata.name}-{pod-name}. For a StatefulSet named postgres with template name data, pod postgres-0 gets PVC data-postgres-0.

Full Anatomy

apiVersion: apps/v1
kind: StatefulSet
metadata:
  name: postgres
  namespace: data
spec:
  serviceName: postgres-headless   # required headless service name
  replicas: 3
  selector:
    matchLabels:
      app: postgres
  podManagementPolicy: OrderedReady  # default; sequential pod lifecycle
  updateStrategy:
    type: RollingUpdate
    rollingUpdate:
      partition: 0   # update all pods; set to N to freeze pods 0..N-1
  template:
    metadata:
      labels:
        app: postgres
    spec:
      terminationGracePeriodSeconds: 60
      containers:
      - name: postgres
        image: postgres:16
        env:
        - name: PGDATA
          value: /var/lib/postgresql/data/pgdata
        volumeMounts:
        - name: data
          mountPath: /var/lib/postgresql/data
        - name: wal
          mountPath: /var/lib/postgresql/wal
  volumeClaimTemplates:
  - metadata:
      name: data
      annotations:
        # Optional: survive StatefulSet deletion (persistentVolumeClaimRetentionPolicy covers this in 1.27+)
    spec:
      accessModes: [ReadWriteOnce]
      storageClassName: ebs-gp3
      resources:
        requests:
          storage: 100Gi
  - metadata:
      name: wal
    spec:
      accessModes: [ReadWriteOnce]
      storageClassName: ebs-io2   # separate faster disk for WAL
      resources:
        requests:
          storage: 20Gi

volumeClaimTemplates Are Immutable

Once a StatefulSet is created, volumeClaimTemplates cannot be modified. Changing storage size, storage class, or access modes requires deleting and re-creating the StatefulSet (with --cascade=orphan to preserve PVCs) or migrating data to new PVCs.

PVC Naming Convention

StatefulSet	Template Name	Replica 0	Replica 1	Replica 2
postgres	data	data-postgres-0	data-postgres-1	data-postgres-2
postgres	wal	wal-postgres-0	wal-postgres-1	wal-postgres-2
kafka	data	data-kafka-0	data-kafka-1	data-kafka-2
cassandra	cassandra-data	cassandra-data-cassandra-0	cassandra-data-cassandra-1	—

PVC Lifecycle and Orphaned PVCs

Default Behavior: PVCs Are Not Deleted

When you delete a StatefulSet, the pods are deleted but the PVCs are left behind — orphaned. This is intentional: it protects against accidental data loss. A new StatefulSet with the same name and template names will re-bind to the existing PVCs and continue from where the data left off.

Deleting with --cascade=orphan

Use kubectl delete sts postgres --cascade=orphan to delete the StatefulSet object without deleting its pods. This lets you recreate the StatefulSet (e.g., with updated volumeClaimTemplates) while keeping pods and PVCs running — zero-downtime migration of StatefulSet spec.

Detecting Orphaned PVCs

# List PVCs in a namespace that have no owning pod
kubectl get pvc -n data -o json | jq -r '
  .items[] |
  select(.metadata.ownerReferences == null) |
  "\(.metadata.name) \(.status.phase) \(.spec.resources.requests.storage)"
'

# More targeted: find PVCs matching StatefulSet template pattern with no matching pod
kubectl get pvc -n data --no-headers | awk '{print $1}' | while read pvc; do
  pod=$(echo "$pvc" | sed 's/^[^-]*-//')
  kubectl get pod "$pod" -n data &>/dev/null || echo "ORPHANED: $pvc"
done

persistentVolumeClaimRetentionPolicy (GA 1.27)

Introduced in 1.22 (alpha), stable in 1.27, this field controls automated PVC deletion lifecycle:

spec:
  persistentVolumeClaimRetentionPolicy:
    whenDeleted: Delete   # Delete PVCs when StatefulSet is deleted
    whenScaled: Retain    # Keep PVCs when scaling down (safe default)
    # whenScaled: Delete  # DANGER: permanently deletes PVCs on scale-down

Policy	whenDeleted	whenScaled	Use Case
Retain/Retain (default)	Keep PVCs	Keep PVCs	Production DBs — manual cleanup
Delete/Retain	Delete PVCs on StatefulSet delete	Keep PVCs on scale-down	CI/staging: easy teardown, safe scale
Retain/Delete	Keep PVCs	Delete PVCs on scale-down	Stateless-ish caches, re-seed on scale-up
Delete/Delete	Delete all PVCs	Delete PVCs on scale-down	Ephemeral test environments only

whenScaled: Delete Is Irreversible

If you scale a StatefulSet from 3→1 with whenScaled: Delete, PVCs for pods 1 and 2 are permanently deleted. Scaling back to 3 creates brand-new empty PVCs — all data from replicas 1 and 2 is gone. Never use this for databases without verified backup/restore testing.

Pod Management and Update Strategies

OrderedReady vs Parallel

podManagementPolicy: OrderedReady  # default
# Pods created 0 → 1 → 2 (each must be Running+Ready before next)
# Pods deleted 2 → 1 → 0 (each must be Terminated before next)
# Safe for databases that require leader election before follower starts

podManagementPolicy: Parallel
# All pods created simultaneously (no sequencing)
# All pods deleted simultaneously
# Faster for stateless-ish work or when app handles concurrent init
# Still maintains stable identity and PVC binding

OrderedReady Can Deadlock

If pod-0 becomes not-Ready (e.g., database waiting for pod-1 to form quorum), the StatefulSet controller blocks — it will not create pod-1 until pod-0 is Ready. Break the deadlock with kubectl rollout restart or by temporarily patching the pod's readiness probe, or by setting podManagementPolicy: Parallel if the application supports parallel start.

RollingUpdate with Partition (Canary Upgrades)

The partition field causes the controller to update only pods with ordinal ≥ partition. Pods with ordinal < partition keep their current revision.

# Update only pod-2 (highest ordinal) to test new version
kubectl patch sts postgres -p '{"spec":{"updateStrategy":{"rollingUpdate":{"partition":2}}}}'
kubectl rollout status sts/postgres

# If postgres-2 is healthy, expand to pod-1 and pod-2
kubectl patch sts postgres -p '{"spec":{"updateStrategy":{"rollingUpdate":{"partition":1}}}}'

# Full rollout
kubectl patch sts postgres -p '{"spec":{"updateStrategy":{"rollingUpdate":{"partition":0}}}}'

OnDelete Strategy (Manual DB Rolling Upgrade)

spec:
  updateStrategy:
    type: OnDelete
# Pods are only updated when manually deleted — you control timing
# Essential for: Patroni (must demote replica first), Galera, Cassandra
# Workflow:
#   1. kubectl set image sts/postgres postgres=postgres:16.2
#   2. kubectl delete pod postgres-2  # kills replica, controller creates new pod with new image
#   3. Verify postgres-2 healthy, then delete postgres-1
#   4. For primary (postgres-0): trigger manual failover first

Headless Service and DNS

A headless service (clusterIP: None) is required for StatefulSets. It creates per-pod DNS A records rather than a single VIP, enabling direct pod addressing.

apiVersion: v1
kind: Service
metadata:
  name: postgres-headless
  namespace: data
spec:
  clusterIP: None         # headless — no VIP
  publishNotReadyAddresses: true  # include NotReady pods (important for init)
  selector:
    app: postgres
  ports:
  - name: postgres
    port: 5432

DNS records created for StatefulSet "postgres" with headless service "postgres-headless" in namespace "data": Pod hostname format: {pod-name}.{service}.{namespace}.svc.cluster.local postgres-0.postgres-headless.data.svc.cluster.local → 10.0.1.10 postgres-1.postgres-headless.data.svc.cluster.local → 10.0.1.11 postgres-2.postgres-headless.data.svc.cluster.local → 10.0.1.12 Service DNS (round-robin A records): postgres-headless.data.svc.cluster.local → [10.0.1.10, 10.0.1.11, 10.0.1.12] With publishNotReadyAddresses: true → DNS includes not-yet-ready pods (pod-0 during init can find pod-1 seed) Without it → pods invisible until Ready (can deadlock peer-discovery)

Anti-Affinity and Topology

Hard Anti-Affinity (One Pod Per Node)

spec:
  template:
    spec:
      affinity:
        podAntiAffinity:
          requiredDuringSchedulingIgnoredDuringExecution:
          - labelSelector:
              matchLabels:
                app: postgres
            topologyKey: kubernetes.io/hostname
            # Prevents any two postgres pods on the same node
            # Blocks scheduling if insufficient nodes — use "preferred" for flexibility

Zone-Spread Anti-Affinity (HA Across AZs)

      affinity:
        podAntiAffinity:
          preferredDuringSchedulingIgnoredDuringExecution:
          - weight: 100
            podAffinityTerm:
              labelSelector:
                matchLabels:
                  app: postgres
              topologyKey: topology.kubernetes.io/zone

TopologySpreadConstraints (Preferred over Anti-Affinity)

      topologySpreadConstraints:
      - maxSkew: 1
        topologyKey: topology.kubernetes.io/zone
        whenUnsatisfiable: DoNotSchedule   # hard: block if can't satisfy
        labelSelector:
          matchLabels:
            app: postgres
      - maxSkew: 1
        topologyKey: kubernetes.io/hostname
        whenUnsatisfiable: ScheduleAnyway  # soft: spread but don't block

PodDisruptionBudgets for Stateful Apps

# PostgreSQL primary/replica: always keep at least 1 pod (the primary)
apiVersion: policy/v1
kind: PodDisruptionBudget
metadata:
  name: postgres-pdb
  namespace: data
spec:
  minAvailable: 1
  selector:
    matchLabels:
      app: postgres
---
# Kafka 3-broker cluster with min.insync.replicas=2:
# Must keep at least 2 brokers to accept writes
apiVersion: policy/v1
kind: PodDisruptionBudget
metadata:
  name: kafka-pdb
spec:
  minAvailable: 2
  selector:
    matchLabels:
      app: kafka

PostgreSQL on Kubernetes

CloudNativePG Operator (Recommended)

CloudNativePG is the CNCF-sandbox operator for PostgreSQL. It manages primary/replica topology, automatic failover, backup integration, and connection pooling via PgBouncer. Use it instead of manual StatefulSet configuration for production.

apiVersion: postgresql.cnpg.io/v1
kind: Cluster
metadata:
  name: postgres-cluster
  namespace: data
spec:
  instances: 3
  imageName: ghcr.io/cloudnative-pg/postgresql:16.2

  storage:
    size: 100Gi
    storageClass: ebs-gp3

  walStorage:
    size: 20Gi
    storageClass: ebs-io2    # separate fast storage for WAL

  backup:
    retentionPolicy: "30d"
    barmanObjectStore:
      destinationPath: s3://my-bucket/postgres
      s3Credentials:
        accessKeyId:
          name: aws-creds
          key: ACCESS_KEY_ID
        secretAccessKey:
          name: aws-creds
          key: SECRET_ACCESS_KEY

  affinity:
    enablePodAntiAffinity: true
    topologyKey: topology.kubernetes.io/zone

  resources:
    requests:
      memory: 4Gi
      cpu: "2"
    limits:
      memory: 8Gi

Patroni-Based StatefulSet (Manual)

For teams not using an operator, Patroni provides HA PostgreSQL with etcd/consul/Kubernetes leader election. The key pattern is an init container that runs pg_basebackup on replicas to initialize from the primary.

initContainers:
- name: init-replica
  image: postgres:16
  command:
  - sh
  - -c
  - |
    # Skip if primary (ordinal 0) or if data dir already initialized
    ORDINAL=$(echo $HOSTNAME | awk -F'-' '{print $NF}')
    if [ "$ORDINAL" = "0" ] || [ -f "/var/lib/postgresql/data/pgdata/PG_VERSION" ]; then
      exit 0
    fi
    # Wait for primary to be ready
    until pg_isready -h postgres-0.postgres-headless.data.svc.cluster.local -p 5432; do
      echo "Waiting for primary..."; sleep 2
    done
    # Stream base backup from primary
    pg_basebackup \
      -h postgres-0.postgres-headless.data.svc.cluster.local \
      -U replication \
      -D /var/lib/postgresql/data/pgdata \
      -Xs -P -R    # -R writes recovery.conf / standby.signal
  volumeMounts:
  - name: data
    mountPath: /var/lib/postgresql/data

Cassandra on Kubernetes

Cassandra uses a gossip protocol and a token ring. Kubernetes StatefulSets map well because Cassandra nodes require stable DNS names for seed discovery and stable storage for SSTables.

apiVersion: apps/v1
kind: StatefulSet
metadata:
  name: cassandra
spec:
  serviceName: cassandra
  replicas: 3
  template:
    spec:
      containers:
      - name: cassandra
        image: cassandra:4.1
        env:
        - name: CASSANDRA_SEEDS
          # First two pods as seeds; headless DNS provides stable addresses
          value: "cassandra-0.cassandra.data.svc.cluster.local,cassandra-1.cassandra.data.svc.cluster.local"
        - name: CASSANDRA_CLUSTER_NAME
          value: "production"
        - name: CASSANDRA_DC
          value: "dc1"
        - name: CASSANDRA_RACK
          # Use node label for rack awareness; requires Downward API or env injection
          value: "rack1"
        - name: MAX_HEAP_SIZE
          value: "8192M"
        - name: HEAP_NEWSIZE
          value: "2048M"
        resources:
          requests:
            memory: 16Gi    # Cassandra is memory-hungry
            cpu: "4"
        readinessProbe:
          exec:
            command: ["/bin/bash", "-c", "nodetool status | grep -E '^UN\\s+$POD_IP'"]
          initialDelaySeconds: 90
          periodSeconds: 10
        volumeMounts:
        - name: cassandra-data
          mountPath: /var/lib/cassandra
  volumeClaimTemplates:
  - metadata:
      name: cassandra-data
    spec:
      accessModes: [ReadWriteOnce]
      storageClassName: local-nvme   # local NVMe for compaction I/O
      resources:
        requests:
          storage: 500Gi

Cassandra Bootstrap Order

Use podManagementPolicy: OrderedReady for initial cluster bootstrap — each node must join the ring and reach UN (Up/Normal) state before the next node starts. After initial bootstrap, Parallel can be used for rolling restarts if the cluster is healthy.

Kafka on Kubernetes

Kafka brokers maintain persistent log segments. The broker ID maps naturally to the StatefulSet pod ordinal. Strimzi is the dominant Kubernetes operator for Kafka.

Strimzi Kafka Cluster (Operator Pattern)

apiVersion: kafka.strimzi.io/v1beta2
kind: Kafka
metadata:
  name: production-kafka
spec:
  kafka:
    version: 3.7.0
    replicas: 3
    listeners:
    - name: plain
      port: 9092
      type: internal
      tls: false
    - name: tls
      port: 9093
      type: internal
      tls: true
    config:
      offsets.topic.replication.factor: 3
      transaction.state.log.replication.factor: 3
      transaction.state.log.min.isr: 2
      default.replication.factor: 3
      min.insync.replicas: 2     # producers with acks=all require 2 ISR
      log.retention.hours: 168   # 7 days
    storage:
      type: persistent-claim
      size: 500Gi
      class: ebs-gp3
      deleteClaim: false         # retain PVCs on Kafka cluster delete
    resources:
      requests:
        memory: 8Gi
        cpu: "2"
      limits:
        memory: 16Gi
  zookeeper:
    replicas: 3
    storage:
      type: persistent-claim
      size: 20Gi
      class: ebs-gp3

KRaft Mode (ZooKeeper-Free, Kafka 3.3+ Production-Ready)

  kafka:
    version: 3.7.0
    metadataVersion: 3.7-IV4
    replicas: 3
    # No zookeeper section — KRaft uses internal metadata topic
    # Combined mode: broker + controller roles in same pods
    # Separate mode (recommended for large clusters): dedicated controller nodes

MongoDB on Kubernetes

apiVersion: mongodbcommunity.mongodb.com/v1
kind: MongoDBCommunity
metadata:
  name: mongodb-replica-set
spec:
  members: 3
  type: ReplicaSet
  version: "7.0.4"
  security:
    authentication:
      modes: ["SCRAM"]
  users:
  - name: admin
    db: admin
    passwordSecretRef:
      name: admin-password
    roles:
    - name: clusterAdmin
      db: admin
  statefulSet:
    spec:
      volumeClaimTemplates:
      - metadata:
          name: data-volume
        spec:
          accessModes: [ReadWriteOnce]
          storageClassName: ebs-gp3
          resources:
            requests:
              storage: 200Gi
      - metadata:
          name: logs-volume
        spec:
          accessModes: [ReadWriteOnce]
          storageClassName: ebs-gp3
          resources:
            requests:
              storage: 10Gi

Redis on Kubernetes

Redis Sentinel (HA for Single Shard)

# Redis StatefulSet with persistent storage
# Sentinel monitors primary and triggers failover
# Primary: redis-0; Replicas: redis-1, redis-2
# Sentinel quorum: 2 of 3 sentinels must agree for failover
apiVersion: apps/v1
kind: StatefulSet
metadata:
  name: redis
spec:
  serviceName: redis-headless
  replicas: 3
  template:
    spec:
      initContainers:
      - name: config
        image: redis:7.2
        command: ["sh", "-c"]
        args:
        - |
          ORDINAL=$(echo $HOSTNAME | awk -F'-' '{print $NF}')
          if [ "$ORDINAL" = "0" ]; then
            cp /tmp/redis-default.conf /etc/redis/redis.conf
          else
            echo "replicaof redis-0.redis-headless.data.svc.cluster.local 6379" >> /etc/redis/redis.conf
          fi
      containers:
      - name: redis
        image: redis:7.2
        command: ["redis-server", "/etc/redis/redis.conf"]
        volumeMounts:
        - name: data
          mountPath: /data
  volumeClaimTemplates:
  - metadata:
      name: data
    spec:
      accessModes: [ReadWriteOnce]
      storageClassName: ebs-gp3
      resources:
        requests:
          storage: 50Gi

Local PVs with StatefulSets

Local PVs bind directly to a node's NVMe disk, providing the lowest possible latency. The tradeoff: if the node fails, the data on that node is inaccessible (or lost if the disk fails) until the node recovers.

# Local PV — manually provisioned, node-affinity required
apiVersion: v1
kind: PersistentVolume
metadata:
  name: local-pv-node1-nvme
spec:
  capacity:
    storage: 1Ti
  accessModes: [ReadWriteOnce]
  persistentVolumeReclaimPolicy: Retain
  storageClassName: local-nvme
  local:
    path: /mnt/nvme0n1
  nodeAffinity:
    required:
      nodeSelectorTerms:
      - matchExpressions:
        - key: kubernetes.io/hostname
          operator: In
          values: [node1]
---
apiVersion: storage.k8s.io/v1
kind: StorageClass
metadata:
  name: local-nvme
provisioner: kubernetes.io/no-provisioner
volumeBindingMode: WaitForFirstConsumer  # critical: bind only when pod is scheduled

Local PV Node Failure Means Data Inaccessibility

When the node hosting a local PV goes down, the pod cannot reschedule to another node — it is stuck in Pending indefinitely because no other node has the required PV. Recovery requires either restoring the node, manually moving data, or re-seeding the replica. Use Longhorn or Rook-Ceph for automatic replication if node failure tolerance is required.

Longhorn

Longhorn (CNCF incubating) is a distributed block storage system built for Kubernetes. It replicates volumes across nodes, handles node failures automatically, and provides S3-backed snapshots and backups.

Longhorn Architecture: ┌─────────────────────────────────────────────────────────┐ │ Kubernetes Cluster │ │ │ │ ┌──────────┐ ┌──────────┐ ┌──────────┐ │ │ │ Node 1 │ │ Node 2 │ │ Node 3 │ │ │ │ │ │ │ │ │ │ │ │ longhorn-│ │ longhorn-│ │ longhorn-│ │ │ │ manager │ │ manager │ │ manager │ ← DaemonSet│ │ │ (agent) │ │ (agent) │ │ (agent) │ │ │ │ │ │ │ │ │ │ │ │ Engine │ │ Replica │ │ Replica │ │ │ │ Process │◄──│ Process │ │ Process │ │ │ │ (active) │ │(replica1)│ │(replica2)│ │ │ │ ▲ │ │ /dev/ │ │ /dev/ │ │ │ │ │ │ │ longhorn│ │ longhorn │ │ │ └────│─────┘ └──────────┘ └──────────┘ │ │ │ iSCSI/NVMe-oF │ │ ┌────┴──────────────────────────────────────────────┐ │ │ │ Application Pod (postgres-0) │ │ │ │ PVC: data-postgres-0 → Longhorn Volume │ │ │ └───────────────────────────────────────────────────┘ │ └─────────────────────────────────────────────────────────┘

# Longhorn StorageClass
apiVersion: storage.k8s.io/v1
kind: StorageClass
metadata:
  name: longhorn
  annotations:
    storageclass.kubernetes.io/is-default-class: "true"
provisioner: driver.longhorn.io
allowVolumeExpansion: true
parameters:
  numberOfReplicas: "3"          # replicate across 3 nodes
  staleReplicaTimeout: "2880"    # 48 hours before treating stale replica as failed
  fromBackup: ""
  fsType: ext4
  diskSelector: ""               # target specific disk tags
  nodeSelector: ""               # target specific node tags
  recurringJobSelector: '[{"name":"backup", "isGroup":true}]'

Volume Replication

Each Longhorn volume has N replicas on N distinct nodes. Engine process runs on the node hosting the pod; replicas sync over the network. replica count configurable per volume.

Backup to S3

Incremental snapshots pushed to S3/MinIO/NFS. RecurringJob CRD schedules automatic backups. Restores create new PVC from backup — cross-cluster recovery supported.

Node Drain Safe

Longhorn respects node cordon/drain — engine migrates to another node automatically when replicas exist elsewhere. PodDisruptionBudget enforced during drain.

UI Dashboard

Built-in web UI shows volume health, replica placement, backup status. Expose via Service or Ingress. RBAC controls available via Longhorn RBAC resources.

Rook-Ceph

Rook is a CNCF-graduated cloud-native storage orchestrator. Rook deploys and manages Ceph — a production-grade distributed storage system providing block (RBD), file (CephFS), and object (RGW/S3) storage from the same cluster.

Rook-Ceph Architecture: ┌───────────────────────────────────────────────────────────────┐ │ Rook-Ceph Namespace │ │ │ │ rook-ceph-operator (Deployment) │ │ └── watches CephCluster/CephBlockPool/CephFilesystem CRDs │ │ │ │ Ceph Daemons (managed by operator): │ │ ┌─────────┐ ┌────────────────────────────────────────────┐ │ │ │ MON │ │ OSD (DaemonSet or StatefulSet per device) │ │ │ │ (x3) │ │ ┌──────┐ ┌──────┐ ┌──────┐ ┌──────┐ │ │ │ │ cluster │ │ │OSD-0 │ │OSD-1 │ │OSD-2 │ │OSD-3 │ │ │ │ │ state │ │ │node1 │ │node1 │ │node2 │ │node3 │ │ │ │ │ quorum │ │ │/dev/ │ │/dev/ │ │/dev/ │ │/dev/ │ │ │ │ └─────────┘ │ │nvme0 │ │nvme1 │ │nvme0 │ │nvme0 │ │ │ │ │ └──────┘ └──────┘ └──────┘ └──────┘ │ │ │ ┌─────────┐ └────────────────────────────────────────────┘ │ │ │ MGR │ │ │ │ (x2 HA) │ ┌──────────────┐ ┌──────────────────────────┐ │ │ │dashboard│ │ MDS (x2) │ │ RGW (S3 endpoint) │ │ │ └─────────┘ │ (CephFS MDS) │ │ (object storage) │ │ │ └──────────────┘ └──────────────────────────┘ │ └───────────────────────────────────────────────────────────────┘

CephCluster CRD

apiVersion: ceph.rook.io/v1
kind: CephCluster
metadata:
  name: rook-ceph
  namespace: rook-ceph
spec:
  cephVersion:
    image: quay.io/ceph/ceph:v18.2.2   # Reef
  dataDirHostPath: /var/lib/rook       # persisted on host for mon data
  mon:
    count: 3
    allowMultiplePerNode: false
  mgr:
    count: 2
    modules:
    - name: pg_autoscaler
      enabled: true
  dashboard:
    enabled: true
    ssl: true
  storage:
    useAllNodes: false
    useAllDevices: false
    nodes:
    - name: "node1"
      devices:
      - name: "nvme0n1"
      - name: "nvme1n1"
    - name: "node2"
      devices:
      - name: "nvme0n1"
    - name: "node3"
      devices:
      - name: "nvme0n1"
  placement:
    osd:
      podAntiAffinity:
        requiredDuringSchedulingIgnoredDuringExecution:
        - topologyKey: kubernetes.io/hostname
          labelSelector:
            matchLabels:
              app: rook-ceph-osd

Ceph Block Storage (RBD) StorageClass

apiVersion: ceph.rook.io/v1
kind: CephBlockPool
metadata:
  name: replicapool
  namespace: rook-ceph
spec:
  failureDomain: host     # CRUSH: tolerate full host failure
  replicated:
    size: 3
    requireSafeReplicaSize: true
---
apiVersion: storage.k8s.io/v1
kind: StorageClass
metadata:
  name: rook-ceph-block
provisioner: rook-ceph.rbd.csi.ceph.com
parameters:
  clusterID: rook-ceph
  pool: replicapool
  imageFormat: "2"
  imageFeatures: layering   # required for snapshots; add exclusive-lock,object-map,fast-diff for better performance
  csi.storage.k8s.io/provisioner-secret-name: rook-csi-rbd-provisioner
  csi.storage.k8s.io/provisioner-secret-namespace: rook-ceph
  csi.storage.k8s.io/controller-expand-secret-name: rook-csi-rbd-provisioner
  csi.storage.k8s.io/controller-expand-secret-namespace: rook-ceph
  csi.storage.k8s.io/node-stage-secret-name: rook-csi-rbd-node
  csi.storage.k8s.io/node-stage-secret-namespace: rook-ceph
allowVolumeExpansion: true
reclaimPolicy: Delete

CephFS StorageClass (ReadWriteMany)

apiVersion: ceph.rook.io/v1
kind: CephFilesystem
metadata:
  name: myfs
  namespace: rook-ceph
spec:
  metadataPool:
    replicated:
      size: 3
  dataPools:
  - name: data0
    replicated:
      size: 3
  preserveFilesystemOnDelete: true   # don't wipe filesystem on CRD delete
  metadataServer:
    activeCount: 1
    activeStandby: true
---
apiVersion: storage.k8s.io/v1
kind: StorageClass
metadata:
  name: rook-cephfs
provisioner: rook-ceph.cephfs.csi.ceph.com
parameters:
  clusterID: rook-ceph
  fsName: myfs
  pool: myfs-data0
  csi.storage.k8s.io/provisioner-secret-name: rook-csi-cephfs-provisioner
  csi.storage.k8s.io/provisioner-secret-namespace: rook-ceph
  csi.storage.k8s.io/node-stage-secret-name: rook-csi-cephfs-node
  csi.storage.k8s.io/node-stage-secret-namespace: rook-ceph
allowVolumeExpansion: true
reclaimPolicy: Delete

CRUSH Map for Rack-Awareness

Set failureDomain: rack in CephBlockPool and label nodes with topology.rook.io/rack=rack1 to spread OSD replicas across physical racks. With 3 racks and replica size 3, a full rack failure is tolerated with no data loss. Requires at least 3 OSDs per rack for optimal placement.

Backup Strategies for Stateful Workloads

Strategy	Mechanism	Pros	Cons	Best For
PVC Volume Snapshot	CSI CreateSnapshot RPC	Fast, crash-consistent	Not app-consistent without quiesce	Caches, replicas with WAL replay
App-level dump	pg_dump / mongodump / redis-cli BGSAVE	Fully consistent, portable SQL	Slow for large DBs, CPU overhead	Primary source of truth DBs
pg_basebackup / pgBackRest	Streaming replication protocol	Fast physical backup, WAL archiving, PITR	PostgreSQL-specific	PostgreSQL production backups
Velero + CSI snapshots	Velero + VolumeSnapshot API	Cluster-wide, namespace backup, manifest + data together	Requires CSI snapshot support	Full namespace DR
Longhorn backup	Longhorn recurring job → S3	Incremental, integrated with Longhorn volumes	Longhorn-specific	Longhorn-managed volumes
Rook-Ceph RBD mirroring	Ceph RBD mirroring to remote cluster	Async replication to DR cluster	Complex setup, separate Ceph cluster	Active-passive cross-cluster DR

Metrics, Alerts, and Runbooks

Key Metrics

Metric	Source	What to Watch
`kube_statefulset_status_replicas_ready`	kube-state-metrics	Should equal `kube_statefulset_replicas`; alert on mismatch
`kube_statefulset_status_current_revision != kube_statefulset_status_update_revision`	kube-state-metrics	Indicates incomplete rolling update; alert if stale > 15m
`kube_persistentvolumeclaim_status_phase{phase="Lost"}`	kube-state-metrics	PVC Lost = pod stuck Pending; page immediately
`kubelet_volume_stats_used_bytes / kubelet_volume_stats_capacity_bytes`	kubelet	Alert at 80% and 90% utilization
`kube_pod_container_status_restarts_total`	kube-state-metrics	StatefulSet pod restarts may indicate data corruption or OOM

Alerting Rules

groups:
- name: stateful-storage
  rules:
  - alert: StatefulSetNotFullyReady
    expr: |
      kube_statefulset_status_replicas_ready
        != kube_statefulset_replicas
    for: 5m
    annotations:
      summary: "StatefulSet {{ $labels.statefulset }} not fully ready"

  - alert: StatefulSetRolloutStuck
    expr: |
      kube_statefulset_status_current_revision
        != kube_statefulset_status_update_revision
    for: 15m
    annotations:
      summary: "StatefulSet {{ $labels.statefulset }} rollout stuck"

  - alert: PVCDiskPressureWarning
    expr: |
      (kubelet_volume_stats_used_bytes / kubelet_volume_stats_capacity_bytes) > 0.80
    for: 5m
    annotations:
      summary: "PVC {{ $labels.persistentvolumeclaim }} >80% full"

  - alert: PVCDiskPressureCritical
    expr: |
      (kubelet_volume_stats_used_bytes / kubelet_volume_stats_capacity_bytes) > 0.90
    for: 2m
    labels:
      severity: critical

Runbooks

StatefulSet Pod Stuck Pending

Check PVC status (kubectl get pvc). If Pending: storage class mis-match or quota. If Lost: PV deleted — recreate PV with original volumeHandle and claimRef UID. If no PVC: volumeClaimTemplates binding mode issue.

StatefulSet Rollout Stuck

Check pod-N readiness: kubectl describe pod sts-N. Common cause: new image crashes on startup, readiness probe fails. If OrderedReady deadlock: temporarily patch partition value to skip the stuck pod or use kubectl rollout undo.

PVC Full — Online Expansion

Edit PVC spec.resources.requests.storage to larger value (storage class must have allowVolumeExpansion: true). For filesystem resize: wait for FileSystemResizePending condition to clear after pod restart. Check CSI driver supports ControllerExpandVolume.

Rook-Ceph OSD Down

Check OSD pod: kubectl -n rook-ceph get pod -l app=rook-ceph-osd. If node failure: Ceph will auto-heal after mon_osd_down_out_interval (10 min default). Check PG status: kubectl -n rook-ceph exec deploy/rook-ceph-tools -- ceph status. Alert if PGs stuck in undersized+degraded > 30min.

Orphaned PVCs After StatefulSet Delete

List with label selector: kubectl get pvc -n ns -l app=deleted-sts. Verify no running pods use them: kubectl get pod -n ns -o json | jq '.items[].spec.volumes[].persistentVolumeClaim.claimName'. Delete manually after confirming data backed up.

Best Practices

Always use a headless service — clusterIP: None with publishNotReadyAddresses: true for peer discovery during init. Never share a ClusterIP service for intra-cluster database replication.
Set terminationGracePeriodSeconds appropriately — databases need time to flush WAL, finish checkpoints, and gracefully close connections. 60–120 seconds is typical; never leave at default 30s for PostgreSQL or Cassandra.
Use podManagementPolicy: OrderedReady for initial bootstrap — then consider Parallel for rolling restarts after the cluster is healthy (faster for large clusters). Never use Parallel with databases that require leader election before followers start.
Pin partition to the highest ordinal during upgrades — canary one replica before rolling all. Especially critical for major Kafka, PostgreSQL, and Cassandra version upgrades with format changes.
Never rely on StatefulSet PVC auto-cleanup for production data — keep persistentVolumeClaimRetentionPolicy.whenDeleted: Retain and whenScaled: Retain for all production databases. Automate orphaned PVC detection and manual review.
Separate data and WAL/log volumes — put WAL on a separate higher-IOPS StorageClass (io2 vs gp3). This allows independent sizing and prevents WAL I/O from contending with table data reads.
Use an operator for complex databases — CloudNativePG for PostgreSQL, Strimzi for Kafka, MongoDB Community Operator for MongoDB. Manual StatefulSet management misses critical details (failover logic, backup scheduling, rolling upgrade ordering, connection pooling).
Test failover and recovery procedures regularly — simulate node failure, test backup restore, validate that replicas rejoin after restart. Storage failures surface during actual incidents, not just monitoring.

← Previous Volume Snapshots Next → Storage Capacity