Stateful Storage Patterns
▶ What This Page Covers
StatefulSet vs Deployment: Storage Semantics
Deployments treat pods as interchangeable cattle — any pod can serve any request, and PVCs are shared (ReadWriteMany) or not used at all. StatefulSets assign each pod a stable identity: a fixed ordinal (0, 1, 2…), a stable DNS hostname, and a stable PVC that follows the pod across rescheduling. The PVC is never re-created — it persists even after pod or StatefulSet deletion.
| Characteristic | Deployment | StatefulSet |
|---|---|---|
| Pod identity | Random hash suffix (random-a1b2c) | Ordinal suffix (db-0, db-1) |
| DNS hostname | Unpredictable | Stable: pod-name.headless-svc.ns.svc.cluster.local |
| PVC per pod | Shared PVC or no PVC | Dedicated PVC per pod via volumeClaimTemplates |
| Pod creation order | All at once (parallel) | Sequential 0→N (OrderedReady) or parallel |
| Pod deletion order | Random | Sequential N→0 |
| PVC on pod delete | N/A | PVC retained (not deleted with pod) |
| PVC on StatefulSet delete | N/A | PVCs orphaned by default (Retain policy) |
| Rolling update direction | New pods before old | Reverse ordinal (N→0) |
volumeClaimTemplates
volumeClaimTemplates defines PVC templates that the StatefulSet controller instantiates for each pod. Each template creates a PVC named {template.metadata.name}-{pod-name}. For a StatefulSet named postgres with template name data, pod postgres-0 gets PVC data-postgres-0.
Full Anatomy
apiVersion: apps/v1
kind: StatefulSet
metadata:
name: postgres
namespace: data
spec:
serviceName: postgres-headless # required headless service name
replicas: 3
selector:
matchLabels:
app: postgres
podManagementPolicy: OrderedReady # default; sequential pod lifecycle
updateStrategy:
type: RollingUpdate
rollingUpdate:
partition: 0 # update all pods; set to N to freeze pods 0..N-1
template:
metadata:
labels:
app: postgres
spec:
terminationGracePeriodSeconds: 60
containers:
- name: postgres
image: postgres:16
env:
- name: PGDATA
value: /var/lib/postgresql/data/pgdata
volumeMounts:
- name: data
mountPath: /var/lib/postgresql/data
- name: wal
mountPath: /var/lib/postgresql/wal
volumeClaimTemplates:
- metadata:
name: data
annotations:
# Optional: survive StatefulSet deletion (persistentVolumeClaimRetentionPolicy covers this in 1.27+)
spec:
accessModes: [ReadWriteOnce]
storageClassName: ebs-gp3
resources:
requests:
storage: 100Gi
- metadata:
name: wal
spec:
accessModes: [ReadWriteOnce]
storageClassName: ebs-io2 # separate faster disk for WAL
resources:
requests:
storage: 20Gi
Once a StatefulSet is created, volumeClaimTemplates cannot be modified. Changing storage size, storage class, or access modes requires deleting and re-creating the StatefulSet (with --cascade=orphan to preserve PVCs) or migrating data to new PVCs.
PVC Naming Convention
| StatefulSet | Template Name | Replica 0 | Replica 1 | Replica 2 |
|---|---|---|---|---|
| postgres | data | data-postgres-0 | data-postgres-1 | data-postgres-2 |
| postgres | wal | wal-postgres-0 | wal-postgres-1 | wal-postgres-2 |
| kafka | data | data-kafka-0 | data-kafka-1 | data-kafka-2 |
| cassandra | cassandra-data | cassandra-data-cassandra-0 | cassandra-data-cassandra-1 | — |
PVC Lifecycle and Orphaned PVCs
Default Behavior: PVCs Are Not Deleted
When you delete a StatefulSet, the pods are deleted but the PVCs are left behind — orphaned. This is intentional: it protects against accidental data loss. A new StatefulSet with the same name and template names will re-bind to the existing PVCs and continue from where the data left off.
Use kubectl delete sts postgres --cascade=orphan to delete the StatefulSet object without deleting its pods. This lets you recreate the StatefulSet (e.g., with updated volumeClaimTemplates) while keeping pods and PVCs running — zero-downtime migration of StatefulSet spec.
Detecting Orphaned PVCs
# List PVCs in a namespace that have no owning pod
kubectl get pvc -n data -o json | jq -r '
.items[] |
select(.metadata.ownerReferences == null) |
"\(.metadata.name) \(.status.phase) \(.spec.resources.requests.storage)"
'
# More targeted: find PVCs matching StatefulSet template pattern with no matching pod
kubectl get pvc -n data --no-headers | awk '{print $1}' | while read pvc; do
pod=$(echo "$pvc" | sed 's/^[^-]*-//')
kubectl get pod "$pod" -n data &>/dev/null || echo "ORPHANED: $pvc"
done
persistentVolumeClaimRetentionPolicy (GA 1.27)
Introduced in 1.22 (alpha), stable in 1.27, this field controls automated PVC deletion lifecycle:
spec:
persistentVolumeClaimRetentionPolicy:
whenDeleted: Delete # Delete PVCs when StatefulSet is deleted
whenScaled: Retain # Keep PVCs when scaling down (safe default)
# whenScaled: Delete # DANGER: permanently deletes PVCs on scale-down
| Policy | whenDeleted | whenScaled | Use Case |
|---|---|---|---|
| Retain/Retain (default) | Keep PVCs | Keep PVCs | Production DBs — manual cleanup |
| Delete/Retain | Delete PVCs on StatefulSet delete | Keep PVCs on scale-down | CI/staging: easy teardown, safe scale |
| Retain/Delete | Keep PVCs | Delete PVCs on scale-down | Stateless-ish caches, re-seed on scale-up |
| Delete/Delete | Delete all PVCs | Delete PVCs on scale-down | Ephemeral test environments only |
If you scale a StatefulSet from 3→1 with whenScaled: Delete, PVCs for pods 1 and 2 are permanently deleted. Scaling back to 3 creates brand-new empty PVCs — all data from replicas 1 and 2 is gone. Never use this for databases without verified backup/restore testing.
Pod Management and Update Strategies
OrderedReady vs Parallel
podManagementPolicy: OrderedReady # default
# Pods created 0 → 1 → 2 (each must be Running+Ready before next)
# Pods deleted 2 → 1 → 0 (each must be Terminated before next)
# Safe for databases that require leader election before follower starts
podManagementPolicy: Parallel
# All pods created simultaneously (no sequencing)
# All pods deleted simultaneously
# Faster for stateless-ish work or when app handles concurrent init
# Still maintains stable identity and PVC binding
If pod-0 becomes not-Ready (e.g., database waiting for pod-1 to form quorum), the StatefulSet controller blocks — it will not create pod-1 until pod-0 is Ready. Break the deadlock with kubectl rollout restart or by temporarily patching the pod's readiness probe, or by setting podManagementPolicy: Parallel if the application supports parallel start.
RollingUpdate with Partition (Canary Upgrades)
The partition field causes the controller to update only pods with ordinal ≥ partition. Pods with ordinal < partition keep their current revision.
# Update only pod-2 (highest ordinal) to test new version
kubectl patch sts postgres -p '{"spec":{"updateStrategy":{"rollingUpdate":{"partition":2}}}}'
kubectl rollout status sts/postgres
# If postgres-2 is healthy, expand to pod-1 and pod-2
kubectl patch sts postgres -p '{"spec":{"updateStrategy":{"rollingUpdate":{"partition":1}}}}'
# Full rollout
kubectl patch sts postgres -p '{"spec":{"updateStrategy":{"rollingUpdate":{"partition":0}}}}'
OnDelete Strategy (Manual DB Rolling Upgrade)
spec:
updateStrategy:
type: OnDelete
# Pods are only updated when manually deleted — you control timing
# Essential for: Patroni (must demote replica first), Galera, Cassandra
# Workflow:
# 1. kubectl set image sts/postgres postgres=postgres:16.2
# 2. kubectl delete pod postgres-2 # kills replica, controller creates new pod with new image
# 3. Verify postgres-2 healthy, then delete postgres-1
# 4. For primary (postgres-0): trigger manual failover first
Headless Service and DNS
A headless service (clusterIP: None) is required for StatefulSets. It creates per-pod DNS A records rather than a single VIP, enabling direct pod addressing.
apiVersion: v1
kind: Service
metadata:
name: postgres-headless
namespace: data
spec:
clusterIP: None # headless — no VIP
publishNotReadyAddresses: true # include NotReady pods (important for init)
selector:
app: postgres
ports:
- name: postgres
port: 5432
Anti-Affinity and Topology
Hard Anti-Affinity (One Pod Per Node)
spec:
template:
spec:
affinity:
podAntiAffinity:
requiredDuringSchedulingIgnoredDuringExecution:
- labelSelector:
matchLabels:
app: postgres
topologyKey: kubernetes.io/hostname
# Prevents any two postgres pods on the same node
# Blocks scheduling if insufficient nodes — use "preferred" for flexibility
Zone-Spread Anti-Affinity (HA Across AZs)
affinity:
podAntiAffinity:
preferredDuringSchedulingIgnoredDuringExecution:
- weight: 100
podAffinityTerm:
labelSelector:
matchLabels:
app: postgres
topologyKey: topology.kubernetes.io/zone
TopologySpreadConstraints (Preferred over Anti-Affinity)
topologySpreadConstraints:
- maxSkew: 1
topologyKey: topology.kubernetes.io/zone
whenUnsatisfiable: DoNotSchedule # hard: block if can't satisfy
labelSelector:
matchLabels:
app: postgres
- maxSkew: 1
topologyKey: kubernetes.io/hostname
whenUnsatisfiable: ScheduleAnyway # soft: spread but don't block
PodDisruptionBudgets for Stateful Apps
# PostgreSQL primary/replica: always keep at least 1 pod (the primary)
apiVersion: policy/v1
kind: PodDisruptionBudget
metadata:
name: postgres-pdb
namespace: data
spec:
minAvailable: 1
selector:
matchLabels:
app: postgres
---
# Kafka 3-broker cluster with min.insync.replicas=2:
# Must keep at least 2 brokers to accept writes
apiVersion: policy/v1
kind: PodDisruptionBudget
metadata:
name: kafka-pdb
spec:
minAvailable: 2
selector:
matchLabels:
app: kafka
PostgreSQL on Kubernetes
CloudNativePG Operator (Recommended)
CloudNativePG is the CNCF-sandbox operator for PostgreSQL. It manages primary/replica topology, automatic failover, backup integration, and connection pooling via PgBouncer. Use it instead of manual StatefulSet configuration for production.
apiVersion: postgresql.cnpg.io/v1
kind: Cluster
metadata:
name: postgres-cluster
namespace: data
spec:
instances: 3
imageName: ghcr.io/cloudnative-pg/postgresql:16.2
storage:
size: 100Gi
storageClass: ebs-gp3
walStorage:
size: 20Gi
storageClass: ebs-io2 # separate fast storage for WAL
backup:
retentionPolicy: "30d"
barmanObjectStore:
destinationPath: s3://my-bucket/postgres
s3Credentials:
accessKeyId:
name: aws-creds
key: ACCESS_KEY_ID
secretAccessKey:
name: aws-creds
key: SECRET_ACCESS_KEY
affinity:
enablePodAntiAffinity: true
topologyKey: topology.kubernetes.io/zone
resources:
requests:
memory: 4Gi
cpu: "2"
limits:
memory: 8Gi
Patroni-Based StatefulSet (Manual)
For teams not using an operator, Patroni provides HA PostgreSQL with etcd/consul/Kubernetes leader election. The key pattern is an init container that runs pg_basebackup on replicas to initialize from the primary.
initContainers:
- name: init-replica
image: postgres:16
command:
- sh
- -c
- |
# Skip if primary (ordinal 0) or if data dir already initialized
ORDINAL=$(echo $HOSTNAME | awk -F'-' '{print $NF}')
if [ "$ORDINAL" = "0" ] || [ -f "/var/lib/postgresql/data/pgdata/PG_VERSION" ]; then
exit 0
fi
# Wait for primary to be ready
until pg_isready -h postgres-0.postgres-headless.data.svc.cluster.local -p 5432; do
echo "Waiting for primary..."; sleep 2
done
# Stream base backup from primary
pg_basebackup \
-h postgres-0.postgres-headless.data.svc.cluster.local \
-U replication \
-D /var/lib/postgresql/data/pgdata \
-Xs -P -R # -R writes recovery.conf / standby.signal
volumeMounts:
- name: data
mountPath: /var/lib/postgresql/data
Cassandra on Kubernetes
Cassandra uses a gossip protocol and a token ring. Kubernetes StatefulSets map well because Cassandra nodes require stable DNS names for seed discovery and stable storage for SSTables.
apiVersion: apps/v1
kind: StatefulSet
metadata:
name: cassandra
spec:
serviceName: cassandra
replicas: 3
template:
spec:
containers:
- name: cassandra
image: cassandra:4.1
env:
- name: CASSANDRA_SEEDS
# First two pods as seeds; headless DNS provides stable addresses
value: "cassandra-0.cassandra.data.svc.cluster.local,cassandra-1.cassandra.data.svc.cluster.local"
- name: CASSANDRA_CLUSTER_NAME
value: "production"
- name: CASSANDRA_DC
value: "dc1"
- name: CASSANDRA_RACK
# Use node label for rack awareness; requires Downward API or env injection
value: "rack1"
- name: MAX_HEAP_SIZE
value: "8192M"
- name: HEAP_NEWSIZE
value: "2048M"
resources:
requests:
memory: 16Gi # Cassandra is memory-hungry
cpu: "4"
readinessProbe:
exec:
command: ["/bin/bash", "-c", "nodetool status | grep -E '^UN\\s+$POD_IP'"]
initialDelaySeconds: 90
periodSeconds: 10
volumeMounts:
- name: cassandra-data
mountPath: /var/lib/cassandra
volumeClaimTemplates:
- metadata:
name: cassandra-data
spec:
accessModes: [ReadWriteOnce]
storageClassName: local-nvme # local NVMe for compaction I/O
resources:
requests:
storage: 500Gi
Use podManagementPolicy: OrderedReady for initial cluster bootstrap — each node must join the ring and reach UN (Up/Normal) state before the next node starts. After initial bootstrap, Parallel can be used for rolling restarts if the cluster is healthy.
Kafka on Kubernetes
Kafka brokers maintain persistent log segments. The broker ID maps naturally to the StatefulSet pod ordinal. Strimzi is the dominant Kubernetes operator for Kafka.
Strimzi Kafka Cluster (Operator Pattern)
apiVersion: kafka.strimzi.io/v1beta2
kind: Kafka
metadata:
name: production-kafka
spec:
kafka:
version: 3.7.0
replicas: 3
listeners:
- name: plain
port: 9092
type: internal
tls: false
- name: tls
port: 9093
type: internal
tls: true
config:
offsets.topic.replication.factor: 3
transaction.state.log.replication.factor: 3
transaction.state.log.min.isr: 2
default.replication.factor: 3
min.insync.replicas: 2 # producers with acks=all require 2 ISR
log.retention.hours: 168 # 7 days
storage:
type: persistent-claim
size: 500Gi
class: ebs-gp3
deleteClaim: false # retain PVCs on Kafka cluster delete
resources:
requests:
memory: 8Gi
cpu: "2"
limits:
memory: 16Gi
zookeeper:
replicas: 3
storage:
type: persistent-claim
size: 20Gi
class: ebs-gp3
KRaft Mode (ZooKeeper-Free, Kafka 3.3+ Production-Ready)
kafka:
version: 3.7.0
metadataVersion: 3.7-IV4
replicas: 3
# No zookeeper section — KRaft uses internal metadata topic
# Combined mode: broker + controller roles in same pods
# Separate mode (recommended for large clusters): dedicated controller nodes
MongoDB on Kubernetes
apiVersion: mongodbcommunity.mongodb.com/v1
kind: MongoDBCommunity
metadata:
name: mongodb-replica-set
spec:
members: 3
type: ReplicaSet
version: "7.0.4"
security:
authentication:
modes: ["SCRAM"]
users:
- name: admin
db: admin
passwordSecretRef:
name: admin-password
roles:
- name: clusterAdmin
db: admin
statefulSet:
spec:
volumeClaimTemplates:
- metadata:
name: data-volume
spec:
accessModes: [ReadWriteOnce]
storageClassName: ebs-gp3
resources:
requests:
storage: 200Gi
- metadata:
name: logs-volume
spec:
accessModes: [ReadWriteOnce]
storageClassName: ebs-gp3
resources:
requests:
storage: 10Gi
Redis on Kubernetes
Redis Sentinel (HA for Single Shard)
# Redis StatefulSet with persistent storage
# Sentinel monitors primary and triggers failover
# Primary: redis-0; Replicas: redis-1, redis-2
# Sentinel quorum: 2 of 3 sentinels must agree for failover
apiVersion: apps/v1
kind: StatefulSet
metadata:
name: redis
spec:
serviceName: redis-headless
replicas: 3
template:
spec:
initContainers:
- name: config
image: redis:7.2
command: ["sh", "-c"]
args:
- |
ORDINAL=$(echo $HOSTNAME | awk -F'-' '{print $NF}')
if [ "$ORDINAL" = "0" ]; then
cp /tmp/redis-default.conf /etc/redis/redis.conf
else
echo "replicaof redis-0.redis-headless.data.svc.cluster.local 6379" >> /etc/redis/redis.conf
fi
containers:
- name: redis
image: redis:7.2
command: ["redis-server", "/etc/redis/redis.conf"]
volumeMounts:
- name: data
mountPath: /data
volumeClaimTemplates:
- metadata:
name: data
spec:
accessModes: [ReadWriteOnce]
storageClassName: ebs-gp3
resources:
requests:
storage: 50Gi
Local PVs with StatefulSets
Local PVs bind directly to a node's NVMe disk, providing the lowest possible latency. The tradeoff: if the node fails, the data on that node is inaccessible (or lost if the disk fails) until the node recovers.
# Local PV — manually provisioned, node-affinity required
apiVersion: v1
kind: PersistentVolume
metadata:
name: local-pv-node1-nvme
spec:
capacity:
storage: 1Ti
accessModes: [ReadWriteOnce]
persistentVolumeReclaimPolicy: Retain
storageClassName: local-nvme
local:
path: /mnt/nvme0n1
nodeAffinity:
required:
nodeSelectorTerms:
- matchExpressions:
- key: kubernetes.io/hostname
operator: In
values: [node1]
---
apiVersion: storage.k8s.io/v1
kind: StorageClass
metadata:
name: local-nvme
provisioner: kubernetes.io/no-provisioner
volumeBindingMode: WaitForFirstConsumer # critical: bind only when pod is scheduled
When the node hosting a local PV goes down, the pod cannot reschedule to another node — it is stuck in Pending indefinitely because no other node has the required PV. Recovery requires either restoring the node, manually moving data, or re-seeding the replica. Use Longhorn or Rook-Ceph for automatic replication if node failure tolerance is required.
Longhorn
Longhorn (CNCF incubating) is a distributed block storage system built for Kubernetes. It replicates volumes across nodes, handles node failures automatically, and provides S3-backed snapshots and backups.
# Longhorn StorageClass
apiVersion: storage.k8s.io/v1
kind: StorageClass
metadata:
name: longhorn
annotations:
storageclass.kubernetes.io/is-default-class: "true"
provisioner: driver.longhorn.io
allowVolumeExpansion: true
parameters:
numberOfReplicas: "3" # replicate across 3 nodes
staleReplicaTimeout: "2880" # 48 hours before treating stale replica as failed
fromBackup: ""
fsType: ext4
diskSelector: "" # target specific disk tags
nodeSelector: "" # target specific node tags
recurringJobSelector: '[{"name":"backup", "isGroup":true}]'
Each Longhorn volume has N replicas on N distinct nodes. Engine process runs on the node hosting the pod; replicas sync over the network. replica count configurable per volume.
Incremental snapshots pushed to S3/MinIO/NFS. RecurringJob CRD schedules automatic backups. Restores create new PVC from backup — cross-cluster recovery supported.
Longhorn respects node cordon/drain — engine migrates to another node automatically when replicas exist elsewhere. PodDisruptionBudget enforced during drain.
Built-in web UI shows volume health, replica placement, backup status. Expose via Service or Ingress. RBAC controls available via Longhorn RBAC resources.
Rook-Ceph
Rook is a CNCF-graduated cloud-native storage orchestrator. Rook deploys and manages Ceph — a production-grade distributed storage system providing block (RBD), file (CephFS), and object (RGW/S3) storage from the same cluster.
CephCluster CRD
apiVersion: ceph.rook.io/v1
kind: CephCluster
metadata:
name: rook-ceph
namespace: rook-ceph
spec:
cephVersion:
image: quay.io/ceph/ceph:v18.2.2 # Reef
dataDirHostPath: /var/lib/rook # persisted on host for mon data
mon:
count: 3
allowMultiplePerNode: false
mgr:
count: 2
modules:
- name: pg_autoscaler
enabled: true
dashboard:
enabled: true
ssl: true
storage:
useAllNodes: false
useAllDevices: false
nodes:
- name: "node1"
devices:
- name: "nvme0n1"
- name: "nvme1n1"
- name: "node2"
devices:
- name: "nvme0n1"
- name: "node3"
devices:
- name: "nvme0n1"
placement:
osd:
podAntiAffinity:
requiredDuringSchedulingIgnoredDuringExecution:
- topologyKey: kubernetes.io/hostname
labelSelector:
matchLabels:
app: rook-ceph-osd
Ceph Block Storage (RBD) StorageClass
apiVersion: ceph.rook.io/v1
kind: CephBlockPool
metadata:
name: replicapool
namespace: rook-ceph
spec:
failureDomain: host # CRUSH: tolerate full host failure
replicated:
size: 3
requireSafeReplicaSize: true
---
apiVersion: storage.k8s.io/v1
kind: StorageClass
metadata:
name: rook-ceph-block
provisioner: rook-ceph.rbd.csi.ceph.com
parameters:
clusterID: rook-ceph
pool: replicapool
imageFormat: "2"
imageFeatures: layering # required for snapshots; add exclusive-lock,object-map,fast-diff for better performance
csi.storage.k8s.io/provisioner-secret-name: rook-csi-rbd-provisioner
csi.storage.k8s.io/provisioner-secret-namespace: rook-ceph
csi.storage.k8s.io/controller-expand-secret-name: rook-csi-rbd-provisioner
csi.storage.k8s.io/controller-expand-secret-namespace: rook-ceph
csi.storage.k8s.io/node-stage-secret-name: rook-csi-rbd-node
csi.storage.k8s.io/node-stage-secret-namespace: rook-ceph
allowVolumeExpansion: true
reclaimPolicy: Delete
CephFS StorageClass (ReadWriteMany)
apiVersion: ceph.rook.io/v1
kind: CephFilesystem
metadata:
name: myfs
namespace: rook-ceph
spec:
metadataPool:
replicated:
size: 3
dataPools:
- name: data0
replicated:
size: 3
preserveFilesystemOnDelete: true # don't wipe filesystem on CRD delete
metadataServer:
activeCount: 1
activeStandby: true
---
apiVersion: storage.k8s.io/v1
kind: StorageClass
metadata:
name: rook-cephfs
provisioner: rook-ceph.cephfs.csi.ceph.com
parameters:
clusterID: rook-ceph
fsName: myfs
pool: myfs-data0
csi.storage.k8s.io/provisioner-secret-name: rook-csi-cephfs-provisioner
csi.storage.k8s.io/provisioner-secret-namespace: rook-ceph
csi.storage.k8s.io/node-stage-secret-name: rook-csi-cephfs-node
csi.storage.k8s.io/node-stage-secret-namespace: rook-ceph
allowVolumeExpansion: true
reclaimPolicy: Delete
Set failureDomain: rack in CephBlockPool and label nodes with topology.rook.io/rack=rack1 to spread OSD replicas across physical racks. With 3 racks and replica size 3, a full rack failure is tolerated with no data loss. Requires at least 3 OSDs per rack for optimal placement.
Backup Strategies for Stateful Workloads
| Strategy | Mechanism | Pros | Cons | Best For |
|---|---|---|---|---|
| PVC Volume Snapshot | CSI CreateSnapshot RPC | Fast, crash-consistent | Not app-consistent without quiesce | Caches, replicas with WAL replay |
| App-level dump | pg_dump / mongodump / redis-cli BGSAVE | Fully consistent, portable SQL | Slow for large DBs, CPU overhead | Primary source of truth DBs |
| pg_basebackup / pgBackRest | Streaming replication protocol | Fast physical backup, WAL archiving, PITR | PostgreSQL-specific | PostgreSQL production backups |
| Velero + CSI snapshots | Velero + VolumeSnapshot API | Cluster-wide, namespace backup, manifest + data together | Requires CSI snapshot support | Full namespace DR |
| Longhorn backup | Longhorn recurring job → S3 | Incremental, integrated with Longhorn volumes | Longhorn-specific | Longhorn-managed volumes |
| Rook-Ceph RBD mirroring | Ceph RBD mirroring to remote cluster | Async replication to DR cluster | Complex setup, separate Ceph cluster | Active-passive cross-cluster DR |
Metrics, Alerts, and Runbooks
Key Metrics
| Metric | Source | What to Watch |
|---|---|---|
kube_statefulset_status_replicas_ready | kube-state-metrics | Should equal kube_statefulset_replicas; alert on mismatch |
kube_statefulset_status_current_revision != kube_statefulset_status_update_revision | kube-state-metrics | Indicates incomplete rolling update; alert if stale > 15m |
kube_persistentvolumeclaim_status_phase{phase="Lost"} | kube-state-metrics | PVC Lost = pod stuck Pending; page immediately |
kubelet_volume_stats_used_bytes / kubelet_volume_stats_capacity_bytes | kubelet | Alert at 80% and 90% utilization |
kube_pod_container_status_restarts_total | kube-state-metrics | StatefulSet pod restarts may indicate data corruption or OOM |
Alerting Rules
groups:
- name: stateful-storage
rules:
- alert: StatefulSetNotFullyReady
expr: |
kube_statefulset_status_replicas_ready
!= kube_statefulset_replicas
for: 5m
annotations:
summary: "StatefulSet {{ $labels.statefulset }} not fully ready"
- alert: StatefulSetRolloutStuck
expr: |
kube_statefulset_status_current_revision
!= kube_statefulset_status_update_revision
for: 15m
annotations:
summary: "StatefulSet {{ $labels.statefulset }} rollout stuck"
- alert: PVCDiskPressureWarning
expr: |
(kubelet_volume_stats_used_bytes / kubelet_volume_stats_capacity_bytes) > 0.80
for: 5m
annotations:
summary: "PVC {{ $labels.persistentvolumeclaim }} >80% full"
- alert: PVCDiskPressureCritical
expr: |
(kubelet_volume_stats_used_bytes / kubelet_volume_stats_capacity_bytes) > 0.90
for: 2m
labels:
severity: critical
Runbooks
Check PVC status (kubectl get pvc). If Pending: storage class mis-match or quota. If Lost: PV deleted — recreate PV with original volumeHandle and claimRef UID. If no PVC: volumeClaimTemplates binding mode issue.
Check pod-N readiness: kubectl describe pod sts-N. Common cause: new image crashes on startup, readiness probe fails. If OrderedReady deadlock: temporarily patch partition value to skip the stuck pod or use kubectl rollout undo.
Edit PVC spec.resources.requests.storage to larger value (storage class must have allowVolumeExpansion: true). For filesystem resize: wait for FileSystemResizePending condition to clear after pod restart. Check CSI driver supports ControllerExpandVolume.
Check OSD pod: kubectl -n rook-ceph get pod -l app=rook-ceph-osd. If node failure: Ceph will auto-heal after mon_osd_down_out_interval (10 min default). Check PG status: kubectl -n rook-ceph exec deploy/rook-ceph-tools -- ceph status. Alert if PGs stuck in undersized+degraded > 30min.
List with label selector: kubectl get pvc -n ns -l app=deleted-sts. Verify no running pods use them: kubectl get pod -n ns -o json | jq '.items[].spec.volumes[].persistentVolumeClaim.claimName'. Delete manually after confirming data backed up.
Best Practices
- Always use a headless service —
clusterIP: NonewithpublishNotReadyAddresses: truefor peer discovery during init. Never share a ClusterIP service for intra-cluster database replication. - Set
terminationGracePeriodSecondsappropriately — databases need time to flush WAL, finish checkpoints, and gracefully close connections. 60–120 seconds is typical; never leave at default 30s for PostgreSQL or Cassandra. - Use
podManagementPolicy: OrderedReadyfor initial bootstrap — then considerParallelfor rolling restarts after the cluster is healthy (faster for large clusters). Never use Parallel with databases that require leader election before followers start. - Pin
partitionto the highest ordinal during upgrades — canary one replica before rolling all. Especially critical for major Kafka, PostgreSQL, and Cassandra version upgrades with format changes. - Never rely on StatefulSet PVC auto-cleanup for production data — keep
persistentVolumeClaimRetentionPolicy.whenDeleted: RetainandwhenScaled: Retainfor all production databases. Automate orphaned PVC detection and manual review. - Separate data and WAL/log volumes — put WAL on a separate higher-IOPS
StorageClass(io2 vs gp3). This allows independent sizing and prevents WAL I/O from contending with table data reads. - Use an operator for complex databases — CloudNativePG for PostgreSQL, Strimzi for Kafka, MongoDB Community Operator for MongoDB. Manual StatefulSet management misses critical details (failover logic, backup scheduling, rolling upgrade ordering, connection pooling).
- Test failover and recovery procedures regularly — simulate node failure, test backup restore, validate that replicas rejoin after restart. Storage failures surface during actual incidents, not just monitoring.