Storage Operations | Kubernetes Docs

Kubernetes Storage Stack

Storage in Kubernetes is layered: the Container Storage Interface (CSI) decouples driver logic from the kubelet, PersistentVolumes (PV) represent the physical resource, PersistentVolumeClaims (PVC) represent a request for storage, and StorageClasses define how PVs are dynamically provisioned.

Storage Architecture — Request to Mount Path

  Pod spec
    └─► PersistentVolumeClaim  (namespace-scoped claim)
          └─► StorageClass      (binding mode / provisioner / params)
                └─► CSI Driver  (provisioner plugin in kube-system)
                      ├─► Controller Plugin  (runs as Deployment)
                      │     ├─ CreateVolume   (AWS EBS CreateVolume API call)
                      │     ├─ AttachVolume   (attach EBS vol to EC2 node)
                      │     └─ DeleteVolume   (GC on PVC delete)
                      └─► Node Plugin         (runs as DaemonSet on every node)
                            ├─ NodeStageVolume  (format + mount to staging path)
                            └─ NodePublishVolume (bind-mount into pod's fs)

  On-disk path:
    /var/lib/kubelet/plugins/kubernetes.io/csi/...  (staging)
    /var/lib/kubelet/pods/<UID>/volumes/...         (pod volume mount)

Storage Object States

Object	State	Meaning	Common Cause
PVC	`Pending`	No matching PV exists yet	WaitForFirstConsumer / no capacity / wrong SC
PVC	`Bound`	PV attached and ready	Normal operating state
PVC	`Lost`	Bound PV deleted or unreachable	Manual PV deletion, zone mismatch
PV	`Available`	Not yet bound to any PVC	Pre-provisioned static PV
PV	`Bound`	Claimed by a PVC	Normal
PV	`Released`	PVC deleted, PV not yet reclaimed	Retain reclaim policy
PV	`Failed`	Automated reclamation failed	Cloud volume deletion error
VolumeAttachment	`Attached: false`	CSI attach in progress or stuck	Node failure, CSI pod crash, AZ mismatch

Quick-Reference Diagnostic Commands

# List all PVCs with their phase and storage class
kubectl get pvc -A -o wide

# List PVs showing reclaim policy, capacity, access modes
kubectl get pv -o custom-columns=\
NAME:.metadata.name,\
CAPACITY:.spec.capacity.storage,\
ACCESS:.spec.accessModes[0],\
POLICY:.spec.persistentVolumeReclaimPolicy,\
STATUS:.status.phase,\
CLAIM:.spec.claimRef.namespace

# Find all PVCs that are NOT Bound
kubectl get pvc -A --field-selector=status.phase!=Bound

# Describe a specific PVC to see binding events and CSI details
kubectl describe pvc <pvc-name> -n <namespace>

# List VolumeAttachment objects (CSI controller attach status)
kubectl get volumeattachment

# Check CSI node driver registration
kubectl get csinode <node-name> -o yaml

# List StorageClasses and their provisioners
kubectl get sc -o wide

PV Lifecycle Operations

Reclaim Policy Behaviour

⚠

Delete vs Retain

The default reclaim policy for dynamically provisioned PVs is Delete — the underlying cloud volume is destroyed when the PVC is deleted. For production databases, always use Retain or set a finalizer, and ensure Velero backups exist before any PVC delete.

# Patch a PV to change reclaim policy from Delete to Retain
kubectl patch pv <pv-name> \
  -p '{"spec":{"persistentVolumeReclaimPolicy":"Retain"}}'

# Patch ALL PVs that use gp3 storage class to Retain
kubectl get pv -o json | \
  jq -r '.items[] | select(.spec.storageClassName=="gp3") | .metadata.name' | \
  xargs -I{} kubectl patch pv {} \
    -p '{"spec":{"persistentVolumeReclaimPolicy":"Retain"}}'

Reclaiming a Released PV

When a PVC is deleted and the PV reclaim policy is Retain, the PV enters Released state. It cannot be automatically rebound until the claimRef is cleared:

# Step 1: verify PV is in Released state
kubectl get pv <pv-name>

# Step 2: remove the claimRef so the PV becomes Available again
kubectl patch pv <pv-name> --type json \
  -p '[{"op":"remove","path":"/spec/claimRef"}]'

# Step 3: create a new PVC that binds to this specific PV
# Use volumeName in the PVC spec:

apiVersion: v1
kind: PersistentVolumeClaim
metadata:
  name: postgres-data-recovered
  namespace: production
spec:
  storageClassName: gp3
  volumeName: pvc-abc123        # pin to the specific PV
  accessModes: [ReadWriteOnce]
  resources:
    requests:
      storage: 100Gi

Reclaim Policy on StorageClass

apiVersion: storage.k8s.io/v1
kind: StorageClass
metadata:
  name: gp3-retain
provisioner: ebs.csi.aws.com
reclaimPolicy: Retain          # <-- preserve volume on PVC delete
volumeBindingMode: WaitForFirstConsumer
allowVolumeExpansion: true
parameters:
  type: gp3
  iops: "3000"
  throughput: "125"
  encrypted: "true"

CSI Driver Operations

CSI Component Health Check

# AWS EBS CSI Driver (typical install via EKS add-on or Helm)
kubectl get pods -n kube-system -l app=ebs-csi-controller
kubectl get pods -n kube-system -l app=ebs-csi-node

# Check EBS CSI controller logs for provisioning errors
kubectl logs -n kube-system \
  -l app=ebs-csi-controller -c csi-provisioner --tail=50

# Cilium uses its own storage — for EKS check aws-ebs-csi-driver add-on
aws eks describe-addon \
  --cluster-name <cluster> \
  --addon-name aws-ebs-csi-driver \
  --query 'addon.{status:status,version:addonVersion}'

# Check CSI node plugin status on a specific node
kubectl describe csinode <node-name>

# EFS CSI Driver
kubectl get pods -n kube-system -l app=efs-csi-controller
kubectl get pods -n kube-system -l app=efs-csi-node

CSI Controller vs Node Plugin Responsibilities

Operation	Plugin	What it does	Failure symptom
CreateVolume	Controller	Calls AWS EBS CreateVolume	PVC stuck Pending — "failed to provision volume"
AttachVolume	Controller	Attaches EBS vol to EC2 instance	VolumeAttachment stuck, pod stays Pending
DeleteVolume	Controller	Destroys cloud volume on PVC delete	Orphaned EBS volumes in AWS console
NodeStageVolume	Node	Formats filesystem + mounts to staging	Pod event: "failed to stage volume"
NodePublishVolume	Node	Bind-mounts staged path into pod	Pod stuck ContainerCreating, mount error
NodeExpandVolume	Node	Resizes filesystem after PVC expand	PVC resize stuck FileSystemResizePending

Diagnosing a Stuck Pod (ContainerCreating / Volume Mount Failure)

# Step 1: describe the stuck pod — look for events
kubectl describe pod <pod-name> -n <namespace>
# Common events:
#   "Unable to attach or mount volumes"
#   "Multi-Attach error for volume" — vol attached to another node
#   "timed out waiting for the condition"

# Step 2: check the VolumeAttachment object for the PV
kubectl get volumeattachment | grep <pv-name>
kubectl describe volumeattachment <attachment-name>

# Step 3: if VolumeAttachment is stuck with the old node:
# This happens when a node dies ungracefully — the vol is still
# "attached" to the dead node from K8s perspective
# Force-delete the VolumeAttachment (CSI controller will re-attach)
kubectl delete volumeattachment <attachment-name>

# Step 4: check the CSI controller pod for errors
kubectl logs -n kube-system \
  -l app=ebs-csi-controller -c csi-attacher --tail=100

# Step 5: verify the EBS volume state in AWS
aws ec2 describe-volumes \
  --volume-ids <vol-id> \
  --query 'Volumes[0].{State:State,Attachments:Attachments}'

Multi-Attach Error (ReadWriteOnce)

EBS volumes with accessModes: ReadWriteOnce can only be attached to one node at a time. If a pod is rescheduled to a different node while the old VolumeAttachment exists (node crashed, not drained), new attachment fails. Always drain nodes gracefully (kubectl drain) before terminating to allow clean VolumeAttachment cleanup.

EBS / Block Storage Operations

gp3 vs gp2 StorageClass

AWS EBS gp3 provides 3000 IOPS and 125 MB/s throughput baseline at no extra cost, decoupled from volume size. gp2 ties IOPS to size (3 IOPS/GB, max 16,000). For most workloads, migrating from gp2 to gp3 reduces cost and improves predictable performance.

apiVersion: storage.k8s.io/v1
kind: StorageClass
metadata:
  name: gp3
  annotations:
    storageclass.kubernetes.io/is-default-class: "true"
provisioner: ebs.csi.aws.com
volumeBindingMode: WaitForFirstConsumer   # only provision in the AZ where pod lands
allowVolumeExpansion: true
reclaimPolicy: Delete
parameters:
  type: gp3
  iops: "3000"           # baseline — can go up to 16000
  throughput: "125"      # MB/s — can go up to 1000
  encrypted: "true"
  kmsKeyId: "arn:aws:kms:us-east-1:123456789012:key/mrk-..."

High-Performance StorageClass for Databases

apiVersion: storage.k8s.io/v1
kind: StorageClass
metadata:
  name: gp3-high-iops
provisioner: ebs.csi.aws.com
volumeBindingMode: WaitForFirstConsumer
allowVolumeExpansion: true
reclaimPolicy: Retain
parameters:
  type: gp3
  iops: "16000"          # max gp3 IOPS
  throughput: "1000"     # max gp3 throughput
  encrypted: "true"

Checking EBS Volume Metrics

# Get the EBS volume ID from a PV
kubectl get pv <pv-name> \
  -o jsonpath='{.spec.csi.volumeHandle}'

# Check CloudWatch EBS metrics
aws cloudwatch get-metric-statistics \
  --namespace AWS/EBS \
  --metric-name VolumeQueueLength \
  --dimensions Name=VolumeId,Value=vol-0abc123 \
  --start-time $(date -u -d '1 hour ago' +%FT%TZ) \
  --end-time $(date -u +%FT%TZ) \
  --period 300 \
  --statistics Average

PromQL: PVC Disk Saturation

# PVC disk usage percentage
(
  kubelet_volume_stats_used_bytes
    /
  kubelet_volume_stats_capacity_bytes
) * 100

# PVCs above 80% full
(
  kubelet_volume_stats_used_bytes
    /
  kubelet_volume_stats_capacity_bytes
) * 100 > 80

# PVC inode usage percentage
(
  kubelet_volume_stats_inodes_used
    /
  kubelet_volume_stats_inodes
) * 100

# Available bytes
kubelet_volume_stats_available_bytes{namespace="production"}

# PVCs with less than 10% free
(
  kubelet_volume_stats_available_bytes
    /
  kubelet_volume_stats_capacity_bytes
) * 100 < 10

EFS / Shared Storage Operations

Amazon EFS provides ReadWriteMany (RWX) access across multiple pods and nodes. It is NFS-based, with throughput automatically scaling. Use EFS for shared content, ML training data, and CMS uploads — not for databases (latency is 1-10ms vs EBS 0.1-1ms).

apiVersion: storage.k8s.io/v1
kind: StorageClass
metadata:
  name: efs-sc
provisioner: efs.csi.aws.com
parameters:
  provisioningMode: efs-ap          # use EFS Access Points (isolated dirs per PVC)
  fileSystemId: fs-0abc12345
  directoryPerms: "700"
  gidRangeStart: "1000"
  gidRangeEnd: "2000"
  basePath: "/dynamic_provisioning"

apiVersion: v1
kind: PersistentVolumeClaim
metadata:
  name: shared-assets
  namespace: production
spec:
  storageClassName: efs-sc
  accessModes: [ReadWriteMany]      # multiple pods can mount simultaneously
  resources:
    requests:
      storage: 5Gi                  # EFS ignores this — it scales elastically

EFS Performance Troubleshooting

# EFS NFS mount options — tuned for K8s (in EFS CSI driver StorageClass)
# noresvport: don't use reserved source port (allows reconnect after timeout)
# soft: NFS soft mount — return EIO on timeout instead of hanging forever
# timeo=600: 60 second NFS timeout (10ths of seconds)
# These are set in EFS CSI driver ConfigMap:
kubectl describe configmap -n kube-system efs-csi-driver-config 2>/dev/null || true

# Check EFS mount on a node running an EFS-backed pod
NODE=$(kubectl get pod <pod-name> -o jsonpath='{.spec.nodeName}')
kubectl debug node/$NODE -it --image=nicolaka/netshoot -- \
  df -h | grep nfs

# EFS throughput utilization via CloudWatch
aws cloudwatch get-metric-statistics \
  --namespace AWS/EFS \
  --metric-name MeteredIOBytes \
  --dimensions Name=FileSystemId,Value=fs-0abc12345 \
  --start-time $(date -u -d '1 hour ago' +%FT%TZ) \
  --end-time $(date -u +%FT%TZ) \
  --period 300 \
  --statistics Sum

Storage Performance Analysis

fio Benchmark — Block Storage (EBS)

# Run fio benchmarks inside a pod using a PVC
apiVersion: v1
kind: Pod
metadata:
  name: fio-benchmark
  namespace: default
spec:
  containers:
  - name: fio
    image: nixery.dev/shell/fio
    command:
    - /bin/bash
    - -c
    - |
      echo "=== Random Read 4K IOPS ==="
      fio --name=randread --ioengine=libaio --iodepth=32 \
        --rw=randread --bs=4k --direct=1 --numjobs=4 \
        --size=1G --runtime=60 --filename=/data/test \
        --output-format=json | \
        python3 -c "import json,sys; d=json.load(sys.stdin); \
          r=d['jobs'][0]['read']; \
          print(f'IOPS: {r[\"iops\"]:.0f}, BW: {r[\"bw\"]/1024:.1f} MB/s, lat_p99: {r[\"lat_ns\"][\"percentile\"][\"99.000000\"]/1e6:.2f}ms')"

      echo "=== Random Write 4K IOPS ==="
      fio --name=randwrite --ioengine=libaio --iodepth=32 \
        --rw=randwrite --bs=4k --direct=1 --numjobs=4 \
        --size=1G --runtime=60 --filename=/data/test \
        --output-format=json | \
        python3 -c "import json,sys; d=json.load(sys.stdin); \
          w=d['jobs'][0]['write']; \
          print(f'IOPS: {w[\"iops\"]:.0f}, BW: {w[\"bw\"]/1024:.1f} MB/s, lat_p99: {w[\"lat_ns\"][\"percentile\"][\"99.000000\"]/1e6:.2f}ms')"

      echo "=== Sequential Read 1M throughput ==="
      fio --name=seqread --ioengine=libaio --iodepth=8 \
        --rw=read --bs=1M --direct=1 --numjobs=1 \
        --size=2G --runtime=30 --filename=/data/test
    volumeMounts:
    - name: data
      mountPath: /data
  volumes:
  - name: data
    persistentVolumeClaim:
      claimName: <pvc-name>
  restartPolicy: Never

Expected I/O Latencies by Storage Type

Storage Type	Read p50	Read p99	Write p50	IOPS Limit	Use Case
EBS gp3 (NVMe)	0.1ms	0.5ms	0.2ms	16,000	Databases, etcd
EBS io2 Block Express	<0.1ms	0.2ms	<0.1ms	256,000	Critical OLTP
EFS (NFS)	1ms	10ms	2ms	—	Shared, RWX
Local NVMe (instance store)	0.05ms	0.1ms	0.05ms	1M+	Kafka, ephemeral caches
emptyDir (tmpfs)	<0.01ms	<0.01ms	<0.01ms	RAM-bound	Temp data, shared mem

Node Storage Metrics

# Check node disk utilization
kubectl top nodes  # doesn't show disk — use node_exporter

# node_exporter disk metrics via PromQL:
# Disk read/write bytes per second
rate(node_disk_read_bytes_total{device=~"nvme.*|xvd.*"}[5m])
rate(node_disk_written_bytes_total{device=~"nvme.*|xvd.*"}[5m])

# Disk I/O await (ms per operation) — high = I/O saturation
rate(node_disk_io_time_seconds_total{device=~"nvme.*|xvd.*"}[5m]) * 1000

# Disk utilization percentage
rate(node_disk_io_time_seconds_total{device=~"nvme.*|xvd.*"}[5m]) * 100

# Root filesystem usage
(node_filesystem_size_bytes{mountpoint="/"} - node_filesystem_free_bytes{mountpoint="/"})
  / node_filesystem_size_bytes{mountpoint="/"} * 100

PVC Resize Operations

ℹ

Prerequisites for Online Resize

The StorageClass must have allowVolumeExpansion: true. The CSI driver must support EXPAND_VOLUME capability. The volume filesystem resize (NodeExpandVolume) happens automatically when the pod restarts or for ext4/xfs if the node plugin supports online expansion. EBS CSI driver v1.0+ supports online resize for ext4 and XFS without pod restart.

Resize a PVC

# Step 1: patch the PVC spec.resources.requests.storage
kubectl patch pvc postgres-data -n production \
  -p '{"spec":{"resources":{"requests":{"storage":"200Gi"}}}}'

# Step 2: watch for the resize to complete
kubectl get pvc postgres-data -n production -w
# Status transitions:
#   Bound (100Gi) → Bound (100Gi, FileSystemResizePending) → Bound (200Gi)

# Step 3: if status stays at FileSystemResizePending,
# a pod restart triggers the node-side filesystem expansion
kubectl rollout restart deployment/postgres -n production

# Step 4: verify new size inside the pod
kubectl exec -n production deploy/postgres -- df -h /var/lib/postgresql/data

Resize a StatefulSet PVC (requires manual steps)

StatefulSet VolumeClaimTemplates are immutable — Kubernetes will not resize them automatically. The workaround is to resize each PVC individually and then update the StatefulSet template.

# Resize all PVCs in a StatefulSet (e.g., kafka-data-kafka-0,1,2...)
STS_NAME=kafka
NAMESPACE=kafka
NEW_SIZE=200Gi

# Step 1: patch each PVC
kubectl get pvc -n $NAMESPACE -l app=$STS_NAME \
  -o jsonpath='{.items[*].metadata.name}' | \
  tr ' ' '\n' | \
  while read pvc; do
    echo "Resizing $pvc..."
    kubectl patch pvc $pvc -n $NAMESPACE \
      -p "{\"spec\":{\"resources\":{\"requests\":{\"storage\":\"$NEW_SIZE\"}}}}"
  done

# Step 2: delete the StatefulSet WITHOUT deleting pods (--cascade=orphan)
kubectl delete sts $STS_NAME -n $NAMESPACE --cascade=orphan

# Step 3: re-apply the StatefulSet with updated volumeClaimTemplates size
# Edit your manifest to reflect NEW_SIZE, then:
kubectl apply -f kafka-statefulset.yaml -n $NAMESPACE

# Step 4: rolling restart to trigger filesystem expansion
kubectl rollout restart statefulset/$STS_NAME -n $NAMESPACE

⚠

You Cannot Shrink a PVC

PVC storage requests can only grow, never shrink. To reduce storage size you must create a new smaller PVC, copy data, and swap the mount — there is no in-place shrink in Kubernetes or EBS.

StorageClass Tuning

WaitForFirstConsumer — Why It Matters

With volumeBindingMode: Immediate, a PVC triggers volume creation before a pod is scheduled. The volume may land in AZ-A while the pod gets scheduled to AZ-B — causing a permanent mount failure. WaitForFirstConsumer delays provisioning until the pod's target node is known, aligning the volume's AZ with the pod's AZ.

WaitForFirstConsumer — Binding Flow

  Immediate (BAD for multi-AZ):
    PVC created → CSI provisions vol in AZ-A → Pod scheduled to AZ-B → MOUNT FAIL

  WaitForFirstConsumer (CORRECT):
    PVC created → status: Pending (no vol yet)
    Pod scheduled → Scheduler picks node in AZ-B
    CSI notified of target node AZ → provisions vol in AZ-B → PVC binds → pod mounts ✓

StorageClass Reference Table

Name	Provisioner	Use Case	Reclaim	Binding Mode
`gp3` (default)	ebs.csi.aws.com	General workloads	Delete	WaitForFirstConsumer
`gp3-retain`	ebs.csi.aws.com	Databases (stateful)	Retain	WaitForFirstConsumer
`gp3-high-iops`	ebs.csi.aws.com	High-throughput DBs	Retain	WaitForFirstConsumer
`efs-sc`	efs.csi.aws.com	Shared RWX access	Delete	Immediate
`local-storage`	kubernetes.io/no-provisioner	Local NVMe (Kafka)	Retain	WaitForFirstConsumer

Local StorageClass for High-Performance Workloads

apiVersion: storage.k8s.io/v1
kind: StorageClass
metadata:
  name: local-nvme
provisioner: kubernetes.io/no-provisioner
volumeBindingMode: WaitForFirstConsumer
reclaimPolicy: Retain

---
# Pre-provision a Local PV (one per NVMe disk per node)
apiVersion: v1
kind: PersistentVolume
metadata:
  name: local-nvme-node1
spec:
  capacity:
    storage: 1000Gi
  accessModes: [ReadWriteOnce]
  storageClassName: local-nvme
  persistentVolumeReclaimPolicy: Retain
  local:
    path: /mnt/nvme0n1       # pre-formatted and mounted on the node
  nodeAffinity:
    required:
      nodeSelectorTerms:
      - matchExpressions:
        - key: kubernetes.io/hostname
          operator: In
          values: ["node-with-nvme-1"]

StatefulSet Operations

StatefulSet Rolling Updates

StatefulSets update pods in reverse ordinal order (N-1, N-2, ..., 0) by default. Each pod is updated, waits to become Ready, then the next is updated. This respects quorum for distributed systems like Kafka, Zookeeper, and etcd.

# Trigger a rolling update (after editing the StatefulSet spec)
kubectl rollout restart statefulset/kafka -n kafka

# Monitor rollout progress
kubectl rollout status statefulset/kafka -n kafka

# Pause a rolling update (useful if intermediate state is healthy and you want to canary)
kubectl rollout pause statefulset/kafka -n kafka

# Resume
kubectl rollout resume statefulset/kafka -n kafka

# Rollback to previous revision
kubectl rollout undo statefulset/kafka -n kafka

Partition-Based Canary for StatefulSets

The partition field in updateStrategy.rollingUpdate limits updates to pods with ordinal ≥ partition. This lets you update a subset first:

# Set partition=2 on a 3-replica StatefulSet (only pod-2 gets updated)
kubectl patch statefulset kafka -n kafka \
  -p '{"spec":{"updateStrategy":{"rollingUpdate":{"partition":2}}}}'

# After validating pod-2 is healthy, lower partition to 1 (update pod-2, pod-1)
kubectl patch statefulset kafka -n kafka \
  -p '{"spec":{"updateStrategy":{"rollingUpdate":{"partition":1}}}}'

# Lower to 0 to complete the update
kubectl patch statefulset kafka -n kafka \
  -p '{"spec":{"updateStrategy":{"rollingUpdate":{"partition":0}}}}'

Force-Delete a Stuck StatefulSet Pod

Force-Delete Is Risky for Stateful Applications

Force-deleting a StatefulSet pod (e.g., after node failure) may result in two pods with the same identity running simultaneously if the original pod is still alive on a partitioned node. Only force-delete when the node is confirmed dead (terminated in cloud console, not just NotReady).

# Standard delete (waits for graceful termination — may block if node is dead)
kubectl delete pod kafka-2 -n kafka

# Force delete (bypasses graceful shutdown — only when node is confirmed dead)
kubectl delete pod kafka-2 -n kafka --force --grace-period=0

Headless Service and DNS for StatefulSets

apiVersion: v1
kind: Service
metadata:
  name: kafka-headless
  namespace: kafka
spec:
  clusterIP: None                      # headless — no VIP, DNS returns pod IPs
  publishNotReadyAddresses: true       # important: include unready pods in DNS
  selector:
    app: kafka
  ports:
  - name: kafka
    port: 9092

# DNS format for StatefulSet pods:
# <pod-name>.<service-name>.<namespace>.svc.cluster.local
# kafka-0.kafka-headless.kafka.svc.cluster.local
# kafka-1.kafka-headless.kafka.svc.cluster.local

# Verify headless DNS from within the cluster
kubectl run dns-test --image=nicolaka/netshoot --rm -it -- \
  dig kafka-0.kafka-headless.kafka.svc.cluster.local

StatefulSet Mininum Availability with PodDisruptionBudget

apiVersion: policy/v1
kind: PodDisruptionBudget
metadata:
  name: kafka-pdb
  namespace: kafka
spec:
  selector:
    matchLabels:
      app: kafka
  minAvailable: 2          # for a 3-replica Kafka — always keep quorum (2/3)

Storage Backup Operations

See Disaster Recovery for the full Velero setup. This section focuses on storage-specific backup patterns and volume snapshot operations.

VolumeSnapshot — CSI Snapshot

VolumeSnapshots create a point-in-time snapshot of a PVC using the cloud provider's native snapshot API (EBS snapshot for gp3 volumes). They are faster and cheaper than full Velero backups for block storage.

# Install the CSI snapshot controller CRDs and controller
kubectl apply -f https://raw.githubusercontent.com/kubernetes-csi/external-snapshotter/v8.0.1/client/config/crd/snapshot.storage.k8s.io_volumesnapshotclasses.yaml
kubectl apply -f https://raw.githubusercontent.com/kubernetes-csi/external-snapshotter/v8.0.1/client/config/crd/snapshot.storage.k8s.io_volumesnapshotcontents.yaml
kubectl apply -f https://raw.githubusercontent.com/kubernetes-csi/external-snapshotter/v8.0.1/client/config/crd/snapshot.storage.k8s.io_volumesnapshots.yaml
kubectl apply -f https://raw.githubusercontent.com/kubernetes-csi/external-snapshotter/v8.0.1/deploy/kubernetes/snapshot-controller/

# Create a VolumeSnapshotClass for EBS
kubectl apply -f - <<'EOF'
apiVersion: snapshot.storage.k8s.io/v1
kind: VolumeSnapshotClass
metadata:
  name: ebs-vsc
  annotations:
    snapshot.storage.kubernetes.io/is-default-class: "true"
driver: ebs.csi.aws.com
deletionPolicy: Retain         # keep the EBS snapshot even if VSC is deleted
parameters:
  tagSpecification_1: "Key=ManagedBy,Value=K8s"
EOF

apiVersion: snapshot.storage.k8s.io/v1
kind: VolumeSnapshot
metadata:
  name: postgres-data-snap-20260524
  namespace: production
spec:
  volumeSnapshotClassName: ebs-vsc
  source:
    persistentVolumeClaimName: postgres-data

# Check snapshot status
kubectl get volumesnapshot -n production
kubectl describe volumesnapshot postgres-data-snap-20260524 -n production
# readyToUse: true means the EBS snapshot is complete

# Restore from a VolumeSnapshot to a new PVC
kubectl apply -f - <<'EOF'
apiVersion: v1
kind: PersistentVolumeClaim
metadata:
  name: postgres-data-restored
  namespace: production
spec:
  storageClassName: gp3-retain
  dataSource:
    name: postgres-data-snap-20260524
    kind: VolumeSnapshot
    apiGroup: snapshot.storage.k8s.io
  accessModes: [ReadWriteOnce]
  resources:
    requests:
      storage: 100Gi
EOF

Automated Snapshot CronJob

apiVersion: batch/v1
kind: CronJob
metadata:
  name: postgres-snapshot
  namespace: production
spec:
  schedule: "0 */6 * * *"      # every 6 hours
  concurrencyPolicy: Forbid
  successfulJobsHistoryLimit: 5
  failedJobsHistoryLimit: 3
  jobTemplate:
    spec:
      template:
        spec:
          serviceAccountName: snapshot-creator
          restartPolicy: OnFailure
          containers:
          - name: snap
            image: bitnami/kubectl:latest
            command:
            - /bin/bash
            - -c
            - |
              SNAP_NAME="postgres-data-snap-$(date +%Y%m%d-%H%M%S)"
              kubectl apply -f - <<EOF
              apiVersion: snapshot.storage.k8s.io/v1
              kind: VolumeSnapshot
              metadata:
                name: $SNAP_NAME
                namespace: production
                labels:
                  app: postgres
                  managed-by: cronjob
              spec:
                volumeSnapshotClassName: ebs-vsc
                source:
                  persistentVolumeClaimName: postgres-data
              EOF
              echo "Created snapshot: $SNAP_NAME"

              # Prune snapshots older than 7 days
              kubectl get volumesnapshot -n production \
                -l managed-by=cronjob \
                -o json | \
              python3 -c "
              import json, sys, subprocess
              from datetime import datetime, timezone, timedelta
              items = json.load(sys.stdin)['items']
              cutoff = datetime.now(timezone.utc) - timedelta(days=7)
              for s in items:
                  ts = datetime.fromisoformat(s['metadata']['creationTimestamp'].replace('Z','+00:00'))
                  if ts < cutoff:
                      name = s['metadata']['name']
                      print(f'Deleting old snapshot: {name}')
                      subprocess.run(['kubectl','delete','volumesnapshot',name,'-n','production'])
              "

Storage Troubleshooting Playbook

PVC Stuck in Pending

# Describe PVC for events
kubectl describe pvc <pvc-name> -n <ns>

# Common causes and checks:
# 1. WaitForFirstConsumer — normal until a pod is scheduled
kubectl get pvc <pvc-name> -n <ns> -o jsonpath='{.metadata.annotations}'

# 2. No StorageClass matching the requested class
kubectl get sc                    # verify the SC exists
kubectl get pvc <pvc-name> -n <ns> -o jsonpath='{.spec.storageClassName}'

# 3. CSI provisioner pod crashed / not running
kubectl get pods -n kube-system -l app=ebs-csi-controller
kubectl logs -n kube-system -l app=ebs-csi-controller -c csi-provisioner

# 4. IRSA permissions — EBS CSI controller needs ec2:CreateVolume
aws iam simulate-principal-policy \
  --policy-source-arn arn:aws:iam::123456789012:role/ebs-csi-role \
  --action-names ec2:CreateVolume \
  --query 'EvaluationResults[0].EvalDecision'

# 5. Insufficient capacity in the AZ (for gp3 this is rare)
# Check AWS service health dashboard

Pod Stuck in ContainerCreating (Mount Failure)

# Get the full error
kubectl describe pod <pod-name> -n <ns> | grep -A 30 Events

# Scenario A: "Multi-Attach error" — volume attached to another node
kubectl get volumeattachment | grep <pv-name>
# Check if old attachment is on a dead node — if so, force-delete it:
kubectl delete volumeattachment <attachment-name>

# Scenario B: "Unable to mount volumes" — node plugin issue
# Check node plugin pod on the target node:
TARGET_NODE=$(kubectl get pod <pod-name> -n <ns> -o jsonpath='{.spec.nodeName}')
kubectl get pods -n kube-system -l app=ebs-csi-node \
  -o wide | grep $TARGET_NODE
kubectl logs -n kube-system -l app=ebs-csi-node \
  --field-selector spec.nodeName=$TARGET_NODE -c ebs-plugin

# Scenario C: "timeout expired waiting for volumes to be attached/mounted"
# kubelet may be having issues — check kubelet status on node
kubectl debug node/$TARGET_NODE -it --image=nicolaka/netshoot -- \
  journalctl -u kubelet --no-pager -n 100

# Scenario D: filesystem corrupted — check dmesg for I/O errors
kubectl debug node/$TARGET_NODE -it --image=nicolaka/netshoot -- \
  dmesg | tail -50 | grep -i "error\|fail\|corrupt"

PVC Resize Stuck at FileSystemResizePending

# Check PVC conditions
kubectl get pvc <pvc-name> -n <ns> -o jsonpath='{.status.conditions}'

# FileSystemResizePending means cloud volume is resized,
# but filesystem inside has not been expanded yet.
# This requires the CSI node plugin to run NodeExpandVolume.

# Option 1: rolling restart the pod (triggers NodePublishVolume → NodeExpandVolume)
kubectl rollout restart deployment/<name> -n <ns>

# Option 2: check if node plugin is healthy on the pod's node
TARGET_NODE=$(kubectl get pod <pod-name> -n <ns> -o jsonpath='{.spec.nodeName}')
kubectl logs -n kube-system \
  $(kubectl get pods -n kube-system -l app=ebs-csi-node \
    -o wide | grep $TARGET_NODE | awk '{print $1}') \
  -c ebs-plugin --tail=50

# Manual resize from inside the pod (last resort — ext4 only)
kubectl exec -n <ns> <pod-name> -- resize2fs /dev/<device>

Orphaned EBS Volume Detection

# Find EBS volumes tagged with kubernetes.io/created-for/pvc that have no attachment
aws ec2 describe-volumes \
  --filters "Name=status,Values=available" \
             "Name=tag-key,Values=kubernetes.io/created-for/pvc/name" \
  --query 'Volumes[*].{VolumeId:VolumeId,Size:Size,Created:CreateTime,PVC:Tags[?Key==`kubernetes.io/created-for/pvc/name`].Value|[0]}' \
  --output table

# Cross-reference with K8s PVs to find truly orphaned volumes
# (PV deleted but cloud volume wasn't cleaned up)
PVOLS=$(kubectl get pv -o jsonpath='{.items[*].spec.csi.volumeHandle}' | tr ' ' '\n' | sort)
aws ec2 describe-volumes \
  --filters "Name=tag-key,Values=kubernetes.io/created-for/pvc/name" \
  --query 'Volumes[*].VolumeId' \
  --output text | tr '\t' '\n' | sort | \
  comm -23 - <(echo "$PVOLS")

Storage Alerting

apiVersion: monitoring.coreos.com/v1
kind: PrometheusRule
metadata:
  name: storage-operations-alerts
  namespace: monitoring
  labels:
    prometheus: kube-prometheus
    role: alert-rules
spec:
  groups:
  - name: storage.pvc
    interval: 60s
    rules:

    - alert: PVCDiskUsageHigh
      expr: |
        (
          kubelet_volume_stats_used_bytes
            /
          kubelet_volume_stats_capacity_bytes
        ) * 100 > 80
      for: 5m
      labels:
        severity: warning
      annotations:
        summary: "PVC disk usage above 80%"
        description: "PVC {{ $labels.persistentvolumeclaim }} in namespace {{ $labels.namespace }} is {{ $value | printf \"%.1f\" }}% full."

    - alert: PVCDiskUsageCritical
      expr: |
        (
          kubelet_volume_stats_used_bytes
            /
          kubelet_volume_stats_capacity_bytes
        ) * 100 > 95
      for: 2m
      labels:
        severity: critical
      annotations:
        summary: "PVC disk usage above 95% — imminent OOM/crash"
        description: "PVC {{ $labels.persistentvolumeclaim }} in namespace {{ $labels.namespace }} is {{ $value | printf \"%.1f\" }}% full. Immediate action required."

    - alert: PVCInodeUsageHigh
      expr: |
        (
          kubelet_volume_stats_inodes_used
            /
          kubelet_volume_stats_inodes
        ) * 100 > 80
      for: 5m
      labels:
        severity: warning
      annotations:
        summary: "PVC inode usage above 80%"
        description: "PVC {{ $labels.persistentvolumeclaim }} in namespace {{ $labels.namespace }} has {{ $value | printf \"%.1f\" }}% inodes used. Many small files may exhaust inodes before disk space."

    - alert: PVCPendingTooLong
      expr: |
        kube_persistentvolumeclaim_status_phase{phase="Pending"} == 1
      for: 10m
      labels:
        severity: warning
      annotations:
        summary: "PVC stuck in Pending for > 10 minutes"
        description: "PVC {{ $labels.persistentvolumeclaim }} in namespace {{ $labels.namespace }} has been Pending for over 10 minutes. Check CSI provisioner logs."

    - alert: PVCInLostState
      expr: |
        kube_persistentvolumeclaim_status_phase{phase="Lost"} == 1
      for: 1m
      labels:
        severity: critical
      annotations:
        summary: "PVC is in Lost state — data may be inaccessible"
        description: "PVC {{ $labels.persistentvolumeclaim }} in namespace {{ $labels.namespace }} is Lost. The underlying PV may have been deleted. Immediate investigation required."

  - name: storage.node
    rules:

    - alert: NodeDiskHighIOWait
      expr: |
        rate(node_disk_io_time_seconds_total{device=~"nvme.*|xvd.*|sd.*"}[5m]) * 100 > 80
      for: 10m
      labels:
        severity: warning
      annotations:
        summary: "Node disk I/O utilization above 80%"
        description: "Node {{ $labels.instance }} disk {{ $labels.device }} I/O utilization is {{ $value | printf \"%.1f\" }}%."

    - alert: NodeRootDiskFull
      expr: |
        (
          node_filesystem_size_bytes{mountpoint="/"}
            - node_filesystem_free_bytes{mountpoint="/"}
        ) / node_filesystem_size_bytes{mountpoint="/"} * 100 > 85
      for: 5m
      labels:
        severity: warning
      annotations:
        summary: "Node root disk above 85% full"
        description: "Node {{ $labels.instance }} root filesystem is {{ $value | printf \"%.1f\" }}% full. Container image layers and logs may be filling disk."

    - alert: PersistentVolumeClaimWithoutSnapshot
      expr: |
        kube_persistentvolumeclaim_info{storageclass=~"gp3-retain|gp3-high-iops"}
          unless on (namespace, persistentvolumeclaim)
        (
          label_replace(
            kube_volumesnapshot_spec_source_persistent_volume_claim_name,
            "persistentvolumeclaim", "$1", "source_pvc_name", "(.*)"
          )
        )
      for: 24h
      labels:
        severity: warning
      annotations:
        summary: "Retain-policy PVC has no VolumeSnapshot in 24 hours"
        description: "PVC {{ $labels.persistentvolumeclaim }} in {{ $labels.namespace }} uses a retain policy but has no recent snapshot."

Best Practices Summary

Use WaitForFirstConsumer

Always set volumeBindingMode: WaitForFirstConsumer on StorageClasses in multi-AZ clusters. Immediate binding causes cross-AZ volume mounts which fail silently until pod scheduling.

Retain Policy for Databases

Set reclaimPolicy: Retain on StorageClasses used by databases and stateful workloads. Accidental PVC deletion should not destroy production data — a human must explicitly release the PV.

Snapshot Before Resize

Always create a VolumeSnapshot before expanding a PVC. Filesystem expansion is usually safe, but if a node crashes during resize the volume may be left in a partially-expanded state. Snapshot provides rollback.

Monitor Inodes Too

Disk space metrics are not sufficient. A volume can fill its inode table (too many small files) while still having GBs of free space. Alert on both kubelet_volume_stats_used_bytes and kubelet_volume_stats_inodes_used.

Graceful Node Drain

Never hard-terminate nodes with attached volumes. Use kubectl drain to cleanly unmount volumes and delete VolumeAttachment objects. Force-deleting VolumeAttachments causes the next pod to wait for the 6-minute attach timeout.

Local Storage for Kafka

Kafka brokers get dramatically better throughput on instance-store NVMe (1M+ IOPS) vs EBS (16K IOPS). Use local-storage StorageClass with DaemonSet pre-provisioning and topology constraints to ensure pods land on NVMe nodes.