Kubernetes Storage Stack

Storage in Kubernetes is layered: the Container Storage Interface (CSI) decouples driver logic from the kubelet, PersistentVolumes (PV) represent the physical resource, PersistentVolumeClaims (PVC) represent a request for storage, and StorageClasses define how PVs are dynamically provisioned.

Storage Architecture — Request to Mount Path
  Pod spec
    └─► PersistentVolumeClaim  (namespace-scoped claim)
          └─► StorageClass      (binding mode / provisioner / params)
                └─► CSI Driver  (provisioner plugin in kube-system)
                      ├─► Controller Plugin  (runs as Deployment)
                      │     ├─ CreateVolume   (AWS EBS CreateVolume API call)
                      │     ├─ AttachVolume   (attach EBS vol to EC2 node)
                      │     └─ DeleteVolume   (GC on PVC delete)
                      └─► Node Plugin         (runs as DaemonSet on every node)
                            ├─ NodeStageVolume  (format + mount to staging path)
                            └─ NodePublishVolume (bind-mount into pod's fs)

  On-disk path:
    /var/lib/kubelet/plugins/kubernetes.io/csi/...  (staging)
    /var/lib/kubelet/pods/<UID>/volumes/...         (pod volume mount)

Storage Object States

ObjectStateMeaningCommon Cause
PVCPendingNo matching PV exists yetWaitForFirstConsumer / no capacity / wrong SC
PVCBoundPV attached and readyNormal operating state
PVCLostBound PV deleted or unreachableManual PV deletion, zone mismatch
PVAvailableNot yet bound to any PVCPre-provisioned static PV
PVBoundClaimed by a PVCNormal
PVReleasedPVC deleted, PV not yet reclaimedRetain reclaim policy
PVFailedAutomated reclamation failedCloud volume deletion error
VolumeAttachmentAttached: falseCSI attach in progress or stuckNode failure, CSI pod crash, AZ mismatch

Quick-Reference Diagnostic Commands

# List all PVCs with their phase and storage class
kubectl get pvc -A -o wide

# List PVs showing reclaim policy, capacity, access modes
kubectl get pv -o custom-columns=\
NAME:.metadata.name,\
CAPACITY:.spec.capacity.storage,\
ACCESS:.spec.accessModes[0],\
POLICY:.spec.persistentVolumeReclaimPolicy,\
STATUS:.status.phase,\
CLAIM:.spec.claimRef.namespace

# Find all PVCs that are NOT Bound
kubectl get pvc -A --field-selector=status.phase!=Bound

# Describe a specific PVC to see binding events and CSI details
kubectl describe pvc <pvc-name> -n <namespace>

# List VolumeAttachment objects (CSI controller attach status)
kubectl get volumeattachment

# Check CSI node driver registration
kubectl get csinode <node-name> -o yaml

# List StorageClasses and their provisioners
kubectl get sc -o wide

PV Lifecycle Operations

Reclaim Policy Behaviour

Delete vs Retain

The default reclaim policy for dynamically provisioned PVs is Delete — the underlying cloud volume is destroyed when the PVC is deleted. For production databases, always use Retain or set a finalizer, and ensure Velero backups exist before any PVC delete.

# Patch a PV to change reclaim policy from Delete to Retain
kubectl patch pv <pv-name> \
  -p '{"spec":{"persistentVolumeReclaimPolicy":"Retain"}}'

# Patch ALL PVs that use gp3 storage class to Retain
kubectl get pv -o json | \
  jq -r '.items[] | select(.spec.storageClassName=="gp3") | .metadata.name' | \
  xargs -I{} kubectl patch pv {} \
    -p '{"spec":{"persistentVolumeReclaimPolicy":"Retain"}}'

Reclaiming a Released PV

When a PVC is deleted and the PV reclaim policy is Retain, the PV enters Released state. It cannot be automatically rebound until the claimRef is cleared:

# Step 1: verify PV is in Released state
kubectl get pv <pv-name>

# Step 2: remove the claimRef so the PV becomes Available again
kubectl patch pv <pv-name> --type json \
  -p '[{"op":"remove","path":"/spec/claimRef"}]'

# Step 3: create a new PVC that binds to this specific PV
# Use volumeName in the PVC spec:
apiVersion: v1
kind: PersistentVolumeClaim
metadata:
  name: postgres-data-recovered
  namespace: production
spec:
  storageClassName: gp3
  volumeName: pvc-abc123        # pin to the specific PV
  accessModes: [ReadWriteOnce]
  resources:
    requests:
      storage: 100Gi

Reclaim Policy on StorageClass

apiVersion: storage.k8s.io/v1
kind: StorageClass
metadata:
  name: gp3-retain
provisioner: ebs.csi.aws.com
reclaimPolicy: Retain          # <-- preserve volume on PVC delete
volumeBindingMode: WaitForFirstConsumer
allowVolumeExpansion: true
parameters:
  type: gp3
  iops: "3000"
  throughput: "125"
  encrypted: "true"

CSI Driver Operations

CSI Component Health Check

# AWS EBS CSI Driver (typical install via EKS add-on or Helm)
kubectl get pods -n kube-system -l app=ebs-csi-controller
kubectl get pods -n kube-system -l app=ebs-csi-node

# Check EBS CSI controller logs for provisioning errors
kubectl logs -n kube-system \
  -l app=ebs-csi-controller -c csi-provisioner --tail=50

# Cilium uses its own storage — for EKS check aws-ebs-csi-driver add-on
aws eks describe-addon \
  --cluster-name <cluster> \
  --addon-name aws-ebs-csi-driver \
  --query 'addon.{status:status,version:addonVersion}'

# Check CSI node plugin status on a specific node
kubectl describe csinode <node-name>

# EFS CSI Driver
kubectl get pods -n kube-system -l app=efs-csi-controller
kubectl get pods -n kube-system -l app=efs-csi-node

CSI Controller vs Node Plugin Responsibilities

OperationPluginWhat it doesFailure symptom
CreateVolumeControllerCalls AWS EBS CreateVolumePVC stuck Pending — "failed to provision volume"
AttachVolumeControllerAttaches EBS vol to EC2 instanceVolumeAttachment stuck, pod stays Pending
DeleteVolumeControllerDestroys cloud volume on PVC deleteOrphaned EBS volumes in AWS console
NodeStageVolumeNodeFormats filesystem + mounts to stagingPod event: "failed to stage volume"
NodePublishVolumeNodeBind-mounts staged path into podPod stuck ContainerCreating, mount error
NodeExpandVolumeNodeResizes filesystem after PVC expandPVC resize stuck FileSystemResizePending

Diagnosing a Stuck Pod (ContainerCreating / Volume Mount Failure)

# Step 1: describe the stuck pod — look for events
kubectl describe pod <pod-name> -n <namespace>
# Common events:
#   "Unable to attach or mount volumes"
#   "Multi-Attach error for volume" — vol attached to another node
#   "timed out waiting for the condition"

# Step 2: check the VolumeAttachment object for the PV
kubectl get volumeattachment | grep <pv-name>
kubectl describe volumeattachment <attachment-name>

# Step 3: if VolumeAttachment is stuck with the old node:
# This happens when a node dies ungracefully — the vol is still
# "attached" to the dead node from K8s perspective
# Force-delete the VolumeAttachment (CSI controller will re-attach)
kubectl delete volumeattachment <attachment-name>

# Step 4: check the CSI controller pod for errors
kubectl logs -n kube-system \
  -l app=ebs-csi-controller -c csi-attacher --tail=100

# Step 5: verify the EBS volume state in AWS
aws ec2 describe-volumes \
  --volume-ids <vol-id> \
  --query 'Volumes[0].{State:State,Attachments:Attachments}'
!
Multi-Attach Error (ReadWriteOnce)

EBS volumes with accessModes: ReadWriteOnce can only be attached to one node at a time. If a pod is rescheduled to a different node while the old VolumeAttachment exists (node crashed, not drained), new attachment fails. Always drain nodes gracefully (kubectl drain) before terminating to allow clean VolumeAttachment cleanup.

EBS / Block Storage Operations

gp3 vs gp2 StorageClass

AWS EBS gp3 provides 3000 IOPS and 125 MB/s throughput baseline at no extra cost, decoupled from volume size. gp2 ties IOPS to size (3 IOPS/GB, max 16,000). For most workloads, migrating from gp2 to gp3 reduces cost and improves predictable performance.

apiVersion: storage.k8s.io/v1
kind: StorageClass
metadata:
  name: gp3
  annotations:
    storageclass.kubernetes.io/is-default-class: "true"
provisioner: ebs.csi.aws.com
volumeBindingMode: WaitForFirstConsumer   # only provision in the AZ where pod lands
allowVolumeExpansion: true
reclaimPolicy: Delete
parameters:
  type: gp3
  iops: "3000"           # baseline — can go up to 16000
  throughput: "125"      # MB/s — can go up to 1000
  encrypted: "true"
  kmsKeyId: "arn:aws:kms:us-east-1:123456789012:key/mrk-..."

High-Performance StorageClass for Databases

apiVersion: storage.k8s.io/v1
kind: StorageClass
metadata:
  name: gp3-high-iops
provisioner: ebs.csi.aws.com
volumeBindingMode: WaitForFirstConsumer
allowVolumeExpansion: true
reclaimPolicy: Retain
parameters:
  type: gp3
  iops: "16000"          # max gp3 IOPS
  throughput: "1000"     # max gp3 throughput
  encrypted: "true"

Checking EBS Volume Metrics

# Get the EBS volume ID from a PV
kubectl get pv <pv-name> \
  -o jsonpath='{.spec.csi.volumeHandle}'

# Check CloudWatch EBS metrics
aws cloudwatch get-metric-statistics \
  --namespace AWS/EBS \
  --metric-name VolumeQueueLength \
  --dimensions Name=VolumeId,Value=vol-0abc123 \
  --start-time $(date -u -d '1 hour ago' +%FT%TZ) \
  --end-time $(date -u +%FT%TZ) \
  --period 300 \
  --statistics Average

PromQL: PVC Disk Saturation

# PVC disk usage percentage
(
  kubelet_volume_stats_used_bytes
    /
  kubelet_volume_stats_capacity_bytes
) * 100

# PVCs above 80% full
(
  kubelet_volume_stats_used_bytes
    /
  kubelet_volume_stats_capacity_bytes
) * 100 > 80

# PVC inode usage percentage
(
  kubelet_volume_stats_inodes_used
    /
  kubelet_volume_stats_inodes
) * 100

# Available bytes
kubelet_volume_stats_available_bytes{namespace="production"}

# PVCs with less than 10% free
(
  kubelet_volume_stats_available_bytes
    /
  kubelet_volume_stats_capacity_bytes
) * 100 < 10

EFS / Shared Storage Operations

Amazon EFS provides ReadWriteMany (RWX) access across multiple pods and nodes. It is NFS-based, with throughput automatically scaling. Use EFS for shared content, ML training data, and CMS uploads — not for databases (latency is 1-10ms vs EBS 0.1-1ms).

apiVersion: storage.k8s.io/v1
kind: StorageClass
metadata:
  name: efs-sc
provisioner: efs.csi.aws.com
parameters:
  provisioningMode: efs-ap          # use EFS Access Points (isolated dirs per PVC)
  fileSystemId: fs-0abc12345
  directoryPerms: "700"
  gidRangeStart: "1000"
  gidRangeEnd: "2000"
  basePath: "/dynamic_provisioning"
apiVersion: v1
kind: PersistentVolumeClaim
metadata:
  name: shared-assets
  namespace: production
spec:
  storageClassName: efs-sc
  accessModes: [ReadWriteMany]      # multiple pods can mount simultaneously
  resources:
    requests:
      storage: 5Gi                  # EFS ignores this — it scales elastically

EFS Performance Troubleshooting

# EFS NFS mount options — tuned for K8s (in EFS CSI driver StorageClass)
# noresvport: don't use reserved source port (allows reconnect after timeout)
# soft: NFS soft mount — return EIO on timeout instead of hanging forever
# timeo=600: 60 second NFS timeout (10ths of seconds)
# These are set in EFS CSI driver ConfigMap:
kubectl describe configmap -n kube-system efs-csi-driver-config 2>/dev/null || true

# Check EFS mount on a node running an EFS-backed pod
NODE=$(kubectl get pod <pod-name> -o jsonpath='{.spec.nodeName}')
kubectl debug node/$NODE -it --image=nicolaka/netshoot -- \
  df -h | grep nfs

# EFS throughput utilization via CloudWatch
aws cloudwatch get-metric-statistics \
  --namespace AWS/EFS \
  --metric-name MeteredIOBytes \
  --dimensions Name=FileSystemId,Value=fs-0abc12345 \
  --start-time $(date -u -d '1 hour ago' +%FT%TZ) \
  --end-time $(date -u +%FT%TZ) \
  --period 300 \
  --statistics Sum

Storage Performance Analysis

fio Benchmark — Block Storage (EBS)

# Run fio benchmarks inside a pod using a PVC
apiVersion: v1
kind: Pod
metadata:
  name: fio-benchmark
  namespace: default
spec:
  containers:
  - name: fio
    image: nixery.dev/shell/fio
    command:
    - /bin/bash
    - -c
    - |
      echo "=== Random Read 4K IOPS ==="
      fio --name=randread --ioengine=libaio --iodepth=32 \
        --rw=randread --bs=4k --direct=1 --numjobs=4 \
        --size=1G --runtime=60 --filename=/data/test \
        --output-format=json | \
        python3 -c "import json,sys; d=json.load(sys.stdin); \
          r=d['jobs'][0]['read']; \
          print(f'IOPS: {r[\"iops\"]:.0f}, BW: {r[\"bw\"]/1024:.1f} MB/s, lat_p99: {r[\"lat_ns\"][\"percentile\"][\"99.000000\"]/1e6:.2f}ms')"

      echo "=== Random Write 4K IOPS ==="
      fio --name=randwrite --ioengine=libaio --iodepth=32 \
        --rw=randwrite --bs=4k --direct=1 --numjobs=4 \
        --size=1G --runtime=60 --filename=/data/test \
        --output-format=json | \
        python3 -c "import json,sys; d=json.load(sys.stdin); \
          w=d['jobs'][0]['write']; \
          print(f'IOPS: {w[\"iops\"]:.0f}, BW: {w[\"bw\"]/1024:.1f} MB/s, lat_p99: {w[\"lat_ns\"][\"percentile\"][\"99.000000\"]/1e6:.2f}ms')"

      echo "=== Sequential Read 1M throughput ==="
      fio --name=seqread --ioengine=libaio --iodepth=8 \
        --rw=read --bs=1M --direct=1 --numjobs=1 \
        --size=2G --runtime=30 --filename=/data/test
    volumeMounts:
    - name: data
      mountPath: /data
  volumes:
  - name: data
    persistentVolumeClaim:
      claimName: <pvc-name>
  restartPolicy: Never

Expected I/O Latencies by Storage Type

Storage TypeRead p50Read p99Write p50IOPS LimitUse Case
EBS gp3 (NVMe)0.1ms0.5ms0.2ms16,000Databases, etcd
EBS io2 Block Express<0.1ms0.2ms<0.1ms256,000Critical OLTP
EFS (NFS)1ms10ms2msShared, RWX
Local NVMe (instance store)0.05ms0.1ms0.05ms1M+Kafka, ephemeral caches
emptyDir (tmpfs)<0.01ms<0.01ms<0.01msRAM-boundTemp data, shared mem

Node Storage Metrics

# Check node disk utilization
kubectl top nodes  # doesn't show disk — use node_exporter

# node_exporter disk metrics via PromQL:
# Disk read/write bytes per second
rate(node_disk_read_bytes_total{device=~"nvme.*|xvd.*"}[5m])
rate(node_disk_written_bytes_total{device=~"nvme.*|xvd.*"}[5m])

# Disk I/O await (ms per operation) — high = I/O saturation
rate(node_disk_io_time_seconds_total{device=~"nvme.*|xvd.*"}[5m]) * 1000

# Disk utilization percentage
rate(node_disk_io_time_seconds_total{device=~"nvme.*|xvd.*"}[5m]) * 100

# Root filesystem usage
(node_filesystem_size_bytes{mountpoint="/"} - node_filesystem_free_bytes{mountpoint="/"})
  / node_filesystem_size_bytes{mountpoint="/"} * 100

PVC Resize Operations

Prerequisites for Online Resize

The StorageClass must have allowVolumeExpansion: true. The CSI driver must support EXPAND_VOLUME capability. The volume filesystem resize (NodeExpandVolume) happens automatically when the pod restarts or for ext4/xfs if the node plugin supports online expansion. EBS CSI driver v1.0+ supports online resize for ext4 and XFS without pod restart.

Resize a PVC

# Step 1: patch the PVC spec.resources.requests.storage
kubectl patch pvc postgres-data -n production \
  -p '{"spec":{"resources":{"requests":{"storage":"200Gi"}}}}'

# Step 2: watch for the resize to complete
kubectl get pvc postgres-data -n production -w
# Status transitions:
#   Bound (100Gi) → Bound (100Gi, FileSystemResizePending) → Bound (200Gi)

# Step 3: if status stays at FileSystemResizePending,
# a pod restart triggers the node-side filesystem expansion
kubectl rollout restart deployment/postgres -n production

# Step 4: verify new size inside the pod
kubectl exec -n production deploy/postgres -- df -h /var/lib/postgresql/data

Resize a StatefulSet PVC (requires manual steps)

StatefulSet VolumeClaimTemplates are immutable — Kubernetes will not resize them automatically. The workaround is to resize each PVC individually and then update the StatefulSet template.

# Resize all PVCs in a StatefulSet (e.g., kafka-data-kafka-0,1,2...)
STS_NAME=kafka
NAMESPACE=kafka
NEW_SIZE=200Gi

# Step 1: patch each PVC
kubectl get pvc -n $NAMESPACE -l app=$STS_NAME \
  -o jsonpath='{.items[*].metadata.name}' | \
  tr ' ' '\n' | \
  while read pvc; do
    echo "Resizing $pvc..."
    kubectl patch pvc $pvc -n $NAMESPACE \
      -p "{\"spec\":{\"resources\":{\"requests\":{\"storage\":\"$NEW_SIZE\"}}}}"
  done

# Step 2: delete the StatefulSet WITHOUT deleting pods (--cascade=orphan)
kubectl delete sts $STS_NAME -n $NAMESPACE --cascade=orphan

# Step 3: re-apply the StatefulSet with updated volumeClaimTemplates size
# Edit your manifest to reflect NEW_SIZE, then:
kubectl apply -f kafka-statefulset.yaml -n $NAMESPACE

# Step 4: rolling restart to trigger filesystem expansion
kubectl rollout restart statefulset/$STS_NAME -n $NAMESPACE
You Cannot Shrink a PVC

PVC storage requests can only grow, never shrink. To reduce storage size you must create a new smaller PVC, copy data, and swap the mount — there is no in-place shrink in Kubernetes or EBS.

StorageClass Tuning

WaitForFirstConsumer — Why It Matters

With volumeBindingMode: Immediate, a PVC triggers volume creation before a pod is scheduled. The volume may land in AZ-A while the pod gets scheduled to AZ-B — causing a permanent mount failure. WaitForFirstConsumer delays provisioning until the pod's target node is known, aligning the volume's AZ with the pod's AZ.

WaitForFirstConsumer — Binding Flow
  Immediate (BAD for multi-AZ):
    PVC created → CSI provisions vol in AZ-A → Pod scheduled to AZ-B → MOUNT FAIL

  WaitForFirstConsumer (CORRECT):
    PVC created → status: Pending (no vol yet)
    Pod scheduled → Scheduler picks node in AZ-B
    CSI notified of target node AZ → provisions vol in AZ-B → PVC binds → pod mounts ✓

StorageClass Reference Table

NameProvisionerUse CaseReclaimBinding Mode
gp3 (default)ebs.csi.aws.comGeneral workloadsDeleteWaitForFirstConsumer
gp3-retainebs.csi.aws.comDatabases (stateful)RetainWaitForFirstConsumer
gp3-high-iopsebs.csi.aws.comHigh-throughput DBsRetainWaitForFirstConsumer
efs-scefs.csi.aws.comShared RWX accessDeleteImmediate
local-storagekubernetes.io/no-provisionerLocal NVMe (Kafka)RetainWaitForFirstConsumer

Local StorageClass for High-Performance Workloads

apiVersion: storage.k8s.io/v1
kind: StorageClass
metadata:
  name: local-nvme
provisioner: kubernetes.io/no-provisioner
volumeBindingMode: WaitForFirstConsumer
reclaimPolicy: Retain

---
# Pre-provision a Local PV (one per NVMe disk per node)
apiVersion: v1
kind: PersistentVolume
metadata:
  name: local-nvme-node1
spec:
  capacity:
    storage: 1000Gi
  accessModes: [ReadWriteOnce]
  storageClassName: local-nvme
  persistentVolumeReclaimPolicy: Retain
  local:
    path: /mnt/nvme0n1       # pre-formatted and mounted on the node
  nodeAffinity:
    required:
      nodeSelectorTerms:
      - matchExpressions:
        - key: kubernetes.io/hostname
          operator: In
          values: ["node-with-nvme-1"]

StatefulSet Operations

StatefulSet Rolling Updates

StatefulSets update pods in reverse ordinal order (N-1, N-2, ..., 0) by default. Each pod is updated, waits to become Ready, then the next is updated. This respects quorum for distributed systems like Kafka, Zookeeper, and etcd.

# Trigger a rolling update (after editing the StatefulSet spec)
kubectl rollout restart statefulset/kafka -n kafka

# Monitor rollout progress
kubectl rollout status statefulset/kafka -n kafka

# Pause a rolling update (useful if intermediate state is healthy and you want to canary)
kubectl rollout pause statefulset/kafka -n kafka

# Resume
kubectl rollout resume statefulset/kafka -n kafka

# Rollback to previous revision
kubectl rollout undo statefulset/kafka -n kafka

Partition-Based Canary for StatefulSets

The partition field in updateStrategy.rollingUpdate limits updates to pods with ordinal ≥ partition. This lets you update a subset first:

# Set partition=2 on a 3-replica StatefulSet (only pod-2 gets updated)
kubectl patch statefulset kafka -n kafka \
  -p '{"spec":{"updateStrategy":{"rollingUpdate":{"partition":2}}}}'

# After validating pod-2 is healthy, lower partition to 1 (update pod-2, pod-1)
kubectl patch statefulset kafka -n kafka \
  -p '{"spec":{"updateStrategy":{"rollingUpdate":{"partition":1}}}}'

# Lower to 0 to complete the update
kubectl patch statefulset kafka -n kafka \
  -p '{"spec":{"updateStrategy":{"rollingUpdate":{"partition":0}}}}'

Force-Delete a Stuck StatefulSet Pod

!
Force-Delete Is Risky for Stateful Applications

Force-deleting a StatefulSet pod (e.g., after node failure) may result in two pods with the same identity running simultaneously if the original pod is still alive on a partitioned node. Only force-delete when the node is confirmed dead (terminated in cloud console, not just NotReady).

# Standard delete (waits for graceful termination — may block if node is dead)
kubectl delete pod kafka-2 -n kafka

# Force delete (bypasses graceful shutdown — only when node is confirmed dead)
kubectl delete pod kafka-2 -n kafka --force --grace-period=0

Headless Service and DNS for StatefulSets

apiVersion: v1
kind: Service
metadata:
  name: kafka-headless
  namespace: kafka
spec:
  clusterIP: None                      # headless — no VIP, DNS returns pod IPs
  publishNotReadyAddresses: true       # important: include unready pods in DNS
  selector:
    app: kafka
  ports:
  - name: kafka
    port: 9092
# DNS format for StatefulSet pods:
# <pod-name>.<service-name>.<namespace>.svc.cluster.local
# kafka-0.kafka-headless.kafka.svc.cluster.local
# kafka-1.kafka-headless.kafka.svc.cluster.local

# Verify headless DNS from within the cluster
kubectl run dns-test --image=nicolaka/netshoot --rm -it -- \
  dig kafka-0.kafka-headless.kafka.svc.cluster.local

StatefulSet Mininum Availability with PodDisruptionBudget

apiVersion: policy/v1
kind: PodDisruptionBudget
metadata:
  name: kafka-pdb
  namespace: kafka
spec:
  selector:
    matchLabels:
      app: kafka
  minAvailable: 2          # for a 3-replica Kafka — always keep quorum (2/3)

Storage Backup Operations

See Disaster Recovery for the full Velero setup. This section focuses on storage-specific backup patterns and volume snapshot operations.

VolumeSnapshot — CSI Snapshot

VolumeSnapshots create a point-in-time snapshot of a PVC using the cloud provider's native snapshot API (EBS snapshot for gp3 volumes). They are faster and cheaper than full Velero backups for block storage.

# Install the CSI snapshot controller CRDs and controller
kubectl apply -f https://raw.githubusercontent.com/kubernetes-csi/external-snapshotter/v8.0.1/client/config/crd/snapshot.storage.k8s.io_volumesnapshotclasses.yaml
kubectl apply -f https://raw.githubusercontent.com/kubernetes-csi/external-snapshotter/v8.0.1/client/config/crd/snapshot.storage.k8s.io_volumesnapshotcontents.yaml
kubectl apply -f https://raw.githubusercontent.com/kubernetes-csi/external-snapshotter/v8.0.1/client/config/crd/snapshot.storage.k8s.io_volumesnapshots.yaml
kubectl apply -f https://raw.githubusercontent.com/kubernetes-csi/external-snapshotter/v8.0.1/deploy/kubernetes/snapshot-controller/

# Create a VolumeSnapshotClass for EBS
kubectl apply -f - <<'EOF'
apiVersion: snapshot.storage.k8s.io/v1
kind: VolumeSnapshotClass
metadata:
  name: ebs-vsc
  annotations:
    snapshot.storage.kubernetes.io/is-default-class: "true"
driver: ebs.csi.aws.com
deletionPolicy: Retain         # keep the EBS snapshot even if VSC is deleted
parameters:
  tagSpecification_1: "Key=ManagedBy,Value=K8s"
EOF
apiVersion: snapshot.storage.k8s.io/v1
kind: VolumeSnapshot
metadata:
  name: postgres-data-snap-20260524
  namespace: production
spec:
  volumeSnapshotClassName: ebs-vsc
  source:
    persistentVolumeClaimName: postgres-data
# Check snapshot status
kubectl get volumesnapshot -n production
kubectl describe volumesnapshot postgres-data-snap-20260524 -n production
# readyToUse: true means the EBS snapshot is complete

# Restore from a VolumeSnapshot to a new PVC
kubectl apply -f - <<'EOF'
apiVersion: v1
kind: PersistentVolumeClaim
metadata:
  name: postgres-data-restored
  namespace: production
spec:
  storageClassName: gp3-retain
  dataSource:
    name: postgres-data-snap-20260524
    kind: VolumeSnapshot
    apiGroup: snapshot.storage.k8s.io
  accessModes: [ReadWriteOnce]
  resources:
    requests:
      storage: 100Gi
EOF

Automated Snapshot CronJob

apiVersion: batch/v1
kind: CronJob
metadata:
  name: postgres-snapshot
  namespace: production
spec:
  schedule: "0 */6 * * *"      # every 6 hours
  concurrencyPolicy: Forbid
  successfulJobsHistoryLimit: 5
  failedJobsHistoryLimit: 3
  jobTemplate:
    spec:
      template:
        spec:
          serviceAccountName: snapshot-creator
          restartPolicy: OnFailure
          containers:
          - name: snap
            image: bitnami/kubectl:latest
            command:
            - /bin/bash
            - -c
            - |
              SNAP_NAME="postgres-data-snap-$(date +%Y%m%d-%H%M%S)"
              kubectl apply -f - <<EOF
              apiVersion: snapshot.storage.k8s.io/v1
              kind: VolumeSnapshot
              metadata:
                name: $SNAP_NAME
                namespace: production
                labels:
                  app: postgres
                  managed-by: cronjob
              spec:
                volumeSnapshotClassName: ebs-vsc
                source:
                  persistentVolumeClaimName: postgres-data
              EOF
              echo "Created snapshot: $SNAP_NAME"

              # Prune snapshots older than 7 days
              kubectl get volumesnapshot -n production \
                -l managed-by=cronjob \
                -o json | \
              python3 -c "
              import json, sys, subprocess
              from datetime import datetime, timezone, timedelta
              items = json.load(sys.stdin)['items']
              cutoff = datetime.now(timezone.utc) - timedelta(days=7)
              for s in items:
                  ts = datetime.fromisoformat(s['metadata']['creationTimestamp'].replace('Z','+00:00'))
                  if ts < cutoff:
                      name = s['metadata']['name']
                      print(f'Deleting old snapshot: {name}')
                      subprocess.run(['kubectl','delete','volumesnapshot',name,'-n','production'])
              "

Storage Troubleshooting Playbook

PVC Stuck in Pending

# Describe PVC for events
kubectl describe pvc <pvc-name> -n <ns>

# Common causes and checks:
# 1. WaitForFirstConsumer — normal until a pod is scheduled
kubectl get pvc <pvc-name> -n <ns> -o jsonpath='{.metadata.annotations}'

# 2. No StorageClass matching the requested class
kubectl get sc                    # verify the SC exists
kubectl get pvc <pvc-name> -n <ns> -o jsonpath='{.spec.storageClassName}'

# 3. CSI provisioner pod crashed / not running
kubectl get pods -n kube-system -l app=ebs-csi-controller
kubectl logs -n kube-system -l app=ebs-csi-controller -c csi-provisioner

# 4. IRSA permissions — EBS CSI controller needs ec2:CreateVolume
aws iam simulate-principal-policy \
  --policy-source-arn arn:aws:iam::123456789012:role/ebs-csi-role \
  --action-names ec2:CreateVolume \
  --query 'EvaluationResults[0].EvalDecision'

# 5. Insufficient capacity in the AZ (for gp3 this is rare)
# Check AWS service health dashboard

Pod Stuck in ContainerCreating (Mount Failure)

# Get the full error
kubectl describe pod <pod-name> -n <ns> | grep -A 30 Events

# Scenario A: "Multi-Attach error" — volume attached to another node
kubectl get volumeattachment | grep <pv-name>
# Check if old attachment is on a dead node — if so, force-delete it:
kubectl delete volumeattachment <attachment-name>

# Scenario B: "Unable to mount volumes" — node plugin issue
# Check node plugin pod on the target node:
TARGET_NODE=$(kubectl get pod <pod-name> -n <ns> -o jsonpath='{.spec.nodeName}')
kubectl get pods -n kube-system -l app=ebs-csi-node \
  -o wide | grep $TARGET_NODE
kubectl logs -n kube-system -l app=ebs-csi-node \
  --field-selector spec.nodeName=$TARGET_NODE -c ebs-plugin

# Scenario C: "timeout expired waiting for volumes to be attached/mounted"
# kubelet may be having issues — check kubelet status on node
kubectl debug node/$TARGET_NODE -it --image=nicolaka/netshoot -- \
  journalctl -u kubelet --no-pager -n 100

# Scenario D: filesystem corrupted — check dmesg for I/O errors
kubectl debug node/$TARGET_NODE -it --image=nicolaka/netshoot -- \
  dmesg | tail -50 | grep -i "error\|fail\|corrupt"

PVC Resize Stuck at FileSystemResizePending

# Check PVC conditions
kubectl get pvc <pvc-name> -n <ns> -o jsonpath='{.status.conditions}'

# FileSystemResizePending means cloud volume is resized,
# but filesystem inside has not been expanded yet.
# This requires the CSI node plugin to run NodeExpandVolume.

# Option 1: rolling restart the pod (triggers NodePublishVolume → NodeExpandVolume)
kubectl rollout restart deployment/<name> -n <ns>

# Option 2: check if node plugin is healthy on the pod's node
TARGET_NODE=$(kubectl get pod <pod-name> -n <ns> -o jsonpath='{.spec.nodeName}')
kubectl logs -n kube-system \
  $(kubectl get pods -n kube-system -l app=ebs-csi-node \
    -o wide | grep $TARGET_NODE | awk '{print $1}') \
  -c ebs-plugin --tail=50

# Manual resize from inside the pod (last resort — ext4 only)
kubectl exec -n <ns> <pod-name> -- resize2fs /dev/<device>

Orphaned EBS Volume Detection

# Find EBS volumes tagged with kubernetes.io/created-for/pvc that have no attachment
aws ec2 describe-volumes \
  --filters "Name=status,Values=available" \
             "Name=tag-key,Values=kubernetes.io/created-for/pvc/name" \
  --query 'Volumes[*].{VolumeId:VolumeId,Size:Size,Created:CreateTime,PVC:Tags[?Key==`kubernetes.io/created-for/pvc/name`].Value|[0]}' \
  --output table

# Cross-reference with K8s PVs to find truly orphaned volumes
# (PV deleted but cloud volume wasn't cleaned up)
PVOLS=$(kubectl get pv -o jsonpath='{.items[*].spec.csi.volumeHandle}' | tr ' ' '\n' | sort)
aws ec2 describe-volumes \
  --filters "Name=tag-key,Values=kubernetes.io/created-for/pvc/name" \
  --query 'Volumes[*].VolumeId' \
  --output text | tr '\t' '\n' | sort | \
  comm -23 - <(echo "$PVOLS")

Storage Alerting

apiVersion: monitoring.coreos.com/v1
kind: PrometheusRule
metadata:
  name: storage-operations-alerts
  namespace: monitoring
  labels:
    prometheus: kube-prometheus
    role: alert-rules
spec:
  groups:
  - name: storage.pvc
    interval: 60s
    rules:

    - alert: PVCDiskUsageHigh
      expr: |
        (
          kubelet_volume_stats_used_bytes
            /
          kubelet_volume_stats_capacity_bytes
        ) * 100 > 80
      for: 5m
      labels:
        severity: warning
      annotations:
        summary: "PVC disk usage above 80%"
        description: "PVC {{ $labels.persistentvolumeclaim }} in namespace {{ $labels.namespace }} is {{ $value | printf \"%.1f\" }}% full."

    - alert: PVCDiskUsageCritical
      expr: |
        (
          kubelet_volume_stats_used_bytes
            /
          kubelet_volume_stats_capacity_bytes
        ) * 100 > 95
      for: 2m
      labels:
        severity: critical
      annotations:
        summary: "PVC disk usage above 95% — imminent OOM/crash"
        description: "PVC {{ $labels.persistentvolumeclaim }} in namespace {{ $labels.namespace }} is {{ $value | printf \"%.1f\" }}% full. Immediate action required."

    - alert: PVCInodeUsageHigh
      expr: |
        (
          kubelet_volume_stats_inodes_used
            /
          kubelet_volume_stats_inodes
        ) * 100 > 80
      for: 5m
      labels:
        severity: warning
      annotations:
        summary: "PVC inode usage above 80%"
        description: "PVC {{ $labels.persistentvolumeclaim }} in namespace {{ $labels.namespace }} has {{ $value | printf \"%.1f\" }}% inodes used. Many small files may exhaust inodes before disk space."

    - alert: PVCPendingTooLong
      expr: |
        kube_persistentvolumeclaim_status_phase{phase="Pending"} == 1
      for: 10m
      labels:
        severity: warning
      annotations:
        summary: "PVC stuck in Pending for > 10 minutes"
        description: "PVC {{ $labels.persistentvolumeclaim }} in namespace {{ $labels.namespace }} has been Pending for over 10 minutes. Check CSI provisioner logs."

    - alert: PVCInLostState
      expr: |
        kube_persistentvolumeclaim_status_phase{phase="Lost"} == 1
      for: 1m
      labels:
        severity: critical
      annotations:
        summary: "PVC is in Lost state — data may be inaccessible"
        description: "PVC {{ $labels.persistentvolumeclaim }} in namespace {{ $labels.namespace }} is Lost. The underlying PV may have been deleted. Immediate investigation required."

  - name: storage.node
    rules:

    - alert: NodeDiskHighIOWait
      expr: |
        rate(node_disk_io_time_seconds_total{device=~"nvme.*|xvd.*|sd.*"}[5m]) * 100 > 80
      for: 10m
      labels:
        severity: warning
      annotations:
        summary: "Node disk I/O utilization above 80%"
        description: "Node {{ $labels.instance }} disk {{ $labels.device }} I/O utilization is {{ $value | printf \"%.1f\" }}%."

    - alert: NodeRootDiskFull
      expr: |
        (
          node_filesystem_size_bytes{mountpoint="/"}
            - node_filesystem_free_bytes{mountpoint="/"}
        ) / node_filesystem_size_bytes{mountpoint="/"} * 100 > 85
      for: 5m
      labels:
        severity: warning
      annotations:
        summary: "Node root disk above 85% full"
        description: "Node {{ $labels.instance }} root filesystem is {{ $value | printf \"%.1f\" }}% full. Container image layers and logs may be filling disk."

    - alert: PersistentVolumeClaimWithoutSnapshot
      expr: |
        kube_persistentvolumeclaim_info{storageclass=~"gp3-retain|gp3-high-iops"}
          unless on (namespace, persistentvolumeclaim)
        (
          label_replace(
            kube_volumesnapshot_spec_source_persistent_volume_claim_name,
            "persistentvolumeclaim", "$1", "source_pvc_name", "(.*)"
          )
        )
      for: 24h
      labels:
        severity: warning
      annotations:
        summary: "Retain-policy PVC has no VolumeSnapshot in 24 hours"
        description: "PVC {{ $labels.persistentvolumeclaim }} in {{ $labels.namespace }} uses a retain policy but has no recent snapshot."

Best Practices Summary

Use WaitForFirstConsumer

Always set volumeBindingMode: WaitForFirstConsumer on StorageClasses in multi-AZ clusters. Immediate binding causes cross-AZ volume mounts which fail silently until pod scheduling.

Retain Policy for Databases

Set reclaimPolicy: Retain on StorageClasses used by databases and stateful workloads. Accidental PVC deletion should not destroy production data — a human must explicitly release the PV.

Snapshot Before Resize

Always create a VolumeSnapshot before expanding a PVC. Filesystem expansion is usually safe, but if a node crashes during resize the volume may be left in a partially-expanded state. Snapshot provides rollback.

Monitor Inodes Too

Disk space metrics are not sufficient. A volume can fill its inode table (too many small files) while still having GBs of free space. Alert on both kubelet_volume_stats_used_bytes and kubelet_volume_stats_inodes_used.

Graceful Node Drain

Never hard-terminate nodes with attached volumes. Use kubectl drain to cleanly unmount volumes and delete VolumeAttachment objects. Force-deleting VolumeAttachments causes the next pod to wait for the 6-minute attach timeout.

Local Storage for Kafka

Kafka brokers get dramatically better throughput on instance-store NVMe (1M+ IOPS) vs EBS (16K IOPS). Use local-storage StorageClass with DaemonSet pre-provisioning and topology constraints to ensure pods land on NVMe nodes.