Storage Operations
Managing persistent volumes, CSI drivers, StatefulSets, performance tuning, and storage incident response in production Kubernetes clusters.
Kubernetes Storage Stack
Storage in Kubernetes is layered: the Container Storage Interface (CSI) decouples driver logic from the kubelet, PersistentVolumes (PV) represent the physical resource, PersistentVolumeClaims (PVC) represent a request for storage, and StorageClasses define how PVs are dynamically provisioned.
Pod spec
└─► PersistentVolumeClaim (namespace-scoped claim)
└─► StorageClass (binding mode / provisioner / params)
└─► CSI Driver (provisioner plugin in kube-system)
├─► Controller Plugin (runs as Deployment)
│ ├─ CreateVolume (AWS EBS CreateVolume API call)
│ ├─ AttachVolume (attach EBS vol to EC2 node)
│ └─ DeleteVolume (GC on PVC delete)
└─► Node Plugin (runs as DaemonSet on every node)
├─ NodeStageVolume (format + mount to staging path)
└─ NodePublishVolume (bind-mount into pod's fs)
On-disk path:
/var/lib/kubelet/plugins/kubernetes.io/csi/... (staging)
/var/lib/kubelet/pods/<UID>/volumes/... (pod volume mount)
Storage Object States
| Object | State | Meaning | Common Cause |
|---|---|---|---|
| PVC | Pending | No matching PV exists yet | WaitForFirstConsumer / no capacity / wrong SC |
| PVC | Bound | PV attached and ready | Normal operating state |
| PVC | Lost | Bound PV deleted or unreachable | Manual PV deletion, zone mismatch |
| PV | Available | Not yet bound to any PVC | Pre-provisioned static PV |
| PV | Bound | Claimed by a PVC | Normal |
| PV | Released | PVC deleted, PV not yet reclaimed | Retain reclaim policy |
| PV | Failed | Automated reclamation failed | Cloud volume deletion error |
| VolumeAttachment | Attached: false | CSI attach in progress or stuck | Node failure, CSI pod crash, AZ mismatch |
Quick-Reference Diagnostic Commands
# List all PVCs with their phase and storage class
kubectl get pvc -A -o wide
# List PVs showing reclaim policy, capacity, access modes
kubectl get pv -o custom-columns=\
NAME:.metadata.name,\
CAPACITY:.spec.capacity.storage,\
ACCESS:.spec.accessModes[0],\
POLICY:.spec.persistentVolumeReclaimPolicy,\
STATUS:.status.phase,\
CLAIM:.spec.claimRef.namespace
# Find all PVCs that are NOT Bound
kubectl get pvc -A --field-selector=status.phase!=Bound
# Describe a specific PVC to see binding events and CSI details
kubectl describe pvc <pvc-name> -n <namespace>
# List VolumeAttachment objects (CSI controller attach status)
kubectl get volumeattachment
# Check CSI node driver registration
kubectl get csinode <node-name> -o yaml
# List StorageClasses and their provisioners
kubectl get sc -o wide
PV Lifecycle Operations
Reclaim Policy Behaviour
The default reclaim policy for dynamically provisioned PVs is Delete — the underlying cloud volume is destroyed when the PVC is deleted. For production databases, always use Retain or set a finalizer, and ensure Velero backups exist before any PVC delete.
# Patch a PV to change reclaim policy from Delete to Retain
kubectl patch pv <pv-name> \
-p '{"spec":{"persistentVolumeReclaimPolicy":"Retain"}}'
# Patch ALL PVs that use gp3 storage class to Retain
kubectl get pv -o json | \
jq -r '.items[] | select(.spec.storageClassName=="gp3") | .metadata.name' | \
xargs -I{} kubectl patch pv {} \
-p '{"spec":{"persistentVolumeReclaimPolicy":"Retain"}}'
Reclaiming a Released PV
When a PVC is deleted and the PV reclaim policy is Retain, the PV enters Released state. It cannot be automatically rebound until the claimRef is cleared:
# Step 1: verify PV is in Released state
kubectl get pv <pv-name>
# Step 2: remove the claimRef so the PV becomes Available again
kubectl patch pv <pv-name> --type json \
-p '[{"op":"remove","path":"/spec/claimRef"}]'
# Step 3: create a new PVC that binds to this specific PV
# Use volumeName in the PVC spec:
apiVersion: v1
kind: PersistentVolumeClaim
metadata:
name: postgres-data-recovered
namespace: production
spec:
storageClassName: gp3
volumeName: pvc-abc123 # pin to the specific PV
accessModes: [ReadWriteOnce]
resources:
requests:
storage: 100Gi
Reclaim Policy on StorageClass
apiVersion: storage.k8s.io/v1
kind: StorageClass
metadata:
name: gp3-retain
provisioner: ebs.csi.aws.com
reclaimPolicy: Retain # <-- preserve volume on PVC delete
volumeBindingMode: WaitForFirstConsumer
allowVolumeExpansion: true
parameters:
type: gp3
iops: "3000"
throughput: "125"
encrypted: "true"
CSI Driver Operations
CSI Component Health Check
# AWS EBS CSI Driver (typical install via EKS add-on or Helm)
kubectl get pods -n kube-system -l app=ebs-csi-controller
kubectl get pods -n kube-system -l app=ebs-csi-node
# Check EBS CSI controller logs for provisioning errors
kubectl logs -n kube-system \
-l app=ebs-csi-controller -c csi-provisioner --tail=50
# Cilium uses its own storage — for EKS check aws-ebs-csi-driver add-on
aws eks describe-addon \
--cluster-name <cluster> \
--addon-name aws-ebs-csi-driver \
--query 'addon.{status:status,version:addonVersion}'
# Check CSI node plugin status on a specific node
kubectl describe csinode <node-name>
# EFS CSI Driver
kubectl get pods -n kube-system -l app=efs-csi-controller
kubectl get pods -n kube-system -l app=efs-csi-node
CSI Controller vs Node Plugin Responsibilities
| Operation | Plugin | What it does | Failure symptom |
|---|---|---|---|
| CreateVolume | Controller | Calls AWS EBS CreateVolume | PVC stuck Pending — "failed to provision volume" |
| AttachVolume | Controller | Attaches EBS vol to EC2 instance | VolumeAttachment stuck, pod stays Pending |
| DeleteVolume | Controller | Destroys cloud volume on PVC delete | Orphaned EBS volumes in AWS console |
| NodeStageVolume | Node | Formats filesystem + mounts to staging | Pod event: "failed to stage volume" |
| NodePublishVolume | Node | Bind-mounts staged path into pod | Pod stuck ContainerCreating, mount error |
| NodeExpandVolume | Node | Resizes filesystem after PVC expand | PVC resize stuck FileSystemResizePending |
Diagnosing a Stuck Pod (ContainerCreating / Volume Mount Failure)
# Step 1: describe the stuck pod — look for events
kubectl describe pod <pod-name> -n <namespace>
# Common events:
# "Unable to attach or mount volumes"
# "Multi-Attach error for volume" — vol attached to another node
# "timed out waiting for the condition"
# Step 2: check the VolumeAttachment object for the PV
kubectl get volumeattachment | grep <pv-name>
kubectl describe volumeattachment <attachment-name>
# Step 3: if VolumeAttachment is stuck with the old node:
# This happens when a node dies ungracefully — the vol is still
# "attached" to the dead node from K8s perspective
# Force-delete the VolumeAttachment (CSI controller will re-attach)
kubectl delete volumeattachment <attachment-name>
# Step 4: check the CSI controller pod for errors
kubectl logs -n kube-system \
-l app=ebs-csi-controller -c csi-attacher --tail=100
# Step 5: verify the EBS volume state in AWS
aws ec2 describe-volumes \
--volume-ids <vol-id> \
--query 'Volumes[0].{State:State,Attachments:Attachments}'
EBS volumes with accessModes: ReadWriteOnce can only be attached to one node at a time. If a pod is rescheduled to a different node while the old VolumeAttachment exists (node crashed, not drained), new attachment fails. Always drain nodes gracefully (kubectl drain) before terminating to allow clean VolumeAttachment cleanup.
EBS / Block Storage Operations
gp3 vs gp2 StorageClass
AWS EBS gp3 provides 3000 IOPS and 125 MB/s throughput baseline at no extra cost, decoupled from volume size. gp2 ties IOPS to size (3 IOPS/GB, max 16,000). For most workloads, migrating from gp2 to gp3 reduces cost and improves predictable performance.
apiVersion: storage.k8s.io/v1
kind: StorageClass
metadata:
name: gp3
annotations:
storageclass.kubernetes.io/is-default-class: "true"
provisioner: ebs.csi.aws.com
volumeBindingMode: WaitForFirstConsumer # only provision in the AZ where pod lands
allowVolumeExpansion: true
reclaimPolicy: Delete
parameters:
type: gp3
iops: "3000" # baseline — can go up to 16000
throughput: "125" # MB/s — can go up to 1000
encrypted: "true"
kmsKeyId: "arn:aws:kms:us-east-1:123456789012:key/mrk-..."
High-Performance StorageClass for Databases
apiVersion: storage.k8s.io/v1
kind: StorageClass
metadata:
name: gp3-high-iops
provisioner: ebs.csi.aws.com
volumeBindingMode: WaitForFirstConsumer
allowVolumeExpansion: true
reclaimPolicy: Retain
parameters:
type: gp3
iops: "16000" # max gp3 IOPS
throughput: "1000" # max gp3 throughput
encrypted: "true"
Checking EBS Volume Metrics
# Get the EBS volume ID from a PV
kubectl get pv <pv-name> \
-o jsonpath='{.spec.csi.volumeHandle}'
# Check CloudWatch EBS metrics
aws cloudwatch get-metric-statistics \
--namespace AWS/EBS \
--metric-name VolumeQueueLength \
--dimensions Name=VolumeId,Value=vol-0abc123 \
--start-time $(date -u -d '1 hour ago' +%FT%TZ) \
--end-time $(date -u +%FT%TZ) \
--period 300 \
--statistics Average
PromQL: PVC Disk Saturation
# PVC disk usage percentage
(
kubelet_volume_stats_used_bytes
/
kubelet_volume_stats_capacity_bytes
) * 100
# PVCs above 80% full
(
kubelet_volume_stats_used_bytes
/
kubelet_volume_stats_capacity_bytes
) * 100 > 80
# PVC inode usage percentage
(
kubelet_volume_stats_inodes_used
/
kubelet_volume_stats_inodes
) * 100
# Available bytes
kubelet_volume_stats_available_bytes{namespace="production"}
# PVCs with less than 10% free
(
kubelet_volume_stats_available_bytes
/
kubelet_volume_stats_capacity_bytes
) * 100 < 10
EFS / Shared Storage Operations
Amazon EFS provides ReadWriteMany (RWX) access across multiple pods and nodes. It is NFS-based, with throughput automatically scaling. Use EFS for shared content, ML training data, and CMS uploads — not for databases (latency is 1-10ms vs EBS 0.1-1ms).
apiVersion: storage.k8s.io/v1
kind: StorageClass
metadata:
name: efs-sc
provisioner: efs.csi.aws.com
parameters:
provisioningMode: efs-ap # use EFS Access Points (isolated dirs per PVC)
fileSystemId: fs-0abc12345
directoryPerms: "700"
gidRangeStart: "1000"
gidRangeEnd: "2000"
basePath: "/dynamic_provisioning"
apiVersion: v1
kind: PersistentVolumeClaim
metadata:
name: shared-assets
namespace: production
spec:
storageClassName: efs-sc
accessModes: [ReadWriteMany] # multiple pods can mount simultaneously
resources:
requests:
storage: 5Gi # EFS ignores this — it scales elastically
EFS Performance Troubleshooting
# EFS NFS mount options — tuned for K8s (in EFS CSI driver StorageClass)
# noresvport: don't use reserved source port (allows reconnect after timeout)
# soft: NFS soft mount — return EIO on timeout instead of hanging forever
# timeo=600: 60 second NFS timeout (10ths of seconds)
# These are set in EFS CSI driver ConfigMap:
kubectl describe configmap -n kube-system efs-csi-driver-config 2>/dev/null || true
# Check EFS mount on a node running an EFS-backed pod
NODE=$(kubectl get pod <pod-name> -o jsonpath='{.spec.nodeName}')
kubectl debug node/$NODE -it --image=nicolaka/netshoot -- \
df -h | grep nfs
# EFS throughput utilization via CloudWatch
aws cloudwatch get-metric-statistics \
--namespace AWS/EFS \
--metric-name MeteredIOBytes \
--dimensions Name=FileSystemId,Value=fs-0abc12345 \
--start-time $(date -u -d '1 hour ago' +%FT%TZ) \
--end-time $(date -u +%FT%TZ) \
--period 300 \
--statistics Sum
Storage Performance Analysis
fio Benchmark — Block Storage (EBS)
# Run fio benchmarks inside a pod using a PVC
apiVersion: v1
kind: Pod
metadata:
name: fio-benchmark
namespace: default
spec:
containers:
- name: fio
image: nixery.dev/shell/fio
command:
- /bin/bash
- -c
- |
echo "=== Random Read 4K IOPS ==="
fio --name=randread --ioengine=libaio --iodepth=32 \
--rw=randread --bs=4k --direct=1 --numjobs=4 \
--size=1G --runtime=60 --filename=/data/test \
--output-format=json | \
python3 -c "import json,sys; d=json.load(sys.stdin); \
r=d['jobs'][0]['read']; \
print(f'IOPS: {r[\"iops\"]:.0f}, BW: {r[\"bw\"]/1024:.1f} MB/s, lat_p99: {r[\"lat_ns\"][\"percentile\"][\"99.000000\"]/1e6:.2f}ms')"
echo "=== Random Write 4K IOPS ==="
fio --name=randwrite --ioengine=libaio --iodepth=32 \
--rw=randwrite --bs=4k --direct=1 --numjobs=4 \
--size=1G --runtime=60 --filename=/data/test \
--output-format=json | \
python3 -c "import json,sys; d=json.load(sys.stdin); \
w=d['jobs'][0]['write']; \
print(f'IOPS: {w[\"iops\"]:.0f}, BW: {w[\"bw\"]/1024:.1f} MB/s, lat_p99: {w[\"lat_ns\"][\"percentile\"][\"99.000000\"]/1e6:.2f}ms')"
echo "=== Sequential Read 1M throughput ==="
fio --name=seqread --ioengine=libaio --iodepth=8 \
--rw=read --bs=1M --direct=1 --numjobs=1 \
--size=2G --runtime=30 --filename=/data/test
volumeMounts:
- name: data
mountPath: /data
volumes:
- name: data
persistentVolumeClaim:
claimName: <pvc-name>
restartPolicy: Never
Expected I/O Latencies by Storage Type
| Storage Type | Read p50 | Read p99 | Write p50 | IOPS Limit | Use Case |
|---|---|---|---|---|---|
| EBS gp3 (NVMe) | 0.1ms | 0.5ms | 0.2ms | 16,000 | Databases, etcd |
| EBS io2 Block Express | <0.1ms | 0.2ms | <0.1ms | 256,000 | Critical OLTP |
| EFS (NFS) | 1ms | 10ms | 2ms | — | Shared, RWX |
| Local NVMe (instance store) | 0.05ms | 0.1ms | 0.05ms | 1M+ | Kafka, ephemeral caches |
| emptyDir (tmpfs) | <0.01ms | <0.01ms | <0.01ms | RAM-bound | Temp data, shared mem |
Node Storage Metrics
# Check node disk utilization
kubectl top nodes # doesn't show disk — use node_exporter
# node_exporter disk metrics via PromQL:
# Disk read/write bytes per second
rate(node_disk_read_bytes_total{device=~"nvme.*|xvd.*"}[5m])
rate(node_disk_written_bytes_total{device=~"nvme.*|xvd.*"}[5m])
# Disk I/O await (ms per operation) — high = I/O saturation
rate(node_disk_io_time_seconds_total{device=~"nvme.*|xvd.*"}[5m]) * 1000
# Disk utilization percentage
rate(node_disk_io_time_seconds_total{device=~"nvme.*|xvd.*"}[5m]) * 100
# Root filesystem usage
(node_filesystem_size_bytes{mountpoint="/"} - node_filesystem_free_bytes{mountpoint="/"})
/ node_filesystem_size_bytes{mountpoint="/"} * 100
PVC Resize Operations
The StorageClass must have allowVolumeExpansion: true. The CSI driver must support EXPAND_VOLUME capability. The volume filesystem resize (NodeExpandVolume) happens automatically when the pod restarts or for ext4/xfs if the node plugin supports online expansion. EBS CSI driver v1.0+ supports online resize for ext4 and XFS without pod restart.
Resize a PVC
# Step 1: patch the PVC spec.resources.requests.storage
kubectl patch pvc postgres-data -n production \
-p '{"spec":{"resources":{"requests":{"storage":"200Gi"}}}}'
# Step 2: watch for the resize to complete
kubectl get pvc postgres-data -n production -w
# Status transitions:
# Bound (100Gi) → Bound (100Gi, FileSystemResizePending) → Bound (200Gi)
# Step 3: if status stays at FileSystemResizePending,
# a pod restart triggers the node-side filesystem expansion
kubectl rollout restart deployment/postgres -n production
# Step 4: verify new size inside the pod
kubectl exec -n production deploy/postgres -- df -h /var/lib/postgresql/data
Resize a StatefulSet PVC (requires manual steps)
StatefulSet VolumeClaimTemplates are immutable — Kubernetes will not resize them automatically. The workaround is to resize each PVC individually and then update the StatefulSet template.
# Resize all PVCs in a StatefulSet (e.g., kafka-data-kafka-0,1,2...)
STS_NAME=kafka
NAMESPACE=kafka
NEW_SIZE=200Gi
# Step 1: patch each PVC
kubectl get pvc -n $NAMESPACE -l app=$STS_NAME \
-o jsonpath='{.items[*].metadata.name}' | \
tr ' ' '\n' | \
while read pvc; do
echo "Resizing $pvc..."
kubectl patch pvc $pvc -n $NAMESPACE \
-p "{\"spec\":{\"resources\":{\"requests\":{\"storage\":\"$NEW_SIZE\"}}}}"
done
# Step 2: delete the StatefulSet WITHOUT deleting pods (--cascade=orphan)
kubectl delete sts $STS_NAME -n $NAMESPACE --cascade=orphan
# Step 3: re-apply the StatefulSet with updated volumeClaimTemplates size
# Edit your manifest to reflect NEW_SIZE, then:
kubectl apply -f kafka-statefulset.yaml -n $NAMESPACE
# Step 4: rolling restart to trigger filesystem expansion
kubectl rollout restart statefulset/$STS_NAME -n $NAMESPACE
PVC storage requests can only grow, never shrink. To reduce storage size you must create a new smaller PVC, copy data, and swap the mount — there is no in-place shrink in Kubernetes or EBS.
StorageClass Tuning
WaitForFirstConsumer — Why It Matters
With volumeBindingMode: Immediate, a PVC triggers volume creation before a pod is scheduled. The volume may land in AZ-A while the pod gets scheduled to AZ-B — causing a permanent mount failure. WaitForFirstConsumer delays provisioning until the pod's target node is known, aligning the volume's AZ with the pod's AZ.
Immediate (BAD for multi-AZ):
PVC created → CSI provisions vol in AZ-A → Pod scheduled to AZ-B → MOUNT FAIL
WaitForFirstConsumer (CORRECT):
PVC created → status: Pending (no vol yet)
Pod scheduled → Scheduler picks node in AZ-B
CSI notified of target node AZ → provisions vol in AZ-B → PVC binds → pod mounts ✓
StorageClass Reference Table
| Name | Provisioner | Use Case | Reclaim | Binding Mode |
|---|---|---|---|---|
gp3 (default) | ebs.csi.aws.com | General workloads | Delete | WaitForFirstConsumer |
gp3-retain | ebs.csi.aws.com | Databases (stateful) | Retain | WaitForFirstConsumer |
gp3-high-iops | ebs.csi.aws.com | High-throughput DBs | Retain | WaitForFirstConsumer |
efs-sc | efs.csi.aws.com | Shared RWX access | Delete | Immediate |
local-storage | kubernetes.io/no-provisioner | Local NVMe (Kafka) | Retain | WaitForFirstConsumer |
Local StorageClass for High-Performance Workloads
apiVersion: storage.k8s.io/v1
kind: StorageClass
metadata:
name: local-nvme
provisioner: kubernetes.io/no-provisioner
volumeBindingMode: WaitForFirstConsumer
reclaimPolicy: Retain
---
# Pre-provision a Local PV (one per NVMe disk per node)
apiVersion: v1
kind: PersistentVolume
metadata:
name: local-nvme-node1
spec:
capacity:
storage: 1000Gi
accessModes: [ReadWriteOnce]
storageClassName: local-nvme
persistentVolumeReclaimPolicy: Retain
local:
path: /mnt/nvme0n1 # pre-formatted and mounted on the node
nodeAffinity:
required:
nodeSelectorTerms:
- matchExpressions:
- key: kubernetes.io/hostname
operator: In
values: ["node-with-nvme-1"]
StatefulSet Operations
StatefulSet Rolling Updates
StatefulSets update pods in reverse ordinal order (N-1, N-2, ..., 0) by default. Each pod is updated, waits to become Ready, then the next is updated. This respects quorum for distributed systems like Kafka, Zookeeper, and etcd.
# Trigger a rolling update (after editing the StatefulSet spec)
kubectl rollout restart statefulset/kafka -n kafka
# Monitor rollout progress
kubectl rollout status statefulset/kafka -n kafka
# Pause a rolling update (useful if intermediate state is healthy and you want to canary)
kubectl rollout pause statefulset/kafka -n kafka
# Resume
kubectl rollout resume statefulset/kafka -n kafka
# Rollback to previous revision
kubectl rollout undo statefulset/kafka -n kafka
Partition-Based Canary for StatefulSets
The partition field in updateStrategy.rollingUpdate limits updates to pods with ordinal ≥ partition. This lets you update a subset first:
# Set partition=2 on a 3-replica StatefulSet (only pod-2 gets updated)
kubectl patch statefulset kafka -n kafka \
-p '{"spec":{"updateStrategy":{"rollingUpdate":{"partition":2}}}}'
# After validating pod-2 is healthy, lower partition to 1 (update pod-2, pod-1)
kubectl patch statefulset kafka -n kafka \
-p '{"spec":{"updateStrategy":{"rollingUpdate":{"partition":1}}}}'
# Lower to 0 to complete the update
kubectl patch statefulset kafka -n kafka \
-p '{"spec":{"updateStrategy":{"rollingUpdate":{"partition":0}}}}'
Force-Delete a Stuck StatefulSet Pod
Force-deleting a StatefulSet pod (e.g., after node failure) may result in two pods with the same identity running simultaneously if the original pod is still alive on a partitioned node. Only force-delete when the node is confirmed dead (terminated in cloud console, not just NotReady).
# Standard delete (waits for graceful termination — may block if node is dead)
kubectl delete pod kafka-2 -n kafka
# Force delete (bypasses graceful shutdown — only when node is confirmed dead)
kubectl delete pod kafka-2 -n kafka --force --grace-period=0
Headless Service and DNS for StatefulSets
apiVersion: v1
kind: Service
metadata:
name: kafka-headless
namespace: kafka
spec:
clusterIP: None # headless — no VIP, DNS returns pod IPs
publishNotReadyAddresses: true # important: include unready pods in DNS
selector:
app: kafka
ports:
- name: kafka
port: 9092
# DNS format for StatefulSet pods:
# <pod-name>.<service-name>.<namespace>.svc.cluster.local
# kafka-0.kafka-headless.kafka.svc.cluster.local
# kafka-1.kafka-headless.kafka.svc.cluster.local
# Verify headless DNS from within the cluster
kubectl run dns-test --image=nicolaka/netshoot --rm -it -- \
dig kafka-0.kafka-headless.kafka.svc.cluster.local
StatefulSet Mininum Availability with PodDisruptionBudget
apiVersion: policy/v1
kind: PodDisruptionBudget
metadata:
name: kafka-pdb
namespace: kafka
spec:
selector:
matchLabels:
app: kafka
minAvailable: 2 # for a 3-replica Kafka — always keep quorum (2/3)
Storage Backup Operations
See Disaster Recovery for the full Velero setup. This section focuses on storage-specific backup patterns and volume snapshot operations.
VolumeSnapshot — CSI Snapshot
VolumeSnapshots create a point-in-time snapshot of a PVC using the cloud provider's native snapshot API (EBS snapshot for gp3 volumes). They are faster and cheaper than full Velero backups for block storage.
# Install the CSI snapshot controller CRDs and controller
kubectl apply -f https://raw.githubusercontent.com/kubernetes-csi/external-snapshotter/v8.0.1/client/config/crd/snapshot.storage.k8s.io_volumesnapshotclasses.yaml
kubectl apply -f https://raw.githubusercontent.com/kubernetes-csi/external-snapshotter/v8.0.1/client/config/crd/snapshot.storage.k8s.io_volumesnapshotcontents.yaml
kubectl apply -f https://raw.githubusercontent.com/kubernetes-csi/external-snapshotter/v8.0.1/client/config/crd/snapshot.storage.k8s.io_volumesnapshots.yaml
kubectl apply -f https://raw.githubusercontent.com/kubernetes-csi/external-snapshotter/v8.0.1/deploy/kubernetes/snapshot-controller/
# Create a VolumeSnapshotClass for EBS
kubectl apply -f - <<'EOF'
apiVersion: snapshot.storage.k8s.io/v1
kind: VolumeSnapshotClass
metadata:
name: ebs-vsc
annotations:
snapshot.storage.kubernetes.io/is-default-class: "true"
driver: ebs.csi.aws.com
deletionPolicy: Retain # keep the EBS snapshot even if VSC is deleted
parameters:
tagSpecification_1: "Key=ManagedBy,Value=K8s"
EOF
apiVersion: snapshot.storage.k8s.io/v1
kind: VolumeSnapshot
metadata:
name: postgres-data-snap-20260524
namespace: production
spec:
volumeSnapshotClassName: ebs-vsc
source:
persistentVolumeClaimName: postgres-data
# Check snapshot status
kubectl get volumesnapshot -n production
kubectl describe volumesnapshot postgres-data-snap-20260524 -n production
# readyToUse: true means the EBS snapshot is complete
# Restore from a VolumeSnapshot to a new PVC
kubectl apply -f - <<'EOF'
apiVersion: v1
kind: PersistentVolumeClaim
metadata:
name: postgres-data-restored
namespace: production
spec:
storageClassName: gp3-retain
dataSource:
name: postgres-data-snap-20260524
kind: VolumeSnapshot
apiGroup: snapshot.storage.k8s.io
accessModes: [ReadWriteOnce]
resources:
requests:
storage: 100Gi
EOF
Automated Snapshot CronJob
apiVersion: batch/v1
kind: CronJob
metadata:
name: postgres-snapshot
namespace: production
spec:
schedule: "0 */6 * * *" # every 6 hours
concurrencyPolicy: Forbid
successfulJobsHistoryLimit: 5
failedJobsHistoryLimit: 3
jobTemplate:
spec:
template:
spec:
serviceAccountName: snapshot-creator
restartPolicy: OnFailure
containers:
- name: snap
image: bitnami/kubectl:latest
command:
- /bin/bash
- -c
- |
SNAP_NAME="postgres-data-snap-$(date +%Y%m%d-%H%M%S)"
kubectl apply -f - <<EOF
apiVersion: snapshot.storage.k8s.io/v1
kind: VolumeSnapshot
metadata:
name: $SNAP_NAME
namespace: production
labels:
app: postgres
managed-by: cronjob
spec:
volumeSnapshotClassName: ebs-vsc
source:
persistentVolumeClaimName: postgres-data
EOF
echo "Created snapshot: $SNAP_NAME"
# Prune snapshots older than 7 days
kubectl get volumesnapshot -n production \
-l managed-by=cronjob \
-o json | \
python3 -c "
import json, sys, subprocess
from datetime import datetime, timezone, timedelta
items = json.load(sys.stdin)['items']
cutoff = datetime.now(timezone.utc) - timedelta(days=7)
for s in items:
ts = datetime.fromisoformat(s['metadata']['creationTimestamp'].replace('Z','+00:00'))
if ts < cutoff:
name = s['metadata']['name']
print(f'Deleting old snapshot: {name}')
subprocess.run(['kubectl','delete','volumesnapshot',name,'-n','production'])
"
Storage Troubleshooting Playbook
PVC Stuck in Pending
# Describe PVC for events
kubectl describe pvc <pvc-name> -n <ns>
# Common causes and checks:
# 1. WaitForFirstConsumer — normal until a pod is scheduled
kubectl get pvc <pvc-name> -n <ns> -o jsonpath='{.metadata.annotations}'
# 2. No StorageClass matching the requested class
kubectl get sc # verify the SC exists
kubectl get pvc <pvc-name> -n <ns> -o jsonpath='{.spec.storageClassName}'
# 3. CSI provisioner pod crashed / not running
kubectl get pods -n kube-system -l app=ebs-csi-controller
kubectl logs -n kube-system -l app=ebs-csi-controller -c csi-provisioner
# 4. IRSA permissions — EBS CSI controller needs ec2:CreateVolume
aws iam simulate-principal-policy \
--policy-source-arn arn:aws:iam::123456789012:role/ebs-csi-role \
--action-names ec2:CreateVolume \
--query 'EvaluationResults[0].EvalDecision'
# 5. Insufficient capacity in the AZ (for gp3 this is rare)
# Check AWS service health dashboard
Pod Stuck in ContainerCreating (Mount Failure)
# Get the full error
kubectl describe pod <pod-name> -n <ns> | grep -A 30 Events
# Scenario A: "Multi-Attach error" — volume attached to another node
kubectl get volumeattachment | grep <pv-name>
# Check if old attachment is on a dead node — if so, force-delete it:
kubectl delete volumeattachment <attachment-name>
# Scenario B: "Unable to mount volumes" — node plugin issue
# Check node plugin pod on the target node:
TARGET_NODE=$(kubectl get pod <pod-name> -n <ns> -o jsonpath='{.spec.nodeName}')
kubectl get pods -n kube-system -l app=ebs-csi-node \
-o wide | grep $TARGET_NODE
kubectl logs -n kube-system -l app=ebs-csi-node \
--field-selector spec.nodeName=$TARGET_NODE -c ebs-plugin
# Scenario C: "timeout expired waiting for volumes to be attached/mounted"
# kubelet may be having issues — check kubelet status on node
kubectl debug node/$TARGET_NODE -it --image=nicolaka/netshoot -- \
journalctl -u kubelet --no-pager -n 100
# Scenario D: filesystem corrupted — check dmesg for I/O errors
kubectl debug node/$TARGET_NODE -it --image=nicolaka/netshoot -- \
dmesg | tail -50 | grep -i "error\|fail\|corrupt"
PVC Resize Stuck at FileSystemResizePending
# Check PVC conditions
kubectl get pvc <pvc-name> -n <ns> -o jsonpath='{.status.conditions}'
# FileSystemResizePending means cloud volume is resized,
# but filesystem inside has not been expanded yet.
# This requires the CSI node plugin to run NodeExpandVolume.
# Option 1: rolling restart the pod (triggers NodePublishVolume → NodeExpandVolume)
kubectl rollout restart deployment/<name> -n <ns>
# Option 2: check if node plugin is healthy on the pod's node
TARGET_NODE=$(kubectl get pod <pod-name> -n <ns> -o jsonpath='{.spec.nodeName}')
kubectl logs -n kube-system \
$(kubectl get pods -n kube-system -l app=ebs-csi-node \
-o wide | grep $TARGET_NODE | awk '{print $1}') \
-c ebs-plugin --tail=50
# Manual resize from inside the pod (last resort — ext4 only)
kubectl exec -n <ns> <pod-name> -- resize2fs /dev/<device>
Orphaned EBS Volume Detection
# Find EBS volumes tagged with kubernetes.io/created-for/pvc that have no attachment
aws ec2 describe-volumes \
--filters "Name=status,Values=available" \
"Name=tag-key,Values=kubernetes.io/created-for/pvc/name" \
--query 'Volumes[*].{VolumeId:VolumeId,Size:Size,Created:CreateTime,PVC:Tags[?Key==`kubernetes.io/created-for/pvc/name`].Value|[0]}' \
--output table
# Cross-reference with K8s PVs to find truly orphaned volumes
# (PV deleted but cloud volume wasn't cleaned up)
PVOLS=$(kubectl get pv -o jsonpath='{.items[*].spec.csi.volumeHandle}' | tr ' ' '\n' | sort)
aws ec2 describe-volumes \
--filters "Name=tag-key,Values=kubernetes.io/created-for/pvc/name" \
--query 'Volumes[*].VolumeId' \
--output text | tr '\t' '\n' | sort | \
comm -23 - <(echo "$PVOLS")
Storage Alerting
apiVersion: monitoring.coreos.com/v1
kind: PrometheusRule
metadata:
name: storage-operations-alerts
namespace: monitoring
labels:
prometheus: kube-prometheus
role: alert-rules
spec:
groups:
- name: storage.pvc
interval: 60s
rules:
- alert: PVCDiskUsageHigh
expr: |
(
kubelet_volume_stats_used_bytes
/
kubelet_volume_stats_capacity_bytes
) * 100 > 80
for: 5m
labels:
severity: warning
annotations:
summary: "PVC disk usage above 80%"
description: "PVC {{ $labels.persistentvolumeclaim }} in namespace {{ $labels.namespace }} is {{ $value | printf \"%.1f\" }}% full."
- alert: PVCDiskUsageCritical
expr: |
(
kubelet_volume_stats_used_bytes
/
kubelet_volume_stats_capacity_bytes
) * 100 > 95
for: 2m
labels:
severity: critical
annotations:
summary: "PVC disk usage above 95% — imminent OOM/crash"
description: "PVC {{ $labels.persistentvolumeclaim }} in namespace {{ $labels.namespace }} is {{ $value | printf \"%.1f\" }}% full. Immediate action required."
- alert: PVCInodeUsageHigh
expr: |
(
kubelet_volume_stats_inodes_used
/
kubelet_volume_stats_inodes
) * 100 > 80
for: 5m
labels:
severity: warning
annotations:
summary: "PVC inode usage above 80%"
description: "PVC {{ $labels.persistentvolumeclaim }} in namespace {{ $labels.namespace }} has {{ $value | printf \"%.1f\" }}% inodes used. Many small files may exhaust inodes before disk space."
- alert: PVCPendingTooLong
expr: |
kube_persistentvolumeclaim_status_phase{phase="Pending"} == 1
for: 10m
labels:
severity: warning
annotations:
summary: "PVC stuck in Pending for > 10 minutes"
description: "PVC {{ $labels.persistentvolumeclaim }} in namespace {{ $labels.namespace }} has been Pending for over 10 minutes. Check CSI provisioner logs."
- alert: PVCInLostState
expr: |
kube_persistentvolumeclaim_status_phase{phase="Lost"} == 1
for: 1m
labels:
severity: critical
annotations:
summary: "PVC is in Lost state — data may be inaccessible"
description: "PVC {{ $labels.persistentvolumeclaim }} in namespace {{ $labels.namespace }} is Lost. The underlying PV may have been deleted. Immediate investigation required."
- name: storage.node
rules:
- alert: NodeDiskHighIOWait
expr: |
rate(node_disk_io_time_seconds_total{device=~"nvme.*|xvd.*|sd.*"}[5m]) * 100 > 80
for: 10m
labels:
severity: warning
annotations:
summary: "Node disk I/O utilization above 80%"
description: "Node {{ $labels.instance }} disk {{ $labels.device }} I/O utilization is {{ $value | printf \"%.1f\" }}%."
- alert: NodeRootDiskFull
expr: |
(
node_filesystem_size_bytes{mountpoint="/"}
- node_filesystem_free_bytes{mountpoint="/"}
) / node_filesystem_size_bytes{mountpoint="/"} * 100 > 85
for: 5m
labels:
severity: warning
annotations:
summary: "Node root disk above 85% full"
description: "Node {{ $labels.instance }} root filesystem is {{ $value | printf \"%.1f\" }}% full. Container image layers and logs may be filling disk."
- alert: PersistentVolumeClaimWithoutSnapshot
expr: |
kube_persistentvolumeclaim_info{storageclass=~"gp3-retain|gp3-high-iops"}
unless on (namespace, persistentvolumeclaim)
(
label_replace(
kube_volumesnapshot_spec_source_persistent_volume_claim_name,
"persistentvolumeclaim", "$1", "source_pvc_name", "(.*)"
)
)
for: 24h
labels:
severity: warning
annotations:
summary: "Retain-policy PVC has no VolumeSnapshot in 24 hours"
description: "PVC {{ $labels.persistentvolumeclaim }} in {{ $labels.namespace }} uses a retain policy but has no recent snapshot."
Best Practices Summary
Use WaitForFirstConsumer
Always set volumeBindingMode: WaitForFirstConsumer on StorageClasses in multi-AZ clusters. Immediate binding causes cross-AZ volume mounts which fail silently until pod scheduling.
Retain Policy for Databases
Set reclaimPolicy: Retain on StorageClasses used by databases and stateful workloads. Accidental PVC deletion should not destroy production data — a human must explicitly release the PV.
Snapshot Before Resize
Always create a VolumeSnapshot before expanding a PVC. Filesystem expansion is usually safe, but if a node crashes during resize the volume may be left in a partially-expanded state. Snapshot provides rollback.
Monitor Inodes Too
Disk space metrics are not sufficient. A volume can fill its inode table (too many small files) while still having GBs of free space. Alert on both kubelet_volume_stats_used_bytes and kubelet_volume_stats_inodes_used.
Graceful Node Drain
Never hard-terminate nodes with attached volumes. Use kubectl drain to cleanly unmount volumes and delete VolumeAttachment objects. Force-deleting VolumeAttachments causes the next pod to wait for the 6-minute attach timeout.
Local Storage for Kafka
Kafka brokers get dramatically better throughput on instance-store NVMe (1M+ IOPS) vs EBS (16K IOPS). Use local-storage StorageClass with DaemonSet pre-provisioning and topology constraints to ensure pods land on NVMe nodes.