Storage Issues
Overview
Diagnosis and resolution of Kubernetes storage failures — from stuck PVC provisioning and attach errors through filesystem corruption to StatefulSet volume management.
Storage Failure Decision Tree
Pod stuck in ContainerCreating
│
▼
kubectl describe pod <pod>
Events say "waiting for volume to be attached"?
┌──────────────────────────────────────────┐
│ Yes │ No → check CNI
└──── VolumeAttachment issue │
→ see Attach section ▼
PVC status?
kubectl get pvc -n <ns>
┌──────────────────────────────────────────┐
│ Pending │ Bound
└── Provisioning failed → VolumeMount error
→ check StorageClass → filesystem / path issue
→ check CSI controller logs
PVC Stuck Pending
kubectl get pvc -n <ns>
# NAME STATUS VOLUME CAPACITY STORAGECLASS
# payments-data Pending <none> 0 gp3-retain
kubectl describe pvc payments-data -n <ns>
# Events: "waiting for a volume to be created, either by external provisioner... or manually"
# Cause 1: StorageClass doesn't exist
kubectl get storageclass
# Fix: create StorageClass or correct the storageClassName in PVC
# Cause 2: StorageClass has WaitForFirstConsumer and no pod yet
kubectl get storageclass gp3-retain -o jsonpath='{.volumeBindingMode}'
# If WaitForFirstConsumer: PVC stays Pending until a pod is scheduled that uses it
# Cause 3: external-provisioner pod crashed / not running
kubectl get pod -n kube-system -l app=ebs-csi-controller
kubectl logs -n kube-system <ebs-csi-controller-pod> -c csi-provisioner --tail=50
# Cause 4: CSI controller cannot reach cloud API
kubectl logs -n kube-system <ebs-csi-controller-pod> -c ebs-plugin --tail=50
# Look for: permission denied (IAM role), API throttling, quota exceeded
# Cause 5: Requested capacity exceeds StorageClass max
kubectl describe storageclass gp3-retain
# Check: parameters for min/max size
# Cause 6: Zone mismatch with WaitForFirstConsumer
# Check what AZ the scheduler would place the pod in
VolumeAttachment Stuck
# Pod stuck ContainerCreating with:
# "waiting for volume to be attached"
# OR "Multi-Attach error: volume can only be attached to one node"
# Check VolumeAttachment
kubectl get volumeattachment
kubectl describe volumeattachment <va-name>
# Case 1: attached=false, no error
# → AWS API call in progress or backing off
kubectl logs -n kube-system <ebs-csi-controller-pod> -c csi-attacher --tail=100
# Case 2: Multi-Attach error
# Another node/pod has the volume attached (EBS RWO)
kubectl get volumeattachment -o json | \
jq '.items[] | select(.spec.source.persistentVolumeName=="<pv>") |
{node:.spec.nodeName, attached:.status.attached}'
# The old pod/node still has the attachment. Fix:
# a) Wait for old pod to terminate and old VA to be deleted
# b) If old node is dead/stuck: kubectl delete node <old-node>
# → AD controller removes old VA → new attach proceeds
# Case 3: Stuck VA with finalizer
# Force-delete the VA (external-attacher removes finalizer when it sees deletion)
kubectl delete volumeattachment <va-name>
# If still stuck after 60s:
kubectl patch volumeattachment <va-name> \
-p '{"metadata":{"finalizers":[]}}' --type=merge
# Verify AWS-side state
PV_HANDLE=$(kubectl get pv <pv-name> -o jsonpath='{.spec.csi.volumeHandle}')
aws ec2 describe-volumes --volume-ids $PV_HANDLE \
--query 'Volumes[0].{State:State,Attachments:Attachments}'
PVC Won't Delete (Stuck Terminating)
# PVC has deletionTimestamp but not deleted
kubectl get pvc payments-data -n production \
-o jsonpath='{.metadata.finalizers}'
# ["kubernetes.io/pvc-protection"]
# Protection finalizer: PVC is still used by a running pod
kubectl get pod -n production \
-o json | jq -r '.items[] | .metadata.name + ": " +
([.spec.volumes[]? | select(.persistentVolumeClaim.claimName=="payments-data")
| .persistentVolumeClaim.claimName] | join(","))'
# Fix: delete the pod using the PVC first
# If no pods are using it but finalizer is stuck:
kubectl patch pvc payments-data -n production \
-p '{"metadata":{"finalizers":[]}}' --type=merge
Read-Only Filesystem / Mount Errors
# Container seeing read-only filesystem errors
kubectl describe pod <pod> -n <ns>
# Events: "MountVolume.SetUp failed for volume ... read-only filesystem"
# Cause 1: Node's volume mount path is on a read-only filesystem
kubectl debug node/<node> -it --image=ubuntu -- \
mount | grep /var/lib/kubelet
# Cause 2: Underlying EBS volume is in error state
aws ec2 describe-volumes --volume-ids <vol-id> \
--query 'Volumes[0].{State:State,MultiAttachEnabled:MultiAttachEnabled}'
# Cause 3: NodeStageVolume failed — filesystem not mounted to staging path
kubectl logs -n kube-system <csi-node-pod> -c ebs-plugin --tail=100 | grep error
# CSI node plugin log location
TARGET_NODE=$(kubectl get pod <pod> -n <ns> -o jsonpath='{.spec.nodeName}')
CSI_NODE=$(kubectl get pod -n kube-system -l app=ebs-csi-node \
--field-selector spec.nodeName=$TARGET_NODE \
-o jsonpath='{.items[0].metadata.name}')
kubectl logs -n kube-system $CSI_NODE -c ebs-plugin --tail=100
# Cause 4: fsGroup mismatch (securityContext)
# Pod writes to /data but files owned by root, app runs as non-root
# Fix:
# spec:
# securityContext:
# fsGroup: 1000 # sets ownership on mount
# fsGroupChangePolicy: OnRootMismatch # faster than Always
Disk Full / Ephemeral Storage
# Pod evicted due to ephemeral storage pressure
kubectl describe pod <pod> -n <ns>
# "The node was low on resource: ephemeral-storage"
# OR: "disk quota exceeded" in logs
# Check ephemeral storage usage
kubectl exec <pod> -n <ns> -- df -h
kubectl exec <pod> -n <ns> -- du -sh /* 2>/dev/null | sort -rh | head -10
# Check logs consuming space (container log files)
kubectl debug node/<node> -it --image=ubuntu -- \
du -sh /var/lib/docker/containers/*/ # or /var/lib/containerd
# Fix 1: set ephemeral-storage limit
resources:
limits:
ephemeral-storage: 2Gi
requests:
ephemeral-storage: 500Mi
# Fix 2: rotate logs aggressively
# In container: configure logrotate or structured logging with rotation
# Fix 3: use emptyDir for large temp files (separate from overlay fs)
volumes:
- name: tmp-space
emptyDir:
sizeLimit: 5Gi # enforced as ephemeral storage
# Check node disk usage
kubectl describe node <node> | grep -A5 "ephemeral-storage"
PVC Resize Issues
# Resize PVC (StorageClass must have allowVolumeExpansion: true)
kubectl patch pvc payments-data -n production \
-p '{"spec":{"resources":{"requests":{"storage":"200Gi"}}}}'
# Check resize progress
kubectl get pvc payments-data -n production
# Status should show FileSystemResizePending → then Bound with new size
# Stuck in FileSystemResizePending
# → CSI has expanded the block device but filesystem not resized yet
# → This happens after pod restart (NodeExpandVolume called by kubelet)
# Force by restarting the pod:
kubectl rollout restart deployment payments-api -n production
# Check CSI node logs for resize
kubectl logs -n kube-system $CSI_NODE -c ebs-plugin | grep -i resize
# StatefulSet PVC resize (replicas can't be updated while running)
kubectl scale statefulset postgres -n production --replicas=0
kubectl patch pvc data-postgres-0 -n production \
-p '{"spec":{"resources":{"requests":{"storage":"200Gi"}}}}'
kubectl scale statefulset postgres -n production --replicas=3
VolumeSnapshot Failures
# VolumeSnapshot stuck "ReadyToUse: false"
kubectl describe volumesnapshot payments-snapshot -n production
kubectl describe volumesnapshotcontent <vsc-name>
# Check snapshot controller logs
kubectl logs -n kube-system -l app=snapshot-controller --tail=50
# Check CSI driver supports snapshots
kubectl get volumesnapshotclass
kubectl describe volumesnapshotclass ebs-vsc
# Common errors:
# "VolumeSnapshotContent not created" → provisioner doesn't support snapshots
# "Failed to create snapshot" → IAM missing ec2:CreateSnapshot permission
# "snapshot size mismatch" → quota limit on EBS snapshots
# Restore from snapshot
kubectl apply -f - <<EOF
apiVersion: v1
kind: PersistentVolumeClaim
metadata:
name: payments-data-restored
namespace: production
spec:
storageClassName: gp3-retain
dataSource:
name: payments-snapshot
kind: VolumeSnapshot
apiGroup: snapshot.storage.k8s.io
accessModes: [ReadWriteOnce]
resources:
requests:
storage: 100Gi
EOF
StatefulSet Volume Issues
# StatefulSet PVCs not being created
kubectl describe statefulset postgres -n production
# Check: volumeClaimTemplates section
# Events: "PVC not found" or "Error creating PVC"
# PVCs orphaned after StatefulSet delete
# StatefulSets deliberately DON'T own PVCs (data safety)
# They must be manually deleted:
kubectl delete pvc -l app=postgres -n production
# StatefulSet pod stuck due to PVC in wrong AZ
# Fix: delete pod + PVC, let StatefulSet recreate in correct AZ
# (only works if data loss is acceptable or backup exists)
# Headless service DNS for StatefulSet pods
kubectl run debug --image=nicolaka/netshoot --rm -it -- \
nslookup postgres-0.postgres.production.svc.cluster.local
# Format: <pod-name>.<service-name>.<namespace>.svc.cluster.local
Prometheus Alerts for Storage
PVCDiskUsageHigh → pvc usage > 80%
PVCDiskUsageCritical → pvc usage > 95%
PVCInodeUsageHigh → inode usage > 80% (many small files)
PVCPendingTooLong → PVC Pending > 5 min
PVCInLostState → PVC in Lost phase
NodeDiskHighIOWait → io_wait > 20% (disk bottleneck)
NodeRootDiskFull → root disk > 90% used
# Check inode usage (files, not space)
kubectl exec <pod> -n <ns> -- df -i
# Check PVC disk usage via Prometheus
kubectl get --raw \
'/api/v1/namespaces/monitoring/services/prometheus-operated:web/proxy/api/v1/query' \
--data-urlencode 'query=kubelet_volume_stats_used_bytes{namespace="production"} /
kubelet_volume_stats_capacity_bytes{namespace="production"}' | jq .
Related
- 07 — CSI Flow — CSI provisioning and mount sequence
- 14 — Volume Attach Flow — attach/detach controller details
- 06 — Storage Operations — storage ops playbook