Volume Snapshots
Complete coverage of the Kubernetes volume snapshot API — the three custom resources, dynamic and pre-provisioned snapshot workflows, restore to new PVC, cross-namespace cloning, VolumeGroupSnapshot, application-consistent snapshots, scheduled snapshot automation, and production backup strategies with Velero.
What This Page Covers
Overview
Volume snapshots are not part of the core Kubernetes API. They are delivered as Custom Resource Definitions (CRDs) maintained by the Kubernetes SIG Storage team, alongside a snapshot controller that must be installed separately. Most managed Kubernetes offerings (EKS, GKE, AKS) pre-install these; self-managed clusters must install them manually.
kubernetes-csi/external-snapshotter, (2) the common snapshot controller Deployment, and (3) the CSI driver's external-snapshotter sidecar. Without these, VolumeSnapshot objects will be created but never acted on.Three Resources — The PV/PVC/StorageClass Analogy
| Snapshot Resource | Scope | Analogous To | Purpose |
|---|---|---|---|
VolumeSnapshotClass | Cluster | StorageClass | Defines which CSI driver handles snapshots and deletion policy |
VolumeSnapshot | Namespace | PersistentVolumeClaim | User request for a snapshot of a specific PVC |
VolumeSnapshotContent | Cluster | PersistentVolume | The actual snapshot on the storage backend; created by controller or pre-provisioned |
Snapshot Controller Architecture
┌─────────────────────────────────────────────────────────────────────┐
│ COMMON SNAPSHOT CONTROLLER (Deployment, cluster-wide) │
│ │
│ Watches: VolumeSnapshot, VolumeSnapshotContent │
│ Responsibilities: │
│ - Binds VolumeSnapshot ↔ VolumeSnapshotContent (like PV binder) │
│ - Manages VSC finalizers │
│ - Handles deletion policy enforcement │
│ - Does NOT call CSI directly │
└──────────────────────────────┬──────────────────────────────────────┘
│ updates VSC status
▼
┌─────────────────────────────────────────────────────────────────────┐
│ CSI DRIVER: external-snapshotter SIDECAR (in controller Deployment) │
│ │
│ Watches: VolumeSnapshotContent (content.status.snapshotHandle=="") │
│ Calls: CSI CreateSnapshot / DeleteSnapshot / ListSnapshots │
│ Updates: VSC.status.snapshotHandle, VSC.status.readyToUse │
└─────────────────────────────────────────────────────────────────────┘
Dynamic snapshot creation flow:
User creates VolumeSnapshot (ns: production, source: data-postgres-0)
│
▼
Common controller creates VolumeSnapshotContent (cluster-scoped)
with snapshotHandle="" (pending)
│
▼
external-snapshotter calls CSI CreateSnapshot(sourceVolumeId, parameters)
│ CSI driver calls cloud API (e.g., AWS ec2:CreateSnapshot)
▼
VSC.status.snapshotHandle = "snap-0abc123"
VSC.status.readyToUse = true
│
▼
VS.status.readyToUse = true
VS.status.restoreSize = 100Gi
VolumeSnapshotClass
apiVersion: snapshot.storage.k8s.io/v1
kind: VolumeSnapshotClass
metadata:
name: csi-aws-vsc
annotations:
snapshot.storage.kubernetes.io/is-default-class: "true"
driver: ebs.csi.aws.com # MUST match the CSI driver name
deletionPolicy: Delete # Delete | Retain
parameters:
# Driver-specific snapshot parameters
tagSpecification_1: "environment=production"
tagSpecification_2: "managed-by=kubernetes"
# For Ceph RBD: csi.storage.k8s.io/snapshotter-secret-name: rook-csi-rbd-provisioner
deletionPolicy
| Policy | When VolumeSnapshot Deleted | When to Use |
|---|---|---|
Delete | VolumeSnapshotContent and the backing cloud snapshot are both deleted | Default for most use cases; ensures no orphaned cloud snapshots |
Retain | VolumeSnapshotContent and backing snapshot are preserved; admin must clean up | Compliance/audit requirements; long-lived snapshots that outlive the K8s object |
VolumeSnapshot
apiVersion: snapshot.storage.k8s.io/v1
kind: VolumeSnapshot
metadata:
name: postgres-snap-2024-01-15
namespace: production
spec:
volumeSnapshotClassName: csi-aws-vsc # which VolumeSnapshotClass to use
source:
persistentVolumeClaimName: data-postgres-0 # dynamic: snapshot this PVC
# OR for pre-provisioned (static):
# volumeSnapshotContentName: existing-vsc-name
VolumeSnapshot Status Fields
kubectl get volumesnapshot postgres-snap-2024-01-15 -n production -o yaml
status:
boundVolumeSnapshotContentName: snapcontent-abc-123-def # bound VSC name
creationTime: "2024-01-15T10:30:00Z"
readyToUse: true # true when snapshot is usable for restore
restoreSize: 100Gi # minimum PVC size for restore
error: null # populated if snapshot creation failed
# message: "..."
# time: "..."
readyToUse: true means the snapshot is in completed state in AWS and can be used for restore. However, EBS snapshots are incrementally stored in S3 — the first restore after snapshot creation may be slow (lazy-loading from S3). Use Fast Snapshot Restore (FSR) on EBS for latency-sensitive restore paths.VolumeSnapshotContent
apiVersion: snapshot.storage.k8s.io/v1
kind: VolumeSnapshotContent
metadata:
name: snapcontent-abc-123-def
finalizers:
- snapshot.storage.kubernetes.io/volumesnapshotcontent-bound-protection
- snapshot.storage.kubernetes.io/volumesnapshotcontent-deletion-protection
spec:
deletionPolicy: Delete
driver: ebs.csi.aws.com
volumeSnapshotClassName: csi-aws-vsc
source:
volumeHandle: vol-0abc123def456789 # for dynamic (set by controller)
# OR for pre-provisioned:
# snapshotHandle: snap-0existingsnap # existing cloud snapshot ID
volumeSnapshotRef: # binding to the VolumeSnapshot
name: postgres-snap-2024-01-15
namespace: production
uid: abc-def-123
status:
snapshotHandle: snap-0new123def456 # cloud snapshot ID (set after creation)
readyToUse: true
restoreSize: 107374182400 # bytes
Dynamic Snapshot Workflow
- Create
VolumeSnapshotreferencing source PVC and VolumeSnapshotClass - Common snapshot controller sees unbound VolumeSnapshot → creates VolumeSnapshotContent with empty
snapshotHandle external-snapshottersidecar sees VSC with empty handle → calls CSICreateSnapshot(sourceVolumeId=vol-0abc123, parameters=...)- CSI driver calls cloud API (e.g.,
ec2:CreateSnapshot) - external-snapshotter polls until snapshot is ready → updates VSC.status.snapshotHandle and readyToUse=true
- Common controller copies readyToUse status to VolumeSnapshot
- VolumeSnapshot.status.readyToUse = true — snapshot is available for restore
# Watch snapshot progress
kubectl get volumesnapshot postgres-snap-2024-01-15 -n production -w
# NAME READYTOUSE SOURCEPVC RESTORESIZE SNAPSHOTCONTENT AGE
# postgres-snap-2024-01-15 false data-postgres-0 snapcontent-abc 5s
# postgres-snap-2024-01-15 true data-postgres-0 100Gi snapcontent-abc 45s
Pre-Provisioned (Static) Snapshots
Import an existing cloud snapshot into Kubernetes without creating a new one. Useful for disaster recovery scenarios where snapshots were created outside Kubernetes (e.g., AWS Backup, cloud-scheduled snapshots).
# Step 1: Create VolumeSnapshotContent pointing to existing cloud snapshot
apiVersion: snapshot.storage.k8s.io/v1
kind: VolumeSnapshotContent
metadata:
name: imported-prod-snap
spec:
deletionPolicy: Retain # Retain: don't delete cloud snapshot when VSC is deleted
driver: ebs.csi.aws.com
volumeSnapshotClassName: csi-aws-vsc
source:
snapshotHandle: snap-0existing123abc # existing AWS snapshot ID
volumeSnapshotRef:
name: imported-snap # VolumeSnapshot to bind to
namespace: production
uid: <will-be-set-after-VS-creation>
---
# Step 2: Create VolumeSnapshot referencing the pre-provisioned VSC
apiVersion: snapshot.storage.k8s.io/v1
kind: VolumeSnapshot
metadata:
name: imported-snap
namespace: production
spec:
volumeSnapshotClassName: csi-aws-vsc
source:
volumeSnapshotContentName: imported-prod-snap # bind to existing VSC
volumeSnapshotRef.uid must match the VolumeSnapshot's UID. Since the UID is only assigned when the VS is created, create both objects and then patch the VSC to add the correct UID: kubectl patch vsc imported-prod-snap --type=merge -p '{"spec":{"volumeSnapshotRef":{"uid":"<vs-uid>"}}}'Restore to New PVC
Restore a VolumeSnapshot to a new PVC using dataSource. The restored PVC is a new independent volume pre-populated with the snapshot's data.
apiVersion: v1
kind: PersistentVolumeClaim
metadata:
name: postgres-restored
namespace: production
spec:
dataSource:
apiGroup: snapshot.storage.k8s.io
kind: VolumeSnapshot
name: postgres-snap-2024-01-15 # VolumeSnapshot in same namespace
accessModes: [ReadWriteOnce]
storageClassName: gp3-encrypted # can be different from source SC
# but must use the same CSI driver
resources:
requests:
storage: 100Gi # must be ≥ snapshot's restoreSize
Restore Constraints
- The VolumeSnapshot must be in the same namespace as the target PVC (unless using cross-namespace via dataSourceRef)
- The StorageClass used for restore must use the same CSI driver as the VolumeSnapshotClass
- Target PVC capacity must be ≥
restoreSizefrom the snapshot status - The snapshot must be
readyToUse: truebefore the PVC creation request arrives
PVC Clone vs Snapshot Restore
| Aspect | PVC Clone (dataSource: PVC) | Snapshot Restore (dataSource: VolumeSnapshot) |
|---|---|---|
| Source must be Bound | Yes | No — snapshot is independent |
| Point-in-time | Best-effort (source still changing) | Yes — frozen moment in time |
| Cross-namespace | No (same namespace required) | Via dataSourceRef + VSC (1.26+) |
| Cloud cost | Full volume copy (same price as new volume) | Incremental from last snapshot (much cheaper over time) |
| Provisioning time | Fast (COW at cloud layer) | Can be slow (data restored from S3/object store) |
| Source StorageClass | Must match | Can differ (same driver, different parameters) |
Snapshot Deletion Semantics
Snapshot objects have two protective finalizers that control deletion order:
snapshot.storage.kubernetes.io/volumesnapshot-bound-protection— on VolumeSnapshot; prevents VS deletion while it is bound to a VSCsnapshot.storage.kubernetes.io/volumesnapshotcontent-bound-protection— on VSC; prevents VSC deletion while VS is boundsnapshot.storage.kubernetes.io/volumesnapshotcontent-deletion-protection— on VSC; prevents VSC deletion until VS is deleted first
Deletion flow (deletionPolicy: Delete): kubectl delete volumesnapshot postgres-snap-2024-01-15 │ ▼ VS deletion timestamp set; VS moves to "Deleting" │ (common controller sees VS being deleted) ▼ Common controller removes bound-protection finalizer from VS VS is deleted from API server │ ▼ Common controller removes deletion-protection finalizer from VSC external-snapshotter calls CSI DeleteSnapshot(snapshotHandle=snap-0abc) │ ▼ Cloud snapshot deleted VSC deleted from API server Deletion flow (deletionPolicy: Retain): kubectl delete volumesnapshot postgres-snap-2024-01-15 → VS deleted → VSC remains, cloud snapshot remains → Admin must manually: kubectl delete vsc snapcontent-abc-123 → external-snapshotter does NOT call DeleteSnapshot (Retain policy) → Cloud snapshot must be deleted manually in AWS/GCP/Azure console
Cross-Namespace Snapshot Sharing (1.26+)
By default, a VolumeSnapshot can only be used as a restore source within its own namespace. Cross-namespace restore uses dataSourceRef with a VolumeSnapshotContent reference and a namespace annotation gate.
# Namespace must opt-in to cross-namespace data sources
kubectl annotate namespace target-ns \
snapshot.storage.kubernetes.io/allow-volume-snapshot-content=true
# In the target namespace, create PVC referencing the VSC directly
apiVersion: v1
kind: PersistentVolumeClaim
metadata:
name: restored-from-other-ns
namespace: target-ns
spec:
dataSourceRef:
apiGroup: snapshot.storage.k8s.io
kind: VolumeSnapshotContent # cluster-scoped, no namespace restriction
name: snapcontent-abc-123-def
accessModes: [ReadWriteOnce]
storageClassName: gp3-encrypted
resources:
requests:
storage: 100Gi
allow-volume-snapshot-content) is the authorization gate. A namespace operator must explicitly allow cross-namespace restores — preventing workloads from restoring arbitrary snapshots from other namespaces. The VSC is cluster-scoped but the annotation check ensures the target namespace admin consents.VolumeGroupSnapshot (Alpha, 1.27+)
VolumeGroupSnapshot takes crash-consistent snapshots of multiple PVCs simultaneously — all in a single atomic operation at the storage backend. This is critical for databases that span multiple volumes (e.g., separate data and WAL volumes for PostgreSQL, or a Kafka broker with multiple partition volumes).
# Install group snapshot CRDs (separate from regular snapshot CRDs)
# kubectl apply -f https://raw.githubusercontent.com/kubernetes-csi/external-snapshotter/main/client/config/crd/groupsnapshot.storage.k8s.io_volumegroupsnapshotclasses.yaml
# ... and VolumeGroupSnapshotContent / VolumeGroupSnapshot CRDs
apiVersion: groupsnapshot.storage.k8s.io/v1alpha1
kind: VolumeGroupSnapshotClass
metadata:
name: csi-aws-group-vsc
driver: ebs.csi.aws.com
deletionPolicy: Delete
---
apiVersion: groupsnapshot.storage.k8s.io/v1alpha1
kind: VolumeGroupSnapshot
metadata:
name: postgres-group-snap
namespace: production
spec:
volumeGroupSnapshotClassName: csi-aws-group-vsc
source:
selector:
matchLabels:
app: postgres # snapshot ALL PVCs with this label simultaneously
component: storage
The CSI driver must implement CreateVolumeGroupSnapshot RPC. The driver issues all individual snapshots in a single API call to the storage backend — guaranteeing consistency across volumes (no torn writes between data and WAL).
Application-Consistent Snapshots
Storage-level snapshots are crash-consistent (like pulling the power plug) — they capture the disk state atomically, but in-flight writes may be incomplete. For most databases this is acceptable with WAL replay on restart. For strict consistency, the application must be quiesced first.
Filesystem Freeze
# Quiesce filesystem before snapshot, then unfreeze
kubectl exec -it postgres-0 -n production -- bash -c "
# Flush and freeze filesystem
psql -c 'CHECKPOINT;' # PostgreSQL: flush dirty buffers
fsfreeze --freeze /var/lib/postgresql/data
# Signal readiness (e.g., write a file that triggers snapshot via hook)
echo 'frozen' > /tmp/snapshot-ready
"
# After snapshot completes:
kubectl exec -it postgres-0 -n production -- fsfreeze --unfreeze /var/lib/postgresql/data
Database Quiesce Patterns
| Database | Quiesce Command | Unquiesce |
|---|---|---|
| PostgreSQL | SELECT pg_start_backup('snap'); or CHECKPOINT; | SELECT pg_stop_backup(); |
| MySQL / MariaDB | FLUSH TABLES WITH READ LOCK; | UNLOCK TABLES; |
| MongoDB | db.fsyncLock() | db.fsyncUnlock() |
| General (Linux) | fsfreeze --freeze <mountpoint> | fsfreeze --unfreeze <mountpoint> |
Velero Pre/Post Snapshot Hooks
Velero supports executing commands in pod containers before and after taking volume snapshots, enabling application-consistent backups without custom tooling:
apiVersion: velero.io/v1
kind: Backup
metadata:
name: postgres-consistent-backup
namespace: velero
spec:
includedNamespaces: [production]
labelSelector:
matchLabels:
app: postgres
snapshotVolumes: true
hooks:
resources:
- name: postgres-hooks
includedNamespaces: [production]
labelSelector:
matchLabels:
app: postgres
pre:
- exec:
container: postgres
command:
- /bin/bash
- -c
- "psql -c 'CHECKPOINT;' && fsfreeze --freeze /var/lib/postgresql/data"
timeout: 30s
onError: Fail
post:
- exec:
container: postgres
command:
- /bin/bash
- -c
- "fsfreeze --unfreeze /var/lib/postgresql/data"
timeout: 10s
onError: Continue # always unfreeze even if backup had issues
Scheduled Snapshots
Kubernetes has no built-in snapshot scheduling — implement it with a CronJob or a dedicated snapshot controller (Kasten K10, Velero schedules, or cloud-native solutions).
CronJob Snapshot Pattern
apiVersion: batch/v1
kind: CronJob
metadata:
name: postgres-snapshot-daily
namespace: production
spec:
schedule: "0 2 * * *" # daily at 2 AM UTC
concurrencyPolicy: Forbid
successfulJobsHistoryLimit: 7
failedJobsHistoryLimit: 3
jobTemplate:
spec:
template:
spec:
serviceAccountName: snapshot-creator
restartPolicy: OnFailure
containers:
- name: snapshot
image: bitnami/kubectl:latest
command:
- /bin/sh
- -c
- |
DATE=$(date +%Y-%m-%d-%H%M)
cat <
---
# RBAC for snapshot-creator ServiceAccount
apiVersion: rbac.authorization.k8s.io/v1
kind: Role
metadata:
name: snapshot-creator
namespace: production
rules:
- apiGroups: [snapshot.storage.k8s.io]
resources: [volumesnapshots]
verbs: [get, list, create, delete]
---
apiVersion: rbac.authorization.k8s.io/v1
kind: RoleBinding
metadata:
name: snapshot-creator
namespace: production
subjects:
- kind: ServiceAccount
name: snapshot-creator
namespace: production
roleRef:
kind: Role
name: snapshot-creator
apiGroup: rbac.authorization.k8s.io
Velero Backup Strategy
Velero provides a complete backup solution for Kubernetes — it backs up both object metadata (Deployments, Services, ConfigMaps, Secrets) and PV data (via CSI snapshots or Restic/Kopia file-level backup).
Installation
# Install Velero with AWS EBS CSI snapshot support
velero install \
--provider aws \
--plugins velero/velero-plugin-for-aws:v1.8.0 \
--bucket my-velero-backups \
--backup-location-config region=us-east-1 \
--snapshot-location-config region=us-east-1 \
--features=EnableCSI \ # enable CSI snapshot integration
--use-volume-snapshots=true \
--secret-file ./credentials-velero # AWS credentials with s3+ec2 permissions
Backup Schedule
# Create a scheduled backup for production namespace
velero schedule create production-daily \
--schedule="0 3 * * *" \
--include-namespaces production \
--snapshot-volumes \
--ttl 720h0m0s # retain backups for 30 days
# List schedules and their status
velero schedule get
# Create an on-demand backup
velero backup create prod-backup-manual \
--include-namespaces production \
--snapshot-volumes \
--wait
Restore Workflow
# List available backups
velero backup get
# Restore entire namespace
velero restore create --from-backup prod-backup-manual \
--include-namespaces production \
--wait
# Restore to a different namespace
velero restore create --from-backup prod-backup-manual \
--namespace-mappings production:production-restore \
--wait
# Restore specific resources only
velero restore create --from-backup prod-backup-manual \
--include-resources persistentvolumeclaims,pods \
--selector app=postgres \
--wait
# Check restore status
velero restore describe <restore-name>
velero restore logs <restore-name>
File-Level Backup with Kopia (No CSI Snapshot)
# For volumes without CSI snapshot support, use Kopia (built-in to Velero 1.10+)
velero install \
--uploader-type kopia \ # or restic (legacy)
--use-node-agent \ # deploys node-agent DaemonSet
...
# Annotate pods to include specific volumes in file-level backup
kubectl annotate pod postgres-0 \
backup.velero.io/backup-volumes=data,wal # comma-separated volume names
Snapshot Storage Costs
Understanding snapshot cost models prevents surprise bills:
| Cloud | Snapshot Cost Model | Key Insight |
|---|---|---|
| AWS EBS | Incremental — only blocks changed since last snapshot stored in S3 | First snapshot = full volume; subsequent snapshots = delta. Cost grows with churn rate, not volume size. |
| GCE PD | Incremental (similar to EBS) | Regional snapshots cost 2× standard; standard snapshots replicated globally anyway. |
| Azure Disk | Incremental by default (Standard HDD billing for snapshot data) | Full snapshots available but rarely needed; incremental is the default since 2020. |
| Ceph RBD | COW (copy-on-write) — snapshot is just a reference; new writes stored separately | Snapshots "free" initially but flattening large old snapshots is expensive I/O. |
aws ec2 describe-snapshots and automate rotation.Cloud Provider Snapshot Characteristics
| Driver | Snapshot Type | Consistency | Cross-Region | Max Snapshots/Volume |
|---|---|---|---|---|
| ebs.csi.aws.com | Incremental S3-backed | Crash-consistent | Manual copy-snapshot | 100,000 |
| pd.csi.storage.gke.io | Incremental (standard) or Regional | Crash-consistent | Standard snapshots are global | 100 per disk |
| disk.csi.azure.com | Incremental by default | Crash-consistent | Manual copy to other region | 500 per disk |
| rbd.csi.ceph.com | COW (Ceph RBD snapshot) | Crash-consistent | Mirror with rbd-mirror | Unlimited (practical: <1000) |
Metrics and Alerting
| Metric | Source | Alert Threshold |
|---|---|---|
kube_volumesnapshot_info | kube-state-metrics | readyToUse=false for >30m |
kube_volumesnapshot_status_readytouse | kube-state-metrics | 0 (not ready) for >30m |
snapshot_controller_operation_total_seconds | snapshot controller | P99 CreateSnapshot > 5m |
velero_backup_success_total | Velero | No successful backup in 25h (missed daily) |
velero_backup_failure_total | Velero | > 0 in any window |
Alerting Rules
groups:
- name: volume-snapshots
rules:
- alert: VolumeSnapshotNotReady
expr: |
kube_volumesnapshot_info{ready_to_use="false"} == 1
for: 30m
labels: {severity: warning}
annotations:
summary: "VolumeSnapshot {{ $labels.namespace }}/{{ $labels.volumesnapshot }} not ready after 30m"
- alert: VolumeSnapshotFailed
expr: |
kube_volumesnapshot_info{ready_to_use="false"} == 1
and on(namespace, volumesnapshot)
kube_volumesnapshot_status_error_message != ""
for: 2m
labels: {severity: critical}
annotations:
summary: "VolumeSnapshot creation failed — check external-snapshotter logs"
- alert: VeleroBackupMissed
expr: |
time() - velero_backup_last_successful_timestamp{schedule="production-daily"} > 90000
labels: {severity: critical}
annotations:
summary: "Velero daily backup for production has not completed in >25h"
- alert: VeleroBackupFailed
expr: increase(velero_backup_failure_total[1h]) > 0
labels: {severity: warning}
annotations:
summary: "Velero backup failure detected"
Troubleshooting Runbooks
Runbook: VolumeSnapshot Stuck Not Ready
# 1. Check VS status and events
kubectl describe volumesnapshot <name> -n <ns>
# status.error.message will indicate failure reason
# 2. Check VSC status
VSC=$(kubectl get vs <name> -n <ns> -o jsonpath='{.status.boundVolumeSnapshotContentName}')
kubectl describe volumesnapshotcontent $VSC
# 3. Check external-snapshotter logs (in CSI controller pod)
kubectl logs -n kube-system \
$(kubectl get pod -n kube-system -l app=ebs-csi-controller -o name | head -1) \
-c csi-snapshotter --tail=100
# Common errors:
# "failed to create snapshot: ... ResourceNotFound" → source PVC PV no longer exists
# "context deadline exceeded" → CSI driver unresponsive
# "InvalidSnapshot.InUse" → cloud snapshot in use by AMI (AWS-specific)
Runbook: Restore PVC Stuck Pending
# PVC with dataSource VolumeSnapshot is Pending
kubectl describe pvc <restore-pvc> -n <ns>
# Common causes:
# 1. VolumeSnapshot not readyToUse yet
kubectl get vs <snap-name> -n <ns> -o jsonpath='{.status.readyToUse}'
# → false: wait for snapshot to complete before creating restore PVC
# 2. StorageClass uses different driver than VolumeSnapshotClass
kubectl get storageclass <sc> -o jsonpath='{.provisioner}'
kubectl get volumesnapshotclass <vsc> -o jsonpath='{.driver}'
# → must match
# 3. PVC capacity less than restoreSize
kubectl get vs <snap-name> -n <ns> -o jsonpath='{.status.restoreSize}'
# → increase PVC resources.requests.storage to match
Runbook: Snapshot Deletion Stuck (VSC in Terminating)
# VolumeSnapshotContent stuck in Terminating state
kubectl describe vsc <name>
# Check finalizers:
kubectl get vsc <name> -o jsonpath='{.metadata.finalizers}'
# Common cause: VS was deleted but VSC finalizer not removed
# Check if bound VS still exists
kubectl get vs --all-namespaces | grep <vsc-name>
# If VS is truly gone but VSC is stuck:
kubectl patch vsc <name> --type=json \
-p '[{"op":"remove","path":"/metadata/finalizers"}]'
# WARNING: This may orphan the cloud snapshot — delete it manually in AWS/GCP/Azure
Runbook: Velero Backup Failing
# Check backup details
velero backup describe <backup-name> --details
# Check Velero server logs
kubectl logs -n velero deployment/velero --tail=100 | grep -i error
# Common failures:
# "no matches for kind VolumeSnapshot" → snapshot CRDs not installed; missing --features=EnableCSI
# "backup storage location not available" → S3 bucket permissions or region mismatch
# "timeout waiting for PodVolumeBackup" → Kopia/Restic agent timeout on large volumes;
# increase --pod-volume-operation-timeout flag
# Test backup location connectivity
velero backup-location get
velero backup-location check default
Runbook: Pre-Provisioned Snapshot Not Binding
# VS and VSC created but VS shows "invalid" or "error setting reference"
kubectl describe vs <name> -n <ns>
# Common cause: VSC.volumeSnapshotRef.uid does not match actual VS uid
VS_UID=$(kubectl get vs <name> -n <ns> -o jsonpath='{.metadata.uid}')
kubectl patch vsc <vsc-name> --type=merge \
-p "{\"spec\":{\"volumeSnapshotRef\":{\"uid\":\"$VS_UID\"}}}"
# Also verify VSC driver matches VolumeSnapshotClass driver
kubectl get vsc <name> -o jsonpath='{.spec.driver}'
kubectl get vsc-class <class-name> -o jsonpath='{.driver}'
Best Practices
- Take a snapshot before every schema migration, data transformation, or major deployment. Cloud incremental snapshots complete in seconds and cost pennies. The worst-case rollback cost of not having one is measured in hours of data recovery.
- Use application-consistent snapshots for transactional databases. Crash-consistent snapshots are sufficient for databases with WAL (PostgreSQL, MySQL with InnoDB) — but quiescing removes any uncertainty and avoids long recovery replays.
- Test restores regularly. A snapshot that has never been tested is not a backup. Include a monthly restore drill in your SLA documentation — restore to a staging namespace and verify data integrity.
- Set deletionPolicy: Retain for compliance snapshots. Snapshots created for regulatory requirements should not be deleteable by namespace developers. Use
Retainand restrict VSC deletion to cluster-admin only. - Implement snapshot retention with automation. Manual snapshot accumulation is the most common cause of unexpected cloud storage costs. Automate retention (the CronJob pattern above, or Velero TTL) from day one.
- Align VolumeSnapshotClass driver to StorageClass provisioner. Mismatched drivers fail silently until restore time — the worst moment to discover the issue.
- Use VolumeGroupSnapshot for multi-volume databases. A PostgreSQL cluster with separate data/WAL volumes that are snapshotted at different times has a torn snapshot — WAL events after the data snapshot but before the WAL snapshot will be replayed, which is harmless but confusing. Group snapshots eliminate this window.
- Monitor snapshot readyToUse lag. A snapshot that takes 30+ minutes to become ready indicates a storage backend health issue. Alert on this before it becomes a restore-time surprise.