Volume Snapshots

Complete coverage of the Kubernetes volume snapshot API — the three custom resources, dynamic and pre-provisioned snapshot workflows, restore to new PVC, cross-namespace cloning, VolumeGroupSnapshot, application-consistent snapshots, scheduled snapshot automation, and production backup strategies with Velero.

Section 04 of 13 File 6 of 8 Platform Engineer

What This Page Covers

Snapshot API overview — not in core Kubernetes; requires CRD + snapshot controller installation

Three snapshot resources — VolumeSnapshotClass, VolumeSnapshot, VolumeSnapshotContent and their analogies to StorageClass/PVC/PV

VolumeSnapshotClass — driver, deletionPolicy (Delete/Retain), parameters, default annotation

VolumeSnapshot spec — volumeSnapshotClassName, source (PVC or VSC), status fields (readyToUse, restoreSize, creationTime, error)

VolumeSnapshotContent spec — deletionPolicy, driver, volumeSnapshotClassName, source (volumeHandle or snapshotHandle), volumeSnapshotRef

Dynamic snapshot workflow — full lifecycle from VolumeSnapshot creation to readyToUse

Pre-provisioned (static) snapshot — import existing cloud snapshot into Kubernetes

Snapshot controller architecture — common snapshot controller + CSI driver sidecar (external-snapshotter)

Restore workflow — dataSource PVC from VolumeSnapshot; capacity constraints; cross-StorageClass limitations

PVC clone vs snapshot restore — when to use each; cost and speed tradeoffs

Snapshot deletion semantics — Delete vs Retain policy; VSC finalizer; orphaned VSC

Cross-namespace snapshot sharing — VolumeSnapshotContent + dataSourceRef; namespace annotation gate

VolumeGroupSnapshot (alpha 1.27) — crash-consistent multi-PVC snapshot; VolumeGroupSnapshotClass/Content; labelSelector grouping

Application-consistent snapshots — quiesce hooks; pre/post snapshot hooks with Velero; filesystem freeze (fsfreeze); database flush patterns (PostgreSQL checkpoint, MySQL FLUSH TABLES)

Scheduled snapshots — CronJob pattern; snapshot retention policy; cleanup automation

Velero backup strategy — installation, backup schedule, PV snapshot + object backup, restore workflow, namespace mapping

Snapshot storage costs — incremental vs full snapshot billing; retention impact

Cloud provider snapshot support — EBS, GCE PD, Azure Disk, Ceph RBD snapshot characteristics

5 metrics + 4 alerting rules + 5 troubleshooting runbooks

8 best practices

Overview

Volume snapshots are not part of the core Kubernetes API. They are delivered as Custom Resource Definitions (CRDs) maintained by the Kubernetes SIG Storage team, alongside a snapshot controller that must be installed separately. Most managed Kubernetes offerings (EKS, GKE, AKS) pre-install these; self-managed clusters must install them manually.

⚠️

Installation required Before using snapshots, install: (1) the snapshot CRDs from kubernetes-csi/external-snapshotter, (2) the common snapshot controller Deployment, and (3) the CSI driver's external-snapshotter sidecar. Without these, VolumeSnapshot objects will be created but never acted on.

Three Resources — The PV/PVC/StorageClass Analogy

Snapshot Resource	Scope	Analogous To	Purpose
`VolumeSnapshotClass`	Cluster	StorageClass	Defines which CSI driver handles snapshots and deletion policy
`VolumeSnapshot`	Namespace	PersistentVolumeClaim	User request for a snapshot of a specific PVC
`VolumeSnapshotContent`	Cluster	PersistentVolume	The actual snapshot on the storage backend; created by controller or pre-provisioned

Snapshot Controller Architecture

┌─────────────────────────────────────────────────────────────────────┐
│  COMMON SNAPSHOT CONTROLLER (Deployment, cluster-wide)              │
│                                                                     │
│  Watches: VolumeSnapshot, VolumeSnapshotContent                     │
│  Responsibilities:                                                  │
│  - Binds VolumeSnapshot ↔ VolumeSnapshotContent (like PV binder)   │
│  - Manages VSC finalizers                                           │
│  - Handles deletion policy enforcement                              │
│  - Does NOT call CSI directly                                       │
└──────────────────────────────┬──────────────────────────────────────┘
                               │ updates VSC status
                               ▼
┌─────────────────────────────────────────────────────────────────────┐
│  CSI DRIVER: external-snapshotter SIDECAR (in controller Deployment) │
│                                                                     │
│  Watches: VolumeSnapshotContent (content.status.snapshotHandle=="") │
│  Calls:   CSI CreateSnapshot / DeleteSnapshot / ListSnapshots       │
│  Updates: VSC.status.snapshotHandle, VSC.status.readyToUse         │
└─────────────────────────────────────────────────────────────────────┘

Dynamic snapshot creation flow:
  User creates VolumeSnapshot (ns: production, source: data-postgres-0)
    │
    ▼
  Common controller creates VolumeSnapshotContent (cluster-scoped)
  with snapshotHandle="" (pending)
    │
    ▼
  external-snapshotter calls CSI CreateSnapshot(sourceVolumeId, parameters)
    │ CSI driver calls cloud API (e.g., AWS ec2:CreateSnapshot)
    ▼
  VSC.status.snapshotHandle = "snap-0abc123"
  VSC.status.readyToUse = true
    │
    ▼
  VS.status.readyToUse = true
  VS.status.restoreSize = 100Gi

VolumeSnapshotClass

apiVersion: snapshot.storage.k8s.io/v1
kind: VolumeSnapshotClass
metadata:
  name: csi-aws-vsc
  annotations:
    snapshot.storage.kubernetes.io/is-default-class: "true"
driver: ebs.csi.aws.com           # MUST match the CSI driver name
deletionPolicy: Delete            # Delete | Retain
parameters:
  # Driver-specific snapshot parameters
  tagSpecification_1: "environment=production"
  tagSpecification_2: "managed-by=kubernetes"
  # For Ceph RBD: csi.storage.k8s.io/snapshotter-secret-name: rook-csi-rbd-provisioner

deletionPolicy

Policy	When VolumeSnapshot Deleted	When to Use
`Delete`	VolumeSnapshotContent and the backing cloud snapshot are both deleted	Default for most use cases; ensures no orphaned cloud snapshots
`Retain`	VolumeSnapshotContent and backing snapshot are preserved; admin must clean up	Compliance/audit requirements; long-lived snapshots that outlive the K8s object

⚠️

Default VolumeSnapshotClass per driver Like StorageClasses, you should designate exactly one default VolumeSnapshotClass per CSI driver. If your cluster has both EBS and Ceph, define one default for each driver. Multiple defaults with the same driver cause snapshot creation without explicit class to fail.

VolumeSnapshot

apiVersion: snapshot.storage.k8s.io/v1
kind: VolumeSnapshot
metadata:
  name: postgres-snap-2024-01-15
  namespace: production
spec:
  volumeSnapshotClassName: csi-aws-vsc    # which VolumeSnapshotClass to use
  source:
    persistentVolumeClaimName: data-postgres-0   # dynamic: snapshot this PVC
    # OR for pre-provisioned (static):
    # volumeSnapshotContentName: existing-vsc-name

VolumeSnapshot Status Fields

kubectl get volumesnapshot postgres-snap-2024-01-15 -n production -o yaml

status:
  boundVolumeSnapshotContentName: snapcontent-abc-123-def   # bound VSC name
  creationTime: "2024-01-15T10:30:00Z"
  readyToUse: true              # true when snapshot is usable for restore
  restoreSize: 100Gi            # minimum PVC size for restore
  error: null                   # populated if snapshot creation failed
    # message: "..."
    # time: "..."

ℹ️

readyToUse and cloud snapshot availability For EBS snapshots, readyToUse: true means the snapshot is in completed state in AWS and can be used for restore. However, EBS snapshots are incrementally stored in S3 — the first restore after snapshot creation may be slow (lazy-loading from S3). Use Fast Snapshot Restore (FSR) on EBS for latency-sensitive restore paths.

VolumeSnapshotContent

apiVersion: snapshot.storage.k8s.io/v1
kind: VolumeSnapshotContent
metadata:
  name: snapcontent-abc-123-def
  finalizers:
    - snapshot.storage.kubernetes.io/volumesnapshotcontent-bound-protection
    - snapshot.storage.kubernetes.io/volumesnapshotcontent-deletion-protection
spec:
  deletionPolicy: Delete
  driver: ebs.csi.aws.com
  volumeSnapshotClassName: csi-aws-vsc
  source:
    volumeHandle: vol-0abc123def456789    # for dynamic (set by controller)
    # OR for pre-provisioned:
    # snapshotHandle: snap-0existingsnap  # existing cloud snapshot ID
  volumeSnapshotRef:                       # binding to the VolumeSnapshot
    name: postgres-snap-2024-01-15
    namespace: production
    uid: abc-def-123
status:
  snapshotHandle: snap-0new123def456      # cloud snapshot ID (set after creation)
  readyToUse: true
  restoreSize: 107374182400               # bytes

Dynamic Snapshot Workflow

Create VolumeSnapshot referencing source PVC and VolumeSnapshotClass
Common snapshot controller sees unbound VolumeSnapshot → creates VolumeSnapshotContent with empty snapshotHandle
external-snapshotter sidecar sees VSC with empty handle → calls CSI CreateSnapshot(sourceVolumeId=vol-0abc123, parameters=...)
CSI driver calls cloud API (e.g., ec2:CreateSnapshot)
external-snapshotter polls until snapshot is ready → updates VSC.status.snapshotHandle and readyToUse=true
Common controller copies readyToUse status to VolumeSnapshot
VolumeSnapshot.status.readyToUse = true — snapshot is available for restore

# Watch snapshot progress
kubectl get volumesnapshot postgres-snap-2024-01-15 -n production -w
# NAME                       READYTOUSE  SOURCEPVC          RESTORESIZE  SNAPSHOTCONTENT    AGE
# postgres-snap-2024-01-15   false       data-postgres-0          snapcontent-abc    5s
# postgres-snap-2024-01-15   true        data-postgres-0    100Gi        snapcontent-abc    45s

Pre-Provisioned (Static) Snapshots

Import an existing cloud snapshot into Kubernetes without creating a new one. Useful for disaster recovery scenarios where snapshots were created outside Kubernetes (e.g., AWS Backup, cloud-scheduled snapshots).

# Step 1: Create VolumeSnapshotContent pointing to existing cloud snapshot
apiVersion: snapshot.storage.k8s.io/v1
kind: VolumeSnapshotContent
metadata:
  name: imported-prod-snap
spec:
  deletionPolicy: Retain          # Retain: don't delete cloud snapshot when VSC is deleted
  driver: ebs.csi.aws.com
  volumeSnapshotClassName: csi-aws-vsc
  source:
    snapshotHandle: snap-0existing123abc    # existing AWS snapshot ID
  volumeSnapshotRef:
    name: imported-snap            # VolumeSnapshot to bind to
    namespace: production
    uid: <will-be-set-after-VS-creation>
---
# Step 2: Create VolumeSnapshot referencing the pre-provisioned VSC
apiVersion: snapshot.storage.k8s.io/v1
kind: VolumeSnapshot
metadata:
  name: imported-snap
  namespace: production
spec:
  volumeSnapshotClassName: csi-aws-vsc
  source:
    volumeSnapshotContentName: imported-prod-snap   # bind to existing VSC

⚠️

UID chicken-and-egg The VSC's volumeSnapshotRef.uid must match the VolumeSnapshot's UID. Since the UID is only assigned when the VS is created, create both objects and then patch the VSC to add the correct UID: kubectl patch vsc imported-prod-snap --type=merge -p '{"spec":{"volumeSnapshotRef":{"uid":"<vs-uid>"}}}'

Restore to New PVC

Restore a VolumeSnapshot to a new PVC using dataSource. The restored PVC is a new independent volume pre-populated with the snapshot's data.

apiVersion: v1
kind: PersistentVolumeClaim
metadata:
  name: postgres-restored
  namespace: production
spec:
  dataSource:
    apiGroup: snapshot.storage.k8s.io
    kind: VolumeSnapshot
    name: postgres-snap-2024-01-15     # VolumeSnapshot in same namespace
  accessModes: [ReadWriteOnce]
  storageClassName: gp3-encrypted      # can be different from source SC
                                       # but must use the same CSI driver
  resources:
    requests:
      storage: 100Gi                   # must be ≥ snapshot's restoreSize

Restore Constraints

The VolumeSnapshot must be in the same namespace as the target PVC (unless using cross-namespace via dataSourceRef)
The StorageClass used for restore must use the same CSI driver as the VolumeSnapshotClass
Target PVC capacity must be ≥ restoreSize from the snapshot status
The snapshot must be readyToUse: true before the PVC creation request arrives

PVC Clone vs Snapshot Restore

Aspect	PVC Clone (dataSource: PVC)	Snapshot Restore (dataSource: VolumeSnapshot)
Source must be Bound	Yes	No — snapshot is independent
Point-in-time	Best-effort (source still changing)	Yes — frozen moment in time
Cross-namespace	No (same namespace required)	Via dataSourceRef + VSC (1.26+)
Cloud cost	Full volume copy (same price as new volume)	Incremental from last snapshot (much cheaper over time)
Provisioning time	Fast (COW at cloud layer)	Can be slow (data restored from S3/object store)
Source StorageClass	Must match	Can differ (same driver, different parameters)

Snapshot Deletion Semantics

Snapshot objects have two protective finalizers that control deletion order:

snapshot.storage.kubernetes.io/volumesnapshot-bound-protection — on VolumeSnapshot; prevents VS deletion while it is bound to a VSC
snapshot.storage.kubernetes.io/volumesnapshotcontent-bound-protection — on VSC; prevents VSC deletion while VS is bound
snapshot.storage.kubernetes.io/volumesnapshotcontent-deletion-protection — on VSC; prevents VSC deletion until VS is deleted first

Deletion flow (deletionPolicy: Delete):

kubectl delete volumesnapshot postgres-snap-2024-01-15
  │
  ▼
VS deletion timestamp set; VS moves to "Deleting"
  │ (common controller sees VS being deleted)
  ▼
Common controller removes bound-protection finalizer from VS
VS is deleted from API server
  │
  ▼
Common controller removes deletion-protection finalizer from VSC
external-snapshotter calls CSI DeleteSnapshot(snapshotHandle=snap-0abc)
  │
  ▼
Cloud snapshot deleted
VSC deleted from API server

Deletion flow (deletionPolicy: Retain):

kubectl delete volumesnapshot postgres-snap-2024-01-15
  → VS deleted
  → VSC remains, cloud snapshot remains
  → Admin must manually: kubectl delete vsc snapcontent-abc-123
  → external-snapshotter does NOT call DeleteSnapshot (Retain policy)
  → Cloud snapshot must be deleted manually in AWS/GCP/Azure console

Cross-Namespace Snapshot Sharing (1.26+)

By default, a VolumeSnapshot can only be used as a restore source within its own namespace. Cross-namespace restore uses dataSourceRef with a VolumeSnapshotContent reference and a namespace annotation gate.

# Namespace must opt-in to cross-namespace data sources
kubectl annotate namespace target-ns \
  snapshot.storage.kubernetes.io/allow-volume-snapshot-content=true

# In the target namespace, create PVC referencing the VSC directly
apiVersion: v1
kind: PersistentVolumeClaim
metadata:
  name: restored-from-other-ns
  namespace: target-ns
spec:
  dataSourceRef:
    apiGroup: snapshot.storage.k8s.io
    kind: VolumeSnapshotContent       # cluster-scoped, no namespace restriction
    name: snapcontent-abc-123-def
  accessModes: [ReadWriteOnce]
  storageClassName: gp3-encrypted
  resources:
    requests:
      storage: 100Gi

ℹ️

Security model for cross-namespace The namespace annotation (allow-volume-snapshot-content) is the authorization gate. A namespace operator must explicitly allow cross-namespace restores — preventing workloads from restoring arbitrary snapshots from other namespaces. The VSC is cluster-scoped but the annotation check ensures the target namespace admin consents.

VolumeGroupSnapshot (Alpha, 1.27+)

VolumeGroupSnapshot takes crash-consistent snapshots of multiple PVCs simultaneously — all in a single atomic operation at the storage backend. This is critical for databases that span multiple volumes (e.g., separate data and WAL volumes for PostgreSQL, or a Kafka broker with multiple partition volumes).

# Install group snapshot CRDs (separate from regular snapshot CRDs)
# kubectl apply -f https://raw.githubusercontent.com/kubernetes-csi/external-snapshotter/main/client/config/crd/groupsnapshot.storage.k8s.io_volumegroupsnapshotclasses.yaml
# ... and VolumeGroupSnapshotContent / VolumeGroupSnapshot CRDs

apiVersion: groupsnapshot.storage.k8s.io/v1alpha1
kind: VolumeGroupSnapshotClass
metadata:
  name: csi-aws-group-vsc
driver: ebs.csi.aws.com
deletionPolicy: Delete
---
apiVersion: groupsnapshot.storage.k8s.io/v1alpha1
kind: VolumeGroupSnapshot
metadata:
  name: postgres-group-snap
  namespace: production
spec:
  volumeGroupSnapshotClassName: csi-aws-group-vsc
  source:
    selector:
      matchLabels:
        app: postgres              # snapshot ALL PVCs with this label simultaneously
        component: storage

The CSI driver must implement CreateVolumeGroupSnapshot RPC. The driver issues all individual snapshots in a single API call to the storage backend — guaranteeing consistency across volumes (no torn writes between data and WAL).

Application-Consistent Snapshots

Storage-level snapshots are crash-consistent (like pulling the power plug) — they capture the disk state atomically, but in-flight writes may be incomplete. For most databases this is acceptable with WAL replay on restart. For strict consistency, the application must be quiesced first.

Filesystem Freeze

# Quiesce filesystem before snapshot, then unfreeze
kubectl exec -it postgres-0 -n production -- bash -c "
  # Flush and freeze filesystem
  psql -c 'CHECKPOINT;'        # PostgreSQL: flush dirty buffers
  fsfreeze --freeze /var/lib/postgresql/data

  # Signal readiness (e.g., write a file that triggers snapshot via hook)
  echo 'frozen' > /tmp/snapshot-ready
"

# After snapshot completes:
kubectl exec -it postgres-0 -n production -- fsfreeze --unfreeze /var/lib/postgresql/data

Database Quiesce Patterns

Database	Quiesce Command	Unquiesce
PostgreSQL	`SELECT pg_start_backup('snap');` or `CHECKPOINT;`	`SELECT pg_stop_backup();`
MySQL / MariaDB	`FLUSH TABLES WITH READ LOCK;`	`UNLOCK TABLES;`
MongoDB	`db.fsyncLock()`	`db.fsyncUnlock()`
General (Linux)	`fsfreeze --freeze <mountpoint>`	`fsfreeze --unfreeze <mountpoint>`

Velero Pre/Post Snapshot Hooks

Velero supports executing commands in pod containers before and after taking volume snapshots, enabling application-consistent backups without custom tooling:

apiVersion: velero.io/v1
kind: Backup
metadata:
  name: postgres-consistent-backup
  namespace: velero
spec:
  includedNamespaces: [production]
  labelSelector:
    matchLabels:
      app: postgres
  snapshotVolumes: true
  hooks:
    resources:
    - name: postgres-hooks
      includedNamespaces: [production]
      labelSelector:
        matchLabels:
          app: postgres
      pre:
      - exec:
          container: postgres
          command:
            - /bin/bash
            - -c
            - "psql -c 'CHECKPOINT;' && fsfreeze --freeze /var/lib/postgresql/data"
          timeout: 30s
          onError: Fail
      post:
      - exec:
          container: postgres
          command:
            - /bin/bash
            - -c
            - "fsfreeze --unfreeze /var/lib/postgresql/data"
          timeout: 10s
          onError: Continue    # always unfreeze even if backup had issues

Scheduled Snapshots

Kubernetes has no built-in snapshot scheduling — implement it with a CronJob or a dedicated snapshot controller (Kasten K10, Velero schedules, or cloud-native solutions).

CronJob Snapshot Pattern

apiVersion: batch/v1
kind: CronJob
metadata:
  name: postgres-snapshot-daily
  namespace: production
spec:
  schedule: "0 2 * * *"          # daily at 2 AM UTC
  concurrencyPolicy: Forbid
  successfulJobsHistoryLimit: 7
  failedJobsHistoryLimit: 3
  jobTemplate:
    spec:
      template:
        spec:
          serviceAccountName: snapshot-creator
          restartPolicy: OnFailure
          containers:
          - name: snapshot
            image: bitnami/kubectl:latest
            command:
            - /bin/sh
            - -c
            - |
              DATE=$(date +%Y-%m-%d-%H%M)
              cat <



---
# RBAC for snapshot-creator ServiceAccount
apiVersion: rbac.authorization.k8s.io/v1
kind: Role
metadata:
  name: snapshot-creator
  namespace: production
rules:
- apiGroups: [snapshot.storage.k8s.io]
  resources: [volumesnapshots]
  verbs: [get, list, create, delete]
---
apiVersion: rbac.authorization.k8s.io/v1
kind: RoleBinding
metadata:
  name: snapshot-creator
  namespace: production
subjects:
- kind: ServiceAccount
  name: snapshot-creator
  namespace: production
roleRef:
  kind: Role
  name: snapshot-creator
  apiGroup: rbac.authorization.k8s.io


Velero Backup Strategy

Velero provides a complete backup solution for Kubernetes — it backs up both object metadata (Deployments, Services, ConfigMaps, Secrets) and PV data (via CSI snapshots or Restic/Kopia file-level backup).

Installation

# Install Velero with AWS EBS CSI snapshot support
velero install \
  --provider aws \
  --plugins velero/velero-plugin-for-aws:v1.8.0 \
  --bucket my-velero-backups \
  --backup-location-config region=us-east-1 \
  --snapshot-location-config region=us-east-1 \
  --features=EnableCSI \                   # enable CSI snapshot integration
  --use-volume-snapshots=true \
  --secret-file ./credentials-velero       # AWS credentials with s3+ec2 permissions

Backup Schedule

# Create a scheduled backup for production namespace
velero schedule create production-daily \
  --schedule="0 3 * * *" \
  --include-namespaces production \
  --snapshot-volumes \
  --ttl 720h0m0s                # retain backups for 30 days

# List schedules and their status
velero schedule get

# Create an on-demand backup
velero backup create prod-backup-manual \
  --include-namespaces production \
  --snapshot-volumes \
  --wait

Restore Workflow

# List available backups
velero backup get

# Restore entire namespace
velero restore create --from-backup prod-backup-manual \
  --include-namespaces production \
  --wait

# Restore to a different namespace
velero restore create --from-backup prod-backup-manual \
  --namespace-mappings production:production-restore \
  --wait

# Restore specific resources only
velero restore create --from-backup prod-backup-manual \
  --include-resources persistentvolumeclaims,pods \
  --selector app=postgres \
  --wait

# Check restore status
velero restore describe <restore-name>
velero restore logs <restore-name>

File-Level Backup with Kopia (No CSI Snapshot)

# For volumes without CSI snapshot support, use Kopia (built-in to Velero 1.10+)
velero install \
  --uploader-type kopia \          # or restic (legacy)
  --use-node-agent \               # deploys node-agent DaemonSet
  ...

# Annotate pods to include specific volumes in file-level backup
kubectl annotate pod postgres-0 \
  backup.velero.io/backup-volumes=data,wal   # comma-separated volume names


Snapshot Storage Costs

Understanding snapshot cost models prevents surprise bills:


  Cloud Snapshot Cost Model Key Insight
  AWS EBS Incremental — only blocks changed since last snapshot stored in S3 First snapshot = full volume; subsequent snapshots = delta. Cost grows with churn rate, not volume size.
  GCE PD Incremental (similar to EBS) Regional snapshots cost 2× standard; standard snapshots replicated globally anyway.
  Azure Disk Incremental by default (Standard HDD billing for snapshot data) Full snapshots available but rarely needed; incremental is the default since 2020.
  Ceph RBD COW (copy-on-write) — snapshot is just a reference; new writes stored separately Snapshots "free" initially but flattening large old snapshots is expensive I/O.



  💡
  Snapshot retention policy Keeping too many incremental snapshots is not just a cost issue — it creates a long chain that slows restore times (each restore replays all deltas). For EBS, AWS recommends keeping ≤7 daily snapshots per volume in a chain before doing a full "flatten" (creating a new snapshot from a restored volume). Monitor snapshot chain length via aws ec2 describe-snapshots and automate rotation.



Cloud Provider Snapshot Characteristics


  Driver Snapshot Type Consistency Cross-Region Max Snapshots/Volume
  ebs.csi.aws.com Incremental S3-backed Crash-consistent Manual copy-snapshot 100,000
  pd.csi.storage.gke.io Incremental (standard) or Regional Crash-consistent Standard snapshots are global 100 per disk
  disk.csi.azure.com Incremental by default Crash-consistent Manual copy to other region 500 per disk
  rbd.csi.ceph.com COW (Ceph RBD snapshot) Crash-consistent Mirror with rbd-mirror Unlimited (practical: <1000)



Metrics and Alerting


  Metric Source Alert Threshold
  kube_volumesnapshot_info kube-state-metrics readyToUse=false for >30m
  kube_volumesnapshot_status_readytouse kube-state-metrics 0 (not ready) for >30m
  snapshot_controller_operation_total_seconds snapshot controller P99 CreateSnapshot > 5m
  velero_backup_success_total Velero No successful backup in 25h (missed daily)
  velero_backup_failure_total Velero > 0 in any window


Alerting Rules
groups:
- name: volume-snapshots
  rules:
  - alert: VolumeSnapshotNotReady
    expr: |
      kube_volumesnapshot_info{ready_to_use="false"} == 1
    for: 30m
    labels: {severity: warning}
    annotations:
      summary: "VolumeSnapshot {{ $labels.namespace }}/{{ $labels.volumesnapshot }} not ready after 30m"

  - alert: VolumeSnapshotFailed
    expr: |
      kube_volumesnapshot_info{ready_to_use="false"} == 1
        and on(namespace, volumesnapshot)
      kube_volumesnapshot_status_error_message != ""
    for: 2m
    labels: {severity: critical}
    annotations:
      summary: "VolumeSnapshot creation failed — check external-snapshotter logs"

  - alert: VeleroBackupMissed
    expr: |
      time() - velero_backup_last_successful_timestamp{schedule="production-daily"} > 90000
    labels: {severity: critical}
    annotations:
      summary: "Velero daily backup for production has not completed in >25h"

  - alert: VeleroBackupFailed
    expr: increase(velero_backup_failure_total[1h]) > 0
    labels: {severity: warning}
    annotations:
      summary: "Velero backup failure detected"


Troubleshooting Runbooks

Runbook: VolumeSnapshot Stuck Not Ready
# 1. Check VS status and events
kubectl describe volumesnapshot <name> -n <ns>
# status.error.message will indicate failure reason

# 2. Check VSC status
VSC=$(kubectl get vs <name> -n <ns> -o jsonpath='{.status.boundVolumeSnapshotContentName}')
kubectl describe volumesnapshotcontent $VSC

# 3. Check external-snapshotter logs (in CSI controller pod)
kubectl logs -n kube-system \
  $(kubectl get pod -n kube-system -l app=ebs-csi-controller -o name | head -1) \
  -c csi-snapshotter --tail=100

# Common errors:
# "failed to create snapshot: ... ResourceNotFound" → source PVC PV no longer exists
# "context deadline exceeded" → CSI driver unresponsive
# "InvalidSnapshot.InUse" → cloud snapshot in use by AMI (AWS-specific)

Runbook: Restore PVC Stuck Pending
# PVC with dataSource VolumeSnapshot is Pending
kubectl describe pvc <restore-pvc> -n <ns>
# Common causes:

# 1. VolumeSnapshot not readyToUse yet
kubectl get vs <snap-name> -n <ns> -o jsonpath='{.status.readyToUse}'
# → false: wait for snapshot to complete before creating restore PVC

# 2. StorageClass uses different driver than VolumeSnapshotClass
kubectl get storageclass <sc> -o jsonpath='{.provisioner}'
kubectl get volumesnapshotclass <vsc> -o jsonpath='{.driver}'
# → must match

# 3. PVC capacity less than restoreSize
kubectl get vs <snap-name> -n <ns> -o jsonpath='{.status.restoreSize}'
# → increase PVC resources.requests.storage to match

Runbook: Snapshot Deletion Stuck (VSC in Terminating)
# VolumeSnapshotContent stuck in Terminating state
kubectl describe vsc <name>
# Check finalizers:
kubectl get vsc <name> -o jsonpath='{.metadata.finalizers}'

# Common cause: VS was deleted but VSC finalizer not removed
# Check if bound VS still exists
kubectl get vs --all-namespaces | grep <vsc-name>

# If VS is truly gone but VSC is stuck:
kubectl patch vsc <name> --type=json \
  -p '[{"op":"remove","path":"/metadata/finalizers"}]'
# WARNING: This may orphan the cloud snapshot — delete it manually in AWS/GCP/Azure

Runbook: Velero Backup Failing
# Check backup details
velero backup describe <backup-name> --details

# Check Velero server logs
kubectl logs -n velero deployment/velero --tail=100 | grep -i error

# Common failures:
# "no matches for kind VolumeSnapshot" → snapshot CRDs not installed; missing --features=EnableCSI
# "backup storage location not available" → S3 bucket permissions or region mismatch
# "timeout waiting for PodVolumeBackup" → Kopia/Restic agent timeout on large volumes;
#   increase --pod-volume-operation-timeout flag

# Test backup location connectivity
velero backup-location get
velero backup-location check default

Runbook: Pre-Provisioned Snapshot Not Binding
# VS and VSC created but VS shows "invalid" or "error setting reference"
kubectl describe vs <name> -n <ns>

# Common cause: VSC.volumeSnapshotRef.uid does not match actual VS uid
VS_UID=$(kubectl get vs <name> -n <ns> -o jsonpath='{.metadata.uid}')
kubectl patch vsc <vsc-name> --type=merge \
  -p "{\"spec\":{\"volumeSnapshotRef\":{\"uid\":\"$VS_UID\"}}}"

# Also verify VSC driver matches VolumeSnapshotClass driver
kubectl get vsc <name> -o jsonpath='{.spec.driver}'
kubectl get vsc-class <class-name> -o jsonpath='{.driver}'


Best Practices


  Take a snapshot before every schema migration, data transformation, or major deployment. Cloud incremental snapshots complete in seconds and cost pennies. The worst-case rollback cost of not having one is measured in hours of data recovery.
  Use application-consistent snapshots for transactional databases. Crash-consistent snapshots are sufficient for databases with WAL (PostgreSQL, MySQL with InnoDB) — but quiescing removes any uncertainty and avoids long recovery replays.
  Test restores regularly. A snapshot that has never been tested is not a backup. Include a monthly restore drill in your SLA documentation — restore to a staging namespace and verify data integrity.
  Set deletionPolicy: Retain for compliance snapshots. Snapshots created for regulatory requirements should not be deleteable by namespace developers. Use Retain and restrict VSC deletion to cluster-admin only.
  Implement snapshot retention with automation. Manual snapshot accumulation is the most common cause of unexpected cloud storage costs. Automate retention (the CronJob pattern above, or Velero TTL) from day one.
  Align VolumeSnapshotClass driver to StorageClass provisioner. Mismatched drivers fail silently until restore time — the worst moment to discover the issue.
  Use VolumeGroupSnapshot for multi-volume databases. A PostgreSQL cluster with separate data/WAL volumes that are snapshotted at different times has a torn snapshot — WAL events after the data snapshot but before the WAL snapshot will be replayed, which is harmless but confusing. Group snapshots eliminate this window.
  Monitor snapshot readyToUse lag. A snapshot that takes 30+ minutes to become ready indicates a storage backend health issue. Alert on this before it becomes a restore-time surprise.




  
    ← Previous
    CSI Drivers
  
  
    Next →
    Stateful Storage Patterns

Cloud	Snapshot Cost Model	Key Insight
AWS EBS	Incremental — only blocks changed since last snapshot stored in S3	First snapshot = full volume; subsequent snapshots = delta. Cost grows with churn rate, not volume size.
GCE PD	Incremental (similar to EBS)	Regional snapshots cost 2× standard; standard snapshots replicated globally anyway.
Azure Disk	Incremental by default (Standard HDD billing for snapshot data)	Full snapshots available but rarely needed; incremental is the default since 2020.
Ceph RBD	COW (copy-on-write) — snapshot is just a reference; new writes stored separately	Snapshots "free" initially but flattening large old snapshots is expensive I/O.

Driver	Snapshot Type	Consistency	Cross-Region	Max Snapshots/Volume
ebs.csi.aws.com	Incremental S3-backed	Crash-consistent	Manual copy-snapshot	100,000
pd.csi.storage.gke.io	Incremental (standard) or Regional	Crash-consistent	Standard snapshots are global	100 per disk
disk.csi.azure.com	Incremental by default	Crash-consistent	Manual copy to other region	500 per disk
rbd.csi.ceph.com	COW (Ceph RBD snapshot)	Crash-consistent	Mirror with rbd-mirror	Unlimited (practical: <1000)

Metric	Source	Alert Threshold
`kube_volumesnapshot_info`	kube-state-metrics	readyToUse=false for >30m
`kube_volumesnapshot_status_readytouse`	kube-state-metrics	0 (not ready) for >30m
`snapshot_controller_operation_total_seconds`	snapshot controller	P99 CreateSnapshot > 5m
`velero_backup_success_total`	Velero	No successful backup in 25h (missed daily)
`velero_backup_failure_total`	Velero	> 0 in any window