Volume Snapshots

Complete coverage of the Kubernetes volume snapshot API — the three custom resources, dynamic and pre-provisioned snapshot workflows, restore to new PVC, cross-namespace cloning, VolumeGroupSnapshot, application-consistent snapshots, scheduled snapshot automation, and production backup strategies with Velero.

Section 04 of 13 File 6 of 8 Platform Engineer
What This Page Covers
  • Snapshot API overview — not in core Kubernetes; requires CRD + snapshot controller installation
  • Three snapshot resources — VolumeSnapshotClass, VolumeSnapshot, VolumeSnapshotContent and their analogies to StorageClass/PVC/PV
  • VolumeSnapshotClass — driver, deletionPolicy (Delete/Retain), parameters, default annotation
  • VolumeSnapshot spec — volumeSnapshotClassName, source (PVC or VSC), status fields (readyToUse, restoreSize, creationTime, error)
  • VolumeSnapshotContent spec — deletionPolicy, driver, volumeSnapshotClassName, source (volumeHandle or snapshotHandle), volumeSnapshotRef
  • Dynamic snapshot workflow — full lifecycle from VolumeSnapshot creation to readyToUse
  • Pre-provisioned (static) snapshot — import existing cloud snapshot into Kubernetes
  • Snapshot controller architecture — common snapshot controller + CSI driver sidecar (external-snapshotter)
  • Restore workflow — dataSource PVC from VolumeSnapshot; capacity constraints; cross-StorageClass limitations
  • PVC clone vs snapshot restore — when to use each; cost and speed tradeoffs
  • Snapshot deletion semantics — Delete vs Retain policy; VSC finalizer; orphaned VSC
  • Cross-namespace snapshot sharing — VolumeSnapshotContent + dataSourceRef; namespace annotation gate
  • VolumeGroupSnapshot (alpha 1.27) — crash-consistent multi-PVC snapshot; VolumeGroupSnapshotClass/Content; labelSelector grouping
  • Application-consistent snapshots — quiesce hooks; pre/post snapshot hooks with Velero; filesystem freeze (fsfreeze); database flush patterns (PostgreSQL checkpoint, MySQL FLUSH TABLES)
  • Scheduled snapshots — CronJob pattern; snapshot retention policy; cleanup automation
  • Velero backup strategy — installation, backup schedule, PV snapshot + object backup, restore workflow, namespace mapping
  • Snapshot storage costs — incremental vs full snapshot billing; retention impact
  • Cloud provider snapshot support — EBS, GCE PD, Azure Disk, Ceph RBD snapshot characteristics
  • 5 metrics + 4 alerting rules + 5 troubleshooting runbooks
  • 8 best practices
  • Overview

    Volume snapshots are not part of the core Kubernetes API. They are delivered as Custom Resource Definitions (CRDs) maintained by the Kubernetes SIG Storage team, alongside a snapshot controller that must be installed separately. Most managed Kubernetes offerings (EKS, GKE, AKS) pre-install these; self-managed clusters must install them manually.

    ⚠️
    Installation required Before using snapshots, install: (1) the snapshot CRDs from kubernetes-csi/external-snapshotter, (2) the common snapshot controller Deployment, and (3) the CSI driver's external-snapshotter sidecar. Without these, VolumeSnapshot objects will be created but never acted on.

    Three Resources — The PV/PVC/StorageClass Analogy

    Snapshot ResourceScopeAnalogous ToPurpose
    VolumeSnapshotClassClusterStorageClassDefines which CSI driver handles snapshots and deletion policy
    VolumeSnapshotNamespacePersistentVolumeClaimUser request for a snapshot of a specific PVC
    VolumeSnapshotContentClusterPersistentVolumeThe actual snapshot on the storage backend; created by controller or pre-provisioned

    Snapshot Controller Architecture

    ┌─────────────────────────────────────────────────────────────────────┐
    │  COMMON SNAPSHOT CONTROLLER (Deployment, cluster-wide)              │
    │                                                                     │
    │  Watches: VolumeSnapshot, VolumeSnapshotContent                     │
    │  Responsibilities:                                                  │
    │  - Binds VolumeSnapshot ↔ VolumeSnapshotContent (like PV binder)   │
    │  - Manages VSC finalizers                                           │
    │  - Handles deletion policy enforcement                              │
    │  - Does NOT call CSI directly                                       │
    └──────────────────────────────┬──────────────────────────────────────┘
                                   │ updates VSC status
                                   ▼
    ┌─────────────────────────────────────────────────────────────────────┐
    │  CSI DRIVER: external-snapshotter SIDECAR (in controller Deployment) │
    │                                                                     │
    │  Watches: VolumeSnapshotContent (content.status.snapshotHandle=="") │
    │  Calls:   CSI CreateSnapshot / DeleteSnapshot / ListSnapshots       │
    │  Updates: VSC.status.snapshotHandle, VSC.status.readyToUse         │
    └─────────────────────────────────────────────────────────────────────┘
    
    Dynamic snapshot creation flow:
      User creates VolumeSnapshot (ns: production, source: data-postgres-0)
        │
        ▼
      Common controller creates VolumeSnapshotContent (cluster-scoped)
      with snapshotHandle="" (pending)
        │
        ▼
      external-snapshotter calls CSI CreateSnapshot(sourceVolumeId, parameters)
        │ CSI driver calls cloud API (e.g., AWS ec2:CreateSnapshot)
        ▼
      VSC.status.snapshotHandle = "snap-0abc123"
      VSC.status.readyToUse = true
        │
        ▼
      VS.status.readyToUse = true
      VS.status.restoreSize = 100Gi
    

    VolumeSnapshotClass

    apiVersion: snapshot.storage.k8s.io/v1
    kind: VolumeSnapshotClass
    metadata:
      name: csi-aws-vsc
      annotations:
        snapshot.storage.kubernetes.io/is-default-class: "true"
    driver: ebs.csi.aws.com           # MUST match the CSI driver name
    deletionPolicy: Delete            # Delete | Retain
    parameters:
      # Driver-specific snapshot parameters
      tagSpecification_1: "environment=production"
      tagSpecification_2: "managed-by=kubernetes"
      # For Ceph RBD: csi.storage.k8s.io/snapshotter-secret-name: rook-csi-rbd-provisioner

    deletionPolicy

    PolicyWhen VolumeSnapshot DeletedWhen to Use
    DeleteVolumeSnapshotContent and the backing cloud snapshot are both deletedDefault for most use cases; ensures no orphaned cloud snapshots
    RetainVolumeSnapshotContent and backing snapshot are preserved; admin must clean upCompliance/audit requirements; long-lived snapshots that outlive the K8s object
    ⚠️
    Default VolumeSnapshotClass per driver Like StorageClasses, you should designate exactly one default VolumeSnapshotClass per CSI driver. If your cluster has both EBS and Ceph, define one default for each driver. Multiple defaults with the same driver cause snapshot creation without explicit class to fail.

    VolumeSnapshot

    apiVersion: snapshot.storage.k8s.io/v1
    kind: VolumeSnapshot
    metadata:
      name: postgres-snap-2024-01-15
      namespace: production
    spec:
      volumeSnapshotClassName: csi-aws-vsc    # which VolumeSnapshotClass to use
      source:
        persistentVolumeClaimName: data-postgres-0   # dynamic: snapshot this PVC
        # OR for pre-provisioned (static):
        # volumeSnapshotContentName: existing-vsc-name

    VolumeSnapshot Status Fields

    kubectl get volumesnapshot postgres-snap-2024-01-15 -n production -o yaml
    
    status:
      boundVolumeSnapshotContentName: snapcontent-abc-123-def   # bound VSC name
      creationTime: "2024-01-15T10:30:00Z"
      readyToUse: true              # true when snapshot is usable for restore
      restoreSize: 100Gi            # minimum PVC size for restore
      error: null                   # populated if snapshot creation failed
        # message: "..."
        # time: "..."
    ℹ️
    readyToUse and cloud snapshot availability For EBS snapshots, readyToUse: true means the snapshot is in completed state in AWS and can be used for restore. However, EBS snapshots are incrementally stored in S3 — the first restore after snapshot creation may be slow (lazy-loading from S3). Use Fast Snapshot Restore (FSR) on EBS for latency-sensitive restore paths.

    VolumeSnapshotContent

    apiVersion: snapshot.storage.k8s.io/v1
    kind: VolumeSnapshotContent
    metadata:
      name: snapcontent-abc-123-def
      finalizers:
        - snapshot.storage.kubernetes.io/volumesnapshotcontent-bound-protection
        - snapshot.storage.kubernetes.io/volumesnapshotcontent-deletion-protection
    spec:
      deletionPolicy: Delete
      driver: ebs.csi.aws.com
      volumeSnapshotClassName: csi-aws-vsc
      source:
        volumeHandle: vol-0abc123def456789    # for dynamic (set by controller)
        # OR for pre-provisioned:
        # snapshotHandle: snap-0existingsnap  # existing cloud snapshot ID
      volumeSnapshotRef:                       # binding to the VolumeSnapshot
        name: postgres-snap-2024-01-15
        namespace: production
        uid: abc-def-123
    status:
      snapshotHandle: snap-0new123def456      # cloud snapshot ID (set after creation)
      readyToUse: true
      restoreSize: 107374182400               # bytes

    Dynamic Snapshot Workflow

    1. Create VolumeSnapshot referencing source PVC and VolumeSnapshotClass
    2. Common snapshot controller sees unbound VolumeSnapshot → creates VolumeSnapshotContent with empty snapshotHandle
    3. external-snapshotter sidecar sees VSC with empty handle → calls CSI CreateSnapshot(sourceVolumeId=vol-0abc123, parameters=...)
    4. CSI driver calls cloud API (e.g., ec2:CreateSnapshot)
    5. external-snapshotter polls until snapshot is ready → updates VSC.status.snapshotHandle and readyToUse=true
    6. Common controller copies readyToUse status to VolumeSnapshot
    7. VolumeSnapshot.status.readyToUse = true — snapshot is available for restore
    # Watch snapshot progress
    kubectl get volumesnapshot postgres-snap-2024-01-15 -n production -w
    # NAME                       READYTOUSE  SOURCEPVC          RESTORESIZE  SNAPSHOTCONTENT    AGE
    # postgres-snap-2024-01-15   false       data-postgres-0          snapcontent-abc    5s
    # postgres-snap-2024-01-15   true        data-postgres-0    100Gi        snapcontent-abc    45s

    Pre-Provisioned (Static) Snapshots

    Import an existing cloud snapshot into Kubernetes without creating a new one. Useful for disaster recovery scenarios where snapshots were created outside Kubernetes (e.g., AWS Backup, cloud-scheduled snapshots).

    # Step 1: Create VolumeSnapshotContent pointing to existing cloud snapshot
    apiVersion: snapshot.storage.k8s.io/v1
    kind: VolumeSnapshotContent
    metadata:
      name: imported-prod-snap
    spec:
      deletionPolicy: Retain          # Retain: don't delete cloud snapshot when VSC is deleted
      driver: ebs.csi.aws.com
      volumeSnapshotClassName: csi-aws-vsc
      source:
        snapshotHandle: snap-0existing123abc    # existing AWS snapshot ID
      volumeSnapshotRef:
        name: imported-snap            # VolumeSnapshot to bind to
        namespace: production
        uid: <will-be-set-after-VS-creation>
    ---
    # Step 2: Create VolumeSnapshot referencing the pre-provisioned VSC
    apiVersion: snapshot.storage.k8s.io/v1
    kind: VolumeSnapshot
    metadata:
      name: imported-snap
      namespace: production
    spec:
      volumeSnapshotClassName: csi-aws-vsc
      source:
        volumeSnapshotContentName: imported-prod-snap   # bind to existing VSC
    ⚠️
    UID chicken-and-egg The VSC's volumeSnapshotRef.uid must match the VolumeSnapshot's UID. Since the UID is only assigned when the VS is created, create both objects and then patch the VSC to add the correct UID: kubectl patch vsc imported-prod-snap --type=merge -p '{"spec":{"volumeSnapshotRef":{"uid":"<vs-uid>"}}}'

    Restore to New PVC

    Restore a VolumeSnapshot to a new PVC using dataSource. The restored PVC is a new independent volume pre-populated with the snapshot's data.

    apiVersion: v1
    kind: PersistentVolumeClaim
    metadata:
      name: postgres-restored
      namespace: production
    spec:
      dataSource:
        apiGroup: snapshot.storage.k8s.io
        kind: VolumeSnapshot
        name: postgres-snap-2024-01-15     # VolumeSnapshot in same namespace
      accessModes: [ReadWriteOnce]
      storageClassName: gp3-encrypted      # can be different from source SC
                                           # but must use the same CSI driver
      resources:
        requests:
          storage: 100Gi                   # must be ≥ snapshot's restoreSize

    Restore Constraints

    PVC Clone vs Snapshot Restore

    AspectPVC Clone (dataSource: PVC)Snapshot Restore (dataSource: VolumeSnapshot)
    Source must be BoundYesNo — snapshot is independent
    Point-in-timeBest-effort (source still changing)Yes — frozen moment in time
    Cross-namespaceNo (same namespace required)Via dataSourceRef + VSC (1.26+)
    Cloud costFull volume copy (same price as new volume)Incremental from last snapshot (much cheaper over time)
    Provisioning timeFast (COW at cloud layer)Can be slow (data restored from S3/object store)
    Source StorageClassMust matchCan differ (same driver, different parameters)

    Snapshot Deletion Semantics

    Snapshot objects have two protective finalizers that control deletion order:

    Deletion flow (deletionPolicy: Delete):
    
    kubectl delete volumesnapshot postgres-snap-2024-01-15
      │
      ▼
    VS deletion timestamp set; VS moves to "Deleting"
      │ (common controller sees VS being deleted)
      ▼
    Common controller removes bound-protection finalizer from VS
    VS is deleted from API server
      │
      ▼
    Common controller removes deletion-protection finalizer from VSC
    external-snapshotter calls CSI DeleteSnapshot(snapshotHandle=snap-0abc)
      │
      ▼
    Cloud snapshot deleted
    VSC deleted from API server
    
    Deletion flow (deletionPolicy: Retain):
    
    kubectl delete volumesnapshot postgres-snap-2024-01-15
      → VS deleted
      → VSC remains, cloud snapshot remains
      → Admin must manually: kubectl delete vsc snapcontent-abc-123
      → external-snapshotter does NOT call DeleteSnapshot (Retain policy)
      → Cloud snapshot must be deleted manually in AWS/GCP/Azure console
    

    Cross-Namespace Snapshot Sharing (1.26+)

    By default, a VolumeSnapshot can only be used as a restore source within its own namespace. Cross-namespace restore uses dataSourceRef with a VolumeSnapshotContent reference and a namespace annotation gate.

    # Namespace must opt-in to cross-namespace data sources
    kubectl annotate namespace target-ns \
      snapshot.storage.kubernetes.io/allow-volume-snapshot-content=true
    
    # In the target namespace, create PVC referencing the VSC directly
    apiVersion: v1
    kind: PersistentVolumeClaim
    metadata:
      name: restored-from-other-ns
      namespace: target-ns
    spec:
      dataSourceRef:
        apiGroup: snapshot.storage.k8s.io
        kind: VolumeSnapshotContent       # cluster-scoped, no namespace restriction
        name: snapcontent-abc-123-def
      accessModes: [ReadWriteOnce]
      storageClassName: gp3-encrypted
      resources:
        requests:
          storage: 100Gi
    ℹ️
    Security model for cross-namespace The namespace annotation (allow-volume-snapshot-content) is the authorization gate. A namespace operator must explicitly allow cross-namespace restores — preventing workloads from restoring arbitrary snapshots from other namespaces. The VSC is cluster-scoped but the annotation check ensures the target namespace admin consents.

    VolumeGroupSnapshot (Alpha, 1.27+)

    VolumeGroupSnapshot takes crash-consistent snapshots of multiple PVCs simultaneously — all in a single atomic operation at the storage backend. This is critical for databases that span multiple volumes (e.g., separate data and WAL volumes for PostgreSQL, or a Kafka broker with multiple partition volumes).

    # Install group snapshot CRDs (separate from regular snapshot CRDs)
    # kubectl apply -f https://raw.githubusercontent.com/kubernetes-csi/external-snapshotter/main/client/config/crd/groupsnapshot.storage.k8s.io_volumegroupsnapshotclasses.yaml
    # ... and VolumeGroupSnapshotContent / VolumeGroupSnapshot CRDs
    
    apiVersion: groupsnapshot.storage.k8s.io/v1alpha1
    kind: VolumeGroupSnapshotClass
    metadata:
      name: csi-aws-group-vsc
    driver: ebs.csi.aws.com
    deletionPolicy: Delete
    ---
    apiVersion: groupsnapshot.storage.k8s.io/v1alpha1
    kind: VolumeGroupSnapshot
    metadata:
      name: postgres-group-snap
      namespace: production
    spec:
      volumeGroupSnapshotClassName: csi-aws-group-vsc
      source:
        selector:
          matchLabels:
            app: postgres              # snapshot ALL PVCs with this label simultaneously
            component: storage

    The CSI driver must implement CreateVolumeGroupSnapshot RPC. The driver issues all individual snapshots in a single API call to the storage backend — guaranteeing consistency across volumes (no torn writes between data and WAL).

    Application-Consistent Snapshots

    Storage-level snapshots are crash-consistent (like pulling the power plug) — they capture the disk state atomically, but in-flight writes may be incomplete. For most databases this is acceptable with WAL replay on restart. For strict consistency, the application must be quiesced first.

    Filesystem Freeze

    # Quiesce filesystem before snapshot, then unfreeze
    kubectl exec -it postgres-0 -n production -- bash -c "
      # Flush and freeze filesystem
      psql -c 'CHECKPOINT;'        # PostgreSQL: flush dirty buffers
      fsfreeze --freeze /var/lib/postgresql/data
    
      # Signal readiness (e.g., write a file that triggers snapshot via hook)
      echo 'frozen' > /tmp/snapshot-ready
    "
    
    # After snapshot completes:
    kubectl exec -it postgres-0 -n production -- fsfreeze --unfreeze /var/lib/postgresql/data

    Database Quiesce Patterns

    DatabaseQuiesce CommandUnquiesce
    PostgreSQLSELECT pg_start_backup('snap'); or CHECKPOINT;SELECT pg_stop_backup();
    MySQL / MariaDBFLUSH TABLES WITH READ LOCK;UNLOCK TABLES;
    MongoDBdb.fsyncLock()db.fsyncUnlock()
    General (Linux)fsfreeze --freeze <mountpoint>fsfreeze --unfreeze <mountpoint>

    Velero Pre/Post Snapshot Hooks

    Velero supports executing commands in pod containers before and after taking volume snapshots, enabling application-consistent backups without custom tooling:

    apiVersion: velero.io/v1
    kind: Backup
    metadata:
      name: postgres-consistent-backup
      namespace: velero
    spec:
      includedNamespaces: [production]
      labelSelector:
        matchLabels:
          app: postgres
      snapshotVolumes: true
      hooks:
        resources:
        - name: postgres-hooks
          includedNamespaces: [production]
          labelSelector:
            matchLabels:
              app: postgres
          pre:
          - exec:
              container: postgres
              command:
                - /bin/bash
                - -c
                - "psql -c 'CHECKPOINT;' && fsfreeze --freeze /var/lib/postgresql/data"
              timeout: 30s
              onError: Fail
          post:
          - exec:
              container: postgres
              command:
                - /bin/bash
                - -c
                - "fsfreeze --unfreeze /var/lib/postgresql/data"
              timeout: 10s
              onError: Continue    # always unfreeze even if backup had issues

    Scheduled Snapshots

    Kubernetes has no built-in snapshot scheduling — implement it with a CronJob or a dedicated snapshot controller (Kasten K10, Velero schedules, or cloud-native solutions).

    CronJob Snapshot Pattern

    apiVersion: batch/v1
    kind: CronJob
    metadata:
      name: postgres-snapshot-daily
      namespace: production
    spec:
      schedule: "0 2 * * *"          # daily at 2 AM UTC
      concurrencyPolicy: Forbid
      successfulJobsHistoryLimit: 7
      failedJobsHistoryLimit: 3
      jobTemplate:
        spec:
          template:
            spec:
              serviceAccountName: snapshot-creator
              restartPolicy: OnFailure
              containers:
              - name: snapshot
                image: bitnami/kubectl:latest
                command:
                - /bin/sh
                - -c
                - |
                  DATE=$(date +%Y-%m-%d-%H%M)
                  cat <
    ---
    # RBAC for snapshot-creator ServiceAccount
    apiVersion: rbac.authorization.k8s.io/v1
    kind: Role
    metadata:
      name: snapshot-creator
      namespace: production
    rules:
    - apiGroups: [snapshot.storage.k8s.io]
      resources: [volumesnapshots]
      verbs: [get, list, create, delete]
    ---
    apiVersion: rbac.authorization.k8s.io/v1
    kind: RoleBinding
    metadata:
      name: snapshot-creator
      namespace: production
    subjects:
    - kind: ServiceAccount
      name: snapshot-creator
      namespace: production
    roleRef:
      kind: Role
      name: snapshot-creator
      apiGroup: rbac.authorization.k8s.io

    Velero Backup Strategy

    Velero provides a complete backup solution for Kubernetes — it backs up both object metadata (Deployments, Services, ConfigMaps, Secrets) and PV data (via CSI snapshots or Restic/Kopia file-level backup).

    Installation

    # Install Velero with AWS EBS CSI snapshot support
    velero install \
      --provider aws \
      --plugins velero/velero-plugin-for-aws:v1.8.0 \
      --bucket my-velero-backups \
      --backup-location-config region=us-east-1 \
      --snapshot-location-config region=us-east-1 \
      --features=EnableCSI \                   # enable CSI snapshot integration
      --use-volume-snapshots=true \
      --secret-file ./credentials-velero       # AWS credentials with s3+ec2 permissions

    Backup Schedule

    # Create a scheduled backup for production namespace
    velero schedule create production-daily \
      --schedule="0 3 * * *" \
      --include-namespaces production \
      --snapshot-volumes \
      --ttl 720h0m0s                # retain backups for 30 days
    
    # List schedules and their status
    velero schedule get
    
    # Create an on-demand backup
    velero backup create prod-backup-manual \
      --include-namespaces production \
      --snapshot-volumes \
      --wait

    Restore Workflow

    # List available backups
    velero backup get
    
    # Restore entire namespace
    velero restore create --from-backup prod-backup-manual \
      --include-namespaces production \
      --wait
    
    # Restore to a different namespace
    velero restore create --from-backup prod-backup-manual \
      --namespace-mappings production:production-restore \
      --wait
    
    # Restore specific resources only
    velero restore create --from-backup prod-backup-manual \
      --include-resources persistentvolumeclaims,pods \
      --selector app=postgres \
      --wait
    
    # Check restore status
    velero restore describe <restore-name>
    velero restore logs <restore-name>

    File-Level Backup with Kopia (No CSI Snapshot)

    # For volumes without CSI snapshot support, use Kopia (built-in to Velero 1.10+)
    velero install \
      --uploader-type kopia \          # or restic (legacy)
      --use-node-agent \               # deploys node-agent DaemonSet
      ...
    
    # Annotate pods to include specific volumes in file-level backup
    kubectl annotate pod postgres-0 \
      backup.velero.io/backup-volumes=data,wal   # comma-separated volume names

    Snapshot Storage Costs

    Understanding snapshot cost models prevents surprise bills:

    CloudSnapshot Cost ModelKey Insight
    AWS EBSIncremental — only blocks changed since last snapshot stored in S3First snapshot = full volume; subsequent snapshots = delta. Cost grows with churn rate, not volume size.
    GCE PDIncremental (similar to EBS)Regional snapshots cost 2× standard; standard snapshots replicated globally anyway.
    Azure DiskIncremental by default (Standard HDD billing for snapshot data)Full snapshots available but rarely needed; incremental is the default since 2020.
    Ceph RBDCOW (copy-on-write) — snapshot is just a reference; new writes stored separatelySnapshots "free" initially but flattening large old snapshots is expensive I/O.
    💡
    Snapshot retention policy Keeping too many incremental snapshots is not just a cost issue — it creates a long chain that slows restore times (each restore replays all deltas). For EBS, AWS recommends keeping ≤7 daily snapshots per volume in a chain before doing a full "flatten" (creating a new snapshot from a restored volume). Monitor snapshot chain length via aws ec2 describe-snapshots and automate rotation.

    Cloud Provider Snapshot Characteristics

    DriverSnapshot TypeConsistencyCross-RegionMax Snapshots/Volume
    ebs.csi.aws.comIncremental S3-backedCrash-consistentManual copy-snapshot100,000
    pd.csi.storage.gke.ioIncremental (standard) or RegionalCrash-consistentStandard snapshots are global100 per disk
    disk.csi.azure.comIncremental by defaultCrash-consistentManual copy to other region500 per disk
    rbd.csi.ceph.comCOW (Ceph RBD snapshot)Crash-consistentMirror with rbd-mirrorUnlimited (practical: <1000)

    Metrics and Alerting

    MetricSourceAlert Threshold
    kube_volumesnapshot_infokube-state-metricsreadyToUse=false for >30m
    kube_volumesnapshot_status_readytousekube-state-metrics0 (not ready) for >30m
    snapshot_controller_operation_total_secondssnapshot controllerP99 CreateSnapshot > 5m
    velero_backup_success_totalVeleroNo successful backup in 25h (missed daily)
    velero_backup_failure_totalVelero> 0 in any window

    Alerting Rules

    groups:
    - name: volume-snapshots
      rules:
      - alert: VolumeSnapshotNotReady
        expr: |
          kube_volumesnapshot_info{ready_to_use="false"} == 1
        for: 30m
        labels: {severity: warning}
        annotations:
          summary: "VolumeSnapshot {{ $labels.namespace }}/{{ $labels.volumesnapshot }} not ready after 30m"
    
      - alert: VolumeSnapshotFailed
        expr: |
          kube_volumesnapshot_info{ready_to_use="false"} == 1
            and on(namespace, volumesnapshot)
          kube_volumesnapshot_status_error_message != ""
        for: 2m
        labels: {severity: critical}
        annotations:
          summary: "VolumeSnapshot creation failed — check external-snapshotter logs"
    
      - alert: VeleroBackupMissed
        expr: |
          time() - velero_backup_last_successful_timestamp{schedule="production-daily"} > 90000
        labels: {severity: critical}
        annotations:
          summary: "Velero daily backup for production has not completed in >25h"
    
      - alert: VeleroBackupFailed
        expr: increase(velero_backup_failure_total[1h]) > 0
        labels: {severity: warning}
        annotations:
          summary: "Velero backup failure detected"

    Troubleshooting Runbooks

    Runbook: VolumeSnapshot Stuck Not Ready

    # 1. Check VS status and events
    kubectl describe volumesnapshot <name> -n <ns>
    # status.error.message will indicate failure reason
    
    # 2. Check VSC status
    VSC=$(kubectl get vs <name> -n <ns> -o jsonpath='{.status.boundVolumeSnapshotContentName}')
    kubectl describe volumesnapshotcontent $VSC
    
    # 3. Check external-snapshotter logs (in CSI controller pod)
    kubectl logs -n kube-system \
      $(kubectl get pod -n kube-system -l app=ebs-csi-controller -o name | head -1) \
      -c csi-snapshotter --tail=100
    
    # Common errors:
    # "failed to create snapshot: ... ResourceNotFound" → source PVC PV no longer exists
    # "context deadline exceeded" → CSI driver unresponsive
    # "InvalidSnapshot.InUse" → cloud snapshot in use by AMI (AWS-specific)

    Runbook: Restore PVC Stuck Pending

    # PVC with dataSource VolumeSnapshot is Pending
    kubectl describe pvc <restore-pvc> -n <ns>
    # Common causes:
    
    # 1. VolumeSnapshot not readyToUse yet
    kubectl get vs <snap-name> -n <ns> -o jsonpath='{.status.readyToUse}'
    # → false: wait for snapshot to complete before creating restore PVC
    
    # 2. StorageClass uses different driver than VolumeSnapshotClass
    kubectl get storageclass <sc> -o jsonpath='{.provisioner}'
    kubectl get volumesnapshotclass <vsc> -o jsonpath='{.driver}'
    # → must match
    
    # 3. PVC capacity less than restoreSize
    kubectl get vs <snap-name> -n <ns> -o jsonpath='{.status.restoreSize}'
    # → increase PVC resources.requests.storage to match

    Runbook: Snapshot Deletion Stuck (VSC in Terminating)

    # VolumeSnapshotContent stuck in Terminating state
    kubectl describe vsc <name>
    # Check finalizers:
    kubectl get vsc <name> -o jsonpath='{.metadata.finalizers}'
    
    # Common cause: VS was deleted but VSC finalizer not removed
    # Check if bound VS still exists
    kubectl get vs --all-namespaces | grep <vsc-name>
    
    # If VS is truly gone but VSC is stuck:
    kubectl patch vsc <name> --type=json \
      -p '[{"op":"remove","path":"/metadata/finalizers"}]'
    # WARNING: This may orphan the cloud snapshot — delete it manually in AWS/GCP/Azure

    Runbook: Velero Backup Failing

    # Check backup details
    velero backup describe <backup-name> --details
    
    # Check Velero server logs
    kubectl logs -n velero deployment/velero --tail=100 | grep -i error
    
    # Common failures:
    # "no matches for kind VolumeSnapshot" → snapshot CRDs not installed; missing --features=EnableCSI
    # "backup storage location not available" → S3 bucket permissions or region mismatch
    # "timeout waiting for PodVolumeBackup" → Kopia/Restic agent timeout on large volumes;
    #   increase --pod-volume-operation-timeout flag
    
    # Test backup location connectivity
    velero backup-location get
    velero backup-location check default

    Runbook: Pre-Provisioned Snapshot Not Binding

    # VS and VSC created but VS shows "invalid" or "error setting reference"
    kubectl describe vs <name> -n <ns>
    
    # Common cause: VSC.volumeSnapshotRef.uid does not match actual VS uid
    VS_UID=$(kubectl get vs <name> -n <ns> -o jsonpath='{.metadata.uid}')
    kubectl patch vsc <vsc-name> --type=merge \
      -p "{\"spec\":{\"volumeSnapshotRef\":{\"uid\":\"$VS_UID\"}}}"
    
    # Also verify VSC driver matches VolumeSnapshotClass driver
    kubectl get vsc <name> -o jsonpath='{.spec.driver}'
    kubectl get vsc-class <class-name> -o jsonpath='{.driver}'

    Best Practices

    1. Take a snapshot before every schema migration, data transformation, or major deployment. Cloud incremental snapshots complete in seconds and cost pennies. The worst-case rollback cost of not having one is measured in hours of data recovery.
    2. Use application-consistent snapshots for transactional databases. Crash-consistent snapshots are sufficient for databases with WAL (PostgreSQL, MySQL with InnoDB) — but quiescing removes any uncertainty and avoids long recovery replays.
    3. Test restores regularly. A snapshot that has never been tested is not a backup. Include a monthly restore drill in your SLA documentation — restore to a staging namespace and verify data integrity.
    4. Set deletionPolicy: Retain for compliance snapshots. Snapshots created for regulatory requirements should not be deleteable by namespace developers. Use Retain and restrict VSC deletion to cluster-admin only.
    5. Implement snapshot retention with automation. Manual snapshot accumulation is the most common cause of unexpected cloud storage costs. Automate retention (the CronJob pattern above, or Velero TTL) from day one.
    6. Align VolumeSnapshotClass driver to StorageClass provisioner. Mismatched drivers fail silently until restore time — the worst moment to discover the issue.
    7. Use VolumeGroupSnapshot for multi-volume databases. A PostgreSQL cluster with separate data/WAL volumes that are snapshotted at different times has a torn snapshot — WAL events after the data snapshot but before the WAL snapshot will be replayed, which is harmless but confusing. Group snapshots eliminate this window.
    8. Monitor snapshot readyToUse lag. A snapshot that takes 30+ minutes to become ready indicates a storage backend health issue. Alert on this before it becomes a restore-time surprise.