CSI Drivers
The Container Storage Interface in depth — the gRPC API contract between Kubernetes and storage backends, the sidecar architecture, every RPC call in the provision/attach/mount/resize/snapshot lifecycle, the CSIDriver object, volume health monitoring, driver upgrade strategy, and a minimal driver implementation walkthrough.
What This Page Covers
Why CSI Exists
Before CSI, storage drivers were compiled into the Kubernetes binary as in-tree plugins. Adding or updating a driver required a Kubernetes release, which took months. Bug fixes in a storage driver couldn't ship independently of Kubernetes. This created three problems:
- Release coupling — storage vendors blocked on Kubernetes release cadence
- Security surface — all storage code ran in kube-controller-manager and kubelet, with full node privileges
- Maintenance burden — the core team had to maintain drivers for dozens of storage backends
CSI (Container Storage Interface) decouples storage drivers from Kubernetes. Drivers run as ordinary pods with only the permissions they need. They communicate with Kubernetes via a gRPC interface over a Unix socket. Kubernetes 1.13 declared CSI GA; the last major in-tree drivers were removed in 1.27–1.28.
CSI Architecture
┌─────────────────────────────────────────────────────────────────────┐
│ KUBERNETES CONTROL PLANE │
│ kube-controller-manager: │
│ AttachDetach controller ──────────────────────────────┐ │
│ PV bind controller ──────────────────────────────┤ │
│ kube-scheduler │ │
└────────────────────────────────────────────────────────────┼────────┘
│ watches K8s API
┌─────────────────────────────────────────────────────── ───┼────────┐
│ CSI CONTROLLER PLUGIN (Deployment, 1-3 replicas) │ │
│ │ │
│ ┌──────────────────────┐ ┌───────────────────────────────┤ │
│ │ Your CSI Driver │ │ Kubernetes Sidecar Containers │ │
│ │ (controller plugin) │◄──│ │ │
│ │ │ │ external-provisioner ───────┘ │
│ │ gRPC Unix socket: │ │ external-attacher │
│ │ /csi/csi.sock │ │ external-resizer │
│ │ │ │ external-snapshotter │
│ │ Implements: │ │ liveness-probe │
│ │ Identity service │ └───────────────────────────────────────┘
│ │ Controller service │
│ └──────────────────────┘
└─────────────────────────────────────────────────────────────────────┘
┌─────────────────────────────────────────────────────────────────────┐
│ CSI NODE PLUGIN (DaemonSet — one pod per node) │
│ │
│ ┌──────────────────────┐ ┌───────────────────────────────────┐ │
│ │ Your CSI Driver │ │ Kubernetes Sidecar Containers │ │
│ │ (node plugin) │◄──│ │ │
│ │ │ │ node-driver-registrar │ │
│ │ gRPC Unix socket: │ │ (registers with kubelet at │ │
│ │ /csi/csi.sock │ │ /var/lib/kubelet/plugins_ │ │
│ │ │ │ registry/) │ │
│ │ Implements: │ │ liveness-probe │ │
│ │ Identity service │ │ external-health-monitor-agent │ │
│ │ Node service │ └───────────────────────────────────┘ │
│ └──────────────────────┘ │
│ ▲ │
│ │ kubelet calls via gRPC Unix socket │
└─────────────────────────────────────────────────────────────────────┘
The split into controller and node plugins is deliberate. The controller plugin runs anywhere in the cluster (usually on control plane nodes via node selector) and makes API calls to the storage backend (cloud APIs, Ceph, etc.). The node plugin runs on every worker node where volumes may be mounted, and performs the low-level format/mount/unmount operations that require direct disk access.
External Sidecar Responsibilities
Kubernetes ships a set of standard sidecar containers that bridge the Kubernetes API to CSI gRPC calls. You never write these — your driver implements the gRPC interface, and the sidecars translate Kubernetes events into gRPC calls.
| Sidecar | Watches | Calls CSI | Required |
|---|---|---|---|
external-provisioner | PVC with matching StorageClass | CreateVolume / DeleteVolume | Yes (dynamic provisioning) |
external-attacher | VolumeAttachment objects | ControllerPublishVolume / ControllerUnpublishVolume | If attachRequired: true |
external-resizer | PVC resize requests (status conditions) | ControllerExpandVolume | If expansion supported |
external-snapshotter | VolumeSnapshot objects | CreateSnapshot / DeleteSnapshot / ListSnapshots | If snapshots supported |
node-driver-registrar | — | GetPluginInfo (to register socket path with kubelet) | Yes (node plugin) |
liveness-probe | — | Probe (health check; exposes /healthz HTTP endpoint) | Recommended |
external-health-monitor-controller | PV objects | ControllerGetVolume / ListVolumes | Optional (volume health) |
external-health-monitor-agent | Node-local volumes | NodeGetVolumeStats (health fields) | Optional (volume health) |
--leader-election flag). Only the elected leader processes events. Without leader election, multiple replicas calling CreateVolume simultaneously would create duplicate volumes. The sidecars use a Lease object for leader election.CSIDriver Object
The CSIDriver cluster-scoped object declares a driver's capabilities to Kubernetes. It is created by the driver deployment (usually via Helm), not dynamically registered.
apiVersion: storage.k8s.io/v1
kind: CSIDriver
metadata:
name: ebs.csi.aws.com # MUST match the provisioner name in StorageClass
spec:
# ── Attach behavior ───────────────────────────────────────────
attachRequired: true # true: driver manages attach/detach (block drivers)
# false: no VolumeAttachment created (NFS, CephFS)
# ── Pod metadata injection ────────────────────────────────────
podInfoOnMount: true # kubelet passes pod UID/name/namespace in NodePublishVolume
# enables per-pod volume access tracking
# ── Volume lifecycle modes ────────────────────────────────────
volumeLifecycleModes:
- Persistent # supports PV/PVC (standard)
# - Ephemeral # supports inline CSI ephemeral volumes (no PVC)
# ── fsGroup handling ──────────────────────────────────────────
fsGroupPolicy: File # None | File | ReadWriteOnceWithFSType
# File: kubelet always chowns mounted volume to fsGroup
# None: driver handles fsGroup itself (e.g., NFS squash)
# ReadWriteOnceWithFSType: chown only when fsType set + RWO
# ── Token injection for cloud auth ────────────────────────────
tokenRequests:
- audience: sts.amazonaws.com # AWS IRSA audience
expirationSeconds: 86400 # token TTL
# - audience: "" # default K8s service account token
# ── Secrets Store CSI or drivers that re-publish volumes ──────
requiresRepublish: false # true: kubelet calls NodePublishVolume periodically
# used by Secrets Store CSI to refresh secret values
# ── SELinux optimization (GA 1.27) ────────────────────────────
seLinuxMount: true # driver can mount with SELinux label directly
# avoids recursive relabeling of volume files
fsGroupPolicy Values
| Value | Behavior | Use For |
|---|---|---|
File | kubelet recursively chowns all files to fsGroup on every mount | Block CSI drivers (EBS, Azure Disk, GCE PD) |
None | kubelet does not chown; driver is responsible | NFS, CephFS — server-side uid/gid management |
ReadWriteOnceWithFSType | chown only when fsType is set AND accessMode is RWO | Drivers that support both block and file |
Identity Service RPCs
Every CSI driver must implement the Identity service — it is called during driver registration and health checks.
| RPC | Called By | Purpose |
|---|---|---|
GetPluginInfo | node-driver-registrar, sidecars | Returns driver name and version. Name must match CSIDriver object name. |
GetPluginCapabilities | All sidecars | Returns which services the plugin supports: CONTROLLER_SERVICE, VOLUME_ACCESSIBILITY_CONSTRAINTS (topology), ONLINE expansion |
Probe | liveness-probe sidecar (periodic) | Returns driver readiness. Used for liveness/readiness probes. Return NOT_READY if not yet initialized. |
Controller Service RPCs
The controller plugin runs in a Deployment and calls cloud/storage APIs. Not all RPCs are required — declare which you implement via ControllerGetCapabilities.
CreateVolume
Called by external-provisioner when a PVC needs a new volume. The most complex RPC — it must handle topology, secrets, cloning, and snapshot restore.
// CreateVolumeRequest (gRPC, simplified Go-like pseudocode)
{
Name: "pvc-abc-123", // unique name; idempotency key
CapacityRange: {
RequiredBytes: 20 * 1024^3, // 20 GiB minimum
LimitBytes: 0, // no upper limit
},
VolumeCapabilities: [{
AccessMode: { Mode: SINGLE_NODE_WRITER },
Mount: { FsType: "ext4", MountFlags: ["noatime"] },
}],
Parameters: { // from StorageClass.parameters
"type": "gp3",
"encrypted": "true",
},
Secrets: { ... }, // from csi.storage.k8s.io/provisioner-secret-name
AccessibilityRequirements: { // from scheduler topology hint
Requisite: [{ Segments: {"topology.ebs.csi.aws.com/zone": "us-east-1a"} }],
Preferred: [{ Segments: {"topology.ebs.csi.aws.com/zone": "us-east-1a"} }],
},
VolumeContentSource: { // optional: clone or snapshot restore
Type: &Snapshot{ SnapshotId: "snap-0abc123" },
// OR
Type: &Volume{ VolumeId: "vol-0source" },
},
}
// CreateVolumeResponse
{
Volume: {
VolumeId: "vol-0abc123def456", // opaque to K8s; stored in PV.spec.csi.volumeHandle
CapacityBytes: 20 * 1024^3,
VolumeContext: { "throughput": "250" }, // stored in PV.spec.csi.volumeAttributes
AccessibleTopology: [{ Segments: {"topology.ebs.csi.aws.com/zone": "us-east-1a"} }],
}
}
ControllerPublishVolume / ControllerUnpublishVolume
Called by external-attacher when a VolumeAttachment object is created/deleted. For cloud block storage, this translates to attaching/detaching the block device to the VM instance.
// ControllerPublishVolumeRequest
{
VolumeId: "vol-0abc123def456", // EBS volume ID
NodeId: "i-0abc123def456789", // EC2 instance ID (from NodeGetInfo)
VolumeCapability: { ... },
Readonly: false,
}
// Response contains PublishContext, e.g.: {"devicePath": "/dev/xvdf"}
// This devicePath is passed to NodeStageVolume
ControllerExpandVolume
Called by external-resizer when PVC storage is increased. Resizes the backing cloud volume. Returns the new capacity and whether NodeExpansionRequired is true (meaning the filesystem on the node also needs resizing).
CreateSnapshot / DeleteSnapshot
Called by external-snapshotter. Parameters come from the VolumeSnapshotClass. The driver must handle idempotency — if a snapshot with the same name already exists, return it rather than failing.
Node Service RPCs
The node plugin runs as a DaemonSet. Kubelet calls it directly via the Unix socket registered at /var/lib/kubelet/plugins/<driver-name>/csi.sock.
NodeStageVolume vs NodePublishVolume
This is the most subtle distinction in CSI. The two-phase mount exists to support multiple pods sharing the same block device on a single node:
BLOCK VOLUME MOUNT FLOW (two-phase)
Cloud storage backend
vol-0abc123 attached to node as /dev/xvdf
│
▼
NodeStageVolume (called once per volume per node)
stagingTargetPath: /var/lib/kubelet/plugins/kubernetes.io/csi/pv/pvc-abc/globalmount
Operations:
- fsck (filesystem check)
- mkfs.ext4 /dev/xvdf (if new volume)
- mount /dev/xvdf /var/lib/kubelet/.../globalmount
Result: device formatted and mounted at a GLOBAL path on the node
│
▼ (for each pod using this volume)
NodePublishVolume (called once per pod per volume)
targetPath: /var/lib/kubelet/pods/<pod-uid>/volumes/kubernetes.io~csi/<vol-name>/mount
Operations:
- mount --bind /globalmount /var/lib/kubelet/pods/.../mount
Result: bind-mount from global mount into the specific pod's directory
│
▼
Container sees /var/data (mountPath) backed by the bind-mounted filesystem
UNMOUNT FLOW (reverse):
Pod terminates → NodeUnpublishVolume (remove bind-mount from pod dir)
Last pod using volume → NodeUnstageVolume (unmount global mount, potentially format check)
VolumeAttachment deleted → ControllerUnpublishVolume (detach from node)
For NFS/CephFS (filesystem volumes without attachment), attachRequired: false in the CSIDriver object, and the driver may skip NodeStage entirely — mounting directly in NodePublish.
NodeExpandVolume
Called by kubelet after ControllerExpandVolume completes and the pod remounts the volume. Runs the filesystem-level resize command:
// NodeExpandVolumeRequest
{
VolumeId: "vol-0abc123",
VolumePath: "/var/lib/kubelet/pods/.../mount", // the pod's mountPath
StagingTargetPath: "/var/lib/.../globalmount",
CapacityRange: { RequiredBytes: 50 * 1024^3 }, // 50 GiB target
VolumeCapability: { Mount: { FsType: "ext4" } },
}
// Driver implementation:
// 1. Find the block device from stagingTargetPath
// 2. Run: resize2fs /dev/xvdf (for ext4)
// Or: xfs_growfs /var/lib/.../globalmount (for xfs — uses mount path, not device)
// 3. Return new capacity
NodeGetInfo
Returns the node's unique ID (used in ControllerPublishVolume) and its topology segments. The topology information is used by the scheduler to place pods near their volumes.
// NodeGetInfoResponse
{
NodeId: "i-0abc123def456789", // instance ID for ControllerPublishVolume
MaxVolumesPerNode: 25, // scheduler enforces this limit
AccessibleTopology: {
Segments: {
"topology.ebs.csi.aws.com/zone": "us-east-1a"
}
}
}
NodeGetVolumeStats
Returns usage statistics for a mounted volume. Kubelet calls this periodically to populate kubelet_volume_stats_* metrics and to check volume health conditions.
// NodeGetVolumeStatsResponse
{
Usage: [
{ Available: 15 * 1024^3, Total: 20 * 1024^3, Used: 5 * 1024^3, Unit: BYTES },
{ Available: 6500000, Total: 6553600, Used: 53600, Unit: INODES },
],
VolumeCondition: { // volume health monitoring (optional)
Abnormal: false,
Message: "",
// Abnormal: true when driver detects I/O errors, corruption, etc.
}
}
Volume Health Monitoring
CSI volume health (GA 1.21) provides a mechanism for drivers to report that a volume is unhealthy. The health signal bubbles up as a Kubernetes Event on the PVC and Pod objects.
Health Monitoring Components
| Component | Where | Checks |
|---|---|---|
external-health-monitor-controller | Controller Deployment sidecar | Calls ControllerGetVolume / ListVolumes periodically; detects cloud-side anomalies (corrupted snapshot, degraded volume) |
external-health-monitor-agent | Node DaemonSet sidecar | Calls NodeGetVolumeStats on mounted volumes; checks VolumeCondition.Abnormal |
When a volume is detected as unhormal, the controller emits a Kubernetes Event on the PVC:
kubectl describe pvc data-postgres-0 -n production
# Events:
# Type Reason Age From Message
# ---- ------ --- ---- -------
# Warning VolumeConditionAbnormal 2m external-health-monitor Volume condition is abnormal: I/O error detected
Pod Info on Mount
When podInfoOnMount: true is set in the CSIDriver, kubelet injects the requesting pod's metadata into NodePublishVolume's VolumeContext. This enables drivers to implement per-pod access control, logging, or billing.
// NodePublishVolumeRequest.VolumeContext when podInfoOnMount:true
{
"csi.storage.k8s.io/pod.name": "postgres-0",
"csi.storage.k8s.io/pod.namespace": "production",
"csi.storage.k8s.io/pod.uid": "abc-123-def",
"csi.storage.k8s.io/serviceAccount.name": "postgres",
// plus any volumeAttributes from the PV spec
}
The Secrets Store CSI Driver uses this to look up the requesting pod's service account and assume the appropriate cloud IAM role for secret retrieval — without any cluster-wide credentials in the driver.
Secrets Store CSI Driver
The most widely-deployed CSI ephemeral driver. It mounts secrets from external secret stores (AWS Secrets Manager / Parameter Store, Azure Key Vault, GCP Secret Manager, HashiCorp Vault) as files in pods — without the secrets ever being stored in Kubernetes etcd.
apiVersion: secrets-store.csi.x-k8s.io/v1
kind: SecretProviderClass
metadata:
name: aws-secrets
namespace: production
spec:
provider: aws
parameters:
objects: |
- objectName: "prod/db/password"
objectType: "secretsmanager"
objectAlias: "db-password" # filename in the pod
- objectName: "/prod/api-key"
objectType: "ssmparameter"
objectAlias: "api-key"
secretObjects: # optional: sync to K8s Secret for env var use
- secretName: db-secret
type: Opaque
data:
- objectName: db-password
key: password
volumes:
- name: secrets
csi:
driver: secrets-store.csi.k8s.io
readOnly: true
volumeAttributes:
secretProviderClass: aws-secrets # reference to SecretProviderClass
containers:
- name: app
volumeMounts:
- name: secrets
mountPath: /mnt/secrets
readOnly: true
env:
- name: DB_PASSWORD # also available as env var via synced K8s Secret
valueFrom:
secretKeyRef:
name: db-secret
key: password
Token Rotation with requiresRepublish
When the CSIDriver has requiresRepublish: true, kubelet calls NodePublishVolume periodically (default 1 minute). The Secrets Store driver uses this to re-fetch secrets from the external store and update the files — enabling seamless secret rotation without pod restart.
Node Volume Limits
Each node can attach a limited number of block volumes. The CSI driver reports this via NodeGetInfo.MaxVolumesPerNode, and the kube-scheduler's NodeVolumeLimits plugin enforces it.
| Cloud / Instance | Max Volumes | Notes |
|---|---|---|
| AWS (most instance types) | 25–39 (type-dependent) | EBS CSI reads from instance metadata; c5.4xlarge = 25, i3.metal = 39 |
| AWS (nitro instances, NVMe) | 28 | Separate limit for NVMe instance store volumes |
| GCE (most machine types) | 16 | pd.csi.storage.gke.io reports 16 by default |
| Azure (Standard_D series) | 4–64 (SKU-dependent) | disk.csi.azure.com reads from instance metadata |
# Check node attachment capacity
kubectl describe node <name> | grep -A 5 "Allocatable"
# attachable-volumes-aws-ebs: 25 ← scheduler enforces this
# Check current usage on a node
kubectl describe node <name> | grep "attachable-volumes" -A 20
# Identify pods that could exhaust volume limits
kubectl get pods --all-namespaces -o json | \
jq -r '.items[] | select(.spec.nodeName=="<node>") |
"\(.metadata.namespace)/\(.metadata.name): \(.spec.volumes | length) volumes"' | sort -t: -k2 -rn
Driver Deployment Patterns
A typical CSI driver Helm chart deploys two workloads and several supporting objects:
# Typical CSI driver RBAC requirements
# Controller ServiceAccount needs:
# - Secrets: get (for StorageClass secrets)
# - PersistentVolumes: get, list, watch, create, delete, patch
# - PersistentVolumeClaims: get, list, watch, update (for resize status)
# - StorageClasses: get, list, watch
# - VolumeAttachments: get, list, watch, patch
# - CSINodes: get, list, watch
# - VolumeSnapshots / VolumeSnapshotContents: get, list, watch, create, delete, patch
# - Events: create, patch
# - Leases: get, create, update (for leader election)
# Node ServiceAccount needs:
# - Secrets: get (for node-stage secrets)
# - Pods: get (when podInfoOnMount: true)
# - PersistentVolumeClaims: get, list, watch
# - CSINodes: get, create, update, patch
# - Nodes: get, list, watch, patch
Controller HA with Leader Election
# Controller Deployment sidecar flags for HA
containers:
- name: csi-provisioner
image: registry.k8s.io/sig-storage/csi-provisioner:v3.6.0
args:
- --csi-address=/csi/csi.sock
- --feature-gates=Topology=true
- --leader-election # enable leader election
- --leader-election-namespace=kube-system
- --timeout=60s # RPC timeout
- --retry-interval-start=1s
- --retry-interval-max=5m
- --worker-threads=100 # parallel reconciliation workers
- name: csi-attacher
image: registry.k8s.io/sig-storage/csi-attacher:v4.4.0
args:
- --csi-address=/csi/csi.sock
- --leader-election
- --leader-election-namespace=kube-system
- --timeout=300s # longer timeout for attach operations
Driver Upgrade Strategy
Upgrading a CSI driver without disrupting running pods requires careful ordering. The node plugin (DaemonSet) and controller plugin (Deployment) can be upgraded independently.
- Check compatibility — verify new driver version supports existing PV volumeHandles and StorageClass parameters. Read the driver changelog for breaking changes.
- Upgrade controller plugin first — scale down old Deployment, apply new Deployment. New provisioner/attacher sidecars start. Existing mounts are unaffected (controller handles only create/delete/attach/detach).
- Upgrade node plugin (DaemonSet) — set
maxUnavailable: 1in DaemonSet update strategy. Pods on each node are replaced one at a time. Running pods continue using existing mounts during the upgrade window. - Verify — create a test PVC, mount it in a pod, write data, delete. Check CSI driver logs for errors.
- Monitor — watch for
csi_operations_secondsP99 latency increase during rollout; watch forstorage_operation_errors_total.
# Safe DaemonSet update strategy for CSI node plugin
spec:
updateStrategy:
type: RollingUpdate
rollingUpdate:
maxUnavailable: 1 # replace one node plugin at a time
# avoids all nodes losing CSI capability simultaneously
Minimal CSI Driver Skeleton
A minimal read/write block CSI driver requires implementing three services. This outline shows the structure — full implementations use the csi-lib-utils library.
// main.go — CSI driver entry point
func main() {
// 1. Parse flags: endpoint (unix:///csi/csi.sock), node-id, driver-name
// 2. Create gRPC server
// 3. Register all three services on the same server
// 4. Start listening
driver := &MyDriver{nodeID: *nodeID}
server := grpc.NewServer(grpc.UnaryInterceptor(logInterceptor))
csi.RegisterIdentityServer(server, driver)
csi.RegisterControllerServer(server, driver) // in controller plugin
csi.RegisterNodeServer(server, driver) // in node plugin
lis, _ := net.Listen("unix", "/csi/csi.sock")
server.Serve(lis)
}
// Identity Service (required by all)
func (d *MyDriver) GetPluginInfo(ctx, req) (*csi.GetPluginInfoResponse, error) {
return &csi.GetPluginInfoResponse{
Name: "my.csi.driver.example",
VendorVersion: "v1.0.0",
}, nil
}
func (d *MyDriver) GetPluginCapabilities(ctx, req) (*csi.GetPluginCapabilitiesResponse, error) {
return &csi.GetPluginCapabilitiesResponse{
Capabilities: []*csi.PluginCapability{{
Type: &csi.PluginCapability_Service_{
Service: &csi.PluginCapability_Service{
Type: csi.PluginCapability_Service_CONTROLLER_SERVICE,
},
},
}},
}, nil
}
func (d *MyDriver) Probe(ctx, req) (*csi.ProbeResponse, error) {
return &csi.ProbeResponse{Ready: &wrapperspb.BoolValue{Value: true}}, nil
}
// Controller Service (minimum for dynamic provisioning)
func (d *MyDriver) CreateVolume(ctx, req) (*csi.CreateVolumeResponse, error) {
// 1. Validate parameters
// 2. Call backend API to create disk (idempotent: check if exists first)
// 3. Return volume ID and topology
}
func (d *MyDriver) DeleteVolume(ctx, req) (*csi.DeleteVolumeResponse, error) {
// 1. Call backend API to delete disk
// 2. Idempotent: if volume not found, return success
}
// Node Service (minimum for mounting)
func (d *MyDriver) NodeStageVolume(ctx, req) (*csi.NodeStageVolumeResponse, error) {
// 1. Find the block device (from PublishContext devicePath)
// 2. Format if needed: exec.Command("mkfs.ext4", "-F", devicePath)
// 3. Mount: exec.Command("mount", devicePath, stagingTargetPath)
}
func (d *MyDriver) NodePublishVolume(ctx, req) (*csi.NodePublishVolumeResponse, error) {
// 1. Bind-mount: exec.Command("mount", "--bind", stagingTargetPath, targetPath)
// 2. Handle readOnly: remount with "-o", "bind,ro"
}
func (d *MyDriver) NodeUnpublishVolume(ctx, req) (*csi.NodeUnpublishVolumeResponse, error) {
// 1. Unmount: exec.Command("umount", targetPath)
// 2. Idempotent: if not mounted, return success
}
func (d *MyDriver) NodeUnstageVolume(ctx, req) (*csi.NodeUnstageVolumeResponse, error) {
// 1. Unmount: exec.Command("umount", stagingTargetPath)
}
func (d *MyDriver) NodeGetInfo(ctx, req) (*csi.NodeGetInfoResponse, error) {
return &csi.NodeGetInfoResponse{
NodeId: d.nodeID,
MaxVolumesPerNode: 20,
AccessibleTopology: &csi.Topology{
Segments: map[string]string{"zone": d.zone},
},
}, nil
}
CSI Migration
The in-tree volume plugin migration to CSI was a multi-year project. Each plugin had a feature gate; migration is now complete for all major cloud providers. The migration is transparent: existing PVs with in-tree types are silently routed to the corresponding CSI driver.
| In-Tree Plugin | CSI Driver | Migration Status |
|---|---|---|
kubernetes.io/aws-ebs | ebs.csi.aws.com | GA 1.17; in-tree removed 1.27 |
kubernetes.io/gce-pd | pd.csi.storage.gke.io | GA 1.17; in-tree removed 1.28 |
kubernetes.io/azure-disk | disk.csi.azure.com | GA 1.19; in-tree removed 1.27 |
kubernetes.io/azure-file | file.csi.azure.com | GA 1.21; in-tree removed 1.27 |
kubernetes.io/cinder (OpenStack) | cinder.csi.openstack.org | GA 1.21; in-tree removed 1.26 |
kubernetes.io/vsphere-volume | csi.vsphere.volume | GA 1.19; in-tree removed 1.26 |
Metrics and Alerting
| Metric | Source | Alert Threshold |
|---|---|---|
csi_operations_seconds{driver,method_name} | External sidecars | P99 CreateVolume >60s; NodePublish >30s |
storage_operation_duration_seconds{operation_name} | kubelet | P99 volume_attach >60s, volume_mount >30s |
storage_operation_errors_total | kubelet | Any non-zero rate |
attachdetach_controller_total_volumes{state="detached"} | kube-controller-manager | Growing detached count indicates stuck detach |
volume_manager_total_volumes{state="desired_state_of_world"} | kubelet | DSW vs ASW divergence for >5min |
Alerting Rules
groups:
- name: csi-drivers
rules:
- alert: CSIProvisioningErrors
expr: rate(storage_operation_errors_total{operation_name="provision"}[5m]) > 0
for: 2m
labels: {severity: warning}
annotations:
summary: "CSI provisioning errors on {{ $labels.volume_plugin }}"
- alert: CSINodePublishSlow
expr: |
histogram_quantile(0.99,
rate(storage_operation_duration_seconds_bucket{operation_name="volume_mount"}[10m])
) > 30
for: 5m
labels: {severity: warning}
annotations:
summary: "CSI NodePublishVolume P99 > 30s — pods taking too long to start"
- alert: VolumeAttachDetachStuck
expr: |
attachdetach_controller_total_volumes{state="detached"} > 0
for: 10m
labels: {severity: warning}
annotations:
summary: "Volumes stuck in detached state for >10 minutes"
- alert: CSIDriverDown
expr: |
up{job=~".*csi.*"} == 0
for: 2m
labels: {severity: critical}
annotations:
summary: "CSI driver {{ $labels.job }} is down — new mounts will fail"
Troubleshooting Runbooks
Runbook: Pod Stuck ContainerCreating — CSI NodePublish Failed
# 1. Get the error
kubectl describe pod <name> -n <ns>
# Events: "MountVolume.SetUp failed for volume... NodePublishVolume failed"
# 2. Check node plugin logs (DaemonSet pod on the same node)
NODE=$(kubectl get pod <name> -n <ns> -o jsonpath='{.spec.nodeName}')
kubectl logs -n kube-system \
$(kubectl get pod -n kube-system -l app=ebs-csi-node -o name | grep $NODE) \
-c csi-driver --tail=50
# 3. Check if the staging path exists (NodeStage may have partially completed)
kubectl debug node/$NODE -it --image=busybox -- \
ls /var/lib/kubelet/plugins/kubernetes.io/csi/
# 4. Check if the block device is available on the node
kubectl debug node/$NODE -it --image=busybox -- lsblk
Runbook: CSI Provisioner Not Creating PVs
# 1. Describe PVC — check events
kubectl describe pvc <name> -n <ns>
# 2. Check external-provisioner logs
kubectl logs -n kube-system \
$(kubectl get pod -n kube-system -l app=ebs-csi-controller -o name | head -1) \
-c csi-provisioner --tail=100
# Common errors:
# "error calling CreateVolume: ... InvalidParameterValue" → bad StorageClass parameters
# "failed to create volume: ... UnauthorizedOperation" → missing IAM permissions
# "context deadline exceeded" → CSI driver not responding (check driver pod health)
# 3. Check driver pod is healthy
kubectl get pods -n kube-system -l app=ebs-csi-controller
# If CrashLoopBackOff: kubectl logs -n kube-system <pod> -c csi-driver --previous
Runbook: VolumeAttachment Stuck in Attaching State
# Attachment has been in progress for >5 minutes
kubectl get volumeattachment
# NAME ATTACHER PV NODE ATTACHED
# csi-abc123 ebs.csi.aws.com pv-prod-db node-1 false
# 1. Check external-attacher logs
kubectl logs -n kube-system \
$(kubectl get pod -n kube-system -l app=ebs-csi-controller -o name | head -1) \
-c csi-attacher --tail=50
# 2. Check cloud-side: is the volume actually attached to the node?
# AWS: aws ec2 describe-volumes --volume-ids vol-0abc123 | jq '.Volumes[].Attachments'
# 3. If cloud shows detached but attachment object is stuck, delete and recreate:
kubectl delete volumeattachment csi-abc123
# The attacher will recreate it and trigger a fresh attach attempt
Runbook: NodeGetVolumeStats Returns Abnormal — Volume Health Alert
# PVC Event shows VolumeConditionAbnormal
kubectl describe pvc <name> -n <ns>
# Warning: VolumeConditionAbnormal ...
# 1. Check what the driver is reporting
kubectl logs -n kube-system <node-plugin-pod> -c external-health-monitor-agent | tail -50
# 2. Check filesystem health from inside the pod
kubectl exec -it <pod> -- dmesg | tail -20 # kernel I/O errors
kubectl exec -it <pod> -- journalctl -k | grep -i "I/O error"
# 3. If filesystem corruption is suspected:
# - Take snapshot immediately (before further damage)
# - Stop writes to the volume
# - Run fsck on an unmounted clone of the snapshot
Runbook: Node Volume Limit Reached — Pod Pending
# Pod stuck Pending with:
# "0/3 nodes are available: 3 node(s) exceed max volume count"
# 1. Check current attachment count per node
kubectl describe nodes | grep "attachable-volumes" -A 3
# 2. Find which pods are using the most volumes on congested nodes
kubectl get pods --all-namespaces -o json | \
jq -r '.items[] | select(.spec.nodeName=="<node>") |
"\(.metadata.namespace)/\(.metadata.name): \(.spec.volumes | map(select(.persistentVolumeClaim)) | length) PVCs"' | \
sort -t: -k2 -rn | head -10
# 3. Options:
# a) Add more nodes to the cluster
# b) Consolidate workloads (fewer, larger PVCs per pod)
# c) Use EBS multi-attach (io1/io2) for shared access where appropriate
# d) AWS: request limit increase via AWS Support for certain instance types
Best Practices
- Always deploy the liveness-probe sidecar. Without it, a hung CSI driver process is invisible to Kubernetes — the pod appears healthy but all storage operations silently fail. Pair with a Prometheus alert on
up{job="csi-driver"} == 0. - Enable leader election on all controller sidecars when running more than one controller replica. Without it, you get duplicate volumes, duplicate snapshots, or conflicting resize operations.
- Use RBAC least-privilege for driver ServiceAccounts. The controller ServiceAccount does not need node access; the node ServiceAccount does not need PV create/delete. Follow the driver's recommended RBAC manifest exactly — don't grant cluster-admin to the driver.
- Pin sidecar versions to your Kubernetes version. The external-provisioner, external-attacher, etc. have compatibility matrices. A sidecar built for K8s 1.28 may use API fields removed or changed in 1.31. Check the sidecar compatibility table before upgrading Kubernetes.
- Set
maxUnavailable: 1on node plugin DaemonSet updates. Rolling the node plugin too fast (all nodes at once) means all nodes simultaneously lose the ability to perform new mount/unmount operations. - Set
fsGroupPolicy: Filefor block drivers. The default is driver-dependent. An unset or wrong policy meansfsGroupin pod SecurityContext has no effect, and containers get permission denied on newly provisioned volumes. - Pre-install the CSI driver before migrating in-tree volumes. If you upgrade Kubernetes past the in-tree removal version without having the CSI driver installed and migration enabled, existing PVs become unserviceable. There is no easy rollback.
- Monitor
storage_operation_errors_totalby operation name. A rate of >0 forprovision,volume_attach, orvolume_mountmeans real workloads are experiencing failures. These operations are synchronous blockers for pod startup — every error translates directly to pod latency or unavailability.