▶ What This Page Covers
  • The pre-GA capacity problem: scheduler blindness and ProvisioningFailed pods
  • CSIStorageCapacity API (GA 1.24) — object model, fields, owner references
  • How external-provisioner publishes CSIStorageCapacity objects
  • Capacity-aware scheduling flow — scheduler plugin StorageCapacityFilter
  • storageCapacity: true on StorageClass — the required opt-in gate
  • WaitForFirstConsumer interaction — why Immediate mode defeats capacity tracking
  • Topology segments and how they map to CSIStorageCapacity objects
  • Multi-zone capacity planning: zone-scoped vs node-scoped capacity
  • Node volume attachment limits per cloud provider (AWS/GCE/Azure)
  • CSINode object — allocatable volume count, driver topology keys
  • kube-scheduler VolumeBinding plugin internals
  • Capacity staleness: resync period, cache TTL, worst-case scheduling race
  • Local storage capacity with TopoLVM operator
  • ResourceQuota for storage: PVC count, storage size per StorageClass
  • LimitRange for PVC minimum/maximum sizes
  • StorageClass ResourceQuota scoping
  • Capacity monitoring: available vs allocated vs used
  • 5 metrics + 4 alerting rules + 5 runbooks
  • 8 best practices for capacity planning at scale
  • The Pre-GA Capacity Problem

    Before Kubernetes 1.24, the scheduler had no knowledge of how much storage capacity was available in any given topology zone. When a pod with a WaitForFirstConsumer PVC was scheduled to a node in zone us-east-1a, the scheduler committed the pod to that zone before the CSI driver attempted to provision the volume. If the provisioner then discovered that zone us-east-1a had no capacity — quota exhausted, regional limits hit, pool full — the PVC remained Pending with event ProvisioningFailed, and the pod was stuck forever.

    Without capacity tracking (pre-1.24): Scheduler Pod PVC CSI Provisioner │ │ │ │ │ Schedule pod to │ │ │ │ zone us-east-1a ──►│ │ │ │ (no capacity info) │ │ │ │ │ Triggers PVC │ │ │ │ binding ─────►│ │ │ │ │ CreateVolume RPC ──►│ │ │ │ │ RESOURCE_EXHAUSTED │ │ │◄── ProvisioningFailed│ │ │ │ (zone full) │ │ │◄── Pod stuck Pending ────────────── │ │ │ (no retry to other zone) │ With capacity tracking (1.24+ GA): Scheduler reads CSIStorageCapacity objects BEFORE scheduling: us-east-1a: 0 Gi available ← skip this zone us-east-1b: 500 Gi available ← schedule here us-east-1c: 1200 Gi available → Pod scheduled to us-east-1b or us-east-1c immediately

    CSIStorageCapacity Object

    CSIStorageCapacity is a namespace-scoped API object (in the same namespace as the CSI controller pod) that represents the available storage capacity for a specific StorageClass within a specific topology segment (zone, rack, node). The external-provisioner sidecar creates and updates these objects periodically.

    Object Structure

    apiVersion: storage.k8s.io/v1
    kind: CSIStorageCapacity
    metadata:
      name: csi-sc-ebs-gp3-us-east-1b-a1b2c3   # auto-generated name
      namespace: kube-system                      # same ns as CSI controller
      ownerReferences:                            # garbage-collected when provisioner pod is deleted
      - apiVersion: apps/v1
        kind: ReplicaSet
        name: ebs-csi-controller-7d9f8b
        uid: a1b2c3d4-...
        controller: true
        blockOwnerDeletion: true
    storageClassName: ebs-gp3                    # the StorageClass this applies to
    nodeTopology:                                # which topology segment
      matchLabels:
        topology.kubernetes.io/zone: us-east-1b
    capacity: "2Ti"                              # available capacity in this segment
    maximumVolumeSize: "16Ti"                    # maximum single-volume size allowed
    FieldTypeDescription
    storageClassNamestring (required)References the StorageClass; must match a SC with storageCapacity: true
    nodeTopologyLabelSelector (optional)Topology segment this capacity applies to. Nil means cluster-wide (e.g., NFS). Typically matchLabels with zone key.
    capacityQuantity (optional)Available storage. Nil means unknown. The scheduler treats nil as sufficient (optimistic).
    maximumVolumeSizeQuantity (optional)Largest single volume this driver can create in this segment. A PVC requesting more will fail scheduling.

    Listing CSIStorageCapacity Objects

    # List all capacity objects across all namespaces
    kubectl get csistoragecapacity -A
    
    # Show capacity per zone for a specific storage class
    kubectl get csistoragecapacity -A \
      -o custom-columns='NAMESPACE:.metadata.namespace,SC:.storageClassName,TOPOLOGY:.nodeTopology,CAPACITY:.capacity' \
      --sort-by='.storageClassName'
    
    # Example output:
    # NAMESPACE     SC          TOPOLOGY                                    CAPACITY
    # kube-system   ebs-gp3     map[topology.kubernetes.io/zone:us-east-1a] 5497558138880
    # kube-system   ebs-gp3     map[topology.kubernetes.io/zone:us-east-1b] 2199023255552
    # kube-system   ebs-gp3     map[topology.kubernetes.io/zone:us-east-1c] 10995116277760
    
    # Watch for capacity updates in real time
    kubectl get csistoragecapacity -A -w

    How external-provisioner Publishes Capacity

    The external-provisioner sidecar is responsible for creating and refreshing CSIStorageCapacity objects. It does this by calling the CSI GetCapacity RPC on the controller plugin for each topology segment it knows about, then writing or updating the corresponding object.

    external-provisioner capacity publication loop: external-provisioner (in CSI controller pod) │ │ 1. List topology segments from CSINode objects │ (each node's accessible topologies from NodeGetInfo) │ │ 2. For each (StorageClass, topology segment) pair: │ Call GetCapacity(parameters, topology) │ ┌─────────────────────────────────────────────┐ │ │ CSI Controller Plugin (driver process) │ │ │ GetCapacity RPC │ │ │ → queries cloud API (e.g., EC2 DescribeVolumes) │ │ │ → returns available_capacity, maximum_volume_size │ │ └─────────────────────────────────────────────┘ │ │ 3. Create/Update CSIStorageCapacity objects │ in the controller pod's namespace │ │ Repeat every --capacity-poll-interval (default: 1m) │ └── Owner reference on ReplicaSet ensures GC on provisioner restart

    Enabling Capacity Tracking on the Provisioner

    # In the CSI controller Deployment, external-provisioner container:
    containers:
    - name: external-provisioner
      image: registry.k8s.io/sig-storage/csi-provisioner:v4.0.0
      args:
      - --csi-address=/var/lib/csi/sockets/pluginproxy/csi.sock
      - --leader-election
      - --feature-gates=Topology=true
      - --enable-capacity                    # enable CSIStorageCapacity publishing
      - --capacity-ownerref-level=2          # 0=pod,1=replicaset,2=deployment owner ref level
      - --capacity-poll-interval=1m          # how often to refresh capacity (default 1m)
      # RBAC needed: create/update/delete csistoragecapacities in controller namespace
    GetCapacity RPC is Optional

    Not all CSI drivers implement GetCapacity. If the driver returns UNIMPLEMENTED, the external-provisioner silently skips capacity publication for that driver — no objects are created and the scheduler falls back to the pre-1.24 optimistic behavior. Check driver release notes for GET_CAPACITY support in ControllerGetCapabilities.

    StorageClass storageCapacity Gate

    Capacity-aware scheduling is opt-in per StorageClass via the storageCapacity field. Without it, the scheduler ignores CSIStorageCapacity objects entirely even if they exist.

    apiVersion: storage.k8s.io/v1
    kind: StorageClass
    metadata:
      name: ebs-gp3
    provisioner: ebs.csi.aws.com
    volumeBindingMode: WaitForFirstConsumer   # REQUIRED for capacity tracking to be useful
    storageCapacity: true                     # enable capacity-aware scheduling
    parameters:
      type: gp3
      iops: "3000"
      throughput: "125"
    Immediate Binding Mode Defeats Capacity Tracking

    With volumeBindingMode: Immediate, PVCs are bound before any pod is scheduled — the scheduler never has the chance to filter nodes based on capacity. The CSIStorageCapacity objects are published but ignored by the scheduling pipeline. Always use WaitForFirstConsumer for topology-aware drivers when capacity tracking matters.

    Capacity-Aware Scheduling Flow

    When a pod with an unbound WaitForFirstConsumer PVC is scheduled, the VolumeBinding plugin in kube-scheduler filters and scores nodes using CSIStorageCapacity data.

    Capacity-aware scheduling (1.24+): kube-scheduler (VolumeBinding plugin) │ │ Pod created with unbound PVC requesting 200Gi on SC "ebs-gp3" │ │ Filter phase: │ ┌────────────────────────────────────────────────────────────┐ │ │ For each candidate node: │ │ │ 1. Determine node's topology: zone=us-east-1b │ │ │ 2. Find CSIStorageCapacity where: │ │ │ storageClassName = "ebs-gp3" │ │ │ nodeTopology matches node's zone label │ │ │ 3. Filter: capacity.capacity >= PVC.requests.storage │ │ │ AND capacity.maximumVolumeSize >= PVC request │ │ │ 4. If no matching CSIStorageCapacity → node passes │ │ │ (optimistic: nil capacity = unknown = assume sufficient)│ │ └────────────────────────────────────────────────────────────┘ │ │ Nodes in us-east-1a (0 Gi) → filtered out │ Nodes in us-east-1b (2 Ti) → pass (200Gi fits) │ Nodes in us-east-1c (10 Ti) → pass │ │ Score phase: higher capacity zones score higher │ (scheduler prefers zones with more headroom) │ │ Pod scheduled to node in us-east-1c │ VolumeBinding plugin annotates PVC with selected node │ external-provisioner calls CreateVolume in us-east-1c

    VolumeBinding Plugin Behavior with Nil Capacity

    CSIStorageCapacity StateScheduler BehaviorRisk
    Object exists, capacity >= PVC sizeNode passes filterLow — capacity confirmed
    Object exists, capacity < PVC sizeNode filtered outNone — avoided bad zone
    Object exists, capacity = nilNode passes (optimistic)May fail at provision time
    No object for this (SC, topology)Node passes (optimistic)May fail at provision time
    maximumVolumeSize < PVC sizeNode filtered outNone — volume would never fit

    Topology Segments

    A topology segment is a set of node labels that define a storage domain. For zone-scoped drivers (AWS EBS, GCE PD, Azure Disk), each availability zone is one segment. For node-scoped drivers (local volumes, TopoLVM), each node is one segment.

    # Zone-scoped: one CSIStorageCapacity per zone per StorageClass
    nodeTopology:
      matchLabels:
        topology.kubernetes.io/zone: us-east-1a
    
    # Node-scoped: one CSIStorageCapacity per node per StorageClass (e.g., TopoLVM)
    nodeTopology:
      matchLabels:
        kubernetes.io/hostname: node1
    
    # Multi-label (rack-aware Ceph):
    nodeTopology:
      matchLabels:
        topology.kubernetes.io/zone: us-east-1a
        topology.rook.io/rack: rack2
    
    # Cluster-wide (NFS, CephFS with global pool):
    # nodeTopology: null (omitted)
    # capacity: "50Ti"

    How Topology Keys Are Discovered

    The external-provisioner discovers topology keys from CSINode objects. Each node running the CSI node plugin has a CSINode object updated by the node-driver-registrar sidecar after NodeGetInfo is called.

    # Inspect CSINode for topology keys a driver exposes
    kubectl get csinode node1 -o yaml
    
    # Relevant section:
    spec:
      drivers:
      - name: ebs.csi.aws.com
        nodeID: i-0abc123def456789
        topologyKeys:
        - topology.kubernetes.io/zone      # driver announces this key
        allocatable:
          count: 25                        # max volumes this node can attach

    Node Volume Attachment Limits

    Every cloud provider imposes a hard limit on how many block volumes can be attached to a single VM instance. Kubernetes enforces these limits via the CSINode.spec.drivers[*].allocatable.count field and the MaxVolumesPerNode VolumeBinding scheduler predicate.

    Cloud ProviderDefault LimitInstance-Specific LimitsNotes
    AWS EBS25 volumesNitro-based: up to 28 (NVMe + EBS); older: 39 incl. rootCSI driver enforces per-node; VOLUMES_LIMIT env or auto-detect
    GCE PD16 volumesN2/C2/M2: up to 128 (NVMe local + PD)Shared-core (f1/g1): max 16 always
    Azure Disk16 volumesDS-series: up to 64; LS-series: up to 64Ultra Disk counts separately from Standard/Premium
    vSphere59 volumesConfigurable via vCenterIncludes SCSI controller limit (4 controllers × 15 devices)
    Local (hostPath/local)Unlimited (disk-based)Constrained by physical disksNo attachment limit; capacity is the constraint

    Checking Node Volume Limits in the Cluster

    # Check allocatable volume count per node per driver
    kubectl get csinode -o json | jq -r '
      .items[] |
      .metadata.name as $node |
      .spec.drivers[]? |
      select(.allocatable != null) |
      "\($node)\t\(.name)\t\(.allocatable.count // "unlimited")"
    ' | column -t
    
    # Expected output:
    # node1   ebs.csi.aws.com   25
    # node2   ebs.csi.aws.com   25
    # node3   ebs.csi.aws.com   25
    
    # Find nodes approaching volume limit
    kubectl get csinode -o json | jq -r '
      .items[] | .metadata.name as $node |
      .spec.drivers[]? | select(.name=="ebs.csi.aws.com") |
      "\($node) max=\(.allocatable.count)"
    '
    
    # Count currently attached volumes per node
    kubectl get volumeattachments -o json | jq -r '
      [.items[] | select(.status.attached==true) | .spec.nodeName] |
      group_by(.) | map({node: .[0], count: length}) | .[]
      | "\(.node): \(.count) attached"
    '
    Volume Limit Exhaustion Causes Pending Pods

    When a node's volume attachment limit is reached, any pod requiring a new EBS/PD/Azure Disk volume cannot be scheduled to that node. The scheduler event reads: 0/10 nodes are available: 10 node(s) exceed max volume count. Mitigate by right-sizing instances, using NVMe local storage for scratch, or horizontally distributing volumes across more nodes.

    CSINode Object Deep Dive

    apiVersion: storage.k8s.io/v1
    kind: CSINode
    metadata:
      name: ip-10-0-1-100.ec2.internal
      ownerReferences:
      - apiVersion: v1
        kind: Node
        name: ip-10-0-1-100.ec2.internal
        uid: abc123...
    spec:
      drivers:
      - name: ebs.csi.aws.com
        nodeID: i-0abc123def456789   # driver's internal ID for this node (EC2 instance ID)
        topologyKeys:
        - topology.kubernetes.io/zone
        allocatable:
          count: 25                  # max EBS volumes attachable; set by NodeGetInfo MaxVolumesPerNode
      - name: efs.csi.aws.com
        nodeID: i-0abc123def456789
        topologyKeys: []             # EFS is regional, no topology key
        # allocatable: not set for NFS-based drivers (no per-node attachment limit)

    Capacity Staleness and Race Conditions

    CSIStorageCapacity objects are snapshots of available capacity at the time the external-provisioner last polled the driver. Between poll cycles, actual capacity can decrease (other PVCs provisioned outside this cluster, quota changes) or increase (volumes deleted). This means the scheduler may make decisions based on stale data.

    ScenarioEffectMitigation
    Capacity decreases between poll cyclesScheduler sends pod to zone, provisioner fails — PVC stuck Pending with ProvisioningFailedReduce --capacity-poll-interval; provisioner retries with exponential backoff
    Capacity increases between poll cyclesPod not scheduled to zone that now has capacityWait for next poll cycle; or manually trigger provisioner resync
    Two clusters sharing same storage backendCluster A schedules based on capacity Cluster B is about to consumeReserve capacity quotas per cluster at storage layer
    Provisioner pod restartsAll CSIStorageCapacity objects GC'd (owner ref); scheduler reverts to optimistic until republishedFast provisioner restart; owner ref at Deployment level (--capacity-ownerref-level=2)
    # Reduce poll interval for environments with rapid capacity changes
    args:
    - --capacity-poll-interval=30s   # more aggressive polling (increases driver API calls)
    
    # Check when capacity was last updated
    kubectl get csistoragecapacity -A \
      -o jsonpath='{range .items[*]}{.metadata.name}{"\t"}{.metadata.creationTimestamp}{"\n"}{end}'

    Local Storage Capacity with TopoLVM

    TopoLVM is a CSI driver that provisions LVM logical volumes from local NVMe disks on each node. It publishes per-node CSIStorageCapacity objects, enabling the scheduler to choose nodes with sufficient local disk space before committing a pod.

    TopoLVM capacity flow: Node 1: 800 Gi available LVM VG Node 2: 200 Gi available LVM VG Node 3: 1.5 Ti available LVM VG CSIStorageCapacity objects (node-scoped): ┌─────────────────────────────────────────────────────┐ │ storageClassName: topolvm-provisioner │ │ nodeTopology: {kubernetes.io/hostname: node1} │ │ capacity: 800Gi │ ├─────────────────────────────────────────────────────┤ │ storageClassName: topolvm-provisioner │ │ nodeTopology: {kubernetes.io/hostname: node2} │ │ capacity: 200Gi │ ├─────────────────────────────────────────────────────┤ │ storageClassName: topolvm-provisioner │ │ nodeTopology: {kubernetes.io/hostname: node3} │ │ capacity: 1.5Ti │ └─────────────────────────────────────────────────────┘ Pod requesting 400Gi PVC on topolvm-provisioner: → Node 2 filtered (200Gi < 400Gi) → Node 1 passes (800Gi >= 400Gi) → Node 3 passes and scores higher (more headroom) → Pod scheduled to node3
    apiVersion: storage.k8s.io/v1
    kind: StorageClass
    metadata:
      name: topolvm-provisioner
    provisioner: topolvm.io
    volumeBindingMode: WaitForFirstConsumer
    storageCapacity: true    # enables per-node capacity filter
    parameters:
      "csi.storage.k8s.io/fstype": xfs
      # device-class: ssd   # target a specific LVM VG group on the node

    ResourceQuota for Storage

    Kubernetes ResourceQuota can limit the total storage consumed by PVCs in a namespace, both in aggregate and per StorageClass. This is the primary capacity governance mechanism for multi-tenant clusters.

    Basic Storage Quota

    apiVersion: v1
    kind: ResourceQuota
    metadata:
      name: storage-quota
      namespace: team-alpha
    spec:
      hard:
        # Total PVC count across all storage classes
        persistentvolumeclaims: "50"
    
        # Total storage across all storage classes
        requests.storage: "10Ti"
    
        # Per-StorageClass limits (format: {storageclass}.storageclass.storage.k8s.io/{resource})
        ebs-gp3.storageclass.storage.k8s.io/persistentvolumeclaims: "20"
        ebs-gp3.storageclass.storage.k8s.io/requests.storage: "5Ti"
    
        ebs-io2.storageclass.storage.k8s.io/persistentvolumeclaims: "5"
        ebs-io2.storageclass.storage.k8s.io/requests.storage: "500Gi"
    
        # Ephemeral storage (requests only, not limits)
        requests.ephemeral-storage: "100Gi"
        limits.ephemeral-storage: "200Gi"

    LimitRange for PVC Sizes

    apiVersion: v1
    kind: LimitRange
    metadata:
      name: storage-limits
      namespace: team-alpha
    spec:
      limits:
      - type: PersistentVolumeClaim
        max:
          storage: 1Ti    # no single PVC can request more than 1Ti
        min:
          storage: 1Gi    # no PVC smaller than 1Gi (prevents waste from tiny claims)

    Checking Quota Consumption

    # See current storage quota usage in a namespace
    kubectl describe resourcequota storage-quota -n team-alpha
    
    # Example output:
    # Resource                                               Used   Hard
    # --------                                               ----   ----
    # ebs-gp3.storageclass.storage.k8s.io/requests.storage  3Ti    5Ti
    # persistentvolumeclaims                                 18     50
    # requests.storage                                       3.2Ti  10Ti
    
    # Across all namespaces: find namespaces near storage quota
    kubectl get resourcequota -A -o json | jq -r '
      .items[] |
      .metadata.namespace as $ns |
      .status.hard | to_entries[] |
      select(.key | contains("storage")) |
      "\($ns)\t\(.key)\t\(.value)"
    ' | column -t

    Multi-Zone Capacity Planning

    For clusters spanning multiple availability zones with zone-scoped storage (EBS, GCE PD, Azure Disk), each zone has independent storage capacity. Imbalanced workloads or a zone outage can leave remaining zones over-provisioned relative to their storage quotas.

    Zone Capacity Health Check Script

    #!/bin/bash
    # Print available capacity per zone per StorageClass
    echo "StorageClass | Zone | Available Capacity"
    echo "-------------|------|-------------------"
    kubectl get csistoragecapacity -A -o json | jq -r '
      .items[] |
      [
        .storageClassName,
        (.nodeTopology.matchLabels // {} | to_entries | map("\(.key)=\(.value)") | join(",")),
        (.capacity // "unknown")
      ] | @tsv
    ' | sort | column -t -s $'\t'

    Topology-Aware PVC Placement

    # Force a PVC to a specific zone using allowedTopologies on StorageClass
    # (Useful for StatefulSets with zone-pinned nodes)
    apiVersion: storage.k8s.io/v1
    kind: StorageClass
    metadata:
      name: ebs-gp3-us-east-1a
    provisioner: ebs.csi.aws.com
    volumeBindingMode: WaitForFirstConsumer
    storageCapacity: true
    parameters:
      type: gp3
    allowedTopologies:
    - matchLabelExpressions:
      - key: topology.kubernetes.io/zone
        values:
        - us-east-1a    # constrain to single zone for data locality
    Zone-Pinned StorageClasses Reduce Availability

    Constraining a StorageClass to a single zone means if that zone runs out of capacity or goes down, provisioning will fail. Use zone-specific StorageClasses only for workloads with strict data locality requirements (e.g., a database replica that must co-locate with a specific Kafka broker). For general workloads, let the scheduler distribute across zones.

    Capacity Monitoring

    Key Metrics

    MetricSourceWhat to Watch
    kubelet_volume_stats_capacity_byteskubeletTotal filesystem capacity of each mounted PVC (from NodeGetVolumeStats)
    kubelet_volume_stats_used_byteskubeletUsed bytes; ratio used/capacity alerts at 80% and 90%
    kubelet_volume_stats_available_byteskubeletRemaining bytes; complements used_bytes
    kubelet_volume_stats_inodes_usedkubeletInode exhaustion is independent of byte usage; many small files can exhaust inodes
    kube_persistentvolumeclaim_resource_requests_storage_byteskube-state-metricsRequested storage per PVC; sum by namespace for allocation accounting

    Alerting Rules

    groups:
    - name: storage-capacity
      rules:
      - alert: PVCDiskUsageWarning
        expr: |
          (
            kubelet_volume_stats_used_bytes
            / kubelet_volume_stats_capacity_bytes
          ) > 0.80
        for: 5m
        labels:
          severity: warning
        annotations:
          summary: "PVC {{ $labels.persistentvolumeclaim }} in {{ $labels.namespace }} is >80% full"
          description: "{{ $value | humanizePercentage }} used. Expand PVC or clean up data."
    
      - alert: PVCDiskUsageCritical
        expr: |
          (
            kubelet_volume_stats_used_bytes
            / kubelet_volume_stats_capacity_bytes
          ) > 0.90
        for: 2m
        labels:
          severity: critical
    
      - alert: PVCInodeExhaustion
        expr: |
          (
            kubelet_volume_stats_inodes_used
            / kubelet_volume_stats_inodes
          ) > 0.90
        for: 5m
        annotations:
          summary: "PVC {{ $labels.persistentvolumeclaim }} inode usage >90%"
          description: "Inode exhaustion causes 'no space left on device' even with bytes available."
    
      - alert: NodeVolumeAttachmentLimit
        expr: |
          # Custom metric requiring VolumeAttachment count vs CSINode allocatable
          # Proxy: alert when a node has >20 VolumeAttachments (adjust per instance type)
          count by (nodeName) (
            kube_volumeattachment_info{attacher="ebs.csi.aws.com"}
          ) > 20
        for: 1m
        annotations:
          summary: "Node {{ $labels.nodeName }} has >20 EBS volumes attached"

    Grafana Dashboard Queries

    # Top 10 largest PVCs by requested storage
    topk(10,
      kube_persistentvolumeclaim_resource_requests_storage_bytes
    )
    
    # PVC fill rate — estimated time to full (hours)
    (
      kubelet_volume_stats_available_bytes
      / deriv(kubelet_volume_stats_used_bytes[1h])
    ) / 3600
    
    # Total provisioned storage per StorageClass across cluster
    sum by (storageclass) (
      kube_persistentvolumeclaim_resource_requests_storage_bytes
      * on(persistentvolumeclaim, namespace)
      group_left(storageclass) kube_persistentvolumeclaim_info
    )

    Runbooks

    Pod Stuck: "exceed max volume count"

    Check node VolumeAttachment count vs CSINode.spec.drivers[*].allocatable.count. If at limit: cordon the node and reschedule pods to free attachments, or use a larger instance type with higher volume limit. For AWS: Nitro-based instances allow more NVMe EBS volumes.

    PVC Stuck Pending: ProvisioningFailed

    Describe PVC for events: kubectl describe pvc NAME. If ProvisioningFailed: no capacity in zone X: check kubectl get csistoragecapacity -A for zone capacity. If zone is full: expand quota at storage layer, delete unused PVs, or use a different zone via allowedTopologies.

    PVC >80% Full — Online Expansion

    Edit PVC: kubectl edit pvc NAME, increase spec.resources.requests.storage. StorageClass needs allowVolumeExpansion: true. Watch for condition FileSystemResizePending — cleared after pod restart triggers filesystem resize. Monitor kubelet_volume_stats_capacity_bytes to confirm expansion.

    Inode Exhaustion

    Confirm: kubectl exec POD -- df -i /mount/path. If inodes full but bytes free: many small files (logs, cache). Fix: delete small files, or for ext4 volumes resize to get more inodes — but inode count is set at mkfs time, requiring volume replacement for ext4. xfs automatically scales inodes with capacity.

    CSIStorageCapacity Objects Stale / Missing

    Check provisioner pod: kubectl logs -n kube-system deploy/ebs-csi-controller -c csi-provisioner | grep capacity. If GetCapacity: Unimplemented: driver doesn't support it — capacity tracking unavailable. If objects missing after provisioner restart: wait for next poll cycle or reduce --capacity-poll-interval.

    Best Practices

    1. Enable storageCapacity: true on all zone-scoped StorageClasses — pair it with volumeBindingMode: WaitForFirstConsumer. This prevents the most common cause of ProvisioningFailed in multi-zone clusters: scheduler committing to a full zone before the provisioner discovers there's no space.
    2. Verify your CSI driver implements GetCapacity — check ControllerGetCapabilities response or driver release notes. AWS EBS CSI (≥1.13), GCE PD CSI (≥1.7), and TopoLVM support it; many in-tree-replaced drivers do not.
    3. Set ResourceQuota per namespace per StorageClass — prevent single teams from exhausting shared storage pools. Use both requests.storage (bytes) and persistentvolumeclaims (count) limits. Enforce a LimitRange minimum (e.g., 1Gi) to prevent dozens of trivially small PVCs that waste API objects.
    4. Monitor inode usage alongside byte usagekubelet_volume_stats_inodes_used / kubelet_volume_stats_inodes. Inode exhaustion produces the same error as byte exhaustion but is invisible to byte-only monitoring. Especially relevant for log directories and package caches.
    5. Use xfs instead of ext4 for large volumes where inode density matters — xfs dynamically allocates inodes from free space; ext4 fixes inode count at mkfs time. For volumes holding many small files (log aggregators, CI artifact stores), xfs avoids inode exhaustion surprises.
    6. Plan for node volume attachment headroom — for AWS, budget 20 EBS volumes per node as a safe limit (leaving headroom for OS volumes, instance store NVMe). Use larger instance types (m5.4xlarge vs m5.large) or NVMe local storage for scratch to free attachment slots for persistent data.
    7. Reduce --capacity-poll-interval in rapidly-changing environments — default 1 minute is acceptable for most clusters. In high-churn test environments where PVCs are created and deleted constantly, shorten to 30 seconds to keep capacity data fresh. Weigh against increased driver API calls (AWS EC2 DescribeVolumes rate limits).
    8. Alert on fill rate, not just utilization — a volume going from 50% to 90% in one hour is more urgent than a volume at 85% stable. Use deriv(kubelet_volume_stats_used_bytes[1h]) to estimate time-to-full and page on <4 hours remaining regardless of current percentage.