Storage Capacity
▶ What This Page Covers
The Pre-GA Capacity Problem
Before Kubernetes 1.24, the scheduler had no knowledge of how much storage capacity was available in any given topology zone. When a pod with a WaitForFirstConsumer PVC was scheduled to a node in zone us-east-1a, the scheduler committed the pod to that zone before the CSI driver attempted to provision the volume. If the provisioner then discovered that zone us-east-1a had no capacity — quota exhausted, regional limits hit, pool full — the PVC remained Pending with event ProvisioningFailed, and the pod was stuck forever.
CSIStorageCapacity Object
CSIStorageCapacity is a namespace-scoped API object (in the same namespace as the CSI controller pod) that represents the available storage capacity for a specific StorageClass within a specific topology segment (zone, rack, node). The external-provisioner sidecar creates and updates these objects periodically.
Object Structure
apiVersion: storage.k8s.io/v1
kind: CSIStorageCapacity
metadata:
name: csi-sc-ebs-gp3-us-east-1b-a1b2c3 # auto-generated name
namespace: kube-system # same ns as CSI controller
ownerReferences: # garbage-collected when provisioner pod is deleted
- apiVersion: apps/v1
kind: ReplicaSet
name: ebs-csi-controller-7d9f8b
uid: a1b2c3d4-...
controller: true
blockOwnerDeletion: true
storageClassName: ebs-gp3 # the StorageClass this applies to
nodeTopology: # which topology segment
matchLabels:
topology.kubernetes.io/zone: us-east-1b
capacity: "2Ti" # available capacity in this segment
maximumVolumeSize: "16Ti" # maximum single-volume size allowed
| Field | Type | Description |
|---|---|---|
storageClassName | string (required) | References the StorageClass; must match a SC with storageCapacity: true |
nodeTopology | LabelSelector (optional) | Topology segment this capacity applies to. Nil means cluster-wide (e.g., NFS). Typically matchLabels with zone key. |
capacity | Quantity (optional) | Available storage. Nil means unknown. The scheduler treats nil as sufficient (optimistic). |
maximumVolumeSize | Quantity (optional) | Largest single volume this driver can create in this segment. A PVC requesting more will fail scheduling. |
Listing CSIStorageCapacity Objects
# List all capacity objects across all namespaces
kubectl get csistoragecapacity -A
# Show capacity per zone for a specific storage class
kubectl get csistoragecapacity -A \
-o custom-columns='NAMESPACE:.metadata.namespace,SC:.storageClassName,TOPOLOGY:.nodeTopology,CAPACITY:.capacity' \
--sort-by='.storageClassName'
# Example output:
# NAMESPACE SC TOPOLOGY CAPACITY
# kube-system ebs-gp3 map[topology.kubernetes.io/zone:us-east-1a] 5497558138880
# kube-system ebs-gp3 map[topology.kubernetes.io/zone:us-east-1b] 2199023255552
# kube-system ebs-gp3 map[topology.kubernetes.io/zone:us-east-1c] 10995116277760
# Watch for capacity updates in real time
kubectl get csistoragecapacity -A -w
How external-provisioner Publishes Capacity
The external-provisioner sidecar is responsible for creating and refreshing CSIStorageCapacity objects. It does this by calling the CSI GetCapacity RPC on the controller plugin for each topology segment it knows about, then writing or updating the corresponding object.
Enabling Capacity Tracking on the Provisioner
# In the CSI controller Deployment, external-provisioner container:
containers:
- name: external-provisioner
image: registry.k8s.io/sig-storage/csi-provisioner:v4.0.0
args:
- --csi-address=/var/lib/csi/sockets/pluginproxy/csi.sock
- --leader-election
- --feature-gates=Topology=true
- --enable-capacity # enable CSIStorageCapacity publishing
- --capacity-ownerref-level=2 # 0=pod,1=replicaset,2=deployment owner ref level
- --capacity-poll-interval=1m # how often to refresh capacity (default 1m)
# RBAC needed: create/update/delete csistoragecapacities in controller namespace
Not all CSI drivers implement GetCapacity. If the driver returns UNIMPLEMENTED, the external-provisioner silently skips capacity publication for that driver — no objects are created and the scheduler falls back to the pre-1.24 optimistic behavior. Check driver release notes for GET_CAPACITY support in ControllerGetCapabilities.
StorageClass storageCapacity Gate
Capacity-aware scheduling is opt-in per StorageClass via the storageCapacity field. Without it, the scheduler ignores CSIStorageCapacity objects entirely even if they exist.
apiVersion: storage.k8s.io/v1
kind: StorageClass
metadata:
name: ebs-gp3
provisioner: ebs.csi.aws.com
volumeBindingMode: WaitForFirstConsumer # REQUIRED for capacity tracking to be useful
storageCapacity: true # enable capacity-aware scheduling
parameters:
type: gp3
iops: "3000"
throughput: "125"
With volumeBindingMode: Immediate, PVCs are bound before any pod is scheduled — the scheduler never has the chance to filter nodes based on capacity. The CSIStorageCapacity objects are published but ignored by the scheduling pipeline. Always use WaitForFirstConsumer for topology-aware drivers when capacity tracking matters.
Capacity-Aware Scheduling Flow
When a pod with an unbound WaitForFirstConsumer PVC is scheduled, the VolumeBinding plugin in kube-scheduler filters and scores nodes using CSIStorageCapacity data.
VolumeBinding Plugin Behavior with Nil Capacity
| CSIStorageCapacity State | Scheduler Behavior | Risk |
|---|---|---|
| Object exists, capacity >= PVC size | Node passes filter | Low — capacity confirmed |
| Object exists, capacity < PVC size | Node filtered out | None — avoided bad zone |
| Object exists, capacity = nil | Node passes (optimistic) | May fail at provision time |
| No object for this (SC, topology) | Node passes (optimistic) | May fail at provision time |
| maximumVolumeSize < PVC size | Node filtered out | None — volume would never fit |
Topology Segments
A topology segment is a set of node labels that define a storage domain. For zone-scoped drivers (AWS EBS, GCE PD, Azure Disk), each availability zone is one segment. For node-scoped drivers (local volumes, TopoLVM), each node is one segment.
# Zone-scoped: one CSIStorageCapacity per zone per StorageClass
nodeTopology:
matchLabels:
topology.kubernetes.io/zone: us-east-1a
# Node-scoped: one CSIStorageCapacity per node per StorageClass (e.g., TopoLVM)
nodeTopology:
matchLabels:
kubernetes.io/hostname: node1
# Multi-label (rack-aware Ceph):
nodeTopology:
matchLabels:
topology.kubernetes.io/zone: us-east-1a
topology.rook.io/rack: rack2
# Cluster-wide (NFS, CephFS with global pool):
# nodeTopology: null (omitted)
# capacity: "50Ti"
How Topology Keys Are Discovered
The external-provisioner discovers topology keys from CSINode objects. Each node running the CSI node plugin has a CSINode object updated by the node-driver-registrar sidecar after NodeGetInfo is called.
# Inspect CSINode for topology keys a driver exposes
kubectl get csinode node1 -o yaml
# Relevant section:
spec:
drivers:
- name: ebs.csi.aws.com
nodeID: i-0abc123def456789
topologyKeys:
- topology.kubernetes.io/zone # driver announces this key
allocatable:
count: 25 # max volumes this node can attach
Node Volume Attachment Limits
Every cloud provider imposes a hard limit on how many block volumes can be attached to a single VM instance. Kubernetes enforces these limits via the CSINode.spec.drivers[*].allocatable.count field and the MaxVolumesPerNode VolumeBinding scheduler predicate.
| Cloud Provider | Default Limit | Instance-Specific Limits | Notes |
|---|---|---|---|
| AWS EBS | 25 volumes | Nitro-based: up to 28 (NVMe + EBS); older: 39 incl. root | CSI driver enforces per-node; VOLUMES_LIMIT env or auto-detect |
| GCE PD | 16 volumes | N2/C2/M2: up to 128 (NVMe local + PD) | Shared-core (f1/g1): max 16 always |
| Azure Disk | 16 volumes | DS-series: up to 64; LS-series: up to 64 | Ultra Disk counts separately from Standard/Premium |
| vSphere | 59 volumes | Configurable via vCenter | Includes SCSI controller limit (4 controllers × 15 devices) |
| Local (hostPath/local) | Unlimited (disk-based) | Constrained by physical disks | No attachment limit; capacity is the constraint |
Checking Node Volume Limits in the Cluster
# Check allocatable volume count per node per driver
kubectl get csinode -o json | jq -r '
.items[] |
.metadata.name as $node |
.spec.drivers[]? |
select(.allocatable != null) |
"\($node)\t\(.name)\t\(.allocatable.count // "unlimited")"
' | column -t
# Expected output:
# node1 ebs.csi.aws.com 25
# node2 ebs.csi.aws.com 25
# node3 ebs.csi.aws.com 25
# Find nodes approaching volume limit
kubectl get csinode -o json | jq -r '
.items[] | .metadata.name as $node |
.spec.drivers[]? | select(.name=="ebs.csi.aws.com") |
"\($node) max=\(.allocatable.count)"
'
# Count currently attached volumes per node
kubectl get volumeattachments -o json | jq -r '
[.items[] | select(.status.attached==true) | .spec.nodeName] |
group_by(.) | map({node: .[0], count: length}) | .[]
| "\(.node): \(.count) attached"
'
When a node's volume attachment limit is reached, any pod requiring a new EBS/PD/Azure Disk volume cannot be scheduled to that node. The scheduler event reads: 0/10 nodes are available: 10 node(s) exceed max volume count. Mitigate by right-sizing instances, using NVMe local storage for scratch, or horizontally distributing volumes across more nodes.
CSINode Object Deep Dive
apiVersion: storage.k8s.io/v1
kind: CSINode
metadata:
name: ip-10-0-1-100.ec2.internal
ownerReferences:
- apiVersion: v1
kind: Node
name: ip-10-0-1-100.ec2.internal
uid: abc123...
spec:
drivers:
- name: ebs.csi.aws.com
nodeID: i-0abc123def456789 # driver's internal ID for this node (EC2 instance ID)
topologyKeys:
- topology.kubernetes.io/zone
allocatable:
count: 25 # max EBS volumes attachable; set by NodeGetInfo MaxVolumesPerNode
- name: efs.csi.aws.com
nodeID: i-0abc123def456789
topologyKeys: [] # EFS is regional, no topology key
# allocatable: not set for NFS-based drivers (no per-node attachment limit)
Capacity Staleness and Race Conditions
CSIStorageCapacity objects are snapshots of available capacity at the time the external-provisioner last polled the driver. Between poll cycles, actual capacity can decrease (other PVCs provisioned outside this cluster, quota changes) or increase (volumes deleted). This means the scheduler may make decisions based on stale data.
| Scenario | Effect | Mitigation |
|---|---|---|
| Capacity decreases between poll cycles | Scheduler sends pod to zone, provisioner fails — PVC stuck Pending with ProvisioningFailed | Reduce --capacity-poll-interval; provisioner retries with exponential backoff |
| Capacity increases between poll cycles | Pod not scheduled to zone that now has capacity | Wait for next poll cycle; or manually trigger provisioner resync |
| Two clusters sharing same storage backend | Cluster A schedules based on capacity Cluster B is about to consume | Reserve capacity quotas per cluster at storage layer |
| Provisioner pod restarts | All CSIStorageCapacity objects GC'd (owner ref); scheduler reverts to optimistic until republished | Fast provisioner restart; owner ref at Deployment level (--capacity-ownerref-level=2) |
# Reduce poll interval for environments with rapid capacity changes
args:
- --capacity-poll-interval=30s # more aggressive polling (increases driver API calls)
# Check when capacity was last updated
kubectl get csistoragecapacity -A \
-o jsonpath='{range .items[*]}{.metadata.name}{"\t"}{.metadata.creationTimestamp}{"\n"}{end}'
Local Storage Capacity with TopoLVM
TopoLVM is a CSI driver that provisions LVM logical volumes from local NVMe disks on each node. It publishes per-node CSIStorageCapacity objects, enabling the scheduler to choose nodes with sufficient local disk space before committing a pod.
apiVersion: storage.k8s.io/v1
kind: StorageClass
metadata:
name: topolvm-provisioner
provisioner: topolvm.io
volumeBindingMode: WaitForFirstConsumer
storageCapacity: true # enables per-node capacity filter
parameters:
"csi.storage.k8s.io/fstype": xfs
# device-class: ssd # target a specific LVM VG group on the node
ResourceQuota for Storage
Kubernetes ResourceQuota can limit the total storage consumed by PVCs in a namespace, both in aggregate and per StorageClass. This is the primary capacity governance mechanism for multi-tenant clusters.
Basic Storage Quota
apiVersion: v1
kind: ResourceQuota
metadata:
name: storage-quota
namespace: team-alpha
spec:
hard:
# Total PVC count across all storage classes
persistentvolumeclaims: "50"
# Total storage across all storage classes
requests.storage: "10Ti"
# Per-StorageClass limits (format: {storageclass}.storageclass.storage.k8s.io/{resource})
ebs-gp3.storageclass.storage.k8s.io/persistentvolumeclaims: "20"
ebs-gp3.storageclass.storage.k8s.io/requests.storage: "5Ti"
ebs-io2.storageclass.storage.k8s.io/persistentvolumeclaims: "5"
ebs-io2.storageclass.storage.k8s.io/requests.storage: "500Gi"
# Ephemeral storage (requests only, not limits)
requests.ephemeral-storage: "100Gi"
limits.ephemeral-storage: "200Gi"
LimitRange for PVC Sizes
apiVersion: v1
kind: LimitRange
metadata:
name: storage-limits
namespace: team-alpha
spec:
limits:
- type: PersistentVolumeClaim
max:
storage: 1Ti # no single PVC can request more than 1Ti
min:
storage: 1Gi # no PVC smaller than 1Gi (prevents waste from tiny claims)
Checking Quota Consumption
# See current storage quota usage in a namespace
kubectl describe resourcequota storage-quota -n team-alpha
# Example output:
# Resource Used Hard
# -------- ---- ----
# ebs-gp3.storageclass.storage.k8s.io/requests.storage 3Ti 5Ti
# persistentvolumeclaims 18 50
# requests.storage 3.2Ti 10Ti
# Across all namespaces: find namespaces near storage quota
kubectl get resourcequota -A -o json | jq -r '
.items[] |
.metadata.namespace as $ns |
.status.hard | to_entries[] |
select(.key | contains("storage")) |
"\($ns)\t\(.key)\t\(.value)"
' | column -t
Multi-Zone Capacity Planning
For clusters spanning multiple availability zones with zone-scoped storage (EBS, GCE PD, Azure Disk), each zone has independent storage capacity. Imbalanced workloads or a zone outage can leave remaining zones over-provisioned relative to their storage quotas.
Zone Capacity Health Check Script
#!/bin/bash
# Print available capacity per zone per StorageClass
echo "StorageClass | Zone | Available Capacity"
echo "-------------|------|-------------------"
kubectl get csistoragecapacity -A -o json | jq -r '
.items[] |
[
.storageClassName,
(.nodeTopology.matchLabels // {} | to_entries | map("\(.key)=\(.value)") | join(",")),
(.capacity // "unknown")
] | @tsv
' | sort | column -t -s $'\t'
Topology-Aware PVC Placement
# Force a PVC to a specific zone using allowedTopologies on StorageClass
# (Useful for StatefulSets with zone-pinned nodes)
apiVersion: storage.k8s.io/v1
kind: StorageClass
metadata:
name: ebs-gp3-us-east-1a
provisioner: ebs.csi.aws.com
volumeBindingMode: WaitForFirstConsumer
storageCapacity: true
parameters:
type: gp3
allowedTopologies:
- matchLabelExpressions:
- key: topology.kubernetes.io/zone
values:
- us-east-1a # constrain to single zone for data locality
Constraining a StorageClass to a single zone means if that zone runs out of capacity or goes down, provisioning will fail. Use zone-specific StorageClasses only for workloads with strict data locality requirements (e.g., a database replica that must co-locate with a specific Kafka broker). For general workloads, let the scheduler distribute across zones.
Capacity Monitoring
Key Metrics
| Metric | Source | What to Watch |
|---|---|---|
kubelet_volume_stats_capacity_bytes | kubelet | Total filesystem capacity of each mounted PVC (from NodeGetVolumeStats) |
kubelet_volume_stats_used_bytes | kubelet | Used bytes; ratio used/capacity alerts at 80% and 90% |
kubelet_volume_stats_available_bytes | kubelet | Remaining bytes; complements used_bytes |
kubelet_volume_stats_inodes_used | kubelet | Inode exhaustion is independent of byte usage; many small files can exhaust inodes |
kube_persistentvolumeclaim_resource_requests_storage_bytes | kube-state-metrics | Requested storage per PVC; sum by namespace for allocation accounting |
Alerting Rules
groups:
- name: storage-capacity
rules:
- alert: PVCDiskUsageWarning
expr: |
(
kubelet_volume_stats_used_bytes
/ kubelet_volume_stats_capacity_bytes
) > 0.80
for: 5m
labels:
severity: warning
annotations:
summary: "PVC {{ $labels.persistentvolumeclaim }} in {{ $labels.namespace }} is >80% full"
description: "{{ $value | humanizePercentage }} used. Expand PVC or clean up data."
- alert: PVCDiskUsageCritical
expr: |
(
kubelet_volume_stats_used_bytes
/ kubelet_volume_stats_capacity_bytes
) > 0.90
for: 2m
labels:
severity: critical
- alert: PVCInodeExhaustion
expr: |
(
kubelet_volume_stats_inodes_used
/ kubelet_volume_stats_inodes
) > 0.90
for: 5m
annotations:
summary: "PVC {{ $labels.persistentvolumeclaim }} inode usage >90%"
description: "Inode exhaustion causes 'no space left on device' even with bytes available."
- alert: NodeVolumeAttachmentLimit
expr: |
# Custom metric requiring VolumeAttachment count vs CSINode allocatable
# Proxy: alert when a node has >20 VolumeAttachments (adjust per instance type)
count by (nodeName) (
kube_volumeattachment_info{attacher="ebs.csi.aws.com"}
) > 20
for: 1m
annotations:
summary: "Node {{ $labels.nodeName }} has >20 EBS volumes attached"
Grafana Dashboard Queries
# Top 10 largest PVCs by requested storage
topk(10,
kube_persistentvolumeclaim_resource_requests_storage_bytes
)
# PVC fill rate — estimated time to full (hours)
(
kubelet_volume_stats_available_bytes
/ deriv(kubelet_volume_stats_used_bytes[1h])
) / 3600
# Total provisioned storage per StorageClass across cluster
sum by (storageclass) (
kube_persistentvolumeclaim_resource_requests_storage_bytes
* on(persistentvolumeclaim, namespace)
group_left(storageclass) kube_persistentvolumeclaim_info
)
Runbooks
Check node VolumeAttachment count vs CSINode.spec.drivers[*].allocatable.count. If at limit: cordon the node and reschedule pods to free attachments, or use a larger instance type with higher volume limit. For AWS: Nitro-based instances allow more NVMe EBS volumes.
Describe PVC for events: kubectl describe pvc NAME. If ProvisioningFailed: no capacity in zone X: check kubectl get csistoragecapacity -A for zone capacity. If zone is full: expand quota at storage layer, delete unused PVs, or use a different zone via allowedTopologies.
Edit PVC: kubectl edit pvc NAME, increase spec.resources.requests.storage. StorageClass needs allowVolumeExpansion: true. Watch for condition FileSystemResizePending — cleared after pod restart triggers filesystem resize. Monitor kubelet_volume_stats_capacity_bytes to confirm expansion.
Confirm: kubectl exec POD -- df -i /mount/path. If inodes full but bytes free: many small files (logs, cache). Fix: delete small files, or for ext4 volumes resize to get more inodes — but inode count is set at mkfs time, requiring volume replacement for ext4. xfs automatically scales inodes with capacity.
Check provisioner pod: kubectl logs -n kube-system deploy/ebs-csi-controller -c csi-provisioner | grep capacity. If GetCapacity: Unimplemented: driver doesn't support it — capacity tracking unavailable. If objects missing after provisioner restart: wait for next poll cycle or reduce --capacity-poll-interval.
Best Practices
- Enable
storageCapacity: trueon all zone-scoped StorageClasses — pair it withvolumeBindingMode: WaitForFirstConsumer. This prevents the most common cause ofProvisioningFailedin multi-zone clusters: scheduler committing to a full zone before the provisioner discovers there's no space. - Verify your CSI driver implements
GetCapacity— checkControllerGetCapabilitiesresponse or driver release notes. AWS EBS CSI (≥1.13), GCE PD CSI (≥1.7), and TopoLVM support it; many in-tree-replaced drivers do not. - Set ResourceQuota per namespace per StorageClass — prevent single teams from exhausting shared storage pools. Use both
requests.storage(bytes) andpersistentvolumeclaims(count) limits. Enforce a LimitRange minimum (e.g., 1Gi) to prevent dozens of trivially small PVCs that waste API objects. - Monitor inode usage alongside byte usage —
kubelet_volume_stats_inodes_used / kubelet_volume_stats_inodes. Inode exhaustion produces the same error as byte exhaustion but is invisible to byte-only monitoring. Especially relevant for log directories and package caches. - Use xfs instead of ext4 for large volumes where inode density matters — xfs dynamically allocates inodes from free space; ext4 fixes inode count at
mkfstime. For volumes holding many small files (log aggregators, CI artifact stores), xfs avoids inode exhaustion surprises. - Plan for node volume attachment headroom — for AWS, budget 20 EBS volumes per node as a safe limit (leaving headroom for OS volumes, instance store NVMe). Use larger instance types (m5.4xlarge vs m5.large) or NVMe local storage for scratch to free attachment slots for persistent data.
- Reduce
--capacity-poll-intervalin rapidly-changing environments — default 1 minute is acceptable for most clusters. In high-churn test environments where PVCs are created and deleted constantly, shorten to 30 seconds to keep capacity data fresh. Weigh against increased driver API calls (AWS EC2 DescribeVolumes rate limits). - Alert on fill rate, not just utilization — a volume going from 50% to 90% in one hour is more urgent than a volume at 85% stable. Use
deriv(kubelet_volume_stats_used_bytes[1h])to estimate time-to-full and page on <4 hours remaining regardless of current percentage.