cloud-controller-manager
The bridge between Kubernetes and cloud provider APIs — how nodes get addresses, load balancers get provisioned, routes get programmed, and zones get discovered. Covers the CCM architecture, the cloud provider interface, out-of-tree provider pattern, and all four built-in controllers.
Why cloud-controller-manager Exists
Before Kubernetes 1.6, cloud-specific logic (AWS ELB creation, GCE route management, Azure VM metadata) was compiled directly into kube-apiserver, kube-controller-manager, and kubelet. This created a tight coupling: every cloud provider had to submit code to the upstream Kubernetes repo, and a bug in the AWS provider could break GKE clusters.
CCM extracts all cloud-specific code into a separate binary that runs alongside — not inside — the core Kubernetes components. This enables:
- Cloud providers to release fixes independently of the Kubernetes release cycle
- Clusters to run without any cloud provider (bare-metal, on-premises)
- Third-party cloud providers (Hetzner, DigitalOcean, OVH) to integrate without upstream changes
- The core Kubernetes binary to shrink and stabilize
Process identity
- Binary:
cloud-controller-manager(cloud-provider-specific build) - Secure port:
:10258(HTTPS, metrics + healthz) - Leader-elected: one active instance per cluster
- Kubeconfig: typically
/etc/kubernetes/cloud-controller-manager.conf - Cloud credentials: IAM role (preferred), static key file, or Workload Identity
- Deployed as: DaemonSet on control-plane nodes, or Deployment
What it manages
- Node controller: cloud metadata on Node objects (addresses, zones, instance type)
- Route controller: cloud VPC routing tables for pod CIDR blocks
- Service controller: cloud load balancers for
type: LoadBalancerServices - Cloud node lifecycle: detect and handle cloud instance termination
CCM Architecture and the Cloud Provider Interface
CCM implements the same reconciliation loop pattern as kube-controller-manager (see 04-kube-controller-manager.html § Reconciliation Loop). It watches Kubernetes API objects and calls cloud provider APIs to converge state.
The Cloud Provider Interface (Go)
Every CCM implementation must implement the cloudprovider.Interface Go interface from k8s.io/cloud-provider. This defines the contract between Kubernetes and any cloud platform.
// k8s.io/cloud-provider/cloud.go (simplified)
type Interface interface {
Initialize(clientBuilder ControllerClientBuilder, stop <-chan struct{})
LoadBalancer() (LoadBalancer, bool) // nil, false if not supported
Instances() (Instances, bool) // VM metadata lookup
InstancesV2() (InstancesV2, bool) // newer V2 interface
Zones() (Zones, bool) // zone/region metadata
Clusters() (Clusters, bool) // cluster management (rarely used)
Routes() (Routes, bool) // VPC route table management
ProviderName() string // "aws", "gce", "azure", etc.
HasClusterID() bool
}
type LoadBalancer interface {
GetLoadBalancer(ctx, clusterName, service) (*LoadBalancerStatus, bool, error)
GetLoadBalancerName(ctx, clusterName, service) string
EnsureLoadBalancer(ctx, clusterName, service, nodes) (*LoadBalancerStatus, error)
UpdateLoadBalancer(ctx, clusterName, service, nodes) error
EnsureLoadBalancerDeleted(ctx, clusterName, service) error
}
type Instances interface {
NodeAddresses(ctx, nodeName) ([]NodeAddress, error)
NodeAddressesByProviderID(ctx, providerID) ([]NodeAddress, error)
InstanceID(ctx, nodeName) (string, error)
InstanceType(ctx, nodeName) (string, error)
InstanceTypeByProviderID(ctx, providerID) (string, error)
InstanceExistsByProviderID(ctx, providerID) (bool, error)
InstanceShutdownByProviderID(ctx, providerID) (bool, error)
}
The Four Built-in Controllers
Node Controller
The Node Controller populates cloud-specific metadata on Node objects after they register. When a new Node appears with spec.providerID set but no zone/address annotations, this controller calls the cloud API to fetch instance details and patches the Node.
Fields populated by the Node Controller:
# After CCM processes the new node:
status:
addresses:
- type: InternalIP
address: "10.0.1.42"
- type: ExternalIP
address: "54.123.45.67" # AWS: EIP or public IP
- type: InternalDNS
address: "ip-10-0-1-42.ec2.internal"
- type: ExternalDNS
address: "ec2-54-123-45-67.compute-1.amazonaws.com"
- type: Hostname
address: "ip-10-0-1-42.ec2.internal"
# Labels added by cloud provider
labels:
topology.kubernetes.io/zone: "us-east-1a"
topology.kubernetes.io/region: "us-east-1"
node.kubernetes.io/instance-type: "m5.xlarge"
failure-domain.beta.kubernetes.io/zone: "us-east-1a" # legacy label
failure-domain.beta.kubernetes.io/region: "us-east-1" # legacy label
# ProviderID set by kubelet, used by CCM to look up instance
spec:
providerID: "aws:///us-east-1a/i-0abc123def456"
Each cloud has its own providerID format. AWS: aws:///<zone>/<instance-id>. GCE: gce://<project>/<zone>/<instance-name>. Azure: azure:///subscriptions/<sub>/resourceGroups/<rg>/providers/Microsoft.Compute/virtualMachines/<name>. The providerID is set by kubelet via --provider-id flag or auto-detected from the instance metadata service (IMDS).
Route Controller
Kubernetes requires every Pod to be directly reachable from every other Pod without NAT. On cloud platforms, this means each node's pod CIDR must be programmed into the cloud's VPC routing table so that inter-node pod traffic is routed correctly.
# Example VPC route entry created by CCM (AWS)
# Destination: 10.244.1.0/24 (pod CIDR for node worker-1)
# Target: i-0abc123def456 (EC2 instance ID for worker-1)
# This allows pod-on-worker-2 (10.244.2.5) to reach pod-on-worker-1 (10.244.1.3)
# without SNAT — traffic routes directly through the VPC.
# Check routes created by CCM
aws ec2 describe-route-tables --filter Name=tag:kubernetes.io/cluster/my-cluster,Values=owned \
--query 'RouteTables[].Routes[]' | jq '.[] | select(.DestinationCidrBlock | startswith("10.244"))'
# On GCP:
gcloud compute routes list --filter="name~k8s-*"
Not all CNI plugins need the Route Controller. Flannel in VXLan mode encapsulates inter-node traffic in UDP — no VPC routes needed. Calico in BGP mode programs routes via BGP peers — no CCM involvement. The Route Controller is specifically for AWS VPC CNI, GCE native routing, and similar "native routing" setups where pod IPs are directly routable in the cloud network without encapsulation.
Service Controller (LoadBalancer Provisioning)
When a Service of type: LoadBalancer is created, the Service Controller calls the cloud's load balancer API to provision an external LB, then writes the allocated IP/hostname back to service.status.loadBalancer.ingress.
apiVersion: v1
kind: Service
metadata:
name: nginx
annotations:
# AWS-specific annotations controlling LB behavior
service.beta.kubernetes.io/aws-load-balancer-type: "nlb"
service.beta.kubernetes.io/aws-load-balancer-scheme: "internet-facing"
service.beta.kubernetes.io/aws-load-balancer-nlb-target-type: "ip"
service.beta.kubernetes.io/aws-load-balancer-cross-zone-load-balancing-enabled: "true"
service.beta.kubernetes.io/aws-load-balancer-ssl-cert: "arn:aws:acm:us-east-1:123:certificate/abc"
service.beta.kubernetes.io/aws-load-balancer-backend-protocol: "ssl"
# GCP-specific
cloud.google.com/load-balancer-type: "Internal"
cloud.google.com/neg: '{"ingress": true}'
spec:
type: LoadBalancer
selector:
app: nginx
ports:
- port: 443
targetPort: 8443
status:
loadBalancer:
ingress:
- hostname: a1b2c3d4e5.elb.amazonaws.com # set by CCM after provisioning
Cloud Node Lifecycle Controller
This controller handles the case where a cloud VM is terminated externally (e.g., AWS terminates a spot instance, or an admin deletes a VM in the cloud console). Without this controller, the deleted VM's Node object would remain in the cluster indefinitely, and workloads wouldn't be rescheduled.
# The cloud-node-lifecycle controller polls cloud API for each node
# If the instance no longer exists in the cloud:
# 1. Add taint: node.cloudprovider.kubernetes.io/shutdown:NoSchedule
# 2. After configurable delay, delete the Node object
# This triggers the node lifecycle controller in kcm to evict pods
# Check for nodes in shutdown state
kubectl get nodes -o wide | grep -v Ready
kubectl describe node <node> | grep -A5 "Taints:"
# Spot instance interruption handling (AWS):
# AWS sends 2-minute warning → IMDS endpoint /spot/termination-notice
# Cluster Autoscaler + cloud-node-lifecycle controller coordinate graceful drain
Out-of-Tree Provider Pattern
Every major cloud provider now ships its own CCM binary, versioned and released independently from Kubernetes core. The provider's CCM binary imports k8s.io/cloud-provider and implements the interface.
| Cloud Provider | Repository | Notes |
|---|---|---|
| AWS | kubernetes/cloud-provider-aws | Classic ELB/NLB provisioning. AWS Load Balancer Controller (separate) handles ALB via Ingress |
| GCP | kubernetes/cloud-provider-gcp | GCE persistent disk, GKE integrated LB, Cloud NAT routes |
| Azure | kubernetes-sigs/cloud-provider-azure | Azure Load Balancer, Azure Disk/File CSI, AKS nodepools |
| OpenStack | kubernetes/cloud-provider-openstack | Octavia LB, Cinder volumes, Nova instance metadata |
| vSphere | kubernetes/cloud-provider-vsphere | vCenter VMs, vSAN storage, NSX-T networking |
| Hetzner | hetznercloud/hcloud-cloud-controller-manager | Community provider; Hcloud LB, Floating IPs |
| DigitalOcean | digitalocean/digitalocean-cloud-controller-manager | DO Load Balancers, Floating IPs |
| Bare metal / None | N/A | Run with --cloud-provider=external but no CCM deployed; route/LB controllers disabled |
In-Tree to Out-of-Tree Migration
Clusters created before CCM existed may still use the in-tree cloud providers via --cloud-provider=aws on kube-apiserver and kube-controller-manager. This is deprecated since 1.29 and will be removed. Migration steps:
# Step 1: Update kube-apiserver and kube-controller-manager flags
# Remove: --cloud-provider=aws
# Add: --cloud-provider=external
# Step 2: Deploy the out-of-tree CCM DaemonSet
# (example: AWS CCM via Helm)
helm repo add aws-cloud-controller-manager https://kubernetes.github.io/cloud-provider-aws
helm upgrade --install aws-cloud-controller-manager \
aws-cloud-controller-manager/aws-cloud-controller-manager \
--namespace kube-system \
--set args={"--v=2","--cloud-provider=aws"}
# Step 3: Verify nodes get cloud metadata from CCM
kubectl get nodes -o yaml | grep -A 10 "addresses:"
# Step 4: CSI migration — replace in-tree volume plugins
# In-tree: kubernetes.io/aws-ebs (deprecated)
# Out-of-tree: ebs.csi.aws.com (CSI driver)
# Migration controlled by feature gate: CSIMigrationAWS=true (default in 1.23+)
# Check if CSI migration is active
kubectl get csidriver ebs.csi.aws.com
kubectl get storageclass | grep ebs
Kubelet and the --cloud-provider=external Flag
Kubelet must be told to wait for CCM to initialize the Node before marking it Ready for scheduling. Without this, pods may be scheduled to a node whose zone/address labels haven't been populated yet.
# kubelet flag required when using CCM:
--cloud-provider=external
# Effect: kubelet adds taint to newly registered node:
# node.cloudprovider.kubernetes.io/uninitialized:NoSchedule
# CCM Node Controller removes this taint after populating cloud metadata
# Only then does the scheduler consider the node for pod placement
# Verify the taint is removed after CCM processes the node
kubectl describe node new-node | grep -A5 Taints:
# Should show: <none> (or only user-defined taints)
# If node stays with uninitialized taint:
kubectl -n kube-system logs <ccm-pod> | grep "node controller"
If kubelet is not started with --cloud-provider=external, nodes will be marked Ready immediately without the uninitialized taint. Pods may land on nodes whose zone labels are still empty, breaking topology-aware routing and zone-aware scheduling. Always set this flag on all kubelets when deploying CCM.
RBAC, Credentials, and Security
Kubernetes RBAC
CCM needs broad Kubernetes API access to watch and patch Nodes, Services, and Events. A ClusterRole is required:
apiVersion: rbac.authorization.k8s.io/v1
kind: ClusterRole
metadata:
name: system:cloud-controller-manager
rules:
- apiGroups: [""]
resources: [events]
verbs: [create, patch, update]
- apiGroups: [""]
resources: [nodes]
verbs: [get, list, watch, delete, patch, update]
- apiGroups: [""]
resources: [nodes/status]
verbs: [patch]
- apiGroups: [""]
resources: [services]
verbs: [get, list, watch, patch]
- apiGroups: [""]
resources: [services/status]
verbs: [update, patch]
- apiGroups: [""]
resources: [serviceaccounts]
verbs: [create]
- apiGroups: [""]
resources: [persistentvolumes]
verbs: [get, list, update, watch]
- apiGroups: [""]
resources: [endpoints]
verbs: [create, get, list, watch, update]
- apiGroups: ["coordination.k8s.io"]
resources: [leases]
verbs: [get, create, update]
Cloud Provider Credentials
IAM Role (Recommended — AWS)
Attach an IAM role to the EC2 instances running the control plane. CCM uses the EC2 instance metadata service (IMDS) to fetch credentials. No key files on disk — credentials rotate automatically.
# Required IAM permissions for AWS CCM:
ec2:DescribeInstances
ec2:DescribeRegions
ec2:DescribeRouteTables
ec2:CreateRoute
ec2:DeleteRoute
ec2:ModifyInstanceAttribute
elasticloadbalancing:* (for Service controller)
autoscaling:DescribeAutoScalingGroups
Workload Identity (GCP)
GKE uses Workload Identity to bind a Kubernetes ServiceAccount to a GCP ServiceAccount. CCM's pod SA is annotated with the GCP SA — no JSON key files needed.
# GKE Workload Identity annotation
metadata:
annotations:
iam.gke.io/gcp-service-account: \
ccm@my-project.iam.gserviceaccount.com
# Required GCP roles:
roles/compute.instanceAdmin.v1
roles/compute.networkAdmin
roles/iam.serviceAccountUser
Azure Managed Identity
AKS uses Managed Identity (system-assigned or user-assigned) for the CCM. No client secrets — IMDS provides a token. Ensure the managed identity has Contributor on the MC_ resource group and Network Contributor on the VNet.
# Azure CCM config (cloud-config secret)
apiVersion: v1
kind: Secret
metadata:
name: cloud-config
namespace: kube-system
data:
cloud-config: |
{
"cloud": "AzurePublicCloud",
"useManagedIdentityExtension": true,
"subscriptionId": "...",
"resourceGroup": "...",
"vnetName": "..."
}
Static Credentials (Not Recommended)
A JSON/YAML config file containing API keys, client secrets, or access keys mounted as a Secret into the CCM pod. Acceptable for dev/test but not production — credentials don't auto-rotate and are stored in etcd (encrypted at rest required).
# Flag to pass static credentials
--cloud-config=/etc/kubernetes/cloud-config.json
# Ensure Secrets are encrypted at rest!
# See 01-kube-apiserver.html § Encryption at Rest
Deployment Patterns
DaemonSet on Control Plane Nodes (most common)
apiVersion: apps/v1
kind: DaemonSet
metadata:
name: cloud-controller-manager
namespace: kube-system
spec:
selector:
matchLabels:
component: cloud-controller-manager
template:
metadata:
labels:
component: cloud-controller-manager
spec:
serviceAccountName: cloud-controller-manager
tolerations:
- key: node-role.kubernetes.io/control-plane # run on control-plane nodes
effect: NoSchedule
- key: node.cloudprovider.kubernetes.io/uninitialized # tolerate uninitialized nodes
effect: NoSchedule
value: "true"
nodeSelector:
node-role.kubernetes.io/control-plane: ""
priorityClassName: system-cluster-critical
hostNetwork: true # access cloud IMDS on link-local (169.254.169.254)
containers:
- name: cloud-controller-manager
image: registry.k8s.io/provider-aws/cloud-controller-manager:v1.30.0
command:
- /bin/aws-cloud-controller-manager
- --cloud-provider=aws
- --leader-elect=true
- --use-service-account-credentials=true
- --configure-cloud-routes=true # enable Route Controller
- --v=2
The CCM pod itself must be able to run on nodes that have the node.cloudprovider.kubernetes.io/uninitialized taint — otherwise there's a chicken-and-egg deadlock: the taint prevents pods from running, but the CCM pod (which removes the taint) can't be scheduled. The hostNetwork: true is needed to reach the cloud IMDS at 169.254.169.254, which is link-local and not routable through the pod network.
Configuration Reference
| Flag | Default | Purpose |
|---|---|---|
--cloud-provider | — | Cloud provider name (must match ProviderName() implementation) |
--cloud-config | — | Path to cloud provider config file (credentials, region, VPC IDs) |
--leader-elect | true | Enable leader election; only one CCM active at a time |
--use-service-account-credentials | false | Use individual SA tokens per controller (mirrors kcm behavior) |
--configure-cloud-routes | true | Enable Route Controller (set false for encapsulated CNIs like Calico VXLAN) |
--cluster-name | — | Cluster name used to tag cloud resources; must match the value used when provisioning |
--cidr-allocator-type | RangeAllocator | CIDR allocation strategy: RangeAllocator or CloudAllocator |
--cluster-cidr | — | Pod CIDR range (required for Route Controller) |
--allocate-node-cidrs | false | Enable CIDR allocation by CCM instead of kube-controller-manager |
--node-sync-period | 10s | How often Node Controller polls cloud API for node metadata |
--route-reconciliation-period | 10s | How often Route Controller reconciles VPC routes |
--concurrent-service-syncs | 1 | Number of Service LB provisioning goroutines |
--v | 2 | Log verbosity. 4+ shows cloud API calls |
Metrics and Alerting
# Scrape CCM metrics
curl -sk https://localhost:10258/metrics
# Key metrics
cloudprovider_aws_api_request_duration_seconds{request="DescribeInstances"}
cloudprovider_aws_api_request_errors_total{request="CreateLoadBalancer"}
# Controller-generic metrics (same workqueue metrics as kcm)
workqueue_depth{name="cloud-node"}
workqueue_depth{name="service"}
workqueue_depth{name="cloud-node-lifecycle"}
workqueue_retries_total{name="service"}
# Load balancer specific (varies by provider)
cloudprovider_aws_api_throttled_requests_total # AWS throttling events
cloud_provider_reconcile_attempts_total{provider="aws",controller="service"}
# Node initialization health
# Monitor: nodes stuck with uninitialized taint > 2 minutes
kubectl get nodes -o json | jq -r '
.items[] |
select(.spec.taints != null) |
select(.spec.taints[] | .key == "node.cloudprovider.kubernetes.io/uninitialized") |
.metadata.name
'
Prometheus Alerting Rules
groups:
- name: cloud-controller-manager
rules:
- alert: CCMDown
expr: absent(up{job="cloud-controller-manager"} == 1)
for: 5m
labels:
severity: critical
annotations:
summary: "cloud-controller-manager is down"
description: "No CCM is running. LoadBalancer Services will not be provisioned and new nodes will stay uninitialized."
- alert: CCMNodeUninitializedStuck
expr: |
kube_node_spec_taint{key="node.cloudprovider.kubernetes.io/uninitialized"} > 0
for: 5m
labels:
severity: warning
annotations:
summary: "Node stuck with cloud-uninitialized taint"
description: "Node {{ $labels.node }} has been uninitialized for >5m. Check CCM logs."
- alert: CCMCloudAPIErrors
expr: rate(cloudprovider_aws_api_request_errors_total[5m]) > 0.1
for: 10m
labels:
severity: warning
annotations:
summary: "High CCM cloud API error rate"
description: "CCM is seeing {{ $value }} cloud API errors/s on {{ $labels.request }}."
- alert: CCMLoadBalancerSyncFailed
expr: rate(workqueue_retries_total{job="cloud-controller-manager",name="service"}[5m]) > 0.5
for: 5m
labels:
severity: warning
annotations:
summary: "Service LB sync failing repeatedly"
description: "CCM service controller is retrying LB sync at {{ $value }}/s."
Troubleshooting
Service type LoadBalancer stuck in <pending>
# Check CCM is running
kubectl -n kube-system get pods -l component=cloud-controller-manager
# Check CCM logs for the service
kubectl -n kube-system logs -l component=cloud-controller-manager | grep -i "service\|loadbalancer\|error" | tail -30
# Check service events
kubectl describe service my-svc | grep -A 10 Events:
# Look for: "Error creating load balancer", "Timeout", "Throttling"
# AWS: check IAM permissions
# "AccessDenied" errors indicate missing IAM policy on control plane node role
aws iam simulate-principal-policy \
--policy-source-arn arn:aws:iam::ACCOUNT:role/k8s-control-plane \
--action-names "elasticloadbalancing:CreateLoadBalancer" \
--resource-arns "*"
# GCP: check service account permissions
gcloud projects get-iam-policy my-project \
--flatten="bindings[].members" \
--filter="bindings.members:serviceAccount:ccm@my-project.iam.gserviceaccount.com"
Nodes stuck with cloud-uninitialized taint
# Check CCM logs for node initialization
kubectl -n kube-system logs <ccm-pod> | grep "node controller\|initialize\|providerID"
# Verify providerID is set on the node (required for CCM lookup)
kubectl get node <node-name> -o jsonpath='{.spec.providerID}'
# Should be: aws:///us-east-1a/i-0abc123def456
# If empty, kubelet is missing --provider-id or --cloud-provider=external
# Check IMDS access from CCM pod (for IAM role fetching)
kubectl -n kube-system exec <ccm-pod> -- \
curl -s http://169.254.169.254/latest/meta-data/instance-id
# Must return an instance ID; failure = network/IMDS issue
# Check if Node has valid providerID format for the cloud
kubectl get nodes -o json | jq -r '.items[] | "\(.metadata.name): \(.spec.providerID)"'
VPC routes not created / pod cross-node connectivity broken
# Verify Route Controller is enabled
kubectl -n kube-system logs <ccm-pod> | grep -i "route"
# Check nodes have podCIDR assigned
kubectl get nodes -o json | jq -r '.items[] | "\(.metadata.name): \(.spec.podCIDR)"'
# AWS: verify routes in VPC route table
aws ec2 describe-route-tables \
--filter Name=vpc-id,Values=vpc-0abc123 \
--query 'RouteTables[].Routes[?InstanceId!=null]'
# If using Calico VXLAN or Flannel VXLAN, routes are NOT needed:
# Set --configure-cloud-routes=false to disable Route Controller
# Check if cloud CIDR conflicts with VPC CIDR
# Pod CIDR should not overlap with VPC subnet CIDRs
Cloud API rate limiting / throttling
# AWS: EC2 API has per-region throttling limits
# Reduce polling frequency:
# --node-sync-period=30s (default 10s)
# --route-reconciliation-period=30s
# Check for throttling in CCM logs
kubectl -n kube-system logs <ccm-pod> | grep -i "throttl\|rateLim\|RequestLimitExceeded"
# AWS: request a limit increase for EC2 API calls via AWS console
# GCP: Cloud Resource Manager API quotas
# Enable API rate limiting in cloud config:
# rateLimitConfig:
# cloudProviderRateLimit: true
# cloudProviderRateLimitQPS: 3
# cloudProviderRateLimitBucket: 5
# Monitor cloud API call rate
kubectl get --raw /metrics | grep cloudprovider_.*_api_request_duration
On-Premises Clusters Without CCM
Clusters running on bare metal, VMware, or private cloud without a CCM must handle the functionality themselves or accept limitations.
Load Balancer alternatives
MetalLB: BGP or L2 mode load balancer for bare metal. Acts as a CCM Service controller substitute. Watches type: LoadBalancer Services and allocates IPs from configured address pools.
kube-vip: VIP-based LB using ARP/BGP. Works well for small clusters.
External DNS + NodePort: Route traffic to NodePort services via external DNS A records pointing to node IPs.
Node metadata alternatives
Without CCM, manually label nodes with topology information:kubectl label node worker-1 topology.kubernetes.io/zone=dc1-rack-akubectl label node worker-1 topology.kubernetes.io/region=dc1
Or use Node Feature Discovery (NFD) to auto-label nodes based on CPU features, PCI devices, and kernel capabilities.
Route controller alternatives
On bare metal, pod routing is handled entirely by the CNI plugin:
Calico BGP: advertises pod CIDRs via BGP to ToR switches
Cilium: eBPF-based routing without VPC routes needed
Flannel VXLAN: encapsulates pod traffic, no underlay routing required
Node deletion handling
Without CCM's cloud-node-lifecycle controller, deleted VMs leave stale Node objects. Use node problem detector + custom scripts, or run kubectl delete node manually. Some node managers (like Cluster Autoscaler) have built-in cleanup logic.
Production Best Practices
Use IAM roles / Workload Identity, never static credentials
Static credentials in a Secret or config file require manual rotation, are stored in etcd, and are a security liability. IAM roles (AWS), Workload Identity (GCP), or Managed Identity (Azure) auto-rotate and follow least-privilege without keys on disk.
Match CCM version to Kubernetes version
CCM follows the same N±1 skew policy as other control plane components. Deploy the CCM version that matches your Kubernetes minor version. Cloud providers typically release a new CCM tag within days of a Kubernetes release.
Disable Route Controller for overlay CNIs
If using Calico VXLAN, Flannel VXLAN, or Cilium with VXLAN/Geneve encapsulation, set --configure-cloud-routes=false. Creating unused VPC routes wastes cloud API quota and can cause routing conflicts if pod CIDRs overlap with subnet CIDRs.
Run on control-plane nodes with system-cluster-critical priority
CCM must run before worker nodes can be initialized. Place it on control-plane nodes (taints tolerated) with priorityClassName: system-cluster-critical to prevent eviction during resource pressure.
Tag cloud resources for cluster isolation
CCM uses --cluster-name to tag all cloud resources (LBs, routes, security groups) it creates. On clusters sharing a VPC, mismatched cluster names cause CCM to fight over the same resources. Always set a unique, stable cluster name at cluster creation.
Monitor node initialization latency
Alert if any node has the uninitialized taint for more than 2 minutes. This indicates CCM is having trouble reaching the cloud API (throttling, IAM permission issue, or CCM pod is not running). Nodes stuck uninitialized cannot receive workloads.
Use Service annotations for LB customization
Cloud-specific Service annotations (AWS, GCP, Azure) control LB type (NLB vs CLB), scheme (internal vs internet-facing), health check paths, SSL certificates, and more. Document your organization's standard LB annotations in a runbook — they vary significantly between clouds.
Plan for LB provisioning latency
Cloud LB provisioning takes 30 seconds to 3 minutes depending on provider and LB type. Do not treat a newly-created Service's EXTERNAL-IP as instantly ready. Use health check endpoints and readiness gates in your CI/CD pipeline rather than a fixed sleep.