CNI Plugins — Deep Dive
The Container Network Interface (CNI) is the pluggable layer that gives every Kubernetes pod a routable IP address. This page covers the CNI specification in full, chained plugin architecture, and detailed internals of Calico, Cilium, Flannel, AWS VPC CNI, Azure CNI, and Antrea — including eBPF dataplanes, BGP configuration, HubbleObservability, ENI management, and production CNI selection criteria.
CNI Specification
CNI is a CNCF specification (NOT Kubernetes-specific) that defines how container runtimes invoke network plugins. The spec version history: 0.1 → 0.2 → 0.3.0/0.3.1 → 0.4.0 → 1.0.0 (2021). Kubernetes 1.28+ drops support for spec versions older than 0.4.0.
Operations: ADD / DEL / CHECK / GC / VERSION
| Operation | Trigger | Plugin Must Do | Return |
|---|---|---|---|
ADD | Container created; pod scheduled to node | Create veth, assign IP (IPAM), configure routes, setup iptables | CNI Result (interfaces[], ips[], routes[], dns{}) |
DEL | Container removed; pod deleted | Remove veth, release IP back to IPAM, tear down routes | Success (errors tolerated if container already gone) |
CHECK | Runtime health verification | Verify container network matches last ADD result | Success or error with reason |
GC | Spec 1.0 — periodic cleanup | Remove orphaned network state (IPs not in validAttachments) | Success |
VERSION | Plugin discovery | Report supported CNI spec versions | {"cniVersion":"1.0.0","supportedVersions":[...]} |
Environment Variables
The container runtime passes configuration via environment variables before exec-ing the CNI binary:
CNI_COMMAND=ADD
CNI_CONTAINERID=abc123def456... # container ID (unique per sandbox)
CNI_NETNS=/var/run/netns/abc123 # path to network namespace
CNI_IFNAME=eth0 # interface name inside container
CNI_ARGS=K8S_POD_NAMESPACE=default;K8S_POD_NAME=nginx-abc;K8S_POD_INFRA_CONTAINER_ID=abc123
CNI_PATH=/opt/cni/bin # colon-separated dirs to search for plugins
Network config JSON is passed on stdin, not as CLI args. The plugin reads the config from stdin and writes the CNI result to stdout. Errors go to stderr.
Full ADD Request / Response
// stdin → plugin (network config)
{
"cniVersion": "1.0.0",
"name": "k8s-pod-network",
"type": "calico",
"ipam": {
"type": "calico-ipam"
},
"kubernetes": {
"k8s_api_root": "https://10.96.0.1:443",
"kubeconfig": "/etc/cni/net.d/calico-kubeconfig"
},
"policy": {
"type": "k8s"
},
"nodename": "worker-1",
"datastore_type": "kubernetes",
"log_level": "info",
"prevResult": null // null for first ADD; populated for chained plugins
}
// stdout ← plugin (CNI result v1.0.0)
{
"cniVersion": "1.0.0",
"interfaces": [
{
"name": "veth8a3f12b",
"mac": "aa:bb:cc:dd:ee:ff",
"sandbox": "" // host side: no sandbox path
},
{
"name": "eth0",
"mac": "aa:bb:cc:dd:ee:00",
"sandbox": "/var/run/netns/abc123" // container side
}
],
"ips": [
{
"interface": 1, // index into interfaces[]
"address": "10.244.1.15/24",
"gateway": "10.244.1.1"
}
],
"routes": [
{ "dst": "0.0.0.0/0", "gw": "10.244.1.1" }
],
"dns": {
"nameservers": ["10.96.0.10"],
"domain": "cluster.local",
"search": ["default.svc.cluster.local","svc.cluster.local","cluster.local"]
}
}
Chained Plugins — conflist
A .conflist file defines an ordered list of plugins. The runtime calls them sequentially; each plugin receives the prevResult from the prior plugin in the chain. This is how bandwidth shaping, firewall, and IPAM are combined:
{
"cniVersion": "1.0.0",
"name": "k8s-pod-network",
"plugins": [
{
"type": "calico", // main plugin: sets up veth + routes
"ipam": { "type": "calico-ipam" },
"kubernetes": { "kubeconfig": "/etc/cni/net.d/calico-kubeconfig" }
},
{
"type": "bandwidth", // meta plugin: shapes traffic
"ingressRate": 104857600, // 100 Mbps in bps
"ingressBurst": 209715200,
"egressRate": 104857600,
"egressBurst": 209715200
},
{
"type": "portmap", // meta plugin: iptables DNAT for hostPort
"capabilities": { "portMappings": true },
"snat": true
},
{
"type": "firewall" // meta plugin: iptables allow/deny rules
}
]
}
The runtime uses the first .conf or .conflist file found in /etc/cni/net.d/ sorted lexicographically. Files beginning with 10- load before 99-. Stale configs left by old CNI installations can cause the wrong plugin to load.
CNI Invocation Flow
CNI Reference Plugins
The containernetworking/plugins repo ships reference plugins that ship with most distributions:
Main Plugins
bridge— Linux bridgehost-device— SR-IOV VFipvlan— ipvlan L2/L3macvlan— macvlanptp— veth pair (no bridge)vlan— VLAN trunkdummy— loopback
IPAM Plugins
host-local— file-baseddhcp— DHCP daemonstatic— hardcoded IPswhereabouts— cluster-wide
Meta Plugins
portmap— hostPort DNATbandwidth— tc tbf shapingfirewall— iptables rulessbr— source-based routingtuning— sysctl / hwaddrvrf— VRF routing table
Calico
Calico (Project Calico / Tigera) is the most widely deployed CNI in production. It supports three dataplanes: standard Linux (iptables), eBPF, and Windows HNS. Network policy is a first-class citizen implemented via Felix and iptables/eBPF rules, not through a separate plugin.
Architecture Components
Felix (DaemonSet)
The policy enforcement agent on every node. Felix programs iptables/eBPF rules, manages routing (kernel FIB or BGP), writes interface routes, enforces NetworkPolicy. Talks to the datastore (etcd or Kubernetes CRD API).
Pod: calico-node in kube-system
BIRD (BGP Daemon)
Runs inside calico-node pod. Advertises pod CIDRs via BGP to upstream routers and to peer nodes. In full-mesh mode each node peers with every other node (n² connections); in Route Reflector mode designated RR nodes relay routes.
Port 179 TCP (BGP)
confd (template engine)
Watches the datastore for BGP configuration changes and renders BIRD config files dynamically. Eliminates the need to restart BIRD on config changes — confd triggers a graceful reload.
Typha (optional)
Fan-out cache between the datastore and Felix daemons. In large clusters (>100 nodes), Typha absorbs the watch fan-out: 1,000 Felix instances watch Typha instead of the API server directly. Reduces API server load dramatically.
Deploy 1 Typha replica per 100-200 nodes
calico-kube-controllers
Deployment (not DaemonSet). Reconciles Kubernetes objects (Namespaces, Pods, NetworkPolicies, ServiceAccounts) into Calico CRDs. Also runs node cleanup when nodes are deleted.
calico-apiserver (optional)
Aggregated API extension exposing Calico CRDs through the Kubernetes API server. Required for Tigera Enterprise features and for applying Calico resources with kubectl via API aggregation.
Routing Modes
| Mode | Encapsulation | Requires | Overhead | Use When |
|---|---|---|---|---|
| BGP (native) | None | Routable pod CIDRs; BGP-capable fabric | ~0 | On-prem with BGP ToR switches; maximum performance |
| VXLAN | VXLAN (UDP 4789) | UDP connectivity between nodes | ~50 bytes/pkt | Cloud VPCs that block BGP; most common in EKS/AKS self-managed |
| IP-in-IP (IPIP) | IP-in-IP (proto 4) | IP connectivity between nodes | ~20 bytes/pkt | Legacy; lighter than VXLAN but less universally supported |
| WireGuard | WireGuard (UDP 51820) | WireGuard kernel module | ~60 bytes/pkt | Encrypted pod-to-pod traffic without service mesh |
| None (CrossSubnet) | None within subnet, VXLAN/IPIP across | BGP for intra-subnet, overlay for cross-subnet | Mixed | AWS multi-AZ or hybrid setups |
Key Calico CRDs
# IPPool — defines allocatable pod CIDR range
apiVersion: crd.projectcalico.org/v1
kind: IPPool
metadata:
name: default-ipv4-ippool
spec:
cidr: 192.168.0.0/16
ipipMode: Never # Never | Always | CrossSubnet
vxlanMode: Always # Never | Always | CrossSubnet
natOutgoing: true # SNAT pods accessing non-cluster IPs
nodeSelector: all() # which nodes can use this pool
blockSize: 26 # /26 = 64 IPs per node block (default /26 for IPv4)
disabled: false
---
# IPAMBlock — auto-created per node; not edited manually
# kubectl get ipamblocks -o yaml
# Each block is a /26 subnet allocated from an IPPool
---
# BGPConfiguration — global BGP settings
apiVersion: crd.projectcalico.org/v1
kind: BGPConfiguration
metadata:
name: default
spec:
logSeverityScreen: Info
nodeToNodeMeshEnabled: true # full-mesh BGP (disable for RR topology)
asNumber: 64512 # AS number for all nodes
serviceClusterIPs:
- cidr: 10.96.0.0/12 # advertise service CIDRs to BGP peers
serviceExternalIPs:
- cidr: 203.0.113.0/24
---
# BGPPeer — configure external BGP peer (Route Reflector or ToR)
apiVersion: crd.projectcalico.org/v1
kind: BGPPeer
metadata:
name: tor-switch-1
spec:
peerIP: 192.168.1.1
asNumber: 65000
nodeSelector: rack == "rack-1"
Calico eBPF Dataplane
Enable with calicoNetwork.linuxDataplane: BPF (Operator) or FELIX_BPFENABLED=true. Requirements: Linux 5.3+ (5.8+ recommended), kube-proxy disabled (--skip-phases=addon/kube-proxy).
eBPF advantages vs iptables
- O(1) service lookup via BPF maps vs O(n) iptables chains
- Direct server return (DSR): response packets bypass kube-proxy SNAT
- Preserved source IP for NodePort (no SNAT)
- Lower latency at scale (10k+ services)
- TC hook (traffic control) replaces netfilter in the fast path
eBPF limitations
- Requires kube-proxy to be disabled (full replacement)
- Kernel < 5.3 not supported
- No support for some exotic iptables rules
- More complex troubleshooting (bpftool required)
# Inspect Calico eBPF maps
calicoctrl felix-diag-dump
bpftool map list | grep calico
bpftool prog show | grep calico
# Check eBPF conntrack
calicoctrl bpf conntrack dump
calicoctrl bpf nat dump frontend
calicoctrl bpf nat dump backend
Cilium
Cilium is an eBPF-native CNI that replaces both the network dataplane AND kube-proxy entirely with eBPF programs loaded into the Linux kernel. It requires kernel 4.9.17+ (5.10+ recommended) and provides deep observability via Hubble.
Architecture
cilium-agent (DaemonSet)
Core daemon on every node. Loads eBPF programs, manages BPF maps, handles IPAM (CiliumNode CRD), enforces network policies. Communicates with the Kubernetes API server and (optionally) Cilium KVStore (etcd).
cilium-operator (Deployment)
Cluster-scoped operations: IPAM allocation for large clusters, garbage collection of terminated pods, CiliumNetworkPolicy sync, garbage collection of CiliumEndpoints.
Hubble (observability)
Ring buffer-based eBPF observability layer. hubble-relay aggregates per-node Hubble servers into a cluster-wide gRPC API. hubble-ui provides a real-time network flow visualization graph. Zero overhead when no observer is connected.
cilium-envoy (optional)
Embedded Envoy proxy for L7 policy enforcement (HTTP, gRPC, Kafka). Runs as a sub-process of cilium-agent. L7 policies use CiliumNetworkPolicy rules[].ingress[].toPorts[].rules with HTTP match expressions.
eBPF Dataplane Internals
Cilium loads eBPF programs at multiple kernel hooks:
| Hook Point | Direction | Purpose |
|---|---|---|
tc ingress on veth host side | Pod → host | Enforce egress NetworkPolicy from pod's perspective; SNAT for masquerade |
tc egress on veth host side | Host → pod | Enforce ingress NetworkPolicy from pod's perspective; DNAT for service |
tc ingress on physical NIC (XDP) | External → node | NodePort load balancing; early drop for DDoS; DSR redirect |
cgroup/connect4 (sock_ops) | Socket level | Transparent socket-level load balancing (bypasses kernel netfilter entirely) |
kprobe (optional) | Kernel events | Process-level visibility for Hubble |
IPAM Modes
| Mode | Config | Where IPs Come From | Use Case |
|---|---|---|---|
| cluster-pool (default) | ipam: cluster-pool | Operator assigns per-node PodCIDR from clusterPoolIPv4PodCIDR | Generic on-prem / self-managed |
| kubernetes | ipam: kubernetes | Uses node.spec.podCIDR set by kube-controller-manager | kubeadm clusters |
| aws-eni | ipam: eni | Cilium operator attaches ENIs and assigns secondary IPs via EC2 API | EKS with native VPC routing |
| azure | ipam: azure | Cilium operator assigns IPs from Azure VNET subnet | AKS with native Azure networking |
| crd | ipam: crd | CiliumNode CRD spec.ipam.pools | Multi-homing, custom IPAM |
CiliumNetworkPolicy
apiVersion: "cilium.io/v2"
kind: CiliumNetworkPolicy
metadata:
name: allow-frontend-to-api
namespace: production
spec:
endpointSelector:
matchLabels:
app: api-server # applies to pods with this label
ingress:
- fromEndpoints:
- matchLabels:
app: frontend # allow from frontend pods
toPorts:
- ports:
- port: "8080"
protocol: TCP
rules:
http: # L7 HTTP rules
- method: "GET"
path: "/api/v1/.*"
- method: "POST"
path: "/api/v1/items"
- fromEntities:
- cluster # allow all intra-cluster traffic
egress:
- toEndpoints:
- matchLabels:
k8s:io.kubernetes.pod.namespace: kube-system
k8s:app: kube-dns # allow DNS
toPorts:
- ports:
- port: "53"
protocol: UDP
rules:
dns:
- matchPattern: "*.cluster.local" # L7 DNS filtering
- toFQDNs: # FQDN-based policy
- matchName: "api.stripe.com"
Hubble Observability
# Install Hubble CLI
export HUBBLE_VERSION=$(curl -s https://raw.githubusercontent.com/cilium/hubble/master/stable.txt)
curl -L --remote-name-all https://github.com/cilium/hubble/releases/download/$HUBBLE_VERSION/hubble-linux-amd64.tar.gz
# Enable Hubble in Cilium
helm upgrade cilium cilium/cilium \
--namespace kube-system \
--set hubble.enabled=true \
--set hubble.relay.enabled=true \
--set hubble.ui.enabled=true
# Port-forward Hubble relay
cilium hubble port-forward &
# Real-time flow inspection
hubble observe --namespace production --follow
hubble observe --pod frontend/frontend-abc --verdict DROPPED
hubble observe --protocol TCP --port 8080 --output json
# Policy troubleshooting
hubble observe --namespace production --verdict DROPPED --last 100 | \
jq '.flow | select(.destination.port == 5432)'
# Network flow statistics
hubble observe --namespace production --type trace --last 1000 | \
jq -r '.flow | [.source.pod_name, .destination.pod_name, .verdict] | @tsv' | \
sort | uniq -c | sort -rn | head -20
Cluster Mesh
Cilium Cluster Mesh connects up to 255 clusters with a shared service discovery and network policy model. Each cluster runs a clustermesh-apiserver that exposes its Cilium KVStore over TLS. Remote clusters are registered as Kubernetes Secrets:
# Enable cluster mesh on cluster-1
cilium clustermesh enable --context cluster-1
# Connect cluster-2 to cluster-1
cilium clustermesh connect \
--context cluster-1 \
--destination-context cluster-2
# Global service — load balance across clusters
kubectl annotate service my-svc \
service.cilium.io/global=true \
service.cilium.io/shared=true
Flannel
Flannel (CoreOS → flannel-io) is the simplest CNI plugin: a single binary that assigns a subnet to each node and wraps packets in a configurable backend. It has NO network policy support — Calico is often chained with Flannel (Canal) for policy enforcement.
Backend Types
| Backend | Mechanism | Overhead | Notes |
|---|---|---|---|
| vxlan (default) | VXLAN over UDP 8472 | ~50 bytes | Works in most cloud environments; DirectRouting bypasses VXLAN within same subnet |
| host-gw | Static kernel routes (no encapsulation) | ~0 | Requires all nodes in same L2 domain; no cloud NAT support |
| wireguard | WireGuard UDP 51820 | ~60 bytes | Encrypted; kernel 5.6+ or wireguard-go |
| udp (deprecated) | Userspace TUN + UDP | High | Legacy; extremely slow; only use for debugging |
| ipip | IP-in-IP | ~20 bytes | Some cloud providers block protocol 4 |
| alloc | IPAM only (no dataplane) | N/A | When another component handles the dataplane |
Flannel Configuration
// /etc/kube-flannel/net-conf.json (ConfigMap kube-flannel-cfg)
{
"Network": "10.244.0.0/16",
"Backend": {
"Type": "vxlan",
"VNI": 1,
"Port": 8472,
"DirectRouting": true // use host-gw within same subnet, VXLAN across
}
}
Flannel does not implement Kubernetes NetworkPolicy. Use Canal (Flannel + Calico policy engine) for NetworkPolicy support with Flannel-style routing, or migrate to Calico/Cilium entirely.
AWS VPC CNI
The AWS VPC CNI (amazon-vpc-cni-k8s) places pod IPs directly in the VPC subnet — pods get real VPC IP addresses, not overlay addresses. This enables native VPC routing, security groups per pod, and direct connectivity to AWS services without NAT.
ENI Secondary IP Model
Max Pods Calculation
# Formula: (ENIs × (IPs_per_ENI - 1)) + 2
# -1 for primary IP, +2 for kube-system pods
# m5.xlarge: 3 ENIs × 15 IPs/ENI
# Max pods = (3 × (15-1)) + 2 = 44
# With prefix delegation (/28): (3 × (15-1) × 16) + 2 = 674
# View instance type limits
aws ec2 describe-instance-types \
--instance-types m5.xlarge \
--query 'InstanceTypes[].NetworkInfo.[MaximumNetworkInterfaces,Ipv4AddressesPerInterface]'
# Check current pod count vs limit on node
kubectl describe node ip-10-0-1-5.ec2.internal | grep "Allocated resources" -A 5
kubectl get node ip-10-0-1-5.ec2.internal -o jsonpath='{.status.allocatable.pods}'
AWS VPC CNI Configuration
# Enable prefix delegation (dramatically increases pod density)
kubectl set env daemonset aws-node -n kube-system \
ENABLE_PREFIX_DELEGATION=true \
WARM_PREFIX_TARGET=1
# Security Groups for Pods
kubectl set env daemonset aws-node -n kube-system \
ENABLE_POD_ENI=true
# Pod-level security group (per-pod ENI trunking)
# Uses ENI trunking — requires Nitro-based instances
# SecurityGroupPolicy CRD
cat <
AWS VPC CNI consumes real VPC subnet IPs. A /24 subnet (254 IPs) shared across 10 nodes with 15 IPs/ENI can exhaust quickly. Plan separate subnets for worker nodes and use prefix delegation or RFC 1918 large private CIDRs. Consider ENABLE_SUBNET_DISCOVERY=true for multi-subnet IPAM.
Azure CNI
Azure CNI has two modes: flat/overlay-free (pods get VNET IPs directly) and overlay (pods get private IPs from an overlay space, nodes NAT to VNET). The traditional flat mode has the same IP exhaustion concern as AWS VPC CNI.
| Mode | Pod IPs | Pros | Cons |
|---|---|---|---|
| Azure CNI (flat) | Real VNET IPs from subnet | Direct VNET routing; no NAT; Azure NSG enforcement | Requires large subnets; 250 pods = 250 IPs consumed |
| Azure CNI Overlay | Private overlay CIDRs (e.g. 10.244.0.0/16) | No subnet IP exhaustion; scale to 50,000 pods | NAT to VNET; latency overhead; less direct NSG |
| kubenet | Private CIDRs + host NAT | Simple; no VNET IP consumption | No network policy; UDR required for cross-node; deprecated in AKS |
# AKS with Azure CNI Overlay
az aks create \
--network-plugin azure \
--network-plugin-mode overlay \
--pod-cidr 192.168.0.0/16 \
--service-cidr 10.96.0.0/12 \
--dns-service-ip 10.96.0.10
# AKS with Azure CNI + Cilium dataplane
az aks create \
--network-plugin azure \
--network-dataplane cilium \
--network-policy cilium
Antrea
Antrea (VMware) uses Open vSwitch (OVS) as its dataplane. It supports Geneve encapsulation, eBPF (experimental), and integrates with NSX-T for enterprise policy management. Traceflow is Antrea's killer feature: trace a packet through the entire network pipeline.
Architecture
antrea-agent (DaemonSet)
Runs OVS daemon + agent per node. Manages OVS flows, IPAM, NetworkPolicy enforcement. Communicates with antrea-controller via gRPC over the Antrea network.
antrea-controller (Deployment)
Computes and distributes NetworkPolicy rules. Watches Kubernetes API server; distributes computed policies to agents. Single instance with leader election.
Traceflow
# Inject a synthetic probe packet and trace its path
apiVersion: ops.antrea.io/v1alpha1
kind: Traceflow
metadata:
name: frontend-to-api
spec:
source:
namespace: production
pod: frontend-abc-123
destination:
namespace: production
pod: api-server-def-456
port: 8080
packet:
ipHeader:
protocol: 6 # TCP
transportHeader:
tcp:
srcPort: 12345
dstPort: 8080
flags: 2 # SYN
# kubectl get traceflow frontend-to-api -o yaml
# Results show each hop: ingress/egress NetworkPolicy matches, OVS flow actions
CNI Selection Guide
| Requirement | Recommendation | Why |
|---|---|---|
| Max performance, eBPF, no kube-proxy | Cilium | eBPF native; socket-level LB; O(1) service lookup; Hubble observability |
| NetworkPolicy + BGP on-prem | Calico | Native BGP; Felix policy engine; WireGuard encryption; battle-tested at scale |
| AWS EKS native VPC routing | AWS VPC CNI | ENI-native IPs; SG per pod; no overlay needed in AWS VPC |
| AKS with Azure VNET integration | Azure CNI | Native VNET routing; NSG per pod; AKS managed |
| Simple cluster, no policy needed | Flannel | Simple, low overhead, easy to debug; pair with Calico policy (Canal) if needed |
| VMware / NSX-T integration | Antrea | OVS dataplane; NSX-T integration; Traceflow; enterprise support |
| Multi-network (multiple interfaces) | Multus | Meta-CNI; chains multiple CNIs per pod; SR-IOV for NFV/telco |
| Very large clusters (>5k nodes) | Cilium | eBPF maps O(1) scaling; Typha-like fan-out; Cluster Mesh for multi-cluster |
| GKE Autopilot | GKE Dataplane V2 (Cilium) | Managed; deeply integrated; no user choice needed |
Feature Matrix
| Feature | Calico | Cilium | Flannel | AWS VPC CNI | Antrea |
|---|---|---|---|---|---|
| NetworkPolicy (K8s) | ✓ | ✓ | ✗ | ✗* | ✓ |
| Extended NetworkPolicy (L7) | Limited | ✓ (HTTP/DNS/Kafka) | ✗ | ✗ | Limited |
| eBPF dataplane | ✓ opt-in | ✓ default | ✗ | ✗ | Experimental |
| kube-proxy replacement | ✓ eBPF mode | ✓ default | ✗ | ✗ | ✗ |
| Encryption | WireGuard/IPSec | WireGuard/IPSec | WireGuard | TLS app layer | IPSec |
| Observability | Metrics | Hubble (flows) | Minimal | VPC Flow Logs | Traceflow |
| BGP support | ✓ native | ✓ BGP Control Plane | ✗ | ✗ | ✗ |
| IPv6 / dual-stack | ✓ | ✓ | Limited | ✓ | ✓ |
| Windows nodes | ✓ | ✗ | ✓ | ✓ | ✓ |
| Multi-cluster | Federation | Cluster Mesh | ✗ | ✗ | Multi-cluster GW |
* AWS VPC CNI has no native policy; use Calico for NetworkPolicy on EKS.
CNI Debugging
crictl — CRI-Level CNI Inspection
# List pod sandboxes and their network namespace status
crictl pods
# Inspect a specific pod sandbox
SANDBOX_ID=$(crictl pods --name nginx-abc --quiet)
crictl inspectp $SANDBOX_ID | jq '.status.network'
# Check CNI logs (containerd)
journalctl -u containerd --since "10 minutes ago" | grep -i cni
# Enable CNI debug logging
export CNI_DEBUG=1 # set in containerd config or systemd environment
Manual CNI Plugin Invocation
# Create a test network namespace
ip netns add test-ns
# Manually invoke CNI ADD (for debugging)
cat <<EOF | CNI_COMMAND=ADD CNI_CONTAINERID=test123 \
CNI_NETNS=/var/run/netns/test-ns \
CNI_IFNAME=eth0 \
CNI_PATH=/opt/cni/bin \
/opt/cni/bin/bridge
{
"cniVersion": "1.0.0",
"name": "test",
"type": "bridge",
"bridge": "cni-test0",
"ipam": {
"type": "host-local",
"subnet": "172.19.0.0/24",
"gateway": "172.19.0.1"
}
}
EOF
# Verify result
ip netns exec test-ns ip addr show eth0
ip netns exec test-ns ip route
# Clean up
cat <<EOF | CNI_COMMAND=DEL CNI_CONTAINERID=test123 \
CNI_NETNS=/var/run/netns/test-ns \
CNI_IFNAME=eth0 \
CNI_PATH=/opt/cni/bin \
/opt/cni/bin/bridge
{ "cniVersion": "1.0.0", "name": "test", "type": "bridge" }
EOF
ip netns del test-ns
Calico-Specific Debugging
# Check Felix status
kubectl exec -n kube-system -it $(kubectl get pod -n kube-system -l k8s-app=calico-node -o jsonpath='{.items[0].metadata.name}') -c calico-node -- calico-node -felix-live
kubectl exec -n kube-system calico-node-xyz -c calico-node -- calico-node -bird-live
# calicoctl commands (install separately)
calicoctl node status # Felix + BIRD status
calicoctl get ippool -o wide
calicoctl get ipamblock # per-node blocks
calicoctl ipam show --show-blocks
calicoctl ipam check # detect IPAM inconsistencies
# Felix logs (verbose)
kubectl logs -n kube-system calico-node-xyz -c calico-node --since=5m | grep -E "ERROR|WARN|policy"
# Datastore connectivity
DATASTORE_TYPE=kubernetes KUBECONFIG=/etc/cni/net.d/calico-kubeconfig calicoctl get nodes
Cilium-Specific Debugging
# Cilium status
cilium status --verbose
# Endpoint (pod) status
cilium endpoint list
cilium endpoint get $(cilium endpoint list | grep 10.244.1.15 | awk '{print $1}')
# Policy verdict for a specific connection
cilium policy trace --src-k8s-pod default/frontend-abc --dst-k8s-pod default/api-server-def --dport 8080 --protocol tcp
# BPF map inspection
cilium bpf lb list # service load balancer map
cilium bpf ct list global # conntrack entries
cilium bpf tunnel list # VXLAN tunnel endpoints
cilium bpf nat list # NAT map
# Connectivity test
cilium connectivity test --test pod-to-pod,pod-to-service
# Hubble (if enabled)
cilium hubble port-forward &
hubble observe --namespace default --verdict DROPPED --last 50
Troubleshooting Runbooks
Runbook 1: Pod stuck in ContainerCreating — CNI failure
# 1. Get pod events
kubectl describe pod <name> -n <ns>
# Look for: "Failed to create pod sandbox" or "network plugin is not ready"
# 2. Check kubelet logs on the node
journalctl -u kubelet --since "5 minutes ago" | grep -E "cni|network|sandbox"
# 3. Verify CNI binary exists and is executable
ls -la /opt/cni/bin/
ls -la /etc/cni/net.d/
# 4. Check containerd CNI config
cat /etc/containerd/config.toml | grep -A5 cni
systemctl status containerd
# 5. If Calico: check calico-node pod on that node
NODE=$(kubectl get pod <name> -n <ns> -o jsonpath='{.spec.nodeName}')
kubectl get pod -n kube-system -l k8s-app=calico-node --field-selector spec.nodeName=$NODE
kubectl logs -n kube-system <calico-node-pod> -c calico-node --tail=50
# 6. If Cilium: check cilium agent on that node
kubectl exec -n kube-system cilium-xyz -- cilium status
Runbook 2: IP address exhaustion (IPAM out of IPs)
# Calico: check IPAM utilization
calicoctl ipam show --show-blocks
calicoctl ipam check
# Look for "blocks with no matching node" — orphaned blocks
# Release leaked IPs
calicoctl ipam release --ip=192.168.5.20 # release specific IP
# Calico: increase block size (destructive — restart required)
# Edit IPPool blockSize from /26 (64 IPs) to /24 (256 IPs)
# Warning: requires full cluster restart to take effect
# AWS VPC CNI: check ENI limits
kubectl describe node <node> | grep "vpc.amazonaws.com"
kubectl get node <node> -o jsonpath='{.metadata.annotations}' | jq
# Enable prefix delegation on AWS
kubectl set env ds aws-node -n kube-system ENABLE_PREFIX_DELEGATION=true
# Cilium: check pool utilization
kubectl get ciliumnodes.cilium.io -o json | jq '.items[] | {name:.metadata.name, used:.status.ipam.used|length, available:.status.ipam.available|length}'
Runbook 3: Cross-node pod connectivity failure
# 1. Confirm pod IPs and node assignment
kubectl get pods -o wide -n default
# 2. Test from netshoot debug pod on each node
kubectl debug node/worker-1 -it --image=nicolaka/netshoot -- bash
# Inside: ping <pod-ip-on-worker-2>
# Inside: traceroute <pod-ip-on-worker-2>
# 3. Check VXLAN tunnel (Flannel/Calico VXLAN mode)
ip link show flannel.1 # or vxlan.calico
bridge fdb show dev flannel.1 | grep <destination-node-mac>
ip route | grep <pod-cidr-of-remote-node>
# 4. Check BGP routes (Calico BGP mode)
calicoctl node status
# Should show "Established" for all peer connections
ip route | grep blackhole # blackhole routes = IPAM reserved
# 5. Check Cilium tunnel endpoints
cilium bpf tunnel list
# Verify destination node IP is present
# 6. Check MTU mismatch (common overlay issue)
kubectl exec -it netshoot -- ping -M do -s 8951 <remote-pod-ip>
# If this fails but small pings work → MTU problem
# Calico: set MTU in CNI config / Felix MTUIfacePattern
# Flannel: set backend.MTU in net-conf.json
Runbook 4: CNI plugin upgrade procedure
# Calico upgrade (Operator-managed)
kubectl apply -f https://raw.githubusercontent.com/projectcalico/calico/v3.27.0/manifests/operator.crds.yaml
kubectl patch installation default --type=merge -p '{"spec":{"calicoNetwork":{"linuxDataplane":"Iptables"},"registry":"docker.io/calico","version":"v3.27.0"}}'
# Watch rollout
kubectl rollout status ds/calico-node -n kube-system
# Cilium upgrade (Helm)
helm upgrade cilium cilium/cilium \
--namespace kube-system \
--version 1.15.0 \
--reuse-values \
--set upgradeCompatibility=1.14
# Monitor
cilium status --wait
# Critical: never skip major versions
# Check upgrade notes: https://docs.cilium.io/en/stable/operations/upgrade/
Key CNI Metrics
| Metric | Source | Alert Threshold |
|---|---|---|
felix_ipset_errors_total | Calico Felix | >0 sustained |
felix_route_table_list_seconds_count | Calico Felix | p99 > 1s |
ipam_ips_in_use / ipam_ips_total | Calico | ratio > 0.85 (85% exhaustion) |
cilium_endpoint_state{state="not-ready"} | Cilium | >0 |
cilium_drop_count_total | Cilium | spike > baseline × 3 |
cilium_bpf_map_ops_total{outcome="fail"} | Cilium | >0 |
awscni_assigned_ip_addresses / awscni_total_ip_addresses | AWS VPC CNI | ratio > 0.90 |
network_plugin_operations_latency_microseconds | kubelet | p99 > 5s |
Production Best Practices
IPAM Planning
- Size pod CIDR for 2× peak capacity
- For Calico: /26 blocks (default) → 64 IPs/node; use /24 for dense nodes
- Monitor IPAM utilization at 70% → start planning expansion
- Never overlap pod CIDR with service CIDR or node CIDR
- Reserve separate subnets in cloud VPCs for nodes
MTU Configuration
- Overlay (VXLAN): set pod MTU = NIC MTU − 50 (typically 1450)
- IPIP: pod MTU = NIC MTU − 20 (typically 1460)
- No overlay: pod MTU = NIC MTU (1500 or 9000 with jumbo frames)
- Test with:
ping -M do -s 1472(1472 + 28 IP/ICMP = 1500) - Jumbo frames: set NIC MTU 9000 + Calico
mtu: 8950
Upgrade Safety
- Test CNI upgrades in staging first
- Use PodDisruptionBudgets on critical workloads
- Calico: rolling upgrade via DaemonSet maxUnavailable=1
- Cilium: use
--set upgradeCompatibilityflag - Never upgrade CNI and Kubernetes simultaneously
Security
- Enable NetworkPolicy deny-by-default in all namespaces
- Use Calico GlobalNetworkPolicy for cluster-wide baseline rules
- Enable WireGuard encryption for inter-node pod traffic
- Restrict CNI DaemonSet to only needed host mounts
- Enable Hubble/Traceflow for anomaly detection
Observability
- Scrape Felix / cilium-agent metrics in Prometheus
- Alert on IPAM exhaustion at 80% threshold
- Use Hubble UI to visualize network flows in staging
- Enable Cilium drop metrics to catch NetworkPolicy rejections
- Monitor
network_plugin_operations_latency_microseconds
Performance
- Prefer eBPF (Cilium or Calico eBPF) for >1k services or >500 nodes
- Enable Typha for Calico at >100 nodes
- Use BGP underlay when possible (zero encapsulation overhead)
- Jumbo frames (9000 MTU) for storage-intensive pods
- Pin NUMA-sensitive workloads; check CNI NUMA awareness