Networking Intermediate Core File: 03-networking/01-pod-networking.html

Pod Networking

Pod networking is the foundation of all Kubernetes communication. It is the layer that makes every pod a first-class network citizen with its own IP address, reachable from any other pod in the cluster without NAT. This page covers the kernel mechanisms that make this possible: Linux network namespaces, veth pairs, the pause container, bridges, routes, and the two major cross-node routing strategies (overlay and underlay). It ends with a full packet trace, CNI invocation details, and diagnostic commands.

Linux Network Namespaces

Every pod lives in its own Linux network namespace — a complete, isolated instance of the kernel network stack including its own interfaces, routing table, iptables rules, and port number space. Namespaces are the kernel primitive that makes "one IP per pod" possible.

# Create a network namespace manually (what the container runtime does for pods)
ip netns add pod-demo

# Show interfaces inside the namespace
ip netns exec pod-demo ip link show
# Output: only loopback (lo) until a veth pair is added

# Verify isolation: routes inside the namespace are independent
ip netns exec pod-demo ip route show

# List all network namespaces on a node
ls -la /var/run/netns/
# Kubernetes pod netns are under /proc//ns/net  (not /var/run/netns)
# Access via: ip netns exec $(ip netns identify ) ip addr show

Network Namespace vs Other Namespaces

A pod's network namespace is shared by all containers in the pod — they all have the same IP and see the same network interfaces. Other namespaces (mount, PID, UTS) are per-container by default unless shareProcessNamespace: true is set. This is why two containers in the same pod cannot both listen on port 8080.

The Pause Container

The pause container (infra container) is a minimal process whose sole job is to hold the network namespace open. It is started first by the kubelet (via CRI RunPodSandbox) before any application containers. Application containers then join its network namespace.

The pause container image (registry.k8s.io/pause:3.9) is a ~700 KB binary that calls pause() and acts as PID 1 to reap zombie child processes. If the pause container exits, the entire pod network namespace is destroyed — all containers in the pod lose network connectivity immediately.

veth Pairs

A veth pair is a connected pair of virtual Ethernet interfaces. Traffic sent into one end emerges from the other. One end lives inside the pod's network namespace (eth0); the other lives in the host network namespace (named vethXXXXXX or cali... depending on CNI).

# View veth pairs on a node
ip link show | grep veth

# Find which veth belongs to a pod (by pod IP or container PID)
# Method 1: via container PID
CONTAINER_PID=$(crictl inspect <container-id> | jq .info.pid)
nsenter --target $CONTAINER_PID --net ip addr show eth0
# shows: eth0: <ip> — check ifindex

# Method 2: match ifindex
# Inside pod netns:
ip link show eth0   # e.g., 23: eth0@if24
# On host:
ip link show | grep "^24:"   # finds veth peer at index 24

# Trace veth to bridge:
bridge link show    # shows which veth interfaces are attached to cni0

Within-Node Pod Networking

When Pod A (10.244.1.5) sends to Pod B (10.244.1.8) on the same node:

Packet leaves Pod A's eth0 → crosses veth pair into host netns
Kernel looks up route: 10.244.1.0/24 is local → routes to the bridge (cni0) or directly to Pod B's veth (Calico host-route mode)
Bridge (or host route) delivers packet to Pod B's veth peer → enters Pod B's netns as eth0
No encapsulation, no NAT — L3 routing entirely within the host

Cross-Node Pod Networking

Cross-node pod routing is where CNI plugins differ fundamentally. There are two strategies:

Overlay (encapsulation)

Pod packets wrapped in an outer IP/UDP or IP/IP header
Outer packet routed normally by the physical network
Outer src/dst = node IPs (always routable)
Inner src/dst = pod IPs (not known to physical network)
VXLAN: UDP port 8472, up to 16M virtual networks
GENEVE: UDP port 6081 (used by OVN/Antrea)
IPIP: IP-in-IP, lower overhead than VXLAN
Works on any network with node-to-node IP connectivity
~5–10% overhead for encap/decap (negligible on modern hardware)

Underlay (BGP/native routing)

No encapsulation — pod IPs are natively routed
Node advertises its podCIDR to BGP peers (router or peers)
Physical network learns pod routes and forwards directly
Calico BGP: each node is a BGP speaker; route reflectors for scale
AWS VPC CNI: pod IPs are real ENI secondary IPs; VPC routing table knows them
Azure CNI: pod IPs are real VNET IPs via IPAM
Zero overhead — packets are native L3
Requires network infrastructure support (BGP capability or cloud native routing)

VXLAN Packet Walkthrough

# Pod A (10.244.1.5 on node1) → Pod C (10.244.2.3 on node2)

# Inner packet (original):
# src: 10.244.1.5   dst: 10.244.2.3  proto: TCP  dport: 8080

# VXLAN encapsulation added by CNI dataplane (flannel/calico/cilium):
# Outer IP header:  src: 10.0.1.10 (node1)   dst: 10.0.1.11 (node2)
# UDP header:       sport: random              dport: 8472
# VXLAN header:     VNI (virtual network ID)
# Inner IP header:  src: 10.244.1.5            dst: 10.244.2.3

# Physical network sees: node1 → node2 UDP/8472
# Node2 VTEP (Virtual Tunnel EndPoint) decapsulates → inner packet routed to Pod C veth

# Verify VXLAN tunnels on a node:
ip -d link show flannel.1   # Flannel VXLAN interface
bridge fdb show dev flannel.1  # VTEP → node IP mappings

BGP Underlay Walkthrough

# Calico BGP mode: each node announces its podCIDR
# node1 announces: 10.244.1.0/24 via 10.0.1.10
# node2 announces: 10.244.2.0/24 via 10.0.1.11

# Route table on node1 (Calico BGP):
ip route show
# 10.244.1.0/24 dev cali-br0 proto kernel    # local pods via bridge
# 10.244.2.0/24 via 10.0.1.11 dev eth0       # learned via BGP: next-hop = node2

# Packet from Pod A → Pod C:
# src: 10.244.1.5  dst: 10.244.2.3
# Routed to 10.0.1.11 via eth0 — NO encapsulation
# Node2 receives, routes to Pod C via veth — standard L3

# Calico BGP peering status:
calicoctl node status
# Shows peer IPs and BGP session state (Established)

CNI Plugin Invocation

The CNI is not a daemon — it is a set of binaries called by the container runtime at specific lifecycle events. The kubelet instructs containerd (via CRI) to set up networking, and containerd calls the CNI binary:

kubelet RunPodSandbox

→

containerd CRI plugin

→

exec /opt/cni/bin/<plugin> ADD

→

pod gets IP

# CNI binary locations
ls /opt/cni/bin/
# bandwidth  bridge  dhcp  firewall  flannel  host-device  host-local
# ipvlan  loopback  macvlan  portmap  ptp  sbr  static  tuning  vlan  vrf

# CNI configuration
ls /etc/cni/net.d/
# 10-calico.conflist   (Calico)
# 10-flannel.conflist  (Flannel)

# Example Flannel CNI config:
cat /etc/cni/net.d/10-flannel.conflist
# {
#   "cniVersion": "0.3.1",
#   "name": "cbr0",
#   "plugins": [
#     { "type": "flannel", "delegate": { "isDefaultGateway": true } },
#     { "type": "portmap", "capabilities": {"portMappings": true} }
#   ]
# }

# CNI ADD called with environment variables:
# CNI_COMMAND=ADD
# CNI_CONTAINERID=<sandbox container ID>
# CNI_NETNS=/proc/<pid>/ns/net   (the pause container's netns)
# CNI_IFNAME=eth0                  (interface to create inside the netns)
# CNI_PATH=/opt/cni/bin
# stdin: JSON config from /etc/cni/net.d/

# CNI ADD returns JSON with allocated IP:
# {
#   "ips": [{"address": "10.244.1.5/24", "gateway": "10.244.1.1"}],
#   "routes": [{"dst": "0.0.0.0/0"}]
# }

CNI Operations

CNI Command	Trigger	What the plugin does
`ADD`	Pod sandbox created (RunPodSandbox)	Create veth pair; assign IP from IPAM; set up routes; add to bridge/VXLAN; return allocated IP
`DEL`	Pod sandbox deleted (RemovePodSandbox)	Remove veth pair; release IP back to IPAM; remove routes; clean up bridge/VXLAN entries
`CHECK`	Periodic health check (kubelet)	Verify the network is correctly configured for the pod; return error if not
`VERSION`	At startup	Return supported CNI spec versions

IP Assignment (IPAM)

The IP Address Management (IPAM) component of the CNI plugin handles IP allocation. Common IPAM plugins:

IPAM Plugin	Storage	Scope	Used by
`host-local`	Files in `/var/lib/cni/networks/`	Per-node (from node's podCIDR)	Flannel, simple setups
`calico-ipam`	etcd / Kubernetes CRDs	Cluster-wide pools; cross-node IP blocks	Calico
`cilium-ipam`	Kubernetes CiliumNode CRD	Per-node blocks from global pool	Cilium
`aws-cni`	EC2 ENI secondary IPs	Per-ENI (tied to instance type)	AWS VPC CNI (EKS)
`azure-vnet-ipam`	Azure IPAM API	Per-subnet (VNET)	Azure CNI (AKS)
`dhcp`	External DHCP server	Whatever DHCP assigns	Bare metal, legacy

# host-local IPAM: check allocated IPs
ls /var/lib/cni/networks/cbr0/
# Files named by IP: 10.244.1.5, 10.244.1.8, etc.
cat /var/lib/cni/networks/cbr0/10.244.1.5
# Contains: container ID that holds the IP

# Calico IPAM: check IP pools and blocks
calicoctl get ippool -o wide
calicoctl get ipamblock -o wide   # per-node blocks

# Find IP allocations
calicoctl ipam show --show-blocks

Pod Network Configuration Files

The CNI plugin populates several files in the pod's network namespace that configure how the pod communicates:

# /etc/resolv.conf (injected by kubelet, based on cluster DNS config):
nameserver 10.96.0.10       # CoreDNS ClusterIP
search default.svc.cluster.local svc.cluster.local cluster.local
options ndots:5

# /etc/hosts (managed by kubelet):
127.0.0.1   localhost
::1         localhost ip6-localhost ip6-loopback
10.244.1.5  my-pod-name.my-namespace.pod.cluster.local

# Inspect from inside a running pod:
kubectl exec my-pod -- cat /etc/resolv.conf
kubectl exec my-pod -- cat /etc/hosts
kubectl exec my-pod -- ip addr show
kubectl exec my-pod -- ip route show

hostNetwork Pods

Pods with spec.hostNetwork: true skip the pod network namespace entirely and run in the host's network namespace. They see all host interfaces, use the node IP, and share port space with the host.

# Pod using host network (e.g., kube-proxy, node exporters, CNI DaemonSets)
spec:
  hostNetwork: true
  containers:
  - name: node-exporter
    image: prom/node-exporter
    ports:
    - containerPort: 9100
      hostPort: 9100   # binds directly on the node's port 9100

# Implications:
# - Pod IP == Node IP (no isolation)
# - Port conflicts with other host processes
# - Required for: CNI plugins, kube-proxy, node monitoring agents
# - Security risk: full access to host network interfaces

# Check which pods use hostNetwork:
kubectl get pods -A -o jsonpath='{range .items[?(@.spec.hostNetwork==true)]}{.metadata.namespace}/{.metadata.name}{"\n"}{end}'

hostNetwork Security Boundary

hostNetwork: true is a significant security escalation. The pod can sniff all traffic on the node's interfaces, bind to privileged ports, and interact with any service listening on localhost of the node. Restrict it to trusted system DaemonSets (CNI, kube-proxy, monitoring agents). Never allow it for user workloads — enforce with PodSecurity admission or OPA Gatekeeper.

Full Packet Trace: Pod A → Pod C (Cross-Node)

# Setup:
# Pod A: 10.244.1.5 on node1 (10.0.1.10) — CNI: Flannel VXLAN
# Pod C: 10.244.2.3 on node2 (10.0.1.11)

# Step 1: Pod A kernel routing
# ip route inside pod A:
# default via 10.244.1.1 dev eth0   (or 169.254.1.1 for Calico)
# → packet exits eth0 of pod A

# Step 2: veth pair — packet enters host netns
# on node1: ip route:
# 10.244.2.0/24 via 10.0.1.11 dev flannel.1 onlink
# → packet handed to flannel.1 VTEP

# Step 3: VXLAN encapsulation
# flannel.1 looks up: which VTEP holds 10.244.2.x?
# bridge fdb: 10.244.2.x → 10.0.1.11 (learned via flannel etcd/k8s watch)
# Outer header: src=10.0.1.10 dst=10.0.1.11 UDP/8472
# Inner: src=10.244.1.5 dst=10.244.2.3

# Step 4: physical network delivers to node2 eth0

# Step 5: node2 kernel receives UDP/8472
# flannel.1 VTEP decapsulates
# inner packet: dst=10.244.2.3

# Step 6: node2 routing
# 10.244.2.3 dev vethXXX   (route installed by CNI when pod C was created)
# → packet delivered to Pod C's veth

# Step 7: Pod C receives
# src=10.244.1.5 (original pod A IP — no NAT anywhere in the path)

# Verify at each step:
# node1: tcpdump -i flannel.1 "host 10.244.2.3"
# node2: tcpdump -i flannel.1 "host 10.244.1.5"
# pod: kubectl exec pod-c -- tcpdump -i eth0

Diagnostics and Debugging

# Get pod IP and node
kubectl get pod my-pod -o wide

# Verify pod network from inside the pod
kubectl exec -it my-pod -- ip addr show
kubectl exec -it my-pod -- ip route show
kubectl exec -it my-pod -- cat /etc/resolv.conf

# Test pod-to-pod connectivity
kubectl exec pod-a -- ping -c3 10.244.2.3
kubectl exec pod-a -- curl -v http://10.244.2.3:8080

# Test from a debug pod with network tools
kubectl run netshoot --image=nicolaka/netshoot --rm -it -- bash
# Inside: ping, traceroute, tcpdump, nmap, ss, ip, dig all available

# Inspect veth pairs on the node
ssh node1 "ip link show type veth"
ssh node1 "bridge link show"   # bridge-attached veth interfaces

# Find the host veth for a specific pod IP
ssh node1 "ip route get 10.244.1.5"
# Output: 10.244.1.5 dev vethABCDEF src 10.244.1.1

# Check CNI configuration
ssh node1 "cat /etc/cni/net.d/*.conflist"
ssh node1 "ls /opt/cni/bin/"

# Check IP allocations (host-local IPAM)
ssh node1 "ls /var/lib/cni/networks/"

# Calico: check node routes and IPAM
calicoctl node status
calicoctl get workloadendpoint -A

# Cilium: endpoint and connectivity status
cilium status
cilium endpoint list
cilium connectivity test   # runs full connectivity test suite

Troubleshooting Runbooks

Pod has no IP address — stuck in ContainerCreating

# Symptom: kubectl get pod shows no IP; describe shows CNI error

# 1. Check kubelet for CNI errors
journalctl -u kubelet | grep -i "cni\|network\|failed" | tail -30

# 2. Check if CNI binaries exist
ls /opt/cni/bin/
ls /etc/cni/net.d/

# 3. Check if CNI DaemonSet pod is running on this node
kubectl get pods -n kube-system -o wide | grep -E "calico|flannel|cilium"

# 4. Try CNI ADD manually to see exact error (advanced)
CNI_COMMAND=ADD CNI_CONTAINERID=test CNI_NETNS=/proc/$$/ns/net \
  CNI_IFNAME=eth0 CNI_PATH=/opt/cni/bin \
  /opt/cni/bin/flannel < /etc/cni/net.d/10-flannel.conflist

# 5. Check IPAM pool exhaustion
# host-local: count files in /var/lib/cni/networks/cbr0/
ls /var/lib/cni/networks/cbr0/ | wc -l
# Compare to node's podCIDR size

# 6. Check NetworkReady condition
crictl info | grep networkReady

Pod can ping same-node pods but not cross-node pods

# Symptom: intra-node works, cross-node fails

# 1. Check VXLAN interface
ip -d link show flannel.1    # or cilium_vxlan, calico_vxlan
# If missing: CNI agent not running

# 2. Check VXLAN FDB (forwarding database)
bridge fdb show dev flannel.1
# Each remote node should have an entry mapping its podCIDR gateway to node IP
# Missing entries: flannel/calico can't reach apiserver to learn remote nodes

# 3. Check firewall rules
iptables -L FORWARD -n | grep ACCEPT
# Must have ACCEPT rules for pod CIDRs in FORWARD chain
# Some cloud environments have restrictive firewall policies

# 4. Check UDP/8472 is not blocked
nc -u -l 8472 &   # on node2
nc -u node2 8472  # from node1 — send a char

# 5. MTU mismatch (common cause)
ip link show flannel.1 | grep mtu
# VXLAN MTU should be node MTU - 50 bytes (VXLAN overhead)
# If too high: packets silently dropped; traceroute shows timeouts
# Fix: set flannel MTU in ConfigMap and restart flannel pods

Pod IP conflicts — two pods with the same IP

# Symptom: traffic routes to wrong pod; connections fail intermittently

# 1. Check for duplicate IPs
kubectl get pods -A -o wide | awk '{print $7}' | sort | uniq -d

# 2. Check IPAM state
# host-local: stale lock files if pod was force-deleted
ls /var/lib/cni/networks/cbr0/
# Remove stale file for the IP if the container no longer exists:
cat /var/lib/cni/networks/cbr0/10.244.1.5  # check container ID
crictl ps | grep <container-id>            # if not found, IP is orphaned
rm /var/lib/cni/networks/cbr0/10.244.1.5   # release the IP

# 3. Calico: check workload endpoints
calicoctl get workloadendpoint -A | grep 10.244.1.5
# Multiple entries = conflict; delete the stale one

# 4. Force pod recreation on affected node
kubectl drain node1 --ignore-daemonsets --force
# Pods will reschedule with fresh IP assignments

Production Best Practices

Size the pod CIDR for growth — a /16 gives 65K pod IPs across the cluster, but each node only gets a /24 (254 IPs). If you need more than 254 pods per node, set --node-cidr-mask-size=23 before cluster creation.
Never overlap CIDRs — confirm pod CIDR, service CIDR, and VPC/physical CIDR don't intersect before provisioning. This cannot be fixed post-creation without rebuilding.
Choose the right encapsulation for your environment — overlay (VXLAN/IPIP) works anywhere but adds overhead; BGP underlay is zero-overhead but requires network infrastructure cooperation. Cloud CNIs (AWS VPC CNI, Azure CNI) are best for cloud-native; Calico BGP for bare metal.
Watch MTU carefully with VXLAN — VXLAN adds 50 bytes of overhead. If node MTU is 1500, VXLAN interface MTU should be 1450. Mismatched MTU causes silent packet drops (ICMP fragmentation needed, but many clouds block ICMP). Set explicitly in CNI config.
Enable IP masquerade only at the node boundary — masquerading pod traffic when leaving the node (for egress to external IPs) is correct. Masquerading pod-to-pod traffic is wrong — it hides source IPs and breaks NetworkPolicy and observability.
Monitor IPAM pool exhaustion — alert when a node's IPAM block is >80% full. Full IPAM → new pods fail to get IPs → ContainerCreating → application outage.
Use netshoot for debugging — keep nicolaka/netshoot available as a debug pod image; it contains every network diagnostic tool you need (tcpdump, dig, curl, nmap, traceroute, iperf3).
Avoid hostNetwork unless essential — use it only for system components that explicitly need it (CNI, kube-proxy, node exporters). Enforce this with PodSecurity restricted profile or a Gatekeeper/Kyverno policy.