Kubernetes Networking Overview
Kubernetes networking is one of the most frequently misunderstood areas of the platform — not because it is conceptually hard, but because it spans multiple abstraction layers, each implemented by a different pluggable component. This overview establishes the four fundamental networking problems Kubernetes solves, the components responsible for each, and the invariants that all implementations must satisfy. Every subsequent page in this section builds on this foundation.
The Four Networking Problems
Kubernetes networking addresses four distinct communication scenarios:
1. Container-to-Container
Containers in the same pod share a network namespace (same IP, same port space). They communicate via localhost. This is provided by the container runtime (pause container) — no Kubernetes networking layer needed.
2. Pod-to-Pod
Every pod gets a unique, routable IP address. Any pod can reach any other pod by IP without NAT. This is implemented by the CNI plugin and is a hard requirement of the Kubernetes networking model.
3. Pod-to-Service
Services provide stable virtual IPs (ClusterIPs) that load-balance to a set of pods. Implemented by kube-proxy (iptables/IPVS/nftables/eBPF) or a CNI replacement. Decouples consumers from pod lifecycle.
4. External-to-Service
External traffic entering the cluster via NodePort, LoadBalancer, or Ingress/Gateway. Implemented by cloud load balancers (CCM), Ingress controllers, and Gateway API implementations.
Every Kubernetes-conformant cluster must guarantee: (1) all pods can communicate with all other pods without NAT; (2) all nodes can communicate with all pods without NAT; (3) the IP a pod sees for itself is the same IP other pods see for it. This is stated in the Kubernetes networking model and enforced by the CNI conformance suite.
The Kubernetes Network Model
Address Spaces
A Kubernetes cluster uses three distinct, non-overlapping CIDR ranges:
| Address Space | Typical Range | What uses it | Configured by |
|---|---|---|---|
| Node CIDR | 10.0.0.0/16 (cloud VPC) |
Node IP addresses (eth0); kubelet advertises this IP | Cloud VPC / on-prem network |
| Pod CIDR | 10.244.0.0/16 |
All pod IPs cluster-wide; each node gets a /24 slice (spec.podCIDR) |
--cluster-cidr on kube-controller-manager; CNI plugin |
| Service CIDR | 10.96.0.0/12 |
ClusterIP virtual IPs; only exist in iptables/IPVS rules, never routed on wire | --service-cluster-ip-range on kube-apiserver |
Overlapping any two of these three ranges causes silent routing failures. A common mistake: the pod CIDR overlaps the VPC CIDR, causing all inter-pod traffic to be misrouted. Plan all three ranges before cluster creation — they cannot be changed after the fact without rebuilding the cluster.
Networking Layers and Responsible Components
Component Roles at a Glance
| Component | Layer | Responsibility | Pluggable? |
|---|---|---|---|
| pause container | Container | Creates and holds the pod network namespace; all app containers join it | No (built into kubelet/CRI) |
| CNI plugin | Pod L3 | Assigns pod IP from node's podCIDR; sets up veth pair, routes, and cross-node routing | Yes (Calico, Cilium, Flannel, Weave, AWS VPC CNI, Azure CNI, …) |
| kube-proxy | Service L4 | Programs iptables/IPVS rules for ClusterIP DNAT; handles NodePort and LoadBalancer | Yes (replaceable by Cilium kube-proxy replacement, Antrea, etc.) |
| CoreDNS | Service DNS | Resolves service.namespace.svc.cluster.local → ClusterIP; pod DNS | Yes (replaceable by node-local-dns, custom CoreDNS plugins) |
| Ingress controller | L7 HTTP | Terminates TLS, routes HTTP by host/path to Services | Yes (nginx, Envoy, Traefik, HAProxy, …) |
| Gateway API impl. | L7 HTTP/TCP | Next-gen Ingress; richer routing (header, weight, TCP/UDP) | Yes (Envoy Gateway, Cilium, Istio, nginx Gateway Fabric) |
| Cloud LB (CCM) | L4 external | Provisions cloud load balancer for type: LoadBalancer Services | Via cloud-controller-manager |
| Network Policy engine | L3/L4 firewall | Enforces NetworkPolicy objects (allow/deny by pod selector/port) | Yes (Calico, Cilium, Weave; Flannel alone cannot enforce policies) |
| Service mesh | L7 mTLS | Mutual TLS, observability, traffic management between pods | Yes (Istio, Linkerd, Consul Connect, Cilium Service Mesh) |
What Each Page in This Section Covers
01 · Pod Networking
veth pairs, network namespaces, pod IP assignment, the pause container, cross-node routing with overlays vs. BGP underlay, packet walk end-to-end.
02 · CNI Plugins
CNI spec, plugin invocation by kubelet, Calico BGP + eBPF, Cilium eBPF, Flannel VXLAN, AWS VPC CNI, Azure CNI — when to use each.
03 · Services
ClusterIP, NodePort, LoadBalancer, ExternalName, Headless — object model, EndpointSlices, session affinity, external traffic policy.
04 · kube-proxy Internals
iptables chain hierarchy, IPVS scheduler algorithms, nftables mode (GA 1.31), EndpointSlice sync, topology-aware routing.
05 · DNS
CoreDNS architecture, pod DNS policy, search domains, ndots, stub zones, NodeLocal DNSCache, custom CoreDNS config.
06 · Ingress
IngressClass, nginx ingress controller, TLS termination, annotations, path types, multi-tenant patterns.
07 · Gateway API
Gateway, HTTPRoute, GRPCRoute, TCPRoute, ReferenceGrant — the successor to Ingress with role-based separation.
08 · Network Policies
Default deny patterns, ingress/egress rules, podSelector + namespaceSelector, CIDR blocks, policy-as-code.
09 · IPv4/IPv6 Dual Stack
Dual-stack cluster setup, pod dual IPs, service preferDualStack, IPv6-first policies.
10 · Load Balancing
Cloud LB integration, MetalLB for bare metal, BGP peering, L2 mode, LoadBalancer IP pools.
11 · Service Mesh
Istio control plane, Envoy sidecar, Linkerd micro-proxy, Cilium Mesh — mTLS, traffic shifting, observability.
Key Networking Invariants
These invariants hold regardless of which CNI, kube-proxy mode, or Ingress controller you choose:
- No NAT between pods. Pod A reaching pod B's IP sees B's real IP. Pod B sees A's real pod IP as the source. NAT only applies at the node boundary for external-to-pod traffic (NodePort/LoadBalancer with
externalTrafficPolicy: Cluster). - Pod IPs are ephemeral. A pod's IP is released when the pod is deleted. Services abstract this: they track healthy pod IPs via EndpointSlices and kube-proxy programs new rules within seconds.
- ClusterIPs are virtual. A ClusterIP address does not exist on any network interface. It lives exclusively in iptables DNAT rules (or IPVS virtual server tables). Pinging a ClusterIP will time out unless the ping itself is intercepted by iptables and DNAT'd to a real pod IP.
- DNS is the primary discovery mechanism. Kubernetes services are primarily discovered via DNS (
<service>.<namespace>.svc.cluster.local), not by hard-coding ClusterIPs. Applications should always use DNS names; IPs change between reinstalls. - Network Policies are additive and default-allow. Without any NetworkPolicy, all pod-to-pod traffic is allowed. Adding a NetworkPolicy affects only the pods it selects. For real isolation, you need a default-deny base policy.
CIDR Planning Reference
# Common CIDR allocation example
--cluster-cidr=10.244.0.0/16 # pod CIDR (65536 pod IPs total)
--service-cluster-ip-range=10.96.0.0/12 # service CIDR (1M virtual IPs)
# Node IPs: 10.0.0.0/16 (cloud VPC)
# Per-node pod subnet (auto-assigned from cluster CIDR):
# node-1: 10.244.1.0/24 → 254 usable pod IPs
# node-2: 10.244.2.0/24 → 254 usable pod IPs
# (maxPods=110 default in kubelet limits actual pod count regardless of subnet size)
# Large cluster example:
--cluster-cidr=10.0.0.0/8 # 16M pod IPs
--node-cidr-mask-size=24 # /24 per node (254 pod IPs each)
# Supports: 65536 nodes × 254 pods = ~16M pods
# Tight cloud VPC example (avoid pod/VPC overlap):
# VPC: 172.31.0.0/16 (AWS default)
# Pod CIDR: 192.168.0.0/16 ← must not overlap with 172.31.0.0/16
# Service CIDR: 10.100.0.0/16 ← must not overlap with either
# Dual-stack:
--cluster-cidr=10.244.0.0/16,fd00::/48
--service-cluster-ip-range=10.96.0.0/12,fd01::/108
Packet Lifecycle: Pod-to-Pod Cross-Node
A brief end-to-end trace for a packet from Pod A (node 1) to Pod C (node 2):
| Step | Component | What happens |
|---|---|---|
| 1 | Pod A kernel | Sends IP packet: src=10.244.1.5, dst=10.244.2.3 |
| 2 | veth pair | Packet exits pod netns, enters host netns via veth peer |
| 3 | iptables (PREROUTING) | kube-proxy rules checked — dst is pod IP, not ClusterIP, so no DNAT |
| 4 | Route table | Node 1 has route: 10.244.2.0/24 via overlay (VXLAN) or BGP peer |
| 5 | CNI encapsulation | VXLAN: inner packet wrapped in outer UDP/IP (src=node1-IP, dst=node2-IP, port 8472). BGP: no encap, just routed. |
| 6 | Physical network | Outer packet delivered to node 2's eth0 |
| 7 | CNI decapsulation | VXLAN: outer header stripped; inner packet revealed with dst=10.244.2.3 |
| 8 | Route table node 2 | Route: 10.244.2.3 → veth of Pod C |
| 9 | Pod C kernel | Receives packet; sees src=10.244.1.5 (no NAT) |
Packet Lifecycle: Pod-to-Service
The ClusterIP is intercepted in the kernel's PREROUTING chain before any routing decision. The DNAT rule rewrites the destination to one of the backing pod IPs. The routing then proceeds exactly as a direct pod-to-pod packet. The reverse SNAT (connection tracking) rewrites the response source back to the ClusterIP before returning to the calling pod, so the caller always sees the service IP as the responder.
Quick Reference: Common kubectl Networking Commands
# List all services with ClusterIPs
kubectl get svc -A -o wide
# List all endpoints (backing pods for each service)
kubectl get endpointslices -A
# Check pod IPs
kubectl get pods -A -o wide | awk '{print $1, $2, $6, $7}'
# Inspect DNS for a service (from within a pod)
kubectl run dnstest --image=busybox:1.36 --rm -it -- \
nslookup kubernetes.default.svc.cluster.local
# Test pod-to-pod connectivity
kubectl exec -it pod-a -- curl http://10.244.2.3:8080
# Test pod-to-service connectivity
kubectl exec -it pod-a -- curl http://my-service.my-namespace.svc.cluster.local
# Inspect node routes
kubectl get node -o jsonpath='{range .items[*]}{.metadata.name}: podCIDR={.spec.podCIDR}{"\n"}{end}'
# Watch endpoint changes for a service
kubectl get endpointslices -l kubernetes.io/service-name=my-service -w
# Debug: trace iptables rules for a ClusterIP
iptables-save | grep "10.96.0.100"
ipvsadm -Ln | grep -A3 "10.96.0.100"
CNI Plugin Selection Guide
| CNI | Dataplane | Network Policy | Best for | Trade-offs |
|---|---|---|---|---|
| Calico | eBPF or iptables; BGP or VXLAN | Yes (full) | On-prem, cloud, enterprises needing BGP peering | More complex; BGP config required for underlay mode |
| Cilium | eBPF (kernel 4.9+) | Yes (L3–L7) | Cloud-native, observability, kube-proxy replacement | Requires recent kernel; higher learning curve |
| Flannel | VXLAN (default) | No (needs Calico for policies) | Simple clusters, learning, dev environments | No network policy; limited features |
| AWS VPC CNI | Native ENI/VPC routing | Yes (via Calico or policy addon) | EKS; native VPC IPs, no overlay | AWS-only; IP exhaustion risk per-node |
| Azure CNI | Native VNET | Yes (Azure NPM) | AKS; native VNET IPs, direct routing | Azure-only; subnet size planning required |
| Antrea | OVS + eBPF | Yes (L3–L7) | VMware/vSphere, Windows support | Complex OVS dependency |
| Weave | VXLAN/mesh | Yes | Simple setup, multi-cloud mesh | Performance overhead; less active development |
Common Networking Pitfalls
Overlapping CIDRs
Pod CIDR overlapping VPC CIDR causes traffic to be delivered to real VMs instead of pods. Always allocate pod and service CIDRs from private ranges that don't conflict with your infrastructure.
Missing Network Policy default deny
Without a default-deny policy, all pods can talk to all pods. A single compromised pod can reach every database in the cluster. Add namespace-level default-deny ingress/egress policies as a baseline.
DNS ndots=5 causing latency
The default ndots:5 means short names like mydb try 5 search-domain suffixes before resolving as absolute. Each attempt is a separate DNS query. Use FQDNs with trailing dots for latency-sensitive services.
kube-proxy / CNI race on node startup
On node startup, kube-proxy may program Service rules before the CNI has finished setting up pod interfaces. Result: early pods get ClusterIP DNAT rules pointing to IPs with no route. CNI and kube-proxy should both watch EndpointSlices and self-heal — but monitor startup events.
externalTrafficPolicy SNAT
externalTrafficPolicy: Cluster (default) SNATs the original source IP before routing to backend pods. Applications that need real client IPs must use externalTrafficPolicy: Local — but this requires the backend pod to be on the same node as the traffic entry point.
NodePort port conflicts
NodePorts (30000–32767 by default) are claimed cluster-wide. Two services with the same NodePort conflict silently — the second to be programmed wins. Keep a port inventory or use a LoadBalancer type instead of NodePort in production.