Networking Beginner Core File: 03-networking/00-networking-overview.html

Kubernetes Networking Overview

Kubernetes networking is one of the most frequently misunderstood areas of the platform — not because it is conceptually hard, but because it spans multiple abstraction layers, each implemented by a different pluggable component. This overview establishes the four fundamental networking problems Kubernetes solves, the components responsible for each, and the invariants that all implementations must satisfy. Every subsequent page in this section builds on this foundation.

The Four Networking Problems

Kubernetes networking addresses four distinct communication scenarios:

1. Container-to-Container

Containers in the same pod share a network namespace (same IP, same port space). They communicate via localhost. This is provided by the container runtime (pause container) — no Kubernetes networking layer needed.

2. Pod-to-Pod

Every pod gets a unique, routable IP address. Any pod can reach any other pod by IP without NAT. This is implemented by the CNI plugin and is a hard requirement of the Kubernetes networking model.

3. Pod-to-Service

Services provide stable virtual IPs (ClusterIPs) that load-balance to a set of pods. Implemented by kube-proxy (iptables/IPVS/nftables/eBPF) or a CNI replacement. Decouples consumers from pod lifecycle.

4. External-to-Service

External traffic entering the cluster via NodePort, LoadBalancer, or Ingress/Gateway. Implemented by cloud load balancers (CCM), Ingress controllers, and Gateway API implementations.

The Fundamental Networking Contract

Every Kubernetes-conformant cluster must guarantee: (1) all pods can communicate with all other pods without NAT; (2) all nodes can communicate with all pods without NAT; (3) the IP a pod sees for itself is the same IP other pods see for it. This is stated in the Kubernetes networking model and enforced by the CNI conformance suite.

The Kubernetes Network Model

Address Spaces

A Kubernetes cluster uses three distinct, non-overlapping CIDR ranges:

Address Space	Typical Range	What uses it	Configured by
Node CIDR	`10.0.0.0/16` (cloud VPC)	Node IP addresses (eth0); kubelet advertises this IP	Cloud VPC / on-prem network
Pod CIDR	`10.244.0.0/16`	All pod IPs cluster-wide; each node gets a /24 slice (`spec.podCIDR`)	`--cluster-cidr` on kube-controller-manager; CNI plugin
Service CIDR	`10.96.0.0/12`	ClusterIP virtual IPs; only exist in iptables/IPVS rules, never routed on wire	`--service-cluster-ip-range` on kube-apiserver

CIDRs Must Not Overlap

Overlapping any two of these three ranges causes silent routing failures. A common mistake: the pod CIDR overlaps the VPC CIDR, causing all inter-pod traffic to be misrouted. Plan all three ranges before cluster creation — they cannot be changed after the fact without rebuilding the cluster.

Networking Layers and Responsible Components

Service Mesh / mTLS (Istio, Linkerd, Cilium) L7 — optional, sidecar or eBPF

Ingress / Gateway API (nginx, envoy, traefik) L7 HTTP routing — external traffic

Services / kube-proxy (ClusterIP, NodePort, LoadBalancer) L4 virtual IP — iptables / IPVS / eBPF

Pod Networking / CNI (Calico, Cilium, Flannel, Weave) L3 pod-to-pod routing — overlay or underlay

Node Network (VPC, physical fabric) L2/L3 — cloud or on-prem infrastructure

Container Networking (pause container, veth pairs) L2 — within-pod namespace sharing

Component Roles at a Glance

Component	Layer	Responsibility	Pluggable?
pause container	Container	Creates and holds the pod network namespace; all app containers join it	No (built into kubelet/CRI)
CNI plugin	Pod L3	Assigns pod IP from node's podCIDR; sets up veth pair, routes, and cross-node routing	Yes (Calico, Cilium, Flannel, Weave, AWS VPC CNI, Azure CNI, …)
kube-proxy	Service L4	Programs iptables/IPVS rules for ClusterIP DNAT; handles NodePort and LoadBalancer	Yes (replaceable by Cilium kube-proxy replacement, Antrea, etc.)
CoreDNS	Service DNS	Resolves `service.namespace.svc.cluster.local` → ClusterIP; pod DNS	Yes (replaceable by node-local-dns, custom CoreDNS plugins)
Ingress controller	L7 HTTP	Terminates TLS, routes HTTP by host/path to Services	Yes (nginx, Envoy, Traefik, HAProxy, …)
Gateway API impl.	L7 HTTP/TCP	Next-gen Ingress; richer routing (header, weight, TCP/UDP)	Yes (Envoy Gateway, Cilium, Istio, nginx Gateway Fabric)
Cloud LB (CCM)	L4 external	Provisions cloud load balancer for `type: LoadBalancer` Services	Via cloud-controller-manager
Network Policy engine	L3/L4 firewall	Enforces NetworkPolicy objects (allow/deny by pod selector/port)	Yes (Calico, Cilium, Weave; Flannel alone cannot enforce policies)
Service mesh	L7 mTLS	Mutual TLS, observability, traffic management between pods	Yes (Istio, Linkerd, Consul Connect, Cilium Service Mesh)

What Each Page in This Section Covers

01 · Pod Networking

veth pairs, network namespaces, pod IP assignment, the pause container, cross-node routing with overlays vs. BGP underlay, packet walk end-to-end.

02 · CNI Plugins

CNI spec, plugin invocation by kubelet, Calico BGP + eBPF, Cilium eBPF, Flannel VXLAN, AWS VPC CNI, Azure CNI — when to use each.

03 · Services

ClusterIP, NodePort, LoadBalancer, ExternalName, Headless — object model, EndpointSlices, session affinity, external traffic policy.

04 · kube-proxy Internals

iptables chain hierarchy, IPVS scheduler algorithms, nftables mode (GA 1.31), EndpointSlice sync, topology-aware routing.

05 · DNS

CoreDNS architecture, pod DNS policy, search domains, ndots, stub zones, NodeLocal DNSCache, custom CoreDNS config.

06 · Ingress

IngressClass, nginx ingress controller, TLS termination, annotations, path types, multi-tenant patterns.

07 · Gateway API

Gateway, HTTPRoute, GRPCRoute, TCPRoute, ReferenceGrant — the successor to Ingress with role-based separation.

08 · Network Policies

Default deny patterns, ingress/egress rules, podSelector + namespaceSelector, CIDR blocks, policy-as-code.

09 · IPv4/IPv6 Dual Stack

Dual-stack cluster setup, pod dual IPs, service preferDualStack, IPv6-first policies.

10 · Load Balancing

Cloud LB integration, MetalLB for bare metal, BGP peering, L2 mode, LoadBalancer IP pools.

11 · Service Mesh

Istio control plane, Envoy sidecar, Linkerd micro-proxy, Cilium Mesh — mTLS, traffic shifting, observability.

Key Networking Invariants

These invariants hold regardless of which CNI, kube-proxy mode, or Ingress controller you choose:

No NAT between pods. Pod A reaching pod B's IP sees B's real IP. Pod B sees A's real pod IP as the source. NAT only applies at the node boundary for external-to-pod traffic (NodePort/LoadBalancer with externalTrafficPolicy: Cluster).
Pod IPs are ephemeral. A pod's IP is released when the pod is deleted. Services abstract this: they track healthy pod IPs via EndpointSlices and kube-proxy programs new rules within seconds.
ClusterIPs are virtual. A ClusterIP address does not exist on any network interface. It lives exclusively in iptables DNAT rules (or IPVS virtual server tables). Pinging a ClusterIP will time out unless the ping itself is intercepted by iptables and DNAT'd to a real pod IP.
DNS is the primary discovery mechanism. Kubernetes services are primarily discovered via DNS (<service>.<namespace>.svc.cluster.local), not by hard-coding ClusterIPs. Applications should always use DNS names; IPs change between reinstalls.
Network Policies are additive and default-allow. Without any NetworkPolicy, all pod-to-pod traffic is allowed. Adding a NetworkPolicy affects only the pods it selects. For real isolation, you need a default-deny base policy.

CIDR Planning Reference

# Common CIDR allocation example
--cluster-cidr=10.244.0.0/16         # pod CIDR (65536 pod IPs total)
--service-cluster-ip-range=10.96.0.0/12  # service CIDR (1M virtual IPs)
# Node IPs: 10.0.0.0/16 (cloud VPC)

# Per-node pod subnet (auto-assigned from cluster CIDR):
# node-1: 10.244.1.0/24  → 254 usable pod IPs
# node-2: 10.244.2.0/24  → 254 usable pod IPs
# (maxPods=110 default in kubelet limits actual pod count regardless of subnet size)

# Large cluster example:
--cluster-cidr=10.0.0.0/8            # 16M pod IPs
--node-cidr-mask-size=24             # /24 per node (254 pod IPs each)
# Supports: 65536 nodes × 254 pods = ~16M pods

# Tight cloud VPC example (avoid pod/VPC overlap):
# VPC: 172.31.0.0/16 (AWS default)
# Pod CIDR: 192.168.0.0/16    ← must not overlap with 172.31.0.0/16
# Service CIDR: 10.100.0.0/16 ← must not overlap with either

# Dual-stack:
--cluster-cidr=10.244.0.0/16,fd00::/48
--service-cluster-ip-range=10.96.0.0/12,fd01::/108

Packet Lifecycle: Pod-to-Pod Cross-Node

A brief end-to-end trace for a packet from Pod A (node 1) to Pod C (node 2):

Pod A sends to 10.244.2.3

→

veth → cni0 bridge

→

Node 1 routing table

→

Overlay/BGP: encap + route to Node 2

→

Node 2 decap + cni0 bridge

→

veth → Pod C

Step	Component	What happens
1	Pod A kernel	Sends IP packet: src=10.244.1.5, dst=10.244.2.3
2	veth pair	Packet exits pod netns, enters host netns via veth peer
3	iptables (PREROUTING)	kube-proxy rules checked — dst is pod IP, not ClusterIP, so no DNAT
4	Route table	Node 1 has route: 10.244.2.0/24 via overlay (VXLAN) or BGP peer
5	CNI encapsulation	VXLAN: inner packet wrapped in outer UDP/IP (src=node1-IP, dst=node2-IP, port 8472). BGP: no encap, just routed.
6	Physical network	Outer packet delivered to node 2's eth0
7	CNI decapsulation	VXLAN: outer header stripped; inner packet revealed with dst=10.244.2.3
8	Route table node 2	Route: 10.244.2.3 → veth of Pod C
9	Pod C kernel	Receives packet; sees src=10.244.1.5 (no NAT)

Packet Lifecycle: Pod-to-Service

Pod A sends to ClusterIP 10.96.0.100:80

→

iptables PREROUTING DNAT

→

dst rewritten to 10.244.2.3:8080

→

cross-node routing (as above)

→

Pod C receives (sees src=pod A IP)

The ClusterIP is intercepted in the kernel's PREROUTING chain before any routing decision. The DNAT rule rewrites the destination to one of the backing pod IPs. The routing then proceeds exactly as a direct pod-to-pod packet. The reverse SNAT (connection tracking) rewrites the response source back to the ClusterIP before returning to the calling pod, so the caller always sees the service IP as the responder.

Quick Reference: Common kubectl Networking Commands

# List all services with ClusterIPs
kubectl get svc -A -o wide

# List all endpoints (backing pods for each service)
kubectl get endpointslices -A

# Check pod IPs
kubectl get pods -A -o wide | awk '{print $1, $2, $6, $7}'

# Inspect DNS for a service (from within a pod)
kubectl run dnstest --image=busybox:1.36 --rm -it -- \
  nslookup kubernetes.default.svc.cluster.local

# Test pod-to-pod connectivity
kubectl exec -it pod-a -- curl http://10.244.2.3:8080

# Test pod-to-service connectivity
kubectl exec -it pod-a -- curl http://my-service.my-namespace.svc.cluster.local

# Inspect node routes
kubectl get node -o jsonpath='{range .items[*]}{.metadata.name}: podCIDR={.spec.podCIDR}{"\n"}{end}'

# Watch endpoint changes for a service
kubectl get endpointslices -l kubernetes.io/service-name=my-service -w

# Debug: trace iptables rules for a ClusterIP
iptables-save | grep "10.96.0.100"
ipvsadm -Ln | grep -A3 "10.96.0.100"

CNI Plugin Selection Guide

CNI	Dataplane	Network Policy	Best for	Trade-offs
Calico	eBPF or iptables; BGP or VXLAN	Yes (full)	On-prem, cloud, enterprises needing BGP peering	More complex; BGP config required for underlay mode
Cilium	eBPF (kernel 4.9+)	Yes (L3–L7)	Cloud-native, observability, kube-proxy replacement	Requires recent kernel; higher learning curve
Flannel	VXLAN (default)	No (needs Calico for policies)	Simple clusters, learning, dev environments	No network policy; limited features
AWS VPC CNI	Native ENI/VPC routing	Yes (via Calico or policy addon)	EKS; native VPC IPs, no overlay	AWS-only; IP exhaustion risk per-node
Azure CNI	Native VNET	Yes (Azure NPM)	AKS; native VNET IPs, direct routing	Azure-only; subnet size planning required
Antrea	OVS + eBPF	Yes (L3–L7)	VMware/vSphere, Windows support	Complex OVS dependency
Weave	VXLAN/mesh	Yes	Simple setup, multi-cloud mesh	Performance overhead; less active development

Common Networking Pitfalls

Overlapping CIDRs

Pod CIDR overlapping VPC CIDR causes traffic to be delivered to real VMs instead of pods. Always allocate pod and service CIDRs from private ranges that don't conflict with your infrastructure.

Missing Network Policy default deny

Without a default-deny policy, all pods can talk to all pods. A single compromised pod can reach every database in the cluster. Add namespace-level default-deny ingress/egress policies as a baseline.

DNS ndots=5 causing latency

The default ndots:5 means short names like mydb try 5 search-domain suffixes before resolving as absolute. Each attempt is a separate DNS query. Use FQDNs with trailing dots for latency-sensitive services.

kube-proxy / CNI race on node startup

On node startup, kube-proxy may program Service rules before the CNI has finished setting up pod interfaces. Result: early pods get ClusterIP DNAT rules pointing to IPs with no route. CNI and kube-proxy should both watch EndpointSlices and self-heal — but monitor startup events.

externalTrafficPolicy SNAT

externalTrafficPolicy: Cluster (default) SNATs the original source IP before routing to backend pods. Applications that need real client IPs must use externalTrafficPolicy: Local — but this requires the backend pod to be on the same node as the traffic entry point.

NodePort port conflicts

NodePorts (30000–32767 by default) are claimed cluster-wide. Two services with the same NodePort conflict silently — the second to be programmed wins. Keep a port inventory or use a LoadBalancer type instead of NodePort in production.