Networking Beginner Core File: 03-networking/00-networking-overview.html

Kubernetes Networking Overview

Kubernetes networking is one of the most frequently misunderstood areas of the platform — not because it is conceptually hard, but because it spans multiple abstraction layers, each implemented by a different pluggable component. This overview establishes the four fundamental networking problems Kubernetes solves, the components responsible for each, and the invariants that all implementations must satisfy. Every subsequent page in this section builds on this foundation.

The Four Networking Problems

Kubernetes networking addresses four distinct communication scenarios:

1. Container-to-Container

Containers in the same pod share a network namespace (same IP, same port space). They communicate via localhost. This is provided by the container runtime (pause container) — no Kubernetes networking layer needed.

2. Pod-to-Pod

Every pod gets a unique, routable IP address. Any pod can reach any other pod by IP without NAT. This is implemented by the CNI plugin and is a hard requirement of the Kubernetes networking model.

3. Pod-to-Service

Services provide stable virtual IPs (ClusterIPs) that load-balance to a set of pods. Implemented by kube-proxy (iptables/IPVS/nftables/eBPF) or a CNI replacement. Decouples consumers from pod lifecycle.

4. External-to-Service

External traffic entering the cluster via NodePort, LoadBalancer, or Ingress/Gateway. Implemented by cloud load balancers (CCM), Ingress controllers, and Gateway API implementations.

The Fundamental Networking Contract

Every Kubernetes-conformant cluster must guarantee: (1) all pods can communicate with all other pods without NAT; (2) all nodes can communicate with all pods without NAT; (3) the IP a pod sees for itself is the same IP other pods see for it. This is stated in the Kubernetes networking model and enforced by the CNI conformance suite.

The Kubernetes Network Model

Node 1 (10.0.1.10) Pod A IP: 10.244.1.5 ctr-1 ctr-2 localhost:PORT Pod B IP: 10.244.1.8 ctr-1 veth veth cni0 / cbr0 (bridge) — 10.244.1.1/24 iptables / IPVS (kube-proxy Service rules) eth0 — Node IP: 10.0.1.10 Node 2 (10.0.1.11) Pod C IP: 10.244.2.3 ctr-1 veth cni0 / cbr0 (bridge) — 10.244.2.1/24 iptables / IPVS eth0 — Node IP: 10.0.1.11 underlay / overlay CNI routes cross-node traffic Pod A → Pod C: direct IP, no NAT ClusterIP: 10.96.0.100 → DNAT to pod IPs

Address Spaces

A Kubernetes cluster uses three distinct, non-overlapping CIDR ranges:

Address SpaceTypical RangeWhat uses itConfigured by
Node CIDR 10.0.0.0/16 (cloud VPC) Node IP addresses (eth0); kubelet advertises this IP Cloud VPC / on-prem network
Pod CIDR 10.244.0.0/16 All pod IPs cluster-wide; each node gets a /24 slice (spec.podCIDR) --cluster-cidr on kube-controller-manager; CNI plugin
Service CIDR 10.96.0.0/12 ClusterIP virtual IPs; only exist in iptables/IPVS rules, never routed on wire --service-cluster-ip-range on kube-apiserver
CIDRs Must Not Overlap

Overlapping any two of these three ranges causes silent routing failures. A common mistake: the pod CIDR overlaps the VPC CIDR, causing all inter-pod traffic to be misrouted. Plan all three ranges before cluster creation — they cannot be changed after the fact without rebuilding the cluster.

Networking Layers and Responsible Components

Service Mesh / mTLS (Istio, Linkerd, Cilium) L7 — optional, sidecar or eBPF
Ingress / Gateway API (nginx, envoy, traefik) L7 HTTP routing — external traffic
Services / kube-proxy (ClusterIP, NodePort, LoadBalancer) L4 virtual IP — iptables / IPVS / eBPF
Pod Networking / CNI (Calico, Cilium, Flannel, Weave) L3 pod-to-pod routing — overlay or underlay
Node Network (VPC, physical fabric) L2/L3 — cloud or on-prem infrastructure
Container Networking (pause container, veth pairs) L2 — within-pod namespace sharing

Component Roles at a Glance

ComponentLayerResponsibilityPluggable?
pause containerContainerCreates and holds the pod network namespace; all app containers join itNo (built into kubelet/CRI)
CNI pluginPod L3Assigns pod IP from node's podCIDR; sets up veth pair, routes, and cross-node routingYes (Calico, Cilium, Flannel, Weave, AWS VPC CNI, Azure CNI, …)
kube-proxyService L4Programs iptables/IPVS rules for ClusterIP DNAT; handles NodePort and LoadBalancerYes (replaceable by Cilium kube-proxy replacement, Antrea, etc.)
CoreDNSService DNSResolves service.namespace.svc.cluster.local → ClusterIP; pod DNSYes (replaceable by node-local-dns, custom CoreDNS plugins)
Ingress controllerL7 HTTPTerminates TLS, routes HTTP by host/path to ServicesYes (nginx, Envoy, Traefik, HAProxy, …)
Gateway API impl.L7 HTTP/TCPNext-gen Ingress; richer routing (header, weight, TCP/UDP)Yes (Envoy Gateway, Cilium, Istio, nginx Gateway Fabric)
Cloud LB (CCM)L4 externalProvisions cloud load balancer for type: LoadBalancer ServicesVia cloud-controller-manager
Network Policy engineL3/L4 firewallEnforces NetworkPolicy objects (allow/deny by pod selector/port)Yes (Calico, Cilium, Weave; Flannel alone cannot enforce policies)
Service meshL7 mTLSMutual TLS, observability, traffic management between podsYes (Istio, Linkerd, Consul Connect, Cilium Service Mesh)

What Each Page in This Section Covers

01 · Pod Networking

veth pairs, network namespaces, pod IP assignment, the pause container, cross-node routing with overlays vs. BGP underlay, packet walk end-to-end.

02 · CNI Plugins

CNI spec, plugin invocation by kubelet, Calico BGP + eBPF, Cilium eBPF, Flannel VXLAN, AWS VPC CNI, Azure CNI — when to use each.

03 · Services

ClusterIP, NodePort, LoadBalancer, ExternalName, Headless — object model, EndpointSlices, session affinity, external traffic policy.

04 · kube-proxy Internals

iptables chain hierarchy, IPVS scheduler algorithms, nftables mode (GA 1.31), EndpointSlice sync, topology-aware routing.

05 · DNS

CoreDNS architecture, pod DNS policy, search domains, ndots, stub zones, NodeLocal DNSCache, custom CoreDNS config.

06 · Ingress

IngressClass, nginx ingress controller, TLS termination, annotations, path types, multi-tenant patterns.

07 · Gateway API

Gateway, HTTPRoute, GRPCRoute, TCPRoute, ReferenceGrant — the successor to Ingress with role-based separation.

08 · Network Policies

Default deny patterns, ingress/egress rules, podSelector + namespaceSelector, CIDR blocks, policy-as-code.

09 · IPv4/IPv6 Dual Stack

Dual-stack cluster setup, pod dual IPs, service preferDualStack, IPv6-first policies.

10 · Load Balancing

Cloud LB integration, MetalLB for bare metal, BGP peering, L2 mode, LoadBalancer IP pools.

11 · Service Mesh

Istio control plane, Envoy sidecar, Linkerd micro-proxy, Cilium Mesh — mTLS, traffic shifting, observability.

Key Networking Invariants

These invariants hold regardless of which CNI, kube-proxy mode, or Ingress controller you choose:

  1. No NAT between pods. Pod A reaching pod B's IP sees B's real IP. Pod B sees A's real pod IP as the source. NAT only applies at the node boundary for external-to-pod traffic (NodePort/LoadBalancer with externalTrafficPolicy: Cluster).
  2. Pod IPs are ephemeral. A pod's IP is released when the pod is deleted. Services abstract this: they track healthy pod IPs via EndpointSlices and kube-proxy programs new rules within seconds.
  3. ClusterIPs are virtual. A ClusterIP address does not exist on any network interface. It lives exclusively in iptables DNAT rules (or IPVS virtual server tables). Pinging a ClusterIP will time out unless the ping itself is intercepted by iptables and DNAT'd to a real pod IP.
  4. DNS is the primary discovery mechanism. Kubernetes services are primarily discovered via DNS (<service>.<namespace>.svc.cluster.local), not by hard-coding ClusterIPs. Applications should always use DNS names; IPs change between reinstalls.
  5. Network Policies are additive and default-allow. Without any NetworkPolicy, all pod-to-pod traffic is allowed. Adding a NetworkPolicy affects only the pods it selects. For real isolation, you need a default-deny base policy.

CIDR Planning Reference

# Common CIDR allocation example
--cluster-cidr=10.244.0.0/16         # pod CIDR (65536 pod IPs total)
--service-cluster-ip-range=10.96.0.0/12  # service CIDR (1M virtual IPs)
# Node IPs: 10.0.0.0/16 (cloud VPC)

# Per-node pod subnet (auto-assigned from cluster CIDR):
# node-1: 10.244.1.0/24  → 254 usable pod IPs
# node-2: 10.244.2.0/24  → 254 usable pod IPs
# (maxPods=110 default in kubelet limits actual pod count regardless of subnet size)

# Large cluster example:
--cluster-cidr=10.0.0.0/8            # 16M pod IPs
--node-cidr-mask-size=24             # /24 per node (254 pod IPs each)
# Supports: 65536 nodes × 254 pods = ~16M pods

# Tight cloud VPC example (avoid pod/VPC overlap):
# VPC: 172.31.0.0/16 (AWS default)
# Pod CIDR: 192.168.0.0/16    ← must not overlap with 172.31.0.0/16
# Service CIDR: 10.100.0.0/16 ← must not overlap with either

# Dual-stack:
--cluster-cidr=10.244.0.0/16,fd00::/48
--service-cluster-ip-range=10.96.0.0/12,fd01::/108

Packet Lifecycle: Pod-to-Pod Cross-Node

A brief end-to-end trace for a packet from Pod A (node 1) to Pod C (node 2):

Pod A sends to 10.244.2.3
veth → cni0 bridge
Node 1 routing table
Overlay/BGP: encap + route to Node 2
Node 2 decap + cni0 bridge
veth → Pod C
StepComponentWhat happens
1Pod A kernelSends IP packet: src=10.244.1.5, dst=10.244.2.3
2veth pairPacket exits pod netns, enters host netns via veth peer
3iptables (PREROUTING)kube-proxy rules checked — dst is pod IP, not ClusterIP, so no DNAT
4Route tableNode 1 has route: 10.244.2.0/24 via overlay (VXLAN) or BGP peer
5CNI encapsulationVXLAN: inner packet wrapped in outer UDP/IP (src=node1-IP, dst=node2-IP, port 8472). BGP: no encap, just routed.
6Physical networkOuter packet delivered to node 2's eth0
7CNI decapsulationVXLAN: outer header stripped; inner packet revealed with dst=10.244.2.3
8Route table node 2Route: 10.244.2.3 → veth of Pod C
9Pod C kernelReceives packet; sees src=10.244.1.5 (no NAT)

Packet Lifecycle: Pod-to-Service

Pod A sends to ClusterIP 10.96.0.100:80
iptables PREROUTING DNAT
dst rewritten to 10.244.2.3:8080
cross-node routing (as above)
Pod C receives (sees src=pod A IP)

The ClusterIP is intercepted in the kernel's PREROUTING chain before any routing decision. The DNAT rule rewrites the destination to one of the backing pod IPs. The routing then proceeds exactly as a direct pod-to-pod packet. The reverse SNAT (connection tracking) rewrites the response source back to the ClusterIP before returning to the calling pod, so the caller always sees the service IP as the responder.

Quick Reference: Common kubectl Networking Commands

# List all services with ClusterIPs
kubectl get svc -A -o wide

# List all endpoints (backing pods for each service)
kubectl get endpointslices -A

# Check pod IPs
kubectl get pods -A -o wide | awk '{print $1, $2, $6, $7}'

# Inspect DNS for a service (from within a pod)
kubectl run dnstest --image=busybox:1.36 --rm -it -- \
  nslookup kubernetes.default.svc.cluster.local

# Test pod-to-pod connectivity
kubectl exec -it pod-a -- curl http://10.244.2.3:8080

# Test pod-to-service connectivity
kubectl exec -it pod-a -- curl http://my-service.my-namespace.svc.cluster.local

# Inspect node routes
kubectl get node -o jsonpath='{range .items[*]}{.metadata.name}: podCIDR={.spec.podCIDR}{"\n"}{end}'

# Watch endpoint changes for a service
kubectl get endpointslices -l kubernetes.io/service-name=my-service -w

# Debug: trace iptables rules for a ClusterIP
iptables-save | grep "10.96.0.100"
ipvsadm -Ln | grep -A3 "10.96.0.100"

CNI Plugin Selection Guide

CNIDataplaneNetwork PolicyBest forTrade-offs
CalicoeBPF or iptables; BGP or VXLANYes (full)On-prem, cloud, enterprises needing BGP peeringMore complex; BGP config required for underlay mode
CiliumeBPF (kernel 4.9+)Yes (L3–L7)Cloud-native, observability, kube-proxy replacementRequires recent kernel; higher learning curve
FlannelVXLAN (default)No (needs Calico for policies)Simple clusters, learning, dev environmentsNo network policy; limited features
AWS VPC CNINative ENI/VPC routingYes (via Calico or policy addon)EKS; native VPC IPs, no overlayAWS-only; IP exhaustion risk per-node
Azure CNINative VNETYes (Azure NPM)AKS; native VNET IPs, direct routingAzure-only; subnet size planning required
AntreaOVS + eBPFYes (L3–L7)VMware/vSphere, Windows supportComplex OVS dependency
WeaveVXLAN/meshYesSimple setup, multi-cloud meshPerformance overhead; less active development

Common Networking Pitfalls

Overlapping CIDRs

Pod CIDR overlapping VPC CIDR causes traffic to be delivered to real VMs instead of pods. Always allocate pod and service CIDRs from private ranges that don't conflict with your infrastructure.

Missing Network Policy default deny

Without a default-deny policy, all pods can talk to all pods. A single compromised pod can reach every database in the cluster. Add namespace-level default-deny ingress/egress policies as a baseline.

DNS ndots=5 causing latency

The default ndots:5 means short names like mydb try 5 search-domain suffixes before resolving as absolute. Each attempt is a separate DNS query. Use FQDNs with trailing dots for latency-sensitive services.

kube-proxy / CNI race on node startup

On node startup, kube-proxy may program Service rules before the CNI has finished setting up pod interfaces. Result: early pods get ClusterIP DNAT rules pointing to IPs with no route. CNI and kube-proxy should both watch EndpointSlices and self-heal — but monitor startup events.

externalTrafficPolicy SNAT

externalTrafficPolicy: Cluster (default) SNATs the original source IP before routing to backend pods. Applications that need real client IPs must use externalTrafficPolicy: Local — but this requires the backend pod to be on the same node as the traffic entry point.

NodePort port conflicts

NodePorts (30000–32767 by default) are claimed cluster-wide. Two services with the same NodePort conflict silently — the second to be programmed wins. Keep a port inventory or use a LoadBalancer type instead of NodePort in production.