Kube-Proxy Internals

kube-proxy is the network rule engine that makes Kubernetes Services reachable. It watches the API server for Service and Endpoints/EndpointSlice changes and programs the local node's kernel to implement virtual IP load balancing. This page dissects every proxy mode — iptables, IPVS, and nftables — down to the individual rules, chains, and data structures, covering Service types, session affinity, connection tracking, and replacement options.

What kube-proxy Does (and Does Not Do)

Key insight: ClusterIPs never hit the wire

A Service ClusterIP (e.g. 10.96.14.3) does not exist on any network interface. It is purely a destination address programmed into kernel packet-processing rules. kube-proxy creates those rules; the kernel rewrites matching packets to a real pod IP before they leave the host.

What kube-proxy programs

ClusterIP → pod IP DNAT (iptables/IPVS/nftables)
NodePort host port → ClusterIP forwarding
LoadBalancer external IP → ClusterIP forwarding
ExternalIP routing rules
Session affinity (client-IP stickiness)
Source IP masquerade (SNAT) for traffic leaving cluster

What kube-proxy does NOT do

Pod-to-pod routing (CNI's job)
Ingress / L7 HTTP routing (Ingress controller's job)
Actual load balancer provisioning (CCM's job)
DNS (CoreDNS's job)
NetworkPolicy enforcement (CNI's job)
TLS termination

Watch Loop

kube-proxy runs a single reflector/informer loop watching three object types:

Object	API Group / Version	What Changes Trigger
`Service`	core/v1	New ClusterIP; port/protocol change; type change; sessionAffinity change
`EndpointSlice`	discovery.k8s.io/v1	Pod ready/not-ready; pod IP change; new pods; deleted pods
`Node`	core/v1 (self only)	Node address changes for NodePort binding

kube-proxy uses EndpointSlice (GA 1.21) by default. Each EndpointSlice holds up to 100 endpoints. For a service with 1,000 pods, 10 EndpointSlices are created rather than one monolithic Endpoints object, dramatically reducing watch event churn.

Service Types Recap

Type	ClusterIP	NodePort	External LB	Use Case
ClusterIP	Assigned from service CIDR	No	No	Internal communication; most services
NodePort	Assigned	30000–32767 on every node	No	External access without cloud LB; bare metal
LoadBalancer	Assigned	Assigned	Cloud LB via CCM	Production external access on cloud
ExternalName	None	No	No	DNS CNAME alias; no proxying
Headless (`clusterIP: None`)	None	No	No	StatefulSets; direct pod DNS; bypasses kube-proxy

Headless Services bypass kube-proxy entirely

When clusterIP: None, no ClusterIP is assigned and kube-proxy programs nothing. CoreDNS returns individual pod A records for the service DNS name. kube-proxy is irrelevant for headless services.

iptables Mode (Default through 1.29)

iptables mode uses Linux netfilter's PREROUTING and OUTPUT chains to DNAT service IPs to pod IPs. Every packet traversing the node is inspected against a linear chain of rules — the fundamental limitation that makes iptables mode struggle above ~10,000 services.

Chain Structure

Annotated Rule Walkthrough

# Dump all kube-proxy iptables rules
iptables-save | grep -E "KUBE|kubernetes"

# ------ PREROUTING: jump to KUBE-SERVICES ------
-A PREROUTING -m comment --comment "kubernetes service portals" \
  -j KUBE-SERVICES

# ------ KUBE-SERVICES: one rule per service ------
# Match destination 10.96.14.3:80 (ClusterIP of nginx service) → jump to SVC chain
-A KUBE-SERVICES -d 10.96.14.3/32 -p tcp -m comment \
  --comment "default/nginx cluster IP" \
  -m tcp --dport 80 -j KUBE-SVC-XPGD46QRK7WJZT7O

# ------ KUBE-SVC-XXXXX: load balancing via statistic module ------
# 3 endpoints: each rule selects one with probability 1/n, 1/(n-1), 1/1

# First endpoint: probability 0.33333333349 (= 1/3)
-A KUBE-SVC-XPGD46QRK7WJZT7O -m comment --comment "default/nginx" \
  -m statistic --mode random --probability 0.33333333349 \
  -j KUBE-SEP-ABCDEF1234567890

# Second endpoint: probability 0.50000000000 (= 1/2 of remaining)
-A KUBE-SVC-XPGD46QRK7WJZT7O -m comment --comment "default/nginx" \
  -m statistic --mode random --probability 0.50000000000 \
  -j KUBE-SEP-BCDEF12345678901

# Third endpoint: probability 1.0 (catch-all for last)
-A KUBE-SVC-XPGD46QRK7WJZT7O -m comment --comment "default/nginx" \
  -j KUBE-SEP-CDEF123456789012

# ------ KUBE-SEP-XXXXX: DNAT to actual pod IP ------
# Mark for masquerade if source is the pod itself (hairpin)
-A KUBE-SEP-ABCDEF1234567890 -s 10.244.1.5/32 -m comment \
  --comment "default/nginx" -j KUBE-MARK-MASQ

# DNAT: rewrite dst IP:port to pod IP:port
-A KUBE-SEP-ABCDEF1234567890 -p tcp -m comment \
  --comment "default/nginx" -m tcp \
  -j DNAT --to-destination 10.244.1.5:80

# ------ KUBE-POSTROUTING: masquerade marked packets ------
-A POSTROUTING -m comment --comment "kubernetes postrouting rules" \
  -j KUBE-POSTROUTING
-A KUBE-POSTROUTING -m mark ! --mark 0x0/0x4000 -j RETURN
-A KUBE-POSTROUTING -j MARK --xor-mark 0x4000
-A KUBE-POSTROUTING -m comment --comment "kubernetes service traffic requiring SNAT" \
  -j MASQUERADE --random-fully

NodePort Rules

# NodePort rule lives in KUBE-NODEPORTS chain (called from KUBE-SERVICES)
# Matches any destination port 30080 on any interface
-A KUBE-SERVICES -m comment --comment "kubernetes service nodeports" \
  -m addrtype --dst-type LOCAL -j KUBE-NODEPORTS

-A KUBE-NODEPORTS -p tcp -m comment --comment "default/nginx nodePort" \
  -m tcp --dport 30080 -j KUBE-EXT-XPGD46QRK7WJZT7O

# KUBE-EXT chain: decides masquerade for external traffic
-A KUBE-EXT-XPGD46QRK7WJZT7O -m comment \
  --comment "masquerade traffic for default/nginx external destinations" \
  -j KUBE-MARK-MASQ
-A KUBE-EXT-XPGD46QRK7WJZT7O -j KUBE-SVC-XPGD46QRK7WJZT7O

Session Affinity

apiVersion: v1
kind: Service
metadata:
  name: sticky-service
spec:
  selector:
    app: api
  sessionAffinity: ClientIP
  sessionAffinityConfig:
    clientIP:
      timeoutSeconds: 10800   # 3 hours; max 86400 (1 day)
  ports:
  - port: 80
    targetPort: 8080

# Session affinity adds a 'recent' module match before the statistic rules
# First packet → choose endpoint normally; mark client IP in 'recent' table
# Subsequent packets from same client IP → jump directly to chosen SEP

-A KUBE-SVC-XXXXX -m comment --comment "default/sticky-service" \
  -m recent --name KUBE-SEP-ABCDEF1234 --rcheck --seconds 10800 \
  --reap -j KUBE-SEP-ABCDEF1234567890

-A KUBE-SEP-ABCDEF1234567890 -m comment --comment "default/sticky-service" \
  -m recent --name KUBE-SEP-ABCDEF1234 --set \
  -j DNAT --to-destination 10.244.1.5:80

iptables Scaling Problem

O(n) rule traversal

Every packet traverses the KUBE-SERVICES chain linearly. With 10,000 services × 10 endpoints = 100,000 KUBE-SEP rules, each packet may traverse thousands of rules. iptables-restore time also grows super-linearly — a full ruleset flush+restore for 50,000 rules can take 10+ seconds, causing brief forwarding blackouts during updates.

Scale	Services	Total Rules (approx)	iptables-restore time	Rule lookup latency
Small	100	~2,000	< 0.1s	Negligible
Medium	1,000	~20,000	~1s	Low
Large	10,000	~200,000	~60s	Measurable (ms)
Very large	50,000	~1,000,000	Minutes	Significant

IPVS Mode

IPVS (IP Virtual Server) is a Linux kernel module originally designed for load balancer appliances. kube-proxy IPVS mode programs kernel IPVS virtual servers instead of iptables chains. Lookups use hash tables: O(1) per packet regardless of service count.

Enabling IPVS

# KubeProxyConfiguration (kube-proxy ConfigMap)
apiVersion: kubeproxy.config.k8s.io/v1alpha1
kind: KubeProxyConfiguration
mode: ipvs
ipvs:
  scheduler: rr              # rr | lc | dh | sh | sed | nq
  syncPeriod: 30s
  minSyncPeriod: 2s
  strictARP: true            # CRITICAL: must be true for IPVS mode
  tcpTimeout: 900s
  tcpFinTimeout: 16s
  udpTimeout: 300s
iptables:
  masqueradeAll: false
  masqueradeBit: 14          # mark bit 0x4000 for masquerade
  minSyncPeriod: 1s
  syncPeriod: 30s

Kernel module requirements

IPVS mode requires ip_vs, ip_vs_rr, ip_vs_wrr, ip_vs_sh, and nf_conntrack kernel modules. Load them with: modprobe ip_vs ip_vs_rr ip_vs_wrr ip_vs_sh nf_conntrack and persist in /etc/modules-load.d/ipvs.conf.

IPVS Virtual Server Model

# IPVS creates a 'virtual server' (VS) per service port
# and 'real servers' (RS) per endpoint

# Install ipvsadm
apt-get install ipvsadm    # Ubuntu

# List all virtual servers
ipvsadm -Ln

# Example output for nginx service (ClusterIP 10.96.14.3:80, 3 pods)
# IP Virtual Server version 1.2.1 (size=4096)
# Prot LocalAddress:Port Scheduler Flags
#   -> RemoteAddress:Port           Forward Weight ActiveConn InActConn
# TCP  10.96.14.3:80 rr
#   -> 10.244.1.5:80                Masq    1      5          0
#   -> 10.244.2.7:80                Masq    1      3          0
#   -> 10.244.3.9:80                Masq    1      4          0

# NodePort creates additional virtual server on node IP
# TCP  192.168.1.10:30080 rr
#   -> 10.244.1.5:80                Masq    1      0          0
#   -> 10.244.2.7:80                Masq    1      0          0
#   -> 10.244.3.9:80                Masq    1      0          0

# kube-proxy also creates a dummy interface 'kube-ipvs0'
# and assigns ALL ClusterIPs to it (so kernel accepts the packets)
ip addr show kube-ipvs0
# inet 10.96.14.3/32 scope host kube-ipvs0
# inet 10.96.0.1/32 scope host kube-ipvs0    (kubernetes API server SVC)
# ... one address per Service ClusterIP

IPVS Scheduling Algorithms

Scheduler	Algorithm	Use Case
`rr`	Round Robin	Default; equal distribution; stateless workloads
`lc`	Least Connections	Long-lived connections; databases; minimize hot pods
`dh`	Destination Hashing	Cache affinity; same destination always hits same backend
`sh`	Source Hashing	Client-IP affinity (like sessionAffinity: ClientIP)
`sed`	Shortest Expected Delay	Weighted round robin factoring active connections
`nq`	Never Queue	SED variant; sends to idle server first
`wrr`	Weighted Round Robin	Heterogeneous pods with different capacities

IPVS + iptables Coexistence

Even in IPVS mode, kube-proxy still uses iptables for scenarios IPVS cannot handle natively:

# iptables rules still present in IPVS mode:
iptables-save | grep KUBE | grep -v "KUBE-SVC\|KUBE-SEP"

# 1. KUBE-MARK-MASQ — mark packets needing SNAT
# 2. KUBE-POSTROUTING — MASQUERADE marked packets
# 3. KUBE-FIREWALL — drop packets with invalid marks
# 4. KUBE-FORWARD — allow forwarding for established connections
# 5. NodePort masquerade (KUBE-NODEPORTS still uses iptables for source IP mark)

# strictARP=true required: IPVS needs ARP responses suppressed for dummy IPs
# Without strictARP, kube-ipvs0 addresses respond to ARP → routing loops
cat /proc/sys/net/ipv4/conf/all/arp_ignore    # should be 1
cat /proc/sys/net/ipv4/conf/all/arp_announce  # should be 2

nftables Mode (GA 1.31)

nftables is the successor to iptables in the Linux kernel, using a bytecode VM instead of linear rule matching. kube-proxy gained nftables mode as alpha in 1.29, beta 1.30, GA in 1.31. It offers the same semantics as iptables mode but with better performance characteristics and atomic ruleset updates.

nftables advantages over iptables

Atomic ruleset updates (entire table replaced atomically)
Native set/map data structures (O(1) lookup for IPs)
No iptables-restore latency spikes during updates
Cleaner, more readable rule syntax
Better integration with modern kernels (5.2+)
Single framework for IPv4, IPv6, ARP, bridge rules

nftables limitations

Requires Linux kernel 5.13+ (full feature set)
Cannot coexist with iptables mode (same hook points)
Less tooling familiarity than iptables
Some distributions ship older nftables versions

# Enable nftables mode
# In KubeProxyConfiguration:
# mode: nftables

# Inspect nftables rules created by kube-proxy
nft list ruleset | grep -A 30 "table inet kube-proxy"

# kube-proxy creates sets for service VIPs
nft list set inet kube-proxy services
# { 10.96.14.3 . tcp . 80, 10.96.0.1 . tcp . 443, ... }

# kube-proxy creates maps for service→endpoint selection
nft list map inet kube-proxy service-ips
# { 10.96.14.3 . tcp . 80 : goto chain-svc-nginx, ... }

# nftables map lookup replaces the linear iptables chain walk
# O(1) per packet vs O(n) in iptables mode at same service count

Proxy Mode Comparison

Dimension	iptables	IPVS	nftables	eBPF (Cilium/Calico)
Lookup complexity	O(n) linear chain	O(1) hash table	O(1) set/map	O(1) BPF map
Rule update model	Full flush + restore	Incremental	Atomic table replace	Incremental map update
Update latency at 10k services	~10-60s flush	<1s	<1s atomic	<100ms
Load balancing algorithms	Random probability	7 algorithms (rr/lc/sh…)	Random probability	Maglev hash (Cilium)
Session affinity	recent module	sh scheduler or persistence	nft timeout maps	BPF affinity map
Source IP preservation	No (SNAT)	No (SNAT); DSR possible	No (SNAT)	Yes (DSR, no SNAT)
kube-proxy required	Yes	Yes	Yes	No (replacement)
Kernel requirement	Any	ip_vs module	5.13+	4.9.17+ (5.10+ for full)
Windows support	No	No	No	No
Stability (1.31)	GA (stable)	GA (stable)	GA (1.31)	CNI-dependent

End-to-End Traffic Flows

ClusterIP: Pod-to-Service Packet Trace

Step	Location	Action	Packet state
1	Pod (app code)	connect("nginx.default.svc.cluster.local", 80)	DNS lookup → 10.96.14.3
2	Pod kernel	SYN packet sent; dst=10.96.14.3:80	src=10.244.1.5, dst=10.96.14.3:80
3	Pod veth → host veth	Packet enters host network namespace	Unchanged
4	netfilter PREROUTING	KUBE-SERVICES rule matches 10.96.14.3:80 → KUBE-SVC chain	Unchanged
5	KUBE-SVC chain	statistic module selects pod-2; jumps to KUBE-SEP-BBB	Unchanged
6	KUBE-SEP-BBB	DNAT: dst rewritten to 10.244.2.7:80	src=10.244.1.5, dst=10.244.2.7:80
7	conntrack	Entry recorded: 10.244.1.5→10.96.14.3:80 ↔ 10.244.1.5→10.244.2.7:80	Conntrack entry created
8	Routing decision	Route lookup: 10.244.2.7 → via bridge or tunnel	Forwarded normally via CNI
9	Pod-2 receives	SYN arrives at pod-2:80	src=10.244.1.5, dst=10.244.2.7:80
10	Return path	SYN-ACK from pod-2; conntrack reverse-translates dst back to 10.96.14.3:80	src=10.244.2.7, dst=10.244.1.5 (conntrack rewrites src back to 10.96.14.3)

NodePort: External-to-Service Packet Trace

Step	Location	Action
1	External client	TCP SYN to node IP 192.168.1.10:30080
2	Node PREROUTING	KUBE-SERVICES → dst-type LOCAL → KUBE-NODEPORTS
3	KUBE-NODEPORTS	dport 30080 → KUBE-EXT chain
4	KUBE-EXT	KUBE-MARK-MASQ (mark 0x4000) + jump to KUBE-SVC
5	KUBE-SVC	Select endpoint → DNAT to pod IP (e.g. 10.244.3.9:80)
6	POSTROUTING	Mark 0x4000 → MASQUERADE: src rewritten to node IP
7	Pod-3 receives	src=192.168.1.10 (node IP), dst=10.244.3.9:80 — original client IP lost!

Source IP loss with NodePort

The default NodePort flow SNATs traffic through the node, losing the original client IP. Use externalTrafficPolicy: Local to skip the masquerade and preserve client IP — but this means only pods on the same node as the receiving NodePort can be selected (pods on other nodes are unreachable for that NodePort).

externalTrafficPolicy

apiVersion: v1
kind: Service
metadata:
  name: web
spec:
  type: LoadBalancer
  externalTrafficPolicy: Local   # Cluster (default) | Local
  # Local: only route to pods on the receiving node; preserve client IP
  # Cluster: SNAT through any node; lose client IP but full pod distribution

  # internalTrafficPolicy controls ClusterIP traffic routing
  internalTrafficPolicy: Local   # Cluster (default) | Local
  # Local: only route to pods on same node as client (topology-aware)

  selector:
    app: web
  ports:
  - port: 80
    targetPort: 8080

EndpointSlices

EndpointSlices (GA 1.21) replaced the monolithic Endpoints object. Each EndpointSlice holds up to maxEndpointsPerSlice (default 100) endpoints. kube-proxy watches EndpointSlices exclusively since 1.22.

apiVersion: discovery.k8s.io/v1
kind: EndpointSlice
metadata:
  name: nginx-abc12
  namespace: default
  labels:
    kubernetes.io/service-name: nginx   # links to Service
addressType: IPv4                       # IPv4 | IPv6 | FQDN
endpoints:
- addresses:
  - "10.244.1.5"
  conditions:
    ready: true          # pod passed readinessProbe
    serving: true        # pod is serving (ready or was ready, not terminating)
    terminating: false   # pod is being deleted
  nodeName: worker-1
  targetRef:
    kind: Pod
    name: nginx-abc-xyz
    namespace: default
- addresses:
  - "10.244.2.7"
  conditions:
    ready: true
    serving: true
    terminating: false
  nodeName: worker-2
  targetRef:
    kind: Pod
    name: nginx-def-uvw
    namespace: default
ports:
- name: http
  port: 80
  protocol: TCP

Terminating endpoints (graceful connection draining)

When a pod is terminating, its endpoint's terminating: true and ready: false but serving: true. kube-proxy (and Cilium) can be configured to include terminating endpoints when no ready endpoints remain, enabling graceful connection draining. Controlled by EndpointSliceTerminatingCondition feature gate (GA 1.28).

Topology-Aware Routing

Topology-Aware Routing (GA 1.27) instructs kube-proxy to prefer endpoints in the same zone as the client pod, reducing cross-zone traffic costs. The EndpointSlice controller annotates endpoints with zone hints.

apiVersion: v1
kind: Service
metadata:
  name: api
  annotations:
    service.kubernetes.io/topology-mode: Auto   # enables topology-aware routing
spec:
  selector:
    app: api
  ports:
  - port: 80

# EndpointSlice with hints populated by controller
kubectl get endpointslice -l kubernetes.io/service-name=api -o yaml | \
  grep -A3 hints
# hints:
#   forZones:
#   - name: us-east-1a   # only route to this endpoint from zone us-east-1a

Full KubeProxyConfiguration

apiVersion: kubeproxy.config.k8s.io/v1alpha1
kind: KubeProxyConfiguration

# Bind and client
bindAddress: "0.0.0.0"
clientConnection:
  kubeconfig: /var/lib/kube-proxy/kubeconfig.conf
  qps: 5
  burst: 10

# Mode: iptables | ipvs | nftables
mode: ipvs

# iptables tuning (used even in IPVS mode for masquerade)
iptables:
  masqueradeAll: false     # SNAT all service traffic (not just NodePort)
  masqueradeBit: 14        # bit 14 (0x4000) used as mark
  minSyncPeriod: 1s        # don't sync more often than this
  syncPeriod: 30s          # full sync interval

# IPVS tuning
ipvs:
  scheduler: lc            # least connections for persistent services
  syncPeriod: 30s
  minSyncPeriod: 2s
  strictARP: true
  tcpTimeout: 900s         # idle TCP connection timeout in IPVS
  tcpFinTimeout: 16s       # TCP FIN_WAIT timeout
  udpTimeout: 300s

# Node port range
nodePortAddresses:
- "192.168.0.0/16"        # restrict NodePort binding to these CIDRs
                          # default: all interfaces

# Feature gates
featureGates:
  TopologyAwareHints: true

# Healthz and metrics
healthzBindAddress: "0.0.0.0:10256"
metricsBindAddress: "0.0.0.0:10249"

# Logging
logging:
  verbosity: 2

kube-proxy Replacement

Several CNIs offer full kube-proxy replacement by programming service routing inside eBPF, eliminating the kube-proxy DaemonSet entirely:

Replacement	Mechanism	How to Disable kube-proxy	Benefits vs kube-proxy
Cilium	eBPF BPF maps + socket-level LB	`--skip-phases=addon/kube-proxy` in kubeadm; or delete kube-proxy DaemonSet	O(1) lookup; DSR; preserved src IP; Maglev hashing; 30-100% lower latency at scale
Calico eBPF	eBPF TC hooks	Disable kube-proxy + set `FELIX_BPFKUBEPROXYIPTSCLEANUPMODE=Enabled`	Same as Cilium eBPF mode; DSR capable
kube-router	IPVS + BGP	Deploy kube-router as DaemonSet; disable kube-proxy	Integrated BGP router + IPVS; no external CNI needed

# Disable kube-proxy when using Cilium kube-proxy replacement
# Option 1: skip during kubeadm init
kubeadm init --skip-phases=addon/kube-proxy

# Option 2: delete after cluster creation
kubectl -n kube-system delete daemonset kube-proxy
iptables-save | grep -v KUBE | iptables-restore   # clean up leftover rules

# Configure Cilium to replace kube-proxy
helm upgrade cilium cilium/cilium \
  --namespace kube-system \
  --set kubeProxyReplacement=true \
  --set k8sServiceHost=API_SERVER_IP \
  --set k8sServicePort=6443

# Verify
cilium status | grep KubeProxyReplacement
# KubeProxyReplacement: True  [NodePort (SNAT, 1 NumDevices), ExternalIPs, HostPort, SessionAffinity, Maglev, XDP (NATIVE)]

Key kube-proxy Metrics

Metric	Type	Alert Threshold
`kubeproxy_sync_proxy_rules_duration_seconds`	histogram	p99 > 5s (iptables mode); > 1s (IPVS/nftables)
`kubeproxy_sync_proxy_rules_last_queued_timestamp_seconds`	gauge	age > 30s → stale rules not being applied
`kubeproxy_network_programming_duration_seconds`	histogram	p99 > 10s
`kubeproxy_iptables_rules_total`	gauge	> 200,000 → consider IPVS or nftables mode
`kubeproxy_ipvs_services_total`	gauge	Informational (track growth)
`rest_client_requests_total{code=~"5.."}`	counter	> 0.1/s sustained → API server connectivity problem
`kubeproxy_sync_proxy_rules_no_local_endpoints_total`	counter	Rate > 0 with externalTrafficPolicy=Local → endpoints missing on node

Alerting Rules

groups:
- name: kube-proxy
  rules:
  - alert: KubeProxyRuleSyncSlow
    expr: histogram_quantile(0.99, rate(kubeproxy_sync_proxy_rules_duration_seconds_bucket[5m])) > 5
    for: 5m
    labels:
      severity: warning
    annotations:
      summary: "kube-proxy rule sync taking >5s (iptables mode scalability issue)"

  - alert: KubeProxyStaleRules
    expr: time() - kubeproxy_sync_proxy_rules_last_queued_timestamp_seconds > 60
    for: 2m
    labels:
      severity: critical
    annotations:
      summary: "kube-proxy has not synced rules in >60s — service endpoints may be stale"

  - alert: KubeProxyTooManyRules
    expr: kubeproxy_iptables_rules_total > 150000
    for: 10m
    labels:
      severity: warning
    annotations:
      summary: "iptables rule count exceeds 150k — consider switching to IPVS or nftables"

  - alert: KubeProxyAPIServerErrors
    expr: rate(rest_client_requests_total{job="kube-proxy",code=~"5.."}[5m]) > 0.1
    for: 3m
    labels:
      severity: warning
    annotations:
      summary: "kube-proxy experiencing API server 5xx errors"

Troubleshooting Runbooks

Runbook 1: Service ClusterIP not reachable from pod

# 1. Verify service exists and has endpoints
kubectl get svc nginx -o wide
kubectl get endpoints nginx    # or: kubectl get endpointslices -l kubernetes.io/service-name=nginx

# 2. Confirm kube-proxy is running on the node
NODE=$(kubectl get pod <client-pod> -o jsonpath='{.spec.nodeName}')
kubectl get pod -n kube-system -l k8s-app=kube-proxy --field-selector spec.nodeName=$NODE

# 3. Check iptables rules on the node (for iptables mode)
kubectl debug node/$NODE -it --image=ubuntu -- bash
  iptables-save | grep 10.96.14.3   # ClusterIP
  # If empty → kube-proxy not syncing rules

# 4. Check IPVS virtual servers (for IPVS mode)
ipvsadm -Ln | grep -A5 "10.96.14.3"

# 5. Check kube-proxy logs
kubectl logs -n kube-system kube-proxy-$NODE --tail=50 | grep -E "ERROR|sync|failed"

# 6. Check conntrack table (if DNAT happening but response lost)
conntrack -L | grep 10.96.14.3

# 7. Verify kube-proxy endpoints match pod IPs
kubectl get pod -l app=nginx -o wide   # compare pod IPs to KUBE-SEP rules

Runbook 2: NodePort accessible on one node but not another

# NodePort should work on ALL nodes regardless of where pods run (Cluster policy)

# 1. Test from each node
curl http://<node1-ip>:30080    # should work
curl http://<node2-ip>:30080    # should also work

# 2. Check firewall/security groups (cloud or on-prem)
# AWS: check Security Group allows port 30080 TCP inbound
# GCP: check firewall rules for node tag
# On-prem: check iptables INPUT chain on the node

# 3. Verify kube-proxy is running on the failing node
kubectl get pods -n kube-system -l k8s-app=kube-proxy -o wide

# 4. Check if externalTrafficPolicy=Local (only pods on THAT node are reachable)
kubectl get svc nginx -o jsonpath='{.spec.externalTrafficPolicy}'
# If "Local", NodePort only works on nodes that have a matching pod

# 5. Check kube-proxy rule on failing node
kubectl debug node/<failing-node> -it --image=ubuntu -- bash
  iptables-save | grep "30080"
  # If rule missing → kube-proxy not syncing on this node

Runbook 3: kube-proxy high CPU / slow sync (iptables mode at scale)

# Symptom: kube-proxy pod using high CPU; endpoints take minutes to update

# 1. Measure current sync duration
kubectl exec -n kube-system kube-proxy-xyz -- \
  curl -s localhost:10249/metrics | grep kubeproxy_sync_proxy_rules_duration

# 2. Count total iptables rules
kubectl debug node/worker-1 -it --image=ubuntu -- bash
  iptables-save | wc -l
  # > 100,000 lines = likely bottleneck

# 3. Count services and endpoints
kubectl get svc --all-namespaces | wc -l
kubectl get endpointslices --all-namespaces | wc -l

# 4. Switch to IPVS mode (rolling, not disruptive)
# Edit kube-proxy ConfigMap
kubectl edit cm -n kube-system kube-proxy
# Change: mode: "ipvs" (was "iptables")
# Add: ipvs.strictARP: true
# Then restart kube-proxy pods
kubectl rollout restart daemonset -n kube-system kube-proxy

# 5. Or migrate to Cilium kube-proxy replacement for maximum scale

Runbook 4: Client IP lost — source IP not preserved through LoadBalancer

# Symptom: application receives node IP instead of client IP in X-Forwarded-For or logs

# Root cause: externalTrafficPolicy: Cluster (default) SNATs traffic through node

# Solution 1: externalTrafficPolicy: Local (preserves client IP, but uneven distribution)
kubectl patch svc web -p '{"spec":{"externalTrafficPolicy":"Local"}}'
# Warning: pods only receive traffic on nodes where they're running
# Health check port needed for cloud LB health probes
kubectl get svc web -o jsonpath='{.spec.healthCheckNodePort}'

# Solution 2: Use Cilium with DSR (Direct Server Return)
# Cilium can preserve source IP even with Cluster policy via DSR
# Requires: --set loadBalancer.dsrDispatch=opt

# Solution 3: Proxy Protocol (for L4 LBs)
# Cloud LB sends PROXY protocol header with original client IP
# Application or ingress must decode PROXY protocol header

# Verify client IP is now preserved
kubectl logs -l app=web | grep "X-Real-IP\|remote_addr"

Runbook 5: IPVS mode — connection resets after pod deletion

# Symptom: established connections RST when a pod is deleted and IPVS removes the real server

# Root cause: IPVS removes the RS immediately; conntrack entries still point to dead pod

# 1. Check IPVS connection table
ipvsadm -Lnc | grep "10.244.1.5"   # connections to deleted pod IP

# 2. Check conntrack for stale entries
conntrack -L | grep "10.244.1.5"
conntrack -D --dst 10.244.1.5       # manually remove (emergency)

# 3. Proper solution: use graceful termination
# - Pod should handle SIGTERM and drain connections
# - preStop hook gives time before endpoint removal:
# preStop:
#   exec:
#     command: ["sleep", "15"]      # 15s for LB to drain connections

# 4. IPVS persistence (alternative to graceful termination for session affinity)
ipvsadm --edit-service --tcp-service 10.96.14.3:80 --persistent 300
# Persistent connections keep going to same RS for 300s even after removal

Production Best Practices

Mode Selection

<500 services: iptables mode is fine
500–10,000 services: switch to IPVS or nftables
>10,000 services: use IPVS or Cilium kube-proxy replacement
Linux 5.13+: use nftables for better atomicity
Maximum performance: Cilium eBPF kube-proxy replacement

IPVS Best Practices

Always set strictARP: true (ARP proxy conflicts)
Load ip_vs modules via /etc/modules-load.d/
Use lc (least-connections) for long-lived connections
Monitor ip_vs connection table size
Set appropriate TCP/UDP timeouts to prevent stale IPVS entries

Source IP Handling

Use externalTrafficPolicy: Local for apps needing client IP
Ensure pods exist on all nodes or LB health checks will fail
Document the SNAT behavior for security/audit teams
For internal traffic: internalTrafficPolicy: Local saves cross-node hops

Graceful Endpoint Handling

Use preStop hook + sleep to drain connections before pod stops
Set terminationGracePeriodSeconds appropriately
Monitor kubeproxy_sync_proxy_rules_no_local_endpoints_total
Test rolling deployments with active connections under load

← CNI Plugins DNS & Service Discovery →