Kube-Proxy Internals
kube-proxy is the network rule engine that makes Kubernetes Services reachable. It watches the API server for Service and Endpoints/EndpointSlice changes and programs the local node's kernel to implement virtual IP load balancing. This page dissects every proxy mode — iptables, IPVS, and nftables — down to the individual rules, chains, and data structures, covering Service types, session affinity, connection tracking, and replacement options.
What kube-proxy Does (and Does Not Do)
A Service ClusterIP (e.g. 10.96.14.3) does not exist on any network interface. It is purely a destination address programmed into kernel packet-processing rules. kube-proxy creates those rules; the kernel rewrites matching packets to a real pod IP before they leave the host.
What kube-proxy programs
- ClusterIP → pod IP DNAT (iptables/IPVS/nftables)
- NodePort host port → ClusterIP forwarding
- LoadBalancer external IP → ClusterIP forwarding
- ExternalIP routing rules
- Session affinity (client-IP stickiness)
- Source IP masquerade (SNAT) for traffic leaving cluster
What kube-proxy does NOT do
- Pod-to-pod routing (CNI's job)
- Ingress / L7 HTTP routing (Ingress controller's job)
- Actual load balancer provisioning (CCM's job)
- DNS (CoreDNS's job)
- NetworkPolicy enforcement (CNI's job)
- TLS termination
Watch Loop
kube-proxy runs a single reflector/informer loop watching three object types:
| Object | API Group / Version | What Changes Trigger |
|---|---|---|
Service | core/v1 | New ClusterIP; port/protocol change; type change; sessionAffinity change |
EndpointSlice | discovery.k8s.io/v1 | Pod ready/not-ready; pod IP change; new pods; deleted pods |
Node | core/v1 (self only) | Node address changes for NodePort binding |
kube-proxy uses EndpointSlice (GA 1.21) by default. Each EndpointSlice holds up to 100 endpoints. For a service with 1,000 pods, 10 EndpointSlices are created rather than one monolithic Endpoints object, dramatically reducing watch event churn.
Service Types Recap
| Type | ClusterIP | NodePort | External LB | Use Case |
|---|---|---|---|---|
| ClusterIP | Assigned from service CIDR | No | No | Internal communication; most services |
| NodePort | Assigned | 30000–32767 on every node | No | External access without cloud LB; bare metal |
| LoadBalancer | Assigned | Assigned | Cloud LB via CCM | Production external access on cloud |
| ExternalName | None | No | No | DNS CNAME alias; no proxying |
Headless (clusterIP: None) | None | No | No | StatefulSets; direct pod DNS; bypasses kube-proxy |
When clusterIP: None, no ClusterIP is assigned and kube-proxy programs nothing. CoreDNS returns individual pod A records for the service DNS name. kube-proxy is irrelevant for headless services.
iptables Mode (Default through 1.29)
iptables mode uses Linux netfilter's PREROUTING and OUTPUT chains to DNAT service IPs to pod IPs. Every packet traversing the node is inspected against a linear chain of rules — the fundamental limitation that makes iptables mode struggle above ~10,000 services.
Chain Structure
Annotated Rule Walkthrough
# Dump all kube-proxy iptables rules
iptables-save | grep -E "KUBE|kubernetes"
# ------ PREROUTING: jump to KUBE-SERVICES ------
-A PREROUTING -m comment --comment "kubernetes service portals" \
-j KUBE-SERVICES
# ------ KUBE-SERVICES: one rule per service ------
# Match destination 10.96.14.3:80 (ClusterIP of nginx service) → jump to SVC chain
-A KUBE-SERVICES -d 10.96.14.3/32 -p tcp -m comment \
--comment "default/nginx cluster IP" \
-m tcp --dport 80 -j KUBE-SVC-XPGD46QRK7WJZT7O
# ------ KUBE-SVC-XXXXX: load balancing via statistic module ------
# 3 endpoints: each rule selects one with probability 1/n, 1/(n-1), 1/1
# First endpoint: probability 0.33333333349 (= 1/3)
-A KUBE-SVC-XPGD46QRK7WJZT7O -m comment --comment "default/nginx" \
-m statistic --mode random --probability 0.33333333349 \
-j KUBE-SEP-ABCDEF1234567890
# Second endpoint: probability 0.50000000000 (= 1/2 of remaining)
-A KUBE-SVC-XPGD46QRK7WJZT7O -m comment --comment "default/nginx" \
-m statistic --mode random --probability 0.50000000000 \
-j KUBE-SEP-BCDEF12345678901
# Third endpoint: probability 1.0 (catch-all for last)
-A KUBE-SVC-XPGD46QRK7WJZT7O -m comment --comment "default/nginx" \
-j KUBE-SEP-CDEF123456789012
# ------ KUBE-SEP-XXXXX: DNAT to actual pod IP ------
# Mark for masquerade if source is the pod itself (hairpin)
-A KUBE-SEP-ABCDEF1234567890 -s 10.244.1.5/32 -m comment \
--comment "default/nginx" -j KUBE-MARK-MASQ
# DNAT: rewrite dst IP:port to pod IP:port
-A KUBE-SEP-ABCDEF1234567890 -p tcp -m comment \
--comment "default/nginx" -m tcp \
-j DNAT --to-destination 10.244.1.5:80
# ------ KUBE-POSTROUTING: masquerade marked packets ------
-A POSTROUTING -m comment --comment "kubernetes postrouting rules" \
-j KUBE-POSTROUTING
-A KUBE-POSTROUTING -m mark ! --mark 0x0/0x4000 -j RETURN
-A KUBE-POSTROUTING -j MARK --xor-mark 0x4000
-A KUBE-POSTROUTING -m comment --comment "kubernetes service traffic requiring SNAT" \
-j MASQUERADE --random-fully
NodePort Rules
# NodePort rule lives in KUBE-NODEPORTS chain (called from KUBE-SERVICES)
# Matches any destination port 30080 on any interface
-A KUBE-SERVICES -m comment --comment "kubernetes service nodeports" \
-m addrtype --dst-type LOCAL -j KUBE-NODEPORTS
-A KUBE-NODEPORTS -p tcp -m comment --comment "default/nginx nodePort" \
-m tcp --dport 30080 -j KUBE-EXT-XPGD46QRK7WJZT7O
# KUBE-EXT chain: decides masquerade for external traffic
-A KUBE-EXT-XPGD46QRK7WJZT7O -m comment \
--comment "masquerade traffic for default/nginx external destinations" \
-j KUBE-MARK-MASQ
-A KUBE-EXT-XPGD46QRK7WJZT7O -j KUBE-SVC-XPGD46QRK7WJZT7O
Session Affinity
apiVersion: v1
kind: Service
metadata:
name: sticky-service
spec:
selector:
app: api
sessionAffinity: ClientIP
sessionAffinityConfig:
clientIP:
timeoutSeconds: 10800 # 3 hours; max 86400 (1 day)
ports:
- port: 80
targetPort: 8080
# Session affinity adds a 'recent' module match before the statistic rules
# First packet → choose endpoint normally; mark client IP in 'recent' table
# Subsequent packets from same client IP → jump directly to chosen SEP
-A KUBE-SVC-XXXXX -m comment --comment "default/sticky-service" \
-m recent --name KUBE-SEP-ABCDEF1234 --rcheck --seconds 10800 \
--reap -j KUBE-SEP-ABCDEF1234567890
-A KUBE-SEP-ABCDEF1234567890 -m comment --comment "default/sticky-service" \
-m recent --name KUBE-SEP-ABCDEF1234 --set \
-j DNAT --to-destination 10.244.1.5:80
iptables Scaling Problem
Every packet traverses the KUBE-SERVICES chain linearly. With 10,000 services × 10 endpoints = 100,000 KUBE-SEP rules, each packet may traverse thousands of rules. iptables-restore time also grows super-linearly — a full ruleset flush+restore for 50,000 rules can take 10+ seconds, causing brief forwarding blackouts during updates.
| Scale | Services | Total Rules (approx) | iptables-restore time | Rule lookup latency |
|---|---|---|---|---|
| Small | 100 | ~2,000 | < 0.1s | Negligible |
| Medium | 1,000 | ~20,000 | ~1s | Low |
| Large | 10,000 | ~200,000 | ~60s | Measurable (ms) |
| Very large | 50,000 | ~1,000,000 | Minutes | Significant |
IPVS Mode
IPVS (IP Virtual Server) is a Linux kernel module originally designed for load balancer appliances. kube-proxy IPVS mode programs kernel IPVS virtual servers instead of iptables chains. Lookups use hash tables: O(1) per packet regardless of service count.
Enabling IPVS
# KubeProxyConfiguration (kube-proxy ConfigMap)
apiVersion: kubeproxy.config.k8s.io/v1alpha1
kind: KubeProxyConfiguration
mode: ipvs
ipvs:
scheduler: rr # rr | lc | dh | sh | sed | nq
syncPeriod: 30s
minSyncPeriod: 2s
strictARP: true # CRITICAL: must be true for IPVS mode
tcpTimeout: 900s
tcpFinTimeout: 16s
udpTimeout: 300s
iptables:
masqueradeAll: false
masqueradeBit: 14 # mark bit 0x4000 for masquerade
minSyncPeriod: 1s
syncPeriod: 30s
IPVS mode requires ip_vs, ip_vs_rr, ip_vs_wrr, ip_vs_sh, and nf_conntrack kernel modules. Load them with: modprobe ip_vs ip_vs_rr ip_vs_wrr ip_vs_sh nf_conntrack and persist in /etc/modules-load.d/ipvs.conf.
IPVS Virtual Server Model
# IPVS creates a 'virtual server' (VS) per service port
# and 'real servers' (RS) per endpoint
# Install ipvsadm
apt-get install ipvsadm # Ubuntu
# List all virtual servers
ipvsadm -Ln
# Example output for nginx service (ClusterIP 10.96.14.3:80, 3 pods)
# IP Virtual Server version 1.2.1 (size=4096)
# Prot LocalAddress:Port Scheduler Flags
# -> RemoteAddress:Port Forward Weight ActiveConn InActConn
# TCP 10.96.14.3:80 rr
# -> 10.244.1.5:80 Masq 1 5 0
# -> 10.244.2.7:80 Masq 1 3 0
# -> 10.244.3.9:80 Masq 1 4 0
# NodePort creates additional virtual server on node IP
# TCP 192.168.1.10:30080 rr
# -> 10.244.1.5:80 Masq 1 0 0
# -> 10.244.2.7:80 Masq 1 0 0
# -> 10.244.3.9:80 Masq 1 0 0
# kube-proxy also creates a dummy interface 'kube-ipvs0'
# and assigns ALL ClusterIPs to it (so kernel accepts the packets)
ip addr show kube-ipvs0
# inet 10.96.14.3/32 scope host kube-ipvs0
# inet 10.96.0.1/32 scope host kube-ipvs0 (kubernetes API server SVC)
# ... one address per Service ClusterIP
IPVS Scheduling Algorithms
| Scheduler | Algorithm | Use Case |
|---|---|---|
rr | Round Robin | Default; equal distribution; stateless workloads |
lc | Least Connections | Long-lived connections; databases; minimize hot pods |
dh | Destination Hashing | Cache affinity; same destination always hits same backend |
sh | Source Hashing | Client-IP affinity (like sessionAffinity: ClientIP) |
sed | Shortest Expected Delay | Weighted round robin factoring active connections |
nq | Never Queue | SED variant; sends to idle server first |
wrr | Weighted Round Robin | Heterogeneous pods with different capacities |
IPVS + iptables Coexistence
Even in IPVS mode, kube-proxy still uses iptables for scenarios IPVS cannot handle natively:
# iptables rules still present in IPVS mode:
iptables-save | grep KUBE | grep -v "KUBE-SVC\|KUBE-SEP"
# 1. KUBE-MARK-MASQ — mark packets needing SNAT
# 2. KUBE-POSTROUTING — MASQUERADE marked packets
# 3. KUBE-FIREWALL — drop packets with invalid marks
# 4. KUBE-FORWARD — allow forwarding for established connections
# 5. NodePort masquerade (KUBE-NODEPORTS still uses iptables for source IP mark)
# strictARP=true required: IPVS needs ARP responses suppressed for dummy IPs
# Without strictARP, kube-ipvs0 addresses respond to ARP → routing loops
cat /proc/sys/net/ipv4/conf/all/arp_ignore # should be 1
cat /proc/sys/net/ipv4/conf/all/arp_announce # should be 2
nftables Mode (GA 1.31)
nftables is the successor to iptables in the Linux kernel, using a bytecode VM instead of linear rule matching. kube-proxy gained nftables mode as alpha in 1.29, beta 1.30, GA in 1.31. It offers the same semantics as iptables mode but with better performance characteristics and atomic ruleset updates.
nftables advantages over iptables
- Atomic ruleset updates (entire table replaced atomically)
- Native set/map data structures (O(1) lookup for IPs)
- No iptables-restore latency spikes during updates
- Cleaner, more readable rule syntax
- Better integration with modern kernels (5.2+)
- Single framework for IPv4, IPv6, ARP, bridge rules
nftables limitations
- Requires Linux kernel 5.13+ (full feature set)
- Cannot coexist with iptables mode (same hook points)
- Less tooling familiarity than iptables
- Some distributions ship older nftables versions
# Enable nftables mode
# In KubeProxyConfiguration:
# mode: nftables
# Inspect nftables rules created by kube-proxy
nft list ruleset | grep -A 30 "table inet kube-proxy"
# kube-proxy creates sets for service VIPs
nft list set inet kube-proxy services
# { 10.96.14.3 . tcp . 80, 10.96.0.1 . tcp . 443, ... }
# kube-proxy creates maps for service→endpoint selection
nft list map inet kube-proxy service-ips
# { 10.96.14.3 . tcp . 80 : goto chain-svc-nginx, ... }
# nftables map lookup replaces the linear iptables chain walk
# O(1) per packet vs O(n) in iptables mode at same service count
Proxy Mode Comparison
| Dimension | iptables | IPVS | nftables | eBPF (Cilium/Calico) |
|---|---|---|---|---|
| Lookup complexity | O(n) linear chain | O(1) hash table | O(1) set/map | O(1) BPF map |
| Rule update model | Full flush + restore | Incremental | Atomic table replace | Incremental map update |
| Update latency at 10k services | ~10-60s flush | <1s | <1s atomic | <100ms |
| Load balancing algorithms | Random probability | 7 algorithms (rr/lc/sh…) | Random probability | Maglev hash (Cilium) |
| Session affinity | recent module | sh scheduler or persistence | nft timeout maps | BPF affinity map |
| Source IP preservation | No (SNAT) | No (SNAT); DSR possible | No (SNAT) | Yes (DSR, no SNAT) |
| kube-proxy required | Yes | Yes | Yes | No (replacement) |
| Kernel requirement | Any | ip_vs module | 5.13+ | 4.9.17+ (5.10+ for full) |
| Windows support | No | No | No | No |
| Stability (1.31) | GA (stable) | GA (stable) | GA (1.31) | CNI-dependent |
End-to-End Traffic Flows
ClusterIP: Pod-to-Service Packet Trace
| Step | Location | Action | Packet state |
|---|---|---|---|
| 1 | Pod (app code) | connect("nginx.default.svc.cluster.local", 80) | DNS lookup → 10.96.14.3 |
| 2 | Pod kernel | SYN packet sent; dst=10.96.14.3:80 | src=10.244.1.5, dst=10.96.14.3:80 |
| 3 | Pod veth → host veth | Packet enters host network namespace | Unchanged |
| 4 | netfilter PREROUTING | KUBE-SERVICES rule matches 10.96.14.3:80 → KUBE-SVC chain | Unchanged |
| 5 | KUBE-SVC chain | statistic module selects pod-2; jumps to KUBE-SEP-BBB | Unchanged |
| 6 | KUBE-SEP-BBB | DNAT: dst rewritten to 10.244.2.7:80 | src=10.244.1.5, dst=10.244.2.7:80 |
| 7 | conntrack | Entry recorded: 10.244.1.5→10.96.14.3:80 ↔ 10.244.1.5→10.244.2.7:80 | Conntrack entry created |
| 8 | Routing decision | Route lookup: 10.244.2.7 → via bridge or tunnel | Forwarded normally via CNI |
| 9 | Pod-2 receives | SYN arrives at pod-2:80 | src=10.244.1.5, dst=10.244.2.7:80 |
| 10 | Return path | SYN-ACK from pod-2; conntrack reverse-translates dst back to 10.96.14.3:80 | src=10.244.2.7, dst=10.244.1.5 (conntrack rewrites src back to 10.96.14.3) |
NodePort: External-to-Service Packet Trace
| Step | Location | Action |
|---|---|---|
| 1 | External client | TCP SYN to node IP 192.168.1.10:30080 |
| 2 | Node PREROUTING | KUBE-SERVICES → dst-type LOCAL → KUBE-NODEPORTS |
| 3 | KUBE-NODEPORTS | dport 30080 → KUBE-EXT chain |
| 4 | KUBE-EXT | KUBE-MARK-MASQ (mark 0x4000) + jump to KUBE-SVC |
| 5 | KUBE-SVC | Select endpoint → DNAT to pod IP (e.g. 10.244.3.9:80) |
| 6 | POSTROUTING | Mark 0x4000 → MASQUERADE: src rewritten to node IP |
| 7 | Pod-3 receives | src=192.168.1.10 (node IP), dst=10.244.3.9:80 — original client IP lost! |
The default NodePort flow SNATs traffic through the node, losing the original client IP. Use externalTrafficPolicy: Local to skip the masquerade and preserve client IP — but this means only pods on the same node as the receiving NodePort can be selected (pods on other nodes are unreachable for that NodePort).
externalTrafficPolicy
apiVersion: v1
kind: Service
metadata:
name: web
spec:
type: LoadBalancer
externalTrafficPolicy: Local # Cluster (default) | Local
# Local: only route to pods on the receiving node; preserve client IP
# Cluster: SNAT through any node; lose client IP but full pod distribution
# internalTrafficPolicy controls ClusterIP traffic routing
internalTrafficPolicy: Local # Cluster (default) | Local
# Local: only route to pods on same node as client (topology-aware)
selector:
app: web
ports:
- port: 80
targetPort: 8080
EndpointSlices
EndpointSlices (GA 1.21) replaced the monolithic Endpoints object. Each EndpointSlice holds up to maxEndpointsPerSlice (default 100) endpoints. kube-proxy watches EndpointSlices exclusively since 1.22.
apiVersion: discovery.k8s.io/v1
kind: EndpointSlice
metadata:
name: nginx-abc12
namespace: default
labels:
kubernetes.io/service-name: nginx # links to Service
addressType: IPv4 # IPv4 | IPv6 | FQDN
endpoints:
- addresses:
- "10.244.1.5"
conditions:
ready: true # pod passed readinessProbe
serving: true # pod is serving (ready or was ready, not terminating)
terminating: false # pod is being deleted
nodeName: worker-1
targetRef:
kind: Pod
name: nginx-abc-xyz
namespace: default
- addresses:
- "10.244.2.7"
conditions:
ready: true
serving: true
terminating: false
nodeName: worker-2
targetRef:
kind: Pod
name: nginx-def-uvw
namespace: default
ports:
- name: http
port: 80
protocol: TCP
When a pod is terminating, its endpoint's terminating: true and ready: false but serving: true. kube-proxy (and Cilium) can be configured to include terminating endpoints when no ready endpoints remain, enabling graceful connection draining. Controlled by EndpointSliceTerminatingCondition feature gate (GA 1.28).
Topology-Aware Routing
Topology-Aware Routing (GA 1.27) instructs kube-proxy to prefer endpoints in the same zone as the client pod, reducing cross-zone traffic costs. The EndpointSlice controller annotates endpoints with zone hints.
apiVersion: v1
kind: Service
metadata:
name: api
annotations:
service.kubernetes.io/topology-mode: Auto # enables topology-aware routing
spec:
selector:
app: api
ports:
- port: 80
# EndpointSlice with hints populated by controller
kubectl get endpointslice -l kubernetes.io/service-name=api -o yaml | \
grep -A3 hints
# hints:
# forZones:
# - name: us-east-1a # only route to this endpoint from zone us-east-1a
Full KubeProxyConfiguration
apiVersion: kubeproxy.config.k8s.io/v1alpha1
kind: KubeProxyConfiguration
# Bind and client
bindAddress: "0.0.0.0"
clientConnection:
kubeconfig: /var/lib/kube-proxy/kubeconfig.conf
qps: 5
burst: 10
# Mode: iptables | ipvs | nftables
mode: ipvs
# iptables tuning (used even in IPVS mode for masquerade)
iptables:
masqueradeAll: false # SNAT all service traffic (not just NodePort)
masqueradeBit: 14 # bit 14 (0x4000) used as mark
minSyncPeriod: 1s # don't sync more often than this
syncPeriod: 30s # full sync interval
# IPVS tuning
ipvs:
scheduler: lc # least connections for persistent services
syncPeriod: 30s
minSyncPeriod: 2s
strictARP: true
tcpTimeout: 900s # idle TCP connection timeout in IPVS
tcpFinTimeout: 16s # TCP FIN_WAIT timeout
udpTimeout: 300s
# Node port range
nodePortAddresses:
- "192.168.0.0/16" # restrict NodePort binding to these CIDRs
# default: all interfaces
# Feature gates
featureGates:
TopologyAwareHints: true
# Healthz and metrics
healthzBindAddress: "0.0.0.0:10256"
metricsBindAddress: "0.0.0.0:10249"
# Logging
logging:
verbosity: 2
kube-proxy Replacement
Several CNIs offer full kube-proxy replacement by programming service routing inside eBPF, eliminating the kube-proxy DaemonSet entirely:
| Replacement | Mechanism | How to Disable kube-proxy | Benefits vs kube-proxy |
|---|---|---|---|
| Cilium | eBPF BPF maps + socket-level LB | --skip-phases=addon/kube-proxy in kubeadm; or delete kube-proxy DaemonSet | O(1) lookup; DSR; preserved src IP; Maglev hashing; 30-100% lower latency at scale |
| Calico eBPF | eBPF TC hooks | Disable kube-proxy + set FELIX_BPFKUBEPROXYIPTSCLEANUPMODE=Enabled | Same as Cilium eBPF mode; DSR capable |
| kube-router | IPVS + BGP | Deploy kube-router as DaemonSet; disable kube-proxy | Integrated BGP router + IPVS; no external CNI needed |
# Disable kube-proxy when using Cilium kube-proxy replacement
# Option 1: skip during kubeadm init
kubeadm init --skip-phases=addon/kube-proxy
# Option 2: delete after cluster creation
kubectl -n kube-system delete daemonset kube-proxy
iptables-save | grep -v KUBE | iptables-restore # clean up leftover rules
# Configure Cilium to replace kube-proxy
helm upgrade cilium cilium/cilium \
--namespace kube-system \
--set kubeProxyReplacement=true \
--set k8sServiceHost=API_SERVER_IP \
--set k8sServicePort=6443
# Verify
cilium status | grep KubeProxyReplacement
# KubeProxyReplacement: True [NodePort (SNAT, 1 NumDevices), ExternalIPs, HostPort, SessionAffinity, Maglev, XDP (NATIVE)]
Key kube-proxy Metrics
| Metric | Type | Alert Threshold |
|---|---|---|
kubeproxy_sync_proxy_rules_duration_seconds | histogram | p99 > 5s (iptables mode); > 1s (IPVS/nftables) |
kubeproxy_sync_proxy_rules_last_queued_timestamp_seconds | gauge | age > 30s → stale rules not being applied |
kubeproxy_network_programming_duration_seconds | histogram | p99 > 10s |
kubeproxy_iptables_rules_total | gauge | > 200,000 → consider IPVS or nftables mode |
kubeproxy_ipvs_services_total | gauge | Informational (track growth) |
rest_client_requests_total{code=~"5.."} | counter | > 0.1/s sustained → API server connectivity problem |
kubeproxy_sync_proxy_rules_no_local_endpoints_total | counter | Rate > 0 with externalTrafficPolicy=Local → endpoints missing on node |
Alerting Rules
groups:
- name: kube-proxy
rules:
- alert: KubeProxyRuleSyncSlow
expr: histogram_quantile(0.99, rate(kubeproxy_sync_proxy_rules_duration_seconds_bucket[5m])) > 5
for: 5m
labels:
severity: warning
annotations:
summary: "kube-proxy rule sync taking >5s (iptables mode scalability issue)"
- alert: KubeProxyStaleRules
expr: time() - kubeproxy_sync_proxy_rules_last_queued_timestamp_seconds > 60
for: 2m
labels:
severity: critical
annotations:
summary: "kube-proxy has not synced rules in >60s — service endpoints may be stale"
- alert: KubeProxyTooManyRules
expr: kubeproxy_iptables_rules_total > 150000
for: 10m
labels:
severity: warning
annotations:
summary: "iptables rule count exceeds 150k — consider switching to IPVS or nftables"
- alert: KubeProxyAPIServerErrors
expr: rate(rest_client_requests_total{job="kube-proxy",code=~"5.."}[5m]) > 0.1
for: 3m
labels:
severity: warning
annotations:
summary: "kube-proxy experiencing API server 5xx errors"
Troubleshooting Runbooks
Runbook 1: Service ClusterIP not reachable from pod
# 1. Verify service exists and has endpoints
kubectl get svc nginx -o wide
kubectl get endpoints nginx # or: kubectl get endpointslices -l kubernetes.io/service-name=nginx
# 2. Confirm kube-proxy is running on the node
NODE=$(kubectl get pod <client-pod> -o jsonpath='{.spec.nodeName}')
kubectl get pod -n kube-system -l k8s-app=kube-proxy --field-selector spec.nodeName=$NODE
# 3. Check iptables rules on the node (for iptables mode)
kubectl debug node/$NODE -it --image=ubuntu -- bash
iptables-save | grep 10.96.14.3 # ClusterIP
# If empty → kube-proxy not syncing rules
# 4. Check IPVS virtual servers (for IPVS mode)
ipvsadm -Ln | grep -A5 "10.96.14.3"
# 5. Check kube-proxy logs
kubectl logs -n kube-system kube-proxy-$NODE --tail=50 | grep -E "ERROR|sync|failed"
# 6. Check conntrack table (if DNAT happening but response lost)
conntrack -L | grep 10.96.14.3
# 7. Verify kube-proxy endpoints match pod IPs
kubectl get pod -l app=nginx -o wide # compare pod IPs to KUBE-SEP rules
Runbook 2: NodePort accessible on one node but not another
# NodePort should work on ALL nodes regardless of where pods run (Cluster policy)
# 1. Test from each node
curl http://<node1-ip>:30080 # should work
curl http://<node2-ip>:30080 # should also work
# 2. Check firewall/security groups (cloud or on-prem)
# AWS: check Security Group allows port 30080 TCP inbound
# GCP: check firewall rules for node tag
# On-prem: check iptables INPUT chain on the node
# 3. Verify kube-proxy is running on the failing node
kubectl get pods -n kube-system -l k8s-app=kube-proxy -o wide
# 4. Check if externalTrafficPolicy=Local (only pods on THAT node are reachable)
kubectl get svc nginx -o jsonpath='{.spec.externalTrafficPolicy}'
# If "Local", NodePort only works on nodes that have a matching pod
# 5. Check kube-proxy rule on failing node
kubectl debug node/<failing-node> -it --image=ubuntu -- bash
iptables-save | grep "30080"
# If rule missing → kube-proxy not syncing on this node
Runbook 3: kube-proxy high CPU / slow sync (iptables mode at scale)
# Symptom: kube-proxy pod using high CPU; endpoints take minutes to update
# 1. Measure current sync duration
kubectl exec -n kube-system kube-proxy-xyz -- \
curl -s localhost:10249/metrics | grep kubeproxy_sync_proxy_rules_duration
# 2. Count total iptables rules
kubectl debug node/worker-1 -it --image=ubuntu -- bash
iptables-save | wc -l
# > 100,000 lines = likely bottleneck
# 3. Count services and endpoints
kubectl get svc --all-namespaces | wc -l
kubectl get endpointslices --all-namespaces | wc -l
# 4. Switch to IPVS mode (rolling, not disruptive)
# Edit kube-proxy ConfigMap
kubectl edit cm -n kube-system kube-proxy
# Change: mode: "ipvs" (was "iptables")
# Add: ipvs.strictARP: true
# Then restart kube-proxy pods
kubectl rollout restart daemonset -n kube-system kube-proxy
# 5. Or migrate to Cilium kube-proxy replacement for maximum scale
Runbook 4: Client IP lost — source IP not preserved through LoadBalancer
# Symptom: application receives node IP instead of client IP in X-Forwarded-For or logs
# Root cause: externalTrafficPolicy: Cluster (default) SNATs traffic through node
# Solution 1: externalTrafficPolicy: Local (preserves client IP, but uneven distribution)
kubectl patch svc web -p '{"spec":{"externalTrafficPolicy":"Local"}}'
# Warning: pods only receive traffic on nodes where they're running
# Health check port needed for cloud LB health probes
kubectl get svc web -o jsonpath='{.spec.healthCheckNodePort}'
# Solution 2: Use Cilium with DSR (Direct Server Return)
# Cilium can preserve source IP even with Cluster policy via DSR
# Requires: --set loadBalancer.dsrDispatch=opt
# Solution 3: Proxy Protocol (for L4 LBs)
# Cloud LB sends PROXY protocol header with original client IP
# Application or ingress must decode PROXY protocol header
# Verify client IP is now preserved
kubectl logs -l app=web | grep "X-Real-IP\|remote_addr"
Runbook 5: IPVS mode — connection resets after pod deletion
# Symptom: established connections RST when a pod is deleted and IPVS removes the real server
# Root cause: IPVS removes the RS immediately; conntrack entries still point to dead pod
# 1. Check IPVS connection table
ipvsadm -Lnc | grep "10.244.1.5" # connections to deleted pod IP
# 2. Check conntrack for stale entries
conntrack -L | grep "10.244.1.5"
conntrack -D --dst 10.244.1.5 # manually remove (emergency)
# 3. Proper solution: use graceful termination
# - Pod should handle SIGTERM and drain connections
# - preStop hook gives time before endpoint removal:
# preStop:
# exec:
# command: ["sleep", "15"] # 15s for LB to drain connections
# 4. IPVS persistence (alternative to graceful termination for session affinity)
ipvsadm --edit-service --tcp-service 10.96.14.3:80 --persistent 300
# Persistent connections keep going to same RS for 300s even after removal
Production Best Practices
Mode Selection
- <500 services: iptables mode is fine
- 500–10,000 services: switch to IPVS or nftables
- >10,000 services: use IPVS or Cilium kube-proxy replacement
- Linux 5.13+: use nftables for better atomicity
- Maximum performance: Cilium eBPF kube-proxy replacement
IPVS Best Practices
- Always set
strictARP: true(ARP proxy conflicts) - Load
ip_vsmodules via/etc/modules-load.d/ - Use
lc(least-connections) for long-lived connections - Monitor
ip_vsconnection table size - Set appropriate TCP/UDP timeouts to prevent stale IPVS entries
Source IP Handling
- Use
externalTrafficPolicy: Localfor apps needing client IP - Ensure pods exist on all nodes or LB health checks will fail
- Document the SNAT behavior for security/audit teams
- For internal traffic:
internalTrafficPolicy: Localsaves cross-node hops
Graceful Endpoint Handling
- Use
preStophook + sleep to drain connections before pod stops - Set
terminationGracePeriodSecondsappropriately - Monitor
kubeproxy_sync_proxy_rules_no_local_endpoints_total - Test rolling deployments with active connections under load