Kube-Proxy Internals

kube-proxy is the network rule engine that makes Kubernetes Services reachable. It watches the API server for Service and Endpoints/EndpointSlice changes and programs the local node's kernel to implement virtual IP load balancing. This page dissects every proxy mode — iptables, IPVS, and nftables — down to the individual rules, chains, and data structures, covering Service types, session affinity, connection tracking, and replacement options.

What kube-proxy Does (and Does Not Do)

Key insight: ClusterIPs never hit the wire

A Service ClusterIP (e.g. 10.96.14.3) does not exist on any network interface. It is purely a destination address programmed into kernel packet-processing rules. kube-proxy creates those rules; the kernel rewrites matching packets to a real pod IP before they leave the host.

What kube-proxy programs

  • ClusterIP → pod IP DNAT (iptables/IPVS/nftables)
  • NodePort host port → ClusterIP forwarding
  • LoadBalancer external IP → ClusterIP forwarding
  • ExternalIP routing rules
  • Session affinity (client-IP stickiness)
  • Source IP masquerade (SNAT) for traffic leaving cluster

What kube-proxy does NOT do

  • Pod-to-pod routing (CNI's job)
  • Ingress / L7 HTTP routing (Ingress controller's job)
  • Actual load balancer provisioning (CCM's job)
  • DNS (CoreDNS's job)
  • NetworkPolicy enforcement (CNI's job)
  • TLS termination

Watch Loop

kube-proxy runs a single reflector/informer loop watching three object types:

ObjectAPI Group / VersionWhat Changes Trigger
Servicecore/v1New ClusterIP; port/protocol change; type change; sessionAffinity change
EndpointSlicediscovery.k8s.io/v1Pod ready/not-ready; pod IP change; new pods; deleted pods
Nodecore/v1 (self only)Node address changes for NodePort binding

kube-proxy uses EndpointSlice (GA 1.21) by default. Each EndpointSlice holds up to 100 endpoints. For a service with 1,000 pods, 10 EndpointSlices are created rather than one monolithic Endpoints object, dramatically reducing watch event churn.

Service Types Recap

TypeClusterIPNodePortExternal LBUse Case
ClusterIPAssigned from service CIDRNoNoInternal communication; most services
NodePortAssigned30000–32767 on every nodeNoExternal access without cloud LB; bare metal
LoadBalancerAssignedAssignedCloud LB via CCMProduction external access on cloud
ExternalNameNoneNoNoDNS CNAME alias; no proxying
Headless (clusterIP: None)NoneNoNoStatefulSets; direct pod DNS; bypasses kube-proxy
Headless Services bypass kube-proxy entirely

When clusterIP: None, no ClusterIP is assigned and kube-proxy programs nothing. CoreDNS returns individual pod A records for the service DNS name. kube-proxy is irrelevant for headless services.

iptables Mode (Default through 1.29)

iptables mode uses Linux netfilter's PREROUTING and OUTPUT chains to DNAT service IPs to pod IPs. Every packet traversing the node is inspected against a linear chain of rules — the fundamental limitation that makes iptables mode struggle above ~10,000 services.

Chain Structure

PREROUTING OUTPUT KUBE-SERVICES KUBE-SVC-XXXXX (nginx) KUBE-SVC-YYYYY (api) KUBE-SVC-ZZZZZ (db) …one per Service KUBE-SEP-AAA (pod-1 10.244.1.5) KUBE-SEP-BBB (pod-2 10.244.2.7) KUBE-SEP-CCC (pod-3 10.244.3.9) KUBE-POSTROUTING KUBE-MARK-MASQ DNAT → 10.244.1.5:80 DNAT → 10.244.2.7:80 DNAT → 10.244.3.9:80 KUBE-SVC-XXXXX → per-endpoint chains probability-based statistic module: 1/n, 1/(n-1), … 1/1

Annotated Rule Walkthrough

# Dump all kube-proxy iptables rules
iptables-save | grep -E "KUBE|kubernetes"

# ------ PREROUTING: jump to KUBE-SERVICES ------
-A PREROUTING -m comment --comment "kubernetes service portals" \
  -j KUBE-SERVICES

# ------ KUBE-SERVICES: one rule per service ------
# Match destination 10.96.14.3:80 (ClusterIP of nginx service) → jump to SVC chain
-A KUBE-SERVICES -d 10.96.14.3/32 -p tcp -m comment \
  --comment "default/nginx cluster IP" \
  -m tcp --dport 80 -j KUBE-SVC-XPGD46QRK7WJZT7O

# ------ KUBE-SVC-XXXXX: load balancing via statistic module ------
# 3 endpoints: each rule selects one with probability 1/n, 1/(n-1), 1/1

# First endpoint: probability 0.33333333349 (= 1/3)
-A KUBE-SVC-XPGD46QRK7WJZT7O -m comment --comment "default/nginx" \
  -m statistic --mode random --probability 0.33333333349 \
  -j KUBE-SEP-ABCDEF1234567890

# Second endpoint: probability 0.50000000000 (= 1/2 of remaining)
-A KUBE-SVC-XPGD46QRK7WJZT7O -m comment --comment "default/nginx" \
  -m statistic --mode random --probability 0.50000000000 \
  -j KUBE-SEP-BCDEF12345678901

# Third endpoint: probability 1.0 (catch-all for last)
-A KUBE-SVC-XPGD46QRK7WJZT7O -m comment --comment "default/nginx" \
  -j KUBE-SEP-CDEF123456789012

# ------ KUBE-SEP-XXXXX: DNAT to actual pod IP ------
# Mark for masquerade if source is the pod itself (hairpin)
-A KUBE-SEP-ABCDEF1234567890 -s 10.244.1.5/32 -m comment \
  --comment "default/nginx" -j KUBE-MARK-MASQ

# DNAT: rewrite dst IP:port to pod IP:port
-A KUBE-SEP-ABCDEF1234567890 -p tcp -m comment \
  --comment "default/nginx" -m tcp \
  -j DNAT --to-destination 10.244.1.5:80

# ------ KUBE-POSTROUTING: masquerade marked packets ------
-A POSTROUTING -m comment --comment "kubernetes postrouting rules" \
  -j KUBE-POSTROUTING
-A KUBE-POSTROUTING -m mark ! --mark 0x0/0x4000 -j RETURN
-A KUBE-POSTROUTING -j MARK --xor-mark 0x4000
-A KUBE-POSTROUTING -m comment --comment "kubernetes service traffic requiring SNAT" \
  -j MASQUERADE --random-fully

NodePort Rules

# NodePort rule lives in KUBE-NODEPORTS chain (called from KUBE-SERVICES)
# Matches any destination port 30080 on any interface
-A KUBE-SERVICES -m comment --comment "kubernetes service nodeports" \
  -m addrtype --dst-type LOCAL -j KUBE-NODEPORTS

-A KUBE-NODEPORTS -p tcp -m comment --comment "default/nginx nodePort" \
  -m tcp --dport 30080 -j KUBE-EXT-XPGD46QRK7WJZT7O

# KUBE-EXT chain: decides masquerade for external traffic
-A KUBE-EXT-XPGD46QRK7WJZT7O -m comment \
  --comment "masquerade traffic for default/nginx external destinations" \
  -j KUBE-MARK-MASQ
-A KUBE-EXT-XPGD46QRK7WJZT7O -j KUBE-SVC-XPGD46QRK7WJZT7O

Session Affinity

apiVersion: v1
kind: Service
metadata:
  name: sticky-service
spec:
  selector:
    app: api
  sessionAffinity: ClientIP
  sessionAffinityConfig:
    clientIP:
      timeoutSeconds: 10800   # 3 hours; max 86400 (1 day)
  ports:
  - port: 80
    targetPort: 8080
# Session affinity adds a 'recent' module match before the statistic rules
# First packet → choose endpoint normally; mark client IP in 'recent' table
# Subsequent packets from same client IP → jump directly to chosen SEP

-A KUBE-SVC-XXXXX -m comment --comment "default/sticky-service" \
  -m recent --name KUBE-SEP-ABCDEF1234 --rcheck --seconds 10800 \
  --reap -j KUBE-SEP-ABCDEF1234567890

-A KUBE-SEP-ABCDEF1234567890 -m comment --comment "default/sticky-service" \
  -m recent --name KUBE-SEP-ABCDEF1234 --set \
  -j DNAT --to-destination 10.244.1.5:80

iptables Scaling Problem

O(n) rule traversal

Every packet traverses the KUBE-SERVICES chain linearly. With 10,000 services × 10 endpoints = 100,000 KUBE-SEP rules, each packet may traverse thousands of rules. iptables-restore time also grows super-linearly — a full ruleset flush+restore for 50,000 rules can take 10+ seconds, causing brief forwarding blackouts during updates.

ScaleServicesTotal Rules (approx)iptables-restore timeRule lookup latency
Small100~2,000< 0.1sNegligible
Medium1,000~20,000~1sLow
Large10,000~200,000~60sMeasurable (ms)
Very large50,000~1,000,000MinutesSignificant

IPVS Mode

IPVS (IP Virtual Server) is a Linux kernel module originally designed for load balancer appliances. kube-proxy IPVS mode programs kernel IPVS virtual servers instead of iptables chains. Lookups use hash tables: O(1) per packet regardless of service count.

Enabling IPVS

# KubeProxyConfiguration (kube-proxy ConfigMap)
apiVersion: kubeproxy.config.k8s.io/v1alpha1
kind: KubeProxyConfiguration
mode: ipvs
ipvs:
  scheduler: rr              # rr | lc | dh | sh | sed | nq
  syncPeriod: 30s
  minSyncPeriod: 2s
  strictARP: true            # CRITICAL: must be true for IPVS mode
  tcpTimeout: 900s
  tcpFinTimeout: 16s
  udpTimeout: 300s
iptables:
  masqueradeAll: false
  masqueradeBit: 14          # mark bit 0x4000 for masquerade
  minSyncPeriod: 1s
  syncPeriod: 30s
Kernel module requirements

IPVS mode requires ip_vs, ip_vs_rr, ip_vs_wrr, ip_vs_sh, and nf_conntrack kernel modules. Load them with: modprobe ip_vs ip_vs_rr ip_vs_wrr ip_vs_sh nf_conntrack and persist in /etc/modules-load.d/ipvs.conf.

IPVS Virtual Server Model

# IPVS creates a 'virtual server' (VS) per service port
# and 'real servers' (RS) per endpoint

# Install ipvsadm
apt-get install ipvsadm    # Ubuntu

# List all virtual servers
ipvsadm -Ln

# Example output for nginx service (ClusterIP 10.96.14.3:80, 3 pods)
# IP Virtual Server version 1.2.1 (size=4096)
# Prot LocalAddress:Port Scheduler Flags
#   -> RemoteAddress:Port           Forward Weight ActiveConn InActConn
# TCP  10.96.14.3:80 rr
#   -> 10.244.1.5:80                Masq    1      5          0
#   -> 10.244.2.7:80                Masq    1      3          0
#   -> 10.244.3.9:80                Masq    1      4          0

# NodePort creates additional virtual server on node IP
# TCP  192.168.1.10:30080 rr
#   -> 10.244.1.5:80                Masq    1      0          0
#   -> 10.244.2.7:80                Masq    1      0          0
#   -> 10.244.3.9:80                Masq    1      0          0

# kube-proxy also creates a dummy interface 'kube-ipvs0'
# and assigns ALL ClusterIPs to it (so kernel accepts the packets)
ip addr show kube-ipvs0
# inet 10.96.14.3/32 scope host kube-ipvs0
# inet 10.96.0.1/32 scope host kube-ipvs0    (kubernetes API server SVC)
# ... one address per Service ClusterIP

IPVS Scheduling Algorithms

SchedulerAlgorithmUse Case
rrRound RobinDefault; equal distribution; stateless workloads
lcLeast ConnectionsLong-lived connections; databases; minimize hot pods
dhDestination HashingCache affinity; same destination always hits same backend
shSource HashingClient-IP affinity (like sessionAffinity: ClientIP)
sedShortest Expected DelayWeighted round robin factoring active connections
nqNever QueueSED variant; sends to idle server first
wrrWeighted Round RobinHeterogeneous pods with different capacities

IPVS + iptables Coexistence

Even in IPVS mode, kube-proxy still uses iptables for scenarios IPVS cannot handle natively:

# iptables rules still present in IPVS mode:
iptables-save | grep KUBE | grep -v "KUBE-SVC\|KUBE-SEP"

# 1. KUBE-MARK-MASQ — mark packets needing SNAT
# 2. KUBE-POSTROUTING — MASQUERADE marked packets
# 3. KUBE-FIREWALL — drop packets with invalid marks
# 4. KUBE-FORWARD — allow forwarding for established connections
# 5. NodePort masquerade (KUBE-NODEPORTS still uses iptables for source IP mark)

# strictARP=true required: IPVS needs ARP responses suppressed for dummy IPs
# Without strictARP, kube-ipvs0 addresses respond to ARP → routing loops
cat /proc/sys/net/ipv4/conf/all/arp_ignore    # should be 1
cat /proc/sys/net/ipv4/conf/all/arp_announce  # should be 2

nftables Mode (GA 1.31)

nftables is the successor to iptables in the Linux kernel, using a bytecode VM instead of linear rule matching. kube-proxy gained nftables mode as alpha in 1.29, beta 1.30, GA in 1.31. It offers the same semantics as iptables mode but with better performance characteristics and atomic ruleset updates.

nftables advantages over iptables

  • Atomic ruleset updates (entire table replaced atomically)
  • Native set/map data structures (O(1) lookup for IPs)
  • No iptables-restore latency spikes during updates
  • Cleaner, more readable rule syntax
  • Better integration with modern kernels (5.2+)
  • Single framework for IPv4, IPv6, ARP, bridge rules

nftables limitations

  • Requires Linux kernel 5.13+ (full feature set)
  • Cannot coexist with iptables mode (same hook points)
  • Less tooling familiarity than iptables
  • Some distributions ship older nftables versions
# Enable nftables mode
# In KubeProxyConfiguration:
# mode: nftables

# Inspect nftables rules created by kube-proxy
nft list ruleset | grep -A 30 "table inet kube-proxy"

# kube-proxy creates sets for service VIPs
nft list set inet kube-proxy services
# { 10.96.14.3 . tcp . 80, 10.96.0.1 . tcp . 443, ... }

# kube-proxy creates maps for service→endpoint selection
nft list map inet kube-proxy service-ips
# { 10.96.14.3 . tcp . 80 : goto chain-svc-nginx, ... }

# nftables map lookup replaces the linear iptables chain walk
# O(1) per packet vs O(n) in iptables mode at same service count

Proxy Mode Comparison

DimensioniptablesIPVSnftableseBPF (Cilium/Calico)
Lookup complexityO(n) linear chainO(1) hash tableO(1) set/mapO(1) BPF map
Rule update modelFull flush + restoreIncrementalAtomic table replaceIncremental map update
Update latency at 10k services~10-60s flush<1s<1s atomic<100ms
Load balancing algorithmsRandom probability7 algorithms (rr/lc/sh…)Random probabilityMaglev hash (Cilium)
Session affinityrecent modulesh scheduler or persistencenft timeout mapsBPF affinity map
Source IP preservationNo (SNAT)No (SNAT); DSR possibleNo (SNAT)Yes (DSR, no SNAT)
kube-proxy requiredYesYesYesNo (replacement)
Kernel requirementAnyip_vs module5.13+4.9.17+ (5.10+ for full)
Windows supportNoNoNoNo
Stability (1.31)GA (stable)GA (stable)GA (1.31)CNI-dependent

End-to-End Traffic Flows

ClusterIP: Pod-to-Service Packet Trace

StepLocationActionPacket state
1Pod (app code)connect("nginx.default.svc.cluster.local", 80)DNS lookup → 10.96.14.3
2Pod kernelSYN packet sent; dst=10.96.14.3:80src=10.244.1.5, dst=10.96.14.3:80
3Pod veth → host vethPacket enters host network namespaceUnchanged
4netfilter PREROUTINGKUBE-SERVICES rule matches 10.96.14.3:80 → KUBE-SVC chainUnchanged
5KUBE-SVC chainstatistic module selects pod-2; jumps to KUBE-SEP-BBBUnchanged
6KUBE-SEP-BBBDNAT: dst rewritten to 10.244.2.7:80src=10.244.1.5, dst=10.244.2.7:80
7conntrackEntry recorded: 10.244.1.5→10.96.14.3:80 ↔ 10.244.1.5→10.244.2.7:80Conntrack entry created
8Routing decisionRoute lookup: 10.244.2.7 → via bridge or tunnelForwarded normally via CNI
9Pod-2 receivesSYN arrives at pod-2:80src=10.244.1.5, dst=10.244.2.7:80
10Return pathSYN-ACK from pod-2; conntrack reverse-translates dst back to 10.96.14.3:80src=10.244.2.7, dst=10.244.1.5 (conntrack rewrites src back to 10.96.14.3)

NodePort: External-to-Service Packet Trace

StepLocationAction
1External clientTCP SYN to node IP 192.168.1.10:30080
2Node PREROUTINGKUBE-SERVICES → dst-type LOCAL → KUBE-NODEPORTS
3KUBE-NODEPORTSdport 30080 → KUBE-EXT chain
4KUBE-EXTKUBE-MARK-MASQ (mark 0x4000) + jump to KUBE-SVC
5KUBE-SVCSelect endpoint → DNAT to pod IP (e.g. 10.244.3.9:80)
6POSTROUTINGMark 0x4000 → MASQUERADE: src rewritten to node IP
7Pod-3 receivessrc=192.168.1.10 (node IP), dst=10.244.3.9:80 — original client IP lost!
Source IP loss with NodePort

The default NodePort flow SNATs traffic through the node, losing the original client IP. Use externalTrafficPolicy: Local to skip the masquerade and preserve client IP — but this means only pods on the same node as the receiving NodePort can be selected (pods on other nodes are unreachable for that NodePort).

externalTrafficPolicy

apiVersion: v1
kind: Service
metadata:
  name: web
spec:
  type: LoadBalancer
  externalTrafficPolicy: Local   # Cluster (default) | Local
  # Local: only route to pods on the receiving node; preserve client IP
  # Cluster: SNAT through any node; lose client IP but full pod distribution

  # internalTrafficPolicy controls ClusterIP traffic routing
  internalTrafficPolicy: Local   # Cluster (default) | Local
  # Local: only route to pods on same node as client (topology-aware)

  selector:
    app: web
  ports:
  - port: 80
    targetPort: 8080

EndpointSlices

EndpointSlices (GA 1.21) replaced the monolithic Endpoints object. Each EndpointSlice holds up to maxEndpointsPerSlice (default 100) endpoints. kube-proxy watches EndpointSlices exclusively since 1.22.

apiVersion: discovery.k8s.io/v1
kind: EndpointSlice
metadata:
  name: nginx-abc12
  namespace: default
  labels:
    kubernetes.io/service-name: nginx   # links to Service
addressType: IPv4                       # IPv4 | IPv6 | FQDN
endpoints:
- addresses:
  - "10.244.1.5"
  conditions:
    ready: true          # pod passed readinessProbe
    serving: true        # pod is serving (ready or was ready, not terminating)
    terminating: false   # pod is being deleted
  nodeName: worker-1
  targetRef:
    kind: Pod
    name: nginx-abc-xyz
    namespace: default
- addresses:
  - "10.244.2.7"
  conditions:
    ready: true
    serving: true
    terminating: false
  nodeName: worker-2
  targetRef:
    kind: Pod
    name: nginx-def-uvw
    namespace: default
ports:
- name: http
  port: 80
  protocol: TCP
Terminating endpoints (graceful connection draining)

When a pod is terminating, its endpoint's terminating: true and ready: false but serving: true. kube-proxy (and Cilium) can be configured to include terminating endpoints when no ready endpoints remain, enabling graceful connection draining. Controlled by EndpointSliceTerminatingCondition feature gate (GA 1.28).

Topology-Aware Routing

Topology-Aware Routing (GA 1.27) instructs kube-proxy to prefer endpoints in the same zone as the client pod, reducing cross-zone traffic costs. The EndpointSlice controller annotates endpoints with zone hints.

apiVersion: v1
kind: Service
metadata:
  name: api
  annotations:
    service.kubernetes.io/topology-mode: Auto   # enables topology-aware routing
spec:
  selector:
    app: api
  ports:
  - port: 80
# EndpointSlice with hints populated by controller
kubectl get endpointslice -l kubernetes.io/service-name=api -o yaml | \
  grep -A3 hints
# hints:
#   forZones:
#   - name: us-east-1a   # only route to this endpoint from zone us-east-1a

Full KubeProxyConfiguration

apiVersion: kubeproxy.config.k8s.io/v1alpha1
kind: KubeProxyConfiguration

# Bind and client
bindAddress: "0.0.0.0"
clientConnection:
  kubeconfig: /var/lib/kube-proxy/kubeconfig.conf
  qps: 5
  burst: 10

# Mode: iptables | ipvs | nftables
mode: ipvs

# iptables tuning (used even in IPVS mode for masquerade)
iptables:
  masqueradeAll: false     # SNAT all service traffic (not just NodePort)
  masqueradeBit: 14        # bit 14 (0x4000) used as mark
  minSyncPeriod: 1s        # don't sync more often than this
  syncPeriod: 30s          # full sync interval

# IPVS tuning
ipvs:
  scheduler: lc            # least connections for persistent services
  syncPeriod: 30s
  minSyncPeriod: 2s
  strictARP: true
  tcpTimeout: 900s         # idle TCP connection timeout in IPVS
  tcpFinTimeout: 16s       # TCP FIN_WAIT timeout
  udpTimeout: 300s

# Node port range
nodePortAddresses:
- "192.168.0.0/16"        # restrict NodePort binding to these CIDRs
                          # default: all interfaces

# Feature gates
featureGates:
  TopologyAwareHints: true

# Healthz and metrics
healthzBindAddress: "0.0.0.0:10256"
metricsBindAddress: "0.0.0.0:10249"

# Logging
logging:
  verbosity: 2

kube-proxy Replacement

Several CNIs offer full kube-proxy replacement by programming service routing inside eBPF, eliminating the kube-proxy DaemonSet entirely:

ReplacementMechanismHow to Disable kube-proxyBenefits vs kube-proxy
CiliumeBPF BPF maps + socket-level LB--skip-phases=addon/kube-proxy in kubeadm; or delete kube-proxy DaemonSetO(1) lookup; DSR; preserved src IP; Maglev hashing; 30-100% lower latency at scale
Calico eBPFeBPF TC hooksDisable kube-proxy + set FELIX_BPFKUBEPROXYIPTSCLEANUPMODE=EnabledSame as Cilium eBPF mode; DSR capable
kube-routerIPVS + BGPDeploy kube-router as DaemonSet; disable kube-proxyIntegrated BGP router + IPVS; no external CNI needed
# Disable kube-proxy when using Cilium kube-proxy replacement
# Option 1: skip during kubeadm init
kubeadm init --skip-phases=addon/kube-proxy

# Option 2: delete after cluster creation
kubectl -n kube-system delete daemonset kube-proxy
iptables-save | grep -v KUBE | iptables-restore   # clean up leftover rules

# Configure Cilium to replace kube-proxy
helm upgrade cilium cilium/cilium \
  --namespace kube-system \
  --set kubeProxyReplacement=true \
  --set k8sServiceHost=API_SERVER_IP \
  --set k8sServicePort=6443

# Verify
cilium status | grep KubeProxyReplacement
# KubeProxyReplacement: True  [NodePort (SNAT, 1 NumDevices), ExternalIPs, HostPort, SessionAffinity, Maglev, XDP (NATIVE)]

Key kube-proxy Metrics

MetricTypeAlert Threshold
kubeproxy_sync_proxy_rules_duration_secondshistogramp99 > 5s (iptables mode); > 1s (IPVS/nftables)
kubeproxy_sync_proxy_rules_last_queued_timestamp_secondsgaugeage > 30s → stale rules not being applied
kubeproxy_network_programming_duration_secondshistogramp99 > 10s
kubeproxy_iptables_rules_totalgauge> 200,000 → consider IPVS or nftables mode
kubeproxy_ipvs_services_totalgaugeInformational (track growth)
rest_client_requests_total{code=~"5.."}counter> 0.1/s sustained → API server connectivity problem
kubeproxy_sync_proxy_rules_no_local_endpoints_totalcounterRate > 0 with externalTrafficPolicy=Local → endpoints missing on node

Alerting Rules

groups:
- name: kube-proxy
  rules:
  - alert: KubeProxyRuleSyncSlow
    expr: histogram_quantile(0.99, rate(kubeproxy_sync_proxy_rules_duration_seconds_bucket[5m])) > 5
    for: 5m
    labels:
      severity: warning
    annotations:
      summary: "kube-proxy rule sync taking >5s (iptables mode scalability issue)"

  - alert: KubeProxyStaleRules
    expr: time() - kubeproxy_sync_proxy_rules_last_queued_timestamp_seconds > 60
    for: 2m
    labels:
      severity: critical
    annotations:
      summary: "kube-proxy has not synced rules in >60s — service endpoints may be stale"

  - alert: KubeProxyTooManyRules
    expr: kubeproxy_iptables_rules_total > 150000
    for: 10m
    labels:
      severity: warning
    annotations:
      summary: "iptables rule count exceeds 150k — consider switching to IPVS or nftables"

  - alert: KubeProxyAPIServerErrors
    expr: rate(rest_client_requests_total{job="kube-proxy",code=~"5.."}[5m]) > 0.1
    for: 3m
    labels:
      severity: warning
    annotations:
      summary: "kube-proxy experiencing API server 5xx errors"

Troubleshooting Runbooks

Runbook 1: Service ClusterIP not reachable from pod

# 1. Verify service exists and has endpoints
kubectl get svc nginx -o wide
kubectl get endpoints nginx    # or: kubectl get endpointslices -l kubernetes.io/service-name=nginx

# 2. Confirm kube-proxy is running on the node
NODE=$(kubectl get pod <client-pod> -o jsonpath='{.spec.nodeName}')
kubectl get pod -n kube-system -l k8s-app=kube-proxy --field-selector spec.nodeName=$NODE

# 3. Check iptables rules on the node (for iptables mode)
kubectl debug node/$NODE -it --image=ubuntu -- bash
  iptables-save | grep 10.96.14.3   # ClusterIP
  # If empty → kube-proxy not syncing rules

# 4. Check IPVS virtual servers (for IPVS mode)
ipvsadm -Ln | grep -A5 "10.96.14.3"

# 5. Check kube-proxy logs
kubectl logs -n kube-system kube-proxy-$NODE --tail=50 | grep -E "ERROR|sync|failed"

# 6. Check conntrack table (if DNAT happening but response lost)
conntrack -L | grep 10.96.14.3

# 7. Verify kube-proxy endpoints match pod IPs
kubectl get pod -l app=nginx -o wide   # compare pod IPs to KUBE-SEP rules

Runbook 2: NodePort accessible on one node but not another

# NodePort should work on ALL nodes regardless of where pods run (Cluster policy)

# 1. Test from each node
curl http://<node1-ip>:30080    # should work
curl http://<node2-ip>:30080    # should also work

# 2. Check firewall/security groups (cloud or on-prem)
# AWS: check Security Group allows port 30080 TCP inbound
# GCP: check firewall rules for node tag
# On-prem: check iptables INPUT chain on the node

# 3. Verify kube-proxy is running on the failing node
kubectl get pods -n kube-system -l k8s-app=kube-proxy -o wide

# 4. Check if externalTrafficPolicy=Local (only pods on THAT node are reachable)
kubectl get svc nginx -o jsonpath='{.spec.externalTrafficPolicy}'
# If "Local", NodePort only works on nodes that have a matching pod

# 5. Check kube-proxy rule on failing node
kubectl debug node/<failing-node> -it --image=ubuntu -- bash
  iptables-save | grep "30080"
  # If rule missing → kube-proxy not syncing on this node

Runbook 3: kube-proxy high CPU / slow sync (iptables mode at scale)

# Symptom: kube-proxy pod using high CPU; endpoints take minutes to update

# 1. Measure current sync duration
kubectl exec -n kube-system kube-proxy-xyz -- \
  curl -s localhost:10249/metrics | grep kubeproxy_sync_proxy_rules_duration

# 2. Count total iptables rules
kubectl debug node/worker-1 -it --image=ubuntu -- bash
  iptables-save | wc -l
  # > 100,000 lines = likely bottleneck

# 3. Count services and endpoints
kubectl get svc --all-namespaces | wc -l
kubectl get endpointslices --all-namespaces | wc -l

# 4. Switch to IPVS mode (rolling, not disruptive)
# Edit kube-proxy ConfigMap
kubectl edit cm -n kube-system kube-proxy
# Change: mode: "ipvs" (was "iptables")
# Add: ipvs.strictARP: true
# Then restart kube-proxy pods
kubectl rollout restart daemonset -n kube-system kube-proxy

# 5. Or migrate to Cilium kube-proxy replacement for maximum scale

Runbook 4: Client IP lost — source IP not preserved through LoadBalancer

# Symptom: application receives node IP instead of client IP in X-Forwarded-For or logs

# Root cause: externalTrafficPolicy: Cluster (default) SNATs traffic through node

# Solution 1: externalTrafficPolicy: Local (preserves client IP, but uneven distribution)
kubectl patch svc web -p '{"spec":{"externalTrafficPolicy":"Local"}}'
# Warning: pods only receive traffic on nodes where they're running
# Health check port needed for cloud LB health probes
kubectl get svc web -o jsonpath='{.spec.healthCheckNodePort}'

# Solution 2: Use Cilium with DSR (Direct Server Return)
# Cilium can preserve source IP even with Cluster policy via DSR
# Requires: --set loadBalancer.dsrDispatch=opt

# Solution 3: Proxy Protocol (for L4 LBs)
# Cloud LB sends PROXY protocol header with original client IP
# Application or ingress must decode PROXY protocol header

# Verify client IP is now preserved
kubectl logs -l app=web | grep "X-Real-IP\|remote_addr"

Runbook 5: IPVS mode — connection resets after pod deletion

# Symptom: established connections RST when a pod is deleted and IPVS removes the real server

# Root cause: IPVS removes the RS immediately; conntrack entries still point to dead pod

# 1. Check IPVS connection table
ipvsadm -Lnc | grep "10.244.1.5"   # connections to deleted pod IP

# 2. Check conntrack for stale entries
conntrack -L | grep "10.244.1.5"
conntrack -D --dst 10.244.1.5       # manually remove (emergency)

# 3. Proper solution: use graceful termination
# - Pod should handle SIGTERM and drain connections
# - preStop hook gives time before endpoint removal:
# preStop:
#   exec:
#     command: ["sleep", "15"]      # 15s for LB to drain connections

# 4. IPVS persistence (alternative to graceful termination for session affinity)
ipvsadm --edit-service --tcp-service 10.96.14.3:80 --persistent 300
# Persistent connections keep going to same RS for 300s even after removal

Production Best Practices

Mode Selection

  • <500 services: iptables mode is fine
  • 500–10,000 services: switch to IPVS or nftables
  • >10,000 services: use IPVS or Cilium kube-proxy replacement
  • Linux 5.13+: use nftables for better atomicity
  • Maximum performance: Cilium eBPF kube-proxy replacement

IPVS Best Practices

  • Always set strictARP: true (ARP proxy conflicts)
  • Load ip_vs modules via /etc/modules-load.d/
  • Use lc (least-connections) for long-lived connections
  • Monitor ip_vs connection table size
  • Set appropriate TCP/UDP timeouts to prevent stale IPVS entries

Source IP Handling

  • Use externalTrafficPolicy: Local for apps needing client IP
  • Ensure pods exist on all nodes or LB health checks will fail
  • Document the SNAT behavior for security/audit teams
  • For internal traffic: internalTrafficPolicy: Local saves cross-node hops

Graceful Endpoint Handling

  • Use preStop hook + sleep to drain connections before pod stops
  • Set terminationGracePeriodSeconds appropriately
  • Monitor kubeproxy_sync_proxy_rules_no_local_endpoints_total
  • Test rolling deployments with active connections under load