kube-proxy
kube-proxy is the network rules agent that runs on every node. Its sole job is to implement the Service abstraction at the network level: when a packet destined for a Service ClusterIP arrives at the node, kube-proxy's rules translate (DNAT) that virtual IP into a real pod IP and load-balance across healthy endpoints. kube-proxy itself never handles any packet in user-space — it programs the kernel to do the work.
For a deep treatment of how Services work end-to-end, see Networking §Services and kube-proxy Internals. This page focuses on kube-proxy as a node component: its process, modes, configuration, and operations.
What kube-proxy Does (and Doesn't Do)
What it does
- Watches Service and EndpointSlice objects from the API server
- Programs iptables NAT rules, IPVS virtual servers, or nftables rules to implement ClusterIP load balancing
- Implements NodePort — opens host ports and forwards to pod IPs
- Implements ExternalIP — adds DNAT for externally-routable IPs
- Handles
sessionAffinity: ClientIPvia iptables recent-match or IPVS persistence - Handles
externalTrafficPolicy: Local— only forwards to local node endpoints for NodePort/LB traffic, preserving source IP
What it does NOT do
- Does not forward packets itself — all forwarding is done by the kernel
- Does not implement pod-to-pod networking — that's the CNI plugin's job
- Does not enforce NetworkPolicy — that's the CNI plugin (e.g., Calico, Cilium)
- Does not provide DNS — that's CoreDNS
- Does not run on clusters using eBPF-based Service implementations (Cilium kube-proxy replacement, Calico eBPF)
Proxy Modes
kube-proxy supports three implementation modes, selected via --proxy-mode:
iptables (default)
Programs chains in the PREROUTING and OUTPUT netfilter hooks. Each Service gets a chain; each endpoint gets a rule using --probability for random load balancing. Rule count grows linearly with Services × endpoints.
Scale limit: ~10,000 Services (rule traversal is O(n)).
Used when: default on most managed clusters; no kernel IPVS module required.
ipvs
Uses the Linux IPVS kernel module (LVS). Creates a virtual server per Service ClusterIP:port and real servers per endpoint. Hash table lookup: O(1) regardless of Service count.
Scale limit: 100,000+ Services with negligible per-rule overhead.
Used when: large clusters (>1000 Services) or when advanced LB algorithms are needed.
nftables (1.29 beta, GA 1.31)
Uses nftables (Linux 3.13+) instead of legacy iptables. Verdict maps provide O(1) lookups similar to IPVS but without requiring a separate kernel module. Replaces iptables on modern Linux distributions.
Used when: modern distros (RHEL 9, Ubuntu 22.04+) where iptables-legacy is deprecated.
iptables Mode — Rule Structure
In iptables mode, kube-proxy creates a predictable chain hierarchy in the nat table. Understanding this structure is essential for debugging connectivity issues.
Key iptables Chains
| Chain | Table | Purpose |
|---|---|---|
KUBE-SERVICES | nat | Entry point; one rule per Service matching dst ClusterIP:port; jumps to per-Service chain |
KUBE-SVC-<hash> | nat | Per-Service chain; random load balancing across endpoints using --probability statistic match |
KUBE-SEP-<hash> | nat | Per-endpoint chain; performs DNAT to the pod IP:port |
KUBE-NODEPORTS | nat | Matches NodePort range (30000-32767); jumps to per-Service chain |
KUBE-POSTROUTING | nat | MASQUERADE for pod→Service traffic (pod to its own Service) to fix return routing |
KUBE-FORWARD | filter | Allows forwarding for established connections and explicitly accepted traffic |
KUBE-FIREWALL | filter | Drops traffic marked for rejection (e.g., packets routed to unready endpoints) |
# Inspect kube-proxy's iptables rules
iptables -t nat -L KUBE-SERVICES -n --line-numbers | head -30
# Find the chain for a specific Service
SVC_IP=$(kubectl get svc my-service -o jsonpath='{.spec.clusterIP}')
iptables -t nat -L KUBE-SERVICES -n | grep $SVC_IP
# Follow the chain for that Service
iptables -t nat -L KUBE-SVC-XXXXXXXXXX -n --line-numbers
# Count total kube-proxy rules (can be large)
iptables-save | grep -c KUBE
IPVS Mode — Virtual Servers
In IPVS mode, kube-proxy creates a dummy interface called kube-ipvs0 and assigns all Service ClusterIPs to it. This makes the kernel recognize those IPs as local, enabling IPVS to intercept packets before they leave the node. IPVS maintains a hash table of virtual servers and real servers in kernel space.
# Check IPVS virtual servers (requires ipvsadm)
ipvsadm -Ln
# Output example:
# IP Virtual Server version 1.2.1 (size=4096)
# Prot LocalAddress:Port Scheduler Flags
# -> RemoteAddress:Port Forward Weight ActiveConn InActConn
# TCP 10.96.0.1:443 rr
# -> 10.0.1.10:6443 Masq 1 5 0
# TCP 10.96.0.10:53 rr
# -> 10.244.0.5:53 Masq 1 0 0
# -> 10.244.0.6:53 Masq 1 0 0
# TCP 10.100.50.20:80 rr
# -> 10.244.1.5:8080 Masq 1 12 0
# -> 10.244.2.8:8080 Masq 1 8 0
# Show the kube-ipvs0 dummy interface with all ClusterIPs
ip addr show kube-ipvs0 | grep "inet " | head -20
IPVS Scheduling Algorithms
| Algorithm | Flag | Description |
|---|---|---|
| Round Robin | rr | Default. Distributes connections equally across backends. No state tracking. |
| Least Connection | lc | Sends to backend with fewest active connections. Better for heterogeneous request durations. |
| Destination Hashing | dh | Consistent mapping: same client always hits same backend (not based on src IP — use sh for that). |
| Source Hashing | sh | Consistent src IP → backend mapping. Implements sessionAffinity without explicit timeout tracking. |
| Shortest Expected Delay | sed | Considers both active connections and weight. Approximates least-loaded server. |
| Weighted Round Robin | wrr | Weight-based round robin. kube-proxy currently assigns equal weight to all endpoints. |
ip_vs, ip_vs_rr, ip_vs_wrr, ip_vs_sh, and nf_conntrack kernel modules. In addition, the conntrack table size (nf_conntrack_max) must be sized appropriately — IPVS still uses conntrack for NAT tracking. Default conntrack max (65536) is insufficient for large clusters; set to at least 1M on large nodes.
# Check required kernel modules for IPVS
lsmod | grep -e ip_vs -e nf_conntrack
# Load modules if needed
modprobe ip_vs ip_vs_rr ip_vs_wrr ip_vs_sh nf_conntrack
# Make persistent (Ubuntu/Debian)
cat >> /etc/modules-load.d/ipvs.conf << 'EOF'
ip_vs
ip_vs_rr
ip_vs_wrr
ip_vs_sh
nf_conntrack
EOF
# Tune conntrack (critical for IPVS at scale)
sysctl -w net.netfilter.nf_conntrack_max=1048576
sysctl -w net.netfilter.nf_conntrack_tcp_timeout_established=86400
Service Type Implementation
| Service type | What kube-proxy does | What it does NOT do |
|---|---|---|
ClusterIP | Programs DNAT rules for the ClusterIP:port → endpoint IPs | Does not provision cloud load balancers (that's CCM) |
NodePort | Opens the NodePort on ALL nodes (KUBE-NODEPORTS chain); forwards to endpoints | Does not ensure the port is accessible through firewalls |
LoadBalancer | Same as NodePort (backend); cloud LB is provisioned by CCM | Does not program the cloud LB; does not manage ExternalIP from LB status by default |
ExternalIP | Programs DNAT for the explicitly specified external IP addresses | Does not route external IPs to the node (that's BGP/cloud routing) |
Headless (ClusterIP: None) | No rules — DNS returns pod IPs directly; no kube-proxy involvement | N/A |
externalTrafficPolicy
This field on a Service controls how traffic arriving at a NodePort or LoadBalancer is handled:
externalTrafficPolicy: Cluster (default)
Traffic arriving at any node's NodePort is forwarded to any healthy endpoint in the cluster, potentially hair-pinning to another node. The source IP is SNAT'd to the node's IP (source IP is lost).
Client → Node A :30080
→ SNAT to Node A IP
→ Pod on Node B :8080
(src IP = Node A, not client)
externalTrafficPolicy: Local
Traffic is only forwarded to endpoints on the same node. If no local endpoints exist, the connection is dropped (health checks will fail, causing the LB to route elsewhere). Source IP is preserved.
Client → Node A :30080
→ Pod on Node A :8080 only
→ src IP = Client IP preserved
(Node A with no local pods:
connection dropped — LB
will not route here)
externalTrafficPolicy but for traffic originating from inside the cluster. When set to Local, ClusterIP traffic is only forwarded to endpoints on the same node. Useful for node-local caches (DaemonSets) where you want pods to always hit the local instance.
EndpointSlices
kube-proxy has fully migrated from watching Endpoints objects to watching EndpointSlice objects (GA 1.21, default from 1.17). EndpointSlices shard endpoints into slices of up to 100 endpoints each, drastically reducing the size of individual watch events.
With old Endpoints: a single endpoint change in a Service with 1000 pods triggered a 1000-entry object update broadcast to every kube-proxy in the cluster. With EndpointSlice: only the slice containing the changed endpoint is updated — at most 100 endpoints per event.
# View EndpointSlices for a Service
kubectl get endpointslices -l kubernetes.io/service-name=my-service
# EndpointSlice object
kubectl get endpointslice my-service-abc12 -o yaml
apiVersion: discovery.k8s.io/v1
kind: EndpointSlice
metadata:
name: my-service-abc12
labels:
kubernetes.io/service-name: my-service
endpointslice.kubernetes.io/managed-by: endpointslice-controller.k8s.io
addressType: IPv4
ports:
- name: http
protocol: TCP
port: 8080
endpoints:
- addresses: ["10.244.1.5"]
conditions:
ready: true
serving: true # Is currently serving (even during termination)
terminating: false # Is in graceful termination
nodeName: worker-1
zone: us-east-1a
hints:
forZones:
- name: us-east-1a # Topology Aware Routing hint
Topology Aware Routing
When service.kubernetes.io/topology-mode: auto is set on a Service (GA 1.27), the EndpointSlice controller adds hints.forZones annotations to endpoints. kube-proxy then preferentially routes to endpoints in the same zone as the requesting pod, reducing cross-zone traffic costs.
apiVersion: v1
kind: Service
metadata:
name: my-service
annotations:
service.kubernetes.io/topology-mode: "auto" # Enable topology-aware routing
spec:
selector:
app: my-app
ports:
- port: 80
targetPort: 8080
endpoint_slice_controller_endpoints_added_per_sync and check EndpointSlice hints fields to verify topology routing is active.
KubeProxyConfiguration Reference
apiVersion: kubeproxy.config.k8s.io/v1alpha1
kind: KubeProxyConfiguration
# --- Core settings ---
bindAddress: "0.0.0.0"
clusterCIDR: "10.244.0.0/16" # Pod CIDR; used to detect traffic that needs SNAT
hostnameOverride: "" # Override node name if different from hostname
# --- Proxy mode ---
mode: "ipvs" # iptables | ipvs | nftables | "" (auto-detect)
# --- IPVS settings (only relevant when mode=ipvs) ---
ipvs:
scheduler: "rr" # rr | lc | dh | sh | sed | wrr
syncPeriod: "30s" # Full sync interval
minSyncPeriod: "1s" # Minimum time between partial syncs
tcpTimeout: "0s" # 0 = use kernel default (900s)
tcpFinTimeout: "0s"
udpTimeout: "0s"
excludeCIDRs: [] # CIDRs to exclude from IPVS (e.g., cloud metadata)
strictARP: true # Required for MetalLB; sets arp_announce=2 on all interfaces
# --- iptables settings ---
iptables:
masqueradeAll: false # SNAT all traffic (not just pod-to-Service)
masqueradeBit: 14 # Bit in fwmark used to identify traffic for masquerade
syncPeriod: "30s"
minSyncPeriod: "1s"
# --- nftables settings ---
nftables:
masqueradeAll: false
masqueradeBit: 14
syncPeriod: "30s"
minSyncPeriod: "1s"
# --- API server connection ---
clientConnection:
kubeconfig: "/var/lib/kube-proxy/kubeconfig.conf"
acceptContentTypes: ""
contentType: "application/vnd.kubernetes.protobuf"
qps: 10
burst: 20 # Increase for large clusters with frequent endpoint churn
# --- Ports ---
metricsBindAddress: "127.0.0.1:10249" # Prometheus metrics
healthzBindAddress: "0.0.0.0:10256" # Health check
# --- Feature gates ---
featureGates:
TopologyAwareHints: true
kube-proxy DaemonSet
kube-proxy runs as a DaemonSet in kube-system. It requires elevated host privileges to program kernel networking rules:
apiVersion: apps/v1
kind: DaemonSet
metadata:
name: kube-proxy
namespace: kube-system
spec:
selector:
matchLabels:
k8s-app: kube-proxy
updateStrategy:
type: RollingUpdate
template:
metadata:
labels:
k8s-app: kube-proxy
spec:
priorityClassName: system-node-critical
tolerations:
- operator: Exists # Run on ALL nodes including control plane
hostNetwork: true # Needs host network namespace for iptables
serviceAccountName: kube-proxy
containers:
- name: kube-proxy
image: registry.k8s.io/kube-proxy:v1.29.3
command:
- /usr/local/bin/kube-proxy
- --config=/var/lib/kube-proxy/config.conf
- --hostname-override=$(NODE_NAME)
env:
- name: NODE_NAME
valueFrom:
fieldRef:
fieldPath: spec.nodeName
securityContext:
privileged: true # Required for iptables/IPVS/nftables
volumeMounts:
- mountPath: /var/lib/kube-proxy
name: kube-proxy
- mountPath: /run/xtables.lock
name: xtables-lock
readOnly: false
- mountPath: /lib/modules
name: lib-modules
readOnly: true
volumes:
- name: kube-proxy
configMap:
name: kube-proxy
- name: xtables-lock
hostPath:
path: /run/xtables.lock
type: FileOrCreate
- name: lib-modules
hostPath:
path: /lib/modules
/run/xtables.lock volume mount ensures kube-proxy and any other process (e.g., a CNI plugin) that modifies iptables use the same lock file. Without this, concurrent iptables modifications can cause partial rule application or iptables: Resource temporarily unavailable errors.
Prometheus Metrics
| Metric | Type | Description |
|---|---|---|
kubeproxy_sync_proxy_rules_duration_seconds | Histogram | Time to sync all proxy rules. Alert if p99 > 10s. |
kubeproxy_sync_proxy_rules_last_queued_timestamp_seconds | Gauge | When the last sync was triggered. Large lag = kube-proxy falling behind. |
kubeproxy_sync_proxy_rules_iptables_total | Gauge | Total number of iptables rules currently programmed (iptables mode) |
kubeproxy_sync_proxy_rules_iptables_restore_failures_total | Counter | Failed iptables-restore calls. Any value > 0 means rules are not being applied. |
kubeproxy_network_programming_duration_seconds | Histogram | End-to-end latency from endpoint change detected to rules programmed |
Alerting Rules
groups:
- name: kube-proxy
rules:
- alert: KubeProxyRuleSyncLatencyHigh
expr: |
histogram_quantile(0.99,
rate(kubeproxy_sync_proxy_rules_duration_seconds_bucket[5m])
) > 10
for: 5m
labels:
severity: warning
annotations:
summary: "kube-proxy rule sync p99 > 10s on {{ $labels.instance }}"
description: "May indicate iptables rule count too high — consider IPVS or eBPF mode"
- alert: KubeProxyIptablesRestoreFailure
expr: increase(kubeproxy_sync_proxy_rules_iptables_restore_failures_total[5m]) > 0
labels:
severity: critical
annotations:
summary: "iptables-restore failing on {{ $labels.instance }} — Service rules not being applied"
- alert: KubeProxyNotRunning
expr: kube_daemonset_status_number_ready{daemonset="kube-proxy"} < kube_daemonset_status_desired_number_scheduled{daemonset="kube-proxy"}
for: 5m
labels:
severity: critical
annotations:
summary: "kube-proxy not running on all nodes — Service connectivity broken"
Troubleshooting Runbooks
Runbook 1: Service ClusterIP not reachable from pod
# 1. Verify Service and endpoints exist
kubectl get svc my-service
kubectl get endpoints my-service
# If Endpoints shows "": selector doesn't match any pods
# 2. Verify kube-proxy is running on target node
kubectl get pods -n kube-system -l k8s-app=kube-proxy -o wide
kubectl logs -n kube-system kube-proxy-xxxxx --tail=50
# 3. Check iptables rules for the ClusterIP
SVC_IP=$(kubectl get svc my-service -o jsonpath='{.spec.clusterIP}')
iptables -t nat -L KUBE-SERVICES -n | grep $SVC_IP
# If missing: kube-proxy hasn't synced yet or is failing
# 4. Check IPVS (if in IPVS mode)
ipvsadm -Ln | grep -A5 $SVC_IP
# 5. Test connectivity directly to pod IP (bypassing Service)
POD_IP=$(kubectl get endpoints my-service -o jsonpath='{.subsets[0].addresses[0].ip}')
kubectl exec -it test-pod -- curl http://$POD_IP:8080
# 6. Check kube-proxy sync errors
kubectl logs -n kube-system -l k8s-app=kube-proxy | grep -i error | tail -20
Runbook 2: NodePort not accessible from outside cluster
# 1. Get the NodePort
kubectl get svc my-service -o jsonpath='{.spec.ports[0].nodePort}'
# e.g. 30080
# 2. Verify KUBE-NODEPORTS chain has the rule
iptables -t nat -L KUBE-NODEPORTS -n | grep 30080
# Should see: tcp dpt:30080 -> KUBE-SVC-...
# 3. Check firewall / security groups (cloud)
# The node's firewall must allow TCP/UDP 30000-32767 from external
# 4. Verify kube-proxy binds to the node's IP
kubectl get cm -n kube-system kube-proxy -o yaml | grep bindAddress
# 5. Test from inside the node
NODE_IP=$(kubectl get node worker-1 -o jsonpath='{.status.addresses[0].address}')
curl http://$NODE_IP:30080
# 6. If externalTrafficPolicy: Local, verify local endpoints exist
kubectl get endpoints my-service -o yaml | grep -A5 nodeName
Runbook 3: iptables rule sync is slow — kube-proxy lagging
# Symptom: new Services take minutes to become reachable
# or kubeproxy_sync_proxy_rules_duration_seconds p99 > 10s
# 1. Count current iptables rules
iptables-save | wc -l
iptables-save | grep -c KUBE
# If > 10000 rules: time to consider IPVS or eBPF
# 2. Check Services with many endpoints (each endpoint = 1 rule)
kubectl get endpoints --all-namespaces -o json | jq '
.items[] | {
name: .metadata.name,
ns: .metadata.namespace,
count: (.subsets[0].addresses | length)
}' | sort -k3 -rn | head -10
# 3. Switch to IPVS mode
# Edit kube-proxy ConfigMap:
kubectl edit cm kube-proxy -n kube-system
# Change: mode: "ipvs"
# Add: ipvs.strictARP: true (if using MetalLB)
# 4. Restart kube-proxy DaemonSet to apply
kubectl rollout restart daemonset/kube-proxy -n kube-system
# 5. Clean up legacy iptables rules after switching to IPVS
iptables-save | grep -v KUBE | iptables-restore
# WARNING: review carefully before running in production
Runbook 4: Source IP lost — need to preserve client IP
# Symptom: application sees node IP instead of client IP
# 1. Check current externalTrafficPolicy
kubectl get svc my-service -o jsonpath='{.spec.externalTrafficPolicy}'
# Returns: Cluster (source IP is SNAT'd)
# 2. Change to Local to preserve source IP
kubectl patch svc my-service \
-p '{"spec":{"externalTrafficPolicy":"Local"}}'
# 3. Verify the Service has local endpoints on target nodes
kubectl get endpoints my-service -o yaml | grep nodeName
# 4. Configure your cloud LB health check to respect Local policy
# Most cloud LBs will stop routing to nodes with 0 local endpoints
# Health check: GET /healthz on NodePort
# Kubernetes provides this automatically when externalTrafficPolicy=Local
# 5. For intra-cluster source IP preservation (1.26+):
kubectl patch svc my-service \
-p '{"spec":{"internalTrafficPolicy":"Local"}}'
Runbook 5: IPVS conntrack table exhaustion
# Symptom: connections randomly fail; dmesg shows "nf_conntrack: table full, dropping packet"
# 1. Check current conntrack usage
cat /proc/sys/net/netfilter/nf_conntrack_count
cat /proc/sys/net/netfilter/nf_conntrack_max
# If count approaching max: need to increase max
# 2. Increase conntrack max (immediately, not persistent)
sysctl -w net.netfilter.nf_conntrack_max=2097152
sysctl -w net.core.netdev_max_backlog=250000
# 3. Make persistent
cat >> /etc/sysctl.d/99-conntrack.conf << 'EOF'
net.netfilter.nf_conntrack_max = 2097152
net.netfilter.nf_conntrack_tcp_timeout_established = 86400
net.netfilter.nf_conntrack_tcp_timeout_close_wait = 60
EOF
sysctl --system
# 4. Tune IPVS timeouts to reduce stale entries
kubectl edit cm kube-proxy -n kube-system
# Set: ipvs.tcpTimeout: "900s" (15 minutes, down from kernel default)
# ipvs.tcpFinTimeout: "30s"
# 5. Monitor conntrack with Prometheus
# node_nf_conntrack_entries vs node_nf_conntrack_entries_limit
kube-proxy Alternatives
| Alternative | Mechanism | Advantage over kube-proxy | Requirement |
|---|---|---|---|
| Cilium (kube-proxy replacement) | eBPF programs on tc/XDP hooks | O(1) lookup, no conntrack for Service traffic, lower latency | Kernel 4.19+; Cilium CNI installed |
| Calico eBPF | eBPF | Same as Cilium; integrates with Calico networking | Calico CNI in eBPF mode |
| kube-router | IPVS + BGP | Combines kube-proxy (IPVS) + CNI + BGP in one binary | Replaces kube-proxy DaemonSet entirely |
| MetalLB (speaker) | BGP/L2 advertisement | Handles LoadBalancer type on bare metal (not a kube-proxy replacement) | Bare metal or non-cloud clusters |
Production Best Practices
- Use IPVS mode for clusters with >1000 Services. iptables rule count grows as Services × endpoints, and rule traversal is O(n). IPVS uses hash tables (O(1)) and handles 100,000+ Services without degradation. Verify the kernel modules are loaded on all nodes before switching.
- Set
strictARP: truein IPVS mode when using MetalLB or kube-vip. Without it, ARP requests for Service IPs may be answered by the wrong interface, breaking L2 advertisement. - Tune the conntrack table size on IPVS nodes. IPVS still uses conntrack for NAT. Default max (65536) is exhausted quickly on busy nodes. Start with
nf_conntrack_max = 1048576and monitornode_nf_conntrack_entries. - Use
externalTrafficPolicy: Localfor services where client IP matters (TLS SNI, rate limiting by IP, geo-routing). Accept that this requires the load balancer to health-check nodes and stop routing to nodes with no local endpoints. - Consider replacing kube-proxy with Cilium's kube-proxy replacement for new clusters. eBPF-based Service implementation has measurably lower latency (no conntrack NAT for service traffic) and is simpler to operate — one less DaemonSet to manage.
- Monitor
kubeproxy_sync_proxy_rules_duration_secondsp99. Values above 1s in iptables mode indicate the rule table is growing too large. Alert at 5s; act at 10s. - Set
minSyncPeriodto at least 1s. Without a minimum sync period, kube-proxy re-programs all rules for every single endpoint change event, causing CPU spikes during rolling deployments. - Use Topology Aware Routing (
service.kubernetes.io/topology-mode: auto) in multi-zone clusters to reduce cross-zone traffic costs, but verify the endpoint distribution is balanced enough for hints to be active.