Load Balancing
Kubernetes load balancing operates at multiple layers simultaneously — from iptables/IPVS ClusterIP distribution inside the cluster to cloud provider L4 NLBs and L7 Ingress controllers at the edge. This page covers every layer, how they interact, and how to tune them for production.
What This Page Covers
- Load balancing layer model: L3/L4 kube-proxy, L7 Ingress/Gateway API, cloud LB
- ClusterIP load balancing: iptables random distribution, IPVS scheduling algorithms
- Session affinity: ClientIP-based, timeoutSeconds; limitations
- Service type LoadBalancer: CCM provisioning lifecycle, status.loadBalancer.ingress
- externalTrafficPolicy: Local vs Cluster — source IP preservation and uneven load tradeoffs
- internalTrafficPolicy: Local vs Cluster — node-local traffic optimization
- Topology-Aware Routing (GA 1.27): zone hints, auto vs disabled annotation
- Traffic Distribution (GA 1.31): preferClose field, replacing topology hints
- AWS NLB: IP vs Instance target types, cross-zone load balancing, health checks, annotations reference
- AWS Classic ELB vs NLB vs ALB — when to use each from Kubernetes
- GCE/GKE: L4 regional vs global LB, BackendConfig CRD, NEGs (Network Endpoint Groups)
- Azure: Standard Load Balancer, internal LB, annotations
- MetalLB: BGP mode and L2 mode, IPAddressPool, L2Advertisement, BGPAdvertisement CRDs
- kube-vip: ARP/BGP, control plane HA and service VIP
- Cilium LB IPAM: replacing cloud CCM for bare metal
- NodePort: port range, externalIPs, hairpin mode
- ECMP and BGP-based load balancing at the network layer
- Connection draining: terminationGracePeriodSeconds, preStop hooks, EndpointSlice terminating condition
- Health checks: readiness probes gate endpoint inclusion; custom health check annotations per cloud
- 7 metrics + 4 alerting rules + 5 troubleshooting runbooks
- 8 best practices for production load balancing
Load Balancing Layer Model
Kubernetes load balancing is not a single mechanism — it is a stack of complementary layers, each operating at a different network level:
These layers are independent and composable. A typical production setup uses all three: a cloud NLB → NGINX Ingress (ClusterIP) → application Service (ClusterIP) → pods.
ClusterIP Load Balancing
ClusterIP is the most common service type. kube-proxy translates a virtual ClusterIP:Port to a real pod IP:Port via DNAT rules. For details on iptables chains and IPVS virtual servers, see 03-kube-proxy-internals.html. This section focuses on the load balancing behavior.
iptables: Random Distribution
In iptables mode, kube-proxy uses the statistic iptables module to implement per-connection random load balancing. For a service with N endpoints, the probability for each endpoint is set so that each gets an equal share:
# For a 3-endpoint service (probability math):
# Rule 1: 1/3 probability → KUBE-SEP-AAAA (pod-1)
# Rule 2: 1/2 probability → KUBE-SEP-BBBB (pod-2) (of remaining 2/3)
# Rule 3: 1/1 probability → KUBE-SEP-CCCC (pod-3) (all remaining)
# Net result: each pod gets ~33% of connections
# View the rules for a service:
sudo iptables -t nat -L KUBE-SVC-XXXXXXXXXXXXXXXXX -n --line-numbers
# Chain KUBE-SVC-XXXX (1 references)
# num target prot opt source destination
# 1 KUBE-SEP-AA all -- 0.0.0.0/0 0.0.0.0/0 statistic mode random probability 0.33333
# 2 KUBE-SEP-BB all -- 0.0.0.0/0 0.0.0.0/0 statistic mode random probability 0.50000
# 3 KUBE-SEP-CC all -- 0.0.0.0/0 0.0.0.0/0
iptables DNAT is applied at connection establishment. All packets in a TCP connection go to the same pod — there is no request-level load balancing. For HTTP/1.1 with persistent connections or HTTP/2 multiplexing, a single long-lived connection can send thousands of requests to one pod while others are idle. Use HTTP/2-aware proxies (Envoy, NGINX) for per-request load balancing.
IPVS: Scheduling Algorithms
IPVS mode provides 7 scheduling algorithms and maintains a proper connection table (not just iptables rules). See 03-kube-proxy-internals.html for full IPVS configuration. The scheduler is set globally via KubeProxyConfiguration.ipvs.scheduler:
| Scheduler | Algorithm | Best For |
|---|---|---|
rr | Round Robin | Default; uniform request cost; even distribution |
lc | Least Connection | Variable-cost requests; routes to pod with fewest active connections |
dh | Destination Hashing | Cache affinity; same destination IP always goes to same backend |
sh | Source Hashing | Client affinity; same source IP always goes to same backend |
sed | Shortest Expected Delay | Weighted least-connection variant |
nq | Never Queue | Always routes to idle server first; SED otherwise |
wrr | Weighted Round Robin | Heterogeneous node capacity (different pod weights) |
Session Affinity
Kubernetes supports ClientIP-based session affinity — connections from the same source IP always go to the same pod. This is implemented differently per proxy mode:
apiVersion: v1
kind: Service
metadata:
name: my-service
spec:
selector:
app: my-app
ports:
- port: 80
sessionAffinity: ClientIP # None (default) or ClientIP
sessionAffinityConfig:
clientIP:
timeoutSeconds: 10800 # 3 hours (default); max 86400
- Source IP is the node IP for traffic through NodePort/cloud LB — all traffic from a cloud LB appears to come from a small set of node IPs, causing all sessions to pin to a few pods. Use
externalTrafficPolicy: Localto preserve the real client IP (at the cost of uneven distribution). - Not HTTP cookie-based — Kubernetes session affinity is purely L4 IP-based. For cookie-based sticky sessions, use Ingress/Gateway API annotations (nginx:
nginx.ingress.kubernetes.io/affinity: cookie). - Does not survive pod restarts — if the pinned pod is replaced, the affinity breaks and clients are redistributed.
externalTrafficPolicy
Controls how NodePort and LoadBalancer services handle external traffic at the node level. This is one of the most impactful load balancing settings in production:
| Policy | Behavior | Source IP | Tradeoff |
|---|---|---|---|
Cluster (default) |
Traffic forwarded to any healthy pod cluster-wide, even on other nodes (extra hop) | Lost — SNAT replaces client IP with node IP | Even distribution; any node can serve; extra hop possible |
Local |
Traffic only forwarded to pods on the receiving node; drops if no local pods | Preserved — no SNAT; real client IP visible | Uneven distribution if pods not evenly spread; node must have local pod |
apiVersion: v1
kind: Service
metadata:
name: my-lb-service
spec:
type: LoadBalancer
externalTrafficPolicy: Local # preserve client IP; no SNAT
selector:
app: my-app
ports:
- port: 443
targetPort: 8443
When externalTrafficPolicy: Local is set, kube-proxy allocates a special healthCheckNodePort (default range: 30000–32767). Cloud load balancers use this port to check if a node has ready local pods before sending traffic. Nodes without local pods return HTTP 503, so the LB skips them. AWS NLB, GCE, and Azure all support this automatically.
internalTrafficPolicy
Controls how ClusterIP traffic is handled for connections originating inside the cluster:
spec:
internalTrafficPolicy: Local # route only to pods on same node as client
# Cluster (default) = any pod anywhere
internalTrafficPolicy: Local is useful for node-local caching services (e.g., a per-node cache DaemonSet) — clients always hit the local replica without a network hop. If no local pod exists, the connection is dropped.
Topology-Aware Routing & Traffic Distribution
Topology-Aware Routing (GA 1.27)
Topology-Aware Routing uses zone hints on EndpointSlices to prefer in-zone endpoints, reducing cross-zone traffic costs and latency. It is activated per-service with an annotation:
apiVersion: v1
kind: Service
metadata:
name: my-service
annotations:
service.kubernetes.io/topology-mode: "auto" # auto | disabled
spec:
selector:
app: my-app
ports:
- port: 80
When set to auto, the EndpointSlice controller adds hints.forZones entries to each endpoint. kube-proxy then prefers endpoints in the same zone as the node. Conditions for auto to activate:
- All nodes have the
topology.kubernetes.io/zonelabel set - Endpoints are spread proportionally across zones (within 3x ratio)
- All endpoints are Ready
- At least 3 zones present
Traffic Distribution (GA 1.31)
Traffic Distribution replaces Topology-Aware Routing with a cleaner API — spec.trafficDistribution on the Service object itself:
apiVersion: v1
kind: Service
metadata:
name: my-service
spec:
selector:
app: my-app
ports:
- port: 80
trafficDistribution: PreferClose # prefer topologically close endpoints
# falls back to global if none available locally
spec.trafficDistribution: PreferClose (GA 1.31) is the preferred way to express topology preference for new clusters. The annotation service.kubernetes.io/topology-mode: auto continues to work but is considered legacy. PreferClose has safer fallback behavior — it never drops traffic even if zone distribution is uneven.
Service type=LoadBalancer
When you create a Service with type: LoadBalancer, the Cloud Controller Manager (CCM) watches the Service and provisions a cloud load balancer. The provisioned LB address is written back into status.loadBalancer.ingress:
kubectl get svc my-lb-service -o yaml
# status:
# loadBalancer:
# ingress:
# - hostname: abc123.us-east-1.elb.amazonaws.com # AWS NLB
# # OR for GCP/Azure:
# - ip: 203.0.113.42
# Track provisioning:
kubectl get events --field-selector involvedObject.name=my-lb-service
# EnsuredLoadBalancer → LB provisioned
# UpdatedLoadBalancer → LB updated (port change, etc.)
AWS Network Load Balancer (NLB)
AWS provides three LB types accessible from Kubernetes. The recommended path for TCP/UDP is NLB via the AWS Load Balancer Controller.
| Type | Layer | Kubernetes Integration | Use Case |
|---|---|---|---|
| Classic ELB | L4 (TCP) or L7 (HTTP) | Legacy in-tree CCM; deprecated | Legacy only; use NLB instead |
| NLB | L4 (TCP/UDP/TLS) | AWS Load Balancer Controller OR in-tree CCM | TCP services, TLS passthrough, static IP |
| ALB | L7 (HTTP/HTTPS) | AWS Load Balancer Controller (Ingress) | HTTP routing — use Ingress, not Service type=LB |
NLB Target Types
target-type: instance (default)
- NLB targets: EC2 instances (nodes)
- Traffic: NLB → NodePort → kube-proxy → pod
- Source IP: lost (SNAT by kube-proxy)
- Works with any CNI
- Extra hop inside cluster
target-type: ip (recommended)
- NLB targets: pod IPs directly
- Traffic: NLB → pod (no kube-proxy hop)
- Source IP: preserved (NLB delivers real client IP)
- Requires AWS VPC CNI (pod IP in VPC)
- No extra hop; lower latency
NLB Annotations Reference
apiVersion: v1
kind: Service
metadata:
name: my-nlb
annotations:
# Controller selection
service.beta.kubernetes.io/aws-load-balancer-type: "external"
service.beta.kubernetes.io/aws-load-balancer-nlb-target-type: "ip" # ip | instance
# Scheme
service.beta.kubernetes.io/aws-load-balancer-scheme: "internet-facing" # or internal
# Cross-zone load balancing (default: false for NLB)
service.beta.kubernetes.io/aws-load-balancer-cross-zone-load-balancing-enabled: "true"
# Connection draining
service.beta.kubernetes.io/aws-load-balancer-target-group-attributes: |
deregistration_delay.timeout_seconds=30
# Health check
service.beta.kubernetes.io/aws-load-balancer-healthcheck-protocol: "TCP"
service.beta.kubernetes.io/aws-load-balancer-healthcheck-interval: "10"
service.beta.kubernetes.io/aws-load-balancer-healthcheck-healthy-threshold: "2"
service.beta.kubernetes.io/aws-load-balancer-healthcheck-unhealthy-threshold: "2"
# TLS termination at NLB
service.beta.kubernetes.io/aws-load-balancer-ssl-cert: "arn:aws:acm:us-east-1:123:certificate/abc"
service.beta.kubernetes.io/aws-load-balancer-ssl-ports: "443"
# Access logs
service.beta.kubernetes.io/aws-load-balancer-access-log-enabled: "true"
service.beta.kubernetes.io/aws-load-balancer-access-log-s3-bucket-name: "my-lb-logs"
# Dual-stack
service.beta.kubernetes.io/aws-load-balancer-ip-address-type: "dualstack"
# Preserve client IP (requires target-type: ip)
service.beta.kubernetes.io/aws-load-balancer-target-group-attributes: |
preserve_client_ip.enabled=true
spec:
type: LoadBalancer
externalTrafficPolicy: Local
selector:
app: my-app
ports:
- port: 443
targetPort: 8443
protocol: TCP
GCE / GKE Load Balancing
L4 Regional Load Balancer
# Default Service type=LoadBalancer on GKE creates a regional External L4 LB
# For internal LB (no public IP):
metadata:
annotations:
cloud.google.com/load-balancer-type: "Internal"
spec:
type: LoadBalancer
Network Endpoint Groups (NEGs)
NEGs allow GKE to route directly to pod IPs instead of NodePort, similar to AWS target-type:ip. NEGs are used automatically when:
- Using
container-native load balancingvia BackendConfig - Using GKE Ingress (automatically provisions NEG-backed backends)
- Annotation
cloud.google.com/neg: '{"ingress": true}'on the Service
# Enable NEG for a service:
metadata:
annotations:
cloud.google.com/neg: '{"exposed_ports": {"80": {}}}'
# BackendConfig CRD — custom health check, CDN, IAP, timeout:
apiVersion: cloud.google.com/v1
kind: BackendConfig
metadata:
name: my-backend-config
spec:
healthCheck:
checkIntervalSec: 15
port: 8080
type: HTTP
requestPath: /healthz
timeoutSec: 30
connectionDraining:
drainingTimeoutSec: 60
cdn:
enabled: true
cachePolicy:
includeHost: true
includeProtocol: true
Azure Load Balancing
# External Standard Load Balancer (default):
spec:
type: LoadBalancer
# Internal Load Balancer (no public IP):
metadata:
annotations:
service.beta.kubernetes.io/azure-load-balancer-internal: "true"
service.beta.kubernetes.io/azure-load-balancer-internal-subnet: "my-subnet"
# Custom frontend IP (static):
metadata:
annotations:
service.beta.kubernetes.io/azure-load-balancer-ipv4: "10.0.0.100"
# Cross-zone (all availability zones):
metadata:
annotations:
service.beta.kubernetes.io/azure-load-balancer-enable-high-availability-ports: "true"
# Health probe protocol:
metadata:
annotations:
service.beta.kubernetes.io/azure-load-balancer-health-probe-protocol: "https"
service.beta.kubernetes.io/azure-load-balancer-health-probe-request-path: "/healthz"
MetalLB — Bare Metal Load Balancing
MetalLB gives bare metal clusters (no cloud provider) the ability to provision LoadBalancer services by advertising VIPs via BGP or ARP (L2). It runs as a DaemonSet (speaker) and a Deployment (controller).
L2 Mode (ARP/NDP)
- One node "owns" the VIP and responds to ARP requests
- Failover: leader election + gratuitous ARP on takeover (~10s)
- No ECMP — all traffic to one node (single point of ingress)
- Works on any network without BGP router support
- Good for: small clusters, home labs, on-premises without BGP
BGP Mode
- All nodes advertise the VIP; router does ECMP across all nodes
- True equal-cost multi-path distribution
- Requires BGP-capable ToR/spine router
- Session per node to router(s); fast failover via BGP withdraw
- Good for: production bare metal, data center deployments
MetalLB v0.13+ CRD Configuration
# Step 1: Define an IP address pool
apiVersion: metallb.io/v1beta1
kind: IPAddressPool
metadata:
name: production-pool
namespace: metallb-system
spec:
addresses:
- 192.168.10.0/24 # IPv4 pool for LB services
- fd00:metallb::/120 # IPv6 pool (dual-stack)
autoAssign: true # auto-assign from pool; false = manual annotation
---
# Step 2a: L2 advertisement
apiVersion: metallb.io/v1beta1
kind: L2Advertisement
metadata:
name: l2-advert
namespace: metallb-system
spec:
ipAddressPools:
- production-pool
nodeSelectors: # which nodes can own the VIP
- matchLabels:
kubernetes.io/os: linux
---
# Step 2b: BGP advertisement (alternative to L2)
apiVersion: metallb.io/v1beta1
kind: BGPAdvertisement
metadata:
name: bgp-advert
namespace: metallb-system
spec:
ipAddressPools:
- production-pool
communities:
- 65000:1 # BGP community tag
aggregationLength: 32 # /32 per VIP (not aggregated)
localPref: 100
---
# Step 3: BGP peer configuration
apiVersion: metallb.io/v1beta2
kind: BGPPeer
metadata:
name: spine-router
namespace: metallb-system
spec:
myASN: 64512 # MetalLB ASN
peerASN: 65001 # Router ASN
peerAddress: 192.168.1.1
keepaliveTime: 30s
holdTime: 90s
---
# Step 4: Service — MetalLB auto-assigns from pool
apiVersion: v1
kind: Service
metadata:
name: my-lb-service
annotations:
metallb.io/address-pool: "production-pool" # optional: pin to specific pool
spec:
type: LoadBalancer
selector:
app: my-app
ports:
- port: 80
kube-vip
kube-vip provides both control plane HA (API server VIP) and service LoadBalancer VIPs using ARP (L2) or BGP. It is often used alongside or instead of MetalLB on bare metal clusters.
# kube-vip in BGP mode for services
# ConfigMap configuration:
apiVersion: v1
kind: ConfigMap
metadata:
name: kube-vip
namespace: kube-system
data:
config: |
localAS: 64512
bgpConfig:
- peerAddress: 192.168.1.1
peerAS: 65001
sourceIF: eth0
enableServicesElection: true
vipInterface: eth0
NodePort Deep Dive
NodePort opens a port (default range: 30000–32767) on every node. External traffic to any node's IP on that port reaches the service:
apiVersion: v1
kind: Service
metadata:
name: my-nodeport
spec:
type: NodePort
selector:
app: my-app
ports:
- port: 80 # ClusterIP port
targetPort: 8080 # Pod container port
nodePort: 30080 # explicit NodePort (optional; auto-assigned if omitted)
# Expand NodePort range (kube-apiserver flag):
# --service-node-port-range=30000-32767 (default)
# Can widen to include standard ports: --service-node-port-range=80-32767
# CAUTION: ports below 1024 require root; avoid overlapping system ports
externalIPs
spec.externalIPs binds a service to specific IP addresses that are owned by cluster nodes. Traffic to those IPs is load balanced to pods — without provisioning a cloud LB:
spec:
externalIPs:
- 203.0.113.10 # must be an IP on one of the cluster nodes
ports:
- port: 80
Any user with permission to create Services can set externalIPs to any IP, including IPs owned by other services in the cluster. This can be used to intercept traffic. Restrict Service creation permissions in multi-tenant clusters, or use a ValidatingWebhook to block arbitrary externalIPs. This is CVE-2020-8554.
Connection Draining
Graceful connection draining ensures in-flight requests complete before a pod is removed from load balancing. Kubernetes implements this through a chain of mechanisms:
spec:
terminationGracePeriodSeconds: 60 # total time before SIGKILL; default 30s
containers:
- name: app
lifecycle:
preStop:
exec:
command: ["/bin/sh", "-c", "sleep 5"] # delay SIGTERM; let LB drain
# Best practice for HTTP servers:
# preStop sleep >= LB health check interval * unhealthy threshold
# (ensures LB removes pod from rotation before SIGTERM kills the server)
# Example with AWS NLB (30s drain) + application shutdown:
# preStop sleep: 5s (LB health check fails, pod marked unhealthy)
# terminationGracePeriodSeconds: 65s (5s preStop + 30s LB drain + 30s app shutdown)
Since 1.28 (GA), kube-proxy and Gateway API controllers respect the terminating condition on EndpointSlice entries. A terminating pod continues to receive traffic until the LB drain timeout expires, then traffic stops flowing to it even though it's still running. This ensures in-flight requests complete without routing new requests to a shutting-down pod.
Health Checks and Readiness
A pod is included in Service endpoints only when its readiness probe passes. This is the primary health-check gate for Kubernetes load balancing — no special configuration is needed for in-cluster load balancing.
spec:
containers:
- name: app
readinessProbe:
httpGet:
path: /ready
port: 8080
initialDelaySeconds: 5
periodSeconds: 5
failureThreshold: 3 # 3 consecutive failures → remove from LB
successThreshold: 1 # 1 success → add back to LB
# For cloud LBs, configure custom health checks via annotations:
# AWS NLB health check:
service.beta.kubernetes.io/aws-load-balancer-healthcheck-path: "/healthz"
service.beta.kubernetes.io/aws-load-balancer-healthcheck-protocol: "HTTP"
service.beta.kubernetes.io/aws-load-balancer-healthcheck-interval: "10"
# GKE BackendConfig health check (see GKE section above)
# Azure health probe:
service.beta.kubernetes.io/azure-load-balancer-health-probe-request-path: "/healthz"
ECMP and BGP Load Balancing
In data center environments, Equal-Cost Multi-Path (ECMP) routing at the network layer provides load balancing without any Kubernetes-specific configuration. With BGP mode MetalLB or Calico BGP, all nodes advertise the same VIP — the router distributes connections across nodes via ECMP hashing:
ECMP hashing is typically per-flow (5-tuple hash). When a node fails and is withdrawn from BGP, the ECMP hash table changes — existing connections rehash to different nodes, causing connection resets for ~50% of flows (for a 2-node cluster). This is a fundamental limitation of ECMP. Mitigations: consistent hashing (some routers support it), fast failover (reduce BGP hold-down timer), or connection retry in the application.
Metrics, Alerting & Troubleshooting
Key Metrics
| Metric | Source | What It Tells You |
|---|---|---|
kube_service_status_load_balancer_ingress | kube-state-metrics | Count of LB services with a provisioned ingress IP/hostname |
kube_endpoint_address_available | kube-state-metrics | Ready endpoints per service; drops = pods failing readiness |
kube_endpoint_address_not_ready | kube-state-metrics | Not-ready endpoints; rises during rolling deployments |
kubeproxy_sync_proxy_rules_duration_seconds | kube-proxy | Time to sync iptables/IPVS rules; spikes indicate rule explosion |
metallb_bgp_session_up | MetalLB speaker | BGP session state per peer; 0 = no LB traffic for that node |
metallb_bgp_announced_prefixes_total | MetalLB speaker | Number of VIPs being advertised; drops = pool exhaustion |
aws_nlb_target_group_unhealthy_host_count | CloudWatch / exporter | NLB targets failing health checks; should stay near 0 |
Alerting Rules
# Alert: Service has no ready endpoints
- alert: ServiceNoReadyEndpoints
expr: kube_endpoint_address_available{endpoint!="kubernetes"} == 0
for: 2m
labels:
severity: critical
annotations:
summary: "Service {{ $labels.endpoint }} has no ready endpoints"
description: "All pods may be failing readiness probes"
# Alert: LoadBalancer IP not provisioned
- alert: LoadBalancerNotProvisioned
expr: |
kube_service_spec_type{type="LoadBalancer"} unless
kube_service_status_load_balancer_ingress > 0
for: 10m
labels:
severity: warning
annotations:
summary: "LoadBalancer service has no external IP after 10m"
# Alert: MetalLB BGP session down
- alert: MetalLBBGPSessionDown
expr: metallb_bgp_session_up == 0
for: 2m
labels:
severity: critical
annotations:
summary: "MetalLB BGP session to {{ $labels.peer }} is down"
# Alert: High kube-proxy sync latency
- alert: KubeProxySyncLatencyHigh
expr: |
histogram_quantile(0.99,
rate(kubeproxy_sync_proxy_rules_duration_seconds_bucket[5m])
) > 5
for: 10m
labels:
severity: warning
Troubleshooting Runbooks
Runbook 1: LoadBalancer Service Stuck in Pending (no external IP)
# Check service status
kubectl get svc my-lb -o wide
# EXTERNAL-IP: <pending> → CCM has not provisioned the LB yet
# Check CCM logs
kubectl logs -n kube-system -l component=cloud-controller-manager -f
# Common causes:
# - CCM not running (check kube-system pods)
# - IAM permissions missing (AWS: check ec2/elasticloadbalancing permissions)
# - Subnet missing tag (AWS: kubernetes.io/role/elb: "1")
# - MetalLB: pool exhausted or L2/BGP advertisement not created
kubectl get ipaddresspools -n metallb-system
kubectl get l2advertisements -n metallb-system
# Check events for the service
kubectl describe svc my-lb | grep -A10 Events
Runbook 2: Traffic Not Reaching Pods (LB IP reachable but 503/504)
# Step 1: Check endpoint health
kubectl get endpoints my-service
# If empty → no ready pods; check readiness probes
kubectl describe pod my-pod | grep -A10 "Readiness"
kubectl get events -n production | grep Readiness
# Step 2: Check with externalTrafficPolicy: Local
# If Local: no pods on nodes that LB is targeting → all 503
kubectl get pods -o wide | grep my-app # are pods spread across nodes?
kubectl get svc my-service -o jsonpath='{.spec.healthCheckNodePort}'
curl http://node-ip:HEALTH_CHECK_PORT # should return 200 if local pods exist
# Step 3: Test ClusterIP directly (bypasses cloud LB)
kubectl exec debug-pod -- curl http://my-service.namespace.svc.cluster.local/
# If this works: issue is in cloud LB → node path
# If this fails: issue is in kube-proxy or pod readiness
Runbook 3: Uneven Load Distribution Across Pods
# Check request rate per pod via metrics
kubectl top pods -l app=my-app
# Causes of uneven distribution:
# 1. HTTP/2 multiplexing (all requests on one connection → one pod)
# Fix: use IPVS lc (least connection) mode or an HTTP/2-aware proxy
# 2. sessionAffinity: ClientIP with cloud LB source IP clustering
# Fix: externalTrafficPolicy: Local, or use IPVS with source hashing disabled
# 3. Topology-Aware Routing over-concentrating in one zone
# kubectl get endpointslices -o yaml | grep -A5 hints
# Fix: set trafficDistribution: "" to disable, or add more pods per zone
# Switch kube-proxy to IPVS least-connection mode:
kubectl edit cm kube-proxy -n kube-system
# Change: mode: "ipvs", ipvs.scheduler: "lc"
kubectl rollout restart daemonset kube-proxy -n kube-system
Runbook 4: Connection Drops During Rolling Update
# Symptoms: HTTP 502/503 spikes during deployments
# Cause: pod removed from LB before finishing in-flight requests
# Fix 1: Add preStop hook to delay SIGTERM
spec:
containers:
- lifecycle:
preStop:
exec:
command: ["sleep", "10"]
# Fix 2: Increase terminationGracePeriodSeconds
spec:
terminationGracePeriodSeconds: 60
# Fix 3: AWS NLB — increase connection drain timeout
service.beta.kubernetes.io/aws-load-balancer-target-group-attributes: |
deregistration_delay.timeout_seconds=30
# Fix 4: Ensure maxSurge allows overlap during rollout
spec:
strategy:
rollingUpdate:
maxSurge: 1
maxUnavailable: 0 # never remove old pod before new pod is ready
Runbook 5: MetalLB BGP Session Down
# Check speaker pod logs
kubectl logs -n metallb-system -l component=speaker | grep -i bgp
# Check BGP peer status
kubectl exec -n metallb-system speaker-xxxx -- gobgp neighbor
# Common causes:
# - Router ASN mismatch: check BGPPeer spec.peerASN
# - MD5 auth mismatch: check BGPPeer spec.password
# - Firewall blocking TCP 179 between nodes and router
# Test: kubectl exec speaker-xxxx -- nc -zv router-ip 179
# - Router prefix limit exceeded: router rejecting BGP updates
# - MTU mismatch on the BGP session interface
# Check MetalLB logs for specific error
kubectl logs -n metallb-system -l app=metallb,component=speaker -f | grep -i "peer\|error\|bgp"
Best Practices
- Use
target-type: ipon AWS (VPC CNI) and NEGs on GKE — direct pod routing eliminates the extra kube-proxy hop, preserves source IPs, and reduces latency by 30–50% for short-lived connections. - Set
externalTrafficPolicy: Localonly when pods are spread evenly — uneven pod distribution causes severe load imbalance under Local policy. Pair withtopologySpreadConstraintsto ensure uniform pod distribution. - Configure connection draining for every production service — add a
preStopsleep and setterminationGracePeriodSecondslonger than the LB drain timeout. The formula:terminationGracePeriodSeconds > LB_drain + preStop_sleep + app_shutdown_time. - Use IPVS mode with
lcorwrrfor heterogeneous workloads — iptables round-robin distributes connections uniformly but ignores pod load. IPVS least-connection routes to the least-loaded pod. - Use
trafficDistribution: PreferClosefor multi-zone cost reduction — keeping traffic in-zone reduces cross-zone data transfer costs significantly on AWS, GCP, and Azure. - Test health check paths under load — cloud LBs use health checks to decide which nodes receive traffic. If your health check path is slow under load, nodes are falsely marked unhealthy and the LB oscillates.
- For bare metal: BGP mode MetalLB over L2 mode — L2 mode has a single-node bottleneck and slower failover. BGP mode provides true ECMP distribution and sub-second failover via BGP withdraw.
- Restrict
externalIPsusage via RBAC or webhook —externalIPsis a privilege escalation vector (CVE-2020-8554). Use ValidatingWebhookConfiguration to block it in multi-tenant clusters.