Load Balancing

Kubernetes load balancing operates at multiple layers simultaneously — from iptables/IPVS ClusterIP distribution inside the cluster to cloud provider L4 NLBs and L7 Ingress controllers at the edge. This page covers every layer, how they interact, and how to tune them for production.

What This Page Covers

Load balancing layer model: L3/L4 kube-proxy, L7 Ingress/Gateway API, cloud LB
ClusterIP load balancing: iptables random distribution, IPVS scheduling algorithms
Session affinity: ClientIP-based, timeoutSeconds; limitations
Service type LoadBalancer: CCM provisioning lifecycle, status.loadBalancer.ingress
externalTrafficPolicy: Local vs Cluster — source IP preservation and uneven load tradeoffs
internalTrafficPolicy: Local vs Cluster — node-local traffic optimization
Topology-Aware Routing (GA 1.27): zone hints, auto vs disabled annotation
Traffic Distribution (GA 1.31): preferClose field, replacing topology hints
AWS NLB: IP vs Instance target types, cross-zone load balancing, health checks, annotations reference
AWS Classic ELB vs NLB vs ALB — when to use each from Kubernetes
GCE/GKE: L4 regional vs global LB, BackendConfig CRD, NEGs (Network Endpoint Groups)
Azure: Standard Load Balancer, internal LB, annotations
MetalLB: BGP mode and L2 mode, IPAddressPool, L2Advertisement, BGPAdvertisement CRDs
kube-vip: ARP/BGP, control plane HA and service VIP
Cilium LB IPAM: replacing cloud CCM for bare metal
NodePort: port range, externalIPs, hairpin mode
ECMP and BGP-based load balancing at the network layer
Connection draining: terminationGracePeriodSeconds, preStop hooks, EndpointSlice terminating condition
Health checks: readiness probes gate endpoint inclusion; custom health check annotations per cloud
7 metrics + 4 alerting rules + 5 troubleshooting runbooks
8 best practices for production load balancing

Load Balancing Layer Model

Kubernetes load balancing is not a single mechanism — it is a stack of complementary layers, each operating at a different network level:

External Client │ ▼ ┌─────────────────────────────────────────────────┐ │ Layer 4 (L4): Cloud Load Balancer │ AWS NLB / GCE L4 / Azure SLB │ Service type=LoadBalancer │ Provisioned by Cloud CCM │ Distributes TCP/UDP connections to nodes │ No TLS termination (NLB passthrough) └────────────────────┬────────────────────────────┘ │ to NodePort on any node ▼ ┌─────────────────────────────────────────────────┐ │ Layer 4 (L4): kube-proxy (iptables/IPVS) │ On every node │ Service type=ClusterIP / NodePort │ Virtual IP → real pod IPs │ Distributes connections across healthy pods │ DNAT, SNAT, session affinity └────────────────────┬────────────────────────────┘ │ to Pod IP:ContainerPort ▼ ┌─────────────────────────────────────────────────┐ │ Layer 7 (L7): Ingress / Gateway API Controller │ NGINX, Traefik, Envoy, etc. │ HTTP routing, TLS termination, canary │ Runs as pods; reached via ClusterIP │ Upstream: application pods via ClusterIP │ or via cloud LB directly (IP mode) └─────────────────────────────────────────────────┘

These layers are independent and composable. A typical production setup uses all three: a cloud NLB → NGINX Ingress (ClusterIP) → application Service (ClusterIP) → pods.

ClusterIP Load Balancing

ClusterIP is the most common service type. kube-proxy translates a virtual ClusterIP:Port to a real pod IP:Port via DNAT rules. For details on iptables chains and IPVS virtual servers, see 03-kube-proxy-internals.html. This section focuses on the load balancing behavior.

iptables: Random Distribution

In iptables mode, kube-proxy uses the statistic iptables module to implement per-connection random load balancing. For a service with N endpoints, the probability for each endpoint is set so that each gets an equal share:

# For a 3-endpoint service (probability math):
# Rule 1: 1/3 probability  → KUBE-SEP-AAAA (pod-1)
# Rule 2: 1/2 probability  → KUBE-SEP-BBBB (pod-2)   (of remaining 2/3)
# Rule 3: 1/1 probability  → KUBE-SEP-CCCC (pod-3)   (all remaining)
# Net result: each pod gets ~33% of connections

# View the rules for a service:
sudo iptables -t nat -L KUBE-SVC-XXXXXXXXXXXXXXXXX -n --line-numbers
# Chain KUBE-SVC-XXXX (1 references)
# num  target        prot  opt  source    destination
# 1    KUBE-SEP-AA   all   --   0.0.0.0/0  0.0.0.0/0  statistic mode random probability 0.33333
# 2    KUBE-SEP-BB   all   --   0.0.0.0/0  0.0.0.0/0  statistic mode random probability 0.50000
# 3    KUBE-SEP-CC   all   --   0.0.0.0/0  0.0.0.0/0

⚠️

iptables mode is per-connection, not per-request

iptables DNAT is applied at connection establishment. All packets in a TCP connection go to the same pod — there is no request-level load balancing. For HTTP/1.1 with persistent connections or HTTP/2 multiplexing, a single long-lived connection can send thousands of requests to one pod while others are idle. Use HTTP/2-aware proxies (Envoy, NGINX) for per-request load balancing.

IPVS: Scheduling Algorithms

IPVS mode provides 7 scheduling algorithms and maintains a proper connection table (not just iptables rules). See 03-kube-proxy-internals.html for full IPVS configuration. The scheduler is set globally via KubeProxyConfiguration.ipvs.scheduler:

Scheduler	Algorithm	Best For
`rr`	Round Robin	Default; uniform request cost; even distribution
`lc`	Least Connection	Variable-cost requests; routes to pod with fewest active connections
`dh`	Destination Hashing	Cache affinity; same destination IP always goes to same backend
`sh`	Source Hashing	Client affinity; same source IP always goes to same backend
`sed`	Shortest Expected Delay	Weighted least-connection variant
`nq`	Never Queue	Always routes to idle server first; SED otherwise
`wrr`	Weighted Round Robin	Heterogeneous node capacity (different pod weights)

Session Affinity

Kubernetes supports ClientIP-based session affinity — connections from the same source IP always go to the same pod. This is implemented differently per proxy mode:

apiVersion: v1
kind: Service
metadata:
  name: my-service
spec:
  selector:
    app: my-app
  ports:
    - port: 80
  sessionAffinity: ClientIP          # None (default) or ClientIP
  sessionAffinityConfig:
    clientIP:
      timeoutSeconds: 10800          # 3 hours (default); max 86400

⚠️

Session affinity limitations

Source IP is the node IP for traffic through NodePort/cloud LB — all traffic from a cloud LB appears to come from a small set of node IPs, causing all sessions to pin to a few pods. Use externalTrafficPolicy: Local to preserve the real client IP (at the cost of uneven distribution).
Not HTTP cookie-based — Kubernetes session affinity is purely L4 IP-based. For cookie-based sticky sessions, use Ingress/Gateway API annotations (nginx: nginx.ingress.kubernetes.io/affinity: cookie).
Does not survive pod restarts — if the pinned pod is replaced, the affinity breaks and clients are redistributed.

externalTrafficPolicy

Controls how NodePort and LoadBalancer services handle external traffic at the node level. This is one of the most impactful load balancing settings in production:

Policy	Behavior	Source IP	Tradeoff
`Cluster` (default)	Traffic forwarded to any healthy pod cluster-wide, even on other nodes (extra hop)	Lost — SNAT replaces client IP with node IP	Even distribution; any node can serve; extra hop possible
`Local`	Traffic only forwarded to pods on the receiving node; drops if no local pods	Preserved — no SNAT; real client IP visible	Uneven distribution if pods not evenly spread; node must have local pod

apiVersion: v1
kind: Service
metadata:
  name: my-lb-service
spec:
  type: LoadBalancer
  externalTrafficPolicy: Local       # preserve client IP; no SNAT
  selector:
    app: my-app
  ports:
    - port: 443
      targetPort: 8443

ℹ️

Health check node port with externalTrafficPolicy: Local

When externalTrafficPolicy: Local is set, kube-proxy allocates a special healthCheckNodePort (default range: 30000–32767). Cloud load balancers use this port to check if a node has ready local pods before sending traffic. Nodes without local pods return HTTP 503, so the LB skips them. AWS NLB, GCE, and Azure all support this automatically.

internalTrafficPolicy

Controls how ClusterIP traffic is handled for connections originating inside the cluster:

spec:
  internalTrafficPolicy: Local     # route only to pods on same node as client
                                   # Cluster (default) = any pod anywhere

internalTrafficPolicy: Local is useful for node-local caching services (e.g., a per-node cache DaemonSet) — clients always hit the local replica without a network hop. If no local pod exists, the connection is dropped.

Topology-Aware Routing & Traffic Distribution

Topology-Aware Routing (GA 1.27)

Topology-Aware Routing uses zone hints on EndpointSlices to prefer in-zone endpoints, reducing cross-zone traffic costs and latency. It is activated per-service with an annotation:

apiVersion: v1
kind: Service
metadata:
  name: my-service
  annotations:
    service.kubernetes.io/topology-mode: "auto"   # auto | disabled
spec:
  selector:
    app: my-app
  ports:
    - port: 80

When set to auto, the EndpointSlice controller adds hints.forZones entries to each endpoint. kube-proxy then prefers endpoints in the same zone as the node. Conditions for auto to activate:

All nodes have the topology.kubernetes.io/zone label set
Endpoints are spread proportionally across zones (within 3x ratio)
All endpoints are Ready
At least 3 zones present

Traffic Distribution (GA 1.31)

Traffic Distribution replaces Topology-Aware Routing with a cleaner API — spec.trafficDistribution on the Service object itself:

apiVersion: v1
kind: Service
metadata:
  name: my-service
spec:
  selector:
    app: my-app
  ports:
    - port: 80
  trafficDistribution: PreferClose   # prefer topologically close endpoints
                                     # falls back to global if none available locally

✅

Use trafficDistribution over topology annotation

spec.trafficDistribution: PreferClose (GA 1.31) is the preferred way to express topology preference for new clusters. The annotation service.kubernetes.io/topology-mode: auto continues to work but is considered legacy. PreferClose has safer fallback behavior — it never drops traffic even if zone distribution is uneven.

Service type=LoadBalancer

When you create a Service with type: LoadBalancer, the Cloud Controller Manager (CCM) watches the Service and provisions a cloud load balancer. The provisioned LB address is written back into status.loadBalancer.ingress:

kubectl get svc my-lb-service -o yaml
# status:
#   loadBalancer:
#     ingress:
#       - hostname: abc123.us-east-1.elb.amazonaws.com  # AWS NLB
#       # OR for GCP/Azure:
#       - ip: 203.0.113.42

# Track provisioning:
kubectl get events --field-selector involvedObject.name=my-lb-service
# EnsuredLoadBalancer  → LB provisioned
# UpdatedLoadBalancer  → LB updated (port change, etc.)

AWS Network Load Balancer (NLB)

AWS provides three LB types accessible from Kubernetes. The recommended path for TCP/UDP is NLB via the AWS Load Balancer Controller.

Type	Layer	Kubernetes Integration	Use Case
Classic ELB	L4 (TCP) or L7 (HTTP)	Legacy in-tree CCM; deprecated	Legacy only; use NLB instead
NLB	L4 (TCP/UDP/TLS)	AWS Load Balancer Controller OR in-tree CCM	TCP services, TLS passthrough, static IP
ALB	L7 (HTTP/HTTPS)	AWS Load Balancer Controller (Ingress)	HTTP routing — use Ingress, not Service type=LB

NLB Target Types

target-type: instance (default)

NLB targets: EC2 instances (nodes)
Traffic: NLB → NodePort → kube-proxy → pod
Source IP: lost (SNAT by kube-proxy)
Works with any CNI
Extra hop inside cluster

target-type: ip (recommended)

NLB targets: pod IPs directly
Traffic: NLB → pod (no kube-proxy hop)
Source IP: preserved (NLB delivers real client IP)
Requires AWS VPC CNI (pod IP in VPC)
No extra hop; lower latency

NLB Annotations Reference

apiVersion: v1
kind: Service
metadata:
  name: my-nlb
  annotations:
    # Controller selection
    service.beta.kubernetes.io/aws-load-balancer-type: "external"
    service.beta.kubernetes.io/aws-load-balancer-nlb-target-type: "ip"  # ip | instance

    # Scheme
    service.beta.kubernetes.io/aws-load-balancer-scheme: "internet-facing"  # or internal

    # Cross-zone load balancing (default: false for NLB)
    service.beta.kubernetes.io/aws-load-balancer-cross-zone-load-balancing-enabled: "true"

    # Connection draining
    service.beta.kubernetes.io/aws-load-balancer-target-group-attributes: |
      deregistration_delay.timeout_seconds=30

    # Health check
    service.beta.kubernetes.io/aws-load-balancer-healthcheck-protocol: "TCP"
    service.beta.kubernetes.io/aws-load-balancer-healthcheck-interval: "10"
    service.beta.kubernetes.io/aws-load-balancer-healthcheck-healthy-threshold: "2"
    service.beta.kubernetes.io/aws-load-balancer-healthcheck-unhealthy-threshold: "2"

    # TLS termination at NLB
    service.beta.kubernetes.io/aws-load-balancer-ssl-cert: "arn:aws:acm:us-east-1:123:certificate/abc"
    service.beta.kubernetes.io/aws-load-balancer-ssl-ports: "443"

    # Access logs
    service.beta.kubernetes.io/aws-load-balancer-access-log-enabled: "true"
    service.beta.kubernetes.io/aws-load-balancer-access-log-s3-bucket-name: "my-lb-logs"

    # Dual-stack
    service.beta.kubernetes.io/aws-load-balancer-ip-address-type: "dualstack"

    # Preserve client IP (requires target-type: ip)
    service.beta.kubernetes.io/aws-load-balancer-target-group-attributes: |
      preserve_client_ip.enabled=true
spec:
  type: LoadBalancer
  externalTrafficPolicy: Local
  selector:
    app: my-app
  ports:
    - port: 443
      targetPort: 8443
      protocol: TCP

GCE / GKE Load Balancing

L4 Regional Load Balancer

# Default Service type=LoadBalancer on GKE creates a regional External L4 LB
# For internal LB (no public IP):
metadata:
  annotations:
    cloud.google.com/load-balancer-type: "Internal"
spec:
  type: LoadBalancer

Network Endpoint Groups (NEGs)

NEGs allow GKE to route directly to pod IPs instead of NodePort, similar to AWS target-type:ip. NEGs are used automatically when:

Using container-native load balancing via BackendConfig
Using GKE Ingress (automatically provisions NEG-backed backends)
Annotation cloud.google.com/neg: '{"ingress": true}' on the Service

# Enable NEG for a service:
metadata:
  annotations:
    cloud.google.com/neg: '{"exposed_ports": {"80": {}}}'

# BackendConfig CRD — custom health check, CDN, IAP, timeout:
apiVersion: cloud.google.com/v1
kind: BackendConfig
metadata:
  name: my-backend-config
spec:
  healthCheck:
    checkIntervalSec: 15
    port: 8080
    type: HTTP
    requestPath: /healthz
  timeoutSec: 30
  connectionDraining:
    drainingTimeoutSec: 60
  cdn:
    enabled: true
    cachePolicy:
      includeHost: true
      includeProtocol: true

Azure Load Balancing

# External Standard Load Balancer (default):
spec:
  type: LoadBalancer

# Internal Load Balancer (no public IP):
metadata:
  annotations:
    service.beta.kubernetes.io/azure-load-balancer-internal: "true"
    service.beta.kubernetes.io/azure-load-balancer-internal-subnet: "my-subnet"

# Custom frontend IP (static):
metadata:
  annotations:
    service.beta.kubernetes.io/azure-load-balancer-ipv4: "10.0.0.100"

# Cross-zone (all availability zones):
metadata:
  annotations:
    service.beta.kubernetes.io/azure-load-balancer-enable-high-availability-ports: "true"

# Health probe protocol:
metadata:
  annotations:
    service.beta.kubernetes.io/azure-load-balancer-health-probe-protocol: "https"
    service.beta.kubernetes.io/azure-load-balancer-health-probe-request-path: "/healthz"

MetalLB — Bare Metal Load Balancing

MetalLB gives bare metal clusters (no cloud provider) the ability to provision LoadBalancer services by advertising VIPs via BGP or ARP (L2). It runs as a DaemonSet (speaker) and a Deployment (controller).

L2 Mode (ARP/NDP)

One node "owns" the VIP and responds to ARP requests
Failover: leader election + gratuitous ARP on takeover (~10s)
No ECMP — all traffic to one node (single point of ingress)
Works on any network without BGP router support
Good for: small clusters, home labs, on-premises without BGP

BGP Mode

All nodes advertise the VIP; router does ECMP across all nodes
True equal-cost multi-path distribution
Requires BGP-capable ToR/spine router
Session per node to router(s); fast failover via BGP withdraw
Good for: production bare metal, data center deployments

MetalLB v0.13+ CRD Configuration

# Step 1: Define an IP address pool
apiVersion: metallb.io/v1beta1
kind: IPAddressPool
metadata:
  name: production-pool
  namespace: metallb-system
spec:
  addresses:
    - 192.168.10.0/24          # IPv4 pool for LB services
    - fd00:metallb::/120        # IPv6 pool (dual-stack)
  autoAssign: true              # auto-assign from pool; false = manual annotation

---
# Step 2a: L2 advertisement
apiVersion: metallb.io/v1beta1
kind: L2Advertisement
metadata:
  name: l2-advert
  namespace: metallb-system
spec:
  ipAddressPools:
    - production-pool
  nodeSelectors:                # which nodes can own the VIP
    - matchLabels:
        kubernetes.io/os: linux

---
# Step 2b: BGP advertisement (alternative to L2)
apiVersion: metallb.io/v1beta1
kind: BGPAdvertisement
metadata:
  name: bgp-advert
  namespace: metallb-system
spec:
  ipAddressPools:
    - production-pool
  communities:
    - 65000:1                   # BGP community tag
  aggregationLength: 32         # /32 per VIP (not aggregated)
  localPref: 100

---
# Step 3: BGP peer configuration
apiVersion: metallb.io/v1beta2
kind: BGPPeer
metadata:
  name: spine-router
  namespace: metallb-system
spec:
  myASN: 64512                  # MetalLB ASN
  peerASN: 65001                # Router ASN
  peerAddress: 192.168.1.1
  keepaliveTime: 30s
  holdTime: 90s

---
# Step 4: Service — MetalLB auto-assigns from pool
apiVersion: v1
kind: Service
metadata:
  name: my-lb-service
  annotations:
    metallb.io/address-pool: "production-pool"   # optional: pin to specific pool
spec:
  type: LoadBalancer
  selector:
    app: my-app
  ports:
    - port: 80

kube-vip

kube-vip provides both control plane HA (API server VIP) and service LoadBalancer VIPs using ARP (L2) or BGP. It is often used alongside or instead of MetalLB on bare metal clusters.

# kube-vip in BGP mode for services
# ConfigMap configuration:
apiVersion: v1
kind: ConfigMap
metadata:
  name: kube-vip
  namespace: kube-system
data:
  config: |
    localAS: 64512
    bgpConfig:
      - peerAddress: 192.168.1.1
        peerAS: 65001
        sourceIF: eth0
    enableServicesElection: true
    vipInterface: eth0

NodePort Deep Dive

NodePort opens a port (default range: 30000–32767) on every node. External traffic to any node's IP on that port reaches the service:

apiVersion: v1
kind: Service
metadata:
  name: my-nodeport
spec:
  type: NodePort
  selector:
    app: my-app
  ports:
    - port: 80           # ClusterIP port
      targetPort: 8080   # Pod container port
      nodePort: 30080    # explicit NodePort (optional; auto-assigned if omitted)

# Expand NodePort range (kube-apiserver flag):
# --service-node-port-range=30000-32767  (default)
# Can widen to include standard ports: --service-node-port-range=80-32767
# CAUTION: ports below 1024 require root; avoid overlapping system ports

externalIPs

spec.externalIPs binds a service to specific IP addresses that are owned by cluster nodes. Traffic to those IPs is load balanced to pods — without provisioning a cloud LB:

spec:
  externalIPs:
    - 203.0.113.10    # must be an IP on one of the cluster nodes
  ports:
    - port: 80

🚨

externalIPs security risk

Any user with permission to create Services can set externalIPs to any IP, including IPs owned by other services in the cluster. This can be used to intercept traffic. Restrict Service creation permissions in multi-tenant clusters, or use a ValidatingWebhook to block arbitrary externalIPs. This is CVE-2020-8554.

Connection Draining

Graceful connection draining ensures in-flight requests complete before a pod is removed from load balancing. Kubernetes implements this through a chain of mechanisms:

Pod deletion triggered (kubectl delete pod / rolling update) │ ├─ 1. Pod phase → Terminating ├─ 2. kubelet executes preStop hook (if defined) — blocks SIGTERM ├─ 3. kube-proxy/CNI marks endpoint as "terminating" in EndpointSlice │ (ready:false, serving:true, terminating:true) ├─ 4. Cloud LB drains connections to this backend (deregistration_delay) │ AWS NLB: 30s default connection drain ├─ 5. kubelet sends SIGTERM → app handles shutdown ├─ 6. terminationGracePeriodSeconds countdown begins ├─ 7. After grace period: SIGKILL └─ 8. Pod removed; endpoint removed from EndpointSlice

spec:
  terminationGracePeriodSeconds: 60   # total time before SIGKILL; default 30s
  containers:
    - name: app
      lifecycle:
        preStop:
          exec:
            command: ["/bin/sh", "-c", "sleep 5"]   # delay SIGTERM; let LB drain

# Best practice for HTTP servers:
# preStop sleep >= LB health check interval * unhealthy threshold
# (ensures LB removes pod from rotation before SIGTERM kills the server)

# Example with AWS NLB (30s drain) + application shutdown:
# preStop sleep: 5s  (LB health check fails, pod marked unhealthy)
# terminationGracePeriodSeconds: 65s (5s preStop + 30s LB drain + 30s app shutdown)

ℹ️

EndpointSlice terminating condition (GA 1.28)

Since 1.28 (GA), kube-proxy and Gateway API controllers respect the terminating condition on EndpointSlice entries. A terminating pod continues to receive traffic until the LB drain timeout expires, then traffic stops flowing to it even though it's still running. This ensures in-flight requests complete without routing new requests to a shutting-down pod.

Health Checks and Readiness

A pod is included in Service endpoints only when its readiness probe passes. This is the primary health-check gate for Kubernetes load balancing — no special configuration is needed for in-cluster load balancing.

spec:
  containers:
    - name: app
      readinessProbe:
        httpGet:
          path: /ready
          port: 8080
        initialDelaySeconds: 5
        periodSeconds: 5
        failureThreshold: 3           # 3 consecutive failures → remove from LB
        successThreshold: 1           # 1 success → add back to LB

# For cloud LBs, configure custom health checks via annotations:
# AWS NLB health check:
service.beta.kubernetes.io/aws-load-balancer-healthcheck-path: "/healthz"
service.beta.kubernetes.io/aws-load-balancer-healthcheck-protocol: "HTTP"
service.beta.kubernetes.io/aws-load-balancer-healthcheck-interval: "10"

# GKE BackendConfig health check (see GKE section above)
# Azure health probe:
service.beta.kubernetes.io/azure-load-balancer-health-probe-request-path: "/healthz"

ECMP and BGP Load Balancing

In data center environments, Equal-Cost Multi-Path (ECMP) routing at the network layer provides load balancing without any Kubernetes-specific configuration. With BGP mode MetalLB or Calico BGP, all nodes advertise the same VIP — the router distributes connections across nodes via ECMP hashing:

Internet/WAN │ ▼ ToR/Spine Router (ECMP hash across all nodes) ├── Node 1 (advertises 203.0.113.0/32 via BGP, ASN 64512) ├── Node 2 (advertises 203.0.113.0/32 via BGP, ASN 64512) └── Node 3 (advertises 203.0.113.0/32 via BGP, ASN 64512) ECMP hash: per-flow (src IP + dst IP + src port + dst port) → same flow always goes to same node (connection-level affinity) → flows distributed across all nodes that advertise the VIP

⚠️

ECMP and connection hashing pitfall

ECMP hashing is typically per-flow (5-tuple hash). When a node fails and is withdrawn from BGP, the ECMP hash table changes — existing connections rehash to different nodes, causing connection resets for ~50% of flows (for a 2-node cluster). This is a fundamental limitation of ECMP. Mitigations: consistent hashing (some routers support it), fast failover (reduce BGP hold-down timer), or connection retry in the application.

Metrics, Alerting & Troubleshooting

Key Metrics

Metric	Source	What It Tells You
`kube_service_status_load_balancer_ingress`	kube-state-metrics	Count of LB services with a provisioned ingress IP/hostname
`kube_endpoint_address_available`	kube-state-metrics	Ready endpoints per service; drops = pods failing readiness
`kube_endpoint_address_not_ready`	kube-state-metrics	Not-ready endpoints; rises during rolling deployments
`kubeproxy_sync_proxy_rules_duration_seconds`	kube-proxy	Time to sync iptables/IPVS rules; spikes indicate rule explosion
`metallb_bgp_session_up`	MetalLB speaker	BGP session state per peer; 0 = no LB traffic for that node
`metallb_bgp_announced_prefixes_total`	MetalLB speaker	Number of VIPs being advertised; drops = pool exhaustion
`aws_nlb_target_group_unhealthy_host_count`	CloudWatch / exporter	NLB targets failing health checks; should stay near 0

Alerting Rules

# Alert: Service has no ready endpoints
- alert: ServiceNoReadyEndpoints
  expr: kube_endpoint_address_available{endpoint!="kubernetes"} == 0
  for: 2m
  labels:
    severity: critical
  annotations:
    summary: "Service {{ $labels.endpoint }} has no ready endpoints"
    description: "All pods may be failing readiness probes"

# Alert: LoadBalancer IP not provisioned
- alert: LoadBalancerNotProvisioned
  expr: |
    kube_service_spec_type{type="LoadBalancer"} unless
    kube_service_status_load_balancer_ingress > 0
  for: 10m
  labels:
    severity: warning
  annotations:
    summary: "LoadBalancer service has no external IP after 10m"

# Alert: MetalLB BGP session down
- alert: MetalLBBGPSessionDown
  expr: metallb_bgp_session_up == 0
  for: 2m
  labels:
    severity: critical
  annotations:
    summary: "MetalLB BGP session to {{ $labels.peer }} is down"

# Alert: High kube-proxy sync latency
- alert: KubeProxySyncLatencyHigh
  expr: |
    histogram_quantile(0.99,
      rate(kubeproxy_sync_proxy_rules_duration_seconds_bucket[5m])
    ) > 5
  for: 10m
  labels:
    severity: warning

Troubleshooting Runbooks

Runbook 1: LoadBalancer Service Stuck in Pending (no external IP)

# Check service status
kubectl get svc my-lb -o wide
# EXTERNAL-IP: <pending> → CCM has not provisioned the LB yet

# Check CCM logs
kubectl logs -n kube-system -l component=cloud-controller-manager -f

# Common causes:
# - CCM not running (check kube-system pods)
# - IAM permissions missing (AWS: check ec2/elasticloadbalancing permissions)
# - Subnet missing tag (AWS: kubernetes.io/role/elb: "1")
# - MetalLB: pool exhausted or L2/BGP advertisement not created
kubectl get ipaddresspools -n metallb-system
kubectl get l2advertisements -n metallb-system

# Check events for the service
kubectl describe svc my-lb | grep -A10 Events

Runbook 2: Traffic Not Reaching Pods (LB IP reachable but 503/504)

# Step 1: Check endpoint health
kubectl get endpoints my-service
# If empty → no ready pods; check readiness probes
kubectl describe pod my-pod | grep -A10 "Readiness"
kubectl get events -n production | grep Readiness

# Step 2: Check with externalTrafficPolicy: Local
# If Local: no pods on nodes that LB is targeting → all 503
kubectl get pods -o wide | grep my-app  # are pods spread across nodes?
kubectl get svc my-service -o jsonpath='{.spec.healthCheckNodePort}'
curl http://node-ip:HEALTH_CHECK_PORT  # should return 200 if local pods exist

# Step 3: Test ClusterIP directly (bypasses cloud LB)
kubectl exec debug-pod -- curl http://my-service.namespace.svc.cluster.local/
# If this works: issue is in cloud LB → node path
# If this fails: issue is in kube-proxy or pod readiness

Runbook 3: Uneven Load Distribution Across Pods

# Check request rate per pod via metrics
kubectl top pods -l app=my-app

# Causes of uneven distribution:
# 1. HTTP/2 multiplexing (all requests on one connection → one pod)
#    Fix: use IPVS lc (least connection) mode or an HTTP/2-aware proxy
# 2. sessionAffinity: ClientIP with cloud LB source IP clustering
#    Fix: externalTrafficPolicy: Local, or use IPVS with source hashing disabled
# 3. Topology-Aware Routing over-concentrating in one zone
#    kubectl get endpointslices -o yaml | grep -A5 hints
#    Fix: set trafficDistribution: "" to disable, or add more pods per zone

# Switch kube-proxy to IPVS least-connection mode:
kubectl edit cm kube-proxy -n kube-system
# Change: mode: "ipvs", ipvs.scheduler: "lc"
kubectl rollout restart daemonset kube-proxy -n kube-system

Runbook 4: Connection Drops During Rolling Update

# Symptoms: HTTP 502/503 spikes during deployments
# Cause: pod removed from LB before finishing in-flight requests

# Fix 1: Add preStop hook to delay SIGTERM
spec:
  containers:
    - lifecycle:
        preStop:
          exec:
            command: ["sleep", "10"]

# Fix 2: Increase terminationGracePeriodSeconds
spec:
  terminationGracePeriodSeconds: 60

# Fix 3: AWS NLB — increase connection drain timeout
service.beta.kubernetes.io/aws-load-balancer-target-group-attributes: |
  deregistration_delay.timeout_seconds=30

# Fix 4: Ensure maxSurge allows overlap during rollout
spec:
  strategy:
    rollingUpdate:
      maxSurge: 1
      maxUnavailable: 0    # never remove old pod before new pod is ready

Runbook 5: MetalLB BGP Session Down

# Check speaker pod logs
kubectl logs -n metallb-system -l component=speaker | grep -i bgp

# Check BGP peer status
kubectl exec -n metallb-system speaker-xxxx -- gobgp neighbor

# Common causes:
# - Router ASN mismatch: check BGPPeer spec.peerASN
# - MD5 auth mismatch: check BGPPeer spec.password
# - Firewall blocking TCP 179 between nodes and router
#   Test: kubectl exec speaker-xxxx -- nc -zv router-ip 179
# - Router prefix limit exceeded: router rejecting BGP updates
# - MTU mismatch on the BGP session interface

# Check MetalLB logs for specific error
kubectl logs -n metallb-system -l app=metallb,component=speaker -f | grep -i "peer\|error\|bgp"

Best Practices

Use target-type: ip on AWS (VPC CNI) and NEGs on GKE — direct pod routing eliminates the extra kube-proxy hop, preserves source IPs, and reduces latency by 30–50% for short-lived connections.
Set externalTrafficPolicy: Local only when pods are spread evenly — uneven pod distribution causes severe load imbalance under Local policy. Pair with topologySpreadConstraints to ensure uniform pod distribution.
Configure connection draining for every production service — add a preStop sleep and set terminationGracePeriodSeconds longer than the LB drain timeout. The formula: terminationGracePeriodSeconds > LB_drain + preStop_sleep + app_shutdown_time.
Use IPVS mode with lc or wrr for heterogeneous workloads — iptables round-robin distributes connections uniformly but ignores pod load. IPVS least-connection routes to the least-loaded pod.
Use trafficDistribution: PreferClose for multi-zone cost reduction — keeping traffic in-zone reduces cross-zone data transfer costs significantly on AWS, GCP, and Azure.
Test health check paths under load — cloud LBs use health checks to decide which nodes receive traffic. If your health check path is slow under load, nodes are falsely marked unhealthy and the LB oscillates.
For bare metal: BGP mode MetalLB over L2 mode — L2 mode has a single-node bottleneck and slower failover. BGP mode provides true ECMP distribution and sub-second failover via BGP withdraw.
Restrict externalIPs usage via RBAC or webhook — externalIPs is a privilege escalation vector (CVE-2020-8554). Use ValidatingWebhookConfiguration to block it in multi-tenant clusters.