Kubernetes Network Stack

Understanding the full packet path is essential for diagnosing network issues. A request from pod A to pod B traverses multiple layers — each is a potential failure point.

Packet Path: Pod to Pod (Same Node)
  Pod A (eth0: 10.0.1.5)
      │
      ▼ veth pair
  Node Linux bridge / eBPF (cilium0 or cni0)
      │ NetworkPolicy evaluated here (iptables or eBPF)
      ▼
  Pod B (eth0: 10.0.1.6)

  Packet Path: Pod to Pod (Different Node)
  Pod A (10.0.1.5) → veth → CNI bridge → node routing table
      → tunnel/VXLAN/BGP encapsulation → eth0 → Node B eth0
      → de-encapsulate → CNI bridge → veth → Pod B

  Packet Path: Pod to Service ClusterIP
  Pod A → DNS lookup → CoreDNS → 10.96.5.42 (ClusterIP)
      → kube-proxy iptables DNAT / eBPF load balancing
      → selected endpoint pod IP (10.0.2.8)
      → Pod B (cross-node routing)
LayerComponentFailure ModeDebug Command
Pod networkingveth pair, CNI pluginPod can't reach any destinationkubectl exec <pod> -- ip addr; ip route
Service resolutionCoreDNS, kube-proxy/eBPFDNS failure or DNAT not workingkubectl exec <pod> -- nslookup svc.namespace
NetworkPolicyCNI enforcementConnection refused/timeout unexpectedlyHubble observe, kubectl describe netpol
Node routingLinux kernel routing tablePods on different nodes can't reach each otherip route show; ip neigh show
Conntracknf_conntrack tableIntermittent connection drops under high RPScat /proc/net/nf_conntrack | wc -l
MTUNIC, overlay tunnelLarge packets silently droppedping -M do -s 1400 <pod-ip>
IngressNGINX/Envoy/ALB502/504 errors, timeoutsNGINX error logs, upstream check

CNI Operations

Checking CNI health

# Check CNI DaemonSet status (for all major CNIs)
kubectl get ds -n kube-system | grep -E "aws-node|cilium|calico|flannel|weave"

# AWS VPC CNI — check node IP assignment
kubectl get pods -n kube-system -l k8s-app=aws-node -o wide
kubectl logs -n kube-system -l k8s-app=aws-node --tail=50

# VPC CNI specific: check IP address pool and ENI attachment
kubectl describe ds -n kube-system aws-node | grep -A5 "Environment"
kubectl exec -n kube-system ds/aws-node -- /app/grpc-health-probe \
  -addr=:11191 -connect-timeout=5s

# Cilium: check agent status
cilium status --wait
cilium connectivity test   # full end-to-end test (takes 2-3 min)

# Check for pods stuck in ContainerCreating (CNI failure)
kubectl get pods -A | grep -v Running | grep ContainerCreating
# If stuck, check kubelet logs on the node:
journalctl -u kubelet | grep CNI | tail -50

AWS VPC CNI warm pool tuning

# VPC CNI maintains a warm pool of pre-assigned IPs
# Tune to balance startup speed vs IP waste

kubectl set env daemonset aws-node -n kube-system \
  WARM_IP_TARGET=5 \          # keep 5 IPs pre-allocated per node
  MINIMUM_IP_TARGET=10 \      # minimum IPs to hold even when idle
  WARM_ENI_TARGET=1 \         # keep 1 ENI with available IPs
  MAX_ENI=4 \                 # max ENIs per node (instance type limit)
  ENABLE_PREFIX_DELEGATION=true  # 16 IPs per ENI slot

# After changing: nodes need rolling replacement to pick up new settings
# (or manually restart aws-node on each node)

# Monitor IP pool health
kubectl exec -n kube-system ds/aws-node -- \
  curl -s http://localhost:61679/v1/networkinterfaces | jq '.'

Cilium node-to-node connectivity

# Check Cilium node mesh (every node should see every other node)
cilium node list

# Check endpoint health
cilium endpoint list | grep -v "ready"

# Verify BPF datapath is active (not falling back to iptables)
kubectl exec -n kube-system ds/cilium -- cilium status | grep "KubeProxyReplacement"
# Should show: KubeProxyReplacement: True

# Check BPF map usage (high usage → map too small)
kubectl exec -n kube-system ds/cilium -- cilium bpf ct list global | wc -l

# Restart a specific Cilium agent (node-targeted)
kubectl delete pod -n kube-system -l k8s-app=cilium \
  --field-selector spec.nodeName=ip-10-0-1-42.us-east-1.compute.internal

CoreDNS Operations

Diagnosing DNS failures

# Test DNS resolution from a pod
kubectl run dns-test --image=busybox:1.36 --restart=Never -it --rm -- \
  nslookup payment-service.payments.svc.cluster.local

# Test external DNS resolution
kubectl run dns-test --image=busybox:1.36 --restart=Never -it --rm -- \
  nslookup api.stripe.com

# Test with dig (more detail)
kubectl run dns-test --image=tutum/dnsutils --restart=Never -it --rm -- \
  dig +short payment-service.payments.svc.cluster.local

# Check CoreDNS pods and logs
kubectl get pods -n kube-system -l k8s-app=kube-dns
kubectl logs -n kube-system -l k8s-app=kube-dns --tail=100 | grep -i "error\|refused\|timeout"

# Check CoreDNS ConfigMap
kubectl get configmap coredns -n kube-system -o yaml

# Watch CoreDNS query log (enable log plugin temporarily)
kubectl edit configmap coredns -n kube-system
# Add: log
# between ready and kubernetes blocks
# REMOVE after debugging — generates huge volume of logs

CoreDNS performance metrics

# CoreDNS request rate
sum(rate(coredns_dns_requests_total[5m])) by (server, zone, type)

# CoreDNS error rate (SERVFAIL, NXDOMAIN)
sum by (rcode) (rate(coredns_dns_responses_total[5m]))

# CoreDNS p99 latency (target: < 2ms for cached; < 10ms for forwarded)
histogram_quantile(0.99,
  sum by (le) (rate(coredns_dns_request_duration_seconds_bucket[5m]))
)

# CoreDNS cache hit ratio
sum(rate(coredns_cache_hits_total[5m]))
  /
(sum(rate(coredns_cache_hits_total[5m])) + sum(rate(coredns_cache_misses_total[5m])))

CoreDNS common issues and fixes

SymptomCauseFix
SERVFAIL for external namesUpstream resolver unreachableCheck /etc/resolv.conf on nodes; verify VPC DNS (169.254.169.253) accessible
Intermittent NXDOMAIN for cluster servicesndots:5 search domain exhausted; race conditionUse FQDNs; set ndots: 3; install NodeLocal DNSCache
High CoreDNS CPUToo many queries; no cachingIncrease cache TTL in Corefile; add prefetch; install NodeLocal DNSCache
CoreDNS pod OOMKilledMemory leak (known in older versions)Upgrade CoreDNS; set memory limit 256Mi+; check for large zone data
5-second DNS delayconntrack race for UDP (parallel A/AAAA queries)Add single-request-reopen to pod dnsConfig; or use NodeLocal DNSCache
DNS resolution works once, then failsCoreDNS pod restarted; DNS negative cache not clearedCheck CoreDNS pod restarts; client-side DNS cache TTL

5-second DNS delay fix (conntrack race)

# This is a well-known Linux kernel issue:
# Parallel A + AAAA queries share the same UDP 5-tuple, causing conntrack collision
# Result: one query is dropped; OS waits 5s for timeout before retry

# Fix 1: single-request-reopen in pod dnsConfig (serializes A/AAAA queries)
# (Add to pod spec — see 02-performance-tuning.html)

# Fix 2: NodeLocal DNSCache (runs on each node, bypasses conntrack for DNS)
# Install:
kubectl apply -f https://raw.githubusercontent.com/kubernetes/kubernetes/master/cluster/addons/dns/nodelocaldns/nodelocaldns.yaml

# Fix 3: CoreDNS autopath plugin (reduces search domain queries)
# Add to Corefile:
# autopath @kubernetes

# Verify NodeLocal DNSCache running on all nodes
kubectl get ds -n kube-system node-local-dns
kubectl get pods -n kube-system -l k8s-app=node-local-dns

Ingress Operations

NGINX Ingress controller operations

# Check NGINX Ingress controller status
kubectl get pods -n ingress-nginx -l app.kubernetes.io/component=controller
kubectl logs -n ingress-nginx -l app.kubernetes.io/component=controller --tail=100

# Inspect effective NGINX config for a specific Ingress
CONTROLLER_POD=$(kubectl get pods -n ingress-nginx -l app.kubernetes.io/component=controller \
  -o jsonpath='{.items[0].metadata.name}')
kubectl exec -n ingress-nginx $CONTROLLER_POD -- cat /etc/nginx/nginx.conf | \
  grep -A20 "server_name.*payment"

# Check upstream health (NGINX upstream list)
kubectl exec -n ingress-nginx $CONTROLLER_POD -- \
  curl -s http://localhost:10246/configuration/backends | \
  jq '.[] | select(.name | contains("payments")) | {name, endpoints}'

# Reload NGINX configuration (triggers graceful reload)
kubectl exec -n ingress-nginx $CONTROLLER_POD -- nginx -s reload

# Check NGINX error logs (502/504 root cause)
kubectl logs -n ingress-nginx $CONTROLLER_POD --tail=200 | \
  grep -E "error|upstream|connect"

Ingress annotation tuning for production

apiVersion: networking.k8s.io/v1
kind: Ingress
metadata:
  name: payment-api
  namespace: payments
  annotations:
    # Timeouts
    nginx.ingress.kubernetes.io/proxy-connect-timeout: "10"
    nginx.ingress.kubernetes.io/proxy-send-timeout: "60"
    nginx.ingress.kubernetes.io/proxy-read-timeout: "60"

    # Connection handling
    nginx.ingress.kubernetes.io/upstream-keepalive-connections: "100"
    nginx.ingress.kubernetes.io/upstream-keepalive-requests: "1000"
    nginx.ingress.kubernetes.io/upstream-keepalive-time: "60s"

    # Body size
    nginx.ingress.kubernetes.io/proxy-body-size: "10m"

    # Rate limiting
    nginx.ingress.kubernetes.io/limit-rps: "100"
    nginx.ingress.kubernetes.io/limit-connections: "10"

    # SSL
    nginx.ingress.kubernetes.io/ssl-redirect: "true"
    nginx.ingress.kubernetes.io/force-ssl-redirect: "true"
    cert-manager.io/cluster-issuer: letsencrypt-prod

    # CORS (if needed)
    nginx.ingress.kubernetes.io/enable-cors: "true"
    nginx.ingress.kubernetes.io/cors-allow-origin: "https://app.example.com"
    nginx.ingress.kubernetes.io/cors-allow-methods: "GET, POST, PUT, DELETE, OPTIONS"

    # Canary (Argo Rollouts manages this for progressive delivery)
    # nginx.ingress.kubernetes.io/canary: "true"
    # nginx.ingress.kubernetes.io/canary-weight: "10"

    # Security headers
    nginx.ingress.kubernetes.io/configuration-snippet: |
      more_set_headers "X-Frame-Options: DENY";
      more_set_headers "X-Content-Type-Options: nosniff";
      more_set_headers "Referrer-Policy: strict-origin-when-cross-origin";
      more_set_headers "Permissions-Policy: camera=(), microphone=(), geolocation=()";
spec:
  ingressClassName: nginx
  tls:
    - hosts:
        - api.example.com
      secretName: api-example-com-tls
  rules:
    - host: api.example.com
      http:
        paths:
          - path: /
            pathType: Prefix
            backend:
              service:
                name: payment-service
                port:
                  number: 80

NGINX Ingress metrics

# Request rate per Ingress
sum by (ingress, namespace) (
  rate(nginx_ingress_controller_requests[5m])
)

# Error rate (4xx + 5xx)
sum by (ingress, status) (
  rate(nginx_ingress_controller_requests{status=~"[45].."}[5m])
)
  /
sum by (ingress) (
  rate(nginx_ingress_controller_requests[5m])
)

# p99 response time
histogram_quantile(0.99,
  sum by (ingress, le) (
    rate(nginx_ingress_controller_response_duration_seconds_bucket[5m])
  )
)

# Upstream connection errors
sum by (ingress) (
  rate(nginx_ingress_controller_upstream_latency_seconds_count{status="error"}[5m])
)

AWS Load Balancer Controller (ALB)

# Check ALB controller health
kubectl get pods -n kube-system -l app.kubernetes.io/name=aws-load-balancer-controller
kubectl logs -n kube-system -l app.kubernetes.io/name=aws-load-balancer-controller --tail=50

# Describe Ingress for ALB events
kubectl describe ingress payment-api -n payments

# Check target group health directly (ALB target group → pod IPs)
ALB_ARN=$(aws elbv2 describe-load-balancers \
  --query 'LoadBalancers[?contains(LoadBalancerName, `payment`)].LoadBalancerArn' \
  --output text)
TG_ARN=$(aws elbv2 describe-target-groups \
  --load-balancer-arn $ALB_ARN \
  --query 'TargetGroups[0].TargetGroupArn' --output text)
aws elbv2 describe-target-health --target-group-arn $TG_ARN | \
  jq '.TargetHealthDescriptions[] | {target:.Target, health:.TargetHealth}'

Service Mesh Operations

Istio / Envoy sidecar diagnostics

# Check Istio control plane health
istioctl verify-install
kubectl get pods -n istio-system

# Check proxy status for all pods in a namespace
istioctl proxy-status -n payments

# Analyze potential configuration issues
istioctl analyze -n payments

# Dump Envoy config for a specific pod (very detailed)
istioctl proxy-config all payment-pod-xyz.payments

# Check listener configuration (what ports Envoy is listening on)
istioctl proxy-config listener payment-pod-xyz.payments

# Check cluster configuration (upstreams Envoy knows about)
istioctl proxy-config cluster payment-pod-xyz.payments | \
  grep -E "HOSTNAME|PORT|STATUS"

# Check route configuration
istioctl proxy-config route payment-pod-xyz.payments

# View real-time Envoy access logs
kubectl logs payment-pod-xyz -n payments -c istio-proxy --tail=50

# Check mTLS status between two services
istioctl x check-inject -n payments

Envoy stats for latency debugging

# Access Envoy admin interface
kubectl port-forward pod/payment-pod-xyz -n payments 15000:15000

# Get upstream latency histograms from Envoy stats
curl -s http://localhost:15000/stats | \
  grep "upstream_rq_time\|upstream_cx_connect_ms" | \
  grep -v "^$"

# Get circuit breaker status
curl -s http://localhost:15000/stats | \
  grep "overflow\|pending_overflow"

# Get retry stats
curl -s http://localhost:15000/stats | \
  grep "upstream_rq_retry"

# Live request logging (Envoy tap API)
curl -X POST http://localhost:15000/tap \
  -H "Content-Type: application/json" \
  -d '{"config_id":"test","tap_config":{"match_config":{"any_match":{}},"output_config":{"sinks":[{"format":"JSON_BODY_AS_STRING","streaming_admin":{}}]}}}'

Network Debugging Toolkit

netshoot — full network debug pod

# Run netshoot in target namespace
kubectl run netshoot -n payments \
  --image=nicolaka/netshoot \
  --restart=Never -it --rm \
  -- bash

# Inside netshoot — useful commands:
# TCP connectivity test
nc -zv payment-service 80
nc -zv postgresql.databases.svc.cluster.local 5432

# HTTP test with headers
curl -v http://payment-service/health
curl -v --resolve "api.example.com:443:10.100.5.20" https://api.example.com/health

# Trace route to a pod
traceroute 10.0.2.5

# DNS trace
dig +trace payment-service.payments.svc.cluster.local
dig +stats @169.254.20.10 payment-service.payments.svc.cluster.local

# TCP packet capture (requires NET_ADMIN or node access)
tcpdump -i eth0 -n 'host 10.0.2.5 and port 80' -w /tmp/capture.pcap

# Show socket stats
ss -tunaep
netstat -tunaep

# Show connection count to a specific IP
ss -tn | grep 10.0.2.5 | wc -l

kubectl debug — ephemeral container for network issues

# Attach netshoot as ephemeral container to a running pod
kubectl debug -it payment-pod-xyz \
  --image=nicolaka/netshoot \
  --target=payment-service \
  -n payments

# Once inside, you share the pod's network namespace:
# - Same IP as the target pod
# - Can tcpdump the pod's actual traffic
tcpdump -i eth0 -n port 5432

# Capture and analyze with tshark
tshark -i eth0 -Y "tcp.port==5432" -T fields \
  -e frame.time -e ip.src -e ip.dst -e tcp.flags

Node-level network debugging

# Check iptables rules (kube-proxy managed)
iptables-save | grep -E "KUBE-SVC|KUBE-SEP" | grep payment | head -20

# Check kube-proxy endpoints for a service
kubectl get endpoints payment-service -n payments -o yaml

# Verify service ClusterIP is reachable from node
curl -v http://10.96.5.42/health   # ClusterIP

# Check if IPVS is being used (alternative to iptables)
ipvsadm -Ln | grep 10.96.5.42

# Check BPF (Cilium eBPF)
kubectl exec -n kube-system ds/cilium -- \
  cilium service list | grep payment

# Check BGP routes (Cilium BGP or Calico BGP)
kubectl exec -n kube-system ds/cilium -- \
  cilium bgp routes --peer all

Conntrack Table Exhaustion

The Linux conntrack table tracks every active TCP/UDP connection for NAT and stateful firewall purposes. Under high connection rates, it exhausts — causing new connections to silently fail. This is one of the most insidious Kubernetes networking problems.

Diagnosing conntrack exhaustion

# Check current conntrack table usage
cat /proc/sys/net/netfilter/nf_conntrack_count      # current entries
cat /proc/sys/net/netfilter/nf_conntrack_max        # maximum

# Usage percentage
echo "scale=2; $(cat /proc/sys/net/netfilter/nf_conntrack_count) * 100 / \
  $(cat /proc/sys/net/netfilter/nf_conntrack_max)" | bc

# View kernel conntrack drop events (requires conntrack-tools)
conntrack -E | grep "[DESTROY\|DROP]" | head -20

# Check kernel drop counter
grep conntrack /proc/net/stat/nf_conntrack | \
  awk '{print "drops:", $3, "inserts:", $4}'

# Monitor continuously
watch -n1 'echo "Used: $(cat /proc/sys/net/netfilter/nf_conntrack_count) / \
  Max: $(cat /proc/sys/net/netfilter/nf_conntrack_max)"'

Conntrack table increase (kernel tuning DaemonSet)

# In the node-tuner DaemonSet initContainer (see 04-security-hardening.html):
sysctl -w net.netfilter.nf_conntrack_max=1048576        # 1M entries
sysctl -w net.netfilter.nf_conntrack_tcp_timeout_established=86400
sysctl -w net.netfilter.nf_conntrack_tcp_timeout_time_wait=30
sysctl -w net.netfilter.nf_conntrack_tcp_timeout_close_wait=15

# Also increase hashsize (requires module parameter)
echo 262144 > /sys/module/nf_conntrack/parameters/hashsize

# Verify
cat /proc/sys/net/netfilter/nf_conntrack_max
💡
Cilium eBPF bypasses conntrack for pod traffic

Cilium in eBPF mode performs load balancing and policy enforcement entirely in the BPF datapath, bypassing iptables and conntrack for east-west (pod-to-pod) traffic. This eliminates conntrack exhaustion for internal traffic. Only external traffic (ingress) still goes through conntrack. If conntrack exhaustion is a recurring problem, migrating from kube-proxy iptables to Cilium eBPF is the architectural fix.

Conntrack metrics

# Conntrack table utilization per node
node_nf_conntrack_entries / node_nf_conntrack_entries_limit

# Conntrack entries rate (growth trend)
rate(node_nf_conntrack_entries[5m])

MTU Problems

MTU (Maximum Transmission Unit) mismatches cause large packets to be silently dropped. In overlay networks (VXLAN/Geneve), the encapsulation overhead reduces the effective MTU. MTU issues are notoriously hard to diagnose because small packets work fine while large ones fail.

MTU Overhead in Overlay Networks
  Physical NIC MTU: 9000 (jumbo frames) or 1500 (standard)

  VXLAN encapsulation overhead:
  Outer Ethernet: 14B + Outer IP: 20B + UDP: 8B + VXLAN: 8B = 50B overhead

  Effective pod MTU: 1500 - 50 = 1450 (with standard frames)
                    9000 - 50 = 8950 (with jumbo frames)

  If pod MTU is set to 1500 but physical MTU is 1500:
  Large packets get fragmented or dropped
  → HTTP works (small packets) but uploads/downloads fail
  → gRPC streams stall after first few kilobytes

Diagnosing MTU issues

# Check current MTU settings inside a pod
kubectl exec -n payments pod/payment-pod-xyz -- ip link show eth0
# Look for: mtu 1500 or mtu 8950

# Test with increasing packet sizes (PMTUD — Path MTU Discovery)
kubectl run mtu-test --image=busybox --restart=Never -it --rm -- sh -c "
for size in 1400 1450 1472 1500 1510 8900 8950; do
  result=$(ping -M do -c1 -s $size 10.0.1.5 2>&1 | tail -1)
  echo \"Size $size: $result\"
done"

# Packets > MTU with DF bit set should fail with "Message too long"
# If they don't fail but application hangs — MTU black hole (no ICMP returned)

# Check for MTU black hole: try large HTTP body
kubectl exec -n payments pod/payment-pod-xyz -- \
  curl -v --max-time 5 http://order-service/api/data \
  -d "$(dd if=/dev/urandom bs=2048 count=1 2>/dev/null | base64)"

MTU configuration fixes

# Cilium: set MTU explicitly in Helm values
cilium:
  mtu: 1450   # 1500 - 50 (VXLAN) = 1450; or auto-detect with 0

# AWS VPC CNI: MTU is auto-configured from node NIC
# Verify:
kubectl exec -n kube-system ds/aws-node -- /app/grpc-health-probe -addr=:11191

# Flannel: configure in ConfigMap
kubectl edit configmap kube-flannel-cfg -n kube-flannel
# Set "Backend": {"Type": "vxlan"}, "mtu": 1450

# For jumbo frames (9000 MTU on AWS enhanced networking):
# Set pod MTU to 8950 (9000 - 50 overhead)
# Ensure ALL nodes use the same MTU — mixed MTU is the worst case

Cilium & Hubble Observability

Hubble is Cilium's eBPF-based network observability platform. It provides a real-time view of every packet flowing through the cluster — who is talking to whom, what was allowed or dropped, and HTTP-level details — without any application instrumentation.

Hubble setup

# Enable Hubble in Cilium Helm values
helm upgrade cilium cilium/cilium \
  --namespace kube-system \
  --reuse-values \
  --set hubble.enabled=true \
  --set hubble.relay.enabled=true \
  --set hubble.ui.enabled=true \
  --set hubble.metrics.enableOpenMetrics=true \
  --set hubble.metrics.enabled="{dns,drop,tcp,flow,port-distribution,icmp,httpV2}"

# Install Hubble CLI
HUBBLE_VERSION=$(curl -s https://raw.githubusercontent.com/cilium/hubble/master/stable.txt)
curl -LO "https://github.com/cilium/hubble/releases/download/${HUBBLE_VERSION}/hubble-linux-amd64.tar.gz"
tar xzvf hubble-linux-amd64.tar.gz
sudo mv hubble /usr/local/bin/

# Port-forward Hubble relay
kubectl port-forward -n kube-system svc/hubble-relay 4245:80 &

# Access Hubble UI
kubectl port-forward -n kube-system svc/hubble-ui 12000:80

Hubble observe — real-time flow analysis

# Watch all flows in the payments namespace
hubble observe --namespace payments

# Watch only dropped flows (NetworkPolicy denials)
hubble observe --namespace payments --verdict DROPPED

# Watch flows from a specific pod
hubble observe --from-pod payments/payment-pod-xyz

# Watch HTTP flows (L7 visibility requires Cilium L7 policy)
hubble observe --namespace payments --protocol http

# Filter flows between specific pods
hubble observe \
  --from-pod payments/payment-service \
  --to-pod databases/postgresql

# Get flow statistics (top talkers)
hubble observe --namespace payments \
  --output json | \
  jq -r '[.source.pod_name, .destination.pod_name, .l4.TCP.destination_port] | @tsv' | \
  sort | uniq -c | sort -rn | head -20

# Find all unique destination ports from payments namespace
hubble observe --namespace payments --output json | \
  jq -r '.l4.TCP.destination_port // .l4.UDP.destination_port' | \
  sort -n | uniq -c | sort -rn

Hubble for NetworkPolicy debugging

# Which pods are being denied by NetworkPolicy?
hubble observe --verdict DROPPED --namespace payments --output json | \
  jq -r '"DROP: \(.source.pod_name) → \(.destination.pod_name):\(.l4.TCP.destination_port // .l4.UDP.destination_port) [\(.drop_reason_desc)]"'

# Common drop reasons:
# POLICY_DENIED      — NetworkPolicy explicitly blocks
# UNKNOWN_CONNECTION — conntrack miss (usually timeout)
# PORT_UNREACHABLE   — no listener on target port

# Find which NetworkPolicy is blocking a specific flow
cilium policy trace \
  --src-identity 12345 \
  --dst-identity 67890 \
  --dport 5432 \
  --proto tcp

# Get Cilium endpoint identity numbers
kubectl exec -n kube-system ds/cilium -- \
  cilium endpoint list | grep payment

NetworkPolicy Operations

Testing NetworkPolicy effectiveness

# Test that default-deny is working (should timeout/refuse)
kubectl run test-deny \
  --image=busybox --restart=Never -n payments -it --rm \
  -- nc -zv -w3 order-service.orders 80
# Expected: nc: bad address 'order-service.orders' OR Connection refused

# Test that allowed traffic works
kubectl run test-allow \
  --image=busybox --restart=Never -n payments -it --rm \
  -- nc -zv -w3 postgresql.databases 5432
# Expected: Connection to postgresql.databases 5432 port [tcp/*] succeeded!

# List all NetworkPolicies and their selectors
kubectl get networkpolicies -n payments -o json | \
  jq -r '.items[] | "\(.metadata.name): podSelector=\(.spec.podSelector | tostring)"'

# Check which pods are selected by a policy
kubectl get pods -n payments -l "$(
  kubectl get networkpolicy default-deny -n payments \
    -o jsonpath='{.spec.podSelector.matchLabels}' | \
    jq -r 'to_entries | map("\(.key)=\(.value)") | join(",")' 2>/dev/null || echo "app=payment-service"
)"

Network policy visualization

# Generate a network policy diagram with netpol-viewer (krew plugin)
kubectl netpol graph -n payments

# Or export to Graphviz DOT format:
kubectl get networkpolicies -n payments -o json | \
  python3 -c "
import json, sys
data = json.load(sys.stdin)
print('digraph G {')
for policy in data['items']:
    name = policy['metadata']['name']
    ns = policy['metadata']['namespace']
    for rule in policy.get('spec', {}).get('ingress', []):
        for src in rule.get('from', []):
            ns_sel = src.get('namespaceSelector', {}).get('matchLabels', {})
            pod_sel = src.get('podSelector', {}).get('matchLabels', {})
            label = str(ns_sel) + '/' + str(pod_sel)
            print(f'  \"{label}\" -> \"{ns}/{name}\"')
print('}')
"

Cross-AZ Traffic Optimization

Cross-AZ traffic in AWS costs $0.01/GB and adds 1–3ms latency. For services with high RPS, this is significant. Kubernetes 1.21+ has built-in topology-aware routing to prefer same-AZ endpoints.

Topology-aware service routing

apiVersion: v1
kind: Service
metadata:
  name: payment-service
  namespace: payments
spec:
  selector:
    app: payment-service
  ports:
    - port: 80
      targetPort: 8080
  # Topology-aware routing: prefer same-AZ endpoints (K8s 1.21+)
  trafficPolicy:
    externalTraffic: Cluster
    internalTrafficPolicy: Local   # for DaemonSet services (node-local)
  topologyKeys:                    # deprecated in 1.21+; use spec below
    - topology.kubernetes.io/zone
    - "*"
# K8s 1.21+: EndpointSlice topology hints (preferred approach)
apiVersion: v1
kind: Service
metadata:
  name: payment-service
  namespace: payments
  annotations:
    service.kubernetes.io/topology-mode: Auto   # enable topology-aware routing
spec:
  selector:
    app: payment-service
  ports:
    - port: 80
      targetPort: 8080
# Verify topology hints are set on EndpointSlice
kubectl get endpointslice -n payments -l kubernetes.io/service-name=payment-service -o yaml | \
  grep -A5 "hints:"

# Monitor cross-AZ traffic (CloudWatch / VPC Flow Logs)
aws ec2 describe-flow-logs \
  --filter Name=resource-id,Values=vpc-0123456789abcdef0 \
  --query 'FlowLogs[*].{LogGroupName:LogGroupName,Status:FlowLogStatus}'

# Estimate cross-AZ cost reduction
# Before topology hints: 100% traffic to all AZs
# After topology hints: ~70% same-AZ, ~30% cross-AZ
# For 1TB/day cross-AZ: $10/day saved with topology hints

Using local traffic for kube-proxy / Cilium

# For NodePort services: route only to local pods (avoids cross-node hop)
spec:
  type: NodePort
  externalTrafficPolicy: Local   # only route to pods on THIS node
  # Tradeoff: uneven load distribution if pods aren't evenly distributed
  # Use with topologySpreadConstraints to ensure even pod distribution

Network Alerts

apiVersion: monitoring.coreos.com/v1
kind: PrometheusRule
metadata:
  name: network-operations-alerts
  namespace: monitoring
spec:
  groups:
    - name: network.operations
      rules:

        # Conntrack table near exhaustion
        - alert: ConntrackTableNearFull
          expr: |
            node_nf_conntrack_entries / node_nf_conntrack_entries_limit > 0.80
          for: 5m
          labels:
            severity: warning
          annotations:
            summary: "Conntrack table {{ $value | humanizePercentage }} full on {{ $labels.instance }}"
            description: "Risk of connection drops if table fills. Increase nf_conntrack_max."
            runbook_url: https://runbooks.example.com/network/conntrack-exhaustion

        - alert: ConntrackTableCritical
          expr: |
            node_nf_conntrack_entries / node_nf_conntrack_entries_limit > 0.95
          for: 1m
          labels:
            severity: critical
          annotations:
            summary: "Conntrack table 95%+ full — new connections may be dropped"
            runbook_url: https://runbooks.example.com/network/conntrack-exhaustion

        # CoreDNS SERVFAIL rate
        - alert: CoreDNSServfailHigh
          expr: |
            sum(rate(coredns_dns_responses_total{rcode="SERVFAIL"}[5m]))
              /
            sum(rate(coredns_dns_responses_total[5m])) > 0.01
          for: 5m
          labels:
            severity: warning
          annotations:
            summary: "CoreDNS SERVFAIL rate > 1%"
            runbook_url: https://runbooks.example.com/network/coredns-servfail

        # CoreDNS pod down
        - alert: CoreDNSDown
          expr: |
            kube_deployment_status_replicas_available{deployment="coredns",namespace="kube-system"} < 1
          for: 1m
          labels:
            severity: critical
          annotations:
            summary: "CoreDNS has no available replicas — cluster DNS is down"
            runbook_url: https://runbooks.example.com/network/coredns-down

        # Hubble: high drop rate in namespace
        - alert: NetworkPolicyDropRateHigh
          expr: |
            sum by (namespace) (
              rate(hubble_drop_total{namespace!="kube-system"}[5m])
            ) > 10
          for: 5m
          labels:
            severity: warning
          annotations:
            summary: "High NetworkPolicy drop rate in namespace {{ $labels.namespace }}"
            description: "{{ $value | humanize }} drops/sec. Check NetworkPolicy rules."
            runbook_url: https://runbooks.example.com/network/policy-drops

        # NGINX Ingress 5xx error rate
        - alert: IngressHighErrorRate
          expr: |
            sum by (ingress, namespace) (
              rate(nginx_ingress_controller_requests{status=~"5.."}[5m])
            )
              /
            sum by (ingress, namespace) (
              rate(nginx_ingress_controller_requests[5m])
            ) > 0.05
          for: 5m
          labels:
            severity: warning
          annotations:
            summary: "Ingress {{ $labels.namespace }}/{{ $labels.ingress }} 5xx rate > 5%"
            runbook_url: https://runbooks.example.com/network/ingress-errors

        # Node network TX drops
        - alert: NodeNetworkTxDropsHigh
          expr: |
            rate(node_network_transmit_drop_total{device!~"lo|veth.*|docker.*|flannel.*|cni.*|cilium.*"}[5m]) > 100
          for: 5m
          labels:
            severity: warning
          annotations:
            summary: "Node {{ $labels.instance }} TX drops > 100/sec on {{ $labels.device }}"
            runbook_url: https://runbooks.example.com/network/node-tx-drops

        # VPC CNI IP pool exhaustion (AWS)
        - alert: VPCCNIIPPoolLow
          expr: |
            awscni_total_ipaddresses - awscni_assigned_ip_per_cidr < 5
          for: 5m
          labels:
            severity: warning
          annotations:
            summary: "VPC CNI IP pool has fewer than 5 available IPs on {{ $labels.instance }}"
            runbook_url: https://runbooks.example.com/network/vpc-cni-ip-pool

Best Practices

NodeLocal DNSCache everywhere

Install NodeLocal DNSCache on every cluster. It eliminates the 5-second conntrack DNS race bug, reduces CoreDNS load by 60–80%, and cuts DNS p99 from 2ms to 0.1ms.

Monitor conntrack utilization

Set an alert at 80% conntrack table utilization. Conntrack exhaustion causes silent connection drops that are extremely hard to diagnose after the fact.

Hubble for zero-trust debugging

When a NetworkPolicy blocks traffic, hubble observe --verdict DROPPED shows exactly which flow is dropped and why — saving hours compared to trial-and-error policy editing.

Set MTU explicitly

Never rely on auto-detection for overlay MTU. Set it explicitly in CNI configuration. Mixed-MTU clusters cause mysterious large-payload failures that don't show up in small-packet tests.

Topology-aware routing

Enable service.kubernetes.io/topology-mode: Auto on high-RPS services. At scale, this meaningfully reduces cross-AZ data transfer costs and latency.

Use FQDNs for cross-namespace calls

Short service names trigger multiple DNS search-domain lookups. Always use service.namespace.svc.cluster.local for cross-namespace calls to reduce DNS query volume and avoid race conditions.