Network Operations
Day-2 operations for CNI, CoreDNS, Ingress controllers, service mesh, conntrack exhaustion, MTU problems, and eBPF network observability with Hubble.
Kubernetes Network Stack
Understanding the full packet path is essential for diagnosing network issues. A request from pod A to pod B traverses multiple layers — each is a potential failure point.
Pod A (eth0: 10.0.1.5)
│
▼ veth pair
Node Linux bridge / eBPF (cilium0 or cni0)
│ NetworkPolicy evaluated here (iptables or eBPF)
▼
Pod B (eth0: 10.0.1.6)
Packet Path: Pod to Pod (Different Node)
Pod A (10.0.1.5) → veth → CNI bridge → node routing table
→ tunnel/VXLAN/BGP encapsulation → eth0 → Node B eth0
→ de-encapsulate → CNI bridge → veth → Pod B
Packet Path: Pod to Service ClusterIP
Pod A → DNS lookup → CoreDNS → 10.96.5.42 (ClusterIP)
→ kube-proxy iptables DNAT / eBPF load balancing
→ selected endpoint pod IP (10.0.2.8)
→ Pod B (cross-node routing)
| Layer | Component | Failure Mode | Debug Command |
|---|---|---|---|
| Pod networking | veth pair, CNI plugin | Pod can't reach any destination | kubectl exec <pod> -- ip addr; ip route |
| Service resolution | CoreDNS, kube-proxy/eBPF | DNS failure or DNAT not working | kubectl exec <pod> -- nslookup svc.namespace |
| NetworkPolicy | CNI enforcement | Connection refused/timeout unexpectedly | Hubble observe, kubectl describe netpol |
| Node routing | Linux kernel routing table | Pods on different nodes can't reach each other | ip route show; ip neigh show |
| Conntrack | nf_conntrack table | Intermittent connection drops under high RPS | cat /proc/net/nf_conntrack | wc -l |
| MTU | NIC, overlay tunnel | Large packets silently dropped | ping -M do -s 1400 <pod-ip> |
| Ingress | NGINX/Envoy/ALB | 502/504 errors, timeouts | NGINX error logs, upstream check |
CNI Operations
Checking CNI health
# Check CNI DaemonSet status (for all major CNIs)
kubectl get ds -n kube-system | grep -E "aws-node|cilium|calico|flannel|weave"
# AWS VPC CNI — check node IP assignment
kubectl get pods -n kube-system -l k8s-app=aws-node -o wide
kubectl logs -n kube-system -l k8s-app=aws-node --tail=50
# VPC CNI specific: check IP address pool and ENI attachment
kubectl describe ds -n kube-system aws-node | grep -A5 "Environment"
kubectl exec -n kube-system ds/aws-node -- /app/grpc-health-probe \
-addr=:11191 -connect-timeout=5s
# Cilium: check agent status
cilium status --wait
cilium connectivity test # full end-to-end test (takes 2-3 min)
# Check for pods stuck in ContainerCreating (CNI failure)
kubectl get pods -A | grep -v Running | grep ContainerCreating
# If stuck, check kubelet logs on the node:
journalctl -u kubelet | grep CNI | tail -50
AWS VPC CNI warm pool tuning
# VPC CNI maintains a warm pool of pre-assigned IPs
# Tune to balance startup speed vs IP waste
kubectl set env daemonset aws-node -n kube-system \
WARM_IP_TARGET=5 \ # keep 5 IPs pre-allocated per node
MINIMUM_IP_TARGET=10 \ # minimum IPs to hold even when idle
WARM_ENI_TARGET=1 \ # keep 1 ENI with available IPs
MAX_ENI=4 \ # max ENIs per node (instance type limit)
ENABLE_PREFIX_DELEGATION=true # 16 IPs per ENI slot
# After changing: nodes need rolling replacement to pick up new settings
# (or manually restart aws-node on each node)
# Monitor IP pool health
kubectl exec -n kube-system ds/aws-node -- \
curl -s http://localhost:61679/v1/networkinterfaces | jq '.'
Cilium node-to-node connectivity
# Check Cilium node mesh (every node should see every other node)
cilium node list
# Check endpoint health
cilium endpoint list | grep -v "ready"
# Verify BPF datapath is active (not falling back to iptables)
kubectl exec -n kube-system ds/cilium -- cilium status | grep "KubeProxyReplacement"
# Should show: KubeProxyReplacement: True
# Check BPF map usage (high usage → map too small)
kubectl exec -n kube-system ds/cilium -- cilium bpf ct list global | wc -l
# Restart a specific Cilium agent (node-targeted)
kubectl delete pod -n kube-system -l k8s-app=cilium \
--field-selector spec.nodeName=ip-10-0-1-42.us-east-1.compute.internal
CoreDNS Operations
Diagnosing DNS failures
# Test DNS resolution from a pod
kubectl run dns-test --image=busybox:1.36 --restart=Never -it --rm -- \
nslookup payment-service.payments.svc.cluster.local
# Test external DNS resolution
kubectl run dns-test --image=busybox:1.36 --restart=Never -it --rm -- \
nslookup api.stripe.com
# Test with dig (more detail)
kubectl run dns-test --image=tutum/dnsutils --restart=Never -it --rm -- \
dig +short payment-service.payments.svc.cluster.local
# Check CoreDNS pods and logs
kubectl get pods -n kube-system -l k8s-app=kube-dns
kubectl logs -n kube-system -l k8s-app=kube-dns --tail=100 | grep -i "error\|refused\|timeout"
# Check CoreDNS ConfigMap
kubectl get configmap coredns -n kube-system -o yaml
# Watch CoreDNS query log (enable log plugin temporarily)
kubectl edit configmap coredns -n kube-system
# Add: log
# between ready and kubernetes blocks
# REMOVE after debugging — generates huge volume of logs
CoreDNS performance metrics
# CoreDNS request rate
sum(rate(coredns_dns_requests_total[5m])) by (server, zone, type)
# CoreDNS error rate (SERVFAIL, NXDOMAIN)
sum by (rcode) (rate(coredns_dns_responses_total[5m]))
# CoreDNS p99 latency (target: < 2ms for cached; < 10ms for forwarded)
histogram_quantile(0.99,
sum by (le) (rate(coredns_dns_request_duration_seconds_bucket[5m]))
)
# CoreDNS cache hit ratio
sum(rate(coredns_cache_hits_total[5m]))
/
(sum(rate(coredns_cache_hits_total[5m])) + sum(rate(coredns_cache_misses_total[5m])))
CoreDNS common issues and fixes
| Symptom | Cause | Fix |
|---|---|---|
| SERVFAIL for external names | Upstream resolver unreachable | Check /etc/resolv.conf on nodes; verify VPC DNS (169.254.169.253) accessible |
| Intermittent NXDOMAIN for cluster services | ndots:5 search domain exhausted; race condition | Use FQDNs; set ndots: 3; install NodeLocal DNSCache |
| High CoreDNS CPU | Too many queries; no caching | Increase cache TTL in Corefile; add prefetch; install NodeLocal DNSCache |
| CoreDNS pod OOMKilled | Memory leak (known in older versions) | Upgrade CoreDNS; set memory limit 256Mi+; check for large zone data |
| 5-second DNS delay | conntrack race for UDP (parallel A/AAAA queries) | Add single-request-reopen to pod dnsConfig; or use NodeLocal DNSCache |
| DNS resolution works once, then fails | CoreDNS pod restarted; DNS negative cache not cleared | Check CoreDNS pod restarts; client-side DNS cache TTL |
5-second DNS delay fix (conntrack race)
# This is a well-known Linux kernel issue:
# Parallel A + AAAA queries share the same UDP 5-tuple, causing conntrack collision
# Result: one query is dropped; OS waits 5s for timeout before retry
# Fix 1: single-request-reopen in pod dnsConfig (serializes A/AAAA queries)
# (Add to pod spec — see 02-performance-tuning.html)
# Fix 2: NodeLocal DNSCache (runs on each node, bypasses conntrack for DNS)
# Install:
kubectl apply -f https://raw.githubusercontent.com/kubernetes/kubernetes/master/cluster/addons/dns/nodelocaldns/nodelocaldns.yaml
# Fix 3: CoreDNS autopath plugin (reduces search domain queries)
# Add to Corefile:
# autopath @kubernetes
# Verify NodeLocal DNSCache running on all nodes
kubectl get ds -n kube-system node-local-dns
kubectl get pods -n kube-system -l k8s-app=node-local-dns
Ingress Operations
NGINX Ingress controller operations
# Check NGINX Ingress controller status
kubectl get pods -n ingress-nginx -l app.kubernetes.io/component=controller
kubectl logs -n ingress-nginx -l app.kubernetes.io/component=controller --tail=100
# Inspect effective NGINX config for a specific Ingress
CONTROLLER_POD=$(kubectl get pods -n ingress-nginx -l app.kubernetes.io/component=controller \
-o jsonpath='{.items[0].metadata.name}')
kubectl exec -n ingress-nginx $CONTROLLER_POD -- cat /etc/nginx/nginx.conf | \
grep -A20 "server_name.*payment"
# Check upstream health (NGINX upstream list)
kubectl exec -n ingress-nginx $CONTROLLER_POD -- \
curl -s http://localhost:10246/configuration/backends | \
jq '.[] | select(.name | contains("payments")) | {name, endpoints}'
# Reload NGINX configuration (triggers graceful reload)
kubectl exec -n ingress-nginx $CONTROLLER_POD -- nginx -s reload
# Check NGINX error logs (502/504 root cause)
kubectl logs -n ingress-nginx $CONTROLLER_POD --tail=200 | \
grep -E "error|upstream|connect"
Ingress annotation tuning for production
apiVersion: networking.k8s.io/v1
kind: Ingress
metadata:
name: payment-api
namespace: payments
annotations:
# Timeouts
nginx.ingress.kubernetes.io/proxy-connect-timeout: "10"
nginx.ingress.kubernetes.io/proxy-send-timeout: "60"
nginx.ingress.kubernetes.io/proxy-read-timeout: "60"
# Connection handling
nginx.ingress.kubernetes.io/upstream-keepalive-connections: "100"
nginx.ingress.kubernetes.io/upstream-keepalive-requests: "1000"
nginx.ingress.kubernetes.io/upstream-keepalive-time: "60s"
# Body size
nginx.ingress.kubernetes.io/proxy-body-size: "10m"
# Rate limiting
nginx.ingress.kubernetes.io/limit-rps: "100"
nginx.ingress.kubernetes.io/limit-connections: "10"
# SSL
nginx.ingress.kubernetes.io/ssl-redirect: "true"
nginx.ingress.kubernetes.io/force-ssl-redirect: "true"
cert-manager.io/cluster-issuer: letsencrypt-prod
# CORS (if needed)
nginx.ingress.kubernetes.io/enable-cors: "true"
nginx.ingress.kubernetes.io/cors-allow-origin: "https://app.example.com"
nginx.ingress.kubernetes.io/cors-allow-methods: "GET, POST, PUT, DELETE, OPTIONS"
# Canary (Argo Rollouts manages this for progressive delivery)
# nginx.ingress.kubernetes.io/canary: "true"
# nginx.ingress.kubernetes.io/canary-weight: "10"
# Security headers
nginx.ingress.kubernetes.io/configuration-snippet: |
more_set_headers "X-Frame-Options: DENY";
more_set_headers "X-Content-Type-Options: nosniff";
more_set_headers "Referrer-Policy: strict-origin-when-cross-origin";
more_set_headers "Permissions-Policy: camera=(), microphone=(), geolocation=()";
spec:
ingressClassName: nginx
tls:
- hosts:
- api.example.com
secretName: api-example-com-tls
rules:
- host: api.example.com
http:
paths:
- path: /
pathType: Prefix
backend:
service:
name: payment-service
port:
number: 80
NGINX Ingress metrics
# Request rate per Ingress
sum by (ingress, namespace) (
rate(nginx_ingress_controller_requests[5m])
)
# Error rate (4xx + 5xx)
sum by (ingress, status) (
rate(nginx_ingress_controller_requests{status=~"[45].."}[5m])
)
/
sum by (ingress) (
rate(nginx_ingress_controller_requests[5m])
)
# p99 response time
histogram_quantile(0.99,
sum by (ingress, le) (
rate(nginx_ingress_controller_response_duration_seconds_bucket[5m])
)
)
# Upstream connection errors
sum by (ingress) (
rate(nginx_ingress_controller_upstream_latency_seconds_count{status="error"}[5m])
)
AWS Load Balancer Controller (ALB)
# Check ALB controller health
kubectl get pods -n kube-system -l app.kubernetes.io/name=aws-load-balancer-controller
kubectl logs -n kube-system -l app.kubernetes.io/name=aws-load-balancer-controller --tail=50
# Describe Ingress for ALB events
kubectl describe ingress payment-api -n payments
# Check target group health directly (ALB target group → pod IPs)
ALB_ARN=$(aws elbv2 describe-load-balancers \
--query 'LoadBalancers[?contains(LoadBalancerName, `payment`)].LoadBalancerArn' \
--output text)
TG_ARN=$(aws elbv2 describe-target-groups \
--load-balancer-arn $ALB_ARN \
--query 'TargetGroups[0].TargetGroupArn' --output text)
aws elbv2 describe-target-health --target-group-arn $TG_ARN | \
jq '.TargetHealthDescriptions[] | {target:.Target, health:.TargetHealth}'
Service Mesh Operations
Istio / Envoy sidecar diagnostics
# Check Istio control plane health
istioctl verify-install
kubectl get pods -n istio-system
# Check proxy status for all pods in a namespace
istioctl proxy-status -n payments
# Analyze potential configuration issues
istioctl analyze -n payments
# Dump Envoy config for a specific pod (very detailed)
istioctl proxy-config all payment-pod-xyz.payments
# Check listener configuration (what ports Envoy is listening on)
istioctl proxy-config listener payment-pod-xyz.payments
# Check cluster configuration (upstreams Envoy knows about)
istioctl proxy-config cluster payment-pod-xyz.payments | \
grep -E "HOSTNAME|PORT|STATUS"
# Check route configuration
istioctl proxy-config route payment-pod-xyz.payments
# View real-time Envoy access logs
kubectl logs payment-pod-xyz -n payments -c istio-proxy --tail=50
# Check mTLS status between two services
istioctl x check-inject -n payments
Envoy stats for latency debugging
# Access Envoy admin interface
kubectl port-forward pod/payment-pod-xyz -n payments 15000:15000
# Get upstream latency histograms from Envoy stats
curl -s http://localhost:15000/stats | \
grep "upstream_rq_time\|upstream_cx_connect_ms" | \
grep -v "^$"
# Get circuit breaker status
curl -s http://localhost:15000/stats | \
grep "overflow\|pending_overflow"
# Get retry stats
curl -s http://localhost:15000/stats | \
grep "upstream_rq_retry"
# Live request logging (Envoy tap API)
curl -X POST http://localhost:15000/tap \
-H "Content-Type: application/json" \
-d '{"config_id":"test","tap_config":{"match_config":{"any_match":{}},"output_config":{"sinks":[{"format":"JSON_BODY_AS_STRING","streaming_admin":{}}]}}}'
Network Debugging Toolkit
netshoot — full network debug pod
# Run netshoot in target namespace
kubectl run netshoot -n payments \
--image=nicolaka/netshoot \
--restart=Never -it --rm \
-- bash
# Inside netshoot — useful commands:
# TCP connectivity test
nc -zv payment-service 80
nc -zv postgresql.databases.svc.cluster.local 5432
# HTTP test with headers
curl -v http://payment-service/health
curl -v --resolve "api.example.com:443:10.100.5.20" https://api.example.com/health
# Trace route to a pod
traceroute 10.0.2.5
# DNS trace
dig +trace payment-service.payments.svc.cluster.local
dig +stats @169.254.20.10 payment-service.payments.svc.cluster.local
# TCP packet capture (requires NET_ADMIN or node access)
tcpdump -i eth0 -n 'host 10.0.2.5 and port 80' -w /tmp/capture.pcap
# Show socket stats
ss -tunaep
netstat -tunaep
# Show connection count to a specific IP
ss -tn | grep 10.0.2.5 | wc -l
kubectl debug — ephemeral container for network issues
# Attach netshoot as ephemeral container to a running pod
kubectl debug -it payment-pod-xyz \
--image=nicolaka/netshoot \
--target=payment-service \
-n payments
# Once inside, you share the pod's network namespace:
# - Same IP as the target pod
# - Can tcpdump the pod's actual traffic
tcpdump -i eth0 -n port 5432
# Capture and analyze with tshark
tshark -i eth0 -Y "tcp.port==5432" -T fields \
-e frame.time -e ip.src -e ip.dst -e tcp.flags
Node-level network debugging
# Check iptables rules (kube-proxy managed)
iptables-save | grep -E "KUBE-SVC|KUBE-SEP" | grep payment | head -20
# Check kube-proxy endpoints for a service
kubectl get endpoints payment-service -n payments -o yaml
# Verify service ClusterIP is reachable from node
curl -v http://10.96.5.42/health # ClusterIP
# Check if IPVS is being used (alternative to iptables)
ipvsadm -Ln | grep 10.96.5.42
# Check BPF (Cilium eBPF)
kubectl exec -n kube-system ds/cilium -- \
cilium service list | grep payment
# Check BGP routes (Cilium BGP or Calico BGP)
kubectl exec -n kube-system ds/cilium -- \
cilium bgp routes --peer all
Conntrack Table Exhaustion
The Linux conntrack table tracks every active TCP/UDP connection for NAT and stateful firewall purposes. Under high connection rates, it exhausts — causing new connections to silently fail. This is one of the most insidious Kubernetes networking problems.
Diagnosing conntrack exhaustion
# Check current conntrack table usage
cat /proc/sys/net/netfilter/nf_conntrack_count # current entries
cat /proc/sys/net/netfilter/nf_conntrack_max # maximum
# Usage percentage
echo "scale=2; $(cat /proc/sys/net/netfilter/nf_conntrack_count) * 100 / \
$(cat /proc/sys/net/netfilter/nf_conntrack_max)" | bc
# View kernel conntrack drop events (requires conntrack-tools)
conntrack -E | grep "[DESTROY\|DROP]" | head -20
# Check kernel drop counter
grep conntrack /proc/net/stat/nf_conntrack | \
awk '{print "drops:", $3, "inserts:", $4}'
# Monitor continuously
watch -n1 'echo "Used: $(cat /proc/sys/net/netfilter/nf_conntrack_count) / \
Max: $(cat /proc/sys/net/netfilter/nf_conntrack_max)"'
Conntrack table increase (kernel tuning DaemonSet)
# In the node-tuner DaemonSet initContainer (see 04-security-hardening.html):
sysctl -w net.netfilter.nf_conntrack_max=1048576 # 1M entries
sysctl -w net.netfilter.nf_conntrack_tcp_timeout_established=86400
sysctl -w net.netfilter.nf_conntrack_tcp_timeout_time_wait=30
sysctl -w net.netfilter.nf_conntrack_tcp_timeout_close_wait=15
# Also increase hashsize (requires module parameter)
echo 262144 > /sys/module/nf_conntrack/parameters/hashsize
# Verify
cat /proc/sys/net/netfilter/nf_conntrack_max
Cilium in eBPF mode performs load balancing and policy enforcement entirely in the BPF datapath, bypassing iptables and conntrack for east-west (pod-to-pod) traffic. This eliminates conntrack exhaustion for internal traffic. Only external traffic (ingress) still goes through conntrack. If conntrack exhaustion is a recurring problem, migrating from kube-proxy iptables to Cilium eBPF is the architectural fix.
Conntrack metrics
# Conntrack table utilization per node
node_nf_conntrack_entries / node_nf_conntrack_entries_limit
# Conntrack entries rate (growth trend)
rate(node_nf_conntrack_entries[5m])
MTU Problems
MTU (Maximum Transmission Unit) mismatches cause large packets to be silently dropped. In overlay networks (VXLAN/Geneve), the encapsulation overhead reduces the effective MTU. MTU issues are notoriously hard to diagnose because small packets work fine while large ones fail.
Physical NIC MTU: 9000 (jumbo frames) or 1500 (standard)
VXLAN encapsulation overhead:
Outer Ethernet: 14B + Outer IP: 20B + UDP: 8B + VXLAN: 8B = 50B overhead
Effective pod MTU: 1500 - 50 = 1450 (with standard frames)
9000 - 50 = 8950 (with jumbo frames)
If pod MTU is set to 1500 but physical MTU is 1500:
Large packets get fragmented or dropped
→ HTTP works (small packets) but uploads/downloads fail
→ gRPC streams stall after first few kilobytes
Diagnosing MTU issues
# Check current MTU settings inside a pod
kubectl exec -n payments pod/payment-pod-xyz -- ip link show eth0
# Look for: mtu 1500 or mtu 8950
# Test with increasing packet sizes (PMTUD — Path MTU Discovery)
kubectl run mtu-test --image=busybox --restart=Never -it --rm -- sh -c "
for size in 1400 1450 1472 1500 1510 8900 8950; do
result=$(ping -M do -c1 -s $size 10.0.1.5 2>&1 | tail -1)
echo \"Size $size: $result\"
done"
# Packets > MTU with DF bit set should fail with "Message too long"
# If they don't fail but application hangs — MTU black hole (no ICMP returned)
# Check for MTU black hole: try large HTTP body
kubectl exec -n payments pod/payment-pod-xyz -- \
curl -v --max-time 5 http://order-service/api/data \
-d "$(dd if=/dev/urandom bs=2048 count=1 2>/dev/null | base64)"
MTU configuration fixes
# Cilium: set MTU explicitly in Helm values
cilium:
mtu: 1450 # 1500 - 50 (VXLAN) = 1450; or auto-detect with 0
# AWS VPC CNI: MTU is auto-configured from node NIC
# Verify:
kubectl exec -n kube-system ds/aws-node -- /app/grpc-health-probe -addr=:11191
# Flannel: configure in ConfigMap
kubectl edit configmap kube-flannel-cfg -n kube-flannel
# Set "Backend": {"Type": "vxlan"}, "mtu": 1450
# For jumbo frames (9000 MTU on AWS enhanced networking):
# Set pod MTU to 8950 (9000 - 50 overhead)
# Ensure ALL nodes use the same MTU — mixed MTU is the worst case
Cilium & Hubble Observability
Hubble is Cilium's eBPF-based network observability platform. It provides a real-time view of every packet flowing through the cluster — who is talking to whom, what was allowed or dropped, and HTTP-level details — without any application instrumentation.
Hubble setup
# Enable Hubble in Cilium Helm values
helm upgrade cilium cilium/cilium \
--namespace kube-system \
--reuse-values \
--set hubble.enabled=true \
--set hubble.relay.enabled=true \
--set hubble.ui.enabled=true \
--set hubble.metrics.enableOpenMetrics=true \
--set hubble.metrics.enabled="{dns,drop,tcp,flow,port-distribution,icmp,httpV2}"
# Install Hubble CLI
HUBBLE_VERSION=$(curl -s https://raw.githubusercontent.com/cilium/hubble/master/stable.txt)
curl -LO "https://github.com/cilium/hubble/releases/download/${HUBBLE_VERSION}/hubble-linux-amd64.tar.gz"
tar xzvf hubble-linux-amd64.tar.gz
sudo mv hubble /usr/local/bin/
# Port-forward Hubble relay
kubectl port-forward -n kube-system svc/hubble-relay 4245:80 &
# Access Hubble UI
kubectl port-forward -n kube-system svc/hubble-ui 12000:80
Hubble observe — real-time flow analysis
# Watch all flows in the payments namespace
hubble observe --namespace payments
# Watch only dropped flows (NetworkPolicy denials)
hubble observe --namespace payments --verdict DROPPED
# Watch flows from a specific pod
hubble observe --from-pod payments/payment-pod-xyz
# Watch HTTP flows (L7 visibility requires Cilium L7 policy)
hubble observe --namespace payments --protocol http
# Filter flows between specific pods
hubble observe \
--from-pod payments/payment-service \
--to-pod databases/postgresql
# Get flow statistics (top talkers)
hubble observe --namespace payments \
--output json | \
jq -r '[.source.pod_name, .destination.pod_name, .l4.TCP.destination_port] | @tsv' | \
sort | uniq -c | sort -rn | head -20
# Find all unique destination ports from payments namespace
hubble observe --namespace payments --output json | \
jq -r '.l4.TCP.destination_port // .l4.UDP.destination_port' | \
sort -n | uniq -c | sort -rn
Hubble for NetworkPolicy debugging
# Which pods are being denied by NetworkPolicy?
hubble observe --verdict DROPPED --namespace payments --output json | \
jq -r '"DROP: \(.source.pod_name) → \(.destination.pod_name):\(.l4.TCP.destination_port // .l4.UDP.destination_port) [\(.drop_reason_desc)]"'
# Common drop reasons:
# POLICY_DENIED — NetworkPolicy explicitly blocks
# UNKNOWN_CONNECTION — conntrack miss (usually timeout)
# PORT_UNREACHABLE — no listener on target port
# Find which NetworkPolicy is blocking a specific flow
cilium policy trace \
--src-identity 12345 \
--dst-identity 67890 \
--dport 5432 \
--proto tcp
# Get Cilium endpoint identity numbers
kubectl exec -n kube-system ds/cilium -- \
cilium endpoint list | grep payment
NetworkPolicy Operations
Testing NetworkPolicy effectiveness
# Test that default-deny is working (should timeout/refuse)
kubectl run test-deny \
--image=busybox --restart=Never -n payments -it --rm \
-- nc -zv -w3 order-service.orders 80
# Expected: nc: bad address 'order-service.orders' OR Connection refused
# Test that allowed traffic works
kubectl run test-allow \
--image=busybox --restart=Never -n payments -it --rm \
-- nc -zv -w3 postgresql.databases 5432
# Expected: Connection to postgresql.databases 5432 port [tcp/*] succeeded!
# List all NetworkPolicies and their selectors
kubectl get networkpolicies -n payments -o json | \
jq -r '.items[] | "\(.metadata.name): podSelector=\(.spec.podSelector | tostring)"'
# Check which pods are selected by a policy
kubectl get pods -n payments -l "$(
kubectl get networkpolicy default-deny -n payments \
-o jsonpath='{.spec.podSelector.matchLabels}' | \
jq -r 'to_entries | map("\(.key)=\(.value)") | join(",")' 2>/dev/null || echo "app=payment-service"
)"
Network policy visualization
# Generate a network policy diagram with netpol-viewer (krew plugin)
kubectl netpol graph -n payments
# Or export to Graphviz DOT format:
kubectl get networkpolicies -n payments -o json | \
python3 -c "
import json, sys
data = json.load(sys.stdin)
print('digraph G {')
for policy in data['items']:
name = policy['metadata']['name']
ns = policy['metadata']['namespace']
for rule in policy.get('spec', {}).get('ingress', []):
for src in rule.get('from', []):
ns_sel = src.get('namespaceSelector', {}).get('matchLabels', {})
pod_sel = src.get('podSelector', {}).get('matchLabels', {})
label = str(ns_sel) + '/' + str(pod_sel)
print(f' \"{label}\" -> \"{ns}/{name}\"')
print('}')
"
Cross-AZ Traffic Optimization
Cross-AZ traffic in AWS costs $0.01/GB and adds 1–3ms latency. For services with high RPS, this is significant. Kubernetes 1.21+ has built-in topology-aware routing to prefer same-AZ endpoints.
Topology-aware service routing
apiVersion: v1
kind: Service
metadata:
name: payment-service
namespace: payments
spec:
selector:
app: payment-service
ports:
- port: 80
targetPort: 8080
# Topology-aware routing: prefer same-AZ endpoints (K8s 1.21+)
trafficPolicy:
externalTraffic: Cluster
internalTrafficPolicy: Local # for DaemonSet services (node-local)
topologyKeys: # deprecated in 1.21+; use spec below
- topology.kubernetes.io/zone
- "*"
# K8s 1.21+: EndpointSlice topology hints (preferred approach)
apiVersion: v1
kind: Service
metadata:
name: payment-service
namespace: payments
annotations:
service.kubernetes.io/topology-mode: Auto # enable topology-aware routing
spec:
selector:
app: payment-service
ports:
- port: 80
targetPort: 8080
# Verify topology hints are set on EndpointSlice
kubectl get endpointslice -n payments -l kubernetes.io/service-name=payment-service -o yaml | \
grep -A5 "hints:"
# Monitor cross-AZ traffic (CloudWatch / VPC Flow Logs)
aws ec2 describe-flow-logs \
--filter Name=resource-id,Values=vpc-0123456789abcdef0 \
--query 'FlowLogs[*].{LogGroupName:LogGroupName,Status:FlowLogStatus}'
# Estimate cross-AZ cost reduction
# Before topology hints: 100% traffic to all AZs
# After topology hints: ~70% same-AZ, ~30% cross-AZ
# For 1TB/day cross-AZ: $10/day saved with topology hints
Using local traffic for kube-proxy / Cilium
# For NodePort services: route only to local pods (avoids cross-node hop)
spec:
type: NodePort
externalTrafficPolicy: Local # only route to pods on THIS node
# Tradeoff: uneven load distribution if pods aren't evenly distributed
# Use with topologySpreadConstraints to ensure even pod distribution
Network Alerts
apiVersion: monitoring.coreos.com/v1
kind: PrometheusRule
metadata:
name: network-operations-alerts
namespace: monitoring
spec:
groups:
- name: network.operations
rules:
# Conntrack table near exhaustion
- alert: ConntrackTableNearFull
expr: |
node_nf_conntrack_entries / node_nf_conntrack_entries_limit > 0.80
for: 5m
labels:
severity: warning
annotations:
summary: "Conntrack table {{ $value | humanizePercentage }} full on {{ $labels.instance }}"
description: "Risk of connection drops if table fills. Increase nf_conntrack_max."
runbook_url: https://runbooks.example.com/network/conntrack-exhaustion
- alert: ConntrackTableCritical
expr: |
node_nf_conntrack_entries / node_nf_conntrack_entries_limit > 0.95
for: 1m
labels:
severity: critical
annotations:
summary: "Conntrack table 95%+ full — new connections may be dropped"
runbook_url: https://runbooks.example.com/network/conntrack-exhaustion
# CoreDNS SERVFAIL rate
- alert: CoreDNSServfailHigh
expr: |
sum(rate(coredns_dns_responses_total{rcode="SERVFAIL"}[5m]))
/
sum(rate(coredns_dns_responses_total[5m])) > 0.01
for: 5m
labels:
severity: warning
annotations:
summary: "CoreDNS SERVFAIL rate > 1%"
runbook_url: https://runbooks.example.com/network/coredns-servfail
# CoreDNS pod down
- alert: CoreDNSDown
expr: |
kube_deployment_status_replicas_available{deployment="coredns",namespace="kube-system"} < 1
for: 1m
labels:
severity: critical
annotations:
summary: "CoreDNS has no available replicas — cluster DNS is down"
runbook_url: https://runbooks.example.com/network/coredns-down
# Hubble: high drop rate in namespace
- alert: NetworkPolicyDropRateHigh
expr: |
sum by (namespace) (
rate(hubble_drop_total{namespace!="kube-system"}[5m])
) > 10
for: 5m
labels:
severity: warning
annotations:
summary: "High NetworkPolicy drop rate in namespace {{ $labels.namespace }}"
description: "{{ $value | humanize }} drops/sec. Check NetworkPolicy rules."
runbook_url: https://runbooks.example.com/network/policy-drops
# NGINX Ingress 5xx error rate
- alert: IngressHighErrorRate
expr: |
sum by (ingress, namespace) (
rate(nginx_ingress_controller_requests{status=~"5.."}[5m])
)
/
sum by (ingress, namespace) (
rate(nginx_ingress_controller_requests[5m])
) > 0.05
for: 5m
labels:
severity: warning
annotations:
summary: "Ingress {{ $labels.namespace }}/{{ $labels.ingress }} 5xx rate > 5%"
runbook_url: https://runbooks.example.com/network/ingress-errors
# Node network TX drops
- alert: NodeNetworkTxDropsHigh
expr: |
rate(node_network_transmit_drop_total{device!~"lo|veth.*|docker.*|flannel.*|cni.*|cilium.*"}[5m]) > 100
for: 5m
labels:
severity: warning
annotations:
summary: "Node {{ $labels.instance }} TX drops > 100/sec on {{ $labels.device }}"
runbook_url: https://runbooks.example.com/network/node-tx-drops
# VPC CNI IP pool exhaustion (AWS)
- alert: VPCCNIIPPoolLow
expr: |
awscni_total_ipaddresses - awscni_assigned_ip_per_cidr < 5
for: 5m
labels:
severity: warning
annotations:
summary: "VPC CNI IP pool has fewer than 5 available IPs on {{ $labels.instance }}"
runbook_url: https://runbooks.example.com/network/vpc-cni-ip-pool
Best Practices
NodeLocal DNSCache everywhere
Install NodeLocal DNSCache on every cluster. It eliminates the 5-second conntrack DNS race bug, reduces CoreDNS load by 60–80%, and cuts DNS p99 from 2ms to 0.1ms.
Monitor conntrack utilization
Set an alert at 80% conntrack table utilization. Conntrack exhaustion causes silent connection drops that are extremely hard to diagnose after the fact.
Hubble for zero-trust debugging
When a NetworkPolicy blocks traffic, hubble observe --verdict DROPPED shows exactly which flow is dropped and why — saving hours compared to trial-and-error policy editing.
Set MTU explicitly
Never rely on auto-detection for overlay MTU. Set it explicitly in CNI configuration. Mixed-MTU clusters cause mysterious large-payload failures that don't show up in small-packet tests.
Topology-aware routing
Enable service.kubernetes.io/topology-mode: Auto on high-RPS services. At scale, this meaningfully reduces cross-AZ data transfer costs and latency.
Use FQDNs for cross-namespace calls
Short service names trigger multiple DNS search-domain lookups. Always use service.namespace.svc.cluster.local for cross-namespace calls to reduce DNS query volume and avoid race conditions.