DNS Issues
Overview
Diagnosis and resolution of Kubernetes DNS failures — NXDOMAIN errors, intermittent resolution failures, CoreDNS crashes, and ndots misconfiguration.
Kubernetes DNS Architecture
Every pod gets:
/etc/resolv.conf:
nameserver 10.96.0.10 ← ClusterIP of kube-dns Service
search production.svc.cluster.local svc.cluster.local cluster.local
options ndots:5
DNS query path:
pod → kube-dns ClusterIP (10.96.0.10) → iptables/BPF → CoreDNS pod(s) → cache/upstream
CoreDNS:
- DaemonSet or Deployment in kube-system namespace
- Serves: *.cluster.local zone (in-cluster names)
- Forwards: external names to upstream resolvers
- Configurable via Corefile (ConfigMap: coredns)
FQDN formats:
<svc>.<ns>.svc.cluster.local ← Service
<pod-ip>.<ns>.pod.cluster.local ← Pod (dashes for dots)
<pod>.<svc>.<ns>.svc.cluster.local ← StatefulSet pod
DNS Diagnosis Steps
# Step 1: Test DNS from inside a pod
kubectl run dnstest --image=nicolaka/netshoot --rm -it -- \
nslookup payments-api.production.svc.cluster.local
# If NXDOMAIN: service doesn't exist or wrong namespace
kubectl get svc payments-api -n production
# Step 2: Test with FQDN vs short name
kubectl run dnstest --image=nicolaka/netshoot --rm -it -n production -- \
nslookup payments-api # short name (uses search domains)
# vs
kubectl run dnstest --image=nicolaka/netshoot --rm -it -n production -- \
nslookup payments-api.production.svc.cluster.local. # fully qualified (trailing dot)
# Step 3: Check CoreDNS pods
kubectl get pods -n kube-system -l k8s-app=kube-dns
kubectl logs -n kube-system -l k8s-app=kube-dns --tail=50
# Step 4: Check CoreDNS Service
kubectl get svc kube-dns -n kube-system
# ClusterIP must be stable (usually 10.96.0.10)
# Step 5: Verify pod's resolv.conf
kubectl exec <pod> -n <ns> -- cat /etc/resolv.conf
# nameserver should be kube-dns ClusterIP
# search should include <ns>.svc.cluster.local
NXDOMAIN — Name Not Found
# "NXDOMAIN" for an in-cluster service
# Cause 1: Typo in service name
kubectl get svc -n production | grep payment
# Cause 2: Wrong namespace in query
# payments-api is in "production" but query is from "default" namespace
# From default namespace:
# payments-api → looks in default.svc.cluster.local → NXDOMAIN
# payments-api.production → resolves
# payments-api.production.svc.cluster.local → always resolves (FQDN)
# Cause 3: Service not created
kubectl get svc payments-api -n production
# Cause 4: Headless service for StatefulSet — pod-specific DNS
# Correct format: postgres-0.postgres.production.svc.cluster.local
# ^pod-name ^headless-svc-name
# Cause 5: ExternalName service misconfiguration
kubectl get svc external-db -n production -o jsonpath='{.spec.type}'
# If ExternalName: check .spec.externalName is a valid DNS name
# Cause 6: CoreDNS not running
kubectl get pods -n kube-system -l k8s-app=kube-dns
# If 0 pods running: CoreDNS is down → ALL in-cluster DNS fails
kubectl describe deployment coredns -n kube-system
Intermittent DNS Failures
# DNS sometimes works, sometimes NXDOMAIN or timeout
# Very common with containerized apps that open many connections
# Cause 1: UDP packet loss / timeout at 5s
# Default DNS timeout: 5s per query, 2 retries
# Under load: DNS UDP packets dropped by conntrack table overflow
# Check conntrack table size
kubectl debug node/<node> -it --image=ubuntu -- \
sysctl net.netfilter.nf_conntrack_max
kubectl debug node/<node> -it --image=ubuntu -- \
cat /proc/sys/net/netfilter/nf_conntrack_count
# If count ≈ max: table full, packets dropped
# Fix: increase conntrack table size
sysctl -w net.netfilter.nf_conntrack_max=524288
echo "net.netfilter.nf_conntrack_max=524288" >> /etc/sysctl.d/99-conntrack.conf
# Cause 2: CoreDNS CPU throttled (too many requests)
kubectl top pod -n kube-system -l k8s-app=kube-dns
# Check CPU: if at limit, CoreDNS is throttled
# Fix: increase CoreDNS CPU limit
kubectl edit deployment coredns -n kube-system
# resources:
# limits:
# cpu: 500m → increase to 1000m
# memory: 170Mi
# Cause 3: ndots:5 causing excessive lookups
# With ndots:5, "payments-api:8080" triggers 6 DNS lookups:
# payments-api.production.svc.cluster.local
# payments-api.svc.cluster.local
# payments-api.cluster.local
# payments-api.us-east-1.compute.internal
# payments-api. (FQDN)
# Then: payments-api (short)
# Fix: use FQDN in app code:
# "payments-api.production.svc.cluster.local" → 1 DNS lookup
# OR set ndots:2 in pod dnsConfig:
dnsConfig:
options:
- name: ndots
value: "2"
- name: single-request-reopen # avoid A/AAAA race condition
# Cause 4: CoreDNS serving from single pod — restart or crash causes brief outage
kubectl get pods -n kube-system -l k8s-app=kube-dns
# Fix: scale to 2+ replicas
kubectl scale deployment coredns -n kube-system --replicas=3
CoreDNS Configuration
# View and edit Corefile
kubectl get configmap coredns -n kube-system -o yaml
# Default Corefile
cat << 'EOF'
.:53 {
errors
health {
lameduck 5s
}
ready
kubernetes cluster.local in-addr.arpa ip6.arpa {
pods insecure
fallthrough in-addr.arpa ip6.arpa
ttl 30
}
prometheus :9153
forward . /etc/resolv.conf {
max_concurrent 1000
}
cache 30
loop
reload
loadbalance
}
EOF
# Add custom domain forwarding (e.g., internal company DNS)
kubectl edit configmap coredns -n kube-system
# Add in Corefile:
# mycompany.internal:53 {
# errors
# cache 30
# forward . 10.0.0.2 10.0.0.3 # internal DNS servers
# }
# Increase cache TTL to reduce upstream queries
# cache 300 (5 minutes, default is 30 seconds)
# Enable CoreDNS debug logging (high verbosity — use temporarily)
# log
# errors { stacktrace }
# Reload CoreDNS after ConfigMap change (automatic with reload plugin)
kubectl rollout restart deployment coredns -n kube-system
CoreDNS Metrics
# CoreDNS exposes Prometheus metrics on :9153
# Check DNS request rate
kubectl get --raw '/api/v1/namespaces/kube-system/services/kube-dns:9153/proxy/metrics' | \
grep coredns_dns_requests_total | head -5
# DNS error rate
# PromQL: rate(coredns_dns_responses_total{rcode="NXDOMAIN"}[5m])
# Latency
# PromQL: histogram_quantile(0.99, rate(coredns_dns_request_duration_seconds_bucket[5m]))
# Cache hit rate
# PromQL: rate(coredns_cache_hits_total[5m]) / rate(coredns_dns_requests_total[5m])
# Prometheus rules for CoreDNS alerting
# - alert: CoreDNSDown: up{job="coredns"} == 0
# - alert: CoreDNSHighErrorRate: rate(NXDOMAIN) / rate(total) > 0.05
# - alert: CoreDNSHighLatency: p99 > 2000ms
External DNS Issues
# Pod can't resolve external names (e.g., api.stripe.com)
# Step 1: Test external resolution from pod
kubectl run dnstest --image=nicolaka/netshoot --rm -it -- \
nslookup api.stripe.com
# Step 2: Check CoreDNS forward configuration
kubectl get configmap coredns -n kube-system -o yaml
# forward . /etc/resolv.conf ← should forward to node's resolver
# OR: forward . 8.8.8.8 1.1.1.1 ← explicit public DNS
# Step 3: Verify node can resolve external DNS
kubectl debug node/<node> -it --image=ubuntu -- \
nslookup api.stripe.com 8.8.8.8
# Step 4: Check NetworkPolicy isn't blocking DNS
# CoreDNS pods need to reach upstream resolvers on UDP/TCP 53
kubectl get networkpolicy -n kube-system | grep dns
# Cause: corporate network requires specific DNS servers
# Fix: update CoreDNS ConfigMap forward to use internal resolvers
# Cause: NodeLocal DNSCache not set up correctly
kubectl get pods -n kube-system -l k8s-app=node-local-dns
kubectl logs -n kube-system -l k8s-app=node-local-dns --tail=20
NodeLocal DNSCache
# NodeLocal DNSCache runs on each node (DaemonSet), answers DNS locally
# Reduces latency and conntrack issues for DNS
# Check if NodeLocal DNSCache is installed
kubectl get daemonset -n kube-system node-local-dns
# If not installed (common optimization for large clusters):
# Install via addon: kubelet --cluster-dns points to node-local IP (169.254.20.10)
# Verify pod is using local DNS
kubectl exec <pod> -n <ns> -- cat /etc/resolv.conf
# nameserver 169.254.20.10 ← node-local DNS (not 10.96.0.10)
# Check NodeLocal DNSCache logs
kubectl logs -n kube-system -l k8s-app=node-local-dns --tail=50
# Benefits:
# - No conntrack for DNS (uses raw sockets for cluster.local)
# - Caching at node level (reduces CoreDNS load by 80%+)
# - Lower latency (sub-ms vs 1-5ms for CoreDNS pod)
Related
- 02 — Network Issues — general network troubleshooting
- 03 — Networking — cluster network architecture
- 09 — Ingress Issues — DNS and TLS at ingress layer