DNS & Service Discovery
CoreDNS is the cluster DNS server for all Kubernetes clusters since 1.13 (replacing kube-dns). It resolves service names, pod hostnames, and external FQDNs for every pod in the cluster. This page covers CoreDNS architecture, the Corefile plugin pipeline, every DNS record type Kubernetes creates, ndots and search-domain resolution mechanics, autopath, stub zones, forward zones, negative caching, DNS-based service discovery patterns, NodeLocal DNSCache, and production tuning.
CoreDNS Architecture
CoreDNS runs as a Deployment (typically 2 replicas) in the kube-system namespace. The Service kube-dns holds the stable ClusterIP (e.g. 10.96.0.10) written into every pod's /etc/resolv.conf at pod creation time by kubelet.
CoreDNS Process Model
Single binary, plugin chain. Each DNS query traverses the plugin chain in order. Plugins can handle, pass-through, or modify the query. No forking — goroutine per query.
kubernetes plugin
Watches the API server for Services and Pods via a shared informer. Answers in-cluster queries from an in-memory cache — no API server hit per DNS query. Serves cluster.local zone.
forward plugin
Forwards unresolved queries upstream. Default: node's /etc/resolv.conf (inherits node DNS). Can be overridden to point at specific resolvers (8.8.8.8, corporate DNS, etc.).
The Corefile — Plugin Pipeline
The Corefile is CoreDNS's configuration file, mounted from a ConfigMap. Each server block defines a zone and an ordered list of plugins:
# kubectl get cm -n kube-system coredns -o jsonpath='{.data.Corefile}'
.:53 {
errors # log errors to stdout
health { # /health endpoint (HTTP :8080)
lameduck 5s # keep serving 5s after unhealthy for graceful shutdown
}
ready # /ready endpoint (HTTP :8181)
kubernetes cluster.local in-addr.arpa ip6.arpa {
pods insecure # create A records for pods (insecure=no verification)
fallthrough in-addr.arpa ip6.arpa # pass PTR queries not found to next plugin
ttl 30 # TTL for synthesized records (default 5s)
}
prometheus :9153 # expose metrics at :9153/metrics
forward . /etc/resolv.conf { # forward all other queries to node DNS
max_concurrent 1000
}
cache 30 # positive TTL 30s, negative TTL 30s/6 = 5s
loop # detect forwarding loops
reload # hot-reload Corefile without restart (every 2s check)
loadbalance # randomize A/AAAA/MX record order (poor-man's LB)
}
CoreDNS Plugin Reference
| Plugin | Purpose | Key Config Options |
|---|---|---|
kubernetes | Serve in-cluster DNS (Services, Pods, StatefulSets) | pods insecure|verified|disabled, ttl, endpoint_pod_names, namespaces, fallthrough |
forward | Proxy queries upstream | . 8.8.8.8 8.8.4.4, max_concurrent, health_check, prefer_udp, tls (DoT) |
cache | In-memory DNS cache | cache 30 (TTL), denial 5 (NXDOMAIN TTL), success 9984 30 (capacity TTL) |
errors | Log SERVFAIL/REFUSED to stdout | consolidate 5m ".*" (rate-limit identical errors) |
health | HTTP liveness probe at :8080/health | lameduck Xs for graceful shutdown |
ready | HTTP readiness probe at :8181/ready | No options; returns 200 when all plugins are ready |
prometheus | Expose metrics at :9153/metrics | Port configurable |
rewrite | Rewrite query names / types | rewrite name suffix .legacy.svc.cluster.local .svc.cluster.local |
hosts | Serve records from a hosts-file | hosts /etc/hosts { fallthrough } |
file | Serve a zone from a RFC 1035 zone file | file db.example.com example.com |
etcd | SkyDNS-compatible etcd backend (legacy) | Mostly replaced by kubernetes plugin |
autopath | Resolve short names server-side (reduces 5x ndots:5 to 1 query) | autopath @kubernetes |
loop | Detect and abort forwarding loops (CoreDNS will crash-loop) | No options |
reload | Watch Corefile for changes and reload without restart | reload 10s |
loadbalance | Rotate A/AAAA/MX records for naive load balancing | loadbalance round_robin |
log | Log all DNS queries (verbose; use sparingly in production) | log . {class denial error} |
DNS Records Kubernetes Creates
Service DNS Records
| Record Type | Name Pattern | Resolves To | Condition |
|---|---|---|---|
| A / AAAA | <svc>.<ns>.svc.cluster.local | ClusterIP (IPv4 or IPv6) | All services with ClusterIP |
| A / AAAA | <svc>.<ns>.svc.cluster.local | All ready pod IPs (multiple A records) | Headless service (clusterIP: None) |
| SRV | _<port-name>._<proto>.<svc>.<ns>.svc.cluster.local | Port number + target A record | Named ports only |
| CNAME | <svc>.<ns>.svc.cluster.local | External hostname | ExternalName services |
| PTR | <reversed-ip>.in-addr.arpa | Service FQDN | ClusterIP services (reverse DNS) |
Pod DNS Records
| Record Type | Name Pattern | Resolves To | Notes |
|---|---|---|---|
| A | <ip-dashed>.<ns>.pod.cluster.local | Pod IP | IP with dots replaced by dashes: 10-244-1-5.default.pod.cluster.local |
| A | <hostname>.<subdomain>.<ns>.svc.cluster.local | Pod IP | Only when pod has hostname + subdomain and a matching headless service |
StatefulSet Pod Records
StatefulSets combine hostname (pod ordinal name) with subdomain (headless service name) to create stable per-pod DNS entries:
apiVersion: v1
kind: Service
metadata:
name: cassandra # headless service — must match subdomain
spec:
clusterIP: None
selector:
app: cassandra
ports:
- port: 9042
---
apiVersion: apps/v1
kind: StatefulSet
metadata:
name: cassandra
spec:
serviceName: cassandra # sets pod subdomain
replicas: 3
# Pods: cassandra-0, cassandra-1, cassandra-2
# DNS: cassandra-0.cassandra.default.svc.cluster.local → pod-0 IP
# cassandra-1.cassandra.default.svc.cluster.local → pod-1 IP
# cassandra-2.cassandra.default.svc.cluster.local → pod-2 IP
Full DNS Name Resolution Table
Query (from pod in default namespace) | Resolves To | Search domain hit |
|---|---|---|
nginx | nginx ClusterIP (same namespace) | nginx.default.svc.cluster.local |
nginx.production | nginx ClusterIP in production ns | nginx.production.svc.cluster.local |
nginx.production.svc | nginx ClusterIP in production ns | nginx.production.svc.cluster.local |
nginx.production.svc.cluster.local | nginx ClusterIP (FQDN, no search) | Direct (no search needed, has dot count ≥ ndots) |
cassandra-0.cassandra | cassandra-0 pod IP | cassandra-0.cassandra.default.svc.cluster.local |
google.com | Forwarded to upstream → 142.250.x.x | All 5 search domains tried first if ndots:5 |
google.com. (trailing dot) | Forwarded immediately (FQDN) | No search domain expansion |
Pod /etc/resolv.conf and ndots
# /etc/resolv.conf inside a pod in 'default' namespace
nameserver 10.96.0.10
search default.svc.cluster.local svc.cluster.local cluster.local
options ndots:5
ndots:5 Resolution Mechanics
The ndots:5 option means: if the query name has fewer than 5 dots, try each search domain first before treating it as an absolute name. This causes up to 5 DNS queries for a simple external hostname like google.com:
A pod querying google.com (1 dot) fires 5 queries before getting an answer:
1. google.com.default.svc.cluster.local → NXDOMAIN
2. google.com.svc.cluster.local → NXDOMAIN
3. google.com.cluster.local → NXDOMAIN
4. google.com → actual answer
(5th search domain only tried on some resolvers)
Each NXDOMAIN adds latency. In high-throughput services making frequent external DNS lookups, this multiplies CoreDNS query load by 4×.
Mitigating ndots:5 Overhead
# Option 1: use trailing dot for FQDNs in application code
# app connects to "stripe.com." (trailing dot) → no search expansion → 1 query
# Option 2: lower ndots per pod via dnsConfig
apiVersion: v1
kind: Pod
spec:
dnsConfig:
options:
- name: ndots
value: "2" # only expand if fewer than 2 dots (affects in-cluster too)
- name: timeout
value: "2"
- name: attempts
value: "3"
containers:
- name: app
image: myapp
# Option 3: enable autopath plugin in CoreDNS (server-side search expansion)
# CoreDNS expands the search domains itself — client sends 1 query, CoreDNS tries all
# Add to Corefile kubernetes block:
# autopath @kubernetes
# Option 4: NodeLocal DNSCache (caches NXDOMAIN responses too)
# Reduces repeated NXDOMAIN round trips to CoreDNS pods
Pod dnsPolicy
| dnsPolicy | resolv.conf behavior | Use Case |
|---|---|---|
ClusterFirst (default) | CoreDNS IP as nameserver; cluster search domains | All normal pods; in-cluster service resolution |
ClusterFirstWithHostNet | Same as ClusterFirst but for hostNetwork: true pods | DaemonSet pods needing both host network + cluster DNS |
Default | Inherits node's /etc/resolv.conf exactly | Pods that must use node's DNS (corporate resolver, etc.) |
None | Completely custom; must provide dnsConfig | Full control; custom nameservers + search domains |
CoreDNS Configuration Patterns
Stub Zones (Split-Horizon DNS)
# Route corporate.internal queries to internal DNS server
# kubectl edit cm -n kube-system coredns
apiVersion: v1
kind: ConfigMap
metadata:
name: coredns
namespace: kube-system
data:
Corefile: |
.:53 {
errors
health { lameduck 5s }
ready
kubernetes cluster.local in-addr.arpa ip6.arpa {
pods insecure
fallthrough in-addr.arpa ip6.arpa
ttl 30
}
prometheus :9153
forward . /etc/resolv.conf {
max_concurrent 1000
}
cache 30
loop
reload
loadbalance
}
# Stub zone: forward corporate.internal to internal resolver
corporate.internal:53 {
errors
cache 30
forward . 10.0.0.2 10.0.0.3 {
prefer_udp
}
}
# Stub zone for on-prem services
on-prem.example.com:53 {
errors
cache 60
forward . 192.168.1.53
}
Custom In-Cluster Records
# Serve custom A records for hostnames not backed by Services
# Useful for external endpoints that pods reference by name
data:
Corefile: |
.:53 {
# ... standard config ...
}
# Custom zone with static records
custom.cluster.local:53 {
errors
hosts /etc/coredns/custom-hosts {
10.0.100.50 db-primary.custom.cluster.local
10.0.100.51 db-replica.custom.cluster.local
fallthrough
}
}
# Mount custom-hosts from a separate ConfigMap key
custom-hosts: |
10.0.100.50 db-primary.custom.cluster.local
10.0.100.51 db-replica.custom.cluster.local
DNS-over-TLS Upstream
.:53 {
forward . tls://8.8.8.8 tls://8.8.4.4 {
tls_servername dns.google
health_check 5s
}
cache 30
}
Negative Caching
# cache plugin controls positive and negative TTLs
cache {
success 9984 30 0 # capacity=9984, TTL=30s, min-TTL=0
denial 9984 5 0 # NXDOMAIN capacity=9984, TTL=5s
# Shorter NXDOMAIN TTL (5s) means bad lookups resolve faster
# but increases CoreDNS query load for persistent NXDOMAIN patterns
prefetch 10 1m 10% # prefetch popular records when TTL < 10% of original
}
NodeLocal DNSCache
NodeLocal DNSCache (GA 1.18) runs a DNS caching agent (node-local-dns DaemonSet) on every node using a link-local IP (169.254.20.10). Pods are reconfigured to use this local cache instead of the CoreDNS ClusterIP, eliminating the UDP conntrack race condition and reducing latency for cached queries.
Why NodeLocal DNSCache?
Problem: UDP Conntrack Race (without NodeLocal)
Each node runs many pods. UDP DNS queries to CoreDNS ClusterIP go through conntrack DNAT. Under high query rates, multiple simultaneous queries from the same source port can collide in the conntrack table — resulting in 5-second DNS timeouts on some queries. This is the infamous "5s DNS timeout" bug endemic to iptables-mode Kubernetes.
Solution: NodeLocal DNSCache
node-local-dns listens on 169.254.20.10:53 (link-local, no conntrack needed — no DNAT). Pods send queries to this local address. Cache hits return immediately (sub-millisecond). Cache misses are forwarded to CoreDNS pods over TCP (not UDP), avoiding the conntrack race entirely.
NodeLocal DNSCache Architecture
Installing NodeLocal DNSCache
# Download and apply the NodeLocal DNSCache DaemonSet manifest
# Replace __PILLAR__DNS__SERVER__ with CoreDNS ClusterIP
# Replace __PILLAR__LOCAL__DNS__ with 169.254.20.10
# Replace __PILLAR__DNS__DOMAIN__ with cluster.local
COREDNS_IP=$(kubectl get svc -n kube-system kube-dns -o jsonpath='{.spec.clusterIP}')
curl -sL "https://raw.githubusercontent.com/kubernetes/kubernetes/master/cluster/addons/dns/nodelocaldns/nodelocaldns.yaml" | \
sed "s/__PILLAR__DNS__SERVER__/${COREDNS_IP}/g" | \
sed "s/__PILLAR__LOCAL__DNS__/169.254.20.10/g" | \
sed "s/__PILLAR__DNS__DOMAIN__/cluster.local/g" | \
kubectl apply -f -
# Kubelet must be reconfigured to use 169.254.20.10 as clusterDNS
# In KubeletConfiguration:
# clusterDNS:
# - 169.254.20.10
# Or set at cluster creation:
# kubeadm init --config kubeadm.yaml
# KubeletConfiguration.clusterDNS: ["169.254.20.10"]
NodeLocal DNSCache Corefile
cluster.local:53 {
errors
cache {
success 9984 30
denial 9984 5
}
reload
loop
bind 169.254.20.10 # bind to link-local address
forward . __PILLAR__DNS__SERVER__ {
force_tcp # forward to CoreDNS over TCP (avoids conntrack)
}
prometheus :9253
health 169.254.20.10:8080
}
.:53 {
errors
cache 30
reload
loop
bind 169.254.20.10
forward . __PILLAR__DNS__SERVER__ {
force_tcp
}
prometheus :9253
}
ExternalDNS — Automatic External DNS
ExternalDNS is a Kubernetes add-on that synchronizes Service and Ingress resources with external DNS providers (Route53, Cloud DNS, Cloudflare, etc.). It watches Services of type LoadBalancer and Ingress objects and creates/updates DNS records automatically.
# Annotate a LoadBalancer Service to create a Route53 record
apiVersion: v1
kind: Service
metadata:
name: web
annotations:
external-dns.alpha.kubernetes.io/hostname: api.example.com
external-dns.alpha.kubernetes.io/ttl: "300"
spec:
type: LoadBalancer
selector:
app: web
ports:
- port: 443
---
# ExternalDNS Deployment (Route53 example)
apiVersion: apps/v1
kind: Deployment
metadata:
name: external-dns
namespace: kube-system
spec:
template:
spec:
serviceAccountName: external-dns
containers:
- name: external-dns
image: registry.k8s.io/external-dns/external-dns:v0.14.2
args:
- --source=service
- --source=ingress
- --domain-filter=example.com
- --provider=aws
- --aws-zone-type=public
- --registry=txt
- --txt-owner-id=my-cluster # prevents multiple clusters clobbering records
Service Discovery Patterns
Environment Variable Discovery (Legacy)
Before DNS was the primary mechanism, Kubernetes injected service information as environment variables into pods. This is still present but order-dependent (services must exist before pods) and generally discouraged:
# In a pod, if service 'nginx' exists in the same namespace:
env | grep NGINX
# NGINX_SERVICE_HOST=10.96.14.3
# NGINX_SERVICE_PORT=80
# NGINX_PORT=tcp://10.96.14.3:80
# NGINX_PORT_80_TCP=tcp://10.96.14.3:80
# NGINX_PORT_80_TCP_PROTO=tcp
# NGINX_PORT_80_TCP_PORT=80
# NGINX_PORT_80_TCP_ADDR=10.96.14.3
# Limitation: services created AFTER the pod don't get env vars
# Use DNS instead — DNS is dynamic and doesn't have this ordering issue
Headless Service Discovery
Headless services are the foundation for stateful workload discovery. DNS returns all ready pod IPs, enabling clients to implement their own load balancing or connect to specific replicas:
# From inside a pod, querying a headless service returns all pod IPs
nslookup cassandra.default.svc.cluster.local
# Server: 10.96.0.10
# Address: 10.96.0.10#53
# Name: cassandra.default.svc.cluster.local
# Address: 10.244.3.9
# Address: 10.244.1.5
# Address: 10.244.2.7
# Round-robin across all IPs (client-side load balancing)
# Individual pod DNS (StatefulSet)
nslookup cassandra-0.cassandra.default.svc.cluster.local
# Address: 10.244.1.5 (always same pod — stable address)
# SRV record (port discovery)
nslookup -type=SRV _cql._tcp.cassandra.default.svc.cluster.local
# _cql._tcp.cassandra.default.svc.cluster.local service = 0 50 9042 cassandra-0.cassandra.default.svc.cluster.local.
# _cql._tcp.cassandra.default.svc.cluster.local service = 0 50 9042 cassandra-1.cassandra.default.svc.cluster.local.
Multi-Cluster DNS (Clusterset)
The Multicluster Services API (KEP-1645) defines ServiceImport and ServiceExport objects. CoreDNS with the multicluster plugin resolves cross-cluster service names:
# Multi-cluster DNS name format:
# ..svc..local
# Export a service from cluster-1
kubectl apply -f - <<EOF
apiVersion: multicluster.x-k8s.io/v1alpha1
kind: ServiceExport
metadata:
name: nginx
namespace: production
EOF
# In cluster-2, the service is discoverable as:
# nginx.production.svc.clusterset.local
CoreDNS Scaling
HPA for CoreDNS
# Default CoreDNS deployment has 2 replicas
# For large clusters, scale CoreDNS with HPA or static replicas
# Option 1: Static scale (simple)
kubectl scale deployment coredns -n kube-system --replicas=4
# Option 2: proportional-cluster-autoscaler (cluster-proportional-autoscaler)
# Scales CoreDNS based on node count
apiVersion: apps/v1
kind: Deployment
metadata:
name: dns-autoscaler
namespace: kube-system
spec:
template:
spec:
containers:
- name: autoscaler
image: registry.k8s.io/cpa/cluster-proportional-autoscaler:v1.8.8
command:
- /cluster-proportional-autoscaler
- --namespace=kube-system
- --configmap=dns-autoscaler
- --target=Deployment/coredns
- --default-params={"linear":{"coresPerReplica":256,"nodesPerReplica":16,"min":2,"max":20}}
- --logtostderr=true
- --v=2
---
apiVersion: v1
kind: ConfigMap
metadata:
name: dns-autoscaler
namespace: kube-system
data:
linear: |
{
"coresPerReplica": 256,
"nodesPerReplica": 16,
"min": 2,
"max": 20,
"preventSinglePointOfFailure": true
}
CoreDNS Resource Tuning
# CoreDNS pod resource requests/limits
# Defaults are too conservative for large clusters
resources:
requests:
cpu: 100m
memory: 70Mi
limits:
cpu: 500m # increase for high-QPS clusters
memory: 170Mi # increase for large service count
# CoreDNS cache tuning for large clusters
cache {
success 9984 60 # 9984 entries, 60s TTL (longer reduces CoreDNS load)
denial 9984 5 # keep NXDOMAIN TTL short
prefetch 10 1m 10% # prefetch entries when 10% of TTL remains
serve_stale 15s # serve stale cache entries during upstream outages (up to 15s)
}
DNS Debugging
# 1. Test basic DNS resolution from a pod
kubectl run dns-test --image=nicolaka/netshoot --rm -it -- bash
# Inside:
nslookup kubernetes.default.svc.cluster.local # should return ClusterIP
nslookup kubernetes # should expand via search domain
dig @10.96.0.10 kubernetes.default.svc.cluster.local # explicit query to CoreDNS
dig kubernetes.default.svc.cluster.local +search +ndots=5
# 2. Check CoreDNS pod logs
kubectl logs -n kube-system -l k8s-app=kube-dns --tail=50
# Enable query logging (verbose — use only temporarily):
# Add to Corefile: log .
# 3. Check CoreDNS metrics
kubectl exec -n kube-system coredns-abc123 -- \
curl -s localhost:9153/metrics | grep -E "coredns_dns_request|coredns_cache"
# 4. CoreDNS health/ready check
kubectl exec -n kube-system coredns-abc123 -- curl -s localhost:8080/health
kubectl exec -n kube-system coredns-abc123 -- curl -s localhost:8181/ready
# 5. Test external resolution
kubectl run dns-test --image=nicolaka/netshoot --rm -it -- \
dig google.com @10.96.0.10 +short
# 6. Check pod's resolv.conf
kubectl exec my-pod -- cat /etc/resolv.conf
# 7. Measure DNS latency
kubectl run dns-bench --image=nicolaka/netshoot --rm -it -- \
bash -c 'for i in $(seq 50); do time nslookup nginx.default; done 2>&1 | grep real'
# 8. Check if NodeLocal DNSCache is active
kubectl get pods -n kube-system -l k8s-app=node-local-dns -o wide
kubectl exec my-pod -- cat /etc/resolv.conf
# Should show 169.254.20.10 as nameserver when NodeLocal DNSCache is active
Key CoreDNS Metrics
| Metric | Type | Alert Threshold |
|---|---|---|
coredns_dns_requests_total | counter | Rate growth > 3× baseline |
coredns_dns_responses_total{rcode="SERVFAIL"} | counter | > 1% of total responses |
coredns_dns_responses_total{rcode="NXDOMAIN"} | counter | High rate → ndots:5 waste or misconfigured apps |
coredns_dns_request_duration_seconds | histogram | p99 > 500ms → CoreDNS overloaded or upstream slow |
coredns_cache_hits_total | counter | ratio vs requests; < 50% cache hit → increase cache TTL |
coredns_cache_misses_total | counter | Informational; high = low cache hit rate |
coredns_panics_total | counter | > 0 → immediate investigation |
coredns_kubernetes_dns_programming_duration_seconds | histogram | p99 > 5s → API server pressure |
process_open_fds | gauge | > 80% of ulimit → file descriptor exhaustion |
Alerting Rules
groups:
- name: coredns
rules:
- alert: CoreDNSHighErrorRate
expr: |
rate(coredns_dns_responses_total{rcode="SERVFAIL"}[5m])
/ rate(coredns_dns_requests_total[5m]) > 0.01
for: 5m
labels:
severity: critical
annotations:
summary: "CoreDNS SERVFAIL rate >1% — DNS resolution failures in cluster"
- alert: CoreDNSHighLatency
expr: histogram_quantile(0.99, rate(coredns_dns_request_duration_seconds_bucket[5m])) > 0.5
for: 5m
labels:
severity: warning
annotations:
summary: "CoreDNS p99 latency >500ms — DNS slow for cluster workloads"
- alert: CoreDNSPanic
expr: increase(coredns_panics_total[5m]) > 0
for: 1m
labels:
severity: critical
annotations:
summary: "CoreDNS panic detected — immediate investigation required"
- alert: CoreDNSLowCacheHitRate
expr: |
rate(coredns_cache_hits_total[10m])
/ (rate(coredns_cache_hits_total[10m]) + rate(coredns_cache_misses_total[10m])) < 0.5
for: 15m
labels:
severity: warning
annotations:
summary: "CoreDNS cache hit rate <50% — consider increasing cache TTL"
Troubleshooting Runbooks
Runbook 1: Pod cannot resolve service names (DNS broken)
# 1. Verify CoreDNS pods are running
kubectl get pods -n kube-system -l k8s-app=kube-dns
# 2. Test DNS from a debug pod
kubectl run dns-test --image=nicolaka/netshoot --rm -it -- nslookup kubernetes
# Expected: Server 10.96.0.10, Address: 10.96.0.1
# 3. If nslookup fails — check if CoreDNS ClusterIP is reachable
kubectl run dns-test --image=nicolaka/netshoot --rm -it -- \
nc -vz 10.96.0.10 53
# If timeout → kube-proxy issue (check 03-kube-proxy-internals.html)
# 4. Check CoreDNS logs for errors
kubectl logs -n kube-system -l k8s-app=kube-dns --since=5m | grep -i "error\|refused\|failed"
# 5. Verify Corefile is syntactically valid
kubectl get cm -n kube-system coredns -o jsonpath='{.data.Corefile}' | \
docker run --rm -i coredns/coredns -conf /dev/stdin -validate
# 6. Check CoreDNS resource limits (OOMKilled?)
kubectl describe pod -n kube-system -l k8s-app=kube-dns | grep -A5 "Last State\|OOM\|Limits"
# 7. Restart CoreDNS (last resort)
kubectl rollout restart deployment -n kube-system coredns
Runbook 2: DNS lookup slow (intermittent 5s timeouts)
# Root cause: UDP conntrack collision (classic Kubernetes DNS bug)
# Fix: deploy NodeLocal DNSCache (see above)
# Diagnose: check if timeouts correlate with conntrack drops
kubectl debug node/worker-1 -it --image=ubuntu -- bash
cat /proc/net/stat/nf_conntrack | head -2 # look for insert_failed counter
# insert_failed > 0 and growing = conntrack collision confirmed
# Workaround (immediate): increase conntrack table size
sysctl -w net.netfilter.nf_conntrack_max=1048576
echo "net.netfilter.nf_conntrack_max=1048576" >> /etc/sysctl.d/99-coredns.conf
# Long-term fix 1: NodeLocal DNSCache (preferred)
# Long-term fix 2: switch CoreDNS to TCP (adds latency but no collisions)
# In Corefile: forward . /etc/resolv.conf { prefer_udp false }
# Also check: are pods using ndots:5 for external lookups?
kubectl exec my-pod -- cat /etc/resolv.conf
# If yes, add ndots:2 via dnsConfig or deploy autopath plugin
Runbook 3: External DNS not resolving (SERVFAIL for external names)
# 1. Test directly
kubectl run dns-test --image=nicolaka/netshoot --rm -it -- \
dig google.com @10.96.0.10
# 2. Check CoreDNS forward plugin config
kubectl get cm -n kube-system coredns -o jsonpath='{.data.Corefile}' | grep -A3 forward
# 3. Verify CoreDNS can reach upstream from its node
kubectl debug node/$(kubectl get pod -n kube-system -l k8s-app=kube-dns -o jsonpath='{.items[0].spec.nodeName}') \
-it --image=ubuntu -- dig google.com @8.8.8.8
# 4. If using /etc/resolv.conf as upstream: check node resolv.conf
kubectl debug node/worker-1 -it --image=ubuntu -- cat /etc/resolv.conf
# 5. NetworkPolicy blocking CoreDNS egress?
kubectl get networkpolicies -n kube-system
# Ensure kube-system allows egress to upstream DNS (port 53)
# 6. Check CoreDNS loop plugin (crash loop if forwarding to itself)
kubectl logs -n kube-system coredns-abc123 | grep "Loop"
# "Loop detected" → CoreDNS pointing to itself; check node resolv.conf
# Fix: explicitly set forward target: forward . 8.8.8.8
Runbook 4: CoreDNS high CPU / memory (overloaded)
# 1. Check CoreDNS QPS
kubectl exec -n kube-system coredns-abc123 -- \
curl -s localhost:9153/metrics | grep "coredns_dns_requests_total"
# rate() in Prometheus:
# rate(coredns_dns_requests_total[1m]) — queries per second
# 2. Identify top query patterns (enable log plugin temporarily)
kubectl edit cm -n kube-system coredns
# Add under .:53 block: log .
# Watch: kubectl logs -n kube-system -l k8s-app=kube-dns -f | head -200
# 3. Check for NXDOMAIN storms (misconfigured apps hammering CoreDNS)
kubectl exec -n kube-system coredns-abc123 -- \
curl -s localhost:9153/metrics | grep 'rcode="NXDOMAIN"'
# 4. Scale CoreDNS horizontally
kubectl scale deployment coredns -n kube-system --replicas=6
# 5. Increase cache TTL to reduce upstream queries
kubectl edit cm -n kube-system coredns
# Change: cache 30 → cache 120
# 6. Enable NodeLocal DNSCache to offload CoreDNS
# See NodeLocal DNSCache section above
Production Best Practices
Always Run NodeLocal DNSCache
- Eliminates the conntrack UDP race (5s timeout bug)
- Sub-millisecond cache hits for popular records
- Reduces CoreDNS pod load by 60-80%
- Use
force_tcpfor cache misses to CoreDNS
Scale CoreDNS Proportionally
- Use cluster-proportional-autoscaler (not HPA)
- 1 replica per 16 nodes or 256 cores as baseline
- Spread across failure domains with anti-affinity
- Set resource limits to prevent OOMKill on traffic spikes
Reduce ndots Overhead
- Use trailing dots for external FQDNs in app config
- Set
ndots: 2for pods making many external calls - Enable
autopath @kubernetesfor server-side search - Use fully-qualified service names where latency matters
Security
- Enable NetworkPolicy to restrict who can query CoreDNS
- Use DNS-over-TLS for upstream forwarding
- Avoid
pods insecureif pod spoofing is a concern - Monitor for DNS exfiltration (high TXT/NULL query rate)
- Use Cilium L7 DNS policy to restrict FQDN access per pod