DaemonSets
▶ What This Page Covers
Controller Mechanics
A DaemonSet ensures exactly one pod runs on every eligible node in the cluster. When a new node joins, the controller automatically creates a pod on it. When a node is removed, the pod is garbage-collected. There is no replicas field — the replica count is implicitly the number of eligible nodes.
Full DaemonSet Spec
apiVersion: apps/v1
kind: DaemonSet
metadata:
name: node-exporter
namespace: monitoring
labels:
app: node-exporter
spec:
selector:
matchLabels:
app: node-exporter # IMMUTABLE after creation
# ── Update strategy ────────────────────────────────────────────
updateStrategy:
type: RollingUpdate # RollingUpdate (default) | OnDelete
rollingUpdate:
maxUnavailable: 1 # default 1; max pods down at once during update
# absolute or percentage of desired count
maxSurge: 0 # 1.22+: allow temporary extra pod per node during update
# default 0; set to 1 for zero-downtime agent updates
# ── Revision history ───────────────────────────────────────────
revisionHistoryLimit: 10
# ── Min ready ──────────────────────────────────────────────────
minReadySeconds: 0
template:
metadata:
labels:
app: node-exporter
annotations:
prometheus.io/scrape: "true"
prometheus.io/port: "9100"
spec:
# ── Node targeting ─────────────────────────────────────────
nodeSelector:
kubernetes.io/os: linux # only run on Linux nodes (skip Windows)
# Fine-grained with nodeAffinity:
affinity:
nodeAffinity:
requiredDuringSchedulingIgnoredDuringExecution:
nodeSelectorTerms:
- matchExpressions:
- key: kubernetes.io/os
operator: In
values: [linux]
- key: node.kubernetes.io/instance-type
operator: NotIn
values: [t3.nano, t3.micro] # skip under-resourced nodes
# ── Tolerations ────────────────────────────────────────────
tolerations:
# Run on control-plane nodes (not added by default)
- key: node-role.kubernetes.io/control-plane
operator: Exists
effect: NoSchedule
# Run on nodes being drained (not-ready / unreachable)
- key: node.kubernetes.io/not-ready
operator: Exists
effect: NoExecute
tolerationSeconds: 300
- key: node.kubernetes.io/unreachable
operator: Exists
effect: NoExecute
tolerationSeconds: 300
# ── Host access ────────────────────────────────────────────
hostNetwork: false # true for CNI plugins, network agents
hostPID: false # true for process-level monitoring (e.g., eBPF agents)
hostIPC: false
# ── Priority ───────────────────────────────────────────────
priorityClassName: system-node-critical # ensures scheduling on stressed nodes
# ── Service account ────────────────────────────────────────
serviceAccountName: node-exporter-sa
# ── Security ───────────────────────────────────────────────
securityContext:
runAsNonRoot: true
runAsUser: 65534 # nobody
seccompProfile:
type: RuntimeDefault
containers:
- name: node-exporter
image: prom/node-exporter:v1.7.0
args:
- --path.rootfs=/host
- --collector.filesystem.mount-points-exclude=^/(dev|proc|sys|run)($|/)
ports:
- name: metrics
containerPort: 9100
hostPort: 9100 # bind directly to node port (optional; use Service instead)
resources:
requests:
cpu: "50m"
memory: "64Mi"
limits:
memory: "128Mi"
securityContext:
allowPrivilegeEscalation: false
readOnlyRootFilesystem: true
capabilities:
drop: ["ALL"]
volumeMounts:
- name: rootfs
mountPath: /host
readOnly: true
mountPropagation: HostToContainer
- name: proc
mountPath: /host/proc
readOnly: true
- name: sys
mountPath: /host/sys
readOnly: true
volumes:
- name: rootfs
hostPath:
path: /
- name: proc
hostPath:
path: /proc
- name: sys
hostPath:
path: /sys
# ── Termination ────────────────────────────────────────────
terminationGracePeriodSeconds: 30
How DaemonSet Scheduling Works
DaemonSet pods bypass the normal kube-scheduler queue. The DaemonSet controller sets spec.nodeName directly on the pod, which causes kubelet to pick it up and start it without scheduler involvement. This means DaemonSet pods can be scheduled on nodes that are:
- Unschedulable (
kubectl cordon) — cordoning prevents new scheduler-placed pods but not DaemonSet pods - At resource capacity — the controller does not check resource availability before setting
nodeName - Not-ready — DaemonSet pods can start on nodes that haven't yet passed readiness checks
Because the DaemonSet controller bypasses the scheduler, it does not verify that the node has sufficient CPU/memory for the pod's resource requests. On a saturated node, kubelet will still start the DaemonSet pod — but other pods may be evicted to make room based on QoS class. Always set resource requests conservatively on DaemonSet pods, and use priorityClassName: system-node-critical for essential infrastructure agents so they are not evicted.
Node Targeting
nodeSelector (Simple)
# Run only on GPU nodes
spec:
template:
spec:
nodeSelector:
accelerator: nvidia-tesla-t4
# Run only on Linux (important in mixed Windows/Linux clusters)
nodeSelector:
kubernetes.io/os: linux
# Run only on nodes in a specific availability zone
nodeSelector:
topology.kubernetes.io/zone: us-east-1a
nodeAffinity (Complex Expressions)
affinity:
nodeAffinity:
requiredDuringSchedulingIgnoredDuringExecution:
nodeSelectorTerms:
- matchExpressions:
# Must be Linux
- key: kubernetes.io/os
operator: In
values: [linux]
# Must NOT be a spot/preemptible node (for critical monitoring agents)
- key: cloud.google.com/gke-preemptible
operator: DoesNotExist
preferredDuringSchedulingIgnoredDuringExecution:
- weight: 100
preference:
matchExpressions:
- key: node-role
operator: In
values: [worker] # prefer worker nodes but also run on others
Label-Based Opt-In/Opt-Out
# Opt-in: only run on nodes with a specific label
nodeSelector:
monitoring: "enabled"
# Add the label to specific nodes:
kubectl label node worker-1 monitoring=enabled
kubectl label node worker-2 monitoring=enabled
# Remove to stop DaemonSet pod on a node:
kubectl label node worker-1 monitoring- # removes the label → pod deleted
# Opt-out: run on all nodes EXCEPT those with a specific label
affinity:
nodeAffinity:
requiredDuringSchedulingIgnoredDuringExecution:
nodeSelectorTerms:
- matchExpressions:
- key: exclude-monitoring
operator: DoesNotExist # run on nodes that do NOT have this label
# Add label to exclude a node:
kubectl label node worker-3 exclude-monitoring=true # → pod deleted from worker-3
Tolerations
Default Tolerations Injected by DaemonSet Controller
The DaemonSet controller automatically injects several tolerations into every DaemonSet pod to ensure agent availability during node problems:
| Auto-Injected Toleration | Effect | Purpose |
|---|---|---|
node.kubernetes.io/not-ready:NoExecute | Tolerated for 300s | Pod survives brief node not-ready periods (network blip) |
node.kubernetes.io/unreachable:NoExecute | Tolerated for 300s | Pod survives brief node unreachable periods |
node.kubernetes.io/disk-pressure:NoSchedule | Tolerated | Monitoring agents still run on disk-pressured nodes |
node.kubernetes.io/memory-pressure:NoSchedule | Tolerated | Monitoring agents still run on memory-pressured nodes |
node.kubernetes.io/pid-pressure:NoSchedule | Tolerated | Pod starvation doesn't block infrastructure agents |
node.kubernetes.io/unschedulable:NoSchedule | Tolerated | DaemonSet pods created on cordoned nodes |
node.kubernetes.io/network-unavailable:NoSchedule | Tolerated | CNI plugin DaemonSets can run before network is ready |
Control-Plane Tolerations
# Control-plane nodes carry a taint that blocks regular pods:
# node-role.kubernetes.io/control-plane:NoSchedule (1.24+)
# node-role.kubernetes.io/master:NoSchedule (deprecated, still present pre-1.24)
# To run a DaemonSet on control-plane nodes (e.g., logging agent, monitoring):
tolerations:
- key: node-role.kubernetes.io/control-plane
operator: Exists
effect: NoSchedule
- key: node-role.kubernetes.io/master # for compatibility with older clusters
operator: Exists
effect: NoSchedule
# Example use cases requiring control-plane coverage:
# - Audit log collector (reads from /var/log/kubernetes/audit.log)
# - etcd backup agent
# - Node-level security scanner
# - Falco (eBPF syscall monitor)
Custom Taint Tolerations
# Node tainted for GPU workloads only:
# kubectl taint node gpu-node-1 dedicated=gpu:NoSchedule
# DaemonSet for GPU metrics (DCGM exporter) — must tolerate the GPU taint:
tolerations:
- key: dedicated
operator: Equal
value: gpu
effect: NoSchedule
# Tolerate ANY taint (run on ALL nodes regardless of taints):
tolerations:
- operator: Exists # matches any key, value, effect — use with caution
Update Strategies
RollingUpdate (Default)
DaemonSet RollingUpdate terminates old pods and starts new ones one node at a time (by default). Unlike Deployment, there is no ReplicaSet intermediary — the controller directly manages the per-node pod replacement.
updateStrategy:
type: RollingUpdate
rollingUpdate:
maxUnavailable: 1 # default; one node's pod down at a time
# absolute: 2 = two nodes simultaneously updated
# percentage: "10%" = 10% of nodes simultaneously
maxSurge: 0 # default; no extra pod during update
# maxSurge: 1 = create new pod BEFORE deleting old
# requires node to have capacity for two pods briefly
maxSurge for Zero-Downtime Agent Updates
updateStrategy:
type: RollingUpdate
rollingUpdate:
maxUnavailable: 0 # never remove old pod before new is ready
maxSurge: 1 # create new pod alongside old; delete old after new is Ready
# Both old and new pod run on same node briefly (double the per-node cost)
# Requires node to have capacity for 2× the pod's resource requests temporarily
# Essential for: network agents (brief coverage gap is unacceptable),
# security scanners (must not miss any window)
OnDelete Strategy
updateStrategy:
type: OnDelete
# Pods are only updated when manually deleted
# Use for: CNI plugins (network disruption during update must be controlled),
# critical security agents (manual per-node validation required)
# Workflow:
# 1. Update DaemonSet spec (kubectl set image or kubectl apply)
# 2. Manually delete pod on specific node to trigger update:
kubectl delete pod fluentd-worker-1 -n logging
# 3. Verify new pod is healthy before proceeding to next node:
kubectl get pods -n logging -o wide | grep worker-1
# 4. Continue node by node
Host Namespaces
Infrastructure agents often need privileged access to the node. The following patterns cover the most common host-access requirements while keeping the security surface as narrow as possible.
hostNetwork for Network Agents
# CNI plugins, network monitoring agents
spec:
template:
spec:
hostNetwork: true # pod uses node's network namespace
dnsPolicy: ClusterFirstWithHostNet # REQUIRED with hostNetwork to still resolve cluster DNS
# Effect: pod sees all node interfaces (eth0, lo, tunnel interfaces)
# Pod IP is the node IP (not a pod CIDR IP)
# Port conflicts: if node already uses port 9100, the pod will fail to bind
hostPID for Process-Level Agents
# eBPF-based tracing, process monitoring (Falco, Pixie, Tetragon)
spec:
template:
spec:
hostPID: true # pod sees all processes on the node via /proc
containers:
- name: falco
image: falcosecurity/falco:0.37.0
securityContext:
privileged: true # required for kernel module / eBPF loading
volumeMounts:
- name: dev
mountPath: /dev
- name: proc
mountPath: /host/proc
readOnly: true
volumes:
- name: dev
hostPath: {path: /dev}
- name: proc
hostPath: {path: /proc}
hostPath Volume Patterns
# Log collection (Fluentd, Filebeat, Vector)
volumes:
- name: varlog
hostPath:
path: /var/log
- name: varlibdockercontainers
hostPath:
path: /var/lib/docker/containers # container log symlink targets
# For containerd (runc logs at different path):
- name: containerd-logs
hostPath:
path: /var/log/pods
# Node filesystem inspection (node-exporter, security scanners)
- name: rootfs
hostPath:
path: /
type: Directory # Directory | File | DirectoryOrCreate | FileOrCreate | Socket
hostPort
hostPort binds a container port directly to the node's network interface on the same port number. The port is reachable at NODE_IP:PORT without needing a Service. It is an alternative to using hostNetwork: true when only specific ports need to be exposed.
containers:
- name: node-exporter
ports:
- containerPort: 9100
hostPort: 9100 # bind to node IP:9100
protocol: TCP
# Prometheus scrape config targeting node IPs directly:
# - targets: ['node-1:9100', 'node-2:9100', ...]
# OR use a Service of type ClusterIP with targetPort: 9100 (preferred)
hostPort reserves the port on the node. Only one pod can use a given hostPort per node (enforced by the scheduler). For DaemonSets, this is fine since exactly one pod runs per node. However, mixing hostPort DaemonSet pods with hostPort application pods that request the same port will cause scheduling conflicts. Prefer a ClusterIP Service to expose DaemonSet pod metrics/APIs rather than hostPort.
Real-World DaemonSet Examples
Prometheus node-exporter
# Minimal production-grade node-exporter DaemonSet
apiVersion: apps/v1
kind: DaemonSet
metadata:
name: node-exporter
namespace: monitoring
spec:
selector:
matchLabels:
app.kubernetes.io/name: node-exporter
updateStrategy:
type: RollingUpdate
rollingUpdate:
maxUnavailable: 1
template:
metadata:
labels:
app.kubernetes.io/name: node-exporter
spec:
hostNetwork: false
hostPID: false
nodeSelector:
kubernetes.io/os: linux
tolerations:
- operator: Exists # run on all nodes including control-plane
priorityClassName: system-cluster-critical
serviceAccountName: node-exporter
securityContext:
runAsNonRoot: true
runAsUser: 65534
containers:
- name: node-exporter
image: prom/node-exporter:v1.7.0
args: ["--path.rootfs=/host", "--path.procfs=/host/proc", "--path.sysfs=/host/sys"]
ports:
- containerPort: 9100
resources:
requests: {cpu: 50m, memory: 64Mi}
limits: {memory: 128Mi}
securityContext:
allowPrivilegeEscalation: false
readOnlyRootFilesystem: true
capabilities: {drop: [ALL]}
volumeMounts:
- {name: root, mountPath: /host, readOnly: true, mountPropagation: HostToContainer}
- {name: proc, mountPath: /host/proc, readOnly: true}
- {name: sys, mountPath: /host/sys, readOnly: true}
volumes:
- {name: root, hostPath: {path: /}}
- {name: proc, hostPath: {path: /proc}}
- {name: sys, hostPath: {path: /sys}}
Fluentd Log Collector
containers:
- name: fluentd
image: fluent/fluentd-kubernetes-daemonset:v1.16-debian-elasticsearch8-1
env:
- name: FLUENT_ELASTICSEARCH_HOST
value: elasticsearch.logging.svc.cluster.local
- name: FLUENT_ELASTICSEARCH_PORT
value: "9200"
- name: K8S_NODE_NAME # inject node name for log enrichment
valueFrom:
fieldRef:
fieldPath: spec.nodeName
resources:
requests: {cpu: 100m, memory: 200Mi}
limits: {memory: 500Mi}
volumeMounts:
- name: varlog
mountPath: /var/log
- name: pods-logs
mountPath: /var/log/pods
readOnly: true
- name: fluentd-config
mountPath: /fluentd/etc/fluent.conf
subPath: fluent.conf
volumes:
- {name: varlog, hostPath: {path: /var/log}}
- {name: pods-logs, hostPath: {path: /var/log/pods}}
- {name: fluentd-config, configMap: {name: fluentd-config}}
CNI Plugin (Calico node)
# CNI plugins require hostNetwork + privileged + control-plane tolerations
spec:
template:
spec:
hostNetwork: true
dnsPolicy: ClusterFirstWithHostNet
tolerations:
- operator: Exists # run on every node including control-plane
priorityClassName: system-node-critical
initContainers:
- name: install-cni
image: calico/cni:v3.27.0
command: ["/opt/cni/bin/install"]
volumeMounts:
- name: cni-bin-dir
mountPath: /opt/cni/bin
- name: cni-net-dir
mountPath: /etc/cni/net.d
containers:
- name: calico-node
image: calico/node:v3.27.0
securityContext:
privileged: true # required: manages iptables, routes, network interfaces
env:
- name: NODENAME
valueFrom:
fieldRef:
fieldPath: spec.nodeName
volumeMounts:
- name: lib-modules
mountPath: /lib/modules
readOnly: true
- name: var-run-calico
mountPath: /var/run/calico
- name: cni-bin-dir
mountPath: /opt/cni/bin
volumes:
- {name: lib-modules, hostPath: {path: /lib/modules}}
- {name: var-run-calico, hostPath: {path: /var/run/calico}}
- {name: cni-bin-dir, hostPath: {path: /opt/cni/bin}}
- {name: cni-net-dir, hostPath: {path: /etc/cni/net.d}}
CSI Node Plugin
# CSI node plugins run as DaemonSets to provide node-local volume operations
# (NodeStageVolume, NodePublishVolume, NodeGetVolumeStats)
containers:
- name: ebs-plugin
image: public.ecr.aws/ebs-csi-driver/aws-ebs-csi-driver:v1.28.0
args: ["node", "--endpoint=$(CSI_ENDPOINT)", "--logtostderr", "--v=5"]
env:
- name: CSI_ENDPOINT
value: unix:///var/lib/kubelet/plugins/ebs.csi.aws.com/csi.sock
securityContext:
privileged: true # required: mount/unmount block devices, create device nodes
volumeMounts:
- name: kubelet-dir
mountPath: /var/lib/kubelet
mountPropagation: Bidirectional # propagate mounts back to host
- name: plugin-dir
mountPath: /var/lib/kubelet/plugins/ebs.csi.aws.com/
- name: device-dir
mountPath: /dev
- name: node-driver-registrar
image: registry.k8s.io/sig-storage/csi-node-driver-registrar:v2.10.0
args:
- --csi-address=$(ADDRESS)
- --kubelet-registration-path=$(DRIVER_REG_SOCK_PATH)
volumes:
- {name: kubelet-dir, hostPath: {path: /var/lib/kubelet, type: Directory}}
- {name: plugin-dir, hostPath: {path: /var/lib/kubelet/plugins/ebs.csi.aws.com/, type: DirectoryOrCreate}}
- {name: device-dir, hostPath: {path: /dev, type: Directory}}
Node Join and Leave Lifecycle
Without --ignore-daemonsets, kubectl drain exits with an error if DaemonSet pods are present (which they always are). Always include this flag during node maintenance. DaemonSet pods will be automatically recreated once the node is uncordoned, or on the next node join if the node is replaced. The DaemonSet pods on a draining node are left running during the drain — they are only removed when the node is deleted or when the DaemonSet itself is updated/deleted.
Resource Sizing for DaemonSet Pods
DaemonSet pods consume resources from every node they run on. Over-provisioning DaemonSet requests reduces the available allocatable resources for application pods cluster-wide. Under-provisioning causes eviction during node pressure.
# Node allocatable = node capacity - system reserved - kubelet reserved - eviction threshold
# DaemonSet pods consume from allocatable on EVERY node
# Example: 100-node cluster, node-exporter requests 50m CPU / 64Mi memory
# Total cluster-wide cost: 100 × (50m CPU + 64Mi) = 5000m CPU + 6.4Gi memory
# This is "hidden tax" that must be accounted for in capacity planning
# Resource sizing guidelines for common agents:
# node-exporter: 50m CPU / 64Mi memory (minimal, read-only host metrics)
# fluentd: 100m CPU / 200Mi (scales with log volume; add VPA)
# calico-node: 100m CPU / 256Mi (network data path; critical path)
# datadog-agent: 200m CPU / 512Mi (full observability; significant cost)
# falco: 100m CPU / 512Mi (eBPF kernel overhead varies)
# CSI node plugin: 50m CPU / 128Mi (per-node volume operations)
DaemonSet vs Deployment vs Static Pods
| Aspect | DaemonSet | Deployment | Static Pod |
|---|---|---|---|
| One per node | Yes (automatic) | No (replica count) | Yes (manual per node) |
| Managed by | DaemonSet controller | Deployment controller | kubelet directly |
| API server required | Yes | Yes | No (kubelet reads local file) |
| Auto new-node coverage | Yes | No | No |
| Rolling update | Yes (RollingUpdate/OnDelete) | Yes (RollingUpdate/Recreate) | No (manual file update per node) |
| kubectl visibility | Yes (appears in kubectl get pods) | Yes | Yes (mirror pod in API) |
| Survives API server outage | No (controller needs API) | No | Yes (kubelet manages locally) |
| Use case | All infrastructure agents | Stateless applications | Control-plane components (etcd, apiserver) only |
Operational Commands
# Check DaemonSet rollout status
kubectl rollout status ds/node-exporter -n monitoring
# Watch pod replacement during RollingUpdate
kubectl get pods -n monitoring -l app=node-exporter -o wide -w
# Check desired vs ready vs available counts
kubectl get ds node-exporter -n monitoring
# NAME DESIRED CURRENT READY UP-TO-DATE AVAILABLE NODE SELECTOR
# node-exporter 10 10 9 8 9 kubernetes.io/os=linux
# Force rollout restart (without spec change)
kubectl rollout restart ds/node-exporter -n monitoring
# Check which nodes have/don't have the DaemonSet pod
kubectl get nodes -o wide
kubectl get pods -n monitoring -l app=node-exporter -o wide
# Compare: any node without a pod = targeting mismatch or toleration missing
# Update image
kubectl set image ds/node-exporter node-exporter=prom/node-exporter:v1.8.0 -n monitoring
# Rollback
kubectl rollout undo ds/node-exporter -n monitoring
# Get ControllerRevision history
kubectl get controllerrevision -n monitoring -l app=node-exporter
# Describe to see events
kubectl describe ds node-exporter -n monitoring
# Check DaemonSet status fields
kubectl get ds node-exporter -n monitoring -o jsonpath='{.status}' | jq .
# {
# "currentNumberScheduled": 10, # nodes where pod is running
# "desiredNumberScheduled": 10, # nodes that should have a pod
# "numberAvailable": 9, # pods passing readiness
# "numberMisscheduled": 0, # pods running on ineligible nodes
# "numberReady": 9,
# "numberUnavailable": 1, # pods not yet ready
# "updatedNumberScheduled": 8 # pods on latest revision
# }
Metrics, Alerts, and Runbooks
Key Metrics
| Metric | Source | Alert Condition |
|---|---|---|
kube_daemonset_status_desired_number_scheduled | kube-state-metrics | Baseline: total eligible nodes |
kube_daemonset_status_number_ready | kube-state-metrics | < desired for > 5m |
kube_daemonset_status_number_misscheduled | kube-state-metrics | > 0 (pods on wrong nodes — selector changed?) |
kube_daemonset_status_updated_number_scheduled | kube-state-metrics | < desired for > 30m → rollout stalled |
kube_daemonset_status_number_unavailable | kube-state-metrics | > 0 for > 10m (node unreachable or pod crash-looping) |
Alerting Rules
groups:
- name: daemonset-health
rules:
- alert: DaemonSetNotFullyScheduled
expr: |
kube_daemonset_status_desired_number_scheduled
!= kube_daemonset_status_current_number_scheduled
for: 5m
labels:
severity: warning
annotations:
summary: "DaemonSet {{ $labels.namespace }}/{{ $labels.daemonset }} not fully scheduled"
- alert: DaemonSetRolloutStuck
expr: |
kube_daemonset_status_updated_number_scheduled
!= kube_daemonset_status_desired_number_scheduled
for: 30m
labels:
severity: warning
annotations:
summary: "DaemonSet {{ $labels.namespace }}/{{ $labels.daemonset }} rollout not complete after 30 minutes"
- alert: DaemonSetMisscheduled
expr: kube_daemonset_status_number_misscheduled > 0
for: 5m
labels:
severity: warning
annotations:
summary: "DaemonSet {{ $labels.namespace }}/{{ $labels.daemonset }} has pods on ineligible nodes"
- alert: DaemonSetPodNotReady
expr: |
kube_daemonset_status_number_unavailable > 0
for: 10m
labels:
severity: warning
annotations:
summary: "DaemonSet {{ $labels.namespace }}/{{ $labels.daemonset }} has {{ $value }} unavailable pods"
Runbooks
Check: kubectl get ds NAME -n NS — compare DESIRED vs CURRENT. If CURRENT < DESIRED: check node labels match nodeSelector; check taints on missing node (kubectl describe node NAME) and ensure DaemonSet has matching toleration. Check numberMisscheduled — may indicate stale pods on nodes that were relabeled.
Check which pod is not updating: compare updatedNumberScheduled vs desiredNumberScheduled. Get pods and check age/image: kubectl get pods -n NS -l app=NAME -o wide. If a pod is stuck NotReady: describe pod for probe failure events. If OnDelete strategy: check if pods were manually deleted. Fix root cause and the controller retries.
Logs: kubectl logs POD -n NS --previous. Common causes for node-specific crashes: hostPath volume doesn't exist on that node (DirectoryOrCreate vs Directory), kernel module not available (eBPF agents), node OS version incompatibility. Check if it's isolated to one node vs all nodes to identify node-specific vs image issues.
numberMisscheduled > 0 means pods run on nodes they shouldn't. Happens when node labels change after pod creation. The controller will eventually delete misscheduled pods. To force immediate cleanup: delete the misscheduled pods manually (kubectl delete pod NAME). They will not be recreated on that node if it no longer matches.
kubectl drain node-1 exits with error about DaemonSet pods. Always use: kubectl drain node-1 --ignore-daemonsets --delete-emptydir-data. The --ignore-daemonsets flag skips DaemonSet pods during eviction. DaemonSet pods remain running until the node is cordoned or deleted; they are not impacted by drain itself.
Best Practices
- Always set
priorityClassName: system-node-criticalfor essential agents — node-exporter, CNI plugins, CSI node plugins, and security agents must survive node pressure. Without a priority class, these pods can be evicted during CPU/memory pressure, leaving nodes unmonitored or without network. Usesystem-cluster-criticalfor cluster-wide critical infrastructure. - Use
tolerations: [{operator: Exists}]for truly universal agents — monitoring agents and CNI plugins need to run on every node including control-plane, GPU-tainted, and spot nodes. Enumerate all expected taints or use the catch-alloperator: Existstoleration. Missing tolerations are the most common reason a DaemonSet is not fully scheduled. - Set
nodeSelector: {kubernetes.io/os: linux}in mixed clusters — Windows nodes cannot run Linux containers. Without the OS selector, the DaemonSet will attempt to schedule Linux images on Windows nodes and fail with an ImagePullBackOff or container runtime error. - Keep DaemonSet resource requests conservative but accurate — DaemonSet pods multiply across every node. A 100-node cluster with a DaemonSet requesting 200m CPU and 512Mi means 20 vCPUs and 50Gi memory reserved cluster-wide for that single DaemonSet. Profile actual usage with VPA recommendations in
Offmode before setting final requests. - Use
maxSurge: 1, maxUnavailable: 0for network-critical agents — a network agent that goes offline during its own rolling update can cause brief packet loss or missed connections on that node. The surge pattern (create new agent first, then delete old) ensures continuous coverage at the cost of briefly doubling the per-node resource usage. - Use
OnDeletefor CNI plugin updates — updating a CNI plugin can briefly disrupt network connectivity on the node.OnDeletelets you control timing: schedule maintenance windows, drain workloads from the node first, then delete the old CNI pod to trigger the update, validate connectivity, and proceed to the next node. - Inject node identity via Downward API, not hostname lookups — use
spec.nodeNamevia fieldRef to get the node name rather than relying onhostnameor DNS resolution. The node name is stable and available immediately; DNS may not resolve correctly especially during init. - Audit DaemonSet pod security contexts regularly — DaemonSet pods are the most likely to run
privileged: trueor withhostPID: true. These settings are often necessary (CNI, eBPF agents) but must be the minimum required. Regularly review whetherprivilegedcan be replaced with specific capabilities, and whetherreadOnlyRootFilesystem: truecan be applied with explicit writable mounts.