DaemonSets — Kubernetes Docs

▶ What This Page Covers

DaemonSet controller mechanics: one pod per eligible node

Full annotated DaemonSet spec

How the DaemonSet scheduler bypasses the kube-scheduler

Node targeting: nodeSelector, nodeAffinity, label-based subset selection

Tolerations for control-plane and tainted nodes

Default tolerations injected by DaemonSet controller

RollingUpdate strategy: maxUnavailable, maxSurge (1.22+)

OnDelete strategy: manual update workflow

Pod name format and host-based identity

hostNetwork, hostPID, hostIPC for infrastructure agents

hostPort: direct node port binding without a Service

Resource sizing for DaemonSet pods: avoiding node starvation

Real-world DaemonSet examples: node-exporter, Fluentd, CNI plugin, CSI node plugin

DaemonSet vs Deployment vs static pods

Node join/leave lifecycle: automatic pod creation and deletion

DaemonSet and node drain behavior

Updating node labels to change DaemonSet coverage

5 metrics + 4 alerts + 5 runbooks + 8 best practices

Controller Mechanics

A DaemonSet ensures exactly one pod runs on every eligible node in the cluster. When a new node joins, the controller automatically creates a pod on it. When a node is removed, the pod is garbage-collected. There is no replicas field — the replica count is implicitly the number of eligible nodes.

DaemonSet pod placement: Cluster nodes: DaemonSet "node-exporter" pod placement: ┌──────────────┐ ┌──────────────┐ │ worker-1 │ ──────► │ node-exporter-worker-1 │ │ (linux/amd64)│ └──────────────┘ └──────────────┘ ┌──────────────┐ ┌──────────────┐ │ worker-2 │ ──────► │ node-exporter-worker-2 │ │ (linux/amd64)│ └──────────────┘ └──────────────┘ ┌──────────────┐ ┌──────────────┐ │ worker-3 │ ──────► │ node-exporter-worker-3 │ │ (linux/arm64)│ └──────────────┘ └──────────────┘ ┌──────────────┐ │ control- │ ──────► no pod (taint: node-role.kubernetes.io/control-plane:NoSchedule) │ plane-1 │ UNLESS DaemonSet has matching toleration └──────────────┘ New node joins → DaemonSet controller creates pod immediately Node deleted → Pod garbage-collected (no finalizer needed) Node cordoned → Existing pod continues running; new pods not created during drain

Full DaemonSet Spec

apiVersion: apps/v1
kind: DaemonSet
metadata:
  name: node-exporter
  namespace: monitoring
  labels:
    app: node-exporter
spec:
  selector:
    matchLabels:
      app: node-exporter     # IMMUTABLE after creation

  # ── Update strategy ────────────────────────────────────────────
  updateStrategy:
    type: RollingUpdate      # RollingUpdate (default) | OnDelete
    rollingUpdate:
      maxUnavailable: 1      # default 1; max pods down at once during update
                             # absolute or percentage of desired count
      maxSurge: 0            # 1.22+: allow temporary extra pod per node during update
                             # default 0; set to 1 for zero-downtime agent updates

  # ── Revision history ───────────────────────────────────────────
  revisionHistoryLimit: 10

  # ── Min ready ──────────────────────────────────────────────────
  minReadySeconds: 0

  template:
    metadata:
      labels:
        app: node-exporter
      annotations:
        prometheus.io/scrape: "true"
        prometheus.io/port: "9100"
    spec:
      # ── Node targeting ─────────────────────────────────────────
      nodeSelector:
        kubernetes.io/os: linux   # only run on Linux nodes (skip Windows)

      # Fine-grained with nodeAffinity:
      affinity:
        nodeAffinity:
          requiredDuringSchedulingIgnoredDuringExecution:
            nodeSelectorTerms:
            - matchExpressions:
              - key: kubernetes.io/os
                operator: In
                values: [linux]
              - key: node.kubernetes.io/instance-type
                operator: NotIn
                values: [t3.nano, t3.micro]  # skip under-resourced nodes

      # ── Tolerations ────────────────────────────────────────────
      tolerations:
      # Run on control-plane nodes (not added by default)
      - key: node-role.kubernetes.io/control-plane
        operator: Exists
        effect: NoSchedule
      # Run on nodes being drained (not-ready / unreachable)
      - key: node.kubernetes.io/not-ready
        operator: Exists
        effect: NoExecute
        tolerationSeconds: 300
      - key: node.kubernetes.io/unreachable
        operator: Exists
        effect: NoExecute
        tolerationSeconds: 300

      # ── Host access ────────────────────────────────────────────
      hostNetwork: false        # true for CNI plugins, network agents
      hostPID: false            # true for process-level monitoring (e.g., eBPF agents)
      hostIPC: false

      # ── Priority ───────────────────────────────────────────────
      priorityClassName: system-node-critical   # ensures scheduling on stressed nodes

      # ── Service account ────────────────────────────────────────
      serviceAccountName: node-exporter-sa

      # ── Security ───────────────────────────────────────────────
      securityContext:
        runAsNonRoot: true
        runAsUser: 65534        # nobody
        seccompProfile:
          type: RuntimeDefault

      containers:
      - name: node-exporter
        image: prom/node-exporter:v1.7.0
        args:
        - --path.rootfs=/host
        - --collector.filesystem.mount-points-exclude=^/(dev|proc|sys|run)($|/)
        ports:
        - name: metrics
          containerPort: 9100
          hostPort: 9100        # bind directly to node port (optional; use Service instead)
        resources:
          requests:
            cpu: "50m"
            memory: "64Mi"
          limits:
            memory: "128Mi"
        securityContext:
          allowPrivilegeEscalation: false
          readOnlyRootFilesystem: true
          capabilities:
            drop: ["ALL"]
        volumeMounts:
        - name: rootfs
          mountPath: /host
          readOnly: true
          mountPropagation: HostToContainer
        - name: proc
          mountPath: /host/proc
          readOnly: true
        - name: sys
          mountPath: /host/sys
          readOnly: true

      volumes:
      - name: rootfs
        hostPath:
          path: /
      - name: proc
        hostPath:
          path: /proc
      - name: sys
        hostPath:
          path: /sys

      # ── Termination ────────────────────────────────────────────
      terminationGracePeriodSeconds: 30

How DaemonSet Scheduling Works

DaemonSet pods bypass the normal kube-scheduler queue. The DaemonSet controller sets spec.nodeName directly on the pod, which causes kubelet to pick it up and start it without scheduler involvement. This means DaemonSet pods can be scheduled on nodes that are:

Unschedulable (kubectl cordon) — cordoning prevents new scheduler-placed pods but not DaemonSet pods
At resource capacity — the controller does not check resource availability before setting nodeName
Not-ready — DaemonSet pods can start on nodes that haven't yet passed readiness checks

DaemonSet Pods Bypass Node Capacity Checks

Because the DaemonSet controller bypasses the scheduler, it does not verify that the node has sufficient CPU/memory for the pod's resource requests. On a saturated node, kubelet will still start the DaemonSet pod — but other pods may be evicted to make room based on QoS class. Always set resource requests conservatively on DaemonSet pods, and use priorityClassName: system-node-critical for essential infrastructure agents so they are not evicted.

Node Targeting

nodeSelector (Simple)

# Run only on GPU nodes
spec:
  template:
    spec:
      nodeSelector:
        accelerator: nvidia-tesla-t4

# Run only on Linux (important in mixed Windows/Linux clusters)
      nodeSelector:
        kubernetes.io/os: linux

# Run only on nodes in a specific availability zone
      nodeSelector:
        topology.kubernetes.io/zone: us-east-1a

nodeAffinity (Complex Expressions)

affinity:
  nodeAffinity:
    requiredDuringSchedulingIgnoredDuringExecution:
      nodeSelectorTerms:
      - matchExpressions:
        # Must be Linux
        - key: kubernetes.io/os
          operator: In
          values: [linux]
        # Must NOT be a spot/preemptible node (for critical monitoring agents)
        - key: cloud.google.com/gke-preemptible
          operator: DoesNotExist
    preferredDuringSchedulingIgnoredDuringExecution:
    - weight: 100
      preference:
        matchExpressions:
        - key: node-role
          operator: In
          values: [worker]   # prefer worker nodes but also run on others

Label-Based Opt-In/Opt-Out

# Opt-in: only run on nodes with a specific label
nodeSelector:
  monitoring: "enabled"

# Add the label to specific nodes:
kubectl label node worker-1 monitoring=enabled
kubectl label node worker-2 monitoring=enabled

# Remove to stop DaemonSet pod on a node:
kubectl label node worker-1 monitoring-    # removes the label → pod deleted

# Opt-out: run on all nodes EXCEPT those with a specific label
affinity:
  nodeAffinity:
    requiredDuringSchedulingIgnoredDuringExecution:
      nodeSelectorTerms:
      - matchExpressions:
        - key: exclude-monitoring
          operator: DoesNotExist   # run on nodes that do NOT have this label

# Add label to exclude a node:
kubectl label node worker-3 exclude-monitoring=true  # → pod deleted from worker-3

Tolerations

Default Tolerations Injected by DaemonSet Controller

The DaemonSet controller automatically injects several tolerations into every DaemonSet pod to ensure agent availability during node problems:

Auto-Injected Toleration	Effect	Purpose
`node.kubernetes.io/not-ready:NoExecute`	Tolerated for 300s	Pod survives brief node not-ready periods (network blip)
`node.kubernetes.io/unreachable:NoExecute`	Tolerated for 300s	Pod survives brief node unreachable periods
`node.kubernetes.io/disk-pressure:NoSchedule`	Tolerated	Monitoring agents still run on disk-pressured nodes
`node.kubernetes.io/memory-pressure:NoSchedule`	Tolerated	Monitoring agents still run on memory-pressured nodes
`node.kubernetes.io/pid-pressure:NoSchedule`	Tolerated	Pod starvation doesn't block infrastructure agents
`node.kubernetes.io/unschedulable:NoSchedule`	Tolerated	DaemonSet pods created on cordoned nodes
`node.kubernetes.io/network-unavailable:NoSchedule`	Tolerated	CNI plugin DaemonSets can run before network is ready

Control-Plane Tolerations

# Control-plane nodes carry a taint that blocks regular pods:
# node-role.kubernetes.io/control-plane:NoSchedule (1.24+)
# node-role.kubernetes.io/master:NoSchedule (deprecated, still present pre-1.24)

# To run a DaemonSet on control-plane nodes (e.g., logging agent, monitoring):
tolerations:
- key: node-role.kubernetes.io/control-plane
  operator: Exists
  effect: NoSchedule
- key: node-role.kubernetes.io/master    # for compatibility with older clusters
  operator: Exists
  effect: NoSchedule

# Example use cases requiring control-plane coverage:
# - Audit log collector (reads from /var/log/kubernetes/audit.log)
# - etcd backup agent
# - Node-level security scanner
# - Falco (eBPF syscall monitor)

Custom Taint Tolerations

# Node tainted for GPU workloads only:
# kubectl taint node gpu-node-1 dedicated=gpu:NoSchedule

# DaemonSet for GPU metrics (DCGM exporter) — must tolerate the GPU taint:
tolerations:
- key: dedicated
  operator: Equal
  value: gpu
  effect: NoSchedule

# Tolerate ANY taint (run on ALL nodes regardless of taints):
tolerations:
- operator: Exists   # matches any key, value, effect — use with caution

Update Strategies

RollingUpdate (Default)

DaemonSet RollingUpdate terminates old pods and starts new ones one node at a time (by default). Unlike Deployment, there is no ReplicaSet intermediary — the controller directly manages the per-node pod replacement.

updateStrategy:
  type: RollingUpdate
  rollingUpdate:
    maxUnavailable: 1     # default; one node's pod down at a time
                          # absolute: 2 = two nodes simultaneously updated
                          # percentage: "10%" = 10% of nodes simultaneously
    maxSurge: 0           # default; no extra pod during update
                          # maxSurge: 1 = create new pod BEFORE deleting old
                          # requires node to have capacity for two pods briefly

RollingUpdate timeline (10 nodes, maxUnavailable=1, maxSurge=0): node-1: [old] → delete old → [new] (wait Ready + minReadySeconds) node-2: → [old] → delete → [new] node-3: → ... ... One node updated at a time; monitoring coverage never drops by more than 1 node. With maxUnavailable=3: node-1: [old] → [new] ┐ node-2: [old] → [new] ├ simultaneously node-3: [old] → [new] ┘ node-4: [old] → [new] ┐ ... ├ next batch after above Ready Faster rollout; 3 nodes temporarily without agent coverage.

maxSurge for Zero-Downtime Agent Updates

updateStrategy:
  type: RollingUpdate
  rollingUpdate:
    maxUnavailable: 0   # never remove old pod before new is ready
    maxSurge: 1         # create new pod alongside old; delete old after new is Ready
    # Both old and new pod run on same node briefly (double the per-node cost)
    # Requires node to have capacity for 2× the pod's resource requests temporarily
    # Essential for: network agents (brief coverage gap is unacceptable),
    #                security scanners (must not miss any window)

OnDelete Strategy

updateStrategy:
  type: OnDelete
# Pods are only updated when manually deleted
# Use for: CNI plugins (network disruption during update must be controlled),
#           critical security agents (manual per-node validation required)

# Workflow:
# 1. Update DaemonSet spec (kubectl set image or kubectl apply)
# 2. Manually delete pod on specific node to trigger update:
kubectl delete pod fluentd-worker-1 -n logging
# 3. Verify new pod is healthy before proceeding to next node:
kubectl get pods -n logging -o wide | grep worker-1
# 4. Continue node by node

Host Namespaces

Infrastructure agents often need privileged access to the node. The following patterns cover the most common host-access requirements while keeping the security surface as narrow as possible.

hostNetwork for Network Agents

# CNI plugins, network monitoring agents
spec:
  template:
    spec:
      hostNetwork: true        # pod uses node's network namespace
      dnsPolicy: ClusterFirstWithHostNet  # REQUIRED with hostNetwork to still resolve cluster DNS

# Effect: pod sees all node interfaces (eth0, lo, tunnel interfaces)
# Pod IP is the node IP (not a pod CIDR IP)
# Port conflicts: if node already uses port 9100, the pod will fail to bind

hostPID for Process-Level Agents

# eBPF-based tracing, process monitoring (Falco, Pixie, Tetragon)
spec:
  template:
    spec:
      hostPID: true    # pod sees all processes on the node via /proc
      containers:
      - name: falco
        image: falcosecurity/falco:0.37.0
        securityContext:
          privileged: true    # required for kernel module / eBPF loading
        volumeMounts:
        - name: dev
          mountPath: /dev
        - name: proc
          mountPath: /host/proc
          readOnly: true
      volumes:
      - name: dev
        hostPath: {path: /dev}
      - name: proc
        hostPath: {path: /proc}

hostPath Volume Patterns

# Log collection (Fluentd, Filebeat, Vector)
volumes:
- name: varlog
  hostPath:
    path: /var/log
- name: varlibdockercontainers
  hostPath:
    path: /var/lib/docker/containers   # container log symlink targets

# For containerd (runc logs at different path):
- name: containerd-logs
  hostPath:
    path: /var/log/pods

# Node filesystem inspection (node-exporter, security scanners)
- name: rootfs
  hostPath:
    path: /
    type: Directory   # Directory | File | DirectoryOrCreate | FileOrCreate | Socket

hostPort

hostPort binds a container port directly to the node's network interface on the same port number. The port is reachable at NODE_IP:PORT without needing a Service. It is an alternative to using hostNetwork: true when only specific ports need to be exposed.

containers:
- name: node-exporter
  ports:
  - containerPort: 9100
    hostPort: 9100          # bind to node IP:9100
    protocol: TCP

# Prometheus scrape config targeting node IPs directly:
# - targets: ['node-1:9100', 'node-2:9100', ...]
# OR use a Service of type ClusterIP with targetPort: 9100 (preferred)

hostPort Limits Pod Scheduling

hostPort reserves the port on the node. Only one pod can use a given hostPort per node (enforced by the scheduler). For DaemonSets, this is fine since exactly one pod runs per node. However, mixing hostPort DaemonSet pods with hostPort application pods that request the same port will cause scheduling conflicts. Prefer a ClusterIP Service to expose DaemonSet pod metrics/APIs rather than hostPort.

Real-World DaemonSet Examples

Prometheus node-exporter

# Minimal production-grade node-exporter DaemonSet
apiVersion: apps/v1
kind: DaemonSet
metadata:
  name: node-exporter
  namespace: monitoring
spec:
  selector:
    matchLabels:
      app.kubernetes.io/name: node-exporter
  updateStrategy:
    type: RollingUpdate
    rollingUpdate:
      maxUnavailable: 1
  template:
    metadata:
      labels:
        app.kubernetes.io/name: node-exporter
    spec:
      hostNetwork: false
      hostPID: false
      nodeSelector:
        kubernetes.io/os: linux
      tolerations:
      - operator: Exists     # run on all nodes including control-plane
      priorityClassName: system-cluster-critical
      serviceAccountName: node-exporter
      securityContext:
        runAsNonRoot: true
        runAsUser: 65534
      containers:
      - name: node-exporter
        image: prom/node-exporter:v1.7.0
        args: ["--path.rootfs=/host", "--path.procfs=/host/proc", "--path.sysfs=/host/sys"]
        ports:
        - containerPort: 9100
        resources:
          requests: {cpu: 50m, memory: 64Mi}
          limits: {memory: 128Mi}
        securityContext:
          allowPrivilegeEscalation: false
          readOnlyRootFilesystem: true
          capabilities: {drop: [ALL]}
        volumeMounts:
        - {name: root, mountPath: /host, readOnly: true, mountPropagation: HostToContainer}
        - {name: proc, mountPath: /host/proc, readOnly: true}
        - {name: sys, mountPath: /host/sys, readOnly: true}
      volumes:
      - {name: root, hostPath: {path: /}}
      - {name: proc, hostPath: {path: /proc}}
      - {name: sys, hostPath: {path: /sys}}

Fluentd Log Collector

containers:
- name: fluentd
  image: fluent/fluentd-kubernetes-daemonset:v1.16-debian-elasticsearch8-1
  env:
  - name: FLUENT_ELASTICSEARCH_HOST
    value: elasticsearch.logging.svc.cluster.local
  - name: FLUENT_ELASTICSEARCH_PORT
    value: "9200"
  - name: K8S_NODE_NAME      # inject node name for log enrichment
    valueFrom:
      fieldRef:
        fieldPath: spec.nodeName
  resources:
    requests: {cpu: 100m, memory: 200Mi}
    limits: {memory: 500Mi}
  volumeMounts:
  - name: varlog
    mountPath: /var/log
  - name: pods-logs
    mountPath: /var/log/pods
    readOnly: true
  - name: fluentd-config
    mountPath: /fluentd/etc/fluent.conf
    subPath: fluent.conf
volumes:
- {name: varlog, hostPath: {path: /var/log}}
- {name: pods-logs, hostPath: {path: /var/log/pods}}
- {name: fluentd-config, configMap: {name: fluentd-config}}

CNI Plugin (Calico node)

# CNI plugins require hostNetwork + privileged + control-plane tolerations
spec:
  template:
    spec:
      hostNetwork: true
      dnsPolicy: ClusterFirstWithHostNet
      tolerations:
      - operator: Exists    # run on every node including control-plane
      priorityClassName: system-node-critical
      initContainers:
      - name: install-cni
        image: calico/cni:v3.27.0
        command: ["/opt/cni/bin/install"]
        volumeMounts:
        - name: cni-bin-dir
          mountPath: /opt/cni/bin
        - name: cni-net-dir
          mountPath: /etc/cni/net.d
      containers:
      - name: calico-node
        image: calico/node:v3.27.0
        securityContext:
          privileged: true     # required: manages iptables, routes, network interfaces
        env:
        - name: NODENAME
          valueFrom:
            fieldRef:
              fieldPath: spec.nodeName
        volumeMounts:
        - name: lib-modules
          mountPath: /lib/modules
          readOnly: true
        - name: var-run-calico
          mountPath: /var/run/calico
        - name: cni-bin-dir
          mountPath: /opt/cni/bin
      volumes:
      - {name: lib-modules, hostPath: {path: /lib/modules}}
      - {name: var-run-calico, hostPath: {path: /var/run/calico}}
      - {name: cni-bin-dir, hostPath: {path: /opt/cni/bin}}
      - {name: cni-net-dir, hostPath: {path: /etc/cni/net.d}}

CSI Node Plugin

# CSI node plugins run as DaemonSets to provide node-local volume operations
# (NodeStageVolume, NodePublishVolume, NodeGetVolumeStats)
containers:
- name: ebs-plugin
  image: public.ecr.aws/ebs-csi-driver/aws-ebs-csi-driver:v1.28.0
  args: ["node", "--endpoint=$(CSI_ENDPOINT)", "--logtostderr", "--v=5"]
  env:
  - name: CSI_ENDPOINT
    value: unix:///var/lib/kubelet/plugins/ebs.csi.aws.com/csi.sock
  securityContext:
    privileged: true   # required: mount/unmount block devices, create device nodes
  volumeMounts:
  - name: kubelet-dir
    mountPath: /var/lib/kubelet
    mountPropagation: Bidirectional  # propagate mounts back to host
  - name: plugin-dir
    mountPath: /var/lib/kubelet/plugins/ebs.csi.aws.com/
  - name: device-dir
    mountPath: /dev
- name: node-driver-registrar
  image: registry.k8s.io/sig-storage/csi-node-driver-registrar:v2.10.0
  args:
  - --csi-address=$(ADDRESS)
  - --kubelet-registration-path=$(DRIVER_REG_SOCK_PATH)
volumes:
- {name: kubelet-dir, hostPath: {path: /var/lib/kubelet, type: Directory}}
- {name: plugin-dir, hostPath: {path: /var/lib/kubelet/plugins/ebs.csi.aws.com/, type: DirectoryOrCreate}}
- {name: device-dir, hostPath: {path: /dev, type: Directory}}

Node Join and Leave Lifecycle

Node lifecycle and DaemonSet response: New node joins cluster: ┌─────────────────────────────────────────────────────────────┐ │ 1. Node registers with API server (kubelet --register-node) │ │ 2. DaemonSet controller watches Node objects via LIST/WATCH │ │ 3. Controller detects new node matches DaemonSet selector │ │ 4. Controller creates Pod with spec.nodeName = new-node │ │ 5. Kubelet on new-node picks up pod and starts container │ │ Typical latency: 2-10 seconds after node registers │ └─────────────────────────────────────────────────────────────┘ Node drain / cordon: ┌─────────────────────────────────────────────────────────────┐ │ kubectl cordon node-1 │ │ → node marked Unschedulable │ │ → existing DaemonSet pod continues running (not evicted) │ │ → new DaemonSet pods CAN still be created (bypass scheduler)│ │ │ │ kubectl drain node-1 │ │ → evicts regular pods (respecting PDB) │ │ → DaemonSet pods NOT evicted by default │ │ → add --ignore-daemonsets to evict DaemonSet pods too │ │ → add --delete-emptydir-data to delete emptyDir data │ └─────────────────────────────────────────────────────────────┘ Node removed from cluster: ┌─────────────────────────────────────────────────────────────┐ │ Node object deleted → DaemonSet pod garbage collected │ │ No manual cleanup required │ └─────────────────────────────────────────────────────────────┘

kubectl drain --ignore-daemonsets

Without --ignore-daemonsets, kubectl drain exits with an error if DaemonSet pods are present (which they always are). Always include this flag during node maintenance. DaemonSet pods will be automatically recreated once the node is uncordoned, or on the next node join if the node is replaced. The DaemonSet pods on a draining node are left running during the drain — they are only removed when the node is deleted or when the DaemonSet itself is updated/deleted.

Resource Sizing for DaemonSet Pods

DaemonSet pods consume resources from every node they run on. Over-provisioning DaemonSet requests reduces the available allocatable resources for application pods cluster-wide. Under-provisioning causes eviction during node pressure.

# Node allocatable = node capacity - system reserved - kubelet reserved - eviction threshold
# DaemonSet pods consume from allocatable on EVERY node

# Example: 100-node cluster, node-exporter requests 50m CPU / 64Mi memory
# Total cluster-wide cost: 100 × (50m CPU + 64Mi) = 5000m CPU + 6.4Gi memory
# This is "hidden tax" that must be accounted for in capacity planning

# Resource sizing guidelines for common agents:
# node-exporter:    50m CPU / 64Mi memory  (minimal, read-only host metrics)
# fluentd:          100m CPU / 200Mi        (scales with log volume; add VPA)
# calico-node:      100m CPU / 256Mi        (network data path; critical path)
# datadog-agent:    200m CPU / 512Mi        (full observability; significant cost)
# falco:            100m CPU / 512Mi        (eBPF kernel overhead varies)
# CSI node plugin:  50m CPU / 128Mi         (per-node volume operations)

DaemonSet vs Deployment vs Static Pods

Aspect	DaemonSet	Deployment	Static Pod
One per node	Yes (automatic)	No (replica count)	Yes (manual per node)
Managed by	DaemonSet controller	Deployment controller	kubelet directly
API server required	Yes	Yes	No (kubelet reads local file)
Auto new-node coverage	Yes	No	No
Rolling update	Yes (RollingUpdate/OnDelete)	Yes (RollingUpdate/Recreate)	No (manual file update per node)
kubectl visibility	Yes (appears in `kubectl get pods`)	Yes	Yes (mirror pod in API)
Survives API server outage	No (controller needs API)	No	Yes (kubelet manages locally)
Use case	All infrastructure agents	Stateless applications	Control-plane components (etcd, apiserver) only

Operational Commands

# Check DaemonSet rollout status
kubectl rollout status ds/node-exporter -n monitoring

# Watch pod replacement during RollingUpdate
kubectl get pods -n monitoring -l app=node-exporter -o wide -w

# Check desired vs ready vs available counts
kubectl get ds node-exporter -n monitoring
# NAME            DESIRED   CURRENT   READY   UP-TO-DATE   AVAILABLE   NODE SELECTOR
# node-exporter   10        10        9       8            9           kubernetes.io/os=linux

# Force rollout restart (without spec change)
kubectl rollout restart ds/node-exporter -n monitoring

# Check which nodes have/don't have the DaemonSet pod
kubectl get nodes -o wide
kubectl get pods -n monitoring -l app=node-exporter -o wide
# Compare: any node without a pod = targeting mismatch or toleration missing

# Update image
kubectl set image ds/node-exporter node-exporter=prom/node-exporter:v1.8.0 -n monitoring

# Rollback
kubectl rollout undo ds/node-exporter -n monitoring

# Get ControllerRevision history
kubectl get controllerrevision -n monitoring -l app=node-exporter

# Describe to see events
kubectl describe ds node-exporter -n monitoring

# Check DaemonSet status fields
kubectl get ds node-exporter -n monitoring -o jsonpath='{.status}' | jq .
# {
#   "currentNumberScheduled": 10,   # nodes where pod is running
#   "desiredNumberScheduled": 10,   # nodes that should have a pod
#   "numberAvailable": 9,           # pods passing readiness
#   "numberMisscheduled": 0,        # pods running on ineligible nodes
#   "numberReady": 9,
#   "numberUnavailable": 1,         # pods not yet ready
#   "updatedNumberScheduled": 8     # pods on latest revision
# }

Metrics, Alerts, and Runbooks

Key Metrics

Metric	Source	Alert Condition
`kube_daemonset_status_desired_number_scheduled`	kube-state-metrics	Baseline: total eligible nodes
`kube_daemonset_status_number_ready`	kube-state-metrics	< desired for > 5m
`kube_daemonset_status_number_misscheduled`	kube-state-metrics	> 0 (pods on wrong nodes — selector changed?)
`kube_daemonset_status_updated_number_scheduled`	kube-state-metrics	< desired for > 30m → rollout stalled
`kube_daemonset_status_number_unavailable`	kube-state-metrics	> 0 for > 10m (node unreachable or pod crash-looping)

Alerting Rules

groups:
- name: daemonset-health
  rules:
  - alert: DaemonSetNotFullyScheduled
    expr: |
      kube_daemonset_status_desired_number_scheduled
        != kube_daemonset_status_current_number_scheduled
    for: 5m
    labels:
      severity: warning
    annotations:
      summary: "DaemonSet {{ $labels.namespace }}/{{ $labels.daemonset }} not fully scheduled"

  - alert: DaemonSetRolloutStuck
    expr: |
      kube_daemonset_status_updated_number_scheduled
        != kube_daemonset_status_desired_number_scheduled
    for: 30m
    labels:
      severity: warning
    annotations:
      summary: "DaemonSet {{ $labels.namespace }}/{{ $labels.daemonset }} rollout not complete after 30 minutes"

  - alert: DaemonSetMisscheduled
    expr: kube_daemonset_status_number_misscheduled > 0
    for: 5m
    labels:
      severity: warning
    annotations:
      summary: "DaemonSet {{ $labels.namespace }}/{{ $labels.daemonset }} has pods on ineligible nodes"

  - alert: DaemonSetPodNotReady
    expr: |
      kube_daemonset_status_number_unavailable > 0
    for: 10m
    labels:
      severity: warning
    annotations:
      summary: "DaemonSet {{ $labels.namespace }}/{{ $labels.daemonset }} has {{ $value }} unavailable pods"

Runbooks

Pod Missing from Node

Check: kubectl get ds NAME -n NS — compare DESIRED vs CURRENT. If CURRENT < DESIRED: check node labels match nodeSelector; check taints on missing node (kubectl describe node NAME) and ensure DaemonSet has matching toleration. Check numberMisscheduled — may indicate stale pods on nodes that were relabeled.

Rollout Stuck

Check which pod is not updating: compare updatedNumberScheduled vs desiredNumberScheduled. Get pods and check age/image: kubectl get pods -n NS -l app=NAME -o wide. If a pod is stuck NotReady: describe pod for probe failure events. If OnDelete strategy: check if pods were manually deleted. Fix root cause and the controller retries.

Pod CrashLooping on Specific Node

Logs: kubectl logs POD -n NS --previous. Common causes for node-specific crashes: hostPath volume doesn't exist on that node (DirectoryOrCreate vs Directory), kernel module not available (eBPF agents), node OS version incompatibility. Check if it's isolated to one node vs all nodes to identify node-specific vs image issues.

Misscheduled Pods

numberMisscheduled > 0 means pods run on nodes they shouldn't. Happens when node labels change after pod creation. The controller will eventually delete misscheduled pods. To force immediate cleanup: delete the misscheduled pods manually (kubectl delete pod NAME). They will not be recreated on that node if it no longer matches.

Node Drain Blocked by DaemonSet

kubectl drain node-1 exits with error about DaemonSet pods. Always use: kubectl drain node-1 --ignore-daemonsets --delete-emptydir-data. The --ignore-daemonsets flag skips DaemonSet pods during eviction. DaemonSet pods remain running until the node is cordoned or deleted; they are not impacted by drain itself.

Best Practices

Always set priorityClassName: system-node-critical for essential agents — node-exporter, CNI plugins, CSI node plugins, and security agents must survive node pressure. Without a priority class, these pods can be evicted during CPU/memory pressure, leaving nodes unmonitored or without network. Use system-cluster-critical for cluster-wide critical infrastructure.
Use tolerations: [{operator: Exists}] for truly universal agents — monitoring agents and CNI plugins need to run on every node including control-plane, GPU-tainted, and spot nodes. Enumerate all expected taints or use the catch-all operator: Exists toleration. Missing tolerations are the most common reason a DaemonSet is not fully scheduled.
Set nodeSelector: {kubernetes.io/os: linux} in mixed clusters — Windows nodes cannot run Linux containers. Without the OS selector, the DaemonSet will attempt to schedule Linux images on Windows nodes and fail with an ImagePullBackOff or container runtime error.
Keep DaemonSet resource requests conservative but accurate — DaemonSet pods multiply across every node. A 100-node cluster with a DaemonSet requesting 200m CPU and 512Mi means 20 vCPUs and 50Gi memory reserved cluster-wide for that single DaemonSet. Profile actual usage with VPA recommendations in Off mode before setting final requests.
Use maxSurge: 1, maxUnavailable: 0 for network-critical agents — a network agent that goes offline during its own rolling update can cause brief packet loss or missed connections on that node. The surge pattern (create new agent first, then delete old) ensures continuous coverage at the cost of briefly doubling the per-node resource usage.
Use OnDelete for CNI plugin updates — updating a CNI plugin can briefly disrupt network connectivity on the node. OnDelete lets you control timing: schedule maintenance windows, drain workloads from the node first, then delete the old CNI pod to trigger the update, validate connectivity, and proceed to the next node.
Inject node identity via Downward API, not hostname lookups — use spec.nodeName via fieldRef to get the node name rather than relying on hostname or DNS resolution. The node name is stable and available immediately; DNS may not resolve correctly especially during init.
Audit DaemonSet pod security contexts regularly — DaemonSet pods are the most likely to run privileged: true or with hostPID: true. These settings are often necessary (CNI, eBPF agents) but must be the minimum required. Regularly review whether privileged can be replaced with specific capabilities, and whether readOnlyRootFilesystem: true can be applied with explicit writable mounts.

← Previous StatefulSets Next → Jobs & CronJobs