▶ What This Page Covers
  • DaemonSet controller mechanics: one pod per eligible node
  • Full annotated DaemonSet spec
  • How the DaemonSet scheduler bypasses the kube-scheduler
  • Node targeting: nodeSelector, nodeAffinity, label-based subset selection
  • Tolerations for control-plane and tainted nodes
  • Default tolerations injected by DaemonSet controller
  • RollingUpdate strategy: maxUnavailable, maxSurge (1.22+)
  • OnDelete strategy: manual update workflow
  • Pod name format and host-based identity
  • hostNetwork, hostPID, hostIPC for infrastructure agents
  • hostPort: direct node port binding without a Service
  • Resource sizing for DaemonSet pods: avoiding node starvation
  • Real-world DaemonSet examples: node-exporter, Fluentd, CNI plugin, CSI node plugin
  • DaemonSet vs Deployment vs static pods
  • Node join/leave lifecycle: automatic pod creation and deletion
  • DaemonSet and node drain behavior
  • Updating node labels to change DaemonSet coverage
  • 5 metrics + 4 alerts + 5 runbooks + 8 best practices
  • Controller Mechanics

    A DaemonSet ensures exactly one pod runs on every eligible node in the cluster. When a new node joins, the controller automatically creates a pod on it. When a node is removed, the pod is garbage-collected. There is no replicas field — the replica count is implicitly the number of eligible nodes.

    DaemonSet pod placement: Cluster nodes: DaemonSet "node-exporter" pod placement: ┌──────────────┐ ┌──────────────┐ │ worker-1 │ ──────► │ node-exporter-worker-1 │ │ (linux/amd64)│ └──────────────┘ └──────────────┘ ┌──────────────┐ ┌──────────────┐ │ worker-2 │ ──────► │ node-exporter-worker-2 │ │ (linux/amd64)│ └──────────────┘ └──────────────┘ ┌──────────────┐ ┌──────────────┐ │ worker-3 │ ──────► │ node-exporter-worker-3 │ │ (linux/arm64)│ └──────────────┘ └──────────────┘ ┌──────────────┐ │ control- │ ──────► no pod (taint: node-role.kubernetes.io/control-plane:NoSchedule) │ plane-1 │ UNLESS DaemonSet has matching toleration └──────────────┘ New node joins → DaemonSet controller creates pod immediately Node deleted → Pod garbage-collected (no finalizer needed) Node cordoned → Existing pod continues running; new pods not created during drain

    Full DaemonSet Spec

    apiVersion: apps/v1
    kind: DaemonSet
    metadata:
      name: node-exporter
      namespace: monitoring
      labels:
        app: node-exporter
    spec:
      selector:
        matchLabels:
          app: node-exporter     # IMMUTABLE after creation
    
      # ── Update strategy ────────────────────────────────────────────
      updateStrategy:
        type: RollingUpdate      # RollingUpdate (default) | OnDelete
        rollingUpdate:
          maxUnavailable: 1      # default 1; max pods down at once during update
                                 # absolute or percentage of desired count
          maxSurge: 0            # 1.22+: allow temporary extra pod per node during update
                                 # default 0; set to 1 for zero-downtime agent updates
    
      # ── Revision history ───────────────────────────────────────────
      revisionHistoryLimit: 10
    
      # ── Min ready ──────────────────────────────────────────────────
      minReadySeconds: 0
    
      template:
        metadata:
          labels:
            app: node-exporter
          annotations:
            prometheus.io/scrape: "true"
            prometheus.io/port: "9100"
        spec:
          # ── Node targeting ─────────────────────────────────────────
          nodeSelector:
            kubernetes.io/os: linux   # only run on Linux nodes (skip Windows)
    
          # Fine-grained with nodeAffinity:
          affinity:
            nodeAffinity:
              requiredDuringSchedulingIgnoredDuringExecution:
                nodeSelectorTerms:
                - matchExpressions:
                  - key: kubernetes.io/os
                    operator: In
                    values: [linux]
                  - key: node.kubernetes.io/instance-type
                    operator: NotIn
                    values: [t3.nano, t3.micro]  # skip under-resourced nodes
    
          # ── Tolerations ────────────────────────────────────────────
          tolerations:
          # Run on control-plane nodes (not added by default)
          - key: node-role.kubernetes.io/control-plane
            operator: Exists
            effect: NoSchedule
          # Run on nodes being drained (not-ready / unreachable)
          - key: node.kubernetes.io/not-ready
            operator: Exists
            effect: NoExecute
            tolerationSeconds: 300
          - key: node.kubernetes.io/unreachable
            operator: Exists
            effect: NoExecute
            tolerationSeconds: 300
    
          # ── Host access ────────────────────────────────────────────
          hostNetwork: false        # true for CNI plugins, network agents
          hostPID: false            # true for process-level monitoring (e.g., eBPF agents)
          hostIPC: false
    
          # ── Priority ───────────────────────────────────────────────
          priorityClassName: system-node-critical   # ensures scheduling on stressed nodes
    
          # ── Service account ────────────────────────────────────────
          serviceAccountName: node-exporter-sa
    
          # ── Security ───────────────────────────────────────────────
          securityContext:
            runAsNonRoot: true
            runAsUser: 65534        # nobody
            seccompProfile:
              type: RuntimeDefault
    
          containers:
          - name: node-exporter
            image: prom/node-exporter:v1.7.0
            args:
            - --path.rootfs=/host
            - --collector.filesystem.mount-points-exclude=^/(dev|proc|sys|run)($|/)
            ports:
            - name: metrics
              containerPort: 9100
              hostPort: 9100        # bind directly to node port (optional; use Service instead)
            resources:
              requests:
                cpu: "50m"
                memory: "64Mi"
              limits:
                memory: "128Mi"
            securityContext:
              allowPrivilegeEscalation: false
              readOnlyRootFilesystem: true
              capabilities:
                drop: ["ALL"]
            volumeMounts:
            - name: rootfs
              mountPath: /host
              readOnly: true
              mountPropagation: HostToContainer
            - name: proc
              mountPath: /host/proc
              readOnly: true
            - name: sys
              mountPath: /host/sys
              readOnly: true
    
          volumes:
          - name: rootfs
            hostPath:
              path: /
          - name: proc
            hostPath:
              path: /proc
          - name: sys
            hostPath:
              path: /sys
    
          # ── Termination ────────────────────────────────────────────
          terminationGracePeriodSeconds: 30

    How DaemonSet Scheduling Works

    DaemonSet pods bypass the normal kube-scheduler queue. The DaemonSet controller sets spec.nodeName directly on the pod, which causes kubelet to pick it up and start it without scheduler involvement. This means DaemonSet pods can be scheduled on nodes that are:

    DaemonSet Pods Bypass Node Capacity Checks

    Because the DaemonSet controller bypasses the scheduler, it does not verify that the node has sufficient CPU/memory for the pod's resource requests. On a saturated node, kubelet will still start the DaemonSet pod — but other pods may be evicted to make room based on QoS class. Always set resource requests conservatively on DaemonSet pods, and use priorityClassName: system-node-critical for essential infrastructure agents so they are not evicted.

    Node Targeting

    nodeSelector (Simple)

    # Run only on GPU nodes
    spec:
      template:
        spec:
          nodeSelector:
            accelerator: nvidia-tesla-t4
    
    # Run only on Linux (important in mixed Windows/Linux clusters)
          nodeSelector:
            kubernetes.io/os: linux
    
    # Run only on nodes in a specific availability zone
          nodeSelector:
            topology.kubernetes.io/zone: us-east-1a

    nodeAffinity (Complex Expressions)

    affinity:
      nodeAffinity:
        requiredDuringSchedulingIgnoredDuringExecution:
          nodeSelectorTerms:
          - matchExpressions:
            # Must be Linux
            - key: kubernetes.io/os
              operator: In
              values: [linux]
            # Must NOT be a spot/preemptible node (for critical monitoring agents)
            - key: cloud.google.com/gke-preemptible
              operator: DoesNotExist
        preferredDuringSchedulingIgnoredDuringExecution:
        - weight: 100
          preference:
            matchExpressions:
            - key: node-role
              operator: In
              values: [worker]   # prefer worker nodes but also run on others

    Label-Based Opt-In/Opt-Out

    # Opt-in: only run on nodes with a specific label
    nodeSelector:
      monitoring: "enabled"
    
    # Add the label to specific nodes:
    kubectl label node worker-1 monitoring=enabled
    kubectl label node worker-2 monitoring=enabled
    
    # Remove to stop DaemonSet pod on a node:
    kubectl label node worker-1 monitoring-    # removes the label → pod deleted
    
    # Opt-out: run on all nodes EXCEPT those with a specific label
    affinity:
      nodeAffinity:
        requiredDuringSchedulingIgnoredDuringExecution:
          nodeSelectorTerms:
          - matchExpressions:
            - key: exclude-monitoring
              operator: DoesNotExist   # run on nodes that do NOT have this label
    
    # Add label to exclude a node:
    kubectl label node worker-3 exclude-monitoring=true  # → pod deleted from worker-3

    Tolerations

    Default Tolerations Injected by DaemonSet Controller

    The DaemonSet controller automatically injects several tolerations into every DaemonSet pod to ensure agent availability during node problems:

    Auto-Injected TolerationEffectPurpose
    node.kubernetes.io/not-ready:NoExecuteTolerated for 300sPod survives brief node not-ready periods (network blip)
    node.kubernetes.io/unreachable:NoExecuteTolerated for 300sPod survives brief node unreachable periods
    node.kubernetes.io/disk-pressure:NoScheduleToleratedMonitoring agents still run on disk-pressured nodes
    node.kubernetes.io/memory-pressure:NoScheduleToleratedMonitoring agents still run on memory-pressured nodes
    node.kubernetes.io/pid-pressure:NoScheduleToleratedPod starvation doesn't block infrastructure agents
    node.kubernetes.io/unschedulable:NoScheduleToleratedDaemonSet pods created on cordoned nodes
    node.kubernetes.io/network-unavailable:NoScheduleToleratedCNI plugin DaemonSets can run before network is ready

    Control-Plane Tolerations

    # Control-plane nodes carry a taint that blocks regular pods:
    # node-role.kubernetes.io/control-plane:NoSchedule (1.24+)
    # node-role.kubernetes.io/master:NoSchedule (deprecated, still present pre-1.24)
    
    # To run a DaemonSet on control-plane nodes (e.g., logging agent, monitoring):
    tolerations:
    - key: node-role.kubernetes.io/control-plane
      operator: Exists
      effect: NoSchedule
    - key: node-role.kubernetes.io/master    # for compatibility with older clusters
      operator: Exists
      effect: NoSchedule
    
    # Example use cases requiring control-plane coverage:
    # - Audit log collector (reads from /var/log/kubernetes/audit.log)
    # - etcd backup agent
    # - Node-level security scanner
    # - Falco (eBPF syscall monitor)

    Custom Taint Tolerations

    # Node tainted for GPU workloads only:
    # kubectl taint node gpu-node-1 dedicated=gpu:NoSchedule
    
    # DaemonSet for GPU metrics (DCGM exporter) — must tolerate the GPU taint:
    tolerations:
    - key: dedicated
      operator: Equal
      value: gpu
      effect: NoSchedule
    
    # Tolerate ANY taint (run on ALL nodes regardless of taints):
    tolerations:
    - operator: Exists   # matches any key, value, effect — use with caution

    Update Strategies

    RollingUpdate (Default)

    DaemonSet RollingUpdate terminates old pods and starts new ones one node at a time (by default). Unlike Deployment, there is no ReplicaSet intermediary — the controller directly manages the per-node pod replacement.

    updateStrategy:
      type: RollingUpdate
      rollingUpdate:
        maxUnavailable: 1     # default; one node's pod down at a time
                              # absolute: 2 = two nodes simultaneously updated
                              # percentage: "10%" = 10% of nodes simultaneously
        maxSurge: 0           # default; no extra pod during update
                              # maxSurge: 1 = create new pod BEFORE deleting old
                              # requires node to have capacity for two pods briefly
    RollingUpdate timeline (10 nodes, maxUnavailable=1, maxSurge=0): node-1: [old] → delete old → [new] (wait Ready + minReadySeconds) node-2: → [old] → delete → [new] node-3: → ... ... One node updated at a time; monitoring coverage never drops by more than 1 node. With maxUnavailable=3: node-1: [old] → [new] ┐ node-2: [old] → [new] ├ simultaneously node-3: [old] → [new] ┘ node-4: [old] → [new] ┐ ... ├ next batch after above Ready Faster rollout; 3 nodes temporarily without agent coverage.

    maxSurge for Zero-Downtime Agent Updates

    updateStrategy:
      type: RollingUpdate
      rollingUpdate:
        maxUnavailable: 0   # never remove old pod before new is ready
        maxSurge: 1         # create new pod alongside old; delete old after new is Ready
        # Both old and new pod run on same node briefly (double the per-node cost)
        # Requires node to have capacity for 2× the pod's resource requests temporarily
        # Essential for: network agents (brief coverage gap is unacceptable),
        #                security scanners (must not miss any window)

    OnDelete Strategy

    updateStrategy:
      type: OnDelete
    # Pods are only updated when manually deleted
    # Use for: CNI plugins (network disruption during update must be controlled),
    #           critical security agents (manual per-node validation required)
    
    # Workflow:
    # 1. Update DaemonSet spec (kubectl set image or kubectl apply)
    # 2. Manually delete pod on specific node to trigger update:
    kubectl delete pod fluentd-worker-1 -n logging
    # 3. Verify new pod is healthy before proceeding to next node:
    kubectl get pods -n logging -o wide | grep worker-1
    # 4. Continue node by node

    Host Namespaces

    Infrastructure agents often need privileged access to the node. The following patterns cover the most common host-access requirements while keeping the security surface as narrow as possible.

    hostNetwork for Network Agents

    # CNI plugins, network monitoring agents
    spec:
      template:
        spec:
          hostNetwork: true        # pod uses node's network namespace
          dnsPolicy: ClusterFirstWithHostNet  # REQUIRED with hostNetwork to still resolve cluster DNS
    
    # Effect: pod sees all node interfaces (eth0, lo, tunnel interfaces)
    # Pod IP is the node IP (not a pod CIDR IP)
    # Port conflicts: if node already uses port 9100, the pod will fail to bind

    hostPID for Process-Level Agents

    # eBPF-based tracing, process monitoring (Falco, Pixie, Tetragon)
    spec:
      template:
        spec:
          hostPID: true    # pod sees all processes on the node via /proc
          containers:
          - name: falco
            image: falcosecurity/falco:0.37.0
            securityContext:
              privileged: true    # required for kernel module / eBPF loading
            volumeMounts:
            - name: dev
              mountPath: /dev
            - name: proc
              mountPath: /host/proc
              readOnly: true
          volumes:
          - name: dev
            hostPath: {path: /dev}
          - name: proc
            hostPath: {path: /proc}

    hostPath Volume Patterns

    # Log collection (Fluentd, Filebeat, Vector)
    volumes:
    - name: varlog
      hostPath:
        path: /var/log
    - name: varlibdockercontainers
      hostPath:
        path: /var/lib/docker/containers   # container log symlink targets
    
    # For containerd (runc logs at different path):
    - name: containerd-logs
      hostPath:
        path: /var/log/pods
    
    # Node filesystem inspection (node-exporter, security scanners)
    - name: rootfs
      hostPath:
        path: /
        type: Directory   # Directory | File | DirectoryOrCreate | FileOrCreate | Socket

    hostPort

    hostPort binds a container port directly to the node's network interface on the same port number. The port is reachable at NODE_IP:PORT without needing a Service. It is an alternative to using hostNetwork: true when only specific ports need to be exposed.

    containers:
    - name: node-exporter
      ports:
      - containerPort: 9100
        hostPort: 9100          # bind to node IP:9100
        protocol: TCP
    
    # Prometheus scrape config targeting node IPs directly:
    # - targets: ['node-1:9100', 'node-2:9100', ...]
    # OR use a Service of type ClusterIP with targetPort: 9100 (preferred)
    hostPort Limits Pod Scheduling

    hostPort reserves the port on the node. Only one pod can use a given hostPort per node (enforced by the scheduler). For DaemonSets, this is fine since exactly one pod runs per node. However, mixing hostPort DaemonSet pods with hostPort application pods that request the same port will cause scheduling conflicts. Prefer a ClusterIP Service to expose DaemonSet pod metrics/APIs rather than hostPort.

    Real-World DaemonSet Examples

    Prometheus node-exporter

    # Minimal production-grade node-exporter DaemonSet
    apiVersion: apps/v1
    kind: DaemonSet
    metadata:
      name: node-exporter
      namespace: monitoring
    spec:
      selector:
        matchLabels:
          app.kubernetes.io/name: node-exporter
      updateStrategy:
        type: RollingUpdate
        rollingUpdate:
          maxUnavailable: 1
      template:
        metadata:
          labels:
            app.kubernetes.io/name: node-exporter
        spec:
          hostNetwork: false
          hostPID: false
          nodeSelector:
            kubernetes.io/os: linux
          tolerations:
          - operator: Exists     # run on all nodes including control-plane
          priorityClassName: system-cluster-critical
          serviceAccountName: node-exporter
          securityContext:
            runAsNonRoot: true
            runAsUser: 65534
          containers:
          - name: node-exporter
            image: prom/node-exporter:v1.7.0
            args: ["--path.rootfs=/host", "--path.procfs=/host/proc", "--path.sysfs=/host/sys"]
            ports:
            - containerPort: 9100
            resources:
              requests: {cpu: 50m, memory: 64Mi}
              limits: {memory: 128Mi}
            securityContext:
              allowPrivilegeEscalation: false
              readOnlyRootFilesystem: true
              capabilities: {drop: [ALL]}
            volumeMounts:
            - {name: root, mountPath: /host, readOnly: true, mountPropagation: HostToContainer}
            - {name: proc, mountPath: /host/proc, readOnly: true}
            - {name: sys, mountPath: /host/sys, readOnly: true}
          volumes:
          - {name: root, hostPath: {path: /}}
          - {name: proc, hostPath: {path: /proc}}
          - {name: sys, hostPath: {path: /sys}}

    Fluentd Log Collector

    containers:
    - name: fluentd
      image: fluent/fluentd-kubernetes-daemonset:v1.16-debian-elasticsearch8-1
      env:
      - name: FLUENT_ELASTICSEARCH_HOST
        value: elasticsearch.logging.svc.cluster.local
      - name: FLUENT_ELASTICSEARCH_PORT
        value: "9200"
      - name: K8S_NODE_NAME      # inject node name for log enrichment
        valueFrom:
          fieldRef:
            fieldPath: spec.nodeName
      resources:
        requests: {cpu: 100m, memory: 200Mi}
        limits: {memory: 500Mi}
      volumeMounts:
      - name: varlog
        mountPath: /var/log
      - name: pods-logs
        mountPath: /var/log/pods
        readOnly: true
      - name: fluentd-config
        mountPath: /fluentd/etc/fluent.conf
        subPath: fluent.conf
    volumes:
    - {name: varlog, hostPath: {path: /var/log}}
    - {name: pods-logs, hostPath: {path: /var/log/pods}}
    - {name: fluentd-config, configMap: {name: fluentd-config}}

    CNI Plugin (Calico node)

    # CNI plugins require hostNetwork + privileged + control-plane tolerations
    spec:
      template:
        spec:
          hostNetwork: true
          dnsPolicy: ClusterFirstWithHostNet
          tolerations:
          - operator: Exists    # run on every node including control-plane
          priorityClassName: system-node-critical
          initContainers:
          - name: install-cni
            image: calico/cni:v3.27.0
            command: ["/opt/cni/bin/install"]
            volumeMounts:
            - name: cni-bin-dir
              mountPath: /opt/cni/bin
            - name: cni-net-dir
              mountPath: /etc/cni/net.d
          containers:
          - name: calico-node
            image: calico/node:v3.27.0
            securityContext:
              privileged: true     # required: manages iptables, routes, network interfaces
            env:
            - name: NODENAME
              valueFrom:
                fieldRef:
                  fieldPath: spec.nodeName
            volumeMounts:
            - name: lib-modules
              mountPath: /lib/modules
              readOnly: true
            - name: var-run-calico
              mountPath: /var/run/calico
            - name: cni-bin-dir
              mountPath: /opt/cni/bin
          volumes:
          - {name: lib-modules, hostPath: {path: /lib/modules}}
          - {name: var-run-calico, hostPath: {path: /var/run/calico}}
          - {name: cni-bin-dir, hostPath: {path: /opt/cni/bin}}
          - {name: cni-net-dir, hostPath: {path: /etc/cni/net.d}}

    CSI Node Plugin

    # CSI node plugins run as DaemonSets to provide node-local volume operations
    # (NodeStageVolume, NodePublishVolume, NodeGetVolumeStats)
    containers:
    - name: ebs-plugin
      image: public.ecr.aws/ebs-csi-driver/aws-ebs-csi-driver:v1.28.0
      args: ["node", "--endpoint=$(CSI_ENDPOINT)", "--logtostderr", "--v=5"]
      env:
      - name: CSI_ENDPOINT
        value: unix:///var/lib/kubelet/plugins/ebs.csi.aws.com/csi.sock
      securityContext:
        privileged: true   # required: mount/unmount block devices, create device nodes
      volumeMounts:
      - name: kubelet-dir
        mountPath: /var/lib/kubelet
        mountPropagation: Bidirectional  # propagate mounts back to host
      - name: plugin-dir
        mountPath: /var/lib/kubelet/plugins/ebs.csi.aws.com/
      - name: device-dir
        mountPath: /dev
    - name: node-driver-registrar
      image: registry.k8s.io/sig-storage/csi-node-driver-registrar:v2.10.0
      args:
      - --csi-address=$(ADDRESS)
      - --kubelet-registration-path=$(DRIVER_REG_SOCK_PATH)
    volumes:
    - {name: kubelet-dir, hostPath: {path: /var/lib/kubelet, type: Directory}}
    - {name: plugin-dir, hostPath: {path: /var/lib/kubelet/plugins/ebs.csi.aws.com/, type: DirectoryOrCreate}}
    - {name: device-dir, hostPath: {path: /dev, type: Directory}}

    Node Join and Leave Lifecycle

    Node lifecycle and DaemonSet response: New node joins cluster: ┌─────────────────────────────────────────────────────────────┐ │ 1. Node registers with API server (kubelet --register-node) │ │ 2. DaemonSet controller watches Node objects via LIST/WATCH │ │ 3. Controller detects new node matches DaemonSet selector │ │ 4. Controller creates Pod with spec.nodeName = new-node │ │ 5. Kubelet on new-node picks up pod and starts container │ │ Typical latency: 2-10 seconds after node registers │ └─────────────────────────────────────────────────────────────┘ Node drain / cordon: ┌─────────────────────────────────────────────────────────────┐ │ kubectl cordon node-1 │ │ → node marked Unschedulable │ │ → existing DaemonSet pod continues running (not evicted) │ │ → new DaemonSet pods CAN still be created (bypass scheduler)│ │ │ │ kubectl drain node-1 │ │ → evicts regular pods (respecting PDB) │ │ → DaemonSet pods NOT evicted by default │ │ → add --ignore-daemonsets to evict DaemonSet pods too │ │ → add --delete-emptydir-data to delete emptyDir data │ └─────────────────────────────────────────────────────────────┘ Node removed from cluster: ┌─────────────────────────────────────────────────────────────┐ │ Node object deleted → DaemonSet pod garbage collected │ │ No manual cleanup required │ └─────────────────────────────────────────────────────────────┘
    kubectl drain --ignore-daemonsets

    Without --ignore-daemonsets, kubectl drain exits with an error if DaemonSet pods are present (which they always are). Always include this flag during node maintenance. DaemonSet pods will be automatically recreated once the node is uncordoned, or on the next node join if the node is replaced. The DaemonSet pods on a draining node are left running during the drain — they are only removed when the node is deleted or when the DaemonSet itself is updated/deleted.

    Resource Sizing for DaemonSet Pods

    DaemonSet pods consume resources from every node they run on. Over-provisioning DaemonSet requests reduces the available allocatable resources for application pods cluster-wide. Under-provisioning causes eviction during node pressure.

    # Node allocatable = node capacity - system reserved - kubelet reserved - eviction threshold
    # DaemonSet pods consume from allocatable on EVERY node
    
    # Example: 100-node cluster, node-exporter requests 50m CPU / 64Mi memory
    # Total cluster-wide cost: 100 × (50m CPU + 64Mi) = 5000m CPU + 6.4Gi memory
    # This is "hidden tax" that must be accounted for in capacity planning
    
    # Resource sizing guidelines for common agents:
    # node-exporter:    50m CPU / 64Mi memory  (minimal, read-only host metrics)
    # fluentd:          100m CPU / 200Mi        (scales with log volume; add VPA)
    # calico-node:      100m CPU / 256Mi        (network data path; critical path)
    # datadog-agent:    200m CPU / 512Mi        (full observability; significant cost)
    # falco:            100m CPU / 512Mi        (eBPF kernel overhead varies)
    # CSI node plugin:  50m CPU / 128Mi         (per-node volume operations)

    DaemonSet vs Deployment vs Static Pods

    AspectDaemonSetDeploymentStatic Pod
    One per nodeYes (automatic)No (replica count)Yes (manual per node)
    Managed byDaemonSet controllerDeployment controllerkubelet directly
    API server requiredYesYesNo (kubelet reads local file)
    Auto new-node coverageYesNoNo
    Rolling updateYes (RollingUpdate/OnDelete)Yes (RollingUpdate/Recreate)No (manual file update per node)
    kubectl visibilityYes (appears in kubectl get pods)YesYes (mirror pod in API)
    Survives API server outageNo (controller needs API)NoYes (kubelet manages locally)
    Use caseAll infrastructure agentsStateless applicationsControl-plane components (etcd, apiserver) only

    Operational Commands

    # Check DaemonSet rollout status
    kubectl rollout status ds/node-exporter -n monitoring
    
    # Watch pod replacement during RollingUpdate
    kubectl get pods -n monitoring -l app=node-exporter -o wide -w
    
    # Check desired vs ready vs available counts
    kubectl get ds node-exporter -n monitoring
    # NAME            DESIRED   CURRENT   READY   UP-TO-DATE   AVAILABLE   NODE SELECTOR
    # node-exporter   10        10        9       8            9           kubernetes.io/os=linux
    
    # Force rollout restart (without spec change)
    kubectl rollout restart ds/node-exporter -n monitoring
    
    # Check which nodes have/don't have the DaemonSet pod
    kubectl get nodes -o wide
    kubectl get pods -n monitoring -l app=node-exporter -o wide
    # Compare: any node without a pod = targeting mismatch or toleration missing
    
    # Update image
    kubectl set image ds/node-exporter node-exporter=prom/node-exporter:v1.8.0 -n monitoring
    
    # Rollback
    kubectl rollout undo ds/node-exporter -n monitoring
    
    # Get ControllerRevision history
    kubectl get controllerrevision -n monitoring -l app=node-exporter
    
    # Describe to see events
    kubectl describe ds node-exporter -n monitoring
    
    # Check DaemonSet status fields
    kubectl get ds node-exporter -n monitoring -o jsonpath='{.status}' | jq .
    # {
    #   "currentNumberScheduled": 10,   # nodes where pod is running
    #   "desiredNumberScheduled": 10,   # nodes that should have a pod
    #   "numberAvailable": 9,           # pods passing readiness
    #   "numberMisscheduled": 0,        # pods running on ineligible nodes
    #   "numberReady": 9,
    #   "numberUnavailable": 1,         # pods not yet ready
    #   "updatedNumberScheduled": 8     # pods on latest revision
    # }

    Metrics, Alerts, and Runbooks

    Key Metrics

    MetricSourceAlert Condition
    kube_daemonset_status_desired_number_scheduledkube-state-metricsBaseline: total eligible nodes
    kube_daemonset_status_number_readykube-state-metrics< desired for > 5m
    kube_daemonset_status_number_misscheduledkube-state-metrics> 0 (pods on wrong nodes — selector changed?)
    kube_daemonset_status_updated_number_scheduledkube-state-metrics< desired for > 30m → rollout stalled
    kube_daemonset_status_number_unavailablekube-state-metrics> 0 for > 10m (node unreachable or pod crash-looping)

    Alerting Rules

    groups:
    - name: daemonset-health
      rules:
      - alert: DaemonSetNotFullyScheduled
        expr: |
          kube_daemonset_status_desired_number_scheduled
            != kube_daemonset_status_current_number_scheduled
        for: 5m
        labels:
          severity: warning
        annotations:
          summary: "DaemonSet {{ $labels.namespace }}/{{ $labels.daemonset }} not fully scheduled"
    
      - alert: DaemonSetRolloutStuck
        expr: |
          kube_daemonset_status_updated_number_scheduled
            != kube_daemonset_status_desired_number_scheduled
        for: 30m
        labels:
          severity: warning
        annotations:
          summary: "DaemonSet {{ $labels.namespace }}/{{ $labels.daemonset }} rollout not complete after 30 minutes"
    
      - alert: DaemonSetMisscheduled
        expr: kube_daemonset_status_number_misscheduled > 0
        for: 5m
        labels:
          severity: warning
        annotations:
          summary: "DaemonSet {{ $labels.namespace }}/{{ $labels.daemonset }} has pods on ineligible nodes"
    
      - alert: DaemonSetPodNotReady
        expr: |
          kube_daemonset_status_number_unavailable > 0
        for: 10m
        labels:
          severity: warning
        annotations:
          summary: "DaemonSet {{ $labels.namespace }}/{{ $labels.daemonset }} has {{ $value }} unavailable pods"

    Runbooks

    Pod Missing from Node

    Check: kubectl get ds NAME -n NS — compare DESIRED vs CURRENT. If CURRENT < DESIRED: check node labels match nodeSelector; check taints on missing node (kubectl describe node NAME) and ensure DaemonSet has matching toleration. Check numberMisscheduled — may indicate stale pods on nodes that were relabeled.

    Rollout Stuck

    Check which pod is not updating: compare updatedNumberScheduled vs desiredNumberScheduled. Get pods and check age/image: kubectl get pods -n NS -l app=NAME -o wide. If a pod is stuck NotReady: describe pod for probe failure events. If OnDelete strategy: check if pods were manually deleted. Fix root cause and the controller retries.

    Pod CrashLooping on Specific Node

    Logs: kubectl logs POD -n NS --previous. Common causes for node-specific crashes: hostPath volume doesn't exist on that node (DirectoryOrCreate vs Directory), kernel module not available (eBPF agents), node OS version incompatibility. Check if it's isolated to one node vs all nodes to identify node-specific vs image issues.

    Misscheduled Pods

    numberMisscheduled > 0 means pods run on nodes they shouldn't. Happens when node labels change after pod creation. The controller will eventually delete misscheduled pods. To force immediate cleanup: delete the misscheduled pods manually (kubectl delete pod NAME). They will not be recreated on that node if it no longer matches.

    Node Drain Blocked by DaemonSet

    kubectl drain node-1 exits with error about DaemonSet pods. Always use: kubectl drain node-1 --ignore-daemonsets --delete-emptydir-data. The --ignore-daemonsets flag skips DaemonSet pods during eviction. DaemonSet pods remain running until the node is cordoned or deleted; they are not impacted by drain itself.

    Best Practices

    1. Always set priorityClassName: system-node-critical for essential agents — node-exporter, CNI plugins, CSI node plugins, and security agents must survive node pressure. Without a priority class, these pods can be evicted during CPU/memory pressure, leaving nodes unmonitored or without network. Use system-cluster-critical for cluster-wide critical infrastructure.
    2. Use tolerations: [{operator: Exists}] for truly universal agents — monitoring agents and CNI plugins need to run on every node including control-plane, GPU-tainted, and spot nodes. Enumerate all expected taints or use the catch-all operator: Exists toleration. Missing tolerations are the most common reason a DaemonSet is not fully scheduled.
    3. Set nodeSelector: {kubernetes.io/os: linux} in mixed clusters — Windows nodes cannot run Linux containers. Without the OS selector, the DaemonSet will attempt to schedule Linux images on Windows nodes and fail with an ImagePullBackOff or container runtime error.
    4. Keep DaemonSet resource requests conservative but accurate — DaemonSet pods multiply across every node. A 100-node cluster with a DaemonSet requesting 200m CPU and 512Mi means 20 vCPUs and 50Gi memory reserved cluster-wide for that single DaemonSet. Profile actual usage with VPA recommendations in Off mode before setting final requests.
    5. Use maxSurge: 1, maxUnavailable: 0 for network-critical agents — a network agent that goes offline during its own rolling update can cause brief packet loss or missed connections on that node. The surge pattern (create new agent first, then delete old) ensures continuous coverage at the cost of briefly doubling the per-node resource usage.
    6. Use OnDelete for CNI plugin updates — updating a CNI plugin can briefly disrupt network connectivity on the node. OnDelete lets you control timing: schedule maintenance windows, drain workloads from the node first, then delete the old CNI pod to trigger the update, validate connectivity, and proceed to the next node.
    7. Inject node identity via Downward API, not hostname lookups — use spec.nodeName via fieldRef to get the node name rather than relying on hostname or DNS resolution. The node name is stable and available immediately; DNS may not resolve correctly especially during init.
    8. Audit DaemonSet pod security contexts regularly — DaemonSet pods are the most likely to run privileged: true or with hostPID: true. These settings are often necessary (CNI, eBPF agents) but must be the minimum required. Regularly review whether privileged can be replaced with specific capabilities, and whether readOnlyRootFilesystem: true can be applied with explicit writable mounts.