Kubernetes Events — Observability

Kubernetes Events

The fourth observability signal: event structure, retention, kubectl usage, event exporters, kubernetes-event-exporter, custom event recording, alerting on events, and production event pipeline design.

What Are Kubernetes Events

Kubernetes Events are first-class API objects (core/v1 Event) written to the API server by controllers, the kubelet, the scheduler, and other system components. They describe what happened to a Kubernetes object — a pod was scheduled, an image pull failed, a container was OOMKilled, a deployment scaled up.

Events are the "fourth observability signal" alongside metrics, logs, and traces. They are unique because they are generated automatically by the cluster itself with no application code changes. Every significant state transition in the cluster produces an event.

Events Are Not Logs — They Expire

Events are stored in etcd, not on disk. By default, the API server purges events after 1 hour (configurable via --event-ttl on kube-apiserver). In high-activity clusters, etcd is also protected by event quota limits that discard older events earlier. If you need event history beyond 1 hour — for incident retrospectives, auditing, or trend alerting — you must run an event exporter that streams events to a durable backend as they arrive.

Events vs Audit Logs vs Application Logs

AspectEventsAudit LogsApplication Logs
SourceK8s controllers, kubelet, schedulerkube-apiserver (all API calls)Application containers
DescribesObject state transitionsWho did what via APIApplication behavior
Storageetcd (1h TTL)Files / webhook backend/var/log/containers/
Accesskubectl get events / APILog files / SIEMkubectl logs / Loki
VolumeLow-Medium (state changes only)High (every API call)High
App changes neededNone (for system events)NoneYes (to structured logging)

Event Object Structure

kubectl get event -n production -o yaml | head -80
apiVersion: v1
kind: Event
metadata:
  name: order-service-7d5f9-xk2pq.17c4a3f0b8e2d1a5   # pod-name.hex-suffix
  namespace: production
  creationTimestamp: "2024-01-15T10:23:45Z"

# The resource this event is about
involvedObject:
  apiVersion: v1
  kind: Pod
  name: order-service-7d5f9-xk2pq
  namespace: production
  uid: a1b2c3d4-e5f6-7890-abcd-ef1234567890
  resourceVersion: "98765"
  fieldPath: spec.containers{order-service}   # specific container, if applicable

# Classification
reason: OOMKilling             # machine-readable reason (CamelCase)
message: "Memory limit reached: Killing container with allowance 256Mi, used 256Mi"
type: Warning                  # Normal or Warning

# Deduplication and aggregation
count: 3                       # how many times this event has occurred
firstTimestamp: "2024-01-15T09:45:00Z"
lastTimestamp:  "2024-01-15T10:23:45Z"

# Who generated this event
source:
  component: kubelet
  host: ip-10-0-1-45.us-east-1.compute.internal

# v1.19+ fields (EventSeries for high-frequency events)
series:
  count: 3
  lastObservedTime: "2024-01-15T10:23:45.000000Z"

# v1.19+ reporting fields (replaces source)
reportingComponent: kubelet
reportingInstance: ip-10-0-1-45.us-east-1.compute.internal

# Related object (action target, if different from involvedObject)
related:
  apiVersion: v1
  kind: Node
  name: ip-10-0-1-45.us-east-1.compute.internal

action: Killing                # verb describing the action taken
eventTime: "2024-01-15T10:23:45.000000Z"   # MicroTime (v1.19+)

Key Fields Reference

FieldTypeDescriptionNotes
involvedObjectObjectReferenceThe K8s object this event is aboutkind + name + namespace + uid
reasonstringShort CamelCase reason codeMachine-readable; used for filtering and alerting
messagestringHuman-readable descriptionMay contain variable data (pod names, resource values)
typestringNormal or WarningWarning events indicate problems
countintOccurrence count (deduplicated)Identical events within 10min window are merged
firstTimestampTimeFirst occurrence timeDeprecated in v1.19+ in favor of eventTime
lastTimestampTimeMost recent occurrenceThis is what you see in kubectl get events AGE column
sourceEventSourceComponent + host that generated eventDeprecated in v1.19+ in favor of reportingComponent
reportingComponentstringComponent name (v1.19+)e.g., kubelet, deployment-controller
actionstringAction takene.g., Killing, Pulling, Scheduled
seriesEventSeriesHigh-frequency event aggregation (v1.19+)Replaces count for rapid-fire events
relatedObjectReferenceSecondary object involvede.g., Node when killing a pod

Event Deduplication

The Kubernetes event recorder deduplicates events that share the same involvedObject, reason, message, source, and type within a 10-minute window. Duplicate events increment count and update lastTimestamp rather than creating new Event objects. This prevents etcd flooding from hot loops but means you lose individual timestamps for high-frequency events.

Event Types and Sources

Event Types

Normal — Informational

Routine state transitions that are expected. Examples: pod scheduled, container started, image pulled, deployment scaled, volume bound. These are part of normal cluster operation.

In production monitoring, Normal events are useful for change tracking (when did a deployment roll out?) but rarely require alerting.

Warning — Requires Attention

Indicates a problem or degraded state. Examples: OOMKilling, BackOff (crash loop), FailedScheduling, FailedMount, Evicted, FailedPullImage. These are actionable signals.

Warning events should flow to your alerting pipeline and trigger investigation. High-frequency Warning events indicate cluster health problems.

Event Sources (Reporting Components)

ComponentTypical Events Generated
kubeletPulling/Pulled/Failed image, Created/Started/Killing container, OOMKilling, Evicting pod, VolumeMount errors, NodeNotReady
kube-schedulerScheduled pod to node, FailedScheduling (insufficient resources, affinity mismatch, taints)
deployment-controllerScalingReplicaSet, Scaled up/down ReplicaSet
replicaset-controllerSuccessfulCreate/Delete pod, FailedCreate pod
statefulset-controllerSuccessfulCreate/Delete pod
job-controllerSuccessfulCreate, BackoffLimitExceeded, Completed
cronjob-controllerSuccessfulCreate job, MissSchedule, TooManyMissedTimes
horizontal-pod-autoscalerSuccessfulRescale, FailedGetScale, DesiredReplicas, ScalingLimited
endpoint-slice-controllerFailedToCreateEndpointSlices, SuccessfulCreate/Update
node-controllerNodeNotReady, NodeReady, RegisteredNode, RemovingNode
volume-controllerFailedBinding, VolumeReleased, ProvisioningSucceeded/Failed
default-schedulerPreempting, Scheduled, FailedScheduling
Custom controllers/operatorsReconciled, SyncFailed, Updated, any custom reason

kubectl events Usage

# List all events in a namespace (sorted by time, most recent last)
kubectl get events -n production

# Sort by most recent first
kubectl get events -n production --sort-by='.lastTimestamp'

# Only Warning events
kubectl get events -n production --field-selector type=Warning

# Events for a specific pod
kubectl get events -n production --field-selector involvedObject.name=order-service-7d5f9-xk2pq

# Events for a specific object kind
kubectl get events -n production --field-selector involvedObject.kind=Node

# Events by reason
kubectl get events -n production --field-selector reason=OOMKilling

# Cross-namespace (all namespaces)
kubectl get events -A --sort-by='.lastTimestamp'

# Only Warning events cluster-wide (most useful for incident triage)
kubectl get events -A --field-selector type=Warning --sort-by='.lastTimestamp'

# Watch events as they arrive
kubectl get events -n production -w

# Describe a specific resource to see associated events inline
kubectl describe pod order-service-7d5f9-xk2pq -n production
# Events section at the bottom shows events for this pod

# Custom columns (useful for quick triage)
kubectl get events -n production \
  -o custom-columns='TIME:.lastTimestamp,TYPE:.type,REASON:.reason,OBJECT:.involvedObject.name,MSG:.message' \
  --sort-by='.lastTimestamp' | tail -30

# JSON output for scripting
kubectl get events -n production -o json | \
  jq '.items[] | select(.type=="Warning") | {time:.lastTimestamp, reason:.reason, object:.involvedObject.name, msg:.message}'
kubectl events (1.26+)

Kubernetes 1.26 added kubectl events as a dedicated subcommand with improved output: kubectl events -n production --for pod/order-service-7d5f9-xk2pq. It supports --for (filter by object), --types (filter by Warning/Normal), and outputs in a human-friendly table sorted by time by default.

# kubectl events (1.26+)
kubectl events -n production
kubectl events -n production --for pod/order-service-7d5f9-xk2pq
kubectl events -n production --types=Warning
kubectl events -n production --for deployment/order-service --types=Warning
kubectl events -A --types=Warning --no-headers | sort -k1

Important Event Reasons

The reason field is the most important field for programmatic filtering and alerting. These are the reasons that matter most for production operations:

Warning Events (Action Required)

OOMKilling

Container exceeded memory limit and was killed by the kernel OOM killer. Increase memory limit or find memory leak.

BackOff

Container in CrashLoopBackOff. Application is crashing on startup. Check container logs with --previous.

Evicted

Pod evicted due to node resource pressure (memory, disk, PID). Check eviction threshold and node capacity.

FailedScheduling

Scheduler cannot place pod. Reason: insufficient CPU/memory, no nodes match affinity, all nodes tainted. Check describe pod for details.

FailedMount

Volume cannot be mounted. Common causes: PVC not bound, StorageClass missing, CSI driver error, wrong access mode.

FailedAttachVolume

Volume attach failed. CSI or in-tree driver error. Often transient; persists if detach from previous node is stuck.

Failed

Image pull failed (FailedPullImage). Check imagePullSecret, registry reachability, image tag exists.

ErrImagePull

Initial image pull error before entering BackOff. Often registry auth or network issue.

ImagePullBackOff

Repeated image pull failures with exponential backoff. Persistent registry or authentication problem.

NodeNotReady

Node condition changed to NotReady. All pods on this node may be evicted or rescheduled after tolerationSeconds.

Preempting

Scheduler preempting lower-priority pods to make room for a higher-priority pod. Check PriorityClass assignments.

FailedCreatePodContainer

Runtime could not create container. Common: invalid image entrypoint, securityContext conflict, init container failure.

Unhealthy

Liveness or readiness probe failed. Check probe config and application health endpoint.

BackoffLimitExceeded

Job exhausted its retry budget. All pod attempts failed. Check Job pod logs for root cause.

MissSchedule

CronJob missed its scheduled run (controller was down or schedule changed). Check for TooManyMissedTimes.

FailedGetScale

HPA could not read current replica count or metric. Check RBAC and metrics-server health.

Normal Events (Operational Visibility)

Scheduled

Pod assigned to a node by the scheduler. Contains which node was chosen.

Pulling / Pulled

Image pull started / completed. Pulled includes how long it took — useful for diagnosing slow cold starts.

Created / Started

Container created and started. Time gap between Pulled and Started indicates container startup overhead.

Killing

Container being killed (graceful shutdown or OOMKill). Check if OOMKilling is the subtype.

ScalingReplicaSet

Deployment scaled a ReplicaSet up or down. Useful for deployment change timeline.

SuccessfulRescale

HPA successfully changed replica count. Contains old and new replica count.

ProvisioningSucceeded

PVC dynamically provisioned. Contains StorageClass and volume name.

Retention and the TTL Problem

The 1-hour event TTL is the single most important operational fact about Kubernetes events. Most production incidents last longer than 1 hour, and retrospective analysis almost always requires events from before the incident started.

Changing the Default TTL

# /etc/kubernetes/manifests/kube-apiserver.yaml (static pod)
spec:
  containers:
    - name: kube-apiserver
      command:
        - kube-apiserver
        - --event-ttl=4h        # extend to 4 hours
        # WARNING: longer TTL = more etcd storage consumed by events
        # Events are stored in etcd under /registry/events/
        # High-event clusters can produce GBs of event data per hour
TTL Extension Has Costs

Extending --event-ttl to 24h in a large cluster can consume hundreds of MB to several GB of etcd storage, slowing etcd list operations and increasing backup size. The correct solution is to run an event exporter that ships events to a cheap long-term store (Loki, Elasticsearch, S3) in real-time, keeping the API server TTL at 1–2 hours.

etcd Event Quota

The API server has a per-namespace event object limit (default 1000 events per namespace per object type). In a namespace with many pods crashing, events may be dropped before TTL expires. The --max-events-per-namespace flag on kube-apiserver (v1.21+) controls this:

- --max-events-per-namespace=1000   # default; increase if events are being dropped

kubernetes-event-exporter

kubernetes-event-exporter (by Resmo, formerly opsgenie) is the most feature-rich and actively maintained event exporter. It watches the Events API and routes events to multiple backends based on configurable filters.

Architecture

Events API (watch /api/v1/events?watch=true) │ ▼ kubernetes-event-exporter (Deployment, single replica) │ ├── Router rules (match on type/reason/namespace/labels) │ ├── Output: Elasticsearch ├── Output: Loki (HTTP push) ├── Output: Slack webhook (Warning events only) ├── Output: PagerDuty (critical reasons) ├── Output: Opsgenie / PagerDuty ├── Output: Kafka └── Output: stdout / file

Helm Install

helm repo add deliveryhero https://charts.deliveryhero.io
helm upgrade --install kubernetes-event-exporter deliveryhero/kubernetes-event-exporter \
  --namespace monitoring \
  --create-namespace \
  --set config.logLevel=warn \
  --set config.logFormat=json \
  --values event-exporter-values.yaml

Production Configuration

# event-exporter-config.yaml (mounted as ConfigMap)
logLevel: warn
logFormat: json

# Cluster-wide label added to all exported events
clusterName: prod-us-east-1

receivers:
  # --- Loki: all events ---
  - name: loki
    loki:
      streamLabels:
        app: kubernetes-event-exporter
        cluster: prod-us-east-1
      url: http://loki-gateway.monitoring.svc/loki/api/v1/push
      headers:
        Content-Type: application/json
      layout:
        # Map event fields to Loki log labels and body
        namespace: "{{ .InvolvedObject.Namespace }}"
        name: "{{ .InvolvedObject.Name }}"
        kind: "{{ .InvolvedObject.Kind }}"
        reason: "{{ .Reason }}"
        type: "{{ .Type }}"
        message: "{{ .Message }}"
        component: "{{ .ReportingComponent }}"
        count: "{{ .Count }}"

  # --- Elasticsearch: all events ---
  - name: elasticsearch
    elasticsearch:
      hosts:
        - https://prod-logs-es-http.logging.svc:9200
      index: kubernetes-events
      indexFormat: kubernetes-events-{2006.01.02}   # daily index
      username: elastic
      password:
        valueFrom:
          secretKeyRef:
            name: elastic-credentials
            key: password
      tls:
        insecureSkipVerify: false
        caFile: /es-certs/ca.crt
      layout:
        timestamp: "{{ .LastTimestamp }}"
        namespace: "{{ .InvolvedObject.Namespace }}"
        name: "{{ .InvolvedObject.Name }}"
        kind: "{{ .InvolvedObject.Kind }}"
        reason: "{{ .Reason }}"
        message: "{{ .Message }}"
        type: "{{ .Type }}"
        count: "{{ .Count }}"
        component: "{{ .ReportingComponent }}"
        cluster: prod-us-east-1

  # --- Slack: Warning events only ---
  - name: slack-warnings
    slack:
      token:
        valueFrom:
          secretKeyRef:
            name: slack-token
            key: token
      channel: "#k8s-warnings"
      message: |
        *[{{ .Type }}]* `{{ .Reason }}` on `{{ .InvolvedObject.Kind }}/{{ .InvolvedObject.Name }}`
        *Namespace:* `{{ .InvolvedObject.Namespace }}`
        *Message:* {{ .Message }}
        *Count:* {{ .Count }} | *Last seen:* {{ .LastTimestamp }}

  # --- PagerDuty: critical Warning reasons ---
  - name: pagerduty-critical
    opsgenie:
      apiKey:
        valueFrom:
          secretKeyRef:
            name: opsgenie-key
            key: apiKey
      message: "K8s {{ .Reason }}: {{ .InvolvedObject.Kind }}/{{ .InvolvedObject.Name }}"
      description: "{{ .Message }}"
      alias: "{{ .InvolvedObject.Namespace }}/{{ .InvolvedObject.Name }}/{{ .Reason }}"
      priority: P2

route:
  # Default: send all events to Loki and Elasticsearch
  routes:
    # Warning events → Loki + Elasticsearch + Slack
    - match:
        - type: Warning
      receivers:
        - loki
        - elasticsearch
        - slack-warnings

    # Critical Warning reasons → also page
    - match:
        - type: Warning
          reason:
            - OOMKilling
            - Evicted
            - BackoffLimitExceeded
            - NodeNotReady
      receivers:
        - pagerduty-critical

    # Normal events → Loki only (cheaper, not in Elasticsearch)
    - match:
        - type: Normal
      receivers:
        - loki

    # Exclude noisy Normal events from kube-system
    - drop:
        - type: Normal
          namespace: kube-system
          reason:
            - LeaderElection
            - Pulling
            - Pulled

RBAC for Event Exporter

apiVersion: v1
kind: ServiceAccount
metadata:
  name: kubernetes-event-exporter
  namespace: monitoring
---
apiVersion: rbac.authorization.k8s.io/v1
kind: ClusterRole
metadata:
  name: kubernetes-event-exporter
rules:
  - apiGroups: [""]
    resources: [events, namespaces, nodes, pods]
    verbs: [get, list, watch]
  - apiGroups: [apps]
    resources: [deployments, replicasets, statefulsets, daemonsets]
    verbs: [get, list, watch]
---
apiVersion: rbac.authorization.k8s.io/v1
kind: ClusterRoleBinding
metadata:
  name: kubernetes-event-exporter
roleRef:
  apiGroup: rbac.authorization.k8s.io
  kind: ClusterRole
  name: kubernetes-event-exporter
subjects:
  - kind: ServiceAccount
    name: kubernetes-event-exporter
    namespace: monitoring

Deployment with Leader Election

apiVersion: apps/v1
kind: Deployment
metadata:
  name: kubernetes-event-exporter
  namespace: monitoring
spec:
  replicas: 2   # HA: leader election prevents duplicate events
  selector:
    matchLabels:
      app: kubernetes-event-exporter
  template:
    spec:
      serviceAccountName: kubernetes-event-exporter
      containers:
        - name: exporter
          image: ghcr.io/resmoio/kubernetes-event-exporter:v1.7
          args:
            - -conf=/data/config.yaml
            - -leader-election          # ensures only one replica exports at a time
            - -leader-election-id=event-exporter-lock
          resources:
            requests: {cpu: 50m, memory: 64Mi}
            limits: {cpu: 200m, memory: 256Mi}
          volumeMounts:
            - name: config
              mountPath: /data
      volumes:
        - name: config
          configMap:
            name: kubernetes-event-exporter-config

kube-eventer

kube-eventer (Alibaba Cloud) is an alternative event exporter with native support for DingTalk, Kafka, Elasticsearch, InfluxDB, and more. Useful when kubernetes-event-exporter lacks a specific output plugin.

apiVersion: apps/v1
kind: Deployment
metadata:
  name: kube-eventer
  namespace: monitoring
spec:
  replicas: 1
  template:
    spec:
      serviceAccountName: kube-eventer
      containers:
        - name: kube-eventer
          image: registry.aliyuncs.com/acs/kube-eventer-amd64:v1.2.7-358d0d3-aliyun
          command:
            - /kube-eventer
            - --source=kubernetes:https://kubernetes.default
            # Elasticsearch sink
            - --sink=elasticsearch:https://prod-es:9200?sniff=false&index=kube-events&x-pack-enabled=true
            # DingTalk sink (optional)
            # - --sink=dingtalk:https://oapi.dingtalk.com/robot/send?level=warning&label=prod
          resources:
            requests: {cpu: 50m, memory: 64Mi}
            limits: {cpu: 200m, memory: 128Mi}

Event Router Patterns

Fluent Bit as Event Collector

Rather than a dedicated event exporter, Fluent Bit can watch the Kubernetes Events API directly and route to the same Loki/Elasticsearch pipeline used for container logs:

[INPUT]
    Name              kubernetes_events
    # Connects to the K8s API and watches /api/v1/events
    Kube_URL          https://kubernetes.default.svc:443
    Kube_CA_File      /var/run/secrets/kubernetes.io/serviceaccount/ca.crt
    Kube_Token_File   /var/run/secrets/kubernetes.io/serviceaccount/token
    DB                /var/log/flb-events.db
    Interval_Sec      5

[FILTER]
    Name   grep
    Match  kube_events
    # Only Warning events
    Regex  type Warning

[OUTPUT]
    Name  loki
    Match kube_events
    Host  loki-gateway.monitoring.svc
    Labels job=k8s-events,cluster=prod

OTel Collector k8sevents Receiver

receivers:
  k8s_events:
    auth_type: serviceAccount
    namespaces: []   # empty = all namespaces

processors:
  resource:
    attributes:
      - key: cluster
        value: prod-us-east-1
        action: insert
  filter/warnings_only:
    logs:
      log_record:
        - 'attributes["k8s.event.reason"] != "" and attributes["k8s.event.action"] != "Scheduled"'
  batch:
    timeout: 10s

exporters:
  loki:
    endpoint: http://loki-gateway.monitoring.svc/loki/api/v1/push
    labels:
      resource:
        cluster: ""
        k8s.namespace.name: ""

service:
  pipelines:
    logs/events:
      receivers: [k8s_events]
      processors: [resource, batch]
      exporters: [loki]

Recording Custom Events

Kubernetes controllers and operators should emit events to report the outcome of reconciliation loops. This follows the standard Kubernetes operational model and makes operator behavior observable via kubectl describe and event exporters without any extra tooling.

Go: client-go EventRecorder

package controller

import (
    "context"
    corev1 "k8s.io/api/core/v1"
    "k8s.io/apimachinery/pkg/runtime"
    "k8s.io/client-go/tools/record"
    ctrl "sigs.k8s.io/controller-runtime"
)

type MyReconciler struct {
    client.Client
    Scheme   *runtime.Scheme
    Recorder record.EventRecorder
}

func (r *MyReconciler) SetupWithManager(mgr ctrl.Manager) error {
    // Create recorder with the component name (appears in event source)
    r.Recorder = mgr.GetEventRecorderFor("my-controller")
    return ctrl.NewControllerManagedBy(mgr).
        For(&myv1.MyResource{}).
        Complete(r)
}

func (r *MyReconciler) Reconcile(ctx context.Context, req ctrl.Request) (ctrl.Result, error) {
    var resource myv1.MyResource
    if err := r.Get(ctx, req.NamespacedName, &resource); err != nil {
        return ctrl.Result{}, client.IgnoreNotFound(err)
    }

    // Normal event: informational
    r.Recorder.Event(&resource, corev1.EventTypeNormal,
        "Reconciled",              // reason (CamelCase)
        "Resource reconciled successfully",  // message
    )

    // Warning event: something went wrong
    if err := r.doSomething(ctx, &resource); err != nil {
        r.Recorder.Eventf(&resource, corev1.EventTypeWarning,
            "SyncFailed",          // reason
            "Failed to sync resource: %v", err,  // formatted message
        )
        return ctrl.Result{}, err
    }

    // Annotation-based event (annotate the related object)
    r.Recorder.AnnotatedEventf(&resource, map[string]string{
        "reconcile-version": "v2",
    }, corev1.EventTypeNormal, "Updated", "Configuration updated to version %s", "v2")

    return ctrl.Result{}, nil
}

Event Reason Naming Conventions

Event Reason Best Practices

Reasons should be CamelCase, short (one word or two), and stable (don't change them — external systems filter on them). Use positive names for success and past-tense verbs for state transitions: Reconciled, Updated, Deleted, SyncFailed, ValidationFailed. Avoid including variable data in the reason; put it in the message field instead.

controller-runtime Event Recorder (operator-sdk style)

// In main.go / cmd/main.go (operator-sdk scaffold)
mgr, err := ctrl.NewManager(ctrl.GetConfigOrDie(), ctrl.Options{
    Scheme: scheme,
    // EventBroadcaster is configured automatically
    // EventRecorder is available via mgr.GetEventRecorderFor()
})

// Use in reconciler:
type DatabaseReconciler struct {
    client.Client
    Scheme   *runtime.Scheme
    Recorder record.EventRecorder
}

func (r *DatabaseReconciler) Reconcile(ctx context.Context, req ctrl.Request) (ctrl.Result, error) {
    db := &dbv1.Database{}
    if err := r.Get(ctx, req.NamespacedName, db); err != nil {
        return ctrl.Result{}, client.IgnoreNotFound(err)
    }

    // Record event on a *related* object (not the resource being reconciled)
    // e.g., record event on the Secret created by this controller
    secret := &corev1.Secret{}
    r.Recorder.Event(secret, corev1.EventTypeNormal, "Created",
        "Secret created for database credentials")

    return ctrl.Result{}, nil
}

Alerting on Events

There are two approaches to alert on Kubernetes events: using kube-state-metrics (converts events to Prometheus metrics for PromQL alerting) and direct alerting from the event exporter pipeline (e.g., Loki alert rules or webhook routing).

kube-state-metrics: Event Metrics

kube-state-metrics (KSM) exposes event counts as Prometheus metrics, enabling event-based alerting via standard PrometheusRule without a separate event exporter:

# kube_event_count: count of events by type/reason/namespace/name
# Available in kube-state-metrics v2.8+
kube_event_count{type="Warning", reason="OOMKilling"}
kube_event_count{type="Warning", reason="BackOff"}
kube_event_count{type="Warning", reason="Evicted"}
kube_event_count{type="Warning", reason="FailedScheduling"}

# Rate of Warning events (increase = new events)
increase(kube_event_count{type="Warning"}[5m])

# Find namespaces with OOM kills in last 15 minutes
sum by (namespace) (increase(kube_event_count{type="Warning", reason="OOMKilling"}[15m]))
kube_event_count Cardinality Warning

Each unique combination of type/reason/namespace/name/uid creates a separate time series. In large clusters with many pod names, this can create cardinality problems. Use metricRelabelings in the ServiceMonitor to drop the name and uid labels if cardinality is a concern, aggregating to namespace + reason level only.

PrometheusRule Alerts on Events

apiVersion: monitoring.coreos.com/v1
kind: PrometheusRule
metadata:
  name: kubernetes-event-alerts
  namespace: monitoring
spec:
  groups:
    - name: kubernetes-events
      rules:
        - alert: KubernetesOOMKilling
          expr: |
            increase(kube_event_count{type="Warning", reason="OOMKilling"}[10m]) > 0
          labels:
            severity: warning
          annotations:
            summary: "OOMKill in {{ $labels.namespace }} ({{ $labels.name }})"
            description: "Container was OOMKilled. Increase memory limit or investigate memory leak."
            runbook_url: "https://wiki/runbooks/oomkilling"

        - alert: KubernetesCrashLoopBackOff
          expr: |
            increase(kube_event_count{type="Warning", reason="BackOff"}[5m]) > 3
          for: 2m
          labels:
            severity: warning
          annotations:
            summary: "CrashLoopBackOff in {{ $labels.namespace }} ({{ $labels.name }})"
            description: "Pod is in CrashLoopBackOff. Check container logs with --previous flag."

        - alert: KubernetesPodEvicted
          expr: |
            increase(kube_event_count{type="Warning", reason="Evicted"}[10m]) > 0
          labels:
            severity: warning
          annotations:
            summary: "Pod evicted in {{ $labels.namespace }}"
            description: "A pod was evicted due to node resource pressure."

        - alert: KubernetesFailedScheduling
          expr: |
            increase(kube_event_count{type="Warning", reason="FailedScheduling"}[5m]) > 0
          for: 5m
          labels:
            severity: warning
          annotations:
            summary: "Pod cannot be scheduled in {{ $labels.namespace }}"
            description: "Scheduler cannot find a suitable node. Check resource requests and node capacity."

        - alert: KubernetesNodeNotReady
          expr: |
            increase(kube_event_count{type="Warning", reason="NodeNotReady"}[5m]) > 0
          labels:
            severity: critical
          annotations:
            summary: "Node entered NotReady state"
            description: "A node transitioned to NotReady. Pods may be evicted after tolerationSeconds."

        - alert: KubernetesImagePullFailure
          expr: |
            increase(kube_event_count{type="Warning", reason=~"Failed|ErrImagePull|ImagePullBackOff"}[5m]) > 0
          for: 3m
          labels:
            severity: warning
          annotations:
            summary: "Image pull failing in {{ $labels.namespace }}"
            description: "Container image cannot be pulled. Check imagePullSecret and registry reachability."

Loki-Based Event Alerting

When events are forwarded to Loki via kubernetes-event-exporter or Fluent Bit, you can write LogQL alerts. This captures the full event message (not just counts) and allows richer filtering:

apiVersion: monitoring.coreos.com/v1alpha1
kind: PrometheusRule
metadata:
  name: loki-event-alerts
  namespace: monitoring
spec:
  groups:
    - name: loki-events
      interval: 1m
      rules:
        - alert: KubernetesWarningEventSpike
          expr: |
            sum by (namespace) (
              rate({job="k8s-events"} | json | type="Warning" [5m])
            ) > 5
          for: 2m
          labels:
            severity: warning
          annotations:
            summary: "High rate of Warning events in {{ $labels.namespace }}"

        - alert: CriticalEventPattern
          expr: |
            count_over_time(
              {job="k8s-events"} | json | type="Warning"
                | reason =~ "OOMKilling|Evicted|BackoffLimitExceeded|NodeNotReady" [5m]
            ) > 0
          labels:
            severity: critical
          annotations:
            summary: "Critical Kubernetes event detected"

Slack Alert Routing for Events

# kubernetes-event-exporter routing: critical events → PagerDuty, warnings → Slack
route:
  routes:
    - match:
        - type: Warning
          reason:
            - OOMKilling
            - NodeNotReady
            - Evicted
      throttle:
        # Prevent Slack flooding: max 10 messages per 5 minutes per reason
        period: 5m
        count: 10
      receivers:
        - slack-critical
        - elasticsearch

    - match:
        - type: Warning
      receivers:
        - slack-warnings
        - elasticsearch

Metrics, Alerts & Runbooks

Key Metrics for the Event Pipeline

MetricSourceAlert ThresholdMeaning
kube_event_count{type="Warning"}kube-state-metricsincrease > 0 for critical reasonsWarning event counts by reason/namespace
kubernetes_event_exporter_sends_totalevent-exporterEvents successfully exported per receiver
kubernetes_event_exporter_send_errors_totalevent-exporter>0Export failures — events may be lost
kubernetes_event_exporter_watch_errors_totalevent-exporter>0API watch errors — events may be missed
kubernetes_event_exporter_events_totalevent-exporterTotal events observed (before routing)
kubernetes_event_exporter_dropped_totalevent-exporter>0Events dropped by routing rules

Operational Alert Rules

groups:
  - name: event-exporter-health
    rules:
      - alert: EventExporterSendErrors
        expr: rate(kubernetes_event_exporter_send_errors_total[5m]) > 0
        for: 5m
        labels: {severity: warning}
        annotations:
          summary: "kubernetes-event-exporter failing to send events — check receiver connectivity"

      - alert: EventExporterWatchErrors
        expr: increase(kubernetes_event_exporter_watch_errors_total[5m]) > 3
        labels: {severity: warning}
        annotations:
          summary: "Event exporter watch API errors — events may be missed"

      - alert: EventExporterDown
        expr: absent(kubernetes_event_exporter_events_total)
        for: 5m
        labels: {severity: critical}
        annotations:
          summary: "kubernetes-event-exporter is not running — no events are being exported"

Runbooks

OOMKilling Events

  1. Identify pod: kubectl get events -A --field-selector reason=OOMKilling
  2. Check current memory usage: kubectl top pod <pod> -n <ns>
  3. Check memory limit: kubectl get pod <pod> -o jsonpath='{.spec.containers[0].resources}'
  4. Check if limit is appropriate: compare working set to limit
  5. Short-term: increase limit in Deployment spec; Long-term: investigate memory leak
  6. Consider VPA for automatic limit recommendation

FailedScheduling Events

  1. kubectl describe pod <pod> -n <ns> — see Events section for exact reason
  2. Insufficient CPU/memory: kubectl describe nodes | grep -A5 "Allocated resources"
  3. Taint mismatch: kubectl describe node | grep Taints
  4. Affinity mismatch: review nodeAffinity/podAntiAffinity rules in deployment
  5. Topology spread: too many pods on some nodes, spread constraint too strict
  6. If persistent: scale cluster (add nodes or node pool)

CrashLoopBackOff

  1. kubectl logs <pod> --previous — logs from crashed container
  2. Check exit code: kubectl get pod -o jsonpath='{.status.containerStatuses[0].lastState.terminated}'
  3. Exit 137 = OOMKilled; Exit 1 = application error; Exit 2 = misuse of command
  4. Check ConfigMap / Secret mounts (missing secret causes immediate crash)
  5. Check readiness/liveness probe configuration
  6. Test container locally with same environment variables

ImagePullBackOff

  1. kubectl describe pod <pod> — see exact pull error message in Events
  2. Verify image tag exists in registry
  3. Check imagePullSecret is attached to ServiceAccount or pod spec
  4. Test credentials: kubectl create secret docker-registry test --docker-server=... --dry-run=client
  5. Check network policy allows egress to registry IP/port 443
  6. If private registry: verify secret is in correct namespace

Event Exporter Not Running

  1. kubectl get deploy kubernetes-event-exporter -n monitoring
  2. Check pod status and logs: kubectl logs -l app=kubernetes-event-exporter -n monitoring
  3. Verify RBAC: kubectl auth can-i list events --as=system:serviceaccount:monitoring:kubernetes-event-exporter
  4. Check ConfigMap format: kubectl get cm kubernetes-event-exporter-config -o yaml
  5. Restart deployment: kubectl rollout restart deploy/kubernetes-event-exporter -n monitoring

Querying Events in Loki

# All Warning events in last 1 hour
{job="k8s-events"} | json | type="Warning"

# OOMKilling events with details
{job="k8s-events"} | json | reason="OOMKilling"
  | line_format "{{.namespace}}/{{.name}}: {{.message}}"

# Events for a specific namespace
{job="k8s-events"} | json | namespace="payments"

# Count Warning events by reason (last 24h)
sum by (reason) (count_over_time({job="k8s-events"} | json | type="Warning" [24h]))

# Find all events for a specific pod
{job="k8s-events"} | json | name=~"order-service-.*"

# Events that correlate with a deployment (around rollout time)
{job="k8s-events"} | json | reason="ScalingReplicaSet" | name=~"order-service-.*"

Best Practices

  1. Deploy a persistent event exporter from day one. The 1-hour TTL makes events useless for any incident that lasts more than an hour. Run kubernetes-event-exporter (or equivalent) before you need it — you cannot retroactively recover expired events.
  2. Send Warning events to an alerting channel. Not every Warning event needs a page, but they should appear in a low-urgency Slack channel where on-call engineers can spot patterns. Aggregate by reason to identify systemic issues.
  3. Route critical event reasons to PagerDuty. OOMKilling, NodeNotReady, Evicted, and BackoffLimitExceeded should trigger alerting. Use throttling (max N per 5 minutes) to prevent alert fatigue from cascading failures.
  4. Emit events from custom controllers. Every operator reconcile loop should emit a Normal event on success and a Warning event on failure. This makes controller behavior observable via kubectl describe with no additional tooling required.
  5. Filter out LeaderElection and health-probe events. These fire every few seconds in kube-system and add noise without value. Exclude them in the event exporter routing rules before they reach Elasticsearch or Loki.
  6. Do not store events in Elasticsearch unless you need full-text search. Loki is 10–20× cheaper per event record. Use Elasticsearch only if you need complex aggregation queries or full-text search on event messages. Forward to Loki for standard operational use.
  7. Enable leader election in the event exporter. A single-replica event exporter is a single point of failure for your event pipeline. Run 2+ replicas with leader election to survive pod restarts without missing events.
  8. Add cluster label to all exported events. In multi-cluster environments, always include a cluster label in exported events so you can distinguish events from different clusters when they land in a shared Loki/Elasticsearch instance.
Coverage Details
  • Events as first-class API objects (core/v1 Event) — the fourth observability signal
  • Events vs Audit Logs vs Application Logs comparison table
  • 1-hour default TTL (--event-ttl) and etcd storage — critical retention problem
  • Full Event object YAML: involvedObject, reason, message, type, count, firstTimestamp, lastTimestamp, source, series, reportingComponent, related, action, eventTime
  • Key fields reference table (12 fields with types and notes)
  • Event deduplication: 10-minute window, count increment, identical field matching
  • Normal vs Warning event types (two cards)
  • Event sources table: kubelet, scheduler, deployment-controller, replicaset, statefulset, job, cronjob, HPA, endpoint-slice, node, volume controllers
  • kubectl events reference: --field-selector (type/reason/involvedObject), --sort-by, -w watch, -A all-namespaces, custom columns, jq JSON processing
  • kubectl events subcommand (1.26+): --for, --types flags
  • Warning event reasons: OOMKilling, BackOff, Evicted, FailedScheduling, FailedMount, FailedAttachVolume, Failed/ErrImagePull/ImagePullBackOff, NodeNotReady, Preempting, FailedCreatePodContainer, Unhealthy, BackoffLimitExceeded, MissSchedule, FailedGetScale
  • Normal event reasons: Scheduled, Pulling/Pulled, Created/Started, Killing, ScalingReplicaSet, SuccessfulRescale, ProvisioningSucceeded
  • --event-ttl kube-apiserver flag and cost warning for extension
  • etcd event quota: --max-events-per-namespace (v1.21+)
  • kubernetes-event-exporter architecture diagram
  • Helm install for kubernetes-event-exporter
  • Production event-exporter config: Loki output (template layout), Elasticsearch output (daily index, TLS, credentials from secret), Slack (Warning events, message template), PagerDuty/Opsgenie (critical reasons)
  • Route rules: Warning→Loki+ES+Slack, critical reasons→page, Normal→Loki only, drop noisy kube-system Normal events
  • RBAC for event-exporter: ClusterRole (get/list/watch events/namespaces/nodes/pods + apps)
  • Deployment with leader election (2 replicas, -leader-election flag)
  • kube-eventer alternative: install YAML with Elasticsearch sink
  • Fluent Bit kubernetes_events input plugin config (DB, Interval_Sec, filter, Loki output)
  • OTel Collector k8s_events receiver config (auth_type, namespaces, filter processor, Loki exporter)
  • Go: client-go EventRecorder via controller-runtime: SetupWithManager, Recorder.Event, Recorder.Eventf, Recorder.AnnotatedEventf
  • Event reason naming conventions: CamelCase, stable, positive/past-tense, no variables in reason
  • controller-runtime operator-sdk scaffold pattern for EventRecorder
  • kube-state-metrics kube_event_count metric (v2.8+): type/reason/namespace labels
  • kube_event_count cardinality warning: drop name/uid labels via metricRelabelings
  • PrometheusRule alerts: OOMKilling, CrashLoopBackOff, Evicted, FailedScheduling, NodeNotReady, ImagePullFailure
  • Loki-based event alerting via LogQL metric queries (Warning event spike, critical pattern count)
  • Slack routing with throttle (max N per 5 minutes per reason) to prevent flooding
  • 6 event pipeline metrics with thresholds (sends_total, send_errors, watch_errors, events_total, dropped)
  • 3 PrometheusRule alerts: EventExporterSendErrors, WatchErrors, EventExporterDown (absent)
  • 5 runbooks: OOMKilling, FailedScheduling, CrashLoopBackOff, ImagePullBackOff, event exporter not running
  • LogQL event queries: Warning filter, OOMKilling, namespace filter, count by reason, pod filter, ScalingReplicaSet correlation
  • 8 best practices: day-one exporter, Warning→Slack, critical→PagerDuty with throttle, emit from controllers, filter LeaderElection, Loki vs ES cost, leader election, cluster label