Kubernetes Events
The fourth observability signal: event structure, retention, kubectl usage, event exporters, kubernetes-event-exporter, custom event recording, alerting on events, and production event pipeline design.
What Are Kubernetes Events
Kubernetes Events are first-class API objects (core/v1 Event) written to the API server by controllers, the kubelet, the scheduler, and other system components. They describe what happened to a Kubernetes object — a pod was scheduled, an image pull failed, a container was OOMKilled, a deployment scaled up.
Events are the "fourth observability signal" alongside metrics, logs, and traces. They are unique because they are generated automatically by the cluster itself with no application code changes. Every significant state transition in the cluster produces an event.
Events are stored in etcd, not on disk. By default, the API server purges events after 1 hour (configurable via --event-ttl on kube-apiserver). In high-activity clusters, etcd is also protected by event quota limits that discard older events earlier. If you need event history beyond 1 hour — for incident retrospectives, auditing, or trend alerting — you must run an event exporter that streams events to a durable backend as they arrive.
Events vs Audit Logs vs Application Logs
| Aspect | Events | Audit Logs | Application Logs |
|---|---|---|---|
| Source | K8s controllers, kubelet, scheduler | kube-apiserver (all API calls) | Application containers |
| Describes | Object state transitions | Who did what via API | Application behavior |
| Storage | etcd (1h TTL) | Files / webhook backend | /var/log/containers/ |
| Access | kubectl get events / API | Log files / SIEM | kubectl logs / Loki |
| Volume | Low-Medium (state changes only) | High (every API call) | High |
| App changes needed | None (for system events) | None | Yes (to structured logging) |
Event Object Structure
kubectl get event -n production -o yaml | head -80
apiVersion: v1
kind: Event
metadata:
name: order-service-7d5f9-xk2pq.17c4a3f0b8e2d1a5 # pod-name.hex-suffix
namespace: production
creationTimestamp: "2024-01-15T10:23:45Z"
# The resource this event is about
involvedObject:
apiVersion: v1
kind: Pod
name: order-service-7d5f9-xk2pq
namespace: production
uid: a1b2c3d4-e5f6-7890-abcd-ef1234567890
resourceVersion: "98765"
fieldPath: spec.containers{order-service} # specific container, if applicable
# Classification
reason: OOMKilling # machine-readable reason (CamelCase)
message: "Memory limit reached: Killing container with allowance 256Mi, used 256Mi"
type: Warning # Normal or Warning
# Deduplication and aggregation
count: 3 # how many times this event has occurred
firstTimestamp: "2024-01-15T09:45:00Z"
lastTimestamp: "2024-01-15T10:23:45Z"
# Who generated this event
source:
component: kubelet
host: ip-10-0-1-45.us-east-1.compute.internal
# v1.19+ fields (EventSeries for high-frequency events)
series:
count: 3
lastObservedTime: "2024-01-15T10:23:45.000000Z"
# v1.19+ reporting fields (replaces source)
reportingComponent: kubelet
reportingInstance: ip-10-0-1-45.us-east-1.compute.internal
# Related object (action target, if different from involvedObject)
related:
apiVersion: v1
kind: Node
name: ip-10-0-1-45.us-east-1.compute.internal
action: Killing # verb describing the action taken
eventTime: "2024-01-15T10:23:45.000000Z" # MicroTime (v1.19+)
Key Fields Reference
| Field | Type | Description | Notes |
|---|---|---|---|
involvedObject | ObjectReference | The K8s object this event is about | kind + name + namespace + uid |
reason | string | Short CamelCase reason code | Machine-readable; used for filtering and alerting |
message | string | Human-readable description | May contain variable data (pod names, resource values) |
type | string | Normal or Warning | Warning events indicate problems |
count | int | Occurrence count (deduplicated) | Identical events within 10min window are merged |
firstTimestamp | Time | First occurrence time | Deprecated in v1.19+ in favor of eventTime |
lastTimestamp | Time | Most recent occurrence | This is what you see in kubectl get events AGE column |
source | EventSource | Component + host that generated event | Deprecated in v1.19+ in favor of reportingComponent |
reportingComponent | string | Component name (v1.19+) | e.g., kubelet, deployment-controller |
action | string | Action taken | e.g., Killing, Pulling, Scheduled |
series | EventSeries | High-frequency event aggregation (v1.19+) | Replaces count for rapid-fire events |
related | ObjectReference | Secondary object involved | e.g., Node when killing a pod |
Event Deduplication
The Kubernetes event recorder deduplicates events that share the same involvedObject, reason, message, source, and type within a 10-minute window. Duplicate events increment count and update lastTimestamp rather than creating new Event objects. This prevents etcd flooding from hot loops but means you lose individual timestamps for high-frequency events.
Event Types and Sources
Event Types
Normal — Informational
Routine state transitions that are expected. Examples: pod scheduled, container started, image pulled, deployment scaled, volume bound. These are part of normal cluster operation.
In production monitoring, Normal events are useful for change tracking (when did a deployment roll out?) but rarely require alerting.
Warning — Requires Attention
Indicates a problem or degraded state. Examples: OOMKilling, BackOff (crash loop), FailedScheduling, FailedMount, Evicted, FailedPullImage. These are actionable signals.
Warning events should flow to your alerting pipeline and trigger investigation. High-frequency Warning events indicate cluster health problems.
Event Sources (Reporting Components)
| Component | Typical Events Generated |
|---|---|
kubelet | Pulling/Pulled/Failed image, Created/Started/Killing container, OOMKilling, Evicting pod, VolumeMount errors, NodeNotReady |
kube-scheduler | Scheduled pod to node, FailedScheduling (insufficient resources, affinity mismatch, taints) |
deployment-controller | ScalingReplicaSet, Scaled up/down ReplicaSet |
replicaset-controller | SuccessfulCreate/Delete pod, FailedCreate pod |
statefulset-controller | SuccessfulCreate/Delete pod |
job-controller | SuccessfulCreate, BackoffLimitExceeded, Completed |
cronjob-controller | SuccessfulCreate job, MissSchedule, TooManyMissedTimes |
horizontal-pod-autoscaler | SuccessfulRescale, FailedGetScale, DesiredReplicas, ScalingLimited |
endpoint-slice-controller | FailedToCreateEndpointSlices, SuccessfulCreate/Update |
node-controller | NodeNotReady, NodeReady, RegisteredNode, RemovingNode |
volume-controller | FailedBinding, VolumeReleased, ProvisioningSucceeded/Failed |
default-scheduler | Preempting, Scheduled, FailedScheduling |
| Custom controllers/operators | Reconciled, SyncFailed, Updated, any custom reason |
kubectl events Usage
# List all events in a namespace (sorted by time, most recent last)
kubectl get events -n production
# Sort by most recent first
kubectl get events -n production --sort-by='.lastTimestamp'
# Only Warning events
kubectl get events -n production --field-selector type=Warning
# Events for a specific pod
kubectl get events -n production --field-selector involvedObject.name=order-service-7d5f9-xk2pq
# Events for a specific object kind
kubectl get events -n production --field-selector involvedObject.kind=Node
# Events by reason
kubectl get events -n production --field-selector reason=OOMKilling
# Cross-namespace (all namespaces)
kubectl get events -A --sort-by='.lastTimestamp'
# Only Warning events cluster-wide (most useful for incident triage)
kubectl get events -A --field-selector type=Warning --sort-by='.lastTimestamp'
# Watch events as they arrive
kubectl get events -n production -w
# Describe a specific resource to see associated events inline
kubectl describe pod order-service-7d5f9-xk2pq -n production
# Events section at the bottom shows events for this pod
# Custom columns (useful for quick triage)
kubectl get events -n production \
-o custom-columns='TIME:.lastTimestamp,TYPE:.type,REASON:.reason,OBJECT:.involvedObject.name,MSG:.message' \
--sort-by='.lastTimestamp' | tail -30
# JSON output for scripting
kubectl get events -n production -o json | \
jq '.items[] | select(.type=="Warning") | {time:.lastTimestamp, reason:.reason, object:.involvedObject.name, msg:.message}'
Kubernetes 1.26 added kubectl events as a dedicated subcommand with improved output: kubectl events -n production --for pod/order-service-7d5f9-xk2pq. It supports --for (filter by object), --types (filter by Warning/Normal), and outputs in a human-friendly table sorted by time by default.
# kubectl events (1.26+)
kubectl events -n production
kubectl events -n production --for pod/order-service-7d5f9-xk2pq
kubectl events -n production --types=Warning
kubectl events -n production --for deployment/order-service --types=Warning
kubectl events -A --types=Warning --no-headers | sort -k1
Important Event Reasons
The reason field is the most important field for programmatic filtering and alerting. These are the reasons that matter most for production operations:
Warning Events (Action Required)
Container exceeded memory limit and was killed by the kernel OOM killer. Increase memory limit or find memory leak.
Container in CrashLoopBackOff. Application is crashing on startup. Check container logs with --previous.
Pod evicted due to node resource pressure (memory, disk, PID). Check eviction threshold and node capacity.
Scheduler cannot place pod. Reason: insufficient CPU/memory, no nodes match affinity, all nodes tainted. Check describe pod for details.
Volume cannot be mounted. Common causes: PVC not bound, StorageClass missing, CSI driver error, wrong access mode.
Volume attach failed. CSI or in-tree driver error. Often transient; persists if detach from previous node is stuck.
Image pull failed (FailedPullImage). Check imagePullSecret, registry reachability, image tag exists.
Initial image pull error before entering BackOff. Often registry auth or network issue.
Repeated image pull failures with exponential backoff. Persistent registry or authentication problem.
Node condition changed to NotReady. All pods on this node may be evicted or rescheduled after tolerationSeconds.
Scheduler preempting lower-priority pods to make room for a higher-priority pod. Check PriorityClass assignments.
Runtime could not create container. Common: invalid image entrypoint, securityContext conflict, init container failure.
Liveness or readiness probe failed. Check probe config and application health endpoint.
Job exhausted its retry budget. All pod attempts failed. Check Job pod logs for root cause.
CronJob missed its scheduled run (controller was down or schedule changed). Check for TooManyMissedTimes.
HPA could not read current replica count or metric. Check RBAC and metrics-server health.
Normal Events (Operational Visibility)
Pod assigned to a node by the scheduler. Contains which node was chosen.
Image pull started / completed. Pulled includes how long it took — useful for diagnosing slow cold starts.
Container created and started. Time gap between Pulled and Started indicates container startup overhead.
Container being killed (graceful shutdown or OOMKill). Check if OOMKilling is the subtype.
Deployment scaled a ReplicaSet up or down. Useful for deployment change timeline.
HPA successfully changed replica count. Contains old and new replica count.
PVC dynamically provisioned. Contains StorageClass and volume name.
Retention and the TTL Problem
The 1-hour event TTL is the single most important operational fact about Kubernetes events. Most production incidents last longer than 1 hour, and retrospective analysis almost always requires events from before the incident started.
Changing the Default TTL
# /etc/kubernetes/manifests/kube-apiserver.yaml (static pod)
spec:
containers:
- name: kube-apiserver
command:
- kube-apiserver
- --event-ttl=4h # extend to 4 hours
# WARNING: longer TTL = more etcd storage consumed by events
# Events are stored in etcd under /registry/events/
# High-event clusters can produce GBs of event data per hour
Extending --event-ttl to 24h in a large cluster can consume hundreds of MB to several GB of etcd storage, slowing etcd list operations and increasing backup size. The correct solution is to run an event exporter that ships events to a cheap long-term store (Loki, Elasticsearch, S3) in real-time, keeping the API server TTL at 1–2 hours.
etcd Event Quota
The API server has a per-namespace event object limit (default 1000 events per namespace per object type). In a namespace with many pods crashing, events may be dropped before TTL expires. The --max-events-per-namespace flag on kube-apiserver (v1.21+) controls this:
- --max-events-per-namespace=1000 # default; increase if events are being dropped
kubernetes-event-exporter
kubernetes-event-exporter (by Resmo, formerly opsgenie) is the most feature-rich and actively maintained event exporter. It watches the Events API and routes events to multiple backends based on configurable filters.
Architecture
Helm Install
helm repo add deliveryhero https://charts.deliveryhero.io
helm upgrade --install kubernetes-event-exporter deliveryhero/kubernetes-event-exporter \
--namespace monitoring \
--create-namespace \
--set config.logLevel=warn \
--set config.logFormat=json \
--values event-exporter-values.yaml
Production Configuration
# event-exporter-config.yaml (mounted as ConfigMap)
logLevel: warn
logFormat: json
# Cluster-wide label added to all exported events
clusterName: prod-us-east-1
receivers:
# --- Loki: all events ---
- name: loki
loki:
streamLabels:
app: kubernetes-event-exporter
cluster: prod-us-east-1
url: http://loki-gateway.monitoring.svc/loki/api/v1/push
headers:
Content-Type: application/json
layout:
# Map event fields to Loki log labels and body
namespace: "{{ .InvolvedObject.Namespace }}"
name: "{{ .InvolvedObject.Name }}"
kind: "{{ .InvolvedObject.Kind }}"
reason: "{{ .Reason }}"
type: "{{ .Type }}"
message: "{{ .Message }}"
component: "{{ .ReportingComponent }}"
count: "{{ .Count }}"
# --- Elasticsearch: all events ---
- name: elasticsearch
elasticsearch:
hosts:
- https://prod-logs-es-http.logging.svc:9200
index: kubernetes-events
indexFormat: kubernetes-events-{2006.01.02} # daily index
username: elastic
password:
valueFrom:
secretKeyRef:
name: elastic-credentials
key: password
tls:
insecureSkipVerify: false
caFile: /es-certs/ca.crt
layout:
timestamp: "{{ .LastTimestamp }}"
namespace: "{{ .InvolvedObject.Namespace }}"
name: "{{ .InvolvedObject.Name }}"
kind: "{{ .InvolvedObject.Kind }}"
reason: "{{ .Reason }}"
message: "{{ .Message }}"
type: "{{ .Type }}"
count: "{{ .Count }}"
component: "{{ .ReportingComponent }}"
cluster: prod-us-east-1
# --- Slack: Warning events only ---
- name: slack-warnings
slack:
token:
valueFrom:
secretKeyRef:
name: slack-token
key: token
channel: "#k8s-warnings"
message: |
*[{{ .Type }}]* `{{ .Reason }}` on `{{ .InvolvedObject.Kind }}/{{ .InvolvedObject.Name }}`
*Namespace:* `{{ .InvolvedObject.Namespace }}`
*Message:* {{ .Message }}
*Count:* {{ .Count }} | *Last seen:* {{ .LastTimestamp }}
# --- PagerDuty: critical Warning reasons ---
- name: pagerduty-critical
opsgenie:
apiKey:
valueFrom:
secretKeyRef:
name: opsgenie-key
key: apiKey
message: "K8s {{ .Reason }}: {{ .InvolvedObject.Kind }}/{{ .InvolvedObject.Name }}"
description: "{{ .Message }}"
alias: "{{ .InvolvedObject.Namespace }}/{{ .InvolvedObject.Name }}/{{ .Reason }}"
priority: P2
route:
# Default: send all events to Loki and Elasticsearch
routes:
# Warning events → Loki + Elasticsearch + Slack
- match:
- type: Warning
receivers:
- loki
- elasticsearch
- slack-warnings
# Critical Warning reasons → also page
- match:
- type: Warning
reason:
- OOMKilling
- Evicted
- BackoffLimitExceeded
- NodeNotReady
receivers:
- pagerduty-critical
# Normal events → Loki only (cheaper, not in Elasticsearch)
- match:
- type: Normal
receivers:
- loki
# Exclude noisy Normal events from kube-system
- drop:
- type: Normal
namespace: kube-system
reason:
- LeaderElection
- Pulling
- Pulled
RBAC for Event Exporter
apiVersion: v1
kind: ServiceAccount
metadata:
name: kubernetes-event-exporter
namespace: monitoring
---
apiVersion: rbac.authorization.k8s.io/v1
kind: ClusterRole
metadata:
name: kubernetes-event-exporter
rules:
- apiGroups: [""]
resources: [events, namespaces, nodes, pods]
verbs: [get, list, watch]
- apiGroups: [apps]
resources: [deployments, replicasets, statefulsets, daemonsets]
verbs: [get, list, watch]
---
apiVersion: rbac.authorization.k8s.io/v1
kind: ClusterRoleBinding
metadata:
name: kubernetes-event-exporter
roleRef:
apiGroup: rbac.authorization.k8s.io
kind: ClusterRole
name: kubernetes-event-exporter
subjects:
- kind: ServiceAccount
name: kubernetes-event-exporter
namespace: monitoring
Deployment with Leader Election
apiVersion: apps/v1
kind: Deployment
metadata:
name: kubernetes-event-exporter
namespace: monitoring
spec:
replicas: 2 # HA: leader election prevents duplicate events
selector:
matchLabels:
app: kubernetes-event-exporter
template:
spec:
serviceAccountName: kubernetes-event-exporter
containers:
- name: exporter
image: ghcr.io/resmoio/kubernetes-event-exporter:v1.7
args:
- -conf=/data/config.yaml
- -leader-election # ensures only one replica exports at a time
- -leader-election-id=event-exporter-lock
resources:
requests: {cpu: 50m, memory: 64Mi}
limits: {cpu: 200m, memory: 256Mi}
volumeMounts:
- name: config
mountPath: /data
volumes:
- name: config
configMap:
name: kubernetes-event-exporter-config
kube-eventer
kube-eventer (Alibaba Cloud) is an alternative event exporter with native support for DingTalk, Kafka, Elasticsearch, InfluxDB, and more. Useful when kubernetes-event-exporter lacks a specific output plugin.
apiVersion: apps/v1
kind: Deployment
metadata:
name: kube-eventer
namespace: monitoring
spec:
replicas: 1
template:
spec:
serviceAccountName: kube-eventer
containers:
- name: kube-eventer
image: registry.aliyuncs.com/acs/kube-eventer-amd64:v1.2.7-358d0d3-aliyun
command:
- /kube-eventer
- --source=kubernetes:https://kubernetes.default
# Elasticsearch sink
- --sink=elasticsearch:https://prod-es:9200?sniff=false&index=kube-events&x-pack-enabled=true
# DingTalk sink (optional)
# - --sink=dingtalk:https://oapi.dingtalk.com/robot/send?level=warning&label=prod
resources:
requests: {cpu: 50m, memory: 64Mi}
limits: {cpu: 200m, memory: 128Mi}
Event Router Patterns
Fluent Bit as Event Collector
Rather than a dedicated event exporter, Fluent Bit can watch the Kubernetes Events API directly and route to the same Loki/Elasticsearch pipeline used for container logs:
[INPUT]
Name kubernetes_events
# Connects to the K8s API and watches /api/v1/events
Kube_URL https://kubernetes.default.svc:443
Kube_CA_File /var/run/secrets/kubernetes.io/serviceaccount/ca.crt
Kube_Token_File /var/run/secrets/kubernetes.io/serviceaccount/token
DB /var/log/flb-events.db
Interval_Sec 5
[FILTER]
Name grep
Match kube_events
# Only Warning events
Regex type Warning
[OUTPUT]
Name loki
Match kube_events
Host loki-gateway.monitoring.svc
Labels job=k8s-events,cluster=prod
OTel Collector k8sevents Receiver
receivers:
k8s_events:
auth_type: serviceAccount
namespaces: [] # empty = all namespaces
processors:
resource:
attributes:
- key: cluster
value: prod-us-east-1
action: insert
filter/warnings_only:
logs:
log_record:
- 'attributes["k8s.event.reason"] != "" and attributes["k8s.event.action"] != "Scheduled"'
batch:
timeout: 10s
exporters:
loki:
endpoint: http://loki-gateway.monitoring.svc/loki/api/v1/push
labels:
resource:
cluster: ""
k8s.namespace.name: ""
service:
pipelines:
logs/events:
receivers: [k8s_events]
processors: [resource, batch]
exporters: [loki]
Recording Custom Events
Kubernetes controllers and operators should emit events to report the outcome of reconciliation loops. This follows the standard Kubernetes operational model and makes operator behavior observable via kubectl describe and event exporters without any extra tooling.
Go: client-go EventRecorder
package controller
import (
"context"
corev1 "k8s.io/api/core/v1"
"k8s.io/apimachinery/pkg/runtime"
"k8s.io/client-go/tools/record"
ctrl "sigs.k8s.io/controller-runtime"
)
type MyReconciler struct {
client.Client
Scheme *runtime.Scheme
Recorder record.EventRecorder
}
func (r *MyReconciler) SetupWithManager(mgr ctrl.Manager) error {
// Create recorder with the component name (appears in event source)
r.Recorder = mgr.GetEventRecorderFor("my-controller")
return ctrl.NewControllerManagedBy(mgr).
For(&myv1.MyResource{}).
Complete(r)
}
func (r *MyReconciler) Reconcile(ctx context.Context, req ctrl.Request) (ctrl.Result, error) {
var resource myv1.MyResource
if err := r.Get(ctx, req.NamespacedName, &resource); err != nil {
return ctrl.Result{}, client.IgnoreNotFound(err)
}
// Normal event: informational
r.Recorder.Event(&resource, corev1.EventTypeNormal,
"Reconciled", // reason (CamelCase)
"Resource reconciled successfully", // message
)
// Warning event: something went wrong
if err := r.doSomething(ctx, &resource); err != nil {
r.Recorder.Eventf(&resource, corev1.EventTypeWarning,
"SyncFailed", // reason
"Failed to sync resource: %v", err, // formatted message
)
return ctrl.Result{}, err
}
// Annotation-based event (annotate the related object)
r.Recorder.AnnotatedEventf(&resource, map[string]string{
"reconcile-version": "v2",
}, corev1.EventTypeNormal, "Updated", "Configuration updated to version %s", "v2")
return ctrl.Result{}, nil
}
Event Reason Naming Conventions
Reasons should be CamelCase, short (one word or two), and stable (don't change them — external systems filter on them). Use positive names for success and past-tense verbs for state transitions: Reconciled, Updated, Deleted, SyncFailed, ValidationFailed. Avoid including variable data in the reason; put it in the message field instead.
controller-runtime Event Recorder (operator-sdk style)
// In main.go / cmd/main.go (operator-sdk scaffold)
mgr, err := ctrl.NewManager(ctrl.GetConfigOrDie(), ctrl.Options{
Scheme: scheme,
// EventBroadcaster is configured automatically
// EventRecorder is available via mgr.GetEventRecorderFor()
})
// Use in reconciler:
type DatabaseReconciler struct {
client.Client
Scheme *runtime.Scheme
Recorder record.EventRecorder
}
func (r *DatabaseReconciler) Reconcile(ctx context.Context, req ctrl.Request) (ctrl.Result, error) {
db := &dbv1.Database{}
if err := r.Get(ctx, req.NamespacedName, db); err != nil {
return ctrl.Result{}, client.IgnoreNotFound(err)
}
// Record event on a *related* object (not the resource being reconciled)
// e.g., record event on the Secret created by this controller
secret := &corev1.Secret{}
r.Recorder.Event(secret, corev1.EventTypeNormal, "Created",
"Secret created for database credentials")
return ctrl.Result{}, nil
}
Alerting on Events
There are two approaches to alert on Kubernetes events: using kube-state-metrics (converts events to Prometheus metrics for PromQL alerting) and direct alerting from the event exporter pipeline (e.g., Loki alert rules or webhook routing).
kube-state-metrics: Event Metrics
kube-state-metrics (KSM) exposes event counts as Prometheus metrics, enabling event-based alerting via standard PrometheusRule without a separate event exporter:
# kube_event_count: count of events by type/reason/namespace/name
# Available in kube-state-metrics v2.8+
kube_event_count{type="Warning", reason="OOMKilling"}
kube_event_count{type="Warning", reason="BackOff"}
kube_event_count{type="Warning", reason="Evicted"}
kube_event_count{type="Warning", reason="FailedScheduling"}
# Rate of Warning events (increase = new events)
increase(kube_event_count{type="Warning"}[5m])
# Find namespaces with OOM kills in last 15 minutes
sum by (namespace) (increase(kube_event_count{type="Warning", reason="OOMKilling"}[15m]))
Each unique combination of type/reason/namespace/name/uid creates a separate time series. In large clusters with many pod names, this can create cardinality problems. Use metricRelabelings in the ServiceMonitor to drop the name and uid labels if cardinality is a concern, aggregating to namespace + reason level only.
PrometheusRule Alerts on Events
apiVersion: monitoring.coreos.com/v1
kind: PrometheusRule
metadata:
name: kubernetes-event-alerts
namespace: monitoring
spec:
groups:
- name: kubernetes-events
rules:
- alert: KubernetesOOMKilling
expr: |
increase(kube_event_count{type="Warning", reason="OOMKilling"}[10m]) > 0
labels:
severity: warning
annotations:
summary: "OOMKill in {{ $labels.namespace }} ({{ $labels.name }})"
description: "Container was OOMKilled. Increase memory limit or investigate memory leak."
runbook_url: "https://wiki/runbooks/oomkilling"
- alert: KubernetesCrashLoopBackOff
expr: |
increase(kube_event_count{type="Warning", reason="BackOff"}[5m]) > 3
for: 2m
labels:
severity: warning
annotations:
summary: "CrashLoopBackOff in {{ $labels.namespace }} ({{ $labels.name }})"
description: "Pod is in CrashLoopBackOff. Check container logs with --previous flag."
- alert: KubernetesPodEvicted
expr: |
increase(kube_event_count{type="Warning", reason="Evicted"}[10m]) > 0
labels:
severity: warning
annotations:
summary: "Pod evicted in {{ $labels.namespace }}"
description: "A pod was evicted due to node resource pressure."
- alert: KubernetesFailedScheduling
expr: |
increase(kube_event_count{type="Warning", reason="FailedScheduling"}[5m]) > 0
for: 5m
labels:
severity: warning
annotations:
summary: "Pod cannot be scheduled in {{ $labels.namespace }}"
description: "Scheduler cannot find a suitable node. Check resource requests and node capacity."
- alert: KubernetesNodeNotReady
expr: |
increase(kube_event_count{type="Warning", reason="NodeNotReady"}[5m]) > 0
labels:
severity: critical
annotations:
summary: "Node entered NotReady state"
description: "A node transitioned to NotReady. Pods may be evicted after tolerationSeconds."
- alert: KubernetesImagePullFailure
expr: |
increase(kube_event_count{type="Warning", reason=~"Failed|ErrImagePull|ImagePullBackOff"}[5m]) > 0
for: 3m
labels:
severity: warning
annotations:
summary: "Image pull failing in {{ $labels.namespace }}"
description: "Container image cannot be pulled. Check imagePullSecret and registry reachability."
Loki-Based Event Alerting
When events are forwarded to Loki via kubernetes-event-exporter or Fluent Bit, you can write LogQL alerts. This captures the full event message (not just counts) and allows richer filtering:
apiVersion: monitoring.coreos.com/v1alpha1
kind: PrometheusRule
metadata:
name: loki-event-alerts
namespace: monitoring
spec:
groups:
- name: loki-events
interval: 1m
rules:
- alert: KubernetesWarningEventSpike
expr: |
sum by (namespace) (
rate({job="k8s-events"} | json | type="Warning" [5m])
) > 5
for: 2m
labels:
severity: warning
annotations:
summary: "High rate of Warning events in {{ $labels.namespace }}"
- alert: CriticalEventPattern
expr: |
count_over_time(
{job="k8s-events"} | json | type="Warning"
| reason =~ "OOMKilling|Evicted|BackoffLimitExceeded|NodeNotReady" [5m]
) > 0
labels:
severity: critical
annotations:
summary: "Critical Kubernetes event detected"
Slack Alert Routing for Events
# kubernetes-event-exporter routing: critical events → PagerDuty, warnings → Slack
route:
routes:
- match:
- type: Warning
reason:
- OOMKilling
- NodeNotReady
- Evicted
throttle:
# Prevent Slack flooding: max 10 messages per 5 minutes per reason
period: 5m
count: 10
receivers:
- slack-critical
- elasticsearch
- match:
- type: Warning
receivers:
- slack-warnings
- elasticsearch
Metrics, Alerts & Runbooks
Key Metrics for the Event Pipeline
| Metric | Source | Alert Threshold | Meaning |
|---|---|---|---|
kube_event_count{type="Warning"} | kube-state-metrics | increase > 0 for critical reasons | Warning event counts by reason/namespace |
kubernetes_event_exporter_sends_total | event-exporter | — | Events successfully exported per receiver |
kubernetes_event_exporter_send_errors_total | event-exporter | >0 | Export failures — events may be lost |
kubernetes_event_exporter_watch_errors_total | event-exporter | >0 | API watch errors — events may be missed |
kubernetes_event_exporter_events_total | event-exporter | — | Total events observed (before routing) |
kubernetes_event_exporter_dropped_total | event-exporter | >0 | Events dropped by routing rules |
Operational Alert Rules
groups:
- name: event-exporter-health
rules:
- alert: EventExporterSendErrors
expr: rate(kubernetes_event_exporter_send_errors_total[5m]) > 0
for: 5m
labels: {severity: warning}
annotations:
summary: "kubernetes-event-exporter failing to send events — check receiver connectivity"
- alert: EventExporterWatchErrors
expr: increase(kubernetes_event_exporter_watch_errors_total[5m]) > 3
labels: {severity: warning}
annotations:
summary: "Event exporter watch API errors — events may be missed"
- alert: EventExporterDown
expr: absent(kubernetes_event_exporter_events_total)
for: 5m
labels: {severity: critical}
annotations:
summary: "kubernetes-event-exporter is not running — no events are being exported"
Runbooks
OOMKilling Events
- Identify pod:
kubectl get events -A --field-selector reason=OOMKilling - Check current memory usage:
kubectl top pod <pod> -n <ns> - Check memory limit:
kubectl get pod <pod> -o jsonpath='{.spec.containers[0].resources}' - Check if limit is appropriate: compare working set to limit
- Short-term: increase limit in Deployment spec; Long-term: investigate memory leak
- Consider VPA for automatic limit recommendation
FailedScheduling Events
kubectl describe pod <pod> -n <ns>— see Events section for exact reason- Insufficient CPU/memory:
kubectl describe nodes | grep -A5 "Allocated resources" - Taint mismatch:
kubectl describe node | grep Taints - Affinity mismatch: review nodeAffinity/podAntiAffinity rules in deployment
- Topology spread: too many pods on some nodes, spread constraint too strict
- If persistent: scale cluster (add nodes or node pool)
CrashLoopBackOff
kubectl logs <pod> --previous— logs from crashed container- Check exit code:
kubectl get pod -o jsonpath='{.status.containerStatuses[0].lastState.terminated}' - Exit 137 = OOMKilled; Exit 1 = application error; Exit 2 = misuse of command
- Check ConfigMap / Secret mounts (missing secret causes immediate crash)
- Check readiness/liveness probe configuration
- Test container locally with same environment variables
ImagePullBackOff
kubectl describe pod <pod>— see exact pull error message in Events- Verify image tag exists in registry
- Check imagePullSecret is attached to ServiceAccount or pod spec
- Test credentials:
kubectl create secret docker-registry test --docker-server=... --dry-run=client - Check network policy allows egress to registry IP/port 443
- If private registry: verify secret is in correct namespace
Event Exporter Not Running
kubectl get deploy kubernetes-event-exporter -n monitoring- Check pod status and logs:
kubectl logs -l app=kubernetes-event-exporter -n monitoring - Verify RBAC:
kubectl auth can-i list events --as=system:serviceaccount:monitoring:kubernetes-event-exporter - Check ConfigMap format:
kubectl get cm kubernetes-event-exporter-config -o yaml - Restart deployment:
kubectl rollout restart deploy/kubernetes-event-exporter -n monitoring
Querying Events in Loki
# All Warning events in last 1 hour
{job="k8s-events"} | json | type="Warning"
# OOMKilling events with details
{job="k8s-events"} | json | reason="OOMKilling"
| line_format "{{.namespace}}/{{.name}}: {{.message}}"
# Events for a specific namespace
{job="k8s-events"} | json | namespace="payments"
# Count Warning events by reason (last 24h)
sum by (reason) (count_over_time({job="k8s-events"} | json | type="Warning" [24h]))
# Find all events for a specific pod
{job="k8s-events"} | json | name=~"order-service-.*"
# Events that correlate with a deployment (around rollout time)
{job="k8s-events"} | json | reason="ScalingReplicaSet" | name=~"order-service-.*"
Best Practices
- Deploy a persistent event exporter from day one. The 1-hour TTL makes events useless for any incident that lasts more than an hour. Run kubernetes-event-exporter (or equivalent) before you need it — you cannot retroactively recover expired events.
- Send Warning events to an alerting channel. Not every Warning event needs a page, but they should appear in a low-urgency Slack channel where on-call engineers can spot patterns. Aggregate by reason to identify systemic issues.
- Route critical event reasons to PagerDuty. OOMKilling, NodeNotReady, Evicted, and BackoffLimitExceeded should trigger alerting. Use throttling (max N per 5 minutes) to prevent alert fatigue from cascading failures.
- Emit events from custom controllers. Every operator reconcile loop should emit a Normal event on success and a Warning event on failure. This makes controller behavior observable via
kubectl describewith no additional tooling required. - Filter out LeaderElection and health-probe events. These fire every few seconds in kube-system and add noise without value. Exclude them in the event exporter routing rules before they reach Elasticsearch or Loki.
- Do not store events in Elasticsearch unless you need full-text search. Loki is 10–20× cheaper per event record. Use Elasticsearch only if you need complex aggregation queries or full-text search on event messages. Forward to Loki for standard operational use.
- Enable leader election in the event exporter. A single-replica event exporter is a single point of failure for your event pipeline. Run 2+ replicas with leader election to survive pod restarts without missing events.
- Add cluster label to all exported events. In multi-cluster environments, always include a
clusterlabel in exported events so you can distinguish events from different clusters when they land in a shared Loki/Elasticsearch instance.
Coverage Details
- Events as first-class API objects (core/v1 Event) — the fourth observability signal
- Events vs Audit Logs vs Application Logs comparison table
- 1-hour default TTL (--event-ttl) and etcd storage — critical retention problem
- Full Event object YAML: involvedObject, reason, message, type, count, firstTimestamp, lastTimestamp, source, series, reportingComponent, related, action, eventTime
- Key fields reference table (12 fields with types and notes)
- Event deduplication: 10-minute window, count increment, identical field matching
- Normal vs Warning event types (two cards)
- Event sources table: kubelet, scheduler, deployment-controller, replicaset, statefulset, job, cronjob, HPA, endpoint-slice, node, volume controllers
- kubectl events reference: --field-selector (type/reason/involvedObject), --sort-by, -w watch, -A all-namespaces, custom columns, jq JSON processing
- kubectl events subcommand (1.26+): --for, --types flags
- Warning event reasons: OOMKilling, BackOff, Evicted, FailedScheduling, FailedMount, FailedAttachVolume, Failed/ErrImagePull/ImagePullBackOff, NodeNotReady, Preempting, FailedCreatePodContainer, Unhealthy, BackoffLimitExceeded, MissSchedule, FailedGetScale
- Normal event reasons: Scheduled, Pulling/Pulled, Created/Started, Killing, ScalingReplicaSet, SuccessfulRescale, ProvisioningSucceeded
- --event-ttl kube-apiserver flag and cost warning for extension
- etcd event quota: --max-events-per-namespace (v1.21+)
- kubernetes-event-exporter architecture diagram
- Helm install for kubernetes-event-exporter
- Production event-exporter config: Loki output (template layout), Elasticsearch output (daily index, TLS, credentials from secret), Slack (Warning events, message template), PagerDuty/Opsgenie (critical reasons)
- Route rules: Warning→Loki+ES+Slack, critical reasons→page, Normal→Loki only, drop noisy kube-system Normal events
- RBAC for event-exporter: ClusterRole (get/list/watch events/namespaces/nodes/pods + apps)
- Deployment with leader election (2 replicas, -leader-election flag)
- kube-eventer alternative: install YAML with Elasticsearch sink
- Fluent Bit kubernetes_events input plugin config (DB, Interval_Sec, filter, Loki output)
- OTel Collector k8s_events receiver config (auth_type, namespaces, filter processor, Loki exporter)
- Go: client-go EventRecorder via controller-runtime: SetupWithManager, Recorder.Event, Recorder.Eventf, Recorder.AnnotatedEventf
- Event reason naming conventions: CamelCase, stable, positive/past-tense, no variables in reason
- controller-runtime operator-sdk scaffold pattern for EventRecorder
- kube-state-metrics kube_event_count metric (v2.8+): type/reason/namespace labels
- kube_event_count cardinality warning: drop name/uid labels via metricRelabelings
- PrometheusRule alerts: OOMKilling, CrashLoopBackOff, Evicted, FailedScheduling, NodeNotReady, ImagePullFailure
- Loki-based event alerting via LogQL metric queries (Warning event spike, critical pattern count)
- Slack routing with throttle (max N per 5 minutes per reason) to prevent flooding
- 6 event pipeline metrics with thresholds (sends_total, send_errors, watch_errors, events_total, dropped)
- 3 PrometheusRule alerts: EventExporterSendErrors, WatchErrors, EventExporterDown (absent)
- 5 runbooks: OOMKilling, FailedScheduling, CrashLoopBackOff, ImagePullBackOff, event exporter not running
- LogQL event queries: Warning filter, OOMKilling, namespace filter, count by reason, pod filter, ScalingReplicaSet correlation
- 8 best practices: day-one exporter, Warning→Slack, critical→PagerDuty with throttle, emit from controllers, filter LeaderElection, Loki vs ES cost, leader election, cluster label