Audit Logging | Kubernetes Documentation

On This Page

Audit Logging Overview
Policy Levels & Stages
Writing Audit Policies
Audit Backends
Audit Event Format
Noise Reduction
SIEM Integration
Forensic Query Patterns
Dynamic Audit Webhooks
Compliance Requirements
Metrics, Alerts & Runbooks
Best Practices

Coverage Checklist

Audit logging architecture: kube-apiserver only
4 policy levels: None/Metadata/Request/RequestResponse
4 audit stages: RequestReceived/ResponseStarted/ResponseComplete/Panic
AuditPolicy YAML: rules, verbs, resources, namespaces, users
omitStages and omitManagedFields
Log backend: file rotation, path, maxAge, maxBackups
Webhook backend: config, batchType, throttle, truncate
Audit event JSON: all fields documented
Noise reduction: system components, health checks, watch events
Production policy: tiered rules pattern
SIEM: Fluentd/Fluent Bit → Elasticsearch, Splunk, Loki
Falco audit rules via webhook
Forensic queries: secret access, exec, privilege escalation
jq query patterns for log analysis
Dynamic audit webhooks (auditregistration.k8s.io)
Compliance: PCI DSS, SOC 2, CIS requirements
5 metrics, 4 alerts, 5 runbooks, 8 best practices

Audit Logging Overview

Kubernetes audit logging records every request processed by the API server — who made the request, what they requested, and what the outcome was. It is the authoritative trail for security investigation, compliance, and forensics.

API Server Only

Kubernetes audit logs only capture activity that goes through the API server. Direct kubelet API calls (port 10250), container runtime operations, and in-process activity within pods are not captured here. Use Falco for syscall-level runtime visibility alongside audit logs.

── Request lifecycle through the audit subsystem ────────────────────

Client
│ (kubectl / controller / service account / external)
│
▼
kube-apiserver
│
├── Stage: RequestReceived ← logged as soon as request arrives
│
├── AuthN + AuthZ + Admission
│
├── Stage: ResponseStarted ← for long-running responses (watch, exec)
│
├── Handler executes (etcd read/write)
│
└── Stage: ResponseComplete ← most important; includes response code
Stage: Panic ← if handler panics

│ │
▼ ▼
Log Backend Webhook Backend
(local file) (external audit sink)

What Audit Logs Answer

Who accessed what?

Track which user, service account, or node read a Secret, ConfigMap, or other sensitive resource.

What changed and when?

Full timeline of creates, updates, deletes to any resource, including the request body at RequestResponse level.

Failed authentication attempts

Unauthorized (401) and Forbidden (403) responses indicate credential probing or misconfigured RBAC.

Privilege escalation paths

Track creation of ClusterRoleBindings, ServiceAccounts, and impersonation attempts.

Policy Levels & Stages

Audit Levels

Each audit rule assigns a level that controls how much detail is recorded. Higher levels capture more data but increase log volume and API server overhead.

Level	What Is Logged	Overhead	Use For
`None`	Nothing — request is not logged at all	Zero	High-volume noise: health checks, metrics scrapes, leader election
`Metadata`	Request metadata only: user, timestamp, resource, verb, response code. No request or response body.	Low	Most resources: read operations, watch events
`Request`	Metadata + request body (but not response body)	Medium	Writes to important resources where you need to see what was submitted
`RequestResponse`	Metadata + request body + response body	High	Secrets, RBAC changes, exec, portforward — highest-value forensic data

RequestResponse for Secrets Contains the Secret Value

Logging Secrets at RequestResponse level includes the base64-encoded secret data in the log. Ensure your log pipeline encrypts logs at rest and restricts access to the audit log backend. This is often a compliance requirement violation if left unprotected.

Audit Stages

Stage	When Emitted	Response Code Available?	Use
`RequestReceived`	Immediately on arrival, before AuthN/AuthZ	No	Detecting requests that crash the handler
`ResponseStarted`	After headers sent, before body	Yes	Long-running: watch, exec, port-forward
`ResponseComplete`	After full response sent	Yes	Standard request completion — the primary stage to capture
`Panic`	On handler panic (500)	Yes (500)	Bug detection and intrusion via panic exploitation

omitStages for Watch Connections

Watch requests emit ResponseStarted when the watch begins and ResponseComplete when it closes — but also emit one event per change notification. Set omitStages: [ResponseStarted] globally to eliminate watch startup noise while keeping completion events.

Writing Audit Policies

The audit policy is a YAML file passed to the API server via --audit-policy-file. Rules are evaluated in order; the first matching rule determines the level. If no rule matches, the request is not logged.

Minimal Policy (Baseline)

apiVersion: audit.k8s.io/v1
kind: Policy

# Omit ResponseStarted for watch-type long-running requests
omitStages:
- RequestReceived

rules:
# Rule 1: Don't log read-only requests from system components
- level: None
  users:
  - system:kube-scheduler
  - system:kube-proxy
  - system:apiserver
  - system:kube-controller-manager
  - system:serviceaccount:kube-system:endpoint-controller
  verbs: [get, watch, list]

# Rule 2: Don't log health/readiness probes
- level: None
  nonResourceURLs:
  - /healthz*
  - /readyz*
  - /livez*
  - /version
  - /swagger*
  - /openapi*

# Rule 3: Don't log metrics scrape
- level: None
  nonResourceURLs:
  - /metrics
  - /metrics/cadvisor

# Rule 4: Full detail for secrets (no response body — value excluded)
- level: Request
  resources:
  - group: ""
    resources: [secrets, configmaps, serviceaccounts/token]

# Rule 5: Full detail for RBAC changes
- level: RequestResponse
  resources:
  - group: rbac.authorization.k8s.io
    resources: [roles, clusterroles, rolebindings, clusterrolebindings]

# Rule 6: Log exec/attach/portforward at Request level (captures command args)
- level: Request
  resources:
  - group: ""
    resources: [pods/exec, pods/attach, pods/portforward]

# Rule 7: Metadata for most other resources
- level: Metadata
  resources:
  - group: ""
  - group: apps
  - group: batch
  - group: networking.k8s.io
  - group: policy
  - group: storage.k8s.io

Production Policy (Tiered)

A production policy uses a tiered approach: suppress noise aggressively, capture sensitive operations in full, and use Metadata as the safe default for everything else.

apiVersion: audit.k8s.io/v1
kind: Policy
omitStages:
- RequestReceived
omitManagedFields: true   # Removes managedFields from request/response bodies

rules:

# ── TIER 0: Complete suppression (high volume, no security value) ────

- level: None
  userGroups: [system:nodes]
  verbs: [get, watch, list]
  resources:
  - group: ""
    resources: [endpoints, services, pods, nodes]

- level: None
  nonResourceURLs: [/healthz*, /readyz*, /livez*, /version, /metrics*]

# Suppress leader election coordination (very high volume, no security value)
- level: None
  resources:
  - group: coordination.k8s.io
    resources: [leases]

# Suppress event updates (informational only, not security relevant)
- level: None
  resources:
  - group: ""
    resources: [events]

# ── TIER 1: RequestResponse — highest security value ─────────────────

# RBAC mutations — full request+response for privilege escalation detection
- level: RequestResponse
  verbs: [create, update, patch, delete]
  resources:
  - group: rbac.authorization.k8s.io
    resources:
    - roles
    - clusterroles
    - rolebindings
    - clusterrolebindings

# ServiceAccount token creation and SA mutations
- level: RequestResponse
  resources:
  - group: ""
    resources: [serviceaccounts, serviceaccounts/token]

# Pod exec/attach — captures command run (command in URI params)
- level: Request
  resources:
  - group: ""
    resources: [pods/exec, pods/attach, pods/portforward]

# ── TIER 2: Request — important writes ───────────────────────────────

# Secret reads and writes (NOT RequestResponse — would log secret values)
- level: Request
  verbs: [get, list, create, update, patch, delete]
  resources:
  - group: ""
    resources: [secrets]

# Admission webhook configurations — changes here affect all cluster security
- level: RequestResponse
  resources:
  - group: admissionregistration.k8s.io
    resources: [mutatingwebhookconfigurations, validatingwebhookconfigurations]

# ── TIER 3: Metadata — default for everything else ───────────────────
- level: Metadata

API Server Flags

# kube-apiserver flags for audit logging
--audit-log-path=/var/log/kubernetes/audit/audit.log
--audit-log-maxage=30          # Days to retain rotated log files
--audit-log-maxbackup=10       # Max number of old log files to keep
--audit-log-maxsize=100        # Max size in MB before rotation
--audit-log-compress           # Compress rotated files with gzip
--audit-policy-file=/etc/kubernetes/audit-policy.yaml
--audit-log-format=json        # json (default) or legacy (text)

Audit Backends

Log Backend

The log backend writes audit events to a file on the API server node in JSON Lines format (one JSON object per line). This is the simplest setup and suitable for environments with a log collection agent (Fluentd, Fluent Bit, Filebeat) on the node.

Flag	Default	Description
`--audit-log-path`	—	File path; required to enable log backend. Use `-` for stdout.
`--audit-log-maxage`	0	Max days to retain rotated files (0 = unlimited)
`--audit-log-maxbackup`	0	Max number of rotated backup files (0 = unlimited)
`--audit-log-maxsize`	100	Max size in megabytes before rotation
`--audit-log-compress`	false	gzip compress rotated files
`--audit-log-format`	json	`json` (structured) or `legacy` (text)

Webhook Backend

The webhook backend sends audit events to an external HTTP endpoint in batches. Used when you want real-time delivery to a SIEM or an audit aggregator without first writing to disk.

# webhook-audit-config.yaml — kubeconfig-format endpoint configuration
apiVersion: v1
kind: Config
clusters:
- name: audit-webhook
  cluster:
    server: https://audit-sink.internal:8888/audit
    certificate-authority: /etc/kubernetes/audit-webhook-ca.crt
contexts:
- name: webhook
  context:
    cluster: audit-webhook
    user: ""
current-context: webhook

# API server flags for webhook backend
--audit-webhook-config-file=/etc/kubernetes/webhook-audit-config.yaml
--audit-webhook-mode=batch          # batch (default) or blocking
--audit-webhook-batch-max-size=400  # Events per batch
--audit-webhook-batch-max-wait=30s  # Max time before sending incomplete batch
--audit-webhook-initial-backoff=10s # Initial retry delay
--audit-webhook-truncate-enabled    # Truncate oversized events instead of dropping
--audit-webhook-truncate-max-event-size=10485760  # 10MB max event

Webhook Mode: blocking vs batch

batch mode (default) is asynchronous — the API server does not wait for the webhook to confirm receipt. Events can be dropped if the webhook is down and the buffer fills. blocking mode makes every request wait for the webhook to respond before the API server returns — this adds latency to every API call and should only be used if you can guarantee sub-millisecond webhook response times.

Audit Event Format

Each audit event is a JSON object with a well-defined schema. Understanding the fields is essential for writing effective SIEM queries and Falco rules.

{
  "kind": "Event",
  "apiVersion": "audit.k8s.io/v1",

  // Unique ID for this event (UUID)
  "auditID": "f23a4b56-1234-5678-abcd-000000000001",

  // Which stage this event was emitted from
  "stage": "ResponseComplete",

  // The full request URI
  "requestURI": "/api/v1/namespaces/production/secrets/db-password",

  // HTTP verb
  "verb": "get",

  // Authenticated user information
  "user": {
    "username": "system:serviceaccount:production:myapp",
    "uid": "abc123",
    "groups": ["system:serviceaccounts", "system:serviceaccounts:production"]
  },

  // If request was made via impersonation
  "impersonatedUser": {
    "username": "admin@example.com"
  },

  // Source IPs (first is original, rest are proxies)
  "sourceIPs": ["10.244.1.5", "192.168.1.100"],

  // User-Agent of the client
  "userAgent": "kubectl/v1.29.0 (linux/amd64)",

  // What resource was accessed
  "objectRef": {
    "resource": "secrets",
    "namespace": "production",
    "name": "db-password",
    "apiVersion": "v1"
  },

  // HTTP response code
  "responseStatus": {
    "code": 200
  },

  // Request body (if level=Request or RequestResponse)
  "requestObject": { "...": "..." },

  // Response body (if level=RequestResponse)
  "responseObject": { "...": "..." },

  // Timestamps
  "requestReceivedTimestamp": "2024-01-15T10:30:00.000000Z",
  "stageTimestamp": "2024-01-15T10:30:00.005000Z",

  // Annotations added by admission plugins
  "annotations": {
    "authorization.k8s.io/decision": "allow",
    "authorization.k8s.io/reason": "RBAC: allowed by ClusterRoleBinding..."
  }
}

Noise Reduction

An untuned audit policy on a production cluster can generate hundreds of megabytes of logs per hour. Most of this is noise. The goal is to capture high-value events without drowning the pipeline.

High-Volume Low-Value Sources

Source	Why It's Noisy	Recommendation
kube-controller-manager	Constantly lists/watches almost all resources	`None` for system:kube-controller-manager reads
kube-scheduler	Lists pods and nodes constantly	`None` for system:kube-scheduler reads
kube-proxy / CNI	Watches endpoints, services, nodes	`None` for these system components
Lease updates	Leader election emits update every 2–5s per component	`None` for coordination.k8s.io/leases
Watch events	Each change to a watched resource emits a log entry at ResponseStarted	`omitStages: [ResponseStarted]` globally
Health checks	/healthz, /readyz hit every few seconds from load balancers	`None` for nonResourceURLs matching /health*
Metrics scrapes	Prometheus scrapes /metrics every 15–30s	`None` for nonResourceURLs matching /metrics*
Events resource	High churn; informational only	`None` for core/events

Volume Estimation

# Estimate audit log volume before deploying (dry-run with --audit-log-path=-)
# Then count events per second during normal operation:

# Count events per minute from audit log
tail -f /var/log/kubernetes/audit/audit.log | \
  jq -r '.stageTimestamp' | \
  awk -F: '{print $1":"$2}' | uniq -c

# Top users by event count (last 1000 events)
tail -1000 /var/log/kubernetes/audit/audit.log | \
  jq -r '.user.username' | sort | uniq -c | sort -rn | head -20

# Top resources by event count
tail -1000 /var/log/kubernetes/audit/audit.log | \
  jq -r '.objectRef.resource // "non-resource"' | sort | uniq -c | sort -rn | head -20

SIEM Integration

kube-apiserver
│
├── audit.log (JSON Lines on node)
│ │
│ Fluent Bit DaemonSet
│ (tail /var/log/kubernetes/audit/audit.log)
│ │
│ ├──▶ Elasticsearch / OpenSearch (Kibana dashboards)
│ ├──▶ Splunk HEC (Splunk SIEM)
│ └──▶ Loki (Grafana dashboards)
│
└── Webhook backend
│
└──▶ Falco gRPC / HTTP sink (real-time alerting)

Fluent Bit Configuration for Audit Logs

# fluent-bit-configmap.yaml
apiVersion: v1
kind: ConfigMap
metadata:
  name: fluent-bit-config
  namespace: logging
data:
  fluent-bit.conf: |
    [SERVICE]
        Flush         5
        Log_Level     info
        Parsers_File  parsers.conf

    # Read audit log file — must be on control plane node
    [INPUT]
        Name              tail
        Tag               kube.audit
        Path              /var/log/kubernetes/audit/audit.log
        Parser            json
        DB                /var/log/flb_kube_audit.db
        Mem_Buf_Limit     50MB
        Skip_Long_Lines   On
        Refresh_Interval  10

    # Add cluster metadata
    [FILTER]
        Name  record_modifier
        Match kube.audit
        Record cluster ${CLUSTER_NAME}
        Record environment ${ENVIRONMENT}

    # Forward to Elasticsearch
    [OUTPUT]
        Name  es
        Match kube.audit
        Host  elasticsearch.logging.svc.cluster.local
        Port  9200
        Index k8s-audit
        Type  _doc
        Logstash_Format On
        Logstash_Prefix k8s-audit

Elasticsearch Query Examples

# Elasticsearch: find all secret reads in the last hour
GET k8s-audit-*/_search
{
  "query": {
    "bool": {
      "must": [
        { "term": { "objectRef.resource": "secrets" } },
        { "term": { "verb": "get" } },
        { "range": { "stageTimestamp": { "gte": "now-1h" } } }
      ],
      "must_not": [
        { "term": { "user.username": "system:serviceaccount:kube-system:cert-manager" } }
      ]
    }
  }
}

Forensic Query Patterns

These jq patterns work against raw audit log files and translate directly to SIEM queries.

Secret Access Investigation

# All users who read any secret in the last 24h
cat audit.log | jq -r '
  select(.objectRef.resource == "secrets") |
  select(.verb == "get") |
  select(.responseStatus.code == 200) |
  [.stageTimestamp, .user.username, .objectRef.namespace, .objectRef.name] |
  @tsv
'

# Secrets read by service accounts outside their own namespace
cat audit.log | jq -r '
  select(.objectRef.resource == "secrets") |
  select(.verb == "get") |
  select(.user.username | startswith("system:serviceaccount:")) |
  select(
    (.user.username | split(":")[3]) !=
    (.objectRef.namespace // "cluster-scoped")
  ) |
  [.stageTimestamp, .user.username, .objectRef.namespace, .objectRef.name] |
  @tsv
'

Privilege Escalation Detection

# New ClusterRoleBindings created (potential privilege escalation)
cat audit.log | jq -r '
  select(.objectRef.resource == "clusterrolebindings") |
  select(.verb == "create") |
  select(.responseStatus.code == 201) |
  [.stageTimestamp, .user.username, .objectRef.name,
   (.requestObject.roleRef.name // "unknown")] |
  @tsv
'

# Any binding to cluster-admin role
cat audit.log | jq -r '
  select(.objectRef.resource | test("rolebinding")) |
  select(.verb == "create") |
  select(.requestObject.roleRef.name == "cluster-admin") |
  [.stageTimestamp, .user.username, .objectRef.namespace // "cluster",
   .objectRef.name, (.requestObject.subjects[0].name // "unknown")] |
  @tsv
'

Pod Exec Investigation

# All exec sessions (command in URI params)
cat audit.log | jq -r '
  select(.objectRef.subresource == "exec") |
  [.stageTimestamp, .user.username, .objectRef.namespace,
   .objectRef.name, .requestURI] |
  @tsv
'

# Exec by humans (non-service-accounts) — should be rare in prod
cat audit.log | jq -r '
  select(.objectRef.subresource == "exec") |
  select(.user.username | startswith("system:serviceaccount") | not) |
  [.stageTimestamp, .user.username, .objectRef.namespace,
   .objectRef.name, .requestURI] |
  @tsv
'

Unauthorized Access Patterns

# All 403 Forbidden responses — RBAC denials
cat audit.log | jq -r '
  select(.responseStatus.code == 403) |
  [.stageTimestamp, .user.username, .verb,
   .objectRef.resource, .objectRef.namespace, .objectRef.name] |
  @tsv
'

# Top users by 403 count (credential probing detection)
cat audit.log | jq -r '
  select(.responseStatus.code == 403) |
  .user.username
' | sort | uniq -c | sort -rn | head -10

# 401 Unauthorized (authentication failures)
cat audit.log | jq -r '
  select(.responseStatus.code == 401) |
  [.stageTimestamp, .user.username, .sourceIPs[0], .requestURI] |
  @tsv
'

Node and Kubelet Activity

# Nodes accessing other nodes' resources (NodeRestriction violation)
cat audit.log | jq -r '
  select(.user.groups[] == "system:nodes") |
  select(.user.username != ("system:node:" + .objectRef.name)) |
  select(.objectRef.resource == "nodes") |
  [.stageTimestamp, .user.username, .verb, .objectRef.name] |
  @tsv
'

Dynamic Audit Webhooks

Dynamic audit webhooks (auditregistration.k8s.io) allow configuring audit sinks via Kubernetes API objects without restarting the API server. This feature is alpha/beta in older versions — check availability in your cluster version.

Dynamic Audit Sink Availability

The AuditSink API was introduced in 1.13 as alpha and has been deprecated. In modern clusters (1.25+), prefer configuring the static webhook backend at API server startup. Dynamic audit webhook registration via the API is no longer the recommended path.

Falco as an Audit Sink

Falco can receive Kubernetes audit events via webhook and apply its rule engine to detect security-relevant patterns in real time.

# Configure audit webhook to send to Falco's audit endpoint
# webhook-audit-config.yaml
clusters:
- name: falco
  cluster:
    server: http://falco.falco-system.svc.cluster.local:8765/k8s-audit
contexts:
- name: falco
  context:
    cluster: falco
    user: ""
current-context: falco

# Falco rule triggered by audit event: detect anonymous kubectl access
- rule: K8s Anonymous Request
  desc: Detect requests from system:anonymous or system:unauthenticated
  condition: >
    ka.user.name in ("system:anonymous", "system:unauthenticated")
  output: >
    Anonymous request to API server
    (user=%ka.user.name verb=%ka.verb uri=%ka.uri
     response=%ka.response.code sourceip=%ka.source.ip)
  priority: CRITICAL
  source: k8s_audit

- rule: K8s ClusterAdmin Binding Created
  desc: Detect creation of ClusterRoleBindings granting cluster-admin
  condition: >
    ka.verb = "create" and
    ka.target.resource = "clusterrolebindings" and
    ka.req.binding.role = "cluster-admin"
  output: >
    cluster-admin ClusterRoleBinding created
    (user=%ka.user.name binding=%ka.target.name subject=%ka.req.binding.subjects)
  priority: CRITICAL
  source: k8s_audit

- rule: K8s Secret Access
  desc: Detect reads of Kubernetes Secrets
  condition: >
    ka.target.resource = "secrets" and
    ka.verb in ("get", "list") and
    not ka.user.name startswith "system:serviceaccount:kube-system:"
  output: >
    Kubernetes Secret accessed
    (user=%ka.user.name secret=%ka.target.name ns=%ka.target.namespace)
  priority: WARNING
  source: k8s_audit

Compliance Requirements

Framework	Requirement	Kubernetes Audit Coverage
PCI DSS 10	Log all access to cardholder data, all auth attempts, privileged actions	Secrets access at Request level; 403/401 logging; RBAC change logging
SOC 2 CC6.1	Logical access controls, monitoring of access	All read/write to sensitive resources; user activity timeline
HIPAA § 164.312(b)	Audit controls: record and examine activity on systems with ePHI	Full audit trail for any namespace containing ePHI workloads
CIS Benchmark 3.2.1	Ensure audit log enabled	`--audit-log-path` and `--audit-policy-file` must be set
CIS Benchmark 3.2.2	Ensure audit policy covers required events	Policy must include secrets, RBAC, exec, auth failures
NIST 800-53 AU-2	Audit event logging	All authentication, authorization, and privileged actions
ISO 27001 A.12.4	Event logging and protection of log information	Immutable log storage; log access controls

Retention Requirements

Most compliance frameworks require 1 year minimum log retention with 90 days readily accessible. Plan your log storage accordingly: a busy cluster at Metadata level for most resources generates ~50–200GB/month. Use compressed cold storage (S3 Glacier, GCS Nearline) for logs older than 90 days.

Audit Log Integrity

# Audit logs must be protected from modification
# Ship to immutable append-only storage immediately:

# 1. S3 Object Lock (compliance mode)
aws s3api put-object-lock-configuration \
  --bucket k8s-audit-logs \
  --object-lock-configuration '{"ObjectLockEnabled":"Enabled","Rule":{"DefaultRetention":{"Mode":"COMPLIANCE","Years":1}}}'

# 2. Restrict who can access audit logs via RBAC
# Audit logs should NOT be readable by workload service accounts

# 3. WORM storage at OS level (if keeping on-disk)
# chattr +a /var/log/kubernetes/audit/audit.log  (append-only)

Metrics, Alerts & Runbooks

Key Metrics

Metric	Source	Description
`apiserver_audit_event_total`	kube-apiserver	Total audit events generated (by level)
`apiserver_audit_requests_rejected_total`	kube-apiserver	Events dropped due to backend overflow
`apiserver_audit_level_total`	kube-apiserver	Events per audit level (None/Metadata/Request/RequestResponse)
`falco_events_total`	Falco	Falco events triggered by k8s_audit rules
`log_storage_bytes_total`	Node exporter	Audit log disk usage

Alerts

# Alert: Audit events being dropped
- alert: AuditEventsDropped
  expr: increase(apiserver_audit_requests_rejected_total[5m]) > 0
  for: 1m
  severity: critical
  annotations:
    summary: "Audit events being dropped — compliance gap"

# Alert: High rate of 403 responses (potential attack probe)
- alert: HighForbiddenRate
  expr: >
    sum(rate(apiserver_request_total{code="403"}[5m])) > 10
  for: 5m
  annotations:
    summary: "High rate of 403 responses — possible RBAC probing"

# Alert: cluster-admin binding created
- alert: ClusterAdminBindingCreated
  expr: increase(falco_events_total{rule="K8s ClusterAdmin Binding Created"}[5m]) > 0
  for: 0m
  severity: critical
  annotations:
    summary: "cluster-admin ClusterRoleBinding created — immediate investigation required"

# Alert: Audit log file not growing (logging broken)
- alert: AuditLogStale
  expr: rate(apiserver_audit_event_total[5m]) == 0
  for: 10m
  annotations:
    summary: "No audit events generated in 10 minutes — audit logging may be broken"

Runbooks

Audit Events Being Dropped

1. Check apiserver_audit_requests_rejected_total for trend
2. Check webhook backend latency and availability
3. Increase --audit-webhook-batch-max-size or reduce policy verbosity
4. If log backend: check disk space on control plane node

Secret Accessed Unexpectedly

1. Query audit log: who accessed the secret and when
2. Check if access was from expected service account
3. If unexpected: rotate the secret immediately
4. Check if pod was compromised (Falco events, exec history)

cluster-admin Binding Alert

1. Identify who created it: jq query on audit log
2. Determine if authorized (planned infra change vs incident)
3. If unauthorized: delete binding, investigate originating pod/user
4. Review RBAC for how the creator had permission to create the binding

High 403 Rate

1. Identify source: top users/IPs by 403 count
2. Determine if misconfigured app (wrong SA permissions) vs attack
3. For misconfigured app: fix RBAC
4. For attack: block source IP at ingress/firewall level

Audit Log Pipeline Failure

1. Check Fluent Bit pod health: kubectl logs -n logging fluent-bit-xxx
2. Verify audit log file is being written on control plane
3. Check Elasticsearch/SIEM receiver is accepting connections
4. Verify TLS certificates for webhook backend haven't expired

Best Practices

Always set an audit policy — never use the default (no policy = no logging)

A missing --audit-policy-file flag means nothing is logged. The absence of audit logs is a compliance failure and makes incident investigation impossible.

Use tiered rules: None → Metadata → Request → RequestResponse

Start with aggressive noise suppression, escalate to higher levels only for security-sensitive resources. A flat "log everything at RequestResponse" policy will generate terabytes of data and obscure the signals you need.

Never log secrets at RequestResponse level

The response body of a secret read contains base64-encoded secret data. Log secrets at Request level (captures who accessed what without exposing the value) or Metadata level if only access patterns matter.

Ship logs to immutable external storage immediately

Audit logs on the control plane node can be deleted by a cluster administrator. Stream to an external SIEM or object storage with Object Lock enabled before an attacker can cover their tracks.

Alert on audit event drops

Dropped audit events are a compliance gap. Monitor apiserver_audit_requests_rejected_total and treat any non-zero value as a critical alert requiring immediate investigation.

Add Falco as a webhook audit sink for real-time alerting

File-based audit logs are useful for investigation but not for real-time detection. Route audit events to Falco via webhook backend to trigger alerts within seconds of suspicious activity.

Include authorization annotations in your policy

The authorization.k8s.io/decision annotation in audit events records whether access was allowed or denied. Capturing this at Metadata level for all resources costs almost nothing but enables powerful RBAC misconfiguration analysis.

Set omitManagedFields: true

Managed fields metadata in request/response bodies can be extremely verbose (often larger than the actual object). Setting omitManagedFields: true in the audit policy reduces log size by 30–60% with no loss of security-relevant information.