Observability Overview | Kubernetes Documentation

On This Page

Why Observability
The Three Pillars
Kubernetes Observability Layers
OpenTelemetry
Signal Correlation
Tooling Landscape
Instrumentation Strategy
SLOs, SLIs, and Error Budgets
Observability Cost Management
Section Guide

Coverage Checklist

Why observability: MTTR, cardinality, unknown unknowns
Three pillars: metrics, logs, traces
Fourth signal: events
Kubernetes observability layers: infra/platform/workload/app
OpenTelemetry: OTel SDK, Collector, OTLP protocol
OTel Collector pipeline: receivers/processors/exporters
Signal correlation: exemplars, trace ID in logs
Tooling landscape: Prometheus, Loki, Tempo, Jaeger, Grafana
PLG stack vs LGTM stack vs vendor solutions
Instrumentation strategy: auto vs manual
SLIs, SLOs, error budgets, burn rate alerts
Observability cost: cardinality explosion, sampling, retention
Section guide with links to detail pages

Why Observability

Monitoring asks "is this thing healthy?" — a binary answer based on known failure modes. Observability asks "why is this behaving this way?" — allowing exploration of unknown unknowns without deploying new instrumentation.

Reduce MTTR

The primary value of observability is reducing mean time to recovery. Correlated signals (metrics → logs → traces) let engineers identify root cause in minutes instead of hours of blind kubectl commands.

Understand Distributed Systems

Kubernetes workloads are inherently distributed. A latency spike in one service may be caused by a slow database, a noisy neighbor, a GC pause, or network congestion — only traces reveal the actual causal chain.

Capacity and Cost Planning

Historical metrics enable right-sizing (CPU/memory requests), predicting when the next node is needed, and identifying expensive operations before they impact SLOs.

Kubernetes Adds Complexity

Ephemeral pods, dynamic IPs, rolling deployments, and auto-scaling make traditional host-based monitoring insufficient. Observability must work at the workload level, not the instance level.

Observability vs Monitoring

Monitoring is checking for known failure states (is memory > 90%?). Observability is the ability to ask arbitrary questions about system behavior using the data it emits — without deploying new code. In practice, you need both: monitoring for alerting on known-bad states, observability for investigating unknown causes.

The Three Pillars

📊

Metrics

Numeric measurements over time. Aggregated, low-cardinality, cheap to store. Best for alerting on trends, capacity, and SLO burn rate.

📋

Logs

Timestamped, structured or unstructured text events. High fidelity, high volume. Best for debugging specific incidents and auditing.

🔗

Traces

Distributed call graphs with timing. Shows the path of a single request across all services. Best for understanding latency and service dependencies.

Signal Comparison

Property	Metrics	Logs	Traces
Granularity	Aggregated (counters, gauges, histograms)	Per-event	Per-request
Cardinality	Low (by design)	Very high	High (per trace ID)
Storage cost	Low	High	Medium (with sampling)
Query pattern	Aggregations, rate/range queries	Full-text search, filter	Trace ID lookup, service dependency maps
Best for	Alerting, dashboards, SLOs	Debugging, auditing, context	Latency analysis, dependency mapping
Kubernetes source	cAdvisor, kube-state-metrics, app /metrics	stdout/stderr → log collector	App instrumentation (OTel SDK)

The Fourth Signal: Events

Kubernetes Events are a native signal type — structured records of state changes in the cluster (pod scheduled, container crashed, image pulled, node pressure). They are often overlooked but are invaluable for correlating infrastructure changes with application symptoms. See Kubernetes Events for the full treatment.

Kubernetes Observability Layers

Observability in Kubernetes spans four distinct layers, each with different data sources, tooling, and ownership.

── Observability layers (top to bottom = user-visible to infrastructure) ──

Layer 4: Application
Signals: custom business metrics, structured logs, distributed traces
Ownership: application teams
Tools: OTel SDK (auto + manual instrumentation), app /metrics endpoint
Examples: request rate, payment latency, cart abandonment rate
│
Layer 3: Workload / Platform
Signals: pod CPU/memory, container restarts, HPA scaling events
Ownership: platform team
Tools: kube-state-metrics, cAdvisor, Prometheus
Examples: OOMKill rate, pod pending duration, deployment rollout status
│
Layer 2: Kubernetes Control Plane
Signals: API server latency, etcd health, scheduler queue depth
Ownership: platform/SRE team
Tools: component /metrics endpoints, audit logs
Examples: apiserver_request_duration, etcd_object_counts
│
Layer 1: Infrastructure / Nodes
Signals: node CPU/memory/disk/network, kernel metrics
Ownership: infrastructure team
Tools: node_exporter, cloud provider metrics
Examples: node_cpu_seconds_total, node_filesystem_avail_bytes

Key Data Sources per Layer

Layer	Component	Exposes	Scraped By
Application	App /metrics endpoint	Custom Prometheus metrics	Prometheus ServiceMonitor
Application	App stdout/stderr	Structured JSON logs	Fluent Bit / Fluentd
Application	OTel SDK	Traces (OTLP)	OTel Collector
Workload	cAdvisor (in kubelet)	Container CPU/mem/net	Prometheus (kubelet /metrics/cadvisor)
Workload	kube-state-metrics	Kubernetes object state	Prometheus
Control Plane	kube-apiserver /metrics	API request latency/count	Prometheus
Control Plane	etcd /metrics	etcd health, latency, size	Prometheus
Control Plane	kube-scheduler /metrics	Scheduling latency, queue depth	Prometheus
Node	node_exporter	OS-level metrics (CPU, mem, disk, net)	Prometheus
Node	kubelet /metrics	kubelet operations, pod lifecycle	Prometheus

OpenTelemetry

OpenTelemetry (OTel) is the CNCF standard for vendor-neutral instrumentation and telemetry collection. It unifies metrics, logs, and traces under a single SDK, wire protocol (OTLP), and collection pipeline (OTel Collector).

── OpenTelemetry architecture ─────────────────────────────────────────

Application
└── OTel SDK (Go/Java/Python/Node/Rust/...)
├── Auto-instrumentation: HTTP, gRPC, DB, messaging frameworks
├── Manual instrumentation: custom spans, metrics, logs
└── Exports via OTLP (gRPC or HTTP)
│
▼
OTel Collector (DaemonSet or Deployment)
├── Receivers: OTLP, Prometheus, Jaeger, Zipkin, Fluent Forward, hostmetrics
├── Processors: batch, memory_limiter, resource, attributes, filter, sampling
└── Exporters: Prometheus, OTLP (to backend), Loki, Jaeger, Tempo, stdout
│
┌───────────────┼─────────────────┐
▼ ▼ ▼
Prometheus Tempo / Jaeger Loki
(metrics) (traces) (logs)
│ │ │
└───────────────▼─────────────────┘
Grafana

OTel Collector Pipeline

# otel-collector-config.yaml
receivers:
  otlp:
    protocols:
      grpc:
        endpoint: 0.0.0.0:4317
      http:
        endpoint: 0.0.0.0:4318
  prometheus:
    config:
      scrape_configs:
      - job_name: otel-collector
        static_configs:
        - targets: [localhost:8888]
  hostmetrics:
    collection_interval: 30s
    scrapers:
      cpu: {}
      memory: {}
      filesystem: {}

processors:
  batch:
    timeout: 5s
    send_batch_size: 1000
  memory_limiter:
    check_interval: 1s
    limit_mib: 400
    spike_limit_mib: 100
  resource:
    attributes:
    - key: k8s.cluster.name
      value: production
      action: upsert

exporters:
  prometheusremotewrite:
    endpoint: http://prometheus:9090/api/v1/write
  otlp/tempo:
    endpoint: tempo:4317
    tls:
      insecure: true
  loki:
    endpoint: http://loki:3100/loki/api/v1/push

service:
  pipelines:
    metrics:
      receivers: [otlp, prometheus, hostmetrics]
      processors: [memory_limiter, batch, resource]
      exporters: [prometheusremotewrite]
    traces:
      receivers: [otlp]
      processors: [memory_limiter, batch, resource]
      exporters: [otlp/tempo]
    logs:
      receivers: [otlp]
      processors: [memory_limiter, batch, resource]
      exporters: [loki]

OTel Operator for Kubernetes

# Install OTel Operator
helm repo add open-telemetry https://open-telemetry.github.io/opentelemetry-helm-charts
helm install opentelemetry-operator open-telemetry/opentelemetry-operator \
  --namespace opentelemetry-operator-system \
  --create-namespace

# OpenTelemetryCollector CRD — operator manages collector lifecycle
apiVersion: opentelemetry.io/v1alpha1
kind: OpenTelemetryCollector
metadata:
  name: otel-collector
spec:
  mode: daemonset   # daemonset | deployment | sidecar | statefulset
  config: |
    # inline OTel Collector config (see above)

# Instrumentation CRD: auto-instrument pods without code changes
apiVersion: opentelemetry.io/v1alpha1
kind: Instrumentation
metadata:
  name: auto-instrumentation
  namespace: production
spec:
  exporter:
    endpoint: http://otel-collector:4317
  propagators: [tracecontext, baggage, b3]
  sampler:
    type: parentbased_traceidratio
    argument: "0.1"   # 10% sampling
  java:
    image: ghcr.io/open-telemetry/opentelemetry-operator/autoinstrumentation-java:1.32.0
  nodejs:
    image: ghcr.io/open-telemetry/opentelemetry-operator/autoinstrumentation-nodejs:0.41.1
  python:
    image: ghcr.io/open-telemetry/opentelemetry-operator/autoinstrumentation-python:0.41b0
  go:
    image: ghcr.io/open-telemetry/opentelemetry-operator/autoinstrumentation-go:v0.9.0-alpha

# Opt pod into auto-instrumentation via annotation:
# instrumentation.opentelemetry.io/inject-java: "true"
# instrumentation.opentelemetry.io/inject-nodejs: "true"

Signal Correlation

The power of observability comes from correlating signals — jumping from a metric alert to the relevant logs and traces for the same time window and request.

── Signal correlation flow ────────────────────────────────────────────

Alert fires: p99 latency > 500ms
│
▼ (click on metric data point in Grafana)
Metrics panel: latency spike at 14:32:15
│
├── Exemplar: trace_id=abc123 ← embedded in Prometheus histogram
│ └──▶ Tempo trace: full request path, 847ms total
│ ├── payment-svc: 12ms
│ ├── inventory-svc: 823ms ← ← SLOW
│ └── notification-svc: 8ms
│
└── Derived log query: trace_id=abc123
└──▶ Loki logs: "DB connection pool exhausted" at inventory-svc

Exemplars: Linking Metrics to Traces

# Prometheus exemplar: attach trace_id to histogram observation
# (Go example with prometheus/client_golang)
httpDuration.With(labels).ObserveWithExemplar(
    duration,
    prometheus.Labels{"traceID": traceID},
)

# Enable exemplar storage in Prometheus
# prometheus.yaml
storage:
  exemplars:
    max-exemplars: 100000

Trace ID in Logs

# Inject trace ID into structured logs (Go + Zap + OTel)
span := trace.SpanFromContext(ctx)
logger.Info("processing request",
    zap.String("trace_id", span.SpanContext().TraceID().String()),
    zap.String("span_id",  span.SpanContext().SpanID().String()),
    zap.String("service",  "payment-svc"),
)

# Log output (JSON):
{
  "level": "info",
  "ts": "2024-01-15T14:32:15.123Z",
  "msg": "processing request",
  "trace_id": "abc123def456...",
  "span_id": "0102030405060708",
  "service": "payment-svc"
}

# Loki query: find logs for a specific trace
{namespace="production"} | json | trace_id = "abc123def456..."

Tooling Landscape

CNCF Open Source Stack (LGTM)

Signal	Collection	Storage	Query / Visualization
Metrics	Prometheus, OTel Collector	Prometheus TSDB, Thanos, Cortex, Mimir	Grafana, PromQL
Logs	Fluent Bit, Fluentd, Promtail, OTel Collector	Loki	Grafana, LogQL
Traces	OTel SDK, OTel Collector	Tempo, Jaeger, Zipkin	Grafana, Jaeger UI
Events	kube-events-exporter, eventrouter	Loki, Elasticsearch	Grafana, Kibana
Profiles	Pyroscope agent, eBPF profilers	Pyroscope	Grafana (Pyroscope plugin)

Stack Comparison

Stack	Components	Strengths	Weaknesses
LGTM (self-hosted)	Loki + Grafana + Tempo + Mimir	Fully open source, tightly integrated, cost-effective at scale	Operational complexity; each component needs HA
Prometheus + ELK	Prometheus + Elasticsearch + Kibana	Mature, large community, full-text search	Elasticsearch resource-intensive; two separate UIs
Grafana Cloud	Hosted LGTM stack	Zero ops for backend; generous free tier	Data leaves cluster; cost at scale
Datadog	Unified SaaS platform	Best-in-class UX; APM + infra in one product	Expensive; data sent to vendor; vendor lock-in
New Relic	Unified SaaS platform	Simple pricing; NRQL query language	Vendor lock-in; data sent to vendor

kube-prometheus-stack: The Quick Start

The kube-prometheus-stack Helm chart (formerly prometheus-operator) installs Prometheus Operator, Prometheus, Alertmanager, Grafana, kube-state-metrics, and node_exporter in one command with pre-built dashboards for Kubernetes. It is the standard starting point for Kubernetes metrics observability. See Metrics & Prometheus for the full installation guide.

Instrumentation Strategy

Auto-Instrumentation vs Manual Instrumentation

Approach	How	Coverage	Effort	When to Use
Auto-instrumentation	OTel Operator Instrumentation CRD; language agents (Java agent, Node.js require hook)	HTTP, gRPC, DB, messaging — framework-level	Low (annotation per pod)	All new services; existing services without time budget
Manual instrumentation	OTel SDK: `tracer.Start()`, `span.SetAttribute()`, custom metrics	Business logic, custom spans, domain metrics	High (per-operation code changes)	Business-critical paths where auto-instrumentation misses context
eBPF-based	Pixie, Hubble, Tetragon — kernel-level tracing	Network, syscalls, DNS, HTTP (without TLS decryption)	Zero (no code changes)	Observing workloads you cannot instrument (third-party, legacy)

What to Instrument First

1. RED Metrics (Every Service)

Rate: requests per second
Errors: failed requests per second
Duration: request latency distribution (p50/p99)
These three metrics cover 80% of all SLO alerting.

2. USE Metrics (Every Resource)

Utilization: % time busy
Saturation: queue depth / wait time
Errors: error rate
Apply to: CPU, memory, disk I/O, network.

3. Business Metrics

Domain-specific KPIs: orders per minute, payment success rate, active users. These are the metrics your business cares about when SLOs are expressed in business terms.

4. Deep-Dive Traces

Add manual spans for complex multi-step business operations (checkout flow, authentication chain) where auto-instrumentation doesn't capture enough context.

SLOs, SLIs, and Error Budgets

SLOs (Service Level Objectives) transform raw metrics into business-facing reliability targets. They are the bridge between engineering observability and product/business requirements.

Term	Definition	Example
SLI (Service Level Indicator)	A metric that measures reliability from the user's perspective	Percentage of requests with latency < 200ms
SLO (Service Level Objective)	Target value or range for an SLI over a rolling window	99.9% of requests complete in < 200ms over 30 days
Error Budget	1 - SLO = allowed failure budget per window	0.1% of 30-day requests = ~43 min of allowed downtime
Burn Rate	Rate at which error budget is consumed	Burn rate 2 = consuming budget twice as fast as allowed
SLA (Service Level Agreement)	Contractual commitment to external customers	99.9% uptime SLA (stricter than internal SLO)

# Multi-window burn rate alert (Google SRE approach)
# Fires when error budget is being consumed faster than sustainable

# Short window (fast detection) + long window (sustained burn)
- alert: SLOBurnRateHigh
  expr: |
    (
      # 1h burn rate > 14.4x (uses 2% budget in 1h at this rate)
      job:slo_errors:rate1h{job="payment-svc"} / 0.001 > 14.4
    ) and (
      # 5m burn rate > 14.4x (ensures it's sustained)
      job:slo_errors:rate5m{job="payment-svc"} / 0.001 > 14.4
    )
  severity: page
  annotations:
    summary: "SLO burn rate critical — paging"

Observability Cost Management

Observability infrastructure can become one of the largest operational costs in a Kubernetes platform. Understanding the cost drivers enables targeted optimization.

Cost Drivers and Mitigations

Driver	Impact	Mitigation
High-cardinality metrics	Prometheus memory/storage explosion; slow queries	Drop high-cardinality labels; use recording rules; Prometheus label limits
Log volume	Loki/Elasticsearch storage and ingestion cost	Log sampling for high-volume debug logs; drop noisy log lines at Fluent Bit level
100% trace sampling	Trace storage scales linearly with traffic	Head-based sampling (10%); tail-based sampling (OTel Collector); always-sample errors
Metrics retention	Long retention = large disk; rarely queried after 90d	Tiered retention: 15d hot, 90d warm, 1y cold (Thanos object storage)
Prometheus cardinality per pod	Each pod restart adds new time series	kube-state-metrics: limit label cardinality; drop pod-hash labels

Cardinality Explosion

The most common Prometheus operational crisis. Adding a high-cardinality label (user ID, request ID, session ID) to a metric multiplies series count by the label's cardinality. A metric with 1,000 time series becomes 1,000,000 time series if you add a user_id label with 1,000 users. This causes OOM kills on the Prometheus pod. See Metrics & Prometheus for cardinality management techniques.

Section Guide

Metrics & Prometheus

Prometheus architecture, ServiceMonitor, PromQL, recording rules, kube-prometheus-stack, Thanos/Mimir for scale

Logging

Fluent Bit DaemonSet, Loki, LogQL, structured logging, log aggregation patterns, retention

Distributed Tracing

OTel SDK, Tempo/Jaeger, sampling strategies, trace propagation, service dependency maps

Kubernetes Events

Event anatomy, kube-events-exporter, event-driven alerting, incident correlation

Dashboards & Grafana

Grafana setup, dashboard-as-code (Grafonnet), USE/RED dashboard templates, multi-cluster views

Alerting

Alertmanager, routing, grouping, inhibition, silence, PagerDuty integration, burn rate alerts

Profiling

Continuous profiling with Pyroscope, pprof endpoints, eBPF-based profiling, flame graphs