Observability Overview
The three pillars of observability in Kubernetes — metrics, logs, and traces — plus events, dashboards, alerting, and profiling. Architecture, signal relationships, tooling landscape, and the instrumentation strategy for production clusters.
Coverage Checklist
- Why observability: MTTR, cardinality, unknown unknowns
- Three pillars: metrics, logs, traces
- Fourth signal: events
- Kubernetes observability layers: infra/platform/workload/app
- OpenTelemetry: OTel SDK, Collector, OTLP protocol
- OTel Collector pipeline: receivers/processors/exporters
- Signal correlation: exemplars, trace ID in logs
- Tooling landscape: Prometheus, Loki, Tempo, Jaeger, Grafana
- PLG stack vs LGTM stack vs vendor solutions
- Instrumentation strategy: auto vs manual
- SLIs, SLOs, error budgets, burn rate alerts
- Observability cost: cardinality explosion, sampling, retention
- Section guide with links to detail pages
Why Observability
Monitoring asks "is this thing healthy?" — a binary answer based on known failure modes. Observability asks "why is this behaving this way?" — allowing exploration of unknown unknowns without deploying new instrumentation.
Reduce MTTR
The primary value of observability is reducing mean time to recovery. Correlated signals (metrics → logs → traces) let engineers identify root cause in minutes instead of hours of blind kubectl commands.
Understand Distributed Systems
Kubernetes workloads are inherently distributed. A latency spike in one service may be caused by a slow database, a noisy neighbor, a GC pause, or network congestion — only traces reveal the actual causal chain.
Capacity and Cost Planning
Historical metrics enable right-sizing (CPU/memory requests), predicting when the next node is needed, and identifying expensive operations before they impact SLOs.
Kubernetes Adds Complexity
Ephemeral pods, dynamic IPs, rolling deployments, and auto-scaling make traditional host-based monitoring insufficient. Observability must work at the workload level, not the instance level.
Monitoring is checking for known failure states (is memory > 90%?). Observability is the ability to ask arbitrary questions about system behavior using the data it emits — without deploying new code. In practice, you need both: monitoring for alerting on known-bad states, observability for investigating unknown causes.
The Three Pillars
Signal Comparison
| Property | Metrics | Logs | Traces |
|---|---|---|---|
| Granularity | Aggregated (counters, gauges, histograms) | Per-event | Per-request |
| Cardinality | Low (by design) | Very high | High (per trace ID) |
| Storage cost | Low | High | Medium (with sampling) |
| Query pattern | Aggregations, rate/range queries | Full-text search, filter | Trace ID lookup, service dependency maps |
| Best for | Alerting, dashboards, SLOs | Debugging, auditing, context | Latency analysis, dependency mapping |
| Kubernetes source | cAdvisor, kube-state-metrics, app /metrics | stdout/stderr → log collector | App instrumentation (OTel SDK) |
The Fourth Signal: Events
Kubernetes Events are a native signal type — structured records of state changes in the cluster (pod scheduled, container crashed, image pulled, node pressure). They are often overlooked but are invaluable for correlating infrastructure changes with application symptoms. See Kubernetes Events for the full treatment.
Kubernetes Observability Layers
Observability in Kubernetes spans four distinct layers, each with different data sources, tooling, and ownership.
Layer 4: Application
Signals: custom business metrics, structured logs, distributed traces
Ownership: application teams
Tools: OTel SDK (auto + manual instrumentation), app /metrics endpoint
Examples: request rate, payment latency, cart abandonment rate
│
Layer 3: Workload / Platform
Signals: pod CPU/memory, container restarts, HPA scaling events
Ownership: platform team
Tools: kube-state-metrics, cAdvisor, Prometheus
Examples: OOMKill rate, pod pending duration, deployment rollout status
│
Layer 2: Kubernetes Control Plane
Signals: API server latency, etcd health, scheduler queue depth
Ownership: platform/SRE team
Tools: component /metrics endpoints, audit logs
Examples: apiserver_request_duration, etcd_object_counts
│
Layer 1: Infrastructure / Nodes
Signals: node CPU/memory/disk/network, kernel metrics
Ownership: infrastructure team
Tools: node_exporter, cloud provider metrics
Examples: node_cpu_seconds_total, node_filesystem_avail_bytes
Key Data Sources per Layer
| Layer | Component | Exposes | Scraped By |
|---|---|---|---|
| Application | App /metrics endpoint | Custom Prometheus metrics | Prometheus ServiceMonitor |
| Application | App stdout/stderr | Structured JSON logs | Fluent Bit / Fluentd |
| Application | OTel SDK | Traces (OTLP) | OTel Collector |
| Workload | cAdvisor (in kubelet) | Container CPU/mem/net | Prometheus (kubelet /metrics/cadvisor) |
| Workload | kube-state-metrics | Kubernetes object state | Prometheus |
| Control Plane | kube-apiserver /metrics | API request latency/count | Prometheus |
| Control Plane | etcd /metrics | etcd health, latency, size | Prometheus |
| Control Plane | kube-scheduler /metrics | Scheduling latency, queue depth | Prometheus |
| Node | node_exporter | OS-level metrics (CPU, mem, disk, net) | Prometheus |
| Node | kubelet /metrics | kubelet operations, pod lifecycle | Prometheus |
OpenTelemetry
OpenTelemetry (OTel) is the CNCF standard for vendor-neutral instrumentation and telemetry collection. It unifies metrics, logs, and traces under a single SDK, wire protocol (OTLP), and collection pipeline (OTel Collector).
Application
└── OTel SDK (Go/Java/Python/Node/Rust/...)
├── Auto-instrumentation: HTTP, gRPC, DB, messaging frameworks
├── Manual instrumentation: custom spans, metrics, logs
└── Exports via OTLP (gRPC or HTTP)
│
▼
OTel Collector (DaemonSet or Deployment)
├── Receivers: OTLP, Prometheus, Jaeger, Zipkin, Fluent Forward, hostmetrics
├── Processors: batch, memory_limiter, resource, attributes, filter, sampling
└── Exporters: Prometheus, OTLP (to backend), Loki, Jaeger, Tempo, stdout
│
┌───────────────┼─────────────────┐
▼ ▼ ▼
Prometheus Tempo / Jaeger Loki
(metrics) (traces) (logs)
│ │ │
└───────────────▼─────────────────┘
Grafana
OTel Collector Pipeline
# otel-collector-config.yaml
receivers:
otlp:
protocols:
grpc:
endpoint: 0.0.0.0:4317
http:
endpoint: 0.0.0.0:4318
prometheus:
config:
scrape_configs:
- job_name: otel-collector
static_configs:
- targets: [localhost:8888]
hostmetrics:
collection_interval: 30s
scrapers:
cpu: {}
memory: {}
filesystem: {}
processors:
batch:
timeout: 5s
send_batch_size: 1000
memory_limiter:
check_interval: 1s
limit_mib: 400
spike_limit_mib: 100
resource:
attributes:
- key: k8s.cluster.name
value: production
action: upsert
exporters:
prometheusremotewrite:
endpoint: http://prometheus:9090/api/v1/write
otlp/tempo:
endpoint: tempo:4317
tls:
insecure: true
loki:
endpoint: http://loki:3100/loki/api/v1/push
service:
pipelines:
metrics:
receivers: [otlp, prometheus, hostmetrics]
processors: [memory_limiter, batch, resource]
exporters: [prometheusremotewrite]
traces:
receivers: [otlp]
processors: [memory_limiter, batch, resource]
exporters: [otlp/tempo]
logs:
receivers: [otlp]
processors: [memory_limiter, batch, resource]
exporters: [loki]
OTel Operator for Kubernetes
# Install OTel Operator
helm repo add open-telemetry https://open-telemetry.github.io/opentelemetry-helm-charts
helm install opentelemetry-operator open-telemetry/opentelemetry-operator \
--namespace opentelemetry-operator-system \
--create-namespace
# OpenTelemetryCollector CRD — operator manages collector lifecycle
apiVersion: opentelemetry.io/v1alpha1
kind: OpenTelemetryCollector
metadata:
name: otel-collector
spec:
mode: daemonset # daemonset | deployment | sidecar | statefulset
config: |
# inline OTel Collector config (see above)
# Instrumentation CRD: auto-instrument pods without code changes
apiVersion: opentelemetry.io/v1alpha1
kind: Instrumentation
metadata:
name: auto-instrumentation
namespace: production
spec:
exporter:
endpoint: http://otel-collector:4317
propagators: [tracecontext, baggage, b3]
sampler:
type: parentbased_traceidratio
argument: "0.1" # 10% sampling
java:
image: ghcr.io/open-telemetry/opentelemetry-operator/autoinstrumentation-java:1.32.0
nodejs:
image: ghcr.io/open-telemetry/opentelemetry-operator/autoinstrumentation-nodejs:0.41.1
python:
image: ghcr.io/open-telemetry/opentelemetry-operator/autoinstrumentation-python:0.41b0
go:
image: ghcr.io/open-telemetry/opentelemetry-operator/autoinstrumentation-go:v0.9.0-alpha
# Opt pod into auto-instrumentation via annotation:
# instrumentation.opentelemetry.io/inject-java: "true"
# instrumentation.opentelemetry.io/inject-nodejs: "true"
Signal Correlation
The power of observability comes from correlating signals — jumping from a metric alert to the relevant logs and traces for the same time window and request.
Alert fires: p99 latency > 500ms
│
▼ (click on metric data point in Grafana)
Metrics panel: latency spike at 14:32:15
│
├── Exemplar: trace_id=abc123 ← embedded in Prometheus histogram
│ └──▶ Tempo trace: full request path, 847ms total
│ ├── payment-svc: 12ms
│ ├── inventory-svc: 823ms ← ← SLOW
│ └── notification-svc: 8ms
│
└── Derived log query: trace_id=abc123
└──▶ Loki logs: "DB connection pool exhausted" at inventory-svc
Exemplars: Linking Metrics to Traces
# Prometheus exemplar: attach trace_id to histogram observation
# (Go example with prometheus/client_golang)
httpDuration.With(labels).ObserveWithExemplar(
duration,
prometheus.Labels{"traceID": traceID},
)
# Enable exemplar storage in Prometheus
# prometheus.yaml
storage:
exemplars:
max-exemplars: 100000
Trace ID in Logs
# Inject trace ID into structured logs (Go + Zap + OTel)
span := trace.SpanFromContext(ctx)
logger.Info("processing request",
zap.String("trace_id", span.SpanContext().TraceID().String()),
zap.String("span_id", span.SpanContext().SpanID().String()),
zap.String("service", "payment-svc"),
)
# Log output (JSON):
{
"level": "info",
"ts": "2024-01-15T14:32:15.123Z",
"msg": "processing request",
"trace_id": "abc123def456...",
"span_id": "0102030405060708",
"service": "payment-svc"
}
# Loki query: find logs for a specific trace
{namespace="production"} | json | trace_id = "abc123def456..."
Tooling Landscape
CNCF Open Source Stack (LGTM)
| Signal | Collection | Storage | Query / Visualization |
|---|---|---|---|
| Metrics | Prometheus, OTel Collector | Prometheus TSDB, Thanos, Cortex, Mimir | Grafana, PromQL |
| Logs | Fluent Bit, Fluentd, Promtail, OTel Collector | Loki | Grafana, LogQL |
| Traces | OTel SDK, OTel Collector | Tempo, Jaeger, Zipkin | Grafana, Jaeger UI |
| Events | kube-events-exporter, eventrouter | Loki, Elasticsearch | Grafana, Kibana |
| Profiles | Pyroscope agent, eBPF profilers | Pyroscope | Grafana (Pyroscope plugin) |
Stack Comparison
| Stack | Components | Strengths | Weaknesses |
|---|---|---|---|
| LGTM (self-hosted) | Loki + Grafana + Tempo + Mimir | Fully open source, tightly integrated, cost-effective at scale | Operational complexity; each component needs HA |
| Prometheus + ELK | Prometheus + Elasticsearch + Kibana | Mature, large community, full-text search | Elasticsearch resource-intensive; two separate UIs |
| Grafana Cloud | Hosted LGTM stack | Zero ops for backend; generous free tier | Data leaves cluster; cost at scale |
| Datadog | Unified SaaS platform | Best-in-class UX; APM + infra in one product | Expensive; data sent to vendor; vendor lock-in |
| New Relic | Unified SaaS platform | Simple pricing; NRQL query language | Vendor lock-in; data sent to vendor |
The kube-prometheus-stack Helm chart (formerly prometheus-operator) installs Prometheus Operator, Prometheus, Alertmanager, Grafana, kube-state-metrics, and node_exporter in one command with pre-built dashboards for Kubernetes. It is the standard starting point for Kubernetes metrics observability. See Metrics & Prometheus for the full installation guide.
Instrumentation Strategy
Auto-Instrumentation vs Manual Instrumentation
| Approach | How | Coverage | Effort | When to Use |
|---|---|---|---|---|
| Auto-instrumentation | OTel Operator Instrumentation CRD; language agents (Java agent, Node.js require hook) | HTTP, gRPC, DB, messaging — framework-level | Low (annotation per pod) | All new services; existing services without time budget |
| Manual instrumentation | OTel SDK: tracer.Start(), span.SetAttribute(), custom metrics |
Business logic, custom spans, domain metrics | High (per-operation code changes) | Business-critical paths where auto-instrumentation misses context |
| eBPF-based | Pixie, Hubble, Tetragon — kernel-level tracing | Network, syscalls, DNS, HTTP (without TLS decryption) | Zero (no code changes) | Observing workloads you cannot instrument (third-party, legacy) |
What to Instrument First
1. RED Metrics (Every Service)
Rate: requests per second
Errors: failed requests per second
Duration: request latency distribution (p50/p99)
These three metrics cover 80% of all SLO alerting.
2. USE Metrics (Every Resource)
Utilization: % time busy
Saturation: queue depth / wait time
Errors: error rate
Apply to: CPU, memory, disk I/O, network.
3. Business Metrics
Domain-specific KPIs: orders per minute, payment success rate, active users. These are the metrics your business cares about when SLOs are expressed in business terms.
4. Deep-Dive Traces
Add manual spans for complex multi-step business operations (checkout flow, authentication chain) where auto-instrumentation doesn't capture enough context.
SLOs, SLIs, and Error Budgets
SLOs (Service Level Objectives) transform raw metrics into business-facing reliability targets. They are the bridge between engineering observability and product/business requirements.
| Term | Definition | Example |
|---|---|---|
| SLI (Service Level Indicator) | A metric that measures reliability from the user's perspective | Percentage of requests with latency < 200ms |
| SLO (Service Level Objective) | Target value or range for an SLI over a rolling window | 99.9% of requests complete in < 200ms over 30 days |
| Error Budget | 1 - SLO = allowed failure budget per window | 0.1% of 30-day requests = ~43 min of allowed downtime |
| Burn Rate | Rate at which error budget is consumed | Burn rate 2 = consuming budget twice as fast as allowed |
| SLA (Service Level Agreement) | Contractual commitment to external customers | 99.9% uptime SLA (stricter than internal SLO) |
# Multi-window burn rate alert (Google SRE approach)
# Fires when error budget is being consumed faster than sustainable
# Short window (fast detection) + long window (sustained burn)
- alert: SLOBurnRateHigh
expr: |
(
# 1h burn rate > 14.4x (uses 2% budget in 1h at this rate)
job:slo_errors:rate1h{job="payment-svc"} / 0.001 > 14.4
) and (
# 5m burn rate > 14.4x (ensures it's sustained)
job:slo_errors:rate5m{job="payment-svc"} / 0.001 > 14.4
)
severity: page
annotations:
summary: "SLO burn rate critical — paging"
Observability Cost Management
Observability infrastructure can become one of the largest operational costs in a Kubernetes platform. Understanding the cost drivers enables targeted optimization.
Cost Drivers and Mitigations
| Driver | Impact | Mitigation |
|---|---|---|
| High-cardinality metrics | Prometheus memory/storage explosion; slow queries | Drop high-cardinality labels; use recording rules; Prometheus label limits |
| Log volume | Loki/Elasticsearch storage and ingestion cost | Log sampling for high-volume debug logs; drop noisy log lines at Fluent Bit level |
| 100% trace sampling | Trace storage scales linearly with traffic | Head-based sampling (10%); tail-based sampling (OTel Collector); always-sample errors |
| Metrics retention | Long retention = large disk; rarely queried after 90d | Tiered retention: 15d hot, 90d warm, 1y cold (Thanos object storage) |
| Prometheus cardinality per pod | Each pod restart adds new time series | kube-state-metrics: limit label cardinality; drop pod-hash labels |
The most common Prometheus operational crisis. Adding a high-cardinality label (user ID, request ID, session ID) to a metric multiplies series count by the label's cardinality. A metric with 1,000 time series becomes 1,000,000 time series if you add a user_id label with 1,000 users. This causes OOM kills on the Prometheus pod. See Metrics & Prometheus for cardinality management techniques.