On This Page
  1. Why Observability
  2. The Three Pillars
  3. Kubernetes Observability Layers
  4. OpenTelemetry
  5. Signal Correlation
  6. Tooling Landscape
  7. Instrumentation Strategy
  8. SLOs, SLIs, and Error Budgets
  9. Observability Cost Management
  10. Section Guide
Coverage Checklist

Why Observability

Monitoring asks "is this thing healthy?" — a binary answer based on known failure modes. Observability asks "why is this behaving this way?" — allowing exploration of unknown unknowns without deploying new instrumentation.

Reduce MTTR

The primary value of observability is reducing mean time to recovery. Correlated signals (metrics → logs → traces) let engineers identify root cause in minutes instead of hours of blind kubectl commands.

Understand Distributed Systems

Kubernetes workloads are inherently distributed. A latency spike in one service may be caused by a slow database, a noisy neighbor, a GC pause, or network congestion — only traces reveal the actual causal chain.

Capacity and Cost Planning

Historical metrics enable right-sizing (CPU/memory requests), predicting when the next node is needed, and identifying expensive operations before they impact SLOs.

Kubernetes Adds Complexity

Ephemeral pods, dynamic IPs, rolling deployments, and auto-scaling make traditional host-based monitoring insufficient. Observability must work at the workload level, not the instance level.

Observability vs Monitoring

Monitoring is checking for known failure states (is memory > 90%?). Observability is the ability to ask arbitrary questions about system behavior using the data it emits — without deploying new code. In practice, you need both: monitoring for alerting on known-bad states, observability for investigating unknown causes.

The Three Pillars

📊
Metrics
Numeric measurements over time. Aggregated, low-cardinality, cheap to store. Best for alerting on trends, capacity, and SLO burn rate.
📋
Logs
Timestamped, structured or unstructured text events. High fidelity, high volume. Best for debugging specific incidents and auditing.
🔗
Traces
Distributed call graphs with timing. Shows the path of a single request across all services. Best for understanding latency and service dependencies.

Signal Comparison

PropertyMetricsLogsTraces
GranularityAggregated (counters, gauges, histograms)Per-eventPer-request
CardinalityLow (by design)Very highHigh (per trace ID)
Storage costLowHighMedium (with sampling)
Query patternAggregations, rate/range queriesFull-text search, filterTrace ID lookup, service dependency maps
Best forAlerting, dashboards, SLOsDebugging, auditing, contextLatency analysis, dependency mapping
Kubernetes sourcecAdvisor, kube-state-metrics, app /metricsstdout/stderr → log collectorApp instrumentation (OTel SDK)

The Fourth Signal: Events

Kubernetes Events are a native signal type — structured records of state changes in the cluster (pod scheduled, container crashed, image pulled, node pressure). They are often overlooked but are invaluable for correlating infrastructure changes with application symptoms. See Kubernetes Events for the full treatment.

Kubernetes Observability Layers

Observability in Kubernetes spans four distinct layers, each with different data sources, tooling, and ownership.

── Observability layers (top to bottom = user-visible to infrastructure) ──

Layer 4: Application
Signals: custom business metrics, structured logs, distributed traces
Ownership: application teams
Tools: OTel SDK (auto + manual instrumentation), app /metrics endpoint
Examples: request rate, payment latency, cart abandonment rate

Layer 3: Workload / Platform
Signals: pod CPU/memory, container restarts, HPA scaling events
Ownership: platform team
Tools: kube-state-metrics, cAdvisor, Prometheus
Examples: OOMKill rate, pod pending duration, deployment rollout status

Layer 2: Kubernetes Control Plane
Signals: API server latency, etcd health, scheduler queue depth
Ownership: platform/SRE team
Tools: component /metrics endpoints, audit logs
Examples: apiserver_request_duration, etcd_object_counts

Layer 1: Infrastructure / Nodes
Signals: node CPU/memory/disk/network, kernel metrics
Ownership: infrastructure team
Tools: node_exporter, cloud provider metrics
Examples: node_cpu_seconds_total, node_filesystem_avail_bytes

Key Data Sources per Layer

LayerComponentExposesScraped By
ApplicationApp /metrics endpointCustom Prometheus metricsPrometheus ServiceMonitor
ApplicationApp stdout/stderrStructured JSON logsFluent Bit / Fluentd
ApplicationOTel SDKTraces (OTLP)OTel Collector
WorkloadcAdvisor (in kubelet)Container CPU/mem/netPrometheus (kubelet /metrics/cadvisor)
Workloadkube-state-metricsKubernetes object statePrometheus
Control Planekube-apiserver /metricsAPI request latency/countPrometheus
Control Planeetcd /metricsetcd health, latency, sizePrometheus
Control Planekube-scheduler /metricsScheduling latency, queue depthPrometheus
Nodenode_exporterOS-level metrics (CPU, mem, disk, net)Prometheus
Nodekubelet /metricskubelet operations, pod lifecyclePrometheus

OpenTelemetry

OpenTelemetry (OTel) is the CNCF standard for vendor-neutral instrumentation and telemetry collection. It unifies metrics, logs, and traces under a single SDK, wire protocol (OTLP), and collection pipeline (OTel Collector).

── OpenTelemetry architecture ─────────────────────────────────────────

Application
└── OTel SDK (Go/Java/Python/Node/Rust/...)
├── Auto-instrumentation: HTTP, gRPC, DB, messaging frameworks
├── Manual instrumentation: custom spans, metrics, logs
└── Exports via OTLP (gRPC or HTTP)


OTel Collector (DaemonSet or Deployment)
├── Receivers: OTLP, Prometheus, Jaeger, Zipkin, Fluent Forward, hostmetrics
├── Processors: batch, memory_limiter, resource, attributes, filter, sampling
└── Exporters: Prometheus, OTLP (to backend), Loki, Jaeger, Tempo, stdout

┌───────────────┼─────────────────┐
▼ ▼ ▼
Prometheus Tempo / Jaeger Loki
(metrics) (traces) (logs)
│ │ │
└───────────────▼─────────────────┘
Grafana

OTel Collector Pipeline

# otel-collector-config.yaml
receivers:
  otlp:
    protocols:
      grpc:
        endpoint: 0.0.0.0:4317
      http:
        endpoint: 0.0.0.0:4318
  prometheus:
    config:
      scrape_configs:
      - job_name: otel-collector
        static_configs:
        - targets: [localhost:8888]
  hostmetrics:
    collection_interval: 30s
    scrapers:
      cpu: {}
      memory: {}
      filesystem: {}

processors:
  batch:
    timeout: 5s
    send_batch_size: 1000
  memory_limiter:
    check_interval: 1s
    limit_mib: 400
    spike_limit_mib: 100
  resource:
    attributes:
    - key: k8s.cluster.name
      value: production
      action: upsert

exporters:
  prometheusremotewrite:
    endpoint: http://prometheus:9090/api/v1/write
  otlp/tempo:
    endpoint: tempo:4317
    tls:
      insecure: true
  loki:
    endpoint: http://loki:3100/loki/api/v1/push

service:
  pipelines:
    metrics:
      receivers: [otlp, prometheus, hostmetrics]
      processors: [memory_limiter, batch, resource]
      exporters: [prometheusremotewrite]
    traces:
      receivers: [otlp]
      processors: [memory_limiter, batch, resource]
      exporters: [otlp/tempo]
    logs:
      receivers: [otlp]
      processors: [memory_limiter, batch, resource]
      exporters: [loki]

OTel Operator for Kubernetes

# Install OTel Operator
helm repo add open-telemetry https://open-telemetry.github.io/opentelemetry-helm-charts
helm install opentelemetry-operator open-telemetry/opentelemetry-operator \
  --namespace opentelemetry-operator-system \
  --create-namespace

# OpenTelemetryCollector CRD — operator manages collector lifecycle
apiVersion: opentelemetry.io/v1alpha1
kind: OpenTelemetryCollector
metadata:
  name: otel-collector
spec:
  mode: daemonset   # daemonset | deployment | sidecar | statefulset
  config: |
    # inline OTel Collector config (see above)
# Instrumentation CRD: auto-instrument pods without code changes
apiVersion: opentelemetry.io/v1alpha1
kind: Instrumentation
metadata:
  name: auto-instrumentation
  namespace: production
spec:
  exporter:
    endpoint: http://otel-collector:4317
  propagators: [tracecontext, baggage, b3]
  sampler:
    type: parentbased_traceidratio
    argument: "0.1"   # 10% sampling
  java:
    image: ghcr.io/open-telemetry/opentelemetry-operator/autoinstrumentation-java:1.32.0
  nodejs:
    image: ghcr.io/open-telemetry/opentelemetry-operator/autoinstrumentation-nodejs:0.41.1
  python:
    image: ghcr.io/open-telemetry/opentelemetry-operator/autoinstrumentation-python:0.41b0
  go:
    image: ghcr.io/open-telemetry/opentelemetry-operator/autoinstrumentation-go:v0.9.0-alpha

# Opt pod into auto-instrumentation via annotation:
# instrumentation.opentelemetry.io/inject-java: "true"
# instrumentation.opentelemetry.io/inject-nodejs: "true"

Signal Correlation

The power of observability comes from correlating signals — jumping from a metric alert to the relevant logs and traces for the same time window and request.

── Signal correlation flow ────────────────────────────────────────────

Alert fires: p99 latency > 500ms

▼ (click on metric data point in Grafana)
Metrics panel: latency spike at 14:32:15

├── Exemplar: trace_id=abc123 ← embedded in Prometheus histogram
│ └──▶ Tempo trace: full request path, 847ms total
│ ├── payment-svc: 12ms
│ ├── inventory-svc: 823ms ← ← SLOW
│ └── notification-svc: 8ms

└── Derived log query: trace_id=abc123
└──▶ Loki logs: "DB connection pool exhausted" at inventory-svc

Exemplars: Linking Metrics to Traces

# Prometheus exemplar: attach trace_id to histogram observation
# (Go example with prometheus/client_golang)
httpDuration.With(labels).ObserveWithExemplar(
    duration,
    prometheus.Labels{"traceID": traceID},
)

# Enable exemplar storage in Prometheus
# prometheus.yaml
storage:
  exemplars:
    max-exemplars: 100000

Trace ID in Logs

# Inject trace ID into structured logs (Go + Zap + OTel)
span := trace.SpanFromContext(ctx)
logger.Info("processing request",
    zap.String("trace_id", span.SpanContext().TraceID().String()),
    zap.String("span_id",  span.SpanContext().SpanID().String()),
    zap.String("service",  "payment-svc"),
)

# Log output (JSON):
{
  "level": "info",
  "ts": "2024-01-15T14:32:15.123Z",
  "msg": "processing request",
  "trace_id": "abc123def456...",
  "span_id": "0102030405060708",
  "service": "payment-svc"
}

# Loki query: find logs for a specific trace
{namespace="production"} | json | trace_id = "abc123def456..."

Tooling Landscape

CNCF Open Source Stack (LGTM)

SignalCollectionStorageQuery / Visualization
MetricsPrometheus, OTel CollectorPrometheus TSDB, Thanos, Cortex, MimirGrafana, PromQL
LogsFluent Bit, Fluentd, Promtail, OTel CollectorLokiGrafana, LogQL
TracesOTel SDK, OTel CollectorTempo, Jaeger, ZipkinGrafana, Jaeger UI
Eventskube-events-exporter, eventrouterLoki, ElasticsearchGrafana, Kibana
ProfilesPyroscope agent, eBPF profilersPyroscopeGrafana (Pyroscope plugin)

Stack Comparison

StackComponentsStrengthsWeaknesses
LGTM (self-hosted) Loki + Grafana + Tempo + Mimir Fully open source, tightly integrated, cost-effective at scale Operational complexity; each component needs HA
Prometheus + ELK Prometheus + Elasticsearch + Kibana Mature, large community, full-text search Elasticsearch resource-intensive; two separate UIs
Grafana Cloud Hosted LGTM stack Zero ops for backend; generous free tier Data leaves cluster; cost at scale
Datadog Unified SaaS platform Best-in-class UX; APM + infra in one product Expensive; data sent to vendor; vendor lock-in
New Relic Unified SaaS platform Simple pricing; NRQL query language Vendor lock-in; data sent to vendor
kube-prometheus-stack: The Quick Start

The kube-prometheus-stack Helm chart (formerly prometheus-operator) installs Prometheus Operator, Prometheus, Alertmanager, Grafana, kube-state-metrics, and node_exporter in one command with pre-built dashboards for Kubernetes. It is the standard starting point for Kubernetes metrics observability. See Metrics & Prometheus for the full installation guide.

Instrumentation Strategy

Auto-Instrumentation vs Manual Instrumentation

ApproachHowCoverageEffortWhen to Use
Auto-instrumentation OTel Operator Instrumentation CRD; language agents (Java agent, Node.js require hook) HTTP, gRPC, DB, messaging — framework-level Low (annotation per pod) All new services; existing services without time budget
Manual instrumentation OTel SDK: tracer.Start(), span.SetAttribute(), custom metrics Business logic, custom spans, domain metrics High (per-operation code changes) Business-critical paths where auto-instrumentation misses context
eBPF-based Pixie, Hubble, Tetragon — kernel-level tracing Network, syscalls, DNS, HTTP (without TLS decryption) Zero (no code changes) Observing workloads you cannot instrument (third-party, legacy)

What to Instrument First

1. RED Metrics (Every Service)

Rate: requests per second
Errors: failed requests per second
Duration: request latency distribution (p50/p99)
These three metrics cover 80% of all SLO alerting.

2. USE Metrics (Every Resource)

Utilization: % time busy
Saturation: queue depth / wait time
Errors: error rate
Apply to: CPU, memory, disk I/O, network.

3. Business Metrics

Domain-specific KPIs: orders per minute, payment success rate, active users. These are the metrics your business cares about when SLOs are expressed in business terms.

4. Deep-Dive Traces

Add manual spans for complex multi-step business operations (checkout flow, authentication chain) where auto-instrumentation doesn't capture enough context.

SLOs, SLIs, and Error Budgets

SLOs (Service Level Objectives) transform raw metrics into business-facing reliability targets. They are the bridge between engineering observability and product/business requirements.

TermDefinitionExample
SLI (Service Level Indicator)A metric that measures reliability from the user's perspectivePercentage of requests with latency < 200ms
SLO (Service Level Objective)Target value or range for an SLI over a rolling window99.9% of requests complete in < 200ms over 30 days
Error Budget1 - SLO = allowed failure budget per window0.1% of 30-day requests = ~43 min of allowed downtime
Burn RateRate at which error budget is consumedBurn rate 2 = consuming budget twice as fast as allowed
SLA (Service Level Agreement)Contractual commitment to external customers99.9% uptime SLA (stricter than internal SLO)
# Multi-window burn rate alert (Google SRE approach)
# Fires when error budget is being consumed faster than sustainable

# Short window (fast detection) + long window (sustained burn)
- alert: SLOBurnRateHigh
  expr: |
    (
      # 1h burn rate > 14.4x (uses 2% budget in 1h at this rate)
      job:slo_errors:rate1h{job="payment-svc"} / 0.001 > 14.4
    ) and (
      # 5m burn rate > 14.4x (ensures it's sustained)
      job:slo_errors:rate5m{job="payment-svc"} / 0.001 > 14.4
    )
  severity: page
  annotations:
    summary: "SLO burn rate critical — paging"

Observability Cost Management

Observability infrastructure can become one of the largest operational costs in a Kubernetes platform. Understanding the cost drivers enables targeted optimization.

Cost Drivers and Mitigations

DriverImpactMitigation
High-cardinality metricsPrometheus memory/storage explosion; slow queriesDrop high-cardinality labels; use recording rules; Prometheus label limits
Log volumeLoki/Elasticsearch storage and ingestion costLog sampling for high-volume debug logs; drop noisy log lines at Fluent Bit level
100% trace samplingTrace storage scales linearly with trafficHead-based sampling (10%); tail-based sampling (OTel Collector); always-sample errors
Metrics retentionLong retention = large disk; rarely queried after 90dTiered retention: 15d hot, 90d warm, 1y cold (Thanos object storage)
Prometheus cardinality per podEach pod restart adds new time serieskube-state-metrics: limit label cardinality; drop pod-hash labels
Cardinality Explosion

The most common Prometheus operational crisis. Adding a high-cardinality label (user ID, request ID, session ID) to a metric multiplies series count by the label's cardinality. A metric with 1,000 time series becomes 1,000,000 time series if you add a user_id label with 1,000 users. This causes OOM kills on the Prometheus pod. See Metrics & Prometheus for cardinality management techniques.

Section Guide

Metrics & Prometheus
Prometheus architecture, ServiceMonitor, PromQL, recording rules, kube-prometheus-stack, Thanos/Mimir for scale
Logging
Fluent Bit DaemonSet, Loki, LogQL, structured logging, log aggregation patterns, retention
Distributed Tracing
OTel SDK, Tempo/Jaeger, sampling strategies, trace propagation, service dependency maps
Kubernetes Events
Event anatomy, kube-events-exporter, event-driven alerting, incident correlation
Dashboards & Grafana
Grafana setup, dashboard-as-code (Grafonnet), USE/RED dashboard templates, multi-cluster views
Alerting
Alertmanager, routing, grouping, inhibition, silence, PagerDuty integration, burn rate alerts
Profiling
Continuous profiling with Pyroscope, pprof endpoints, eBPF-based profiling, flame graphs