Distributed Tracing in Kubernetes

Complete guide to OpenTelemetry tracing, W3C Trace Context, Jaeger, Grafana Tempo, sampling strategies, auto-instrumentation, service mesh tracing, and production tail sampling.

Core Concepts

Distributed tracing tracks a single request as it flows through multiple services, capturing timing and causal relationships. Without tracing, diagnosing latency in a microservices system requires grepping logs across dozens of services with no way to correlate them to a specific request.

Trace, Span, and Span Context

Trace

A directed acyclic graph (DAG) of spans representing one end-to-end request. Identified by a globally unique trace ID (128-bit / 16 bytes). All spans in a trace share the same trace ID.

Span

A named, timed operation representing a unit of work. Has a span ID (64-bit / 8 bytes), start time, end time, status (OK/ERROR/UNSET), and zero or more attributes, events, and links.

Span Context

The propagated portion of a span: trace ID + span ID + trace flags (sampled bit) + trace state. Propagated across process boundaries via HTTP headers or message queue metadata.

Parent–Child Relationship

A child span records its parent's span ID. The root span has no parent ID. Causal relationships form a tree; async fan-out or messaging creates a DAG via span links.

Span Attributes

Key-value pairs on a span (indexed for search). OTel semantic conventions define standard names: http.method, db.system, rpc.service, k8s.pod.name.

Span Events

Timestamped log-like messages attached to a span (not propagated). Use for: exception recording, cache miss, retry attempt. Cheaper than creating a child span.

Example Trace: Order Service

POST /orders [gateway]

234ms

└ order-service.CreateOrder

212ms

├ redis.GET inventory

11ms

├ postgres.INSERT orders

43ms

├ payment-service.Charge

96ms

│ └ stripe.CreateCharge

81ms

└ kafka.produce order.created

17ms

Waterfall view: each bar represents a span's start/end relative to trace start. Gaps reveal time spent between spans (serialization, network, queue wait). This trace immediately shows payment-service (96ms) is the dominant latency contributor.

Span Status and Error Recording

Status Code	Meaning	When to Set
`UNSET`	Default — operation not explicitly classified	Default for all new spans
`OK`	Operation succeeded, explicitly confirmed	Set only when you want to suppress downstream error status
`ERROR`	Operation failed	On any exception or non-2xx HTTP response in a server span

Do Not Set OK Proactively

Setting status to OK on every successful span prevents automatic error propagation from child spans. Only set OK explicitly when you want to mark a span as definitively successful despite child errors (e.g., a retry that ultimately succeeded). Leave UNSET for normal successful operations.

W3C Trace Context & Propagation

The W3C Trace Context specification (RFC) defines standard HTTP headers for propagating span context across service boundaries. All modern tracing SDKs support this as the default propagation format.

traceparent Header

traceparent: 00-4bf92f3577b34da6a3ce929d0e0e4736-00f067aa0ba902b7-01
             ^^ ^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^ ^^^^^^^^^^^^^^^^ ^^
             |  trace-id (32 hex chars, 128-bit) span-id (16 hex) flags
             version=00                                            01=sampled

tracestate Header

# Vendor-specific metadata, preserved through the call chain
tracestate: vendorname1=opaqueValue1,vendorname2=opaqueValue2

# Jaeger example:
tracestate: jaeger=sampled=1

# B3 single-header (legacy, still common):
b3: 4bf92f3577b34da6a3ce929d0e0e4736-00f067aa0ba902b7-1

Propagation Formats Comparison

Format	Header(s)	Origin	Status
W3C TraceContext	`traceparent`, `tracestate`	W3C standard	Recommended
B3 Multi-header	`X-B3-TraceId`, `X-B3-SpanId`, `X-B3-Sampled`	Zipkin	Legacy / widely used
B3 Single-header	`b3`	Zipkin	Legacy
Jaeger	`uber-trace-id`	Uber/Jaeger	Legacy
AWS X-Ray	`X-Amzn-Trace-Id`	AWS	AWS environments
Baggage	`baggage`	W3C standard	User-defined context

W3C Baggage

# Baggage propagates user-defined key-value pairs across the entire call chain
# Use for: tenant ID, A/B experiment ID, user ID for debug sessions
baggage: userId=12345,tenantId=acme-corp,ab-experiment=new-checkout

# WARNING: Baggage is propagated to ALL downstream services — never put secrets here.
# Baggage adds network overhead proportional to its size on every HTTP call.

OpenTelemetry SDK & API

OTel Architecture Layers

┌─────────────────────────────────────────────────────┐ │ OTel API (stable contract) │ │ Tracer / Span / Context / Propagator interfaces │ │ Safe to use in libraries — no vendor coupling │ └──────────────────┬──────────────────────────────────┘ │ implements ┌──────────────────▼──────────────────────────────────┐ │ OTel SDK (configurable) │ │ TracerProvider, Sampler, SpanProcessor, │ │ SpanExporter — configured by application operator │ └──────────────────┬──────────────────────────────────┘ │ exports via ┌──────────────────▼──────────────────────────────────┐ │ OTLP Exporter (gRPC :4317 / HTTP :4318) │ │ Sends to OTel Collector or directly to backend │ └─────────────────────────────────────────────────────┘

Go: Manual Instrumentation

package main

import (
    "context"
    "go.opentelemetry.io/otel"
    "go.opentelemetry.io/otel/attribute"
    "go.opentelemetry.io/otel/codes"
    "go.opentelemetry.io/otel/exporters/otlp/otlptrace/otlptracegrpc"
    "go.opentelemetry.io/otel/sdk/resource"
    sdktrace "go.opentelemetry.io/otel/sdk/trace"
    semconv "go.opentelemetry.io/otel/semconv/v1.21.0"
    "go.opentelemetry.io/otel/trace"
    "google.golang.org/grpc"
)

func initTracer(ctx context.Context) (*sdktrace.TracerProvider, error) {
    exporter, err := otlptracegrpc.New(ctx,
        otlptracegrpc.WithEndpoint("otel-collector.monitoring.svc:4317"),
        otlptracegrpc.WithInsecure(),
        otlptracegrpc.WithDialOption(grpc.WithBlock()),
    )
    if err != nil {
        return nil, err
    }

    res := resource.NewWithAttributes(
        semconv.SchemaURL,
        semconv.ServiceName("order-service"),
        semconv.ServiceVersion("v2.4.1"),
        semconv.DeploymentEnvironment("production"),
        attribute.String("k8s.namespace.name", "payments"),
    )

    tp := sdktrace.NewTracerProvider(
        sdktrace.WithBatcher(exporter),
        sdktrace.WithResource(res),
        // Head-based sampling: 10% of traces
        sdktrace.WithSampler(sdktrace.ParentBased(
            sdktrace.TraceIDRatioBased(0.1),
        )),
    )
    otel.SetTracerProvider(tp)
    // Set W3C TraceContext + Baggage propagators
    otel.SetTextMapPropagator(
        propagation.NewCompositeTextMapPropagator(
            propagation.TraceContext{},
            propagation.Baggage{},
        ),
    )
    return tp, nil
}

var tracer = otel.Tracer("order-service")

func createOrder(ctx context.Context, req OrderRequest) (*Order, error) {
    ctx, span := tracer.Start(ctx, "CreateOrder",
        trace.WithAttributes(
            attribute.String("order.customer_id", req.CustomerID),
            attribute.Int("order.item_count", len(req.Items)),
            attribute.Float64("order.total_amount", req.Total),
        ),
        trace.WithSpanKind(trace.SpanKindServer),
    )
    defer span.End()

    order, err := insertOrderDB(ctx, req)
    if err != nil {
        // Record exception — adds span event with stack trace
        span.RecordError(err, trace.WithStackTrace(true))
        span.SetStatus(codes.Error, err.Error())
        return nil, err
    }

    // Add span event (lightweight log attached to this span)
    span.AddEvent("order persisted",
        trace.WithAttributes(attribute.String("order.id", order.ID)),
    )
    return order, nil
}

func insertOrderDB(ctx context.Context, req OrderRequest) (*Order, error) {
    ctx, span := tracer.Start(ctx, "db.insert",
        trace.WithAttributes(
            semconv.DBSystemPostgreSQL,
            semconv.DBNameKey.String("orders"),
            semconv.DBOperationKey.String("INSERT"),
            semconv.DBStatementKey.String("INSERT INTO orders (customer_id, total) VALUES (?, ?)"),
        ),
        trace.WithSpanKind(trace.SpanKindClient),
    )
    defer span.End()
    // ... actual DB operation
    return &Order{}, nil
}

Python: Manual Instrumentation

from opentelemetry import trace
from opentelemetry.sdk.trace import TracerProvider
from opentelemetry.sdk.trace.export import BatchSpanProcessor
from opentelemetry.exporter.otlp.proto.grpc.trace_exporter import OTLPSpanExporter
from opentelemetry.sdk.resources import Resource, SERVICE_NAME, SERVICE_VERSION
from opentelemetry.propagate import set_global_textmap
from opentelemetry.propagators.composite import CompositePropagator
from opentelemetry.propagators.b3 import B3MultiFormat
from opentelemetry.trace.propagation.tracecontext import TraceContextTextMapPropagator

def init_tracer():
    resource = Resource.create({
        SERVICE_NAME: "payment-service",
        SERVICE_VERSION: "v1.2.0",
        "deployment.environment": "production",
    })

    exporter = OTLPSpanExporter(
        endpoint="http://otel-collector.monitoring.svc:4317",
        insecure=True,
    )

    provider = TracerProvider(resource=resource)
    provider.add_span_processor(BatchSpanProcessor(exporter))
    trace.set_tracer_provider(provider)

    # Support both W3C and B3 for interop with legacy services
    set_global_textmap(CompositePropagator([
        TraceContextTextMapPropagator(),
        B3MultiFormat(),
    ]))

tracer = trace.get_tracer("payment-service")

def charge_customer(ctx, customer_id: str, amount: float):
    with tracer.start_as_current_span(
        "PaymentService.Charge",
        context=ctx,
        kind=trace.SpanKind.SERVER,
        attributes={
            "payment.customer_id": customer_id,
            "payment.amount": amount,
            "payment.currency": "USD",
        }
    ) as span:
        try:
            result = stripe_charge(ctx, customer_id, amount)
            span.set_attribute("payment.charge_id", result.id)
            return result
        except StripeError as e:
            span.record_exception(e)
            span.set_status(trace.StatusCode.ERROR, str(e))
            raise

Java: OpenTelemetry with Spring Boot

<!-- pom.xml -->
<dependency>
  <groupId>io.opentelemetry.instrumentation</groupId>
  <artifactId>opentelemetry-spring-boot-starter</artifactId>
  <version>2.3.0-alpha</version>
</dependency>

# application.yml — OTel Spring Boot auto-config
otel:
  service:
    name: inventory-service
  exporter:
    otlp:
      endpoint: http://otel-collector.monitoring.svc:4317
      protocol: grpc
  traces:
    sampler: parentbased_traceidratio
    sampler:
      arg: "0.1"      # 10% head sampling
  metrics:
    exporter: otlp   # also export metrics via OTel
  logs:
    exporter: otlp   # also export logs via OTel
  propagators: tracecontext,baggage

import io.opentelemetry.api.GlobalOpenTelemetry;
import io.opentelemetry.api.trace.Span;
import io.opentelemetry.api.trace.SpanKind;
import io.opentelemetry.api.trace.StatusCode;
import io.opentelemetry.api.trace.Tracer;
import io.opentelemetry.context.Scope;

@Service
public class InventoryService {
    private final Tracer tracer = GlobalOpenTelemetry.getTracer("inventory-service");

    public int checkStock(String productId) {
        Span span = tracer.spanBuilder("InventoryService.checkStock")
            .setSpanKind(SpanKind.INTERNAL)
            .setAttribute("product.id", productId)
            .startSpan();
        try (Scope scope = span.makeCurrent()) {
            int qty = inventoryRepo.findQuantity(productId);
            span.setAttribute("inventory.quantity", qty);
            return qty;
        } catch (Exception e) {
            span.recordException(e);
            span.setStatus(StatusCode.ERROR, e.getMessage());
            throw e;
        } finally {
            span.end();
        }
    }
}

Semantic Conventions

OTel semantic conventions define standardized attribute names across language SDKs. Using them enables consistent cross-service querying in backends like Tempo.

Signal	Key Attributes
HTTP Server	`http.method`, `http.route`, `http.status_code`, `http.url`, `server.address`, `server.port`
HTTP Client	`http.method`, `http.url`, `http.status_code`, `http.request.body.size`
Database	`db.system` (postgresql/mysql/redis), `db.name`, `db.operation`, `db.statement`, `db.user`, `server.address`
Messaging	`messaging.system` (kafka/rabbitmq), `messaging.destination.name`, `messaging.operation` (publish/receive/process)
RPC	`rpc.system` (grpc), `rpc.service`, `rpc.method`, `rpc.grpc.status_code`
Kubernetes	`k8s.pod.name`, `k8s.namespace.name`, `k8s.deployment.name`, `k8s.node.name`, `k8s.cluster.name`
Exceptions	`exception.type`, `exception.message`, `exception.stacktrace`

Auto-Instrumentation

The OTel Operator provides zero-code-change auto-instrumentation for Java, Node.js, Python, .NET, and Go via the Instrumentation CRD. It injects an init container that installs the OTel agent/SDK and configures it via environment variables.

OTel Operator Install

kubectl apply -f https://github.com/open-telemetry/opentelemetry-operator/releases/latest/download/opentelemetry-operator.yaml

# Or via Helm:
helm repo add open-telemetry https://open-telemetry.github.io/opentelemetry-helm-charts
helm upgrade --install opentelemetry-operator open-telemetry/opentelemetry-operator \
  --namespace opentelemetry-operator-system \
  --create-namespace \
  --set "manager.collectorImage.repository=otel/opentelemetry-collector-contrib"

Instrumentation CRD

apiVersion: opentelemetry.io/v1alpha1
kind: Instrumentation
metadata:
  name: otel-instrumentation
  namespace: production
spec:
  # Where to send traces (OTel Collector endpoint)
  exporter:
    endpoint: http://otel-collector.monitoring.svc:4317

  propagators:
    - tracecontext
    - baggage
    - b3multi           # also support legacy B3 for mixed environments

  sampler:
    type: parentbased_traceidratio
    argument: "0.1"     # 10% head sample; adjust per service via annotation

  # Resource attributes added to all signals from instrumented pods
  resource:
    addK8sUIDAttributes: true
    attributes:
      cluster: prod-us-east-1

  java:
    image: ghcr.io/open-telemetry/opentelemetry-operator/autoinstrumentation-java:1.33.0
    env:
      - name: OTEL_INSTRUMENTATION_JDBC_ENABLED
        value: "true"
      - name: OTEL_INSTRUMENTATION_SPRING_WEB_ENABLED
        value: "true"

  nodejs:
    image: ghcr.io/open-telemetry/opentelemetry-operator/autoinstrumentation-nodejs:0.49.1
    env:
      - name: OTEL_NODE_ENABLED_INSTRUMENTATIONS
        value: "http,express,pg,redis,kafkajs"

  python:
    image: ghcr.io/open-telemetry/opentelemetry-operator/autoinstrumentation-python:0.45b0
    env:
      - name: OTEL_PYTHON_LOGGING_AUTO_INSTRUMENTATION_ENABLED
        value: "true"

  go:
    image: ghcr.io/open-telemetry/opentelemetry-operator/autoinstrumentation-go:v0.11.0
    # Go uses eBPF — no init container; requires privileged mode
    env:
      - name: OTEL_GO_AUTO_SHOW_VERIFIER_LOG
        value: "false"

  dotnet:
    image: ghcr.io/open-telemetry/opentelemetry-operator/autoinstrumentation-dotnet:1.7.0

Enabling Auto-Instrumentation via Pod Annotations

apiVersion: apps/v1
kind: Deployment
metadata:
  name: order-service
  namespace: production
spec:
  template:
    metadata:
      annotations:
        # Enable auto-instrumentation for this deployment
        instrumentation.opentelemetry.io/inject-java: "true"
        # Or: inject-nodejs, inject-python, inject-dotnet, inject-go

        # Override sampler for this specific service
        instrumentation.opentelemetry.io/inject-java: "otel-instrumentation"

        # Override container to inject (default: first container)
        instrumentation.opentelemetry.io/container-names: "order-service"
    spec:
      containers:
        - name: order-service
          image: myregistry/order-service:v2.4.1
          # OTel Operator injects these env vars automatically:
          # OTEL_SERVICE_NAME=order-service
          # OTEL_EXPORTER_OTLP_ENDPOINT=http://otel-collector...:4317
          # JAVA_TOOL_OPTIONS=-javaagent:/otel-auto-instrumentation/javaagent.jar
          # OTEL_RESOURCE_ATTRIBUTES=k8s.pod.name=$(POD_NAME),k8s.namespace.name=$(NAMESPACE)...

Auto-Instrumentation Coverage by Language

Language	Mechanism	Frameworks Covered	Limitation
Java	javaagent JAR (bytecode)	Spring Boot, Quarkus, Micronaut, JDBC, gRPC, Kafka, Redis, Mongo	Lambda / GraalVM native require manual SDK
Node.js	require() hook at startup	Express, Fastify, HTTP, gRPC, pg, mysql, Redis, Kafka	Custom async hooks may conflict
Python	sitecustomize.py + PYTHONPATH	Django, Flask, FastAPI, aiohttp, SQLAlchemy, Redis, Kafka	Celery async tasks need manual context
Go	eBPF (no code change needed)	net/http, gRPC, database/sql	Requires privileged DaemonSet; limited framework depth
.NET	CLR profiler	ASP.NET Core, HttpClient, EF Core, gRPC, Redis, Kafka	Profiler API limitations in some scenarios

OTel Collector for Traces

The OTel Collector decouples application instrumentation from backend details. Applications export to the Collector via OTLP; the Collector transforms, samples, and fans out to one or more backends. This enables backend changes with zero application restarts.

Collector Deployment Modes for Tracing

Mode	Topology	Use For
Agent (DaemonSet)	One collector per node; apps send to localhost	Head sampling decisions, initial enrichment, local buffering
Gateway (Deployment)	Centralized collectors; agents forward to gateway	Tail sampling (requires seeing all spans for a trace), fan-out to multiple backends
Sidecar	One collector per pod	Strict per-pod isolation; rarely needed for tracing

Tail Sampling Requires Gateway Mode

Tail sampling decisions require seeing all spans of a trace before deciding whether to keep or drop it. This means all spans for a trace must route to the same collector instance. Use a consistent hash on trace_id in the load balancing exporter to route all spans of a trace to the same gateway instance.

Full Collector Config: Agent + Gateway

# --- Agent ConfigMap (DaemonSet — one per node) ---
apiVersion: v1
kind: ConfigMap
metadata:
  name: otel-agent-config
  namespace: monitoring
data:
  config.yaml: |
    receivers:
      otlp:
        protocols:
          grpc:
            endpoint: 0.0.0.0:4317      # apps send to node IP:4317
          http:
            endpoint: 0.0.0.0:4318

    processors:
      memory_limiter:
        limit_mib: 400
        spike_limit_mib: 100
        check_interval: 5s
      batch:
        timeout: 5s
        send_batch_size: 1024
      k8sattributes:                    # enrich with pod metadata
        extract:
          metadata:
            - k8s.pod.name
            - k8s.pod.uid
            - k8s.deployment.name
            - k8s.namespace.name
            - k8s.node.name
            - k8s.container.name
          labels:
            - tag_name: app
              key: app
              from: pod
        pod_association:
          - sources:
              - from: connection
      resource:
        attributes:
          - key: cluster
            value: prod-us-east-1
            action: insert

    exporters:
      # Forward to gateway using load-balancing on trace_id
      loadbalancing:
        protocol:
          otlp:
            tls:
              insecure: true
        resolver:
          dns:
            hostname: otel-gateway-headless.monitoring.svc   # headless Service
            port: 4317

    service:
      pipelines:
        traces:
          receivers: [otlp]
          processors: [memory_limiter, k8sattributes, resource, batch]
          exporters: [loadbalancing]

# --- Gateway ConfigMap (Deployment — 3+ replicas with tail sampler) ---
data:
  config.yaml: |
    receivers:
      otlp:
        protocols:
          grpc:
            endpoint: 0.0.0.0:4317

    processors:
      memory_limiter:
        limit_mib: 1500
        spike_limit_mib: 400
        check_interval: 5s
      tail_sampling:
        decision_wait: 30s            # wait up to 30s for all spans to arrive
        num_traces: 50000             # in-memory trace buffer
        expected_new_traces_per_sec: 1000
        policies:
          # Keep ALL traces with errors
          - name: error-traces
            type: status_code
            status_code: {status_codes: [ERROR]}
          # Keep slow traces (> 2s)
          - name: slow-traces
            type: latency
            latency: {threshold_ms: 2000}
          # Keep traces from specific services (always)
          - name: payment-always
            type: string_attribute
            string_attribute: {key: "service.name", values: ["payment-service"]}
          # Sample 5% of everything else
          - name: probabilistic-base
            type: probabilistic
            probabilistic: {sampling_percentage: 5}
      batch:
        timeout: 5s
        send_batch_size: 2048

    exporters:
      otlp/tempo:
        endpoint: tempo.monitoring.svc:4317
        tls:
          insecure: true
      otlp/jaeger:
        endpoint: jaeger-collector.monitoring.svc:4317
        tls:
          insecure: true

    service:
      pipelines:
        traces:
          receivers: [otlp]
          processors: [memory_limiter, tail_sampling, batch]
          exporters: [otlp/tempo]

Load-Balancing Exporter and Headless Service

# Headless Service for gateway Pods — allows DNS SRV discovery
apiVersion: v1
kind: Service
metadata:
  name: otel-gateway-headless
  namespace: monitoring
spec:
  clusterIP: None           # headless
  selector:
    app: otel-gateway
  ports:
    - port: 4317
      name: grpc
---
# Gateway Deployment
apiVersion: apps/v1
kind: Deployment
metadata:
  name: otel-gateway
  namespace: monitoring
spec:
  replicas: 3
  template:
    spec:
      containers:
        - name: otelcol
          image: otel/opentelemetry-collector-contrib:0.96.0
          resources:
            requests: {cpu: 500m, memory: 1Gi}
            limits: {cpu: 2, memory: 3Gi}
          volumeMounts:
            - name: config
              mountPath: /etc/otelcol-contrib

Grafana Tempo

Grafana Tempo is a high-scale, cost-efficient distributed tracing backend. Like Loki, it uses object storage (S3/GCS/Azure Blob) for trace data and indexes only trace ID and span attributes needed for search — not full-text. It integrates natively with Grafana for trace visualization and metric/log correlation.

Tempo Architecture

OTel Collector → Tempo Distributor │ (consistent hash ring) │ ┌────────▼────────┐ │ Ingester │ (WAL → object store) │ (3 replicas) │ └────────┬────────┘ │ ┌───────────▼───────────┐ │ Object Store (S3) │ │ Compactor (merges) │ └───────────┬───────────┘ │ ┌───────────▼───────────┐ │ Querier / Query │ ← Grafana / Tempo API │ Frontend │ └───────────────────────┘ TraceQL search also requires: ┌───────────────────────┐ │ Metrics Generator │ (derives RED metrics from traces) │ Tempo query via │ ← Prometheus remote_write │ span attributes │ └───────────────────────┘

Tempo Helm Install

helm repo add grafana https://grafana.github.io/helm-charts
helm upgrade --install tempo grafana/tempo-distributed \
  --namespace monitoring \
  --values tempo-values.yaml

Tempo Distributed Values

# tempo-values.yaml
tempo:
  reportingEnabled: false
  storage:
    trace:
      backend: s3
      s3:
        bucket: prod-tempo-traces
        region: us-east-1
        # IRSA annotation on ServiceAccount instead of static credentials

  # Enable search over span attributes
  search:
    enabled: true
    max_duration: 0      # no limit on trace duration for search

  # Metrics generator: derive RED metrics from trace spans
  metricsGenerator:
    enabled: true
    remoteWriteUrl: http://prometheus-operated.monitoring.svc:9090/api/v1/write
    processors:
      - service-graphs    # service dependency graph from traces
      - span-metrics      # RED metrics (rate, error, duration) per operation

  # TraceQL search requires ingester/querier config
  ingester:
    config:
      replication_factor: 3
      trace_idle_period: 30s
      max_block_bytes: 104857600    # 100MB chunks

ingester:
  replicas: 3
  resources:
    requests: {cpu: 500m, memory: 2Gi}
    limits: {cpu: 2, memory: 4Gi}
  extraEnv:
    - name: GOMEMLIMIT
      value: "3500MiB"

distributor:
  replicas: 2
  resources:
    requests: {cpu: 200m, memory: 256Mi}

querier:
  replicas: 2
  resources:
    requests: {cpu: 500m, memory: 1Gi}

compactor:
  resources:
    requests: {cpu: 500m, memory: 512Mi}

serviceAccount:
  annotations:
    eks.amazonaws.com/role-arn: arn:aws:iam::123456789:role/tempo-s3-role

TraceQL

TraceQL is Tempo's query language for searching and filtering traces by span attributes, duration, status, and structural properties. Available in Tempo 2.0+.

# Find all error traces for order-service
{ resource.service.name = "order-service" && status = error }

# Find slow database spans (> 500ms)
{ span.db.system = "postgresql" && duration > 500ms }

# Find traces that hit the payment service with errors
{ resource.service.name = "payment-service" && status = error }

# Find traces by HTTP path and status
{ span.http.route = "/api/v1/orders" && span.http.status_code >= 500 }

# Find traces containing a specific span AND being slow overall
{ resource.service.name = "inventory-service" } | select(duration > 2s)

# Span co-existence: trace has both an error and a DB call
{ status = error } && { span.db.system != nil }

# Structural query: find parent spans with slow child DB operations
{ resource.service.name = "order-service" } >> { span.db.system = "postgresql" && duration > 200ms }

# Find traces by custom attribute
{ span.order.customer_id = "cust-12345" }

# Aggregation (Tempo 2.4+)
{ status = error } | rate()
{ resource.service.name = "payment-service" } | avg(duration) by(resource.service.name)

Tempo Metrics Generator — Derived RED Metrics

# Prometheus metrics automatically generated from trace data:
# traces_spanmetrics_calls_total{service_name, span_name, status_code}
# traces_spanmetrics_duration_milliseconds_bucket{...}
# traces_service_graph_request_total{client, server}
# traces_service_graph_request_failed_total{client, server}
# traces_service_graph_request_server_seconds_bucket{...}

# Use in Prometheus/Grafana without requiring manual metric instrumentation:
# Error rate per operation from traces:
rate(traces_spanmetrics_calls_total{status_code="STATUS_CODE_ERROR"}[5m])
  / rate(traces_spanmetrics_calls_total[5m])

# p99 latency per service from traces:
histogram_quantile(0.99,
  sum by (le, service_name) (
    rate(traces_spanmetrics_duration_milliseconds_bucket[5m])
  )
)

Jaeger

Jaeger is a CNCF graduated distributed tracing system originally developed at Uber. It provides a rich UI for trace search, comparison, and dependency graph visualization. Many teams use Jaeger as the tracing UI frontend with Tempo or Elasticsearch as the storage backend.

Jaeger Operator Install

kubectl create namespace observability
kubectl apply -f https://github.com/jaegertracing/jaeger-operator/releases/latest/download/jaeger-operator.yaml \
  -n observability

Production Jaeger CR (Elasticsearch backend)

apiVersion: jaegertracing.io/v1
kind: Jaeger
metadata:
  name: prod-jaeger
  namespace: observability
spec:
  strategy: production    # separate collector / query / agent components

  collector:
    maxReplicas: 5
    resources:
      requests: {cpu: 500m, memory: 512Mi}
      limits: {cpu: 1, memory: 1Gi}
    options:
      collector:
        num-workers: 50
        queue-size: 2000

  query:
    replicas: 2
    resources:
      requests: {cpu: 200m, memory: 256Mi}
    metricsStorage:
      type: prometheus    # span RED metrics from Prometheus

  storage:
    type: elasticsearch
    options:
      es:
        server-urls: https://prod-logs-es-http.logging.svc:9200
        tls:
          ca: /es/certificates/ca.crt
        index-prefix: jaeger
        num-shards: 3
        num-replicas: 1
    secretName: jaeger-es-secret   # contains ES credentials

  ingress:
    enabled: true
    annotations:
      nginx.ingress.kubernetes.io/auth-type: basic
      nginx.ingress.kubernetes.io/auth-secret: jaeger-basic-auth

Jaeger vs Tempo Comparison

Aspect	Jaeger	Grafana Tempo
Storage	Cassandra / Elasticsearch	Object storage (S3/GCS/Azure Blob)
Storage cost	High (full indexing)	Low (only trace-ID + attribute index)
Query language	Tag search, service filter	TraceQL (powerful)
Derived metrics	SPM (with Prometheus)	Metrics generator (built-in)
Grafana integration	Via data source plugin	Native (first-class)
UI	Dedicated trace UI (excellent)	Grafana Explore (good)
Service dependency graph	Built-in (Jaeger UI)	Via Metrics Generator
Scalability	Moderate (ES/Cassandra limits)	High (object store scales)
Tail sampling	Built-in adaptive sampling	Via OTel Collector

Sampling Strategies

Tracing every request at 100% is prohibitively expensive at production scale. Sampling reduces the volume of trace data while preserving statistical accuracy for performance analysis and retaining 100% of traces for important paths (errors, slow requests).

Head-Based vs Tail-Based Sampling

Head-Based Sampling

Decision made at the start of a trace (before any spans are collected). Deterministic — all services in a trace see the same sampling decision via the sampled flag in traceparent.

Pros: Zero overhead for dropped traces. No collector buffering required.

Cons: Cannot keep low-volume errors (you don't know a trace will error at trace start).

Use: Traffic >10k RPS where tail sampling is operationally expensive.

Tail-Based Sampling

Decision made after all spans arrive at the collector (typically 10–30s window). Can inspect trace outcome: status, duration, service names.

Pros: Can keep 100% of error/slow traces while sampling normal traffic.

Cons: Requires buffering all in-flight spans in collector memory (50,000+ traces). All spans of a trace must route to the same collector instance.

Use: Preferred when you need error trace fidelity. Requires OTel Collector gateway.

Sampling Strategies Reference

Strategy	Type	Description	OTel Sampler
Always On	Head	Sample 100% of traces. Development/debug only.	`always_on`
Always Off	Head	Drop 100%. Effectively disables tracing.	`always_off`
TraceID Ratio	Head	Deterministic 0–100% based on trace ID hash.	`traceidratio`
ParentBased (ratio)	Head	Respects parent's sampling decision; falls back to ratio for root spans. Recommended default.	`parentbased_traceidratio`
Status Code	Tail	Keep all ERROR traces.	OTel Collector tail_sampling policy
Latency	Tail	Keep traces over a duration threshold.	OTel Collector tail_sampling policy
Composite	Tail	AND/OR combination of multiple policies.	OTel Collector composite policy
Adaptive	Tail	Automatically adjusts rate to meet target RPS. Jaeger-specific.	Jaeger adaptive sampling

ParentBased Sampler — Why It Matters

Never Use TraceIDRatioBased Without ParentBased Wrapper

If you use TraceIDRatioBased(0.1) directly, each service makes its own independent sampling decision. Service A may decide to sample a trace; service B may decide to drop the same trace. This creates broken traces — some spans exist, others are missing. Always use ParentBased(TraceIDRatioBased(0.1)) so downstream services respect the root's sampling decision.

Production Tail Sampling Policy (OTel Collector)

processors:
  tail_sampling:
    decision_wait: 30s
    num_traces: 100000
    expected_new_traces_per_sec: 5000
    policies:
      # --- Always keep ---
      - name: errors
        type: status_code
        status_code: {status_codes: [ERROR]}
      - name: slow-requests
        type: latency
        latency: {threshold_ms: 3000}
      - name: user-marked-important
        type: string_attribute
        string_attribute:
          key: sampling.priority
          values: ["1", "high", "critical"]
      # --- Noise reduction ---
      - name: drop-health-checks
        type: string_attribute
        string_attribute:
          key: http.route
          values: ["/health", "/healthz", "/readyz", "/livez", "/metrics"]
          invert_match: false
        # This policy drops matching traces — combined with `and` policy below
      # --- Composite: sample 5% of remaining traffic ---
      - name: base-rate
        type: and
        and:
          and_sub_policy:
            - name: no-health-check
              type: string_attribute
              string_attribute:
                key: http.route
                values: ["/health", "/healthz", "/readyz", "/livez"]
                invert_match: true
            - name: probabilistic-5pct
              type: probabilistic
              probabilistic: {sampling_percentage: 5}

Service Mesh Tracing

Service meshes (Istio, Linkerd) provide automatic trace span creation for all inter-pod communication at the sidecar proxy layer — without any code changes. This is a form of infrastructure-level instrumentation covering all HTTP/gRPC traffic.

Istio Tracing

# Enable tracing in IstioOperator
apiVersion: install.istio.io/v1alpha1
kind: IstioOperator
spec:
  meshConfig:
    enableTracing: true
    defaultConfig:
      tracing:
        sampling: 1.0      # 1% — Istio uses 0-100 range (not 0-1)
        # Send to OTel Collector via OTLP (Istio 1.16+)
        openCensusAgent:
          address: otel-agent.monitoring.svc:55678
    extensionProviders:
      - name: otel-tracing
        opentelemetry:
          service: otel-collector.monitoring.svc.cluster.local
          port: 4317

# Enable tracing for a specific namespace via Telemetry API
apiVersion: telemetry.istio.io/v1alpha1
kind: Telemetry
metadata:
  name: mesh-tracing
  namespace: production
spec:
  tracing:
    - providers:
        - name: otel-tracing
      randomSamplingPercentage: 5.0   # 5% for this namespace
      customTags:
        env:
          literal:
            value: production
        cluster:
          environment:
            name: CLUSTER_NAME

Istio Tracing Requires Header Forwarding

Istio's Envoy sidecar creates a new child span for each incoming request but cannot automatically propagate the trace context to outgoing requests made by your application code. Your application must still forward the traceparent (and optionally b3, x-b3-*) headers from incoming to outgoing requests. Failure to do this breaks the trace tree — Istio spans appear disconnected from application spans.

Header Forwarding in Go

// Forward trace headers from incoming request to outgoing call
func forwardHeaders(outReq *http.Request, inReq *http.Request) {
    for _, h := range []string{
        "traceparent", "tracestate", "baggage",
        // B3 headers (for Istio/Zipkin compatibility):
        "x-b3-traceid", "x-b3-spanid", "x-b3-parentspanid", "x-b3-sampled",
        "x-b3-flags", "b3",
    } {
        if v := inReq.Header.Get(h); v != "" {
            outReq.Header.Set(h, v)
        }
    }
}

// With OTel SDK — inject propagates automatically if using otelhttp transport:
import "go.opentelemetry.io/contrib/instrumentation/net/http/otelhttp"

client := &http.Client{Transport: otelhttp.NewTransport(http.DefaultTransport)}

Linkerd Tracing

# Linkerd does NOT inject headers by default — uses opt-in tracing
# Configure via Helm values:
linkerd upgrade \
  --set jaeger.enabled=true \
  --set jaeger.collector.otlp_otlp_grpc_enabled=true \
  --set jaeger.collector.otlp_grpc_addr="otel-collector.monitoring.svc:4317"

# Enable tracing per namespace via annotation:
kubectl annotate namespace production \
  config.linkerd.io/trace-collector=otel-collector.monitoring.svc:4317 \
  config.linkerd.io/trace-collector-service-account=otel-collector

Signal Correlation

The value of tracing multiplies when trace IDs are available across all three observability signals. A single trace ID lets you navigate from a Grafana alert → Prometheus metric exemplar → Tempo trace → Loki log line for the exact same request.

Exemplars: Metrics → Traces

// Prometheus exemplar on a histogram observation
import (
    "github.com/prometheus/client_golang/prometheus"
    "go.opentelemetry.io/otel/trace"
)

var requestDuration = prometheus.NewHistogramVec(
    prometheus.HistogramOpts{
        Name:    "http_request_duration_seconds",
        Buckets: prometheus.DefBuckets,
        // Native histograms also support exemplars:
        NativeHistogramBucketFactor: 1.1,
    },
    []string{"method", "route", "status"},
)

func instrumentHandler(span trace.Span, method, route string, status int, dur float64) {
    sc := span.SpanContext()
    requestDuration.With(prometheus.Labels{
        "method": method, "route": route, "status": strconv.Itoa(status),
    }).(prometheus.ExemplarObserver).ObserveWithExemplar(dur, prometheus.Labels{
        "traceID": sc.TraceID().String(),
        "spanID":  sc.SpanID().String(),
    })
}

Grafana: Linking Traces to Logs (Derived Fields)

// Grafana Loki data source derived fields config (in provisioning)
{
  "name": "Loki",
  "type": "loki",
  "url": "http://loki-gateway.monitoring.svc",
  "jsonData": {
    "derivedFields": [
      {
        "matcherRegex": "trace_id=(\\w+)",
        "name": "TraceID",
        "url": "${__value.raw}",
        "datasourceUid": "tempo-uid",
        "urlDisplayLabel": "View Trace in Tempo"
      }
    ]
  }
}

Grafana: Linking Traces to Logs (Tempo → Loki)

// Tempo data source config — link to Loki for logs from trace
{
  "name": "Tempo",
  "type": "tempo",
  "url": "http://tempo-query-frontend.monitoring.svc:3100",
  "jsonData": {
    "tracesToLogsV2": {
      "datasourceUid": "loki-uid",
      "spanStartTimeShift": "-1m",
      "spanEndTimeShift": "1m",
      "filterByTraceID": true,
      "filterBySpanID": false,
      "customQuery": true,
      "query": "{cluster=\"prod\", pod=\"$${__span.tags[\"k8s.pod.name\"]}\"} | json | trace_id = \"$${__trace.traceId}\""
    },
    "tracesToMetrics": {
      "datasourceUid": "prometheus-uid",
      "queries": [
        {
          "name": "Request Rate",
          "query": "rate(traces_spanmetrics_calls_total{service_name=\"$${__span.tags[\"service.name\"]}\"}[5m])"
        }
      ]
    },
    "serviceMap": {
      "datasourceUid": "prometheus-uid"
    }
  }
}

Correlation Flow Diagram

Alert fires (PrometheusRule) │ ▼ Click alert → Grafana Explore Prometheus metric: http_request_duration_seconds p99 spike │ ▼ Click exemplar dot on metric graph Tempo trace: 4bf92f3577b34da6... (the slow request) │ ├── Waterfall shows payment-service span took 3.2s │ ▼ Click "Logs for this span" in Tempo UI Loki logs: all log lines for that pod in the span time window │ ▼ Find log line: "stripe API timeout after 3000ms" trace_id=4bf92f3577b34da6 span_id=00f067aa0ba902b7 Total time to root cause: ~2 minutes instead of ~2 hours

Metrics, Alerts & Runbooks

Key Tracing Infrastructure Metrics

Metric	Source	Alert Threshold	Meaning
`otelcol_receiver_accepted_spans`	OTel Collector	—	Spans successfully received
`otelcol_receiver_refused_spans`	OTel Collector	>0	Spans rejected (pipeline full/misconfigured)
`otelcol_exporter_queue_size`	OTel Collector	>80% capacity	Export queue filling up — backend slow or down
`otelcol_exporter_send_failed_spans`	OTel Collector	>0	Export failures — traces being dropped
`tempo_ingester_live_traces`	Tempo	—	Active traces in ingester memory
`tempo_distributor_bytes_received_total`	Tempo	—	Ingestion throughput
`tempo_query_frontend_duration_seconds`	Tempo	p99 > 10s	TraceQL query latency
`traces_spanmetrics_calls_total`	Tempo metrics gen	—	RED metrics derived from traces

Alert Rules

groups:
  - name: tracing-infrastructure
    rules:
      - alert: OTelCollectorDroppingSpans
        expr: rate(otelcol_exporter_send_failed_spans_total[5m]) > 0
        for: 2m
        labels: {severity: critical}
        annotations:
          summary: "OTel Collector dropping spans — traces incomplete"
          description: "Failed span export rate: {{ $value | humanize }}/s"
          runbook: "Check collector logs; verify backend (Tempo/Jaeger) health"

      - alert: OTelCollectorQueueFull
        expr: (otelcol_exporter_queue_size / otelcol_exporter_queue_capacity) > 0.8
        for: 5m
        labels: {severity: warning}
        annotations:
          summary: "OTel Collector export queue > 80% full"
          description: "Queue filling — backend may be slow. Consider scaling gateway."

      - alert: TempoIngestionFailing
        expr: rate(tempo_distributor_ingester_appends_failures_total[5m]) > 0
        for: 3m
        labels: {severity: critical}
        annotations:
          summary: "Tempo ingestion failures — traces may be lost"

      - alert: TempoQueryLatencyHigh
        expr: |
          histogram_quantile(0.99,
            rate(tempo_query_frontend_duration_seconds_bucket[5m])
          ) > 10
        for: 5m
        labels: {severity: warning}
        annotations:
          summary: "Tempo query p99 > 10s — trace searches degraded"

Runbooks

Broken / Incomplete Traces

Check if all services use ParentBased sampler (not raw ratio)
Verify all services forward traceparent header on outgoing calls
Check load-balancer: spans for same trace must route to same gateway
Check decision_wait in tail_sampling — increase if spans arrive late
Verify otelcol_receiver_refused_spans is zero on agent

High Trace Drop Rate

Check otelcol_exporter_send_failed_spans_total for failed exports
Verify Tempo / Jaeger backend is healthy and reachable
Check gateway CPU/memory — may need horizontal scaling
Check otelcol_exporter_queue_size vs capacity
Reduce ingestion rate by lowering sampling percentage temporarily

Traces Not Appearing in Tempo

Verify app is sending spans: otelcol_receiver_accepted_spans_total > 0
Check Tempo distributor: kubectl logs -l app.kubernetes.io/component=distributor
Verify S3 write permissions (Tempo needs s3:PutObject)
Check Tempo ingester WAL disk space
Query Tempo directly: curl http://tempo:3100/api/traces/<trace-id>

Auto-Instrumentation Not Working

Verify operator is running: kubectl get pods -n opentelemetry-operator-system
Check Instrumentation CRD exists in correct namespace
Verify pod annotation: instrumentation.opentelemetry.io/inject-java: "true"
Check init container ran: kubectl describe pod <pod> | grep -A5 Init
Check env vars injected: kubectl exec <pod> -- env | grep OTEL

TraceQL Queries Timing Out

Narrow time range — TraceQL scans all blocks in range
Add resource.service.name predicate to reduce scan scope
Check querier memory: kubectl top pod -l app=tempo-distributed-querier
Enable cache_results in Tempo query frontend config
Increase query_timeout in Tempo config (default: 30s)

Best Practices

Use ParentBased sampler universally. All services must wrap their rate-based sampler in ParentBased. A service that breaks the parent chain creates orphaned spans that don't appear in trace waterfalls, making traces useless for debugging.
Always record exceptions with span.RecordError(). This adds a span event with the full stack trace and sets status to ERROR, making error traces searchable by status in tail sampling policies and TraceQL.
Use OTel semantic conventions for attribute names. Non-standard attribute names like database_host instead of server.address fragment queries across services and prevent generic dashboards from working.
Deploy tail sampling for error fidelity. Head sampling at 1–10% will miss most errors (which are typically rare). Use OTel Collector gateway with tail sampling: keep 100% of error traces, slow traces, and use a low percentage (1–5%) for normal traffic.
Exclude health-check and readiness probe spans. Kubernetes probes generate continuous noise traces. Add a tail sampling policy or head-based filter to drop spans where http.route matches /health, /readyz, /livez, /metrics.
Set span names at the route level, not URL level. /api/orders/12345 as a span name creates unbounded cardinality. Use /api/orders/{id} (the route template). OTel HTTP instrumentation libraries do this automatically when configured correctly.
Forward trace context in async paths. When a trace crosses a message queue (Kafka, SQS, RabbitMQ), inject the span context into message headers and extract it on the consumer side. Use OTel's messaging semantic conventions and SpanKind.PRODUCER / SpanKind.CONSUMER.
Configure Tempo metrics generator for service graphs. Deriving RED metrics and service dependency graphs from trace data gives you automatic service maps and SLO metrics without requiring manual metric instrumentation in every service.

Coverage Details

Core concepts: Trace (DAG), Span (timing + attributes), Span Context (trace ID + span ID + flags)
Parent-child span relationships; span links for async/messaging
Span attributes, span events (lightweight log on span)
Trace waterfall visualization example (order service, 7 spans)
Span status codes: UNSET / OK / ERROR — with anti-pattern callout (do not set OK proactively)
W3C traceparent header format: version + trace-id + span-id + flags
tracestate header for vendor-specific metadata
Propagation format comparison: W3C TraceContext / B3 Multi / B3 Single / Jaeger / AWS X-Ray / Baggage
W3C Baggage: use cases (tenant ID, A/B) and warning (no secrets, network overhead)
OTel architecture layers: API (library contract) → SDK (application config) → OTLP Exporter
Go: TracerProvider + OTLP gRPC exporter + ParentBased sampler + W3C+Baggage propagator + manual span creation with attributes, RecordError, AddEvent
Python: structlog integration, OTel TracerProvider, OTLP exporter, composite propagator (W3C + B3)
Java: Spring Boot auto-config via opentelemetry-spring-boot-starter, application.yml properties, manual tracer usage with Scope
OTel semantic conventions table: HTTP server/client, database, messaging, RPC, Kubernetes, exceptions
OTel Operator install (kubectl + Helm)
Instrumentation CRD: exporter, propagators, sampler, resource, per-language image + env config (Java/Node.js/Python/Go/dotnet)
Pod annotation for auto-instrumentation opt-in: inject-java/nodejs/python/go/dotnet
Auto-instrumentation coverage table by language: mechanism, frameworks, limitations
OTel Collector deployment modes: agent DaemonSet / gateway Deployment / sidecar
Tail sampling requires gateway callout; consistent hash on trace_id for routing
Agent ConfigMap: OTLP receiver, k8sattributes processor, loadbalancing exporter (DNS resolver)
Gateway ConfigMap: tail_sampling processor (error + latency + string_attribute + probabilistic policies)
Headless Service for gateway DNS discovery by load-balancing exporter
Grafana Tempo architecture: distributor / ingester (WAL) / compactor / querier / query frontend / metrics generator
Tempo Helm install (tempo-distributed)
Tempo distributed values: S3 backend, replication_factor 3, metricsGenerator (service-graphs + span-metrics), remote_write to Prometheus
TraceQL: span selectors, structural queries (>>), co-existence (&&), aggregation (rate/avg by)
Tempo metrics generator: derived RED metrics (traces_spanmetrics_calls_total, duration bucket, service_graph_request_total)
Jaeger Operator install; production Jaeger CR (strategy: production, Elasticsearch backend, SPM)
Jaeger vs Tempo comparison table (storage, cost, query, UI, scalability, tail sampling)
Head-based vs tail-based sampling: pros/cons/use cases
Sampling strategies reference table: always_on/off, traceidratio, parentbased, status_code, latency, composite, adaptive
ParentBased sampler critical: broken traces from independent sampling decisions
Production tail sampling policy: errors + slow + user-marked + health-check exclusion + 5% base rate
Istio tracing: IstioOperator config, Telemetry API for per-namespace sampling, OTLP endpoint
Istio header forwarding requirement callout + Go code pattern + otelhttp transport
Linkerd tracing via opt-in annotations + collector config
Exemplars: Prometheus histogram ExemplarObserver with trace_id/span_id
Grafana derived fields: Loki data source config linking trace_id to Tempo
Grafana Tempo data source: tracesToLogsV2 (Loki link), tracesToMetrics (Prometheus link), serviceMap
Signal correlation flow diagram: alert → metric exemplar → trace → logs → root cause
8 tracing infrastructure metrics with thresholds
4 PrometheusRule alert rules: OTelCollectorDroppingSpans, QueueFull, TempoIngestionFailing, TempoQueryLatencyHigh
5 runbooks: broken traces, high drop rate, traces not in Tempo, auto-instrumentation not working, TraceQL timeout
8 best practices: ParentBased universally, RecordError, semantic conventions, tail sampling for errors, health-check exclusion, route-level span names, async context propagation, Tempo metrics generator