Distributed Tracing in Kubernetes
Complete guide to OpenTelemetry tracing, W3C Trace Context, Jaeger, Grafana Tempo, sampling strategies, auto-instrumentation, service mesh tracing, and production tail sampling.
Core Concepts
Distributed tracing tracks a single request as it flows through multiple services, capturing timing and causal relationships. Without tracing, diagnosing latency in a microservices system requires grepping logs across dozens of services with no way to correlate them to a specific request.
Trace, Span, and Span Context
Trace
A directed acyclic graph (DAG) of spans representing one end-to-end request. Identified by a globally unique trace ID (128-bit / 16 bytes). All spans in a trace share the same trace ID.
Span
A named, timed operation representing a unit of work. Has a span ID (64-bit / 8 bytes), start time, end time, status (OK/ERROR/UNSET), and zero or more attributes, events, and links.
Span Context
The propagated portion of a span: trace ID + span ID + trace flags (sampled bit) + trace state. Propagated across process boundaries via HTTP headers or message queue metadata.
Parent–Child Relationship
A child span records its parent's span ID. The root span has no parent ID. Causal relationships form a tree; async fan-out or messaging creates a DAG via span links.
Span Attributes
Key-value pairs on a span (indexed for search). OTel semantic conventions define standard names: http.method, db.system, rpc.service, k8s.pod.name.
Span Events
Timestamped log-like messages attached to a span (not propagated). Use for: exception recording, cache miss, retry attempt. Cheaper than creating a child span.
Example Trace: Order Service
Waterfall view: each bar represents a span's start/end relative to trace start. Gaps reveal time spent between spans (serialization, network, queue wait). This trace immediately shows payment-service (96ms) is the dominant latency contributor.
Span Status and Error Recording
| Status Code | Meaning | When to Set |
|---|---|---|
UNSET | Default — operation not explicitly classified | Default for all new spans |
OK | Operation succeeded, explicitly confirmed | Set only when you want to suppress downstream error status |
ERROR | Operation failed | On any exception or non-2xx HTTP response in a server span |
Setting status to OK on every successful span prevents automatic error propagation from child spans. Only set OK explicitly when you want to mark a span as definitively successful despite child errors (e.g., a retry that ultimately succeeded). Leave UNSET for normal successful operations.
W3C Trace Context & Propagation
The W3C Trace Context specification (RFC) defines standard HTTP headers for propagating span context across service boundaries. All modern tracing SDKs support this as the default propagation format.
traceparent Header
traceparent: 00-4bf92f3577b34da6a3ce929d0e0e4736-00f067aa0ba902b7-01
^^ ^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^ ^^^^^^^^^^^^^^^^ ^^
| trace-id (32 hex chars, 128-bit) span-id (16 hex) flags
version=00 01=sampled
tracestate Header
# Vendor-specific metadata, preserved through the call chain
tracestate: vendorname1=opaqueValue1,vendorname2=opaqueValue2
# Jaeger example:
tracestate: jaeger=sampled=1
# B3 single-header (legacy, still common):
b3: 4bf92f3577b34da6a3ce929d0e0e4736-00f067aa0ba902b7-1
Propagation Formats Comparison
| Format | Header(s) | Origin | Status |
|---|---|---|---|
| W3C TraceContext | traceparent, tracestate | W3C standard | Recommended |
| B3 Multi-header | X-B3-TraceId, X-B3-SpanId, X-B3-Sampled | Zipkin | Legacy / widely used |
| B3 Single-header | b3 | Zipkin | Legacy |
| Jaeger | uber-trace-id | Uber/Jaeger | Legacy |
| AWS X-Ray | X-Amzn-Trace-Id | AWS | AWS environments |
| Baggage | baggage | W3C standard | User-defined context |
W3C Baggage
# Baggage propagates user-defined key-value pairs across the entire call chain
# Use for: tenant ID, A/B experiment ID, user ID for debug sessions
baggage: userId=12345,tenantId=acme-corp,ab-experiment=new-checkout
# WARNING: Baggage is propagated to ALL downstream services — never put secrets here.
# Baggage adds network overhead proportional to its size on every HTTP call.
OpenTelemetry SDK & API
OTel Architecture Layers
Go: Manual Instrumentation
package main
import (
"context"
"go.opentelemetry.io/otel"
"go.opentelemetry.io/otel/attribute"
"go.opentelemetry.io/otel/codes"
"go.opentelemetry.io/otel/exporters/otlp/otlptrace/otlptracegrpc"
"go.opentelemetry.io/otel/sdk/resource"
sdktrace "go.opentelemetry.io/otel/sdk/trace"
semconv "go.opentelemetry.io/otel/semconv/v1.21.0"
"go.opentelemetry.io/otel/trace"
"google.golang.org/grpc"
)
func initTracer(ctx context.Context) (*sdktrace.TracerProvider, error) {
exporter, err := otlptracegrpc.New(ctx,
otlptracegrpc.WithEndpoint("otel-collector.monitoring.svc:4317"),
otlptracegrpc.WithInsecure(),
otlptracegrpc.WithDialOption(grpc.WithBlock()),
)
if err != nil {
return nil, err
}
res := resource.NewWithAttributes(
semconv.SchemaURL,
semconv.ServiceName("order-service"),
semconv.ServiceVersion("v2.4.1"),
semconv.DeploymentEnvironment("production"),
attribute.String("k8s.namespace.name", "payments"),
)
tp := sdktrace.NewTracerProvider(
sdktrace.WithBatcher(exporter),
sdktrace.WithResource(res),
// Head-based sampling: 10% of traces
sdktrace.WithSampler(sdktrace.ParentBased(
sdktrace.TraceIDRatioBased(0.1),
)),
)
otel.SetTracerProvider(tp)
// Set W3C TraceContext + Baggage propagators
otel.SetTextMapPropagator(
propagation.NewCompositeTextMapPropagator(
propagation.TraceContext{},
propagation.Baggage{},
),
)
return tp, nil
}
var tracer = otel.Tracer("order-service")
func createOrder(ctx context.Context, req OrderRequest) (*Order, error) {
ctx, span := tracer.Start(ctx, "CreateOrder",
trace.WithAttributes(
attribute.String("order.customer_id", req.CustomerID),
attribute.Int("order.item_count", len(req.Items)),
attribute.Float64("order.total_amount", req.Total),
),
trace.WithSpanKind(trace.SpanKindServer),
)
defer span.End()
order, err := insertOrderDB(ctx, req)
if err != nil {
// Record exception — adds span event with stack trace
span.RecordError(err, trace.WithStackTrace(true))
span.SetStatus(codes.Error, err.Error())
return nil, err
}
// Add span event (lightweight log attached to this span)
span.AddEvent("order persisted",
trace.WithAttributes(attribute.String("order.id", order.ID)),
)
return order, nil
}
func insertOrderDB(ctx context.Context, req OrderRequest) (*Order, error) {
ctx, span := tracer.Start(ctx, "db.insert",
trace.WithAttributes(
semconv.DBSystemPostgreSQL,
semconv.DBNameKey.String("orders"),
semconv.DBOperationKey.String("INSERT"),
semconv.DBStatementKey.String("INSERT INTO orders (customer_id, total) VALUES (?, ?)"),
),
trace.WithSpanKind(trace.SpanKindClient),
)
defer span.End()
// ... actual DB operation
return &Order{}, nil
}
Python: Manual Instrumentation
from opentelemetry import trace
from opentelemetry.sdk.trace import TracerProvider
from opentelemetry.sdk.trace.export import BatchSpanProcessor
from opentelemetry.exporter.otlp.proto.grpc.trace_exporter import OTLPSpanExporter
from opentelemetry.sdk.resources import Resource, SERVICE_NAME, SERVICE_VERSION
from opentelemetry.propagate import set_global_textmap
from opentelemetry.propagators.composite import CompositePropagator
from opentelemetry.propagators.b3 import B3MultiFormat
from opentelemetry.trace.propagation.tracecontext import TraceContextTextMapPropagator
def init_tracer():
resource = Resource.create({
SERVICE_NAME: "payment-service",
SERVICE_VERSION: "v1.2.0",
"deployment.environment": "production",
})
exporter = OTLPSpanExporter(
endpoint="http://otel-collector.monitoring.svc:4317",
insecure=True,
)
provider = TracerProvider(resource=resource)
provider.add_span_processor(BatchSpanProcessor(exporter))
trace.set_tracer_provider(provider)
# Support both W3C and B3 for interop with legacy services
set_global_textmap(CompositePropagator([
TraceContextTextMapPropagator(),
B3MultiFormat(),
]))
tracer = trace.get_tracer("payment-service")
def charge_customer(ctx, customer_id: str, amount: float):
with tracer.start_as_current_span(
"PaymentService.Charge",
context=ctx,
kind=trace.SpanKind.SERVER,
attributes={
"payment.customer_id": customer_id,
"payment.amount": amount,
"payment.currency": "USD",
}
) as span:
try:
result = stripe_charge(ctx, customer_id, amount)
span.set_attribute("payment.charge_id", result.id)
return result
except StripeError as e:
span.record_exception(e)
span.set_status(trace.StatusCode.ERROR, str(e))
raise
Java: OpenTelemetry with Spring Boot
<!-- pom.xml -->
<dependency>
<groupId>io.opentelemetry.instrumentation</groupId>
<artifactId>opentelemetry-spring-boot-starter</artifactId>
<version>2.3.0-alpha</version>
</dependency>
# application.yml — OTel Spring Boot auto-config
otel:
service:
name: inventory-service
exporter:
otlp:
endpoint: http://otel-collector.monitoring.svc:4317
protocol: grpc
traces:
sampler: parentbased_traceidratio
sampler:
arg: "0.1" # 10% head sampling
metrics:
exporter: otlp # also export metrics via OTel
logs:
exporter: otlp # also export logs via OTel
propagators: tracecontext,baggage
import io.opentelemetry.api.GlobalOpenTelemetry;
import io.opentelemetry.api.trace.Span;
import io.opentelemetry.api.trace.SpanKind;
import io.opentelemetry.api.trace.StatusCode;
import io.opentelemetry.api.trace.Tracer;
import io.opentelemetry.context.Scope;
@Service
public class InventoryService {
private final Tracer tracer = GlobalOpenTelemetry.getTracer("inventory-service");
public int checkStock(String productId) {
Span span = tracer.spanBuilder("InventoryService.checkStock")
.setSpanKind(SpanKind.INTERNAL)
.setAttribute("product.id", productId)
.startSpan();
try (Scope scope = span.makeCurrent()) {
int qty = inventoryRepo.findQuantity(productId);
span.setAttribute("inventory.quantity", qty);
return qty;
} catch (Exception e) {
span.recordException(e);
span.setStatus(StatusCode.ERROR, e.getMessage());
throw e;
} finally {
span.end();
}
}
}
Semantic Conventions
OTel semantic conventions define standardized attribute names across language SDKs. Using them enables consistent cross-service querying in backends like Tempo.
| Signal | Key Attributes |
|---|---|
| HTTP Server | http.method, http.route, http.status_code, http.url, server.address, server.port |
| HTTP Client | http.method, http.url, http.status_code, http.request.body.size |
| Database | db.system (postgresql/mysql/redis), db.name, db.operation, db.statement, db.user, server.address |
| Messaging | messaging.system (kafka/rabbitmq), messaging.destination.name, messaging.operation (publish/receive/process) |
| RPC | rpc.system (grpc), rpc.service, rpc.method, rpc.grpc.status_code |
| Kubernetes | k8s.pod.name, k8s.namespace.name, k8s.deployment.name, k8s.node.name, k8s.cluster.name |
| Exceptions | exception.type, exception.message, exception.stacktrace |
Auto-Instrumentation
The OTel Operator provides zero-code-change auto-instrumentation for Java, Node.js, Python, .NET, and Go via the Instrumentation CRD. It injects an init container that installs the OTel agent/SDK and configures it via environment variables.
OTel Operator Install
kubectl apply -f https://github.com/open-telemetry/opentelemetry-operator/releases/latest/download/opentelemetry-operator.yaml
# Or via Helm:
helm repo add open-telemetry https://open-telemetry.github.io/opentelemetry-helm-charts
helm upgrade --install opentelemetry-operator open-telemetry/opentelemetry-operator \
--namespace opentelemetry-operator-system \
--create-namespace \
--set "manager.collectorImage.repository=otel/opentelemetry-collector-contrib"
Instrumentation CRD
apiVersion: opentelemetry.io/v1alpha1
kind: Instrumentation
metadata:
name: otel-instrumentation
namespace: production
spec:
# Where to send traces (OTel Collector endpoint)
exporter:
endpoint: http://otel-collector.monitoring.svc:4317
propagators:
- tracecontext
- baggage
- b3multi # also support legacy B3 for mixed environments
sampler:
type: parentbased_traceidratio
argument: "0.1" # 10% head sample; adjust per service via annotation
# Resource attributes added to all signals from instrumented pods
resource:
addK8sUIDAttributes: true
attributes:
cluster: prod-us-east-1
java:
image: ghcr.io/open-telemetry/opentelemetry-operator/autoinstrumentation-java:1.33.0
env:
- name: OTEL_INSTRUMENTATION_JDBC_ENABLED
value: "true"
- name: OTEL_INSTRUMENTATION_SPRING_WEB_ENABLED
value: "true"
nodejs:
image: ghcr.io/open-telemetry/opentelemetry-operator/autoinstrumentation-nodejs:0.49.1
env:
- name: OTEL_NODE_ENABLED_INSTRUMENTATIONS
value: "http,express,pg,redis,kafkajs"
python:
image: ghcr.io/open-telemetry/opentelemetry-operator/autoinstrumentation-python:0.45b0
env:
- name: OTEL_PYTHON_LOGGING_AUTO_INSTRUMENTATION_ENABLED
value: "true"
go:
image: ghcr.io/open-telemetry/opentelemetry-operator/autoinstrumentation-go:v0.11.0
# Go uses eBPF — no init container; requires privileged mode
env:
- name: OTEL_GO_AUTO_SHOW_VERIFIER_LOG
value: "false"
dotnet:
image: ghcr.io/open-telemetry/opentelemetry-operator/autoinstrumentation-dotnet:1.7.0
Enabling Auto-Instrumentation via Pod Annotations
apiVersion: apps/v1
kind: Deployment
metadata:
name: order-service
namespace: production
spec:
template:
metadata:
annotations:
# Enable auto-instrumentation for this deployment
instrumentation.opentelemetry.io/inject-java: "true"
# Or: inject-nodejs, inject-python, inject-dotnet, inject-go
# Override sampler for this specific service
instrumentation.opentelemetry.io/inject-java: "otel-instrumentation"
# Override container to inject (default: first container)
instrumentation.opentelemetry.io/container-names: "order-service"
spec:
containers:
- name: order-service
image: myregistry/order-service:v2.4.1
# OTel Operator injects these env vars automatically:
# OTEL_SERVICE_NAME=order-service
# OTEL_EXPORTER_OTLP_ENDPOINT=http://otel-collector...:4317
# JAVA_TOOL_OPTIONS=-javaagent:/otel-auto-instrumentation/javaagent.jar
# OTEL_RESOURCE_ATTRIBUTES=k8s.pod.name=$(POD_NAME),k8s.namespace.name=$(NAMESPACE)...
Auto-Instrumentation Coverage by Language
| Language | Mechanism | Frameworks Covered | Limitation |
|---|---|---|---|
| Java | javaagent JAR (bytecode) | Spring Boot, Quarkus, Micronaut, JDBC, gRPC, Kafka, Redis, Mongo | Lambda / GraalVM native require manual SDK |
| Node.js | require() hook at startup | Express, Fastify, HTTP, gRPC, pg, mysql, Redis, Kafka | Custom async hooks may conflict |
| Python | sitecustomize.py + PYTHONPATH | Django, Flask, FastAPI, aiohttp, SQLAlchemy, Redis, Kafka | Celery async tasks need manual context |
| Go | eBPF (no code change needed) | net/http, gRPC, database/sql | Requires privileged DaemonSet; limited framework depth |
| .NET | CLR profiler | ASP.NET Core, HttpClient, EF Core, gRPC, Redis, Kafka | Profiler API limitations in some scenarios |
OTel Collector for Traces
The OTel Collector decouples application instrumentation from backend details. Applications export to the Collector via OTLP; the Collector transforms, samples, and fans out to one or more backends. This enables backend changes with zero application restarts.
Collector Deployment Modes for Tracing
| Mode | Topology | Use For |
|---|---|---|
| Agent (DaemonSet) | One collector per node; apps send to localhost | Head sampling decisions, initial enrichment, local buffering |
| Gateway (Deployment) | Centralized collectors; agents forward to gateway | Tail sampling (requires seeing all spans for a trace), fan-out to multiple backends |
| Sidecar | One collector per pod | Strict per-pod isolation; rarely needed for tracing |
Tail sampling decisions require seeing all spans of a trace before deciding whether to keep or drop it. This means all spans for a trace must route to the same collector instance. Use a consistent hash on trace_id in the load balancing exporter to route all spans of a trace to the same gateway instance.
Full Collector Config: Agent + Gateway
# --- Agent ConfigMap (DaemonSet — one per node) ---
apiVersion: v1
kind: ConfigMap
metadata:
name: otel-agent-config
namespace: monitoring
data:
config.yaml: |
receivers:
otlp:
protocols:
grpc:
endpoint: 0.0.0.0:4317 # apps send to node IP:4317
http:
endpoint: 0.0.0.0:4318
processors:
memory_limiter:
limit_mib: 400
spike_limit_mib: 100
check_interval: 5s
batch:
timeout: 5s
send_batch_size: 1024
k8sattributes: # enrich with pod metadata
extract:
metadata:
- k8s.pod.name
- k8s.pod.uid
- k8s.deployment.name
- k8s.namespace.name
- k8s.node.name
- k8s.container.name
labels:
- tag_name: app
key: app
from: pod
pod_association:
- sources:
- from: connection
resource:
attributes:
- key: cluster
value: prod-us-east-1
action: insert
exporters:
# Forward to gateway using load-balancing on trace_id
loadbalancing:
protocol:
otlp:
tls:
insecure: true
resolver:
dns:
hostname: otel-gateway-headless.monitoring.svc # headless Service
port: 4317
service:
pipelines:
traces:
receivers: [otlp]
processors: [memory_limiter, k8sattributes, resource, batch]
exporters: [loadbalancing]
# --- Gateway ConfigMap (Deployment — 3+ replicas with tail sampler) ---
data:
config.yaml: |
receivers:
otlp:
protocols:
grpc:
endpoint: 0.0.0.0:4317
processors:
memory_limiter:
limit_mib: 1500
spike_limit_mib: 400
check_interval: 5s
tail_sampling:
decision_wait: 30s # wait up to 30s for all spans to arrive
num_traces: 50000 # in-memory trace buffer
expected_new_traces_per_sec: 1000
policies:
# Keep ALL traces with errors
- name: error-traces
type: status_code
status_code: {status_codes: [ERROR]}
# Keep slow traces (> 2s)
- name: slow-traces
type: latency
latency: {threshold_ms: 2000}
# Keep traces from specific services (always)
- name: payment-always
type: string_attribute
string_attribute: {key: "service.name", values: ["payment-service"]}
# Sample 5% of everything else
- name: probabilistic-base
type: probabilistic
probabilistic: {sampling_percentage: 5}
batch:
timeout: 5s
send_batch_size: 2048
exporters:
otlp/tempo:
endpoint: tempo.monitoring.svc:4317
tls:
insecure: true
otlp/jaeger:
endpoint: jaeger-collector.monitoring.svc:4317
tls:
insecure: true
service:
pipelines:
traces:
receivers: [otlp]
processors: [memory_limiter, tail_sampling, batch]
exporters: [otlp/tempo]
Load-Balancing Exporter and Headless Service
# Headless Service for gateway Pods — allows DNS SRV discovery
apiVersion: v1
kind: Service
metadata:
name: otel-gateway-headless
namespace: monitoring
spec:
clusterIP: None # headless
selector:
app: otel-gateway
ports:
- port: 4317
name: grpc
---
# Gateway Deployment
apiVersion: apps/v1
kind: Deployment
metadata:
name: otel-gateway
namespace: monitoring
spec:
replicas: 3
template:
spec:
containers:
- name: otelcol
image: otel/opentelemetry-collector-contrib:0.96.0
resources:
requests: {cpu: 500m, memory: 1Gi}
limits: {cpu: 2, memory: 3Gi}
volumeMounts:
- name: config
mountPath: /etc/otelcol-contrib
Grafana Tempo
Grafana Tempo is a high-scale, cost-efficient distributed tracing backend. Like Loki, it uses object storage (S3/GCS/Azure Blob) for trace data and indexes only trace ID and span attributes needed for search — not full-text. It integrates natively with Grafana for trace visualization and metric/log correlation.
Tempo Architecture
Tempo Helm Install
helm repo add grafana https://grafana.github.io/helm-charts
helm upgrade --install tempo grafana/tempo-distributed \
--namespace monitoring \
--values tempo-values.yaml
Tempo Distributed Values
# tempo-values.yaml
tempo:
reportingEnabled: false
storage:
trace:
backend: s3
s3:
bucket: prod-tempo-traces
region: us-east-1
# IRSA annotation on ServiceAccount instead of static credentials
# Enable search over span attributes
search:
enabled: true
max_duration: 0 # no limit on trace duration for search
# Metrics generator: derive RED metrics from trace spans
metricsGenerator:
enabled: true
remoteWriteUrl: http://prometheus-operated.monitoring.svc:9090/api/v1/write
processors:
- service-graphs # service dependency graph from traces
- span-metrics # RED metrics (rate, error, duration) per operation
# TraceQL search requires ingester/querier config
ingester:
config:
replication_factor: 3
trace_idle_period: 30s
max_block_bytes: 104857600 # 100MB chunks
ingester:
replicas: 3
resources:
requests: {cpu: 500m, memory: 2Gi}
limits: {cpu: 2, memory: 4Gi}
extraEnv:
- name: GOMEMLIMIT
value: "3500MiB"
distributor:
replicas: 2
resources:
requests: {cpu: 200m, memory: 256Mi}
querier:
replicas: 2
resources:
requests: {cpu: 500m, memory: 1Gi}
compactor:
resources:
requests: {cpu: 500m, memory: 512Mi}
serviceAccount:
annotations:
eks.amazonaws.com/role-arn: arn:aws:iam::123456789:role/tempo-s3-role
TraceQL
TraceQL is Tempo's query language for searching and filtering traces by span attributes, duration, status, and structural properties. Available in Tempo 2.0+.
# Find all error traces for order-service
{ resource.service.name = "order-service" && status = error }
# Find slow database spans (> 500ms)
{ span.db.system = "postgresql" && duration > 500ms }
# Find traces that hit the payment service with errors
{ resource.service.name = "payment-service" && status = error }
# Find traces by HTTP path and status
{ span.http.route = "/api/v1/orders" && span.http.status_code >= 500 }
# Find traces containing a specific span AND being slow overall
{ resource.service.name = "inventory-service" } | select(duration > 2s)
# Span co-existence: trace has both an error and a DB call
{ status = error } && { span.db.system != nil }
# Structural query: find parent spans with slow child DB operations
{ resource.service.name = "order-service" } >> { span.db.system = "postgresql" && duration > 200ms }
# Find traces by custom attribute
{ span.order.customer_id = "cust-12345" }
# Aggregation (Tempo 2.4+)
{ status = error } | rate()
{ resource.service.name = "payment-service" } | avg(duration) by(resource.service.name)
Tempo Metrics Generator — Derived RED Metrics
# Prometheus metrics automatically generated from trace data:
# traces_spanmetrics_calls_total{service_name, span_name, status_code}
# traces_spanmetrics_duration_milliseconds_bucket{...}
# traces_service_graph_request_total{client, server}
# traces_service_graph_request_failed_total{client, server}
# traces_service_graph_request_server_seconds_bucket{...}
# Use in Prometheus/Grafana without requiring manual metric instrumentation:
# Error rate per operation from traces:
rate(traces_spanmetrics_calls_total{status_code="STATUS_CODE_ERROR"}[5m])
/ rate(traces_spanmetrics_calls_total[5m])
# p99 latency per service from traces:
histogram_quantile(0.99,
sum by (le, service_name) (
rate(traces_spanmetrics_duration_milliseconds_bucket[5m])
)
)
Jaeger
Jaeger is a CNCF graduated distributed tracing system originally developed at Uber. It provides a rich UI for trace search, comparison, and dependency graph visualization. Many teams use Jaeger as the tracing UI frontend with Tempo or Elasticsearch as the storage backend.
Jaeger Operator Install
kubectl create namespace observability
kubectl apply -f https://github.com/jaegertracing/jaeger-operator/releases/latest/download/jaeger-operator.yaml \
-n observability
Production Jaeger CR (Elasticsearch backend)
apiVersion: jaegertracing.io/v1
kind: Jaeger
metadata:
name: prod-jaeger
namespace: observability
spec:
strategy: production # separate collector / query / agent components
collector:
maxReplicas: 5
resources:
requests: {cpu: 500m, memory: 512Mi}
limits: {cpu: 1, memory: 1Gi}
options:
collector:
num-workers: 50
queue-size: 2000
query:
replicas: 2
resources:
requests: {cpu: 200m, memory: 256Mi}
metricsStorage:
type: prometheus # span RED metrics from Prometheus
storage:
type: elasticsearch
options:
es:
server-urls: https://prod-logs-es-http.logging.svc:9200
tls:
ca: /es/certificates/ca.crt
index-prefix: jaeger
num-shards: 3
num-replicas: 1
secretName: jaeger-es-secret # contains ES credentials
ingress:
enabled: true
annotations:
nginx.ingress.kubernetes.io/auth-type: basic
nginx.ingress.kubernetes.io/auth-secret: jaeger-basic-auth
Jaeger vs Tempo Comparison
| Aspect | Jaeger | Grafana Tempo |
|---|---|---|
| Storage | Cassandra / Elasticsearch | Object storage (S3/GCS/Azure Blob) |
| Storage cost | High (full indexing) | Low (only trace-ID + attribute index) |
| Query language | Tag search, service filter | TraceQL (powerful) |
| Derived metrics | SPM (with Prometheus) | Metrics generator (built-in) |
| Grafana integration | Via data source plugin | Native (first-class) |
| UI | Dedicated trace UI (excellent) | Grafana Explore (good) |
| Service dependency graph | Built-in (Jaeger UI) | Via Metrics Generator |
| Scalability | Moderate (ES/Cassandra limits) | High (object store scales) |
| Tail sampling | Built-in adaptive sampling | Via OTel Collector |
Sampling Strategies
Tracing every request at 100% is prohibitively expensive at production scale. Sampling reduces the volume of trace data while preserving statistical accuracy for performance analysis and retaining 100% of traces for important paths (errors, slow requests).
Head-Based vs Tail-Based Sampling
Head-Based Sampling
Decision made at the start of a trace (before any spans are collected). Deterministic — all services in a trace see the same sampling decision via the sampled flag in traceparent.
Pros: Zero overhead for dropped traces. No collector buffering required.
Cons: Cannot keep low-volume errors (you don't know a trace will error at trace start).
Use: Traffic >10k RPS where tail sampling is operationally expensive.
Tail-Based Sampling
Decision made after all spans arrive at the collector (typically 10–30s window). Can inspect trace outcome: status, duration, service names.
Pros: Can keep 100% of error/slow traces while sampling normal traffic.
Cons: Requires buffering all in-flight spans in collector memory (50,000+ traces). All spans of a trace must route to the same collector instance.
Use: Preferred when you need error trace fidelity. Requires OTel Collector gateway.
Sampling Strategies Reference
| Strategy | Type | Description | OTel Sampler |
|---|---|---|---|
| Always On | Head | Sample 100% of traces. Development/debug only. | always_on |
| Always Off | Head | Drop 100%. Effectively disables tracing. | always_off |
| TraceID Ratio | Head | Deterministic 0–100% based on trace ID hash. | traceidratio |
| ParentBased (ratio) | Head | Respects parent's sampling decision; falls back to ratio for root spans. Recommended default. | parentbased_traceidratio |
| Status Code | Tail | Keep all ERROR traces. | OTel Collector tail_sampling policy |
| Latency | Tail | Keep traces over a duration threshold. | OTel Collector tail_sampling policy |
| Composite | Tail | AND/OR combination of multiple policies. | OTel Collector composite policy |
| Adaptive | Tail | Automatically adjusts rate to meet target RPS. Jaeger-specific. | Jaeger adaptive sampling |
ParentBased Sampler — Why It Matters
If you use TraceIDRatioBased(0.1) directly, each service makes its own independent sampling decision. Service A may decide to sample a trace; service B may decide to drop the same trace. This creates broken traces — some spans exist, others are missing. Always use ParentBased(TraceIDRatioBased(0.1)) so downstream services respect the root's sampling decision.
Production Tail Sampling Policy (OTel Collector)
processors:
tail_sampling:
decision_wait: 30s
num_traces: 100000
expected_new_traces_per_sec: 5000
policies:
# --- Always keep ---
- name: errors
type: status_code
status_code: {status_codes: [ERROR]}
- name: slow-requests
type: latency
latency: {threshold_ms: 3000}
- name: user-marked-important
type: string_attribute
string_attribute:
key: sampling.priority
values: ["1", "high", "critical"]
# --- Noise reduction ---
- name: drop-health-checks
type: string_attribute
string_attribute:
key: http.route
values: ["/health", "/healthz", "/readyz", "/livez", "/metrics"]
invert_match: false
# This policy drops matching traces — combined with `and` policy below
# --- Composite: sample 5% of remaining traffic ---
- name: base-rate
type: and
and:
and_sub_policy:
- name: no-health-check
type: string_attribute
string_attribute:
key: http.route
values: ["/health", "/healthz", "/readyz", "/livez"]
invert_match: true
- name: probabilistic-5pct
type: probabilistic
probabilistic: {sampling_percentage: 5}
Service Mesh Tracing
Service meshes (Istio, Linkerd) provide automatic trace span creation for all inter-pod communication at the sidecar proxy layer — without any code changes. This is a form of infrastructure-level instrumentation covering all HTTP/gRPC traffic.
Istio Tracing
# Enable tracing in IstioOperator
apiVersion: install.istio.io/v1alpha1
kind: IstioOperator
spec:
meshConfig:
enableTracing: true
defaultConfig:
tracing:
sampling: 1.0 # 1% — Istio uses 0-100 range (not 0-1)
# Send to OTel Collector via OTLP (Istio 1.16+)
openCensusAgent:
address: otel-agent.monitoring.svc:55678
extensionProviders:
- name: otel-tracing
opentelemetry:
service: otel-collector.monitoring.svc.cluster.local
port: 4317
# Enable tracing for a specific namespace via Telemetry API
apiVersion: telemetry.istio.io/v1alpha1
kind: Telemetry
metadata:
name: mesh-tracing
namespace: production
spec:
tracing:
- providers:
- name: otel-tracing
randomSamplingPercentage: 5.0 # 5% for this namespace
customTags:
env:
literal:
value: production
cluster:
environment:
name: CLUSTER_NAME
Istio's Envoy sidecar creates a new child span for each incoming request but cannot automatically propagate the trace context to outgoing requests made by your application code. Your application must still forward the traceparent (and optionally b3, x-b3-*) headers from incoming to outgoing requests. Failure to do this breaks the trace tree — Istio spans appear disconnected from application spans.
Header Forwarding in Go
// Forward trace headers from incoming request to outgoing call
func forwardHeaders(outReq *http.Request, inReq *http.Request) {
for _, h := range []string{
"traceparent", "tracestate", "baggage",
// B3 headers (for Istio/Zipkin compatibility):
"x-b3-traceid", "x-b3-spanid", "x-b3-parentspanid", "x-b3-sampled",
"x-b3-flags", "b3",
} {
if v := inReq.Header.Get(h); v != "" {
outReq.Header.Set(h, v)
}
}
}
// With OTel SDK — inject propagates automatically if using otelhttp transport:
import "go.opentelemetry.io/contrib/instrumentation/net/http/otelhttp"
client := &http.Client{Transport: otelhttp.NewTransport(http.DefaultTransport)}
Linkerd Tracing
# Linkerd does NOT inject headers by default — uses opt-in tracing
# Configure via Helm values:
linkerd upgrade \
--set jaeger.enabled=true \
--set jaeger.collector.otlp_otlp_grpc_enabled=true \
--set jaeger.collector.otlp_grpc_addr="otel-collector.monitoring.svc:4317"
# Enable tracing per namespace via annotation:
kubectl annotate namespace production \
config.linkerd.io/trace-collector=otel-collector.monitoring.svc:4317 \
config.linkerd.io/trace-collector-service-account=otel-collector
Signal Correlation
The value of tracing multiplies when trace IDs are available across all three observability signals. A single trace ID lets you navigate from a Grafana alert → Prometheus metric exemplar → Tempo trace → Loki log line for the exact same request.
Exemplars: Metrics → Traces
// Prometheus exemplar on a histogram observation
import (
"github.com/prometheus/client_golang/prometheus"
"go.opentelemetry.io/otel/trace"
)
var requestDuration = prometheus.NewHistogramVec(
prometheus.HistogramOpts{
Name: "http_request_duration_seconds",
Buckets: prometheus.DefBuckets,
// Native histograms also support exemplars:
NativeHistogramBucketFactor: 1.1,
},
[]string{"method", "route", "status"},
)
func instrumentHandler(span trace.Span, method, route string, status int, dur float64) {
sc := span.SpanContext()
requestDuration.With(prometheus.Labels{
"method": method, "route": route, "status": strconv.Itoa(status),
}).(prometheus.ExemplarObserver).ObserveWithExemplar(dur, prometheus.Labels{
"traceID": sc.TraceID().String(),
"spanID": sc.SpanID().String(),
})
}
Grafana: Linking Traces to Logs (Derived Fields)
// Grafana Loki data source derived fields config (in provisioning)
{
"name": "Loki",
"type": "loki",
"url": "http://loki-gateway.monitoring.svc",
"jsonData": {
"derivedFields": [
{
"matcherRegex": "trace_id=(\\w+)",
"name": "TraceID",
"url": "${__value.raw}",
"datasourceUid": "tempo-uid",
"urlDisplayLabel": "View Trace in Tempo"
}
]
}
}
Grafana: Linking Traces to Logs (Tempo → Loki)
// Tempo data source config — link to Loki for logs from trace
{
"name": "Tempo",
"type": "tempo",
"url": "http://tempo-query-frontend.monitoring.svc:3100",
"jsonData": {
"tracesToLogsV2": {
"datasourceUid": "loki-uid",
"spanStartTimeShift": "-1m",
"spanEndTimeShift": "1m",
"filterByTraceID": true,
"filterBySpanID": false,
"customQuery": true,
"query": "{cluster=\"prod\", pod=\"$${__span.tags[\"k8s.pod.name\"]}\"} | json | trace_id = \"$${__trace.traceId}\""
},
"tracesToMetrics": {
"datasourceUid": "prometheus-uid",
"queries": [
{
"name": "Request Rate",
"query": "rate(traces_spanmetrics_calls_total{service_name=\"$${__span.tags[\"service.name\"]}\"}[5m])"
}
]
},
"serviceMap": {
"datasourceUid": "prometheus-uid"
}
}
}
Correlation Flow Diagram
Metrics, Alerts & Runbooks
Key Tracing Infrastructure Metrics
| Metric | Source | Alert Threshold | Meaning |
|---|---|---|---|
otelcol_receiver_accepted_spans | OTel Collector | — | Spans successfully received |
otelcol_receiver_refused_spans | OTel Collector | >0 | Spans rejected (pipeline full/misconfigured) |
otelcol_exporter_queue_size | OTel Collector | >80% capacity | Export queue filling up — backend slow or down |
otelcol_exporter_send_failed_spans | OTel Collector | >0 | Export failures — traces being dropped |
tempo_ingester_live_traces | Tempo | — | Active traces in ingester memory |
tempo_distributor_bytes_received_total | Tempo | — | Ingestion throughput |
tempo_query_frontend_duration_seconds | Tempo | p99 > 10s | TraceQL query latency |
traces_spanmetrics_calls_total | Tempo metrics gen | — | RED metrics derived from traces |
Alert Rules
groups:
- name: tracing-infrastructure
rules:
- alert: OTelCollectorDroppingSpans
expr: rate(otelcol_exporter_send_failed_spans_total[5m]) > 0
for: 2m
labels: {severity: critical}
annotations:
summary: "OTel Collector dropping spans — traces incomplete"
description: "Failed span export rate: {{ $value | humanize }}/s"
runbook: "Check collector logs; verify backend (Tempo/Jaeger) health"
- alert: OTelCollectorQueueFull
expr: (otelcol_exporter_queue_size / otelcol_exporter_queue_capacity) > 0.8
for: 5m
labels: {severity: warning}
annotations:
summary: "OTel Collector export queue > 80% full"
description: "Queue filling — backend may be slow. Consider scaling gateway."
- alert: TempoIngestionFailing
expr: rate(tempo_distributor_ingester_appends_failures_total[5m]) > 0
for: 3m
labels: {severity: critical}
annotations:
summary: "Tempo ingestion failures — traces may be lost"
- alert: TempoQueryLatencyHigh
expr: |
histogram_quantile(0.99,
rate(tempo_query_frontend_duration_seconds_bucket[5m])
) > 10
for: 5m
labels: {severity: warning}
annotations:
summary: "Tempo query p99 > 10s — trace searches degraded"
Runbooks
Broken / Incomplete Traces
- Check if all services use
ParentBasedsampler (not raw ratio) - Verify all services forward
traceparentheader on outgoing calls - Check load-balancer: spans for same trace must route to same gateway
- Check
decision_waitin tail_sampling — increase if spans arrive late - Verify
otelcol_receiver_refused_spansis zero on agent
High Trace Drop Rate
- Check
otelcol_exporter_send_failed_spans_totalfor failed exports - Verify Tempo / Jaeger backend is healthy and reachable
- Check gateway CPU/memory — may need horizontal scaling
- Check
otelcol_exporter_queue_sizevs capacity - Reduce ingestion rate by lowering sampling percentage temporarily
Traces Not Appearing in Tempo
- Verify app is sending spans:
otelcol_receiver_accepted_spans_total > 0 - Check Tempo distributor:
kubectl logs -l app.kubernetes.io/component=distributor - Verify S3 write permissions (Tempo needs s3:PutObject)
- Check Tempo ingester WAL disk space
- Query Tempo directly:
curl http://tempo:3100/api/traces/<trace-id>
Auto-Instrumentation Not Working
- Verify operator is running:
kubectl get pods -n opentelemetry-operator-system - Check Instrumentation CRD exists in correct namespace
- Verify pod annotation:
instrumentation.opentelemetry.io/inject-java: "true" - Check init container ran:
kubectl describe pod <pod> | grep -A5 Init - Check env vars injected:
kubectl exec <pod> -- env | grep OTEL
TraceQL Queries Timing Out
- Narrow time range — TraceQL scans all blocks in range
- Add
resource.service.namepredicate to reduce scan scope - Check querier memory:
kubectl top pod -l app=tempo-distributed-querier - Enable
cache_resultsin Tempo query frontend config - Increase
query_timeoutin Tempo config (default: 30s)
Best Practices
- Use ParentBased sampler universally. All services must wrap their rate-based sampler in
ParentBased. A service that breaks the parent chain creates orphaned spans that don't appear in trace waterfalls, making traces useless for debugging. - Always record exceptions with
span.RecordError(). This adds a span event with the full stack trace and sets status to ERROR, making error traces searchable by status in tail sampling policies and TraceQL. - Use OTel semantic conventions for attribute names. Non-standard attribute names like
database_hostinstead ofserver.addressfragment queries across services and prevent generic dashboards from working. - Deploy tail sampling for error fidelity. Head sampling at 1–10% will miss most errors (which are typically rare). Use OTel Collector gateway with tail sampling: keep 100% of error traces, slow traces, and use a low percentage (1–5%) for normal traffic.
- Exclude health-check and readiness probe spans. Kubernetes probes generate continuous noise traces. Add a tail sampling policy or head-based filter to drop spans where
http.routematches/health,/readyz,/livez,/metrics. - Set span names at the route level, not URL level.
/api/orders/12345as a span name creates unbounded cardinality. Use/api/orders/{id}(the route template). OTel HTTP instrumentation libraries do this automatically when configured correctly. - Forward trace context in async paths. When a trace crosses a message queue (Kafka, SQS, RabbitMQ), inject the span context into message headers and extract it on the consumer side. Use OTel's messaging semantic conventions and
SpanKind.PRODUCER/SpanKind.CONSUMER. - Configure Tempo metrics generator for service graphs. Deriving RED metrics and service dependency graphs from trace data gives you automatic service maps and SLO metrics without requiring manual metric instrumentation in every service.
Coverage Details
- Core concepts: Trace (DAG), Span (timing + attributes), Span Context (trace ID + span ID + flags)
- Parent-child span relationships; span links for async/messaging
- Span attributes, span events (lightweight log on span)
- Trace waterfall visualization example (order service, 7 spans)
- Span status codes: UNSET / OK / ERROR — with anti-pattern callout (do not set OK proactively)
- W3C traceparent header format: version + trace-id + span-id + flags
- tracestate header for vendor-specific metadata
- Propagation format comparison: W3C TraceContext / B3 Multi / B3 Single / Jaeger / AWS X-Ray / Baggage
- W3C Baggage: use cases (tenant ID, A/B) and warning (no secrets, network overhead)
- OTel architecture layers: API (library contract) → SDK (application config) → OTLP Exporter
- Go: TracerProvider + OTLP gRPC exporter + ParentBased sampler + W3C+Baggage propagator + manual span creation with attributes, RecordError, AddEvent
- Python: structlog integration, OTel TracerProvider, OTLP exporter, composite propagator (W3C + B3)
- Java: Spring Boot auto-config via opentelemetry-spring-boot-starter, application.yml properties, manual tracer usage with Scope
- OTel semantic conventions table: HTTP server/client, database, messaging, RPC, Kubernetes, exceptions
- OTel Operator install (kubectl + Helm)
- Instrumentation CRD: exporter, propagators, sampler, resource, per-language image + env config (Java/Node.js/Python/Go/dotnet)
- Pod annotation for auto-instrumentation opt-in: inject-java/nodejs/python/go/dotnet
- Auto-instrumentation coverage table by language: mechanism, frameworks, limitations
- OTel Collector deployment modes: agent DaemonSet / gateway Deployment / sidecar
- Tail sampling requires gateway callout; consistent hash on trace_id for routing
- Agent ConfigMap: OTLP receiver, k8sattributes processor, loadbalancing exporter (DNS resolver)
- Gateway ConfigMap: tail_sampling processor (error + latency + string_attribute + probabilistic policies)
- Headless Service for gateway DNS discovery by load-balancing exporter
- Grafana Tempo architecture: distributor / ingester (WAL) / compactor / querier / query frontend / metrics generator
- Tempo Helm install (tempo-distributed)
- Tempo distributed values: S3 backend, replication_factor 3, metricsGenerator (service-graphs + span-metrics), remote_write to Prometheus
- TraceQL: span selectors, structural queries (>>), co-existence (&&), aggregation (rate/avg by)
- Tempo metrics generator: derived RED metrics (traces_spanmetrics_calls_total, duration bucket, service_graph_request_total)
- Jaeger Operator install; production Jaeger CR (strategy: production, Elasticsearch backend, SPM)
- Jaeger vs Tempo comparison table (storage, cost, query, UI, scalability, tail sampling)
- Head-based vs tail-based sampling: pros/cons/use cases
- Sampling strategies reference table: always_on/off, traceidratio, parentbased, status_code, latency, composite, adaptive
- ParentBased sampler critical: broken traces from independent sampling decisions
- Production tail sampling policy: errors + slow + user-marked + health-check exclusion + 5% base rate
- Istio tracing: IstioOperator config, Telemetry API for per-namespace sampling, OTLP endpoint
- Istio header forwarding requirement callout + Go code pattern + otelhttp transport
- Linkerd tracing via opt-in annotations + collector config
- Exemplars: Prometheus histogram ExemplarObserver with trace_id/span_id
- Grafana derived fields: Loki data source config linking trace_id to Tempo
- Grafana Tempo data source: tracesToLogsV2 (Loki link), tracesToMetrics (Prometheus link), serviceMap
- Signal correlation flow diagram: alert → metric exemplar → trace → logs → root cause
- 8 tracing infrastructure metrics with thresholds
- 4 PrometheusRule alert rules: OTelCollectorDroppingSpans, QueueFull, TempoIngestionFailing, TempoQueryLatencyHigh
- 5 runbooks: broken traces, high drop rate, traces not in Tempo, auto-instrumentation not working, TraceQL timeout
- 8 best practices: ParentBased universally, RecordError, semantic conventions, tail sampling for errors, health-check exclusion, route-level span names, async context propagation, Tempo metrics generator