Continuous Profiling
1. What Is Continuous Profiling?
Profiling answers the question where is time and memory being spent? — a question that metrics, logs, and traces cannot fully answer. Metrics tell you a service is slow. Traces narrow it to a function call. Profiling shows which exact line of code consumed 37% of CPU.
Continuous profiling means collecting profiles in production constantly, not just during a debug session. Profiles are stored indexed by service/version/time, enabling comparison across deploys and correlation with incidents.
Metrics tell you what
CPU 90%, p99 latency 2s, memory 4 GiB. Aggregate, low-cardinality, no code-level detail.
prometheusTraces tell you where
Request took 2s: 1.8s in db.Query(). Span-level, single-request granularity.
Profiles tell you why
db.Query() spent 1.4s in JSON serialisation of a 50-field struct. Function-level hot path.
Continuous vs On-Demand Profiling
| Dimension | On-Demand | Continuous |
|---|---|---|
| Trigger | Manual (incident or debug session) | Always running — timed samples |
| Overhead | High during collection (can be >10%) | Low constant overhead (0.5–2%) |
| Reproducibility | Difficult — must reproduce the issue | Profiles captured during the original event |
| Historical comparison | None | Compare any two timestamps or commits |
| Kubernetes fit | Poor — pods are ephemeral | Excellent — samples labelled by pod/node |
The Observability Stack with Profiling
Alert fires (Alertmanager)
│
▼
Metric spike ──────── Grafana / Prometheus
│ (exemplar)
▼
Trace (Tempo / Jaeger) — which function was slow?
│ (profile_id link)
▼
Profile (Pyroscope) — which line consumed CPU/memory?
│
▼
Code fix + deploy — compare profiles before/after
2. Profile Types Reference
| Profile Type | What It Measures | Languages | Typical Use Case |
|---|---|---|---|
| cpu | CPU time consumed by each function (sampled at 100 Hz) | Go, Java, Python, Rust, .NET, Ruby | High CPU usage, latency regression |
| heap | Live heap allocations (bytes allocated / objects in use) | Go, Java (heap dump), .NET | Memory leak, OOM root cause |
| allocs | All memory allocations including those already freed | Go | GC pressure, excessive short-lived objects |
| goroutine | Stack traces of all current goroutines | Go | Goroutine leak, deadlock investigation |
| mutex | Contended mutex lock wait time | Go | Lock contention under concurrency |
| block | Blocking operations (channel receives, syscalls) | Go | Channel deadlocks, slow I/O paths |
| wall | Wall-clock time (CPU + I/O wait) | Go, Java (async-profiler), eBPF | Distinguishing I/O-bound vs CPU-bound work |
| threadcreate | Stack traces that led to OS thread creation | Go | Excessive cgo or syscall thread spawning |
| eBPF CPU | CPU samples from OS kernel (zero instrumentation) | Any (Go/C/C++/Rust/Java) | Cross-language profiling, kernel overhead |
Sampling profilers (pprof, async-profiler, eBPF) interrupt the process at a fixed rate (e.g., 100 Hz) and record the call stack. Overhead is proportional to sample rate. Instrumented profilers (JVM TI, .NET ETW) inject code at every method entry/exit — exact counts but 5–20× overhead. Always prefer sampling for production continuous profiling.
3. Go: pprof Endpoints & Analysis
Enabling the pprof HTTP Server
// main.go — register net/http/pprof handlers on a separate port
import (
"net/http"
_ "net/http/pprof" // registers /debug/pprof/* handlers as side effect
)
func main() {
// Application server on :8080
go func() {
http.ListenAndServe(":6060", nil) // pprof on :6060
}()
// ...
}
pprof endpoints reveal heap contents, goroutine stacks (may contain request data), and allow CPU load spikes. Bind to a separate port (e.g., :6060) and restrict via NetworkPolicy to only the profiling scraper DaemonSet or admin namespace.
pprof HTTP Endpoints
| Endpoint | Profile | Duration |
|---|---|---|
/debug/pprof/profile?seconds=30 | CPU (30s sample) | Required: seconds param |
/debug/pprof/heap | Heap (live allocations) | Instant snapshot |
/debug/pprof/allocs | All allocations since start | Instant snapshot |
/debug/pprof/goroutine | All goroutine stacks | Instant snapshot |
/debug/pprof/mutex | Mutex contention | Instant (requires mutex fraction) |
/debug/pprof/block | Block/channel waits | Instant (requires block rate) |
/debug/pprof/threadcreate | Thread creation stack traces | Instant snapshot |
/debug/pprof/trace?seconds=5 | Go runtime execution trace | Required: seconds param |
Enabling Mutex and Block Profiling
import "runtime"
func init() {
// Report 1/5 of all mutex contention events (fraction = 5)
runtime.SetMutexProfileFraction(5)
// Report 1/1 of all blocking events (rate = 1 ns threshold)
runtime.SetBlockProfileRate(1)
}
Collecting and Analysing Profiles with go tool pprof
# Download CPU profile (30s sample) from a pod
kubectl port-forward pod/myapp-7d9f8b6c4-xk2p9 6060:6060 &
# CPU profile
go tool pprof http://localhost:6060/debug/pprof/profile?seconds=30
# Heap profile
go tool pprof http://localhost:6060/debug/pprof/heap
# Interactive pprof commands:
(pprof) top10 # top 10 functions by flat/cumulative time
(pprof) web # open flame graph in browser (requires graphviz)
(pprof) list FunctionName # annotated source listing
(pprof) traces # sample-level call stacks
(pprof) tree # tree view of cumulative costs
(pprof) svg # output SVG call graph
# Save for later comparison:
go tool pprof -output before.pb.gz http://localhost:6060/debug/pprof/heap
# After deploy:
go tool pprof -output after.pb.gz http://localhost:6060/debug/pprof/heap
# Diff:
go tool pprof -diff_base before.pb.gz after.pb.gz
Continuous pprof Collection via Pyroscope SDK (Go)
import (
"github.com/grafana/pyroscope-go"
)
func initProfiling() {
pyroscope.Start(pyroscope.Config{
ApplicationName: "order-service",
ServerAddress: "http://pyroscope.observability.svc:4040",
Logger: pyroscope.StandardLogger,
Tags: map[string]string{
"pod": os.Getenv("POD_NAME"),
"namespace": os.Getenv("POD_NAMESPACE"),
"version": os.Getenv("APP_VERSION"),
},
ProfileTypes: []pyroscope.ProfileType{
pyroscope.ProfileCPU,
pyroscope.ProfileAllocObjects,
pyroscope.ProfileAllocSpace,
pyroscope.ProfileInuseObjects,
pyroscope.ProfileInuseSpace,
pyroscope.ProfileGoroutines,
pyroscope.ProfileMutexCount,
pyroscope.ProfileMutexDuration,
pyroscope.ProfileBlockCount,
pyroscope.ProfileBlockDuration,
},
})
}
Labels for Per-Request Profiling
Pyroscope labels (and pprof runtime labels) allow attributing CPU time to a specific tenant, endpoint, or user — without separate profiling runs.
import (
"github.com/grafana/pyroscope-go"
"runtime/pprof"
)
func handleRequest(w http.ResponseWriter, r *http.Request) {
tenantID := r.Header.Get("X-Tenant-ID")
endpoint := r.URL.Path
// Dynamic labels — profile data is segmented by these in Pyroscope
pyroscope.TagWrapper(r.Context(), pyroscope.Labels(
"tenant_id", tenantID,
"endpoint", endpoint,
), func(ctx context.Context) {
// All CPU time while this closure executes is tagged
processRequest(ctx, w, r)
})
}
4. Language-Specific Profilers
Java: async-profiler & JFR
async-profiler is an async-safe sampling profiler for JVM that uses AsyncGetCallTrace API (avoids safepoint bias present in older profilers like YourKit in sampling mode) and perf_events for CPU samples.
# Attach async-profiler to a running JVM (via agent)
# In Dockerfile or K8s initContainer, download async-profiler
RUN wget https://github.com/async-profiler/async-profiler/releases/download/v3.0/async-profiler-3.0-linux-x64.tar.gz
# Kubernetes: mount via initContainer, attach via profiler container
# Or use Pyroscope Java agent for continuous collection:
java -javaagent:/opt/pyroscope.jar \
-Dpyroscope.server.address=http://pyroscope.observability.svc:4040 \
-Dpyroscope.application.name=payment-service \
-Dpyroscope.format=jfr \
-Dpyroscope.profiler.event=cpu \
-jar app.jar
# application.properties for Pyroscope Spring Boot integration
pyroscope.server.address=http://pyroscope.observability.svc:4040
pyroscope.application.name=${spring.application.name}
pyroscope.format=jfr
pyroscope.profiling.interval=10ms
pyroscope.labels.pod=${POD_NAME}
pyroscope.labels.namespace=${POD_NAMESPACE}
JVM Flight Recorder (JFR) — Built-in since JDK 11
# Start JFR recording in a running pod
kubectl exec -it pod/payment-svc-abc123 -- \
jcmd 1 JFR.start duration=60s filename=/tmp/recording.jfr settings=profile
# Copy and analyse with JDK Mission Control
kubectl cp payment-svc-abc123:/tmp/recording.jfr ./recording.jfr
jmc -open recording.jfr
Python: py-spy
py-spy is a sampling profiler for Python that reads the CPython process memory without requiring code changes or a Python interpreter restart.
# Install py-spy in your Python container (or as ephemeral container)
pip install py-spy
# Attach to a running Python process (PID 1 in container)
py-spy record --output profile.svg --duration 30 --pid 1
# For continuous collection with Pyroscope:
pip install pyroscope-io
# Python app — Pyroscope SDK
import pyroscope
pyroscope.configure(
application_name="ml-inference",
server_address="http://pyroscope.observability.svc:4040",
tags={
"pod": os.environ.get("POD_NAME", ""),
"version": os.environ.get("APP_VERSION", ""),
},
)
# Tag per-request
with pyroscope.tag_wrapper({"endpoint": "/predict", "model": model_name}):
result = run_inference(payload)
Node.js: Clinic.js & 0x
# Install profiling tools
npm install -g clinic 0x
# Production-safe: Pyroscope Node.js SDK
npm install @pyroscope/nodejs
// index.js — Pyroscope SDK for Node.js
const Pyroscope = require('@pyroscope/nodejs');
Pyroscope.init({
serverAddress: 'http://pyroscope.observability.svc:4040',
appName: 'api-gateway',
tags: {
pod: process.env.POD_NAME || '',
namespace: process.env.POD_NAMESPACE || '',
},
});
Pyroscope.start();
Rust: pprof-rs
# Cargo.toml
[dependencies]
pprof = { version = "0.13", features = ["flamegraph", "protobuf-codec"] }
# Expose /debug/pprof endpoint via actix-web or axum handler
use pprof::ProfilerGuardBuilder;
async fn cpu_profile() -> impl Responder {
let guard = ProfilerGuardBuilder::default()
.frequency(100)
.blocklist(&["libc", "libgcc", "pthread"])
.build()
.unwrap();
tokio::time::sleep(Duration::from_secs(30)).await;
let report = guard.report().build().unwrap();
let mut body = Vec::new();
report.pprof().unwrap().encode(&mut body).unwrap();
HttpResponse::Ok().content_type("application/octet-stream").body(body)
}
.NET: dotnet-trace & Pyroscope
# dotnet-trace (built-in .NET diagnostic tool)
dotnet-trace collect --process-id 1 --duration 00:00:30
# Pyroscope .NET agent (via environment variable injection)
# In Kubernetes Pod spec:
env:
- name: CORECLR_ENABLE_PROFILING
value: "1"
- name: CORECLR_PROFILER
value: "{BD1A650D-AC5D-4896-B64F-D6FA25D6B26A}"
- name: CORECLR_PROFILER_PATH
value: /pyroscope/Pyroscope.Profiler.Native.so
- name: PYROSCOPE_SERVER_ADDRESS
value: http://pyroscope.observability.svc:4040
- name: PYROSCOPE_APPLICATION_NAME
value: cart-service
5. Grafana Pyroscope
Architecture
┌─────────────────────────────────────────────────────────┐
│ Pyroscope Cluster │
│ │
│ Push (SDK) Pull (scrape) │
│ ┌────────┐ ┌────────────────┐ │
│ │ app │──push──▶│ distributor │ │
│ │ SDK │ └───────┬────────┘ │
│ └────────┘ │ fan-out │
│ ┌────────▼────────┐ │
│ ┌────────────┐ │ ingester │ (WAL + ring) │
│ │ Pyroscope │───│ (3 replicas) │ │
│ │ scrape │ └────────┬─────────┘ │
│ │ (pprof pull│ │ flush │
│ │ targets) │ ┌────────▼────────┐ │
│ └────────────┘ │ object store │ (S3 / GCS) │
│ │ (blocks) │ │
│ └────────┬────────┘ │
│ ┌────────▼────────┐ │
│ │ store-gateway │ (cache + query) │
│ └────────┬────────┘ │
│ ┌────────▼────────┐ │
│ │ query-frontend │◀── Grafana │
│ └─────────────────┘ │
└─────────────────────────────────────────────────────────┘
Helm Install — Pyroscope Distributed
helm repo add grafana https://grafana.github.io/helm-charts
helm repo update
helm upgrade --install pyroscope grafana/pyroscope \
--namespace observability \
--create-namespace \
--values pyroscope-values.yaml
pyroscope-values.yaml (Production)
pyroscope:
replicationFactor: 3
storage:
backend: s3
s3:
bucket_name: my-pyroscope-profiles
region: us-east-1
# Use IRSA — no access keys in cluster
components:
distributor:
replicas: 2
resources:
requests: {cpu: 500m, memory: 512Mi}
limits: {memory: 1Gi}
ingester:
replicas: 3
persistence:
enabled: true
size: 20Gi
resources:
requests: {cpu: 1, memory: 2Gi}
limits: {memory: 4Gi}
querier:
replicas: 2
resources:
requests: {cpu: 1, memory: 1Gi}
limits: {memory: 2Gi}
query-frontend:
replicas: 2
store-gateway:
replicas: 3
persistence:
enabled: true
size: 50Gi
compactor:
replicas: 1
persistence:
enabled: true
size: 50Gi
limits:
# Global defaults
max_sample_age: 24h
# Per-tenant overrides via ConfigMap
retention:
default: 720h # 30 days
# Scrape pprof endpoints from pods with annotation
scrapeConfigs:
- job_name: kubernetes-pods
kubernetes_sd_configs:
- role: pod
relabel_configs:
- source_labels: [__meta_kubernetes_pod_annotation_profiles_grafana_com_cpu_scrape]
action: keep
regex: "true"
- source_labels: [__meta_kubernetes_pod_annotation_profiles_grafana_com_cpu_port]
action: replace
target_label: __address__
regex: (.+)
replacement: "${1}:$1"
# Override with annotation port
- source_labels: [__meta_kubernetes_pod_name]
target_label: pod
- source_labels: [__meta_kubernetes_namespace]
target_label: namespace
- source_labels: [__meta_kubernetes_pod_label_app]
target_label: service_name
serviceAccount:
annotations:
eks.amazonaws.com/role-arn: arn:aws:iam::123456789:role/pyroscope-s3-role
Pod Annotations for Pull-Mode Scraping
apiVersion: apps/v1
kind: Deployment
metadata:
name: order-service
spec:
template:
metadata:
annotations:
# Enable CPU profiling scrape
profiles.grafana.com/cpu.scrape: "true"
profiles.grafana.com/cpu.port: "6060"
profiles.grafana.com/cpu.path: "/debug/pprof/profile"
# Enable memory profiling scrape
profiles.grafana.com/memory.scrape: "true"
profiles.grafana.com/memory.port: "6060"
profiles.grafana.com/memory.path: "/debug/pprof/heap"
# Enable goroutine profiling scrape
profiles.grafana.com/goroutine.scrape: "true"
profiles.grafana.com/goroutine.port: "6060"
profiles.grafana.com/goroutine.path: "/debug/pprof/goroutine"
Grafana Data Source for Pyroscope
apiVersion: v1
kind: ConfigMap
metadata:
name: grafana-datasource-pyroscope
namespace: observability
labels:
grafana_datasource: "1"
data:
pyroscope.yaml: |
apiVersion: 1
datasources:
- name: Pyroscope
type: grafana-pyroscope-datasource
url: http://pyroscope.observability.svc:4040
isDefault: false
jsonData:
minStep: "15s"
Pyroscope Query Language (FlameQL)
# Select CPU profiles for the order-service
{service_name="order-service"}
# Filter by namespace and pod
{service_name="order-service", namespace="production", pod=~"order-service-.*"}
# Filter by dynamic label (set via SDK TagWrapper)
{service_name="order-service", endpoint="/api/checkout"}
# Compare two time ranges (diff mode in Grafana UI):
# baseline: 2024-01-15T10:00:00Z / 2024-01-15T10:30:00Z
# comparison: 2024-01-15T11:00:00Z / 2024-01-15T11:30:00Z
6. eBPF-Based Profiling
eBPF profilers run in the Linux kernel and capture CPU stack traces for every process on a node — including native code, JVM internals, and kernel functions — without any application instrumentation. This makes them ideal for profiling languages where SDK injection is impractical (C/C++, Rust) or for getting a full-system view.
Grafana Alloy (eBPF profiler)
Part of the Grafana observability stack. DaemonSet agent that uses eBPF perf events + DWARF unwinding. Ships profiles directly to Pyroscope.
recommendedParca Agent
CNCF project. eBPF CPU profiling DaemonSet. Stores profiles in Parca server (similar to Pyroscope). Open source, BPF CO-RE (no kernel header deps).
CNCFPixie
CNCF project. eBPF-based auto-telemetry (metrics + traces + profiles + network maps) with no instrumentation. Strong Go/Java/C++ support.
CNCFElastic Universal Profiling
Commercial eBPF profiler in Elastic Stack. Full-stack profiling from kernel to application. Correlates with Elastic APM traces.
commercialGrafana Alloy eBPF Profiling DaemonSet
helm upgrade --install alloy grafana/alloy \
--namespace observability \
--values alloy-values.yaml
# alloy-values.yaml (eBPF profiling config)
alloy:
configMap:
content: |
// eBPF CPU profiler — no app instrumentation required
pyroscope.ebpf "all_pods" {
targets_only = false // profile all processes
demangle = "full" // C++ symbol demangling
python_enabled = true // Python frame unwinding
collect_interval = "15s"
forward_to = [pyroscope.write.default.receiver]
}
// Kubernetes pod enrichment — add namespace/pod/service labels
discovery.kubernetes "pods" {
role = "pod"
}
pyroscope.write "default" {
endpoint {
url = "http://pyroscope.observability.svc:4040"
}
external_labels = {
"cluster" = "production",
}
}
daemonset:
enabled: true
# Required Linux capabilities for eBPF
podSecurityContext:
runAsUser: 0
containerSecurityContext:
privileged: true # eBPF requires elevated privileges
capabilities:
add:
- SYS_ADMIN
- SYS_PTRACE
- NET_ADMIN
eBPF profilers require SYS_ADMIN and SYS_PTRACE capabilities. They must run as root (UID 0) and use hostPID: true to see all container processes. This is expected for node-level agents — apply strict NetworkPolicy and RBAC so the DaemonSet ServiceAccount cannot access application secrets or APIs beyond what's needed for pod label enrichment.
eBPF vs SDK Profiling Comparison
| Dimension | SDK / Pull (pprof) | eBPF Agent (Alloy/Parca) |
|---|---|---|
| Instrumentation required | Yes (import SDK or add agent JVM arg) | None |
| Languages supported | Per-SDK (Go, Java, Python, .NET, Node) | All (C, Go, Rust, Java, Python, Node) |
| Profile types | CPU, heap, allocs, goroutine, mutex, block | CPU (wall), some memory via USDT probes |
| Per-request labels | Yes (TagWrapper / runtime labels) | No (pod/container level only) |
| Kernel visibility | Userspace only | Full kernel + userspace stacks |
| JVM Java frames | Full (JVM knows method names) | Partial without javaagent; use perf-map-agent |
| Container overhead | Low per app (~1–2% CPU) | Low per node (~0.5% per profiled process) |
| Zero-day coverage | Only opted-in services | Every process on the node automatically |
7. Kubernetes Component Profiling
All core Kubernetes components expose pprof endpoints natively — enabling profiling of the control plane itself when diagnosing API server latency, scheduler queue depth, or etcd compaction pauses.
Accessing Component pprof Endpoints
# kube-apiserver — requires authentication
# Port-forward via kubectl proxy or direct pod port-forward
kubectl -n kube-system port-forward pod/kube-apiserver-controlplane-0 6443:6443
# CPU profile of the API server (30s)
curl -sk --cert /etc/kubernetes/pki/admin.crt \
--key /etc/kubernetes/pki/admin.key \
"https://localhost:6443/debug/pprof/profile?seconds=30" \
> apiserver.pprof
go tool pprof apiserver.pprof
# kube-scheduler (insecure port 10251 or secure port 10259)
kubectl -n kube-system port-forward pod/kube-scheduler-controlplane-0 10259:10259
curl -sk "https://localhost:10259/debug/pprof/heap" --cert ... > scheduler-heap.pprof
# kube-controller-manager (port 10257)
kubectl -n kube-system port-forward pod/kube-controller-manager-controlplane-0 10257:10257
curl -sk "https://localhost:10257/debug/pprof/goroutine?debug=1" --cert ...
# kubelet (port 10250 on each node)
NODE_IP=$(kubectl get node worker-1 -o jsonpath='{.status.addresses[0].address}')
curl -sk "https://${NODE_IP}:10250/debug/pprof/profile?seconds=30" \
--header "Authorization: Bearer $(kubectl create token default)" > kubelet.pprof
# etcd (port 2379 — requires etcd client cert)
kubectl -n kube-system exec etcd-controlplane-0 -- \
curl -sk "https://localhost:2379/debug/pprof/heap" \
--cert /etc/kubernetes/pki/etcd/server.crt \
--key /etc/kubernetes/pki/etcd/server.key > etcd-heap.pprof
Common Control Plane Profiling Scenarios
| Symptom | Component | Profile Type | What to Look For |
|---|---|---|---|
| API server high latency | kube-apiserver | CPU + goroutine | etcd calls, admission webhook latency, LIST serialisation |
| Scheduler queue growing | kube-scheduler | CPU + goroutine | Predicate/scoring plugins, priority queue operations |
| etcd high memory | etcd | heap | Large objects in watch cache, compaction lag, mvcc index |
| Controller manager slow | kube-controller-manager | CPU | GC loops, requeue storms, informer cache resync |
| Node CPU spike | kubelet | CPU + goroutine | Image pulls, pod lifecycle, CRI calls, eviction checks |
8. Reading Flame Graphs
A flame graph shows all stack traces collected during the profiling period, merged and sorted alphabetically at each level. The x-axis represents proportion of sampled time (not wall-clock order). The y-axis represents call depth (root at bottom in traditional flame graphs, or top in Pyroscope's icicle graphs).
Reading a flame graph:
┌──────────────────────────────────────────────────────┐
│ runtime.main │ ← root (widest = most time)
├──────────────────────────┬───────────────────────────┤
│ http.(*ServeMux). │ runtime.gcBgMarkWorker │ ← GC taking ~30% CPU!
│ ServeHTTP (70%) │ (30%) │
├──────────┬───────────────┤ │
│ handler │ middleware.Do │ │
│ (45%) │ (25%) │ │
├───┬──────┤ │ │
│db │json │ │ │
│ │.Marshal │ │
└───┴──────┴───────────────┴───────────────────────────┘
Width = proportion of total samples where this function was on the stack
Tall = deep call stack (not necessarily slow)
Wide = this function (or its callees) uses a lot of CPU
Plateau = the function itself (not its callees) is using the CPU
Color = arbitrary (typically indicates package/module/type in Pyroscope)
Look for wide frames at the top of the flame graph (or bottom of an icicle graph) — these are "leaf" functions that consume time themselves rather than passing it to callees. A wide frame in the middle means a common call path, but the cost may be in its children. Use the diff mode in Pyroscope to highlight functions that grew between a before/after comparison.
Pyroscope Diff Workflow
# Using Pyroscope HTTP API to compare two time ranges
# Baseline: before deploy (T-1h to T-30m)
# Comparison: after deploy (T-10m to now)
curl "http://pyroscope.observability.svc:4040/render?
query=order-service.cpu%7Bnamespace%3D%22production%22%7D
&from=now-1h&until=now-30m
&format=json" > baseline.json
curl "http://pyroscope.observability.svc:4040/render?
query=order-service.cpu%7Bnamespace%3D%22production%22%7D
&from=now-10m&until=now
&format=json" > after.json
# Or use Grafana Explore → Pyroscope datasource → Diff mode (select two date ranges)
9. Linking Traces to Profiles
Pyroscope supports attaching a profile_id label to profiles at the same time a trace span is active. This allows Grafana Tempo to show a "View Profile" link directly from a trace span, drilling into the exact CPU usage during that request.
Go: Profiling a Specific Trace Span
import (
"github.com/grafana/otel-profiling-go"
"go.opentelemetry.io/otel"
)
// In your HTTP handler, after setting up OTel tracer and Pyroscope:
func handleCheckout(w http.ResponseWriter, r *http.Request) {
ctx, span := otel.Tracer("order-service").Start(r.Context(), "checkout")
defer span.End()
// otelpyroscope middleware attaches span context to Pyroscope labels
// so profiles can be linked to this specific trace ID
ctx = otelpyroscope.Start(ctx)
defer otelpyroscope.Stop(ctx)
// The CPU time of processCheckout is now labelled with:
// profile_id = span.TraceID (correlates with Tempo)
processCheckout(ctx, w, r)
}
Grafana Tempo → Pyroscope Integration
# grafana/provisioning/datasources/tempo.yaml
apiVersion: 1
datasources:
- name: Tempo
type: tempo
url: http://tempo-query-frontend.observability.svc:3100
jsonData:
tracesToProfiles:
datasourceUid: pyroscope
tags:
- key: service.name
value: service_name
profileTypeId: "process_cpu:cpu:nanoseconds:cpu:nanoseconds"
customQuery: false
With this configuration, Tempo trace spans show a "Profile" button. Clicking it opens Pyroscope filtered to the same service_name and time window as the trace span — giving a CPU flame graph for the exact duration of the slow request.
Correlation Flow
1. Alert: "p99 checkout latency > 3s for last 10 min"
│
2. Grafana → Tempo trace search
→ Find a slow checkout trace (2.8s)
→ Identify slow span: "process_payment" (2.4s)
│
3. Click "View Profile" on the span
→ Pyroscope flame graph for that 2.4s window
→ 68% of CPU in json.Marshal(*PaymentResponse)
→ PaymentResponse has 120 fields, most nil
│
4. Fix: return only populated fields / use proto instead of JSON
→ Deploy → compare Pyroscope profiles before/after
→ json.Marshal reduced from 68% → 4% of CPU
→ p99 latency: 2.8s → 180ms
10. Alerting & Anomaly Detection
Pyroscope Self-Metrics (Prometheus)
# Pyroscope exposes Prometheus metrics on :4040/metrics
pyroscope_ingester_profiles_received_total # profiles being pushed
pyroscope_distributor_received_samples_total # samples received
pyroscope_querier_query_duration_seconds # query latency histogram
pyroscope_compactor_block_cleanup_failures_total
pyroscope_ring_members{name="ingester",state="ACTIVE"} # ring health
PrometheusRule — Pyroscope Health
apiVersion: monitoring.coreos.com/v1
kind: PrometheusRule
metadata:
name: pyroscope-alerts
namespace: observability
spec:
groups:
- name: pyroscope
interval: 1m
rules:
- alert: PyroscopeIngesterDown
expr: |
count(pyroscope_ring_members{name="ingester",state="ACTIVE"}) < 2
for: 5m
labels:
severity: warning
team: platform
annotations:
summary: "Pyroscope ingester ring below quorum"
description: "Only {{ $value }} ingesters active, expected ≥ 2"
- alert: PyroscopeDroppingProfiles
expr: |
rate(pyroscope_distributor_received_samples_total{status="dropped"}[5m]) > 0
for: 5m
labels:
severity: warning
annotations:
summary: "Pyroscope distributor is dropping profiles"
- alert: PyroscopeQueryLatencyHigh
expr: |
histogram_quantile(0.99,
rate(pyroscope_querier_query_duration_seconds_bucket[5m])
) > 30
for: 10m
labels:
severity: warning
annotations:
summary: "Pyroscope p99 query latency > 30s"
Detecting CPU Regressions via Pyroscope API
Because Pyroscope stores profiles as time series, you can query aggregated CPU time and alert on regressions using Prometheus-style metric queries if you enable Pyroscope's metric export.
# pyroscope-values.yaml — enable metric exporter
pyroscope:
extraConfig:
metric_store:
enabled: true
# Aggregates CPU samples as prometheus metrics:
# pyroscope_app_cpu_seconds_total{service_name, namespace}
# Then alert on CPU regression:
- alert: ServiceCPURegressionDetected
expr: |
(
rate(pyroscope_app_cpu_seconds_total{namespace="production"}[30m])
/
rate(pyroscope_app_cpu_seconds_total{namespace="production"}[30m] offset 1h)
) > 1.5
for: 15m
labels:
severity: warning
annotations:
summary: "{{ $labels.service_name }} CPU increased >50% vs 1h ago"
description: "CPU regression detected — compare profiles in Pyroscope"
runbook: "https://runbooks.internal/cpu-regression"
NetworkPolicy: Restrict pprof Access
apiVersion: networking.k8s.io/v1
kind: NetworkPolicy
metadata:
name: allow-pprof-only-from-pyroscope
namespace: production
spec:
podSelector:
matchLabels:
profiles.grafana.com/cpu.scrape: "true"
policyTypes: [Ingress]
ingress:
- from:
- namespaceSelector:
matchLabels:
kubernetes.io/metadata.name: observability
podSelector:
matchLabels:
app.kubernetes.io/name: alloy
ports:
- port: 6060
protocol: TCP
11. Best Practices
1. Use SDK for heap + goroutine; eBPF for CPU
eBPF gives zero-instrumentation CPU visibility across all services. SDK (pprof/Pyroscope) gives heap, allocs, goroutine, mutex profiles that eBPF cannot collect from userspace.
2. Add dynamic labels at request boundaries
Use pyroscope.TagWrapper / runtime/pprof.Do to tag CPU time by endpoint, tenant, queue. Otherwise all requests merge into an unfiltered flame graph.
3. Separate pprof port from app port
Bind pprof to :6060 (or similar). Add a NetworkPolicy that only allows the Pyroscope scraper to reach it. Never expose pprof through Ingress.
4. Enable mutex and block profiling deliberately
Call runtime.SetMutexProfileFraction(5) and SetBlockProfileRate(1) only in services where you suspect contention. These add ~1–3% overhead.
5. Compare profiles across deploys, not just time
Store the git SHA as a Pyroscope label. Use Pyroscope diff mode with version="v1.4.2" vs version="v1.4.3" to attribute CPU changes to a specific commit.
6. Profile control plane components during incidents
API server, scheduler, and etcd have built-in pprof. High API latency is often a LIST cardinality issue visible as json.Marshal in the API server heap profile.
7. Use Pyroscope diff for post-deploy regression check
Make profile comparison part of your deployment runbook: collect a 10-minute CPU profile before the deploy and after, confirm no function gained >20% CPU share.
8. Set retention to 30 days minimum
Profiles older than the incident won't help diagnose root cause. 30 days retains enough history for seasonal pattern analysis (e.g., end-of-month batch jobs).
Profiling Overhead Reference
| Profiler | CPU Overhead | Memory Overhead | Notes |
|---|---|---|---|
| Go pprof CPU (pull, 100 Hz) | ~1–2% | Negligible | Only during active sample collection |
| Go pprof heap (pull) | ~0.5% | Proportional to live heap | Always active when enabled |
| Go mutex profile (fraction=5) | ~1–3% | Low | Reports 1/5 contention events |
| Pyroscope Go SDK (push, 100 Hz) | ~1–3% | ~50 MiB | Continuous with periodic upload |
| Java async-profiler (100 Hz) | ~1–2% | ~30 MiB agent | JFR format reduces overhead vs JVMTI |
| Python py-spy (100 Hz) | ~2–5% | Negligible (external) | Does not require code changes |
| eBPF CPU (Alloy/Parca) | ~0.5–1% per node | ~100 MiB agent | Covers all processes on node |
Coverage Checklist
- Continuous vs on-demand profiling comparison
- Profiling as the fifth observability signal
- Metric → Trace → Profile correlation flow
- Profile types reference table (CPU/heap/allocs/goroutine/mutex/block/wall/eBPF)
- Sampling vs instrumented profiler overhead comparison
- Go net/http/pprof HTTP endpoints reference
- pprof server on separate port + danger callout
- go tool pprof CLI commands (top10/web/list/traces/diff_base)
- Mutex and block profiling enablement (SetMutexProfileFraction/SetBlockProfileRate)
- Pyroscope Go SDK (Config/ProfileTypes/Tags/TagWrapper)
- Dynamic per-request labels with pyroscope.TagWrapper
- Java async-profiler attach + JFR recording commands
- Pyroscope Java javaagent config (application.properties)
- Python py-spy + Pyroscope SDK with tag_wrapper
- Node.js @pyroscope/nodejs SDK
- Rust pprof-rs HTTP endpoint via actix-web
- .NET dotnet-trace + Pyroscope env var injection
- Pyroscope architecture diagram (distributor/ingester/S3/store-gateway/querier)
- Pyroscope Helm install (distributed mode)
- pyroscope-values.yaml (S3/replication/retention/limits/per-component resources/IRSA)
- Pod annotations for pull-mode scraping (cpu/memory/goroutine)
- Grafana Pyroscope data source provisioning YAML
- FlameQL query examples (service/namespace/dynamic labels/diff)
- eBPF profiling overview and tool comparison (Alloy/Parca/Pixie/Elastic)
- Grafana Alloy eBPF DaemonSet values YAML
- SYS_ADMIN/SYS_PTRACE privilege warning
- eBPF vs SDK profiling comparison table
- kube-apiserver pprof access (cert auth curl command)
- kube-scheduler/controller-manager/kubelet/etcd pprof commands
- Control plane profiling scenarios table
- Flame graph anatomy and reading guide
- Pyroscope diff workflow (HTTP API before/after comparison)
- Trace → Profile linking (otelpyroscope middleware in Go)
- Grafana Tempo → Pyroscope data source tracesToProfiles config
- End-to-end correlation flow (alert → trace → profile → code fix)
- Pyroscope self-metrics reference
- PrometheusRule for Pyroscope health (ingester/drop/query latency)
- CPU regression alerting via metric export
- NetworkPolicy restricting pprof access to Pyroscope scraper only
- 8 best practices with cards
- Profiling overhead reference table