Grafana Dashboards for Kubernetes
Complete guide to Grafana architecture, dashboard provisioning as code, essential Kubernetes dashboard designs, variable templating, SLO panels, cross-signal navigation, and production dashboard governance.
Grafana Architecture
Grafana is the standard visualization layer for the Kubernetes observability stack. It connects to multiple data sources simultaneously and correlates signals across metrics (Prometheus/Thanos/Mimir), logs (Loki), traces (Tempo/Jaeger), and events — all in a unified interface.
Grafana in the Kubernetes Stack
Grafana Helm Install
helm repo add grafana https://grafana.github.io/helm-charts
helm upgrade --install grafana grafana/grafana \
--namespace monitoring \
--values grafana-values.yaml
Production grafana-values.yaml
# grafana-values.yaml
replicas: 2
persistence:
enabled: true
storageClassName: gp3
size: 10Gi
# Admin credentials from secret (not plain text)
admin:
existingSecret: grafana-admin-secret
userKey: admin-user
passwordKey: admin-password
# Grafana config overrides
grafana.ini:
server:
root_url: https://grafana.company.com
domain: grafana.company.com
auth.generic_oauth:
enabled: true
name: SSO
allow_sign_up: true
client_id: grafana-client
client_secret: $__env{OAUTH_CLIENT_SECRET}
scopes: openid profile email groups
auth_url: https://sso.company.com/oauth2/auth
token_url: https://sso.company.com/oauth2/token
api_url: https://sso.company.com/oauth2/userinfo
role_attribute_path: "contains(groups[*], 'grafana-admin') && 'Admin' || contains(groups[*], 'grafana-editor') && 'Editor' || 'Viewer'"
users:
auto_assign_org_role: Viewer
allow_sign_up: false
feature_toggles:
enable: correlations # enable cross-signal correlations
analytics:
reporting_enabled: false
security:
disable_gravatar: true
cookie_secure: true
cookie_samesite: strict
# Data source provisioning (auto-configured on startup)
datasources:
datasources.yaml:
apiVersion: 1
datasources:
- name: Prometheus
type: prometheus
uid: prometheus-uid
url: http://prometheus-operated.monitoring.svc:9090
isDefault: true
jsonData:
timeInterval: 30s
queryTimeout: 60s
exemplarTraceIdDestinations:
- name: traceID
datasourceUid: tempo-uid
urlDisplayLabel: "View trace in Tempo"
- name: Loki
type: loki
uid: loki-uid
url: http://loki-gateway.monitoring.svc
jsonData:
maxLines: 1000
derivedFields:
- name: TraceID
matcherRegex: '"trace_id":"(\w+)"'
url: "${__value.raw}"
datasourceUid: tempo-uid
urlDisplayLabel: "View trace"
- name: Tempo
type: tempo
uid: tempo-uid
url: http://tempo-query-frontend.monitoring.svc:3100
jsonData:
tracesToLogsV2:
datasourceUid: loki-uid
spanStartTimeShift: "-1m"
spanEndTimeShift: "1m"
filterByTraceID: true
customQuery: true
query: '{cluster="prod", pod="${__span.tags["k8s.pod.name"]}"} | json | trace_id = "${__trace.traceId}"'
tracesToMetrics:
datasourceUid: prometheus-uid
queries:
- name: "Request Rate"
query: 'rate(traces_spanmetrics_calls_total{service_name="${__span.tags["service.name"]}"}[5m])'
serviceMap:
datasourceUid: prometheus-uid
nodeGraph:
enabled: true
# Dashboard provisioning (auto-load from ConfigMaps / filesystem)
dashboardProviders:
dashboardproviders.yaml:
apiVersion: 1
providers:
- name: default
orgId: 1
folder: Kubernetes
type: file
disableDeletion: true # prevent UI edits from being lost on restart
updateIntervalSeconds: 30
options:
path: /var/lib/grafana/dashboards/default
- name: slo
orgId: 1
folder: SLOs
type: file
disableDeletion: true
options:
path: /var/lib/grafana/dashboards/slo
# Pre-built dashboards from ConfigMaps (loaded by provider above)
dashboardsConfigMaps:
default: grafana-dashboards-kubernetes
slo: grafana-dashboards-slo
resources:
requests: {cpu: 200m, memory: 256Mi}
limits: {cpu: 1, memory: 512Mi}
serviceMonitor:
enabled: true # expose /metrics for Prometheus scraping
Dashboard Provisioning as Code
Never create dashboards only through the Grafana UI. UI-created dashboards are not version-controlled, will be lost if Grafana's PVC is accidentally deleted, and cannot be reviewed or rolled back. The correct approach is to store dashboard JSON in Git and provision via ConfigMaps or Grafana Operator.
Three Approaches to Dashboard as Code
| Approach | Mechanism | Pros | Cons |
|---|---|---|---|
| ConfigMap + file provisioner | Dashboard JSON in ConfigMap, Grafana sidecar watches for changes | Simple, no CRDs needed | Raw JSON is hard to read/diff; no templating |
| GrafanaDashboard CRD (Grafana Operator) | Custom resource contains dashboard JSON; operator syncs to Grafana API | Kubernetes-native; supports namespaced ownership | Requires Grafana Operator installation |
| Grafonnet / dashboard-as-code libraries | Jsonnet/CUE/Python generates dashboard JSON; committed to Git | Full abstraction; reusable panels; type-safe | Learning curve; Jsonnet toolchain required |
ConfigMap Dashboard Provisioning
# Store dashboard JSON in a ConfigMap
# The Grafana sidecar (grafana/grafana-sc-dashboard) watches for labeled ConfigMaps
apiVersion: v1
kind: ConfigMap
metadata:
name: grafana-dashboard-k8s-workloads
namespace: monitoring
labels:
grafana_dashboard: "1" # label that the sidecar watches for
data:
k8s-workloads.json: |
{
"title": "Kubernetes / Workloads",
"uid": "k8s-workloads",
"tags": ["kubernetes", "workloads"],
...dashboard JSON...
}
# Enable sidecar in grafana Helm values
sidecar:
dashboards:
enabled: true
label: grafana_dashboard
labelValue: "1"
folder: /var/lib/grafana/dashboards
searchNamespace: ALL # watch all namespaces for labeled ConfigMaps
provider:
disableDeletion: true
allowUiUpdates: false # prevent dashboard drift
Grafana Operator (CRD-based)
# Install Grafana Operator
helm repo add grafana-operator https://grafana.github.io/helm-charts
helm upgrade --install grafana-operator grafana-operator/grafana-operator \
--namespace monitoring
apiVersion: grafana.integreatly.org/v1beta1
kind: GrafanaDashboard
metadata:
name: k8s-namespace-overview
namespace: payments # team-owned dashboard in their namespace
spec:
instanceSelector:
matchLabels:
dashboards: grafana # which Grafana instance to target
folder: "Kubernetes"
json: |
{
"title": "Namespace Overview — payments",
"uid": "ns-payments",
"tags": ["kubernetes", "namespace", "payments"],
...
}
Export Dashboard from UI to JSON
# Export via Grafana HTTP API (use to bootstrap from existing dashboards)
curl -s http://admin:password@grafana.monitoring.svc:3000/api/dashboards/uid/k8s-workloads \
| jq '.dashboard' > k8s-workloads.json
# Export all dashboards
curl -s http://admin:password@grafana/api/search?type=dash-db \
| jq -r '.[].uid' \
| xargs -I{} curl -s http://admin:password@grafana/api/dashboards/uid/{} \
| jq '.dashboard' > all-dashboards.json
Grafonnet & Dashboard as Code
Grafonnet is a Jsonnet library for generating Grafana dashboards programmatically. It eliminates copy-paste panel duplication, enforces consistent layouts, and enables dashboard templating with real programming constructs (functions, loops, conditionals).
Grafonnet Quickstart
# Install jsonnet and jsonnet-bundler
go install github.com/google/go-jsonnet/cmd/jsonnet@latest
go install github.com/jsonnet-bundler/jsonnet-bundler/cmd/jb@latest
# Initialize a dashboard project
mkdir dashboards && cd dashboards
jb init
jb install github.com/grafana/grafonnet/gen/grafonnet-latest@main
# Generate dashboard JSON
jsonnet -J vendor my-dashboard.jsonnet > my-dashboard.json
Grafonnet Dashboard Example
// k8s-service-dashboard.jsonnet
local g = import 'github.com/grafana/grafonnet/gen/grafonnet-latest/main.libsonnet';
local prometheusQuery(expr, legendFormat) =
g.query.prometheus.new('Prometheus', expr)
+ g.query.prometheus.withLegendFormat(legendFormat);
local requestRatePanel(service) =
g.panel.timeSeries.new('Request Rate')
+ g.panel.timeSeries.queryOptions.withTargets([
prometheusQuery(
'sum by (status_code) (rate(http_requests_total{service="%s"}[5m]))' % service,
'{{status_code}}'
),
])
+ g.panel.timeSeries.standardOptions.withUnit('reqps')
+ g.panel.timeSeries.options.withLegend({ displayMode: 'table', placement: 'bottom' })
+ g.panel.timeSeries.gridPos.withW(12)
+ g.panel.timeSeries.gridPos.withH(8);
local errorRatePanel(service) =
g.panel.timeSeries.new('Error Rate')
+ g.panel.timeSeries.queryOptions.withTargets([
prometheusQuery(
'sum(rate(http_requests_total{service="%s",status_code=~"5.."}[5m])) / sum(rate(http_requests_total{service="%s"}[5m]))' % [service, service],
'Error Rate'
),
])
+ g.panel.timeSeries.standardOptions.withUnit('percentunit')
+ g.panel.timeSeries.standardOptions.thresholds.withSteps([
g.panel.timeSeries.standardOptions.threshold.step.withColor('green').withValue(null),
g.panel.timeSeries.standardOptions.threshold.step.withColor('yellow').withValue(0.01),
g.panel.timeSeries.standardOptions.threshold.step.withColor('red').withValue(0.05),
])
+ g.panel.timeSeries.gridPos.withW(12)
+ g.panel.timeSeries.gridPos.withH(8);
g.dashboard.new('Service Overview')
+ g.dashboard.withUid('service-overview')
+ g.dashboard.withTags(['kubernetes', 'service'])
+ g.dashboard.withRefresh('30s')
+ g.dashboard.time.withFrom('now-1h')
+ g.dashboard.withVariables([
g.dashboard.variable.query.new('namespace')
+ g.dashboard.variable.query.withDatasource('Prometheus')
+ g.dashboard.variable.query.queryTypes.withLabelValues('namespace', 'kube_namespace_labels')
+ g.dashboard.variable.query.selectionOptions.withMulti(true)
+ g.dashboard.variable.query.selectionOptions.withIncludeAll(true),
g.dashboard.variable.query.new('service')
+ g.dashboard.variable.query.withDatasource('Prometheus')
+ g.dashboard.variable.query.queryTypes.withLabelValues('service', 'kube_service_labels{namespace="$namespace"}')
+ g.dashboard.variable.query.selectionOptions.withMulti(false),
])
+ g.dashboard.withPanels([
requestRatePanel('$service'),
errorRatePanel('$service'),
])
Template Variables
Template variables make dashboards reusable across namespaces, clusters, services, and time ranges without duplicating dashboard JSON. They appear as dropdowns at the top of the dashboard and are interpolated into panel queries as $variable_name or ${variable_name}.
Variable Types
| Type | Source | Use Case |
|---|---|---|
| Query | Label values from data source (Prometheus, Loki) | cluster, namespace, pod, service — dynamic from live data |
| Custom | Static comma-separated list | env (prod,staging,dev), region (us-east-1,eu-west-1) |
| Constant | Fixed string | Cluster name, base URL — inject into all panel queries |
| Interval | Duration options | Resolution for $__rate_interval — user-selectable |
| Data source | List of data source instances | Multi-cluster Prometheus federation |
| Text box | Free-text user input | Pod name filter, trace ID search |
| Ad hoc filter | Label key=value pairs | Add arbitrary label filters to all queries on a dashboard |
Standard Variable Hierarchy for Kubernetes Dashboards
# Variable chain: datasource → cluster → namespace → workload → pod
# Each variable depends on the previous for correct scoping
# 1. datasource (for multi-cluster Prometheus)
type: datasource
query: prometheus
# 2. cluster
type: query
datasource: $datasource
query: label_values(kube_node_info, cluster)
multi: false
includeAll: false
# 3. namespace
type: query
datasource: $datasource
query: label_values(kube_namespace_labels{cluster="$cluster"}, namespace)
multi: true
includeAll: true
# 4. workload (deployment/statefulset)
type: query
datasource: $datasource
query: label_values(kube_deployment_labels{cluster="$cluster",namespace=~"$namespace"}, deployment)
multi: false
includeAll: true
# 5. pod (scoped to workload)
type: query
datasource: $datasource
query: label_values(kube_pod_info{cluster="$cluster",namespace=~"$namespace",created_by_name=~"$workload.*"}, pod)
multi: true
includeAll: true
$__rate_interval vs $__interval
$__interval is the panel's resolution interval but can be smaller than Prometheus's scrape_interval, causing incorrect rate calculations. $__rate_interval is automatically clamped to at least 4× the scrape interval, ensuring at least 4 data points per window. Use rate(metric[$__rate_interval]) in all time series panels instead of hardcoded intervals like rate(metric[5m]).
# CORRECT — adapts to selected time range and scrape interval
rate(http_requests_total{namespace="$namespace",service="$service"}[$__rate_interval])
# INCORRECT — hardcoded interval may be wrong for zoomed-out time ranges
rate(http_requests_total{namespace="$namespace",service="$service"}[5m])
# For Loki queries, use $__range for metric queries:
rate({namespace="$namespace"} | json | level="error" [$__range])
Essential Kubernetes Dashboards
The following is the minimum set of dashboards every production Kubernetes cluster should have. For each, the key panels and PromQL queries are specified so you can build them or verify imported ones are complete.
1. Cluster Overview Dashboard
Stat panels + time series. Primary audience: on-call engineers, SRE leads.
# Cluster CPU utilization %
sum(rate(node_cpu_seconds_total{mode!="idle",cluster="$cluster"}[$__rate_interval]))
/ sum(machine_cpu_cores{cluster="$cluster"}) * 100
# Cluster memory utilization %
1 - sum(node_memory_MemAvailable_bytes{cluster="$cluster"})
/ sum(node_memory_MemTotal_bytes{cluster="$cluster"})
# Total pods by phase
sum by (phase) (kube_pod_status_phase{cluster="$cluster"})
# Non-running pods (Pending + Failed + Unknown)
sum(kube_pod_status_phase{cluster="$cluster", phase!~"Running|Succeeded"})
# CPU usage per namespace (top 10)
topk(10, sum by (namespace) (
rate(container_cpu_usage_seconds_total{cluster="$cluster",container!=""}[$__rate_interval])
))
# Memory usage per namespace (working set)
sum by (namespace) (
container_memory_working_set_bytes{cluster="$cluster",container!=""}
)
2. Namespace / Workload Overview Dashboard
Variables: cluster, namespace, workload. Shows RED metrics for selected workload.
# Deployment available vs desired replicas
kube_deployment_status_replicas_available{namespace="$namespace",deployment="$workload"}
kube_deployment_spec_replicas{namespace="$namespace",deployment="$workload"}
# Pod restart count (sorted by most restarts)
topk(10, sum by (pod, container) (
increase(kube_pod_container_status_restarts_total{namespace="$namespace"}[1h])
))
# CPU throttling % for workload pods
sum by (pod) (
rate(container_cpu_throttled_seconds_total{namespace="$namespace",pod=~"$workload.*"}[$__rate_interval])
) / sum by (pod) (
rate(container_cpu_usage_seconds_total{namespace="$namespace",pod=~"$workload.*"}[$__rate_interval])
+ rate(container_cpu_throttled_seconds_total{namespace="$namespace",pod=~"$workload.*"}[$__rate_interval])
) * 100
# Memory usage vs limit
container_memory_working_set_bytes{namespace="$namespace",pod=~"$workload.*"}
/ on (pod, container)
kube_pod_container_resource_limits{namespace="$namespace",resource="memory",pod=~"$workload.*"}
# OOM kill count
increase(kube_pod_container_status_last_terminated_reason{namespace="$namespace",reason="OOMKilled"}[1h])
3. Node Dashboard
# Node CPU utilization
1 - avg by (node) (rate(node_cpu_seconds_total{mode="idle",cluster="$cluster"}[$__rate_interval]))
# Node memory pressure
1 - node_memory_MemAvailable_bytes{cluster="$cluster"}
/ node_memory_MemTotal_bytes{cluster="$cluster"}
# Node disk I/O (read + write throughput)
rate(node_disk_read_bytes_total{cluster="$cluster",device=~"nvme.*|sd.*"}[$__rate_interval])
rate(node_disk_written_bytes_total{cluster="$cluster",device=~"nvme.*|sd.*"}[$__rate_interval])
# Node network throughput
rate(node_network_receive_bytes_total{cluster="$cluster",device!~"lo|veth.*|docker.*|flannel.*|cali.*|cbr.*"}[$__rate_interval])
rate(node_network_transmit_bytes_total{cluster="$cluster",device!~"lo|veth.*|docker.*|flannel.*|cali.*|cbr.*"}[$__rate_interval])
# Allocatable vs requested (resource headroom)
kube_node_status_allocatable{cluster="$cluster",resource="cpu"}
- on (node)
sum by (node) (kube_pod_container_resource_requests{cluster="$cluster",resource="cpu",node=~".*"})
# Node conditions (Ready, DiskPressure, MemoryPressure, PIDPressure)
kube_node_status_condition{condition=~"DiskPressure|MemoryPressure|PIDPressure",status="true"}
4. Service RED Metrics Dashboard
Variables: namespace, service. Shows Rate, Errors, Duration for HTTP/gRPC services.
# Request rate (RPS)
sum by (method, route) (
rate(http_requests_total{namespace="$namespace",service="$service"}[$__rate_interval])
)
# Error rate (5xx / total)
sum(rate(http_requests_total{namespace="$namespace",service="$service",status_code=~"5.."}[$__rate_interval]))
/
sum(rate(http_requests_total{namespace="$namespace",service="$service"}[$__rate_interval]))
# p50/p95/p99 latency (histogram)
histogram_quantile(0.99,
sum by (le, method, route) (
rate(http_request_duration_seconds_bucket{namespace="$namespace",service="$service"}[$__rate_interval])
)
)
# Derived from Tempo traces (if metrics generator enabled):
# RED metrics from traces — no application instrumentation needed
sum by (span_name) (rate(traces_spanmetrics_calls_total{service_name="$service"}[$__rate_interval]))
histogram_quantile(0.99, sum by (le, span_name) (rate(traces_spanmetrics_duration_milliseconds_bucket{service_name="$service"}[$__rate_interval])))
5. Kubernetes Control Plane Dashboard
# API server request rate by verb
sum by (verb, resource) (rate(apiserver_request_total{cluster="$cluster"}[$__rate_interval]))
# API server error rate (4xx + 5xx)
sum(rate(apiserver_request_total{cluster="$cluster",code=~"4..|5.."}[$__rate_interval]))
/ sum(rate(apiserver_request_total{cluster="$cluster"}[$__rate_interval]))
# API server latency p99 by resource/verb
histogram_quantile(0.99, sum by (le, verb, resource) (
rate(apiserver_request_duration_seconds_bucket{cluster="$cluster",subresource!="log"}[$__rate_interval])
))
# etcd request duration p99
histogram_quantile(0.99, sum by (le, operation) (
rate(etcd_request_duration_seconds_bucket{cluster="$cluster"}[$__rate_interval])
))
# Scheduler queue depth (pending pods)
scheduler_pending_pods{cluster="$cluster"}
# Controller manager work queue depth
workqueue_depth{cluster="$cluster",name=~"deployment|replicaset|statefulset"}
6. Persistent Volume Dashboard
# PV capacity utilization
(kubelet_volume_stats_used_bytes / kubelet_volume_stats_capacity_bytes) * 100
# PVCs near full (> 80%)
(kubelet_volume_stats_used_bytes / kubelet_volume_stats_capacity_bytes) > 0.80
# PVC inode usage
(kubelet_volume_stats_inodes_used / kubelet_volume_stats_inodes) * 100
# PVC by StorageClass
sum by (storageclass) (kube_persistentvolume_capacity_bytes)
# Unbound PVCs
kube_persistentvolumeclaim_status_phase{phase!="Bound"}
SLO / Error Budget Panels
SLO dashboards are the single most important dashboard type for production teams. They answer: "Are we within budget?", "How fast are we burning?", and "How many good minutes do we have left this month?"
SLO Dashboard Structure
SLO Panel PromQL Queries
# --- SLI: availability (successful requests / total requests) ---
# Using recording rule (recommended for performance):
# namespace_service:sli_requests:ratio_rate5m
# namespace_service:sli_requests:ratio_rate30m
# namespace_service:sli_requests:ratio_rate1h
# namespace_service:sli_requests:ratio_rate6h
# Raw SLI (30-day rolling window)
sum(rate(http_requests_total{namespace="$namespace",service="$service",code!~"5.."}[30d]))
/
sum(rate(http_requests_total{namespace="$namespace",service="$service"}[30d]))
# --- Error budget remaining (%) ---
# Budget = 1 - SLO objective (e.g., 1 - 0.999 = 0.001 = 0.1%)
# Consumed = 1 - current_SLI
# Remaining = 1 - (consumed / budget)
(
1 - (
(1 - (
sum(rate(http_requests_total{namespace="$namespace",service="$service",code!~"5.."}[30d]))
/ sum(rate(http_requests_total{namespace="$namespace",service="$service"}[30d]))
)) / (1 - 0.999) # 0.999 = SLO objective
)
) * 100
# --- Burn rate (1h / 5h multi-window) ---
# Burn rate = error rate / (1 - SLO)
# Alert thresholds: 14.4× (1h), 6× (6h), 3× (1d), 1× (3d)
(
1 - (
sum(rate(http_requests_total{service="$service",code!~"5.."}[1h]))
/ sum(rate(http_requests_total{service="$service"}[1h]))
)
) / (1 - 0.999)
# --- Time to exhaustion (hours) ---
(
# remaining error budget in absolute terms
(1 - 0.999) * (30 * 24 * 3600) # 30d in seconds
* (
sum(rate(http_requests_total{service="$service",code!~"5.."}[1h]))
/ sum(rate(http_requests_total{service="$service"}[1h]))
) # current error rate (errors/sec)
/ scalar(sum(rate(http_requests_total{service="$service"}[1h])))
) / 3600 # convert to hours
# --- SLO compliance status (green/red) ---
# Uses stat panel with threshold: green if > SLO, red if below
sum(rate(http_requests_total{service="$service",code!~"5.."}[30d]))
/ sum(rate(http_requests_total{service="$service"}[30d]))
Grafana Stat Panel Thresholds for SLO
{
"thresholds": {
"mode": "absolute",
"steps": [
{"color": "red", "value": null},
{"color": "yellow", "value": 0.99},
{"color": "green", "value": 0.999}
]
},
"mappings": [],
"unit": "percentunit",
"decimals": 4
}
Cross-Signal Navigation
The highest-value Grafana feature for production operations is navigating from a dashboard panel directly to the relevant logs, traces, or related dashboards — without manually copying query parameters. This is achieved through data links, panel links, and Grafana correlations.
Data Links (Panel-Level)
// In a time series panel showing error rate per pod:
// Clicking a data point opens Loki Explore for that pod's logs
{
"dataLinks": [
{
"title": "Logs for ${__series.name}",
"url": "/explore?orgId=1&left={\"datasource\":\"loki-uid\",\"queries\":[{\"expr\":\"{namespace=\\\"${namespace}\\\",pod=\\\"${__series.name}\\\"} | json | level=\\\"error\\\"\",\"refId\":\"A\"}],\"range\":{\"from\":\"${__from}\",\"to\":\"${__to}\"}}",
"targetBlank": true
},
{
"title": "Traces for ${__series.name}",
"url": "/explore?orgId=1&left={\"datasource\":\"tempo-uid\",\"queries\":[{\"query\":\"{resource.service.name=\\\"${service}\\\"}&& status = error\",\"refId\":\"A\"}]}",
"targetBlank": true
}
]
}
Panel Links (Dashboard Navigation)
// Link from cluster overview → namespace detail dashboard
{
"links": [
{
"title": "Namespace Details",
"url": "/d/ns-detail?var-cluster=${cluster}&var-namespace=${__data.fields.namespace}",
"targetBlank": false,
"includeVars": true,
"keepTime": true
}
]
}
Grafana Correlations (Global)
// Defined at the data source level — applies across all dashboards
// Maps a field in query results to a target exploration
{
"label": "View logs for this service",
"description": "Navigate to Loki logs filtered by service name",
"sourceUID": "prometheus-uid",
"targetUID": "loki-uid",
"config": {
"type": "query",
"field": "service",
"target": {
"expr": "{namespace=\"${__data.fields.namespace}\", app=\"${__data.fields.service}\"} | json | level=\"error\""
},
"transformations": [
{
"type": "regex",
"field": "service",
"expression": "(.*)",
"mapValue": "$1"
}
]
}
}
Exemplar Drill-Down (Metric → Trace)
# In Prometheus data source config (grafana-values.yaml):
# exemplarTraceIdDestinations maps trace ID exemplars to Tempo
datasources:
datasources.yaml:
datasources:
- name: Prometheus
type: prometheus
jsonData:
exemplarTraceIdDestinations:
- name: traceID # label name on the exemplar
datasourceUid: tempo-uid
urlDisplayLabel: "View in Tempo"
- name: TraceID # some apps use different casing
datasourceUid: tempo-uid
# In Grafana: enable exemplars on the time series panel:
# Panel options → Data source → Enable exemplars (toggle)
# Exemplar dots appear on the graph — click → opens Tempo trace
Community Dashboards
Grafana has a public dashboard library at grafana.com/grafana/dashboards. The following are the canonical production-ready dashboards included with kube-prometheus-stack. Import by UID or use the Helm chart's built-in provisioning.
| Dashboard | Grafana ID | Description | Included in kps |
|---|---|---|---|
| Kubernetes / Cluster | 7249 | Cluster-wide CPU/memory/pods overview | Yes |
| Kubernetes / Nodes | 11074 | Node CPU, memory, disk, network per node | Yes |
| Kubernetes / Workloads | 12122 | Deployment/StatefulSet RED metrics | Yes |
| Kubernetes / Pods | 6417 | Per-pod CPU, memory, restarts | Yes |
| Kubernetes / Namespaces | 7249 | Namespace resource usage | Yes |
| Kubernetes / API Server | 12006 | kube-apiserver request rate, latency, errors | Yes |
| Kubernetes / PVs | 13646 | Persistent volume utilization and inode usage | Yes |
| Node Exporter Full | 1860 | Comprehensive node metrics (system, disk, network) | Import |
| Alertmanager | 9578 | Alert routing, inhibition, silence status | Yes |
| Prometheus / Overview | 3662 | Prometheus self-metrics (scrape, TSDB, rules) | Yes |
| Loki / Chunks | 14055 | Loki ingestion and storage metrics | Import |
| Tempo / Overview | 16698 | Tempo ingestion, query, storage | Import |
| Fluent Bit | 7752 | Fluent Bit input/output/filter metrics | Import |
| Cert-Manager | 11001 | Certificate expiry and renewal status | Import |
| NGINX Ingress | 9614 | Ingress controller request rate, latency, errors | Import |
Importing via Helm (kube-prometheus-stack)
# kube-prometheus-stack already includes most of the above
helm upgrade --install kube-prometheus-stack prometheus-community/kube-prometheus-stack \
--set grafana.defaultDashboardsEnabled=true \
--set grafana.defaultDashboardsTimezone=UTC \
--namespace monitoring
# Import community dashboards not in kps:
kubectl create configmap grafana-dashboard-node-exporter \
--from-file=node-exporter-full.json=./1860_rev37.json \
-n monitoring
kubectl label configmap grafana-dashboard-node-exporter grafana_dashboard=1 -n monitoring
Dashboard Governance
Dashboard Organization
Folder Structure
Organize dashboards by audience and scope: Kubernetes (cluster-level infra), Platform (shared platform services), Services (per-team application dashboards), SLOs (error budgets), Oncall (triage dashboards for on-call).
Naming Conventions
Use consistent naming: [Scope] / [Subject] — e.g., Kubernetes / Nodes, Payments / Order Service, SLO / Checkout API. Use stable UIDs (e.g., k8s-nodes, svc-order-api) for reliable linking.
Version Control
All dashboard JSON stored in Git. No manual UI edits in production — use a review process for all dashboard changes. CI validates JSON syntax. Deployed via GitOps (ArgoCD/Flux) using ConfigMap or GrafanaDashboard CRD.
Access Control
Use Grafana Teams + RBAC: Viewer role for most engineers (read/explore only), Editor for platform team (can modify dashboards), Admin for Grafana operators only. SSO via OIDC/SAML maps group membership to Grafana roles automatically.
Dashboard Review Checklist
- Variables defined. Every dashboard has at minimum: cluster, namespace. All panel queries use
$clusterand$namespacevariables to avoid hardcoding. - $__rate_interval used. All
rate(),irate(), andincrease()calls use[$__rate_interval]not hardcoded windows. - Units configured. Every panel has the correct unit set: bytes for memory (not "short"), reqps for request rate, seconds or milliseconds for latency, percentunit for ratios.
- Thresholds set. Stat and gauge panels have color thresholds aligned with SLOs or alert thresholds (green/yellow/red at the same boundaries as PrometheusRule alerts).
- Dashboard UID set. Custom UIDs prevent collisions and make cross-dashboard linking reliable. Auto-generated UIDs are not stable across imports.
- No hardcoded namespace/cluster. Dashboards should be reusable across all namespaces and clusters via variables, not tied to one environment.
- Data links configured. Error rate / latency panels should link to Loki logs and Tempo traces for the selected service/pod.
- Refresh interval appropriate. Operational dashboards: 30s or 1m. Historical analysis dashboards: off (manual refresh). SLO dashboards: 5m.
Detecting Dashboard Drift
# Detect dashboards modified in UI but not in Git (drift detection)
# Export current state from Grafana API
curl -s "http://grafana.monitoring.svc:3000/api/search?type=dash-db&limit=500" \
| jq -r '.[].uid' | while read uid; do
curl -s "http://grafana.monitoring.svc:3000/api/dashboards/uid/$uid" \
| jq '{uid: .dashboard.uid, version: .dashboard.version, title: .dashboard.title}'
done > current-state.json
# Compare against Git-tracked versions (CI check)
# Any dashboard with version > what was last provisioned has been modified in UI
Metrics, Alerts & Runbooks
Grafana Self-Metrics
| Metric | Alert Threshold | Meaning |
|---|---|---|
grafana_http_request_duration_seconds | p99 > 5s | Grafana API latency — slow dashboards |
grafana_datasource_request_duration_seconds | p99 > 30s | Data source query latency (Prometheus/Loki) |
grafana_datasource_request_total{status_code=~"5.."} | >0 | Data source query errors |
grafana_stat_totals_dashboard | Unexpected decrease | Dashboard count — detects accidental deletion |
grafana_alerting_result_total | — | Alert evaluation results (active/normal/error) |
up{job="grafana"} | == 0 | Grafana process down |
Alert Rules
groups:
- name: grafana-health
rules:
- alert: GrafanaDown
expr: absent(up{job="grafana"}) or up{job="grafana"} == 0
for: 2m
labels: {severity: critical}
annotations:
summary: "Grafana is down — dashboards and alerts unavailable"
- alert: GrafanaDatasourceErrors
expr: rate(grafana_datasource_request_total{status_code=~"5.."}[5m]) > 0
for: 5m
labels: {severity: warning}
annotations:
summary: "Grafana data source {{ $labels.datasource }} returning errors"
description: "Dashboard queries to {{ $labels.datasource }} are failing. Check data source connectivity."
- alert: GrafanaDashboardDeleted
expr: grafana_stat_totals_dashboard < (grafana_stat_totals_dashboard offset 1h) * 0.9
labels: {severity: warning}
annotations:
summary: "Grafana dashboard count dropped by >10% — possible accidental deletion"
Runbooks
Dashboard Shows "No Data"
- Check time range — may be zoomed into a period with no data
- Verify data source connection: Data Sources → Test
- Open Explore and run the panel query manually
- Check variable values — empty variable drops all data if query uses
{namespace="$namespace"} - Verify metric name exists:
curl prometheus:9090/api/v1/label/__name__/values | grep metric_name
Dashboard Loading Slowly
- Reduce time range (shorter ranges = less data to scan)
- Check panel count — reduce to <20 panels per dashboard
- Check Prometheus query duration: Explore → run panel query → check execution time
- Add recording rules for expensive PromQL expressions
- Enable query caching in Grafana (requires Grafana Enterprise or Mimir)
- Check if Prometheus is overloaded:
prometheus_engine_query_duration_seconds p99
Dashboard Drift (UI vs Git)
- Set
disableDeletion: trueandallowUiUpdates: falsein provisioner config - Export current dashboard JSON via API and diff against Git
- Revert to Git version by re-applying ConfigMap / triggering GitOps sync
- Add CI drift check job to detect version discrepancies
Data Source Authentication Failure
- Grafana → Configuration → Data Sources → select source → Test
- Check if ServiceAccount token is expired (rotate if using static tokens)
- For Prometheus: verify NetworkPolicy allows grafana → prometheus port 9090
- For Loki: check Loki gateway auth configuration
- Restart Grafana pod after updating secrets:
kubectl rollout restart deploy/grafana -n monitoring
Best Practices
- All dashboards in Git, zero manual UI edits in production. Enable
disableDeletion: trueandallowUiUpdates: falseon the provisioner. Use a staging Grafana instance for dashboard development before promoting to production. - Use $__rate_interval for all rate-based queries. Hardcoded intervals produce incorrect values when the user zooms out to a multi-day time range, because rate windows smaller than the scrape interval return undefined.
- Build a hierarchy: cluster → namespace → workload → pod. Navigation between dashboard levels via data links reduces mean time to investigate by eliminating manual query reconstruction. A team-level dashboard should always link to the cluster overview and the pod-level details.
- Align dashboard thresholds with alert thresholds. If your alert fires at error rate > 1%, the stat panel on the dashboard should turn yellow at 0.5% and red at 1%. Misaligned thresholds confuse on-call engineers who see green dashboards while alerts are firing.
- Include an SLO/error budget panel on every service dashboard. Application RED metrics (rate, error, duration) are necessary but not sufficient. Teams need to see whether they are within their error budget to make correct prioritization decisions.
- Configure exemplar drill-down on latency histograms. A p99 latency spike is much faster to investigate if you can click the spike and jump directly to a representative slow trace in Tempo without switching tools and manually searching.
- Avoid dashboards wider than the screen. Panels that require horizontal scrolling are never read. Prefer 24-column grid layouts (Grafana default) with panels spanning 6–12 columns.
- Provision one "golden signals" dashboard per team. Teams should own and maintain a single dashboard showing their service's four golden signals (latency, traffic, errors, saturation) plus SLO status. This becomes the starting point for all incident investigations.
Coverage Details
- Grafana architecture: data sources (Prometheus/Loki/Tempo/Jaeger/CloudWatch), features (dashboards/alerting/explore/correlations)
- Grafana Helm install with production grafana-values.yaml
- Production grafana.ini config: root_url, OAuth/OIDC SSO with role mapping from groups, feature_toggles correlations, security (cookie_secure/samesite)
- Data source provisioning in YAML: Prometheus (exemplarTraceIdDestinations→Tempo), Loki (derivedFields regex→Tempo), Tempo (tracesToLogsV2/tracesToMetrics/serviceMap/nodeGraph)
- Dashboard provider config: file provider, disableDeletion, allowUiUpdates, folder, updateIntervalSeconds
- dashboardsConfigMaps mapping to provisioner folders
- Three dashboard-as-code approaches: ConfigMap+file provisioner, GrafanaDashboard CRD (Grafana Operator), Grafonnet/Jsonnet
- ConfigMap provisioning: grafana_dashboard: "1" label, sidecar config (searchNamespace: ALL)
- Grafana Operator: GrafanaDashboard CRD with instanceSelector and folder
- Export dashboard JSON via Grafana HTTP API (single UID + bulk all dashboards)
- Grafonnet install: go-jsonnet + jb + jsonnet-bundler initialization
- Grafonnet dashboard example: prometheusQuery function, requestRatePanel + errorRatePanel functions, dashboard with query variables (namespace/service chained), panels array
- Template variable types: Query / Custom / Constant / Interval / Data source / Text box / Ad hoc filter
- Standard variable hierarchy: datasource → cluster → namespace → workload → pod with scoped label_values() queries
- $__rate_interval vs $__interval explanation and callout; Loki $__range equivalent
- Cluster Overview dashboard: stat panels (CPU%, memory%, pod count, non-running) + namespace time series
- Cluster overview PromQL: node CPU/memory utilization, pod phase sum, non-running pods, namespace topk
- Namespace/Workload dashboard: deployment replicas available/desired, pod restart topk, CPU throttling %, memory vs limit ratio, OOM kill count
- Node dashboard: CPU/memory per node, disk I/O (read+write), network throughput, allocatable vs requested headroom, node condition checks
- Service RED dashboard: request rate by method/route, error rate 5xx/total, p99 latency histogram_quantile, derived from Tempo metrics generator
- Control plane dashboard: API server request rate by verb/resource, error rate, p99 latency by resource/verb, etcd request duration, scheduler queue, controller workqueue depth
- PV dashboard: capacity utilization (used/capacity), near-full alert (>80%), inode usage, by StorageClass, unbound PVCs
- SLO dashboard mockup: current SLI, error budget remaining %, burn rate ×, time to exhaustion
- SLO PromQL: 30d rolling SLI, error budget remaining formula, burn rate (1h window / 1-SLO), time to exhaustion in hours
- Stat panel threshold JSON for SLO compliance (red/yellow/green at objective boundary)
- Data links JSON: click series → Loki Explore with time range; click → Tempo trace search
- Panel links JSON: cluster overview → namespace detail with var interpolation + keepTime
- Grafana correlations JSON config: field mapping, target query template, transformations
- Exemplar drill-down: Prometheus datasource exemplarTraceIdDestinations config, panel exemplars toggle
- Community dashboard table: 15 dashboards with IDs, descriptions, kps inclusion status
- Importing non-kps dashboards via labeled ConfigMap
- Dashboard governance: folder structure (Kubernetes/Platform/Services/SLOs/Oncall)
- Naming conventions: [Scope] / [Subject] pattern, stable UIDs
- Version control: no UI edits in production, CI JSON validation, GitOps deployment
- Access control: Grafana Teams + Viewer/Editor/Admin roles, SSO group→role mapping
- 8-point dashboard review checklist: variables, $__rate_interval, units, thresholds, UID, no hardcoding, data links, refresh interval
- Dashboard drift detection via Grafana API version comparison script
- 6 Grafana self-metrics with thresholds
- 3 PrometheusRule alerts: GrafanaDown, DatasourceErrors, DashboardDeleted
- 4 runbooks: No Data, slow loading, drift, auth failure
- 8 best practices: Git-only, $__rate_interval, hierarchy with data links, aligned thresholds, SLO panel, exemplar drill-down, panel width discipline, golden signals dashboard per team