Dashboards — Kubernetes Observability

Grafana Dashboards for Kubernetes

Complete guide to Grafana architecture, dashboard provisioning as code, essential Kubernetes dashboard designs, variable templating, SLO panels, cross-signal navigation, and production dashboard governance.

Grafana Architecture

Grafana is the standard visualization layer for the Kubernetes observability stack. It connects to multiple data sources simultaneously and correlates signals across metrics (Prometheus/Thanos/Mimir), logs (Loki), traces (Tempo/Jaeger), and events — all in a unified interface.

Grafana in the Kubernetes Stack

┌──────────────────────────────────────────────────────────────────┐ │ Grafana │ │ │ │ Data Sources: │ │ ┌──────────┐ ┌──────┐ ┌───────┐ ┌─────────┐ ┌──────────┐ │ │ │Prometheus│ │ Loki │ │ Tempo │ │ Jaeger │ │CloudWatch│ │ │ │ /Thanos │ │ │ │ │ │ │ │ /Datadog │ │ │ └──────────┘ └──────┘ └───────┘ └─────────┘ └──────────┘ │ │ │ │ Features: │ │ • Dashboards (panels + variables + annotations) │ │ • Alerting (unified alert rules + contact points) │ │ • Explore (ad-hoc query across any data source) │ │ • Correlations (link metric → trace → log) │ │ • Playlists, Snapshots, Reporting │ └──────────────────────────────────────────────────────────────────┘

Grafana Helm Install

helm repo add grafana https://grafana.github.io/helm-charts
helm upgrade --install grafana grafana/grafana \
  --namespace monitoring \
  --values grafana-values.yaml

Production grafana-values.yaml

# grafana-values.yaml
replicas: 2

persistence:
  enabled: true
  storageClassName: gp3
  size: 10Gi

# Admin credentials from secret (not plain text)
admin:
  existingSecret: grafana-admin-secret
  userKey: admin-user
  passwordKey: admin-password

# Grafana config overrides
grafana.ini:
  server:
    root_url: https://grafana.company.com
    domain: grafana.company.com
  auth.generic_oauth:
    enabled: true
    name: SSO
    allow_sign_up: true
    client_id: grafana-client
    client_secret: $__env{OAUTH_CLIENT_SECRET}
    scopes: openid profile email groups
    auth_url: https://sso.company.com/oauth2/auth
    token_url: https://sso.company.com/oauth2/token
    api_url: https://sso.company.com/oauth2/userinfo
    role_attribute_path: "contains(groups[*], 'grafana-admin') && 'Admin' || contains(groups[*], 'grafana-editor') && 'Editor' || 'Viewer'"
  users:
    auto_assign_org_role: Viewer
    allow_sign_up: false
  feature_toggles:
    enable: correlations    # enable cross-signal correlations
  analytics:
    reporting_enabled: false
  security:
    disable_gravatar: true
    cookie_secure: true
    cookie_samesite: strict

# Data source provisioning (auto-configured on startup)
datasources:
  datasources.yaml:
    apiVersion: 1
    datasources:
      - name: Prometheus
        type: prometheus
        uid: prometheus-uid
        url: http://prometheus-operated.monitoring.svc:9090
        isDefault: true
        jsonData:
          timeInterval: 30s
          queryTimeout: 60s
          exemplarTraceIdDestinations:
            - name: traceID
              datasourceUid: tempo-uid
              urlDisplayLabel: "View trace in Tempo"

      - name: Loki
        type: loki
        uid: loki-uid
        url: http://loki-gateway.monitoring.svc
        jsonData:
          maxLines: 1000
          derivedFields:
            - name: TraceID
              matcherRegex: '"trace_id":"(\w+)"'
              url: "${__value.raw}"
              datasourceUid: tempo-uid
              urlDisplayLabel: "View trace"

      - name: Tempo
        type: tempo
        uid: tempo-uid
        url: http://tempo-query-frontend.monitoring.svc:3100
        jsonData:
          tracesToLogsV2:
            datasourceUid: loki-uid
            spanStartTimeShift: "-1m"
            spanEndTimeShift: "1m"
            filterByTraceID: true
            customQuery: true
            query: '{cluster="prod", pod="${__span.tags["k8s.pod.name"]}"} | json | trace_id = "${__trace.traceId}"'
          tracesToMetrics:
            datasourceUid: prometheus-uid
            queries:
              - name: "Request Rate"
                query: 'rate(traces_spanmetrics_calls_total{service_name="${__span.tags["service.name"]}"}[5m])'
          serviceMap:
            datasourceUid: prometheus-uid
          nodeGraph:
            enabled: true

# Dashboard provisioning (auto-load from ConfigMaps / filesystem)
dashboardProviders:
  dashboardproviders.yaml:
    apiVersion: 1
    providers:
      - name: default
        orgId: 1
        folder: Kubernetes
        type: file
        disableDeletion: true      # prevent UI edits from being lost on restart
        updateIntervalSeconds: 30
        options:
          path: /var/lib/grafana/dashboards/default
      - name: slo
        orgId: 1
        folder: SLOs
        type: file
        disableDeletion: true
        options:
          path: /var/lib/grafana/dashboards/slo

# Pre-built dashboards from ConfigMaps (loaded by provider above)
dashboardsConfigMaps:
  default: grafana-dashboards-kubernetes
  slo: grafana-dashboards-slo

resources:
  requests: {cpu: 200m, memory: 256Mi}
  limits: {cpu: 1, memory: 512Mi}

serviceMonitor:
  enabled: true   # expose /metrics for Prometheus scraping

Dashboard Provisioning as Code

Never create dashboards only through the Grafana UI. UI-created dashboards are not version-controlled, will be lost if Grafana's PVC is accidentally deleted, and cannot be reviewed or rolled back. The correct approach is to store dashboard JSON in Git and provision via ConfigMaps or Grafana Operator.

Three Approaches to Dashboard as Code

ApproachMechanismProsCons
ConfigMap + file provisionerDashboard JSON in ConfigMap, Grafana sidecar watches for changesSimple, no CRDs neededRaw JSON is hard to read/diff; no templating
GrafanaDashboard CRD (Grafana Operator)Custom resource contains dashboard JSON; operator syncs to Grafana APIKubernetes-native; supports namespaced ownershipRequires Grafana Operator installation
Grafonnet / dashboard-as-code librariesJsonnet/CUE/Python generates dashboard JSON; committed to GitFull abstraction; reusable panels; type-safeLearning curve; Jsonnet toolchain required

ConfigMap Dashboard Provisioning

# Store dashboard JSON in a ConfigMap
# The Grafana sidecar (grafana/grafana-sc-dashboard) watches for labeled ConfigMaps
apiVersion: v1
kind: ConfigMap
metadata:
  name: grafana-dashboard-k8s-workloads
  namespace: monitoring
  labels:
    grafana_dashboard: "1"   # label that the sidecar watches for
data:
  k8s-workloads.json: |
    {
      "title": "Kubernetes / Workloads",
      "uid": "k8s-workloads",
      "tags": ["kubernetes", "workloads"],
      ...dashboard JSON...
    }
# Enable sidecar in grafana Helm values
sidecar:
  dashboards:
    enabled: true
    label: grafana_dashboard
    labelValue: "1"
    folder: /var/lib/grafana/dashboards
    searchNamespace: ALL   # watch all namespaces for labeled ConfigMaps
    provider:
      disableDeletion: true
      allowUiUpdates: false   # prevent dashboard drift

Grafana Operator (CRD-based)

# Install Grafana Operator
helm repo add grafana-operator https://grafana.github.io/helm-charts
helm upgrade --install grafana-operator grafana-operator/grafana-operator \
  --namespace monitoring
apiVersion: grafana.integreatly.org/v1beta1
kind: GrafanaDashboard
metadata:
  name: k8s-namespace-overview
  namespace: payments    # team-owned dashboard in their namespace
spec:
  instanceSelector:
    matchLabels:
      dashboards: grafana    # which Grafana instance to target
  folder: "Kubernetes"
  json: |
    {
      "title": "Namespace Overview — payments",
      "uid": "ns-payments",
      "tags": ["kubernetes", "namespace", "payments"],
      ...
    }

Export Dashboard from UI to JSON

# Export via Grafana HTTP API (use to bootstrap from existing dashboards)
curl -s http://admin:password@grafana.monitoring.svc:3000/api/dashboards/uid/k8s-workloads \
  | jq '.dashboard' > k8s-workloads.json

# Export all dashboards
curl -s http://admin:password@grafana/api/search?type=dash-db \
  | jq -r '.[].uid' \
  | xargs -I{} curl -s http://admin:password@grafana/api/dashboards/uid/{} \
  | jq '.dashboard' > all-dashboards.json

Grafonnet & Dashboard as Code

Grafonnet is a Jsonnet library for generating Grafana dashboards programmatically. It eliminates copy-paste panel duplication, enforces consistent layouts, and enables dashboard templating with real programming constructs (functions, loops, conditionals).

Grafonnet Quickstart

# Install jsonnet and jsonnet-bundler
go install github.com/google/go-jsonnet/cmd/jsonnet@latest
go install github.com/jsonnet-bundler/jsonnet-bundler/cmd/jb@latest

# Initialize a dashboard project
mkdir dashboards && cd dashboards
jb init
jb install github.com/grafana/grafonnet/gen/grafonnet-latest@main

# Generate dashboard JSON
jsonnet -J vendor my-dashboard.jsonnet > my-dashboard.json

Grafonnet Dashboard Example

// k8s-service-dashboard.jsonnet
local g = import 'github.com/grafana/grafonnet/gen/grafonnet-latest/main.libsonnet';

local prometheusQuery(expr, legendFormat) =
  g.query.prometheus.new('Prometheus', expr)
  + g.query.prometheus.withLegendFormat(legendFormat);

local requestRatePanel(service) =
  g.panel.timeSeries.new('Request Rate')
  + g.panel.timeSeries.queryOptions.withTargets([
      prometheusQuery(
        'sum by (status_code) (rate(http_requests_total{service="%s"}[5m]))' % service,
        '{{status_code}}'
      ),
    ])
  + g.panel.timeSeries.standardOptions.withUnit('reqps')
  + g.panel.timeSeries.options.withLegend({ displayMode: 'table', placement: 'bottom' })
  + g.panel.timeSeries.gridPos.withW(12)
  + g.panel.timeSeries.gridPos.withH(8);

local errorRatePanel(service) =
  g.panel.timeSeries.new('Error Rate')
  + g.panel.timeSeries.queryOptions.withTargets([
      prometheusQuery(
        'sum(rate(http_requests_total{service="%s",status_code=~"5.."}[5m])) / sum(rate(http_requests_total{service="%s"}[5m]))' % [service, service],
        'Error Rate'
      ),
    ])
  + g.panel.timeSeries.standardOptions.withUnit('percentunit')
  + g.panel.timeSeries.standardOptions.thresholds.withSteps([
      g.panel.timeSeries.standardOptions.threshold.step.withColor('green').withValue(null),
      g.panel.timeSeries.standardOptions.threshold.step.withColor('yellow').withValue(0.01),
      g.panel.timeSeries.standardOptions.threshold.step.withColor('red').withValue(0.05),
    ])
  + g.panel.timeSeries.gridPos.withW(12)
  + g.panel.timeSeries.gridPos.withH(8);

g.dashboard.new('Service Overview')
+ g.dashboard.withUid('service-overview')
+ g.dashboard.withTags(['kubernetes', 'service'])
+ g.dashboard.withRefresh('30s')
+ g.dashboard.time.withFrom('now-1h')
+ g.dashboard.withVariables([
    g.dashboard.variable.query.new('namespace')
    + g.dashboard.variable.query.withDatasource('Prometheus')
    + g.dashboard.variable.query.queryTypes.withLabelValues('namespace', 'kube_namespace_labels')
    + g.dashboard.variable.query.selectionOptions.withMulti(true)
    + g.dashboard.variable.query.selectionOptions.withIncludeAll(true),
    g.dashboard.variable.query.new('service')
    + g.dashboard.variable.query.withDatasource('Prometheus')
    + g.dashboard.variable.query.queryTypes.withLabelValues('service', 'kube_service_labels{namespace="$namespace"}')
    + g.dashboard.variable.query.selectionOptions.withMulti(false),
  ])
+ g.dashboard.withPanels([
    requestRatePanel('$service'),
    errorRatePanel('$service'),
  ])

Template Variables

Template variables make dashboards reusable across namespaces, clusters, services, and time ranges without duplicating dashboard JSON. They appear as dropdowns at the top of the dashboard and are interpolated into panel queries as $variable_name or ${variable_name}.

Variable Types

TypeSourceUse Case
QueryLabel values from data source (Prometheus, Loki)cluster, namespace, pod, service — dynamic from live data
CustomStatic comma-separated listenv (prod,staging,dev), region (us-east-1,eu-west-1)
ConstantFixed stringCluster name, base URL — inject into all panel queries
IntervalDuration optionsResolution for $__rate_interval — user-selectable
Data sourceList of data source instancesMulti-cluster Prometheus federation
Text boxFree-text user inputPod name filter, trace ID search
Ad hoc filterLabel key=value pairsAdd arbitrary label filters to all queries on a dashboard

Standard Variable Hierarchy for Kubernetes Dashboards

# Variable chain: datasource → cluster → namespace → workload → pod
# Each variable depends on the previous for correct scoping

# 1. datasource (for multi-cluster Prometheus)
type: datasource
query: prometheus

# 2. cluster
type: query
datasource: $datasource
query: label_values(kube_node_info, cluster)
multi: false
includeAll: false

# 3. namespace
type: query
datasource: $datasource
query: label_values(kube_namespace_labels{cluster="$cluster"}, namespace)
multi: true
includeAll: true

# 4. workload (deployment/statefulset)
type: query
datasource: $datasource
query: label_values(kube_deployment_labels{cluster="$cluster",namespace=~"$namespace"}, deployment)
multi: false
includeAll: true

# 5. pod (scoped to workload)
type: query
datasource: $datasource
query: label_values(kube_pod_info{cluster="$cluster",namespace=~"$namespace",created_by_name=~"$workload.*"}, pod)
multi: true
includeAll: true

$__rate_interval vs $__interval

Always Use $__rate_interval for rate() and increase()

$__interval is the panel's resolution interval but can be smaller than Prometheus's scrape_interval, causing incorrect rate calculations. $__rate_interval is automatically clamped to at least 4× the scrape interval, ensuring at least 4 data points per window. Use rate(metric[$__rate_interval]) in all time series panels instead of hardcoded intervals like rate(metric[5m]).

# CORRECT — adapts to selected time range and scrape interval
rate(http_requests_total{namespace="$namespace",service="$service"}[$__rate_interval])

# INCORRECT — hardcoded interval may be wrong for zoomed-out time ranges
rate(http_requests_total{namespace="$namespace",service="$service"}[5m])

# For Loki queries, use $__range for metric queries:
rate({namespace="$namespace"} | json | level="error" [$__range])

Essential Kubernetes Dashboards

The following is the minimum set of dashboards every production Kubernetes cluster should have. For each, the key panels and PromQL queries are specified so you can build them or verify imported ones are complete.

1. Cluster Overview Dashboard

Stat panels + time series. Primary audience: on-call engineers, SRE leads.

Cluster CPU Usage
67%
34 / 50 cores allocated
Cluster Memory Usage
81%
198 / 244 GiB allocated
Pod Count
847
Running / Total
Non-Running Pods
12
Pending / Failed / Unknown
CPU Usage by Namespace
Memory Usage by Namespace
# Cluster CPU utilization %
sum(rate(node_cpu_seconds_total{mode!="idle",cluster="$cluster"}[$__rate_interval]))
  / sum(machine_cpu_cores{cluster="$cluster"}) * 100

# Cluster memory utilization %
1 - sum(node_memory_MemAvailable_bytes{cluster="$cluster"})
    / sum(node_memory_MemTotal_bytes{cluster="$cluster"})

# Total pods by phase
sum by (phase) (kube_pod_status_phase{cluster="$cluster"})

# Non-running pods (Pending + Failed + Unknown)
sum(kube_pod_status_phase{cluster="$cluster", phase!~"Running|Succeeded"})

# CPU usage per namespace (top 10)
topk(10, sum by (namespace) (
  rate(container_cpu_usage_seconds_total{cluster="$cluster",container!=""}[$__rate_interval])
))

# Memory usage per namespace (working set)
sum by (namespace) (
  container_memory_working_set_bytes{cluster="$cluster",container!=""}
)

2. Namespace / Workload Overview Dashboard

Variables: cluster, namespace, workload. Shows RED metrics for selected workload.

# Deployment available vs desired replicas
kube_deployment_status_replicas_available{namespace="$namespace",deployment="$workload"}
kube_deployment_spec_replicas{namespace="$namespace",deployment="$workload"}

# Pod restart count (sorted by most restarts)
topk(10, sum by (pod, container) (
  increase(kube_pod_container_status_restarts_total{namespace="$namespace"}[1h])
))

# CPU throttling % for workload pods
sum by (pod) (
  rate(container_cpu_throttled_seconds_total{namespace="$namespace",pod=~"$workload.*"}[$__rate_interval])
) / sum by (pod) (
  rate(container_cpu_usage_seconds_total{namespace="$namespace",pod=~"$workload.*"}[$__rate_interval])
  + rate(container_cpu_throttled_seconds_total{namespace="$namespace",pod=~"$workload.*"}[$__rate_interval])
) * 100

# Memory usage vs limit
container_memory_working_set_bytes{namespace="$namespace",pod=~"$workload.*"}
  / on (pod, container)
kube_pod_container_resource_limits{namespace="$namespace",resource="memory",pod=~"$workload.*"}

# OOM kill count
increase(kube_pod_container_status_last_terminated_reason{namespace="$namespace",reason="OOMKilled"}[1h])

3. Node Dashboard

# Node CPU utilization
1 - avg by (node) (rate(node_cpu_seconds_total{mode="idle",cluster="$cluster"}[$__rate_interval]))

# Node memory pressure
1 - node_memory_MemAvailable_bytes{cluster="$cluster"}
    / node_memory_MemTotal_bytes{cluster="$cluster"}

# Node disk I/O (read + write throughput)
rate(node_disk_read_bytes_total{cluster="$cluster",device=~"nvme.*|sd.*"}[$__rate_interval])
rate(node_disk_written_bytes_total{cluster="$cluster",device=~"nvme.*|sd.*"}[$__rate_interval])

# Node network throughput
rate(node_network_receive_bytes_total{cluster="$cluster",device!~"lo|veth.*|docker.*|flannel.*|cali.*|cbr.*"}[$__rate_interval])
rate(node_network_transmit_bytes_total{cluster="$cluster",device!~"lo|veth.*|docker.*|flannel.*|cali.*|cbr.*"}[$__rate_interval])

# Allocatable vs requested (resource headroom)
kube_node_status_allocatable{cluster="$cluster",resource="cpu"}
  - on (node)
sum by (node) (kube_pod_container_resource_requests{cluster="$cluster",resource="cpu",node=~".*"})

# Node conditions (Ready, DiskPressure, MemoryPressure, PIDPressure)
kube_node_status_condition{condition=~"DiskPressure|MemoryPressure|PIDPressure",status="true"}

4. Service RED Metrics Dashboard

Variables: namespace, service. Shows Rate, Errors, Duration for HTTP/gRPC services.

# Request rate (RPS)
sum by (method, route) (
  rate(http_requests_total{namespace="$namespace",service="$service"}[$__rate_interval])
)

# Error rate (5xx / total)
sum(rate(http_requests_total{namespace="$namespace",service="$service",status_code=~"5.."}[$__rate_interval]))
  /
sum(rate(http_requests_total{namespace="$namespace",service="$service"}[$__rate_interval]))

# p50/p95/p99 latency (histogram)
histogram_quantile(0.99,
  sum by (le, method, route) (
    rate(http_request_duration_seconds_bucket{namespace="$namespace",service="$service"}[$__rate_interval])
  )
)

# Derived from Tempo traces (if metrics generator enabled):
# RED metrics from traces — no application instrumentation needed
sum by (span_name) (rate(traces_spanmetrics_calls_total{service_name="$service"}[$__rate_interval]))
histogram_quantile(0.99, sum by (le, span_name) (rate(traces_spanmetrics_duration_milliseconds_bucket{service_name="$service"}[$__rate_interval])))

5. Kubernetes Control Plane Dashboard

# API server request rate by verb
sum by (verb, resource) (rate(apiserver_request_total{cluster="$cluster"}[$__rate_interval]))

# API server error rate (4xx + 5xx)
sum(rate(apiserver_request_total{cluster="$cluster",code=~"4..|5.."}[$__rate_interval]))
  / sum(rate(apiserver_request_total{cluster="$cluster"}[$__rate_interval]))

# API server latency p99 by resource/verb
histogram_quantile(0.99, sum by (le, verb, resource) (
  rate(apiserver_request_duration_seconds_bucket{cluster="$cluster",subresource!="log"}[$__rate_interval])
))

# etcd request duration p99
histogram_quantile(0.99, sum by (le, operation) (
  rate(etcd_request_duration_seconds_bucket{cluster="$cluster"}[$__rate_interval])
))

# Scheduler queue depth (pending pods)
scheduler_pending_pods{cluster="$cluster"}

# Controller manager work queue depth
workqueue_depth{cluster="$cluster",name=~"deployment|replicaset|statefulset"}

6. Persistent Volume Dashboard

# PV capacity utilization
(kubelet_volume_stats_used_bytes / kubelet_volume_stats_capacity_bytes) * 100

# PVCs near full (> 80%)
(kubelet_volume_stats_used_bytes / kubelet_volume_stats_capacity_bytes) > 0.80

# PVC inode usage
(kubelet_volume_stats_inodes_used / kubelet_volume_stats_inodes) * 100

# PVC by StorageClass
sum by (storageclass) (kube_persistentvolume_capacity_bytes)

# Unbound PVCs
kube_persistentvolumeclaim_status_phase{phase!="Bound"}

SLO / Error Budget Panels

SLO dashboards are the single most important dashboard type for production teams. They answer: "Are we within budget?", "How fast are we burning?", and "How many good minutes do we have left this month?"

SLO Dashboard Structure

Current SLI (30d)
99.87%
Target: 99.9%
Error Budget Remaining
42%
18.7 min of 43.8 min/30d
Burn Rate (1h)
3.2×
Target: ≤ 1×
Time to Exhaustion
9.3 days
At current burn rate
Error Budget Burn (30d window)
SLI Over Time

SLO Panel PromQL Queries

# --- SLI: availability (successful requests / total requests) ---
# Using recording rule (recommended for performance):
# namespace_service:sli_requests:ratio_rate5m
# namespace_service:sli_requests:ratio_rate30m
# namespace_service:sli_requests:ratio_rate1h
# namespace_service:sli_requests:ratio_rate6h

# Raw SLI (30-day rolling window)
sum(rate(http_requests_total{namespace="$namespace",service="$service",code!~"5.."}[30d]))
  /
sum(rate(http_requests_total{namespace="$namespace",service="$service"}[30d]))

# --- Error budget remaining (%) ---
# Budget = 1 - SLO objective (e.g., 1 - 0.999 = 0.001 = 0.1%)
# Consumed = 1 - current_SLI
# Remaining = 1 - (consumed / budget)
(
  1 - (
    (1 - (
      sum(rate(http_requests_total{namespace="$namespace",service="$service",code!~"5.."}[30d]))
      / sum(rate(http_requests_total{namespace="$namespace",service="$service"}[30d]))
    )) / (1 - 0.999)   # 0.999 = SLO objective
  )
) * 100

# --- Burn rate (1h / 5h multi-window) ---
# Burn rate = error rate / (1 - SLO)
# Alert thresholds: 14.4× (1h), 6× (6h), 3× (1d), 1× (3d)
(
  1 - (
    sum(rate(http_requests_total{service="$service",code!~"5.."}[1h]))
    / sum(rate(http_requests_total{service="$service"}[1h]))
  )
) / (1 - 0.999)

# --- Time to exhaustion (hours) ---
(
  # remaining error budget in absolute terms
  (1 - 0.999) * (30 * 24 * 3600)   # 30d in seconds
  * (
      sum(rate(http_requests_total{service="$service",code!~"5.."}[1h]))
      / sum(rate(http_requests_total{service="$service"}[1h]))
    )   # current error rate (errors/sec)
  / scalar(sum(rate(http_requests_total{service="$service"}[1h])))
) / 3600  # convert to hours

# --- SLO compliance status (green/red) ---
# Uses stat panel with threshold: green if > SLO, red if below
sum(rate(http_requests_total{service="$service",code!~"5.."}[30d]))
  / sum(rate(http_requests_total{service="$service"}[30d]))

Grafana Stat Panel Thresholds for SLO

{
  "thresholds": {
    "mode": "absolute",
    "steps": [
      {"color": "red",    "value": null},
      {"color": "yellow", "value": 0.99},
      {"color": "green",  "value": 0.999}
    ]
  },
  "mappings": [],
  "unit": "percentunit",
  "decimals": 4
}

Cross-Signal Navigation

The highest-value Grafana feature for production operations is navigating from a dashboard panel directly to the relevant logs, traces, or related dashboards — without manually copying query parameters. This is achieved through data links, panel links, and Grafana correlations.

Data Links (Panel-Level)

// In a time series panel showing error rate per pod:
// Clicking a data point opens Loki Explore for that pod's logs
{
  "dataLinks": [
    {
      "title": "Logs for ${__series.name}",
      "url": "/explore?orgId=1&left={\"datasource\":\"loki-uid\",\"queries\":[{\"expr\":\"{namespace=\\\"${namespace}\\\",pod=\\\"${__series.name}\\\"} | json | level=\\\"error\\\"\",\"refId\":\"A\"}],\"range\":{\"from\":\"${__from}\",\"to\":\"${__to}\"}}",
      "targetBlank": true
    },
    {
      "title": "Traces for ${__series.name}",
      "url": "/explore?orgId=1&left={\"datasource\":\"tempo-uid\",\"queries\":[{\"query\":\"{resource.service.name=\\\"${service}\\\"}&& status = error\",\"refId\":\"A\"}]}",
      "targetBlank": true
    }
  ]
}

Panel Links (Dashboard Navigation)

// Link from cluster overview → namespace detail dashboard
{
  "links": [
    {
      "title": "Namespace Details",
      "url": "/d/ns-detail?var-cluster=${cluster}&var-namespace=${__data.fields.namespace}",
      "targetBlank": false,
      "includeVars": true,
      "keepTime": true
    }
  ]
}

Grafana Correlations (Global)

// Defined at the data source level — applies across all dashboards
// Maps a field in query results to a target exploration
{
  "label": "View logs for this service",
  "description": "Navigate to Loki logs filtered by service name",
  "sourceUID": "prometheus-uid",
  "targetUID": "loki-uid",
  "config": {
    "type": "query",
    "field": "service",
    "target": {
      "expr": "{namespace=\"${__data.fields.namespace}\", app=\"${__data.fields.service}\"} | json | level=\"error\""
    },
    "transformations": [
      {
        "type": "regex",
        "field": "service",
        "expression": "(.*)",
        "mapValue": "$1"
      }
    ]
  }
}

Exemplar Drill-Down (Metric → Trace)

# In Prometheus data source config (grafana-values.yaml):
# exemplarTraceIdDestinations maps trace ID exemplars to Tempo
datasources:
  datasources.yaml:
    datasources:
      - name: Prometheus
        type: prometheus
        jsonData:
          exemplarTraceIdDestinations:
            - name: traceID           # label name on the exemplar
              datasourceUid: tempo-uid
              urlDisplayLabel: "View in Tempo"
            - name: TraceID           # some apps use different casing
              datasourceUid: tempo-uid

# In Grafana: enable exemplars on the time series panel:
# Panel options → Data source → Enable exemplars (toggle)
# Exemplar dots appear on the graph — click → opens Tempo trace

Community Dashboards

Grafana has a public dashboard library at grafana.com/grafana/dashboards. The following are the canonical production-ready dashboards included with kube-prometheus-stack. Import by UID or use the Helm chart's built-in provisioning.

DashboardGrafana IDDescriptionIncluded in kps
Kubernetes / Cluster7249Cluster-wide CPU/memory/pods overviewYes
Kubernetes / Nodes11074Node CPU, memory, disk, network per nodeYes
Kubernetes / Workloads12122Deployment/StatefulSet RED metricsYes
Kubernetes / Pods6417Per-pod CPU, memory, restartsYes
Kubernetes / Namespaces7249Namespace resource usageYes
Kubernetes / API Server12006kube-apiserver request rate, latency, errorsYes
Kubernetes / PVs13646Persistent volume utilization and inode usageYes
Node Exporter Full1860Comprehensive node metrics (system, disk, network)Import
Alertmanager9578Alert routing, inhibition, silence statusYes
Prometheus / Overview3662Prometheus self-metrics (scrape, TSDB, rules)Yes
Loki / Chunks14055Loki ingestion and storage metricsImport
Tempo / Overview16698Tempo ingestion, query, storageImport
Fluent Bit7752Fluent Bit input/output/filter metricsImport
Cert-Manager11001Certificate expiry and renewal statusImport
NGINX Ingress9614Ingress controller request rate, latency, errorsImport

Importing via Helm (kube-prometheus-stack)

# kube-prometheus-stack already includes most of the above
helm upgrade --install kube-prometheus-stack prometheus-community/kube-prometheus-stack \
  --set grafana.defaultDashboardsEnabled=true \
  --set grafana.defaultDashboardsTimezone=UTC \
  --namespace monitoring

# Import community dashboards not in kps:
kubectl create configmap grafana-dashboard-node-exporter \
  --from-file=node-exporter-full.json=./1860_rev37.json \
  -n monitoring
kubectl label configmap grafana-dashboard-node-exporter grafana_dashboard=1 -n monitoring

Dashboard Governance

Dashboard Organization

Folder Structure

Organize dashboards by audience and scope: Kubernetes (cluster-level infra), Platform (shared platform services), Services (per-team application dashboards), SLOs (error budgets), Oncall (triage dashboards for on-call).

Naming Conventions

Use consistent naming: [Scope] / [Subject] — e.g., Kubernetes / Nodes, Payments / Order Service, SLO / Checkout API. Use stable UIDs (e.g., k8s-nodes, svc-order-api) for reliable linking.

Version Control

All dashboard JSON stored in Git. No manual UI edits in production — use a review process for all dashboard changes. CI validates JSON syntax. Deployed via GitOps (ArgoCD/Flux) using ConfigMap or GrafanaDashboard CRD.

Access Control

Use Grafana Teams + RBAC: Viewer role for most engineers (read/explore only), Editor for platform team (can modify dashboards), Admin for Grafana operators only. SSO via OIDC/SAML maps group membership to Grafana roles automatically.

Dashboard Review Checklist

  1. Variables defined. Every dashboard has at minimum: cluster, namespace. All panel queries use $cluster and $namespace variables to avoid hardcoding.
  2. $__rate_interval used. All rate(), irate(), and increase() calls use [$__rate_interval] not hardcoded windows.
  3. Units configured. Every panel has the correct unit set: bytes for memory (not "short"), reqps for request rate, seconds or milliseconds for latency, percentunit for ratios.
  4. Thresholds set. Stat and gauge panels have color thresholds aligned with SLOs or alert thresholds (green/yellow/red at the same boundaries as PrometheusRule alerts).
  5. Dashboard UID set. Custom UIDs prevent collisions and make cross-dashboard linking reliable. Auto-generated UIDs are not stable across imports.
  6. No hardcoded namespace/cluster. Dashboards should be reusable across all namespaces and clusters via variables, not tied to one environment.
  7. Data links configured. Error rate / latency panels should link to Loki logs and Tempo traces for the selected service/pod.
  8. Refresh interval appropriate. Operational dashboards: 30s or 1m. Historical analysis dashboards: off (manual refresh). SLO dashboards: 5m.

Detecting Dashboard Drift

# Detect dashboards modified in UI but not in Git (drift detection)
# Export current state from Grafana API
curl -s "http://grafana.monitoring.svc:3000/api/search?type=dash-db&limit=500" \
  | jq -r '.[].uid' | while read uid; do
  curl -s "http://grafana.monitoring.svc:3000/api/dashboards/uid/$uid" \
    | jq '{uid: .dashboard.uid, version: .dashboard.version, title: .dashboard.title}'
done > current-state.json

# Compare against Git-tracked versions (CI check)
# Any dashboard with version > what was last provisioned has been modified in UI

Metrics, Alerts & Runbooks

Grafana Self-Metrics

MetricAlert ThresholdMeaning
grafana_http_request_duration_secondsp99 > 5sGrafana API latency — slow dashboards
grafana_datasource_request_duration_secondsp99 > 30sData source query latency (Prometheus/Loki)
grafana_datasource_request_total{status_code=~"5.."}>0Data source query errors
grafana_stat_totals_dashboardUnexpected decreaseDashboard count — detects accidental deletion
grafana_alerting_result_totalAlert evaluation results (active/normal/error)
up{job="grafana"}== 0Grafana process down

Alert Rules

groups:
  - name: grafana-health
    rules:
      - alert: GrafanaDown
        expr: absent(up{job="grafana"}) or up{job="grafana"} == 0
        for: 2m
        labels: {severity: critical}
        annotations:
          summary: "Grafana is down — dashboards and alerts unavailable"

      - alert: GrafanaDatasourceErrors
        expr: rate(grafana_datasource_request_total{status_code=~"5.."}[5m]) > 0
        for: 5m
        labels: {severity: warning}
        annotations:
          summary: "Grafana data source {{ $labels.datasource }} returning errors"
          description: "Dashboard queries to {{ $labels.datasource }} are failing. Check data source connectivity."

      - alert: GrafanaDashboardDeleted
        expr: grafana_stat_totals_dashboard < (grafana_stat_totals_dashboard offset 1h) * 0.9
        labels: {severity: warning}
        annotations:
          summary: "Grafana dashboard count dropped by >10% — possible accidental deletion"

Runbooks

Dashboard Shows "No Data"

  1. Check time range — may be zoomed into a period with no data
  2. Verify data source connection: Data Sources → Test
  3. Open Explore and run the panel query manually
  4. Check variable values — empty variable drops all data if query uses {namespace="$namespace"}
  5. Verify metric name exists: curl prometheus:9090/api/v1/label/__name__/values | grep metric_name

Dashboard Loading Slowly

  1. Reduce time range (shorter ranges = less data to scan)
  2. Check panel count — reduce to <20 panels per dashboard
  3. Check Prometheus query duration: Explore → run panel query → check execution time
  4. Add recording rules for expensive PromQL expressions
  5. Enable query caching in Grafana (requires Grafana Enterprise or Mimir)
  6. Check if Prometheus is overloaded: prometheus_engine_query_duration_seconds p99

Dashboard Drift (UI vs Git)

  1. Set disableDeletion: true and allowUiUpdates: false in provisioner config
  2. Export current dashboard JSON via API and diff against Git
  3. Revert to Git version by re-applying ConfigMap / triggering GitOps sync
  4. Add CI drift check job to detect version discrepancies

Data Source Authentication Failure

  1. Grafana → Configuration → Data Sources → select source → Test
  2. Check if ServiceAccount token is expired (rotate if using static tokens)
  3. For Prometheus: verify NetworkPolicy allows grafana → prometheus port 9090
  4. For Loki: check Loki gateway auth configuration
  5. Restart Grafana pod after updating secrets: kubectl rollout restart deploy/grafana -n monitoring

Best Practices

  1. All dashboards in Git, zero manual UI edits in production. Enable disableDeletion: true and allowUiUpdates: false on the provisioner. Use a staging Grafana instance for dashboard development before promoting to production.
  2. Use $__rate_interval for all rate-based queries. Hardcoded intervals produce incorrect values when the user zooms out to a multi-day time range, because rate windows smaller than the scrape interval return undefined.
  3. Build a hierarchy: cluster → namespace → workload → pod. Navigation between dashboard levels via data links reduces mean time to investigate by eliminating manual query reconstruction. A team-level dashboard should always link to the cluster overview and the pod-level details.
  4. Align dashboard thresholds with alert thresholds. If your alert fires at error rate > 1%, the stat panel on the dashboard should turn yellow at 0.5% and red at 1%. Misaligned thresholds confuse on-call engineers who see green dashboards while alerts are firing.
  5. Include an SLO/error budget panel on every service dashboard. Application RED metrics (rate, error, duration) are necessary but not sufficient. Teams need to see whether they are within their error budget to make correct prioritization decisions.
  6. Configure exemplar drill-down on latency histograms. A p99 latency spike is much faster to investigate if you can click the spike and jump directly to a representative slow trace in Tempo without switching tools and manually searching.
  7. Avoid dashboards wider than the screen. Panels that require horizontal scrolling are never read. Prefer 24-column grid layouts (Grafana default) with panels spanning 6–12 columns.
  8. Provision one "golden signals" dashboard per team. Teams should own and maintain a single dashboard showing their service's four golden signals (latency, traffic, errors, saturation) plus SLO status. This becomes the starting point for all incident investigations.
Coverage Details
  • Grafana architecture: data sources (Prometheus/Loki/Tempo/Jaeger/CloudWatch), features (dashboards/alerting/explore/correlations)
  • Grafana Helm install with production grafana-values.yaml
  • Production grafana.ini config: root_url, OAuth/OIDC SSO with role mapping from groups, feature_toggles correlations, security (cookie_secure/samesite)
  • Data source provisioning in YAML: Prometheus (exemplarTraceIdDestinations→Tempo), Loki (derivedFields regex→Tempo), Tempo (tracesToLogsV2/tracesToMetrics/serviceMap/nodeGraph)
  • Dashboard provider config: file provider, disableDeletion, allowUiUpdates, folder, updateIntervalSeconds
  • dashboardsConfigMaps mapping to provisioner folders
  • Three dashboard-as-code approaches: ConfigMap+file provisioner, GrafanaDashboard CRD (Grafana Operator), Grafonnet/Jsonnet
  • ConfigMap provisioning: grafana_dashboard: "1" label, sidecar config (searchNamespace: ALL)
  • Grafana Operator: GrafanaDashboard CRD with instanceSelector and folder
  • Export dashboard JSON via Grafana HTTP API (single UID + bulk all dashboards)
  • Grafonnet install: go-jsonnet + jb + jsonnet-bundler initialization
  • Grafonnet dashboard example: prometheusQuery function, requestRatePanel + errorRatePanel functions, dashboard with query variables (namespace/service chained), panels array
  • Template variable types: Query / Custom / Constant / Interval / Data source / Text box / Ad hoc filter
  • Standard variable hierarchy: datasource → cluster → namespace → workload → pod with scoped label_values() queries
  • $__rate_interval vs $__interval explanation and callout; Loki $__range equivalent
  • Cluster Overview dashboard: stat panels (CPU%, memory%, pod count, non-running) + namespace time series
  • Cluster overview PromQL: node CPU/memory utilization, pod phase sum, non-running pods, namespace topk
  • Namespace/Workload dashboard: deployment replicas available/desired, pod restart topk, CPU throttling %, memory vs limit ratio, OOM kill count
  • Node dashboard: CPU/memory per node, disk I/O (read+write), network throughput, allocatable vs requested headroom, node condition checks
  • Service RED dashboard: request rate by method/route, error rate 5xx/total, p99 latency histogram_quantile, derived from Tempo metrics generator
  • Control plane dashboard: API server request rate by verb/resource, error rate, p99 latency by resource/verb, etcd request duration, scheduler queue, controller workqueue depth
  • PV dashboard: capacity utilization (used/capacity), near-full alert (>80%), inode usage, by StorageClass, unbound PVCs
  • SLO dashboard mockup: current SLI, error budget remaining %, burn rate ×, time to exhaustion
  • SLO PromQL: 30d rolling SLI, error budget remaining formula, burn rate (1h window / 1-SLO), time to exhaustion in hours
  • Stat panel threshold JSON for SLO compliance (red/yellow/green at objective boundary)
  • Data links JSON: click series → Loki Explore with time range; click → Tempo trace search
  • Panel links JSON: cluster overview → namespace detail with var interpolation + keepTime
  • Grafana correlations JSON config: field mapping, target query template, transformations
  • Exemplar drill-down: Prometheus datasource exemplarTraceIdDestinations config, panel exemplars toggle
  • Community dashboard table: 15 dashboards with IDs, descriptions, kps inclusion status
  • Importing non-kps dashboards via labeled ConfigMap
  • Dashboard governance: folder structure (Kubernetes/Platform/Services/SLOs/Oncall)
  • Naming conventions: [Scope] / [Subject] pattern, stable UIDs
  • Version control: no UI edits in production, CI JSON validation, GitOps deployment
  • Access control: Grafana Teams + Viewer/Editor/Admin roles, SSO group→role mapping
  • 8-point dashboard review checklist: variables, $__rate_interval, units, thresholds, UID, no hardcoding, data links, refresh interval
  • Dashboard drift detection via Grafana API version comparison script
  • 6 Grafana self-metrics with thresholds
  • 3 PrometheusRule alerts: GrafanaDown, DatasourceErrors, DashboardDeleted
  • 4 runbooks: No Data, slow loading, drift, auth failure
  • 8 best practices: Git-only, $__rate_interval, hierarchy with data links, aligned thresholds, SLO panel, exemplar drill-down, panel width discipline, golden signals dashboard per team