Grafana Dashboards for Kubernetes

Complete guide to Grafana architecture, dashboard provisioning as code, essential Kubernetes dashboard designs, variable templating, SLO panels, cross-signal navigation, and production dashboard governance.

Grafana Architecture

Grafana is the standard visualization layer for the Kubernetes observability stack. It connects to multiple data sources simultaneously and correlates signals across metrics (Prometheus/Thanos/Mimir), logs (Loki), traces (Tempo/Jaeger), and events — all in a unified interface.

Grafana in the Kubernetes Stack

┌──────────────────────────────────────────────────────────────────┐ │ Grafana │ │ │ │ Data Sources: │ │ ┌──────────┐ ┌──────┐ ┌───────┐ ┌─────────┐ ┌──────────┐ │ │ │Prometheus│ │ Loki │ │ Tempo │ │ Jaeger │ │CloudWatch│ │ │ │ /Thanos │ │ │ │ │ │ │ │ /Datadog │ │ │ └──────────┘ └──────┘ └───────┘ └─────────┘ └──────────┘ │ │ │ │ Features: │ │ • Dashboards (panels + variables + annotations) │ │ • Alerting (unified alert rules + contact points) │ │ • Explore (ad-hoc query across any data source) │ │ • Correlations (link metric → trace → log) │ │ • Playlists, Snapshots, Reporting │ └──────────────────────────────────────────────────────────────────┘

Grafana Helm Install

helm repo add grafana https://grafana.github.io/helm-charts
helm upgrade --install grafana grafana/grafana \
  --namespace monitoring \
  --values grafana-values.yaml

Production grafana-values.yaml

# grafana-values.yaml
replicas: 2

persistence:
  enabled: true
  storageClassName: gp3
  size: 10Gi

# Admin credentials from secret (not plain text)
admin:
  existingSecret: grafana-admin-secret
  userKey: admin-user
  passwordKey: admin-password

# Grafana config overrides
grafana.ini:
  server:
    root_url: https://grafana.company.com
    domain: grafana.company.com
  auth.generic_oauth:
    enabled: true
    name: SSO
    allow_sign_up: true
    client_id: grafana-client
    client_secret: $__env{OAUTH_CLIENT_SECRET}
    scopes: openid profile email groups
    auth_url: https://sso.company.com/oauth2/auth
    token_url: https://sso.company.com/oauth2/token
    api_url: https://sso.company.com/oauth2/userinfo
    role_attribute_path: "contains(groups[*], 'grafana-admin') && 'Admin' || contains(groups[*], 'grafana-editor') && 'Editor' || 'Viewer'"
  users:
    auto_assign_org_role: Viewer
    allow_sign_up: false
  feature_toggles:
    enable: correlations    # enable cross-signal correlations
  analytics:
    reporting_enabled: false
  security:
    disable_gravatar: true
    cookie_secure: true
    cookie_samesite: strict

# Data source provisioning (auto-configured on startup)
datasources:
  datasources.yaml:
    apiVersion: 1
    datasources:
      - name: Prometheus
        type: prometheus
        uid: prometheus-uid
        url: http://prometheus-operated.monitoring.svc:9090
        isDefault: true
        jsonData:
          timeInterval: 30s
          queryTimeout: 60s
          exemplarTraceIdDestinations:
            - name: traceID
              datasourceUid: tempo-uid
              urlDisplayLabel: "View trace in Tempo"

      - name: Loki
        type: loki
        uid: loki-uid
        url: http://loki-gateway.monitoring.svc
        jsonData:
          maxLines: 1000
          derivedFields:
            - name: TraceID
              matcherRegex: '"trace_id":"(\w+)"'
              url: "${__value.raw}"
              datasourceUid: tempo-uid
              urlDisplayLabel: "View trace"

      - name: Tempo
        type: tempo
        uid: tempo-uid
        url: http://tempo-query-frontend.monitoring.svc:3100
        jsonData:
          tracesToLogsV2:
            datasourceUid: loki-uid
            spanStartTimeShift: "-1m"
            spanEndTimeShift: "1m"
            filterByTraceID: true
            customQuery: true
            query: '{cluster="prod", pod="${__span.tags["k8s.pod.name"]}"} | json | trace_id = "${__trace.traceId}"'
          tracesToMetrics:
            datasourceUid: prometheus-uid
            queries:
              - name: "Request Rate"
                query: 'rate(traces_spanmetrics_calls_total{service_name="${__span.tags["service.name"]}"}[5m])'
          serviceMap:
            datasourceUid: prometheus-uid
          nodeGraph:
            enabled: true

# Dashboard provisioning (auto-load from ConfigMaps / filesystem)
dashboardProviders:
  dashboardproviders.yaml:
    apiVersion: 1
    providers:
      - name: default
        orgId: 1
        folder: Kubernetes
        type: file
        disableDeletion: true      # prevent UI edits from being lost on restart
        updateIntervalSeconds: 30
        options:
          path: /var/lib/grafana/dashboards/default
      - name: slo
        orgId: 1
        folder: SLOs
        type: file
        disableDeletion: true
        options:
          path: /var/lib/grafana/dashboards/slo

# Pre-built dashboards from ConfigMaps (loaded by provider above)
dashboardsConfigMaps:
  default: grafana-dashboards-kubernetes
  slo: grafana-dashboards-slo

resources:
  requests: {cpu: 200m, memory: 256Mi}
  limits: {cpu: 1, memory: 512Mi}

serviceMonitor:
  enabled: true   # expose /metrics for Prometheus scraping

Dashboard Provisioning as Code

Never create dashboards only through the Grafana UI. UI-created dashboards are not version-controlled, will be lost if Grafana's PVC is accidentally deleted, and cannot be reviewed or rolled back. The correct approach is to store dashboard JSON in Git and provision via ConfigMaps or Grafana Operator.

Three Approaches to Dashboard as Code

Approach	Mechanism	Pros	Cons
ConfigMap + file provisioner	Dashboard JSON in ConfigMap, Grafana sidecar watches for changes	Simple, no CRDs needed	Raw JSON is hard to read/diff; no templating
GrafanaDashboard CRD (Grafana Operator)	Custom resource contains dashboard JSON; operator syncs to Grafana API	Kubernetes-native; supports namespaced ownership	Requires Grafana Operator installation
Grafonnet / dashboard-as-code libraries	Jsonnet/CUE/Python generates dashboard JSON; committed to Git	Full abstraction; reusable panels; type-safe	Learning curve; Jsonnet toolchain required

ConfigMap Dashboard Provisioning

# Store dashboard JSON in a ConfigMap
# The Grafana sidecar (grafana/grafana-sc-dashboard) watches for labeled ConfigMaps
apiVersion: v1
kind: ConfigMap
metadata:
  name: grafana-dashboard-k8s-workloads
  namespace: monitoring
  labels:
    grafana_dashboard: "1"   # label that the sidecar watches for
data:
  k8s-workloads.json: |
    {
      "title": "Kubernetes / Workloads",
      "uid": "k8s-workloads",
      "tags": ["kubernetes", "workloads"],
      ...dashboard JSON...
    }

# Enable sidecar in grafana Helm values
sidecar:
  dashboards:
    enabled: true
    label: grafana_dashboard
    labelValue: "1"
    folder: /var/lib/grafana/dashboards
    searchNamespace: ALL   # watch all namespaces for labeled ConfigMaps
    provider:
      disableDeletion: true
      allowUiUpdates: false   # prevent dashboard drift

Grafana Operator (CRD-based)

# Install Grafana Operator
helm repo add grafana-operator https://grafana.github.io/helm-charts
helm upgrade --install grafana-operator grafana-operator/grafana-operator \
  --namespace monitoring

apiVersion: grafana.integreatly.org/v1beta1
kind: GrafanaDashboard
metadata:
  name: k8s-namespace-overview
  namespace: payments    # team-owned dashboard in their namespace
spec:
  instanceSelector:
    matchLabels:
      dashboards: grafana    # which Grafana instance to target
  folder: "Kubernetes"
  json: |
    {
      "title": "Namespace Overview — payments",
      "uid": "ns-payments",
      "tags": ["kubernetes", "namespace", "payments"],
      ...
    }

Export Dashboard from UI to JSON

# Export via Grafana HTTP API (use to bootstrap from existing dashboards)
curl -s http://admin:password@grafana.monitoring.svc:3000/api/dashboards/uid/k8s-workloads \
  | jq '.dashboard' > k8s-workloads.json

# Export all dashboards
curl -s http://admin:password@grafana/api/search?type=dash-db \
  | jq -r '.[].uid' \
  | xargs -I{} curl -s http://admin:password@grafana/api/dashboards/uid/{} \
  | jq '.dashboard' > all-dashboards.json

Grafonnet & Dashboard as Code

Grafonnet is a Jsonnet library for generating Grafana dashboards programmatically. It eliminates copy-paste panel duplication, enforces consistent layouts, and enables dashboard templating with real programming constructs (functions, loops, conditionals).

Grafonnet Quickstart

# Install jsonnet and jsonnet-bundler
go install github.com/google/go-jsonnet/cmd/jsonnet@latest
go install github.com/jsonnet-bundler/jsonnet-bundler/cmd/jb@latest

# Initialize a dashboard project
mkdir dashboards && cd dashboards
jb init
jb install github.com/grafana/grafonnet/gen/grafonnet-latest@main

# Generate dashboard JSON
jsonnet -J vendor my-dashboard.jsonnet > my-dashboard.json

Grafonnet Dashboard Example

// k8s-service-dashboard.jsonnet
local g = import 'github.com/grafana/grafonnet/gen/grafonnet-latest/main.libsonnet';

local prometheusQuery(expr, legendFormat) =
  g.query.prometheus.new('Prometheus', expr)
  + g.query.prometheus.withLegendFormat(legendFormat);

local requestRatePanel(service) =
  g.panel.timeSeries.new('Request Rate')
  + g.panel.timeSeries.queryOptions.withTargets([
      prometheusQuery(
        'sum by (status_code) (rate(http_requests_total{service="%s"}[5m]))' % service,
        '{{status_code}}'
      ),
    ])
  + g.panel.timeSeries.standardOptions.withUnit('reqps')
  + g.panel.timeSeries.options.withLegend({ displayMode: 'table', placement: 'bottom' })
  + g.panel.timeSeries.gridPos.withW(12)
  + g.panel.timeSeries.gridPos.withH(8);

local errorRatePanel(service) =
  g.panel.timeSeries.new('Error Rate')
  + g.panel.timeSeries.queryOptions.withTargets([
      prometheusQuery(
        'sum(rate(http_requests_total{service="%s",status_code=~"5.."}[5m])) / sum(rate(http_requests_total{service="%s"}[5m]))' % [service, service],
        'Error Rate'
      ),
    ])
  + g.panel.timeSeries.standardOptions.withUnit('percentunit')
  + g.panel.timeSeries.standardOptions.thresholds.withSteps([
      g.panel.timeSeries.standardOptions.threshold.step.withColor('green').withValue(null),
      g.panel.timeSeries.standardOptions.threshold.step.withColor('yellow').withValue(0.01),
      g.panel.timeSeries.standardOptions.threshold.step.withColor('red').withValue(0.05),
    ])
  + g.panel.timeSeries.gridPos.withW(12)
  + g.panel.timeSeries.gridPos.withH(8);

g.dashboard.new('Service Overview')
+ g.dashboard.withUid('service-overview')
+ g.dashboard.withTags(['kubernetes', 'service'])
+ g.dashboard.withRefresh('30s')
+ g.dashboard.time.withFrom('now-1h')
+ g.dashboard.withVariables([
    g.dashboard.variable.query.new('namespace')
    + g.dashboard.variable.query.withDatasource('Prometheus')
    + g.dashboard.variable.query.queryTypes.withLabelValues('namespace', 'kube_namespace_labels')
    + g.dashboard.variable.query.selectionOptions.withMulti(true)
    + g.dashboard.variable.query.selectionOptions.withIncludeAll(true),
    g.dashboard.variable.query.new('service')
    + g.dashboard.variable.query.withDatasource('Prometheus')
    + g.dashboard.variable.query.queryTypes.withLabelValues('service', 'kube_service_labels{namespace="$namespace"}')
    + g.dashboard.variable.query.selectionOptions.withMulti(false),
  ])
+ g.dashboard.withPanels([
    requestRatePanel('$service'),
    errorRatePanel('$service'),
  ])

Template Variables

Template variables make dashboards reusable across namespaces, clusters, services, and time ranges without duplicating dashboard JSON. They appear as dropdowns at the top of the dashboard and are interpolated into panel queries as $variable_name or ${variable_name}.

Variable Types

Type	Source	Use Case
Query	Label values from data source (Prometheus, Loki)	cluster, namespace, pod, service — dynamic from live data
Custom	Static comma-separated list	env (prod,staging,dev), region (us-east-1,eu-west-1)
Constant	Fixed string	Cluster name, base URL — inject into all panel queries
Interval	Duration options	Resolution for `$__rate_interval` — user-selectable
Data source	List of data source instances	Multi-cluster Prometheus federation
Text box	Free-text user input	Pod name filter, trace ID search
Ad hoc filter	Label key=value pairs	Add arbitrary label filters to all queries on a dashboard

Standard Variable Hierarchy for Kubernetes Dashboards

# Variable chain: datasource → cluster → namespace → workload → pod
# Each variable depends on the previous for correct scoping

# 1. datasource (for multi-cluster Prometheus)
type: datasource
query: prometheus

# 2. cluster
type: query
datasource: $datasource
query: label_values(kube_node_info, cluster)
multi: false
includeAll: false

# 3. namespace
type: query
datasource: $datasource
query: label_values(kube_namespace_labels{cluster="$cluster"}, namespace)
multi: true
includeAll: true

# 4. workload (deployment/statefulset)
type: query
datasource: $datasource
query: label_values(kube_deployment_labels{cluster="$cluster",namespace=~"$namespace"}, deployment)
multi: false
includeAll: true

# 5. pod (scoped to workload)
type: query
datasource: $datasource
query: label_values(kube_pod_info{cluster="$cluster",namespace=~"$namespace",created_by_name=~"$workload.*"}, pod)
multi: true
includeAll: true

$__rate_interval vs $__interval

Always Use $__rate_interval for rate() and increase()

$__interval is the panel's resolution interval but can be smaller than Prometheus's scrape_interval, causing incorrect rate calculations. $__rate_interval is automatically clamped to at least 4× the scrape interval, ensuring at least 4 data points per window. Use rate(metric[$__rate_interval]) in all time series panels instead of hardcoded intervals like rate(metric[5m]).

# CORRECT — adapts to selected time range and scrape interval
rate(http_requests_total{namespace="$namespace",service="$service"}[$__rate_interval])

# INCORRECT — hardcoded interval may be wrong for zoomed-out time ranges
rate(http_requests_total{namespace="$namespace",service="$service"}[5m])

# For Loki queries, use $__range for metric queries:
rate({namespace="$namespace"} | json | level="error" [$__range])

Essential Kubernetes Dashboards

The following is the minimum set of dashboards every production Kubernetes cluster should have. For each, the key panels and PromQL queries are specified so you can build them or verify imported ones are complete.

1. Cluster Overview Dashboard

Stat panels + time series. Primary audience: on-call engineers, SRE leads.

Cluster CPU Usage

67%

34 / 50 cores allocated

Cluster Memory Usage

81%

198 / 244 GiB allocated

Pod Count

847

Running / Total

Non-Running Pods

Pending / Failed / Unknown

CPU Usage by Namespace

Memory Usage by Namespace

# Cluster CPU utilization %
sum(rate(node_cpu_seconds_total{mode!="idle",cluster="$cluster"}[$__rate_interval]))
  / sum(machine_cpu_cores{cluster="$cluster"}) * 100

# Cluster memory utilization %
1 - sum(node_memory_MemAvailable_bytes{cluster="$cluster"})
    / sum(node_memory_MemTotal_bytes{cluster="$cluster"})

# Total pods by phase
sum by (phase) (kube_pod_status_phase{cluster="$cluster"})

# Non-running pods (Pending + Failed + Unknown)
sum(kube_pod_status_phase{cluster="$cluster", phase!~"Running|Succeeded"})

# CPU usage per namespace (top 10)
topk(10, sum by (namespace) (
  rate(container_cpu_usage_seconds_total{cluster="$cluster",container!=""}[$__rate_interval])
))

# Memory usage per namespace (working set)
sum by (namespace) (
  container_memory_working_set_bytes{cluster="$cluster",container!=""}
)

2. Namespace / Workload Overview Dashboard

Variables: cluster, namespace, workload. Shows RED metrics for selected workload.

# Deployment available vs desired replicas
kube_deployment_status_replicas_available{namespace="$namespace",deployment="$workload"}
kube_deployment_spec_replicas{namespace="$namespace",deployment="$workload"}

# Pod restart count (sorted by most restarts)
topk(10, sum by (pod, container) (
  increase(kube_pod_container_status_restarts_total{namespace="$namespace"}[1h])
))

# CPU throttling % for workload pods
sum by (pod) (
  rate(container_cpu_throttled_seconds_total{namespace="$namespace",pod=~"$workload.*"}[$__rate_interval])
) / sum by (pod) (
  rate(container_cpu_usage_seconds_total{namespace="$namespace",pod=~"$workload.*"}[$__rate_interval])
  + rate(container_cpu_throttled_seconds_total{namespace="$namespace",pod=~"$workload.*"}[$__rate_interval])
) * 100

# Memory usage vs limit
container_memory_working_set_bytes{namespace="$namespace",pod=~"$workload.*"}
  / on (pod, container)
kube_pod_container_resource_limits{namespace="$namespace",resource="memory",pod=~"$workload.*"}

# OOM kill count
increase(kube_pod_container_status_last_terminated_reason{namespace="$namespace",reason="OOMKilled"}[1h])

3. Node Dashboard

# Node CPU utilization
1 - avg by (node) (rate(node_cpu_seconds_total{mode="idle",cluster="$cluster"}[$__rate_interval]))

# Node memory pressure
1 - node_memory_MemAvailable_bytes{cluster="$cluster"}
    / node_memory_MemTotal_bytes{cluster="$cluster"}

# Node disk I/O (read + write throughput)
rate(node_disk_read_bytes_total{cluster="$cluster",device=~"nvme.*|sd.*"}[$__rate_interval])
rate(node_disk_written_bytes_total{cluster="$cluster",device=~"nvme.*|sd.*"}[$__rate_interval])

# Node network throughput
rate(node_network_receive_bytes_total{cluster="$cluster",device!~"lo|veth.*|docker.*|flannel.*|cali.*|cbr.*"}[$__rate_interval])
rate(node_network_transmit_bytes_total{cluster="$cluster",device!~"lo|veth.*|docker.*|flannel.*|cali.*|cbr.*"}[$__rate_interval])

# Allocatable vs requested (resource headroom)
kube_node_status_allocatable{cluster="$cluster",resource="cpu"}
  - on (node)
sum by (node) (kube_pod_container_resource_requests{cluster="$cluster",resource="cpu",node=~".*"})

# Node conditions (Ready, DiskPressure, MemoryPressure, PIDPressure)
kube_node_status_condition{condition=~"DiskPressure|MemoryPressure|PIDPressure",status="true"}

4. Service RED Metrics Dashboard

Variables: namespace, service. Shows Rate, Errors, Duration for HTTP/gRPC services.

# Request rate (RPS)
sum by (method, route) (
  rate(http_requests_total{namespace="$namespace",service="$service"}[$__rate_interval])
)

# Error rate (5xx / total)
sum(rate(http_requests_total{namespace="$namespace",service="$service",status_code=~"5.."}[$__rate_interval]))
  /
sum(rate(http_requests_total{namespace="$namespace",service="$service"}[$__rate_interval]))

# p50/p95/p99 latency (histogram)
histogram_quantile(0.99,
  sum by (le, method, route) (
    rate(http_request_duration_seconds_bucket{namespace="$namespace",service="$service"}[$__rate_interval])
  )
)

# Derived from Tempo traces (if metrics generator enabled):
# RED metrics from traces — no application instrumentation needed
sum by (span_name) (rate(traces_spanmetrics_calls_total{service_name="$service"}[$__rate_interval]))
histogram_quantile(0.99, sum by (le, span_name) (rate(traces_spanmetrics_duration_milliseconds_bucket{service_name="$service"}[$__rate_interval])))

5. Kubernetes Control Plane Dashboard

# API server request rate by verb
sum by (verb, resource) (rate(apiserver_request_total{cluster="$cluster"}[$__rate_interval]))

# API server error rate (4xx + 5xx)
sum(rate(apiserver_request_total{cluster="$cluster",code=~"4..|5.."}[$__rate_interval]))
  / sum(rate(apiserver_request_total{cluster="$cluster"}[$__rate_interval]))

# API server latency p99 by resource/verb
histogram_quantile(0.99, sum by (le, verb, resource) (
  rate(apiserver_request_duration_seconds_bucket{cluster="$cluster",subresource!="log"}[$__rate_interval])
))

# etcd request duration p99
histogram_quantile(0.99, sum by (le, operation) (
  rate(etcd_request_duration_seconds_bucket{cluster="$cluster"}[$__rate_interval])
))

# Scheduler queue depth (pending pods)
scheduler_pending_pods{cluster="$cluster"}

# Controller manager work queue depth
workqueue_depth{cluster="$cluster",name=~"deployment|replicaset|statefulset"}

6. Persistent Volume Dashboard

# PV capacity utilization
(kubelet_volume_stats_used_bytes / kubelet_volume_stats_capacity_bytes) * 100

# PVCs near full (> 80%)
(kubelet_volume_stats_used_bytes / kubelet_volume_stats_capacity_bytes) > 0.80

# PVC inode usage
(kubelet_volume_stats_inodes_used / kubelet_volume_stats_inodes) * 100

# PVC by StorageClass
sum by (storageclass) (kube_persistentvolume_capacity_bytes)

# Unbound PVCs
kube_persistentvolumeclaim_status_phase{phase!="Bound"}

SLO / Error Budget Panels

SLO dashboards are the single most important dashboard type for production teams. They answer: "Are we within budget?", "How fast are we burning?", and "How many good minutes do we have left this month?"

SLO Dashboard Structure

Current SLI (30d)

99.87%

Target: 99.9%

Error Budget Remaining

42%

18.7 min of 43.8 min/30d

Burn Rate (1h)

3.2×

Target: ≤ 1×

Time to Exhaustion

9.3 days

At current burn rate

Error Budget Burn (30d window)

SLI Over Time

SLO Panel PromQL Queries

# --- SLI: availability (successful requests / total requests) ---
# Using recording rule (recommended for performance):
# namespace_service:sli_requests:ratio_rate5m
# namespace_service:sli_requests:ratio_rate30m
# namespace_service:sli_requests:ratio_rate1h
# namespace_service:sli_requests:ratio_rate6h

# Raw SLI (30-day rolling window)
sum(rate(http_requests_total{namespace="$namespace",service="$service",code!~"5.."}[30d]))
  /
sum(rate(http_requests_total{namespace="$namespace",service="$service"}[30d]))

# --- Error budget remaining (%) ---
# Budget = 1 - SLO objective (e.g., 1 - 0.999 = 0.001 = 0.1%)
# Consumed = 1 - current_SLI
# Remaining = 1 - (consumed / budget)
(
  1 - (
    (1 - (
      sum(rate(http_requests_total{namespace="$namespace",service="$service",code!~"5.."}[30d]))
      / sum(rate(http_requests_total{namespace="$namespace",service="$service"}[30d]))
    )) / (1 - 0.999)   # 0.999 = SLO objective
  )
) * 100

# --- Burn rate (1h / 5h multi-window) ---
# Burn rate = error rate / (1 - SLO)
# Alert thresholds: 14.4× (1h), 6× (6h), 3× (1d), 1× (3d)
(
  1 - (
    sum(rate(http_requests_total{service="$service",code!~"5.."}[1h]))
    / sum(rate(http_requests_total{service="$service"}[1h]))
  )
) / (1 - 0.999)

# --- Time to exhaustion (hours) ---
(
  # remaining error budget in absolute terms
  (1 - 0.999) * (30 * 24 * 3600)   # 30d in seconds
  * (
      sum(rate(http_requests_total{service="$service",code!~"5.."}[1h]))
      / sum(rate(http_requests_total{service="$service"}[1h]))
    )   # current error rate (errors/sec)
  / scalar(sum(rate(http_requests_total{service="$service"}[1h])))
) / 3600  # convert to hours

# --- SLO compliance status (green/red) ---
# Uses stat panel with threshold: green if > SLO, red if below
sum(rate(http_requests_total{service="$service",code!~"5.."}[30d]))
  / sum(rate(http_requests_total{service="$service"}[30d]))

Grafana Stat Panel Thresholds for SLO

{
  "thresholds": {
    "mode": "absolute",
    "steps": [
      {"color": "red",    "value": null},
      {"color": "yellow", "value": 0.99},
      {"color": "green",  "value": 0.999}
    ]
  },
  "mappings": [],
  "unit": "percentunit",
  "decimals": 4
}

Cross-Signal Navigation

The highest-value Grafana feature for production operations is navigating from a dashboard panel directly to the relevant logs, traces, or related dashboards — without manually copying query parameters. This is achieved through data links, panel links, and Grafana correlations.

Data Links (Panel-Level)

// In a time series panel showing error rate per pod:
// Clicking a data point opens Loki Explore for that pod's logs
{
  "dataLinks": [
    {
      "title": "Logs for ${__series.name}",
      "url": "/explore?orgId=1&left={\"datasource\":\"loki-uid\",\"queries\":[{\"expr\":\"{namespace=\\\"${namespace}\\\",pod=\\\"${__series.name}\\\"} | json | level=\\\"error\\\"\",\"refId\":\"A\"}],\"range\":{\"from\":\"${__from}\",\"to\":\"${__to}\"}}",
      "targetBlank": true
    },
    {
      "title": "Traces for ${__series.name}",
      "url": "/explore?orgId=1&left={\"datasource\":\"tempo-uid\",\"queries\":[{\"query\":\"{resource.service.name=\\\"${service}\\\"}&& status = error\",\"refId\":\"A\"}]}",
      "targetBlank": true
    }
  ]
}

Panel Links (Dashboard Navigation)

// Link from cluster overview → namespace detail dashboard
{
  "links": [
    {
      "title": "Namespace Details",
      "url": "/d/ns-detail?var-cluster=${cluster}&var-namespace=${__data.fields.namespace}",
      "targetBlank": false,
      "includeVars": true,
      "keepTime": true
    }
  ]
}

Grafana Correlations (Global)

// Defined at the data source level — applies across all dashboards
// Maps a field in query results to a target exploration
{
  "label": "View logs for this service",
  "description": "Navigate to Loki logs filtered by service name",
  "sourceUID": "prometheus-uid",
  "targetUID": "loki-uid",
  "config": {
    "type": "query",
    "field": "service",
    "target": {
      "expr": "{namespace=\"${__data.fields.namespace}\", app=\"${__data.fields.service}\"} | json | level=\"error\""
    },
    "transformations": [
      {
        "type": "regex",
        "field": "service",
        "expression": "(.*)",
        "mapValue": "$1"
      }
    ]
  }
}

Exemplar Drill-Down (Metric → Trace)

# In Prometheus data source config (grafana-values.yaml):
# exemplarTraceIdDestinations maps trace ID exemplars to Tempo
datasources:
  datasources.yaml:
    datasources:
      - name: Prometheus
        type: prometheus
        jsonData:
          exemplarTraceIdDestinations:
            - name: traceID           # label name on the exemplar
              datasourceUid: tempo-uid
              urlDisplayLabel: "View in Tempo"
            - name: TraceID           # some apps use different casing
              datasourceUid: tempo-uid

# In Grafana: enable exemplars on the time series panel:
# Panel options → Data source → Enable exemplars (toggle)
# Exemplar dots appear on the graph — click → opens Tempo trace

Community Dashboards

Grafana has a public dashboard library at grafana.com/grafana/dashboards. The following are the canonical production-ready dashboards included with kube-prometheus-stack. Import by UID or use the Helm chart's built-in provisioning.

Dashboard	Grafana ID	Description	Included in kps
Kubernetes / Cluster	7249	Cluster-wide CPU/memory/pods overview	Yes
Kubernetes / Nodes	11074	Node CPU, memory, disk, network per node	Yes
Kubernetes / Workloads	12122	Deployment/StatefulSet RED metrics	Yes
Kubernetes / Pods	6417	Per-pod CPU, memory, restarts	Yes
Kubernetes / Namespaces	7249	Namespace resource usage	Yes
Kubernetes / API Server	12006	kube-apiserver request rate, latency, errors	Yes
Kubernetes / PVs	13646	Persistent volume utilization and inode usage	Yes
Node Exporter Full	1860	Comprehensive node metrics (system, disk, network)	Import
Alertmanager	9578	Alert routing, inhibition, silence status	Yes
Prometheus / Overview	3662	Prometheus self-metrics (scrape, TSDB, rules)	Yes
Loki / Chunks	14055	Loki ingestion and storage metrics	Import
Tempo / Overview	16698	Tempo ingestion, query, storage	Import
Fluent Bit	7752	Fluent Bit input/output/filter metrics	Import
Cert-Manager	11001	Certificate expiry and renewal status	Import
NGINX Ingress	9614	Ingress controller request rate, latency, errors	Import

Importing via Helm (kube-prometheus-stack)

# kube-prometheus-stack already includes most of the above
helm upgrade --install kube-prometheus-stack prometheus-community/kube-prometheus-stack \
  --set grafana.defaultDashboardsEnabled=true \
  --set grafana.defaultDashboardsTimezone=UTC \
  --namespace monitoring

# Import community dashboards not in kps:
kubectl create configmap grafana-dashboard-node-exporter \
  --from-file=node-exporter-full.json=./1860_rev37.json \
  -n monitoring
kubectl label configmap grafana-dashboard-node-exporter grafana_dashboard=1 -n monitoring

Dashboard Governance

Dashboard Organization

Folder Structure

Organize dashboards by audience and scope: Kubernetes (cluster-level infra), Platform (shared platform services), Services (per-team application dashboards), SLOs (error budgets), Oncall (triage dashboards for on-call).

Naming Conventions

Use consistent naming: [Scope] / [Subject] — e.g., Kubernetes / Nodes, Payments / Order Service, SLO / Checkout API. Use stable UIDs (e.g., k8s-nodes, svc-order-api) for reliable linking.

Version Control

All dashboard JSON stored in Git. No manual UI edits in production — use a review process for all dashboard changes. CI validates JSON syntax. Deployed via GitOps (ArgoCD/Flux) using ConfigMap or GrafanaDashboard CRD.

Access Control

Use Grafana Teams + RBAC: Viewer role for most engineers (read/explore only), Editor for platform team (can modify dashboards), Admin for Grafana operators only. SSO via OIDC/SAML maps group membership to Grafana roles automatically.

Dashboard Review Checklist

Variables defined. Every dashboard has at minimum: cluster, namespace. All panel queries use $cluster and $namespace variables to avoid hardcoding.
$__rate_interval used. All rate(), irate(), and increase() calls use [$__rate_interval] not hardcoded windows.
Units configured. Every panel has the correct unit set: bytes for memory (not "short"), reqps for request rate, seconds or milliseconds for latency, percentunit for ratios.
Thresholds set. Stat and gauge panels have color thresholds aligned with SLOs or alert thresholds (green/yellow/red at the same boundaries as PrometheusRule alerts).
Dashboard UID set. Custom UIDs prevent collisions and make cross-dashboard linking reliable. Auto-generated UIDs are not stable across imports.
No hardcoded namespace/cluster. Dashboards should be reusable across all namespaces and clusters via variables, not tied to one environment.
Data links configured. Error rate / latency panels should link to Loki logs and Tempo traces for the selected service/pod.
Refresh interval appropriate. Operational dashboards: 30s or 1m. Historical analysis dashboards: off (manual refresh). SLO dashboards: 5m.

Detecting Dashboard Drift

# Detect dashboards modified in UI but not in Git (drift detection)
# Export current state from Grafana API
curl -s "http://grafana.monitoring.svc:3000/api/search?type=dash-db&limit=500" \
  | jq -r '.[].uid' | while read uid; do
  curl -s "http://grafana.monitoring.svc:3000/api/dashboards/uid/$uid" \
    | jq '{uid: .dashboard.uid, version: .dashboard.version, title: .dashboard.title}'
done > current-state.json

# Compare against Git-tracked versions (CI check)
# Any dashboard with version > what was last provisioned has been modified in UI

Metrics, Alerts & Runbooks

Grafana Self-Metrics

Metric	Alert Threshold	Meaning
`grafana_http_request_duration_seconds`	p99 > 5s	Grafana API latency — slow dashboards
`grafana_datasource_request_duration_seconds`	p99 > 30s	Data source query latency (Prometheus/Loki)
`grafana_datasource_request_total{status_code=~"5.."}`	>0	Data source query errors
`grafana_stat_totals_dashboard`	Unexpected decrease	Dashboard count — detects accidental deletion
`grafana_alerting_result_total`	—	Alert evaluation results (active/normal/error)
`up{job="grafana"}`	== 0	Grafana process down

Alert Rules

groups:
  - name: grafana-health
    rules:
      - alert: GrafanaDown
        expr: absent(up{job="grafana"}) or up{job="grafana"} == 0
        for: 2m
        labels: {severity: critical}
        annotations:
          summary: "Grafana is down — dashboards and alerts unavailable"

      - alert: GrafanaDatasourceErrors
        expr: rate(grafana_datasource_request_total{status_code=~"5.."}[5m]) > 0
        for: 5m
        labels: {severity: warning}
        annotations:
          summary: "Grafana data source {{ $labels.datasource }} returning errors"
          description: "Dashboard queries to {{ $labels.datasource }} are failing. Check data source connectivity."

      - alert: GrafanaDashboardDeleted
        expr: grafana_stat_totals_dashboard < (grafana_stat_totals_dashboard offset 1h) * 0.9
        labels: {severity: warning}
        annotations:
          summary: "Grafana dashboard count dropped by >10% — possible accidental deletion"

Runbooks

Dashboard Shows "No Data"

Check time range — may be zoomed into a period with no data
Verify data source connection: Data Sources → Test
Open Explore and run the panel query manually
Check variable values — empty variable drops all data if query uses {namespace="$namespace"}
Verify metric name exists: curl prometheus:9090/api/v1/label/__name__/values | grep metric_name

Dashboard Loading Slowly

Reduce time range (shorter ranges = less data to scan)
Check panel count — reduce to <20 panels per dashboard
Check Prometheus query duration: Explore → run panel query → check execution time
Add recording rules for expensive PromQL expressions
Enable query caching in Grafana (requires Grafana Enterprise or Mimir)
Check if Prometheus is overloaded: prometheus_engine_query_duration_seconds p99

Dashboard Drift (UI vs Git)

Set disableDeletion: true and allowUiUpdates: false in provisioner config
Export current dashboard JSON via API and diff against Git
Revert to Git version by re-applying ConfigMap / triggering GitOps sync
Add CI drift check job to detect version discrepancies

Data Source Authentication Failure

Grafana → Configuration → Data Sources → select source → Test
Check if ServiceAccount token is expired (rotate if using static tokens)
For Prometheus: verify NetworkPolicy allows grafana → prometheus port 9090
For Loki: check Loki gateway auth configuration
Restart Grafana pod after updating secrets: kubectl rollout restart deploy/grafana -n monitoring

Best Practices

All dashboards in Git, zero manual UI edits in production. Enable disableDeletion: true and allowUiUpdates: false on the provisioner. Use a staging Grafana instance for dashboard development before promoting to production.
Use $__rate_interval for all rate-based queries. Hardcoded intervals produce incorrect values when the user zooms out to a multi-day time range, because rate windows smaller than the scrape interval return undefined.
Build a hierarchy: cluster → namespace → workload → pod. Navigation between dashboard levels via data links reduces mean time to investigate by eliminating manual query reconstruction. A team-level dashboard should always link to the cluster overview and the pod-level details.
Align dashboard thresholds with alert thresholds. If your alert fires at error rate > 1%, the stat panel on the dashboard should turn yellow at 0.5% and red at 1%. Misaligned thresholds confuse on-call engineers who see green dashboards while alerts are firing.
Include an SLO/error budget panel on every service dashboard. Application RED metrics (rate, error, duration) are necessary but not sufficient. Teams need to see whether they are within their error budget to make correct prioritization decisions.
Configure exemplar drill-down on latency histograms. A p99 latency spike is much faster to investigate if you can click the spike and jump directly to a representative slow trace in Tempo without switching tools and manually searching.
Avoid dashboards wider than the screen. Panels that require horizontal scrolling are never read. Prefer 24-column grid layouts (Grafana default) with panels spanning 6–12 columns.
Provision one "golden signals" dashboard per team. Teams should own and maintain a single dashboard showing their service's four golden signals (latency, traffic, errors, saturation) plus SLO status. This becomes the starting point for all incident investigations.

Coverage Details

Grafana architecture: data sources (Prometheus/Loki/Tempo/Jaeger/CloudWatch), features (dashboards/alerting/explore/correlations)
Grafana Helm install with production grafana-values.yaml
Production grafana.ini config: root_url, OAuth/OIDC SSO with role mapping from groups, feature_toggles correlations, security (cookie_secure/samesite)
Data source provisioning in YAML: Prometheus (exemplarTraceIdDestinations→Tempo), Loki (derivedFields regex→Tempo), Tempo (tracesToLogsV2/tracesToMetrics/serviceMap/nodeGraph)
Dashboard provider config: file provider, disableDeletion, allowUiUpdates, folder, updateIntervalSeconds
dashboardsConfigMaps mapping to provisioner folders
Three dashboard-as-code approaches: ConfigMap+file provisioner, GrafanaDashboard CRD (Grafana Operator), Grafonnet/Jsonnet
ConfigMap provisioning: grafana_dashboard: "1" label, sidecar config (searchNamespace: ALL)
Grafana Operator: GrafanaDashboard CRD with instanceSelector and folder
Export dashboard JSON via Grafana HTTP API (single UID + bulk all dashboards)
Grafonnet install: go-jsonnet + jb + jsonnet-bundler initialization
Grafonnet dashboard example: prometheusQuery function, requestRatePanel + errorRatePanel functions, dashboard with query variables (namespace/service chained), panels array
Template variable types: Query / Custom / Constant / Interval / Data source / Text box / Ad hoc filter
Standard variable hierarchy: datasource → cluster → namespace → workload → pod with scoped label_values() queries
$__rate_interval vs $__interval explanation and callout; Loki $__range equivalent
Cluster Overview dashboard: stat panels (CPU%, memory%, pod count, non-running) + namespace time series
Cluster overview PromQL: node CPU/memory utilization, pod phase sum, non-running pods, namespace topk
Namespace/Workload dashboard: deployment replicas available/desired, pod restart topk, CPU throttling %, memory vs limit ratio, OOM kill count
Node dashboard: CPU/memory per node, disk I/O (read+write), network throughput, allocatable vs requested headroom, node condition checks
Service RED dashboard: request rate by method/route, error rate 5xx/total, p99 latency histogram_quantile, derived from Tempo metrics generator
Control plane dashboard: API server request rate by verb/resource, error rate, p99 latency by resource/verb, etcd request duration, scheduler queue, controller workqueue depth
PV dashboard: capacity utilization (used/capacity), near-full alert (>80%), inode usage, by StorageClass, unbound PVCs
SLO dashboard mockup: current SLI, error budget remaining %, burn rate ×, time to exhaustion
SLO PromQL: 30d rolling SLI, error budget remaining formula, burn rate (1h window / 1-SLO), time to exhaustion in hours
Stat panel threshold JSON for SLO compliance (red/yellow/green at objective boundary)
Data links JSON: click series → Loki Explore with time range; click → Tempo trace search
Panel links JSON: cluster overview → namespace detail with var interpolation + keepTime
Grafana correlations JSON config: field mapping, target query template, transformations
Exemplar drill-down: Prometheus datasource exemplarTraceIdDestinations config, panel exemplars toggle
Community dashboard table: 15 dashboards with IDs, descriptions, kps inclusion status
Importing non-kps dashboards via labeled ConfigMap
Dashboard governance: folder structure (Kubernetes/Platform/Services/SLOs/Oncall)
Naming conventions: [Scope] / [Subject] pattern, stable UIDs
Version control: no UI edits in production, CI JSON validation, GitOps deployment
Access control: Grafana Teams + Viewer/Editor/Admin roles, SSO group→role mapping
8-point dashboard review checklist: variables, $__rate_interval, units, thresholds, UID, no hardcoding, data links, refresh interval
Dashboard drift detection via Grafana API version comparison script
6 Grafana self-metrics with thresholds
3 PrometheusRule alerts: GrafanaDown, DatasourceErrors, DashboardDeleted
4 runbooks: No Data, slow loading, drift, auth failure
8 best practices: Git-only, $__rate_interval, hierarchy with data links, aligned thresholds, SLO panel, exemplar drill-down, panel width discipline, golden signals dashboard per team