Alerting — Kubernetes Observability

Alerting in Kubernetes

Complete guide to Alertmanager architecture, PrometheusRule CRD, alert routing and inhibition, silences, multi-window SLO burn rate alerts, notification channels, Grafana unified alerting, and production alert hygiene.

Alert Lifecycle

An alert progresses through defined states from the moment its expression becomes true to the moment it resolves. Understanding this lifecycle prevents confusion about why alerts page (or fail to page) during incidents.

Inactive
expr = false
Pending
expr = true
for: not elapsed
Firing
for: elapsed
sent to Alertmanager
Resolved
expr = false
resolve_timeout elapsed
StateMeaningVisible InNotification Sent
InactiveExpression evaluates to false or no dataPrometheus /alerts (not shown)No
PendingExpression true but for duration not yet elapsedPrometheus /alerts (pending)No
FiringExpression true for the full for duration — alert sent to AlertmanagerPrometheus /alerts, Alertmanager UIYes (after group_wait)
ResolvedExpression returned false after firingAlertmanager UI briefly, then clearedYes (resolve notification)

The for: Duration — When to Use It

for: Prevents Noise But Delays Detection

The for duration prevents alerts from firing on transient spikes. However, it also delays notification. For a for: 5m alert on a 1-minute evaluation interval, an incident can be undetected for up to 6 minutes. For critical alerts (full service outage, data loss risk), use for: 0m or for: 1m. Use longer durations (5–15m) only for warnings about gradual degradation.

SeverityRecommended for:Rationale
criticalfor: 0m or for: 1mImmediate detection required; tolerate rare false positives
warningfor: 5m to for: 15mReduce noise from transient spikes; trend confirmation
infofor: 30m to for: 1hCapacity and trend signals; no urgency

PrometheusRule CRD

The PrometheusRule CRD (provided by Prometheus Operator) allows alert and recording rules to be stored as Kubernetes objects and automatically loaded by Prometheus without restarts. The Prometheus Operator watches PrometheusRule objects and reconciles them into the Prometheus configuration.

PrometheusRule Structure

apiVersion: monitoring.coreos.com/v1
kind: PrometheusRule
metadata:
  name: payments-alerts
  namespace: payments           # team-owned namespace
  labels:
    prometheus: kube-prometheus  # must match Prometheus .spec.ruleSelector
    role: alert-rules
spec:
  groups:
    - name: payments.availability    # group name — appears in Prometheus UI
      interval: 1m                   # optional: override evaluation interval for this group
      rules:
        # --- Recording Rule ---
        - record: namespace_service:http_request_errors:rate5m
          expr: |
            sum by (namespace, service) (
              rate(http_requests_total{namespace="payments",code=~"5.."}[5m])
            )
          labels:
            team: payments

        # --- Alerting Rule ---
        - alert: PaymentsHighErrorRate
          expr: |
            (
              sum(rate(http_requests_total{namespace="payments",code=~"5.."}[5m]))
              / sum(rate(http_requests_total{namespace="payments"}[5m]))
            ) > 0.01
          for: 2m
          labels:
            severity: critical        # used by Alertmanager routing
            team: payments            # used for routing to team channel
            service: payment-api
          annotations:
            summary: "High error rate in payments ({{ $value | humanizePercentage }})"
            description: |
              Error rate has been above 1% for 2 minutes in the payments namespace.
              Current value: {{ $value | humanizePercentage }}
              Namespace: {{ $labels.namespace }}
            runbook_url: "https://wiki.company.com/runbooks/payments/high-error-rate"
            dashboard_url: "https://grafana.company.com/d/payments-sre?var-namespace=payments"

Prometheus Selector for PrometheusRules

# In Prometheus CR — which PrometheusRule objects to load
apiVersion: monitoring.coreos.com/v1
kind: Prometheus
metadata:
  name: kube-prometheus
  namespace: monitoring
spec:
  ruleSelector:
    matchLabels:
      prometheus: kube-prometheus    # only load rules with this label
  ruleNamespaceSelector:
    matchLabels: {}                  # empty = all namespaces
    # Or restrict to specific namespaces:
    # matchExpressions:
    #   - key: kubernetes.io/metadata.name
    #     operator: In
    #     values: [monitoring, payments, orders]

Alert Labels — What Each Field Does

LabelPurposeUsed By
severityAlert urgency: critical, warning, infoAlertmanager routing, PagerDuty urgency, Slack channel selection
teamOwning team identifierAlertmanager routing to team-specific receiver
serviceService nameGrouping, deduplication, notification context
namespaceKubernetes namespace (often from metric labels)Routing, inhibition, notification context
envEnvironment: production, stagingSuppress staging alerts from paging on-call
pageBoolean: true if this should page on-callFine-grained routing: some critical alerts don't page at night

Alert Annotation Variables

# Available template variables in annotations:
annotations:
  summary: "{{ $labels.namespace }}/{{ $labels.pod }} — {{ $value | humanize }}"
  # $labels     — label set of the alert
  # $value      — current numeric value of the expression
  # $externalURL — Alertmanager external URL

  # Humanize filters:
  # humanize       — 1234567 → "1.23M"
  # humanizePercentage — 0.01234 → "1.23%"
  # humanizeDuration   — 93542 → "1d 1h 59m 2s"
  # humanize1024       — bytes: 1073741824 → "1Gi"

  # Direct Grafana/runbook links (golden practice):
  runbook_url: "https://wiki/runbooks/{{ $labels.alertname }}"
  dashboard_url: "https://grafana/d/{{ $labels.namespace }}-sre?from={{ $startsAt }}&to=now"

Alertmanager Architecture

Prometheus (fires alerts → HTTP POST) │ ▼ ┌────────────────────────────────────────────────┐ │ Alertmanager │ │ │ │ ┌──────────┐ ┌────────────┐ │ │ │ Ingress │ │ Gossip │ ← HA cluster │ │ │ (alerts) │ │ Protocol │ (3 replicas) │ │ └────┬─────┘ └────────────┘ │ │ │ │ │ ┌────▼──────────────────────────────────┐ │ │ │ Pipeline │ │ │ │ 1. Silence check (muted → drop) │ │ │ │ 2. Inhibition check (inhibited→drop) │ │ │ │ 3. Routing (match labels → receiver) │ │ │ │ 4. Grouping (batch into groups) │ │ │ │ 5. Timing (group_wait/interval/ │ │ │ │ repeat_interval) │ │ │ └───────────────────────┬───────────────┘ │ └──────────────────────────┼─────────────────────┘ │ ┌─────────────────┼─────────────────┐ ▼ ▼ ▼ PagerDuty Slack OpsGenie (critical) (warnings) (backup)

HA Alertmanager — Gossip and Deduplication

Alertmanager uses the gossip protocol (memberlist) to synchronize state across a cluster of replicas. All Prometheus instances send alerts to all Alertmanager replicas. Each replica independently processes alerts, but gossip ensures silences, inhibitions, and notification state are shared — so only one replica sends each notification.

# Alertmanager CR for HA (kube-prometheus-stack)
apiVersion: monitoring.coreos.com/v1
kind: Alertmanager
metadata:
  name: kube-prometheus-stack-alertmanager
  namespace: monitoring
spec:
  replicas: 3
  retention: 120h   # keep alert state for 5 days (for silence persistence)
  storage:
    volumeClaimTemplate:
      spec:
        storageClassName: gp3
        resources:
          requests:
            storage: 10Gi
  resources:
    requests: {cpu: 100m, memory: 128Mi}
    limits: {cpu: 500m, memory: 256Mi}
  alertmanagerConfigSelector:
    matchLabels:
      alertmanagerConfig: kube-prometheus    # load AlertmanagerConfig CRDs

Alertmanager Configuration

Core Configuration Structure

# alertmanager.yaml — full production configuration
global:
  resolve_timeout: 5m       # time to wait before marking alert resolved
  smtp_from: alerts@company.com
  smtp_smarthost: smtp.company.com:587
  slack_api_url: https://hooks.slack.com/services/...
  pagerduty_url: https://events.pagerduty.com/v2/enqueue

# Notification templates (Go text/template)
templates:
  - /etc/alertmanager/templates/*.tmpl

# Root route — catches all alerts not matched by child routes
route:
  receiver: default-receiver
  group_by: [alertname, cluster, namespace]
  group_wait: 30s            # wait 30s to batch initial alerts in a group
  group_interval: 5m         # wait 5m before sending updates to existing groups
  repeat_interval: 4h        # re-notify if alert is still firing after 4h
  routes:
    # Route critical alerts to PagerDuty
    - matchers:
        - severity = critical
      receiver: pagerduty-critical
      group_by: [alertname, cluster, namespace, service]
      group_wait: 10s
      repeat_interval: 1h
      continue: false          # stop matching further routes

    # Route team-specific alerts
    - matchers:
        - team = payments
      receiver: slack-payments-team
      routes:
        - matchers:
            - severity = critical
          receiver: pagerduty-payments
          repeat_interval: 30m

    - matchers:
        - team = platform
      receiver: slack-platform-team

    # Suppress staging environment pages
    - matchers:
        - env = staging
      receiver: slack-staging
      repeat_interval: 24h

receivers:
  - name: default-receiver
    slack_configs:
      - channel: "#alerts-default"
        send_resolved: true

  - name: pagerduty-critical
    pagerduty_configs:
      - routing_key: "${PAGERDUTY_INTEGRATION_KEY}"
        severity: critical
        description: '{{ template "pagerduty.description" . }}'
        details:
          namespace: '{{ .CommonLabels.namespace }}'
          service: '{{ .CommonLabels.service }}'
          runbook: '{{ .CommonAnnotations.runbook_url }}'
        links:
          - href: '{{ .CommonAnnotations.dashboard_url }}'
            text: Grafana Dashboard
          - href: '{{ .CommonAnnotations.runbook_url }}'
            text: Runbook

  - name: slack-payments-team
    slack_configs:
      - api_url: "${SLACK_PAYMENTS_WEBHOOK}"
        channel: "#payments-alerts"
        send_resolved: true
        color: '{{ if eq .Status "firing" }}{{ if eq (index .GroupLabels "severity") "critical" }}danger{{ else }}warning{{ end }}{{ else }}good{{ end }}'
        title: '{{ template "slack.title" . }}'
        text: '{{ template "slack.text" . }}'
        actions:
          - type: button
            text: "Runbook"
            url: '{{ (index .Alerts 0).Annotations.runbook_url }}'
          - type: button
            text: "Dashboard"
            url: '{{ (index .Alerts 0).Annotations.dashboard_url }}'
          - type: button
            text: "Silence"
            url: '{{ template "slack.silence_url" . }}'

  - name: pagerduty-payments
    pagerduty_configs:
      - routing_key: "${PAGERDUTY_PAYMENTS_KEY}"
        severity: '{{ (index .GroupLabels "severity") }}'

  - name: slack-platform-team
    slack_configs:
      - channel: "#platform-alerts"
        send_resolved: true

  - name: slack-staging
    slack_configs:
      - channel: "#staging-noise"
        send_resolved: false     # don't flood with resolved notifications

inhibit_rules:
  # If a critical alert is firing, suppress warnings for the same service
  - source_matchers:
      - severity = critical
    target_matchers:
      - severity = warning
    equal: [alertname, namespace, service]

  # If a node is down, suppress all pod alerts on that node
  - source_matchers:
      - alertname = KubernetesNodeNotReady
    target_matchers:
      - alertname =~ "KubernetesPod.*|KubernetesContainer.*"
    equal: [node]

AlertmanagerConfig CRD (Namespaced)

The AlertmanagerConfig CRD (Prometheus Operator v0.43+) allows teams to define routing rules and receivers in their own namespace without access to the global Alertmanager configuration:

apiVersion: monitoring.coreos.com/v1alpha1
kind: AlertmanagerConfig
metadata:
  name: payments-alertconfig
  namespace: payments
  labels:
    alertmanagerConfig: kube-prometheus   # matches Alertmanager .spec.alertmanagerConfigSelector
spec:
  route:
    receiver: payments-slack
    matchers:
      - name: namespace
        value: payments
    groupBy: [alertname, service]
    groupWait: 30s
    repeatInterval: 4h

  receivers:
    - name: payments-slack
      slackConfigs:
        - apiURL:
            name: slack-secret
            key: url
          channel: "#payments-alerts"
          sendResolved: true
          title: '[{{ .Status | toUpper }}] {{ .CommonLabels.alertname }}'
          text: |
            *Service:* {{ .CommonLabels.service }}
            *Namespace:* {{ .CommonLabels.namespace }}
            *Description:* {{ .CommonAnnotations.description }}

  inhibitRules:
    - sourceMatch:
        - name: severity
          value: critical
      targetMatch:
        - name: severity
          value: warning
      equal: [service]

Custom Notification Templates

{{/* /etc/alertmanager/templates/slack.tmpl */}}
{{ define "slack.title" -}}
  [{{ .Status | toUpper }}{{ if eq .Status "firing" }}:{{ .Alerts.Firing | len }}{{ end }}]
  {{ .CommonLabels.alertname }}
  — {{ .CommonLabels.namespace }}/{{ .CommonLabels.service }}
{{- end }}

{{ define "slack.text" -}}
{{ range .Alerts -}}
*Alert:* {{ .Annotations.summary }}
*Severity:* {{ .Labels.severity }}
*Description:* {{ .Annotations.description }}
*Since:* {{ .StartsAt | since }}
{{ end }}
{{- end }}

{{ define "pagerduty.description" -}}
  [{{ .CommonLabels.severity | toUpper }}] {{ .CommonLabels.alertname }}
  Namespace: {{ .CommonLabels.namespace }}
  {{ .CommonAnnotations.summary }}
{{- end }}

{{/* Convenience: build Alertmanager silence URL */}}
{{ define "slack.silence_url" -}}
  {{ .ExternalURL }}/#/silences/new?filter=%7B
  {{- range .CommonLabels.SortedPairs -}}
    {{- if ne .Name "severity" -}}
      {{- .Name }}%3D"{{- .Value | urlquery }}"
    {{- end -}}
  {{- end -}}
  %7D
{{- end }}

Routing, Grouping & Inhibition

Routing Decision Tree

Incoming alert: {alertname="HighErrorRate", severity="critical", team="payments", namespace="payments"} Route tree traversal (first match wins unless continue:true): │ ├─ matchers: severity=critical ✓ MATCH │ receiver: pagerduty-critical │ continue: false │ └─ STOP — alert sent to pagerduty-critical │ │ (if continue: true was set above, would also check:) ├─ matchers: team=payments (would match if continue:true) │ └─ routes: │ └─ matchers: severity=critical ✓ would also match │ receiver: pagerduty-payments │ └─ default: default-receiver (only reached if no route matched)

Grouping — Batch vs Flood Control

group_by Controls Notification Batching

group_by determines which label dimensions create separate notification groups. group_by: [alertname, namespace] sends one notification per unique (alertname, namespace) combination, batching all pods in that namespace together. group_by: [...] (empty) groups everything into one notification — useful for catch-all receivers. group_by: [alertname, namespace, pod] sends separate notifications per pod — usually too noisy.

Timing Parameters

ParameterScopeDefaultEffect
group_waitRoute30sHow long to wait before sending the first notification for a new group — allows batching of simultaneous alerts
group_intervalRoute5mHow long to wait before sending updates about an existing group (new alerts added, or alerts resolved within the group)
repeat_intervalRoute4hHow long to wait before re-notifying about a group that is still firing and unchanged
resolve_timeoutGlobal5mHow long after the last firing alert before marking a group as resolved and sending the resolve notification

Inhibition Rules

Inhibition suppresses lower-priority alerts when a higher-priority alert for the same system is already firing. This prevents alert floods during cascading failures where a root-cause alert fires alongside dozens of downstream symptom alerts.

inhibit_rules:
  # 1. Critical suppresses warning for same service
  - source_matchers: [severity="critical"]
    target_matchers: [severity="warning"]
    equal: [alertname, namespace, service]

  # 2. Node down suppresses all pod/container alerts on that node
  - source_matchers: [alertname="KubernetesNodeNotReady"]
    target_matchers: [alertname=~"Kubernetes(Pod|Container|Deployment).*"]
    equal: [node]

  # 3. Cluster degraded suppresses individual namespace alerts
  - source_matchers: [alertname="ClusterHighCPU"]
    target_matchers: [alertname="NamespaceCPUQuotaExceeded"]
    equal: [cluster]

  # 4. Watchdog (deadman's switch) alert — when present suppresses nothing
  # But absence of Watchdog alert should page (see DeadmanSwitch section)

  # 5. Maintenance window suppression via custom label
  - source_matchers: [alertname="MaintenanceWindow"]
    target_matchers: [severity=~"warning|info"]
    equal: [namespace]

Silences & Maintenance Windows

Silences temporarily suppress notifications for matching alerts without disabling the alert rule itself. The alert continues to fire and be visible in the Prometheus and Alertmanager UIs — it just doesn't page anyone during the silence window.

amtool — Alertmanager CLI

# Install amtool
go install github.com/prometheus/alertmanager/cmd/amtool@latest
# Or: brew install alertmanager (macOS)

# Configure amtool
export ALERTMANAGER_URL=http://alertmanager.monitoring.svc:9093

# List active silences
amtool silence query

# Create a silence (maintenance window)
amtool silence add \
  --author="jane.doe@company.com" \
  --comment="Planned maintenance: payments database upgrade" \
  --duration=2h \
  alertname=~"Payments.*" namespace=payments

# Create silence with specific expiry time
amtool silence add \
  --author="jane.doe" \
  --comment="Batch job running — expected high CPU" \
  --start="2024-01-15T22:00:00Z" \
  --end="2024-01-16T02:00:00Z" \
  namespace=batch

# Expire (end) a silence early
amtool silence expire 

# List current firing alerts (with matchers)
amtool alert query
amtool alert query --filter='alertname="KubernetesOOMKilling"'

# Check what alerts a silence would suppress (dry-run)
amtool silence add --dry-run namespace=payments severity=warning

Silence via Alertmanager API

# Create silence programmatically (e.g., from deploy script)
curl -s -X POST http://alertmanager:9093/api/v2/silences \
  -H "Content-Type: application/json" \
  -d '{
    "matchers": [
      {"name": "namespace", "value": "payments", "isRegex": false},
      {"name": "alertname", "value": ".*", "isRegex": true}
    ],
    "startsAt": "2024-01-15T22:00:00.000Z",
    "endsAt": "2024-01-15T23:00:00.000Z",
    "createdBy": "deploy-bot",
    "comment": "Automated silence during payments-v2.4 deployment"
  }'

# List silences
curl -s http://alertmanager:9093/api/v2/silences | jq '.[] | {id:.id, comment:.comment, endsAt:.endsAt}'

# Expire silence
curl -s -X DELETE http://alertmanager:9093/api/v2/silence/

Deadman's Switch (Watchdog Alert)

# Always-firing alert — confirms the alerting pipeline is working end-to-end.
# If Prometheus or Alertmanager is down, this alert stops firing → absence triggers page.
- alert: Watchdog
  expr: vector(1)   # always true
  labels:
    severity: none
    alertname: Watchdog
  annotations:
    summary: "Alertmanager pipeline is healthy"
    description: "This alert is always firing and routed to a deadman receiver. If it stops, the pipeline is broken."
# Alertmanager route for Watchdog → external heartbeat service
# (PagerDuty Dead Man's Snitch, OpsGenie Heartbeat, VictorOps, etc.)
- matchers:
    - alertname = Watchdog
  receiver: deadmans-snitch
  group_wait: 0s
  group_interval: 1m
  repeat_interval: 50s    # must be less than heartbeat timeout

receivers:
  - name: deadmans-snitch
    webhook_configs:
      - url: https://nosnch.in/YOUR_SNITCH_ID    # Dead Man's Snitch URL
        send_resolved: false

SLO Burn Rate Alerts

Standard threshold alerts (e.g., "error rate > 1%") cannot detect slow budget burns that will exhaust your SLO before the end of the month. Multi-window burn rate alerting (from Google SRE's "The Art of SLOs") detects both fast burns (urgent) and slow burns (non-urgent) while keeping false positive rates low.

Burn Rate Formula

Burn Rate = How Fast You're Consuming Error Budget

A burn rate of 1× means you're consuming budget at exactly the sustainable rate (budget exhausted at exactly the SLO period end). A burn rate of 14.4× means you will exhaust your entire 30-day budget in 30/14.4 = 2.08 days.

burn_rate = error_rate / (1 - SLO)
Example: SLO = 99.9%, current error rate = 5%: burn_rate = 0.05 / 0.001 = 50×

Multi-Window Burn Rate Alert — Complete Implementation

apiVersion: monitoring.coreos.com/v1
kind: PrometheusRule
metadata:
  name: payments-slo-alerts
  namespace: payments
  labels:
    prometheus: kube-prometheus
spec:
  groups:
    # --- Recording rules for efficiency ---
    - name: payments.slo.recording
      interval: 1m
      rules:
        - record: job:http_requests_total:rate5m
          expr: sum by (job, code) (rate(http_requests_total{namespace="payments"}[5m]))
        - record: job:http_requests_total:rate30m
          expr: sum by (job, code) (rate(http_requests_total{namespace="payments"}[30m]))
        - record: job:http_requests_total:rate1h
          expr: sum by (job, code) (rate(http_requests_total{namespace="payments"}[1h]))
        - record: job:http_requests_total:rate2h
          expr: sum by (job, code) (rate(http_requests_total{namespace="payments"}[2h]))
        - record: job:http_requests_total:rate6h
          expr: sum by (job, code) (rate(http_requests_total{namespace="payments"}[6h]))
        - record: job:http_requests_total:rate1d
          expr: sum by (job, code) (rate(http_requests_total{namespace="payments"}[1d]))
        - record: job:http_requests_total:rate3d
          expr: sum by (job, code) (rate(http_requests_total{namespace="payments"}[3d]))

        # Error ratio recording rules (good_requests / total_requests)
        - record: job:http_request_error_ratio:rate5m
          expr: |
            sum(job:http_requests_total:rate5m{code=~"5.."})
              / sum(job:http_requests_total:rate5m)

    # --- SLO Burn Rate Alerts ---
    # SLO: 99.9% availability over 30 days
    # Error budget: 0.1% = 43.8 minutes/30d
    - name: payments.slo.alerts
      rules:
        # TIER 1: Fast burn — page immediately
        # 14.4× burn rate: exhausts budget in 2.08 days
        # Two windows: 1h + 5m (both must be firing → reduces false positives)
        - alert: PaymentsSLOBurnRateCritical
          expr: |
            (
              (
                sum(rate(http_requests_total{namespace="payments",code=~"5.."}[1h]))
                / sum(rate(http_requests_total{namespace="payments"}[1h]))
              ) > (14.4 * (1 - 0.999))
            )
            and
            (
              (
                sum(rate(http_requests_total{namespace="payments",code=~"5.."}[5m]))
                / sum(rate(http_requests_total{namespace="payments"}[5m]))
              ) > (14.4 * (1 - 0.999))
            )
          for: 2m
          labels:
            severity: critical
            team: payments
            slo: availability
          annotations:
            summary: "Payments SLO: Fast burn rate — budget exhausts in ~2 days"
            description: "Error rate is >{{ 14.4 | humanize }}× the sustainable rate. Current: {{ $value | humanizePercentage }}. SLO: 99.9%."
            runbook_url: "https://wiki/runbooks/payments/slo-burn-rate"

        # TIER 2: Moderate burn — page (lower urgency / business hours)
        # 6× burn rate: exhausts budget in 5 days
        # Windows: 6h + 30m
        - alert: PaymentsSLOBurnRateWarning
          expr: |
            (
              (
                sum(rate(http_requests_total{namespace="payments",code=~"5.."}[6h]))
                / sum(rate(http_requests_total{namespace="payments"}[6h]))
              ) > (6 * (1 - 0.999))
            )
            and
            (
              (
                sum(rate(http_requests_total{namespace="payments",code=~"5.."}[30m]))
                / sum(rate(http_requests_total{namespace="payments"}[30m]))
              ) > (6 * (1 - 0.999))
            )
          for: 15m
          labels:
            severity: warning
            team: payments
            slo: availability
          annotations:
            summary: "Payments SLO: Moderate burn rate — budget exhausts in ~5 days"

        # TIER 3: Slow burn — ticket (no page)
        # 3× burn rate: exhausts budget in 10 days
        # Windows: 1d + 2h
        - alert: PaymentsSLOBurnRateLow
          expr: |
            (
              (
                sum(rate(http_requests_total{namespace="payments",code=~"5.."}[1d]))
                / sum(rate(http_requests_total{namespace="payments"}[1d]))
              ) > (3 * (1 - 0.999))
            )
            and
            (
              (
                sum(rate(http_requests_total{namespace="payments",code=~"5.."}[2h]))
                / sum(rate(http_requests_total{namespace="payments"}[2h]))
              ) > (3 * (1 - 0.999))
            )
          for: 1h
          labels:
            severity: info
            team: payments
            slo: availability
          annotations:
            summary: "Payments SLO: Slow burn — create reliability ticket"

        # Budget exhaustion check (absolute remaining)
        - alert: PaymentsSLOBudgetExhausted
          expr: |
            (
              1 - (
                (1 - (
                  sum(rate(http_requests_total{namespace="payments",code!~"5.."}[30d]))
                  / sum(rate(http_requests_total{namespace="payments"}[30d]))
                )) / (1 - 0.999)
              )
            ) * 100 < 10
          labels:
            severity: critical
            team: payments
          annotations:
            summary: "Payments SLO: Error budget below 10% remaining (30-day window)"

Burn Rate Alert Thresholds Reference

TierBurn RateTime to ExhaustionWindowsActionSeverity
1 (Fast)14.4×~2 days1h + 5mPage immediatelycritical
2 (Moderate)~5 days6h + 30mPage (business hours)warning
3 (Slow)~10 days1d + 2hCreate ticketinfo
4 (Steady)30 days (budget exhausted at period end)3d + 6hMonitor

Notification Channels

PagerDuty Integration

receivers:
  - name: pagerduty-platform
    pagerduty_configs:
      - routing_key: "$PAGERDUTY_INTEGRATION_KEY"
        severity: '{{ .GroupLabels.severity }}'
        description: '{{ .CommonAnnotations.summary }}'
        details:
          cluster: '{{ .CommonLabels.cluster }}'
          namespace: '{{ .CommonLabels.namespace }}'
          service: '{{ .CommonLabels.service }}'
          runbook: '{{ .CommonAnnotations.runbook_url }}'
        links:
          - href: '{{ .CommonAnnotations.dashboard_url }}'
            text: Dashboard
          - href: '{{ .CommonAnnotations.runbook_url }}'
            text: Runbook
        # Auto-resolve in PD when alert resolves
        send_resolved: true

Slack Integration

receivers:
  - name: slack-engineering
    slack_configs:
      - api_url: "$SLACK_WEBHOOK_URL"
        channel: "#on-call-alerts"
        send_resolved: true
        icon_emoji: ':alert:'
        username: 'Alertmanager'
        title: |-
          {{ if eq .Status "firing" -}}
            :red_circle: [{{ .CommonLabels.severity | upper }}] {{ .CommonLabels.alertname }}
          {{- else -}}
            :large_green_circle: [RESOLVED] {{ .CommonLabels.alertname }}
          {{- end }}
        text: |-
          *Cluster:* {{ .CommonLabels.cluster | default "n/a" }}
          *Namespace:* {{ .CommonLabels.namespace | default "n/a" }}
          *Summary:* {{ .CommonAnnotations.summary }}
          *Firing alerts:* {{ .Alerts.Firing | len }}
          {{ range .Alerts.Firing -}}
          > {{ .Annotations.description }}
          {{ end }}
        actions:
          - type: button
            text: ':notebook: Runbook'
            url: '{{ (index .Alerts 0).Annotations.runbook_url }}'
          - type: button
            text: ':grafana: Dashboard'
            url: '{{ (index .Alerts 0).Annotations.dashboard_url }}'

Email

receivers:
  - name: email-platform
    email_configs:
      - to: platform-on-call@company.com
        from: alertmanager@company.com
        smarthost: smtp.company.com:587
        auth_username: alertmanager@company.com
        auth_password: "$SMTP_PASSWORD"
        require_tls: true
        send_resolved: true
        html: '{{ template "email.html" . }}'
        headers:
          Subject: '[{{ .Status | toUpper }}] {{ .CommonLabels.alertname }} — {{ .CommonLabels.namespace }}'

OpsGenie

receivers:
  - name: opsgenie-critical
    opsgenie_configs:
      - api_key: "$OPSGENIE_API_KEY"
        message: '{{ .CommonLabels.alertname }} — {{ .CommonLabels.namespace }}'
        description: '{{ .CommonAnnotations.description }}'
        priority: |-
          {{ if eq .CommonLabels.severity "critical" }}P1
          {{- else if eq .CommonLabels.severity "warning" }}P3
          {{- else }}P5{{ end }}
        tags: '{{ .CommonLabels.team }},{{ .CommonLabels.namespace }}'
        details:
          runbook: '{{ .CommonAnnotations.runbook_url }}'
        responders:
          - name: platform-team
            type: team
        send_resolved: true

Webhook (Generic)

receivers:
  - name: webhook-jira
    webhook_configs:
      - url: https://automation.company.com/alertmanager-webhook
        send_resolved: true
        http_config:
          bearer_token: "$WEBHOOK_TOKEN"
        max_alerts: 10    # limit alerts per webhook call to prevent payload explosion

Grafana Unified Alerting

Grafana 9.0+ includes a unified alerting engine that supports multi-datasource alert rules (Prometheus, Loki, Tempo, CloudWatch, Elasticsearch) through a single interface. It uses the same Alertmanager for routing. Teams that need log-based or trace-based alerts (not just metric-based) should consider Grafana alerting for those rule types.

Grafana vs Prometheus Alerting Comparison

AspectPrometheusRule + AlertmanagerGrafana Unified Alerting
Rule storageKubernetes CRD (PrometheusRule)Grafana database or provisioned YAML
Data sourcesPrometheus onlyAny Grafana data source
Log-based alertsVia Loki ruler (PrometheusRule)Native (Loki, Elasticsearch)
Trace-based alertsVia Tempo metrics generatorNative (Tempo)
GitOps/CI supportExcellent (native K8s objects)Via file provisioning or Terraform
Multi-condition alertsLimited (single PromQL)Yes (A && B conditions from different sources)
Alert state persistenceIn Prometheus TSDBIn Grafana database
RoutingAlertmanagerBuilt-in or external Alertmanager

Grafana Alert Rule Provisioning

# provisioning/alerting/rules.yaml (loaded by Grafana on startup)
apiVersion: 1
groups:
  - orgId: 1
    name: Payments SLO
    folder: SLOs
    interval: 1m
    rules:
      - uid: payments-slo-burn
        title: Payments High Burn Rate
        condition: C
        data:
          - refId: A
            relativeTimeRange:
              from: 3600
              to: 0
            datasourceUid: prometheus-uid
            model:
              expr: |
                sum(rate(http_requests_total{namespace="payments",code=~"5.."}[1h]))
                / sum(rate(http_requests_total{namespace="payments"}[1h]))
              instant: true
          - refId: B
            relativeTimeRange:
              from: 300
              to: 0
            datasourceUid: prometheus-uid
            model:
              expr: |
                sum(rate(http_requests_total{namespace="payments",code=~"5.."}[5m]))
                / sum(rate(http_requests_total{namespace="payments"}[5m]))
              instant: true
          # Threshold condition
          - refId: C
            datasourceUid: "-100"
            model:
              type: math
              expression: "$A > 0.0144 && $B > 0.0144"
        noDataState: NoData
        execErrState: Alerting
        for: 2m
        labels:
          severity: critical
          team: payments
        annotations:
          summary: "Payments burn rate critical — dual window firing"
          runbook_url: "https://wiki/runbooks/payments/slo-burn-rate"
        isPaused: false

# Contact points provisioning
contactPoints:
  - orgId: 1
    name: PagerDuty
    receivers:
      - uid: pagerduty-receiver
        type: pagerduty
        settings:
          integrationKey: "$PAGERDUTY_KEY"
          severity: critical

# Notification policies provisioning
policies:
  - orgId: 1
    receiver: PagerDuty
    group_by: [alertname, namespace]
    group_wait: 30s
    repeat_interval: 4h
    routes:
      - receiver: PagerDuty
        matchers:
          - severity = critical

Alert Quality & Hygiene

Alert Anti-Patterns

Alerting on Symptoms That Don't Require Human Action

An alert that fires, on-call checks, finds it auto-recovered, and closes it is a waste. Every alert must require a human decision that a machine cannot make automatically. If the correct response is "wait for it to recover," don't alert.

Threshold Not Matched to SLO

Alerting at 0.1% error rate when your SLO allows 0.1% is meaningless — by definition you're within budget. Alert thresholds must be derived from SLO burn rates, not arbitrary percentages.

Missing Runbook Links

An alert without a runbook URL forces the on-call engineer to spend 10–20 minutes figuring out what to do. Every alert must include runbook_url and dashboard_url annotations. A missing runbook is a blocker before shipping a new alert.

High False Positive Rate

Once an on-call team learns that >20% of alert pages are false positives, they begin ignoring alerts. Track false positive rate per alert and remove or retune any alert above 10% false positive rate.

Alerting on Every Individual Pod

Alerting when one pod has high CPU in a 20-pod deployment is almost always noise. Alert on service-level aggregates (sum/avg across pods) unless the use case specifically requires per-pod alerting.

No Severity Tiers

Without severity levels, everything is equally urgent. This leads to either alert fatigue (every warning pages) or missed incidents (people stop looking). Use at minimum: critical (page), warning (Slack), info (ticket).

Alert Review Checklist

  1. Does this alert require immediate human action? If the correct response is "watch for 30 minutes and if it gets worse, act," it is a warning, not a critical page. If it is a page, on-call must be able to act within 5 minutes of receiving it.
  2. Is there a runbook? The runbook must exist before the alert is enabled. The runbook must include: symptoms, potential causes, diagnostic commands, and resolution steps. A runbook that just says "check the logs" is not a runbook.
  3. Is the threshold calibrated? Use 2–4 weeks of historical data to set thresholds. A threshold at the 99.9th percentile of normal traffic will have near-zero false positives. A threshold at the 90th percentile will page constantly.
  4. Does this duplicate an existing alert? Two alerts that fire simultaneously for the same root cause double the on-call burden. Use inhibition rules so only the most specific alert pages.
  5. Is for: appropriate? A for: 0m on a noisy metric causes immediate paging on transient spikes. A for: 15m on a critical service outage causes 15 minutes of undetected downtime.
  6. Has this been reviewed by on-call? The engineers who receive pages are the correct reviewers for alert quality. A platform engineer who never gets paged should not unilaterally decide what gets paged.
  7. Is there a test? Alert rules should be tested via promtool or in a staging environment. Untested alert rules have subtle PromQL bugs that prevent them from ever firing when they should.
  8. Is there an owner? Every alert must have a team label so it routes to the correct team. Alerts with no team routing to a generic "default" receiver are routinely ignored.

Testing Alert Rules with promtool

# promtool test rules (unit tests for PromQL alert expressions)
promtool test rules payments-alert-tests.yaml
# payments-alert-tests.yaml
rule_files:
  - payments-alerts.yaml

evaluation_interval: 1m

tests:
  - interval: 1m
    input_series:
      # Simulate 5% error rate (should trigger PaymentsHighErrorRate)
      - series: 'http_requests_total{code="200",namespace="payments"}'
        values: '0+95x10'    # 95 successful requests per minute for 10 minutes
      - series: 'http_requests_total{code="500",namespace="payments"}'
        values: '0+5x10'     # 5 errors per minute (5% error rate)

    alert_rule_test:
      - eval_time: 3m        # check state at 3 minutes
        alertname: PaymentsHighErrorRate
        exp_alerts: []       # should be Pending (for: 2m not yet elapsed)

      - eval_time: 4m
        alertname: PaymentsHighErrorRate
        exp_alerts:
          - exp_labels:
              severity: critical
              team: payments
              namespace: payments
            exp_annotations:
              runbook_url: "https://wiki.company.com/runbooks/payments/high-error-rate"

Metrics, Self-Monitoring & Runbooks

Alertmanager Self-Metrics

MetricAlert ThresholdMeaning
alertmanager_alerts_firing_totalCurrent count of firing alerts
alertmanager_notifications_totalTotal notifications sent per receiver and integration
alertmanager_notifications_failed_total>0Failed notification deliveries — on-call may not receive pages
alertmanager_cluster_members<3 for HANumber of Alertmanager cluster members — watch for split-brain
alertmanager_silences{state="active"}Active silences — unexpected high count may indicate suppression
alertmanager_config_hashChanges unexpectedlyConfiguration reload detection
prometheus_rule_evaluation_failures_total>0Rule evaluation errors — alerts may not be firing correctly
prometheus_rule_evaluation_duration_secondsp99 > eval intervalRules taking longer than evaluation interval — miss evaluations

Meta-Alerts (Alerts About Alerting)

groups:
  - name: alerting-infrastructure
    rules:
      - alert: AlertmanagerNotificationsFailing
        expr: rate(alertmanager_notifications_failed_total[5m]) > 0
        for: 5m
        labels: {severity: critical}
        annotations:
          summary: "Alertmanager failing to send notifications — on-call may not receive pages"
          description: "Receiver {{ $labels.integration }} failing. Check credentials and endpoint connectivity."

      - alert: AlertmanagerClusterMembersMissing
        expr: alertmanager_cluster_members < 3
        for: 3m
        labels: {severity: warning}
        annotations:
          summary: "Alertmanager cluster has fewer than 3 members — HA degraded"

      - alert: PrometheusRuleEvaluationFailures
        expr: rate(prometheus_rule_evaluation_failures_total[5m]) > 0
        for: 3m
        labels: {severity: warning}
        annotations:
          summary: "Prometheus rule evaluation failures — some alerts may not fire"

      - alert: PrometheusRuleEvaluationSlow
        expr: |
          prometheus_rule_group_last_duration_seconds
            > prometheus_rule_group_interval_seconds
        for: 5m
        labels: {severity: warning}
        annotations:
          summary: "Rule group {{ $labels.rule_group }} evaluation exceeds its interval"
          description: "Evaluation taking {{ $value | humanizeDuration }}. Simplify expressions or add recording rules."

      - alert: WatchdogSilent
        expr: absent(alertmanager_alerts_firing_total{alertname="Watchdog"})
        for: 5m
        labels: {severity: critical}
        annotations:
          summary: "Watchdog alert missing — Prometheus or Alertmanager pipeline broken"

Runbooks

Alert Not Firing When Expected

  1. Check Prometheus /alerts UI — is it Pending or Inactive?
  2. Run alert expression in Prometheus /graph — does it return values above threshold?
  3. Check PrometheusRule has correct label matching Prometheus ruleSelector
  4. Check prometheus_rule_evaluation_failures_total for errors
  5. Verify metric name and labels are correct: kubectl port-forward svc/prometheus 9090
  6. Run promtool test rules alerts-test.yaml to validate rule logic

Alert Firing But No Notification Received

  1. Check Alertmanager UI — is the alert shown in Active Alerts?
  2. Check if alert is silenced: amtool silence query
  3. Check if alert is inhibited: Alertmanager UI → Inhibitions tab
  4. Check Alertmanager logs: kubectl logs -l app.kubernetes.io/name=alertmanager
  5. Check alertmanager_notifications_failed_total for the receiver
  6. Verify webhook/Slack URL and credentials in Alertmanager secret

Alert Fatigue / Too Many Pages

  1. Run alert frequency report: sum by (alertname) (increase(alertmanager_notifications_total[7d]))
  2. Identify top-5 noisiest alerts
  3. For each: increase for duration, raise threshold, or convert to info/warning
  4. Check if symptom alerts should be inhibited by a root-cause alert
  5. Review false positive rate: did each page require human action?

Alertmanager Split-Brain (HA)

  1. Check alertmanager_cluster_members on each replica
  2. Check gossip connectivity: amtool cluster show
  3. Verify mesh port (9094) NetworkPolicy allows pod-to-pod traffic
  4. Restart all Alertmanager pods simultaneously to reform cluster
  5. Check PV mounts — volume contention can cause split-brain

Best Practices

  1. Every alert must have a runbook. A runbook URL in annotations is not optional. An alert without a runbook is not production-ready. The runbook must describe: what triggered, likely causes, diagnostic steps, and resolution.
  2. Use multi-window burn rate for SLO alerts. Simple threshold alerts (error rate > X%) miss slow burns. Multi-window burn rate alerts (1h + 5m windows) catch both fast and slow budget consumption while maintaining low false positive rates.
  3. Deduplicate with inhibition rules. Every cascading failure (node down → pods failing → app errors) generates dozens of alerts. Define inhibition rules so only the root-cause alert pages, suppressing all downstream symptoms.
  4. Test alert rules before deploying. Use promtool test rules to write unit tests for alert expressions. An untested alert rule that never fires during an actual incident is worse than no alert.
  5. Implement a deadman's switch. The Watchdog alert (always-firing) confirms end-to-end pipeline health. Route it to an external heartbeat service. If Prometheus or Alertmanager fails, you detect it within minutes instead of hours.
  6. Store Alertmanager configuration in Git with secret rotation. Credentials (PagerDuty integration keys, Slack webhooks) belong in Kubernetes Secrets referenced by the config — never in plaintext YAML. Use $VARIABLE syntax or secretKeyRef in AlertmanagerConfig CRD.
  7. Review alert noise weekly. Track notifications per alert over the past 7 days. Any alert generating >5 pages/week that did not result in actual user-impacting incidents should be retuned or removed.
  8. Separate alert routing from alert rules. Use PrometheusRule for expressions and AlertmanagerConfig CRD for routing. Teams own their alert rules (in their namespace); platform owns the global Alertmanager config. This separation enables self-service alerting.
Coverage Details
  • Alert lifecycle states: Inactive → Pending → Firing → Resolved with timeline diagram
  • Alert state table: visibility, notification behavior per state
  • for: duration guidance table by severity (critical:0–1m, warning:5–15m, info:30m–1h)
  • PrometheusRule CRD: full YAML with recording rule + alerting rule, labels, annotations with humanize filters
  • Prometheus ruleSelector and ruleNamespaceSelector in Prometheus CR
  • Alert labels reference table: severity, team, service, namespace, env, page
  • Annotation template variables: $labels, $value, humanize/humanizePercentage/humanizeDuration/humanize1024 filters
  • Alertmanager architecture diagram: ingress → pipeline (silence/inhibition/routing/grouping/timing) → receivers
  • HA gossip protocol: all Prometheus instances send to all Alertmanager replicas, gossip deduplicates
  • Alertmanager CR: replicas 3, retention 120h, storage PVC, alertmanagerConfigSelector
  • Full alertmanager.yaml: global (resolve_timeout), templates, route tree (critical→PD, team→slack, staging→suppress)
  • Receiver configs: PagerDuty (routing_key, severity, details, links), Slack (color/title/text templates, action buttons with runbook/dashboard/silence links), email, OpsGenie, webhook
  • Inhibition rules: critical suppresses warning (same service), node-down suppresses pod alerts, cluster CPU suppresses namespace quota, maintenance window label
  • AlertmanagerConfig CRD: namespaced routing for team self-service, slackConfigs with secret reference, inhibitRules
  • Custom notification templates: slack.title, slack.text, pagerduty.description, slack.silence_url (Go text/template)
  • Routing decision tree diagram: first-match traversal, continue:true behavior
  • group_by cardinality control callout
  • Timing parameters table: group_wait/group_interval/repeat_interval/resolve_timeout
  • 5 inhibition rule patterns with YAML
  • amtool CLI: silence add/query/expire, alert query/filter, dry-run flag
  • Alertmanager API: POST /api/v2/silences, GET silences, DELETE silence
  • Deadman's switch: Watchdog alert (vector(1)), route to external heartbeat, repeat_interval 50s
  • Burn rate formula: error_rate / (1 - SLO) with example
  • Multi-window burn rate alert: 4 tiers (14.4×/6×/3×/1×) with window pairs (1h+5m, 6h+30m, 1d+2h, 3d+6h)
  • Full PrometheusRule for SLO burn rate: recording rules (5m/30m/1h/2h/6h/1d/3d), 3-tier alert rules + budget exhausted alert
  • Burn rate thresholds reference table (tier/rate/time-to-exhaustion/windows/action/severity)
  • PagerDuty receiver: routing_key, severity, description, details, links, send_resolved
  • Slack receiver: color conditional, title/text template, action buttons (runbook/dashboard)
  • Email, OpsGenie, and webhook receiver configs
  • Grafana vs PrometheusRule alerting comparison table (8 aspects)
  • Grafana alert rule provisioning YAML: multi-condition (A && B from Prometheus), contact points, notification policies
  • 6 alert anti-patterns: no human action, threshold not from SLO, missing runbook, false positives, per-pod alerting, no severity tiers
  • 8-point alert review checklist: human action, runbook, threshold, deduplication, for: duration, on-call review, testing, owner
  • promtool test rules: payments-alert-tests.yaml with input_series, alert_rule_test, eval_time, exp_alerts
  • 8 Alertmanager self-metrics with thresholds
  • 5 meta-alert PrometheusRule rules: NotificationsFailing, ClusterMembersMissing, RuleEvaluationFailures, RuleEvaluationSlow, WatchdogSilent
  • 4 runbooks: alert not firing, no notification received, alert fatigue, Alertmanager split-brain
  • 8 best practices: runbook required, multi-window burn rate, inhibition deduplication, promtool testing, deadman's switch, secrets rotation, weekly noise review, rule/routing separation