Runbooks

Overview

A runbook is the authoritative guide for operating a service in production. It answers the question every on-call engineer faces at 2 AM: "This alert fired — what do I do now?"

Good runbooks are:

Linked from alerts — the alert annotation should contain a direct URL to the relevant runbook section
Action-oriented — commands to copy-paste, not paragraphs to read
Verified — someone has actually followed the steps and they work
Maintained — updated whenever the system changes

A runbook that is not linked from the alert that triggered it has near-zero value in an incident.

Runbook Standard Template

# [Service Name] — [Alert Name] Runbook

**Service:** payments-api
**Alert:** PaymentProcessingErrorRateHigh
**Severity:** P1 (Critical) / P2 (High) / P3 (Medium)
**Owner:** team-payments
**Last verified:** 2026-04-01

---

## Summary

One-paragraph description of what this alert means and why it fires.

## Impact

- Who is affected: end users making payments / downstream reconciliation service
- Business impact: revenue loss at ~$X/minute when error rate > 5%
- SLO budget burn: this alert fires at 5x burn rate → ~8 hours to budget exhaustion

## Diagnosis

### Step 1: Verify the alert is real

kubectl exec -n production deploy/payments-api -- \ curl -s localhost:8080/metrics | grep http_requests_total

kubectl get pods -n production -l app=payments-api


### Step 2: Check recent deployments

kubectl rollout history deployment/payments-api -n production argocd app history payments-api


### Step 3: Check error type in logs

stern -l app=payments-api -n production --since=10m | grep ERROR


Common error patterns:
| Log pattern | Likely cause |
|-------------|--------------|
| `connection refused` to postgres | Database connection pool exhausted |
| `upstream connect error` | Stripe API unavailable |
| `context deadline exceeded` | Pod overloaded / memory pressure |

## Resolution

### Case A: Recent deployment caused regression

kubectl argo rollouts abort payments-api -n production

kubectl rollout undo deployment/payments-api -n production


### Case B: Database connection exhaustion

kubectl exec -n production deploy/payments-api -- \ curl -s localhost:8080/debug/db/stats | jq .

kubectl rollout restart deployment/payments-api -n production


### Case C: Stripe API outage (external)

1. Check https://status.stripe.com
2. If Stripe is down: enable maintenance mode via feature flag `payment-processing-enabled`
3. Notify customer success team (#cs-team Slack channel)
4. Set up Stripe webhook replay once service recovers

## Escalation

- P1: Page team-payments lead + engineering manager immediately
- Not resolved in 15 min: escalate to Director of Engineering
- Contacts: [PagerDuty service P1234AB](https://acme.pagerduty.com/service-directory/P1234AB)

## Post-Incident

- File incident report within 24 hours in Linear (project: INCIDENTS)
- Update this runbook if any steps were missing or wrong
- Add any new diagnostic commands discovered during incident

Runbook for Common Kubernetes Operations

The following are generic runbook sections applicable to any service. Adapt to your service.

Pod Crash Loop (CrashLoopBackOff)

# Step 1: Identify the failing pod
kubectl get pods -n <namespace> | grep CrashLoop

# Step 2: Get crash reason from the terminated container
kubectl describe pod <pod-name> -n <namespace> | grep -A 10 "Last State:"
# Look for: Exit Code, Reason (OOMKilled / Error / Completed)

# Step 3: Read logs from the last crash (not current attempt)
kubectl logs <pod-name> -n <namespace> --previous

# Step 4: Diagnosis by exit code
# Exit 0  → Container exited cleanly (readiness probe issue)
# Exit 1  → Application error (check logs)
# Exit 137 → OOM Killed (increase memory limit)
# Exit 143 → SIGTERM not handled (fix graceful shutdown)

# Step 5: OOMKilled → increase memory limit
kubectl patch deployment <name> -n <namespace> \
  -p '{"spec":{"template":{"spec":{"containers":[{"name":"<container>","resources":{"limits":{"memory":"512Mi"}}}]}}}}'

Service Not Receiving Traffic

# Step 1: Verify pods are Ready
kubectl get pods -n <namespace> -l app=<service> -o wide

# Step 2: Verify Service selector matches pod labels
kubectl get svc <service> -n <namespace> -o yaml | grep selector -A5
kubectl get pods -n <namespace> --show-labels | grep <service>

# Step 3: Verify Endpoints are populated
kubectl get endpoints <service> -n <namespace>
# Empty endpoints = selector mismatch or no Ready pods

# Step 4: Test connectivity from another pod
kubectl run test --image=nicolaka/netshoot --rm -it -- \
  curl -v http://<service>.<namespace>.svc.cluster.local:<port>/healthz

# Step 5: Check NetworkPolicy isn't blocking
kubectl get networkpolicy -n <namespace>
kubectl describe networkpolicy <name> -n <namespace>

High Memory Usage / OOM Risk

# Step 1: Check current memory usage vs limit
kubectl top pods -n <namespace> -l app=<service>

# Step 2: Check if approaching limit (>80% is concerning)
kubectl get pod <pod-name> -n <namespace> -o json | \
  jq '.spec.containers[0].resources'

# Step 3: Enable heap profiling (Go)
kubectl port-forward <pod-name> 6060:6060 -n <namespace> &
curl http://localhost:6060/debug/pprof/heap > heap.prof
go tool pprof -http=:8081 heap.prof

# Step 4: Temporary relief — rolling restart to reclaim leaked memory
kubectl rollout restart deployment/<name> -n <namespace>

Deployment Stuck (Rollout Not Progressing)

# Step 1: Check rollout status
kubectl rollout status deployment/<name> -n <namespace>

# Step 2: Check ReplicaSet events
kubectl describe deployment/<name> -n <namespace> | tail -30

# Step 3: Common causes
# - Image pull failure → check imagePullSecrets, registry auth
kubectl get events -n <namespace> --field-selector reason=Failed | grep Pull

# - Readiness probe failing → new pods start but never become Ready
kubectl describe pod <new-pod-name> -n <namespace> | grep -A10 Conditions

# - PodDisruptionBudget blocking → can't remove old pods
kubectl get pdb -n <namespace>

# Step 4: Rollback if stuck > 10 minutes
kubectl rollout undo deployment/<name> -n <namespace>

Runbook Publishing with TechDocs (Backstage)

TechDocs renders Markdown from the service repo into a searchable portal page.

# mkdocs.yml (at repo root)
site_name: Payments API Docs
docs_dir: docs/
nav:
  - Home: index.md
  - Architecture: architecture.md
  - Runbooks:
    - Overview: runbooks/index.md
    - Alerts: runbooks/alerts.md
    - Database: runbooks/database.md
    - Deployment: runbooks/deployment.md
  - API Reference: api.md

plugins:
  - techdocs-core

# Repo layout for TechDocs
repo-root/
  catalog-info.yaml        ← backstage.io/techdocs-ref: dir:.
  mkdocs.yml
  docs/
    index.md
    runbooks/
      index.md
      alerts.md
      database.md

# Preview TechDocs locally
npx @techdocs/cli serve

Runbook Quality Checklist

Before publishing a runbook, verify:

- [ ] Alert name in title matches exactly the Prometheus alert name
- [ ] Severity level matches PagerDuty routing policy
- [ ] Every command is copy-pasteable (no placeholders requiring mental substitution)
- [ ] Steps are numbered and ordered (diagnosis before resolution)
- [ ] Escalation contacts are current (check every quarter)
- [ ] Runbook URL is in the PrometheusRule alert annotation:
      annotations:
        runbook_url: https://backstage.internal/docs/payments-api/runbooks/alerts/#payment-error-rate
- [ ] Someone other than the author has dry-run the steps
- [ ] "Last verified" date is within 6 months

Alert Annotation — Linking Runbooks

# PrometheusRule — always include runbook_url
apiVersion: monitoring.coreos.com/v1
kind: PrometheusRule
metadata:
  name: payments-alerts
  namespace: monitoring
spec:
  groups:
  - name: payments.api
    rules:
    - alert: PaymentProcessingErrorRateHigh
      expr: |
        sum(rate(http_requests_total{
          service="payments-api",
          status_code=~"5.."
        }[5m])) /
        sum(rate(http_requests_total{service="payments-api"}[5m]))
        > 0.05
      for: 2m
      labels:
        severity: critical
        team: payments
        service: payments-api
      annotations:
        summary: "Payments API error rate above 5%"
        description: "Error rate is {{ $value | humanizePercentage }} (threshold 5%)"
        runbook_url: "https://backstage.internal/docs/payments-api/runbooks/alerts/#payment-processing-error-rate-high"
        dashboard_url: "https://grafana.internal/d/payments-api?var-namespace=production"

On-Call — how runbooks are used during incidents
09 — Production Overview — runbook standard template
09 — Incident Response — incident lifecycle