On-Call

Overview

On-call is the system by which a team ensures someone is always available to respond to production incidents. A well-designed on-call rotation is sustainable — engineers are not burned out, alerts are actionable, and incidents are resolved and learned from systematically.

Alert fires (PagerDuty)
      │
      ▼
On-call engineer receives page
      │
      ▼
Acknowledge within SLA (5 min P1 / 30 min P2)
      │
      ▼
Diagnose using runbook + dashboards
      │
      ├─ Resolved quickly → close, write timeline note
      │
      └─ Not resolved within 15 min → escalate
                                            │
                                            ▼
                                     Secondary on-call
                                     / Eng manager
                                            │
                                            ▼
                                     Incident declared
                                     Incident commander assigned

PagerDuty Setup

Service and Escalation Policy

# Create a service via PagerDuty CLI (pd)
pd service create \
  --name "payments-api" \
  --escalation-policy "P1 Critical Payments" \
  --acknowledgement-timeout 300 \       # 5 min to ack P1
  --auto-resolve-timeout 14400          # auto-resolve after 4h of silence

# Create escalation policy
pd escalation create \
  --name "P1 Critical Payments" \
  --rule "level=1,delay=0,targets=schedule:payments-primary" \
  --rule "level=2,delay=15,targets=schedule:payments-secondary" \
  --rule "level=3,delay=30,targets=user:eng-manager-id"

Alertmanager → PagerDuty Integration

# alertmanager/config.yaml
global:
  pagerduty_url: https://events.pagerduty.com/v2/enqueue

route:
  group_by: [alertname, cluster, service]
  group_wait: 30s
  group_interval: 5m
  repeat_interval: 12h
  receiver: pagerduty-critical

  routes:
  - match:
      severity: critical
    receiver: pagerduty-critical
    continue: false

  - match:
      severity: warning
    receiver: pagerduty-warning
    continue: false

  - match:
      severity: info
    receiver: slack-info

receivers:
- name: pagerduty-critical
  pagerduty_configs:
  - routing_key: $PAGERDUTY_INTEGRATION_KEY_CRITICAL
    severity: critical
    client: "Alertmanager"
    client_url: "{{ .ExternalURL }}"
    description: "{{ range .Alerts }}{{ .Annotations.summary }}{{ end }}"
    details:
      firing: "{{ .Alerts.Firing | len }}"
      resolved: "{{ .Alerts.Resolved | len }}"
      runbook: "{{ (index .Alerts 0).Annotations.runbook_url }}"
      dashboard: "{{ (index .Alerts 0).Annotations.dashboard_url }}"

- name: pagerduty-warning
  pagerduty_configs:
  - routing_key: $PAGERDUTY_INTEGRATION_KEY_WARNING
    severity: warning

- name: slack-info
  slack_configs:
  - api_url: $SLACK_WEBHOOK_URL
    channel: "#alerts-info"
    text: "{{ range .Alerts }}{{ .Annotations.description }}{{ end }}"

Alert Severity Levels

Severity	Response SLA	Examples	Action
P1 Critical	Ack 5 min, resolve 30 min	SLO exhausted, payment outage, data loss risk	Page primary + escalate if not acked
P2 High	Ack 30 min, resolve 4h	Error rate elevated, latency degraded	Page primary on-call
P3 Medium	Business hours	Disk > 80%, cert expiry < 30d	Ticket created, no page
P4 Low	Next sprint	Non-critical config drift	Backlog item

On-Call Rotation Schedule

# PagerDuty schedule via Terraform (team of 4 rotating weekly)
resource "pagerduty_schedule" "payments_primary" {
  name      = "Payments Primary On-Call"
  time_zone = "America/New_York"

  layer {
    name                         = "Weekly Rotation"
    start                        = "2026-01-06T09:00:00-05:00"
    rotation_virtual_start       = "2026-01-06T09:00:00-05:00"
    rotation_turn_length_seconds = 604800   # 1 week

    users = [
      pagerduty_user.alice.id,
      pagerduty_user.bob.id,
      pagerduty_user.carol.id,
      pagerduty_user.dave.id,
    ]

    restriction {
      type              = "weekly_restriction"
      start_day_of_week = 1      # Monday
      start_time_of_day = "09:00:00"
      duration_seconds  = 432000  # Mon-Fri 9am-5pm
    }
  }
}

On-Call Hygiene Rules

No single point of failure: always have primary + secondary coverage
Handoff ritual: 15-minute sync between outgoing and incoming on-call; review open incidents, known fragile systems, pending deploys
No paging without runbook: every alert must have a runbook_url annotation before routing to PagerDuty
Weekday business hours first: new services start with P2-only alerting (no nighttime pages) until runbooks are complete and alerts are validated
4-week ramp for new engineers: shadow + secondary before primary on-call

Incident Response Lifecycle

Severity Classification

P1 — Customer-facing service completely down, data loss risk, or SLO
     exhaustion within 1 hour
P2 — Significant degradation affecting >10% of users, or P1 trajectory
P3 — Minor degradation, single user affected, or warning threshold crossed

Incident Commander Responsibilities

For P1 incidents, assign an Incident Commander (IC) separate from the engineers debugging the issue. The IC:

Declares the incident in Slack (/incident declare)
Assigns roles: IC, Tech Lead, Comms Lead
Posts initial status update within 10 minutes
Keeps the war room (Slack channel or Zoom) on track
Drives to resolution, not diagnosis
Coordinates customer communication via Comms Lead

Incident Slack Workflow

# Using Rootly / FireHydrant / Slack workflow
/incident declare
# → Creates #incident-2026-0524-payments channel
# → Pages IC and Tech Lead
# → Opens incident timeline

# During incident — keep timeline updated
/incident update "Identified root cause: memory leak in payments-api v1.14.1"
/incident update "Mitigation: rolling restart deployed, error rate returning to baseline"
/incident resolved
# → Sends resolution notification
# → Creates postmortem doc

Postmortem Process

Blameless postmortems focus on systems and processes, not individuals. The goal is to prevent recurrence, not assign fault.

Postmortem Template

# Postmortem: [Incident Title]

**Date:** 2026-05-24
**Duration:** 47 minutes (14:23 – 15:10 UTC)
**Severity:** P1
**Services affected:** payments-api, reconciliation-worker
**Authors:** @alice, @bob
**Status:** Draft / In Review / Final

---

## Summary

One paragraph: what happened, what the user impact was, how it was resolved.

## Timeline (UTC)

| Time | Event |
|------|-------|
| 14:23 | PagerDuty alert: PaymentProcessingErrorRateHigh |
| 14:25 | @alice acknowledged, began investigation |
| 14:31 | Identified: payments-api v1.14.1 deployed at 14:15 |
| 14:34 | Rollback initiated via Argo Rollouts |
| 14:38 | Error rate normalising |
| 15:10 | Error rate < 0.1%, incident resolved |

## Root Cause

payments-api v1.14.1 introduced a change to database connection pool
initialisation that caused connection exhaustion under moderate load. The
issue was not caught in staging because staging uses 1/10th the connection
pool size.

## Contributing Factors

- Staging connection pool size not representative of production
- No connection pool exhaustion metric was being alerted on
- Code review did not catch the pool sizing change

## Detection

Alert fired 8 minutes after deploy. MTTD: 8 minutes.
The alert threshold (5% error rate) was reached before most users noticed.

## Resolution

Rolled back to v1.14.0 via `kubectl argo rollouts abort payments-api`.
Error rate returned to baseline within 4 minutes of rollback.

## Impact

- 23 failed payment requests (0.03% of traffic in the window)
- 0 data loss events
- Error budget consumed: 12 minutes of P1-equivalent downtime

## Action Items

| Action | Owner | Due |
|--------|-------|-----|
| Add connection pool utilization alert (warn > 70%, critical > 90%) | @alice | 2026-05-31 |
| Match staging connection pool size to production (scaled) | @bob | 2026-06-07 |
| Add DB connection pool integration test to CI | @carol | 2026-06-14 |
| Add pool sizing to code review checklist | @dave | 2026-05-28 |

## Lessons Learned

### What went well
- Rollback was fast (< 5 minutes from decision to resolution)
- Runbook correctly identified rollback as the resolution path
- Team coordination was effective

### What could be improved
- Staging environment does not reflect production connection pool limits
- We lacked visibility into connection pool health before the incident

On-Call Metrics and Health

Track these to identify on-call burnout and alert quality issues:

Metric	Target	Action if breached
Pages per engineer per week	< 5	Tune alert thresholds, fix flapping alerts
Actionable page rate	> 90%	Audit and suppress noisy alerts
MTTR (P1)	< 30 min	Improve runbooks, add automation
MTTA (P1)	< 5 min	Ensure PagerDuty mobile app installed and configured
Postmortem completion rate	> 90% for P1/P2	Enforce process via incident tooling
Action items closed within SLA	> 80%	Weekly review in team meeting

# PagerDuty analytics via API
curl -H "Authorization: Token token=$PD_TOKEN" \
  "https://api.pagerduty.com/analytics/metrics/incidents/all?filters[service_ids][]=$SERVICE_ID&period=month" \
  | jq '.data | {total: .total_incidents, mtta: .mean_time_to_acknowledge, mttr: .mean_time_to_resolve}'

On-Call Toolbox

# Fast context — what is currently broken?
kubectl get events --all-namespaces --field-selector type=Warning \
  --sort-by='.lastTimestamp' | tail -20

# What deployed recently?
kubectl rollout history deployment --all-namespaces 2>/dev/null | \
  grep -v "^$\|REVISION" | head -20

# Which nodes are unhealthy?
kubectl get nodes | grep -v Ready

# Which pods are not running?
kubectl get pods -A | grep -v Running | grep -v Completed

# Check SLO burn rate (Prometheus)
curl -s "http://prometheus:9090/api/v1/query" \
  --data-urlencode 'query=sum(rate(http_requests_total{status_code=~"5.."}[1h])) / sum(rate(http_requests_total[1h]))' \
  | jq '.data.result[0].value[1]'

# Quick node resource check
kubectl top nodes

# Full cluster health snapshot (k9s shortcut)
k9s  # press 0 to show all namespaces, :pulse for cluster health

Runbooks — what on-call engineers read during incidents
09 — Production Overview — SLO error budget policy
09 — Incident Response — full incident lifecycle
07 — Alerting — alert configuration