Overview

On-call is the system by which a team ensures someone is always available to respond to production incidents. A well-designed on-call rotation is sustainable — engineers are not burned out, alerts are actionable, and incidents are resolved and learned from systematically.

Alert fires (PagerDuty)
      │
      ▼
On-call engineer receives page
      │
      ▼
Acknowledge within SLA (5 min P1 / 30 min P2)
      │
      ▼
Diagnose using runbook + dashboards
      │
      ├─ Resolved quickly → close, write timeline note
      │
      └─ Not resolved within 15 min → escalate
                                            │
                                            ▼
                                     Secondary on-call
                                     / Eng manager
                                            │
                                            ▼
                                     Incident declared
                                     Incident commander assigned

PagerDuty Setup

Service and Escalation Policy

# Create a service via PagerDuty CLI (pd)
pd service create \
  --name "payments-api" \
  --escalation-policy "P1 Critical Payments" \
  --acknowledgement-timeout 300 \       # 5 min to ack P1
  --auto-resolve-timeout 14400          # auto-resolve after 4h of silence

# Create escalation policy
pd escalation create \
  --name "P1 Critical Payments" \
  --rule "level=1,delay=0,targets=schedule:payments-primary" \
  --rule "level=2,delay=15,targets=schedule:payments-secondary" \
  --rule "level=3,delay=30,targets=user:eng-manager-id"

Alertmanager → PagerDuty Integration

# alertmanager/config.yaml
global:
  pagerduty_url: https://events.pagerduty.com/v2/enqueue

route:
  group_by: [alertname, cluster, service]
  group_wait: 30s
  group_interval: 5m
  repeat_interval: 12h
  receiver: pagerduty-critical

  routes:
  - match:
      severity: critical
    receiver: pagerduty-critical
    continue: false

  - match:
      severity: warning
    receiver: pagerduty-warning
    continue: false

  - match:
      severity: info
    receiver: slack-info

receivers:
- name: pagerduty-critical
  pagerduty_configs:
  - routing_key: $PAGERDUTY_INTEGRATION_KEY_CRITICAL
    severity: critical
    client: "Alertmanager"
    client_url: "{{ .ExternalURL }}"
    description: "{{ range .Alerts }}{{ .Annotations.summary }}{{ end }}"
    details:
      firing: "{{ .Alerts.Firing | len }}"
      resolved: "{{ .Alerts.Resolved | len }}"
      runbook: "{{ (index .Alerts 0).Annotations.runbook_url }}"
      dashboard: "{{ (index .Alerts 0).Annotations.dashboard_url }}"

- name: pagerduty-warning
  pagerduty_configs:
  - routing_key: $PAGERDUTY_INTEGRATION_KEY_WARNING
    severity: warning

- name: slack-info
  slack_configs:
  - api_url: $SLACK_WEBHOOK_URL
    channel: "#alerts-info"
    text: "{{ range .Alerts }}{{ .Annotations.description }}{{ end }}"

Alert Severity Levels

SeverityResponse SLAExamplesAction
P1 CriticalAck 5 min, resolve 30 minSLO exhausted, payment outage, data loss riskPage primary + escalate if not acked
P2 HighAck 30 min, resolve 4hError rate elevated, latency degradedPage primary on-call
P3 MediumBusiness hoursDisk > 80%, cert expiry < 30dTicket created, no page
P4 LowNext sprintNon-critical config driftBacklog item

On-Call Rotation Schedule

# PagerDuty schedule via Terraform (team of 4 rotating weekly)
resource "pagerduty_schedule" "payments_primary" {
  name      = "Payments Primary On-Call"
  time_zone = "America/New_York"

  layer {
    name                         = "Weekly Rotation"
    start                        = "2026-01-06T09:00:00-05:00"
    rotation_virtual_start       = "2026-01-06T09:00:00-05:00"
    rotation_turn_length_seconds = 604800   # 1 week

    users = [
      pagerduty_user.alice.id,
      pagerduty_user.bob.id,
      pagerduty_user.carol.id,
      pagerduty_user.dave.id,
    ]

    restriction {
      type              = "weekly_restriction"
      start_day_of_week = 1      # Monday
      start_time_of_day = "09:00:00"
      duration_seconds  = 432000  # Mon-Fri 9am-5pm
    }
  }
}

On-Call Hygiene Rules

Incident Response Lifecycle

Severity Classification

P1 — Customer-facing service completely down, data loss risk, or SLO
     exhaustion within 1 hour
P2 — Significant degradation affecting >10% of users, or P1 trajectory
P3 — Minor degradation, single user affected, or warning threshold crossed

Incident Commander Responsibilities

For P1 incidents, assign an Incident Commander (IC) separate from the engineers debugging the issue. The IC:

  1. Declares the incident in Slack (/incident declare)
  2. Assigns roles: IC, Tech Lead, Comms Lead
  3. Posts initial status update within 10 minutes
  4. Keeps the war room (Slack channel or Zoom) on track
  5. Drives to resolution, not diagnosis
  6. Coordinates customer communication via Comms Lead

Incident Slack Workflow

# Using Rootly / FireHydrant / Slack workflow
/incident declare
# → Creates #incident-2026-0524-payments channel
# → Pages IC and Tech Lead
# → Opens incident timeline

# During incident — keep timeline updated
/incident update "Identified root cause: memory leak in payments-api v1.14.1"
/incident update "Mitigation: rolling restart deployed, error rate returning to baseline"
/incident resolved
# → Sends resolution notification
# → Creates postmortem doc

Postmortem Process

Blameless postmortems focus on systems and processes, not individuals. The goal is to prevent recurrence, not assign fault.

Postmortem Template

# Postmortem: [Incident Title]

**Date:** 2026-05-24
**Duration:** 47 minutes (14:23 – 15:10 UTC)
**Severity:** P1
**Services affected:** payments-api, reconciliation-worker
**Authors:** @alice, @bob
**Status:** Draft / In Review / Final

---

## Summary

One paragraph: what happened, what the user impact was, how it was resolved.

## Timeline (UTC)

| Time | Event |
|------|-------|
| 14:23 | PagerDuty alert: PaymentProcessingErrorRateHigh |
| 14:25 | @alice acknowledged, began investigation |
| 14:31 | Identified: payments-api v1.14.1 deployed at 14:15 |
| 14:34 | Rollback initiated via Argo Rollouts |
| 14:38 | Error rate normalising |
| 15:10 | Error rate < 0.1%, incident resolved |

## Root Cause

payments-api v1.14.1 introduced a change to database connection pool
initialisation that caused connection exhaustion under moderate load. The
issue was not caught in staging because staging uses 1/10th the connection
pool size.

## Contributing Factors

- Staging connection pool size not representative of production
- No connection pool exhaustion metric was being alerted on
- Code review did not catch the pool sizing change

## Detection

Alert fired 8 minutes after deploy. MTTD: 8 minutes.
The alert threshold (5% error rate) was reached before most users noticed.

## Resolution

Rolled back to v1.14.0 via `kubectl argo rollouts abort payments-api`.
Error rate returned to baseline within 4 minutes of rollback.

## Impact

- 23 failed payment requests (0.03% of traffic in the window)
- 0 data loss events
- Error budget consumed: 12 minutes of P1-equivalent downtime

## Action Items

| Action | Owner | Due |
|--------|-------|-----|
| Add connection pool utilization alert (warn > 70%, critical > 90%) | @alice | 2026-05-31 |
| Match staging connection pool size to production (scaled) | @bob | 2026-06-07 |
| Add DB connection pool integration test to CI | @carol | 2026-06-14 |
| Add pool sizing to code review checklist | @dave | 2026-05-28 |

## Lessons Learned

### What went well
- Rollback was fast (< 5 minutes from decision to resolution)
- Runbook correctly identified rollback as the resolution path
- Team coordination was effective

### What could be improved
- Staging environment does not reflect production connection pool limits
- We lacked visibility into connection pool health before the incident

On-Call Metrics and Health

Track these to identify on-call burnout and alert quality issues:

MetricTargetAction if breached
Pages per engineer per week< 5Tune alert thresholds, fix flapping alerts
Actionable page rate> 90%Audit and suppress noisy alerts
MTTR (P1)< 30 minImprove runbooks, add automation
MTTA (P1)< 5 minEnsure PagerDuty mobile app installed and configured
Postmortem completion rate> 90% for P1/P2Enforce process via incident tooling
Action items closed within SLA> 80%Weekly review in team meeting
# PagerDuty analytics via API
curl -H "Authorization: Token token=$PD_TOKEN" \
  "https://api.pagerduty.com/analytics/metrics/incidents/all?filters[service_ids][]=$SERVICE_ID&period=month" \
  | jq '.data | {total: .total_incidents, mtta: .mean_time_to_acknowledge, mttr: .mean_time_to_resolve}'

On-Call Toolbox

# Fast context — what is currently broken?
kubectl get events --all-namespaces --field-selector type=Warning \
  --sort-by='.lastTimestamp' | tail -20

# What deployed recently?
kubectl rollout history deployment --all-namespaces 2>/dev/null | \
  grep -v "^$\|REVISION" | head -20

# Which nodes are unhealthy?
kubectl get nodes | grep -v Ready

# Which pods are not running?
kubectl get pods -A | grep -v Running | grep -v Completed

# Check SLO burn rate (Prometheus)
curl -s "http://prometheus:9090/api/v1/query" \
  --data-urlencode 'query=sum(rate(http_requests_total{status_code=~"5.."}[1h])) / sum(rate(http_requests_total[1h]))' \
  | jq '.data.result[0].value[1]'

# Quick node resource check
kubectl top nodes

# Full cluster health snapshot (k9s shortcut)
k9s  # press 0 to show all namespaces, :pulse for cluster health