On-Call
Overview
On-call is the system by which a team ensures someone is always available to respond to production incidents. A well-designed on-call rotation is sustainable — engineers are not burned out, alerts are actionable, and incidents are resolved and learned from systematically.
Alert fires (PagerDuty)
│
▼
On-call engineer receives page
│
▼
Acknowledge within SLA (5 min P1 / 30 min P2)
│
▼
Diagnose using runbook + dashboards
│
├─ Resolved quickly → close, write timeline note
│
└─ Not resolved within 15 min → escalate
│
▼
Secondary on-call
/ Eng manager
│
▼
Incident declared
Incident commander assigned
PagerDuty Setup
Service and Escalation Policy
# Create a service via PagerDuty CLI (pd)
pd service create \
--name "payments-api" \
--escalation-policy "P1 Critical Payments" \
--acknowledgement-timeout 300 \ # 5 min to ack P1
--auto-resolve-timeout 14400 # auto-resolve after 4h of silence
# Create escalation policy
pd escalation create \
--name "P1 Critical Payments" \
--rule "level=1,delay=0,targets=schedule:payments-primary" \
--rule "level=2,delay=15,targets=schedule:payments-secondary" \
--rule "level=3,delay=30,targets=user:eng-manager-id"
Alertmanager → PagerDuty Integration
# alertmanager/config.yaml
global:
pagerduty_url: https://events.pagerduty.com/v2/enqueue
route:
group_by: [alertname, cluster, service]
group_wait: 30s
group_interval: 5m
repeat_interval: 12h
receiver: pagerduty-critical
routes:
- match:
severity: critical
receiver: pagerduty-critical
continue: false
- match:
severity: warning
receiver: pagerduty-warning
continue: false
- match:
severity: info
receiver: slack-info
receivers:
- name: pagerduty-critical
pagerduty_configs:
- routing_key: $PAGERDUTY_INTEGRATION_KEY_CRITICAL
severity: critical
client: "Alertmanager"
client_url: "{{ .ExternalURL }}"
description: "{{ range .Alerts }}{{ .Annotations.summary }}{{ end }}"
details:
firing: "{{ .Alerts.Firing | len }}"
resolved: "{{ .Alerts.Resolved | len }}"
runbook: "{{ (index .Alerts 0).Annotations.runbook_url }}"
dashboard: "{{ (index .Alerts 0).Annotations.dashboard_url }}"
- name: pagerduty-warning
pagerduty_configs:
- routing_key: $PAGERDUTY_INTEGRATION_KEY_WARNING
severity: warning
- name: slack-info
slack_configs:
- api_url: $SLACK_WEBHOOK_URL
channel: "#alerts-info"
text: "{{ range .Alerts }}{{ .Annotations.description }}{{ end }}"
Alert Severity Levels
| Severity | Response SLA | Examples | Action |
|---|---|---|---|
| P1 Critical | Ack 5 min, resolve 30 min | SLO exhausted, payment outage, data loss risk | Page primary + escalate if not acked |
| P2 High | Ack 30 min, resolve 4h | Error rate elevated, latency degraded | Page primary on-call |
| P3 Medium | Business hours | Disk > 80%, cert expiry < 30d | Ticket created, no page |
| P4 Low | Next sprint | Non-critical config drift | Backlog item |
On-Call Rotation Schedule
# PagerDuty schedule via Terraform (team of 4 rotating weekly)
resource "pagerduty_schedule" "payments_primary" {
name = "Payments Primary On-Call"
time_zone = "America/New_York"
layer {
name = "Weekly Rotation"
start = "2026-01-06T09:00:00-05:00"
rotation_virtual_start = "2026-01-06T09:00:00-05:00"
rotation_turn_length_seconds = 604800 # 1 week
users = [
pagerduty_user.alice.id,
pagerduty_user.bob.id,
pagerduty_user.carol.id,
pagerduty_user.dave.id,
]
restriction {
type = "weekly_restriction"
start_day_of_week = 1 # Monday
start_time_of_day = "09:00:00"
duration_seconds = 432000 # Mon-Fri 9am-5pm
}
}
}
On-Call Hygiene Rules
- No single point of failure: always have primary + secondary coverage
- Handoff ritual: 15-minute sync between outgoing and incoming on-call; review open incidents, known fragile systems, pending deploys
- No paging without runbook: every alert must have a
runbook_urlannotation before routing to PagerDuty - Weekday business hours first: new services start with P2-only alerting (no nighttime pages) until runbooks are complete and alerts are validated
- 4-week ramp for new engineers: shadow + secondary before primary on-call
Incident Response Lifecycle
Severity Classification
P1 — Customer-facing service completely down, data loss risk, or SLO
exhaustion within 1 hour
P2 — Significant degradation affecting >10% of users, or P1 trajectory
P3 — Minor degradation, single user affected, or warning threshold crossed
Incident Commander Responsibilities
For P1 incidents, assign an Incident Commander (IC) separate from the engineers debugging the issue. The IC:
- Declares the incident in Slack (
/incident declare) - Assigns roles: IC, Tech Lead, Comms Lead
- Posts initial status update within 10 minutes
- Keeps the war room (Slack channel or Zoom) on track
- Drives to resolution, not diagnosis
- Coordinates customer communication via Comms Lead
Incident Slack Workflow
# Using Rootly / FireHydrant / Slack workflow
/incident declare
# → Creates #incident-2026-0524-payments channel
# → Pages IC and Tech Lead
# → Opens incident timeline
# During incident — keep timeline updated
/incident update "Identified root cause: memory leak in payments-api v1.14.1"
/incident update "Mitigation: rolling restart deployed, error rate returning to baseline"
/incident resolved
# → Sends resolution notification
# → Creates postmortem doc
Postmortem Process
Blameless postmortems focus on systems and processes, not individuals. The goal is to prevent recurrence, not assign fault.
Postmortem Template
# Postmortem: [Incident Title]
**Date:** 2026-05-24
**Duration:** 47 minutes (14:23 – 15:10 UTC)
**Severity:** P1
**Services affected:** payments-api, reconciliation-worker
**Authors:** @alice, @bob
**Status:** Draft / In Review / Final
---
## Summary
One paragraph: what happened, what the user impact was, how it was resolved.
## Timeline (UTC)
| Time | Event |
|------|-------|
| 14:23 | PagerDuty alert: PaymentProcessingErrorRateHigh |
| 14:25 | @alice acknowledged, began investigation |
| 14:31 | Identified: payments-api v1.14.1 deployed at 14:15 |
| 14:34 | Rollback initiated via Argo Rollouts |
| 14:38 | Error rate normalising |
| 15:10 | Error rate < 0.1%, incident resolved |
## Root Cause
payments-api v1.14.1 introduced a change to database connection pool
initialisation that caused connection exhaustion under moderate load. The
issue was not caught in staging because staging uses 1/10th the connection
pool size.
## Contributing Factors
- Staging connection pool size not representative of production
- No connection pool exhaustion metric was being alerted on
- Code review did not catch the pool sizing change
## Detection
Alert fired 8 minutes after deploy. MTTD: 8 minutes.
The alert threshold (5% error rate) was reached before most users noticed.
## Resolution
Rolled back to v1.14.0 via `kubectl argo rollouts abort payments-api`.
Error rate returned to baseline within 4 minutes of rollback.
## Impact
- 23 failed payment requests (0.03% of traffic in the window)
- 0 data loss events
- Error budget consumed: 12 minutes of P1-equivalent downtime
## Action Items
| Action | Owner | Due |
|--------|-------|-----|
| Add connection pool utilization alert (warn > 70%, critical > 90%) | @alice | 2026-05-31 |
| Match staging connection pool size to production (scaled) | @bob | 2026-06-07 |
| Add DB connection pool integration test to CI | @carol | 2026-06-14 |
| Add pool sizing to code review checklist | @dave | 2026-05-28 |
## Lessons Learned
### What went well
- Rollback was fast (< 5 minutes from decision to resolution)
- Runbook correctly identified rollback as the resolution path
- Team coordination was effective
### What could be improved
- Staging environment does not reflect production connection pool limits
- We lacked visibility into connection pool health before the incident
On-Call Metrics and Health
Track these to identify on-call burnout and alert quality issues:
| Metric | Target | Action if breached |
|---|---|---|
| Pages per engineer per week | < 5 | Tune alert thresholds, fix flapping alerts |
| Actionable page rate | > 90% | Audit and suppress noisy alerts |
| MTTR (P1) | < 30 min | Improve runbooks, add automation |
| MTTA (P1) | < 5 min | Ensure PagerDuty mobile app installed and configured |
| Postmortem completion rate | > 90% for P1/P2 | Enforce process via incident tooling |
| Action items closed within SLA | > 80% | Weekly review in team meeting |
# PagerDuty analytics via API
curl -H "Authorization: Token token=$PD_TOKEN" \
"https://api.pagerduty.com/analytics/metrics/incidents/all?filters[service_ids][]=$SERVICE_ID&period=month" \
| jq '.data | {total: .total_incidents, mtta: .mean_time_to_acknowledge, mttr: .mean_time_to_resolve}'
On-Call Toolbox
# Fast context — what is currently broken?
kubectl get events --all-namespaces --field-selector type=Warning \
--sort-by='.lastTimestamp' | tail -20
# What deployed recently?
kubectl rollout history deployment --all-namespaces 2>/dev/null | \
grep -v "^$\|REVISION" | head -20
# Which nodes are unhealthy?
kubectl get nodes | grep -v Ready
# Which pods are not running?
kubectl get pods -A | grep -v Running | grep -v Completed
# Check SLO burn rate (Prometheus)
curl -s "http://prometheus:9090/api/v1/query" \
--data-urlencode 'query=sum(rate(http_requests_total{status_code=~"5.."}[1h])) / sum(rate(http_requests_total[1h]))' \
| jq '.data.result[0].value[1]'
# Quick node resource check
kubectl top nodes
# Full cluster health snapshot (k9s shortcut)
k9s # press 0 to show all namespaces, :pulse for cluster health
Related
- Runbooks — what on-call engineers read during incidents
- 09 — Production Overview — SLO error budget policy
- 09 — Incident Response — full incident lifecycle
- 07 — Alerting — alert configuration