← Coverage / 09 — Production Operations / 09 — Incident Response Section 09.09

Incident Response

Structured procedures for detecting, triaging, communicating, and resolving production incidents — and running blameless postmortems afterward.

Severity Classification

Severity	Definition	Response Time	Escalation
P1Critical	Complete outage; all users affected; revenue impact	Immediate (24/7)	On-call + Engineering Manager + VP
P2High	Major feature unavailable; >20% users affected	<15 min (24/7)	On-call + Engineering Manager
P3Medium	Degraded performance; <20% users affected	<2h (business hours)	On-call engineer
P4Low	Minor issue; workaround available	Next business day	Ticket + team Slack

Incident Response Lifecycle

1. DETECT      Alert fires → PagerDuty pages on-call engineer
2. ACKNOWLEDGE On-call acknowledges within 5 min (P1/P2) or 15 min (P3)
3. TRIAGE      Assess severity, impact, blast radius
4. COMMUNICATE Open incident channel, post initial status update
5. INVESTIGATE Identify root cause while mitigating symptoms
6. MITIGATE    Apply fix or workaround to restore service
7. RESOLVE     Verify service restored, close incident
8. POSTMORTEM  Blameless review within 5 business days (P1/P2)

First 5 Minutes — P1 Response

# Immediately when paged:

# 1. Acknowledge PagerDuty
# 2. Open incident Slack channel: /incident P1 "payment service down"
# 3. Post initial update in #incidents:
#    "⚠️ INCIDENT: Payments service returning 503. Investigating. IC: @alice"

# Quick diagnosis commands (run in parallel):
kubectl get pods -n production | grep -v Running
kubectl get events -n production --sort-by='.lastTimestamp' | tail -20
kubectl top nodes
kubectl get nodes -o wide

# Check Grafana dashboard (bookmark these):
# - API p99 latency
# - Error rate
# - Pod restart count
# - Node CPU/memory

Incident Commander Responsibilities

# Incident Commander (IC) is the single decision-maker during an incident.
# IC does NOT have to be the most technical person — they coordinate.

IC responsibilities:
  - Open the incident channel and Zoom bridge
  - Assign roles: Investigator, Communicator, Scribe
  - Drive to mitigation (not necessarily root cause)
  - Post status updates every 15 minutes (P1) or 30 minutes (P2)
  - Decide when to escalate (engage more engineers)
  - Decide when incident is resolved
  - Ensure postmortem is scheduled

Status update template (every 15 min during P1):
  Status: INVESTIGATING | MITIGATING | RESOLVED
  Impact: [describe what's broken and who's affected]
  Last Action: [what we just tried]
  Next Action: [what we're doing now]
  ETA: [estimate if known, "unknown" if not]

Common Kubernetes Incident Playbooks

All Pods Crashing

kubectl get pods -A | grep CrashLoop
kubectl describe pod <pod> -n <ns>
kubectl logs <pod> -n <ns> --previous

# Common cause: bad deploy (wrong config, broken image)
# Immediate mitigation:
kubectl rollout undo deployment <deploy> -n <ns>
# Verify rollback:
kubectl rollout status deployment <deploy> -n <ns>

# If Argo CD is auto-syncing back to broken state:
kubectl patch application <app> -n argocd \
  --type merge -p '{"spec":{"syncPolicy":{"automated":null}}}'

Service Traffic Dropped to Zero

# Check endpoints
kubectl get endpoints <svc> -n <ns>

# Check pods
kubectl get pods -n <ns> -l app=<app>

# Trace the path
kubectl run debug --image=nicolaka/netshoot --rm -it -n <ns> -- \
  curl -v http://<svc>.<ns>.svc.cluster.local:<port>/healthz

# If pods are unhealthy: rollback or restart
kubectl rollout restart deployment <deploy> -n <ns>

# If NetworkPolicy changed recently:
kubectl get networkpolicy -n <ns>
kubectl describe networkpolicy <np> -n <ns>

Node Group Failure

kubectl get nodes
# Several nodes showing NotReady

# Check if pods have rescheduled (they will if other nodes have capacity)
kubectl get pods -A -o wide | grep <dead-node>

# For cloud: check if ASG is replacing instances
aws autoscaling describe-auto-scaling-groups \
  --auto-scaling-group-names workers-asg \
  --query 'AutoScalingGroups[0].{Desired:DesiredCapacity,Running:Instances[?LifecycleState==`InService`]|length(@)}'

# Force pod migration from dead node (if pods stuck)
kubectl delete node <dead-node>   # removes from API server → triggers rescheduling

etcd / API Server Issues

# kubectl not responding: "The connection to the server was refused"

# Check control plane (from a node that can reach it)
curl -k https://<api-server-ip>:6443/healthz

# EKS: check in console or:
aws eks describe-cluster --name <cluster> --query 'cluster.status'

# kubeadm: check if etcd is running
kubectl get pod -n kube-system etcd-<cp-node>
systemctl status etcd   # on control plane node

# Emergency: if API server is down, kubelet continues running existing pods
# Focus on restoring API server before touching any workloads

Rollback Procedures

# Application rollback (Deployment)
kubectl rollout history deployment <deploy> -n <ns>
kubectl rollout undo deployment <deploy> -n <ns>
kubectl rollout undo deployment <deploy> --to-revision=<N> -n <ns>

# Argo CD rollback (disable auto-sync first)
kubectl patch application <app> -n argocd \
  --type merge -p '{"spec":{"syncPolicy":{"automated":null}}}'
argocd app rollback <app> --revision <N>

# Helm rollback
helm history payments-api -n production
helm rollback payments-api <revision> -n production

# Rollback a ConfigMap change (restore from git)
git show HEAD~1:k8s/production/configmap.yaml | kubectl apply -f -

# Emergency image pin (bypass GitOps temporarily)
kubectl set image deployment/payments-api \
  payments-api=myrepo/payments-api:sha-abc123 \
  -n production

Communication Templates

# Initial status page update (within 5 min of incident declaration):
"We are currently investigating reports of [brief description].
Our team has been alerted and is actively working on a resolution.
We will provide an update in 30 minutes or sooner."

# Mid-incident update:
"We have identified the root cause as [brief, non-technical].
We are currently [action being taken].
Estimated time to resolution: [ETA or unknown].
Next update in 15 minutes."

# Resolution:
"This incident has been resolved as of [time].
The issue was [brief explanation].
All services are now operating normally.
We will publish a full post-mortem within 5 business days."

Blameless Postmortem Template

# Postmortem required for all P1 and P2 incidents
# Due: within 5 business days

## Incident Summary
- Date/Time:
- Duration:
- Severity: P1/P2
- Services Affected:
- Impact: (users affected, requests failed, revenue)

## Timeline
| Time  | Event | Who |
|-------|-------|-----|
| 14:00 | Alert fired: error rate > 5% | PagerDuty |
| 14:02 | On-call acknowledged | alice |
| 14:08 | Identified bad deploy | alice, bob |
| 14:15 | Rolled back deployment | alice |
| 14:20 | Service restored | - |

## Root Cause
[Technical explanation of what broke and why]

## Contributing Factors
- [Factor 1: missing staging test]
- [Factor 2: no canary deployment]

## What Went Well
- Alert fired immediately
- Rollback was fast (2 min)

## Action Items
| Action | Owner | Due | Priority |
|--------|-------|-----|----------|
| Add integration test for X | Bob | 2025-02-01 | P1 |
| Enable canary deployment | Alice | 2025-02-15 | P2 |
| Improve runbook for Y | Charlie | 2025-02-08 | P3 |

On-Call Hygiene

# After every incident:
# 1. Ensure all action items are tracked (Linear/Jira tickets)
# 2. Update runbook if gaps found
# 3. Add/improve alerts if detection was slow
# 4. Schedule postmortem within 24h (for P1)

# Toil tracking:
# Log every page that required manual action
# Aggregate monthly: if any alert fires > 5 times → automate or fix

# On-call metrics to review weekly:
# - Mean Time to Acknowledge (MTTA): target < 5 min
# - Mean Time to Resolve (MTTR): track by severity
# - False positive alert rate: target < 5%
# - Pages per on-call rotation: target < 5 actionable pages/week

10 — SRE Practices — SLOs, error budgets, toil
10.09 — On-Call — PagerDuty setup and rotation
10.08 — Runbooks — runbook standards
12 — Troubleshooting — per-layer diagnostic playbooks