Production Operations Overview
Running Kubernetes reliably at scale — capacity, performance, resilience, security, and incident response in a unified operational framework.
What Is Production Operations
Production operations is the discipline of keeping Kubernetes clusters and their workloads available, performant, secure, and cost-efficient 24/7. While the platform engineering section (Section 08) covers what to build, this section covers how to run it — day-2 through day-N operations.
Plan Build Deploy Operate Improve ──────── ─────── ──────── ──────── ───────── Capacity → Platform → GitOps → Monitor → Post-mortems Planning Provisioning Pipeline Alert SLO review SLO design Policy Canary Runbooks Chaos tests Budget Security PDB Incident Capacity plan On-call Secrets Smoke test Response Efficiency rotation Portal Gate Escalation upgrade cycle
The gap between a cluster that works and a cluster that runs production spans five concerns:
Reliability
SLOs, error budgets, PDBs, multi-AZ topology, chaos engineering, disaster recovery with RTO/RPO targets.
Performance
Right-sizing, kernel tuning, JVM profiling, etcd/apiserver latency, HPA/KEDA autoscaling response time.
Security
CIS hardening, runtime threat detection, audit log analysis, CVE patching cadence, zero-trust networking.
Efficiency
Capacity planning, VPA right-sizing, spot adoption, FinOps reviews, idle resource detection.
Operability
Runbooks, on-call rotations, change management, upgrade cadence, certificate lifecycle, incident playbooks.
Operations Domains
| Domain | Core Concern | Key Tools | Primary Signal |
|---|---|---|---|
| Capacity Planning | Right cluster size today and 6 months forward; headroom for spikes | VPA/Goldilocks, OpenCost, kube-capacity | CPU/mem allocatable vs requested |
| Performance Tuning | Latency, throughput, kernel/JVM/GC optimization | pprof, Pyroscope, perf, strace, netstat | Request p99, GC pause, saturation |
| Disaster Recovery | etcd backup/restore, Velero workload backup, RTO/RPO | Velero, etcdctl, cluster snapshots | Restore time, backup age |
| Security Hardening | CIS benchmarks, runtime detection, supply chain | kube-bench, Falco, Trivy, network policies | kube-bench score, Falco alert rate |
| Network Operations | DNS reliability, CNI health, Ingress/Gateway throughput | CoreDNS, Cilium, NGINX Ingress, Hubble | DNS latency, connection errors, drop rate |
| Storage Operations | PVC lifecycle, StorageClass tuning, backup, CSI health | CSI drivers, Velero, Restic, etcd compaction | PVC bind time, IOPS saturation |
| Certificate Management | TLS expiry, rotation automation, PKI hierarchy | cert-manager, Vault PKI, cfssl | Certificate days-to-expiry |
| Cluster Maintenance | Version upgrades, node rotation, etcd compaction | eksctl/gcloud/kubeadm, Karpenter drift | K8s version skew, etcd DB size |
| Incident Response | On-call triage, runbooks, post-mortems, MTTR | PagerDuty, runbooks, kubectl debug | MTTR, alert fatigue rate, post-mortem count |
| SRE Practices | SLI/SLO/error budget, toil reduction, reliability reviews | Sloth, pyrra, Grafana SLO, chaos-mesh | Error budget burn rate |
Production Readiness Checklist
Before a workload enters production, each item below should be confirmed. This acts as a gate — teams self-certify against this checklist as part of their launch process (optionally enforced via Kyverno policies or Backstage scorecards, as covered in Section 08-05 and 08-04).
Reliability
replicas >= 2for stateless workloads; zero single points of failure- PodDisruptionBudget defined (
minAvailable: 1ormaxUnavailable: 1) - Pod topology spread across zones (
topologyKey: topology.kubernetes.io/zone) - Liveness, readiness, and startup probes configured
- Graceful shutdown:
preStophook +terminationGracePeriodSecondsmatches SIGTERM handling - HPA or KEDA ScaledObject defined; scale-up headroom tested
Observability
- ServiceMonitor or PodMonitor present; Prometheus scraping confirmed
- Structured JSON logs with
level,msg,trace_id,service,envfields - OpenTelemetry traces emitted; sampling rate configured
- Grafana dashboard committed to repo alongside code
- Alerts: SLO burn rate + saturation + error rate PrometheusRules in repo
- PagerDuty service registered; runbook URL in alert annotations
Security
- Pod Security Standard ≥ baseline enforced on namespace
runAsNonRoot: true,readOnlyRootFilesystem: true,allowPrivilegeEscalation: false- All capabilities dropped; seccomp profile
RuntimeDefaultorLocalhost - Container image scanned (no CRITICAL/HIGH CVEs); signed with Cosign
- Secrets sourced from ESO or Vault — no plaintext in manifests or env vars
- NetworkPolicy: default-deny + explicit allow rules
- RBAC: least-privilege ServiceAccount; no cluster-admin binding
Resource Management
- CPU and memory requests and limits set on every container
- VPA VerticalPodAutoscaler in
Offmode for recommendation data - Namespace ResourceQuota and LimitRange present
- PriorityClass assigned (defaults to
production-default) - Cost labels:
team,env,cost-center,service
GitOps & CI/CD
- All manifests in Git; no manual
kubectl applyto production - Argo CD Application configured with
selfHeal: true,prune: true - Rollout strategy: canary or blue-green with AnalysisTemplate
- Rollback procedure documented and tested in staging
- SBOM generated and attested per image build
Operations
- Runbook written and linked from alert annotations
- On-call rotation set up; escalation policy defined
- SLOs defined: availability target + latency target + error budget window
- Chaos/failure injection tested in staging (pod kill, AZ failure)
- DR restore procedure documented with RTO/RPO targets
The above can be implemented as Backstage scorecard checks (tech-insights plugin) or Kyverno ClusterPolicy generate rules that fail admission if required annotations are missing. Automating the gate catches regressions automatically — no manual review required for standard workloads.
Runbook Framework
Every production alert must link to a runbook. A runbook without a clear structure becomes a wall of text nobody reads under pressure. Use this standard template:
# Alert: PaymentServiceHighErrorRate
## Summary
Payment service error rate exceeds 5% for 5+ minutes.
## Impact
Customers may be unable to complete purchases. Revenue impact estimated at $X/minute.
## Severity: P1
## Diagnosis Steps
### 1. Triage
```bash
kubectl get pods -n payments -l app=payment-service
kubectl top pods -n payments --sort-by=cpu
```
### 2. Check recent changes
```bash
kubectl rollout history deploy/payment-service -n payments
# Check Argo CD sync history
argocd app history payments-service
```
### 3. Check logs for errors
```bash
kubectl logs -n payments -l app=payment-service --since=10m | \
jq 'select(.level=="error")' | tail -50
```
### 4. Check downstream dependencies
```bash
# Database connectivity
kubectl exec -n payments deploy/payment-service -- \
pg_isready -h $DB_HOST -p 5432
# Check ESO secret sync
kubectl get externalsecret -n payments
```
## Resolution Steps
### Option A: Rollback
```bash
kubectl rollout undo deploy/payment-service -n payments
# Verify
kubectl rollout status deploy/payment-service -n payments
```
### Option B: Scale up
```bash
kubectl scale deploy/payment-service --replicas=10 -n payments
```
## Escalation
- After 15 min unresolved: page payments-team-lead
- After 30 min: page VP Engineering
## Post-Incident
File post-mortem within 48h using template at /post-mortems/template.md
Store runbooks in the same Git repository as the application. Link them from Backstage catalog-info.yaml under annotations.pagerduty.com/service-id and from every PrometheusRule alert annotation:
- alert: PaymentServiceHighErrorRate
annotations:
runbook_url: https://github.com/org/payments/blob/main/runbooks/high-error-rate.md
summary: Payment service error rate > 5%
description: "Namespace {{ $labels.namespace }}, job {{ $labels.job }}: {{ $value | humanizePercentage }} error rate"
Operations Tooling Reference
| Category | Tool | Purpose | Install |
|---|---|---|---|
| Cluster introspection | k9s | Terminal UI for real-time cluster browsing | brew install k9s |
| kube-capacity | Node resource usage vs allocatable | krew install resource-capacity | |
| kubectl-tree | OwnerReference hierarchy tree | krew install tree | |
| kubectl-neat | Strip generated fields from kubectl output | krew install neat | |
| stern | Multi-pod log tailing with regex filter | brew install stern | |
| Debugging | kubectl debug | Ephemeral debug container (distroless-safe) | Built-in (K8s 1.23+) |
| inspektor-gadget | eBPF-based in-cluster debugging (tcpdump, top, trace) | krew install gadget | |
| netshoot | nicolaka/netshoot: full network debug toolkit pod | kubectl run tmp --image=nicolaka/netshoot -it --rm | |
| Security scanning | kube-bench | CIS Kubernetes Benchmark checks | Job YAML in cluster |
| Falco | Runtime syscall threat detection | Helm install | |
| Trivy Operator | Continuous vulnerability + config scanning in-cluster | Helm install | |
| Backup | Velero | Workload + PV backup/restore/migration | CLI + Helm chart |
| etcdctl | etcd snapshot save/restore | Bundled with etcd | |
| Chaos | chaos-mesh | Fault injection: pod-kill, network, I/O, stress | Helm install |
| kube-monkey | Chaos Monkey for Kubernetes (opt-in via labels) | Helm install | |
| Cost | OpenCost / Kubecost | Namespace cost allocation and savings recommendations | Helm install (see 08-07) |
| Profiling | Pyroscope | Continuous profiling aggregation and flame graphs | Helm install (see 06-07) |
Essential krew plugins
# Install krew plugin manager first
(
set -x; cd "$(mktemp -d)" &&
OS="$(uname | tr '[:upper:]' '[:lower:]')" &&
ARCH="$(uname -m | sed -e 's/x86_64/amd64/' -e 's/arm.*$/arm/')" &&
KREW="krew-${OS}_${ARCH}" &&
curl -fsSLO "https://github.com/kubernetes-sigs/krew/releases/latest/download/${KREW}.tar.gz" &&
tar zxvf "${KREW}.tar.gz" &&
./"${KREW}" install krew
)
# Add to PATH
export PATH="${KREW_ROOT:-$HOME/.krew}/bin:$PATH"
# Install essential plugins
kubectl krew install \
resource-capacity \
tree \
neat \
ctx \
ns \
stern \
gadget \
view-secret \
who-can \
access-matrix \
hns
Change Management in Production
Production changes fall into three tiers with different approval and rollout requirements:
| Tier | Examples | Approval | Rollout | Rollback SLA |
|---|---|---|---|---|
| T1 — Standard | Application code change, config update, image bump | PR review + CI pass | GitOps canary/blue-green (automated) | < 5 min automated rollback |
| T2 — Elevated | HPA bounds change, resource quota increase, new Ingress rule, cluster add-on update | PR + team-lead approval | Staged: dev → staging → prod with manual gate | < 15 min via GitOps revert |
| T3 — High Risk | K8s control plane upgrade, etcd migration, CNI change, StorageClass migration, security policy change | Change Advisory Board + runbook review | Maintenance window + dry-run in staging | Manual restore procedure + RTO target |
GitOps as change ledger
Every production change must originate from a Git commit. This provides an automatic audit trail — who changed what, when, and via which PR. The Argo CD sync history additionally records which Git SHA was applied to each cluster and at what time.
# Who changed what in the last 24 hours (Argo CD history)
argocd app history payment-service --output json | \
jq '.[] | {id:.id, revision:.revision, deployedAt:.deployedAt}'
# What changed between two revisions
argocd app diff payment-service --revision HEAD~1
# Rollback to previous revision
argocd app rollback payment-service 42 # where 42 is the history ID
Establish change freeze periods: Friday 3pm–Monday 9am for T2/T3 changes; 48 hours before/after major holidays. Document these in the GitOps repo README and enforce via CI checks that block merges to the production branch outside approved windows.
On-Call Engineering
Sustainable on-call requires alert quality before rotation size. Before growing the rotation, reduce alert noise.
PagerDuty page
│
▼
Is SLO burning?
├─ Yes → Is error budget <10%? → P1: wake secondary + escalate
│ Is error budget <50%? → P2: respond within 30 min
│ Otherwise → P3: next business day
└─ No → Is customer-visible?
├─ Yes → P2
└─ No → Investigate during business hours; silence if noisy
Alert quality metrics to track
| Metric | Healthy Target | Warning | Action if Warning |
|---|---|---|---|
| Pages per on-call shift (8h) | < 2 | > 5 | Alert audit: silence/tune noisy alerts |
| Actionable page % | > 80% | < 60% | Review alert conditions; raise thresholds |
| MTTR (P1 incidents) | < 30 min | > 60 min | Improve runbooks; add diagnostic steps |
| Post-mortem completion rate | 100% of P1/P2 | < 80% | Block new incidents until post-mortem filed |
| Alert flap rate | < 5% | > 15% | Add for: 5m duration to flapping alerts |
On-call rotation setup (PagerDuty pattern)
# Backstage catalog-info.yaml — link to PagerDuty service
metadata:
annotations:
pagerduty.com/integration-key: abc123xyz
pagerduty.com/service-id: PXYZ123
# On-call schedule visible in Backstage
pagerduty.com/escalation-policy-id: EP123456
kubectl debug — ephemeral container triage
# Attach ephemeral debug container to running pod (non-distroless)
kubectl debug -it payment-pod-xyz \
--image=nicolaka/netshoot \
--target=payment-service \
-n payments
# Debug a CrashLoopBackOff pod by copying it with a shell
kubectl debug payment-pod-xyz \
-it \
--copy-to=payment-debug \
--image=nicolaka/netshoot \
--share-processes \
-n payments
# Node-level debug (requires privileged — for platform SREs only)
kubectl debug node/ip-10-0-1-42.us-east-1.compute.internal \
-it \
--image=nicolaka/netshoot
The Four Golden Signals
The four golden signals (Google SRE Book) apply directly to Kubernetes workloads. Every production service should have Prometheus alerts covering all four.
Latency
Time to handle a request. Track p50/p95/p99 — not just mean. A slow success and a fast failure are both important.
histogram_quantile(0.99,
rate(http_request_duration_seconds_bucket
{job="payment-service"}[5m]))
Traffic
How much demand is on the system. Requests per second, messages per second, active connections.
sum(rate(http_requests_total
{job="payment-service"}[1m]))
by (method, route)
Errors
Rate of failed requests — explicit (5xx) or implicit (wrong data). Track 4xx separately to detect client-side abuse.
sum(rate(http_requests_total
{status=~"5.."}[5m]))
/ sum(rate(http_requests_total[5m]))
Saturation
How full the service is. CPU throttling, queue depth, memory pressure, connection pool exhaustion.
sum(container_cpu_cfs_throttled_seconds_total
{namespace="payments"}) /
sum(container_cpu_cfs_periods_total
{namespace="payments"})
Production SLOs
SLOs are agreements with your users about service reliability. Without an error budget, every outage feels like a catastrophe. With one, you know how much risk you can take.
SLI → SLO → Error Budget chain
SLI: ratio of successful requests SLO: 99.9% availability over 30 days Error budget: 0.1% × 30d × 24h × 60min = 43.2 minutes ┌─────────────────────────────────────────────────────────────┐ │ Day 1-5: No incidents Budget remaining: 43.2 min │ │ Day 12: 5 min incident Budget remaining: 38.2 min │ │ Day 20: 30 min incident Budget remaining: 8.2 min │ │ Day 22: Budget < 10% → Feature freeze │ │ (only reliability → No T2/T3 changes │ │ work allowed) → On-call P1 threshold ↓ │ └─────────────────────────────────────────────────────────────┘
Sloth SLO definition (declarative)
Sloth generates multi-window multi-burn-rate alerts from a simple SLO YAML, following the Google SRE workbook's alerting strategy:
apiVersion: sloth.slok.dev/v1
kind: PrometheusServiceLevel
metadata:
name: payment-service-slo
namespace: payments
spec:
service: payment-service
labels:
team: payments
env: production
slos:
- name: requests-availability
objective: 99.9
description: 99.9% of payment API requests succeed
sli:
events:
error_query: |
sum(rate(http_requests_total{job="payment-service",status=~"5.."}[{{.window}}]))
total_query: |
sum(rate(http_requests_total{job="payment-service"}[{{.window}}]))
alerting:
name: PaymentServiceHighErrorBudgetBurn
annotations:
runbook_url: https://github.com/org/payments/blob/main/runbooks/slo-burn.md
page_alert:
labels:
severity: critical
ticket_alert:
labels:
severity: warning
- name: requests-latency
objective: 99.0
description: 99% of payment API requests complete in < 500ms
sli:
events:
error_query: |
sum(rate(http_request_duration_seconds_bucket{
job="payment-service",le="0.5"}[{{.window}}]))
total_query: |
sum(rate(http_request_duration_seconds_count{
job="payment-service"}[{{.window}}]))
# Generate PrometheusRules from Sloth SLO YAML
sloth generate -i slo.yaml -o prometheus-rules.yaml
# Or deploy Sloth as an operator (watches PrometheusServiceLevel CRs)
helm repo add sloth https://slok.github.io/sloth
helm install sloth sloth/sloth \
--namespace monitoring \
--set customSLIs.enabled=true
Error budget policy
| Budget Remaining | Action |
|---|---|
| > 50% | Normal operations; T1/T2/T3 changes allowed |
| 25–50% | T3 changes require extra review; increase canary duration |
| 10–25% | No T3 changes; T2 changes require CISO/VP approval |
| < 10% | Feature freeze: only reliability/security work merges; all-hands reliability sprint |
| Exhausted | Incident declared; post-mortem mandatory; SLO review within 1 week |
Operations Maturity Model
No runbooks
Alert on every metric
Single point of failure
No post-mortems
Manual cert renewal
Basic runbooks
PDBs defined
On-call rotation
Ad-hoc post-mortems
cert-manager installed
Error budgets tracked
Canary deployments
Velero backups
kube-bench enforced
DR tested quarterly
Chaos engineering
Auto-remediation
Capacity forecasting
Toil < 50%
Multi-cluster DR
Predictive scaling
Zero-touch upgrades
Continuous chaos
Toil < 20%
Autonomous DR
Most teams should target Level 3 before optimizing for Level 4. Skipping to Level 5 without solid Level 3 foundations (SLOs, DR, runbooks) results in fragile automation.
Section 09 — Topics in This Section
The eleven files in this section move from planning to hands-on operations to reliability engineering:
Best Practices
SLOs before alerts
Define SLOs first; derive alerts from error budget burn rate. This eliminates alert fatigue and focuses on user-visible impact.
Runbooks are mandatory
No alert ships to production without a runbook URL in its annotation. Enforce this in CI with a PromQL lint check.
Test your DR regularly
An untested restore procedure is not a DR plan. Run quarterly restore drills from Velero backups and etcd snapshots in a staging cluster.
Production readiness gates
Automate the readiness checklist as a Backstage scorecard or Kyverno policy. Manual checklists are forgotten under delivery pressure.
Toil budget
Track on-call toil per sprint. If toil exceeds 50% of engineering time, halt feature work and invest in automation. This is the SRE contract.
Change tier discipline
Classify every change before it ships. T3 changes in a change-freeze window are the most common cause of weekend incidents.