What Is Production Operations

Production operations is the discipline of keeping Kubernetes clusters and their workloads available, performant, secure, and cost-efficient 24/7. While the platform engineering section (Section 08) covers what to build, this section covers how to run it — day-2 through day-N operations.

Operational Lifecycle
   Plan          Build          Deploy         Operate         Improve
  ────────      ───────        ────────        ────────        ─────────
  Capacity   →  Platform   →   GitOps     →   Monitor    →   Post-mortems
  Planning      Provisioning   Pipeline        Alert           SLO review
  SLO design    Policy         Canary           Runbooks        Chaos tests
  Budget        Security       PDB              Incident        Capacity plan
  On-call       Secrets        Smoke test       Response        Efficiency
  rotation      Portal         Gate             Escalation      upgrade cycle

The gap between a cluster that works and a cluster that runs production spans five concerns:

Reliability

SLOs, error budgets, PDBs, multi-AZ topology, chaos engineering, disaster recovery with RTO/RPO targets.

Performance

Right-sizing, kernel tuning, JVM profiling, etcd/apiserver latency, HPA/KEDA autoscaling response time.

Security

CIS hardening, runtime threat detection, audit log analysis, CVE patching cadence, zero-trust networking.

Efficiency

Capacity planning, VPA right-sizing, spot adoption, FinOps reviews, idle resource detection.

Operability

Runbooks, on-call rotations, change management, upgrade cadence, certificate lifecycle, incident playbooks.

Operations Domains

DomainCore ConcernKey ToolsPrimary Signal
Capacity PlanningRight cluster size today and 6 months forward; headroom for spikesVPA/Goldilocks, OpenCost, kube-capacityCPU/mem allocatable vs requested
Performance TuningLatency, throughput, kernel/JVM/GC optimizationpprof, Pyroscope, perf, strace, netstatRequest p99, GC pause, saturation
Disaster Recoveryetcd backup/restore, Velero workload backup, RTO/RPOVelero, etcdctl, cluster snapshotsRestore time, backup age
Security HardeningCIS benchmarks, runtime detection, supply chainkube-bench, Falco, Trivy, network policieskube-bench score, Falco alert rate
Network OperationsDNS reliability, CNI health, Ingress/Gateway throughputCoreDNS, Cilium, NGINX Ingress, HubbleDNS latency, connection errors, drop rate
Storage OperationsPVC lifecycle, StorageClass tuning, backup, CSI healthCSI drivers, Velero, Restic, etcd compactionPVC bind time, IOPS saturation
Certificate ManagementTLS expiry, rotation automation, PKI hierarchycert-manager, Vault PKI, cfsslCertificate days-to-expiry
Cluster MaintenanceVersion upgrades, node rotation, etcd compactioneksctl/gcloud/kubeadm, Karpenter driftK8s version skew, etcd DB size
Incident ResponseOn-call triage, runbooks, post-mortems, MTTRPagerDuty, runbooks, kubectl debugMTTR, alert fatigue rate, post-mortem count
SRE PracticesSLI/SLO/error budget, toil reduction, reliability reviewsSloth, pyrra, Grafana SLO, chaos-meshError budget burn rate

Production Readiness Checklist

Before a workload enters production, each item below should be confirmed. This acts as a gate — teams self-certify against this checklist as part of their launch process (optionally enforced via Kyverno policies or Backstage scorecards, as covered in Section 08-05 and 08-04).

Reliability

Observability

Security

Resource Management

GitOps & CI/CD

Operations

💡
Automate the checklist

The above can be implemented as Backstage scorecard checks (tech-insights plugin) or Kyverno ClusterPolicy generate rules that fail admission if required annotations are missing. Automating the gate catches regressions automatically — no manual review required for standard workloads.

Runbook Framework

Every production alert must link to a runbook. A runbook without a clear structure becomes a wall of text nobody reads under pressure. Use this standard template:

# Alert: PaymentServiceHighErrorRate

## Summary
Payment service error rate exceeds 5% for 5+ minutes.

## Impact
Customers may be unable to complete purchases. Revenue impact estimated at $X/minute.

## Severity: P1

## Diagnosis Steps

### 1. Triage
```bash
kubectl get pods -n payments -l app=payment-service
kubectl top pods -n payments --sort-by=cpu
```

### 2. Check recent changes
```bash
kubectl rollout history deploy/payment-service -n payments
# Check Argo CD sync history
argocd app history payments-service
```

### 3. Check logs for errors
```bash
kubectl logs -n payments -l app=payment-service --since=10m | \
  jq 'select(.level=="error")' | tail -50
```

### 4. Check downstream dependencies
```bash
# Database connectivity
kubectl exec -n payments deploy/payment-service -- \
  pg_isready -h $DB_HOST -p 5432
# Check ESO secret sync
kubectl get externalsecret -n payments
```

## Resolution Steps

### Option A: Rollback
```bash
kubectl rollout undo deploy/payment-service -n payments
# Verify
kubectl rollout status deploy/payment-service -n payments
```

### Option B: Scale up
```bash
kubectl scale deploy/payment-service --replicas=10 -n payments
```

## Escalation
- After 15 min unresolved: page payments-team-lead
- After 30 min: page VP Engineering

## Post-Incident
File post-mortem within 48h using template at /post-mortems/template.md

Store runbooks in the same Git repository as the application. Link them from Backstage catalog-info.yaml under annotations.pagerduty.com/service-id and from every PrometheusRule alert annotation:

- alert: PaymentServiceHighErrorRate
  annotations:
    runbook_url: https://github.com/org/payments/blob/main/runbooks/high-error-rate.md
    summary: Payment service error rate > 5%
    description: "Namespace {{ $labels.namespace }}, job {{ $labels.job }}: {{ $value | humanizePercentage }} error rate"

Operations Tooling Reference

CategoryToolPurposeInstall
Cluster introspectionk9sTerminal UI for real-time cluster browsingbrew install k9s
kube-capacityNode resource usage vs allocatablekrew install resource-capacity
kubectl-treeOwnerReference hierarchy treekrew install tree
kubectl-neatStrip generated fields from kubectl outputkrew install neat
sternMulti-pod log tailing with regex filterbrew install stern
Debuggingkubectl debugEphemeral debug container (distroless-safe)Built-in (K8s 1.23+)
inspektor-gadgeteBPF-based in-cluster debugging (tcpdump, top, trace)krew install gadget
netshootnicolaka/netshoot: full network debug toolkit podkubectl run tmp --image=nicolaka/netshoot -it --rm
Security scanningkube-benchCIS Kubernetes Benchmark checksJob YAML in cluster
FalcoRuntime syscall threat detectionHelm install
Trivy OperatorContinuous vulnerability + config scanning in-clusterHelm install
BackupVeleroWorkload + PV backup/restore/migrationCLI + Helm chart
etcdctletcd snapshot save/restoreBundled with etcd
Chaoschaos-meshFault injection: pod-kill, network, I/O, stressHelm install
kube-monkeyChaos Monkey for Kubernetes (opt-in via labels)Helm install
CostOpenCost / KubecostNamespace cost allocation and savings recommendationsHelm install (see 08-07)
ProfilingPyroscopeContinuous profiling aggregation and flame graphsHelm install (see 06-07)

Essential krew plugins

# Install krew plugin manager first
(
  set -x; cd "$(mktemp -d)" &&
  OS="$(uname | tr '[:upper:]' '[:lower:]')" &&
  ARCH="$(uname -m | sed -e 's/x86_64/amd64/' -e 's/arm.*$/arm/')" &&
  KREW="krew-${OS}_${ARCH}" &&
  curl -fsSLO "https://github.com/kubernetes-sigs/krew/releases/latest/download/${KREW}.tar.gz" &&
  tar zxvf "${KREW}.tar.gz" &&
  ./"${KREW}" install krew
)

# Add to PATH
export PATH="${KREW_ROOT:-$HOME/.krew}/bin:$PATH"

# Install essential plugins
kubectl krew install \
  resource-capacity \
  tree \
  neat \
  ctx \
  ns \
  stern \
  gadget \
  view-secret \
  who-can \
  access-matrix \
  hns

Change Management in Production

Production changes fall into three tiers with different approval and rollout requirements:

TierExamplesApprovalRolloutRollback SLA
T1 — StandardApplication code change, config update, image bumpPR review + CI passGitOps canary/blue-green (automated)< 5 min automated rollback
T2 — ElevatedHPA bounds change, resource quota increase, new Ingress rule, cluster add-on updatePR + team-lead approvalStaged: dev → staging → prod with manual gate< 15 min via GitOps revert
T3 — High RiskK8s control plane upgrade, etcd migration, CNI change, StorageClass migration, security policy changeChange Advisory Board + runbook reviewMaintenance window + dry-run in stagingManual restore procedure + RTO target

GitOps as change ledger

Every production change must originate from a Git commit. This provides an automatic audit trail — who changed what, when, and via which PR. The Argo CD sync history additionally records which Git SHA was applied to each cluster and at what time.

# Who changed what in the last 24 hours (Argo CD history)
argocd app history payment-service --output json | \
  jq '.[] | {id:.id, revision:.revision, deployedAt:.deployedAt}'

# What changed between two revisions
argocd app diff payment-service --revision HEAD~1

# Rollback to previous revision
argocd app rollback payment-service 42   # where 42 is the history ID
⚠️
Change freeze windows

Establish change freeze periods: Friday 3pm–Monday 9am for T2/T3 changes; 48 hours before/after major holidays. Document these in the GitOps repo README and enforce via CI checks that block merges to the production branch outside approved windows.

On-Call Engineering

Sustainable on-call requires alert quality before rotation size. Before growing the rotation, reduce alert noise.

On-Call Alert Triage Flow
  PagerDuty page
       │
       ▼
  Is SLO burning?
  ├─ Yes → Is error budget <10%? → P1: wake secondary + escalate
  │        Is error budget <50%? → P2: respond within 30 min
  │        Otherwise            → P3: next business day
  └─ No  → Is customer-visible?
            ├─ Yes → P2
            └─ No  → Investigate during business hours; silence if noisy

Alert quality metrics to track

MetricHealthy TargetWarningAction if Warning
Pages per on-call shift (8h)< 2> 5Alert audit: silence/tune noisy alerts
Actionable page %> 80%< 60%Review alert conditions; raise thresholds
MTTR (P1 incidents)< 30 min> 60 minImprove runbooks; add diagnostic steps
Post-mortem completion rate100% of P1/P2< 80%Block new incidents until post-mortem filed
Alert flap rate< 5%> 15%Add for: 5m duration to flapping alerts

On-call rotation setup (PagerDuty pattern)

# Backstage catalog-info.yaml — link to PagerDuty service
metadata:
  annotations:
    pagerduty.com/integration-key: abc123xyz
    pagerduty.com/service-id: PXYZ123
    # On-call schedule visible in Backstage
    pagerduty.com/escalation-policy-id: EP123456

kubectl debug — ephemeral container triage

# Attach ephemeral debug container to running pod (non-distroless)
kubectl debug -it payment-pod-xyz \
  --image=nicolaka/netshoot \
  --target=payment-service \
  -n payments

# Debug a CrashLoopBackOff pod by copying it with a shell
kubectl debug payment-pod-xyz \
  -it \
  --copy-to=payment-debug \
  --image=nicolaka/netshoot \
  --share-processes \
  -n payments

# Node-level debug (requires privileged — for platform SREs only)
kubectl debug node/ip-10-0-1-42.us-east-1.compute.internal \
  -it \
  --image=nicolaka/netshoot

The Four Golden Signals

The four golden signals (Google SRE Book) apply directly to Kubernetes workloads. Every production service should have Prometheus alerts covering all four.

Latency

Time to handle a request. Track p50/p95/p99 — not just mean. A slow success and a fast failure are both important.

histogram_quantile(0.99,
  rate(http_request_duration_seconds_bucket
    {job="payment-service"}[5m]))

Traffic

How much demand is on the system. Requests per second, messages per second, active connections.

sum(rate(http_requests_total
  {job="payment-service"}[1m]))
by (method, route)

Errors

Rate of failed requests — explicit (5xx) or implicit (wrong data). Track 4xx separately to detect client-side abuse.

sum(rate(http_requests_total
  {status=~"5.."}[5m]))
/ sum(rate(http_requests_total[5m]))

Saturation

How full the service is. CPU throttling, queue depth, memory pressure, connection pool exhaustion.

sum(container_cpu_cfs_throttled_seconds_total
  {namespace="payments"}) /
sum(container_cpu_cfs_periods_total
  {namespace="payments"})

Production SLOs

SLOs are agreements with your users about service reliability. Without an error budget, every outage feels like a catastrophe. With one, you know how much risk you can take.

SLI → SLO → Error Budget chain

SLO Error Budget Flow
  SLI: ratio of successful requests
  SLO: 99.9% availability over 30 days
  Error budget: 0.1% × 30d × 24h × 60min = 43.2 minutes

  ┌─────────────────────────────────────────────────────────────┐
  │  Day 1-5:   No incidents         Budget remaining: 43.2 min │
  │  Day 12:    5 min incident       Budget remaining: 38.2 min │
  │  Day 20:    30 min incident      Budget remaining:  8.2 min │
  │  Day 22:    Budget < 10%         → Feature freeze           │
  │             (only reliability    → No T2/T3 changes         │
  │              work allowed)       → On-call P1 threshold ↓   │
  └─────────────────────────────────────────────────────────────┘

Sloth SLO definition (declarative)

Sloth generates multi-window multi-burn-rate alerts from a simple SLO YAML, following the Google SRE workbook's alerting strategy:

apiVersion: sloth.slok.dev/v1
kind: PrometheusServiceLevel
metadata:
  name: payment-service-slo
  namespace: payments
spec:
  service: payment-service
  labels:
    team: payments
    env: production
  slos:
    - name: requests-availability
      objective: 99.9
      description: 99.9% of payment API requests succeed
      sli:
        events:
          error_query: |
            sum(rate(http_requests_total{job="payment-service",status=~"5.."}[{{.window}}]))
          total_query: |
            sum(rate(http_requests_total{job="payment-service"}[{{.window}}]))
      alerting:
        name: PaymentServiceHighErrorBudgetBurn
        annotations:
          runbook_url: https://github.com/org/payments/blob/main/runbooks/slo-burn.md
        page_alert:
          labels:
            severity: critical
        ticket_alert:
          labels:
            severity: warning

    - name: requests-latency
      objective: 99.0
      description: 99% of payment API requests complete in < 500ms
      sli:
        events:
          error_query: |
            sum(rate(http_request_duration_seconds_bucket{
              job="payment-service",le="0.5"}[{{.window}}]))
          total_query: |
            sum(rate(http_request_duration_seconds_count{
              job="payment-service"}[{{.window}}]))
# Generate PrometheusRules from Sloth SLO YAML
sloth generate -i slo.yaml -o prometheus-rules.yaml

# Or deploy Sloth as an operator (watches PrometheusServiceLevel CRs)
helm repo add sloth https://slok.github.io/sloth
helm install sloth sloth/sloth \
  --namespace monitoring \
  --set customSLIs.enabled=true

Error budget policy

Budget RemainingAction
> 50%Normal operations; T1/T2/T3 changes allowed
25–50%T3 changes require extra review; increase canary duration
10–25%No T3 changes; T2 changes require CISO/VP approval
< 10%Feature freeze: only reliability/security work merges; all-hands reliability sprint
ExhaustedIncident declared; post-mortem mandatory; SLO review within 1 week

Operations Maturity Model

1
Reactive
Manual deploys
No runbooks
Alert on every metric
Single point of failure
No post-mortems
Manual cert renewal
2
Managed
GitOps in place
Basic runbooks
PDBs defined
On-call rotation
Ad-hoc post-mortems
cert-manager installed
3
Defined
SLOs defined
Error budgets tracked
Canary deployments
Velero backups
kube-bench enforced
DR tested quarterly
4
Measured
DORA metrics tracked
Chaos engineering
Auto-remediation
Capacity forecasting
Toil < 50%
Multi-cluster DR
5
Optimising
Self-healing operators
Predictive scaling
Zero-touch upgrades
Continuous chaos
Toil < 20%
Autonomous DR

Most teams should target Level 3 before optimizing for Level 4. Skipping to Level 5 without solid Level 3 foundations (SLOs, DR, runbooks) results in fragile automation.

Section 09 — Topics in This Section

The eleven files in this section move from planning to hands-on operations to reliability engineering:

09-00
Production Overview
This file — readiness checklist, runbook framework, golden signals, SLOs, maturity model.
09-01
Capacity Planning
Demand forecasting, node sizing, headroom calculation, VPA/Goldilocks, cluster autoscaling strategy.
09-02
Performance Tuning
Kernel tuning, JVM GC, CPU throttling, network optimization, etcd and apiserver performance.
09-03
Disaster Recovery
etcd backup/restore, Velero workload backup, RTO/RPO targets, failover automation, DR drills.
09-04
Security Hardening
CIS benchmarks, kube-bench, Falco runtime detection, Trivy Operator, audit log analysis.
09-05
Network Operations
CoreDNS tuning, CNI troubleshooting, Ingress operations, eBPF network observability with Hubble.
09-06
Storage Operations
PVC lifecycle, CSI driver ops, StorageClass tuning, etcd compaction, backup with Restic.
09-07
Certificate Management
cert-manager ops, PKI hierarchy, rotation automation, mutual TLS, expiry monitoring.
09-08
Cluster Maintenance
K8s version upgrades, node rotation, etcd maintenance, add-on lifecycle, maintenance windows.
09-09
Incident Response
On-call triage, incident lifecycle, post-mortem process, MTTR reduction, escalation playbooks.
09-10
SRE Practices
Toil elimination, error budget policies, chaos engineering, reliability reviews, automation patterns.

Best Practices

SLOs before alerts

Define SLOs first; derive alerts from error budget burn rate. This eliminates alert fatigue and focuses on user-visible impact.

Runbooks are mandatory

No alert ships to production without a runbook URL in its annotation. Enforce this in CI with a PromQL lint check.

Test your DR regularly

An untested restore procedure is not a DR plan. Run quarterly restore drills from Velero backups and etcd snapshots in a staging cluster.

Production readiness gates

Automate the readiness checklist as a Backstage scorecard or Kyverno policy. Manual checklists are forgotten under delivery pressure.

Toil budget

Track on-call toil per sprint. If toil exceeds 50% of engineering time, halt feature work and invest in automation. This is the SRE contract.

Change tier discipline

Classify every change before it ships. T3 changes in a change-freeze window are the most common cause of weekend incidents.