Production Operations Overview

What Is Production Operations

Production operations is the discipline of keeping Kubernetes clusters and their workloads available, performant, secure, and cost-efficient 24/7. While the platform engineering section (Section 08) covers what to build, this section covers how to run it — day-2 through day-N operations.

Operational Lifecycle

   Plan          Build          Deploy         Operate         Improve
  ────────      ───────        ────────        ────────        ─────────
  Capacity   →  Platform   →   GitOps     →   Monitor    →   Post-mortems
  Planning      Provisioning   Pipeline        Alert           SLO review
  SLO design    Policy         Canary           Runbooks        Chaos tests
  Budget        Security       PDB              Incident        Capacity plan
  On-call       Secrets        Smoke test       Response        Efficiency
  rotation      Portal         Gate             Escalation      upgrade cycle

The gap between a cluster that works and a cluster that runs production spans five concerns:

Reliability

SLOs, error budgets, PDBs, multi-AZ topology, chaos engineering, disaster recovery with RTO/RPO targets.

Performance

Right-sizing, kernel tuning, JVM profiling, etcd/apiserver latency, HPA/KEDA autoscaling response time.

Security

CIS hardening, runtime threat detection, audit log analysis, CVE patching cadence, zero-trust networking.

Efficiency

Capacity planning, VPA right-sizing, spot adoption, FinOps reviews, idle resource detection.

Operability

Runbooks, on-call rotations, change management, upgrade cadence, certificate lifecycle, incident playbooks.

Operations Domains

Domain	Core Concern	Key Tools	Primary Signal
Capacity Planning	Right cluster size today and 6 months forward; headroom for spikes	VPA/Goldilocks, OpenCost, kube-capacity	CPU/mem allocatable vs requested
Performance Tuning	Latency, throughput, kernel/JVM/GC optimization	pprof, Pyroscope, perf, strace, netstat	Request p99, GC pause, saturation
Disaster Recovery	etcd backup/restore, Velero workload backup, RTO/RPO	Velero, etcdctl, cluster snapshots	Restore time, backup age
Security Hardening	CIS benchmarks, runtime detection, supply chain	kube-bench, Falco, Trivy, network policies	kube-bench score, Falco alert rate
Network Operations	DNS reliability, CNI health, Ingress/Gateway throughput	CoreDNS, Cilium, NGINX Ingress, Hubble	DNS latency, connection errors, drop rate
Storage Operations	PVC lifecycle, StorageClass tuning, backup, CSI health	CSI drivers, Velero, Restic, etcd compaction	PVC bind time, IOPS saturation
Certificate Management	TLS expiry, rotation automation, PKI hierarchy	cert-manager, Vault PKI, cfssl	Certificate days-to-expiry
Cluster Maintenance	Version upgrades, node rotation, etcd compaction	eksctl/gcloud/kubeadm, Karpenter drift	K8s version skew, etcd DB size
Incident Response	On-call triage, runbooks, post-mortems, MTTR	PagerDuty, runbooks, kubectl debug	MTTR, alert fatigue rate, post-mortem count
SRE Practices	SLI/SLO/error budget, toil reduction, reliability reviews	Sloth, pyrra, Grafana SLO, chaos-mesh	Error budget burn rate

Production Readiness Checklist

Before a workload enters production, each item below should be confirmed. This acts as a gate — teams self-certify against this checklist as part of their launch process (optionally enforced via Kyverno policies or Backstage scorecards, as covered in Section 08-05 and 08-04).

Reliability

replicas >= 2 for stateless workloads; zero single points of failure
PodDisruptionBudget defined (minAvailable: 1 or maxUnavailable: 1)
Pod topology spread across zones (topologyKey: topology.kubernetes.io/zone)
Liveness, readiness, and startup probes configured
Graceful shutdown: preStop hook + terminationGracePeriodSeconds matches SIGTERM handling
HPA or KEDA ScaledObject defined; scale-up headroom tested

Observability

ServiceMonitor or PodMonitor present; Prometheus scraping confirmed
Structured JSON logs with level, msg, trace_id, service, env fields
OpenTelemetry traces emitted; sampling rate configured
Grafana dashboard committed to repo alongside code
Alerts: SLO burn rate + saturation + error rate PrometheusRules in repo
PagerDuty service registered; runbook URL in alert annotations

Security

Pod Security Standard ≥ baseline enforced on namespace
runAsNonRoot: true, readOnlyRootFilesystem: true, allowPrivilegeEscalation: false
All capabilities dropped; seccomp profile RuntimeDefault or Localhost
Container image scanned (no CRITICAL/HIGH CVEs); signed with Cosign
Secrets sourced from ESO or Vault — no plaintext in manifests or env vars
NetworkPolicy: default-deny + explicit allow rules
RBAC: least-privilege ServiceAccount; no cluster-admin binding

Resource Management

CPU and memory requests and limits set on every container
VPA VerticalPodAutoscaler in Off mode for recommendation data
Namespace ResourceQuota and LimitRange present
PriorityClass assigned (defaults to production-default)
Cost labels: team, env, cost-center, service

GitOps & CI/CD

All manifests in Git; no manual kubectl apply to production
Argo CD Application configured with selfHeal: true, prune: true
Rollout strategy: canary or blue-green with AnalysisTemplate
Rollback procedure documented and tested in staging
SBOM generated and attested per image build

Operations

Runbook written and linked from alert annotations
On-call rotation set up; escalation policy defined
SLOs defined: availability target + latency target + error budget window
Chaos/failure injection tested in staging (pod kill, AZ failure)
DR restore procedure documented with RTO/RPO targets

💡

Automate the checklist

The above can be implemented as Backstage scorecard checks (tech-insights plugin) or Kyverno ClusterPolicy generate rules that fail admission if required annotations are missing. Automating the gate catches regressions automatically — no manual review required for standard workloads.

Runbook Framework

Every production alert must link to a runbook. A runbook without a clear structure becomes a wall of text nobody reads under pressure. Use this standard template:

# Alert: PaymentServiceHighErrorRate

## Summary
Payment service error rate exceeds 5% for 5+ minutes.

## Impact
Customers may be unable to complete purchases. Revenue impact estimated at $X/minute.

## Severity: P1

## Diagnosis Steps

### 1. Triage
```bash
kubectl get pods -n payments -l app=payment-service
kubectl top pods -n payments --sort-by=cpu
```

### 2. Check recent changes
```bash
kubectl rollout history deploy/payment-service -n payments
# Check Argo CD sync history
argocd app history payments-service
```

### 3. Check logs for errors
```bash
kubectl logs -n payments -l app=payment-service --since=10m | \
  jq 'select(.level=="error")' | tail -50
```

### 4. Check downstream dependencies
```bash
# Database connectivity
kubectl exec -n payments deploy/payment-service -- \
  pg_isready -h $DB_HOST -p 5432
# Check ESO secret sync
kubectl get externalsecret -n payments
```

## Resolution Steps

### Option A: Rollback
```bash
kubectl rollout undo deploy/payment-service -n payments
# Verify
kubectl rollout status deploy/payment-service -n payments
```

### Option B: Scale up
```bash
kubectl scale deploy/payment-service --replicas=10 -n payments
```

## Escalation
- After 15 min unresolved: page payments-team-lead
- After 30 min: page VP Engineering

## Post-Incident
File post-mortem within 48h using template at /post-mortems/template.md

Store runbooks in the same Git repository as the application. Link them from Backstage catalog-info.yaml under annotations.pagerduty.com/service-id and from every PrometheusRule alert annotation:

- alert: PaymentServiceHighErrorRate
  annotations:
    runbook_url: https://github.com/org/payments/blob/main/runbooks/high-error-rate.md
    summary: Payment service error rate > 5%
    description: "Namespace {{ $labels.namespace }}, job {{ $labels.job }}: {{ $value | humanizePercentage }} error rate"

Operations Tooling Reference

Category	Tool	Purpose	Install
Cluster introspection	k9s	Terminal UI for real-time cluster browsing	`brew install k9s`
	kube-capacity	Node resource usage vs allocatable	`krew install resource-capacity`
	kubectl-tree	OwnerReference hierarchy tree	`krew install tree`
	kubectl-neat	Strip generated fields from kubectl output	`krew install neat`
	stern	Multi-pod log tailing with regex filter	`brew install stern`
Debugging	kubectl debug	Ephemeral debug container (distroless-safe)	Built-in (K8s 1.23+)
	inspektor-gadget	eBPF-based in-cluster debugging (tcpdump, top, trace)	krew install gadget
	netshoot	nicolaka/netshoot: full network debug toolkit pod	`kubectl run tmp --image=nicolaka/netshoot -it --rm`
Security scanning	kube-bench	CIS Kubernetes Benchmark checks	Job YAML in cluster
	Falco	Runtime syscall threat detection	Helm install
	Trivy Operator	Continuous vulnerability + config scanning in-cluster	Helm install
Backup	Velero	Workload + PV backup/restore/migration	CLI + Helm chart
	etcdctl	etcd snapshot save/restore	Bundled with etcd
Chaos	chaos-mesh	Fault injection: pod-kill, network, I/O, stress	Helm install
	kube-monkey	Chaos Monkey for Kubernetes (opt-in via labels)	Helm install
Cost	OpenCost / Kubecost	Namespace cost allocation and savings recommendations	Helm install (see 08-07)
Profiling	Pyroscope	Continuous profiling aggregation and flame graphs	Helm install (see 06-07)

Essential krew plugins

# Install krew plugin manager first
(
  set -x; cd "$(mktemp -d)" &&
  OS="$(uname | tr '[:upper:]' '[:lower:]')" &&
  ARCH="$(uname -m | sed -e 's/x86_64/amd64/' -e 's/arm.*$/arm/')" &&
  KREW="krew-${OS}_${ARCH}" &&
  curl -fsSLO "https://github.com/kubernetes-sigs/krew/releases/latest/download/${KREW}.tar.gz" &&
  tar zxvf "${KREW}.tar.gz" &&
  ./"${KREW}" install krew
)

# Add to PATH
export PATH="${KREW_ROOT:-$HOME/.krew}/bin:$PATH"

# Install essential plugins
kubectl krew install \
  resource-capacity \
  tree \
  neat \
  ctx \
  ns \
  stern \
  gadget \
  view-secret \
  who-can \
  access-matrix \
  hns

Change Management in Production

Production changes fall into three tiers with different approval and rollout requirements:

Tier	Examples	Approval	Rollout	Rollback SLA
T1 — Standard	Application code change, config update, image bump	PR review + CI pass	GitOps canary/blue-green (automated)	< 5 min automated rollback
T2 — Elevated	HPA bounds change, resource quota increase, new Ingress rule, cluster add-on update	PR + team-lead approval	Staged: dev → staging → prod with manual gate	< 15 min via GitOps revert
T3 — High Risk	K8s control plane upgrade, etcd migration, CNI change, StorageClass migration, security policy change	Change Advisory Board + runbook review	Maintenance window + dry-run in staging	Manual restore procedure + RTO target

GitOps as change ledger

Every production change must originate from a Git commit. This provides an automatic audit trail — who changed what, when, and via which PR. The Argo CD sync history additionally records which Git SHA was applied to each cluster and at what time.

# Who changed what in the last 24 hours (Argo CD history)
argocd app history payment-service --output json | \
  jq '.[] | {id:.id, revision:.revision, deployedAt:.deployedAt}'

# What changed between two revisions
argocd app diff payment-service --revision HEAD~1

# Rollback to previous revision
argocd app rollback payment-service 42   # where 42 is the history ID

⚠️

Change freeze windows

Establish change freeze periods: Friday 3pm–Monday 9am for T2/T3 changes; 48 hours before/after major holidays. Document these in the GitOps repo README and enforce via CI checks that block merges to the production branch outside approved windows.

On-Call Engineering

Sustainable on-call requires alert quality before rotation size. Before growing the rotation, reduce alert noise.

On-Call Alert Triage Flow

  PagerDuty page
       │
       ▼
  Is SLO burning?
  ├─ Yes → Is error budget <10%? → P1: wake secondary + escalate
  │        Is error budget <50%? → P2: respond within 30 min
  │        Otherwise            → P3: next business day
  └─ No  → Is customer-visible?
            ├─ Yes → P2
            └─ No  → Investigate during business hours; silence if noisy

Alert quality metrics to track

Metric	Healthy Target	Warning	Action if Warning
Pages per on-call shift (8h)	< 2	> 5	Alert audit: silence/tune noisy alerts
Actionable page %	> 80%	< 60%	Review alert conditions; raise thresholds
MTTR (P1 incidents)	< 30 min	> 60 min	Improve runbooks; add diagnostic steps
Post-mortem completion rate	100% of P1/P2	< 80%	Block new incidents until post-mortem filed
Alert flap rate	< 5%	> 15%	Add `for: 5m` duration to flapping alerts

On-call rotation setup (PagerDuty pattern)

# Backstage catalog-info.yaml — link to PagerDuty service
metadata:
  annotations:
    pagerduty.com/integration-key: abc123xyz
    pagerduty.com/service-id: PXYZ123
    # On-call schedule visible in Backstage
    pagerduty.com/escalation-policy-id: EP123456

kubectl debug — ephemeral container triage

# Attach ephemeral debug container to running pod (non-distroless)
kubectl debug -it payment-pod-xyz \
  --image=nicolaka/netshoot \
  --target=payment-service \
  -n payments

# Debug a CrashLoopBackOff pod by copying it with a shell
kubectl debug payment-pod-xyz \
  -it \
  --copy-to=payment-debug \
  --image=nicolaka/netshoot \
  --share-processes \
  -n payments

# Node-level debug (requires privileged — for platform SREs only)
kubectl debug node/ip-10-0-1-42.us-east-1.compute.internal \
  -it \
  --image=nicolaka/netshoot

The Four Golden Signals

The four golden signals (Google SRE Book) apply directly to Kubernetes workloads. Every production service should have Prometheus alerts covering all four.

Latency

Time to handle a request. Track p50/p95/p99 — not just mean. A slow success and a fast failure are both important.

histogram_quantile(0.99,
  rate(http_request_duration_seconds_bucket
    {job="payment-service"}[5m]))

Traffic

How much demand is on the system. Requests per second, messages per second, active connections.

sum(rate(http_requests_total
  {job="payment-service"}[1m]))
by (method, route)

Errors

Rate of failed requests — explicit (5xx) or implicit (wrong data). Track 4xx separately to detect client-side abuse.

sum(rate(http_requests_total
  {status=~"5.."}[5m]))
/ sum(rate(http_requests_total[5m]))

Saturation

How full the service is. CPU throttling, queue depth, memory pressure, connection pool exhaustion.

sum(container_cpu_cfs_throttled_seconds_total
  {namespace="payments"}) /
sum(container_cpu_cfs_periods_total
  {namespace="payments"})

Production SLOs

SLOs are agreements with your users about service reliability. Without an error budget, every outage feels like a catastrophe. With one, you know how much risk you can take.

SLI → SLO → Error Budget chain

SLO Error Budget Flow

  SLI: ratio of successful requests
  SLO: 99.9% availability over 30 days
  Error budget: 0.1% × 30d × 24h × 60min = 43.2 minutes

  ┌─────────────────────────────────────────────────────────────┐
  │  Day 1-5:   No incidents         Budget remaining: 43.2 min │
  │  Day 12:    5 min incident       Budget remaining: 38.2 min │
  │  Day 20:    30 min incident      Budget remaining:  8.2 min │
  │  Day 22:    Budget < 10%         → Feature freeze           │
  │             (only reliability    → No T2/T3 changes         │
  │              work allowed)       → On-call P1 threshold ↓   │
  └─────────────────────────────────────────────────────────────┘

Sloth SLO definition (declarative)

Sloth generates multi-window multi-burn-rate alerts from a simple SLO YAML, following the Google SRE workbook's alerting strategy:

apiVersion: sloth.slok.dev/v1
kind: PrometheusServiceLevel
metadata:
  name: payment-service-slo
  namespace: payments
spec:
  service: payment-service
  labels:
    team: payments
    env: production
  slos:
    - name: requests-availability
      objective: 99.9
      description: 99.9% of payment API requests succeed
      sli:
        events:
          error_query: |
            sum(rate(http_requests_total{job="payment-service",status=~"5.."}[{{.window}}]))
          total_query: |
            sum(rate(http_requests_total{job="payment-service"}[{{.window}}]))
      alerting:
        name: PaymentServiceHighErrorBudgetBurn
        annotations:
          runbook_url: https://github.com/org/payments/blob/main/runbooks/slo-burn.md
        page_alert:
          labels:
            severity: critical
        ticket_alert:
          labels:
            severity: warning

    - name: requests-latency
      objective: 99.0
      description: 99% of payment API requests complete in < 500ms
      sli:
        events:
          error_query: |
            sum(rate(http_request_duration_seconds_bucket{
              job="payment-service",le="0.5"}[{{.window}}]))
          total_query: |
            sum(rate(http_request_duration_seconds_count{
              job="payment-service"}[{{.window}}]))

# Generate PrometheusRules from Sloth SLO YAML
sloth generate -i slo.yaml -o prometheus-rules.yaml

# Or deploy Sloth as an operator (watches PrometheusServiceLevel CRs)
helm repo add sloth https://slok.github.io/sloth
helm install sloth sloth/sloth \
  --namespace monitoring \
  --set customSLIs.enabled=true

Error budget policy

Budget Remaining	Action
> 50%	Normal operations; T1/T2/T3 changes allowed
25–50%	T3 changes require extra review; increase canary duration
10–25%	No T3 changes; T2 changes require CISO/VP approval
< 10%	Feature freeze: only reliability/security work merges; all-hands reliability sprint
Exhausted	Incident declared; post-mortem mandatory; SLO review within 1 week

Operations Maturity Model

Reactive

Manual deploys
No runbooks
Alert on every metric
Single point of failure
No post-mortems
Manual cert renewal

Managed

GitOps in place
Basic runbooks
PDBs defined
On-call rotation
Ad-hoc post-mortems
cert-manager installed

Defined

SLOs defined
Error budgets tracked
Canary deployments
Velero backups
kube-bench enforced
DR tested quarterly

Measured

DORA metrics tracked
Chaos engineering
Auto-remediation
Capacity forecasting
Toil < 50%
Multi-cluster DR

Optimising

Self-healing operators
Predictive scaling
Zero-touch upgrades
Continuous chaos
Toil < 20%
Autonomous DR

Most teams should target Level 3 before optimizing for Level 4. Skipping to Level 5 without solid Level 3 foundations (SLOs, DR, runbooks) results in fragile automation.

Section 09 — Topics in This Section

The eleven files in this section move from planning to hands-on operations to reliability engineering:

09-00

Production Overview

This file — readiness checklist, runbook framework, golden signals, SLOs, maturity model.

09-01

Capacity Planning

Demand forecasting, node sizing, headroom calculation, VPA/Goldilocks, cluster autoscaling strategy.

09-02

Performance Tuning

Kernel tuning, JVM GC, CPU throttling, network optimization, etcd and apiserver performance.

09-03

Disaster Recovery

etcd backup/restore, Velero workload backup, RTO/RPO targets, failover automation, DR drills.

09-04

Security Hardening

CIS benchmarks, kube-bench, Falco runtime detection, Trivy Operator, audit log analysis.

09-05

Network Operations

CoreDNS tuning, CNI troubleshooting, Ingress operations, eBPF network observability with Hubble.

09-06

Storage Operations

PVC lifecycle, CSI driver ops, StorageClass tuning, etcd compaction, backup with Restic.

09-07

Certificate Management

cert-manager ops, PKI hierarchy, rotation automation, mutual TLS, expiry monitoring.

09-08

Cluster Maintenance

K8s version upgrades, node rotation, etcd maintenance, add-on lifecycle, maintenance windows.

09-09

Incident Response

On-call triage, incident lifecycle, post-mortem process, MTTR reduction, escalation playbooks.

09-10

SRE Practices

Toil elimination, error budget policies, chaos engineering, reliability reviews, automation patterns.

Best Practices

SLOs before alerts

Define SLOs first; derive alerts from error budget burn rate. This eliminates alert fatigue and focuses on user-visible impact.

Runbooks are mandatory

No alert ships to production without a runbook URL in its annotation. Enforce this in CI with a PromQL lint check.

Test your DR regularly

An untested restore procedure is not a DR plan. Run quarterly restore drills from Velero backups and etcd snapshots in a staging cluster.

Production readiness gates

Automate the readiness checklist as a Backstage scorecard or Kyverno policy. Manual checklists are forgotten under delivery pressure.

Toil budget

Track on-call toil per sprint. If toil exceeds 50% of engineering time, halt feature work and invest in automation. This is the SRE contract.

Change tier discipline

Classify every change before it ships. T3 changes in a change-freeze window are the most common cause of weekend incidents.

Kubernetes Docs

Production Operations Overview

What Is Production Operations

Reliability

Performance

Security

Efficiency

Operability

Operations Domains

Production Readiness Checklist

Reliability

Observability

Security

Resource Management

GitOps & CI/CD

Operations

Runbook Framework

Operations Tooling Reference

Essential krew plugins

Change Management in Production

GitOps as change ledger

On-Call Engineering

Alert quality metrics to track

On-call rotation setup (PagerDuty pattern)

kubectl debug — ephemeral container triage

The Four Golden Signals

Latency

Traffic

Errors

Saturation

Production SLOs

SLI → SLO → Error Budget chain

Sloth SLO definition (declarative)

Error budget policy

Operations Maturity Model

Section 09 — Topics in This Section

Best Practices

SLOs before alerts

Runbooks are mandatory

Test your DR regularly

Production readiness gates

Toil budget

Change tier discipline