SRE Practices

Service Level Objectives, error budgets, toil reduction, and reliability engineering practices for Kubernetes-based platforms.

SLI / SLO / SLA Framework

SLI (Service Level Indicator)
  → A quantitative measure of a service dimension
  → Examples: request success rate, p99 latency, uptime percentage

SLO (Service Level Objective)
  → Target for an SLI over a time window
  → Examples: 99.9% success rate over 30 days, p99 latency < 200ms
  → Internal commitment — the goal your team operates to

SLA (Service Level Agreement)
  → Contractual commitment with external consequences (refunds, penalties)
  → SLA < SLO (set SLA at 99.5% if you target SLO of 99.9%)

Error Budget
  → 100% - SLO = allowed downtime
  → 99.9% SLO → 0.1% error budget → 43.8 min/month
  → Budget governs how fast you can move (deploy, change) vs stabilize

Defining SLOs

# Good SLIs are user-facing, measurable, and meaningful:

# Availability SLI (success rate)
SLI = successful_requests / total_requests
# Where "successful" = HTTP 2xx and 3xx (not 5xx, timeouts)

# Latency SLI
SLI = requests_served_faster_than_threshold / total_requests
# "What % of requests completed within 200ms?"
# Better than "average latency" which hides tail problems

# Freshness SLI (for data pipelines)
SLI = data_age_seconds < threshold
# "Is the data less than 5 minutes old?"

# Saturation SLI
SLI = current_utilization / capacity
# "Is the service below 80% resource utilization?"

# Example SLO document:
service: payments-api
slos:
  - name: availability
    sli: sum(rate(http_requests_total{status!~"5.."}[5m])) /
         sum(rate(http_requests_total[5m]))
    target: 99.9
    window: 30d
  - name: latency
    sli: histogram_quantile(0.99, rate(http_request_duration_seconds_bucket[5m]))
    target: 0.2   # 200ms
    window: 30d
    alerting: burn_rate

Error Budget Policy

# Error budget policy — what happens when budget runs low:

# Budget remaining > 50%: normal operations, new features allowed
# Budget remaining 25-50%: feature freeze optional, focus on reliability
# Budget remaining < 25%: feature freeze, prioritize reliability work
# Budget remaining 0% (exhausted): no new features, fix first

# Track error budget burn rate (how fast budget is consumed):
# Burn rate = error_rate / (1 - SLO)
# Example: SLO = 99.9%, current error rate = 1%
# Burn rate = 1% / 0.1% = 10x
# → Exhausts 30-day budget in 3 days

# Prometheus SLO alerting (multi-window burn rate):
# Fast burn (1h window, 14x burn): pages immediately
# Slow burn (6h window, 6x burn): tickets + notify
# See: sloth.dev, pyrra.dev for SLO tooling

Prometheus SLO Rules

# PrometheusRule for payments-api SLO
apiVersion: monitoring.coreos.com/v1
kind: PrometheusRule
metadata:
  name: payments-api-slo
  namespace: monitoring
spec:
  groups:
  - name: payments-api.slo
    interval: 30s
    rules:
    # Availability recording rules
    - record: job:http_requests:rate5m
      expr: sum(rate(http_requests_total{job="payments-api"}[5m]))
    - record: job:http_errors:rate5m
      expr: sum(rate(http_requests_total{job="payments-api",status=~"5.."}[5m]))

    # Error budget burn rate alerts (multi-window, multi-burn-rate)
    - alert: PaymentsAPIHighErrorRate
      expr: |
        (
          job:http_errors:rate5m / job:http_requests:rate5m > (14.4 * 0.001)
        ) and (
          job:http_errors:rate1h / job:http_requests:rate1h > (14.4 * 0.001)
        )
      labels:
        severity: critical
        slo: availability
      annotations:
        summary: "Payments API burning error budget at 14x rate (P1)"
        runbook_url: "https://wiki.example.com/runbooks/payments-api-high-error-rate"

    - alert: PaymentsAPIElevatedErrorRate
      expr: |
        (
          job:http_errors:rate30m / job:http_requests:rate30m > (6 * 0.001)
        ) and (
          job:http_errors:rate6h / job:http_requests:rate6h > (6 * 0.001)
        )
      labels:
        severity: warning
        slo: availability
      annotations:
        summary: "Payments API burning error budget at 6x rate"

Toil Reduction

# Toil = manual, repetitive, automatable operational work that scales with service growth
# SRE principle: if toil > 50% of time, team is failing to improve the system

# Identify toil:
# - Tracking in every sprint retrospective: "What did we do manually this week?"
# - On-call log review: what pages required human action?
# - Categories: manual deploys, manual scaling, manual restarts, manual certificate renewal

# Toil examples and automation:
# Manual cert renewal → cert-manager auto-renewal
# Manual scaling → HPA + Karpenter auto-scaling
# Manual secret rotation → External Secrets + Vault auto-rotation
# Manual runbook execution → runbook automation (Ansible, scripts)
# Manual deploy approval → automated canary analysis (Flagger/Argo Rollouts)

# Measure toil:
# Track pages per week that required human action
# Target: reduce by 10% each quarter
# Automate anything that fires more than 3 times per month

Production Readiness Review (PRR)

# PRR checklist before a service goes to production:

Observability:
  [ ] SLOs defined and SLO alerts configured
  [ ] Structured logging (JSON with request ID, user ID)
  [ ] Distributed tracing (OpenTelemetry)
  [ ] Grafana dashboard exists with golden signals
  [ ] Prometheus metrics exported (/metrics)

Reliability:
  [ ] Readiness + liveness probes configured
  [ ] Resource requests and limits set
  [ ] PodDisruptionBudget configured (minAvailable: 1)
  [ ] HPA or KEDA configured (no manual scaling)
  [ ] Graceful shutdown (SIGTERM handler, preStop hook)

Security:
  [ ] Non-root container (runAsNonRoot: true)
  [ ] Resource limits set (memory limit required)
  [ ] NetworkPolicy defined (default deny + explicit allow)
  [ ] Secrets via External Secrets (not hardcoded)
  [ ] RBAC scoped to minimum necessary

Deployment:
  [ ] CI/CD pipeline with automated tests
  [ ] Canary or blue-green deployment strategy
  [ ] Rollback procedure documented
  [ ] Feature flags for risky changes

Runbooks:
  [ ] Runbook exists for top 3 alert scenarios
  [ ] Runbook URL in Prometheus alert annotations
  [ ] On-call rotation includes this service

Capacity Planning

# Forecast resource usage to stay ahead of growth

# 1. Baseline current usage
kubectl top pods -n production --sort-by=memory
# Or via Prometheus:
# avg_over_time(container_memory_usage_bytes{namespace="production"}[7d])

# 2. Growth rate
# Compare last month vs this month requests:
# rate(http_requests_total[30d]) offset 30d vs current

# 3. Headroom targets
# CPU: keep P99 usage < 60% of request (leaves burst headroom)
# Memory: keep working set < 70% of limit (avoid OOMKill)
# Nodes: keep allocatable at >20% free (allows scheduling spikes)

# 4. Karpenter / cluster autoscaler
# Ensure node pool min/max bounds cover forecasted peak
kubectl get nodepool default \
  -o jsonpath='{.spec.limits}'   # Karpenter CPU/memory limits

# 5. Cloud quota
# Check AWS EC2 service quotas for instance types you use
aws service-quotas get-service-quota \
  --service-code ec2 \
  --quota-code L-1216C47A   # Running On-Demand Standard instances

# Review quarterly: peak load × 3 = headroom target

SLO Dashboard

# Grafana dashboard queries for SLO tracking:

# 30-day availability
sum(increase(http_requests_total{job="payments-api",status!~"5.."}[30d]))
/
sum(increase(http_requests_total{job="payments-api"}[30d]))

# Error budget remaining (30-day)
(
  1 - (
    sum(increase(http_requests_total{job="payments-api",status=~"5.."}[30d]))
    / sum(increase(http_requests_total{job="payments-api"}[30d]))
  )
) / (1 - 0.999)   # SLO = 99.9%
# Result: 1.0 = full budget, 0 = exhausted

# p99 latency trend (7 days)
histogram_quantile(0.99,
  sum(rate(http_request_duration_seconds_bucket{job="payments-api"}[5m]))
  by (le)
)

# Burn rate (current vs 6h vs 24h)
job:http_errors:rate5m / job:http_requests:rate5m / 0.001
# 1 = burning exactly at SLO rate
# 10 = burning 10x too fast

Key SRE Metrics

MetricFormulaTarget
MTTA (Mean Time to Acknowledge)Time from alert → acknowledgement<5 min (P1)
MTTR (Mean Time to Recover)Time from incident start → resolution<30 min (P1)
MTBF (Mean Time Between Failures)Time between P1/P2 incidents>30 days
Change Failure RateFailed deploys / total deploys<5%
Deployment FrequencyDeploys per day/weekMultiple/day (elite)
Pages per rotationActionable alerts per on-call week<5/week
Toil %Toil hours / total work hours<50%

Related