Disaster Recovery | Kubernetes Docs

RTO & RPO Targets

RTO (Recovery Time Objective) is the maximum acceptable downtime. RPO (Recovery Point Objective) is the maximum acceptable data loss (how old a restore can be). These must be defined per workload class before designing DR procedures — not after an incident.

Tier	Example Workloads	RTO Target	RPO Target	DR Strategy
Tier 0 — Mission Critical	Payment processing, authentication, core API	< 15 min	< 1 min	Active-active multi-cluster; real-time replication
Tier 1 — Business Critical	Order management, user accounts, notifications	< 1 hour	< 5 min	Active-passive; warm standby cluster; DB replica
Tier 2 — Important	Reporting, analytics, internal tools	< 4 hours	< 1 hour	Velero backup + GitOps re-deploy to standby
Tier 3 — Standard	Batch jobs, dev environments, CI	< 24 hours	< 24 hours	GitOps re-deploy; data rebuilt from source

🚨

An untested DR plan is not a DR plan

Every RTO/RPO target in the table above is only achievable if the restore procedure has been tested end-to-end in the last 90 days. A restore that has never been practiced will take 3–5× longer than estimated under incident pressure. Schedule and run DR drills — they are as important as the backup jobs themselves.

DR Architecture Patterns

DR Strategy Spectrum

  Cost →                                                     ←  RTO/RPO

  Backup/Restore        Pilot Light        Warm Standby        Active-Active
  ─────────────         ──────────         ─────────────        ─────────────
  Periodic backups      Minimal infra      Scaled-down          Full capacity
  to S3/GCS             always running     replica running      in both regions
  Restore on DR         Scale up on DR     Scale up on DR       Traffic split
                                                                always
  RTO: hours            RTO: 30–60 min     RTO: 10–30 min       RTO: < 1 min
  RPO: backup age       RPO: ~15 min       RPO: ~5 min          RPO: near-zero
  Cost: $               Cost: $$           Cost: $$$            Cost: $$$$

  Most K8s clusters:    ← Use Velero + GitOps for Tier 2/3  →   Tier 0 only

What needs backup

etcd: all K8s API objects (Deployments, ConfigMaps, Secrets, CRDs, RBAC)
PersistentVolumes: stateful app data (databases, file stores)
Container registry: images (ECR replication or OCI artifact copy)
Certificates & secrets: Vault raft snapshots; KMS key policy backups
Git repositories: source of truth for manifests (GitHub/GitLab already replicated)

What does NOT need backup

Stateless Deployment pods (re-create from image)
ConfigMaps/Secrets that ESO regenerates from Vault/ASM
Kubernetes node data (nodes are ephemeral by design)
Prometheus TSDB data older than retention window (rebuilds from metrics stream)
Anything stored in the Git-managed manifest repo

etcd Backup

etcd is the authoritative store for all Kubernetes API objects. Without an etcd backup, a total control plane loss means re-creating every resource manually. Snapshot-based backup is the standard approach.

Manual etcd snapshot

# On a control plane node (or from a pod with etcdctl access)
ETCDCTL_API=3 etcdctl snapshot save /tmp/etcd-snapshot-$(date +%Y%m%d-%H%M%S).db \
  --endpoints=https://127.0.0.1:2379 \
  --cacert=/etc/kubernetes/pki/etcd/ca.crt \
  --cert=/etc/kubernetes/pki/etcd/healthcheck-client.crt \
  --key=/etc/kubernetes/pki/etcd/healthcheck-client.key

# Verify snapshot integrity
ETCDCTL_API=3 etcdctl snapshot status /tmp/etcd-snapshot-*.db \
  --write-out=table

# Output:
# +---------+----------+------------+------------+
# |  HASH   | REVISION | TOTAL KEYS | TOTAL SIZE |
# +---------+----------+------------+------------+
# | 12ab34cd|   847291 |      15432 | 62 MB      |
# +---------+----------+------------+------------+

Automated etcd backup CronJob

apiVersion: batch/v1
kind: CronJob
metadata:
  name: etcd-backup
  namespace: kube-system
spec:
  schedule: "0 */6 * * *"   # every 6 hours
  successfulJobsHistoryLimit: 3
  failedJobsHistoryLimit: 3
  jobTemplate:
    spec:
      template:
        spec:
          serviceAccountName: etcd-backup
          hostNetwork: true
          tolerations:
            - key: node-role.kubernetes.io/control-plane
              effect: NoSchedule
          nodeSelector:
            node-role.kubernetes.io/control-plane: ""
          restartPolicy: OnFailure
          containers:
            - name: etcd-backup
              image: registry.k8s.io/etcd:3.5.12-0
              env:
                - name: ETCDCTL_API
                  value: "3"
                - name: S3_BUCKET
                  value: my-cluster-etcd-backups
                - name: AWS_DEFAULT_REGION
                  value: us-east-1
              command:
                - /bin/sh
                - -c
                - |
                  set -e
                  SNAPSHOT_FILE="/tmp/etcd-$(date +%Y%m%d-%H%M%S).db"

                  echo "Taking etcd snapshot..."
                  etcdctl snapshot save "$SNAPSHOT_FILE" \
                    --endpoints=https://127.0.0.1:2379 \
                    --cacert=/etc/kubernetes/pki/etcd/ca.crt \
                    --cert=/etc/kubernetes/pki/etcd/healthcheck-client.crt \
                    --key=/etc/kubernetes/pki/etcd/healthcheck-client.key

                  echo "Verifying snapshot..."
                  etcdctl snapshot status "$SNAPSHOT_FILE" --write-out=table

                  echo "Uploading to S3..."
                  aws s3 cp "$SNAPSHOT_FILE" \
                    "s3://${S3_BUCKET}/$(date +%Y/%m/%d)/$(basename $SNAPSHOT_FILE)" \
                    --sse aws:kms

                  echo "Pruning old snapshots (keep 30 days)..."
                  aws s3 ls "s3://${S3_BUCKET}/" --recursive | \
                    awk '{print $4}' | \
                    sort | \
                    head -n -120 | \
                    xargs -I{} aws s3 rm "s3://${S3_BUCKET}/{}" || true

                  echo "Backup complete: $(basename $SNAPSHOT_FILE)"
              volumeMounts:
                - name: etcd-certs
                  mountPath: /etc/kubernetes/pki/etcd
                  readOnly: true
              resources:
                requests:
                  cpu: 100m
                  memory: 128Mi
                limits:
                  cpu: 500m
                  memory: 512Mi
          volumes:
            - name: etcd-certs
              hostPath:
                path: /etc/kubernetes/pki/etcd

# IRSA ServiceAccount for S3 access (EKS)
apiVersion: v1
kind: ServiceAccount
metadata:
  name: etcd-backup
  namespace: kube-system
  annotations:
    eks.amazonaws.com/role-arn: arn:aws:iam::123456789:role/etcd-backup-role

EKS etcd backup considerations

💡

Managed Kubernetes (EKS/GKE/AKS) — etcd is managed

On managed Kubernetes, the control plane (including etcd) is managed by the cloud provider. You cannot access etcd directly. Instead:
• EKS: Use Velero for workload-level backup; AWS handles etcd; use cluster backup via AWS Backup (EKS resources).
• GKE: Automated etcd snapshots; use Config Connector + Velero.
• Self-managed (kubeadm): Full etcd access; use the CronJob above.
The etcd backup CronJob applies to self-managed clusters only.

etcd Restore

🚨

etcd restore is destructive and cluster-wide

Restoring etcd overwrites all current cluster state. Every resource created after the backup timestamp will be lost. Restore etcd only when the control plane is completely unrecoverable. For partial data loss (accidentally deleted namespace), use Velero or kubectl to restore individual resources.

Single-node etcd restore procedure

Download the snapshot from S3

aws s3 cp s3://my-cluster-etcd-backups/2026/05/20/etcd-20260520-060000.db \
  /tmp/etcd-snapshot.db

Stop the API server and etcd (prevent writes during restore)

# Move static pod manifests out of /etc/kubernetes/manifests/
# kubelet stops the pods when manifests are removed
mv /etc/kubernetes/manifests/kube-apiserver.yaml /tmp/
mv /etc/kubernetes/manifests/etcd.yaml /tmp/

# Verify pods stopped
crictl pods | grep -E "etcd|apiserver"   # should return empty

Restore snapshot to a new data directory

ETCDCTL_API=3 etcdctl snapshot restore /tmp/etcd-snapshot.db \
  --data-dir=/var/lib/etcd-restore \
  --name=master-0 \
  --initial-cluster=master-0=https://10.0.1.10:2380 \
  --initial-cluster-token=etcd-cluster-restore \
  --initial-advertise-peer-urls=https://10.0.1.10:2380

Swap the data directory

mv /var/lib/etcd /var/lib/etcd-old-$(date +%Y%m%d)
mv /var/lib/etcd-restore /var/lib/etcd
chown -R etcd:etcd /var/lib/etcd

Restore the static pod manifests

mv /tmp/etcd.yaml /etc/kubernetes/manifests/
# Wait for etcd to start
sleep 10
mv /tmp/kube-apiserver.yaml /etc/kubernetes/manifests/

# Verify
kubectl get nodes
kubectl get pods -A | head -20

Multi-node etcd cluster restore

# On ALL control plane nodes simultaneously (must use same snapshot + same token)
# node-specific: adjust --name and --initial-advertise-peer-urls per node

# Node 1
ETCDCTL_API=3 etcdctl snapshot restore /tmp/etcd-snapshot.db \
  --data-dir=/var/lib/etcd-restore \
  --name=master-0 \
  --initial-cluster=master-0=https://10.0.1.10:2380,master-1=https://10.0.1.11:2380,master-2=https://10.0.1.12:2380 \
  --initial-cluster-token=etcd-cluster-restore-$(date +%Y%m%d) \
  --initial-advertise-peer-urls=https://10.0.1.10:2380

# Node 2
ETCDCTL_API=3 etcdctl snapshot restore /tmp/etcd-snapshot.db \
  --data-dir=/var/lib/etcd-restore \
  --name=master-1 \
  --initial-cluster=master-0=https://10.0.1.10:2380,master-1=https://10.0.1.11:2380,master-2=https://10.0.1.12:2380 \
  --initial-cluster-token=etcd-cluster-restore-$(date +%Y%m%d) \
  --initial-advertise-peer-urls=https://10.0.1.11:2380

# Node 3
ETCDCTL_API=3 etcdctl snapshot restore /tmp/etcd-snapshot.db \
  --data-dir=/var/lib/etcd-restore \
  --name=master-2 \
  --initial-cluster=master-0=https://10.0.1.10:2380,master-1=https://10.0.1.11:2380,master-2=https://10.0.1.12:2380 \
  --initial-cluster-token=etcd-cluster-restore-$(date +%Y%m%d) \
  --initial-advertise-peer-urls=https://10.0.1.12:2380

Velero Workload Backup

Velero backs up Kubernetes API objects and PersistentVolume data. It is the standard tool for application-level backup/restore and cluster-to-cluster migration. Unlike etcd snapshots, Velero backups are selective (namespace, label, resource type) and support incremental PV backup via Restic/Kopia.

Velero installation (AWS S3 + EBS)

# Install Velero CLI
brew install velero

# Create S3 bucket for Velero backups
aws s3api create-bucket \
  --bucket my-cluster-velero-backups \
  --region us-east-1

aws s3api put-bucket-versioning \
  --bucket my-cluster-velero-backups \
  --versioning-configuration Status=Enabled

aws s3api put-bucket-encryption \
  --bucket my-cluster-velero-backups \
  --server-side-encryption-configuration \
  '{"Rules":[{"ApplyServerSideEncryptionByDefault":{"SSEAlgorithm":"aws:kms"}}]}'

# Install Velero with IRSA (no static credentials)
velero install \
  --provider aws \
  --plugins velero/velero-plugin-for-aws:v1.9.0 \
  --bucket my-cluster-velero-backups \
  --backup-location-config region=us-east-1 \
  --snapshot-location-config region=us-east-1 \
  --use-node-agent \
  --default-volumes-to-fs-backup \
  --sa-annotations eks.amazonaws.com/role-arn=arn:aws:iam::123456789:role/velero-role \
  --no-secret \
  --wait

# BackupStorageLocation — verify Velero can reach S3
apiVersion: velero.io/v1
kind: BackupStorageLocation
metadata:
  name: default
  namespace: velero
spec:
  provider: aws
  objectStorage:
    bucket: my-cluster-velero-backups
    prefix: prod-cluster
  config:
    region: us-east-1
    s3ForcePathStyle: "false"
    s3Url: ""     # empty = use AWS SDK default (VPC endpoint preferred)
  credential:
    name: ""      # empty = use IRSA
  default: true

---
# VolumeSnapshotLocation — EBS snapshots
apiVersion: velero.io/v1
kind: VolumeSnapshotLocation
metadata:
  name: default
  namespace: velero
spec:
  provider: aws
  config:
    region: us-east-1

Scheduled backups

# Daily backup of all namespaces (Kubernetes objects + PV snapshots)
velero schedule create daily-full \
  --schedule="0 2 * * *" \
  --ttl 720h \
  --include-namespaces "*" \
  --exclude-namespaces velero,kube-system,kube-public,kube-node-lease \
  --snapshot-volumes=true \
  --volume-snapshot-locations default

# Hourly backup of critical namespace (Kubernetes objects only — no PV snap)
velero schedule create hourly-payments \
  --schedule="0 * * * *" \
  --ttl 48h \
  --include-namespaces payments,auth \
  --snapshot-volumes=false

# Every 15 minutes for stateful data using file-system backup (Kopia/Restic)
velero schedule create frequent-payments-stateful \
  --schedule="*/15 * * * *" \
  --ttl 24h \
  --include-namespaces payments \
  --default-volumes-to-fs-backup=true \
  --snapshot-volumes=false

# Backup CRD — declarative (GitOps-managed)
apiVersion: velero.io/v1
kind: Schedule
metadata:
  name: daily-full
  namespace: velero
spec:
  schedule: "0 2 * * *"
  template:
    includedNamespaces:
      - "*"
    excludedNamespaces:
      - velero
      - kube-system
      - kube-public
      - kube-node-lease
    includedResources:
      - "*"
    excludedResources:
      - events
      - events.events.k8s.io
    snapshotVolumes: true
    volumeSnapshotLocations:
      - default
    ttl: 720h0m0s
    storageLocation: default
    hooks:
      resources:
        - name: database-quiesce
          includedNamespaces:
            - payments
          labelSelector:
            matchLabels:
              app: postgresql
          pre:
            - exec:
                container: postgresql
                command:
                  - /bin/bash
                  - -c
                  - psql -U postgres -c "SELECT pg_start_backup('velero', true);"
                onError: Fail
                timeout: 60s
          post:
            - exec:
                container: postgresql
                command:
                  - /bin/bash
                  - -c
                  - psql -U postgres -c "SELECT pg_stop_backup();"
                onError: Continue
                timeout: 60s

Backup verification

# List all backups with status
velero backup get

# Describe a specific backup (shows included resources, errors, warnings)
velero backup describe daily-full-20260520020000 --details

# Check backup logs for errors
velero backup logs daily-full-20260520020000 | grep -i error

# Verify S3 storage (backup content)
aws s3 ls s3://my-cluster-velero-backups/prod-cluster/backups/ --recursive | \
  sort -k1,2 | tail -20

Velero Restore

Full namespace restore

# List available backups
velero backup get

# Restore a specific namespace from backup
velero restore create payments-restore-$(date +%Y%m%d-%H%M) \
  --from-backup daily-full-20260520020000 \
  --include-namespaces payments \
  --restore-volumes=true \
  --wait

# Check restore status
velero restore describe payments-restore-20260524-1430

# Check for failures
velero restore logs payments-restore-20260524-1430 | grep -i "error\|warn"

Selective resource restore

# Restore only Deployments and Services (not PVCs — data already present)
velero restore create deployments-only \
  --from-backup daily-full-20260520020000 \
  --include-namespaces payments \
  --include-resources deployments,services,configmaps \
  --restore-volumes=false

# Restore a single resource by name
velero restore create single-deployment \
  --from-backup daily-full-20260520020000 \
  --include-namespaces payments \
  --selector "app=payment-service"

# Restore to a different namespace (namespace mapping)
velero restore create payments-to-staging \
  --from-backup daily-full-20260520020000 \
  --include-namespaces payments \
  --namespace-mappings payments:payments-staging

Cross-cluster restore (cluster migration)

# Scenario: migrate from cluster-A to cluster-B (different region)
# Both clusters must be able to access the same S3 backup bucket

# On cluster-B: create a BackupStorageLocation pointing to cluster-A's bucket
velero backup-location create cluster-a-backups \
  --provider aws \
  --bucket my-cluster-velero-backups \
  --prefix prod-cluster \
  --config region=us-east-1 \
  --access-mode ReadOnly   # cluster-B only reads, doesn't overwrite

# Sync backup metadata to cluster-B
velero backup-location set cluster-a-backups --access-mode ReadOnly

# Restore from cluster-A's backup into cluster-B
velero restore create migrate-payments \
  --from-backup daily-full-20260520020000 \
  --include-namespaces payments \
  --restore-volumes=true

Stateful Data DR

Database backup strategies

Database	Backup Method	RPO	Tool
PostgreSQL	pg_dump (logical) + WAL archiving (PITR) + EBS snapshot	~5 min (WAL) or backup age (pg_dump)	pgBackRest, Barman, Velero hook
MySQL/MariaDB	mysqldump + binary log + EBS snapshot	~5 min (binlog)	XtraBackup, Velero hook
MongoDB	mongodump + oplog + Atlas backup	~1 min (oplog)	Percona Backup for MongoDB
Redis	RDB snapshot (configurable) + AOF persistence	Configurable (1s–1h)	Redis built-in; Velero file-system backup
etcd (app-level)	etcdctl snapshot (same as control plane etcd)	Snapshot frequency	etcdctl + CronJob
Kafka	MirrorMaker 2 replication + topic offset backup	~minutes (MirrorMaker lag)	MirrorMaker 2, Confluent Replicator
AWS RDS (via Crossplane)	Automated RDS snapshots + cross-region copy	5 min (automated backup window)	AWS RDS automated backups + Crossplane MR status

PostgreSQL PITR with pgBackRest

# pgBackRest sidecar pattern in PostgreSQL StatefulSet
spec:
  containers:
    - name: postgres
      image: postgres:16
      env:
        - name: PGBACKREST_STANZA
          value: prod
        - name: PGBACKREST_REPO1_TYPE
          value: s3
        - name: PGBACKREST_REPO1_S3_BUCKET
          value: my-cluster-pgbackrest
        - name: PGBACKREST_REPO1_S3_REGION
          value: us-east-1
        - name: PGBACKREST_REPO1_RETENTION_FULL
          value: "7"
        - name: PGBACKREST_ARCHIVE_PUSH_QUEUE_MAX
          value: 4GiB

    - name: pgbackrest
      image: pgbackrest/pgbackrest:2.50
      command:
        - /bin/sh
        - -c
        - |
          # Full backup at 2am
          while true; do
            sleep $(( 86400 - $(date +%s) % 86400 + 7200 ))
            pgbackrest --stanza=prod backup --type=full
          done

# Point-in-time recovery: restore to specific timestamp
pgbackrest --stanza=prod restore \
  --delta \
  --target="2026-05-24 14:30:00" \
  --target-action=promote

# Restore latest backup
pgbackrest --stanza=prod restore --delta

Multi-Cluster Failover

For Tier 0 workloads with <15 min RTO, a single-cluster restore procedure is too slow. Multi-cluster active-passive or active-active architecture is required.

Active-passive architecture

Active-Passive Failover

                   Route 53 / Global Load Balancer
                          │
           ┌──────────────┴─────────────┐
           │                            │
     Primary (us-east-1)           Standby (us-west-2)
     All traffic → 100%            Cold/warm standby → 0%
     ┌──────────────┐               ┌──────────────┐
     │ K8s cluster  │               │ K8s cluster  │
     │ PostgreSQL   │──replication──►│ PostgreSQL   │
     │  primary     │               │  replica     │
     │ Redis        │──replication──►│ Redis        │
     └──────────────┘               └──────────────┘
           │
           ▼ Failure detected (health check fails 3×)
           │
     Route 53 routes 100% to us-west-2
     Promote PostgreSQL replica → primary
     Scale up standby K8s workloads (if warm standby)
     Update DNS TTL 60s → fast propagation

Route 53 health check failover

# Create health check for primary cluster
aws route53 create-health-check \
  --caller-reference "primary-$(date +%s)" \
  --health-check-config '{
    "Type": "HTTPS",
    "FullyQualifiedDomainName": "api.primary.example.com",
    "Port": 443,
    "ResourcePath": "/health",
    "RequestInterval": 10,
    "FailureThreshold": 3,
    "EnableSNI": true
  }'

# Create primary DNS record (failover routing)
aws route53 change-resource-record-sets \
  --hosted-zone-id Z1234567890 \
  --change-batch '{
    "Changes": [{
      "Action": "UPSERT",
      "ResourceRecordSet": {
        "Name": "api.example.com",
        "Type": "A",
        "SetIdentifier": "primary",
        "Failover": "PRIMARY",
        "TTL": 60,
        "ResourceRecords": [{"Value": "1.2.3.4"}],
        "HealthCheckId": "abc123"
      }
    },{
      "Action": "UPSERT",
      "ResourceRecordSet": {
        "Name": "api.example.com",
        "Type": "A",
        "SetIdentifier": "secondary",
        "Failover": "SECONDARY",
        "TTL": 60,
        "ResourceRecords": [{"Value": "5.6.7.8"}]
      }
    }]
  }'

Argo CD multi-cluster failover

# ApplicationSet targeting standby cluster for DR workloads
apiVersion: argoproj.io/v1alpha1
kind: ApplicationSet
metadata:
  name: payments-dr
  namespace: argocd
spec:
  generators:
    - list:
        elements:
          - cluster: prod-us-east-1
            weight: "100"
            suspended: "false"
          - cluster: dr-us-west-2
            weight: "0"
            suspended: "true"     # suspended in normal ops; activate on DR
  template:
    metadata:
      name: "payments-{{cluster}}"
    spec:
      project: production
      source:
        repoURL: https://github.com/org/gitops
        targetRevision: HEAD
        path: "clusters/{{cluster}}/payments"
      destination:
        server: "https://{{cluster}}.example.com"
        namespace: payments
      syncPolicy:
        automated:
          prune: true
          selfHeal: true
        syncOptions:
          - ServerSideApply=true

# Activate DR cluster (unsuspend Argo CD application)
# Step 1: Update ApplicationSet in Git (remove suspended: "true")
# Step 2: Argo CD syncs and deploys to DR cluster automatically

# Or manually unsuspend via CLI for immediate failover:
argocd app set payments-dr-us-west-2 --sync-policy automated
argocd app sync payments-dr-us-west-2 --force

GitOps-Driven Recovery

For Tier 2/3 workloads, GitOps is the DR strategy. The Git repository contains the complete desired state of every namespace. Re-deploying to a new or wiped cluster is a matter of pointing Argo CD at the repo.

#!/bin/bash
# Bootstrap a new cluster from GitOps in < 30 minutes
# Assumes: new cluster exists, kubectl configured, AWS credentials present

set -euo pipefail

CLUSTER_NAME="${1:-disaster-recovery-$(date +%Y%m%d)}"
GIT_REPO="https://github.com/org/gitops"
ARGOCD_VERSION="v2.12.0"

echo "=== GitOps DR Bootstrap: $CLUSTER_NAME ==="

# 1. Install Argo CD (the bootstrap operator)
echo "Installing Argo CD..."
kubectl create namespace argocd --dry-run=client -o yaml | kubectl apply -f -
kubectl apply -n argocd \
  -f "https://raw.githubusercontent.com/argoproj/argo-cd/${ARGOCD_VERSION}/manifests/install.yaml"
kubectl wait --for=condition=available --timeout=300s \
  deployment/argocd-server -n argocd

# 2. Apply root App-of-Apps (points to cluster-specific overlay in Git)
echo "Applying root application..."
kubectl apply -f - <




  ✅
  
    What GitOps recovers automatically
    When Argo CD syncs to a new cluster, it restores: all Deployments, StatefulSets, Services, Ingresses, ConfigMaps, RBAC, NetworkPolicies, HPA, PodDisruptionBudgets, CRDs, and operator installations. What it does NOT restore: PersistentVolume data (needs Velero), Secrets not managed by ESO (need manual re-seeding from Vault), and in-flight workload state.
  



DR Testing & Drills

DR testing must be scheduled, documented, and measured. An ad-hoc restore under pressure is not a test — it is an incident.

DR drill schedule



  Drill Type Frequency Scope Success Criteria Duration
  
    Backup verification Weekly (automated) Restore a backup to a temp namespace; verify resource count matches Resource count matches source; no errors 15 min
    Namespace restore drill Monthly Full Velero restore of one non-critical namespace to a staging cluster All pods running; smoke test passes; RTO achieved 1–2 hours
    GitOps bootstrap drill Quarterly Bootstrap a complete cluster from Git; route test traffic to it All Tier 2+ workloads deployed; latency SLO met 2–4 hours
    etcd restore drill Quarterly Restore etcd snapshot to a spare control plane node kubectl functional; all resources present; completed in RTO 1 hour
    Full failover drill Annually (or after major changes) Full production failover to DR cluster; traffic routed; RTO/RPO verified RTO target achieved; no data loss beyond RPO; post-mortem filed 4–8 hours
  



Automated backup verification job

apiVersion: batch/v1
kind: CronJob
metadata:
  name: backup-verify
  namespace: velero
spec:
  schedule: "0 6 * * 1"   # every Monday 6am
  jobTemplate:
    spec:
      template:
        spec:
          serviceAccountName: velero
          restartPolicy: OnFailure
          containers:
            - name: verify
              image: velero/velero:v1.14.0
              command:
                - /bin/sh
                - -c
                - |
                  set -e

                  # Get latest successful backup
                  BACKUP=$(velero backup get -o json | \
                    jq -r '[.items[] | select(.status.phase=="Completed")] | sort_by(.status.completionTimestamp) | last | .metadata.name')

                  echo "Verifying backup: $BACKUP"

                  # Restore to verification namespace
                  RESTORE_NS="backup-verify-$(date +%Y%m%d)"
                  velero restore create "$RESTORE_NS" \
                    --from-backup "$BACKUP" \
                    --include-namespaces payments \
                    --namespace-mappings payments:$RESTORE_NS \
                    --restore-volumes=false \
                    --wait

                  # Check restore status
                  STATUS=$(velero restore get "$RESTORE_NS" -o json | \
                    jq -r '.status.phase')

                  if [ "$STATUS" != "Completed" ]; then
                    echo "FAIL: Restore status = $STATUS"
                    exit 1
                  fi

                  # Verify resource count
                  PODS=$(kubectl get pods -n "$RESTORE_NS" --no-headers | wc -l)
                  echo "Pods in restored namespace: $PODS"

                  if [ "$PODS" -lt 1 ]; then
                    echo "FAIL: No pods restored"
                    exit 1
                  fi

                  # Cleanup verification namespace
                  kubectl delete namespace "$RESTORE_NS" --wait=false

                  echo "PASS: Backup $BACKUP verified successfully"

Chaos engineering for DR validation

# chaos-mesh: simulate AZ failure (drain all pods from one zone)
apiVersion: chaos-mesh.org/v1alpha1
kind: PodChaos
metadata:
  name: az-failure-simulation
  namespace: chaos-testing
spec:
  action: pod-kill
  mode: all
  selector:
    namespaces:
      - payments
    nodeSelectors:
      topology.kubernetes.io/zone: us-east-1a   # target specific AZ
  duration: "10m"   # 10 minute chaos window
  scheduler:
    cron: "0 14 * * 2"   # Tuesday 2pm — controlled DR drill time

# chaos-mesh: etcd leader election disruption
apiVersion: chaos-mesh.org/v1alpha1
kind: NetworkChaos
metadata:
  name: etcd-partition
  namespace: chaos-testing
spec:
  action: partition
  mode: one   # partition one etcd member
  selector:
    namespaces:
      - kube-system
    labelSelectors:
      component: etcd
  direction: both
  duration: "5m"


DR Monitoring Alerts

apiVersion: monitoring.coreos.com/v1
kind: PrometheusRule
metadata:
  name: disaster-recovery-alerts
  namespace: monitoring
spec:
  groups:
    - name: dr.backup
      rules:

        # Velero backup not completed in 25 hours (daily backup missed)
        - alert: VeleroBackupMissed
          expr: |
            time() - max(velero_backup_last_successful_timestamp{schedule="daily-full"}) > 90000
          labels:
            severity: critical
          annotations:
            summary: "Velero daily backup not completed in 25 hours"
            description: "Last successful backup: {{ $value | humanizeDuration }} ago"
            runbook_url: https://runbooks.example.com/dr/velero-backup-missed

        # Velero backup failed
        - alert: VeleroBackupFailed
          expr: velero_backup_failure_total > 0
          for: 1m
          labels:
            severity: warning
          annotations:
            summary: "Velero backup {{ $labels.schedule }} failed"
            runbook_url: https://runbooks.example.com/dr/velero-backup-failed

        # etcd snapshot stale (> 7 hours since last backup)
        - alert: EtcdSnapshotStale
          expr: |
            time() - max(etcd_backup_last_success_timestamp) > 25200
          labels:
            severity: critical
          annotations:
            summary: "etcd snapshot backup stale — last backup > 7 hours ago"
            runbook_url: https://runbooks.example.com/dr/etcd-snapshot-stale

        # etcd quorum lost (fewer than 2 members healthy)
        - alert: EtcdQuorumLost
          expr: |
            count(etcd_server_is_leader == 1) < 1
              OR
            count(up{job="etcd"}) < 2
          for: 1m
          labels:
            severity: critical
          annotations:
            summary: "etcd quorum lost — cluster control plane at risk"
            runbook_url: https://runbooks.example.com/dr/etcd-quorum

        # PVC not backed up (no Velero annotation or backup tag)
        - alert: PVCWithoutBackupPolicy
          expr: |
            count(kube_persistentvolumeclaim_info{namespace!~"velero|kube-system"}) -
            count(kube_persistentvolumeclaim_labels{label_velero_io_backup="true"}) > 0
          for: 1h
          labels:
            severity: warning
          annotations:
            summary: "PVCs exist without backup policy labels"
            runbook_url: https://runbooks.example.com/dr/pvc-backup-policy

        # Velero node agent not running on all nodes
        - alert: VeleroNodeAgentNotRunning
          expr: |
            kube_daemonset_status_number_ready{daemonset="node-agent",namespace="velero"}
              /
            kube_daemonset_status_desired_number_scheduled{daemonset="node-agent",namespace="velero"}
              < 1
          for: 5m
          labels:
            severity: warning
          annotations:
            summary: "Velero node-agent DaemonSet not fully running — file-system backups affected"
            runbook_url: https://runbooks.example.com/dr/velero-node-agent


DR Runbook Template

Every cluster should have a printed (or offline-accessible) DR runbook. When the cluster is down, the wiki might be too.

# Cluster Disaster Recovery Runbook
Last tested: 2026-03-15 | Next scheduled test: 2026-06-15
Owner: platform-team@company.com | Emergency: +1-555-ONCALL

## Prerequisites
- [ ] kubectl access to DR cluster (context: dr-us-west-2)
- [ ] AWS credentials with S3 + EC2 + Route53 access
- [ ] Vault root token in break-glass secret (1Password: DR-Vault-Root)
- [ ] This runbook (offline copy at /runbooks/dr.md on all SRE laptops)

## Decision Tree
1. Is the ENTIRE cluster unrecoverable?
   YES → Go to Section A: Full Cluster Restore
   NO  → Is it a namespace/application issue?
         YES → Go to Section B: Velero Namespace Restore
         NO  → Is it a control plane issue?
               YES → Go to Section C: Control Plane Recovery

## Section A: Full Cluster Restore (RTO: 60–90 min)
...etcd restore procedure + GitOps bootstrap...

## Section B: Velero Namespace Restore (RTO: 15–30 min)
...velero restore commands...

## Section C: Control Plane Recovery (RTO: 30–60 min)
...kubeadm reset + rejoin + certificate rotation...

## Post-Recovery Checklist
- [ ] All pods Running (kubectl get pods -A | grep -v Running)
- [ ] Route 53 health checks passing
- [ ] Vault unsealed (kubectl exec -n vault vault-0 -- vault status)
- [ ] ESO syncing secrets (kubectl get externalsecrets -A)
- [ ] Argo CD healthy (argocd app list | grep -v Synced)
- [ ] SLO burn rate normal (check Grafana SLO dashboard)
- [ ] File post-mortem within 48 hours



Best Practices


  
    Test every quarter
    An untested restore is not a DR plan. Run namespace restore drills monthly and full cluster restore annually. Measure actual RTO vs target.
  
  
    3-2-1 backup rule
    3 copies of data, 2 different media, 1 offsite. etcd snapshots in S3 cross-region copy + Velero backups in a different S3 bucket + EBS snapshots.
  
  
    GitOps is your fast path
    For stateless workloads, a GitOps re-deploy is faster than a Velero restore. Bootstrap Argo CD first; it re-creates everything in the Git repo automatically.
  
  
    Automate backup verification
    A CronJob that does a weekly restore to a verification namespace and counts resources catches silent backup failures before the real DR event.
  
  
    Separate DR credentials
    Break-glass credentials (Vault root token, kubeconfig, AWS DR role) must be stored offline (1Password, printed sheet in a safe). If the cluster is down, so is your secret manager.
  
  
    Match DR tier to cost
    Active-active is expensive. Most services don't need it. Use the tier table to assign the right strategy — warm standby for Tier 1, GitOps re-deploy for Tier 3.

Drill Type	Frequency	Scope	Success Criteria	Duration
Backup verification	Weekly (automated)	Restore a backup to a temp namespace; verify resource count matches	Resource count matches source; no errors	15 min
Namespace restore drill	Monthly	Full Velero restore of one non-critical namespace to a staging cluster	All pods running; smoke test passes; RTO achieved	1–2 hours
GitOps bootstrap drill	Quarterly	Bootstrap a complete cluster from Git; route test traffic to it	All Tier 2+ workloads deployed; latency SLO met	2–4 hours
etcd restore drill	Quarterly	Restore etcd snapshot to a spare control plane node	kubectl functional; all resources present; completed in RTO	1 hour
Full failover drill	Annually (or after major changes)	Full production failover to DR cluster; traffic routed; RTO/RPO verified	RTO target achieved; no data loss beyond RPO; post-mortem filed	4–8 hours