Disaster Recovery
etcd backup and restore, Velero workload backup, RTO/RPO targets, multi-cluster failover automation, and regular DR drill procedures.
RTO & RPO Targets
RTO (Recovery Time Objective) is the maximum acceptable downtime. RPO (Recovery Point Objective) is the maximum acceptable data loss (how old a restore can be). These must be defined per workload class before designing DR procedures — not after an incident.
| Tier | Example Workloads | RTO Target | RPO Target | DR Strategy |
|---|---|---|---|---|
| Tier 0 — Mission Critical | Payment processing, authentication, core API | < 15 min | < 1 min | Active-active multi-cluster; real-time replication |
| Tier 1 — Business Critical | Order management, user accounts, notifications | < 1 hour | < 5 min | Active-passive; warm standby cluster; DB replica |
| Tier 2 — Important | Reporting, analytics, internal tools | < 4 hours | < 1 hour | Velero backup + GitOps re-deploy to standby |
| Tier 3 — Standard | Batch jobs, dev environments, CI | < 24 hours | < 24 hours | GitOps re-deploy; data rebuilt from source |
Every RTO/RPO target in the table above is only achievable if the restore procedure has been tested end-to-end in the last 90 days. A restore that has never been practiced will take 3–5× longer than estimated under incident pressure. Schedule and run DR drills — they are as important as the backup jobs themselves.
DR Architecture Patterns
Cost → ← RTO/RPO
Backup/Restore Pilot Light Warm Standby Active-Active
───────────── ────────── ───────────── ─────────────
Periodic backups Minimal infra Scaled-down Full capacity
to S3/GCS always running replica running in both regions
Restore on DR Scale up on DR Scale up on DR Traffic split
always
RTO: hours RTO: 30–60 min RTO: 10–30 min RTO: < 1 min
RPO: backup age RPO: ~15 min RPO: ~5 min RPO: near-zero
Cost: $ Cost: $$ Cost: $$$ Cost: $$$$
Most K8s clusters: ← Use Velero + GitOps for Tier 2/3 → Tier 0 only
What needs backup
- etcd: all K8s API objects (Deployments, ConfigMaps, Secrets, CRDs, RBAC)
- PersistentVolumes: stateful app data (databases, file stores)
- Container registry: images (ECR replication or OCI artifact copy)
- Certificates & secrets: Vault raft snapshots; KMS key policy backups
- Git repositories: source of truth for manifests (GitHub/GitLab already replicated)
What does NOT need backup
- Stateless Deployment pods (re-create from image)
- ConfigMaps/Secrets that ESO regenerates from Vault/ASM
- Kubernetes node data (nodes are ephemeral by design)
- Prometheus TSDB data older than retention window (rebuilds from metrics stream)
- Anything stored in the Git-managed manifest repo
etcd Backup
etcd is the authoritative store for all Kubernetes API objects. Without an etcd backup, a total control plane loss means re-creating every resource manually. Snapshot-based backup is the standard approach.
Manual etcd snapshot
# On a control plane node (or from a pod with etcdctl access)
ETCDCTL_API=3 etcdctl snapshot save /tmp/etcd-snapshot-$(date +%Y%m%d-%H%M%S).db \
--endpoints=https://127.0.0.1:2379 \
--cacert=/etc/kubernetes/pki/etcd/ca.crt \
--cert=/etc/kubernetes/pki/etcd/healthcheck-client.crt \
--key=/etc/kubernetes/pki/etcd/healthcheck-client.key
# Verify snapshot integrity
ETCDCTL_API=3 etcdctl snapshot status /tmp/etcd-snapshot-*.db \
--write-out=table
# Output:
# +---------+----------+------------+------------+
# | HASH | REVISION | TOTAL KEYS | TOTAL SIZE |
# +---------+----------+------------+------------+
# | 12ab34cd| 847291 | 15432 | 62 MB |
# +---------+----------+------------+------------+
Automated etcd backup CronJob
apiVersion: batch/v1
kind: CronJob
metadata:
name: etcd-backup
namespace: kube-system
spec:
schedule: "0 */6 * * *" # every 6 hours
successfulJobsHistoryLimit: 3
failedJobsHistoryLimit: 3
jobTemplate:
spec:
template:
spec:
serviceAccountName: etcd-backup
hostNetwork: true
tolerations:
- key: node-role.kubernetes.io/control-plane
effect: NoSchedule
nodeSelector:
node-role.kubernetes.io/control-plane: ""
restartPolicy: OnFailure
containers:
- name: etcd-backup
image: registry.k8s.io/etcd:3.5.12-0
env:
- name: ETCDCTL_API
value: "3"
- name: S3_BUCKET
value: my-cluster-etcd-backups
- name: AWS_DEFAULT_REGION
value: us-east-1
command:
- /bin/sh
- -c
- |
set -e
SNAPSHOT_FILE="/tmp/etcd-$(date +%Y%m%d-%H%M%S).db"
echo "Taking etcd snapshot..."
etcdctl snapshot save "$SNAPSHOT_FILE" \
--endpoints=https://127.0.0.1:2379 \
--cacert=/etc/kubernetes/pki/etcd/ca.crt \
--cert=/etc/kubernetes/pki/etcd/healthcheck-client.crt \
--key=/etc/kubernetes/pki/etcd/healthcheck-client.key
echo "Verifying snapshot..."
etcdctl snapshot status "$SNAPSHOT_FILE" --write-out=table
echo "Uploading to S3..."
aws s3 cp "$SNAPSHOT_FILE" \
"s3://${S3_BUCKET}/$(date +%Y/%m/%d)/$(basename $SNAPSHOT_FILE)" \
--sse aws:kms
echo "Pruning old snapshots (keep 30 days)..."
aws s3 ls "s3://${S3_BUCKET}/" --recursive | \
awk '{print $4}' | \
sort | \
head -n -120 | \
xargs -I{} aws s3 rm "s3://${S3_BUCKET}/{}" || true
echo "Backup complete: $(basename $SNAPSHOT_FILE)"
volumeMounts:
- name: etcd-certs
mountPath: /etc/kubernetes/pki/etcd
readOnly: true
resources:
requests:
cpu: 100m
memory: 128Mi
limits:
cpu: 500m
memory: 512Mi
volumes:
- name: etcd-certs
hostPath:
path: /etc/kubernetes/pki/etcd
# IRSA ServiceAccount for S3 access (EKS)
apiVersion: v1
kind: ServiceAccount
metadata:
name: etcd-backup
namespace: kube-system
annotations:
eks.amazonaws.com/role-arn: arn:aws:iam::123456789:role/etcd-backup-role
EKS etcd backup considerations
On managed Kubernetes, the control plane (including etcd) is managed by the cloud provider. You cannot access etcd directly. Instead:
• EKS: Use Velero for workload-level backup; AWS handles etcd; use cluster backup via AWS Backup (EKS resources).
• GKE: Automated etcd snapshots; use Config Connector + Velero.
• Self-managed (kubeadm): Full etcd access; use the CronJob above.
The etcd backup CronJob applies to self-managed clusters only.
etcd Restore
Restoring etcd overwrites all current cluster state. Every resource created after the backup timestamp will be lost. Restore etcd only when the control plane is completely unrecoverable. For partial data loss (accidentally deleted namespace), use Velero or kubectl to restore individual resources.
Single-node etcd restore procedure
-
Download the snapshot from S3
aws s3 cp s3://my-cluster-etcd-backups/2026/05/20/etcd-20260520-060000.db \ /tmp/etcd-snapshot.db -
Stop the API server and etcd (prevent writes during restore)
# Move static pod manifests out of /etc/kubernetes/manifests/ # kubelet stops the pods when manifests are removed mv /etc/kubernetes/manifests/kube-apiserver.yaml /tmp/ mv /etc/kubernetes/manifests/etcd.yaml /tmp/ # Verify pods stopped crictl pods | grep -E "etcd|apiserver" # should return empty -
Restore snapshot to a new data directory
ETCDCTL_API=3 etcdctl snapshot restore /tmp/etcd-snapshot.db \ --data-dir=/var/lib/etcd-restore \ --name=master-0 \ --initial-cluster=master-0=https://10.0.1.10:2380 \ --initial-cluster-token=etcd-cluster-restore \ --initial-advertise-peer-urls=https://10.0.1.10:2380 -
Swap the data directory
mv /var/lib/etcd /var/lib/etcd-old-$(date +%Y%m%d) mv /var/lib/etcd-restore /var/lib/etcd chown -R etcd:etcd /var/lib/etcd -
Restore the static pod manifests
mv /tmp/etcd.yaml /etc/kubernetes/manifests/ # Wait for etcd to start sleep 10 mv /tmp/kube-apiserver.yaml /etc/kubernetes/manifests/ # Verify kubectl get nodes kubectl get pods -A | head -20
Multi-node etcd cluster restore
# On ALL control plane nodes simultaneously (must use same snapshot + same token)
# node-specific: adjust --name and --initial-advertise-peer-urls per node
# Node 1
ETCDCTL_API=3 etcdctl snapshot restore /tmp/etcd-snapshot.db \
--data-dir=/var/lib/etcd-restore \
--name=master-0 \
--initial-cluster=master-0=https://10.0.1.10:2380,master-1=https://10.0.1.11:2380,master-2=https://10.0.1.12:2380 \
--initial-cluster-token=etcd-cluster-restore-$(date +%Y%m%d) \
--initial-advertise-peer-urls=https://10.0.1.10:2380
# Node 2
ETCDCTL_API=3 etcdctl snapshot restore /tmp/etcd-snapshot.db \
--data-dir=/var/lib/etcd-restore \
--name=master-1 \
--initial-cluster=master-0=https://10.0.1.10:2380,master-1=https://10.0.1.11:2380,master-2=https://10.0.1.12:2380 \
--initial-cluster-token=etcd-cluster-restore-$(date +%Y%m%d) \
--initial-advertise-peer-urls=https://10.0.1.11:2380
# Node 3
ETCDCTL_API=3 etcdctl snapshot restore /tmp/etcd-snapshot.db \
--data-dir=/var/lib/etcd-restore \
--name=master-2 \
--initial-cluster=master-0=https://10.0.1.10:2380,master-1=https://10.0.1.11:2380,master-2=https://10.0.1.12:2380 \
--initial-cluster-token=etcd-cluster-restore-$(date +%Y%m%d) \
--initial-advertise-peer-urls=https://10.0.1.12:2380
Velero Workload Backup
Velero backs up Kubernetes API objects and PersistentVolume data. It is the standard tool for application-level backup/restore and cluster-to-cluster migration. Unlike etcd snapshots, Velero backups are selective (namespace, label, resource type) and support incremental PV backup via Restic/Kopia.
Velero installation (AWS S3 + EBS)
# Install Velero CLI
brew install velero
# Create S3 bucket for Velero backups
aws s3api create-bucket \
--bucket my-cluster-velero-backups \
--region us-east-1
aws s3api put-bucket-versioning \
--bucket my-cluster-velero-backups \
--versioning-configuration Status=Enabled
aws s3api put-bucket-encryption \
--bucket my-cluster-velero-backups \
--server-side-encryption-configuration \
'{"Rules":[{"ApplyServerSideEncryptionByDefault":{"SSEAlgorithm":"aws:kms"}}]}'
# Install Velero with IRSA (no static credentials)
velero install \
--provider aws \
--plugins velero/velero-plugin-for-aws:v1.9.0 \
--bucket my-cluster-velero-backups \
--backup-location-config region=us-east-1 \
--snapshot-location-config region=us-east-1 \
--use-node-agent \
--default-volumes-to-fs-backup \
--sa-annotations eks.amazonaws.com/role-arn=arn:aws:iam::123456789:role/velero-role \
--no-secret \
--wait
# BackupStorageLocation — verify Velero can reach S3
apiVersion: velero.io/v1
kind: BackupStorageLocation
metadata:
name: default
namespace: velero
spec:
provider: aws
objectStorage:
bucket: my-cluster-velero-backups
prefix: prod-cluster
config:
region: us-east-1
s3ForcePathStyle: "false"
s3Url: "" # empty = use AWS SDK default (VPC endpoint preferred)
credential:
name: "" # empty = use IRSA
default: true
---
# VolumeSnapshotLocation — EBS snapshots
apiVersion: velero.io/v1
kind: VolumeSnapshotLocation
metadata:
name: default
namespace: velero
spec:
provider: aws
config:
region: us-east-1
Scheduled backups
# Daily backup of all namespaces (Kubernetes objects + PV snapshots)
velero schedule create daily-full \
--schedule="0 2 * * *" \
--ttl 720h \
--include-namespaces "*" \
--exclude-namespaces velero,kube-system,kube-public,kube-node-lease \
--snapshot-volumes=true \
--volume-snapshot-locations default
# Hourly backup of critical namespace (Kubernetes objects only — no PV snap)
velero schedule create hourly-payments \
--schedule="0 * * * *" \
--ttl 48h \
--include-namespaces payments,auth \
--snapshot-volumes=false
# Every 15 minutes for stateful data using file-system backup (Kopia/Restic)
velero schedule create frequent-payments-stateful \
--schedule="*/15 * * * *" \
--ttl 24h \
--include-namespaces payments \
--default-volumes-to-fs-backup=true \
--snapshot-volumes=false
# Backup CRD — declarative (GitOps-managed)
apiVersion: velero.io/v1
kind: Schedule
metadata:
name: daily-full
namespace: velero
spec:
schedule: "0 2 * * *"
template:
includedNamespaces:
- "*"
excludedNamespaces:
- velero
- kube-system
- kube-public
- kube-node-lease
includedResources:
- "*"
excludedResources:
- events
- events.events.k8s.io
snapshotVolumes: true
volumeSnapshotLocations:
- default
ttl: 720h0m0s
storageLocation: default
hooks:
resources:
- name: database-quiesce
includedNamespaces:
- payments
labelSelector:
matchLabels:
app: postgresql
pre:
- exec:
container: postgresql
command:
- /bin/bash
- -c
- psql -U postgres -c "SELECT pg_start_backup('velero', true);"
onError: Fail
timeout: 60s
post:
- exec:
container: postgresql
command:
- /bin/bash
- -c
- psql -U postgres -c "SELECT pg_stop_backup();"
onError: Continue
timeout: 60s
Backup verification
# List all backups with status
velero backup get
# Describe a specific backup (shows included resources, errors, warnings)
velero backup describe daily-full-20260520020000 --details
# Check backup logs for errors
velero backup logs daily-full-20260520020000 | grep -i error
# Verify S3 storage (backup content)
aws s3 ls s3://my-cluster-velero-backups/prod-cluster/backups/ --recursive | \
sort -k1,2 | tail -20
Velero Restore
Full namespace restore
# List available backups
velero backup get
# Restore a specific namespace from backup
velero restore create payments-restore-$(date +%Y%m%d-%H%M) \
--from-backup daily-full-20260520020000 \
--include-namespaces payments \
--restore-volumes=true \
--wait
# Check restore status
velero restore describe payments-restore-20260524-1430
# Check for failures
velero restore logs payments-restore-20260524-1430 | grep -i "error\|warn"
Selective resource restore
# Restore only Deployments and Services (not PVCs — data already present)
velero restore create deployments-only \
--from-backup daily-full-20260520020000 \
--include-namespaces payments \
--include-resources deployments,services,configmaps \
--restore-volumes=false
# Restore a single resource by name
velero restore create single-deployment \
--from-backup daily-full-20260520020000 \
--include-namespaces payments \
--selector "app=payment-service"
# Restore to a different namespace (namespace mapping)
velero restore create payments-to-staging \
--from-backup daily-full-20260520020000 \
--include-namespaces payments \
--namespace-mappings payments:payments-staging
Cross-cluster restore (cluster migration)
# Scenario: migrate from cluster-A to cluster-B (different region)
# Both clusters must be able to access the same S3 backup bucket
# On cluster-B: create a BackupStorageLocation pointing to cluster-A's bucket
velero backup-location create cluster-a-backups \
--provider aws \
--bucket my-cluster-velero-backups \
--prefix prod-cluster \
--config region=us-east-1 \
--access-mode ReadOnly # cluster-B only reads, doesn't overwrite
# Sync backup metadata to cluster-B
velero backup-location set cluster-a-backups --access-mode ReadOnly
# Restore from cluster-A's backup into cluster-B
velero restore create migrate-payments \
--from-backup daily-full-20260520020000 \
--include-namespaces payments \
--restore-volumes=true
Stateful Data DR
Database backup strategies
| Database | Backup Method | RPO | Tool |
|---|---|---|---|
| PostgreSQL | pg_dump (logical) + WAL archiving (PITR) + EBS snapshot | ~5 min (WAL) or backup age (pg_dump) | pgBackRest, Barman, Velero hook |
| MySQL/MariaDB | mysqldump + binary log + EBS snapshot | ~5 min (binlog) | XtraBackup, Velero hook |
| MongoDB | mongodump + oplog + Atlas backup | ~1 min (oplog) | Percona Backup for MongoDB |
| Redis | RDB snapshot (configurable) + AOF persistence | Configurable (1s–1h) | Redis built-in; Velero file-system backup |
| etcd (app-level) | etcdctl snapshot (same as control plane etcd) | Snapshot frequency | etcdctl + CronJob |
| Kafka | MirrorMaker 2 replication + topic offset backup | ~minutes (MirrorMaker lag) | MirrorMaker 2, Confluent Replicator |
| AWS RDS (via Crossplane) | Automated RDS snapshots + cross-region copy | 5 min (automated backup window) | AWS RDS automated backups + Crossplane MR status |
PostgreSQL PITR with pgBackRest
# pgBackRest sidecar pattern in PostgreSQL StatefulSet
spec:
containers:
- name: postgres
image: postgres:16
env:
- name: PGBACKREST_STANZA
value: prod
- name: PGBACKREST_REPO1_TYPE
value: s3
- name: PGBACKREST_REPO1_S3_BUCKET
value: my-cluster-pgbackrest
- name: PGBACKREST_REPO1_S3_REGION
value: us-east-1
- name: PGBACKREST_REPO1_RETENTION_FULL
value: "7"
- name: PGBACKREST_ARCHIVE_PUSH_QUEUE_MAX
value: 4GiB
- name: pgbackrest
image: pgbackrest/pgbackrest:2.50
command:
- /bin/sh
- -c
- |
# Full backup at 2am
while true; do
sleep $(( 86400 - $(date +%s) % 86400 + 7200 ))
pgbackrest --stanza=prod backup --type=full
done
# Point-in-time recovery: restore to specific timestamp
pgbackrest --stanza=prod restore \
--delta \
--target="2026-05-24 14:30:00" \
--target-action=promote
# Restore latest backup
pgbackrest --stanza=prod restore --delta
Multi-Cluster Failover
For Tier 0 workloads with <15 min RTO, a single-cluster restore procedure is too slow. Multi-cluster active-passive or active-active architecture is required.
Active-passive architecture
Route 53 / Global Load Balancer
│
┌──────────────┴─────────────┐
│ │
Primary (us-east-1) Standby (us-west-2)
All traffic → 100% Cold/warm standby → 0%
┌──────────────┐ ┌──────────────┐
│ K8s cluster │ │ K8s cluster │
│ PostgreSQL │──replication──►│ PostgreSQL │
│ primary │ │ replica │
│ Redis │──replication──►│ Redis │
└──────────────┘ └──────────────┘
│
▼ Failure detected (health check fails 3×)
│
Route 53 routes 100% to us-west-2
Promote PostgreSQL replica → primary
Scale up standby K8s workloads (if warm standby)
Update DNS TTL 60s → fast propagation
Route 53 health check failover
# Create health check for primary cluster
aws route53 create-health-check \
--caller-reference "primary-$(date +%s)" \
--health-check-config '{
"Type": "HTTPS",
"FullyQualifiedDomainName": "api.primary.example.com",
"Port": 443,
"ResourcePath": "/health",
"RequestInterval": 10,
"FailureThreshold": 3,
"EnableSNI": true
}'
# Create primary DNS record (failover routing)
aws route53 change-resource-record-sets \
--hosted-zone-id Z1234567890 \
--change-batch '{
"Changes": [{
"Action": "UPSERT",
"ResourceRecordSet": {
"Name": "api.example.com",
"Type": "A",
"SetIdentifier": "primary",
"Failover": "PRIMARY",
"TTL": 60,
"ResourceRecords": [{"Value": "1.2.3.4"}],
"HealthCheckId": "abc123"
}
},{
"Action": "UPSERT",
"ResourceRecordSet": {
"Name": "api.example.com",
"Type": "A",
"SetIdentifier": "secondary",
"Failover": "SECONDARY",
"TTL": 60,
"ResourceRecords": [{"Value": "5.6.7.8"}]
}
}]
}'
Argo CD multi-cluster failover
# ApplicationSet targeting standby cluster for DR workloads
apiVersion: argoproj.io/v1alpha1
kind: ApplicationSet
metadata:
name: payments-dr
namespace: argocd
spec:
generators:
- list:
elements:
- cluster: prod-us-east-1
weight: "100"
suspended: "false"
- cluster: dr-us-west-2
weight: "0"
suspended: "true" # suspended in normal ops; activate on DR
template:
metadata:
name: "payments-{{cluster}}"
spec:
project: production
source:
repoURL: https://github.com/org/gitops
targetRevision: HEAD
path: "clusters/{{cluster}}/payments"
destination:
server: "https://{{cluster}}.example.com"
namespace: payments
syncPolicy:
automated:
prune: true
selfHeal: true
syncOptions:
- ServerSideApply=true
# Activate DR cluster (unsuspend Argo CD application)
# Step 1: Update ApplicationSet in Git (remove suspended: "true")
# Step 2: Argo CD syncs and deploys to DR cluster automatically
# Or manually unsuspend via CLI for immediate failover:
argocd app set payments-dr-us-west-2 --sync-policy automated
argocd app sync payments-dr-us-west-2 --force
GitOps-Driven Recovery
For Tier 2/3 workloads, GitOps is the DR strategy. The Git repository contains the complete desired state of every namespace. Re-deploying to a new or wiped cluster is a matter of pointing Argo CD at the repo.
#!/bin/bash
# Bootstrap a new cluster from GitOps in < 30 minutes
# Assumes: new cluster exists, kubectl configured, AWS credentials present
set -euo pipefail
CLUSTER_NAME="${1:-disaster-recovery-$(date +%Y%m%d)}"
GIT_REPO="https://github.com/org/gitops"
ARGOCD_VERSION="v2.12.0"
echo "=== GitOps DR Bootstrap: $CLUSTER_NAME ==="
# 1. Install Argo CD (the bootstrap operator)
echo "Installing Argo CD..."
kubectl create namespace argocd --dry-run=client -o yaml | kubectl apply -f -
kubectl apply -n argocd \
-f "https://raw.githubusercontent.com/argoproj/argo-cd/${ARGOCD_VERSION}/manifests/install.yaml"
kubectl wait --for=condition=available --timeout=300s \
deployment/argocd-server -n argocd
# 2. Apply root App-of-Apps (points to cluster-specific overlay in Git)
echo "Applying root application..."
kubectl apply -f - <
When Argo CD syncs to a new cluster, it restores: all Deployments, StatefulSets, Services, Ingresses, ConfigMaps, RBAC, NetworkPolicies, HPA, PodDisruptionBudgets, CRDs, and operator installations. What it does NOT restore: PersistentVolume data (needs Velero), Secrets not managed by ESO (need manual re-seeding from Vault), and in-flight workload state.
DR Testing & Drills
DR testing must be scheduled, documented, and measured. An ad-hoc restore under pressure is not a test — it is an incident.
DR drill schedule
| Drill Type | Frequency | Scope | Success Criteria | Duration |
|---|---|---|---|---|
| Backup verification | Weekly (automated) | Restore a backup to a temp namespace; verify resource count matches | Resource count matches source; no errors | 15 min |
| Namespace restore drill | Monthly | Full Velero restore of one non-critical namespace to a staging cluster | All pods running; smoke test passes; RTO achieved | 1–2 hours |
| GitOps bootstrap drill | Quarterly | Bootstrap a complete cluster from Git; route test traffic to it | All Tier 2+ workloads deployed; latency SLO met | 2–4 hours |
| etcd restore drill | Quarterly | Restore etcd snapshot to a spare control plane node | kubectl functional; all resources present; completed in RTO | 1 hour |
| Full failover drill | Annually (or after major changes) | Full production failover to DR cluster; traffic routed; RTO/RPO verified | RTO target achieved; no data loss beyond RPO; post-mortem filed | 4–8 hours |
Automated backup verification job
apiVersion: batch/v1
kind: CronJob
metadata:
name: backup-verify
namespace: velero
spec:
schedule: "0 6 * * 1" # every Monday 6am
jobTemplate:
spec:
template:
spec:
serviceAccountName: velero
restartPolicy: OnFailure
containers:
- name: verify
image: velero/velero:v1.14.0
command:
- /bin/sh
- -c
- |
set -e
# Get latest successful backup
BACKUP=$(velero backup get -o json | \
jq -r '[.items[] | select(.status.phase=="Completed")] | sort_by(.status.completionTimestamp) | last | .metadata.name')
echo "Verifying backup: $BACKUP"
# Restore to verification namespace
RESTORE_NS="backup-verify-$(date +%Y%m%d)"
velero restore create "$RESTORE_NS" \
--from-backup "$BACKUP" \
--include-namespaces payments \
--namespace-mappings payments:$RESTORE_NS \
--restore-volumes=false \
--wait
# Check restore status
STATUS=$(velero restore get "$RESTORE_NS" -o json | \
jq -r '.status.phase')
if [ "$STATUS" != "Completed" ]; then
echo "FAIL: Restore status = $STATUS"
exit 1
fi
# Verify resource count
PODS=$(kubectl get pods -n "$RESTORE_NS" --no-headers | wc -l)
echo "Pods in restored namespace: $PODS"
if [ "$PODS" -lt 1 ]; then
echo "FAIL: No pods restored"
exit 1
fi
# Cleanup verification namespace
kubectl delete namespace "$RESTORE_NS" --wait=false
echo "PASS: Backup $BACKUP verified successfully"
Chaos engineering for DR validation
# chaos-mesh: simulate AZ failure (drain all pods from one zone)
apiVersion: chaos-mesh.org/v1alpha1
kind: PodChaos
metadata:
name: az-failure-simulation
namespace: chaos-testing
spec:
action: pod-kill
mode: all
selector:
namespaces:
- payments
nodeSelectors:
topology.kubernetes.io/zone: us-east-1a # target specific AZ
duration: "10m" # 10 minute chaos window
scheduler:
cron: "0 14 * * 2" # Tuesday 2pm — controlled DR drill time
# chaos-mesh: etcd leader election disruption
apiVersion: chaos-mesh.org/v1alpha1
kind: NetworkChaos
metadata:
name: etcd-partition
namespace: chaos-testing
spec:
action: partition
mode: one # partition one etcd member
selector:
namespaces:
- kube-system
labelSelectors:
component: etcd
direction: both
duration: "5m"
DR Monitoring Alerts
apiVersion: monitoring.coreos.com/v1
kind: PrometheusRule
metadata:
name: disaster-recovery-alerts
namespace: monitoring
spec:
groups:
- name: dr.backup
rules:
# Velero backup not completed in 25 hours (daily backup missed)
- alert: VeleroBackupMissed
expr: |
time() - max(velero_backup_last_successful_timestamp{schedule="daily-full"}) > 90000
labels:
severity: critical
annotations:
summary: "Velero daily backup not completed in 25 hours"
description: "Last successful backup: {{ $value | humanizeDuration }} ago"
runbook_url: https://runbooks.example.com/dr/velero-backup-missed
# Velero backup failed
- alert: VeleroBackupFailed
expr: velero_backup_failure_total > 0
for: 1m
labels:
severity: warning
annotations:
summary: "Velero backup {{ $labels.schedule }} failed"
runbook_url: https://runbooks.example.com/dr/velero-backup-failed
# etcd snapshot stale (> 7 hours since last backup)
- alert: EtcdSnapshotStale
expr: |
time() - max(etcd_backup_last_success_timestamp) > 25200
labels:
severity: critical
annotations:
summary: "etcd snapshot backup stale — last backup > 7 hours ago"
runbook_url: https://runbooks.example.com/dr/etcd-snapshot-stale
# etcd quorum lost (fewer than 2 members healthy)
- alert: EtcdQuorumLost
expr: |
count(etcd_server_is_leader == 1) < 1
OR
count(up{job="etcd"}) < 2
for: 1m
labels:
severity: critical
annotations:
summary: "etcd quorum lost — cluster control plane at risk"
runbook_url: https://runbooks.example.com/dr/etcd-quorum
# PVC not backed up (no Velero annotation or backup tag)
- alert: PVCWithoutBackupPolicy
expr: |
count(kube_persistentvolumeclaim_info{namespace!~"velero|kube-system"}) -
count(kube_persistentvolumeclaim_labels{label_velero_io_backup="true"}) > 0
for: 1h
labels:
severity: warning
annotations:
summary: "PVCs exist without backup policy labels"
runbook_url: https://runbooks.example.com/dr/pvc-backup-policy
# Velero node agent not running on all nodes
- alert: VeleroNodeAgentNotRunning
expr: |
kube_daemonset_status_number_ready{daemonset="node-agent",namespace="velero"}
/
kube_daemonset_status_desired_number_scheduled{daemonset="node-agent",namespace="velero"}
< 1
for: 5m
labels:
severity: warning
annotations:
summary: "Velero node-agent DaemonSet not fully running — file-system backups affected"
runbook_url: https://runbooks.example.com/dr/velero-node-agent
DR Runbook Template
Every cluster should have a printed (or offline-accessible) DR runbook. When the cluster is down, the wiki might be too.
# Cluster Disaster Recovery Runbook
Last tested: 2026-03-15 | Next scheduled test: 2026-06-15
Owner: platform-team@company.com | Emergency: +1-555-ONCALL
## Prerequisites
- [ ] kubectl access to DR cluster (context: dr-us-west-2)
- [ ] AWS credentials with S3 + EC2 + Route53 access
- [ ] Vault root token in break-glass secret (1Password: DR-Vault-Root)
- [ ] This runbook (offline copy at /runbooks/dr.md on all SRE laptops)
## Decision Tree
1. Is the ENTIRE cluster unrecoverable?
YES → Go to Section A: Full Cluster Restore
NO → Is it a namespace/application issue?
YES → Go to Section B: Velero Namespace Restore
NO → Is it a control plane issue?
YES → Go to Section C: Control Plane Recovery
## Section A: Full Cluster Restore (RTO: 60–90 min)
...etcd restore procedure + GitOps bootstrap...
## Section B: Velero Namespace Restore (RTO: 15–30 min)
...velero restore commands...
## Section C: Control Plane Recovery (RTO: 30–60 min)
...kubeadm reset + rejoin + certificate rotation...
## Post-Recovery Checklist
- [ ] All pods Running (kubectl get pods -A | grep -v Running)
- [ ] Route 53 health checks passing
- [ ] Vault unsealed (kubectl exec -n vault vault-0 -- vault status)
- [ ] ESO syncing secrets (kubectl get externalsecrets -A)
- [ ] Argo CD healthy (argocd app list | grep -v Synced)
- [ ] SLO burn rate normal (check Grafana SLO dashboard)
- [ ] File post-mortem within 48 hours
Best Practices
Test every quarter
An untested restore is not a DR plan. Run namespace restore drills monthly and full cluster restore annually. Measure actual RTO vs target.
3-2-1 backup rule
3 copies of data, 2 different media, 1 offsite. etcd snapshots in S3 cross-region copy + Velero backups in a different S3 bucket + EBS snapshots.
GitOps is your fast path
For stateless workloads, a GitOps re-deploy is faster than a Velero restore. Bootstrap Argo CD first; it re-creates everything in the Git repo automatically.
Automate backup verification
A CronJob that does a weekly restore to a verification namespace and counts resources catches silent backup failures before the real DR event.
Separate DR credentials
Break-glass credentials (Vault root token, kubeconfig, AWS DR role) must be stored offline (1Password, printed sheet in a safe). If the cluster is down, so is your secret manager.
Match DR tier to cost
Active-active is expensive. Most services don't need it. Use the tier table to assign the right strategy — warm standby for Tier 1, GitOps re-deploy for Tier 3.