20. Disaster Recovery
Chapter 20 of 24 · 20 min
Disaster recovery planning addresses catastrophic failures: region outages, data corruption, or infrastructure loss. Recovery objectives define acceptable service interruption and data loss durations measured by RTO (Recovery Time Objective) and RPO (Recovery Point Objective).
Prerequisites
- New infrastructure provisioned
- Network connectivity verified
- Access credentials validated
Restore Order
1. Database (RPO: 5 minutes target)
cd /tmp/restoration
aws s3 sync s3://inference-backups/db/latest/ ./db/
rm -rf /var/lib/postgresql/data/*
tar -xzf base_backup.tar.gz -C /var/lib/postgresql/data/
pg_ctl start -D /var/lib/postgresql/data/
2. Model Artifacts
aws s3 sync s3://inference-backups/models/latest/ \
s3://model-artifacts/production/
3. Configuration State
kubectl apply -f ./configs/namespace.yaml
kubectl apply -f ./configs/secrets.yaml
kubectl apply -f ./configs/configmaps.yaml
4. Inference Services
kubectl apply -f ./inference/deployment.yaml
kubectl apply -f ./inference/service.yaml
kubectl rollout status deployment/inference-server
Verification
- Health endpoints responding
- Basic inference test passes
- Prometheus metrics flowing
- Alert channels active
EXERCISE
Perform a full disaster recovery drill by backing up model artifacts to local storage, destroying the inference deployment, then restoring from backup. Measure the actual recovery time and document discrepancies between planned RTO and achieved recovery time.