Production Local AI Deployment

20. Disaster Recovery

Chapter 20 of 24 · 20 min

KEY INSIGHT

Disaster recovery testing must include actual restoration procedures, not just backup verification; plans that have never been executed contain undocumented failure modes. ### Backup Strategy ```bash # Model artifact backup script #!/bin/bash # backup-models.sh S3_BUCKET="s3://inference-backups/models" TIMESTAMP=$(date +%Y%m%d_%H%M%S) # Create point-in-time backup aws s3 sync \ s3://model-artifacts/production/ \ "${S3_BUCKET}/point-in-time/${TIMESTAMP}/" \ --storage-class GLACIER # Verify backup integrity aws s3 sync \ "${S3_BUCKET}/point-in-time/${TIMESTAMP}/" \ /tmp/backup_verify/ \ --dryrun # Create backup manifest cat > /tmp/backup_manifest.json <<EOF { "timestamp": "${TIMESTAMP}", "models": [ $(aws s3 ls s3://model-artifacts/production/ --json | jq -r '.[] | {key: .Key, size: .Size, md5: .ETag}') ], "checksum": "$(aws s3 ls ${S3_BUCKET}/point-in-time/${TIMESTAMP}/ --summarize | grep 'Total Objects' | awk '{print $4}')" } EOF aws s3 cp /tmp/backup_manifest.json "${S3_BUCKET}/manifests/" ``` ### Database Backup ```bash # Database point-in-time recovery backup #!/bin/bash # backup-db.sh PGHOST="pg-primary.internal" PGDATABASE="inference" WAL_S3_PATH="s3://inference-backups/wal/" # Configure continuous archiving psql -h $PGHOST -U postgres <<EOF ALTER SYSTEM SET wal_level = replica; ALTER SYSTEM SET max_wal_senders = 3; ALTER SYSTEM SET wal_keep_size = 1024; ALTER SYSTEM SET archive_mode = on; ALTER SYSTEM SET archive_command = 'aws s3 cp %p ${WAL_S3_PATH}%f'; EOF # Base backup pg_basebackup \ -h $PGHOST \ -U postgres \ -D /tmp/basebackup_$(date +%Y%m%d) \ -Ft \ -z \ -P \ -Xs ``` ### Recovery Runbook ```markdown # DR-001: Full System Recovery ## Prerequisites - [ ] New infrastructure provisioned - [ ] Network connectivity verified - [ ] Access credentials validated ## Restore Order ### 1. Database (RPO: 5 minutes target) ``` cd /tmp/restoration aws s3 sync s3://inference-backups/db/latest/ ./db/ rm -rf /var/lib/postgresql/data/* tar -xzf base_backup.tar.gz -C /var/lib/postgresql/data/ pg_ctl start -D /var/lib/postgresql/data/ ``` ### 2. Model Artifacts ``` aws s3 sync s3://inference-backups/models/latest/ \ s3://model-artifacts/production/ ``` ### 3. Configuration State ``` kubectl apply -f ./configs/namespace.yaml kubectl apply -f ./configs/secrets.yaml kubectl apply -f ./configs/configmaps.yaml ``` ### 4. Inference Services ``` kubectl apply -f ./inference/deployment.yaml kubectl apply -f ./inference/service.yaml kubectl rollout status deployment/inference-server ``` ### Verification - [ ] Health endpoints responding - [ ] Basic inference test passes - [ ] Prometheus metrics flowing - [ ] Alert channels active ```

Disaster recovery planning addresses catastrophic failures: region outages, data corruption, or infrastructure loss. Recovery objectives define acceptable service interruption and data loss durations measured by RTO (Recovery Time Objective) and RPO (Recovery Point Objective).

Prerequisites

New infrastructure provisioned
Network connectivity verified
Access credentials validated

Restore Order

1. Database (RPO: 5 minutes target)

cd /tmp/restoration
aws s3 sync s3://inference-backups/db/latest/ ./db/
rm -rf /var/lib/postgresql/data/*
tar -xzf base_backup.tar.gz -C /var/lib/postgresql/data/
pg_ctl start -D /var/lib/postgresql/data/

2. Model Artifacts

aws s3 sync s3://inference-backups/models/latest/ \
  s3://model-artifacts/production/

3. Configuration State

kubectl apply -f ./configs/namespace.yaml
kubectl apply -f ./configs/secrets.yaml
kubectl apply -f ./configs/configmaps.yaml

4. Inference Services

kubectl apply -f ./inference/deployment.yaml
kubectl apply -f ./inference/service.yaml
kubectl rollout status deployment/inference-server

Verification

Health endpoints responding
Basic inference test passes
Prometheus metrics flowing
Alert channels active

EXERCISE

Perform a full disaster recovery drill by backing up model artifacts to local storage, destroying the inference deployment, then restoring from backup. Measure the actual recovery time and document discrepancies between planned RTO and achieved recovery time.