RUNLOCALAIv38
->Will it run?Best GPUCompareTroubleshootStartLearnPulseModelsHardwareToolsBench
Run check
RUNLOCALAI

Independently operated catalog for local-AI hardware and software. Hand-written verdicts. Source-cited claims. Reproducible commands when we have them.

OP·Fredoline Eruo
DIR
  • Models
  • Hardware
  • Tools
  • Benchmarks
TOOLS
  • Will it run?
  • Compare hardware
  • Cost vs cloud
  • Choose my GPU
  • Prompting kits
  • Quick answers
REF
  • All buyer guides
  • Learn local AI
  • Methodology
  • Glossary
  • Errors KB
  • Trust
EDITOR
  • About
  • Author
  • How we make money
  • Editorial policy
  • Contact
LEGAL
  • Privacy
  • Terms
  • Sitemap
MAIL · MONTHLY DIGEST
Get monthly local AI changes
Monthly recap. No spam.
DISCLOSURE

Some links on this site are affiliate links (Amazon Associates and other first-class retailers). When you buy through them, we earn a small commission at no extra cost to you. Affiliate links do not influence our verdicts — there are cards we rate highly that we don't have affiliate relationships with, and cards that sell well that we refuse to recommend. Read more →

© 2026 runlocalai.coIndependently operated
RUNLOCALAI · v38
  1. >
  2. Home
  3. /Learn
  4. /Courses
  5. /Production Local AI Deployment
  6. /Ch. 20
Production Local AI Deployment

20. Disaster Recovery

Chapter 20 of 24 · 20 min
KEY INSIGHT

Disaster recovery testing must include actual restoration procedures, not just backup verification; plans that have never been executed contain undocumented failure modes. ### Backup Strategy ```bash # Model artifact backup script #!/bin/bash # backup-models.sh S3_BUCKET="s3://inference-backups/models" TIMESTAMP=$(date +%Y%m%d_%H%M%S) # Create point-in-time backup aws s3 sync \ s3://model-artifacts/production/ \ "${S3_BUCKET}/point-in-time/${TIMESTAMP}/" \ --storage-class GLACIER # Verify backup integrity aws s3 sync \ "${S3_BUCKET}/point-in-time/${TIMESTAMP}/" \ /tmp/backup_verify/ \ --dryrun # Create backup manifest cat > /tmp/backup_manifest.json <<EOF { "timestamp": "${TIMESTAMP}", "models": [ $(aws s3 ls s3://model-artifacts/production/ --json | jq -r '.[] | {key: .Key, size: .Size, md5: .ETag}') ], "checksum": "$(aws s3 ls ${S3_BUCKET}/point-in-time/${TIMESTAMP}/ --summarize | grep 'Total Objects' | awk '{print $4}')" } EOF aws s3 cp /tmp/backup_manifest.json "${S3_BUCKET}/manifests/" ``` ### Database Backup ```bash # Database point-in-time recovery backup #!/bin/bash # backup-db.sh PGHOST="pg-primary.internal" PGDATABASE="inference" WAL_S3_PATH="s3://inference-backups/wal/" # Configure continuous archiving psql -h $PGHOST -U postgres <<EOF ALTER SYSTEM SET wal_level = replica; ALTER SYSTEM SET max_wal_senders = 3; ALTER SYSTEM SET wal_keep_size = 1024; ALTER SYSTEM SET archive_mode = on; ALTER SYSTEM SET archive_command = 'aws s3 cp %p ${WAL_S3_PATH}%f'; EOF # Base backup pg_basebackup \ -h $PGHOST \ -U postgres \ -D /tmp/basebackup_$(date +%Y%m%d) \ -Ft \ -z \ -P \ -Xs ``` ### Recovery Runbook ```markdown # DR-001: Full System Recovery ## Prerequisites - [ ] New infrastructure provisioned - [ ] Network connectivity verified - [ ] Access credentials validated ## Restore Order ### 1. Database (RPO: 5 minutes target) ``` cd /tmp/restoration aws s3 sync s3://inference-backups/db/latest/ ./db/ rm -rf /var/lib/postgresql/data/* tar -xzf base_backup.tar.gz -C /var/lib/postgresql/data/ pg_ctl start -D /var/lib/postgresql/data/ ``` ### 2. Model Artifacts ``` aws s3 sync s3://inference-backups/models/latest/ \ s3://model-artifacts/production/ ``` ### 3. Configuration State ``` kubectl apply -f ./configs/namespace.yaml kubectl apply -f ./configs/secrets.yaml kubectl apply -f ./configs/configmaps.yaml ``` ### 4. Inference Services ``` kubectl apply -f ./inference/deployment.yaml kubectl apply -f ./inference/service.yaml kubectl rollout status deployment/inference-server ``` ### Verification - [ ] Health endpoints responding - [ ] Basic inference test passes - [ ] Prometheus metrics flowing - [ ] Alert channels active ```

Disaster recovery planning addresses catastrophic failures: region outages, data corruption, or infrastructure loss. Recovery objectives define acceptable service interruption and data loss durations measured by RTO (Recovery Time Objective) and RPO (Recovery Point Objective).

Prerequisites

  • New infrastructure provisioned
  • Network connectivity verified
  • Access credentials validated

Restore Order

1. Database (RPO: 5 minutes target)

cd /tmp/restoration
aws s3 sync s3://inference-backups/db/latest/ ./db/
rm -rf /var/lib/postgresql/data/*
tar -xzf base_backup.tar.gz -C /var/lib/postgresql/data/
pg_ctl start -D /var/lib/postgresql/data/

2. Model Artifacts

aws s3 sync s3://inference-backups/models/latest/ \
  s3://model-artifacts/production/

3. Configuration State

kubectl apply -f ./configs/namespace.yaml
kubectl apply -f ./configs/secrets.yaml
kubectl apply -f ./configs/configmaps.yaml

4. Inference Services

kubectl apply -f ./inference/deployment.yaml
kubectl apply -f ./inference/service.yaml
kubectl rollout status deployment/inference-server

Verification

  • Health endpoints responding
  • Basic inference test passes
  • Prometheus metrics flowing
  • Alert channels active

EXERCISE

Perform a full disaster recovery drill by backing up model artifacts to local storage, destroying the inference deployment, then restoring from backup. Measure the actual recovery time and document discrepancies between planned RTO and achieved recovery time.

← Chapter 19
High Availability
Chapter 21 →
Cost Optimization