RUNLOCALAIv38
->Will it run?Best GPUCompareTroubleshootStartLearnPulseModelsHardwareToolsBench
Run check
RUNLOCALAI

Independently operated catalog for local-AI hardware and software. Hand-written verdicts. Source-cited claims. Reproducible commands when we have them.

OP·Fredoline Eruo
DIR
  • Models
  • Hardware
  • Tools
  • Benchmarks
TOOLS
  • Will it run?
  • Compare hardware
  • Cost vs cloud
  • Choose my GPU
  • Prompting kits
  • Quick answers
REF
  • All buyer guides
  • Learn local AI
  • Methodology
  • Glossary
  • Errors KB
  • Trust
EDITOR
  • About
  • Author
  • How we make money
  • Editorial policy
  • Contact
LEGAL
  • Privacy
  • Terms
  • Sitemap
MAIL · MONTHLY DIGEST
Get monthly local AI changes
Monthly recap. No spam.
DISCLOSURE

Some links on this site are affiliate links (Amazon Associates and other first-class retailers). When you buy through them, we earn a small commission at no extra cost to you. Affiliate links do not influence our verdicts — there are cards we rate highly that we don't have affiliate relationships with, and cards that sell well that we refuse to recommend. Read more →

© 2026 runlocalai.coIndependently operated
RUNLOCALAI · v38
  1. >
  2. Home
  3. /Learn
  4. /Courses
  5. /Enterprise-Scale RAG
  6. /Ch. 16
Enterprise-Scale RAG

16. Disaster Recovery

Chapter 16 of 24 · 15 min
KEY INSIGHT

RAG disaster recovery requires backing up both the vector embeddings and the Redis metadata index separately. Vector embeddings in Parquet format on S3 provide a portable, queryable backup that exceeds the RPO window.

Disaster recovery planning for RAG systems addresses data loss, system failure, and extended outages. The recovery point objective (RPO) determines how much data loss is acceptable—typically 5 minutes for production systems.

The backup strategy combines Redis RDB snapshots with S3 object storage:

import boto3
from datetime import datetime

class RAGDisasterRecovery:
    def __init__(self, redis_primary: Redis, s3_bucket: str):
        self.redis = redis_primary
        self.s3 = boto3.client("s3")
        self.bucket = s3_bucket
        self.backup_prefix = "rag/backups"
    
    def take_consistent_snapshot(self) -> dict:
        """Create point-in-time snapshot for disaster recovery"""
        timestamp = datetime.utcnow().isoformat()
        
        # Trigger Redis BGSAVE for non-blocking snapshot
        self.redis.bgsave()
        
        # Wait for save completion (monitor via INFO persistence)
        while self.redis.info("persistence")["rdb_changes_since_last_save"] > 0:
            import time
            time.sleep(0.1)
        
        rdb_path = "/var/lib/redis/dump.rdb"
        s3_key = f"{self.backup_prefix}/{timestamp}/dump.rdb"
        
        self.s3.upload_file(rdb_path, self.bucket, s3_key)
        
        return {
            "timestamp": timestamp,
            "s3_key": s3_key,
            "status": "completed"
        }
    
    def restore_from_backup(self, s3_key: str, target_host: str):
        """Restore Redis from S3 backup to specified host"""
        import subprocess
        
        download_path = f"/tmp/restore_{datetime.utcnow().timestamp()}.rdb"
        self.s3.download_file(self.bucket, s3_key, download_path)
        
        # Stop Redis, replace dump file, restart
        subprocess.run(["sudo", "systemctl", "stop", "redis"])
        subprocess.run(["sudo", "mv", download_path, "/var/lib/redis/dump.rdb"])
        subprocess.run(["sudo", "systemctl", "start", "redis"])

Vector embeddings stored separately require their own backup process:

    def backup_vector_index(self, index_name: str = "idx:embeddings") -> str:
        """Export vector index to Parquet for S3 archival"""
        import pandas as pd
        
        all_docs = self.redis.ft(index_name).search("*", 
            {"fields": ["chunk_id", "text_embedding", "metadata"]})
        
        records = []
        for doc in all_docs.docs:
            records.append({
                "chunk_id": doc.chunk_id,
                "embedding": np.frombuffer(doc.text_embedding, dtype=np.float32),
                "metadata": doc.metadata
            })
        
        df = pd.DataFrame(records)
        buffer = df.to_parquet()
        
        s3_key = f"{self.backup_prefix}/embeddings/{datetime.utcnow().date()}.parquet"
        self.s3.put_object(Bucket=self.bucket, Key=s3_key, Body=buffer)
        
        return s3_key

Failure Modes:

  • Corrupted RDB snapshots: Redis snapshots taken during heavy write traffic may be inconsistent. Solution: pause writes briefly during backup or use BGSAVE carefully.
  • S3 eventual consistency: Newly uploaded snapshots may not be immediately retrievable. Read-after-write consistency can be forced with object versioning.
  • Restore timeline:TB-scale indexes take hours to restore. Recovery time objective (RTO) must account for full restoration time.

Runbook documentation should include exact commands, IP addresses, and escalation contacts. Annual DR drills validate restoration procedures.

EXERCISE

Create a restore procedure that downloads a backup from S3 and loads it into a local Redis instance. Verify that vector search returns expected results after restore.

← Chapter 15
Geographic Distribution
Chapter 17 →
Cost Modeling