16. Disaster Recovery
Chapter 16 of 24 · 15 min
Disaster recovery planning for RAG systems addresses data loss, system failure, and extended outages. The recovery point objective (RPO) determines how much data loss is acceptable—typically 5 minutes for production systems.
The backup strategy combines Redis RDB snapshots with S3 object storage:
import boto3
from datetime import datetime
class RAGDisasterRecovery:
def __init__(self, redis_primary: Redis, s3_bucket: str):
self.redis = redis_primary
self.s3 = boto3.client("s3")
self.bucket = s3_bucket
self.backup_prefix = "rag/backups"
def take_consistent_snapshot(self) -> dict:
"""Create point-in-time snapshot for disaster recovery"""
timestamp = datetime.utcnow().isoformat()
# Trigger Redis BGSAVE for non-blocking snapshot
self.redis.bgsave()
# Wait for save completion (monitor via INFO persistence)
while self.redis.info("persistence")["rdb_changes_since_last_save"] > 0:
import time
time.sleep(0.1)
rdb_path = "/var/lib/redis/dump.rdb"
s3_key = f"{self.backup_prefix}/{timestamp}/dump.rdb"
self.s3.upload_file(rdb_path, self.bucket, s3_key)
return {
"timestamp": timestamp,
"s3_key": s3_key,
"status": "completed"
}
def restore_from_backup(self, s3_key: str, target_host: str):
"""Restore Redis from S3 backup to specified host"""
import subprocess
download_path = f"/tmp/restore_{datetime.utcnow().timestamp()}.rdb"
self.s3.download_file(self.bucket, s3_key, download_path)
# Stop Redis, replace dump file, restart
subprocess.run(["sudo", "systemctl", "stop", "redis"])
subprocess.run(["sudo", "mv", download_path, "/var/lib/redis/dump.rdb"])
subprocess.run(["sudo", "systemctl", "start", "redis"])
Vector embeddings stored separately require their own backup process:
def backup_vector_index(self, index_name: str = "idx:embeddings") -> str:
"""Export vector index to Parquet for S3 archival"""
import pandas as pd
all_docs = self.redis.ft(index_name).search("*",
{"fields": ["chunk_id", "text_embedding", "metadata"]})
records = []
for doc in all_docs.docs:
records.append({
"chunk_id": doc.chunk_id,
"embedding": np.frombuffer(doc.text_embedding, dtype=np.float32),
"metadata": doc.metadata
})
df = pd.DataFrame(records)
buffer = df.to_parquet()
s3_key = f"{self.backup_prefix}/embeddings/{datetime.utcnow().date()}.parquet"
self.s3.put_object(Bucket=self.bucket, Key=s3_key, Body=buffer)
return s3_key
Failure Modes:
- Corrupted RDB snapshots: Redis snapshots taken during heavy write traffic may be inconsistent. Solution: pause writes briefly during backup or use
BGSAVEcarefully. - S3 eventual consistency: Newly uploaded snapshots may not be immediately retrievable. Read-after-write consistency can be forced with object versioning.
- Restore timeline:TB-scale indexes take hours to restore. Recovery time objective (RTO) must account for full restoration time.
Runbook documentation should include exact commands, IP addresses, and escalation contacts. Annual DR drills validate restoration procedures.
EXERCISE
Create a restore procedure that downloads a backup from S3 and loads it into a local Redis instance. Verify that vector search returns expected results after restore.