16. Deployment Runbook

Chapter 16 of 18 · 25 min

A deployment runbook provides step-by-step instructions for operators. It covers normal operations, common failures, and escalation paths. Runbooks should be detailed enough that someone unfamiliar with the system can complete the deployment.

Pre-Deployment Checklist

  • Database migrations reviewed and tested
  • New environment variables documented
  • Rollback plan documented
  • Monitoring dashboards ready
  • On-call engineer notified

Standard Deployment

1. Prepare the deployment server

# SSH to server
ssh [email protected]

# Navigate to application directory
cd /app/ai-doc-qa

# Pull latest changes
git pull origin main

# Review configuration changes
diff .env.production .env.staging

2. Run database migrations

# Start database if stopped
docker compose up -d postgres redis

# Run pending migrations
docker compose exec backend alembic upgrade head

# Verify migration status
docker compose exec backend alembic current
docker compose exec backend alembic history

3. Deploy services

# Pull new images
docker compose -f docker-compose.production.yml pull

# Deploy with zero-downtime
docker compose -f docker-compose.production.yml up -d

# Wait for health checks
docker compose ps

# Verify backend health
curl -f http://localhost:8000/health

4. Verify deployment

# Check application logs
docker compose logs --tail=100 backend

# Run smoke tests
curl -f http://localhost/api/v1/documents
curl -f http://localhost/health

# Check monitoring
open http://monitoring.example.com/dashboard

Rollback Procedure

If deployment fails or issues are discovered:

# Identify the previous working image tag
git log --oneline -5
docker images | grep ai-doc-qa

# Rollback to previous version
docker compose -f docker-compose.production.yml pull backend:previous-tag
docker compose -f docker-compose.production.yml up -d backend

# Verify rollback
curl -f http://localhost/health

Common Issues

Pod restart loop

Symptom: Containers restart repeatedly

# Check container logs for OOMKilled or signal errors
docker compose logs backend | grep -E "Killed|Signal|Exit"

# Check resource limits
docker stats

# Increase memory if needed
# Edit docker-compose.yml and redeploy

Database connection failures

Symptom: Backend logs show "connection refused" to postgres

# Verify postgres is running
docker compose ps postgres

# Check postgres logs
docker compose logs postgres

# Restart postgres if unhealthy
docker compose restart postgres

# Verify connection from backend
docker compose exec backend python -c "from app.db import engine; engine.connect()"

Model server timeout

Symptom: Requests hang and eventually timeout

# Check model server memory usage
docker stats model_server

# View model server logs
docker compose logs model_server --tail=200

# If OOM, increase memory limit or reduce context size
# Edit docker-compose.yml CONTEXT_SIZE parameter

Escalation Contacts

Issue Contact Response Time
Database corruption DBA Team 15 minutes
Infrastructure down Platform Team 5 minutes
Security incident Security Team Immediate
EXERCISE

Perform a deployment to a staging environment using this runbook. Time the process and identify bottlenecks.