16. Deployment Runbook
Chapter 16 of 18 · 25 min
A deployment runbook provides step-by-step instructions for operators. It covers normal operations, common failures, and escalation paths. Runbooks should be detailed enough that someone unfamiliar with the system can complete the deployment.
Pre-Deployment Checklist
- Database migrations reviewed and tested
- New environment variables documented
- Rollback plan documented
- Monitoring dashboards ready
- On-call engineer notified
Standard Deployment
1. Prepare the deployment server
# SSH to server
ssh [email protected]
# Navigate to application directory
cd /app/ai-doc-qa
# Pull latest changes
git pull origin main
# Review configuration changes
diff .env.production .env.staging
2. Run database migrations
# Start database if stopped
docker compose up -d postgres redis
# Run pending migrations
docker compose exec backend alembic upgrade head
# Verify migration status
docker compose exec backend alembic current
docker compose exec backend alembic history
3. Deploy services
# Pull new images
docker compose -f docker-compose.production.yml pull
# Deploy with zero-downtime
docker compose -f docker-compose.production.yml up -d
# Wait for health checks
docker compose ps
# Verify backend health
curl -f http://localhost:8000/health
4. Verify deployment
# Check application logs
docker compose logs --tail=100 backend
# Run smoke tests
curl -f http://localhost/api/v1/documents
curl -f http://localhost/health
# Check monitoring
open http://monitoring.example.com/dashboard
Rollback Procedure
If deployment fails or issues are discovered:
# Identify the previous working image tag
git log --oneline -5
docker images | grep ai-doc-qa
# Rollback to previous version
docker compose -f docker-compose.production.yml pull backend:previous-tag
docker compose -f docker-compose.production.yml up -d backend
# Verify rollback
curl -f http://localhost/health
Common Issues
Pod restart loop
Symptom: Containers restart repeatedly
# Check container logs for OOMKilled or signal errors
docker compose logs backend | grep -E "Killed|Signal|Exit"
# Check resource limits
docker stats
# Increase memory if needed
# Edit docker-compose.yml and redeploy
Database connection failures
Symptom: Backend logs show "connection refused" to postgres
# Verify postgres is running
docker compose ps postgres
# Check postgres logs
docker compose logs postgres
# Restart postgres if unhealthy
docker compose restart postgres
# Verify connection from backend
docker compose exec backend python -c "from app.db import engine; engine.connect()"
Model server timeout
Symptom: Requests hang and eventually timeout
# Check model server memory usage
docker stats model_server
# View model server logs
docker compose logs model_server --tail=200
# If OOM, increase memory limit or reduce context size
# Edit docker-compose.yml CONTEXT_SIZE parameter
Escalation Contacts
| Issue | Contact | Response Time |
|---|---|---|
| Database corruption | DBA Team | 15 minutes |
| Infrastructure down | Platform Team | 5 minutes |
| Security incident | Security Team | Immediate |
EXERCISE
Perform a deployment to a staging environment using this runbook. Time the process and identify bottlenecks.