04. Docker Compose for AI Stack
Docker Compose orchestrates multi-container applications for local development and testing environments. Production deployments typically graduate to Kubernetes, but Compose remains valuable for development iteration, integration testing, and staging environments that mirror production topology.
The compose specification defines services, networks, volumes, and configurations as versioned YAML documents. Each service maps to a running container with specific image or build instructions, environment variables, port mappings, and dependencies.
AI inference stacks typically include multiple services with distinct responsibilities. The model server handles inference requests. The API gateway provides authentication, rate limiting, and request routing. Redis or Memcached provides caching for repeated queries. A vector database serves similarity search workloads.
Service dependencies introduce startup ordering. The depends_on directive ensures containers start in the correct sequence. Health checks complement dependencies for cases where service readiness matters more than process existence.
version: "3.9"
services:
# Model inference server
model-server:
build:
context: ./model_server
dockerfile: Dockerfile
image: inference/model-server:latest
deploy:
resources:
reservations:
devices:
- driver: nvidia
count: 1
capabilities: [gpu]
environment:
MODEL_PATH: /models/transformer
MAX_BATCH_SIZE: 32
DEVICE: cuda
volumes:
- model_cache:/models
healthcheck:
test: ["CMD", "curl", "-f", "http://localhost:8080/health"]
interval: 30s
timeout: 10s
retries: 3
networks:
- ai-inference
# API Gateway service
api-gateway:
build:
context: ./api_gateway
dockerfile: Dockerfile
image: inference/api-gateway:latest
ports:
- "8000:8000"
environment:
MODEL_SERVICE_URL: http://model-server:8080
REDIS_URL: redis://cache:6379
LOAD_BALANCER_STRATEGY: least-loaded
depends_on:
model-server:
condition: service_healthy
cache:
condition: service_started
networks:
- ai-inference
# Inference caching layer
cache:
image: redis:7-alpine
command: redis-server --maxmemory 2gb --maxmemory-policy allkeys-lru
volumes:
- redis-data:/data
networks:
- ai-inference
# Vector database for semantic search
vector-db:
image: qdrant/qdrant:latest
ports:
- "6333:6333"
- "6334:6334"
volumes:
- qdrant-data:/qdrant/storage
networks:
- ai-inference
volumes:
model_cache:
redis-data:
qdrant-data:
networks:
ai-inference:
driver: bridge
Scaling services in Compose uses docker-compose up --scale. Scaling model servers behind a load balancer distributes inference load across multiple replicas. The gateway service monitors healthy instances and routes traffic accordingly.
Environment-specific overrides use override files. The base docker-compose.yml defines common expectations while docker-compose.override.yml applies local development modifications. Production deployments use docker-compose -f base.yml -f production.yml for environment separation.
Create a Docker Compose stack for a Retrieval Augmented Generation application: web frontend, API server, inference service, PostgreSQL database, and Redis cache. Define health checks, dependencies, and resource reservations. Include environment variables for model configuration and database credentials using a .env file pattern.