Docker Compose for AI Stack — Production Local AI Deployment (Chapter 4)

Docker Compose orchestrates multi-container applications for local development and testing environments. Production deployments typically graduate to Kubernetes, but Compose remains valuable for development iteration, integration testing, and staging environments that mirror production topology.

The compose specification defines services, networks, volumes, and configurations as versioned YAML documents. Each service maps to a running container with specific image or build instructions, environment variables, port mappings, and dependencies.

AI inference stacks typically include multiple services with distinct responsibilities. The model server handles inference requests. The API gateway provides authentication, rate limiting, and request routing. Redis or Memcached provides caching for repeated queries. A vector database serves similarity search workloads.

Service dependencies introduce startup ordering. The depends_on directive ensures containers start in the correct sequence. Health checks complement dependencies for cases where service readiness matters more than process existence.

version: "3.9"

services:
  # Model inference server
  model-server:
    build:
      context: ./model_server
      dockerfile: Dockerfile
    image: inference/model-server:latest
    deploy:
      resources:
        reservations:
          devices:
            - driver: nvidia
              count: 1
              capabilities: [gpu]
    environment:
      MODEL_PATH: /models/transformer
      MAX_BATCH_SIZE: 32
      DEVICE: cuda
    volumes:
      - model_cache:/models
    healthcheck:
      test: ["CMD", "curl", "-f", "http://localhost:8080/health"]
      interval: 30s
      timeout: 10s
      retries: 3
    networks:
      - ai-inference

  # API Gateway service
  api-gateway:
    build:
      context: ./api_gateway
      dockerfile: Dockerfile
    image: inference/api-gateway:latest
    ports:
      - "8000:8000"
    environment:
      MODEL_SERVICE_URL: http://model-server:8080
      REDIS_URL: redis://cache:6379
      LOAD_BALANCER_STRATEGY: least-loaded
    depends_on:
      model-server:
        condition: service_healthy
      cache:
        condition: service_started
    networks:
      - ai-inference

  # Inference caching layer
  cache:
    image: redis:7-alpine
    command: redis-server --maxmemory 2gb --maxmemory-policy allkeys-lru
    volumes:
      - redis-data:/data
    networks:
      - ai-inference

  # Vector database for semantic search
  vector-db:
    image: qdrant/qdrant:latest
    ports:
      - "6333:6333"
      - "6334:6334"
    volumes:
      - qdrant-data:/qdrant/storage
    networks:
      - ai-inference

volumes:
  model_cache:
  redis-data:
  qdrant-data:

networks:
  ai-inference:
    driver: bridge

Scaling services in Compose uses docker-compose up --scale. Scaling model servers behind a load balancer distributes inference load across multiple replicas. The gateway service monitors healthy instances and routes traffic accordingly.

Environment-specific overrides use override files. The base docker-compose.yml defines common expectations while docker-compose.override.yml applies local development modifications. Production deployments use docker-compose -f base.yml -f production.yml for environment separation.