RUNLOCALAIv38
->Will it run?Best GPUCompareTroubleshootStartLearnPulseModelsHardwareToolsBench
Run check
RUNLOCALAI

Independently operated catalog for local-AI hardware and software. Hand-written verdicts. Source-cited claims. Reproducible commands when we have them.

OP·Fredoline Eruo
DIR
  • Models
  • Hardware
  • Tools
  • Benchmarks
TOOLS
  • Will it run?
  • Compare hardware
  • Cost vs cloud
  • Choose my GPU
  • Prompting kits
  • Quick answers
REF
  • All buyer guides
  • Learn local AI
  • Methodology
  • Glossary
  • Errors KB
  • Trust
EDITOR
  • About
  • Author
  • How we make money
  • Editorial policy
  • Contact
LEGAL
  • Privacy
  • Terms
  • Sitemap
MAIL · MONTHLY DIGEST
Get monthly local AI changes
Monthly recap. No spam.
DISCLOSURE

Some links on this site are affiliate links (Amazon Associates and other first-class retailers). When you buy through them, we earn a small commission at no extra cost to you. Affiliate links do not influence our verdicts — there are cards we rate highly that we don't have affiliate relationships with, and cards that sell well that we refuse to recommend. Read more →

© 2026 runlocalai.coIndependently operated
RUNLOCALAI · v38
  1. >
  2. Home
  3. /Learn
  4. /Courses
  5. /Production Local AI Deployment
  6. /Ch. 4
Production Local AI Deployment

04. Docker Compose for AI Stack

Chapter 4 of 24 · 15 min
KEY INSIGHT

Docker Compose definitions become infrastructure as code, establishing reproducible multi-service deployments that mirror production Kubernetes topologies.

Docker Compose orchestrates multi-container applications for local development and testing environments. Production deployments typically graduate to Kubernetes, but Compose remains valuable for development iteration, integration testing, and staging environments that mirror production topology.

The compose specification defines services, networks, volumes, and configurations as versioned YAML documents. Each service maps to a running container with specific image or build instructions, environment variables, port mappings, and dependencies.

AI inference stacks typically include multiple services with distinct responsibilities. The model server handles inference requests. The API gateway provides authentication, rate limiting, and request routing. Redis or Memcached provides caching for repeated queries. A vector database serves similarity search workloads.

Service dependencies introduce startup ordering. The depends_on directive ensures containers start in the correct sequence. Health checks complement dependencies for cases where service readiness matters more than process existence.

version: "3.9"

services:
  # Model inference server
  model-server:
    build:
      context: ./model_server
      dockerfile: Dockerfile
    image: inference/model-server:latest
    deploy:
      resources:
        reservations:
          devices:
            - driver: nvidia
              count: 1
              capabilities: [gpu]
    environment:
      MODEL_PATH: /models/transformer
      MAX_BATCH_SIZE: 32
      DEVICE: cuda
    volumes:
      - model_cache:/models
    healthcheck:
      test: ["CMD", "curl", "-f", "http://localhost:8080/health"]
      interval: 30s
      timeout: 10s
      retries: 3
    networks:
      - ai-inference

  # API Gateway service
  api-gateway:
    build:
      context: ./api_gateway
      dockerfile: Dockerfile
    image: inference/api-gateway:latest
    ports:
      - "8000:8000"
    environment:
      MODEL_SERVICE_URL: http://model-server:8080
      REDIS_URL: redis://cache:6379
      LOAD_BALANCER_STRATEGY: least-loaded
    depends_on:
      model-server:
        condition: service_healthy
      cache:
        condition: service_started
    networks:
      - ai-inference

  # Inference caching layer
  cache:
    image: redis:7-alpine
    command: redis-server --maxmemory 2gb --maxmemory-policy allkeys-lru
    volumes:
      - redis-data:/data
    networks:
      - ai-inference

  # Vector database for semantic search
  vector-db:
    image: qdrant/qdrant:latest
    ports:
      - "6333:6333"
      - "6334:6334"
    volumes:
      - qdrant-data:/qdrant/storage
    networks:
      - ai-inference

volumes:
  model_cache:
  redis-data:
  qdrant-data:

networks:
  ai-inference:
    driver: bridge

Scaling services in Compose uses docker-compose up --scale. Scaling model servers behind a load balancer distributes inference load across multiple replicas. The gateway service monitors healthy instances and routes traffic accordingly.

Environment-specific overrides use override files. The base docker-compose.yml defines common expectations while docker-compose.override.yml applies local development modifications. Production deployments use docker-compose -f base.yml -f production.yml for environment separation.

EXERCISE

Create a Docker Compose stack for a Retrieval Augmented Generation application: web frontend, API server, inference service, PostgreSQL database, and Redis cache. Define health checks, dependencies, and resource reservations. Include environment variables for model configuration and database credentials using a .env file pattern.

← Chapter 3
Multi-Stage Builds
Chapter 5 →
GPU Access in Docker