Docker Compose AI Stack — Local AI on Linux (Chapter 10)

Docker Compose manages multi-container AI stacks declaratively. A typical stack includes the inference server, a reverse proxy, a rate limiter, and a monitoring sidecar.

Example docker-compose.yml:

services:
  llama-server:
    image: ghcr.io/ggerganov/llama.cpp:server
    container_name: llama-inference
    runtime: nvidia
    environment:
      CUDA_VISIBLE_DEVICES: "0"
    volumes:
      - ./models:/models:ro
    command: >
      ./server
      -m /models/mistral-7b-q4_k_m.gguf
      -ngl 99
      -c 8192
      --host 0.0.0.0
      --port 8080
    restart: unless-stopped
    healthcheck:
      test: ["CMD", "curl", "-f", "http://localhost:8080"]
      interval: 30s
      timeout: 10s
      retries: 3

  nginx-proxy:
    image: nginx:alpine
    container_name: ai-proxy
    ports:
      - "80:80"
      - "443:443"
    volumes:
      - ./nginx.conf:/etc/nginx/nginx.conf:ro
      - ./certs:/etc/nginx/certs:ro
    depends_on:
      - llama-server
    restart: unless-stopped

  prometheus-sidecar:
    image: prom/prometheus:latest
    container_name: ai-metrics
    command:
      - '--config.file=/etc/prometheus/prometheus.yml'
      - '--storage.tsdb.path=/prometheus'
    volumes:
      - ./prometheus.yml:/etc/prometheus/prometheus.yml:ro
      - prometheus-data:/prometheus
    ports:
      - "9090:9090"
    restart: unless-stopped

volumes:
  prometheus-data:

# nginx.conf
events { worker_connections 1024; }

http {
  limit_req_zone $binary_remote_addr zone=ai_limit:10m rate=10r/s;

  upstream llama_backend {
    server llama-server:8080;
  }

  server {
    listen 80;

    location /v1/chat/completions {
      limit_req zone=ai_limit burst=20 nodelay;
      proxy_pass http://llama_backend;
      proxy_http_version 1.1;
      proxy_set_header Host $host;
      proxy_set_header X-Real-IP $remote_addr;
    }
  }
}

# prometheus.yml
global:
  scrape_interval: 15s

scrape_configs:
  - job_name: 'llama-inference'
    static_configs:
      - targets: ['llama-server:8080']

Start the stack:

docker compose up -d
docker compose ps
docker compose logs -f llama-server

Failure mode: llama-server container restarts repeatedly. The healthcheck curl is failing because the container starts slowly and the health check fires before the server is ready. Add start_period: 60s to the healthcheck block to give the model 60 seconds to load before counting failures.

Failure mode: CUDA out of memory in the inference container. Multiple llama.cpp server instances share the same GPU. Set CUDA_VISIBLE_DEVICES to a specific device and limit memory with nvidia-smi -i 0 -c 70 to set the compute mode to exclusive-process so only one process can claim the GPU at a time.

Failure mode: nginx-proxy returns 502. depends_on in docker-compose does not wait for the HTTP server to be ready, only for the container to start. Add a wait-for-it.sh script or use condition: service_healthy with a health check defined on the upstream service.