10. Docker Compose AI Stack
Docker Compose manages multi-container AI stacks declaratively. A typical stack includes the inference server, a reverse proxy, a rate limiter, and a monitoring sidecar.
Example docker-compose.yml:
services:
llama-server:
image: ghcr.io/ggerganov/llama.cpp:server
container_name: llama-inference
runtime: nvidia
environment:
CUDA_VISIBLE_DEVICES: "0"
volumes:
- ./models:/models:ro
command: >
./server
-m /models/mistral-7b-q4_k_m.gguf
-ngl 99
-c 8192
--host 0.0.0.0
--port 8080
restart: unless-stopped
healthcheck:
test: ["CMD", "curl", "-f", "http://localhost:8080"]
interval: 30s
timeout: 10s
retries: 3
nginx-proxy:
image: nginx:alpine
container_name: ai-proxy
ports:
- "80:80"
- "443:443"
volumes:
- ./nginx.conf:/etc/nginx/nginx.conf:ro
- ./certs:/etc/nginx/certs:ro
depends_on:
- llama-server
restart: unless-stopped
prometheus-sidecar:
image: prom/prometheus:latest
container_name: ai-metrics
command:
- '--config.file=/etc/prometheus/prometheus.yml'
- '--storage.tsdb.path=/prometheus'
volumes:
- ./prometheus.yml:/etc/prometheus/prometheus.yml:ro
- prometheus-data:/prometheus
ports:
- "9090:9090"
restart: unless-stopped
volumes:
prometheus-data:
# nginx.conf
events { worker_connections 1024; }
http {
limit_req_zone $binary_remote_addr zone=ai_limit:10m rate=10r/s;
upstream llama_backend {
server llama-server:8080;
}
server {
listen 80;
location /v1/chat/completions {
limit_req zone=ai_limit burst=20 nodelay;
proxy_pass http://llama_backend;
proxy_http_version 1.1;
proxy_set_header Host $host;
proxy_set_header X-Real-IP $remote_addr;
}
}
}
# prometheus.yml
global:
scrape_interval: 15s
scrape_configs:
- job_name: 'llama-inference'
static_configs:
- targets: ['llama-server:8080']
Start the stack:
docker compose up -d
docker compose ps
docker compose logs -f llama-server
Failure mode: llama-server container restarts repeatedly. The healthcheck curl is failing because the container starts slowly and the health check fires before the server is ready. Add start_period: 60s to the healthcheck block to give the model 60 seconds to load before counting failures.
Failure mode: CUDA out of memory in the inference container. Multiple llama.cpp server instances share the same GPU. Set CUDA_VISIBLE_DEVICES to a specific device and limit memory with nvidia-smi -i 0 -c 70 to set the compute mode to exclusive-process so only one process can claim the GPU at a time.
Failure mode: nginx-proxy returns 502. depends_on in docker-compose does not wait for the HTTP server to be ready, only for the container to start. Add a wait-for-it.sh script or use condition: service_healthy with a health check defined on the upstream service.
Create a docker-compose.yml that runs a llama.cpp server, a Prometheus metrics sidecar, and an nginx proxy with a rate limit. Verify all three containers are running and the proxy forwards requests to the inference server.