How to create a Docker Compose stack with vLLM, Redis, and Prometheus
Docker and Docker Compose installed, NVIDIA Container Toolkit
What this does
This guide creates a single-node Docker Compose stack that runs vLLM for model inference, Redis for request queuing and caching, and Prometheus for metrics collection. The vLLM server serves a local model, Redis acts as a distributed task queue with TTL-based caching for common requests, and Prometheus scrapes both services. This stack is suitable for development, testing, and small-scale production deployments on a single GPU machine.
Steps
Verify GPU access in Docker:
docker run --rm --gpus all nvidia/cuda:12.4.0-base-ubuntu22.04 nvidia-smiExpected output: the nvidia-smi output showing your GPU(s).
Create the project directory and
docker-compose.yml:version: "3.8" services: vllm: image: vllm/vllm-openai:latest runtime: nvidia environment: - NVIDIA_VISIBLE_DEVICES=0 ports: ["8000:8000"] volumes: - /models:/models:ro command: > --model /models/Meta-Llama-3-8B-Instruct --max-model-len 8192 --gpu-memory-utilization 0.90 ipc: host redis: image: redis:7-alpine ports: ["6379:6379"] volumes: ["redis_data:/data"] command: redis-server --maxmemory 4gb --maxmemory-policy allkeys-lru prometheus: image: prom/prometheus:v2.51.0 ports: ["9090:9090"] volumes: - ./prometheus.yml:/etc/prometheus/prometheus.yml - prometheus_data:/prometheus volumes: redis_data: prometheus_data:Create
prometheus.ymlin the project directory:global: scrape_interval: 5s scrape_configs: - job_name: "vllm" static_configs: - targets: ["vllm:8000"] - job_name: "redis-exporter" static_configs: - targets: ["redis-exporter:9121"]Add a Redis exporter to the Compose file to expose Redis metrics:
redis-exporter: image: oliver006/redis_exporter:latest ports: ["9121:9121"] environment: - REDIS_ADDR=redis://redis:6379Start the entire stack:
docker compose up -dExpected output: four containers starting, confirmed with
docker compose psshowing all as healthy.Verify vLLM is ready by querying its health endpoint:
curl -s http://localhost:8000/health | jqExpected output:
nullor{}with a 200 status code.Test a complete inference through the stack. Send a prompt to vLLM:
curl http://localhost:8000/v1/completions -H "Content-Type: application/json" -d '{"model":"Meta-Llama-3-8B-Instruct","prompt":"Hello","max_tokens":10}'Expected output: JSON with generated text in the
choicesarray.Verify Prometheus is scraping metrics:
curl -s "http://localhost:9090/api/v1/query?query=up" | jq '.data.result[] | {instance: .metric.instance, value: .value[1]}'Expected output:
1for each target (vllm and redis-exporter).
Verification
docker compose ps --format json | ConvertFrom-Json | ForEach-Object { "$($_.Service): $($_.State)" }
Expected output: all four services showing running.
Common failures
- vLLM container exits immediately — the model path inside the container must match the volume mount. Check with
docker compose logs vllmfor "Model not found" errors. - vLLM fails with CUDA out of memory — reduce
--gpu-memory-utilizationto 0.75 or decrease--max-model-len. Check current memory usage withnvidia-smion the host. - Redis exporter shows "context deadline exceeded" — the exporter cannot reach Redis. Verify the
REDIS_ADDRenvironment variable uses the correct service name (redis://redis:6379for Docker network resolution). - Prometheus target shows "connection refused" — the vLLM metrics endpoint may not be available until the model finishes loading. Wait 2-3 minutes and recheck.