What this does

This guide creates a single-node Docker Compose stack that runs vLLM for model inference, Redis for request queuing and caching, and Prometheus for metrics collection. The vLLM server serves a local model, Redis acts as a distributed task queue with TTL-based caching for common requests, and Prometheus scrapes both services. This stack is suitable for development, testing, and small-scale production deployments on a single GPU machine.

Steps

Verify GPU access in Docker:
```
docker run --rm --gpus all nvidia/cuda:12.4.0-base-ubuntu22.04 nvidia-smi
```
Expected output: the nvidia-smi output showing your GPU(s).

Create the project directory and docker-compose.yml:

version: "3.8"
services:
  vllm:
    image: vllm/vllm-openai:latest
    runtime: nvidia
    environment:
      - NVIDIA_VISIBLE_DEVICES=0
    ports: ["8000:8000"]
    volumes:
      - /models:/models:ro
    command: >
      --model /models/Meta-Llama-3-8B-Instruct
      --max-model-len 8192
      --gpu-memory-utilization 0.90
    ipc: host
  redis:
    image: redis:7-alpine
    ports: ["6379:6379"]
    volumes: ["redis_data:/data"]
    command: redis-server --maxmemory 4gb --maxmemory-policy allkeys-lru
  prometheus:
    image: prom/prometheus:v2.51.0
    ports: ["9090:9090"]
    volumes:
      - ./prometheus.yml:/etc/prometheus/prometheus.yml
      - prometheus_data:/prometheus
volumes:
  redis_data:
  prometheus_data:

Create prometheus.yml in the project directory:

global:
  scrape_interval: 5s
scrape_configs:
  - job_name: "vllm"
    static_configs:
      - targets: ["vllm:8000"]
  - job_name: "redis-exporter"
    static_configs:
      - targets: ["redis-exporter:9121"]

Add a Redis exporter to the Compose file to expose Redis metrics:

  redis-exporter:
    image: oliver006/redis_exporter:latest
    ports: ["9121:9121"]
    environment:
      - REDIS_ADDR=redis://redis:6379

Start the entire stack:
```
docker compose up -d
```
Expected output: four containers starting, confirmed with docker compose ps showing all as healthy.
Verify vLLM is ready by querying its health endpoint:
```
curl -s http://localhost:8000/health | jq
```
Expected output: null or {} with a 200 status code.

Test a complete inference through the stack. Send a prompt to vLLM:

curl http://localhost:8000/v1/completions -H "Content-Type: application/json" -d '{"model":"Meta-Llama-3-8B-Instruct","prompt":"Hello","max_tokens":10}'

Expected output: JSON with generated text in the choices array.

Verify Prometheus is scraping metrics:

curl -s "http://localhost:9090/api/v1/query?query=up" | jq '.data.result[] | {instance: .metric.instance, value: .value[1]}'

Expected output: 1 for each target (vllm and redis-exporter).

Verification

docker compose ps --format json | ConvertFrom-Json | ForEach-Object { "$($_.Service): $($_.State)" }

Expected output: all four services showing running.

Common failures

vLLM container exits immediately — the model path inside the container must match the volume mount. Check with docker compose logs vllm for "Model not found" errors.
vLLM fails with CUDA out of memory — reduce --gpu-memory-utilization to 0.75 or decrease --max-model-len. Check current memory usage with nvidia-smi on the host.
Redis exporter shows "context deadline exceeded" — the exporter cannot reach Redis. Verify the REDIS_ADDR environment variable uses the correct service name (redis://redis:6379 for Docker network resolution).
Prometheus target shows "connection refused" — the vLLM metrics endpoint may not be available until the model finishes loading. Wait 2-3 minutes and recheck.

How to create a Docker Compose stack with vLLM, Redis, and Prometheus

What this does

Steps

Verification

Common failures

Related guides