HOW-TO · OPS

How to create a Docker Compose stack with vLLM, Redis, and Prometheus

intermediate30 minBy Fredoline Eruo
Target environment
Ubuntu 24.04 · Ollama 0.4.x
PREREQUISITES

Docker and Docker Compose installed, NVIDIA Container Toolkit

What this does

This guide creates a single-node Docker Compose stack that runs vLLM for model inference, Redis for request queuing and caching, and Prometheus for metrics collection. The vLLM server serves a local model, Redis acts as a distributed task queue with TTL-based caching for common requests, and Prometheus scrapes both services. This stack is suitable for development, testing, and small-scale production deployments on a single GPU machine.

Steps

  1. Verify GPU access in Docker:

    docker run --rm --gpus all nvidia/cuda:12.4.0-base-ubuntu22.04 nvidia-smi
    

    Expected output: the nvidia-smi output showing your GPU(s).

  2. Create the project directory and docker-compose.yml:

    version: "3.8"
    services:
      vllm:
        image: vllm/vllm-openai:latest
        runtime: nvidia
        environment:
          - NVIDIA_VISIBLE_DEVICES=0
        ports: ["8000:8000"]
        volumes:
          - /models:/models:ro
        command: >
          --model /models/Meta-Llama-3-8B-Instruct
          --max-model-len 8192
          --gpu-memory-utilization 0.90
        ipc: host
      redis:
        image: redis:7-alpine
        ports: ["6379:6379"]
        volumes: ["redis_data:/data"]
        command: redis-server --maxmemory 4gb --maxmemory-policy allkeys-lru
      prometheus:
        image: prom/prometheus:v2.51.0
        ports: ["9090:9090"]
        volumes:
          - ./prometheus.yml:/etc/prometheus/prometheus.yml
          - prometheus_data:/prometheus
    volumes:
      redis_data:
      prometheus_data:
    
  3. Create prometheus.yml in the project directory:

    global:
      scrape_interval: 5s
    scrape_configs:
      - job_name: "vllm"
        static_configs:
          - targets: ["vllm:8000"]
      - job_name: "redis-exporter"
        static_configs:
          - targets: ["redis-exporter:9121"]
    
  4. Add a Redis exporter to the Compose file to expose Redis metrics:

      redis-exporter:
        image: oliver006/redis_exporter:latest
        ports: ["9121:9121"]
        environment:
          - REDIS_ADDR=redis://redis:6379
    
  5. Start the entire stack:

    docker compose up -d
    

    Expected output: four containers starting, confirmed with docker compose ps showing all as healthy.

  6. Verify vLLM is ready by querying its health endpoint:

    curl -s http://localhost:8000/health | jq
    

    Expected output: null or {} with a 200 status code.

  7. Test a complete inference through the stack. Send a prompt to vLLM:

    curl http://localhost:8000/v1/completions -H "Content-Type: application/json" -d '{"model":"Meta-Llama-3-8B-Instruct","prompt":"Hello","max_tokens":10}'
    

    Expected output: JSON with generated text in the choices array.

  8. Verify Prometheus is scraping metrics:

    curl -s "http://localhost:9090/api/v1/query?query=up" | jq '.data.result[] | {instance: .metric.instance, value: .value[1]}'
    

    Expected output: 1 for each target (vllm and redis-exporter).

Verification

docker compose ps --format json | ConvertFrom-Json | ForEach-Object { "$($_.Service): $($_.State)" }

Expected output: all four services showing running.

Common failures

  • vLLM container exits immediately — the model path inside the container must match the volume mount. Check with docker compose logs vllm for "Model not found" errors.
  • vLLM fails with CUDA out of memory — reduce --gpu-memory-utilization to 0.75 or decrease --max-model-len. Check current memory usage with nvidia-smi on the host.
  • Redis exporter shows "context deadline exceeded" — the exporter cannot reach Redis. Verify the REDIS_ADDR environment variable uses the correct service name (redis://redis:6379 for Docker network resolution).
  • Prometheus target shows "connection refused" — the vLLM metrics endpoint may not be available until the model finishes loading. Wait 2-3 minutes and recheck.

Related guides