HOW-TO · SET

How to run vLLM in Docker

intermediate15 minBy Fredoline Eruo
Target environment
Ubuntu 24.04 · Ollama 0.4.xWindows 11 · Ollama 0.4.xmacOS 15 · Ollama 0.4.x
PREREQUISITES

Docker with NVIDIA Container Toolkit, CUDA-capable GPU

What this does

Launches vLLM's OpenAI-compatible API server inside a Docker container with GPU acceleration, providing high-throughput inference via a familiar REST interface.

Steps

  1. Start the vLLM container with GPU passthrough and model volume.

    docker run -d \
      --gpus all \
      -v /path/to/models:/models \
      -p 8000:8000 \
      --shm-size=1g \
      --name vllm-server \
      vllm/vllm-openai:latest \
      --model /models/llama-model \
      --gpu-memory-utilization 0.9
    

    Expected output: Container starts and logs display vLLM server initialization.

  2. Confirm the API server is responding.

    curl http://localhost:8000/health
    

    Expected output: {"status":"OK"} indicating the server is ready.

  3. Send a completion request to the OpenAI-compatible endpoint.

    curl -X POST http://localhost:8000/v1/completions \
      -H "Content-Type: application/json" \
      -d '{"model":"/models/llama-model","prompt":"What is machine learning?","max_tokens":128}'
    

    Expected output: JSON response with generated text and usage statistics.

  • Record the local run evidence. Save the exact command, runtime or package version, model name if applicable, and observed output so the result can be reproduced later.

Verification

curl -s http://localhost:8000/v1/models
# Expected: model identifier string

Common failures

  • Shared memory size too small — Increase --shm-size to 4g or 8g if errors occur during heavy loads.
  • GPU memory exhausted at startup — Lower --gpu-memory-utilization to 0.7 or use a quantized model variant.
  • Model path not accessible inside container — Verify with docker exec vllm-server ls /models.
  • Port 8000 conflict — Use -p 8001:8000 and update the client URL accordingly.

Operator checkpoint

Before treating this as solved, write down the local runtime, model or package version, hardware/backend if relevant, and the verification output. This keeps the guide useful as a Will-It-Run style decision instead of a one-off command transcript.

Operator checkpoint

Before treating this as solved, write down the local runtime, model or package version, hardware/backend if relevant, and the verification output. This keeps the guide useful as a Will-It-Run style decision instead of a one-off command transcript.

Related guides

RELATED GUIDES