How to run vLLM in Docker
Docker with NVIDIA Container Toolkit, CUDA-capable GPU
What this does
Launches vLLM's OpenAI-compatible API server inside a Docker container with GPU acceleration, providing high-throughput inference via a familiar REST interface.
Steps
Start the vLLM container with GPU passthrough and model volume.
docker run -d \ --gpus all \ -v /path/to/models:/models \ -p 8000:8000 \ --shm-size=1g \ --name vllm-server \ vllm/vllm-openai:latest \ --model /models/llama-model \ --gpu-memory-utilization 0.9Expected output: Container starts and logs display vLLM server initialization.
Confirm the API server is responding.
curl http://localhost:8000/healthExpected output:
{"status":"OK"}indicating the server is ready.Send a completion request to the OpenAI-compatible endpoint.
curl -X POST http://localhost:8000/v1/completions \ -H "Content-Type: application/json" \ -d '{"model":"/models/llama-model","prompt":"What is machine learning?","max_tokens":128}'Expected output: JSON response with generated text and usage statistics.
- Record the local run evidence. Save the exact command, runtime or package version, model name if applicable, and observed output so the result can be reproduced later.
Verification
curl -s http://localhost:8000/v1/models
# Expected: model identifier string
Common failures
- Shared memory size too small — Increase
--shm-sizeto 4g or 8g if errors occur during heavy loads. - GPU memory exhausted at startup — Lower
--gpu-memory-utilizationto 0.7 or use a quantized model variant. - Model path not accessible inside container — Verify with
docker exec vllm-server ls /models. - Port 8000 conflict — Use
-p 8001:8000and update the client URL accordingly.
Operator checkpoint
Before treating this as solved, write down the local runtime, model or package version, hardware/backend if relevant, and the verification output. This keeps the guide useful as a Will-It-Run style decision instead of a one-off command transcript.
Operator checkpoint
Before treating this as solved, write down the local runtime, model or package version, hardware/backend if relevant, and the verification output. This keeps the guide useful as a Will-It-Run style decision instead of a one-off command transcript.