How to configure GPU access in Docker Compose for AI inference
NVIDIA Container Toolkit, docker-compose.yml
What this does
This guide configures GPU passthrough to Docker containers using the NVIDIA Container Toolkit within Docker Compose. It covers specifying which GPUs to allocate, setting memory limits, enabling GPU sharing across services, and verifying that CUDA is accessible inside the container. This is the prerequisite step for any containerized AI inference or training workload.
Steps
Install the NVIDIA Container Toolkit:
curl -fsSL https://nvidia.github.io/libnvidia-container/gpgkey | sudo gpg --dearmor -o /usr/share/keyrings/nvidia-container-toolkit-keyring.gpg curl -s -L https://nvidia.github.io/libnvidia-container/stable/deb/nvidia-container-toolkit.list | \ sed 's#deb https://#deb [signed-by=/usr/share/keyrings/nvidia-container-toolkit-keyring.gpg] https://#g' | \ sudo tee /etc/apt/sources.list.d/nvidia-container-toolkit.list sudo apt-get update && sudo apt-get install -y nvidia-container-toolkitConfigure Docker to use the NVIDIA runtime:
sudo nvidia-ctk runtime configure --runtime=docker sudo systemctl restart dockerVerify GPU access works in a test container:
docker run --rm --gpus all nvidia/cuda:12.4.0-base-ubuntu22.04 nvidia-smiExpected output: the same nvidia-smi output shown on the host.
In
docker-compose.yml, add GPU configuration to the AI service. The modern syntax uses thedeploykey:services: inference: image: vllm/vllm-openai:latest deploy: resources: reservations: devices: - driver: nvidia count: 1 capabilities: [gpu]For the legacy Docker Compose syntax (when
deployis not supported), useruntime: nvidia:services: inference: image: vllm/vllm-openai:latest runtime: nvidia environment: - NVIDIA_VISIBLE_DEVICES=0This restricts the container to GPU index 0 only.
To share one GPU across multiple services, assign the same GPU index with different memory limits:
services: vllm: runtime: nvidia environment: - NVIDIA_VISIBLE_DEVICES=0 command: --gpu-memory-utilization 0.5 embedding: runtime: nvidia environment: - NVIDIA_VISIBLE_DEVICES=0 command: --gpu-memory-utilization 0.3For multi-GPU setups, use
countwith a specific GPU set:deploy: resources: reservations: devices: - driver: nvidia device_ids: ["0", "1"] capabilities: [gpu]Start the stack and verify GPU visibility:
docker compose up -d docker compose exec inference nvidia-smiExpected output: the GPU(s) listed as visible inside the container, matching the configuration.
Verification
docker compose exec inference python3 -c "import torch; print(f'CUDA available: {torch.cuda.is_available()}, Devices: {torch.cuda.device_count()}')"
Expected output: CUDA available: True, Devices: 1 (or the configured count).
Common failures
- "could not select device driver 'nvidia'" — the NVIDIA Container Toolkit is not installed or Docker was not restarted after installation. Run
nvidia-ctk runtime configure --runtime=dockerandsudo systemctl restart docker. - CUDA available is False despite GPU config — PyTorch inside the container may not be the CUDA-enabled version. Verify with
pip list | grep torchand installtorchwith CUDA support if needed. - Multiple services fail with "CUDA out of memory" — when sharing one GPU, each service's
--gpu-memory-utilizationmust sum to less than 1.0. Reduce individual allocation or use separate GPUs. - Environment variable
NVIDIA_VISIBLE_DEVICEShas no effect — this only works withruntime: nvidia, not with thedeploy.resourcessyntax. Choose one approach and use it consistently.