What this does

This guide configures GPU passthrough to Docker containers using the NVIDIA Container Toolkit within Docker Compose. It covers specifying which GPUs to allocate, setting memory limits, enabling GPU sharing across services, and verifying that CUDA is accessible inside the container. This is the prerequisite step for any containerized AI inference or training workload.

Steps

Install the NVIDIA Container Toolkit:

curl -fsSL https://nvidia.github.io/libnvidia-container/gpgkey | sudo gpg --dearmor -o /usr/share/keyrings/nvidia-container-toolkit-keyring.gpg
curl -s -L https://nvidia.github.io/libnvidia-container/stable/deb/nvidia-container-toolkit.list | \
  sed 's#deb https://#deb [signed-by=/usr/share/keyrings/nvidia-container-toolkit-keyring.gpg] https://#g' | \
  sudo tee /etc/apt/sources.list.d/nvidia-container-toolkit.list
sudo apt-get update && sudo apt-get install -y nvidia-container-toolkit

Configure Docker to use the NVIDIA runtime:

sudo nvidia-ctk runtime configure --runtime=docker
sudo systemctl restart docker

Verify GPU access works in a test container:
```
docker run --rm --gpus all nvidia/cuda:12.4.0-base-ubuntu22.04 nvidia-smi
```
Expected output: the same nvidia-smi output shown on the host.

In docker-compose.yml, add GPU configuration to the AI service. The modern syntax uses the deploy key:

services:
  inference:
    image: vllm/vllm-openai:latest
    deploy:
      resources:
        reservations:
          devices:
            - driver: nvidia
              count: 1
              capabilities: [gpu]

For the legacy Docker Compose syntax (when deploy is not supported), use runtime: nvidia:
```
services:
  inference:
    image: vllm/vllm-openai:latest
    runtime: nvidia
    environment:
      - NVIDIA_VISIBLE_DEVICES=0
```
This restricts the container to GPU index 0 only.

To share one GPU across multiple services, assign the same GPU index with different memory limits:

services:
  vllm:
    runtime: nvidia
    environment:
      - NVIDIA_VISIBLE_DEVICES=0
    command: --gpu-memory-utilization 0.5
  embedding:
    runtime: nvidia
    environment:
      - NVIDIA_VISIBLE_DEVICES=0
    command: --gpu-memory-utilization 0.3

For multi-GPU setups, use count with a specific GPU set:

deploy:
  resources:
    reservations:
      devices:
        - driver: nvidia
          device_ids: ["0", "1"]
          capabilities: [gpu]

Start the stack and verify GPU visibility:
```
docker compose up -d
docker compose exec inference nvidia-smi
```
Expected output: the GPU(s) listed as visible inside the container, matching the configuration.

Verification

docker compose exec inference python3 -c "import torch; print(f'CUDA available: {torch.cuda.is_available()}, Devices: {torch.cuda.device_count()}')"

Expected output: CUDA available: True, Devices: 1 (or the configured count).

Common failures

"could not select device driver 'nvidia'" — the NVIDIA Container Toolkit is not installed or Docker was not restarted after installation. Run nvidia-ctk runtime configure --runtime=docker and sudo systemctl restart docker.
CUDA available is False despite GPU config — PyTorch inside the container may not be the CUDA-enabled version. Verify with pip list | grep torch and install torch with CUDA support if needed.
Multiple services fail with "CUDA out of memory" — when sharing one GPU, each service's --gpu-memory-utilization must sum to less than 1.0. Reduce individual allocation or use separate GPUs.
Environment variable NVIDIA_VISIBLE_DEVICES has no effect — this only works with runtime: nvidia, not with the deploy.resources syntax. Choose one approach and use it consistently.

How to configure GPU access in Docker Compose for AI inference

What this does

Steps

Verification

Common failures

Related guides