HOW-TO · OPS

How to configure GPU access in Docker Compose for AI inference

intermediate15 minBy Fredoline Eruo
Target environment
Ubuntu 24.04 · Ollama 0.4.x
PREREQUISITES

NVIDIA Container Toolkit, docker-compose.yml

What this does

This guide configures GPU passthrough to Docker containers using the NVIDIA Container Toolkit within Docker Compose. It covers specifying which GPUs to allocate, setting memory limits, enabling GPU sharing across services, and verifying that CUDA is accessible inside the container. This is the prerequisite step for any containerized AI inference or training workload.

Steps

  1. Install the NVIDIA Container Toolkit:

    curl -fsSL https://nvidia.github.io/libnvidia-container/gpgkey | sudo gpg --dearmor -o /usr/share/keyrings/nvidia-container-toolkit-keyring.gpg
    curl -s -L https://nvidia.github.io/libnvidia-container/stable/deb/nvidia-container-toolkit.list | \
      sed 's#deb https://#deb [signed-by=/usr/share/keyrings/nvidia-container-toolkit-keyring.gpg] https://#g' | \
      sudo tee /etc/apt/sources.list.d/nvidia-container-toolkit.list
    sudo apt-get update && sudo apt-get install -y nvidia-container-toolkit
    
  2. Configure Docker to use the NVIDIA runtime:

    sudo nvidia-ctk runtime configure --runtime=docker
    sudo systemctl restart docker
    
  3. Verify GPU access works in a test container:

    docker run --rm --gpus all nvidia/cuda:12.4.0-base-ubuntu22.04 nvidia-smi
    

    Expected output: the same nvidia-smi output shown on the host.

  4. In docker-compose.yml, add GPU configuration to the AI service. The modern syntax uses the deploy key:

    services:
      inference:
        image: vllm/vllm-openai:latest
        deploy:
          resources:
            reservations:
              devices:
                - driver: nvidia
                  count: 1
                  capabilities: [gpu]
    
  5. For the legacy Docker Compose syntax (when deploy is not supported), use runtime: nvidia:

    services:
      inference:
        image: vllm/vllm-openai:latest
        runtime: nvidia
        environment:
          - NVIDIA_VISIBLE_DEVICES=0
    

    This restricts the container to GPU index 0 only.

  6. To share one GPU across multiple services, assign the same GPU index with different memory limits:

    services:
      vllm:
        runtime: nvidia
        environment:
          - NVIDIA_VISIBLE_DEVICES=0
        command: --gpu-memory-utilization 0.5
      embedding:
        runtime: nvidia
        environment:
          - NVIDIA_VISIBLE_DEVICES=0
        command: --gpu-memory-utilization 0.3
    
  7. For multi-GPU setups, use count with a specific GPU set:

    deploy:
      resources:
        reservations:
          devices:
            - driver: nvidia
              device_ids: ["0", "1"]
              capabilities: [gpu]
    
  8. Start the stack and verify GPU visibility:

    docker compose up -d
    docker compose exec inference nvidia-smi
    

    Expected output: the GPU(s) listed as visible inside the container, matching the configuration.

Verification

docker compose exec inference python3 -c "import torch; print(f'CUDA available: {torch.cuda.is_available()}, Devices: {torch.cuda.device_count()}')"

Expected output: CUDA available: True, Devices: 1 (or the configured count).

Common failures

  • "could not select device driver 'nvidia'" — the NVIDIA Container Toolkit is not installed or Docker was not restarted after installation. Run nvidia-ctk runtime configure --runtime=docker and sudo systemctl restart docker.
  • CUDA available is False despite GPU config — PyTorch inside the container may not be the CUDA-enabled version. Verify with pip list | grep torch and install torch with CUDA support if needed.
  • Multiple services fail with "CUDA out of memory" — when sharing one GPU, each service's --gpu-memory-utilization must sum to less than 1.0. Reduce individual allocation or use separate GPUs.
  • Environment variable NVIDIA_VISIBLE_DEVICES has no effect — this only works with runtime: nvidia, not with the deploy.resources syntax. Choose one approach and use it consistently.

Related guides