Hardware & infrastructure

Vulkan compute

Vulkan compute is a cross-platform GPU compute API that runs inference workloads on GPUs without requiring CUDA. In local AI, it allows operators to run models on AMD, Intel, and older NVIDIA GPUs that lack full CUDA support, often via llama.cpp's Vulkan backend. Performance is typically 10-30% slower than CUDA on NVIDIA hardware but enables broader hardware compatibility.

Deeper dive

Vulkan is a low-overhead, cross-platform graphics and compute API developed by the Khronos Group. Its compute shader functionality can be used for neural network inference, bypassing the need for vendor-specific APIs like CUDA or ROCm. In the context of local AI, llama.cpp provides a Vulkan backend that compiles model operations into Vulkan compute shaders, allowing inference on GPUs from AMD, Intel, and even integrated graphics. The main trade-off is performance: on NVIDIA GPUs, Vulkan compute typically achieves 70-90% of CUDA throughput due to driver overhead and less optimized kernel generation. However, for operators with non-NVIDIA hardware, Vulkan compute is often the only way to get GPU acceleration. It also supports multi-GPU setups and can leverage GPU memory pooling. The backend is under active development, with improvements in operator fusion and memory management.

Practical example

An operator with an AMD RX 7900 XTX (24 GB VRAM) cannot use CUDA. By compiling llama.cpp with the Vulkan backend (cmake -DLLAMA_VULKAN=ON), they can run Llama 3.1 70B at Q4_K_M (~40 GB) with system-RAM offload, achieving ~2 tok/s. Without Vulkan, they would be limited to CPU-only inference at ~0.5 tok/s.

Workflow example

In llama.cpp, enable Vulkan by building with -DLLAMA_VULKAN=ON. Then run inference with ./main -m model.gguf -ngl 99 to offload all layers to GPU. The runtime will select the Vulkan backend automatically if available. In LM Studio, select the Vulkan backend in Settings > Engine > Backend. On first run, it compiles shaders (takes 1-5 minutes), then runs inference normally.

Reviewed by Fredoline Eruo. See our editorial policy.

Buyer guides

When it doesn't work