llama.cpp build failed — get the right backend compiling
Most llama.cpp build failures trace to a missing toolkit (CUDA, Metal, Vulkan SDK), wrong compiler version, or a stale CMake cache. Diagnose in order: PATH first, CMake version second, GCC/MSVC third.
Diagnostic order — most likely first
CUDA toolkit not in PATH
`nvcc --version` returns 'command not found.' CMake errors with 'CUDA compiler not found' or 'nvcc not found in PATH.'
Add CUDA bin to PATH. Linux: `export PATH=/usr/local/cuda/bin:$PATH`. Windows: add `C:\Program Files\NVIDIA GPU Computing Toolkit\CUDA\v12.x\bin` via System → Environment Variables. Verify with `nvcc --version` before retrying build.
CMake too old (need 3.18+)
`cmake --version` shows < 3.18. CMake errors with 'unknown command' or 'feature not supported.'
Upgrade CMake. Linux: `sudo snap install cmake --classic` or build from source. macOS: `brew upgrade cmake`. Windows: download installer from cmake.org. llama.cpp HEAD typically needs CMake 3.20+.
Mismatched CUDA + GCC versions
Build fails with 'CUDA compiler does not support GCC 14' or similar. Common on Arch / Fedora / rolling distros where system GCC is newer than CUDA's supported version.
Install matching GCC: `sudo apt install gcc-12 g++-12` (Ubuntu) or `sudo pacman -S gcc12` (Arch). Configure CMake to use it: `cmake -B build -DCMAKE_C_COMPILER=gcc-12 -DCMAKE_CXX_COMPILER=g++-12 -DGGML_CUDA=ON`.
Metal flag on non-Apple machine
Build fails immediately with 'Metal framework not found' or 'this target is only available on macOS.'
Use the right backend flag. macOS: `-DGGML_METAL=ON`. Linux/Windows NVIDIA: `-DGGML_CUDA=ON`. AMD: `-DGGML_HIP=ON` (with ROCm) or `-DGGML_VULKAN=ON`. Don't mix Metal with anything else.
Vulkan SDK missing
`vulkaninfo` returns 'command not found.' Build with `-DGGML_VULKAN=ON` fails at 'Vulkan headers not found.'
Install Vulkan SDK from LunarG (https://vulkan.lunarg.com). Linux: `sudo apt install vulkan-sdk`. macOS: download MoltenVK from LunarG. Windows: SDK installer adds VULKAN_SDK env var. Verify with `vulkaninfo` before retrying.
Stale CMake build directory
Build worked yesterday, broken today after pulling new commits. Cryptic CMake errors mentioning 'cached value' or 'previous configuration.'
Wipe and rebuild: `rm -rf build && cmake -B build -DGGML_CUDA=ON && cmake --build build --config Release -j`. Alternatively `cmake --fresh -B build` (CMake 3.24+) preserves the directory but resets cached vars.
Frequently asked questions
Should I build from source or use a prebuilt binary?
Prebuilt for evaluation (fastest path to a working `llama-cli`). Build from source if you want custom flags (FA, ROCm gfx targets, specific CUDA arch), are on a non-standard distro, or need the absolute latest commit. The official releases ship reasonable defaults for most users.
Best build flags for inference on RTX 4090?
`cmake -B build -DGGML_CUDA=ON -DGGML_CUDA_FA=ON -DCMAKE_CUDA_ARCHITECTURES=89` then `cmake --build build --config Release -j`. The architecture flag (89 for Ada / RTX 40-series) trims the binary to your card's compute capability and speeds the build.
Build keeps OOMing on my system — why?
CUDA compilation is RAM-intensive. The `-j` flag spawns N parallel compile jobs; if RAM < jobs × 2 GB, you'll OOM. Lower parallelism: `cmake --build build --config Release -j 4` (or `-j 2` on tight RAM). Building takes longer but won't OOM.
Related troubleshooting
Why CUDA OOM happens during local LLM inference and image gen, how to confirm the real cause, and the four real fixes (smaller quant, shorter context, gradient checkpointing, or more VRAM).
ROCm is finicky on consumer AMD GPUs in 2026. Here's the install order, the gfx-version override that fixes 80% of detection failures, and when to give up and use Vulkan.
PyTorch falsely reporting no CUDA is the most common Python ML setup failure. The cause is almost always: wrong PyTorch wheel for your CUDA version, or a CPU-only build accidentally installed.
When the fix is hardware
A surprising fraction of troubleshooting tickets resolve to: this card doesn't have enough VRAM for what you're asking it to do. If you're hitting OOM after every reasonable fix, or your GPU genuinely can't fit the model you need, it's upgrade time: