How to compile llama.cpp from source
Git, CMake, C++ compiler (GCC 11+ or Clang 15+), Python 3.8+
What this does
Builds the llama.cpp library and companion CLI and server binaries from the upstream Git repository. The result is a set of native executables (llama-cli, llama-server, llama-quantize) compiled and optimized for the host system architecture.
Steps
Clone the repository.
git clone https://github.com/ggerganov/llama.cpp.git cd llama.cppExpected output: repository cloned;
llama.cpp/directory created.Create and enter a build directory. Out-of-source builds keep the source tree clean.
mkdir build && cd buildExpected output:
build/directory exists.Configure CMake with the desired backend. GPU support is enabled by adding
-DLLAMA_CUDA=ON.cmake .. -DLLAMA_CUDA=ON -DLLAMA_BUILD_SERVER=ON -DLLAMA_BUILD_CLI=ONExpected output:
-- Configuring donefollowed by-- Generating done.Compile with parallel jobs.
cmake --build . --config Release -j$(nproc)Expected output: the final line reads
[100%] Built target <target-name>for each binary.Verify the CLI binary runs.
./llama-cli --versionExpected output: version string such as
version: 1.0.0.
Verification
./llama-cli -m models/some-model.gguf -p "Hello world" -n 10 --no-display-prompt 2>/dev/null
# Expected: model loads and outputs a 10-token completion without errors
Common failures
nvcc: command not found— CUDA not in PATH. Setexport PATH=/usr/local/cuda/bin:$PATHbefore running cmake.- Header
ggml.hnot found — Submodules not initialized. Rungit submodule update --init --recursive. - CUDA compute capability mismatch — Set
-DCMAKE_CUDA_ARCHITECTURES=75for older GPUs. - Out of RAM during compilation — Reduce concurrency with
-j4on systems with limited RAM. - Python bindings not built — Install Python dev headers, then reconfigure with
-DLLAMA_PYTHON_BINDINGS=ON. - Missing GLIBCXX symbols at runtime — The system libstdc++ is older than the build toolchain. Install a newer
libstdc++-devpackage and relink.