What this does

Builds the llama.cpp library and companion CLI and server binaries from the upstream Git repository. The result is a set of native executables (llama-cli, llama-server, llama-quantize) compiled and optimized for the host system architecture.

Steps

Clone the repository.
```
git clone https://github.com/ggerganov/llama.cpp.git
cd llama.cpp
```
Expected output: repository cloned; llama.cpp/ directory created.
Create and enter a build directory. Out-of-source builds keep the source tree clean.
```
mkdir build && cd build
```
Expected output: build/ directory exists.
Configure CMake with the desired backend. GPU support is enabled by adding -DLLAMA_CUDA=ON.
```
cmake .. -DLLAMA_CUDA=ON -DLLAMA_BUILD_SERVER=ON -DLLAMA_BUILD_CLI=ON
```
Expected output: -- Configuring done followed by -- Generating done.
Compile with parallel jobs.
```
cmake --build . --config Release -j$(nproc)
```
Expected output: the final line reads [100%] Built target <target-name> for each binary.
Verify the CLI binary runs.
```
./llama-cli --version
```
Expected output: version string such as version: 1.0.0.

Verification

./llama-cli -m models/some-model.gguf -p "Hello world" -n 10 --no-display-prompt 2>/dev/null
# Expected: model loads and outputs a 10-token completion without errors

Common failures

nvcc: command not found — CUDA not in PATH. Set export PATH=/usr/local/cuda/bin:$PATH before running cmake.
Header ggml.h not found — Submodules not initialized. Run git submodule update --init --recursive.
CUDA compute capability mismatch — Set -DCMAKE_CUDA_ARCHITECTURES=75 for older GPUs.
Out of RAM during compilation — Reduce concurrency with -j4 on systems with limited RAM.
Python bindings not built — Install Python dev headers, then reconfigure with -DLLAMA_PYTHON_BINDINGS=ON.
Missing GLIBCXX symbols at runtime — The system libstdc++ is older than the build toolchain. Install a newer libstdc++-dev package and relink.

How to compile llama.cpp from source

What this does

Steps

Verification

Common failures

Related guides