06. llama.cpp from Source

Chapter 6 of 15 · 20 min

llama.cpp is the reference implementation for running quantized LLMs on CPU and GPU with minimal dependencies. Building from source lets you target your exact CPU architecture (AVX2, AVX512, NEON for ARM) and enable CUDA or HIP backends.

Prerequisites on Ubuntu:

sudo apt update
sudo apt install build-essential cmake git

Clone and configure:

git clone https://github.com/ggerganov/llama.cpp.git
cd llama.cpp
mkdir build && cd build
cmake .. -DLLAMA_CUDA=ON -DLLAMA_BUILD_TESTS=OFF
cmake --build . --config Release -j$(nproc)

The -DLLAMA_CUDA=ON flag adds GPU offloading. On AMD GPUs with HIP, use -DLLAMA_HIP=ON -DLLAMA_HIPBLAS=ON. The build output lands in build/bin/.

Quantize a model from HuggingFace and run it:

# Download the raw model weights (this may be several GB)
git lfs install
git clone https://huggingface.co/mistralai/Mistral-7B-Instruct-v0.3
# Convert to gguf
python3 ../llama.cpp/convert_hf_to_gguf.py ./Mistral-7B-Instruct-v0.3 \
  --outfile ./mistral-7b.gguf --outtype f16
# Quantize to Q4_K_M (4-bit, ~4GB from 14GB)
./build/bin/llama-quantize ./mistral-7b.gguf ./mistral-7b-q4_k_m.gguf Q4_K_M
# Run with GPU offloading (split across GPU+CPU)
./build/bin/llama-cli -m ./mistral-7b-q4_k_m.gguf \
  -ngl 99 -t 8 -c 4096 -p "Explain Docker in one sentence"

-ngl 99 means offload 99 layers to the GPU (use 35 for a 35-layer model). -ngl 0 runs fully on CPU. -t 8 sets 8 CPU threads. -c 4096 sets context window to 4096 tokens.

Failure mode: CMake fails with Could not find CUDA. The CUDA toolkit was installed but CMake cannot find it because CUDA_PATH is not set. Set export CUDA_PATH=/usr/local/cuda-12.4 before running cmake.

Failure mode: Build succeeds but llama-cli segfaults immediately. The GPU does not have enough VRAM for the selected quantization at -ngl 99. Reduce layers: -ngl 50 for 50 layers on GPU, the rest on CPU. Or use a lighter quantization.

Failure mode: llama-quantize fails with file too large. The filesystem does not support large files. Reformat as ext4 or use mkfs.xfs -s size=512 with proper inode settings. On ext4, ensure the partition was created with mkfs.ext4 -O ^extent disabled for some older kernels.

EXERCISE

Clone llama.cpp, build with CUDA support enabled, download a small model (Qwen-0.5B or similar), convert to GGUF, quantize to Q4_K_M, run with -ngl 0 (CPU only), record tokens per second, then run with -ngl 99, and record the speedup.