Precision
Precision in local AI refers to the number of bits used to represent each weight and activation in a neural network. Lower precision (e.g., 4-bit) reduces model size and memory bandwidth requirements, enabling larger models to run on consumer hardware, but can degrade output quality. Common precisions include FP32 (32-bit), FP16 (16-bit), and quantized integer formats like Q4_K_M (4-bit) used in llama.cpp. The operator must balance VRAM constraints against acceptable perplexity loss.
Deeper dive
Precision directly impacts VRAM usage and inference speed. A 7B parameter model in FP32 requires ~28 GB, while Q4_K_M reduces it to ~4 GB, fitting on a 6 GB GPU. Lower precision also increases tokens per second because less data moves across the memory bus. However, aggressive quantization (e.g., 2-bit) can introduce noticeable quality loss. llama.cpp offers a range of quantization levels (Q2, Q3, Q4, Q5, Q6, Q8) with trade-offs between size and fidelity. Operators typically choose the highest precision that fits their VRAM budget.
Practical example
A 13B model in FP16 (26 GB) exceeds the 16 GB VRAM of an RTX 4060 Ti. Using Q4_K_M quantization (7 GB) fits comfortably, allowing 40 tok/s inference. Going to Q2 (4 GB) would run even faster but may degrade output coherence.
Workflow example
In llama.cpp, the operator selects precision via the -b flag (e.g., -b 4 for 4-bit) or by downloading a pre-quantized GGUF file like llama-2-13b.Q4_K_M.gguf. In LM Studio, the model card lists available quantizations; the operator picks one that fits their GPU VRAM. In MLX, precision is set via model.load_weights(..., dtype=mx.float16).
Reviewed by Fredoline Eruo. See our editorial policy.