How to optimize llama.cpp inference parameters
llama.cpp compiled and running
What this does
Adjusts runtime parameters to balance inference speed, memory usage, and output quality. Parameter tuning changes how the model generates text without altering the model file itself.
Steps
Set the context size to match expected prompt length. A value larger than needed wastes memory; too small causes context drops.
./llama-cli -m model.gguf -c 2048 -p "Prompt here"Expected output: Model accepts the context size and begins generation within memory constraints.
Adjust batch size for throughput gains. Higher values improve throughput at the cost of additional memory.
./llama-cli -m model.gguf -c 2048 -b 512 -p "Prompt here"Expected output: Higher tokens-per-second compared to default batch size on GPU builds.
Control generation length with n_predict.
./llama-cli -m model.gguf -c 2048 -n 256 -p "Prompt here"Expected output: Generation stops precisely after 256 tokens.
Select temperature for output diversity. Lower values produce deterministic outputs; higher values introduce creative variation.
./llama-cli -m model.gguf -c 2048 --temp 0.7 -p "Prompt here"Expected output: Varied outputs across multiple runs when temperature is above 0.
Verification
./llama-cli -m model.gguf -c 2048 -p "Test prompt" 2>&1 | Select-String "tokens per second"
# Expected: measurable tokens-per-second value indicating throughput
Common failures
- Context size exceeds available memory — Reduce
-cto 1024 or lower to stay within memory limits. - Temperature set too high causes repetition loops — Values above 1.2 often produce degenerate output. Use
--temp 0.7. - Batch size causes out-of-memory errors on GPU — Reduce
-bfrom 512 to 128. - Output truncates unexpectedly — Explicitly set
-nto the desired token count.
Operator checkpoint
Before treating this as solved, write down the local runtime, model or package version, hardware/backend if relevant, and the verification output. This keeps the guide useful as a Will-It-Run style decision instead of a one-off command transcript.