What this does

Adjusts runtime parameters to balance inference speed, memory usage, and output quality. Parameter tuning changes how the model generates text without altering the model file itself.

Steps

Set the context size to match expected prompt length. A value larger than needed wastes memory; too small causes context drops.
```
./llama-cli -m model.gguf -c 2048 -p "Prompt here"
```
Expected output: Model accepts the context size and begins generation within memory constraints.
Adjust batch size for throughput gains. Higher values improve throughput at the cost of additional memory.
```
./llama-cli -m model.gguf -c 2048 -b 512 -p "Prompt here"
```
Expected output: Higher tokens-per-second compared to default batch size on GPU builds.
Control generation length with n_predict.
```
./llama-cli -m model.gguf -c 2048 -n 256 -p "Prompt here"
```
Expected output: Generation stops precisely after 256 tokens.
Select temperature for output diversity. Lower values produce deterministic outputs; higher values introduce creative variation.
```
./llama-cli -m model.gguf -c 2048 --temp 0.7 -p "Prompt here"
```
Expected output: Varied outputs across multiple runs when temperature is above 0.

Verification

./llama-cli -m model.gguf -c 2048 -p "Test prompt" 2>&1 | Select-String "tokens per second"
# Expected: measurable tokens-per-second value indicating throughput

Common failures

Context size exceeds available memory — Reduce -c to 1024 or lower to stay within memory limits.
Temperature set too high causes repetition loops — Values above 1.2 often produce degenerate output. Use --temp 0.7.
Batch size causes out-of-memory errors on GPU — Reduce -b from 512 to 128.
Output truncates unexpectedly — Explicitly set -n to the desired token count.

Operator checkpoint

Before treating this as solved, write down the local runtime, model or package version, hardware/backend if relevant, and the verification output. This keeps the guide useful as a Will-It-Run style decision instead of a one-off command transcript.

How to optimize llama.cpp inference parameters

What this does

Steps

Verification

Common failures

Operator checkpoint

Related guides