19. Slow Inference Debugging
Slow inference has multiple causes: CPU-only processing, insufficient memory bandwidth, thermal throttling, or model configuration issues. This chapter provides a systematic debugging approach.
Baseline Measurement
First, establish a baseline. Run the same prompt multiple times:
for i in {1..5}; do
start=$(date +%s%N)
curl -s http://localhost:11434/api/generate \\
-d '{"model":"llama3.2:1b","prompt":"Write a haiku","stream":false}' \\
| jq '.eval_duration'
end=$(date +%s%N)
echo "Total time: $(( (end - start) / 1000000 ))ms"
done
If timing varies significantly between runs, thermal throttling or memory pressure may be involved.
Identifying the Bottleneck
Check GPU utilization:
nvidia-smi --query-gpu=utilization.gpu,utilization.memory --format=csv
- Low GPU utilization (<50%) with high latency suggests CPU bottleneck or data transfer overhead
- High GPU utilization indicates the model is compute-bound-use a smaller or more quantized model
Check CPU usage:
top -Hp $(pgrep -f ollama)
This shows per-thread CPU usage. High CPU usage suggests the tokenization or post-processing is the bottleneck.
Common Causes and Fixes
CPU-only inference - If ollama ps shows no GPU, follow Chapter 17 to fix GPU detection.
Small context window - Too-small context forces frequent KV cache evictions:
# Increase context window
ollama run llama3.2:1b --param num_ctx 4096
Low batch size - Small batch sizes reduce parallelism:
ollama run llama3.2:1b --param num_batch 256
Model quantization - Lower quantization (like Q2_K) uses less memory but is slower:
# Try a different quantization level
ollama run llama3.2:3b # Q4_K_M (default)
Network latency - If using a remote Ollama instance, network delay adds to response time. Test with ping <ollama-host> and compare local versus remote response times.
Thermal Throttling
Sustained high GPU load generates heat. On laptops or small form-factor PCs, thermal throttling can reduce GPU clock speed:
# Check GPU temperature
nvidia-smi -q -i 0 -z temperature.gpu
# If temperature exceeds 85�C, throttling is likely
Solutions:
- Improve airflow (laptop stands, additional fans)
- Undervolt GPU (nvidia-settings)
- Use a smaller model during sustained workloads
Debugging API Latency
Add timing to Python client calls:
import time
from ollama import chat
start = time.time()
response = chat(model='llama3.2:1b', messages=[
{'role': 'user', 'content': 'Hello'}
])
elapsed = time.time() - start
print(f"Total time: {elapsed:.2f}s")
print(f"Response: {response['message']['content']}")
Compare timing with curl to isolate client versus server issues.
Run a benchmark script that measures tokens per second. Then enable CPU-only mode with CUDA_VISIBLE_DEVICES="" and re-run. Compare results to determine how much GPU acceleration helps in your setup.