RUNLOCALAIv38
->Will it run?Best GPUCompareTroubleshootStartLearnPulseModelsHardwareToolsBench
Run check
RUNLOCALAI

Independently operated catalog for local-AI hardware and software. Hand-written verdicts. Source-cited claims. Reproducible commands when we have them.

OP·Fredoline Eruo
DIR
  • Models
  • Hardware
  • Tools
  • Benchmarks
TOOLS
  • Will it run?
  • Compare hardware
  • Cost vs cloud
  • Choose my GPU
  • Prompting kits
  • Quick answers
REF
  • All buyer guides
  • Learn local AI
  • Methodology
  • Glossary
  • Errors KB
  • Trust
EDITOR
  • About
  • Author
  • How we make money
  • Editorial policy
  • Contact
LEGAL
  • Privacy
  • Terms
  • Sitemap
MAIL · MONTHLY DIGEST
Get monthly local AI changes
Monthly recap. No spam.
DISCLOSURE

Some links on this site are affiliate links (Amazon Associates and other first-class retailers). When you buy through them, we earn a small commission at no extra cost to you. Affiliate links do not influence our verdicts — there are cards we rate highly that we don't have affiliate relationships with, and cards that sell well that we refuse to recommend. Read more →

© 2026 runlocalai.coIndependently operated
RUNLOCALAI · v38
  1. >
  2. Home
  3. /Learn
  4. /Courses
  5. /Ollama — Installation to Mastery
  6. /Ch. 19
Ollama — Installation to Mastery

19. Slow Inference Debugging

Chapter 19 of 20 · 20 min
KEY INSIGHT

Slow inference is usually caused by one bottleneck: CPU-only mode, low GPU utilization, or thermal throttling. Measure systematically to identify which.

Slow inference has multiple causes: CPU-only processing, insufficient memory bandwidth, thermal throttling, or model configuration issues. This chapter provides a systematic debugging approach.

Baseline Measurement

First, establish a baseline. Run the same prompt multiple times:

for i in {1..5}; do
    start=$(date +%s%N)
    curl -s http://localhost:11434/api/generate \\
        -d '{"model":"llama3.2:1b","prompt":"Write a haiku","stream":false}' \\
        | jq '.eval_duration'
    end=$(date +%s%N)
    echo "Total time: $(( (end - start) / 1000000 ))ms"
done

If timing varies significantly between runs, thermal throttling or memory pressure may be involved.

Identifying the Bottleneck

Check GPU utilization:

nvidia-smi --query-gpu=utilization.gpu,utilization.memory --format=csv
  • Low GPU utilization (<50%) with high latency suggests CPU bottleneck or data transfer overhead
  • High GPU utilization indicates the model is compute-bound-use a smaller or more quantized model

Check CPU usage:

top -Hp $(pgrep -f ollama)

This shows per-thread CPU usage. High CPU usage suggests the tokenization or post-processing is the bottleneck.

Common Causes and Fixes

CPU-only inference - If ollama ps shows no GPU, follow Chapter 17 to fix GPU detection.

Small context window - Too-small context forces frequent KV cache evictions:

# Increase context window
ollama run llama3.2:1b --param num_ctx 4096

Low batch size - Small batch sizes reduce parallelism:

ollama run llama3.2:1b --param num_batch 256

Model quantization - Lower quantization (like Q2_K) uses less memory but is slower:

# Try a different quantization level
ollama run llama3.2:3b  # Q4_K_M (default)

Network latency - If using a remote Ollama instance, network delay adds to response time. Test with ping <ollama-host> and compare local versus remote response times.

Thermal Throttling

Sustained high GPU load generates heat. On laptops or small form-factor PCs, thermal throttling can reduce GPU clock speed:

# Check GPU temperature
nvidia-smi -q -i 0 -z temperature.gpu

# If temperature exceeds 85�C, throttling is likely

Solutions:

  • Improve airflow (laptop stands, additional fans)
  • Undervolt GPU (nvidia-settings)
  • Use a smaller model during sustained workloads

Debugging API Latency

Add timing to Python client calls:

import time
from ollama import chat

start = time.time()
response = chat(model='llama3.2:1b', messages=[
    {'role': 'user', 'content': 'Hello'}
])
elapsed = time.time() - start

print(f"Total time: {elapsed:.2f}s")
print(f"Response: {response['message']['content']}")

Compare timing with curl to isolate client versus server issues.

EXERCISE

Run a benchmark script that measures tokens per second. Then enable CPU-only mode with CUDA_VISIBLE_DEVICES="" and re-run. Compare results to determine how much GPU acceleration helps in your setup.

← Chapter 18
OOM Errors
Chapter 20 →
Model Download Failures