How to monitor CPU and GPU memory during inference
nvidia-smi (NVIDIA) or rocm-smi (AMD) available
What this does
Monitoring memory usage during inference helps diagnose out-of-memory errors, identify memory leaks, and tune offloading settings for optimal performance.
Steps
Monitor GPU memory in real-time.
nvidia-smi --query-gpu=memory.used,memory.total,utilization.gpu --format=csv -l 1Expected: Live-updating table showing VRAM usage and GPU utilization.
Log memory to a file during a benchmark run.
# Start logging in background nvidia-smi --query-gpu=memory.used --format=csv,noheader -lms 500 > gpu_mem_log.csv & LOG_PID=$! # Run inference ./llama-cli -m model.gguf -p "Long prompt here" -n 512 # Stop logging kill $LOG_PIDMonitor CPU memory on Windows.
Get-Process -Name ollama | Select-Object WorkingSet64, PrivateMemorySize64 # Or watch total system memory while ($true) { Get-Counter "\Memory\Available MBytes"; Start-Sleep 1 }Plot memory usage over time.
import pandas as pd, matplotlib.pyplot as plt df = pd.read_csv("gpu_mem_log.csv", header=None, names=["memory_mb"]) df.plot() plt.ylabel("GPU Memory (MB)") plt.savefig("memory_profile.png")
Verification
# Check the log has timestamps increasing during inference
Get-Content gpu_mem_log.csv | Select-Object -First 5
# Expected: rising memory values as model loads, plateau during generation
Common failures
- nvidia-smi shows zero GPU activity: The model may be running entirely on CPU. Check
--n-gpu-layerssetting. - Sampling interval too fast:
-lms 100(100ms) can miss events. Use 500ms for accurate captures. - Permission denied: On Linux,
nvidia-smimay needsudofor certain metrics. On Windows, run PowerShell as Administrator.
Operator checkpoint
Before treating this as solved, write down the local runtime, model or package version, hardware/backend if relevant, and the verification output. This keeps the guide useful as a Will-It-Run style decision instead of a one-off command transcript.
Operator checkpoint
Before treating this as solved, write down the local runtime, model or package version, hardware/backend if relevant, and the verification output. This keeps the guide useful as a Will-It-Run style decision instead of a one-off command transcript.