HOW-TO · INF

How to monitor CPU and GPU memory during inference

intermediate10 minBy Fredoline Eruo
PREREQUISITES

nvidia-smi (NVIDIA) or rocm-smi (AMD) available

What this does

Monitoring memory usage during inference helps diagnose out-of-memory errors, identify memory leaks, and tune offloading settings for optimal performance.

Steps

  1. Monitor GPU memory in real-time.

    nvidia-smi --query-gpu=memory.used,memory.total,utilization.gpu --format=csv -l 1
    

    Expected: Live-updating table showing VRAM usage and GPU utilization.

  2. Log memory to a file during a benchmark run.

    # Start logging in background
    nvidia-smi --query-gpu=memory.used --format=csv,noheader -lms 500 > gpu_mem_log.csv &
    LOG_PID=$!
    # Run inference
    ./llama-cli -m model.gguf -p "Long prompt here" -n 512
    # Stop logging
    kill $LOG_PID
    
  3. Monitor CPU memory on Windows.

    Get-Process -Name ollama | Select-Object WorkingSet64, PrivateMemorySize64
    # Or watch total system memory
    while ($true) { Get-Counter "\Memory\Available MBytes"; Start-Sleep 1 }
    
  4. Plot memory usage over time.

    import pandas as pd, matplotlib.pyplot as plt
    df = pd.read_csv("gpu_mem_log.csv", header=None, names=["memory_mb"])
    df.plot()
    plt.ylabel("GPU Memory (MB)")
    plt.savefig("memory_profile.png")
    

Verification

# Check the log has timestamps increasing during inference
Get-Content gpu_mem_log.csv | Select-Object -First 5
# Expected: rising memory values as model loads, plateau during generation

Common failures

  • nvidia-smi shows zero GPU activity: The model may be running entirely on CPU. Check --n-gpu-layers setting.
  • Sampling interval too fast: -lms 100 (100ms) can miss events. Use 500ms for accurate captures.
  • Permission denied: On Linux, nvidia-smi may need sudo for certain metrics. On Windows, run PowerShell as Administrator.

Operator checkpoint

Before treating this as solved, write down the local runtime, model or package version, hardware/backend if relevant, and the verification output. This keeps the guide useful as a Will-It-Run style decision instead of a one-off command transcript.

Operator checkpoint

Before treating this as solved, write down the local runtime, model or package version, hardware/backend if relevant, and the verification output. This keeps the guide useful as a Will-It-Run style decision instead of a one-off command transcript.

Related guides

RELATED GUIDES