RUNLOCALAIv38
->Will it run?Best GPUCompareTroubleshootStartLearnPulseModelsHardwareToolsBench
Run check
RUNLOCALAI

Independently operated catalog for local-AI hardware and software. Hand-written verdicts. Source-cited claims. Reproducible commands when we have them.

OP·Fredoline Eruo
DIR
  • Models
  • Hardware
  • Tools
  • Benchmarks
TOOLS
  • Will it run?
  • Compare hardware
  • Cost vs cloud
  • Choose my GPU
  • Prompting kits
  • Quick answers
REF
  • All buyer guides
  • Learn local AI
  • Methodology
  • Glossary
  • Errors KB
  • Trust
EDITOR
  • About
  • Author
  • How we make money
  • Editorial policy
  • Contact
LEGAL
  • Privacy
  • Terms
  • Sitemap
MAIL · MONTHLY DIGEST
Get monthly local AI changes
Monthly recap. No spam.
DISCLOSURE

Some links on this site are affiliate links (Amazon Associates and other first-class retailers). When you buy through them, we earn a small commission at no extra cost to you. Affiliate links do not influence our verdicts — there are cards we rate highly that we don't have affiliate relationships with, and cards that sell well that we refuse to recommend. Read more →

© 2026 runlocalai.coIndependently operated
RUNLOCALAI · v38
  1. >
  2. Home
  3. /Learn
  4. /How-to
  5. /How to monitor GPU utilization for AI inference servers with nvidia-ml-py
HOW-TO · OPS

How to monitor GPU utilization for AI inference servers with nvidia-ml-py

intermediate·15 min·By Fredoline Eruo
Target environment
Ubuntu 24.04 · Ollama 0.4.x
PREREQUISITES

NVIDIA GPU, nvidia-ml-py or pynvml installed

What this does

This guide uses the NVIDIA Management Library (pynvml) to collect real-time GPU metrics — utilization, memory usage, temperature, and power draw — and exposes them as Prometheus metrics. Operators can track GPU saturation, detect memory leaks in inference servers, and set alerts for thermal throttling. The solution works with any NVIDIA GPU supported by the driver and does not require DCGM.

Steps

  1. Verify the NVIDIA driver and nvidia-smi are functional:

    nvidia-smi --query-gpu=utilization.gpu,memory.used,memory.total --format=csv
    

    Expected output: CSV header followed by a data row, e.g., 45 %, 8192 MiB, 24576 MiB.

  2. Install the Python libraries:

    pip install nvidia-ml-py prometheus-client
    
  3. Create a gpu_metrics.py module that initializes NVML and defines Prometheus gauges:

    from pynvml import nvmlInit, nvmlDeviceGetCount, nvmlDeviceGetHandleByIndex
    from pynvml import nvmlDeviceGetUtilizationRates, nvmlDeviceGetMemoryInfo
    from prometheus_client import Gauge, generate_latest, CollectorRegistry
    
    registry = CollectorRegistry()
    gpu_util = Gauge("nvidia_gpu_utilization_percent", "GPU utilization", ["gpu"], registry=registry)
    gpu_mem_used = Gauge("nvidia_gpu_memory_used_bytes", "GPU memory used", ["gpu"], registry=registry)
    gpu_mem_total = Gauge("nvidia_gpu_memory_total_bytes", "GPU memory total", ["gpu"], registry=registry)
    
    nvmlInit()
    
  4. Add a collection loop that updates the gauges every 5 seconds:

    def collect_gpu_metrics():
        count = nvmlDeviceGetCount()
        for i in range(count):
            handle = nvmlDeviceGetHandleByIndex(i)
            util = nvmlDeviceGetUtilizationRates(handle)
            mem = nvmlDeviceGetMemoryInfo(handle)
            gpu_util.labels(gpu=str(i)).set(util.gpu)
            gpu_mem_used.labels(gpu=str(i)).set(mem.used)
            gpu_mem_total.labels(gpu=str(i)).set(mem.total)
    
  5. Serve metrics via a simple HTTP server. Use FastAPI or a minimal http.server with a /metrics route that returns generate_latest(registry).

  6. Run the collector as a background thread or asyncio task to ensure metrics stay current between scrapes.

  7. Add a Prometheus scrape config:

    - job_name: "gpu-metrics"
      scrape_interval: 5s
      static_configs:
        - targets: ["localhost:9400"]
    
  8. Query GPU utilization in Prometheus:

    nvidia_gpu_utilization_percent
    

    Expected: value between 0 and 100 for each GPU index.

Verification

curl -s http://localhost:9400/metrics | grep nvidia_gpu_utilization_percent

Expected output: nvidia_gpu_utilization_percent{gpu="0"} 45.0 (or similar non-zero value).

Common failures

  • NVML initialization fails — confirm the NVIDIA driver is loaded with lsmod | grep nvidia. If absent, install with ubuntu-drivers autoinstall and reboot.
  • Gauges always report zero — the collection function may not be called. Verify the background thread is started: threading.Thread(target=collect_loop, daemon=True).start().
  • Permission denied on /dev/nvidia0 — the user running the Python process must be in the video group. Run sudo usermod -aG video $USER and re-login.
  • nvidia-smi works but pynvml fails — version mismatch between the installed nvidia-ml-py and the driver. Check compatibility: pip show nvidia-ml-py and compare with nvidia-smi --version.

Related guides

  • Implement GPU-based autoscaling with custom metrics
  • Configure GPU access in Docker Compose for AI inference
  • Deploy vLLM on Kubernetes with GPU node selection
← All how-to guidesCourses →