RUNLOCALAIv38
->Will it run?Best GPUCompareTroubleshootStartLearnPulseModelsHardwareToolsBench
Run check
RUNLOCALAI

Independently operated catalog for local-AI hardware and software. Hand-written verdicts. Source-cited claims. Reproducible commands when we have them.

OP·Fredoline Eruo
DIR
  • Models
  • Hardware
  • Tools
  • Benchmarks
TOOLS
  • Will it run?
  • Compare hardware
  • Cost vs cloud
  • Choose my GPU
  • Prompting kits
  • Quick answers
REF
  • All buyer guides
  • Learn local AI
  • Methodology
  • Glossary
  • Errors KB
  • Trust
EDITOR
  • About
  • Author
  • How we make money
  • Editorial policy
  • Contact
LEGAL
  • Privacy
  • Terms
  • Sitemap
MAIL · MONTHLY DIGEST
Get monthly local AI changes
Monthly recap. No spam.
DISCLOSURE

Some links on this site are affiliate links (Amazon Associates and other first-class retailers). When you buy through them, we earn a small commission at no extra cost to you. Affiliate links do not influence our verdicts — there are cards we rate highly that we don't have affiliate relationships with, and cards that sell well that we refuse to recommend. Read more →

© 2026 runlocalai.coIndependently operated
RUNLOCALAI · v38
  1. >
  2. Home
  3. /Learn
  4. /How-to
  5. /How to choose the right quantization level based on your hardware
HOW-TO · INF

How to choose the right quantization level based on your hardware

intermediate·10 min·By Fredoline Eruo
Target environment
Ubuntu 24.04 · Ollama 0.4.x
PREREQUISITES

Knowledge of your system available RAM and a base model in mind

What this does

Measures available system memory, estimates RAM demands for quantized models, and selects an appropriate quantization level so inference runs without swapping or OOM errors. The result is a model and quantization pairing that fits the hardware budget.

Steps

  1. Check available system RAM. Measures the memory ceiling available for model loading.

    free -h
    

    Expected output: Total, used, and available memory columns.

  2. Estimate RAM requirement from model size and quantization. A 7B model in Q4_K_M needs roughly 4.5-5 GB on disk and 6-8 GB during inference. Larger counts scale proportionally.

  3. Select quantization matching available memory. Use this guide:

    • Q8_0: requires 12+ GB free RAM for 7B models
    • Q5_K_M: suits systems with 8-12 GB free
    • Q4_K_M: ideal for 6-8 GB free
    • Q3_K_M: for 4-6 GB free
    • Q2_K: use when RAM is tightly constrained
  4. Pull the chosen variant. Downloads the selected quantization level.

    ollama pull llama3:q4_K_M
    

    Expected output: Progress bars and success.

Verification

free -h | awk 'NR==2{print "Available RAM: " $7}' && ollama list | grep q4_K_M
# Expected: Available RAM greater than estimated model need; model variant present in list

Common failures

  • OOM killer triggers during model load - Available RAM was overestimated; close other applications or switch to a lighter quantization.
  • disk size != RAM usage - On-disk size underreports RAM need; real memory depends on context window and batch settings.
  • GPU offload complications - Quantization levels expecting GPU offload may fail without CUDA; check runtime GPU support.
  • confusing VRAM vs RAM - On discrete GPU systems, VRAM and system RAM are separate pools; each must be considered independently.

Related guides

  • How to run quantized models on systems with limited RAM
  • How to compare file sizes between different quantization formats
RELATED GUIDES
INF
How to compare file sizes between different quantization formats
INF
How to run quantized models on systems with limited RAM
← All how-to guidesCourses →