RUNLOCALAIv38
->Will it run?Best GPUCompareTroubleshootStartLearnPulseModelsHardwareToolsBench
Run check
RUNLOCALAI

Independently operated catalog for local-AI hardware and software. Hand-written verdicts. Source-cited claims. Reproducible commands when we have them.

OP·Fredoline Eruo
DIR
  • Models
  • Hardware
  • Tools
  • Benchmarks
TOOLS
  • Will it run?
  • Compare hardware
  • Cost vs cloud
  • Choose my GPU
  • Prompting kits
  • Quick answers
REF
  • All buyer guides
  • Learn local AI
  • Methodology
  • Glossary
  • Errors KB
  • Trust
EDITOR
  • About
  • Author
  • How we make money
  • Editorial policy
  • Contact
LEGAL
  • Privacy
  • Terms
  • Sitemap
MAIL · MONTHLY DIGEST
Get monthly local AI changes
Monthly recap. No spam.
DISCLOSURE

Some links on this site are affiliate links (Amazon Associates and other first-class retailers). When you buy through them, we earn a small commission at no extra cost to you. Affiliate links do not influence our verdicts — there are cards we rate highly that we don't have affiliate relationships with, and cards that sell well that we refuse to recommend. Read more →

© 2026 runlocalai.coIndependently operated
RUNLOCALAI · v38
  1. >
  2. Home
  3. /Learn
  4. /Courses
  5. /Hardware Planning for Local AI
COURSE · FND · B006

Hardware Planning for Local AI

Learn hardware planning for local ai through RunLocalAI's practical lens: hardware, gpu, vram and budget, hardware fit, runtime settings, verification habits and local-vs-cloud tradeoffs.

20 chapters·8h·Foundations track·By Fredoline Eruo
PREREQUISITES
  • B001

Course B006: Hardware Planning for Local AI

Running AI models locally requires matching your hardware to your workload. This course removes the guesswork from hardware planning. You will learn which specifications actually matter for AI inference, how to calculate VRAM requirements for any model, and how to build or buy a system that meets your specific use case at your specific budget. Whether you are running 7B parameter models on a laptop or 70B models on a server, this course gives you the numbers to make informed decisions.

What You Will Know After This Course

  • Calculate minimum and recommended VRAM requirements for any open-source model
  • Select and compare GPUs across budget tiers with confidence
  • Build a hardware plan that balances cost, performance, and future needs
  • Identify cost-effective alternatives including used hardware and cloud fallback options
CHAPTERS
  1. 01VRAM is EverythingVRAM capacity determines which models you can run locally and at what speed—without sufficient VRAM, no other specification matters for inference workloads.15 min
  2. 02Calculating VRAM NeedsQuantization reduces VRAM requirements dramatically at acceptable quality loss—INT4 quantization runs 7B models on 6GB VRAM. ```python # VRAM Calculator Template def calculate_vram_requirement( model_size_billions: float, precision: str = "FP16", context_length: int = 2048, overhead_factor: float = 1.25 ) -> float: """Calculate VRAM requirement in GB.""" precision_bytes = { "FP32": 4, "FP16": 2, "INT8": 1, "INT4": 0.5 } weights_gb = model_size_billions * precision_bytes[precision] return weights_gb * overhead_factor # Example: Mistral 7B in INT4 print(f"{calculate_vram_requirement(7, 'INT4'):.1f} GB minimum") ```15 min
  3. 03GPU Selection: Budget TierThe RTX 3060 12GB delivers the best VRAM-per-dollar in the budget tier—prioritize VRAM over raw compute for inference workloads. ```bash # Verify CUDA availability and compute capability python3 -c "import torch; print(f'CUDA available: {torch.cuda.is_available()}'); print(f'Device: {torch.cuda.get_device_name(0) if torch.cuda.is_available() else \"N/A\"}')" ```15 min
  4. 04GPU Selection: Mid-RangeThe RTX 4080's 16GB balances cost and capacity—running 13B models in FP16 without needing aggressive quantization. ```bash # Compare GPU performance using llama.cpp benchmark ./llama-bench -m models/llama-3-8b-q4_k_m.gguf -ngl 99 -t 12 # Expected output format: # llama_layer_load_time: 123.45ms # sample_time: 50.23ms # prompt_eval_time: 2345.67ms # tokens generated: 512 # tokens_per_second: 18.45 ```15 min
  5. 05GPU Selection: High-EndThe RTX 4090 offers the best practical balance in high-end consumer GPUs—24GB VRAM handles anything feasible while remaining achievable for individual purchases. ```bash # Monitor GPU temperature and power under load watch -n 1 nvidia-smi --query-gpu=temperature.gpu,power.draw,utilization.gpu --format=csv # Stress test to verify thermal headroom nvidia-smi -pl 450 # Set power limit to maximum python3 -c "import torch; torch.cudnn.benchmark=True; torch.zeros((4096,4096), device='cuda')" & ```15 min
  6. 06CPU-Only InferenceCPU inference is practical only for development/testing or non-interactive batch workloads—GPU acceleration is essential for interactive LLM use. ```bash # Run llama.cpp with CPU only (no GPU offload) ./main -m models/llama-3-8b-q4_k_m.gguf \ --seed 42 \ -p "Explain quantum computing" \ -n 256 \ -t 8 \ # threads -ngl 0 # NO GPU layers (CPU only) # Typical output: # llama_new_context_with_model: n_ctx = 2048, n_keep = 0 # CUDA: Not using # AVX2 = 1, AVX_VNNI = 1, FMA = 1 # inference takes 3-5 minutes for 256 tokens ```15 min
  7. 07AMD ROCm CompatibilityAMD GPUs are viable alternatives when NVIDIA prices are prohibitive, but verify framework support before purchase—llama.cpp works well, but other engines vary. ```bash # Check AMD GPU visibility in ROCm rocm-smi --showproductname rocm-smi --showid rocm-smi --showtemp rocm-smi --showbus # Expected output: # GPU ID : 0 # Name : gfx1100 # Bus : 3 # Temp (C) : 42 ```20 min
  8. 08Apple Silicon Deep DiveUnified memory bandwidth directly limits AI performance—prioritize memory capacity over GPU core count for local AI workloads. ```bash # macOS: Check Apple Silicon specs system_profiler SPHardwareDataType # Verify MLX installation for optimized inference pip3 install mlx python3 -c "import mlx.core; print(mlx.core.default_device())" # Speed test with MLX python3 -c " from transformers import AutoModelForCausalLM import time # Requires hummingbird or local model " ```15 min
  9. 09System RAM BenefitsSystem RAM matters primarily during model loading and CPU preprocessing—16GB is adequate for 7B models, but 32GB+ becomes necessary for 13B+ configurations. ```bash # Monitor RAM usage during inference python3 -c " import psutil import time process = psutil.Process() for i in range(30): mem = process.memory_info().rss / 1024**3 print(f'RAM usage: {mem:.2f} GB') time.sleep(1) " & # Run inference in parallel to see actual consumption ```20 min
  10. 10Motherboard and PSUA single high-end GPU typically needs a quality 850W PSU, while dual-GPU configurations require 1400W+—never cheap out on power delivery. ```bash # Check current system power draw when idle nvidia-smi -q -d POWER | grep "Power Draw" # Monitor power under inference load python3 -c " import torch import time # Sustained inference loads the GPU model = torch.nn.Linear(4096, 4096).cuda() data = torch.randn(512, 4096).cuda() for i in range(1000): result = model(data) torch.cuda.synchronize() if i % 100 == 0: print(f'Iteration {i}') " ```15 min
  11. 11External GPU EnclosureseGPU enclosures provide upgrade paths for laptops at 15-25% performance cost versus internal GPUs—acceptable for development, marginal for production use. ```bash # Verify Thunderbolt connection on macOS system_profiler SPThunderboltDataType # Check eGPU enumeration on Linux ls -la /sys/bus/pci/devices/ # Identify the eGPU by vendor ID (NVIDIA = 10de) lspci | grep -i vga ```15 min
  12. 12Cloud GPU FallbackCloud GPUs offer cost-effective capacity for occasional heavy workloads—calculate break-even against purchase cost before committing to either path. ```bash # Test connection to cloud llama.cpp instance curl -X POST http://instance-ip:8080/completion \ -H "Content-Type: application/json" \ -d '{"prompt": "Hello, explain AI", "max_tokens": 100}' # Monitor usage to calculate monthly cost watch -n 60 nvidia-smi ```20 min
  13. 13Budget Build: Entry-Level Under $500Entry-level builds with RTX 3060 12GB handle 7B models well—manage expectations for larger models through appropriate quantization. ```bash # Verify complete system specification inxi -Fxz # Expected output for key items: # GPU: NVIDIA GeForce RTX 3060 (12GB) # RAM: 16GB DDR4 (dual channel at 3600 MT/s) # Storage: 1TB NVMe (PCIe 4.0) # CPU: AMD Ryzen 5 5600 (6 cores, 12 threads) ```15 min
  14. 14Budget Build: Mid-Range $800-1200The RTX 4080 16GB balances cost and capability—$700 for the 4070 Ti saves $300 but loses 4GB VRAM and significant 40B+ model capacity. ```bash # Full system benchmark for AI workloads python3 << 'EOF' import torch import time print(f"CUDA available: {torch.cuda.is_available()}") if torch.cuda.is_available(): print(f"GPU: {torch.cuda.get_device_name(0)}") print(f"VRAM: {torch.cuda.get_device_properties(0).total_memory / 1e9:.1f} GB") # Inference simulation start = time.time() for _ in range(100): x = torch.randn(4096, 4096, device='cuda') y = x @ x.T torch.cuda.synchronize() elapsed = time.time() - start print(f"Matrix operations: {elapsed:.2f}s ({100/elapsed:.1f} ops/sec)") EOF ```20 min
  15. 15Budget Build: High-Performance $2000-3000RTX 4090 at 24GB runs 70B models via INT4 quantization at interactive speeds—$1700 GPU cost is the practical ceiling for consumer-grade AI workstations. ```bash # Measure actual inference performance ./llama-bench \ -m models/llama-3-8b-instruct-q4_k_m.gguf \ -ngl 999 \ -t 16 \ -c 4096 \ -n 512 # Parse key metrics # tokens_per_second: Expected 60-75 # prompt_eval_time: Memory bandwidth test # sample_time: Compute-bound test ```15 min
  16. 16Used GPU Buying GuideUsed RTX 3080/RTX 3090 cards offer the best value-to-performance ratio at 35-45% below retail—prioritize low hours-display and no mining history indicators. ```bash # Check GPU usage hours (NVIDIA display driver private data) nvidia-smi -q -i 0 --display=utilization | head -50 # Look for "Used hours" or "Total hours" if available # Historical temperature check on some cards nvidia-smi -q -i 0 -x | grep -A 5 "Temperature" # Higher historical max = hotter operation history ```20 min
  17. 17Future-ProofingAllocate budget for infrastructure upgrades (PSU, case, storage) that outlast GPU cycles—these components cost more to replace than upgrade. ```bash # Monitor framework releases for efficiency improvements # Check llama.cpp releases monthly # https://github.com/ggerganov/llama.cpp/releases # Example: Q8_0 optimization reduced memory by 15% with same quality # Version 2000: 3.5GB for 7B # Version 2100: 3.0GB for 7B (after optimization) ```15 min
  18. 18Multi-GPU SetupMulti-GPU setups rarely achieve linear scaling due to communication overhead—evaluate whether cost difference versus single high-end GPU justifies the complexity. ```bash # Monitor both GPUs during parallel inference nvidia-smi --query-gpu=index,name,utilization.gpu,memory.used,memory.total \ --format=csv # Split model by layer count for tensor parallel # 80 layers on 2 GPUs = 40 layers each # Total VRAM = GPU_VRAM * 2 (minus overhead) ```20 min
  19. 19Hardware BenchmarkingStandardized benchmarks using consistent prompts, model versions, and quantization formats enable valid hardware comparisons. ```bash # Automated benchmark logging echo "GPU: $(nvidia-smi --query-gpu=name --format=csv,noheader)" > benchmark_results.txt echo "CUDA Version: $(nvcc --version | grep release)" >> benchmark_results.txt echo "llama.cpp commit: $(git rev-parse HEAD)" >> benchmark_results.txt echo "---" >> benchmark_results.txt ./llama-bench >> benchmark_results.txt ```20 min
  20. 20Personal Hardware Plan ProjectThe best hardware plan balances current needs, budget constraints, and future flexibility—heavy users break even on purchases, light users should use cloud. ```markdown # Example Personal Hardware Plan Template Save as: `hardware_plan.md` ## My Hardware Plan for Local AI ### Current Status - GPU: nvidia-smi --query-gpu=name,memory.total - RAM: [system info] - Primary models: [List of models you run regularly] ### VRAM Requirements | Model | FP16 (GB) | INT8 (GB) | INT4 (GB) | |-------|-----------|-----------|-----------| [Your table here] ### Recommended Build [Your specific configuration] ### Purchase Decision Cloud [$X]/month vs. Purchase [$Y] one-time Break-even: [Calculated months] ```25 min
← All coursesStart chapter 1 →