Won't fit

Running Llama 3.3 70B Instruct on NVIDIA GeForce RTX 4090

Llama 3.3 70B Instruct requires more memory than NVIDIA GeForce RTX 4090 provides (24 GB available).

By Fredoline Eruo·Last verified May 14, 2026

70B params

24 GB

—

Highest quality that fits

Variants and what fits

Quantization	File size	VRAM required	Fits on NVIDIA GeForce RTX 4090?
Q4_K_M	40.0 GB	48 GB	No
Q5_K_M	47.0 GB	56 GB	No
Q8_0	70.0 GB	80 GB	No

Tool	Quant	Context	tok/s	VRAM used	Source
Ollama	Q4_K_M	8,192	14.8 tok/s	23.4 GB	community
llama.cpp	Q4_K_M	4,096	8.0 tok/s	—	community

Llama 3.3 70B Instruct requires more memory than NVIDIA GeForce RTX 4090 provides (24 GB available).

No quantization of Llama 3.3 70B Instruct fits on NVIDIA GeForce RTX 4090. Pick a smaller model.

Measured at 14.8 tok/s on this combination (community-sourced).

Reviewed by RunLocalAI Editorial. See our editorial policy.