llama

70B parameters

Commercial OK

Reviewed June 2026

Llama 3.3 70B Instruct

Late-2024 refresh of the 70B Llama line. Roughly matches Llama 3.1 405B on most benchmarks at one-fifth the parameter count. The default high-end model for serious local inference on 48GB+ VRAM rigs.

License: Llama 3.3 Community License·Released Dec 6, 2024·Context: 131,072 tokens

BLK · VERDICT

Our verdict

OP · Fredoline Eruo|VERIFIED JUN 12, 2026

9.1/10

Positioning

The new ceiling for "what a serious local-AI hobbyist runs daily." Llama 3.3 70B closed almost the entire gap between open-weight 70B and frontier closed models for everyday work — chat, drafting, code review, RAG, multi-turn tool use. If you have an RTX 3090 / 4090 / 5090 / dual 3090s, this is your headline model.

Strengths

70B-class quality at 32B-class effective requirements: Meta clearly continued training with reasoning + tool-use data. Outperforms Llama 3.1 70B noticeably on instruction following.
Single-card runnable: Q4_K_M at 39 GB requires offload on a 24 GB card, but it works — ~22–28 tok/s with the right runner. On 32 GB+ (RTX 5090, dual cards) it's pure VRAM.
License unchanged from 3.1: same permissive commercial terms, same broad ecosystem.

Limitations

Single-card VRAM is the bottleneck, not compute — even at Q4 you're partial-offloading on 24 GB. Q5+ is dual-card or workstation territory.
No native vision: Meta kept multimodality on the 11B/90B vision branch. Use Llama 3.2 90B Vision or Pixtral if you need images.
Knowledge cutoff is early-2024, which shows on current-events tasks; the local stack should pair it with web search or RAG for time-sensitive work.

Real-world performance on RTX 4090

Q4_K_M (39 GB) — partial GPU offload: 22–28 tok/s decode, TTFT 350–500 ms on 1K prompt
Q5_K_M (47 GB) — heavy CPU offload: 9–14 tok/s, only worth it on dual-card setups
Q8_0 (70 GB) — workstation only (A6000 / dual 4090 / Mac Studio M2 Ultra)

Should you run this locally?

Yes, for anyone who wants a near-frontier general model and is comfortable with 22–28 tok/s. The quality jump from 8B-class is enormous. No, for users on 16 GB or less VRAM (you'll partial-offload onto system RAM and watch tok/s collapse), or for high-throughput agent loops where speed matters more than ceiling quality.

How it compares

vs Llama 3.1 70B → 3.3 is a meaningful upgrade on instruction following, math, and multi-turn coherence; same VRAM footprint. Always pick 3.3.
vs Qwen 2.5 72B → Qwen edges Llama on multilingual + raw knowledge; Llama edges Qwen on instruction reliability and tool-use. Both are fair picks; pick by license preference.
vs DeepSeek R1 Distill Llama 70B → R1 Distill is dramatically better at reasoning tasks (math, code, planning); base Llama 3.3 is better at general chat and writing. Run both side by side if disk allows.
vs Mixtral 8x22B → Mixtral has sparser compute but heavier total VRAM (~84 GB Q4) and is now behind on quality. Llama 3.3 70B replaces Mixtral 8x22B for almost every workload.

Run this yourself

ollama pull llama3.3:70b-instruct-q4_K_M
ollama run llama3.3:70b-instruct-q4_K_M

Settings used in the timing range above Quant: Q4_K_M GGUF Context: 8192 (--n-gpu-layers 65 of 81) Backend: llama.cpp via Ollama, CUDA 12.4 GPU: RTX 4090, 64 GB DDR5 system RAM

›Why this rating

9.1/10 — the best open-weight model you can run on a single 24 GB consumer GPU, and the highest-quality general-purpose local model period for anyone willing to live with 22–28 tok/s. Loses fractional points only because frontier closed models are still ahead on hard reasoning.

Overview

Late-2024 refresh of the 70B Llama line. Roughly matches Llama 3.1 405B on most benchmarks at one-fifth the parameter count. The default high-end model for serious local inference on 48GB+ VRAM rigs.

Featured in these stacks

The L3 execution stacks that pick this model as a recommended component, with the one-line note explaining the role it plays in each.

Stack · L3·Workstation tier·Role: Primary 70B chat / instruction model
Dual RTX 3090 workstation stack — 70B-class on $1,800 of used GPUs
Llama 3.3 70B at AWQ-INT4 (~40 GB weights) fits dual-3090 with comfortable 6 GB headroom. The L1.25-enriched 70B-class default; superseded only by Qwen 2.5 72B for Apache-license sensitivity.
Stack · L3·Production tier·Role: Primary 70B model
Dual RTX 4090 workstation stack — newer-architecture 70B serving without NVLink
Same 70B-class envelope as dual-3090. AWQ-INT4 fits 48 GB total / ~45 GB effective with 6 GB headroom for KV at 8K context. The L1.25-enriched 70B canonical.
Stack · L3·Homelab tier·Role: Primary 70B model (Q4_K_M)
Mixed RTX 4090 + 3090 workstation — the asymmetric upgrade path
70B Q4_K_M (~40 GB weights) fits the 48 GB total via layer-split. Real per-stream throughput trails dual-3090 NVLink by ~40-50% due to asymmetric stalling.

Execution notes

L1.25 enriched

Operator notes

Llama 3.3 70B Instruct is the production-default 70B model from late 2024 through mid-2026 — when teams say "we need self-hosted general-purpose serving at 70B scale," this is the reference point.

The 70B class sits in an awkward window:

Too big for single consumer cards (24 GB requires ~50% CPU offload — throughput drops 7-8x)
Too small to justify multi-node clustering (the /stacks/distributed-inference-homelab cost model)
Just right for 2x A100 80 GB or 1x H100, which is where most production-self-hosted deployments land.

Llama 4 70B succeeded this model in February 2026 with better reasoning at the same hardware envelope. Llama 3.3 70B remains the pragmatic default for two reasons: (1) deployment infrastructure already tuned for it, and (2) Apache-2.0-equivalent Llama Community License unchanged.

Deployment notes

Production tier: 2x A100 80 GB or 1x H100 80 GB with vLLM tensor-parallel-size=2 (or 1 for the H100 single-card config). AWQ-INT4 fits with ~30 GB of KV-cache headroom — comfortable for 32K context at single-user concurrency, ~5 concurrent users at 8K context.

Workstation tier: RTX 4090 24 GB with Q4_K_M offload. Throughput drops to ~14 tok/s; usable for single-user but unsuitable for production multi-user. The /stacks/rtx-4090-workstation recipe deliberately avoids 70B for this reason.

Apple Silicon: M3 Ultra Mac Studio (192 GB unified memory) handles MLX-4bit comfortably. M4 Max 128 GB is tight; M3 Max 64 GB requires offload.

Runtime compatibility

vLLM ✓ excellent. The reference production runtime; tensor parallel via NCCL is mature.
SGLang ✓ excellent. Same hardware envelope as vLLM; pick when shared system prompts dominate the workload.
Ollama ✓ good. Q4_K_M with consumer-tier offload for solo-developer workflows.
TensorRT-LLM ✓ excellent. Compile-once-per-SKU for absolute lowest TTFT; pick when single-request latency matters more than iteration speed.
MLX-LM ✓ good. Apple Silicon path with the M3 Ultra / M4 Max VRAM envelope.

When to use a different model

You need autonomous coding agents on a single 4090: Qwen 2.5 Coder 32B Instruct — fits AWQ-INT4 in 24 GB with no offload tax.
You need reasoning depth: DeepSeek R1 Distill Llama 70B — same hardware envelope, R1's explicit reasoning emission.
You're starting fresh with 70B-class infrastructure: Llama 4 70B — drop-in upgrade, same hardware, better benchmarks. Llama 3.3 70B is the pragmatic continuity choice; Llama 4 70B is the future.
You need Apache 2.0 license clean: Qwen 2.5 72B Instruct — Apache 2.0; reasoning-mode toggle; same hardware envelope.

Best use cases

Production general-purpose chat at 5-50 user team scale — the /stacks/rtx-4090-workstation recipe with vLLM scales here.
Long-context summarization — 128K context with verified recall at the 70B class.
Multilingual workflows — strong on European languages and CJK; better than 32B-class on the long-tail languages.
Drop-in upgrade from Llama 3.1 70B with no infrastructure changes — same architecture, slightly better instruction following.

Failure modes specific to this model

Offload tax on 24 GB cards is severe. Q4_K_M's 42 GB total means ~50% of layers run on CPU. PCIe-bound throughput drops decode to ~14 tok/s. Don't attempt 70B on a single 4090 for production.
NVLink absent on consumer multi-card setups. 2x consumer 4090s on PCIe lose 30-40% of throughput vs 2x A100 NVLink. The architectural cost of TP scales with interconnect quality.
Long-context KV cache pressure. 128K context costs ~12 GB of KV cache at FP16; tight on a 24 GB card after model weights. Drop max_model_len if you don't need full context.

Going deeper

/stacks/rtx-4090-workstation — the workstation recipe (avoids 70B; uses 32B class instead)
/stacks/distributed-inference-homelab — when 70B isn't enough and you go cluster
/systems/distributed-inference — the architectural depth on TP/PP/cluster economics
vLLM operational review — the production runtime

Reviewed May 6, 2026 by Fredoline Eruo

Family & lineage

How this model relates to others in its lineage. Family members share architecture and training-data roots; parent / children edges record direct distillation or fine-tune relationships.

Parent / base model

Llama 3.1 70B Instruct70B

Datacenter

Family siblings (llama-3.x)

Llama 3.2 1B Instruct1B

Edge

Llama 3.2 3B Instruct3B

Edge

Llama 3.1 8B Instruct8B

Consumer

Llama 3.3 8B Instruct8B

Consumer

Llama 3.1 70B Instruct70B

Datacenter

Llama 3.3 70B Instruct70B

You are here

Distilled / fine-tuned from this

Dolphin 3 Llama 3.3 70B70B

Datacenter

Hermes 4 Llama 3.3 70B70B

Strengths

Top-tier reasoning at 70B size
128K context
Production-grade tool calling

Weaknesses

Needs 48GB+ VRAM at Q4
Slow on memory-bandwidth-bound consumer cards

Prompting kit

From model card

source

Tested patterns for getting the most out of Llama 3.3 70B Instruct locally. Local models are pickier about prompt structure than cloud models — what works on Claude or GPT-5 often fails here.

Recommended system prompt

You are a helpful, honest, and concise assistant. Answer the user's question directly. If you need to use a tool, emit a JSON function call. If you don't know something, say so rather than guessing.

Quirks to know

•Multilingual: officially supports 8 languages — English, German, French, Italian, Portuguese, Hindi, Spanish, Thai. Per the model card, performance in other languages is not guaranteed.
•128K context window per the model card. Inference frameworks may default to a shorter context; check your runtime config.
•Per Meta's release notes, Llama 3.3 70B matches Llama 3.1 405B on most benchmarks at ~5× lower inference cost. Use 70B over 405B unless you specifically need the larger weights.
•Refusal-prone on ambiguous prompts. Per Meta's responsible-use guide, the model is tuned to add disclaimers; for production assistant work, anchor the system prompt to a specific persona and task so the model stays on-task.
•Native tool-calling support — declare tools in the system prompt as JSON schemas; the model emits function call JSON in the assistant turn.

Chat template

Llama 3

Uses <|begin_of_text|>, <|start_header_id|>{role}<|end_header_id|>, <|eot_id|> tokens. Apply via the runtime's built-in template (tokenizer_config.json ships the canonical version).

Tool calling

✓ Supported(json-function-calls)

Per the model card: tools are declared in the system prompt as a JSON schema, and the model responds with `{"name": "...", "parameters": {...}}` in the assistant turn. Compatible with llama.cpp --jinja and with the OpenAI tool-calling client convention.

Sampler settings

temperature: 0.6
top_p: 0.9

Meta does not publish strict sampler defaults; these are the community-converged values used in Meta's own evaluation harness. Lower temperature (0.1-0.3) for tool calling and structured output.

Browse prompting kits for every model →/prompting

Quantization variants

Each quantization trades model quality for file size and VRAM. Q4_K_M is the most popular starting point.

Quantization	File size	VRAM required
Q4_K_M	40.0 GB	48 GB
Q5_K_M	47.0 GB	56 GB
Q8_0	70.0 GB	80 GB

Get the model

Ollama

One-line install

ollama run llama3.3:70bRead our Ollama review →

HuggingFace

Original weights

huggingface.co/meta-llama/Llama-3.3-70B-Instruct

Source repository — direct quantization required.

Hardware that runs this

Cards with enough VRAM for at least one quantization of Llama 3.3 70B Instruct.

NVIDIA B300 (Blackwell Ultra)

Compare alternatives

Models worth comparing

Same parameter band, plus what's one tier above and below — so you can decide what actually fits your hardware.

Same tier

Models in the same parameter band as this one

Step up

More capable — bigger memory footprint

Step down

Smaller — faster, runs on weaker hardware

Frequently asked

What's the minimum VRAM to run Llama 3.3 70B Instruct?

48GB of VRAM is enough to run Llama 3.3 70B Instruct at the Q4_K_M quantization (file size 40.0 GB). Higher-quality quantizations need more.

Can I use Llama 3.3 70B Instruct commercially?

Yes — Llama 3.3 70B Instruct ships under the Llama 3.3 Community License, which permits commercial use. Always read the license text before deployment.

What's the context length of Llama 3.3 70B Instruct?

Llama 3.3 70B Instruct supports a context window of 131,072 tokens (about 131K).

How do I install Llama 3.3 70B Instruct with Ollama?

Run `ollama pull llama3.3:70b` to download, then `ollama run llama3.3:70b` to start a chat session. The default quantization is Q4_K_M.

Compare against other models

Curated head-to-head decisions where Llama 3.3 70B Instruct is one of the contenders. For arbitrary pairings use /model-battle.

DeepSeek R1 Distill Llama 70B vs Llama 3.3 70B

reasoning vs instruction following

Llama 3.3 70B vs Qwen 3 32B

the size-vs-architecture tradeoff

Source: huggingface.co/meta-llama/Llama-3.3-70B-Instruct

Reviewed by RunLocalAI Editorial. See our editorial policy for how we research and verify model claims.

Related — keep moving

Compare hardware

Buyer guides

When it doesn't work

Recommended hardware

Alternatives

Llama 3.2 3B Instruct Llama 3.1 8B Instruct Llama 3.1 70B Instruct Llama 3.3 8B Instruct Llama 3.2 1B Instruct

Before you buy

Verify Llama 3.3 70B Instruct runs on your specific hardware before committing money.

Will it run on my hardware? →Custom hardware comparison →GPU recommender (4 questions) →