RUNLOCALAIv38
→WILL IT RUNBEST GPUCOMPARETROUBLESHOOTSTARTPULSEMODELSHARDWARETOOLSBENCH
  1. >
  2. Home
  3. /Models
  4. /Tulu 3 70B
other
70B parameters
Commercial OK
·Reviewed May 2026

Tulu 3 70B

Tulu 3 at 70B. AI2's fully-open instruct fine-tune — research transparency at scale.

License: Llama 3.1 Community License·Released Nov 21, 2024·Context: 131,072 tokens

Overview

Tulu 3 at 70B. AI2's fully-open instruct fine-tune — research transparency at scale.

How to run it

Tulu 3 70B is Ai2's instruction-tuned 70B model based on Llama 3.1 70B. Tulu is Ai2's research fine-tune focused on improving instruction-following with a curated dataset mix (open-source post-training pipeline). Run at Q4_K_M via Ollama (ollama pull tulu3:70b) or llama.cpp with -ngl 999 -fa -c 8192. Q4_K_M file size ~40 GB on disk. Minimum VRAM: 48 GB — RTX A6000 (48GB) at Q4_K_M for 4K context. RTX 4090 24GB: Q3_K_M with KV offload. Recommended: A100 80GB at AWQ-INT4 for serving. Throughput: ~15-25 tok/s on A6000 at Q4_K_M (4K context); ~30-45 tok/s on A100. Standard Llama architecture — dropp-in compatible with any Llama inference stack. Tulu 3 is instruction-tuned (chat/agent focus). Use for: general chat, instruction-following, agent tasks, knowledge work. Ai2's license is permissive (usually ODC-By or Apache 2.0 for Tulu). Context: Llama 3.1-level (128K, practical 8-16K on 48 GB).

Hardware guidance

Minimum: RTX 3090 24GB at Q3_K_M with KV offload (4K). Recommended: RTX A6000 48GB at Q4_K_M (8K). Optimal: A100 80GB at AWQ-INT4. VRAM math: 70B dense, Q4_K_M ≈ 40 GB. KV cache at 8K: ~10 GB. Total: ~50 GB at 8K. A6000 48GB: borderline — trim context to 4K. RTX 4090 24GB: Q3_K_M ≈ 30 GB + KV offload. RTX 5090 32GB: Q4_K_M 40 GB — must offload KV. Dual RTX 4090 48 GB: Q4 at 8K — viable. Mac Studio M4 Max 64GB: Q4_K_M at 5-10 tok/s. Cloud: A100 80GB at $5-10/hr. AWQ-INT4 on A100 enables 32K context.

What breaks first

  1. Tulu chat template. Tulu 3 uses Ai2's chat template, which differs slightly from standard Llama 3.1. Using the Llama 3.1 default template may produce subtly worse instruction-following. Use Tulu's template from tokenizer_config.json. 2. Benchmark overfitting. Tulu 3's training uses public benchmarks in the data mix. Performance on exact benchmark prompts may overstate real-world quality. Test on your own tasks. 3. Q3 quality on instruction-following. Tulu's instruction-tuning is relatively shallow compared to base Llama training. At Q3, instruction adherence degrades more than base knowledge — the fine-tuned behavior is more quant-sensitive. 4. Ollama tag freshness. Tulu 3 may not be in Ollama's default catalog. Check huggingface.co/allenai for GGUF availability or convert from hf.

Runtime recommendation

Ollama for quick-start (if Tulu 3 tag exists). llama.cpp for fine control. vLLM for serving. Llama-based architecture means broad support. Tulu 3 uses the same chat template family as Llama 3.1 with minor modifications — most stacks handle it correctly.

Common beginner mistakes

Mistake: Using Llama 3.1's default chat template with Tulu 3. Fix: Tulu 3 uses Ai2's template. Check tokenizer_config.json for exact format or use the model card's recommended template. Mistake: Assuming Tulu 3 matches Llama 3.3 70B quality. Fix: Tulu 3 is fine-tuned on Llama 3.1 70B, not 3.3. It's a different base model. Expect quality similar to Llama 3.1 70B with improved instruction-following. Mistake: Expecting Tulu 3 to follow system prompts as aggressively as command-r models. Fix: Tulu 3 is instruction-tuned but not specifically system-prompt-optimized. Longer system prompts may be ignored or partially followed. Mistake: Running at 128K context on consumer hardware. Fix: Same as all ~70B models — KV cache at 128K is 80+ GB. Keep context 4-8K on 24-48 GB GPUs.

Family & lineage

How this model relates to others in its lineage. Family members share architecture and training-data roots; parent / children edges record direct distillation or fine-tune relationships.

Parent / base model
Tulu 3 8B8B
Consumer
Family siblings (tulu-3)
Tulu 3 8B8B
Consumer
Tulu 3 70B70B
You are here

Strengths

  • Fully-open recipe at 70B

Weaknesses

  • Llama Community license inherited

Quantization variants

Each quantization trades model quality for file size and VRAM. Q4_K_M is the most popular starting point.

QuantizationFile sizeVRAM required
Q4_K_M40.0 GB48 GB

Get the model

HuggingFace

Original weights

huggingface.co/allenai/Llama-3.1-Tulu-3-70B

Source repository — direct quantization required.

Hardware that runs this

Cards with enough VRAM for at least one quantization of Tulu 3 70B.

NVIDIA GB200 NVL72
13824GB · nvidia
AMD Instinct MI355X
288GB · amd
AMD Instinct MI325X
256GB · amd
AMD Instinct MI300X
192GB · amd
NVIDIA B200
192GB · nvidia
NVIDIA H100 NVL
188GB · nvidia
NVIDIA H200
141GB · nvidia
AMD Instinct MI250X
128GB · amd

Frequently asked

What's the minimum VRAM to run Tulu 3 70B?

48GB of VRAM is enough to run Tulu 3 70B at the Q4_K_M quantization (file size 40.0 GB). Higher-quality quantizations need more.

Can I use Tulu 3 70B commercially?

Yes — Tulu 3 70B ships under the Llama 3.1 Community License, which permits commercial use. Always read the license text before deployment.

What's the context length of Tulu 3 70B?

Tulu 3 70B supports a context window of 131,072 tokens (about 131K).

Source: huggingface.co/allenai/Llama-3.1-Tulu-3-70B

Reviewed by RunLocalAI Editorial. See our editorial policy for how we research and verify model claims.

Related — keep moving

Compare hardware
  • Dual 3090 vs RTX 5090 (48 GB or 32 GB) →
  • RTX 3090 vs RTX 4090 →
Buyer guides
  • 16 GB vs 24 GB VRAM — what 70B-class models need →
  • Best GPU for local AI →
  • Best laptop for local AI →
  • Best Mac for local AI →
  • Best used GPU for local AI →
When it doesn't work
  • CUDA out of memory →
  • Ollama running slowly →
  • ROCm not detected →
  • Model keeps crashing →
Recommended hardware
  • NVIDIA GB200 NVL72 →
  • AMD Instinct MI355X →
  • AMD Instinct MI325X →
  • AMD Instinct MI300X →
  • NVIDIA B200 →
Alternatives
Tulu 3 8B
Before you buy

Verify Tulu 3 70B runs on your specific hardware before committing money.

Will it run on my hardware? →Custom hardware comparison →GPU recommender (4 questions) →
RUNLOCALAI

Operator-grade instrument for local-AI hardware intelligence. Hand-written verdicts. Real benchmarks. Reproducible commands.

OP·Fredoline Eruo
DIR
  • Models
  • Hardware
  • Tools
  • Benchmarks
  • Will it run?
GUIDES
  • Best GPU
  • Best laptop
  • Best Mac
  • Best used GPU
  • Best budget GPU
  • Best GPU for Ollama
  • Best GPU for SD
  • AI PC build $2K
  • CUDA vs ROCm
  • 16 vs 24 GB
  • Compare hardware
  • Custom compare
REF
  • Systems
  • Ecosystem maps
  • Pillar guides
  • Methodology
  • Glossary
  • Errors KB
  • Troubleshooting
  • Resources
  • Public API
EDITOR
  • About
  • About the author
  • Changelog
  • Latest
  • Updates
  • Submit benchmark
  • Send feedback
  • Trust
  • Editorial policy
  • How we make money
  • Contact
LEGAL
  • Privacy
  • Terms
  • Sitemap
MAIL · MONTHLY DIGEST
Get monthly local AI changes
Monthly recap. No spam.
DISCLOSURE

Some links on this site are affiliate links (Amazon Associates and other first-class retailers). When you buy through them, we earn a small commission at no extra cost to you. Affiliate links do not influence our verdicts — there are cards we rate highly that we don't have affiliate relationships with, and cards that sell well that we refuse to recommend. Read more →

SYS · ONLINEUPTIME · 100%2026 · operator-owned
RUNLOCALAI · v38
Compare alternatives

Models worth comparing

Same parameter band, plus what's one tier above and below — so you can decide what actually fits your hardware.

Same tier
Models in the same parameter band as this one
  • Llama 3.3 70B Instruct
    llama · 70B
    9.1/10
  • DeepSeek R1 Distill Llama 70B
    deepseek · 70B
    9.0/10
  • Qwen 2.5 72B Instruct
    qwen · 72B
    9.0/10
  • Llama 3.1 70B Instruct
    llama · 70B
    8.0/10
Step up
More capable — bigger memory footprint
  • DeepSeek V4 Pro (1.6T MoE)
    deepseek · 1600B
    unrated
  • Qwen 3.5 235B-A17B (MoE)
    qwen · 397B
    unrated
Step down
Smaller — faster, runs on weaker hardware
  • Qwen 3 30B-A3B
    qwen · 30B
    unrated
  • Gemma 4 31B Dense
    gemma · 31B
    unrated