other

70B parameters

Commercial OK

Reviewed May 2026

Tulu 3 70B

Tulu 3 at 70B. AI2's fully-open instruct fine-tune — research transparency at scale.

License: Llama 3.1 Community License·Released Nov 21, 2024·Context: 131,072 tokens

Overview

Tulu 3 at 70B. AI2's fully-open instruct fine-tune — research transparency at scale.

How to run it

Tulu 3 70B is Ai2's instruction-tuned 70B model based on Llama 3.1 70B. Tulu is Ai2's research fine-tune focused on improving instruction-following with a curated dataset mix (open-source post-training pipeline). Run at Q4_K_M via Ollama (ollama pull tulu3:70b) or llama.cpp with -ngl 999 -fa -c 8192. Q4_K_M file size ~40 GB on disk. Minimum VRAM: 48 GB — RTX A6000 (48GB) at Q4_K_M for 4K context. RTX 4090 24GB: Q3_K_M with KV offload. Recommended: A100 80GB at AWQ-INT4 for serving. Throughput: ~15-25 tok/s on A6000 at Q4_K_M (4K context); ~30-45 tok/s on A100. Standard Llama architecture — dropp-in compatible with any Llama inference stack. Tulu 3 is instruction-tuned (chat/agent focus). Use for: general chat, instruction-following, agent tasks, knowledge work. Ai2's license is permissive (usually ODC-By or Apache 2.0 for Tulu). Context: Llama 3.1-level (128K, practical 8-16K on 48 GB).

Hardware guidance

Minimum: RTX 3090 24GB at Q3_K_M with KV offload (4K). Recommended: RTX A6000 48GB at Q4_K_M (8K). Optimal: A100 80GB at AWQ-INT4. VRAM math: 70B dense, Q4_K_M ≈ 40 GB. KV cache at 8K: ~10 GB. Total: ~50 GB at 8K. A6000 48GB: borderline — trim context to 4K. RTX 4090 24GB: Q3_K_M ≈ 30 GB + KV offload. RTX 5090 32GB: Q4_K_M 40 GB — must offload KV. Dual RTX 4090 48 GB: Q4 at 8K — viable. Mac Studio M4 Max 64GB: Q4_K_M at 5-10 tok/s. Cloud: A100 80GB at $5-10/hr. AWQ-INT4 on A100 enables 32K context.

What breaks first

Tulu chat template. Tulu 3 uses Ai2's chat template, which differs slightly from standard Llama 3.1. Using the Llama 3.1 default template may produce subtly worse instruction-following. Use Tulu's template from tokenizer_config.json. 2. Benchmark overfitting. Tulu 3's training uses public benchmarks in the data mix. Performance on exact benchmark prompts may overstate real-world quality. Test on your own tasks. 3. Q3 quality on instruction-following. Tulu's instruction-tuning is relatively shallow compared to base Llama training. At Q3, instruction adherence degrades more than base knowledge — the fine-tuned behavior is more quant-sensitive. 4. Ollama tag freshness. Tulu 3 may not be in Ollama's default catalog. Check huggingface.co/allenai for GGUF availability or convert from hf.

Runtime recommendation

Ollama for quick-start (if Tulu 3 tag exists). llama.cpp for fine control. vLLM for serving. Llama-based architecture means broad support. Tulu 3 uses the same chat template family as Llama 3.1 with minor modifications — most stacks handle it correctly.

Common beginner mistakes

Mistake: Using Llama 3.1's default chat template with Tulu 3. Fix: Tulu 3 uses Ai2's template. Check tokenizer_config.json for exact format or use the model card's recommended template. Mistake: Assuming Tulu 3 matches Llama 3.3 70B quality. Fix: Tulu 3 is fine-tuned on Llama 3.1 70B, not 3.3. It's a different base model. Expect quality similar to Llama 3.1 70B with improved instruction-following. Mistake: Expecting Tulu 3 to follow system prompts as aggressively as command-r models. Fix: Tulu 3 is instruction-tuned but not specifically system-prompt-optimized. Longer system prompts may be ignored or partially followed. Mistake: Running at 128K context on consumer hardware. Fix: Same as all ~70B models — KV cache at 128K is 80+ GB. Keep context 4-8K on 24-48 GB GPUs.

Family & lineage

How this model relates to others in its lineage. Family members share architecture and training-data roots; parent / children edges record direct distillation or fine-tune relationships.

Parent / base model

Tulu 3 8B8B

Consumer

Family siblings (tulu-3)

Tulu 3 8B8B

Consumer

Tulu 3 70B70B

You are here

Strengths

Fully-open recipe at 70B

Weaknesses

Llama Community license inherited

Quantization variants

Each quantization trades model quality for file size and VRAM. Q4_K_M is the most popular starting point.

Quantization	File size	VRAM required
Q4_K_M	40.0 GB	48 GB

Get the model

HuggingFace

Original weights

huggingface.co/allenai/Llama-3.1-Tulu-3-70B

Source repository — direct quantization required.

Hardware that runs this

Cards with enough VRAM for at least one quantization of Tulu 3 70B.

Frequently asked

What's the minimum VRAM to run Tulu 3 70B?

48GB of VRAM is enough to run Tulu 3 70B at the Q4_K_M quantization (file size 40.0 GB). Higher-quality quantizations need more.

Can I use Tulu 3 70B commercially?

Yes — Tulu 3 70B ships under the Llama 3.1 Community License, which permits commercial use. Always read the license text before deployment.

What's the context length of Tulu 3 70B?

Tulu 3 70B supports a context window of 131,072 tokens (about 131K).

Source: huggingface.co/allenai/Llama-3.1-Tulu-3-70B

Reviewed by RunLocalAI Editorial. See our editorial policy for how we research and verify model claims.

Related — keep moving

Compare hardware

Buyer guides

When it doesn't work

Recommended hardware

Alternatives

Tulu 3 8B

Before you buy

Verify Tulu 3 70B runs on your specific hardware before committing money.

Will it run on my hardware? →Custom hardware comparison →GPU recommender (4 questions) →

other

70B parameters

Commercial OK

Reviewed May 2026

Tulu 3 70B

Tulu 3 at 70B. AI2's fully-open instruct fine-tune — research transparency at scale.

License: Llama 3.1 Community License·Released Nov 21, 2024·Context: 131,072 tokens

Overview

Tulu 3 at 70B. AI2's fully-open instruct fine-tune — research transparency at scale.

How to run it

Hardware guidance

What breaks first

Tulu chat template. Tulu 3 uses Ai2's chat template, which differs slightly from standard Llama 3.1. Using the Llama 3.1 default template may produce subtly worse instruction-following. Use Tulu's template from tokenizer_config.json. 2. Benchmark overfitting. Tulu 3's training uses public benchmarks in the data mix. Performance on exact benchmark prompts may overstate real-world quality. Test on your own tasks. 3. Q3 quality on instruction-following. Tulu's instruction-tuning is relatively shallow compared to base Llama training. At Q3, instruction adherence degrades more than base knowledge — the fine-tuned behavior is more quant-sensitive. 4. Ollama tag freshness. Tulu 3 may not be in Ollama's default catalog. Check huggingface.co/allenai for GGUF availability or convert from hf.

Runtime recommendation

Common beginner mistakes

Family & lineage

How this model relates to others in its lineage. Family members share architecture and training-data roots; parent / children edges record direct distillation or fine-tune relationships.

Parent / base model

Tulu 3 8B8B

Consumer

Family siblings (tulu-3)

Tulu 3 8B8B

Consumer

Tulu 3 70B70B

You are here