Tulu 3 70B
Tulu 3 at 70B. AI2's fully-open instruct fine-tune — research transparency at scale.
Positioning
Tulu 3 70B is a dense 70-billion-parameter instruct model released by the Allen Institute (AI2) under the Llama 3.1 Community License. With a 131,072-token context window, it is designed for datacenter-tier deployment. Its key distinction is full research transparency: AI2 published the complete training recipe, data, and code, making it a go-to choice for operators who need to audit or reproduce the fine-tuning process.
Strengths
- Fully open recipe. Unlike many instruct models that only release weights, AI2 provides the complete training pipeline, enabling operators to inspect, replicate, or modify the fine-tuning methodology.
- Long 128K context. The 131,072-token context window supports extended document analysis, multi-turn conversations, and retrieval-augmented generation without truncation.
- Permissive commercial license. The Llama 3.1 Community License allows commercial use, making Tulu 3 suitable for proprietary applications.
- Dense architecture at scale. As a dense 70B model, it offers predictable inference behavior without the routing complexity of mixture-of-experts models.
Limitations
- Datacenter-only deployment. With FP16 requiring ~140 GB of disk and even Q4_K_M at ~39 GB, plus substantial KV cache overhead, this model cannot run on consumer or workstation GPUs. It requires multi-GPU datacenter hardware.
- No community benchmarks yet. We do not have independent, community-reported benchmark scores for this model. Operators should treat any vendor-published metrics as best-case and verify on their own workloads.
- High memory overhead. The 128K context window demands significant KV cache memory — expect to add 30–50% to the model footprint for typical usage, pushing Q4_K_M from ~39 GB to 50–60 GB or more.
- Limited ecosystem. As a relatively new model from a research institute, community tooling, quantization presets, and deployment guides are less mature than for more widely adopted families like Llama or Mistral.
What it takes to run this locally
Tulu 3 70B is a datacenter-class model. Quantization reduces disk footprint but does not change the hardware requirement: you need multiple high-memory GPUs (e.g., 2–4× A100 80GB or H100) to accommodate the model plus KV cache. At Q4_K_M (39 GB), expect total memory demand of 50–60 GB with moderate context lengths. At FP16 (140 GB), you need 2–3× A100 80GB or similar. No single consumer GPU can run this model.
Should you run this locally?
Yes if you need full transparency into the fine-tuning process for compliance, research, or customization, and you have access to datacenter-grade multi-GPU hardware. The open recipe and permissive license make it ideal for organizations that want to audit or modify the training pipeline.
No if you lack multi-GPU infrastructure, need a model that fits on a single workstation GPU, or require mature community tooling and pre-built quantized runtimes. For those cases, smaller dense models or established families may be more practical.
Catalog cross-links
- Llama 3.1 70B – the base model Tulu 3 is fine-tuned from.
- AI2 OLMo – another fully open model from the same institute.
- A100 GPU – typical hardware for running 70B-class models.
Overview
Tulu 3 at 70B. AI2's fully-open instruct fine-tune — research transparency at scale.
How to run it
Tulu 3 70B is Ai2's instruction-tuned 70B model based on Llama 3.1 70B. Tulu is Ai2's research fine-tune focused on improving instruction-following with a curated dataset mix (open-source post-training pipeline). Run at Q4_K_M via Ollama (ollama pull tulu3:70b) or llama.cpp with -ngl 999 -fa -c 8192. Q4_K_M file size ~40 GB on disk. Minimum VRAM: 48 GB — RTX A6000 (48GB) at Q4_K_M for 4K context. RTX 4090 24GB: Q3_K_M with KV offload. Recommended: A100 80GB at AWQ-INT4 for serving. Throughput: ~15-25 tok/s on A6000 at Q4_K_M (4K context); ~30-45 tok/s on A100. Standard Llama architecture — dropp-in compatible with any Llama inference stack. Tulu 3 is instruction-tuned (chat/agent focus). Use for: general chat, instruction-following, agent tasks, knowledge work. Ai2's license is permissive (usually ODC-By or Apache 2.0 for Tulu). Context: Llama 3.1-level (128K, practical 8-16K on 48 GB).
Hardware guidance
Minimum: RTX 3090 24GB at Q3_K_M with KV offload (4K). Recommended: RTX A6000 48GB at Q4_K_M (8K). Optimal: A100 80GB at AWQ-INT4. VRAM math: 70B dense, Q4_K_M ≈ 40 GB. KV cache at 8K: ~10 GB. Total: ~50 GB at 8K. A6000 48GB: borderline — trim context to 4K. RTX 4090 24GB: Q3_K_M ≈ 30 GB + KV offload. RTX 5090 32GB: Q4_K_M 40 GB — must offload KV. Dual RTX 4090 48 GB: Q4 at 8K — viable. Mac Studio M4 Max 64GB: Q4_K_M at 5-10 tok/s. Cloud: A100 80GB at $5-10/hr. AWQ-INT4 on A100 enables 32K context.
What breaks first
- Tulu chat template. Tulu 3 uses Ai2's chat template, which differs slightly from standard Llama 3.1. Using the Llama 3.1 default template may produce subtly worse instruction-following. Use Tulu's template from tokenizer_config.json. 2. Benchmark overfitting. Tulu 3's training uses public benchmarks in the data mix. Performance on exact benchmark prompts may overstate real-world quality. Test on your own tasks. 3. Q3 quality on instruction-following. Tulu's instruction-tuning is relatively shallow compared to base Llama training. At Q3, instruction adherence degrades more than base knowledge — the fine-tuned behavior is more quant-sensitive. 4. Ollama tag freshness. Tulu 3 may not be in Ollama's default catalog. Check huggingface.co/allenai for GGUF availability or convert from hf.
Runtime recommendation
Common beginner mistakes
Mistake: Using Llama 3.1's default chat template with Tulu 3. Fix: Tulu 3 uses Ai2's template. Check tokenizer_config.json for exact format or use the model card's recommended template. Mistake: Assuming Tulu 3 matches Llama 3.3 70B quality. Fix: Tulu 3 is fine-tuned on Llama 3.1 70B, not 3.3. It's a different base model. Expect quality similar to Llama 3.1 70B with improved instruction-following. Mistake: Expecting Tulu 3 to follow system prompts as aggressively as command-r models. Fix: Tulu 3 is instruction-tuned but not specifically system-prompt-optimized. Longer system prompts may be ignored or partially followed. Mistake: Running at 128K context on consumer hardware. Fix: Same as all ~70B models — KV cache at 128K is 80+ GB. Keep context 4-8K on 24-48 GB GPUs.
Family & lineage
How this model relates to others in its lineage. Family members share architecture and training-data roots; parent / children edges record direct distillation or fine-tune relationships.
Strengths
- Fully-open recipe at 70B
Weaknesses
- Llama Community license inherited
Quantization variants
Each quantization trades model quality for file size and VRAM. Q4_K_M is the most popular starting point.
| Quantization | File size | VRAM required |
|---|---|---|
| Q4_K_M | 40.0 GB | 48 GB |
Get the model
HuggingFace
Original weights
Source repository — direct quantization required.
Hardware that runs this
Cards with enough VRAM for at least one quantization of Tulu 3 70B.
Models worth comparing
Same parameter band, plus what's one tier above and below — so you can decide what actually fits your hardware.
Frequently asked
What's the minimum VRAM to run Tulu 3 70B?
Can I use Tulu 3 70B commercially?
What's the context length of Tulu 3 70B?
Source: huggingface.co/allenai/Llama-3.1-Tulu-3-70B
Reviewed by RunLocalAI Editorial. See our editorial policy for how we research and verify model claims.
Related — keep moving
Verify Tulu 3 70B runs on your specific hardware before committing money.