other
70B parameters
Commercial OK
Reviewed June 2026

Tulu 3 70B

Tulu 3 at 70B. AI2's fully-open instruct fine-tune — research transparency at scale.

License: Llama 3.1 Community License·Released Nov 21, 2024·Context: 131,072 tokens
BLK · VERDICT

Our verdict

OP · Fredoline Eruo|VERIFIED JUN 12, 2026
unrated

Positioning

Tulu 3 70B is a dense 70-billion-parameter instruct model released by the Allen Institute (AI2) under the Llama 3.1 Community License. With a 131,072-token context window, it is designed for datacenter-tier deployment. Its key distinction is full research transparency: AI2 published the complete training recipe, data, and code, making it a go-to choice for operators who need to audit or reproduce the fine-tuning process.

Strengths

  • Fully open recipe. Unlike many instruct models that only release weights, AI2 provides the complete training pipeline, enabling operators to inspect, replicate, or modify the fine-tuning methodology.
  • Long 128K context. The 131,072-token context window supports extended document analysis, multi-turn conversations, and retrieval-augmented generation without truncation.
  • Permissive commercial license. The Llama 3.1 Community License allows commercial use, making Tulu 3 suitable for proprietary applications.
  • Dense architecture at scale. As a dense 70B model, it offers predictable inference behavior without the routing complexity of mixture-of-experts models.

Limitations

  • Datacenter-only deployment. With FP16 requiring ~140 GB of disk and even Q4_K_M at ~39 GB, plus substantial KV cache overhead, this model cannot run on consumer or workstation GPUs. It requires multi-GPU datacenter hardware.
  • No community benchmarks yet. We do not have independent, community-reported benchmark scores for this model. Operators should treat any vendor-published metrics as best-case and verify on their own workloads.
  • High memory overhead. The 128K context window demands significant KV cache memory — expect to add 30–50% to the model footprint for typical usage, pushing Q4_K_M from ~39 GB to 50–60 GB or more.
  • Limited ecosystem. As a relatively new model from a research institute, community tooling, quantization presets, and deployment guides are less mature than for more widely adopted families like Llama or Mistral.

What it takes to run this locally

Tulu 3 70B is a datacenter-class model. Quantization reduces disk footprint but does not change the hardware requirement: you need multiple high-memory GPUs (e.g., 2–4× A100 80GB or H100) to accommodate the model plus KV cache. At Q4_K_M (39 GB), expect total memory demand of 50–60 GB with moderate context lengths. At FP16 (140 GB), you need 2–3× A100 80GB or similar. No single consumer GPU can run this model.

Should you run this locally?

Yes if you need full transparency into the fine-tuning process for compliance, research, or customization, and you have access to datacenter-grade multi-GPU hardware. The open recipe and permissive license make it ideal for organizations that want to audit or modify the training pipeline.

No if you lack multi-GPU infrastructure, need a model that fits on a single workstation GPU, or require mature community tooling and pre-built quantized runtimes. For those cases, smaller dense models or established families may be more practical.

Catalog cross-links

  • Llama 3.1 70B – the base model Tulu 3 is fine-tuned from.
  • AI2 OLMo – another fully open model from the same institute.
  • A100 GPU – typical hardware for running 70B-class models.

Overview

Tulu 3 at 70B. AI2's fully-open instruct fine-tune — research transparency at scale.

How to run it

Tulu 3 70B is Ai2's instruction-tuned 70B model based on Llama 3.1 70B. Tulu is Ai2's research fine-tune focused on improving instruction-following with a curated dataset mix (open-source post-training pipeline). Run at Q4_K_M via Ollama (ollama pull tulu3:70b) or llama.cpp with -ngl 999 -fa -c 8192. Q4_K_M file size ~40 GB on disk. Minimum VRAM: 48 GB — RTX A6000 (48GB) at Q4_K_M for 4K context. RTX 4090 24GB: Q3_K_M with KV offload. Recommended: A100 80GB at AWQ-INT4 for serving. Throughput: ~15-25 tok/s on A6000 at Q4_K_M (4K context); ~30-45 tok/s on A100. Standard Llama architecture — dropp-in compatible with any Llama inference stack. Tulu 3 is instruction-tuned (chat/agent focus). Use for: general chat, instruction-following, agent tasks, knowledge work. Ai2's license is permissive (usually ODC-By or Apache 2.0 for Tulu). Context: Llama 3.1-level (128K, practical 8-16K on 48 GB).

Hardware guidance

Minimum: RTX 3090 24GB at Q3_K_M with KV offload (4K). Recommended: RTX A6000 48GB at Q4_K_M (8K). Optimal: A100 80GB at AWQ-INT4. VRAM math: 70B dense, Q4_K_M ≈ 40 GB. KV cache at 8K: ~10 GB. Total: ~50 GB at 8K. A6000 48GB: borderline — trim context to 4K. RTX 4090 24GB: Q3_K_M ≈ 30 GB + KV offload. RTX 5090 32GB: Q4_K_M 40 GB — must offload KV. Dual RTX 4090 48 GB: Q4 at 8K — viable. Mac Studio M4 Max 64GB: Q4_K_M at 5-10 tok/s. Cloud: A100 80GB at $5-10/hr. AWQ-INT4 on A100 enables 32K context.

What breaks first

  1. Tulu chat template. Tulu 3 uses Ai2's chat template, which differs slightly from standard Llama 3.1. Using the Llama 3.1 default template may produce subtly worse instruction-following. Use Tulu's template from tokenizer_config.json. 2. Benchmark overfitting. Tulu 3's training uses public benchmarks in the data mix. Performance on exact benchmark prompts may overstate real-world quality. Test on your own tasks. 3. Q3 quality on instruction-following. Tulu's instruction-tuning is relatively shallow compared to base Llama training. At Q3, instruction adherence degrades more than base knowledge — the fine-tuned behavior is more quant-sensitive. 4. Ollama tag freshness. Tulu 3 may not be in Ollama's default catalog. Check huggingface.co/allenai for GGUF availability or convert from hf.

Runtime recommendation

Ollama for quick-start (if Tulu 3 tag exists). llama.cpp for fine control. vLLM for serving. Llama-based architecture means broad support. Tulu 3 uses the same chat template family as Llama 3.1 with minor modifications — most stacks handle it correctly.

Common beginner mistakes

Mistake: Using Llama 3.1's default chat template with Tulu 3. Fix: Tulu 3 uses Ai2's template. Check tokenizer_config.json for exact format or use the model card's recommended template. Mistake: Assuming Tulu 3 matches Llama 3.3 70B quality. Fix: Tulu 3 is fine-tuned on Llama 3.1 70B, not 3.3. It's a different base model. Expect quality similar to Llama 3.1 70B with improved instruction-following. Mistake: Expecting Tulu 3 to follow system prompts as aggressively as command-r models. Fix: Tulu 3 is instruction-tuned but not specifically system-prompt-optimized. Longer system prompts may be ignored or partially followed. Mistake: Running at 128K context on consumer hardware. Fix: Same as all ~70B models — KV cache at 128K is 80+ GB. Keep context 4-8K on 24-48 GB GPUs.

Family & lineage

How this model relates to others in its lineage. Family members share architecture and training-data roots; parent / children edges record direct distillation or fine-tune relationships.

Parent / base model
Tulu 3 8B8B
Consumer
Family siblings (tulu-3)
Tulu 3 8B8B
Consumer
Tulu 3 70B70B
You are here

Strengths

  • Fully-open recipe at 70B

Weaknesses

  • Llama Community license inherited

Quantization variants

Each quantization trades model quality for file size and VRAM. Q4_K_M is the most popular starting point.

QuantizationFile sizeVRAM required
Q4_K_M40.0 GB48 GB

Get the model

HuggingFace

Original weights

huggingface.co/allenai/Llama-3.1-Tulu-3-70B

Source repository — direct quantization required.

Hardware that runs this

Cards with enough VRAM for at least one quantization of Tulu 3 70B.

Compare alternatives

Models worth comparing

Same parameter band, plus what's one tier above and below — so you can decide what actually fits your hardware.

Frequently asked

What's the minimum VRAM to run Tulu 3 70B?

48GB of VRAM is enough to run Tulu 3 70B at the Q4_K_M quantization (file size 40.0 GB). Higher-quality quantizations need more.

Can I use Tulu 3 70B commercially?

Yes — Tulu 3 70B ships under the Llama 3.1 Community License, which permits commercial use. Always read the license text before deployment.

What's the context length of Tulu 3 70B?

Tulu 3 70B supports a context window of 131,072 tokens (about 131K).

Source: huggingface.co/allenai/Llama-3.1-Tulu-3-70B

Reviewed by RunLocalAI Editorial. See our editorial policy for how we research and verify model claims.

Related — keep moving

Alternatives
Before you buy

Verify Tulu 3 70B runs on your specific hardware before committing money.