DBRX Base

DBRX base (non-instruct). 132B total / 36B active fine-grained MoE.

License: Databricks Open Model License·Released Mar 27, 2024·Context: 32,768 tokens

Overview

DBRX base (non-instruct). 132B total / 36B active fine-grained MoE.

How to run it

DBRX is Databricks' 132B MoE model (~36B active per token with 4-of-16 expert routing). Run at Q4_K_M via llama.cpp with -ngl 999 -fa -c 8192. Q4_K_M file size ~75 GB on disk. Minimum VRAM: 48 GB — RTX A6000 (48GB) at Q4_K_M with expert offload, or dual RTX 3090 row-split (48 GB total). Recommended: A100 80GB at AWQ-INT4. Throughput: ~15-25 tok/s on A6000 at Q4_K_M (8K context). DBRX uses a fine-grained MoE with 16 experts (4 active) — more routing decisions per token than Mixtral-style (8 experts, 2 active). This means higher routing overhead but potentially better expert specialization. DBRX is a base model — not instruction-tuned. Use for fine-tuning, not direct chat. For instruction-tuned use, look at DBRX-Instruct or fine-tune yourself. Ollama may not have DBRX base — verify the tag. Architecture: standard transformer with MoE FFN layers — well-supported in llama.cpp and potentially vLLM.

Hardware guidance

Minimum: dual RTX 3090 48 GB total at Q4_K_M (tight at 4K context). Recommended: A100 80GB at AWQ-INT4 for serving. Budget: RTX A6000 48GB at Q3_K_M with expert offload. VRAM math: 132B total, ~36B active (4 experts selected). Q4_K_M for full 132B: ~70-80 GB. Expert offload reduces VRAM to ~30-40 GB (active experts in VRAM, rest in RAM). KV cache at 8K: ~10-15 GB. 48 GB with expert offload: borderline. 80 GB A100: comfortable with all experts in VRAM. Mac Studio M4 Max 64GB: Q4_K_M with expert offload, 3-6 tok/s. RTX 4090 24GB: Q3_K_M with aggressive expert offload. Cloud: single A100 at $5-10/hr for AWQ.

What breaks first

Base model, not instruct. DBRX-base has no chat or instruction tuning. Raw completions will continue the prompt style — not answer questions. Fine-tuning or few-shot prompting is necessary. 2. Fine-grained MoE routing overhead. 16 experts with top-4 routing per token means more routing decisions and higher all-to-all communication. On PCIe cards, this routing pattern causes more stalls than Mixtral-style. 3. AWQ calibration gap. DBRX AWQ quants calibrated on generic data may not preserve quality on domain-specific tasks. Test quant quality on your data before deploying. 4. Databricks' license. Verify DBRX's license for commercial use — it may differ from standard open-weight licenses. Check huggingface.co/databricks/dbrx-base for terms.

Runtime recommendation

llama.cpp with -ngl 999 for local use. vLLM for multi-user serving on A100. DBRX's fine-grained MoE benefits from vLLM's expert-parallel scheduling. Avoid Ollama for base models — it's designed for instruct/chat. For fine-tuning: Axolotl or Unsloth with QLoRA.

Common beginner mistakes

Mistake: Expecting DBRX-base to chat. Fix: Base models generate completions, not conversations. Use DBRX-Instruct or fine-tune. Use few-shot prompting with careful formatting for base model use. Mistake: Assuming 132B total means it needs 132 GB VRAM. Fix: MoE with Q4_K_M is 75 GB on disk. Active subset per token is only ~36B (21 GB at Q4). Expert offload makes it run on 48 GB. Mistake: Using standard Llama GGUF conversion. Fix: DBRX has a specific architecture. Use the correct conversion script or pre-converted GGUFs from TheBloke or bartowski. Mistake: Ignoring the 16-expert routing overhead. Fix: DBRX's top-4-of-16 routing is more complex than Mixtral's top-2-of-8. Expect higher latency variance per token due to more frequent expert switches.

Family & lineage

How this model relates to others in its lineage. Family members share architecture and training-data roots; parent / children edges record direct distillation or fine-tune relationships.

Family siblings (dbrx)

DBRX Base132B

You are here

DBRX Instruct132B

Datacenter