Sarvam 105B FP8
Sarvam-105B is a Mixture-of-Experts model with 10.3B active parameters built for Indian-language tasks, reasoning, and coding. It covers 22 Indian languages and supports a 128K context window via YaRN scaling. This repo ships FP8-quantized weights intended for deployment with SGLang or patched vLLM.
If you're running inference infrastructure and need strong Hindi (or broader Indic-language) coverage at long context, this is currently one of the few serious options at this scale. The FP8 weights help, but you still need a multi-GPU server and a non-standard inference stack — this is not a plug-and-play download. The extremely low download and like counts mean you're likely to hit rough edges with minimal community help. Hedge: worth evaluating if Indic language quality is your primary requirement and you have the hardware; skip it if you're hoping for easy deployment or active community support.
›Why this rating
Auto-generated rating (Opus 4.7 judge, claude-opus-4-7). Overall 9.15/10. License is explicitly apache-2.0 in the card and matches the claim. Metadata is accurate: 105B total with 10.3B active MoE, 128K context via YaRN, FP8 weights, Indian-language focus all verifiable from the card. Editorial voice is honest and operator-grade — explicitly flags non-standard inference stack, VRAM requirements, and weak community signal. The row slightly misrepresents the description by saying '22 Indian languages' (the card says SOTA across 22 Indian languages for its size, which matches), and parameterCountB=105 is correct as total params though active is 10.3B — acceptable as it's the standard convention. Brand fit is moderate since this is datacenter-scale, not a local-laptop model, but Indic coverage at this scale is genuinely useful to the audience. Clears the 9.0 bar.
Flags: - Datacenter-scale model on a 'local AI' catalog — ensure framing makes hardware requirements obvious (the row does this adequately) - parameterCountB=105 reflects total params, not active 10.3B — convention is fine but readers may need the MoE distinction surfaced in UI
Overview
Sarvam-105B is a Mixture-of-Experts model with 10.3B active parameters built for Indian-language tasks, reasoning, and coding. It covers 22 Indian languages and supports a 128K context window via YaRN scaling. This repo ships FP8-quantized weights intended for deployment with SGLang or patched vLLM.
Strengths
- Covers 22 Indian languages including Hindi — broadest regional coverage in its class
- 128K context via YaRN scaling
- Only 10.3B parameters are active at inference time despite 105B total, reducing compute per token
- Apache-2.0 license — commercial use permitted
Weaknesses
- Even with FP8 quantization, loading 105B weights demands serious VRAM — not a consumer-GPU model
- Requires SGLang or a patched vLLM build; stock inference stacks won't work out of the box
- MoE routing can introduce latency spikes compared to equivalently-sized dense models
- 546 HF downloads and 5 likes — very limited community testing or troubleshooting resources
Quantization variants
Each quantization trades model quality for file size and VRAM. Q4_K_M is the most popular starting point.
| Quantization | File size | VRAM required |
|---|---|---|
| Q4_K_M | 57.8 GB | 74 GB |
Get the model
HuggingFace
Original weights
Source repository — direct quantization required.
Hardware that runs this
Cards with enough VRAM for at least one quantization of Sarvam 105B FP8.
Models worth comparing
Same parameter band, plus what's one tier above and below — so you can decide what actually fits your hardware.
Frequently asked
What's the minimum VRAM to run Sarvam 105B FP8?
Can I use Sarvam 105B FP8 commercially?
What's the context length of Sarvam 105B FP8?
Source: huggingface.co/sarvamai/sarvam-105b-fp8
Reviewed by RunLocalAI Editorial. See our editorial policy for how we research and verify model claims.
Related — keep moving
Verify Sarvam 105B FP8 runs on your specific hardware before committing money.