other

1.1B parameters

Commercial OK

Reviewed May 2026

TinyLlama 1.1B Chat v0.3 GPTQ

GPTQ-quantized build of TinyLlama 1.1B Chat v0.3, trained on SlimPajama, StarCoder, and OpenAssistant data. Runs in roughly 0.8 GB VRAM thanks to 4-bit quantization. English only, 2048-token context window.

License: apache-2.0·Context: 2,048 tokens

BLK · VERDICT

Our verdict

OP · Eruo Fredoline|VERIFIED MAY 29, 2026

9.0/10

If you are building for German-speaking users, skip this — it has no multilingual capability and will produce poor German output. For English-only edge deployments where VRAM is the hard constraint, the ~0.8 GB footprint is genuinely useful. Do not expect reliable reasoning or multi-step instruction following at 1.1B. Treat it as a keyword responder, not a capable assistant.

›Why this rating

Auto-generated rating (Opus 4.7 judge, claude-opus-4-7). Overall 9.00/10. License (Apache-2.0) is verified directly from the card and commercial-OK is correct. Metadata (1.1B, 2048 ctx, TinyLlama family, TheBloke as quantizer) is accurate. The editorial voice is solid and operator-grade, with honest weaknesses about VRAM, context, and reasoning limits. However, the useCases array includes 'german' which directly contradicts the description, weaknesses, and verdict — this is an internal inconsistency that would embarrass the catalog. Also, GPTQ is not supported by llama.cpp (that's GGUF), which is a factual error in the strengths list. These two concrete errors push it below the 9.0 bar.

Flags: - useCases contains 'german' but model is English-only — direct contradiction with description and verdict - Strength claims 'GPTQ format broadly supported by ... llama.cpp backends' — llama.cpp does not support GPTQ (it uses GGUF); factual error - GGUF alternative exists from same uploader, so the 'GPTQ adds dependency vs plain GGUF' weakness should reference that sibling repo for honesty

Overview

GPTQ-quantized build of TinyLlama 1.1B Chat v0.3, trained on SlimPajama, StarCoder, and OpenAssistant data. Runs in roughly 0.8 GB VRAM thanks to 4-bit quantization. English only, 2048-token context window.

Strengths

~0.8 GB VRAM footprint — fits on almost any GPU or CPU-offload setup
Apache 2.0 license, commercial use permitted
Over 1 million HF downloads — well-tested in the wild
GPTQ format broadly supported by AutoGPTQ, text-generation-webui, and llama.cpp backends

Weaknesses

English only — no German or multilingual support
1.1B parameters means weak reasoning and poor instruction-following on complex tasks
2048-token context is short; long conversations or documents will hit the limit fast
GPTQ quantization adds a setup dependency compared to plain GGUF alternatives

Quantization variants

Each quantization trades model quality for file size and VRAM. Q4_K_M is the most popular starting point.

Quantization	File size	VRAM required
Q4_K_M	0.6 GB	1 GB

Get the model

HuggingFace

Original weights

huggingface.co/TheBloke/TinyLlama-1.1B-Chat-v0.3-GPTQ

Source repository — direct quantization required.

Hardware that runs this

Cards with enough VRAM for at least one quantization of TinyLlama 1.1B Chat v0.3 GPTQ.

NVIDIA GB200 NVL72

13824GB · nvidia

AMD Instinct MI350X

NVIDIA B300 (Blackwell Ultra)

288GB · nvidia

AMD Instinct MI355X

AMD Instinct MI325X

AMD Instinct MI300X

192GB · nvidia

NVIDIA H100 NVL

188GB · nvidia

Compare alternatives

Models worth comparing

Same parameter band, plus what's one tier above and below — so you can decide what actually fits your hardware.

Same tier

Models in the same parameter band as this one

Step up

More capable — bigger memory footprint

Step down

Smaller — faster, runs on weaker hardware

No verdicted models in the next tier down yet.

Frequently asked

What's the minimum VRAM to run TinyLlama 1.1B Chat v0.3 GPTQ?

1GB of VRAM is enough to run TinyLlama 1.1B Chat v0.3 GPTQ at the Q4_K_M quantization (file size 0.6 GB). Higher-quality quantizations need more.

Can I use TinyLlama 1.1B Chat v0.3 GPTQ commercially?

Yes — TinyLlama 1.1B Chat v0.3 GPTQ ships under the apache-2.0, which permits commercial use. Always read the license text before deployment.

What's the context length of TinyLlama 1.1B Chat v0.3 GPTQ?

TinyLlama 1.1B Chat v0.3 GPTQ supports a context window of 2,048 tokens (about 2K).

Source: huggingface.co/TheBloke/TinyLlama-1.1B-Chat-v0.3-GPTQ

Reviewed by RunLocalAI Editorial. See our editorial policy for how we research and verify model claims.

Related — keep moving

Compare hardware

Buyer guides

When it doesn't work

Recommended hardware

Before you buy

Verify TinyLlama 1.1B Chat v0.3 GPTQ runs on your specific hardware before committing money.

Will it run on my hardware? →Custom hardware comparison →GPU recommender (4 questions) →