Transformer & LLM components
Definition pending

SentencePiece

Definition pending

We've cataloged "SentencePiece" but haven't written a full definition yet. Definitions are hand-curated rather than auto-generated, so it takes time to cover every term.

Want this one prioritized? Email us and we'll bump it.

Practical example

SentencePiece is a tokenizer that doesn't assume space-separated words — it works with any language, even those without spaces (Chinese, Japanese). It's used by Llama, Gemma, and many others. The practical difference from BPE: SentencePiece treats the input as a raw byte stream, so it handles rare characters and code better.

Workflow example

When switching between models, check tokenizer type — Llama uses SentencePiece (BPE variant), GPT uses tiktoken, older models use WordPiece. If your prompt produces wildly different token counts across models, it's the tokenizer's fault. For consistent token budgeting, normalize to the tokenizer you'll use in production. Don't estimate tokens by word-count — a 100-word prompt can be 80 tokens (GPT) or 150 tokens (Llama on Korean).