SentencePiece
Definition pending
We've cataloged "SentencePiece" but haven't written a full definition yet. Definitions are hand-curated rather than auto-generated, so it takes time to cover every term.
Want this one prioritized? Email us and we'll bump it.
Practical example
SentencePiece is a tokenizer that doesn't assume space-separated words — it works with any language, even those without spaces (Chinese, Japanese). It's used by Llama, Gemma, and many others. The practical difference from BPE: SentencePiece treats the input as a raw byte stream, so it handles rare characters and code better.
Workflow example
When switching between models, check tokenizer type — Llama uses SentencePiece (BPE variant), GPT uses tiktoken, older models use WordPiece. If your prompt produces wildly different token counts across models, it's the tokenizer's fault. For consistent token budgeting, normalize to the tokenizer you'll use in production. Don't estimate tokens by word-count — a 100-word prompt can be 80 tokens (GPT) or 150 tokens (Llama on Korean).