WordPiece

We've cataloged "WordPiece" but haven't written a full definition yet. Definitions are hand-curated rather than auto-generated, so it takes time to cover every term.

Want this one prioritized? Email us and we'll bump it.

WordPiece is BERT's tokenizer. Unlike BPE which starts from bytes and merges, WordPiece starts from characters and merges based on likelihood on training data. It handles rare words by splitting them: "unaffable" → ["un", "##aff", "##able"]. The "##" prefix means "this is a continuation of the previous token." If you see ## tokens in your output, your detokenizer is broken.

WordPiece matters only if you're using BERT-family models for classification or NER. The tokenizer splits words you didn't anticipate — "cybersecurity" → ["cyber", "##security"] = 2 tokens. When using BERT for text classification, your max_length=512 means 512 WordPiece tokens, which is roughly 350–400 English words. If your documents are longer, truncation will happen silently — log truncation_rate to catch data loss.

When it doesn't work

Definition pending

Practical example

Workflow example