Natural language processing

BERT

BERT (Bidirectional Encoder Representations from Transformers) is a transformer-based language model that reads text in both directions simultaneously, producing context-aware word embeddings. Unlike autoregressive models (e.g., GPT), BERT is an encoder-only model trained via masked language modeling and next-sentence prediction. Operators encounter BERT primarily for tasks like text classification, named entity recognition, and question answering. BERT models are smaller than modern LLMs (e.g., BERT-base has 110M parameters) and run efficiently on consumer hardware, often fitting in a few GB of VRAM at FP16.

Deeper dive

BERT introduced the bidirectional pre-training approach that became foundational for NLP. During training, random words in a sentence are masked, and BERT learns to predict them using both left and right context. This yields deep bidirectional representations. Variants include RoBERTa (optimized training), DistilBERT (40% smaller, 60% faster), and ALBERT (parameter-efficient). For operators, BERT models are typically loaded via Hugging Face Transformers. They are not used for text generation (no decoder) but excel at understanding tasks. Quantization (e.g., ONNX Runtime with INT8) can shrink BERT-base from ~440 MB to ~110 MB with minimal accuracy loss, enabling deployment on edge devices or low-VRAM GPUs.

Practical example

A 6 GB VRAM GPU (e.g., RTX 3060) can run BERT-base (110M params) at FP16 (~220 MB) with a batch size of 32 and sequence length 512, achieving ~500 samples/sec for sentiment classification. Quantizing to INT8 reduces memory to ~110 MB, allowing larger batches or longer sequences. DistilBERT (66M params) runs even faster, ~800 samples/sec on the same hardware.

Workflow example

In Hugging Face Transformers, loading BERT for classification: from transformers import BertForSequenceClassification; model = BertForSequenceClassification.from_pretrained('bert-base-uncased'). For inference, operators often export to ONNX and apply INT8 quantization via onnxruntime-tools to reduce latency. In LM Studio, BERT models appear under 'Embedding & Classification' and can be used for zero-shot classification or feature extraction without generation overhead.

Reviewed by Fredoline Eruo. See our editorial policy.

Buyer guides

When it doesn't work