Natural language processing

FastText

FastText is a library for efficient learning of word representations and sentence classification, developed by Facebook AI Research. It represents words as bags of character n-grams, enabling it to handle out-of-vocabulary words by composing vectors from subword units. For operators, FastText is relevant as a lightweight alternative to deep neural networks for text classification tasks, often used in preprocessing pipelines or as a baseline model. It is not typically run on local AI runtimes like llama.cpp or Ollama, but its pretrained vectors (e.g., wiki-news-300d-1M) can be loaded in Python for feature extraction or similarity search.

Deeper dive

FastText extends the word2vec approach by incorporating subword information: each word is represented as a bag of character n-grams (e.g., for n=3, 'apple' yields 'app', 'ppl', 'ple'), and the word vector is the sum of its n-gram vectors. This allows FastText to produce embeddings for rare or unseen words, a key advantage over word2vec or GloVe. The library also includes a supervised classifier that uses a hierarchical softmax for fast training on large datasets. For operators, FastText is primarily used in Python via the fasttext package, either for generating embeddings or for text classification. It is not designed for GPU acceleration and runs efficiently on CPU, making it suitable for low-resource environments. Pretrained models for 157 languages are available, and operators can fine-tune them on custom datasets.

Practical example

An operator wants to classify short text queries into categories (e.g., 'weather', 'news', 'music'). They train a FastText classifier on 100k labeled examples. Training takes ~2 minutes on a CPU, and inference runs at ~100k queries/second. The model file is ~10 MB. This contrasts with a BERT-based classifier that would require a GPU, take hours to train, and produce a ~400 MB model. FastText handles misspellings (e.g., 'weathr' → 'weather') due to subword n-grams.

Workflow example

In a typical workflow, an operator installs FastText via pip install fasttext. They download a pretrained model: wget https://dl.fbaipublicfiles.com/fasttext/vectors-crawl/cc.en.300.bin.gz. Then in Python: import fasttext; model = fasttext.load_model('cc.en.300.bin'); vec = model.get_word_vector('example'). For classification: model = fasttext.train_supervised('train.txt'). The model file is saved and loaded later for inference. FastText is not integrated with llama.cpp or Ollama; it runs as a standalone Python library.

Reviewed by Fredoline Eruo. See our editorial policy.

Buyer guides

When it doesn't work