Natural language processing

Word2Vec

Word2Vec is an algorithm that learns dense vector representations (embeddings) of words from large text corpora. Each word maps to a fixed-size vector (e.g., 300 dimensions) such that semantically similar words have nearby vectors. Two main architectures exist: Continuous Bag-of-Words (CBOW) predicts a target word from its context, while Skip-gram predicts context words from a target. These vectors capture analogies (e.g., king - man + woman ≈ queen) and are used as input features for downstream NLP models. Operators encounter Word2Vec when fine-tuning or using older models that rely on static embeddings rather than contextual ones like BERT.

Deeper dive

Word2Vec, introduced by Mikolov et al. in 2013, revolutionized NLP by providing efficient, high-quality word embeddings. The algorithm uses a shallow neural network (one hidden layer) trained on a sliding window over text. CBOW averages context vectors to predict the center word, while Skip-gram uses the center word to predict surrounding words, often performing better on rare words. Training produces a weight matrix where each row is a word's embedding. These embeddings are static: each word has one vector regardless of context. For operators, Word2Vec is relevant when working with legacy models or when computational resources are limited, as static embeddings are much smaller and faster than modern contextual models. However, for most local AI tasks, contextual embeddings (e.g., from BERT or Llama) are preferred because they handle polysemy. Word2Vec is still used in recommendation systems and information retrieval where speed is critical.

Practical example

A 300-dimensional Word2Vec model trained on Google News (~100 billion words) produces vectors where 'Paris' - 'France' + 'Italy' ≈ 'Rome'. For an operator running a text classifier on an RTX 3060, using pre-trained Word2Vec embeddings (e.g., from Gensim) reduces model size from hundreds of MBs to ~100 MB, enabling faster inference than a BERT-based classifier that requires ~400 MB and more VRAM.

Workflow example

In a Python script using Gensim, an operator loads a Word2Vec model: model = gensim.models.KeyedVectors.load_word2vec_format('GoogleNews-vectors-negative300.bin', binary=True). Then, to get the vector for 'king': vec = model['king']. These vectors can be fed into a simple classifier (e.g., logistic regression) for tasks like sentiment analysis, avoiding the need for a GPU. In Hugging Face Transformers, Word2Vec is not directly used; instead, operators would use AutoModel for contextual embeddings.

Reviewed by Fredoline Eruo. See our editorial policy.

Buyer guides

When it doesn't work