Natural language processing

Language Modeling

Language modeling is the task of predicting the next token (word, subword, or character) in a sequence given the preceding context. In local AI, this is what transformer-based models like Llama, Mistral, or Qwen do during inference: they take a prompt and generate one token at a time, each step conditioned on all previous tokens. The model assigns a probability distribution over the vocabulary, and the runtime samples from that distribution to produce the next token. Language modeling is the core capability behind text generation, chat, and completion tasks. For operators, the key metric is tokens per second, which directly depends on VRAM, quantization level, and context length.

Deeper dive

Language modeling has evolved from n-gram statistical models to neural networks, and now to large transformers trained on massive text corpora. The training objective is typically next-token prediction (autoregressive) or masked language modeling (e.g., BERT). For generative models (GPT-style), the model is trained to minimize cross-entropy loss on predicting the next token. During inference, the model runs in a loop: given a sequence of tokens, it computes logits for the next token, applies a softmax to get probabilities, then samples (greedy, top-k, top-p, temperature) to pick the next token. This token is appended to the context, and the process repeats. The computational cost grows with context length due to the attention mechanism's quadratic complexity, which is why operators care about context window size and KV cache management. Quantization reduces model size and speeds up inference at the cost of some accuracy. Local AI operators often use 4-bit or 8-bit quantized models to fit larger models into VRAM.

Practical example

When running Llama 3.1 8B at Q4_K_M on an RTX 4090 (24 GB VRAM), the model generates about 40-60 tokens per second for a 2K context. If the context grows to 32K, the KV cache consumes more VRAM (roughly 2 GB for 32K context at 8-bit cache), and tokens per second drops to ~20-30 due to increased memory bandwidth usage. Operators can monitor VRAM usage with nvidia-smi and adjust context length or quantization to balance speed and capability.

Workflow example

In llama.cpp, language modeling is invoked by running ./main -m model.gguf -p "Hello, how are you?" -n 256. The runtime loads the model, tokenizes the prompt, and then autoregressively generates 256 tokens. The -ngl flag controls how many layers are offloaded to GPU. In Ollama, ollama run llama3.1:8b starts an interactive session where each user input is appended to the conversation history, and the model generates responses token by token. Operators can see generation speed in the log or via ollama ps.

Reviewed by Fredoline Eruo. See our editorial policy.