551 terms across 19 categories. 440 have full definitions today; the rest are cataloged and being written.
We focus depth on terms most relevant to running AI locally. Cloud-only and academic terms are listed for completeness but get less attention.
Artificial Intelligence AI refers to systems that perform tasks typically requiring human intelligence, such as reasonin
Machine Learning ML is a field of AI where systems learn patterns from data without being explicitly programmed for ever
Deep learning DL is a subset of machine learning that uses multi-layer neural networks to learn patterns from data. In l
Neural networks are the computational architecture behind modern AI models. They consist of layers of interconnected nod
Artificial General Intelligence AGI refers to a hypothetical AI system that can perform any intellectual task that a hum
Artificial Superintelligence ASI refers to a hypothetical AI system that surpasses human intelligence across all domains
Inference is the process of running a trained model on input data to generate an output — the "forward pass" that produc
A Large Language Model is a neural network with billions of parameters trained on massive text corpora to predict the ne
Quantization is the process of reducing a model's numeric precision to shrink its memory footprint with minimal quality
Quantization is the process of reducing a model's numeric precision to shrink its memory footprint with minimal quality
Inference is the act of running a trained model to generate predictions, as opposed to training which produces the model
A prompt is the input text you provide to a language model to generate a response. It can be a simple question, a set of
RAG is the pattern of retrieving relevant documents from a knowledge base and including them in the LLM's prompt so the
Hallucination is when an LLM generates plausible-sounding but factually incorrect information — citing papers that don't
Prompt engineering is the practice of crafting model inputs to elicit better outputs without changing the model itself.
LoRA is a parameter-efficient fine-tuning technique that adapts a large pre-trained model by training small low-rank mat
RLHF Reinforcement Learning from Human Feedback is a training method that fine-tunes a language model using human prefer
Fine-tuning is continued training of a pre-trained model on a smaller, task-specific dataset. Pre-training builds genera
An embedding is a fixed-length vector representation of text, image, or other input — typically 384-3072 dimensions — wh
A foundation model is a large neural network trained on broad data at scale, designed to be adapted for a wide range of
Chain-of-thought prompting is asking a model to show its reasoning step-by-step before giving the final answer. It drama
Latency measures how fast you get a response. Two metrics matter for local LLMs: Time to First Token TTFT — wall-clock
A vector database stores and retrieves data as high-dimensional vectors embeddings rather than rows or documents. In loc
Alignment refers to the process of fine-tuning a base LLM so its outputs match human preferences, values, or safety guid
GGUF GGML Unified Format is the file format used by llama.cpp and its ecosystem Ollama, KoboldCPP, LM Studio. A single f
Pre-training is the initial phase where a large language model learns from a vast, diverse corpus of text data e.g., web
A system prompt is the initial instruction or context prepended to a conversation with an LLM. It sets the model's behav
Throughput measures how much work a system completes per unit time — typically tokens-per-second across all concurrent r
Instruction tuning is a supervised fine-tuning step where a base language model is trained on instruction, response pair
QLoRA combines LoRA/glossary/lora fine-tuning with 4-bit quantization of the base model. Introduced by Tim Dettmers in 2
Semantic search retrieves results based on meaning rather than exact keyword matches. Instead of looking for literal wor
Direct Preference Optimization DPO is a method for fine-tuning language models to align with human preferences without u
Few-shot prompting is a technique where you include a small number of input-output examples in the prompt to guide the m
In-context learning ICL is a capability of large language models where the model adapts its behavior based solely on exa
A jailbreak is a prompt designed to bypass the safety guardrails of an LLM, causing it to generate content it would norm
ORPO Odds Ratio Preference Optimization is a fine-tuning method that combines supervised fine-tuning SFT and preference
Prompt injection is a security exploit where a crafted input overrides the system prompt or instruction set of an LLM, c
Zero-shot prompting is a technique where you give a language model a task description or instruction without providing a
DoRA Weight-Decomposed Low-Rank Adaptation is a fine-tuning method that improves upon LoRA by decomposing pre-trained we
KV cache quantization reduces the memory footprint of the key-value KV cache by storing its entries in lower-precision f
Speculative decoding speeds up LLM inference by using a small fast "draft" model to propose the next several tokens, the
Distillation is a training technique where a smaller 'student' model learns to mimic the behavior of a larger 'teacher'
Guardrails are runtime constraints or filters applied to an LLM's input and output to enforce safety, compliance, or for
Parameter-Efficient Fine-Tuning PEFT is a set of techniques that adapt a pre-trained large language model to a specific
ReAct Reasoning + Acting is a prompting technique that interleaves chain-of-thought reasoning with tool-use actions. In
Red teaming is the practice of systematically probing an LLM to find failure modes: harmful outputs, jailbreaks, halluci
Chunked prefill is an inference-engine technique that splits long-prompt processing into smaller chunks so the engine ca
Dense retrieval finds documents by computing cosine similarity or dot product between learned vector embeddings of the q
A reranker is a cross-encoder model that scores query/document pairs jointly concatenated as input, producing a relevanc
Hybrid retrieval combines dense and sparse retrieval, typically by union-then-rerank or reciprocal rank fusion RRF. The
Constitutional AI CAI is a training method that aligns language model behavior using a set of written rules a 'constitut
Grounding connects a language model's output to verifiable external sources documents, databases, APIs to reduce halluci
Knowledge distillation is a technique where a smaller, faster 'student' model is trained to mimic the behavior of a larg
Proximal Policy Optimization PPO is a reinforcement learning algorithm used to fine-tune large language models LLMs with
RLAIF Reinforcement Learning from AI Feedback is a technique for fine-tuning language models where an AI system, rather
BM25 is the canonical sparse-retrieval algorithm: a TF-IDF variant that saturates term frequency a token appearing 100 t
Sycophancy in LLMs refers to the model's tendency to agree with a user's stated or implied position, even when that posi
Tree of Thoughts ToT is a prompting strategy that expands a single chain of reasoning into a tree of multiple reasoning
Sparse retrieval scores documents by lexical overlap with the query — high-dimensional vectors where most entries are ze
The KV cache stores the key and value tensors from previous attention computations so the model doesn't recompute them a
The context window is the maximum number of tokens a model can attend to at once — both prompt and previously generated
The attention mechanism is a neural network component that lets a model weigh the importance of different parts of the i
A token is the smallest unit of text a language model processes. Most modern models use subword tokenization, where comm
Self-attention computes a weighted representation of every position in a sequence by comparing each token against every
Tokenization is the process of converting text into the numeric tokens a model can process. Modern systems use subword t
Multi-Head Attention is a mechanism in transformer models where the input is projected into multiple parallel 'attention
Multi-Head Latent Attention MLA is an attention mechanism used in DeepSeek V2/V3 that compresses the key-value KV cache
Prefill is the first phase of LLM inference: the model processes the entire prompt in a single parallel pass, building u
Decode is the second phase of LLM inference: generating one output token at a time, autoregressively. Each decode step d
Flash Attention is a memory-efficient implementation of the attention mechanism that reduces memory usage from On² to On
Sliding Window Attention SWA is an attention pattern where each token only attends to a fixed-size window of nearby toke
Temperature is a sampling parameter that controls the randomness of token selection during text generation. It scales th
A decoder is the component of a transformer model that generates output tokens one at a time, using the input's encoded
An encoder is a neural network component that processes input data text, images, audio into a dense representation—a vec
Grouped-Query Attention GQA is a variant of multi-head attention that reduces memory and compute costs by sharing key-va
Rotary Position Embedding RoPE is a method for encoding token position in transformer models by rotating query and key v
Multi-Query Attention MQA is a transformer attention variant where all attention heads share a single key/value projecti
PagedAttention is the memory layout introduced by vLLM that stores the KV cache in fixed-size blocks pages, like virtual
Sampling is the process of converting model logits into output tokens. Common strategies: greedy temperature 0, random s
Byte Pair Encoding BPE is a subword tokenization algorithm that splits text into a sequence of tokens by iteratively mer
An encoder-decoder is a neural network architecture that processes an input sequence through an encoder to produce a com
Top-p nucleus sampling is a text generation strategy that selects from the smallest set of tokens whose cumulative proba
Temperature 0 disables sampling entirely — the model picks the highest-logit token at every step. Equivalent to greedy d
Cross-attention is a mechanism in transformer models where the query vectors come from one sequence e.g., the decoder's
Positional encoding is a technique used in transformer models to inject information about the position of tokens in a se
Top-k sampling is a text-generation strategy that restricts the model's next-token choices to the k tokens with the high
Deterministic decoding means same prompt → same output, every time. Achieved by setting temperature to 0 always pick the
Layer normalization is a technique that stabilizes training and inference by normalizing activations across the features
Logits are the raw, unnormalized scores output by the final linear layer of a transformer model, before the softmax func
A random seed initializes the pseudo-random generator that drives sampling at temperature > 0. Same seed + same prompt +
RMSNorm is a simpler variant of LayerNorm that normalizes activations by their root-mean-square instead of their varianc
YaRN is a context-extension method that modifies RoPE frequencies to let a model trained on, say, 8K context generalize
SwiGLU is a gated feed-forward activation: W1·x ⊙ swishW2·x · W3, replacing the standard MLP's GELU/ReLU in modern trans
ALiBi is a positional encoding scheme that biases attention scores by a linear function of token distance, instead of in
Mirostat is a sampling algorithm that targets a fixed perplexity-like "surprise" level tau instead of a fixed top-p or t
GPT Generative Pre-trained Transformer is a decoder-only Transformer architecture that predicts the next token in a sequ
Natural Language Processing NLP is the field of AI focused on enabling computers to understand, interpret, and generate
BERT Bidirectional Encoder Representations from Transformers is a transformer-based language model that reads text in bo
Language modeling is the task of predicting the next token word, subword, or character in a sequence given the preceding
Text generation is the process where a language model produces coherent sequences of tokens words or subwords in respons
Automatic Speech Recognition ASR converts spoken audio into text. Operators encounter ASR when running models like Whisp
Machine translation MT is the task of automatically translating text from one natural language to another using a neural
Sentiment analysis is a text classification task where a model assigns a label e.g., positive, negative, neutral to a pi
Text summarization is a natural language processing task where a model generates a shorter version of a longer text whil
Text-to-Speech TTS converts written text into spoken audio using neural models. Operators encounter TTS when running loc
A word embedding is a dense vector of floating-point numbers that maps a word or token to a point in a high-dimensional
Word2Vec is an algorithm that learns dense vector representations embeddings of words from large text corpora. Each word
Named Entity Recognition NER is an NLP task that identifies and classifies named entities e.g., person, organization, lo
Question answering QA is a natural language processing task where a model receives a question and returns a concise answ
Text classification is a natural language processing task where a model assigns a predefined category label to a piece o
GloVe Global Vectors for Word Representation is a static word embedding method that learns vector representations of wor
Speech synthesis, also known as text-to-speech TTS, converts written text into spoken audio. In local AI, operators run
T5 Text-to-Text Transfer Transformer is a sequence-to-sequence model from Google that converts every NLP task into a tex
FastText is a library for efficient learning of word representations and sentence classification, developed by Facebook
An n-gram is a contiguous sequence of n items usually tokens or characters from a text. In local AI, n-grams appear in t
Topic modeling is an unsupervised NLP technique that discovers latent themes topics across a collection of documents. It
GPT-4 is a large multimodal language model developed by OpenAI, released in March 2023. It accepts text and image inputs
OpenAI is the organization that developed the GPT series of large language models GPT-3, GPT-4, GPT-4o and the DALL-E im
Llama is a family of open-weight large language models LLMs developed by Meta, starting with Llama 1 in 2023 and continu
Anthropic is an AI safety and research company that develops large language models LLMs under the Claude family. Operato
Claude is a family of large language models LLMs developed by Anthropic, designed for safe and helpful text generation.
NVIDIA designs the GPUs most operators use for local AI inference. Its consumer RTX series e.g., RTX 4090 and workstatio
DeepSeek is a family of open-weight large language models developed by DeepSeek 深度求索, a Chinese AI research company. The
GPT-5 is the hypothetical successor to OpenAI's GPT-4 model family. As of early 2025, no official GPT-5 model has been r
Gemini is a family of multimodal large language models LLMs developed by Google DeepMind, designed to process text, imag
Google DeepMind is an AI research lab formed from the 2023 merger of Google Brain and DeepMind. It develops large langua
Hugging Face is a platform and company that hosts a vast repository of open-source machine learning models, datasets, an
Qwen is a family of large language models LLMs developed by Alibaba Cloud, ranging from 0.5B to 110B parameters. Operato
Meta AI is the artificial intelligence research division of Meta Platforms formerly Facebook. For local AI operators, Me
Mistral is a family of open-weight large language models LLMs developed by Mistral AI, known for their efficiency and st
Stability AI is the company behind the Stable Diffusion family of image generation models, which operators run locally v
Grok is a family of large language models LLMs developed by xAI, led by Elon Musk. The first version, Grok-1, was releas
Phi is a family of small language models SLMs developed by Microsoft, designed to run efficiently on consumer hardware l
Generative AI GenAI refers to machine learning models that produce new content—text, images, audio, code, or video—by le
A deepfake is a synthetic media image, video, or audio generated or manipulated by a deep learning model, typically an a
A generative model is a type of machine learning model that learns the underlying distribution of training data and can
ControlNet is a neural network architecture that adds spatial conditioning to pretrained image diffusion models like Sta
Latent diffusion is a technique used in image generation models like Stable Diffusion that applies the diffusion process
Video generation refers to the process of creating new video content from text prompts, images, or other video inputs us
Autoregressive models generate text one token at a time, where each new token depends on all previously generated tokens
Latent space is the internal, compressed representation of data that a generative model learns during training. It is a
Voice cloning is the process of generating synthetic speech that mimics a specific person's voice, including timbre, pit
Audio generation refers to the process of creating audio content—such as speech, music, or sound effects—using machine l
DreamBooth is a fine-tuning technique that personalizes a text-to-image model like Stable Diffusion to generate images o
StyleGAN is a generative adversarial network GAN architecture designed for high-resolution image synthesis, introduced b
DDPM Denoising Diffusion Probabilistic Models is a class of generative models that learn to generate data by reversing a
Music generation refers to the use of AI models to produce audio or symbolic representations of music e.g., MIDI, sheet
Ollama is a runtime and CLI tool for running large language models locally on consumer hardware. It wraps llama.cpp and
PyTorch is an open-source machine learning framework developed by Meta. It provides tensor computation with GPU accelera
llama.cpp is a C++ inference engine for running large language models LLMs locally on consumer hardware. It loads quanti
vLLM is an open-source inference engine optimized for high-throughput, low-latency serving of large language models. It
Hugging Face Transformers is a Python library that provides pre-trained models and tools for natural language processing
LM Studio is a desktop application that provides a graphical interface for downloading, managing, and running local larg
LangChain is a Python/TypeScript framework for building applications that chain together LLM calls, external data source
TensorFlow is an open-source machine learning framework developed by Google. Operators encounter it as an alternative to
scikit-learn is a Python library for classical machine learning regression, classification, clustering, dimensionality r
text-generation-webui often called oobabooga is a browser-based interface for running large language models locally. It
ExLlamaV2 is a high-performance inference engine for Llama-family models, optimized for GPU execution. It achieves faste
KoboldCpp is a single-file, self-contained executable that bundles llama.cpp with a web-based UI and a built-in API, des
LlamaIndex is a data framework for building retrieval-augmented generation RAG applications. It provides tools to ingest
OpenCV Open Source Computer Vision Library is a C++ library with Python bindings for real-time image and video processin
Continuous batching sometimes "iteration-level scheduling" is a serving optimization where new requests join the active
Hugging Face Text Generation Inference TGI is a production-grade inference server for large language models, optimized f
Gradio is an open-source Python library for quickly building web-based user interfaces for machine learning models. Oper
JAX is a numerical computing library from Google that combines NumPy-like array operations with automatic differentiatio
Keras is a high-level neural network API, written in Python and capable of running on top of TensorFlow, JAX, or PyTorch
MLC LLM Machine Learning Compilation for Large Language Models is a framework that compiles LLMs into deployable binarie
SGLang is an open-source LLM inference engine focused on high throughput for structured generation and complex agent wor
Streamlit is an open-source Python framework for turning data scripts into interactive web apps with minimal code. Opera
Prefix caching stores the KV cache from previous requests so a new request that shares a prefix system prompt, few-shot
Request batching packs multiple inference requests into a single forward pass to amortize the cost of loading model weig
MPS is Apple's high-level Metal-based compute library, exposed in PyTorch as the mps device backend. Calling model.to"mp
Airflow is a workflow orchestration tool for scheduling, monitoring, and managing complex data pipelines as directed acy
MLflow is an open-source platform for managing the end-to-end machine learning lifecycle, including experimentation, rep
Ray is an open-source distributed computing framework for scaling AI workloads across multiple machines. Operators encou
Triton Inference Server is an open-source inference serving software by NVIDIA that manages multiple AI models across GP
Weights & Biases W&B is a cloud-based MLOps platform for tracking experiments, visualizing metrics, and managing model a
spaCy is a Python library for industrial-strength natural language processing NLP that provides pre-trained pipelines fo
FAISS Facebook AI Similarity Search is a C++/Python library for fast approximate nearest-neighbor search over dense vect
GGML is the C/C++ tensor library that underlies llama.cpp, whisper.cpp, and the original GGUF format. It provides quanti
Vulkan compute is the cross-vendor GPU compute API from Khronos. llama.cpp ships a Vulkan backend that runs on AMD, Inte
NLTK Natural Language Toolkit is a Python library for classical NLP tasks like tokenization, stemming, tagging, and pars
TensorBoard is a visualization toolkit from TensorFlow for inspecting model training metrics, graph structures, and weig
DirectML is Microsoft's GPU-agnostic ML acceleration API, layered on DirectX 12. It works on any Windows-supported GPU —
Expert parallelism is a parallelism strategy specific to MoE models: each GPU holds a different subset of the experts, a
The Transformer is a neural network architecture introduced in 2017 that replaced recurrent layers with a self-attention
A diffusion model is a type of generative model that learns to reverse a gradual noising process. During training, the m
A Convolutional Neural Network CNN is a neural network architecture that uses convolutional layers to process grid-like
A Generative Adversarial Network GAN is a machine learning architecture where two neural networks—a generator and a disc
Mixture of Experts is a neural network architecture where multiple specialized sub-networks "experts" exist, but only a
Multimodal AI refers to models that process and generate multiple data types—typically text, images, and sometimes audio
A Vision-Language Model VLM processes both images and text, enabling tasks like image captioning, visual question answer
Long Short-Term Memory LSTM is a recurrent neural network RNN architecture designed to model sequential data while avoid
A Recurrent Neural Network RNN is a neural network architecture designed for sequential data, where each output depends
A Multi-Layer Perceptron MLP is a feedforward neural network composed of at least three layers: an input layer, one or m
A Residual Network ResNet is a neural network architecture that introduces skip connections also called shortcut connect
A Vision Transformer ViT is a neural network architecture that applies the Transformer model, originally designed for te
Decoder-only is the architecture of GPT, Llama, Qwen, Mistral, DeepSeek, and almost every modern open-weight LLM. The mo
An autoencoder is a neural network trained to reconstruct its input after passing it through a bottleneck layer. The bot
A Graph Neural Network GNN is a neural network architecture designed to process data structured as graphs—nodes connecte
A perceptron is the simplest form of a neural network: a single linear unit that takes weighted inputs, sums them, adds
State Space Models SSMs, notably the Mamba architecture, are a class of sequence models that process tokens in linear ti
U-Net is a convolutional neural network architecture designed for image segmentation tasks. It consists of a contracting
A Variational Autoencoder VAE is a generative neural network that learns a compressed latent representation of input dat
A dense model activates every parameter on every forward pass — the default architecture for transformers like Llama, Qw
A Neural Radiance Field NeRF is a neural network that represents a 3D scene as a continuous function mapping a 3D locati
MoE routing is the gating mechanism that decides which experts a token activates in a Mixture-of-Experts layer. Top-k ro
A feedforward neural network FFNN is the simplest type of neural network where connections between nodes do not form cyc
Encoder-decoder transformers T5, BART, original "Attention is All You Need" architecture have two halves: an encoder rea
VRAM is the dedicated memory on a GPU. For local AI, VRAM capacity is the single most important spec — it determines whi
A GPU Graphics Processing Unit is a specialized processor designed for parallel computation, originally for graphics but
CUDA Compute Unified Device Architecture is NVIDIA's parallel-computing platform and the dominant API for GPU-accelerate
CPU offload is a technique where parts of a neural network model are processed by the CPU instead of the GPU, typically
Edge AI refers to running machine learning models locally on consumer hardware laptops, phones, GPUs rather than sending
VRAM bandwidth is the rate at which the GPU's video memory can transfer data to the compute cores, measured in GB/s. For
MLX is Apple's open-source array framework optimized for Apple Silicon. The Apple equivalent of PyTorch + CUDA, with fir
A Tensor Processing Unit TPU is a custom ASIC designed by Google specifically for accelerating machine learning workload
Distributed training splits the work of training a neural network across multiple GPUs or machines, using techniques lik
Edge computing means running AI inference on a local device laptop, phone, embedded system instead of sending data to a
FLOPS Floating Point Operations Per Second measures how many floating-point calculations a processor can perform in one
FP16 16-bit floating point is a number format that uses 16 bits per weight or activation, balancing precision and memory
A Neural Processing Unit NPU is a specialized hardware accelerator designed to execute neural network operations efficie
On-device AI refers to running machine learning models directly on local hardware CPU, GPU, NPU rather than sending data
GDDR7 uses PAM3 signaling to push per-pin rates to 28–32 Gbps in first-gen products 2025, with a path to 40+ Gbps. RTX 5
Unified memory is a memory architecture where CPU and GPU share the same physical RAM pool, eliminating CPU↔GPU copies.
BF16 Brain Floating Point 16 is a 16-bit floating-point number format that uses 8 exponent bits and 7 mantissa bits, mat
Data parallelism is a distributed training strategy where a model is replicated across multiple devices GPUs or nodes, a
DeepSpeed is a deep learning optimization library by Microsoft that reduces memory usage and speeds up training for larg
FP8 Floating Point 8 is an 8-bit floating-point number format used in AI inference and training to reduce memory and com
HBM High Bandwidth Memory is a 3D-stacked DRAM design that vertically layers memory dies with through-silicon vias TSVs
Mixed precision is a technique that uses different numerical precisions e.g., FP16 and FP32 for different parts of a mod
NVLink is NVIDIA's proprietary GPU-to-GPU interconnect, used to bind multiple data-center GPUs into a coherent memory fa
ONNX Open Neural Network Exchange is an open-source format for representing machine learning models, designed to enable
ROCm Radeon Open Compute is AMD's open-source equivalent of NVIDIA's CUDA. It's required for any meaningful AMD GPU infe
Tensor Cores are specialized hardware units on NVIDIA GPUs Volta architecture and later that perform fused multiply-add
Tensor parallelism splits each transformer layer's weight matrices across multiple GPUs. Card 0 holds the first half of
TensorRT is NVIDIA's SDK for optimizing and deploying deep learning models on NVIDIA GPUs. It performs graph optimizatio
FSDP Fully Sharded Data Parallel is a distributed training technique that shards model parameters, gradients, and optimi
INT8 8-bit integer is a numerical format that uses 8 bits to represent integers, typically in the range -128, 127 for si
Metal is Apple's low-level GPU programming framework and API, analogous to Vulkan on other platforms. For local AI opera
Model parallelism is a technique that splits a single neural network across multiple GPUs or other accelerators, with ea
ZeRO Zero Redundancy Optimizer is a memory optimization technique for distributed training of large models. It partition
cuDNN CUDA Deep Neural Network library is NVIDIA's GPU-accelerated library for deep learning primitives like convolution
NVSwitch is the crossbar that connects 8 or in NVL72, 72 GPUs into a single all-to-all NVLink fabric. Each GPU talks to
FP32 32-bit floating point is a numerical format that uses 32 bits to represent each model weight, offering high precisi
INT4 is a quantization format that stores each model weight using 4 bits, reducing memory usage by roughly 4× compared t
Pipeline parallelism a.k.a. "layer split" in llama.cpp parlance puts whole layers on different GPUs. Card 0 handles laye
Vulkan compute is a cross-platform GPU compute API that runs inference workloads on GPUs without requiring CUDA. In loca
Q4KM is the most-downloaded GGUF quantization on Hugging Face — the default tradeoff for local inference. It mixes 6-bit
AWQ Activation-aware Weight Quantization is a 4-bit quantization method designed for fast inference on NVIDIA GPUs. It's
Backpropagation is the algorithm used to train neural networks by computing gradients of the loss function with respect
Dropout is a regularization technique used during neural network training where randomly selected neurons are ignored dr
Gradient descent is an optimization algorithm that iteratively adjusts model weights to minimize a loss function. In loc
Overfitting occurs when a model learns training data too well, including noise and irrelevant patterns, at the cost of g
Q5KM is a mixed-precision GGUF quantization averaging ~5.7 bits per parameter. Attention and feed-forward weights use 6-
Q80 is llama.cpp's simplest 8-bit GGUF quantization: weights in INT8, one FP16 scale per 32-element block, no zero-point
Adam Adaptive Moment Estimation is an optimizer that adjusts learning rates per parameter during training. It combines m
Batch normalization is a training technique that normalizes the inputs to a layer across a mini-batch of data. It comput
A hyperparameter is a configuration variable set before training begins that controls the learning process, not a parame
Learning rate is a hyperparameter that controls how much the model's weights are adjusted during each training step. A h
Stochastic Gradient Descent SGD is an optimization algorithm used during model training to minimize the loss function. U
GPTQ Generative Pre-trained Transformer Quantization is a one-shot post-training quantization method that uses approxima
Q40 is the original llama.cpp 4-bit quantization: INT4 weights with one FP16 scale per 32-element block, no zero-point,
AdamW is an optimizer algorithm used during fine-tuning or training of neural networks, including LLMs. It modifies the
Batch size is the number of training samples processed together in one forward and backward pass. In local AI training,
Hyperparameter tuning is the process of selecting the configuration values that control how a model trains, such as lear
Regularization is a set of techniques used during model training to prevent overfitting—where the model memorizes traini
EXL2 is the ExLlamaV2 quantization format. NVIDIA-only, single-stream-throughput-optimized. Allows fractional bit-rates
The bias-variance tradeoff describes the tension between a model's ability to fit training data closely low bias and its
An epoch is one complete pass through the entire training dataset during model training. In practice, operators fine-tun
HQQ Half-Quadratic Quantization is a calibration-free quantization method that produces 2-, 3-, 4-, and 8-bit weight qua
Q3KM is a 3-bit GGUF K-quant averaging ~3.9 bits per parameter. It's the smallest format that still produces usable outp
The vanishing gradient problem occurs when gradients used to update model weights become extremely small as they are bac
Early stopping is a training technique that halts model training when performance on a validation set stops improving, p
An exploding gradient occurs when the gradients used to update model weights during training grow exponentially large, c
Gradient clipping is a technique used during neural network training to prevent exploding gradients. It caps the gradien
A learning rate schedule adjusts the step size learning rate during training to improve convergence and model quality. I
Weight decay is a regularization technique used during model training that adds a penalty proportional to the squared ma
Q2K is 2-bit GGUF quantization averaging ~3.0 bits per parameter with mandatory 4-bit scales and importance metadata. It
Stable Diffusion is a text-to-image model that generates images from text prompts using a diffusion process. It runs on
Object detection is a computer vision task that identifies and localizes specific objects within an image or video frame
DALL-E is a family of text-to-image generative models developed by OpenAI. Operators encounter it as a cloud-only API se
Image classification is a computer vision task where a model assigns a single label from a predefined set to an input im
Midjourney is a proprietary text-to-image AI service accessible via Discord, not a local model. Operators cannot downloa
Optical Character Recognition OCR is the process of converting images of text—scanned documents, photos, or screenshots—
YOLO You Only Look Once is a family of real-time object detection models that process an entire image in a single forwar
Face recognition is a computer vision task that identifies or verifies a person from an image or video frame by comparin
Image segmentation is a computer vision task that partitions an image into multiple segments or regions, each correspond
The R-CNN family is a series of object detection architectures that evolved from region-based convolutional neural netwo
Semantic segmentation is a computer vision task that assigns a class label e.g., 'car', 'road', 'person' to every pixel
Super-resolution is a computer vision technique that takes a low-resolution image and generates a higher-resolution vers
Feature extraction is the process of converting raw input data like an image into a compact set of numerical representat
Image inpainting is the task of filling missing or masked regions of an image with plausible, contextually consistent co
Instance segmentation is a computer vision task that assigns a pixel-level mask to each distinct object instance in an i
SLAM Simultaneous Localization and Mapping is a computational problem in robotics and computer vision where a device bui
Style transfer is a computer vision technique that applies the visual style of one image e.g., a painting to the content
Depth estimation is a computer vision task that predicts a depth value for each pixel in an image, producing a depth map
Edge detection is a computer vision technique that identifies points in an image where brightness changes sharply, formi
Pose estimation is a computer vision task that identifies the positions of key body joints e.g., shoulders, elbows, wris
An AI agent is software that uses an LLM to decide what to do, takes actions, observes results, and iterates toward a go
A coding agent is a language model configured to write, debug, or refactor code autonomously or semi-autonomously. It ty
Function calling also called tool use is a capability where the model emits structured JSON requesting that specific too
Tool calling also called function calling is a model's structured output capability where it produces JSON-shaped tool i
MCP is an open protocol introduced by Anthropic in late 2024 for connecting AI agents to tools and data sources in a sta
An autonomous agent is a system that uses a language model to decide and execute multi-step tasks without human interven
A browser agent is an AI-driven program that controls a web browser to automate tasks like form filling, data extraction
A multi-agent system MAS is a setup where multiple AI agents, each with distinct roles or capabilities, collaborate or c
Orchestration in the context of agents refers to the system that manages the lifecycle, communication, and task delegati
Planning in agents refers to the process where an LLM decomposes a complex goal into a sequence of sub-steps or actions
Agent memory refers to the mechanisms an AI agent uses to store and recall information across interactions. Short-term m
Robotic Process Automation RPA is software that automates repetitive, rule-based tasks typically performed by humans int
Embodied AI refers to AI systems that interact with the physical world through a body or sensorimotor capabilities, rath
A Reactive Agent selects actions based solely on its current percepts and a fixed set of condition-action rules, without
Tokens per second tok/s is the most-cited LLM throughput metric, but it's also the most-misunderstood. It splits into tw
Accuracy measures how often a model's predictions match the expected ground truth, typically expressed as a percentage e
TTFT time-to-first-token is the latency between sending a prompt and receiving the first generated token. It's dominated
The F1 score is the harmonic mean of precision and recall, giving a single metric that balances false positives and fals
Perplexity is a metric that measures how well a language model predicts a sequence of tokens. Lower perplexity means the
Precision in local AI refers to the number of bits used to represent each weight and activation in a neural network. Low
Recall measures the fraction of relevant items that a retrieval or classification system successfully finds. In local AI
AUC Area Under the Curve measures a model's ability to rank positive examples higher than negative ones, typically using
A confusion matrix is a table that summarizes the performance of a classification model by comparing predicted labels ag
Elo rating in LLM benchmarks is a relative scoring system that ranks models based on pairwise comparison results, typica
Pass@k is a metric that measures the probability that at least one of k independently generated samples from a model con
A Receiver Operating Characteristic ROC curve plots the true positive rate against the false positive rate at various cl
Throughput is aggregate tokens generated per second across all in-flight requests; latency is wall-clock time for a sing
GSM8K is a benchmark of 8,500 grade-school math word problems requiring 2–8 reasoning steps. Models are scored by whethe
BLEU Bilingual Evaluation Understudy is an automated metric that measures how similar a machine-generated text is to one
FID Fréchet Inception Distance is a metric that measures the quality of images generated by a model by comparing the sta
IoU Intersection over Union is a metric that measures the overlap between a predicted bounding box and a ground-truth bo
R² coefficient of determination measures how well a regression model's predictions match actual outcomes, on a scale fro
mAP mean Average Precision is a metric that evaluates object detection models by averaging precision across recall thres
pass@1 is the probability that a model's first generated solution passes the unit tests for a coding problem, computed f
Sensitivity measures how much a model's output changes in response to small changes in its input. In local AI, sensitivi
Reinforcement Learning RL is a machine learning paradigm where an agent learns to make decisions by interacting with an
Self-supervised learning SSL is a training paradigm where a model learns representations from unlabeled data by creating
Supervised learning is a training paradigm where a model learns to map inputs to outputs using labeled data — each train
Zero-shot learning is a capability where a model performs a task it was never explicitly trained on, using only a natura
Transfer learning is a technique where a model trained on one task is reused as the starting point for a second task. In
Federated learning is a machine learning technique where a model is trained across multiple decentralized devices or ser
Few-shot learning is a technique where a model performs a task after seeing only a small number of examples typically 2–
Unsupervised learning is a machine learning paradigm where a model finds patterns in data without labeled examples. Unli
Contrastive learning is a self-supervised training method where a model learns to pull similar data points e.g., two aug
Deep Reinforcement Learning DRL combines deep neural networks with reinforcement learning, enabling agents to learn opti
Representation learning is the process by which a model automatically discovers the features or patterns in raw data tha
Continual learning also called lifelong learning is a machine learning paradigm where a model is trained on a sequence o
Meta-learning, or 'learning to learn,' is a training paradigm where a model is exposed to many related tasks so it can q
Multi-task learning MTL trains a single model on multiple related tasks simultaneously, sharing representations across t
AI safety refers to the set of practices and research aimed at ensuring that AI systems behave reliably, predictably, an
AI alignment refers to the challenge of ensuring that a model's outputs match the operator's intended goals and values.
AI ethics refers to the principles and practices that guide the responsible development and deployment of AI systems. Fo
Bias in AI/ML refers to systematic errors in model outputs that result from skewed training data, flawed assumptions, or
Algorithmic bias refers to systematic and repeatable errors in a model's outputs that create unfair outcomes, such as pr
The EU AI Act is a regulatory framework from the European Union that classifies AI systems by risk level unacceptable, h
Explainability refers to the ability to understand and interpret why a model produces a specific output. For local AI op
Fairness in AI refers to the absence of systematic bias in model outputs across different demographic groups. For operat
Interpretability refers to the ability to understand and explain why a model produces a specific output. For local AI op
Privacy in local AI refers to the operator's control over their data and model interactions, ensuring no data leaves the
AI regulation refers to laws, policies, and guidelines that govern the development, deployment, and use of AI systems. F
An adversarial attack is a technique where small, often imperceptible perturbations are added to an input to cause a mac
AI Governance refers to the set of policies, processes, and technical controls that determine how a model is developed,
An adversarial example is an input to a machine learning model that has been intentionally perturbed to cause a mispredi
Differential Privacy is a mathematical framework that provides a formal guarantee that the output of an analysis reveals
Mechanistic interpretability is the research approach of reverse-engineering neural networks into human-understandable a
Transparency in AI refers to the degree to which a model's behavior, training data, architecture, and decision-making pr
Explainable AI XAI refers to methods that make the decisions of machine learning models understandable to humans. For lo
Accountability in AI means that the operator or organization deploying a model can be held responsible for its outputs a
Computer vision is the field of AI that enables machines to interpret and process visual data—images, videos, or live ca
Self-driving cars, also known as autonomous vehicles, use AI to perceive their environment and navigate without human in
AlphaFold is a deep learning model developed by DeepMind that predicts the 3D structure of proteins from their amino aci
AlphaGo is a computer program developed by DeepMind that plays the board game Go at a superhuman level. It combines deep
Autonomous vehicles are self-driving systems that use AI to perceive their environment, plan routes, and control vehicle
Robotics in AI refers to the integration of machine learning models into physical robots to enable perception, decision-
Healthcare AI refers to machine learning models applied to medical data for tasks like diagnosis, treatment planning, dr
Recommender systems are machine learning models that predict user preferences for items movies, products, content based
AI in Finance refers to the application of machine learning and deep learning models to financial tasks like fraud detec
AlphaZero is a reinforcement learning algorithm developed by DeepMind that learns to master board games Go, chess, shogi
Anomaly detection is the task of identifying data points, events, or patterns that deviate significantly from a dataset'
Fraud detection is a machine learning task that identifies suspicious transactions, account activities, or user behavior
Medical imaging AI refers to machine learning models trained to analyze medical scans like X-rays, CTs, MRIs, and pathol
Speech processing refers to the analysis, synthesis, and manipulation of human speech by AI models. Operators encounter
Algorithmic trading uses computer programs to execute financial trades based on predefined rules, often involving statis
Drug discovery with AI applies machine learning to the process of identifying and designing new pharmaceutical compounds
Game AI refers to the algorithms and systems that control non-player characters NPCs, opponents, and procedural content
Training data is the dataset used to teach a model its patterns and behaviors. For LLMs, this typically means trillions
ImageNet is a large-scale image dataset containing over 14 million labeled images across 20,000 categories, organized by
MMLU Massive Multitask Language Understanding is a benchmark that tests a language model's knowledge across 57 subjects,
Data augmentation is the technique of generating modified copies of existing training data to increase dataset size and
Feature engineering is the process of transforming raw data into input variables features that improve model performance
HumanEval is a benchmark dataset of 164 hand-written programming problems, each with a function signature, docstring, an
MNIST Modified National Institute of Standards and Technology is a dataset of 70,000 grayscale images of handwritten dig
Synthetic data is artificially generated data used to train or fine-tune AI models, created by algorithms rather than co
COCO Common Objects in Context is a large-scale image dataset created by Microsoft for object detection, segmentation, a
Cross-validation is a technique for evaluating how well a model generalizes to unseen data by partitioning the dataset i
Data labeling is the process of annotating raw data text, images, audio with tags or categories that teach a model what
A data pipeline is a sequence of automated steps that ingest, transform, and load data from source to destination. In lo
ETL Extract, Transform, Load is a data pipeline process that pulls raw data from sources Extract, cleans or reformats it
Ground truth is the correct, real-world answer or label that a model is trained to predict or evaluated against. In supe
Test data is a set of examples used to evaluate a model's performance after training, distinct from the training data th
Validation data is a subset of examples held back from training to evaluate how well a model generalizes to unseen input
Annotation is the process of adding labels, tags, or metadata to raw data text, images, audio to create a training datas
CIFAR-10 and CIFAR-100 are datasets of 32x32 color images used for benchmarking image classification models. CIFAR-10 ha
Feature selection is the process of identifying and retaining only the most relevant input variables features for a mach
Imbalanced data refers to a dataset where the number of samples per class is significantly skewed, with one or more mino
Normalization is a data preprocessing step that rescales input values to a fixed range e.g., 0,1 or -1,1 or adjusts them
One-hot encoding converts categorical data e.g., token IDs into binary vectors where only one element is 'hot' 1 and all
Concept drift is a change in the statistical properties of a target variable over time, causing a trained model to becom
Feature scaling adjusts the range of numeric input values so that each feature contributes equally to a model's training
The GLUE General Language Understanding Evaluation benchmark is a collection of nine natural language understanding task
K-Fold Cross-Validation is a technique for evaluating a model's performance by splitting the dataset into K equal-sized
Standardization in local AI refers to the process of converting raw data into a consistent format that models can proces
XGBoost Extreme Gradient Boosting is a gradient-boosted decision tree GBDT library optimized for structured/tabular data
Random Forest is an ensemble machine learning method that builds multiple decision trees during training and outputs the
A decision tree is a supervised learning model that splits data into branches based on feature values, forming a tree-li
Gradient boosting is an ensemble machine learning technique that builds a strong predictive model by sequentially adding
K-Means Clustering is an unsupervised learning algorithm that partitions a dataset into K distinct, non-overlapping clus
LightGBM is a gradient boosting framework that uses tree-based learning algorithms. It is designed for efficiency and sp
Principal Component Analysis PCA is a dimensionality reduction technique that transforms a high-dimensional dataset into
CatBoost is a gradient boosting library developed by Yandex that handles categorical features automatically without manu
K-Nearest Neighbors KNN is a classical machine learning algorithm used for classification or regression. It works by fin
Linear regression is a statistical method that models the relationship between an input variable feature and an output v
Logistic regression is a statistical model used for binary classification tasks, predicting the probability that an inpu
A Support Vector Machine SVM is a supervised learning model that finds a hyperplane a decision boundary to separate data
t-SNE t-distributed Stochastic Neighbor Embedding is a dimensionality reduction technique used to visualize high-dimensi
Q-Learning is a model-free reinforcement learning algorithm that learns an optimal action-selection policy by iterativel
UMAP Uniform Manifold Approximation and Projection is a dimensionality reduction technique used to visualize high-dimens
DBSCAN Density-Based Spatial Clustering of Applications with Noise is an unsupervised clustering algorithm that groups d
A Markov Decision Process MDP is a mathematical framework for modeling decision-making in environments where outcomes ar
Monte Carlo methods are a class of algorithms that use repeated random sampling to approximate numerical results. In loc
MLOps Machine Learning Operations is the practice of managing the lifecycle of machine learning models from development
LLMOps Large Language Model Operations is the set of practices for deploying, monitoring, and maintaining LLMs in produc
Model deployment is the process of making a trained AI model available for inference in a production environment. For lo
A/B Testing in ML compares two model variants — a control current production model and a treatment candidate model — by
An inference API is a programmatic interface that accepts input data like a prompt and returns a model's output like gen
Model Monitoring continuously tracks the health and performance of deployed ML models by measuring: 1 prediction quality
Model serving is the process of making a trained AI model available for inference via an API or local runtime. For opera
Real-time inference means the model processes input and returns output fast enough to feel instantaneous to a human user
Model Versioning tracks the evolution of ML models over time by assigning unique identifiers to each trained artifact an
A Model Registry is a centralized catalog that stores and versions trained models along with their metadata — training d
Shadow Deployment also called dark launch or shadow mode runs a candidate model in production alongside the current mode
The glossary grows when we find gaps.
If you searched for an AI term and we don't have a definition, email Contact support with the term. We prioritize terms that are practical for running AI locally over purely academic ones, but we'll consider any reasonable suggestion.