Machine Learning (ML)
Machine Learning (ML) is a field of AI where systems learn patterns from data without being explicitly programmed for every rule. In local AI, operators run ML models—like language models or image generators—on their own hardware. The model's weights, learned during training, are loaded into VRAM or system RAM for inference. ML matters because the same model can be used for many tasks (e.g., text generation, classification) by swapping the weights file, but the hardware must fit the model's size and compute requirements.
Deeper dive
ML models are trained on large datasets to minimize a loss function, adjusting internal parameters (weights) via backpropagation. The result is a static set of weights that encode learned patterns. For operators, the key distinction is between training (resource-intensive, often done in the cloud) and inference (running the model locally). Local inference uses frameworks like llama.cpp or MLX to load quantized weights into VRAM and perform forward passes. Model architecture (e.g., transformer) determines compute and memory needs. Operators choose models based on task, hardware constraints (VRAM, RAM, GPU compute), and latency requirements. Quantization reduces model size and speeds up inference at a cost to accuracy.
Practical example
An operator with an RTX 3060 12GB can run Llama 3.1 8B at Q4_K_M (5 GB VRAM) for chat, but cannot fit Llama 3.1 70B at Q4 (40 GB) without offloading to system RAM, which drops tokens/sec from ~40 to ~3. The same GPU can run Stable Diffusion XL (SDXL) for image generation, which uses ~8 GB VRAM at 1024x1024 resolution. ML enables both tasks with the same hardware, just different model files.
Workflow example
In Ollama, an operator runs ollama pull llama3.1:8b to download a pre-trained ML model. The runtime loads the weights into VRAM and runs inference. In LM Studio, the operator selects a model from the hub, adjusts context length, and starts a chat session. In vLLM, the operator launches an API server with vllm serve meta-llama/Llama-3.1-8B-Instruct to serve multiple requests. All these workflows rely on the ML model's weights being loaded and executed locally.
Reviewed by Fredoline Eruo. See our editorial policy.