Hugging Face
Hugging Face is a platform and company that hosts a vast repository of open-source machine learning models, datasets, and tools. For local AI operators, it's the primary source for downloading model weights, configuration files, and tokenizers. Models are organized into repositories with metadata like license, architecture, and quantization options. The Hugging Face Hub integrates with tools like llama.cpp, Ollama, and vLLM, allowing operators to pull models directly via URLs or CLI commands. It also provides the Transformers library for loading and running models in Python, though many local runtimes use their own loaders.
Deeper dive
Hugging Face started as a chatbot company but pivoted to become the central repository for the open-source ML community. The Hub hosts over 500,000 models, including popular architectures like Llama, Mistral, and Gemma. Each model repository contains weight files (often in safetensors format), a config.json with architecture parameters, and a tokenizer. Operators interact with Hugging Face primarily through the huggingface_hub Python library or by downloading files directly. For local inference, many runtimes (e.g., llama.cpp, Ollama) have built-in support to fetch models from the Hub using a model identifier like meta-llama/Llama-3.1-8B. The platform also provides model cards with important details: quantization options (e.g., GGUF, GPTQ), context length, and hardware requirements. While the Transformers library is the standard for Python inference, local runtimes often use custom loaders that bypass Transformers for better performance on consumer hardware.
Practical example
When an operator wants to run Llama 3.1 8B locally, they visit huggingface.co/meta-llama/Llama-3.1-8B to find the model card. They see that the original weights are in safetensors format (16 GB) but there are community quantized versions like llama-3.1-8b-instruct-q4_k_m.gguf (5 GB). They download the GGUF file and load it in llama.cpp or Ollama. The model card also lists the required VRAM: ~6 GB for Q4, ~10 GB for FP16.
Workflow example
In Ollama, an operator runs ollama pull llama3.1:8b. Ollama internally resolves this to a Hugging Face model (e.g., meta-llama/Llama-3.1-8B), downloads the quantized GGUF weights from the Hub, and stores them in ~/.ollama/models. Alternatively, an operator using llama.cpp can download a GGUF file directly from Hugging Face using wget and then run ./llama-cli -m model.gguf -p "Hello". In LM Studio, the operator searches the Hub's model catalog within the app and clicks download.
Reviewed by Fredoline Eruo. See our editorial policy.