Notable models & companies

OpenAI

OpenAI is the organization that developed the GPT series of large language models (GPT-3, GPT-4, GPT-4o) and the DALL-E image generation models. For local AI operators, OpenAI is relevant as the creator of model architectures and weights that are often reimplemented or reverse-engineered by open-source projects. For example, the GPT-2 architecture was the basis for many early local models, and OpenAI's API pricing and capabilities set a benchmark for what local models aim to match or exceed in terms of quality and latency.

Deeper dive

OpenAI was founded in 2015 as a non-profit AI research lab, later transitioning to a capped-profit structure. They have released several influential models, including GPT-1, GPT-2, GPT-3, GPT-4, and GPT-4o, as well as the CLIP and DALL-E models. While OpenAI's models are primarily accessed via cloud API, their research publications and model weights (e.g., GPT-2) have spurred the open-source local AI community. Operators often compare local model performance against OpenAI's API benchmarks (e.g., MMLU, HumanEval) and use OpenAI's tokenizer (tiktoken) or model architectures as reference implementations. The release of GPT-2's weights in 2019 was a pivotal moment for local AI, enabling the first wave of local language models. However, later models like GPT-3 and GPT-4 have not been fully open-sourced, leading to the development of alternatives like LLaMA and Mistral.

Practical example

An operator running Llama 3.1 8B locally might compare its output quality to GPT-4o on a specific task, noting that GPT-4o runs on remote servers with low latency (~1-2 seconds) but costs per token, while the local model runs at ~40 tok/s on an RTX 4090 with no ongoing cost. The operator might also use OpenAI's tiktoken library to count tokens for local model prompts, ensuring they stay within context limits.

Workflow example

When using Hugging Face Transformers to load a model like GPT-2, the operator runs from transformers import GPT2LMHeadModel and downloads weights from the hub. In LM Studio, an operator might select a model that uses the GPT-2 architecture (e.g., DistilGPT-2) and run inference locally. For API comparison, an operator might use curl https://api.openai.com/v1/chat/completions to test a prompt, then replicate it locally with Ollama using ollama run llama3.1 to compare latency and output.

Reviewed by Fredoline Eruo. See our editorial policy.

Buyer guides

When it doesn't work