by OpenAI
OpenAI's closed-weight family — GPT-4, GPT-4o, GPT-5 series. Reference baseline for capability comparisons; not self-hostable. RunLocalAI covers GPT family for benchmark context only.
GPT is OpenAI's closed-weight family — there is no open-weight GPT model for local deployment. For local open-weight alternatives: Llama 3.3 70B at Q4 (40 GB) on 2× RTX 4090 matches GPT-4o on MMLU (87%); Qwen 3 32B at Q4 (18 GB) on RTX 4090 matches GPT-4-Turbo on math; DeepSeek V4 at FP8 on 8× H100 SXM matches GPT-5 class reasoning. The only open-weight GPT model is GPT-2 1.5B (2019, MIT license) — historically significant, runs on any hardware, but functionally obsolete for production use. OpenAI's GPT-4o and GPT-5 are API-only. For self-hosted alternatives with GPT-compatible tool-use/function-calling, use Llama 3.3 70B with vLLM serving an OpenAI-compatible API — most GPT SDKs (Python openai, LangChain) work against vLLM endpoints with a base_url change. Skip building a local "GPT replacement" without specific requirements — the API is the pragmatic choice for GPT-family models.
GPT models are API-only. OpenAI API: GPT-4o ($2.50/1M input, $10/1M output), GPT-4o-mini ($0.15/1M input, $0.60/1M output), GPT-5 (pricing TBD). For self-hosted API-compatible alternatives: vLLM 0.6.3+ serving Llama 3.3 70B AWQ 4-bit with --api-key and --served-model-name gpt-4 flags — exposes an OpenAI-compatible /v1/chat/completions endpoint. For on-device mobile: llama.cpp serving Llama 3.1 8B Q4_0 on Snapdragon X Elite — ~18 tok/s with OpenAI-compatible HTTP API. For Azure OpenAI users with data residency requirements: deploy Llama 3.3 70B on-premise with vLLM on H100 SXM in your own VPC — matches GPT-4o quality with full data sovereignty. The GPT-2 1.5B open-weight model runs on any hardware via Transformers AutoModelForCausalLM — 3 GB FP32, ~40 tok/s on CPU — educational/research use only.
Verify GPT (OpenAI) runs on your specific hardware before committing money.