GPT-4 — AI glossary

GPT-4 is a large multimodal language model developed by OpenAI, released in March 2023. It accepts text and image inputs and produces text outputs. For operators, GPT-4 is a proprietary, closed-weight model accessible only via OpenAI's API or ChatGPT Plus subscription. It is not available for local download or self-hosting, unlike open-weight models such as Llama 3.1 or Mistral. The model is estimated to have 1.7 trillion parameters and uses a mixture-of-experts (MoE) architecture, making it far too large to run on consumer hardware even with quantization. Operators encounter GPT-4 when comparing its API-based performance against locally runnable models for tasks like coding, reasoning, or creative writing.

Deeper dive

GPT-4 represents a significant leap over GPT-3.5 in reasoning, factual accuracy, and steerability. It is multimodal, meaning it can process images (e.g., diagrams, screenshots) and text. The model is built on a transformer architecture with a mixture-of-experts (MoE) design, where only a subset of parameters (around 280 billion) are active per token, reducing inference cost. OpenAI has not released official parameter counts, but leaks and analyses suggest 1.7 trillion total parameters with 8 experts. GPT-4 is available in several variants: GPT-4 (base), GPT-4 Turbo (cheaper, faster, with knowledge cutoff April 2023), and GPT-4o (omni, multimodal, faster). For operators, the key takeaway is that GPT-4 is a closed, API-only model. Its performance sets a benchmark for local models, but its cost, latency, and lack of privacy are trade-offs. Local models like Llama 3.1 70B or Mixtral 8x22B approach GPT-4 quality on many tasks while running on a single high-end GPU (e.g., 48 GB VRAM).

Practical example

An operator comparing GPT-4 to a local model might run a coding benchmark: GPT-4 Turbo via API costs $0.01 per 1K input tokens and ~$0.03 per 1K output tokens. A 1,000-token prompt with 500-token response costs $0.025. In contrast, running Llama 3.1 70B Q4 locally on an RTX 6000 Ada (48 GB VRAM) costs only electricity ($0.10/hour) and yields ~15 tok/s. For a batch of 100 such queries, GPT-4 costs $2.50, while local inference costs ~$0.10 in electricity but requires upfront hardware investment.

Workflow example

In practice, an operator might use GPT-4 via the OpenAI API in a script: import openai; response = openai.ChatCompletion.create(model="gpt-4", messages=[{"role": "user", "content": "Explain quantum computing"}]). The response includes token usage (usage.total_tokens). For local alternatives, the operator would use Ollama: ollama run llama3.1:70b and measure tokens/sec via the --verbose flag. The choice between GPT-4 and local models hinges on budget, latency tolerance, and data privacy requirements.