Browser AI
Running models directly in web browsers. Transformers.js, web-llm, ONNX Runtime Web, WebGPU.
Setup walkthrough
- Open Chrome/Edge (latest). No installation needed.
- Visit huggingface.co/spaces/webml-community/llama-3.2-webgpu and click "Load Model". The model (~2 GB) downloads to your browser's IndexedDB cache (stays for future visits).
- After loading (~1-2 minutes on first visit, instant on subsequent visits), type: "What is WebGPU and how does it enable browser AI?"
- First response in 3-10 seconds — entirely local, zero server calls, works offline after model is cached.
- For Transformers.js in your own site:
npm install @huggingface/transformers— run Whisper, embeddings, image classification, and small LLMs in-browser.
import { pipeline } from "@huggingface/transformers";
const classifier = await pipeline("sentiment-analysis");
const result = await classifier("I love local AI!");
console.log(result); // [{ label: "POSITIVE", score: 0.99 }]
- First browser AI app in 30 minutes with Transformers.js. Zero server infrastructure needed.
The cheap setup
Browser AI runs on the hardware you already own. Any laptop with 8+ GB RAM and a browser from 2023+ runs 3B models at 10-30 tok/s. A Chromebook ($200-300, 8 GB RAM) runs WebLLM/Llama 3.2 3B competently. For embedding models (Nomic Embed Text, ~200 MB): they run in-browser on any device including phones. Browser AI is the ultimate "cheap" AI — the user already has the hardware, your web app just ships the model. If your users have a browser, they have AI compute. Incremental hardware cost: $0.
The serious setup
Browser AI has no "serious hardware" tier — it runs on the user's device, not yours. For developers building browser AI apps: optimize model sizes (use ONNX quantized models, WebGPU shader optimizations), test on low-end devices (Chromebook with 4 GB RAM), and implement progressive loading. For users running browser AI: a MacBook Pro M4 Max (see /hardware/macbook-pro-16-m4-max) with 40-core GPU runs WebGPU at desktop speeds — 50-80 tok/s for 3B models. An RTX 4060 gaming laptop ($1,000) achieves similar speeds. But browser AI is deliberately lightweight — if you have a $2,000 GPU, you should run models natively, not in-browser. Browser AI is for accessibility, not maximum performance.
Common beginner mistake
The mistake: Building a web app that downloads a 2 GB model on every page load because the model isn't cached properly. Users on mobile data get a $10 phone bill for loading your demo. Why it fails: Large models trigger browser download prompts and consume mobile data. On metered connections, a 2 GB model download costs money and takes 5-10 minutes on 4G. Users bounce before the model loads. The fix: Use IndexedDB caching. WebLLM and Transformers.js support model caching automatically — but you must configure it. First load: show a progress bar ("Downloading model (2 GB)... This is a one-time download, cached for future visits."). Subsequent loads: model loads from cache in <5 seconds. For mobile users: serve a smaller model variant (Q2_K quant, ~1 GB) or offer a "use server-side inference" fallback. Also: check navigator.connection.saveData — if the user has data saver mode on, ask before downloading 2 GB. Respect your users' data plans.
Recommended setup for browser ai
Browse all tools for runtimes that fit this workload.
Reality check
Local AI workloads have real hardware constraints that vary by task type. VRAM ceiling decides what model fits; bandwidth decides decode speed; compute decides prefill speed. Pick the GPU tier that fits your actual workload, not the spec sheet.
Common mistakes
- Buying for spec-sheet VRAM without modeling KV cache + activation overhead
- Underestimating quantization quality loss below Q4
- Skipping flash-attention support (real perf gap on long context)
- Ignoring sustained-load thermals (laptops thermal-throttle within 30 min)
What breaks first
The errors most operators hit when running browser ai locally. Each links to a diagnose+fix walkthrough.
Before you buy
Verify your specific hardware can handle browser ai before committing money.