Running models directly in web browsers. Transformers.js, web-llm, ONNX Runtime Web, WebGPU.
npm install @huggingface/transformers — run Whisper, embeddings, image classification, and small LLMs in-browser.import { pipeline } from "@huggingface/transformers";
const classifier = await pipeline("sentiment-analysis");
const result = await classifier("I love local AI!");
console.log(result); // [{ label: "POSITIVE", score: 0.99 }]
Browser AI runs on the hardware you already own. Any laptop with 8+ GB RAM and a browser from 2023+ runs 3B models at 10-30 tok/s. A Chromebook ($200-300, 8 GB RAM) runs WebLLM/Llama 3.2 3B competently. For embedding models (Nomic Embed Text, ~200 MB): they run in-browser on any device including phones. Browser AI is the ultimate "cheap" AI — the user already has the hardware, your web app just ships the model. If your users have a browser, they have AI compute. Incremental hardware cost: $0.
Browser AI has no "serious hardware" tier — it runs on the user's device, not yours. For developers building browser AI apps: optimize model sizes (use ONNX quantized models, WebGPU shader optimizations), test on low-end devices (Chromebook with 4 GB RAM), and implement progressive loading. For users running browser AI: a MacBook Pro M4 Max (see /hardware/macbook-pro-16-m4-max) with 40-core GPU runs WebGPU at desktop speeds — 50-80 tok/s for 3B models. An RTX 4060 gaming laptop ($1,000) achieves similar speeds. But browser AI is deliberately lightweight — if you have a $2,000 GPU, you should run models natively, not in-browser. Browser AI is for accessibility, not maximum performance.
The mistake: Building a web app that downloads a 2 GB model on every page load because the model isn't cached properly. Users on mobile data get a $10 phone bill for loading your demo. Why it fails: Large models trigger browser download prompts and consume mobile data. On metered connections, a 2 GB model download costs money and takes 5-10 minutes on 4G. Users bounce before the model loads. The fix: Use IndexedDB caching. WebLLM and Transformers.js support model caching automatically — but you must configure it. First load: show a progress bar ("Downloading model (2 GB)... This is a one-time download, cached for future visits."). Subsequent loads: model loads from cache in <5 seconds. For mobile users: serve a smaller model variant (Q2_K quant, ~1 GB) or offer a "use server-side inference" fallback. Also: check navigator.connection.saveData — if the user has data saver mode on, ask before downloading 2 GB. Respect your users' data plans.
Browse all tools for runtimes that fit this workload.
Local AI workloads have real hardware constraints that vary by task type. VRAM ceiling decides what model fits; bandwidth decides decode speed; compute decides prefill speed. Pick the GPU tier that fits your actual workload, not the spec sheet.
The errors most operators hit when running browser ai locally. Each links to a diagnose+fix walkthrough.
Verify your specific hardware can handle browser ai before committing money.