The canonical taxonomy of what local AI actually does — 94 tasks across 11 modality buckets. For each task: which models do it well, what hardware they need, what runtimes they require, and what breaks in production.
Chat, reasoning, summarization, translation, extraction.
General-purpose conversational and instruction-following text generation. The default LLM workload — answering questions, writing prose, drafting emails, summarizing inputs.
Multi-step logical reasoning, mathematical problem-solving, and symbolic manipulation. Distinguished from general chat by chain-of-thought trace quality and accuracy on AIME/GSM8K-class benchmarks.
Explicit step-by-step reasoning with visible intermediate steps. Useful for transparency and debuggability in agentic workflows.
Condensing long documents into shorter summaries — extractive (pulling key sentences) or abstractive (rewriting in fewer words). Long-context capable models excel here.
Between-language text translation. Multilingual instruction-tuned models handle this competently; specialized translation models exist for very-low-resource languages.
Educational explanation, concept teaching, and Socratic guidance. Strong reasoning + patient explanation styles matter more than raw capability.
Pulling structured data (entities, dates, prices, relationships) from unstructured text. Strong instruction-following + JSON-mode capability matters.
Generating reliably-formatted JSON, XML, YAML, or schema-constrained output. Grammar-constrained generation libraries (Outlines, Guidance, llama.cpp grammars) are the canonical solution.
Long-form character roleplay, creative fiction, and persona-driven dialogue. Specialized fine-tunes (uncensored, character-tuned) dominate this space.
Contract review, case-law analysis, regulatory interpretation. Privacy + on-prem deployment is the wedge — legal data can't leave the firm. Long-context handling is critical.
Clinical note review, medical literature search, treatment-recommendation drafting. HIPAA + privacy = local deployment is non-negotiable. Specialized medical-tuned models exist.
Earnings transcript analysis, SEC filing review, sentiment from financial news. Compliance + sensitivity = local deployment for many workflows.
OCR, classification, detection, document understanding, VQA.
Assigning labels to images — single-label or multi-label. Foundational vision task; modern multimodal LLMs handle this competently in addition to specialized classifiers.
Extracting text from images, PDFs, screenshots, and handwritten documents. Modern multimodal LLMs (Qwen2.5-VL, InternVL, GPT-4V) increasingly outperform specialized OCR engines on complex layouts.
Extracting data from charts, graphs, plots, and infographics. Specialized capability for vision-language models — distinct from raw OCR.
Understanding software UI from screenshots — identifying buttons, fields, widgets, layout. Foundation for browser agents and computer-use AI.
General screenshot understanding for productivity workflows — code screenshots, terminal output, error messages, document screenshots.
Locating and labeling specific objects in images with bounding boxes. Specialized detection models (YOLO family, DETR) dominate, though VLMs increasingly handle simple detection via prompting.
Pixel-level region labeling — semantic, instance, or panoptic segmentation. Specialized models (SAM family, Mask2Former) dominate. Critical for medical imaging, robotics, content creation.
Answering natural-language questions about image contents. Modern VLMs make this accessible — Qwen2.5-VL, InternVL, LLaVA all credible.
Parsing complex document layouts — tables, multi-column text, footnotes, equations. Combines OCR + structure understanding + reasoning.
Retrieval-augmented generation over document images directly — no OCR pre-processing step. ColPali / ColQwen-style models embed page images for retrieval.
Text-to-image, editing, inpainting, anime, photorealistic, posters.
Generating images from text prompts. The canonical creative AI workload — Flux, SDXL, Stable Diffusion 3.5, Playground v3 lead the open-weight tier.
Modifying existing images via prompts or masks. Distinct from generation — instruction-based editing models (Flux Fill, ControlNet, IP-Adapter) excel here.
Filling in masked regions of an image based on context + optional prompts. Essential for object removal, background replacement, content-aware fills.
Extending images beyond their original borders. Useful for aspect ratio changes, scene expansion, panoramic creation.
Anime, manga, illustration-style image generation. Specialized fine-tunes (Pony Diffusion, NoobAI, Illustrious-XL) dominate this niche.
Photorealistic portraits, landscapes, product shots. Flux Dev/Schnell, Stable Diffusion 3.5 Large, Playground v3 are the open-weight leaders.
Generating data visualizations and infographics from prompts. Combines text rendering + diagrammatic layout — challenging task where models still struggle.
Marketing posters, social media graphics, promotional images. Text rendering quality is the differentiator — Flux family excels.
Sequential image generation for film storyboards, comic panels. Consistency across frames is the hard problem.
Multi-panel comic and manga generation with character consistency, panel composition, speech bubbles.
Logo design generation. Specialized models + text-rendering-quality matter; vector output via post-processing.
Text-to-video, image-to-video, understanding, avatars, animation.
Generating short video clips from text prompts. Wan 2.1, HunyuanVideo, LTX-Video lead the open-weight tier in 2026.
Animating still images into short video clips. Stable Video Diffusion, Wan, CogVideoX-I2V are open-weight options.
Comprehending video content — captioning, Q&A, action recognition. Multimodal video LLMs like Qwen2.5-VL handle this.
Film-quality cinematic generation with camera moves, lighting, narrative consistency. Open-weight is closing the gap with Sora/Veo but not there yet.
Applying motion from a source video to a target subject — pose-driven dance generation, lip-sync, gesture transfer.
Generating intermediate frames between sparse keyframes — slow-mo, smooth animation, frame-rate upscaling.
Talking-head avatar video generation from audio + reference image. SadTalker, EMO, Hallo, AnimateDiff are open-weight options.
Animated character motion, 2D/3D animation, looping animations. AnimateDiff family + dedicated animation models.
STT, TTS, voice cloning, music generation, dubbing, diarization.
Transcribing spoken audio into text. Whisper family is the open-weight default; faster-whisper + WhisperX deliver production-grade speed.
Generating natural-sounding speech from text. F5-TTS, XTTS-v2, Kokoro, Sesame CSM-1B lead open-weight TTS in 2026.
Replicating a specific voice from a few seconds of reference audio. F5-TTS and XTTS-v2 are zero-shot voice cloning leaders.
Generating music from text prompts or melody references. MusicGen, Stable Audio, Suno-clone open-weight models.
Translating audio across languages while preserving speaker voice. Combines STT → translation → cloned-voice TTS.
Removing noise, restoring clarity, enhancing low-quality recordings. Specialized models (FRCRN, DeepFilterNet) excel here.
Identifying who-spoke-when in multi-speaker audio. PyAnnote is the open-weight default.
Detecting speech vs silence in audio streams. Silero VAD is the open-weight default — small, fast, accurate.
AI-generated podcast-style audio from text scripts or document inputs. NotebookLM-clone workflows combine TTS + dialogue generation.
Text-to-3D, image-to-3D, mesh, texture, Gaussian splatting.
Generating 3D models from text prompts. Hunyuan3D-2, TRELLIS, Stable Fast 3D lead open-weight in 2026.
Reconstructing 3D models from single or multiple images. TripoSR, Stable Fast 3D, Hunyuan3D-2 + multi-view diffusion approaches.
3D scene representation via Gaussian splats. Real-time rendering of photorealistic scenes — VFX, robotics, AR/VR applications.
Generating clean polygon meshes — distinct from point clouds or SDFs. Production 3D pipelines need clean topology.
Generating textures and materials for 3D models. PBR-aware texture synthesis is the production target.
AI-assisted CAD modeling — parametric design, constraint solving, design suggestions. Specialized fine-tunes for engineering workflows.
Volumetric scene representation from posed images. Largely superseded by Gaussian Splatting for real-time use but still relevant for research.
Code generation, repo chat, debugging, code review, agentic coding.
Generating code from natural language prompts. Qwen 2.5 Coder, DeepSeek Coder V3, Codestral are open-weight leaders.
Conversational interface over an entire codebase. Combines RAG over code + long-context model + tool-use for navigation.
AI-assisted bug diagnosis and fix generation. Reasoning + code understanding + execution tool-use combine here.
Reviewing PRs/MRs for bugs, style, security. Tool-use + repo-context awareness drive quality.
Multi-step autonomous coding agents that read repos, edit files, run tests. Aider, Cline, OpenHands are open-weight tooling leaders.
Natural-language → shell command translation, terminal-based AI workflows. tldr-clones, Warp AI, ShellGPT.
AI-driven Terraform, Ansible, Kubernetes manifest generation. Specialized + tool-use heavy.
Embeddings, reranking, retrieval, semantic search, agent memory.
Generating dense vector representations of text for similarity search and retrieval. BGE-M3 is the canonical multilingual open-weight choice in 2026.
Cross-encoder reranking of retrieved documents for relevance. BGE Reranker V2 M3 + Cohere Rerank are the leaders.
First-stage retrieval over document corpora — dense, sparse (BM25), or hybrid. Foundation for all RAG pipelines.
Search by meaning rather than keyword match. Powered by embedding models + vector databases.
Search across enterprise data sources (Confluence, Slack, Drive, internal docs) with permissions awareness. Self-hosted is the privacy wedge.
RAG over sensitive documents (legal, medical, financial, personal) where data must not leave the local environment. The local-first wedge.
Long-term memory for AI agents — summarization-based, vector-based, graph-based. Mem0, Letta, Zep are open-weight tooling.
Browser, coding, workflow, computer-use, autonomous, multi-agent systems.
AI agents that navigate and interact with web browsers. Browser-use, Playwright-based agents, BrowserBase pattern.
Multi-step autonomous coding agents. Aider, Cline, OpenHands, Continue.dev, Claude Code.
Agents that orchestrate multi-step business workflows — n8n + AI, Zapier AI, custom orchestration.
Agents that operate desktop applications via screenshot + mouse/keyboard. Anthropic Computer Use API, OS-Atlas, ShowUI.
Long-horizon planning agents that pursue goals over extended timeframes. AutoGPT-lineage + research-grade autonomy frameworks.
Coordinated multi-agent workflows — manager+worker, debate, swarm patterns. CrewAI, AutoGen, Swarm.
iPhone, Android, browser, WebGPU, TinyML, Jetson, Raspberry Pi.
On-device AI on iPhone. Apple Intelligence (A18 Pro+), MLC-LLM iOS apps, third-party MLX-on-iOS deployment.
On-device AI on Android. Google Gemini Nano, Samsung Galaxy AI, OEM-specific NPU acceleration.
Running models directly in web browsers. Transformers.js, web-llm, ONNX Runtime Web, WebGPU.
WebGPU-accelerated inference in browsers. Massive privacy + zero-install wedge for consumer AI apps.
AI on microcontrollers — Arduino, ESP32, Raspberry Pi Pico. Sub-100 KB models for sensors and embedded systems.
AI at the edge — IoT cameras, industrial sensors, retail kiosks. Mid-tier between mobile SoCs and datacenter.
AI on NVIDIA Jetson Nano/Xavier/Orin. CUDA on edge — robotics, drones, industrial computer vision.
Google Coral Edge TPU for low-power on-device inference. TensorFlow Lite quantized models.
AI on Raspberry Pi (4, 5, 500). CPU-only inference + small models, often with Coral or Hailo accelerators.
Theorem proving, chemistry, biology, robotics, RL, forecasting.
AI-assisted formal theorem proving in Lean, Coq, Isabelle. DeepSeek-Prover, Lean Copilot, AlphaProof-lineage.
Multi-step scientific reasoning across physics, chemistry, biology. GPQA + ScienceQA benchmark this. Frontier reasoning models lead.
AlphaFold-lineage protein structure prediction, molecular design, drug discovery. Specialized scientific models.
Genomic sequence analysis, protein design, single-cell analysis. ESM, RoseTTAFold, scGPT.
Vision-language-action models for robotics. RT-2, Open X-Embodiment, RDT-1B, Pi0.
Graph neural networks for molecular property prediction, social networks, knowledge graphs.
RL for game-playing, robotics, alignment. PPO, DPO, post-training RLHF + RLAIF.
Foundation models for time-series — TimeGPT, Chronos, Lag-Llama. Generic forecasting across domains.