AI Tasks

The canonical taxonomy of what local AI actually does — 94 tasks across 11 modality buckets. For each task: which models do it well, what hardware they need, what runtimes they require, and what breaks in production.

94 canonical tasks

11 modalities

Hardware-aware

Runtime-aware

📝Text
12

Chat, reasoning, summarization, translation, extraction.

Text Generation

General-purpose conversational and instruction-following text generation. The default LLM workload — answering questions, writing prose, drafting emails, summarizing inputs.

Reasoning & Math

Multi-step logical reasoning, mathematical problem-solving, and symbolic manipulation. Distinguished from general chat by chain-of-thought trace quality and accuracy on AIME/GSM8K-class benchmarks.

Chain-of-Thought Reasoning

Explicit step-by-step reasoning with visible intermediate steps. Useful for transparency and debuggability in agentic workflows.

Summarization

Condensing long documents into shorter summaries — extractive (pulling key sentences) or abstractive (rewriting in fewer words). Long-context capable models excel here.

Translation

Between-language text translation. Multilingual instruction-tuned models handle this competently; specialized translation models exist for very-low-resource languages.

Tutoring & Education

Educational explanation, concept teaching, and Socratic guidance. Strong reasoning + patient explanation styles matter more than raw capability.

Data Extraction

Pulling structured data (entities, dates, prices, relationships) from unstructured text. Strong instruction-following + JSON-mode capability matters.

Structured Output Generation

Generating reliably-formatted JSON, XML, YAML, or schema-constrained output. Grammar-constrained generation libraries (Outlines, Guidance, llama.cpp grammars) are the canonical solution.

Roleplay & Creative Writing

Long-form character roleplay, creative fiction, and persona-driven dialogue. Specialized fine-tunes (uncensored, character-tuned) dominate this space.

Legal Analysis

Contract review, case-law analysis, regulatory interpretation. Privacy + on-prem deployment is the wedge — legal data can't leave the firm. Long-context handling is critical.

Medical Analysis

Clinical note review, medical literature search, treatment-recommendation drafting. HIPAA + privacy = local deployment is non-negotiable. Specialized medical-tuned models exist.

Financial Analysis

Earnings transcript analysis, SEC filing review, sentiment from financial news. Compliance + sensitivity = local deployment for many workflows.

👁️Vision
10

OCR, classification, detection, document understanding, VQA.

Image Classification

Assigning labels to images — single-label or multi-label. Foundational vision task; modern multimodal LLMs handle this competently in addition to specialized classifiers.

OCR / Document Text Extraction

Extracting text from images, PDFs, screenshots, and handwritten documents. Modern multimodal LLMs (Qwen2.5-VL, InternVL, GPT-4V) increasingly outperform specialized OCR engines on complex layouts.

Chart & Graph Reading

Extracting data from charts, graphs, plots, and infographics. Specialized capability for vision-language models — distinct from raw OCR.

UI / Screenshot Analysis

Understanding software UI from screenshots — identifying buttons, fields, widgets, layout. Foundation for browser agents and computer-use AI.

Screenshot Analysis

General screenshot understanding for productivity workflows — code screenshots, terminal output, error messages, document screenshots.

Object Detection

Locating and labeling specific objects in images with bounding boxes. Specialized detection models (YOLO family, DETR) dominate, though VLMs increasingly handle simple detection via prompting.

Image Segmentation

Pixel-level region labeling — semantic, instance, or panoptic segmentation. Specialized models (SAM family, Mask2Former) dominate. Critical for medical imaging, robotics, content creation.

Visual Question Answering

Answering natural-language questions about image contents. Modern VLMs make this accessible — Qwen2.5-VL, InternVL, LLaVA all credible.

Document Understanding

Parsing complex document layouts — tables, multi-column text, footnotes, equations. Combines OCR + structure understanding + reasoning.

Visual RAG

Retrieval-augmented generation over document images directly — no OCR pre-processing step. ColPali / ColQwen-style models embed page images for retrieval.

🎨Image Generation
11

Text-to-image, editing, inpainting, anime, photorealistic, posters.

Text-to-Image Generation

Generating images from text prompts. The canonical creative AI workload — Flux, SDXL, Stable Diffusion 3.5, Playground v3 lead the open-weight tier.

Image Editing

Modifying existing images via prompts or masks. Distinct from generation — instruction-based editing models (Flux Fill, ControlNet, IP-Adapter) excel here.

Inpainting

Filling in masked regions of an image based on context + optional prompts. Essential for object removal, background replacement, content-aware fills.

Outpainting

Extending images beyond their original borders. Useful for aspect ratio changes, scene expansion, panoramic creation.

Anime / Stylized Generation

Anime, manga, illustration-style image generation. Specialized fine-tunes (Pony Diffusion, NoobAI, Illustrious-XL) dominate this niche.

Photorealistic Image Generation

Photorealistic portraits, landscapes, product shots. Flux Dev/Schnell, Stable Diffusion 3.5 Large, Playground v3 are the open-weight leaders.

Infographic Generation

Generating data visualizations and infographics from prompts. Combines text rendering + diagrammatic layout — challenging task where models still struggle.

Poster & Marketing Asset Generation

Marketing posters, social media graphics, promotional images. Text rendering quality is the differentiator — Flux family excels.

Storyboard Generation

Sequential image generation for film storyboards, comic panels. Consistency across frames is the hard problem.

Comic Generation

Multi-panel comic and manga generation with character consistency, panel composition, speech bubbles.

Logo Generation

Logo design generation. Specialized models + text-rendering-quality matter; vector output via post-processing.

🎬Video
8

Text-to-video, image-to-video, understanding, avatars, animation.

Text-to-Video Generation

Generating short video clips from text prompts. Wan 2.1, HunyuanVideo, LTX-Video lead the open-weight tier in 2026.

Image-to-Video

Animating still images into short video clips. Stable Video Diffusion, Wan, CogVideoX-I2V are open-weight options.

Video Understanding

Comprehending video content — captioning, Q&A, action recognition. Multimodal video LLMs like Qwen2.5-VL handle this.

Cinematic Video Generation

Film-quality cinematic generation with camera moves, lighting, narrative consistency. Open-weight is closing the gap with Sora/Veo but not there yet.

Motion Transfer

Applying motion from a source video to a target subject — pose-driven dance generation, lip-sync, gesture transfer.

Frame Interpolation

Generating intermediate frames between sparse keyframes — slow-mo, smooth animation, frame-rate upscaling.

Avatar Generation

Talking-head avatar video generation from audio + reference image. SadTalker, EMO, Hallo, AnimateDiff are open-weight options.

Animation Generation

Animated character motion, 2D/3D animation, looping animations. AnimateDiff family + dedicated animation models.

🔊Audio
9

STT, TTS, voice cloning, music generation, dubbing, diarization.

Speech-to-Text (STT)

Transcribing spoken audio into text. Whisper family is the open-weight default; faster-whisper + WhisperX deliver production-grade speed.

Text-to-Speech (TTS)

Generating natural-sounding speech from text. F5-TTS, XTTS-v2, Kokoro, Sesame CSM-1B lead open-weight TTS in 2026.

Voice Cloning

Replicating a specific voice from a few seconds of reference audio. F5-TTS and XTTS-v2 are zero-shot voice cloning leaders.

Music Generation

Generating music from text prompts or melody references. MusicGen, Stable Audio, Suno-clone open-weight models.

Dubbing & Translation

Translating audio across languages while preserving speaker voice. Combines STT → translation → cloned-voice TTS.

Audio Enhancement

Removing noise, restoring clarity, enhancing low-quality recordings. Specialized models (FRCRN, DeepFilterNet) excel here.

Speaker Diarization

Identifying who-spoke-when in multi-speaker audio. PyAnnote is the open-weight default.

Voice Activity Detection

Detecting speech vs silence in audio streams. Silero VAD is the open-weight default — small, fast, accurate.

Podcast Generation

AI-generated podcast-style audio from text scripts or document inputs. NotebookLM-clone workflows combine TTS + dialogue generation.

🧊3D
7

Text-to-3D, image-to-3D, mesh, texture, Gaussian splatting.

Text-to-3D Generation

Generating 3D models from text prompts. Hunyuan3D-2, TRELLIS, Stable Fast 3D lead open-weight in 2026.

Image-to-3D Reconstruction

Reconstructing 3D models from single or multiple images. TripoSR, Stable Fast 3D, Hunyuan3D-2 + multi-view diffusion approaches.

Gaussian Splatting

3D scene representation via Gaussian splats. Real-time rendering of photorealistic scenes — VFX, robotics, AR/VR applications.

Mesh Generation

Generating clean polygon meshes — distinct from point clouds or SDFs. Production 3D pipelines need clean topology.

Texture Generation

Generating textures and materials for 3D models. PBR-aware texture synthesis is the production target.

CAD Assistance

AI-assisted CAD modeling — parametric design, constraint solving, design suggestions. Specialized fine-tunes for engineering workflows.

NeRF (Neural Radiance Fields)

Volumetric scene representation from posed images. Largely superseded by Gaussian Splatting for real-time use but still relevant for research.

💻Coding
7

Code generation, repo chat, debugging, code review, agentic coding.

Code Generation

Generating code from natural language prompts. Qwen 2.5 Coder, DeepSeek Coder V3, Codestral are open-weight leaders.

Repository Chat

Conversational interface over an entire codebase. Combines RAG over code + long-context model + tool-use for navigation.

Debugging

AI-assisted bug diagnosis and fix generation. Reasoning + code understanding + execution tool-use combine here.

Code Review

Reviewing PRs/MRs for bugs, style, security. Tool-use + repo-context awareness drive quality.

Agentic Coding

Multi-step autonomous coding agents that read repos, edit files, run tests. Aider, Cline, OpenHands are open-weight tooling leaders.

Terminal & Shell Assistance

Natural-language → shell command translation, terminal-based AI workflows. tldr-clones, Warp AI, ShellGPT.

Infrastructure Automation

AI-driven Terraform, Ansible, Kubernetes manifest generation. Specialized + tool-use heavy.

🔍RAG & Search
7

Embeddings, reranking, retrieval, semantic search, agent memory.

Text Embeddings

Generating dense vector representations of text for similarity search and retrieval. BGE-M3 is the canonical multilingual open-weight choice in 2026.

Document Reranking

Cross-encoder reranking of retrieved documents for relevance. BGE Reranker V2 M3 + Cohere Rerank are the leaders.

Retrieval (Dense + Hybrid)

First-stage retrieval over document corpora — dense, sparse (BM25), or hybrid. Foundation for all RAG pipelines.

Semantic Search

Search by meaning rather than keyword match. Powered by embedding models + vector databases.

Enterprise Search

Search across enterprise data sources (Confluence, Slack, Drive, internal docs) with permissions awareness. Self-hosted is the privacy wedge.

Private Document Analysis

RAG over sensitive documents (legal, medical, financial, personal) where data must not leave the local environment. The local-first wedge.

Agent Memory Systems

Long-term memory for AI agents — summarization-based, vector-based, graph-based. Mem0, Letta, Zep are open-weight tooling.

🤖Agents
6

Browser, coding, workflow, computer-use, autonomous, multi-agent systems.

Browser Agents

AI agents that navigate and interact with web browsers. Browser-use, Playwright-based agents, BrowserBase pattern.

Coding Agents

Multi-step autonomous coding agents. Aider, Cline, OpenHands, Continue.dev, Claude Code.

Workflow Agents

Agents that orchestrate multi-step business workflows — n8n + AI, Zapier AI, custom orchestration.

Computer-Use Agents

Agents that operate desktop applications via screenshot + mouse/keyboard. Anthropic Computer Use API, OS-Atlas, ShowUI.

Autonomous Agents

Long-horizon planning agents that pursue goals over extended timeframes. AutoGPT-lineage + research-grade autonomy frameworks.

Multi-Agent Systems

Coordinated multi-agent workflows — manager+worker, debate, swarm patterns. CrewAI, AutoGen, Swarm.

📱Mobile & Edge
9

iPhone, Android, browser, WebGPU, TinyML, Jetson, Raspberry Pi.

iPhone AI

On-device AI on iPhone. Apple Intelligence (A18 Pro+), MLC-LLM iOS apps, third-party MLX-on-iOS deployment.

Android AI

On-device AI on Android. Google Gemini Nano, Samsung Galaxy AI, OEM-specific NPU acceleration.

Browser AI

Running models directly in web browsers. Transformers.js, web-llm, ONNX Runtime Web, WebGPU.

WebGPU AI

WebGPU-accelerated inference in browsers. Massive privacy + zero-install wedge for consumer AI apps.

TinyML / Microcontroller AI

AI on microcontrollers — Arduino, ESP32, Raspberry Pi Pico. Sub-100 KB models for sensors and embedded systems.

Edge AI

AI at the edge — IoT cameras, industrial sensors, retail kiosks. Mid-tier between mobile SoCs and datacenter.

NVIDIA Jetson AI

AI on NVIDIA Jetson Nano/Xavier/Orin. CUDA on edge — robotics, drones, industrial computer vision.

Google Coral / Edge TPU

Google Coral Edge TPU for low-power on-device inference. TensorFlow Lite quantized models.

Raspberry Pi AI

AI on Raspberry Pi (4, 5, 500). CPU-only inference + small models, often with Coral or Hailo accelerators.

🔬Scientific
8

Theorem proving, chemistry, biology, robotics, RL, forecasting.

Theorem Proving

AI-assisted formal theorem proving in Lean, Coq, Isabelle. DeepSeek-Prover, Lean Copilot, AlphaProof-lineage.

Scientific Reasoning

Multi-step scientific reasoning across physics, chemistry, biology. GPQA + ScienceQA benchmark this. Frontier reasoning models lead.

Chemistry & Drug Discovery

AlphaFold-lineage protein structure prediction, molecular design, drug discovery. Specialized scientific models.

Biology & Genomics

Genomic sequence analysis, protein design, single-cell analysis. ESM, RoseTTAFold, scGPT.

Robotics

Vision-language-action models for robotics. RT-2, Open X-Embodiment, RDT-1B, Pi0.

Graph Machine Learning

Graph neural networks for molecular property prediction, social networks, knowledge graphs.

Reinforcement Learning

RL for game-playing, robotics, alignment. PPO, DPO, post-training RLHF + RLAIF.

Time-Series Forecasting

Foundation models for time-series — TimeGPT, Chronos, Lag-Llama. Generic forecasting across domains.