RUNLOCALAIv38
->Will it run?Best GPUCompareTroubleshootStartLearnPulseModelsHardwareToolsBench
Run check
RUNLOCALAI

Independently operated catalog for local-AI hardware and software. Hand-written verdicts. Source-cited claims. Reproducible commands when we have them.

OP·Fredoline Eruo
DIR
  • Models
  • Hardware
  • Tools
  • Benchmarks
TOOLS
  • Will it run?
  • Compare hardware
  • Cost vs cloud
  • Choose my GPU
  • Prompting kits
  • Quick answers
REF
  • All buyer guides
  • Learn local AI
  • Methodology
  • Glossary
  • Errors KB
  • Trust
EDITOR
  • About
  • Author
  • How we make money
  • Editorial policy
  • Contact
LEGAL
  • Privacy
  • Terms
  • Sitemap
MAIL · MONTHLY DIGEST
Get monthly local AI changes
Monthly recap. No spam.
DISCLOSURE

Some links on this site are affiliate links (Amazon Associates and other first-class retailers). When you buy through them, we earn a small commission at no extra cost to you. Affiliate links do not influence our verdicts — there are cards we rate highly that we don't have affiliate relationships with, and cards that sell well that we refuse to recommend. Read more →

© 2026 runlocalai.coIndependently operated
RUNLOCALAI · v38
Glossary / Generative AI / Generative Model
Generative AI

Generative Model

A generative model is a type of machine learning model that learns the underlying distribution of training data and can then produce new samples resembling that data. In local AI, generative models like large language models (LLMs) or diffusion models generate text, images, or audio from prompts or latent vectors. They differ from discriminative models, which classify or label inputs. For operators, the key practical distinction is that generative models require significant VRAM and compute for inference, especially at larger sizes (e.g., 70B parameters).

Deeper dive

Generative models capture the joint probability distribution P(X, Y) or P(X) of the data, enabling them to create new instances. Common types include autoregressive models (e.g., GPT, Llama), which predict the next token sequentially; variational autoencoders (VAEs); generative adversarial networks (GANs); and diffusion models (e.g., Stable Diffusion). In local deployment, autoregressive LLMs dominate text generation, while diffusion models are popular for image generation. The choice of model size and quantization directly impacts VRAM usage and inference speed. For example, a 7B parameter model at 4-bit quantization uses ~4 GB VRAM, while a 70B model uses ~40 GB, dictating hardware requirements.

Practical example

An operator running Llama 3.1 8B on an RTX 4090 (24 GB VRAM) can generate text at ~50 tokens/sec using Q4_K_M quantization. The same model on an RTX 3060 (12 GB VRAM) might fit but run slower due to partial offloading. For image generation, Stable Diffusion XL requires ~8 GB VRAM; an RTX 3060 can run it but may struggle with high-resolution outputs.

Workflow example

When using Ollama, running ollama run llama3.1:8b loads a generative model. The runtime allocates VRAM for weights and KV cache. If VRAM is insufficient, Ollama offloads layers to system RAM, reducing tokens/sec from ~40 to ~5. Operators monitor VRAM usage with nvidia-smi or ollama ps to ensure the model fits.

Reviewed by Fredoline Eruo. See our editorial policy.

Buyer guides
  • Best GPU for local AI →
  • Best laptop for local AI →
  • Best Mac for local AI →
When it doesn't work
  • CUDA out of memory →
  • Ollama running slowly →
  • ROCm not detected →