RUNLOCALAIv38
->Will it run?Best GPUCompareTroubleshootStartLearnPulseModelsHardwareToolsBench
Run check
RUNLOCALAI

Independently operated catalog for local-AI hardware and software. Hand-written verdicts. Source-cited claims. Reproducible commands when we have them.

OP·Fredoline Eruo
DIR
  • Models
  • Hardware
  • Tools
  • Benchmarks
TOOLS
  • Will it run?
  • Compare hardware
  • Cost vs cloud
  • Choose my GPU
  • Prompting kits
  • Quick answers
REF
  • All buyer guides
  • Learn local AI
  • Methodology
  • Glossary
  • Errors KB
  • Trust
EDITOR
  • About
  • Author
  • How we make money
  • Editorial policy
  • Contact
LEGAL
  • Privacy
  • Terms
  • Sitemap
MAIL · MONTHLY DIGEST
Get monthly local AI changes
Monthly recap. No spam.
DISCLOSURE

Some links on this site are affiliate links (Amazon Associates and other first-class retailers). When you buy through them, we earn a small commission at no extra cost to you. Affiliate links do not influence our verdicts — there are cards we rate highly that we don't have affiliate relationships with, and cards that sell well that we refuse to recommend. Read more →

© 2026 runlocalai.coIndependently operated
RUNLOCALAI · v38
Glossary / MLOps & deployment / Model Serving
MLOps & deployment

Model Serving

Model serving is the process of making a trained AI model available for inference via an API or local runtime. For operators running local AI, this means loading a model into memory (typically VRAM) and exposing it through a server endpoint (e.g., HTTP, gRPC) or a command-line interface. The serving layer handles request batching, concurrency, and resource management. Key operator concerns: VRAM capacity determines which models can be served without offloading; latency and throughput depend on quantization level, batch size, and hardware. Tools like vLLM and llama.cpp optimize serving with continuous batching and KV-cache management.

Deeper dive

Model serving bridges training and inference. In production, it involves deploying a model behind a REST or gRPC API, handling authentication, rate limiting, and scaling. For local AI, serving is often simpler: a single process (e.g., Ollama, LM Studio) loads the model and listens on localhost. The runtime must manage the model's weights, KV cache, and token generation loop. Key serving metrics: time-to-first-token (TTFT), tokens-per-second (TPS), and memory usage. Advanced serving frameworks like vLLM use PagedAttention to reduce memory fragmentation and support continuous batching—processing multiple requests concurrently to improve throughput. Operators should consider whether their workload benefits from batching (e.g., chat apps) or low latency (e.g., real-time assistants).

Practical example

An operator with an RTX 4090 (24 GB VRAM) serves Llama 3.1 8B at Q4_K_M (~5.5 GB). Using llama.cpp's built-in HTTP server (./server -m model.gguf --host 0.0.0.0 --port 8080), they get ~40 tok/s for single requests. If they switch to vLLM with continuous batching, throughput can exceed 100 tok/s under concurrent load, but VRAM usage increases due to additional overhead.

Workflow example

In Ollama, model serving is automatic after ollama pull llama3.1:8b and ollama serve. The runtime listens on localhost:11434. An operator can send a request via curl http://localhost:11434/api/generate -d '{"model": "llama3.1:8b", "prompt": "Hello"}'. For custom setups, llama.cpp's server binary loads a GGUF file and provides a REST API. vLLM requires a Python environment: vllm serve meta-llama/Llama-3.1-8B-Instruct --quantization awq starts an OpenAI-compatible server.

Reviewed by Fredoline Eruo. See our editorial policy.

Buyer guides
  • Best GPU for local AI →
  • Best laptop for local AI →
  • Best Mac for local AI →
When it doesn't work
  • CUDA out of memory →
  • Ollama running slowly →
  • ROCm not detected →