RUNLOCALAIv38
→WILL IT RUNBEST GPUCOMPARETROUBLESHOOTSTARTPULSEMODELSHARDWARETOOLSBENCH
RUNLOCALAI

Operator-grade instrument for local-AI hardware intelligence. Hand-written verdicts. Real benchmarks. Reproducible commands.

OP·Fredoline Eruo
DIR
  • Models
  • Hardware
  • Tools
  • Benchmarks
  • Will it run?
GUIDES
  • Best GPU
  • Best laptop
  • Best Mac
  • Best used GPU
  • Best budget GPU
  • Best GPU for Ollama
  • Best GPU for SD
  • AI PC build $2K
  • CUDA vs ROCm
  • 16 vs 24 GB
  • Compare hardware
  • Custom compare
REF
  • Systems
  • Ecosystem maps
  • Pillar guides
  • Methodology
  • Glossary
  • Errors KB
  • Troubleshooting
  • Resources
  • Public API
EDITOR
  • About
  • About the author
  • Changelog
  • Latest
  • Updates
  • Submit benchmark
  • Send feedback
  • Trust
  • Editorial policy
  • How we make money
  • Contact
LEGAL
  • Privacy
  • Terms
  • Sitemap
MAIL · MONTHLY DIGEST
Get monthly local AI changes
Monthly recap. No spam.
DISCLOSURE

Some links on this site are affiliate links (Amazon Associates and other first-class retailers). When you buy through them, we earn a small commission at no extra cost to you. Affiliate links do not influence our verdicts — there are cards we rate highly that we don't have affiliate relationships with, and cards that sell well that we refuse to recommend. Read more →

SYS · ONLINEUPTIME · 100%2026 · operator-owned
RUNLOCALAI · v38
  1. >
  2. Home
  3. /Tools
  4. /Aphrodite Engine
runner
Open source
free + open-source

Aphrodite Engine

vLLM fork specialized for creative writing / role-play workloads. Adds samplers (smoothing factor, dynatemp, mirostat, DRY, XTC) that mainline vLLM doesn't ship. Same continuous-batching architecture; trades some throughput for sampler richness.

By Fredoline Eruo·Last verified May 9, 2026·1,700 GitHub stars

Overview

vLLM fork specialized for creative writing / role-play workloads. Adds samplers (smoothing factor, dynatemp, mirostat, DRY, XTC) that mainline vLLM doesn't ship. Same continuous-batching architecture; trades some throughput for sampler richness.

Setup guidance

Install via pip in a Python 3.10+ venv with CUDA 12.1+: pip install aphrodite-engine. Aphrodite is a fork of vLLM optimized for single-user throughput rather than multi-tenant serving. Start: aphrodite run meta-llama/Llama-3.1-8B-Instruct --port 2242. The server exposes an OpenAI-compatible API at /v1/chat/completions. Verify: curl http://localhost:2242/v1/chat/completions -H "Content-Type: application/json" -d '{"model":"meta-llama/Llama-3.1-8B-Instruct","messages":[{"role":"user","content":"Hello"}]}'. Aphrodite maintains vLLM's PagedAttention KV-cache management but replaces the continuous batching scheduler with a single-stream-optimized path. It supports EXL2, AWQ, GPTQ, and FP8 quantization formats. For GGUF models, use aphrodite run ./model.gguf. First run downloads the model from HuggingFace (~5–20 minutes for 70B). Time-to-first-response from zero: ~10 minutes. Aphrodite also includes a SillyTavern-compatible API mode for roleplay/chat UI integrations.

Workload fit

Best for: single-user high-throughput local LLM serving on NVIDIA GPUs, roleplay and creative writing scenarios where high single-stream decode speed enhances the interactive experience, SillyTavern and character-chat frontend integration, users who want vLLM's PagedAttention memory management without the continuous batching complexity, GGUF-based model users who want faster decode than raw llama.cpp CUDA. Not suited for: multi-tenant production serving (use vLLM), CPU-only or Apple Silicon deployment (NVIDIA-only), non-NVIDIA GPU use, workloads requiring multiple concurrent users, users who need automatic model management (use Ollama).

Alternatives

Use Aphrodite when you want vLLM-level single-user throughput with less operational complexity — it strips continuous batching complexity for the single-user case and is the go-to engine for roleplay and creative writing. Switch to vLLM when you need multi-tenant concurrency — Aphrodite's scheduler is not optimized for concurrent requests. Use ExLlamaV2 when you want maximum single-stream decode speed on consumer NVIDIA GPUs and can accept EXL2 format conversion. Use Ollama for zero-config desktop LLM with automatic model management — Aphrodite requires explicit model specification and Python environment setup. Use KoboldCPP when you need a bundled chat UI, Windows-native deployment without Python, and the full GGUF ecosystem. Aphrodite sits between vLLM and ExLlamaV2: more single-user throughput than vLLM, more model format support than ExLlamaV2.

Troubleshooting + when to switch

Problem: Performance identical to vLLM, no throughput gain. Fix: Aphrodite's single-user optimization engages when concurrency is 1. If you're testing with multiple concurrent requests, Aphrodite falls back to near-vLLM behavior. Test with single sequential requests. Enable --enforce-eager to bypass the CUDA graph optimization which can mask single-user gains. Problem: GGUF model fails to load. Fix: Aphrodite's GGUF support is via llama.cpp integration, not all GGUF quantizations are supported. Stick to Q4_K_M, Q5_K_M, and Q8_0 formats. Below Q4_K_M, Aphrodite may reject the model or produce garbage output. Problem: SillyTavern connection fails. Fix: Aphrodite's SillyTavern API mode requires --api-type kobold flag. The endpoint is at /api/v1/generate on port 2242, not the standard OpenAI endpoint. Ensure SillyTavern is configured as a "KoboldAI" API type pointing to http://localhost:2242.

Pros

  • Sampling-method richness — DRY / XTC / dynatemp don't exist in stock vLLM
  • OpenAI-compatible API like vLLM — drop-in for compatible clients
  • Strong fit for SillyTavern / TavernAI / role-play workloads

Cons

  • Lags vLLM mainline on new model architectures by 2-6 weeks
  • Smaller community + fewer production deployments
  • Throughput slightly trails vLLM at high concurrency

Compatibility

Operating systems
Linux
Windows
GPU backends
NVIDIA CUDA
AMD ROCm
LicenseOpen source · free + open-source

Runtime health

Operator-grade signals on how actively Aphrodite Engine is being maintained, how fresh its measurements are, and what failure classes operators have flagged. Every label below is anchored to a real date or count — we never infer maintainer activity we can't show.

Release cadence

Derived from the most recent editorial signal on this row.

Active
Updated May 9, 2026

5 days since last refresh · source: lastUpdated

Benchmark freshness

How recent the editorial measurements on this runtime are.

0editorial benchmarks

No editorial benchmarks for this runtime yet.

Community reproduction

Submissions that match an editorial measurement on similar hardware.

0reproduced reports

No community reproductions on file yet.

Get Aphrodite Engine

Official site
https://aphrodite.pygmalion.chat
GitHub
https://github.com/aphrodite-engine/aphrodite-engine

Frequently asked

Is Aphrodite Engine free?

Aphrodite Engine has a paid tier (free + open-source). Check the pricing page for current terms.

What operating systems does Aphrodite Engine support?

Aphrodite Engine supports Linux, Windows.

Which GPUs work with Aphrodite Engine?

Aphrodite Engine supports NVIDIA CUDA, AMD ROCm. CPU-only inference is also possible but slow.
See something off?Report outdated·Suggest a correctionWe read every submission. Editorial review takes 1-7 days.

Reviewed by RunLocalAI Editorial. See our editorial policy for how we evaluate tools.

Related — keep moving

Compare hardware
  • RTX 3090 vs RTX 4090 →
  • Apple M4 Max vs RTX 4090 →
Buyer guides
  • Best GPU for local AI →
  • Best budget GPU →
When it doesn't work
  • llama.cpp too slow →
  • llama.cpp build failed →
  • llama.cpp Metal crash (Mac) →
  • GGUF tokenizer mismatch →
Recommended hardware
  • RTX 3090 (used) →
  • Apple M4 Max →
Alternatives
MLX-LMExLlamaV2llama.cppLlamafileOllamaIPEX-LLMCTranslate2Intel OpenVINO
Before you buy

Verify Aphrodite Engine runs on your specific hardware before committing money.

Will it run on my hardware? →Custom hardware comparison →GPU recommender (4 questions) →