TurboVec
TurboVec is an open-source, **local-first vector index** (Rust core + Python bindings) by Ryan Codrai, MIT-licensed, built on Google Research's **TurboQuant** quantizer (presented at ICLR 2026). Its pitch for local AI: fit far more embeddings in RAM so a privacy-first RAG stack stays on one machine instead of needing a memory-optimized cloud box.
Architecture: a data-oblivious quantizer, not another HNSW fork
TurboVec is a local-first vector index built on TurboQuant, a data-oblivious scalar quantizer (random rotation followed by scalar quantization, presented at ICLR 2026). The operationally important detail: there is no k-means, no learned codebook, and no separate training pass. Quantization parameters derive only from the vector dimension and the target bit-width, so an add() is effectively instantaneous and the index never needs a rebuild as documents stream in. The core is Rust with thin Python bindings (crates.io, MSRV 1.70); search is SIMD-accelerated — hand-written NEON on ARM, AVX-512BW/AVX2 on x86 — and filtered search runs against an id allowlist with no recall penalty.
If you have lived with FAISS's IVF/PQ train step, this is the headline ergonomic win: you skip the "collect a representative sample, fit the quantizer, pray the distribution holds" dance entirely. That single design choice is why TurboVec fits the local-AI stack so cleanly — it pairs with any open-weight embedding model you already run through Ollama and never phones home. For the broader "which model on which box" decisions around it, our models catalog and will-it-run checker cover the embedding and inference side.
Compatibility: what the SIMD tiers actually mean
The published throughput claims (beating FAISS IndexPQFastScan by 12–20% on ARM, ~0.23 ms/query on an Apple M3 Max) target the NEON and AVX-512 paths. Most readers will not be on those — and that is fine, because the AVX2 fallback is still fast. The matrix below records what we verified versus what the project reports. The takeaway: TurboVec is CPU-only by design (no GPU path, and it does not need one), so the relevant axis is your CPU's SIMD width, not your GPU.
Deployment paths: three honest shapes
We tested the Python path end to end. The cleanest deployment is local Python RAG: pip install turbovec, embed with a local model, done — an air-gapped index on a laptop. The second is embedding the Rust crate directly in a service when you do not want Python in the hot path; online ingest with no rebuild makes it viable for streaming document pipelines. The third is memory-tight edge / on-device, where 2-bit quantization is the whole point: a 50K-vector index drops under 10 MB, so a Raspberry-Pi-class box holds an index that would otherwise want a server. The structured cards below lay out the hardware and complexity for each.
Resource usage and first-party benchmarks
We installed TurboVec ourselves (turbovec 0.7.0, Python 3.14) on a Ryzen 9 5900HX laptop (Zen 3, AVX2 — no AVX-512) and measured two workloads. Numbers here are ours, not the project's.
Synthetic, 50,000 vectors × 768-dim, 200 queries:
- float32 baseline: 153.6 MB index.
- 4-bit: 19.2 MB (8× smaller), 2.1 s build, 0.19 ms/query.
- 2-bit: 9.6 MB (16× smaller), 1.9 s build, 0.10 ms/query.
The 8× / 16× compression claims are exact, builds are sub-3-seconds with no train step, and search is sub-millisecond even on the AVX2 fallback — so the headline ARM/AVX-512 numbers do not apply to this chip, yet it still serves in ~0.1–0.2 ms/query.
Real embeddings, 888 RunLocalAI catalog docs via nomic-embed-text, 768-dim, 150 held-out semantic queries, recall vs exact float32 brute force:
- 4-bit: recall@10 0.955, 0.033 ms/query, 0.3 MB.
- 2-bit: recall@10 0.884, 0.020 ms/query, 0.2 MB.
On real embeddings the recall is far higher than on random vectors — quantization preserves genuine semantic structure — so the project's published ~0.955 recall ballpark held up on our data. At 4-bit you recover ~95.5% of the true top-10 neighbours an uncompressed search would return; 2-bit keeps ~88% at 16× less memory. These sit alongside our other measured numbers in the benchmarks hub.
Capacity planning from these numbers. The memory math is linear and easy to project, which is what makes TurboVec predictable to deploy. A 768-dim vector costs ~3,072 bytes at float32, ~384 bytes at 4-bit, and ~192 bytes at 2-bit. So a one-million-vector index lands around 3 GB uncompressed, ~384 MB at 4-bit, and ~192 MB at 2-bit — which is exactly why the project can quote a 10M-document corpus dropping from ~31 GB to ~4 GB. For an operator, that converts directly into a hosting decision: a 4-bit million-vector index fits comfortably in the RAM you already have on a laptop or a small VM, with no memory-optimised cloud box and no managed vector service in the loop. Because builds are train-free and ingest is online, you also do not pay a rebuild tax as the corpus grows — you size for the final footprint once and add documents incrementally.
Failure modes: what breaks, and where it is the wrong tool
TurboVec is a vector index, not a vector database, and most of its failure modes are really scope mismatches:
- No server, dashboard, query language, or rich metadata filtering. Filtering is an id allowlist; if you need
WHERE tenant = x AND date > yat the store layer, this is not it. - No AVX-512 means no headline numbers. On older CPUs (our Zen 3 case) you get the AVX2 path — still sub-0.2 ms/query at 50K×768, but do not expect the published ARM figures.
- Recall drifts on much larger corpora. Our recall run was 888 docs; at tens of millions the 2-bit tier in particular will lose more. Benchmark on your own data — the author's docs say the same.
- 2-bit is a footprint trade, not a free lunch. It cost ~7 points of recall@10 (0.955 → 0.884) in our test. Use it only when memory is the binding constraint.
How it compares
For billion-scale trained PQ on GPU, FAISS still wins — TurboVec has no GPU path and does not try to. For sub-microsecond flat-storage latency on small sets, HNSWlib wins. Against managed cloud vector databases, TurboVec's pitch is the opposite axis: everything stays on your machine, which is the entire reason to run it in a local-first RAG stack. Its genuine sweet spot is sub-1M-vector prototyping and memory-tight edge/on-device RAG where instant builds and aggressive compression matter more than distributed scale. If you are choosing between local inference engines and stores generally, the tools catalog frames the wider landscape.
Verdict
TurboVec does one thing and does it honestly: compressed approximate-nearest-neighbour search with instant, train-free builds. Our testing confirmed the compression math exactly (8× / 16×) and a production-viable 0.955 recall@10 at 4-bit on real embeddings, all on CPU and even without AVX-512. It will not replace a full vector database and it is not a GPU billion-scale engine — but for an air-gapped, single-machine RAG index, that is precisely the point. Recommended (4.4/5) for local and edge RAG up to ~1M vectors; reach for FAISS or a managed store past that. The 4-bit tier is the default; drop to 2-bit only when every megabyte counts.
| Status | Runtime / Stack | Notes |
|---|---|---|
| Excellent | x86-64 with AVX-512 (Zen 4 / Sapphire Rapids+) | Headline path. AVX-512BW SIMD kernels — this is the tier the project's published throughput claims target. |
| Good | x86-64 with AVX2 only (Zen 3 / older Intel) | Fallback path we tested on a Ryzen 9 5900HX. Still ~0.1-0.2 ms/query at 50K x 768; just not the headline SIMD numbers. |
| Excellent | Apple Silicon / ARM (NEON) | Hand-written NEON kernels; the author reports ~0.23 ms/query on an M3 Max. Not independently reproduced by us. |
| Excellent | Python 3.x via pip | Installed clean on Python 3.14. LangChain / LlamaIndex / Haystack / Agno integrations via pip extras. |
| Excellent | Rust via crates.io (MSRV 1.70) | cargo add turbovec — the native path with no Python in the hot loop. Online ingest, no rebuilds. |
Local Python RAG (laptop / desktop)
trivialpip install turbovec plus a local embedding model via Ollama (e.g. nomic-embed-text). Our 888-doc real-embedding index used 0.3 MB at 4-bit with 0.955 recall@10. Fastest path to an air-gapped RAG index.
Embedded in a Rust service
moderatecargo add turbovec to embed the index in a Rust binary with no Python in the hot path. Train-free online ingest makes it viable for streaming document pipelines.
Memory-tight edge / on-device
moderate2-bit quantization keeps a 50K-vector index under 10 MB, so a Pi-class device holds an index that would otherwise need a server. Costs ~7 points of recall@10 (0.955 -> 0.884) vs 4-bit.
Pros
- Extreme memory compression (2-bit/4-bit, 16x/8x) — fits ~10M vectors in ~4GB, keeps RAG on one local box
- Zero train step / no codebook — instant online ingest, no rebuilds as the corpus grows
- Pure-local, MIT, air-gappable; drop-in integrations for LangChain / LlamaIndex / Haystack / Agno
- Hand-written SIMD (NEON/AVX-512BW) + filtered search with no recall penalty
Cons
- An index, NOT a database — no server, dashboard, query language, or distributed/metadata-filter features
- CPU-only (SIMD); no GPU acceleration — scan-based search trails HNSWlib on raw latency
- Young (first release Apr 2026); published benchmarks are the author's own and not yet independently reproduced
- Best for <1M vectors / edge; FAISS still wins for billion-scale GPU PQ
Compatibility
| Operating systems | Linux macOS Windows |
| GPU backends | none (CPU SIMD) |
| License | Open source · free (OSS, MIT) |
Runtime health
Operator-grade signals on how actively TurboVec is being maintained, how fresh its measurements are, and what failure classes operators have flagged. Every label below is anchored to a real date or count — we never infer maintainer activity we can't show.
Release cadence
Derived from the most recent editorial signal on this row.
2 days since last refresh · source: operationalReviewedAt
Benchmark freshness
How recent the editorial measurements on this runtime are.
No editorial benchmarks for this runtime yet.
Community reproduction
Submissions that match an editorial measurement on similar hardware.
No community reproductions on file yet.
Ecosystem stability
Editorial rating from RunLocalAI — qualitative, not measured.
Get TurboVec
Frequently asked
Is TurboVec free?
What operating systems does TurboVec support?
Does TurboVec need a GPU?
Reviewed by RunLocalAI Editorial. See our editorial policy for how we evaluate tools.
Related — keep moving
Verify TurboVec runs on your specific hardware before committing money.