RUNLOCALAIv38
->Will it run?Best GPUCompareTroubleshootStartLearnPulseModelsHardwareToolsBench
Run check
RUNLOCALAI

Independently operated catalog for local-AI hardware and software. Hand-written verdicts. Source-cited claims. Reproducible commands when we have them.

OP·Fredoline Eruo
DIR
  • Models
  • Hardware
  • Tools
  • Benchmarks
TOOLS
  • Will it run?
  • Compare hardware
  • Cost vs cloud
  • Choose my GPU
  • Prompting kits
  • Quick answers
REF
  • All buyer guides
  • Learn local AI
  • Methodology
  • Glossary
  • Errors KB
  • Trust
EDITOR
  • About
  • Author
  • How we make money
  • Editorial policy
  • Contact
LEGAL
  • Privacy
  • Terms
  • Sitemap
MAIL · MONTHLY DIGEST
Get monthly local AI changes
Monthly recap. No spam.
DISCLOSURE

Some links on this site are affiliate links (Amazon Associates and other first-class retailers). When you buy through them, we earn a small commission at no extra cost to you. Affiliate links do not influence our verdicts — there are cards we rate highly that we don't have affiliate relationships with, and cards that sell well that we refuse to recommend. Read more →

© 2026 runlocalai.coIndependently operated
RUNLOCALAI · v38
  1. >
  2. Home
  3. /Learn
  4. /Courses
  5. /Vector Database Internals
  6. /Ch. 14
Vector Database Internals

14. Multi-Tier Indexing

Chapter 14 of 18 · 25 min
KEY INSIGHT

Multi-tier indexing is fundamentally a budget allocation problem. Your goal is to maximize recall per dollar spent on storage and compute. The optimal tier sizes depend on your query distribution—profile before committing to a tier split.

Multi-tier indexing separates data by access frequency and query requirements. Hot data lives in fast storage; cold data on slower, cheaper tiers. This is essential for scaling beyond single-machine memory.

Tier Architecture

┌─────────────────────────────────────────────────────────┐
│  Tier 0: Hot Index (HNSW in RAM)                       │
│  - Top 1-10% of most queried vectors                   │
│  - Sub-millisecond query latency                       │
│  - High memory cost                                    │
├─────────────────────────────────────────────────────────┤
│  Tier 1: Warm Index (HNSW on NVMe)                     │
│  - Next 10-30% of vectors                              │
│  - 1-10ms query latency                                │
│  - Lower cost, larger capacity                         │
├─────────────────────────────────────────────────────────┤
│  Tier 2: Cold Storage (PQ + IVF on object storage)     │
│  - Remaining vectors                                   │
│  - 10-100ms query latency                              │
│  - Commodity storage costs                             │
└─────────────────────────────────────────────────────────┘

Routing Strategy

Queries cascade through tiers until recall requirements are met:

from dataclasses import dataclass
from typing import Protocol
import numpy as np

class VectorIndex(Protocol):
    def search(self, query: np.ndarray, k: int) -> tuple[np.ndarray, np.ndarray]: ...

@dataclass
class TierConfig:
    name: str
    index: VectorIndex
    size: int  # number of vectors in this tier
    recall_weight: float  # fraction of total recall this tier contributes

class MultiTierSearch:
    def __init__(self, tiers: list[TierConfig]):
        self.tiers = sorted(tiers, key=lambda t: t.size, reverse=True)  # smallest first
        self.total_vectors = sum(t.size for t in tiers)
        self.cumulative_sizes = np.cumsum([t.size for t in tiers])
    
    def search(self, query: np.ndarray, k: int, 
               target_recall: float = 0.95) -> tuple[np.ndarray, np.ndarray]:
        """
        Search through tiers until target recall is achieved.
        Returns approximate top-k results.
        """
        results = {}  # index -> (distance, original_id)
        seen_ids = set()
        accumulated = 0
        
        for tier in self.tiers:
            tier_results, tier_ids = tier.index.search(query, k + len(seen_ids))
            
            # Merge new results
            for dist, idx in zip(tier_results, tier_ids):
                global_idx = accumulated + idx
                if global_idx not in seen_ids:
                    results[global_idx] = (dist, global_idx)
                    seen_ids.add(global_idx)
            
            accumulated += tier.size
            
            # Check if we have enough results
            if len(results) >= k:
                sorted_results = sorted(results.values(), key=lambda x: x[0])
                top_k = sorted_results[:k]
                return np.array([r[0] for r in top_k]), np.array([r[1] for r in top_k])
        
        # Fallback: return what we have
        sorted_results = sorted(results.values(), key=lambda x: x[0])
        return np.array([r[0] for r in sorted_results[:k]]), np.array([r[1] for r in sorted_results[:k]])

Tier Promotion

Hot vectors get promoted to faster tiers based on query frequency:

from collections import deque
import threading

class TierManager:
    def __init__(self, hot_threshold: int = 100, promotion_window: int = 3600):
        self.query_counts = {}  # vector_id -> query count in window
        self.window_seconds = promotion_window
        self.hot_threshold = hot_threshold
        self.lock = threading.Lock()
        self.promotion_queue = deque()
    
    def record_query(self, vector_id: int):
        """Record that a vector was returned in a query result."""
        with self.lock:
            self.query_counts[vector_id] = self.query_counts.get(vector_id, 0) + 1
    
    def check_promotions(self) -> list[int]:
        """
        Return list of vector IDs that should be promoted to hot tier.
        Should be called periodically (e.g., every minute).
        """
        with self.lock:
            candidates = [
                vid for vid, count in self.query_counts.items()
                if count >= self.hot_threshold
            ]
            
            self.promotion_queue.extend(candidates)
            self.query_counts.clear()
        
        return list(self.promotion_queue)

Failure Modes

Tier boundary recall loss: Vectors near tier boundaries may have different distances in each tier's index. Always re-rank unified results against the original vectors when precision matters.

Promotion lag: Newly hot vectors wait for the promotion cycle. If query patterns change rapidly (e.g., trending content), consider synchronous promotion with a latency budget.


EXERCISE

Implement a 2-tier system with tier 0 containing 10% of vectors and tier 1 containing 90%. Measure end-to-end recall at different k values (1, 10, 100) comparing single-tier (tier 1 only) vs multi-tier search. Quantify the recall cost of tiering.

# Expected: recall cost increases with smaller k
# Reason: top results are more likely in hot tier, but misses are more costly
← Chapter 13
SIMD Distance Computation
Chapter 15 →
Distributed Search