RUNLOCALAIv38
->Will it run?Best GPUCompareTroubleshootStartLearnPulseModelsHardwareToolsBench
Run check
RUNLOCALAI

Independently operated catalog for local-AI hardware and software. Hand-written verdicts. Source-cited claims. Reproducible commands when we have them.

OP·Fredoline Eruo
DIR
  • Models
  • Hardware
  • Tools
  • Benchmarks
TOOLS
  • Will it run?
  • Compare hardware
  • Cost vs cloud
  • Choose my GPU
  • Prompting kits
  • Quick answers
REF
  • All buyer guides
  • Learn local AI
  • Methodology
  • Glossary
  • Errors KB
  • Trust
EDITOR
  • About
  • Author
  • How we make money
  • Editorial policy
  • Contact
LEGAL
  • Privacy
  • Terms
  • Sitemap
MAIL · MONTHLY DIGEST
Get monthly local AI changes
Monthly recap. No spam.
DISCLOSURE

Some links on this site are affiliate links (Amazon Associates and other first-class retailers). When you buy through them, we earn a small commission at no extra cost to you. Affiliate links do not influence our verdicts — there are cards we rate highly that we don't have affiliate relationships with, and cards that sell well that we refuse to recommend. Read more →

© 2026 runlocalai.coIndependently operated
RUNLOCALAI · v38
  1. >
  2. Home
  3. /Learn
  4. /Courses
  5. /RAG Systems: Part 2
  6. /Ch. 5
RAG Systems: Part 2

05. Reranking Pipeline

Chapter 5 of 22 · 25 min
KEY INSIGHT

The reranking pipeline's value lies in decoupling recall (retrieve widely) from precision (select intelligently), but requires tuning k parameters against your actual evaluation data.

This chapter integrates reranking into a complete retrieval pipeline, covering configuration, tuning, and common integration patterns with LangChain, LlamaIndex, and custom implementations.

Adding Reranking to LangChain

LangChain's cross-encoder reranker integrates directly with the LangChain vector store abstraction:

from langchain_community.cross_encoders import HuggingFaceCrossEncoder
from langchain.retrievers import ContextualCompressionRetriever
from langchain_community.vectorstores import Chroma
from langchain_openai import OpenAIEmbeddings

# Base vectorstore (from Part 1)
embeddings = OpenAIEmbeddings(model="text-embedding-3-small")
vectorstore = Chroma(persist_directory="./chroma_db", embedding_function=embeddings)

# Reranker configuration
reranker = HuggingFaceCrossEncoder(
    model_name="BAAI/bge-reranker-base",
    top_n=5  # Return top 5 after reranking
)

# Contextual compression with reranking
compressor = LangChainRank湾区Reranker(
    reranker=reranker,
    top_n=5
)

retriever = ContextualCompressionRetriever(
    base_compressor=compressor,
    base_retriever=vectorstore.as_retriever(search_kwargs={"k": 50})
)

# Now queries go through retrieval -> reranking pipeline
results = retriever.get_relevant_documents("What is the vacation policy?")

The key parameters: base_retriever k=50 retrieves 50 initial candidates, then the reranker filters to top 5.

Tuning k Parameters

The initial retrieval k and final selection top_n require tuning for your use case. Guidelines:

Initial k should be high enough to capture relevant content. If your documents have dense information, relevant content might be distributed across many chunks. Start with k=100 and measure recall of your evaluation set.

Final top_n should match your LLM context budget. If your LLM accepts 16k tokens and your chunks average 200 tokens with 1000-token context wrapping, you can fit 10-12 chunks. Leave headroom for the system prompt and query.

The gap between k and top_n reflects reranking value. A large gap (k=100 → top_n=5) means the reranker is doing aggressive filtering. A small gap (k=20 → top_n=10) means your initial retrieval was already fairly precise. Small gaps may indicate the reranker isn't adding much value.

# Tuning script to find optimal k values
def tune_retrieval_params(query, relevant_doc_ids, vectorstore, reranker, k_values=[20, 50, 100, 200], top_n_values=[5, 10, 20]):
    results = {}
    for k in k_values:
        # Initial retrieval
        raw_results = vectorstore.similarity_search(query, k=k)
        raw_ids = [doc.metadata.get('chunk_id') for doc in raw_results]
        
        # Measure initial recall
        initial_recall = len(set(raw_ids) & set(relevant_doc_ids)) / len(relevant_doc_ids)
        
        for top_n in top_n_values:
            # Rerank
            reranked = rerank_documents(query, raw_results, reranker, top_n=top_n)
            reranked_ids = [chunk.metadata.get('chunk_id') for chunk in reranked]
            
            # Measure reranked recall and MRR
            reranked_recall = len(set(reranked_ids) & set(relevant_doc_ids)) / len(relevant_doc_ids)
            
            # Mean Reciprocal Rank
            mrr = 0
            for i, doc_id in enumerate(reranked_ids):
                if doc_id in relevant_doc_ids:
                    mrr = 1 / (i + 1)
                    break
            
            results[(k, top_n)] = {
                'initial_recall': initial_recall,
                'reranked_recall': reranked_recall,
                'mrr': mrr
            }
    return results

LlamaIndex Integration

LlamaIndex provides post-processors for reranking:

from llama_index.postprocessor import SentenceTransformerRerank
from llama_index.retrievers import VectorIndexRetriever
from llama_index.query_engine import RetrieverQueryEngine

# Define retriever
retriever = VectorIndexRetriever(
    index=index,
    similarity_top_k=100  # High initial k
)

# Define reranker post-processor
reranker = SentenceTransformerRerank(
    model="BAAI/bge-reranker-base",
    top_n=10,  # Final selection
    device="cuda"
)

# Combine into query engine
query_engine = RetrieverQueryEngine.from_args(
    retriever=retriever,
    node_postprocessors=[reranker]
)

# Query
response = query_engine.query("Your query here")

Custom Pipeline Integration

For full control, implement a custom pipeline:

class RerankingRetrievalPipeline:
    def __init__(self, vectorstore, reranker, embedder):
        self.vectorstore = vectorstore
        self.reranker = reranker
        self.embedder = embedder
    
    def retrieve(self, query, initial_k=100, final_k=20):
        # Step 1: Initial vector retrieval
        raw_results = self.vectorstore.similarity_search(
            query, 
            k=initial_k
        )
        
        # Step 2: Rerank
        reranked = self.reranker.predict([
            [query, doc.page_content] for doc in raw_results
        ])
        
        # Step 3: Select top-k
        scored = list(zip(raw_results, reranked))
        scored.sort(key=lambda x: x[1], reverse=True)
        
        return scored[:final_k]
    
    def retrieve_with_confidence(self, query, initial_k=100, final_k=20, score_threshold=0.5):
        results = self.retrieve(query, initial_k, final_k)
        
        # Filter by confidence if threshold provided
        if score_threshold is not None:
            results = [
                (doc, score) for doc, score in results 
                if score > score_threshold
            ]
        
        return results

Failure Modes

Reranker too slow for interactive use. If cross-encoder inference exceeds your latency budget, options include: smaller model (MiniLM-L-6 instead of L-12), quantized model, GPU acceleration, or caching reranker scores for repeated queries.

Over-filtering. Aggressive reranking (small final_k) may filter out correctly-relevant documents that score lower due to surface form differences. Always measure recall on a labeled evaluation set, not just precision.

Trusting scores across queries. A reranker score of 0.9 doesn't mean 90% relevance. Scores are ordinal—use them for ranking, not for hard classification without calibration.

EXERCISE

Take the retrieval pipeline from Part 1 and add the reranking components from this chapter. Run a simple evaluation measuring recall at k=100 initial and k=20 final. Compare to baseline without reranking.

← Chapter 4
Local Cross-Encoder Models
Chapter 6 →
Query Rewriting