Reranking Pipeline — RAG Systems: Part 2 (Chapter 5)

This chapter integrates reranking into a complete retrieval pipeline, covering configuration, tuning, and common integration patterns with LangChain, LlamaIndex, and custom implementations.

Adding Reranking to LangChain

LangChain's cross-encoder reranker integrates directly with the LangChain vector store abstraction:

from langchain_community.cross_encoders import HuggingFaceCrossEncoder
from langchain.retrievers import ContextualCompressionRetriever
from langchain_community.vectorstores import Chroma
from langchain_openai import OpenAIEmbeddings

# Base vectorstore (from Part 1)
embeddings = OpenAIEmbeddings(model="text-embedding-3-small")
vectorstore = Chroma(persist_directory="./chroma_db", embedding_function=embeddings)

# Reranker configuration
reranker = HuggingFaceCrossEncoder(
    model_name="BAAI/bge-reranker-base",
    top_n=5  # Return top 5 after reranking
)

# Contextual compression with reranking
compressor = LangChainRank湾区Reranker(
    reranker=reranker,
    top_n=5
)

retriever = ContextualCompressionRetriever(
    base_compressor=compressor,
    base_retriever=vectorstore.as_retriever(search_kwargs={"k": 50})
)

# Now queries go through retrieval -> reranking pipeline
results = retriever.get_relevant_documents("What is the vacation policy?")

The key parameters: base_retriever k=50 retrieves 50 initial candidates, then the reranker filters to top 5.

Tuning k Parameters

The initial retrieval k and final selection top_n require tuning for your use case. Guidelines:

Initial k should be high enough to capture relevant content. If your documents have dense information, relevant content might be distributed across many chunks. Start with k=100 and measure recall of your evaluation set.

Final top_n should match your LLM context budget. If your LLM accepts 16k tokens and your chunks average 200 tokens with 1000-token context wrapping, you can fit 10-12 chunks. Leave headroom for the system prompt and query.

The gap between k and top_n reflects reranking value. A large gap (k=100 → top_n=5) means the reranker is doing aggressive filtering. A small gap (k=20 → top_n=10) means your initial retrieval was already fairly precise. Small gaps may indicate the reranker isn't adding much value.

# Tuning script to find optimal k values
def tune_retrieval_params(query, relevant_doc_ids, vectorstore, reranker, k_values=[20, 50, 100, 200], top_n_values=[5, 10, 20]):
    results = {}
    for k in k_values:
        # Initial retrieval
        raw_results = vectorstore.similarity_search(query, k=k)
        raw_ids = [doc.metadata.get('chunk_id') for doc in raw_results]
        
        # Measure initial recall
        initial_recall = len(set(raw_ids) & set(relevant_doc_ids)) / len(relevant_doc_ids)
        
        for top_n in top_n_values:
            # Rerank
            reranked = rerank_documents(query, raw_results, reranker, top_n=top_n)
            reranked_ids = [chunk.metadata.get('chunk_id') for chunk in reranked]
            
            # Measure reranked recall and MRR
            reranked_recall = len(set(reranked_ids) & set(relevant_doc_ids)) / len(relevant_doc_ids)
            
            # Mean Reciprocal Rank
            mrr = 0
            for i, doc_id in enumerate(reranked_ids):
                if doc_id in relevant_doc_ids:
                    mrr = 1 / (i + 1)
                    break
            
            results[(k, top_n)] = {
                'initial_recall': initial_recall,
                'reranked_recall': reranked_recall,
                'mrr': mrr
            }
    return results

LlamaIndex Integration

LlamaIndex provides post-processors for reranking:

from llama_index.postprocessor import SentenceTransformerRerank
from llama_index.retrievers import VectorIndexRetriever
from llama_index.query_engine import RetrieverQueryEngine

# Define retriever
retriever = VectorIndexRetriever(
    index=index,
    similarity_top_k=100  # High initial k
)

# Define reranker post-processor
reranker = SentenceTransformerRerank(
    model="BAAI/bge-reranker-base",
    top_n=10,  # Final selection
    device="cuda"
)

# Combine into query engine
query_engine = RetrieverQueryEngine.from_args(
    retriever=retriever,
    node_postprocessors=[reranker]
)

# Query
response = query_engine.query("Your query here")

Custom Pipeline Integration

For full control, implement a custom pipeline:

class RerankingRetrievalPipeline:
    def __init__(self, vectorstore, reranker, embedder):
        self.vectorstore = vectorstore
        self.reranker = reranker
        self.embedder = embedder
    
    def retrieve(self, query, initial_k=100, final_k=20):
        # Step 1: Initial vector retrieval
        raw_results = self.vectorstore.similarity_search(
            query, 
            k=initial_k
        )
        
        # Step 2: Rerank
        reranked = self.reranker.predict([
            [query, doc.page_content] for doc in raw_results
        ])
        
        # Step 3: Select top-k
        scored = list(zip(raw_results, reranked))
        scored.sort(key=lambda x: x[1], reverse=True)
        
        return scored[:final_k]
    
    def retrieve_with_confidence(self, query, initial_k=100, final_k=20, score_threshold=0.5):
        results = self.retrieve(query, initial_k, final_k)
        
        # Filter by confidence if threshold provided
        if score_threshold is not None:
            results = [
                (doc, score) for doc, score in results 
                if score > score_threshold
            ]
        
        return results

Failure Modes

Reranker too slow for interactive use. If cross-encoder inference exceeds your latency budget, options include: smaller model (MiniLM-L-6 instead of L-12), quantized model, GPU acceleration, or caching reranker scores for repeated queries.

Over-filtering. Aggressive reranking (small final_k) may filter out correctly-relevant documents that score lower due to surface form differences. Always measure recall on a labeled evaluation set, not just precision.

Trusting scores across queries. A reranker score of 0.9 doesn't mean 90% relevance. Scores are ordinal—use them for ranking, not for hard classification without calibration.