05. Reranking Pipeline
This chapter integrates reranking into a complete retrieval pipeline, covering configuration, tuning, and common integration patterns with LangChain, LlamaIndex, and custom implementations.
Adding Reranking to LangChain
LangChain's cross-encoder reranker integrates directly with the LangChain vector store abstraction:
from langchain_community.cross_encoders import HuggingFaceCrossEncoder
from langchain.retrievers import ContextualCompressionRetriever
from langchain_community.vectorstores import Chroma
from langchain_openai import OpenAIEmbeddings
# Base vectorstore (from Part 1)
embeddings = OpenAIEmbeddings(model="text-embedding-3-small")
vectorstore = Chroma(persist_directory="./chroma_db", embedding_function=embeddings)
# Reranker configuration
reranker = HuggingFaceCrossEncoder(
model_name="BAAI/bge-reranker-base",
top_n=5 # Return top 5 after reranking
)
# Contextual compression with reranking
compressor = LangChainRank湾区Reranker(
reranker=reranker,
top_n=5
)
retriever = ContextualCompressionRetriever(
base_compressor=compressor,
base_retriever=vectorstore.as_retriever(search_kwargs={"k": 50})
)
# Now queries go through retrieval -> reranking pipeline
results = retriever.get_relevant_documents("What is the vacation policy?")
The key parameters: base_retriever k=50 retrieves 50 initial candidates, then the reranker filters to top 5.
Tuning k Parameters
The initial retrieval k and final selection top_n require tuning for your use case. Guidelines:
Initial k should be high enough to capture relevant content. If your documents have dense information, relevant content might be distributed across many chunks. Start with k=100 and measure recall of your evaluation set.
Final top_n should match your LLM context budget. If your LLM accepts 16k tokens and your chunks average 200 tokens with 1000-token context wrapping, you can fit 10-12 chunks. Leave headroom for the system prompt and query.
The gap between k and top_n reflects reranking value. A large gap (k=100 → top_n=5) means the reranker is doing aggressive filtering. A small gap (k=20 → top_n=10) means your initial retrieval was already fairly precise. Small gaps may indicate the reranker isn't adding much value.
# Tuning script to find optimal k values
def tune_retrieval_params(query, relevant_doc_ids, vectorstore, reranker, k_values=[20, 50, 100, 200], top_n_values=[5, 10, 20]):
results = {}
for k in k_values:
# Initial retrieval
raw_results = vectorstore.similarity_search(query, k=k)
raw_ids = [doc.metadata.get('chunk_id') for doc in raw_results]
# Measure initial recall
initial_recall = len(set(raw_ids) & set(relevant_doc_ids)) / len(relevant_doc_ids)
for top_n in top_n_values:
# Rerank
reranked = rerank_documents(query, raw_results, reranker, top_n=top_n)
reranked_ids = [chunk.metadata.get('chunk_id') for chunk in reranked]
# Measure reranked recall and MRR
reranked_recall = len(set(reranked_ids) & set(relevant_doc_ids)) / len(relevant_doc_ids)
# Mean Reciprocal Rank
mrr = 0
for i, doc_id in enumerate(reranked_ids):
if doc_id in relevant_doc_ids:
mrr = 1 / (i + 1)
break
results[(k, top_n)] = {
'initial_recall': initial_recall,
'reranked_recall': reranked_recall,
'mrr': mrr
}
return results
LlamaIndex Integration
LlamaIndex provides post-processors for reranking:
from llama_index.postprocessor import SentenceTransformerRerank
from llama_index.retrievers import VectorIndexRetriever
from llama_index.query_engine import RetrieverQueryEngine
# Define retriever
retriever = VectorIndexRetriever(
index=index,
similarity_top_k=100 # High initial k
)
# Define reranker post-processor
reranker = SentenceTransformerRerank(
model="BAAI/bge-reranker-base",
top_n=10, # Final selection
device="cuda"
)
# Combine into query engine
query_engine = RetrieverQueryEngine.from_args(
retriever=retriever,
node_postprocessors=[reranker]
)
# Query
response = query_engine.query("Your query here")
Custom Pipeline Integration
For full control, implement a custom pipeline:
class RerankingRetrievalPipeline:
def __init__(self, vectorstore, reranker, embedder):
self.vectorstore = vectorstore
self.reranker = reranker
self.embedder = embedder
def retrieve(self, query, initial_k=100, final_k=20):
# Step 1: Initial vector retrieval
raw_results = self.vectorstore.similarity_search(
query,
k=initial_k
)
# Step 2: Rerank
reranked = self.reranker.predict([
[query, doc.page_content] for doc in raw_results
])
# Step 3: Select top-k
scored = list(zip(raw_results, reranked))
scored.sort(key=lambda x: x[1], reverse=True)
return scored[:final_k]
def retrieve_with_confidence(self, query, initial_k=100, final_k=20, score_threshold=0.5):
results = self.retrieve(query, initial_k, final_k)
# Filter by confidence if threshold provided
if score_threshold is not None:
results = [
(doc, score) for doc, score in results
if score > score_threshold
]
return results
Failure Modes
Reranker too slow for interactive use. If cross-encoder inference exceeds your latency budget, options include: smaller model (MiniLM-L-6 instead of L-12), quantized model, GPU acceleration, or caching reranker scores for repeated queries.
Over-filtering. Aggressive reranking (small final_k) may filter out correctly-relevant documents that score lower due to surface form differences. Always measure recall on a labeled evaluation set, not just precision.
Trusting scores across queries. A reranker score of 0.9 doesn't mean 90% relevance. Scores are ordinal—use them for ranking, not for hard classification without calibration.
Take the retrieval pipeline from Part 1 and add the reranking components from this chapter. Run a simple evaluation measuring recall at k=100 initial and k=20 final. Compare to baseline without reranking.