How to Add Reranking to Your RAG Pipeline
RAG pipeline running, sentence-transformers installed
What this does
Reranking refines the initial retrieval pass by re-scoring candidate documents with a cross-encoder model that evaluates query-document pairs jointly. This step boosts precision by promoting documents that are genuinely relevant while demoting superficially similar ones. The result is a tighter context window and more accurate answers.
Steps
Set up the initial vector store. Load and index documents as usual.
import os os.environ["OLLAMA_BASE_URL"] = "http://localhost:11434" from langchain_community.document_loaders import TextLoader from langchain.text_splitter import RecursiveCharacterTextSplitter from langchain_ollama import OllamaEmbeddings from langchain_community.vectorstores import Chroma loader = TextLoader("context/guides.txt") docs = loader.load() chunks = RecursiveCharacterTextSplitter(chunk_size=500).split_documents(docs) embeddings = OllamaEmbeddings(model="llama3") db = Chroma.from_documents(chunks, embeddings)Retrieve a broad candidate set. Retrieve more candidates than you will ultimately use.
query = "How do I configure retrieval settings?" initial_results = db.similarity_search(query, k=20) candidate_texts = [r.page_content for r in initial_results]Load a cross-encoder reranker. The
CrossEncoderscores each query-document pair.from sentence_transformers import CrossEncoder reranker = CrossEncoder("cross-encoder/ms-marco-MiniLM-L-6-v2")Score and re-rank candidates. Pair the query with each candidate for joint scoring.
pairs = [[query, text] for text in candidate_texts] scores = reranker.predict(pairs) ranked_indices = sorted(range(len(scores)), key=lambda i: scores[i], reverse=True) reranked = [candidate_texts[i] for i in ranked_indices[:5]]Pass reranked chunks to the LLM. Use only the top results as context.
from langchain_ollama import ChatOllama from langchain.chains import LLMChain from langchain.prompts import PromptTemplate llm = ChatOllama(model="llama3") prompt = PromptTemplate.from_template( "Context: {context}\n\nQuestion: {question}\n\nAnswer:" ) context = "\n\n".join(reranked[:3]) result = llm.invoke(prompt.format(context=context, question=query)) print(result.content)Expected output: a precise answer drawn from the most relevant reranked chunks.
Verification
python -c "
from sentence_transformers import CrossEncoder
reranker = CrossEncoder('cross-encoder/ms-marco-MiniLM-L-6-v2')
scores = reranker.predict([['query', 'document']])
print(len(scores) == 1)
# Expected: True
"
Common failures
- Cross-encoder model not downloaded. On first run, the model downloads automatically; ensure internet access.
- Too many candidates causing latency. Limit initial retrieval to 20-50 documents; reranking scales quadratically.
- Negative scores causing wrong ranking. Cross-encoder scores are relative; sort by descending value, not absolute magnitude.
- Re-ranking hurting diversity. Preserve top-k selection but allow secondary candidates to enter when scores are close.
- Version mismatch - The installed package or runtime differs from the command shown; check the version first and rerun the smallest verification command.
- Local environment drift - Another service, virtual environment, model, or path is being used; print the active binary path and configuration before changing the guide steps.