What this does

Reranking refines the initial retrieval pass by re-scoring candidate documents with a cross-encoder model that evaluates query-document pairs jointly. This step boosts precision by promoting documents that are genuinely relevant while demoting superficially similar ones. The result is a tighter context window and more accurate answers.

Steps

Set up the initial vector store. Load and index documents as usual.

import os
os.environ["OLLAMA_BASE_URL"] = "http://localhost:11434"

from langchain_community.document_loaders import TextLoader
from langchain.text_splitter import RecursiveCharacterTextSplitter
from langchain_ollama import OllamaEmbeddings
from langchain_community.vectorstores import Chroma

loader = TextLoader("context/guides.txt")
docs = loader.load()
chunks = RecursiveCharacterTextSplitter(chunk_size=500).split_documents(docs)
embeddings = OllamaEmbeddings(model="llama3")
db = Chroma.from_documents(chunks, embeddings)

Retrieve a broad candidate set. Retrieve more candidates than you will ultimately use.

query = "How do I configure retrieval settings?"
initial_results = db.similarity_search(query, k=20)
candidate_texts = [r.page_content for r in initial_results]

Load a cross-encoder reranker. The CrossEncoder scores each query-document pair.

from sentence_transformers import CrossEncoder

reranker = CrossEncoder("cross-encoder/ms-marco-MiniLM-L-6-v2")

Score and re-rank candidates. Pair the query with each candidate for joint scoring.

pairs = [[query, text] for text in candidate_texts]
scores = reranker.predict(pairs)
ranked_indices = sorted(range(len(scores)), key=lambda i: scores[i], reverse=True)
reranked = [candidate_texts[i] for i in ranked_indices[:5]]

Pass reranked chunks to the LLM. Use only the top results as context.

from langchain_ollama import ChatOllama
from langchain.chains import LLMChain
from langchain.prompts import PromptTemplate

llm = ChatOllama(model="llama3")
prompt = PromptTemplate.from_template(
    "Context: {context}\n\nQuestion: {question}\n\nAnswer:"
)
context = "\n\n".join(reranked[:3])
result = llm.invoke(prompt.format(context=context, question=query))
print(result.content)

Expected output: a precise answer drawn from the most relevant reranked chunks.

Verification

python -c "
from sentence_transformers import CrossEncoder
reranker = CrossEncoder('cross-encoder/ms-marco-MiniLM-L-6-v2')
scores = reranker.predict([['query', 'document']])
print(len(scores) == 1)
# Expected: True
"

Common failures

Cross-encoder model not downloaded. On first run, the model downloads automatically; ensure internet access.
Too many candidates causing latency. Limit initial retrieval to 20-50 documents; reranking scales quadratically.
Negative scores causing wrong ranking. Cross-encoder scores are relative; sort by descending value, not absolute magnitude.
Re-ranking hurting diversity. Preserve top-k selection but allow secondary candidates to enter when scores are close.
Version mismatch - The installed package or runtime differs from the command shown; check the version first and rerun the smallest verification command.
Local environment drift - Another service, virtual environment, model, or path is being used; print the active binary path and configuration before changing the guide steps.

How to Add Reranking to Your RAG Pipeline

What this does

Steps

Verification

Common failures

Related guides