RUNLOCALAIv38
->Will it run?Best GPUCompareTroubleshootStartLearnPulseModelsHardwareToolsBench
Run check
RUNLOCALAI

Independently operated catalog for local-AI hardware and software. Hand-written verdicts. Source-cited claims. Reproducible commands when we have them.

OP·Fredoline Eruo
DIR
  • Models
  • Hardware
  • Tools
  • Benchmarks
TOOLS
  • Will it run?
  • Compare hardware
  • Cost vs cloud
  • Choose my GPU
  • Prompting kits
  • Quick answers
REF
  • All buyer guides
  • Learn local AI
  • Methodology
  • Glossary
  • Errors KB
  • Trust
EDITOR
  • About
  • Author
  • How we make money
  • Editorial policy
  • Contact
LEGAL
  • Privacy
  • Terms
  • Sitemap
MAIL · MONTHLY DIGEST
Get monthly local AI changes
Monthly recap. No spam.
DISCLOSURE

Some links on this site are affiliate links (Amazon Associates and other first-class retailers). When you buy through them, we earn a small commission at no extra cost to you. Affiliate links do not influence our verdicts — there are cards we rate highly that we don't have affiliate relationships with, and cards that sell well that we refuse to recommend. Read more →

© 2026 runlocalai.coIndependently operated
RUNLOCALAI · v38
  1. >
  2. Home
  3. /Learn
  4. /How-to
  5. /How to Add Reranking to Your RAG Pipeline
HOW-TO · RAG

How to Add Reranking to Your RAG Pipeline

intermediate·20 min·By Fredoline Eruo
Target environment
Ubuntu 24.04 · Ollama 0.4.x
PREREQUISITES

RAG pipeline running, sentence-transformers installed

What this does

Reranking refines the initial retrieval pass by re-scoring candidate documents with a cross-encoder model that evaluates query-document pairs jointly. This step boosts precision by promoting documents that are genuinely relevant while demoting superficially similar ones. The result is a tighter context window and more accurate answers.

Steps

  1. Set up the initial vector store. Load and index documents as usual.

    import os
    os.environ["OLLAMA_BASE_URL"] = "http://localhost:11434"
    
    from langchain_community.document_loaders import TextLoader
    from langchain.text_splitter import RecursiveCharacterTextSplitter
    from langchain_ollama import OllamaEmbeddings
    from langchain_community.vectorstores import Chroma
    
    loader = TextLoader("context/guides.txt")
    docs = loader.load()
    chunks = RecursiveCharacterTextSplitter(chunk_size=500).split_documents(docs)
    embeddings = OllamaEmbeddings(model="llama3")
    db = Chroma.from_documents(chunks, embeddings)
    
  2. Retrieve a broad candidate set. Retrieve more candidates than you will ultimately use.

    query = "How do I configure retrieval settings?"
    initial_results = db.similarity_search(query, k=20)
    candidate_texts = [r.page_content for r in initial_results]
    
  3. Load a cross-encoder reranker. The CrossEncoder scores each query-document pair.

    from sentence_transformers import CrossEncoder
    
    reranker = CrossEncoder("cross-encoder/ms-marco-MiniLM-L-6-v2")
    
  4. Score and re-rank candidates. Pair the query with each candidate for joint scoring.

    pairs = [[query, text] for text in candidate_texts]
    scores = reranker.predict(pairs)
    ranked_indices = sorted(range(len(scores)), key=lambda i: scores[i], reverse=True)
    reranked = [candidate_texts[i] for i in ranked_indices[:5]]
    
  5. Pass reranked chunks to the LLM. Use only the top results as context.

    from langchain_ollama import ChatOllama
    from langchain.chains import LLMChain
    from langchain.prompts import PromptTemplate
    
    llm = ChatOllama(model="llama3")
    prompt = PromptTemplate.from_template(
        "Context: {context}\n\nQuestion: {question}\n\nAnswer:"
    )
    context = "\n\n".join(reranked[:3])
    result = llm.invoke(prompt.format(context=context, question=query))
    print(result.content)
    

    Expected output: a precise answer drawn from the most relevant reranked chunks.

Verification

python -c "
from sentence_transformers import CrossEncoder
reranker = CrossEncoder('cross-encoder/ms-marco-MiniLM-L-6-v2')
scores = reranker.predict([['query', 'document']])
print(len(scores) == 1)
# Expected: True
"

Common failures

  • Cross-encoder model not downloaded. On first run, the model downloads automatically; ensure internet access.
  • Too many candidates causing latency. Limit initial retrieval to 20-50 documents; reranking scales quadratically.
  • Negative scores causing wrong ranking. Cross-encoder scores are relative; sort by descending value, not absolute magnitude.
  • Re-ranking hurting diversity. Preserve top-k selection but allow secondary candidates to enter when scores are close.
  • Version mismatch - The installed package or runtime differs from the command shown; check the version first and rerun the smallest verification command.
  • Local environment drift - Another service, virtual environment, model, or path is being used; print the active binary path and configuration before changing the guide steps.

Related guides

  • build-basic-rag-pipeline-langchain
  • implement-hybrid-search-rag-bm25-vector
RELATED GUIDES
RAG
How to Build a Basic RAG Pipeline with LangChain
RAG
How to Implement Hybrid Search RAG (BM25 + Vector)
← All how-to guidesCourses →