RUNLOCALAIv38
->Will it run?Best GPUCompareTroubleshootStartLearnPulseModelsHardwareToolsBench
Run check
RUNLOCALAI

Independently operated catalog for local-AI hardware and software. Hand-written verdicts. Source-cited claims. Reproducible commands when we have them.

OP·Fredoline Eruo
DIR
  • Models
  • Hardware
  • Tools
  • Benchmarks
TOOLS
  • Will it run?
  • Compare hardware
  • Cost vs cloud
  • Choose my GPU
  • Prompting kits
  • Quick answers
REF
  • All buyer guides
  • Learn local AI
  • Methodology
  • Glossary
  • Errors KB
  • Trust
EDITOR
  • About
  • Author
  • How we make money
  • Editorial policy
  • Contact
LEGAL
  • Privacy
  • Terms
  • Sitemap
MAIL · MONTHLY DIGEST
Get monthly local AI changes
Monthly recap. No spam.
DISCLOSURE

Some links on this site are affiliate links (Amazon Associates and other first-class retailers). When you buy through them, we earn a small commission at no extra cost to you. Affiliate links do not influence our verdicts — there are cards we rate highly that we don't have affiliate relationships with, and cards that sell well that we refuse to recommend. Read more →

© 2026 runlocalai.coIndependently operated
RUNLOCALAI · v38
  1. >
  2. Home
  3. /Learn
  4. /How-to
  5. /How to Evaluate RAG Pipeline Performance
HOW-TO · RAG

How to Evaluate RAG Pipeline Performance

intermediate·30 min·By Fredoline Eruo
Target environment
Ubuntu 24.04 · Ollama 0.4.x
PREREQUISITES

RAG pipeline running, ragas library installed

What this does

Measuring a RAG pipeline requires more than asking a few questions manually. Quantitative evaluation reveals where retrieval fails, where generation drifts, and whether the system handles edge cases. This guide covers using the ragas library to compute faithfulness, answer relevance, context precision, and context recall against a curated test set.

Steps

  1. Set up the evaluation dataset. Structure the test data as a dictionary of lists.

    import os
    os.environ["OLLAMA_BASE_URL"] = "http://localhost:11434"
    
    from ragas import EvaluationDataset
    
    eval_data = {
        "user_input": [
            "What is retrieval-augmented generation?",
            "How does embedding work?",
        ],
        "retrieved_contexts": [
            ["RAG combines retrieval with generation models."],
            ["Embeddings convert text into numerical vectors."],
        ],
        "response": [
            "RAG augments generation with retrieved documents.",
            "Embeddings map text to vectors for similarity search.",
        ],
        "reference": [
            "RAG is a technique that uses retrieved documents to ground LLM responses.",
            "An embedding is a numerical representation of text used in similarity comparisons.",
        ],
    }
    dataset = EvaluationDataset.from_dict(eval_data)
    
  2. Import the evaluation metrics. RagAS provides multiple metric evaluators.

    from ragas.metrics import (
        faithfulness,
        answer_relevancy,
        context_precision,
        context_recall,
    )
    
  3. Initialize an evaluator LLM. Use Ollama as the evaluation backend.

    from ragas.llms import LangchainLLM
    from langchain_ollama import ChatOllama
    
    eval_llm = LangchainLLM(ChatOllama(model="llama3", base_url="http://localhost:11434"))
    
  4. Run the evaluation. Score each metric across the dataset.

    from ragas import evaluate
    
    result = evaluate(
        dataset,
        metrics=[faithfulness, answer_relevancy, context_precision, context_recall],
       llm=eval_llm,
    )
    print(result)
    

    Expected output: a table of scores for each metric across all test samples.

  5. Interpret the results. Aim for context precision above 0.8 and faithfulness above 0.9 in well-tuned systems. Low recall signals missing relevant documents; low faithfulness signals hallucination.

Verification

python -c "
from ragas.metrics import faithfulness
print(hasattr(faithfulness, 'name'))
# Expected: True
"

Common failures

  • Insufficient test samples. Fewer than 10 samples produce noisy scores; aim for 20 or more.
  • Ragas LLM not connecting. Verify Ollama is running at http://localhost:11434 before evaluation.
  • Reference answers missing. Without ground-truth answers, context recall cannot be computed reliably.
  • Metric timeout on long responses. Set a longer timeout in the ragas config when generating explanations.
  • Version mismatch - The installed package or runtime differs from the command shown; check the version first and rerun the smallest verification command.
  • Local environment drift - Another service, virtual environment, model, or path is being used; print the active binary path and configuration before changing the guide steps.

Related guides

  • build-basic-rag-pipeline-langchain
  • add-reranking-rag-pipeline
RELATED GUIDES
RAG
How to Build a Basic RAG Pipeline with LangChain
RAG
How to Add Reranking to Your RAG Pipeline
← All how-to guidesCourses →