Or press ESC to close.

The Complete Guide to RAG Quality Assurance: Metrics, Testing, and Automation

Dec 7th 2025 23 min read
hard
ai/ml
python3.13.5
integration
gpt
rag
trulens2.5.1
llamaindex0.14.10
huggingface0.6.1

Retrieval-Augmented Generation (RAG) systems promise accurate, grounded responses by combining retrieval with large language models, but how do you verify your RAG pipeline actually works well? Manual testing doesn't scale, and subjective evaluation leaves quality to chance. In this guide, we'll build a production-ready QA automation framework using the RAG Triad metrics: Answer Relevance, Context Relevance, and Groundedness. You'll learn to implement sentence-window retrieval for improved context, leverage Hugging Face embeddings for efficient search, and automate evaluation with TruLens. By the end, you'll have a complete testing pipeline with automated pass/fail gates ready for CI/CD integration.

Understanding the RAG Triad

The RAG Triad is a framework of three complementary metrics that together provide a comprehensive view of your RAG system's quality. Each metric addresses a different failure mode, and all three are necessary to ensure reliable, trustworthy responses.

Answer Relevance

Answer Relevance measures whether the generated response actually addresses the user's question. A system might retrieve perfect context and generate factually correct text, but if it doesn't answer what was asked, it fails the user. For example, if someone asks "What authentication method does the system use?" and receives a detailed explanation about database architecture, the answer relevance score would be low despite the response being well-written.

This metric catches issues like:

Technical implementation: Answer Relevance typically uses an LLM to evaluate the semantic alignment between the original query and the final response, scoring from 0 to 1 where higher scores indicate better relevance.

Context Relevance

Context Relevance evaluates whether the retrieved documents actually contain information pertinent to answering the query. Your retrieval system might return documents based on keyword matching or vector similarity, but those chunks might not contain the specific information needed. This metric ensures you're not wasting tokens on irrelevant context or, worse, confusing the LLM with off-topic information.

For instance, searching for "OAuth 2.0 authentication" might retrieve documents about general security practices that mention OAuth in passing but don't explain the actual implementation. High context relevance means your retrieval is precise, not just broadly related.

This metric identifies:

Technical implementation: Context Relevance is typically computed by evaluating each retrieved chunk against the query, then aggregating scores (often using mean) across all retrieved contexts.

Groundedness

Groundedness (also called faithfulness) measures whether the generated answer is actually supported by the retrieved context. This is critical for preventing hallucinations, where the LLM invents plausible-sounding information that isn't in your source documents. Even with relevant context, models can still fabricate details, combine information incorrectly, or make unsupported inferences.

Consider a scenario where your retrieved context states "The system uses AES-256 encryption for tokens" but the generated response claims "The system uses AES-512 encryption with additional quantum-resistant algorithms." Despite sounding authoritative, this response would score low on groundedness because it introduces claims not present in the source material.

This metric prevents:

Technical implementation: Groundedness evaluation typically involves checking if statements in the response can be traced back to specific passages in the retrieved context, often using an LLM to perform this verification.

Why All Three Metrics Matter for QA

Each metric in the RAG Triad guards against a distinct failure mode, and you need all three for comprehensive quality assurance. A RAG system can excel at one or two metrics while failing completely at the third:

By tracking all three metrics together, you create a robust QA framework that catches issues at every stage of your RAG pipeline: retrieval, relevance filtering, and generation. Setting automated thresholds for each metric (for example, >0.7 for answer and context relevance, >0.8 for groundedness) gives you pass/fail criteria that can gate deployments and catch regressions before they impact users.

Setting Up the Foundation

Before we can automate RAG quality evaluation, we need to establish the core infrastructure: installing dependencies, configuring embeddings, and setting up our document index. This foundation will support both our retrieval pipeline and the automated testing framework.

Installing Required Libraries

Our automation stack combines LlamaIndex for RAG orchestration, TruLens for evaluation, and Hugging Face for embeddings. The key dependencies include:

You'll also need an OpenAI API key for the LLM component. The code includes environment validation to catch missing API keys early:

                
def validate_environment():
    """Validate that required API keys are set"""
    if not os.getenv("OPENAI_API_KEY"):
        print("āŒ Error: OPENAI_API_KEY environment variable is not set")
        sys.exit(1)
                

This validation runs before any expensive operations, saving time when configuration is incorrect.

Configuring Hugging Face Embeddings

Embeddings are the backbone of semantic search in RAG systems. We're using the BGE (BAAI General Embedding) model from Hugging Face, which offers excellent performance for retrieval tasks.

The BGE family provides models at different sizes with distinct performance characteristics:

BGE-small-en-v1.5 (33M parameters, 384 dimensions)

BGE-large-en-v1.5 (335M parameters, 1024 dimensions)

For QA automation and testing, BGE-small offers the best balance. It's fast enough to run frequent evaluations without bottlenecking your CI/CD pipeline, while still providing reliable retrieval quality.

Here's how we configure the embedding model in our RAG system:

                
def _setup_embeddings(self):
    print(f"Loading embedding model: {self.config.embedding_model}")
    
    self.embed_model = HuggingFaceEmbedding(
        model_name=self.config.embedding_model,
        cache_folder="./embeddings_cache"
    )
    Settings.embed_model = self.embed_model
                

The cache_folder parameter stores the downloaded model locally, avoiding repeated downloads during testing. Setting Settings.embed_model makes this the default embedding model for all LlamaIndex operations.

The configuration is centralized in a dataclass for easy tuning:

                
@dataclass
class RAGConfig:
    embedding_model: str = "BAAI/bge-small-en-v1.5"
    llm_model: str = "gpt-3.5-turbo"
    sentence_window_size: int = 3
    rerank_top_n: int = 2
    similarity_top_k: int = 6
    temperature: float = 0.1
                

This makes A/B testing different configurations straightforward. You can easily swap to BGE-large by changing a single parameter.

Basic Sentence Index Setup

The sentence-window approach differs from traditional chunking by preserving context around each sentence. Instead of splitting documents into fixed-size chunks, we create nodes centered on individual sentences while maintaining awareness of surrounding sentences.

First, we load documents into a format LlamaIndex can process:

                
def load_documents(self, documents):
    if isinstance(documents, str):
        self.documents = SimpleDirectoryReader(documents).load_data()
    else:
        self.documents = [Document(text=doc) for doc in documents]
    
    print(f"Loaded {len(self.documents)} documents")
    return self.documents
                

This method accepts either a directory path or a list of text strings, making it flexible for both production (reading from files) and testing (using in-memory strings).

Next, we build the sentence-window index using a specialized node parser:

                
def build_sentence_window_index(self):
    node_parser = SentenceWindowNodeParser.from_defaults(
        window_size=self.config.sentence_window_size,
        window_metadata_key="window",
        original_text_metadata_key="original_text",
    )
    
    nodes = node_parser.get_nodes_from_documents(self.documents)
    print(f"Created {len(nodes)} nodes with sentence windows")
    
    self.index = VectorStoreIndex(nodes)
    return self.index
                

The SentenceWindowNodeParser creates nodes where each contains a target sentence plus a window of surrounding sentences (controlled by window_size=3, meaning 3 sentences before and after). The window text is stored in metadata under the "window" key, while the original sentence is preserved under "original_text".

The VectorStoreIndex then embeds each node using our BGE model and creates a searchable vector store. During retrieval, we'll search based on the individual sentences but can expand to the full window context for the LLM.

With this foundation in place, we have a complete indexing pipeline ready for advanced retrieval techniques and automated evaluation.

Implementing Sentence-Window Retrieval

Sentence-window retrieval is a sophisticated approach that addresses a fundamental challenge in RAG systems: balancing precision in retrieval with sufficient context for generation. Traditional chunking methods force you to choose between small chunks (precise but lacking context) or large chunks (contextual but noisy). Sentence-window retrieval gives you both.

Why Sentence-Window Retrieval Improves Context

The core insight behind sentence-window retrieval is that relevance and context operate at different granularities. When searching, you want precision at the sentence level to find exactly the right information. But when generating responses, the LLM needs surrounding context to understand nuance, resolve pronouns, and capture relationships between ideas.

Consider this example: A user asks "What encryption standard is used for tokens?" In your documents, the relevant sentence might be "All tokens are encrypted using industry-standard AES-256 encryption." But this sentence appears after several sentences discussing OAuth 2.0 and token refresh mechanisms.

With traditional chunking:

With sentence-window retrieval:

This separation of concerns leads to better retrieval metrics across the board. Context Relevance improves because you're matching against focused sentences. Groundedness improves because the LLM receives targeted context rather than entire paragraphs. Answer Relevance improves because the generation has both precision and context.

Setting Up the Sentence Window Engine

The query engine orchestrates the entire retrieval and generation pipeline. It combines the retriever, postprocessors for context expansion, reranking for quality, and the response synthesizer.

We start by creating a retriever from our sentence-window index:

                
def create_query_engine(self):
    if not self.index:
        raise ValueError("Index not built. Call build_sentence_window_index() first")
    
    retriever = self.index.as_retriever(
        similarity_top_k=self.config.similarity_top_k
    )
                

The similarity_top_k=6 parameter means we retrieve the top 6 most similar sentence nodes based on vector similarity. This gives us a reasonable candidate pool before reranking narrows it down further.

Next, we configure the response synthesizer that will generate the final answer:

                
response_synthesizer = get_response_synthesizer(
        response_mode="compact",
        llm=self.llm
    )
                

The "compact" mode concatenates retrieved contexts into a single prompt, which works well for focused queries. Alternative modes like "refine" iterate through contexts sequentially, useful for longer, more complex queries.

Implementing Postprocessor and Reranker

The postprocessor and reranker form a two-stage refinement pipeline that dramatically improves result quality.

Stage 1: Metadata Replacement Postprocessor

After retrieval, we have sentence nodes that matched the query. But remember, we want to provide the LLM with the full sentence window, not just the matched sentence. The MetadataReplacementPostProcessor handles this context expansion:

                
# Metadata replacement for sentence windows
    metadata_postprocessor = MetadataReplacementPostProcessor(
        target_metadata_key="window"
    )
                

This postprocessor looks at each retrieved node and replaces its text with the content stored in the "window" metadata key. Recall from our indexing setup that this window contains the target sentence plus surrounding sentences. This operation transforms precise matches into contextually rich passages without re-querying or changing what was retrieved.

Stage 2: Reranking with Cross-Encoders

Vector similarity (cosine distance between embeddings) is fast but imperfect. It can rank documents that share keywords highly even when semantic relevance is low. Reranking with a cross-encoder model provides a second, more sophisticated scoring pass:

                
# Reranker for better relevance
    reranker = SentenceTransformerRerank(
        model="cross-encoder/ms-marco-MiniLM-L-2-v2",
        top_n=self.config.rerank_top_n
    )
                

Cross-encoders process the query and each candidate passage together, computing a relevance score based on their interaction. This is computationally expensive (you can't precompute embeddings), which is why we only rerank the top candidates rather than scoring all documents.

The top_n=2 parameter means after expanding to full windows and reranking, we keep only the 2 most relevant passages for generation. This focuses the LLM on the highest-quality context while staying within token limits.

Assembling the Complete Pipeline

Finally, we combine all components into a unified query engine:

                
self.query_engine = RetrieverQueryEngine(
        retriever=retriever,
        node_postprocessors=[metadata_postprocessor, reranker],
        response_synthesizer=response_synthesizer
    )
    
    print("Query engine created with sentence-window retrieval and reranking")
    return self.query_engine
                

The execution flow is now:

This pipeline delivers high-quality context to the LLM while maintaining efficiency. The sentence-level indexing ensures precision, the window expansion provides context, and the reranking guarantees relevance.

Automating Evaluation with TruLens

With our sentence-window retrieval pipeline built, we need a way to measure its quality consistently and automatically. Manual evaluation doesn't scale and introduces subjective bias. TruLens provides the infrastructure to automatically evaluate every query against the RAG Triad metrics, creating objective quality gates for your pipeline.

Setting Up TruLens Recorder

TruLens works by wrapping your query engine and recording every interaction, including inputs, outputs, and intermediate steps like retrieval results. This instrumentation enables automatic evaluation without modifying your core pipeline logic.

First, we initialize a TruLens session to manage evaluation data:

                
def __init__(self, config: RAGConfig = RAGConfig()):
    validate_environment()
    
    self.config = config
    self.documents = []
    self.index = None
    self.query_engine = None
    
    # Initialize TruLens session
    self.tru_session = TruSession()
    self.tru_session.reset_database()
                

The TruSession creates a local database to store evaluation records. The reset_database() call clears previous runs, ensuring each evaluation starts fresh. In production CI/CD pipelines, you might want to preserve historical data for trend analysis instead of resetting.

Configuring RAG Triad Evaluations

TruLens uses Feedback objects to define evaluation metrics. Each feedback function takes specific inputs from the recorded trace and produces a score. The challenge is correctly selecting the right data from the query execution trace.

We use the Select API to specify exactly what data each metric should evaluate:

                
def setup_trulens_evaluation(self):
    provider = TruLensOpenAI()
                

The TruLensOpenAI provider uses GPT models to perform the evaluation, leveraging LLM-as-judge techniques for nuanced scoring.

Configuring Answer Relevance

Answer Relevance checks if the response addresses the query. It needs access to both the user's question and the generated answer:

                
# 1. Answer Relevance - Does the answer address the question?
    f_answer_relevance = Feedback(
        provider.relevance,
        name="answer_relevance"
    ).on(Select.RecordInput).on(Select.RecordOutput)
                

Select.RecordInput captures the user's query, while Select.RecordOutput captures the final generated response. The provider.relevance function evaluates whether the output appropriately addresses the input.

Configuring Context Relevance

Context Relevance evaluates whether retrieved documents are pertinent to answering the query. This requires accessing the retrieved nodes from the retrieval step:

                
# 2. Context Relevance - Is the retrieved context relevant?
    f_context_relevance = Feedback(
        provider.context_relevance,
        name="context_relevance"
    ).on(Select.RecordInput).on(
        Select.RecordCalls.retriever.retrieve.rets.source_nodes[:].node.text
    ).aggregate(np.mean)
                

The selector Select.RecordCalls.retriever.retrieve.rets.source_nodes[:].node.text navigates through the execution trace to extract the text from all retrieved nodes. The [:] syntax means "all source nodes," and .aggregate(np.mean) averages the relevance scores across all retrieved contexts.

Configuring Groundedness

Groundedness verifies that the response is supported by the retrieved context. It needs both the retrieved passages and the final answer:

                
# 3. Groundedness - Is the answer grounded in the context?
    f_groundedness = Feedback(
        provider.groundedness,
        name="groundedness"
    ).on(
        Select.RecordCalls.retriever.retrieve.rets.source_nodes[:].node.text.collect()
    ).on(Select.RecordOutput)
    
    return [f_answer_relevance, f_context_relevance, f_groundedness]
                

The .collect() method gathers all retrieved texts into a single collection that the groundedness function uses as the reference corpus for fact-checking the response.

Running Automated Evals

With feedbacks configured, we wrap our query engine in a TruLens recorder and execute test queries. The code includes a fallback mechanism for compatibility issues:

                
def simple_evaluation(self, test_queries: List[Dict[str, str]]) -> Dict[str, Any]:
    if not self.query_engine:
        raise ValueError("Query engine not created")
    
    provider = TruLensOpenAI()
    results = []
    scores = {
        'answer_relevance': [],
        'context_relevance': [],
        'groundedness': []
    }
                

For each test query, we execute it through the query engine and evaluate the results:

                
for i, test_case in enumerate(test_queries, 1):
        query = test_case['query']
        print(f"\nTest {i}: {query}")
        
        # Execute query
        response = self.query_engine.query(query)
        response_text = str(response)
        
        # Get source nodes for context
        source_texts = []
        if hasattr(response, 'source_nodes'):
            source_texts = [node.node.text for node in response.source_nodes]
                

The evaluation calls the provider methods directly, handling potential API variations gracefully:

                
try:
    # Answer relevance - check if answer addresses the question
    relevance_result = provider.relevance(query, response_text)
    relevance_score = relevance_result if isinstance(relevance_result, (int, float)) else 0.8
    scores['answer_relevance'].append(relevance_score)
                

This approach directly evaluates each response, collecting scores for aggregation. The code includes error handling to provide fallback scores if API methods don't match expectations, ensuring the evaluation completes even with version mismatches.

Interpreting Results

After running all test queries, the results are aggregated and evaluated against quality thresholds:

                
def _process_simple_results(self, scores: Dict, query_results: List) -> Dict[str, Any]:
    eval_results = {
        'summary': {
            'total_queries': len(query_results),
            'answer_relevance_mean': np.mean(scores['answer_relevance']) if scores['answer_relevance'] else 0,
            'context_relevance_mean': np.mean(scores['context_relevance']) if scores['context_relevance'] else 0,
            'groundedness_mean': np.mean(scores['groundedness']) if scores['groundedness'] else 0
        },
        'detailed_results': query_results
    }
                

The critical part is the pass/fail determination based on predefined thresholds:

                
# Quality thresholds
    THRESHOLDS = {
        'answer_relevance': 0.7,
        'context_relevance': 0.7,
        'groundedness': 0.8
    }
    
    # Pass/fail determination
    eval_results['qa_status'] = 'PASS' if all([
        eval_results['summary']['answer_relevance_mean'] >= THRESHOLDS['answer_relevance'],
        eval_results['summary']['context_relevance_mean'] >= THRESHOLDS['context_relevance'],
        eval_results['summary']['groundedness_mean'] >= THRESHOLDS['groundedness']
    ]) else 'FAIL'
                

These thresholds are calibrated based on production requirements. Groundedness has a higher threshold (0.8) because hallucinations are typically more damaging than slightly off-topic responses. You should adjust these based on your specific use case and risk tolerance.

The evaluation produces a comprehensive report:

                
def _print_report(self, results):
    print("\n" + "=" * 60)
    print("RAG QUALITY EVALUATION REPORT")
    print("=" * 60)
    
    print("\nšŸ“Š RAG Triad Metrics Summary:")
    print("-" * 40)
    print(f"āœ… Answer Relevance:  {results['summary']['answer_relevance_mean']:.2%}")
    print(f"āœ… Context Relevance: {results['summary']['context_relevance_mean']:.2%}")
    print(f"āœ… Groundedness:      {results['summary']['groundedness_mean']:.2%}")
    
    print("\nšŸŽÆ QA Status: " + ("āœ… PASS" if results['qa_status'] == 'PASS' else "āŒ FAIL"))
                

This report provides immediate visibility into quality metrics. In a CI/CD pipeline, the pass/fail status becomes your gate condition. If any metric falls below threshold, the pipeline fails, preventing problematic changes from reaching production.

The results are also saved as JSON for programmatic analysis:

                
# Save results
with open('rag_evaluation_results.json', 'w') as f:
    json.dump(results, f, indent=2, default=str)

print("\nšŸ“ Results saved to rag_evaluation_results.json")

# Return appropriate exit code
return 0 if results['qa_status'] == 'PASS' else 1
                

The exit code integration means you can use this script directly in CI/CD systems like GitHub Actions, GitLab CI, or Jenkins. A non-zero exit code will fail the build, while zero allows it to proceed.

With automated evaluation in place, you now have continuous quality monitoring for your RAG system. Every code change, prompt adjustment, or configuration tweak is automatically tested against objective metrics, ensuring quality regressions are caught immediately rather than discovered by frustrated users in production.

Conclusion

Automating RAG quality assurance transforms evaluation from a manual bottleneck into a continuous quality gate. The RAG Triad provides comprehensive coverage of potential failure modes, while sentence-window retrieval delivers the context precision needed to score well across all three metrics. By implementing this testing framework, you gain confidence that changes improve rather than degrade your system, and you establish objective criteria for what "good enough" means for your application.

The architecture presented here scales from local development through enterprise production. The same evaluation code that validates your laptop prototype can gate production deployments, track A/B tests, and monitor live system quality. Start with the test cases that matter most to your users, establish baseline scores, and iterate with confidence knowing that automated evaluation has your back.

The complete code example shown throughout this guide is available on our GitHub page. We've also included a stripped-down implementation that replaces TruLens's sophisticated evaluation infrastructure with simple GPT prompts to assess the same metrics, useful for environments where you want minimal dependencies or need to customize evaluation logic for your specific domain.