Definition
The systematic process of evaluating RAG pipeline performance and AI agent outputs using metrics like faithfulness, relevance, and precision to minimize hallucinations and ensure grounding. It involves a trade-off between the high cost/accuracy of 'LLM-as-a-judge' or human evaluation and the speed/lower precision of heuristic-based metrics.
Distinguishes the 'Quality Assurance' testing methodology from 'Question Answering' as a functional task.
"A food safety inspector using a checklist to verify that a chef (LLM) used only the provided ingredients (retrieved context) without adding any unauthorized fillers."
Conceptual Overview
The systematic process of evaluating RAG pipeline performance and AI agent outputs using metrics like faithfulness, relevance, and precision to minimize hallucinations and ensure grounding. It involves a trade-off between the high cost/accuracy of 'LLM-as-a-judge' or human evaluation and the speed/lower precision of heuristic-based metrics.
Disambiguation
Distinguishes the 'Quality Assurance' testing methodology from 'Question Answering' as a functional task.
Visual Analog
A food safety inspector using a checklist to verify that a chef (LLM) used only the provided ingredients (retrieved context) without adding any unauthorized fillers.