Definition
A large-scale reading comprehension benchmark used to evaluate RAG pipelines by testing an agent's ability to retrieve and synthesize answers from multiple, often noisy, evidence documents. It highlights the architectural trade-off between retrieval recall (finding any source) and synthesis precision (filtering out irrelevant distractors).
A research benchmark for model evaluation, not a consumer trivia application.
"An open-book exam where the student must scan a stack of 100 messy newspaper clippings to find one specific factual date."
- Exact Match (EM)(Primary Evaluation Metric)
- Natural Questions (NQ)(Alternative Benchmark)
- Open-Domain QA(Task Category)
- Data Contamination(Evaluation Risk)
Conceptual Overview
A large-scale reading comprehension benchmark used to evaluate RAG pipelines by testing an agent's ability to retrieve and synthesize answers from multiple, often noisy, evidence documents. It highlights the architectural trade-off between retrieval recall (finding any source) and synthesis precision (filtering out irrelevant distractors).
Disambiguation
A research benchmark for model evaluation, not a consumer trivia application.
Visual Analog
An open-book exam where the student must scan a stack of 100 messy newspaper clippings to find one specific factual date.