Definition
A large-scale reading comprehension benchmark used to evaluate RAG pipelines by testing an agent's ability to retrieve and synthesize answers from multiple, often noisy, evidence documents. It highlights the architectural trade-off between retrieval recall (finding any source) and synthesis precision (filtering out irrelevant distractors).
A research benchmark for model evaluation, not a consumer trivia application.
"An open-book exam where the student must scan a stack of 100 messy newspaper clippings to find one specific factual date."
Conceptual Overview
A large-scale reading comprehension benchmark used to evaluate RAG pipelines by testing an agent's ability to retrieve and synthesize answers from multiple, often noisy, evidence documents. It highlights the architectural trade-off between retrieval recall (finding any source) and synthesis precision (filtering out irrelevant distractors).
Disambiguation
A research benchmark for model evaluation, not a consumer trivia application.
Visual Analog
An open-book exam where the student must scan a stack of 100 messy newspaper clippings to find one specific factual date.