Definition
A metric measuring the lexical overlap of n-grams between an LLM-generated response and a reference ground-truth text, specifically emphasizing recall to evaluate how much information was successfully retrieved and synthesized. While efficient for benchmarking, it involves a trade-off where it favors exact word matching over semantic meaning, potentially penalizing factually correct answers that use different phrasing.
Focuses on 'Recall' (how much ground truth was captured), whereas BLEU focuses on 'Precision' (how much generated text is valid).
"A highlighter overlay: placing a transparent sheet of the model's answer over the reference text to see how much of the original 'gold' text is covered."
- N-gram(Component)
- Ground Truth(Prerequisite)
- BLEU Score(Alternative Metric)
- METEOR(Semantic Alternative)
Conceptual Overview
A metric measuring the lexical overlap of n-grams between an LLM-generated response and a reference ground-truth text, specifically emphasizing recall to evaluate how much information was successfully retrieved and synthesized. While efficient for benchmarking, it involves a trade-off where it favors exact word matching over semantic meaning, potentially penalizing factually correct answers that use different phrasing.
Disambiguation
Focuses on 'Recall' (how much ground truth was captured), whereas BLEU focuses on 'Precision' (how much generated text is valid).
Visual Analog
A highlighter overlay: placing a transparent sheet of the model's answer over the reference text to see how much of the original 'gold' text is covered.