Definition
A lexical similarity metric defined as the size of the intersection divided by the size of the union of two token sets, used in RAG to evaluate keyword overlap. While computationally efficient for deduplication and exact-match filtering, it lacks semantic awareness, failing to recognize synonyms that vector-based embeddings would capture.
Measures set-based keyword overlap rather than semantic vector direction.
"A Venn diagram where the score is the ratio of the shared overlapping area to the total combined area of both circles."
- Cosine Similarity(Semantic Alternative)
- Tokenization(Prerequisite)
- MinHash(Optimization Technique)
- BM25(Alternative Lexical Ranker)
Conceptual Overview
A lexical similarity metric defined as the size of the intersection divided by the size of the union of two token sets, used in RAG to evaluate keyword overlap. While computationally efficient for deduplication and exact-match filtering, it lacks semantic awareness, failing to recognize synonyms that vector-based embeddings would capture.
Disambiguation
Measures set-based keyword overlap rather than semantic vector direction.
Visual Analog
A Venn diagram where the score is the ratio of the shared overlapping area to the total combined area of both circles.