Definition
Stop Word Removal is a preprocessing technique in RAG pipelines where high-frequency, low-semantic words (e.g., 'the', 'is', 'on') are filtered out during indexing or query processing to prioritize tokens that carry domain-specific meaning. While beneficial for optimizing sparse retrieval (BM25) and reducing index size, it can negatively impact dense embeddings by destroying the syntactic context required by transformer-based models.
Not to be confused with 'Negative Constraints' or 'Safety Guardrails' in Agent logic.
"A gold-panning sifter that lets common sand pass through while retaining valuable nuggets of information."
- BM25(Component)
- Tokenization(Prerequisite)
- Hybrid Search(Component)
- Sparse Vector(Component)
Conceptual Overview
Stop Word Removal is a preprocessing technique in RAG pipelines where high-frequency, low-semantic words (e.g., 'the', 'is', 'on') are filtered out during indexing or query processing to prioritize tokens that carry domain-specific meaning. While beneficial for optimizing sparse retrieval (BM25) and reducing index size, it can negatively impact dense embeddings by destroying the syntactic context required by transformer-based models.
Disambiguation
Not to be confused with 'Negative Constraints' or 'Safety Guardrails' in Agent logic.
Visual Analog
A gold-panning sifter that lets common sand pass through while retaining valuable nuggets of information.