Definition
A text normalization technique in RAG pipelines that reduces inflected word forms to their dictionary base (lemma) by performing morphological analysis and considering part-of-speech context. It is primarily used to enhance keyword-based retrieval (BM25) and improve the semantic consistency of chunks before embedding.
Lemmatization uses linguistic rules to find real words (e.g., 'better' to 'good'), whereas Stemming simply chops off suffixes (e.g., 'running' to 'run').
"A Master Filing Cabinet where 'running', 'ran', and 'runs' are all filed under a single labeled folder 'run'."
- Stemming(Lower-fidelity alternative)
- Tokenization(Prerequisite)
- Hybrid Search(Retrieval context)
- BM25(Retrieval algorithm benefiting from normalization)
Conceptual Overview
A text normalization technique in RAG pipelines that reduces inflected word forms to their dictionary base (lemma) by performing morphological analysis and considering part-of-speech context. It is primarily used to enhance keyword-based retrieval (BM25) and improve the semantic consistency of chunks before embedding.
Disambiguation
Lemmatization uses linguistic rules to find real words (e.g., 'better' to 'good'), whereas Stemming simply chops off suffixes (e.g., 'running' to 'run').
Visual Analog
A Master Filing Cabinet where 'running', 'ran', and 'runs' are all filed under a single labeled folder 'run'.