Definition
In RAG pipelines, data lineage is the end-to-end documentation of a piece of information's journey from its raw source through extraction, chunking, embedding, and storage, culminating in its retrieval and inclusion in an LLM prompt. It provides the forensic map necessary for source attribution and debugging 'hallucinations' by identifying exactly which data segment influenced a specific model response.
Distinguishes the 'Source-to-Context' mapping used for AI citations from general database transaction logs.
"A digital breadcrumb trail leading from a single puzzle piece (the retrieved chunk) back to the original box lid (the source document)."
- Chunking(Prerequisite)
- Source Attribution(Component)
- Metadata Filtering(Component)
- Vector Store(Component)
Conceptual Overview
In RAG pipelines, data lineage is the end-to-end documentation of a piece of information's journey from its raw source through extraction, chunking, embedding, and storage, culminating in its retrieval and inclusion in an LLM prompt. It provides the forensic map necessary for source attribution and debugging 'hallucinations' by identifying exactly which data segment influenced a specific model response.
Disambiguation
Distinguishes the 'Source-to-Context' mapping used for AI citations from general database transaction logs.
Visual Analog
A digital breadcrumb trail leading from a single puzzle piece (the retrieved chunk) back to the original box lid (the source document).