Definition
The initial stage of a RAG ingestion pipeline where unstructured data formats—such as PDFs, HTML, or images—are parsed and converted into clean, machine-readable text strings. This process involves a trade-off between speed and structural fidelity, particularly when deciding whether to preserve document layout or prioritize raw character accuracy via OCR.
Refers to physical character/layout ingestion, not semantic 'Entity Extraction' or NER.
"A digital mineral refinery that crushes raw, complex ore (unstructured files) to isolate the pure gold (clean text strings)."
Conceptual Overview
The initial stage of a RAG ingestion pipeline where unstructured data formats—such as PDFs, HTML, or images—are parsed and converted into clean, machine-readable text strings. This process involves a trade-off between speed and structural fidelity, particularly when deciding whether to preserve document layout or prioritize raw character accuracy via OCR.
Disambiguation
Refers to physical character/layout ingestion, not semantic 'Entity Extraction' or NER.
Visual Analog
A digital mineral refinery that crushes raw, complex ore (unstructured files) to isolate the pure gold (clean text strings).