Definition
The initial stage of a RAG ingestion pipeline where unstructured data formats—such as PDFs, HTML, or images—are parsed and converted into clean, machine-readable text strings. This process involves a trade-off between speed and structural fidelity, particularly when deciding whether to preserve document layout or prioritize raw character accuracy via OCR.
Refers to physical character/layout ingestion, not semantic 'Entity Extraction' or NER.
"A digital mineral refinery that crushes raw, complex ore (unstructured files) to isolate the pure gold (clean text strings)."
- Optical Character Recognition (OCR)(Prerequisite for image-based or non-searchable documents)
- Chunking(Subsequent step to break extracted text into manageable segments)
- Layout Analysis(Component for maintaining semantic grouping of tables and headers)
- ETL Pipeline(Architectural parent process for data movement)
Conceptual Overview
The initial stage of a RAG ingestion pipeline where unstructured data formats—such as PDFs, HTML, or images—are parsed and converted into clean, machine-readable text strings. This process involves a trade-off between speed and structural fidelity, particularly when deciding whether to preserve document layout or prioritize raw character accuracy via OCR.
Disambiguation
Refers to physical character/layout ingestion, not semantic 'Entity Extraction' or NER.
Visual Analog
A digital mineral refinery that crushes raw, complex ore (unstructured files) to isolate the pure gold (clean text strings).