OCR (Optical Character Recognition)

The extraction of machine-readable text and structural metadata from image-based or non-selectable document formats to enable ingestion into a RAG pipeline. Architectural trade-off: High-accuracy deep learning OCR models preserve complex layouts and tables better but introduce significant latency and inference costs compared to lightweight, rule-based engines.

Definition

Disambiguation

The bridge between raw pixels and tokenizable strings for LLM indexing, rather than simple text extraction from digital PDFs.

Visual Metaphor

"A Digital Scribe: Transcribing a photo of an ancient scroll into a typed manuscript so it can be searchable in a modern library."

Key Tools

Unstructured.ioTesseractAmazon TextractAzure AI Document IntelligencePaddleOCRPyMuPDF

Related Connections

Layout Analysis(Component)
Document Chunking(Dependent Step)
Multimodal LLMs(Emerging Alternative)
Vector Embedding(Downstream Consumer)

Conceptual Overview

Disambiguation

The bridge between raw pixels and tokenizable strings for LLM indexing, rather than simple text extraction from digital PDFs.

Visual Analog

A Digital Scribe: Transcribing a photo of an ancient scroll into a typed manuscript so it can be searchable in a modern library.

OCR (Optical Character Recognition)

Definition

Conceptual Overview

Disambiguation

Visual Analog

Related Articles