Definition
The process of segmenting raw text into discrete numerical sub-units (tokens) that an LLM can ingest. In RAG pipelines, tokenization is the critical first step for managing context window limits and determining the granularity of document chunking, where the trade-off lies between smaller tokens for precision and larger tokens for processing efficiency and cost.
LLM data ingestion, not security-based PII masking or blockchain assets.
"The Salami Slicer: Dividing a continuous loaf of text into uniform, manageable slices that a machine can digest."
- Chunking(Dependent Process)
- Embedding(Subsequent Transformation)
- Context Window(Resource Constraint)
Conceptual Overview
The process of segmenting raw text into discrete numerical sub-units (tokens) that an LLM can ingest. In RAG pipelines, tokenization is the critical first step for managing context window limits and determining the granularity of document chunking, where the trade-off lies between smaller tokens for precision and larger tokens for processing efficiency and cost.
Disambiguation
LLM data ingestion, not security-based PII masking or blockchain assets.
Visual Analog
The Salami Slicer: Dividing a continuous loaf of text into uniform, manageable slices that a machine can digest.