Definition
The architectural constraint defining the maximum number of sub-word units (tokens) an LLM can process in a single request, encompassing both the prompt and the completion. In RAG pipelines, this limit forces a trade-off between the depth of retrieved context and the remaining capacity for the model's reasoning and response generation.
Not to be confused with 'Rate Limits,' which govern the frequency of API calls rather than the volume of data per call.
"A fixed-length conveyor belt that can only carry a specific amount of cargo into a factory; if you add more raw materials (context), you have less room for the finished product (output)."
- Context Window(Prerequisite)
- Chunking(Component)
- Lost in the Middle(Consequent Phenomenon)
- KV Cache(Underlying Architecture)
Conceptual Overview
The architectural constraint defining the maximum number of sub-word units (tokens) an LLM can process in a single request, encompassing both the prompt and the completion. In RAG pipelines, this limit forces a trade-off between the depth of retrieved context and the remaining capacity for the model's reasoning and response generation.
Disambiguation
Not to be confused with 'Rate Limits,' which govern the frequency of API calls rather than the volume of data per call.
Visual Analog
A fixed-length conveyor belt that can only carry a specific amount of cargo into a factory; if you add more raw materials (context), you have less room for the finished product (output).