Definition
The finite allocation of sub-word units (tokens) allowed within a single inference call, requiring developers to balance the volume of retrieved context, conversation history, and system instructions against the desired length of the generated output. In RAG, managing this budget is critical to avoid truncating vital information while minimizing latency and API costs.
This refers to the per-request limit of an LLM's context window, not the total monthly financial spend on an AI provider.
"A fixed-size shipping container where every piece of 'Retrieved Context' added reduces the available space for the 'Model's Response'."
- Context Window(Hard Physical Limit)
- Retrieval Augmented Generation (RAG)(Primary Consumer of Budget)
- Lost in the Middle(Performance degradation caused by budget saturation)
Conceptual Overview
The finite allocation of sub-word units (tokens) allowed within a single inference call, requiring developers to balance the volume of retrieved context, conversation history, and system instructions against the desired length of the generated output. In RAG, managing this budget is critical to avoid truncating vital information while minimizing latency and API costs.
Disambiguation
This refers to the per-request limit of an LLM's context window, not the total monthly financial spend on an AI provider.
Visual Analog
A fixed-size shipping container where every piece of 'Retrieved Context' added reduces the available space for the 'Model's Response'.