Definition
The strategic allocation of token limits or hardware resources (VRAM/RAM) within an LLM or Agent architecture to manage context window utilization and retrieval density. It forces a trade-off between comprehensive context—improving recall but increasing latency and cost—and performance efficiency, which reduces cost but risks context loss.
In AI, this refers to token quotas and context window management rather than just general-purpose system RAM.
"A Suitcase with Fixed Dividers: Deciding exactly how much space is reserved for 'essential clothes' (System Prompt) versus 'souvenirs' (Retrieved Documents) before the lid won't close."
- Context Window(Prerequisite)
- Sliding Window Memory(Component)
- Tokenization(Component)
- Vector Quantization(Component)
Conceptual Overview
The strategic allocation of token limits or hardware resources (VRAM/RAM) within an LLM or Agent architecture to manage context window utilization and retrieval density. It forces a trade-off between comprehensive context—improving recall but increasing latency and cost—and performance efficiency, which reduces cost but risks context loss.
Disambiguation
In AI, this refers to token quotas and context window management rather than just general-purpose system RAM.
Visual Analog
A Suitcase with Fixed Dividers: Deciding exactly how much space is reserved for 'essential clothes' (System Prompt) versus 'souvenirs' (Retrieved Documents) before the lid won't close.