Definition
Large Language Models engineered with extended attention mechanisms—such as Rotary Positional Embeddings (RoPE) or FlashAttention—to process input sequences often exceeding 100k tokens. In RAG pipelines, these models allow for the ingestion of entire documents or codebases directly into the prompt, though they trade off increased inference latency and higher token costs for reduced complexity in vector database management.
Distinguish between the size of the model's active memory (Context Window) versus the total number of parameters in its weights.
"A massive banquet table that can hold an entire library’s worth of open books simultaneously, rather than a small desk where only one page can be read at a time."
Conceptual Overview
Large Language Models engineered with extended attention mechanisms—such as Rotary Positional Embeddings (RoPE) or FlashAttention—to process input sequences often exceeding 100k tokens. In RAG pipelines, these models allow for the ingestion of entire documents or codebases directly into the prompt, though they trade off increased inference latency and higher token costs for reduced complexity in vector database management.
Disambiguation
Distinguish between the size of the model's active memory (Context Window) versus the total number of parameters in its weights.
Visual Analog
A massive banquet table that can hold an entire library’s worth of open books simultaneously, rather than a small desk where only one page can be read at a time.