Back to Learn
Intermediate

Long-Context Models

Large Language Models engineered with extended attention mechanisms—such as Rotary Positional Embeddings (RoPE) or FlashAttention—to process input sequences often exceeding 100k tokens. In RAG pipelines, these models allow for the ingestion of entire documents or codebases directly into the prompt, though they trade off increased inference latency and higher token costs for reduced complexity in vector database management.

Definition

Large Language Models engineered with extended attention mechanisms—such as Rotary Positional Embeddings (RoPE) or FlashAttention—to process input sequences often exceeding 100k tokens. In RAG pipelines, these models allow for the ingestion of entire documents or codebases directly into the prompt, though they trade off increased inference latency and higher token costs for reduced complexity in vector database management.

Disambiguation

Distinguish between the size of the model's active memory (Context Window) versus the total number of parameters in its weights.

Visual Metaphor

"A massive banquet table that can hold an entire library’s worth of open books simultaneously, rather than a small desk where only one page can be read at a time."

Conceptual Overview

Large Language Models engineered with extended attention mechanisms—such as Rotary Positional Embeddings (RoPE) or FlashAttention—to process input sequences often exceeding 100k tokens. In RAG pipelines, these models allow for the ingestion of entire documents or codebases directly into the prompt, though they trade off increased inference latency and higher token costs for reduced complexity in vector database management.

Disambiguation

Distinguish between the size of the model's active memory (Context Window) versus the total number of parameters in its weights.

Visual Analog

A massive banquet table that can hold an entire library’s worth of open books simultaneously, rather than a small desk where only one page can be read at a time.

Related Articles