Definition
The practice of optimizing the ratio between model performance and token consumption within RAG pipelines or AI agents to minimize inference costs and latency while managing finite context window limits.
Focuses on prompt engineering and context pruning during inference, rather than hardware throughput or training speed.
"Packing a carry-on suitcase: strategically selecting only the most essential items to fit within a strict weight limit to avoid extra fees and delays."
- Context Window(Hard Constraint)
- Prompt Compression(Optimization Technique)
- Reranking(Filtering Mechanism)
- Lost in the Middle(Performance Trade-off)
Conceptual Overview
The practice of optimizing the ratio between model performance and token consumption within RAG pipelines or AI agents to minimize inference costs and latency while managing finite context window limits.
Disambiguation
Focuses on prompt engineering and context pruning during inference, rather than hardware throughput or training speed.
Visual Analog
Packing a carry-on suitcase: strategically selecting only the most essential items to fit within a strict weight limit to avoid extra fees and delays.