Definition
In RAG and agentic workflows, a bottleneck is a performance or architectural constraint—typically in retrieval latency, LLM inference speed, or context window limits—that dictates the maximum throughput and responsiveness of the entire system.
Refers to computational or data flow constraints in AI pipelines, not physical hardware thermal throttling.
"A narrow hourglass neck where the large volume of sand (vector database results) is restricted by the small opening (LLM context window) before reaching the bottom chamber (final response)."
- Context Window(Resource Constraint)
- Inference Latency(Performance Metric)
- Vector Retrieval(Potential Component Source)
- Token Limits(Structural Constraint)
Conceptual Overview
In RAG and agentic workflows, a bottleneck is a performance or architectural constraint—typically in retrieval latency, LLM inference speed, or context window limits—that dictates the maximum throughput and responsiveness of the entire system.
Disambiguation
Refers to computational or data flow constraints in AI pipelines, not physical hardware thermal throttling.
Visual Analog
A narrow hourglass neck where the large volume of sand (vector database results) is restricted by the small opening (LLM context window) before reaching the bottom chamber (final response).