Definition
The enhancement of a RAG pipeline or AI agent's capacity to process a greater volume of requests, tokens, or inferences per unit of time, typically achieved through parallel execution, request batching, or model quantization. While it increases total system capacity (Queries Per Second), it often introduces a trade-off with individual request latency due to queueing or processing overhead.
Throughput is about volume (how many), whereas Latency is about speed (how fast).
"A multi-lane highway that allows more cars to pass through a toll gate simultaneously, even if the speed limit for each individual car remains the same."
- Continuous Batching(Component)
- Latency(Trade-off)
- Quantization(Component)
- Parallel Retrieval(Prerequisite)
Conceptual Overview
The enhancement of a RAG pipeline or AI agent's capacity to process a greater volume of requests, tokens, or inferences per unit of time, typically achieved through parallel execution, request batching, or model quantization. While it increases total system capacity (Queries Per Second), it often introduces a trade-off with individual request latency due to queueing or processing overhead.
Disambiguation
Throughput is about volume (how many), whereas Latency is about speed (how fast).
Visual Analog
A multi-lane highway that allows more cars to pass through a toll gate simultaneously, even if the speed limit for each individual car remains the same.