Definition
The measure of total data or requests processed by an AI system per unit of time, specifically quantified in RAG as tokens per second (TPS) for generation or queries per second (QPS) for the end-to-end pipeline. Maximizing throughput usually involves architectural trade-offs such as batching multiple requests, which increases overall efficiency but can result in higher tail latency for individual users.
Throughput is the volume of work finished (width of the pipe), whereas latency is the time taken for one task (length of the pipe).
"A multi-lane highway: Increasing lanes allows more cars to pass a point per hour (throughput) even if the speed of each individual car (latency) remains the same."
- Latency(Inverse Metric/Trade-off)
- Continuous Batching(Optimization Technique)
- Concurrency(Prerequisite)
- Time To First Token (TTFT)(Component)
Conceptual Overview
The measure of total data or requests processed by an AI system per unit of time, specifically quantified in RAG as tokens per second (TPS) for generation or queries per second (QPS) for the end-to-end pipeline. Maximizing throughput usually involves architectural trade-offs such as batching multiple requests, which increases overall efficiency but can result in higher tail latency for individual users.
Disambiguation
Throughput is the volume of work finished (width of the pipe), whereas latency is the time taken for one task (length of the pipe).
Visual Analog
A multi-lane highway: Increasing lanes allows more cars to pass a point per hour (throughput) even if the speed of each individual car (latency) remains the same.