Definition
The simultaneous execution of multiple inference requests or data transformation tasks (like embedding generation) to maximize hardware utilization and system throughput. While it significantly lowers the cost per token and increases total capacity, it typically introduces higher per-request latency as the system waits to accumulate and process the group.
Throughput-oriented volume processing vs. Latency-oriented real-time streaming.
"A multi-passenger airport shuttle bus that waits to fill its seats before departing, rather than a private taxi that leaves immediately for one person."
- Throughput(Primary Metric)
- Vector Indexing(Common Pipeline Stage)
- Dynamic Batching(Optimization Component)
- GPU Memory Bandwidth(Hardware Constraint)
Conceptual Overview
The simultaneous execution of multiple inference requests or data transformation tasks (like embedding generation) to maximize hardware utilization and system throughput. While it significantly lowers the cost per token and increases total capacity, it typically introduces higher per-request latency as the system waits to accumulate and process the group.
Disambiguation
Throughput-oriented volume processing vs. Latency-oriented real-time streaming.
Visual Analog
A multi-passenger airport shuttle bus that waits to fill its seats before departing, rather than a private taxi that leaves immediately for one person.