Throughput

Throughput

The measure of total data or requests processed by an AI system per unit of time, specifically quantified in RAG as tokens per second (TPS) for generation or queries per second (QPS) for the end-to-end pipeline. Maximizing throughput usually involves architectural trade-offs such as batching multiple requests, which increases overall efficiency but can result in higher tail latency for individual users.

Definition

Disambiguation

Throughput is the volume of work finished (width of the pipe), whereas latency is the time taken for one task (length of the pipe).

Visual Metaphor

"A multi-lane highway: Increasing lanes allows more cars to pass a point per hour (throughput) even if the speed of each individual car (latency) remains the same."

Key Tools

vLLMTGI (Text Generation Inference)NVIDIA Triton Inference ServerRay ServeBentoml

Related Connections

Latency(Inverse Metric/Trade-off)
Continuous Batching(Optimization Technique)
Concurrency(Prerequisite)
Time To First Token (TTFT)(Component)

Conceptual Overview

Disambiguation

Throughput is the volume of work finished (width of the pipe), whereas latency is the time taken for one task (length of the pipe).

Visual Analog

A multi-lane highway: Increasing lanes allows more cars to pass a point per hour (throughput) even if the speed of each individual car (latency) remains the same.

Definition

Conceptual Overview

Disambiguation

Visual Analog

Related Articles