Latency

Definition

The total time elapsed between a user's query and the final response in an AI pipeline, representing the cumulative delay of embedding generation, vector retrieval, and LLM inference. Lower latency often requires a trade-off with retrieval accuracy or model reasoning depth.

Disambiguation

Refers to the end-to-end inference delay in AI workflows rather than standard network ping or database read speeds.

Visual Metaphor

"The total duration a customer spends at a drive-thru, from the moment they speak into the intercom to the moment they receive their bag at the window."

Key Tools

vLLMLangSmithArize PhoenixTriton Inference ServerPrometheus

Related Connections

Time to First Token (TTFT)(Component)
Throughput(Performance Metric Counterpart)
Embedding Generation(Pipeline Stage)
Vector Retrieval(Pipeline Stage)

Conceptual Overview

Disambiguation

Refers to the end-to-end inference delay in AI workflows rather than standard network ping or database read speeds.

Visual Analog

The total duration a customer spends at a drive-thru, from the moment they speak into the intercom to the moment they receive their bag at the window.

Definition

Conceptual Overview

Disambiguation

Visual Analog

Related Articles