Definition
The total time elapsed between a user's query and the final response in an AI pipeline, representing the cumulative delay of embedding generation, vector retrieval, and LLM inference. Lower latency often requires a trade-off with retrieval accuracy or model reasoning depth.
Refers to the end-to-end inference delay in AI workflows rather than standard network ping or database read speeds.
"The total duration a customer spends at a drive-thru, from the moment they speak into the intercom to the moment they receive their bag at the window."
- Time to First Token (TTFT)(Component)
- Throughput(Performance Metric Counterpart)
- Embedding Generation(Pipeline Stage)
- Vector Retrieval(Pipeline Stage)
Conceptual Overview
The total time elapsed between a user's query and the final response in an AI pipeline, representing the cumulative delay of embedding generation, vector retrieval, and LLM inference. Lower latency often requires a trade-off with retrieval accuracy or model reasoning depth.
Disambiguation
Refers to the end-to-end inference delay in AI workflows rather than standard network ping or database read speeds.
Visual Analog
The total duration a customer spends at a drive-thru, from the moment they speak into the intercom to the moment they receive their bag at the window.