Definition
Query Latency is the total duration required for an AI agent or RAG pipeline to process a request, spanning embedding generation, vector database retrieval, context re-ranking, and the final LLM inference. It represents the end-to-end delay that governs user experience, specifically the balance between retrieval depth and response speed.
Distinguish between 'Inference Latency' (LLM only) and 'Pipeline Latency' (retrieval + processing + inference).
"A multi-stop transit route where the 'query' must stop at a library (Vector DB) to pick up books before heading to a translator (LLM) to deliver the message."
- Time to First Token (TTFT)(Component)
- Vector Search Latency(Component)
- Semantic Caching(Optimization Technique)
- Throughput(Inverse System Metric)
Conceptual Overview
Query Latency is the total duration required for an AI agent or RAG pipeline to process a request, spanning embedding generation, vector database retrieval, context re-ranking, and the final LLM inference. It represents the end-to-end delay that governs user experience, specifically the balance between retrieval depth and response speed.
Disambiguation
Distinguish between 'Inference Latency' (LLM only) and 'Pipeline Latency' (retrieval + processing + inference).
Visual Analog
A multi-stop transit route where the 'query' must stop at a library (Vector DB) to pick up books before heading to a translator (LLM) to deliver the message.