Definition
The velocity at which a Large Language Model (LLM) processes input tokens and generates output tokens, typically measured in tokens per second (TPS). In RAG and agentic workflows, it determines the 'time-to-first-token' and the total duration of multi-step reasoning cycles.
Distinguish from 'Retrieval Latency', which is the time taken to fetch documents from a vector database before the LLM begins processing.
"The 'Frames Per Second' (FPS) of a video game—higher speeds create a seamless, real-time experience, while low speeds cause jarring lag."
- Quantization(Optimization technique to increase speed by reducing numerical precision.)
- Time to First Token (TTFT)(A component metric of inference speed focusing on initial responsiveness.)
- KV Cache(Architectural component that accelerates inference by storing previously computed attention keys/values.)
Conceptual Overview
The velocity at which a Large Language Model (LLM) processes input tokens and generates output tokens, typically measured in tokens per second (TPS). In RAG and agentic workflows, it determines the 'time-to-first-token' and the total duration of multi-step reasoning cycles.
Disambiguation
Distinguish from 'Retrieval Latency', which is the time taken to fetch documents from a vector database before the LLM begins processing.
Visual Analog
The 'Frames Per Second' (FPS) of a video game—higher speeds create a seamless, real-time experience, while low speeds cause jarring lag.