Inference Speed

Definition

The velocity at which a Large Language Model (LLM) processes input tokens and generates output tokens, typically measured in tokens per second (TPS). In RAG and agentic workflows, it determines the 'time-to-first-token' and the total duration of multi-step reasoning cycles.

Disambiguation

Distinguish from 'Retrieval Latency', which is the time taken to fetch documents from a vector database before the LLM begins processing.

Visual Metaphor

"The 'Frames Per Second' (FPS) of a video game—higher speeds create a seamless, real-time experience, while low speeds cause jarring lag."

Key Tools

vLLMNVIDIA TensorRT-LLMllama.cppGroq LPUHugging Face Text Generation Inference (TGI)

Related Connections

Quantization(Optimization technique to increase speed by reducing numerical precision.)
Time to First Token (TTFT)(A component metric of inference speed focusing on initial responsiveness.)
KV Cache(Architectural component that accelerates inference by storing previously computed attention keys/values.)

Conceptual Overview

Disambiguation

Distinguish from 'Retrieval Latency', which is the time taken to fetch documents from a vector database before the LLM begins processing.

Visual Analog

The 'Frames Per Second' (FPS) of a video game—higher speeds create a seamless, real-time experience, while low speeds cause jarring lag.

Definition

Conceptual Overview

Disambiguation

Visual Analog

Related Articles