Back to Learn
Concept

Inference Speed

The velocity at which a Large Language Model (LLM) processes input tokens and generates output tokens, typically measured in tokens per second (TPS). In RAG and agentic workflows, it determines the 'time-to-first-token' and the total duration of multi-step reasoning cycles.

Definition

The velocity at which a Large Language Model (LLM) processes input tokens and generates output tokens, typically measured in tokens per second (TPS). In RAG and agentic workflows, it determines the 'time-to-first-token' and the total duration of multi-step reasoning cycles.

Disambiguation

Distinguish from 'Retrieval Latency', which is the time taken to fetch documents from a vector database before the LLM begins processing.

Visual Metaphor

"The 'Frames Per Second' (FPS) of a video game—higher speeds create a seamless, real-time experience, while low speeds cause jarring lag."

Conceptual Overview

The velocity at which a Large Language Model (LLM) processes input tokens and generates output tokens, typically measured in tokens per second (TPS). In RAG and agentic workflows, it determines the 'time-to-first-token' and the total duration of multi-step reasoning cycles.

Disambiguation

Distinguish from 'Retrieval Latency', which is the time taken to fetch documents from a vector database before the LLM begins processing.

Visual Analog

The 'Frames Per Second' (FPS) of a video game—higher speeds create a seamless, real-time experience, while low speeds cause jarring lag.

Related Articles