SmartFAQs.ai
Back to Learn
Concept

Inference Speed

The velocity at which a Large Language Model (LLM) processes input tokens and generates output tokens, typically measured in tokens per second (TPS). In RAG and agentic workflows, it determines the 'time-to-first-token' and the total duration of multi-step reasoning cycles.

Definition

The velocity at which a Large Language Model (LLM) processes input tokens and generates output tokens, typically measured in tokens per second (TPS). In RAG and agentic workflows, it determines the 'time-to-first-token' and the total duration of multi-step reasoning cycles.

Disambiguation

Distinguish from 'Retrieval Latency', which is the time taken to fetch documents from a vector database before the LLM begins processing.

Visual Metaphor

"The 'Frames Per Second' (FPS) of a video game—higher speeds create a seamless, real-time experience, while low speeds cause jarring lag."

Key Tools
vLLMNVIDIA TensorRT-LLMllama.cppGroq LPUHugging Face Text Generation Inference (TGI)
Related Connections
  • Quantization(Optimization technique to increase speed by reducing numerical precision.)
  • Time to First Token (TTFT)(A component metric of inference speed focusing on initial responsiveness.)
  • KV Cache(Architectural component that accelerates inference by storing previously computed attention keys/values.)

Conceptual Overview

The velocity at which a Large Language Model (LLM) processes input tokens and generates output tokens, typically measured in tokens per second (TPS). In RAG and agentic workflows, it determines the 'time-to-first-token' and the total duration of multi-step reasoning cycles.

Disambiguation

Distinguish from 'Retrieval Latency', which is the time taken to fetch documents from a vector database before the LLM begins processing.

Visual Analog

The 'Frames Per Second' (FPS) of a video game—higher speeds create a seamless, real-time experience, while low speeds cause jarring lag.

Related Articles