Definition
The measurement of discrete semantic units processed during an LLM inference cycle, encompassing both input tokens (prompt/context) and output tokens (completion). In RAG and Agentic systems, it serves as the primary metric for managing API costs, monitoring latency, and preventing context window overflow during document retrieval.
Quantifies internal model units rather than raw character counts or file sizes.
"A metered taxi fare where the cost and time of the trip are determined by every block (token) the car travels through the city."
Conceptual Overview
The measurement of discrete semantic units processed during an LLM inference cycle, encompassing both input tokens (prompt/context) and output tokens (completion). In RAG and Agentic systems, it serves as the primary metric for managing API costs, monitoring latency, and preventing context window overflow during document retrieval.
Disambiguation
Quantifies internal model units rather than raw character counts or file sizes.
Visual Analog
A metered taxi fare where the cost and time of the trip are determined by every block (token) the car travels through the city.