Definition
The measurement of discrete semantic units processed during an LLM inference cycle, encompassing both input tokens (prompt/context) and output tokens (completion). In RAG and Agentic systems, it serves as the primary metric for managing API costs, monitoring latency, and preventing context window overflow during document retrieval.
Quantifies internal model units rather than raw character counts or file sizes.
"A metered taxi fare where the cost and time of the trip are determined by every block (token) the car travels through the city."
- Context Window(Hard Constraint)
- Chunking(Input Optimization)
- Inference Latency(Performance Correlation)
Conceptual Overview
The measurement of discrete semantic units processed during an LLM inference cycle, encompassing both input tokens (prompt/context) and output tokens (completion). In RAG and Agentic systems, it serves as the primary metric for managing API costs, monitoring latency, and preventing context window overflow during document retrieval.
Disambiguation
Quantifies internal model units rather than raw character counts or file sizes.
Visual Analog
A metered taxi fare where the cost and time of the trip are determined by every block (token) the car travels through the city.