Definition
Prompt caching is a technique used to persist and reuse the computed state of frequently used prompt prefixes—such as system instructions, few-shot examples, or static RAG context—to significantly reduce latency and token processing costs during inference.
Distinct from Semantic Caching (which stores model outputs) as it caches the pre-processed state of the prompt inputs themselves.
"A reusable rubber stamp for a long legal header, allowing you to bypass handwriting the same intro on every page."
- KV Cache(Underlying Mechanism)
- System Prompt(Prerequisite)
- Context Window(Constraint)
Conceptual Overview
Prompt caching is a technique used to persist and reuse the computed state of frequently used prompt prefixes—such as system instructions, few-shot examples, or static RAG context—to significantly reduce latency and token processing costs during inference.
Disambiguation
Distinct from Semantic Caching (which stores model outputs) as it caches the pre-processed state of the prompt inputs themselves.
Visual Analog
A reusable rubber stamp for a long legal header, allowing you to bypass handwriting the same intro on every page.