Definition
The storage and reuse of pre-computed LLM responses, vector embeddings, or prompt prefixes to reduce inference latency and API token costs. In RAG specifically, it involves Semantic Caching—matching new queries to cached results based on vector similarity rather than exact string matching—and Prompt Caching to preserve context across long agentic turns.
Refers to Semantic and Prompt caching, not hardware L1/L2 or browser-side storage.
"A prepared 'frequently asked questions' sheet at a help desk that prevents the clerk from calling the back office for every repeat visitor."
- Semantic Similarity(Prerequisite for identifying cache hits in non-exact queries)
- TTL (Time To Live)(Component managing cache invalidation and data freshness)
- Prompt Engineering(Component optimized by prefix-based prompt caching)
Conceptual Overview
The storage and reuse of pre-computed LLM responses, vector embeddings, or prompt prefixes to reduce inference latency and API token costs. In RAG specifically, it involves Semantic Caching—matching new queries to cached results based on vector similarity rather than exact string matching—and Prompt Caching to preserve context across long agentic turns.
Disambiguation
Refers to Semantic and Prompt caching, not hardware L1/L2 or browser-side storage.
Visual Analog
A prepared 'frequently asked questions' sheet at a help desk that prevents the clerk from calling the back office for every repeat visitor.