Definition
The practice of using an in-memory data store to persist LLM responses, prompt templates, or vector embeddings, allowing for rapid retrieval of recurring queries and reducing inference latency and API token expenditure.
Specifically refers to semantic and prompt caching for LLMs, not generic web session state or static asset delivery.
"An agent's 'Quick-Access Cheat Sheet' that stores answers to common questions so they don't have to consult the expensive master textbook every time."
- Semantic Similarity Search(Component)
- Token Latency(Metric influenced by)
- TTL (Time-to-Live)(Component)
- Vector Database(Alternative/Augmentation)
Conceptual Overview
The practice of using an in-memory data store to persist LLM responses, prompt templates, or vector embeddings, allowing for rapid retrieval of recurring queries and reducing inference latency and API token expenditure.
Disambiguation
Specifically refers to semantic and prompt caching for LLMs, not generic web session state or static asset delivery.
Visual Analog
An agent's 'Quick-Access Cheat Sheet' that stores answers to common questions so they don't have to consult the expensive master textbook every time.