Caching

Caching

The storage and reuse of pre-computed LLM responses, vector embeddings, or prompt prefixes to reduce inference latency and API token costs. In RAG specifically, it involves Semantic Caching—matching new queries to cached results based on vector similarity rather than exact string matching—and Prompt Caching to preserve context across long agentic turns.

Definition

Disambiguation

Refers to Semantic and Prompt caching, not hardware L1/L2 or browser-side storage.

Visual Metaphor

"A prepared 'frequently asked questions' sheet at a help desk that prevents the clerk from calling the back office for every repeat visitor."

Key Tools

GPTCacheRedisLangChainAnthropic Prompt CachingMomentoMilvus

Related Connections

Semantic Similarity(Prerequisite for identifying cache hits in non-exact queries)
TTL (Time To Live)(Component managing cache invalidation and data freshness)
Prompt Engineering(Component optimized by prefix-based prompt caching)

Conceptual Overview

Disambiguation

Refers to Semantic and Prompt caching, not hardware L1/L2 or browser-side storage.

Visual Analog

A prepared 'frequently asked questions' sheet at a help desk that prevents the clerk from calling the back office for every repeat visitor.

Definition

Conceptual Overview

Disambiguation

Visual Analog

Related Articles