Model Cascading

Model Cascading is an architectural strategy that sequentially invokes models of increasing complexity and cost, where execution proceeds to a more powerful model only if the preceding smaller model fails to meet a specific confidence threshold or quality metric. In RAG pipelines, this is primarily used to minimize inference costs and latency by handling trivial queries with 'SLMs' (Small Language Models) before escalating to 'LLMs' (Large Language Models).

Definition

Disambiguation

Unlike 'Model Routing' which picks one model upfront, Cascading is a multi-stage fallback process.

Visual Metaphor

"A multi-stage water filtration system where coarse mesh catches large debris immediately, and only the finest particles are sent to the expensive, high-pressure carbon filter."

Key Tools

LangChainDSPyLiteLLMSemantic KernelHaystack

Related Connections

LLM Routing(Component)
Confidence Scoring(Prerequisite)
Inference Latency(Trade-off)
Semantic Cache(Prerequisite)

Conceptual Overview

Disambiguation

Unlike 'Model Routing' which picks one model upfront, Cascading is a multi-stage fallback process.

Visual Analog

A multi-stage water filtration system where coarse mesh catches large debris immediately, and only the finest particles are sent to the expensive, high-pressure carbon filter.

Model Cascading

Definition

Conceptual Overview

Disambiguation

Visual Analog

Related Articles