Definition
Model Cascading is an architectural strategy that sequentially invokes models of increasing complexity and cost, where execution proceeds to a more powerful model only if the preceding smaller model fails to meet a specific confidence threshold or quality metric. In RAG pipelines, this is primarily used to minimize inference costs and latency by handling trivial queries with 'SLMs' (Small Language Models) before escalating to 'LLMs' (Large Language Models).
Unlike 'Model Routing' which picks one model upfront, Cascading is a multi-stage fallback process.
"A multi-stage water filtration system where coarse mesh catches large debris immediately, and only the finest particles are sent to the expensive, high-pressure carbon filter."
- LLM Routing(Component)
- Confidence Scoring(Prerequisite)
- Inference Latency(Trade-off)
- Semantic Cache(Prerequisite)
Conceptual Overview
Model Cascading is an architectural strategy that sequentially invokes models of increasing complexity and cost, where execution proceeds to a more powerful model only if the preceding smaller model fails to meet a specific confidence threshold or quality metric. In RAG pipelines, this is primarily used to minimize inference costs and latency by handling trivial queries with 'SLMs' (Small Language Models) before escalating to 'LLMs' (Large Language Models).
Disambiguation
Unlike 'Model Routing' which picks one model upfront, Cascading is a multi-stage fallback process.
Visual Analog
A multi-stage water filtration system where coarse mesh catches large debris immediately, and only the finest particles are sent to the expensive, high-pressure carbon filter.