Definition
The deployment of redundant copies of vector database indices or model inference nodes across a distributed cluster to increase read throughput and provide high availability (HA). While improving performance for concurrent RAG queries, it introduces trade-offs regarding synchronization latency and increased infrastructure costs.
In RAG, this refers to scaling read capacity and fault tolerance, not data partitioning (sharding) or data duplication in a corpus.
"A popular library printing ten identical copies of the same reference book so ten researchers can look up facts simultaneously instead of waiting in line for one copy."
- Sharding(Complementary strategy for horizontal data distribution)
- Consistency Model(Trade-off governing how quickly updates reflect across replicas)
- Load Balancing(Prerequisite for distributing queries across replicas)
Conceptual Overview
The deployment of redundant copies of vector database indices or model inference nodes across a distributed cluster to increase read throughput and provide high availability (HA). While improving performance for concurrent RAG queries, it introduces trade-offs regarding synchronization latency and increased infrastructure costs.
Disambiguation
In RAG, this refers to scaling read capacity and fault tolerance, not data partitioning (sharding) or data duplication in a corpus.
Visual Analog
A popular library printing ten identical copies of the same reference book so ten researchers can look up facts simultaneously instead of waiting in line for one copy.