Definition
The architectural practice of distributing the workload of a RAG pipeline or AI agent system across multiple compute or storage nodes to handle increased concurrency and data volume. In this context, it involves sharding vector databases for faster retrieval and deploying multiple parallel instances of LLM orchestrators or agent workers to manage simultaneous user sessions.
Adding more 'instances' of workers or database nodes, rather than upgrading the CPU/GPU of a single server.
"A supermarket opening ten checkout lanes to handle a holiday rush, rather than training one cashier to work ten times faster."
- Sharding(Component)
- Load Balancing(Prerequisite)
- Vertical Scaling(Contrast)
- Statelessness(Design Constraint)
Conceptual Overview
The architectural practice of distributing the workload of a RAG pipeline or AI agent system across multiple compute or storage nodes to handle increased concurrency and data volume. In this context, it involves sharding vector databases for faster retrieval and deploying multiple parallel instances of LLM orchestrators or agent workers to manage simultaneous user sessions.
Disambiguation
Adding more 'instances' of workers or database nodes, rather than upgrading the CPU/GPU of a single server.
Visual Analog
A supermarket opening ten checkout lanes to handle a holiday rush, rather than training one cashier to work ten times faster.