Basic RAG Flows

TLDR

Basic RAG (Retrieval-Augmented Generation) represents the foundational architecture for grounding Large Language Models (LLMs) in external, verifiable data. This cluster synthesizes three primary operational patterns: the Standard Retrieval-Generation Flow, which establishes the dual-pipeline of offline ingestion and online inference; Query Decomposition, which solves the "multi-hop" reasoning problem by breaking complex prompts into manageable sub-queries; and Iterative Retrieval, which employs a "Retrieve-Reason-Refine" loop to navigate deep knowledge gaps. Together, these flows transform the LLM from a static knowledge base into a dynamic reasoning engine capable of enterprise-grade precision through techniques like A (Comparing prompt variants) and structured metadata indexing using Trie structures.

Conceptual Overview

The evolution of RAG marks a departure from treating LLMs as static encyclopedias. In traditional deployments, a model's utility is capped by its training cutoff and its propensity for "hallucinations"—generating plausible but factually incorrect information when its internal weights lack specific data.

Basic RAG Flows solve this by decoupling knowledge from parameters. The system treats the LLM as a reasoning agent that is provided with a "context window" containing relevant documents retrieved in real-time. This conceptual shift requires a sophisticated orchestration layer that manages how data is stored, how queries are understood, and how the model interacts with retrieved information.

The Hierarchy of RAG Complexity

Linear (Standard RAG): A straight-line path from query to retrieval to generation. Best for simple fact-seeking.
Parallel (Query Decomposition): A "divide and conquer" approach. The system identifies multiple facets of a query and retrieves information for each simultaneously.
Recursive (Iterative Retrieval): A feedback loop where the model's initial findings dictate the next search parameters, essential for complex research tasks.

Infographic: The Basic RAG Ecosystem

Infographic Placeholder Description: A high-level architectural diagram showing a central "Query Controller." The controller receives a user prompt and decides the path: "(A) Standard path for simple queries, (B) Decomposition path for multi-faceted queries, or (C) Iterative path for deep-dive reasoning. All paths feed into a shared Vector Database and are optimized via "A" (Comparing prompt variants)."

Practical Implementations

Implementing Basic RAG requires a bifurcated architecture: the Offline Ingestion Pipeline and the Online Inference Pipeline.

1. The Offline Ingestion Pipeline

Before a query can be answered, raw data must be transformed into a machine-readable format.

NER (Named Entity Recognition): During ingestion, the system identifies key entities (e.g., "Q3 Revenue," "GDPR Compliance"). These entities are used to enrich the metadata of document chunks.
Chunking & Embedding: Documents are broken into semantically meaningful segments. These segments are converted into high-dimensional vectors.
Trie-based Metadata Indexing: To speed up filtering, metadata (like document source or date) is often stored in Trie structures, allowing for rapid prefix-based lookups that complement the fuzzy nature of vector searches.

2. The Online Inference Pipeline

This is where the user interaction occurs. The sophistication of this pipeline determines the system's accuracy.

Query Understanding Layer: Instead of passing a raw string to the vector database, the system analyzes the intent. If the query is complex (e.g., "Compare our 2022 and 2023 cloud spend"), the Query Decomposition module splits it into two distinct searches.
Semantic Search: The system performs a similarity search in the vector database, often using HNSW (Hierarchical Navigable Small World) graphs for efficiency.
Context Injection: The retrieved "facts" are formatted into a prompt template. This is where A (Comparing prompt variants) becomes critical; developers must test different ways of presenting context to the LLM to minimize "lost in the middle" phenomena.

Advanced Techniques

To move beyond "Naive RAG," several optimization strategies are employed across the three flows.

Solving the Multi-Hop Problem

Standard RAG often fails when an answer requires connecting disparate dots. Query Decomposition addresses this by treating the query as a high-level task. By generating sub-queries, the system reduces semantic noise—the dilution of a vector's meaning when too many distinct concepts are packed into one search string.

The Retrieve-Reason-Refine Loop

In Iterative Retrieval, the system doesn't stop after the first search.

Retrieve: Fetch initial documents.
Reason: The LLM evaluates if the documents answer the query. If not, it identifies "missing links."
Refine: The LLM generates a new search query based on the missing links. This loop is particularly effective for legal or scientific research where the first set of results often introduces new terminology that must be explored.

Optimization via "A" (Comparing prompt variants)

System performance is highly sensitive to the "System Prompt." By utilizing A, architects can systematically test:

Instructional Clarity: Does the model perform better when told to "be concise" or "be exhaustive"?
Context Ordering: Does placing the most relevant document at the beginning or end of the prompt improve recall?
Decomposition Logic: Which prompt variant most reliably splits complex queries without losing the original intent?

Research and Future Directions

The primary frontier in Basic RAG Flows is closing the "compositionality gap." Research by Press et al. (2022) demonstrates that while LLMs are excellent at retrieving individual facts, their ability to compose those facts into a coherent multi-step answer is significantly lower.

Emerging Frameworks

FLARE (Forward-Looking Active REtrieval): Instead of retrieving once, FLARE triggers retrieval only when the LLM is uncertain about the next token it is generating.
SELF-RAG: A framework where the model outputs "reflection tokens" to critique its own retrieval quality and relevance, effectively automating the "Reason" step of the iterative loop.

As vector databases evolve, we are seeing a tighter integration between Trie-based keyword search and semantic vector search (Hybrid Search), ensuring that specific technical terms (found via NER) are not "smoothed over" by the embedding model's latent space.

Frequently Asked Questions

Q: When should I choose Query Decomposition over Iterative Retrieval?

Query Decomposition is preferred when the sub-questions are independent and can be answered in parallel (e.g., "What were the sales in NY and CA?"). Iterative Retrieval is necessary when the second question depends on the answer to the first (e.g., "Who is the CEO of the company that acquired Startup X?").

Q: How does NER improve the retrieval process in Standard RAG?

NER allows for hard-filtering. If a user asks about "Project Titan," the system can use the NER-tagged metadata to filter the vector database to only include chunks tagged with "Project Titan," drastically reducing the search space and eliminating irrelevant results that might be semantically similar but contextually wrong.

Q: What is the "semantic noise" problem in complex queries?

When a query contains multiple distinct intents, the resulting embedding vector is an average of those intents. In a high-dimensional space, this "average" vector may land in a region where no single document is a good match, leading to poor retrieval. Decomposition ensures each intent has its own clean vector.

Q: Does Iterative Retrieval significantly increase latency?

Yes. Because each "hop" requires a new LLM call and a new database lookup, latency scales linearly with the number of iterations. This is why strict stopping criteria and "A" (Comparing prompt variants) for efficiency are vital in production environments.

Q: How do Trie structures complement vector databases?

Vector databases are great for "fuzzy" conceptual matches but struggle with exact matches for IDs, part numbers, or specific names. A Trie structure allows for instantaneous, exact prefix matching on metadata, which can be used to pre-filter the vector space before the expensive similarity search begins.

References

Press et al. (2022) - Measuring and Narrowing the Compositionality Gap
FLARE: Forward-Looking Active REtrieval
Self-RAG: Learning to Retrieve, Generate, and Critique