Mermaid Diagram of RAG Topology

TLDR

Retrieval-Augmented Generation (RAG) has transitioned from a linear "Retrieve-and-Read" pipeline to a multi-dimensional topology involving agentic loops, graph-based relationships, and sophisticated re-ranking stages. Visualizing these architectures using Mermaid.js allows engineers to map the flow of high-dimensional vector data, the logic of query transformations, and the decision-making nodes of autonomous agents. This article explores the evolution of RAG topologies—from Naive to Agentic—providing standardized Mermaid templates to document and optimize production-grade AI systems.

Conceptual Overview

The term RAG (Retrieval-Augmented Generation) describes a framework for grounding Large Language Model (LLM) outputs in verifiable, external data. However, as systems move from prototypes to production, the "topology"—the arrangement of components and the flow of data—becomes increasingly complex.

The Visual Grammar of RAG

In technical documentation, Mermaid.js serves as a markdown-based diagramming tool that renders text into visual charts. For RAG, we primarily use graph LR (Left-to-Right) or graph TD (Top-Down) to represent the lifecycle of a query.

Nodes: Represent functional units like Vector Databases, Embedding Models, or LLMs.
Edges: Represent the flow of data (e.g., JSON objects, Tensors, or Natural Language strings).
Subgraphs: Group related processes, such as the "Indexing Pipeline" vs. the "Retrieval Pipeline."

The Evolution of Topology

The industry categorizes RAG into three main evolutionary stages:

Naive RAG: A straight-line process of indexing, retrieving, and generating.
Advanced RAG: Introduces pre-retrieval (query expansion) and post-retrieval (reranking) optimizations to solve issues like "Lost in the Middle" or low precision[src:001].
Modular/Agentic RAG: Features non-linear paths where the LLM decides whether to retrieve more data, rewrite the query, or ignore the retrieved context entirely[src:004].

Infographic Placeholder: A high-level comparison of Naive vs. Advanced RAG topologies, showing the addition of 'Query Rewriting' and 'Reranking' nodes in the advanced version.

Practical Implementations

1. Naive RAG Topology

The Naive RAG pipeline is the foundational architecture. It assumes that the user's query is perfectly formulated for vector search and that the top-k retrieved documents are always relevant.

graph LR
    subgraph Indexing_Phase
        A[Raw Docs] --> B[Chunking]
        B --> C[Embedding Model]
        C --> D[(Vector DB)]
    end

    subgraph Retrieval_Phase
        E[User Query] --> F[Embedding Model]
        F --> G{Similarity Search}
        D -.-> G
        G --> H[Top-K Context]
    end

    subgraph Generation_Phase
        H --> I[Prompt Template]
        E --> I
        I --> J[LLM]
        J --> K[Final Answer]
    end

Key Components:

Chunking Strategy: Fixed-size vs. Semantic chunking.
Vector DB: Storage for high-dimensional embeddings (e.g., Pinecone, Milvus, Weaviate).
Top-K: The number of document snippets retrieved.

2. Advanced RAG: The Two-Stage Retrieval

Advanced RAG addresses the "Precision vs. Recall" trade-off. While vector search is excellent at finding similar items (Recall), it often fails at finding the most relevant items for a specific answer (Precision).

graph TD
    UserQuery[User Query] --> QueryTransform[Query Rewriting / HyDE]
    QueryTransform --> VectorSearch[Vector Search - Bi-Encoder]
    QueryTransform --> KeywordSearch[BM25 Keyword Search]
    
    VectorSearch --> Fusion[Reciprocal Rank Fusion]
    KeywordSearch --> Fusion
    
    Fusion --> Reranker[Cross-Encoder Reranker]
    Reranker --> ContextFilter[Context Compression]
    
    ContextFilter --> LLM[LLM Generation]
    LLM --> Output[Refined Answer]

Technical Nuance:

HyDE (Hypothetical Document Embeddings): The LLM generates a "fake" answer first, and that answer is used to search the database, often yielding better semantic matches[src:005].
Cross-Encoder: Unlike Bi-Encoders (which compare embeddings), Cross-Encoders process the query and document together, providing much higher accuracy at the cost of latency.

Advanced Techniques

Agentic RAG (Self-Correction Loops)

In Agentic RAG, the system is no longer a static pipeline but a state machine. The LLM acts as an "Agent" that can critique its own retrieval results.

graph TD
    Start[User Query] --> Router{Router Agent}
    Router -->|Internal Knowledge| LLM[Direct Answer]
    Router -->|External Data| Retrieval[Retrieve Context]
    
    Retrieval --> Critique{Critique Agent}
    Critique -->|Irrelevant| Rewrite[Rewrite Query]
    Rewrite --> Retrieval
    
    Critique -->|Relevant| Generate[Generate Answer]
    Generate --> HallucinationCheck{Hallucination Grade}
    
    HallucinationCheck -->|Fail| Generate
    HallucinationCheck -->|Pass| Final[Final Output]

Corrective RAG (CRAG): This specific topology introduces a "Knowledge Refinement" step. If the retrieved documents are evaluated as "ambiguous," the agent triggers a web search to supplement the local vector store[src:003].

GraphRAG: Entity-Centric Topology

GraphRAG, popularized by Microsoft Research, moves away from simple text chunks. It builds a Knowledge Graph (KG) where nodes are entities (People, Places, Concepts) and edges are relationships[src:002].

graph LR
    Docs[Unstructured Docs] --> Extraction[LLM Entity Extraction]
    Extraction --> KG[(Knowledge Graph)]
    
    Query[Global Query] --> CommunityDetection[Leiden Algorithm]
    CommunityDetection --> Summaries[Community Summaries]
    Summaries --> LLM_Final[Global Answer]
    
    Query --> LocalSearch[Entity Traversal]
    LocalSearch --> LLM_Final

Why GraphRAG? Standard RAG struggles with "Global" questions (e.g., "What are the main themes in these 1,000 documents?"). GraphRAG solves this by pre-summarizing "communities" of entities in the graph, allowing the LLM to reason over the structure of the data, not just the text.

Research and Future Directions

1. Long-Context LLMs vs. RAG

With the advent of models like Gemini 1.5 Pro (2M token window), some argue that RAG is obsolete. However, research suggests a "Hybrid Topology" is the future:

RAG for massive, dynamic datasets (Petabytes).
Long-Context for deep reasoning over a specific, retrieved subset (Megabytes).

2. Self-RAG and Reflection

Self-RAG introduces "Reflection Tokens." The model is trained to output special tokens like [Retrieve], [No Retrieval], [Is Supported], and [Is Useful][src:004]. This allows the topology to be controlled by the model's own internal logic rather than hard-coded if/else statements.

3. Multimodal RAG Topologies

The next frontier is visualizing the flow of images, audio, and video.

Topology Shift: The "Embedding Model" must be a CLIP-style model capable of mapping different modalities into the same vector space.
Mermaid Representation: Requires parallel paths for different data types (e.g., an OCR path for images and a Whisper path for audio) before fusion.

Frequently Asked Questions

Q: Why use Mermaid instead of a tool like Lucidchart for RAG?

Mermaid is "Diagram-as-Code." This means your RAG topology lives in your Git repository alongside your code. When the architecture changes, the diagram is updated in the same Pull Request, ensuring documentation never goes out of sync with the implementation.

Q: What is the "Lost in the Middle" problem in RAG topology?

Research has shown that LLMs are better at processing information at the very beginning or very end of a prompt. If your RAG topology retrieves 20 documents and the most relevant one is at index #10, the LLM might ignore it. Advanced topologies solve this using Rerankers to place the most critical context at the "poles" of the prompt.

Q: How does Hybrid Search affect the Mermaid diagram?

Hybrid search adds a parallel branch. One branch performs a Vector Search (semantic), and the other performs a Keyword Search (lexical). These are merged using Reciprocal Rank Fusion (RRF) before reaching the LLM.

Q: Can Mermaid represent the latency of different RAG nodes?

While Mermaid doesn't natively track time, you can use "Styling" or "Classes" to color-code nodes. For example, you can color the Cross-Encoder node red to indicate it is a high-latency bottleneck, while the Vector Search node is green (low latency).

Q: What is the difference between a Bi-Encoder and a Cross-Encoder in these diagrams?

In a Mermaid diagram, a Bi-Encoder is typically used in the "Indexing" and "Initial Retrieval" stages because it is fast. A Cross-Encoder is placed in the "Post-retrieval" stage as a Reranker because it is slow but highly accurate at comparing the specific query-document pair.

References

Retrieval-Augmented Generation for Large Language Models: A Surveyresearch paper
GraphRAG: Unlocking LLM discovery on narrative private dataofficial docs
Corrective Retrieval Augmented Generationresearch paper
Self-RAG: Learning to Retrieve, Generate, and Critique through Self-Reflectionresearch paper
Advanced RAG Patterns: Query Transformationsblog