Decomposition RAG

TLDR

Decomposition RAG is a sophisticated evolution of standard Retrieval-Augmented Generation (RAG) designed to solve the "multi-hop" reasoning problem[1]. While standard RAG systems often fail when a user query requires connecting disparate pieces of information across multiple documents, Decomposition RAG employs a "divide and conquer" strategy. It utilizes a Large Language Model (LLM) to break a complex query into a series of simpler, atomic sub-questions. Each sub-question triggers its own retrieval process, and the resulting evidence is consolidated and reranked to provide a comprehensive answer[1][4]. Research indicates that this approach can improve retrieval performance (MRR@10) by up to 36.7% and answer accuracy (F1) by 11.6% on challenging benchmarks like MultiHop-RAG[1]. It represents a practical, training-free enhancement for production-grade AI agents.

Conceptual Overview

The fundamental limitation of "Naive" or standard RAG is its reliance on a single retrieval step. In a standard pipeline, a user query is converted into a vector embedding and compared against a vector database to find the top-k most similar chunks. This works exceptionally well for factoid questions (e.g., "What is the capital of France?") but collapses when faced with multi-hop queries (e.g., "How does the revenue of the company that acquired Slack compare to the revenue of Zoom?").

The Multi-Hop Challenge

Multi-hop questions require the system to:

Identify the first entity (The company that acquired Slack = Salesforce).
Retrieve information about that entity (Salesforce's revenue).
Retrieve information about the second entity (Zoom's revenue).
Perform a comparison or synthesis.

In a standard RAG system, the initial query embedding might be dominated by terms like "Slack" or "Zoom," potentially missing the "Salesforce" connection entirely because that information isn't explicitly in the query.

The Decomposition Philosophy

Decomposition RAG addresses this by introducing a Decomposer module—typically an LLM prompted to act as a query planner. The decomposer transforms the complex query into a directed acyclic graph (DAG) or a simple list of sub-queries[2]. This ensures that the retrieval engine searches for each specific piece of the puzzle independently, maximizing the chance that the relevant "hops" are captured in the context window.

Key Components

Query Decomposer: An LLM that parses the intent and generates sub-questions.
Sub-query Retriever: A standard retriever (semantic or hybrid) that executes searches for each sub-question.
Reranker: A critical component that filters the expanded pool of documents (which can be 3-5x larger than standard RAG) to remove noise[1].
Response Synthesizer: An LLM that takes the consolidated, reranked evidence and the original query to produce the final output.

![Infographic: Decomposition RAG Pipeline. The diagram shows a central 'User Query' entering a 'Decomposition Engine'. The engine outputs three parallel arrows labeled 'Sub-query 1', 'Sub-query 2', and 'Sub-query 3'. Each arrow points to a 'Vector Store' block. The outputs of these blocks converge into a 'Reranking Layer' (Cross-Encoder), which then feeds a filtered set of 'Top Context' into the 'Final LLM Synthesis' block to produce the 'User Answer'.]

Practical Implementations

Building a Decomposition RAG system requires careful orchestration of the LLM prompts and the retrieval logic. Below is the typical workflow for implementation.

1. The Decomposition Prompt

The first step is prompting the LLM to generate sub-questions. A common technique is Least-to-Most Prompting or Self-Ask.

Example Prompt: "Given the complex question: {query}, break it down into 3-4 simpler sub-questions that, when answered, will provide all the information needed to answer the original question. Output as a JSON list."

2. Parallel vs. Sequential Retrieval

There are two primary ways to execute the sub-queries:

Parallel Retrieval: All sub-questions are fired at the vector database simultaneously. This is faster (lower latency) but doesn't allow for "interleaving," where the answer to sub-question 1 informs sub-question 2[3].
Sequential (Recursive) Retrieval: The system answers the first sub-question, adds that information to the context, and then generates/executes the next sub-question. This is more powerful for deep reasoning but significantly slower.

3. Handling the "Noise" Problem

Because Decomposition RAG generates multiple queries, it retrieves a much larger volume of documents. If you retrieve the top 5 documents for 4 sub-questions, you have 20 documents. Feeding all 20 into an LLM can lead to "Lost in the Middle" phenomena or context window overflow.

Implementation Detail: Use a Cross-Encoder Reranker (like BGE-Reranker or Cohere Rerank). The reranker evaluates the relevance of all 20 documents against the original complex query, ensuring only the most salient evidence reaches the final synthesis stage[1].

4. Pseudo-Code Logic

def decomposition_rag(user_query):
    # Step 1: Decompose
    sub_queries = llm.generate_sub_queries(user_query) # e.g., ["Who acquired Slack?", "Salesforce revenue 2023", "Zoom revenue 2023"]
    
    # Step 2: Retrieve
    all_docs = []
    for sq in sub_queries:
        docs = vector_db.search(sq, top_k=5)
        all_docs.extend(docs)
    
    # Step 3: Rerank
    # Filter out duplicates and noise
    ranked_docs = reranker.rank(query=user_query, documents=all_docs, top_n=10)
    
    # Step 4: Synthesize
    final_answer = llm.synthesize(user_query, ranked_docs)
    return final_answer

Advanced Techniques

As the field matures, several advanced variations of Decomposition RAG have emerged to handle edge cases and optimize performance.

Step-Back Prompting

Instead of just breaking a question down, the system generates a "step-back" question—a more generic, high-level version of the query. For example, if the query is "Why did my specific Nvidia RTX 3080 crash during Cyberpunk 2077?", the step-back question might be "What are the common causes of GPU crashes in high-demand games?". Retrieving the high-level principles provides a conceptual framework that helps the LLM interpret the specific evidence[4].

Multi-Query Fusion

This technique involves generating multiple variations of the same sub-question to overcome the limitations of distance-based vector search. By slightly varying the wording, the system can capture relevant chunks that might have been missed due to the specific phrasing of a single sub-query. The results are then combined using Reciprocal Rank Fusion (RRF).

Contextual Retrieval (Anthropic Pattern)

Recent research from Anthropic suggests that retrieval is improved when chunks are "contextualized" before being indexed[4]. In Decomposition RAG, this means that when a sub-query retrieves a chunk, the system doesn't just look at the chunk text, but also a small summary of the document it came from. This is particularly useful when sub-questions are highly specific and might retrieve chunks that lack context (e.g., a table row that says "Revenue: $5B" without mentioning the company name).

Iterative Refinement (IRRR)

Interleaving Retrieval and Reasoning (IRRR) is a dynamic form of decomposition. Instead of generating all sub-questions upfront, the agent generates one, retrieves info, reasons about it, and then decides what the next sub-question should be. This is the foundation of "Agentic RAG" workflows[5].

Research and Future Directions

The efficacy of Decomposition RAG is well-documented in recent literature.

Performance Benchmarks

In the paper "Improving Complex Question Answering with Decomposition RAG" (2023), researchers tested the architecture against the MultiHop-RAG dataset[1]. This dataset specifically contains questions that cannot be answered by a single document.

Standard RAG: MRR@10 of ~0.45.
Decomposition RAG: MRR@10 of ~0.61 (+36.7% improvement).
Answer Accuracy: F1 score improved from 32.4 to 36.2 (+11.6%)[1].

The Latency-Accuracy Trade-off

The primary hurdle for Decomposition RAG is latency. Generating sub-queries and performing multiple retrieval passes takes significantly longer than a single-hop search. Future research is focused on:

Speculative Decoding for Decomposition: Using smaller, faster models to generate sub-queries while the larger model handles synthesis.
Cached Sub-queries: Identifying common "query patterns" and caching the decomposition paths to avoid redundant LLM calls[6].
End-to-End Training: While current methods are training-free, there is interest in training "Dense Decomposers" that can output sub-query embeddings directly without generating text, potentially merging the decomposition and retrieval steps[7].

Integration with Knowledge Graphs

There is a growing trend of combining Decomposition RAG with GraphRAG. In this hybrid model, the decomposition step identifies entities and relations, which are then used to traverse a Knowledge Graph (KG) rather than just performing vector similarity searches. This allows for even more precise multi-hop traversal, as the "hops" are explicitly defined by graph edges.

Frequently Asked Questions

Q: When should I use Decomposition RAG instead of standard RAG?

You should use Decomposition RAG when your users ask complex, comparative, or multi-part questions. If your data requires connecting information from different files (e.g., comparing financial reports of two different companies), standard RAG will likely fail, and decomposition becomes necessary.

Q: Does this increase my LLM costs?

Yes. Decomposition RAG requires at least two LLM calls (one for decomposition and one for synthesis) and potentially more if you use sequential reasoning. It also increases the number of tokens processed because you are retrieving and feeding more context into the final prompt.

Q: Can I use a smaller model for the decomposition step?

Absolutely. Models like GPT-3.5 Turbo, Llama-3-8B, or Mistral-7B are often sufficient for the decomposition task, provided they are given clear instructions and a few-shot examples. You can reserve the larger, more expensive models (like GPT-4o or Claude 3.5 Sonnet) for the final synthesis.

Q: How do I prevent the system from generating too many sub-questions?

You can constrain the LLM in the system prompt (e.g., "Limit to a maximum of 3 sub-questions"). Additionally, you can implement a "Query Merger" step that identifies and combines redundant sub-questions before they are sent to the retriever.

Q: Is reranking mandatory in this architecture?

While not strictly mandatory, it is highly recommended. Without reranking, the "signal-to-noise" ratio in your context window decreases significantly because you are aggregating results from multiple different searches. Reranking ensures that the most relevant pieces of evidence from all sub-queries rise to the top[1].

References

Improving Complex Question Answering with Decomposition RAGresearch paper
Query Decomposition for Question Answeringdocumentation
RAG Pipeline: A Complete Guideblog post
Contextual Retrievalblog post
Decomposition Strategies for RAGblog post
Breakdown RAG Model Parameters, Settings, and Their Impactarticle
MultiHop-RAG: Benchmarking Retrieval-Augmented Generation for Multi-Hop Question Answeringresearch paper