Mastering Query Decomposition: A Technical Guide to Multi-Hop Retrieval in RAG

TLDR

Query decomposition is a specialized preprocessing technique in Retrieval-Augmented Generation (RAG) that transforms a single, complex user prompt into multiple, discrete sub-queries. It is the primary solution for the "multi-hop" reasoning problem, where the information required to answer a question is distributed across disparate documents or requires sequential logical steps. By breaking down a query, systems reduce semantic noise in vector searches, improve retrieval precision, and ensure comprehensive recall. In production, this acts as a "Query Understanding" layer that bridges the gap between sophisticated human intent and the limitations of flat vector database lookups.

Conceptual Overview

In standard RAG workflows, a user query is converted into a vector embedding and compared against a document store. While effective for simple fact retrieval, this "flat" approach fails when queries are multi-faceted. Query decomposition introduces a Query Understanding layer that treats the user's input not as a search string, but as a high-level task.

The Multi-Hop Reasoning Challenge

A "multi-hop" query is one where the answer to the first part of the question is required to find the answer to the second part, or where multiple independent facts must be aggregated. Research into the "compositionality gap" (Press et al., 2022) suggests that LLMs often struggle to compose multiple retrieved facts even if they can retrieve them individually.

Example: "How did the 2023 fiscal policy of the ECB affect the stock price of German automotive manufacturers compared to their 2022 performance?"
The Problem: A single vector search for this entire sentence will likely land in a "semantic no-man's land." The embedding model tries to represent fiscal policy, German automotive stocks, 2023 data, and 2022 data in one point. The resulting vector is an "average" that may not be close enough to any specific relevant document to trigger a high-confidence match.

The Mathematical Intuition: Semantic Noise

When an embedding model processes a complex query, it generates a high-dimensional vector. As the complexity of the query increases, the "signal" (the specific keywords and entities) is diluted by the "noise" (the relationships and constraints). This is often referred to as the Signal-to-Noise Ratio (SNR) in vector space.

By decomposing the query into:

"What was the ECB fiscal policy in 2023?"
"List of German automotive manufacturers."
"Stock price performance of [List] in 2022 vs 2023."

Each sub-query has a much higher SNR. The resulting vectors are "sharper" and more likely to reside in the same neighborhood as the relevant source text. Mathematically, the cosine similarity between a sub-query and its target document is significantly higher than the similarity between the original complex query and that same document.

![Infographic Placeholder](A flowchart showing a complex 'Root Query' entering an LLM 'Decomposition Agent.' The agent outputs 'Sub-Query 1', 'Sub-Query 2', and 'Sub-Query 3'. These flow into a Vector Store, return 'Context Snippets', and converge into a 'Final Synthesis' step. The agent also has a feedback loop to refine the sub-queries based on initial retrieval results.)

Practical Implementations

Building a decomposition layer requires an orchestration layer (using frameworks like LangChain or LlamaIndex) and a high-reasoning LLM (e.g., GPT-4o, Claude 3.5 Sonnet) to act as the "Planner."

1. The Planning Phase (Prompt Engineering)

The planner's job is to identify independent and dependent variables. A typical system prompt for a decomposer looks like this:

"You are a query decomposition assistant. Break the following user query into a set of atomic, independent sub-questions that can be answered by searching a technical database. If a question depends on the answer of another, mark it as 'sequential'. Output in JSON format."

Input: "Compare the energy density of solid-state batteries vs lithium-ion and explain which is better for long-haul trucking."

Decomposed Output:

q1: "What is the typical energy density of solid-state batteries?"
q2: "What is the typical energy density of lithium-ion batteries?"
q3: "What are the energy density requirements for long-haul trucking?"

2. Execution Strategies

Once decomposed, the system must decide how to execute the retrieval:

A. Parallel Execution (Independent)

If the sub-queries are independent (like q1 and q2 above), they are fired simultaneously against the vector database. This minimizes latency. The results are collected into a shared context window. This is the strategy used by LangChain's MultiQueryRetriever.

B. Sequential (Chained) Execution

If q2 requires the output of q1, the system uses a Chain-of-Thought (CoT) approach.

Step 1: Retrieve context for q1.
Step 2: Use the context from q1 to refine q2.
Step 3: Retrieve context for q2.

This is essential for queries like "Who is the current CEO of the company that acquired Slack, and what is their background?" The system must first identify Salesforce before it can search for the CEO's background.

3. Synthesis and Merging

The final step is the Synthesis Phase. The LLM receives the original query and the aggregated context from all sub-queries. It must then:

De-duplicate: Remove overlapping information retrieved by different sub-queries.
Resolve Conflicts: If q1 and q2 return contradictory data, the LLM must use reasoning to determine the most authoritative source.
Cite: Map specific facts back to the sub-query sources to maintain transparency.

Advanced Techniques

As RAG architectures evolve, simple decomposition is being replaced by more dynamic, agentic strategies.

Recursive Decomposition

For massive research tasks, a sub-query might itself be too complex. Recursive decomposition allows the system to break a sub-query into "sub-sub-queries" dynamically. This is often used in "Long-Context RAG" where the system must navigate thousands of pages. If a sub-query returns too much information or "no relevant results," the agent triggers a second level of decomposition.

Least-to-Most Prompting

Based on the research by Zhou et al. (2022), this technique involves the LLM solving the easiest sub-problem first. The context gained from the easy problem provides the "grounding" needed to tackle more complex segments of the query. This reduces hallucinations in the later stages of retrieval because the model is "warmed up" with relevant facts.

Step-Back Abstraction

Instead of just breaking the query down, the system generates a "step-back" question (Zheng et al., 2023).

Original: "Why did my specific NVIDIA H100 node fail with Error 404?"
Step-Back: "What are the common causes of Error 404 in NVIDIA H100 clusters?" By retrieving the broad principles first, the system provides a conceptual framework that makes the specific decomposition more accurate.

GraphRAG Integration

While vector databases rely on semantic similarity, Knowledge Graphs (KGs) rely on explicit relationships. Advanced decomposition layers use KGs to identify entities. If a query mentions "The CEO of Apple," the decomposer can use a KG to instantly resolve this to "Tim Cook" before even hitting the vector store, significantly narrowing the search space and improving retrieval precision.

Research and Future Directions

The frontier of query decomposition is moving toward efficiency and self-governance.

1. Self-Correction (Self-RAG)

The Self-RAG framework (Asai et al., 2023) introduces "reflection tokens." During decomposition, the model evaluates its own retrieved chunks. If the retrieved context for a sub-query is irrelevant, the model "self-corrects" by generating a new, different sub-query. This iterative loop ensures that the synthesis phase only receives high-quality data.

2. Distilled Decomposers (Token Efficiency)

Decomposition is expensive because it requires multiple LLM calls. Current research focuses on distillation—training smaller models (e.g., 7B or 8B parameters) specifically for the task of decomposition. These "specialist" models can perform as well as GPT-4 at breaking down queries but at 1/10th the cost and latency.

3. Cross-Modal Decomposition

In Multimodal RAG, a query like "Show me the chart of Tesla's revenue and explain the dip in Q3" requires decomposing the intent into a text-based search (for the explanation) and an image/vision-based search (for the chart). Future systems will use unified decomposition layers to route sub-queries to different modal encoders (CLIP for images, BERT/Ada for text).

4. Agentic RAG Loops

The industry is shifting from static pipelines to Agentic RAG. In this model, the decomposition isn't a single step at the start; it's a continuous process. The agent retrieves some data, realizes it's missing a piece of the puzzle, decomposes a new sub-query on the fly, and repeats until the "information gain" threshold is met.

Frequently Asked Questions

Q: Does query decomposition increase latency?

Yes, decomposition typically increases latency because it involves an initial LLM call to generate sub-queries and potentially multiple retrieval steps. However, this is often offset by the use of parallel execution and the significantly higher accuracy of the final answer, which reduces the need for user follow-up questions.

Q: When should I NOT use query decomposition?

If your RAG system primarily handles simple, fact-based questions (e.g., "What is the company's PTO policy?"), decomposition is overkill. It adds unnecessary cost and latency. Use a "Router" to only trigger decomposition when the LLM detects a complex or multi-part query.

Q: How many sub-queries are too many?

Most production systems cap decomposition at 3-5 sub-queries. Beyond this, the "Synthesis Phase" can become overwhelmed with context, leading to the "Lost in the Middle" phenomenon where the LLM ignores information placed in the center of a long prompt.

Q: Can I use query decomposition with open-source models?

Absolutely. Models like Llama-3 and Mistral-7B are highly capable of query decomposition if provided with a clear few-shot prompt. The key is to ensure the model is fine-tuned for instruction following.

Q: How does decomposition help with "Semantic Search" limitations?

Semantic search (vector similarity) is great at finding "things like this" but bad at "logical relationships." Decomposition converts a logical problem into multiple similarity problems, which vector databases are actually designed to solve.

References

https://arxiv.org/abs/2310.11511
https://arxiv.org/abs/2205.10625
https://arxiv.org/abs/2212.10509
https://python.langchain.com/docs/modules/data_connection/retrievers/MultiQueryRetriever
https://docs.llamaindex.ai/en/stable/examples/query_engine/sub_question_query_engine/
https://arxiv.org/abs/2210.03350
https://arxiv.org/abs/2310.06117