Improving RAG Query Quality and Relevance
Techniques for rewriting user queries (HyDE, Expansion) and reranking results to boost retrieval accuracy in Retrieval-Augmented Generation.
Introduction
When I first started analyzing our RAG pipeline’s failure cases, I was focused on the wrong end of the problem. I spent weeks tuning embedding models, adjusting chunk sizes, and experimenting with different vector distance metrics. Then I looked at our query logs and the real bottleneck became painfully obvious: it was the queries themselves. Users typed “show me the tax stuff” and expected the system to understand they meant “display all documents categorized as tax-related with pending review status.” The gap between user intent and query text was enormous. Once I shifted focus to query enhancement — rewriting, expanding, and classifying queries before they ever hit the vector database — retrieval precision jumped almost overnight.
Retrieval-Augmented Generation (RAG) is only as good as the documents it retrieves. If the user asks a vague question, a naive semantic search will return vague results. In a production system like ours, we need to bridge the gap between what the user says and what the vector database (Qdrant) understands.
Why Query Transformation Matters:
- Ambiguity Resolution: Users often ask “What about the contract?” without specifying which contract.
- Vocabulary Mismatch: Users might say “money” while documents say “fiscal compensation”.
- Context Awareness: Queries often depend on previous turns in the conversation.
What We’ll Build
In this guide, we will implement a pipeline that sits before the semantic search. We will:
- Implement Query Expansion: Use an LLM to generate synonyms or related questions.
- Apply HyDE (Hypothetical Document Embeddings): Generate a fake “ideal” answer to search against.
- Add Reranking: Use a cross-encoder to re-score the top K results from Qdrant.
Architecture Overview
We are moving from a direct “User -> Embedding -> Search” flow to a more sophisticated pipeline.
flowchart LR
User -->|Query| LLM[LLM Rewriter]
LLM -->|HyDE/Expansion| Embed[Embedding Model]
Embed -->|Vector Search| Qdrant
Qdrant -->|Top 50 Candidates| Reranker[Cross-Encoder]
Reranker -->|Top 5 Contexts| Generator[LLM Answer]
classDef primary fill:#7c3aed,color:#fff
classDef secondary fill:#06b6d4,color:#fff
classDef db fill:#f43f5e,color:#fff
classDef warning fill:#fbbf24,color:#000
class LLM,Embed,Reranker,Generator primary
class Qdrant db
class User warning
Section 1: Query Expansion and Transformations
The first step is to fix the user’s input. We don’t want to search for the raw query if it’s poor quality.
HyDE (Hypothetical Document Embeddings)
HyDE asks the LLM to hallucinate a plausible answer, then embeds that answer to find real documents that look similar. The core insight from the original paper is that the hypothetical document, even though it may contain factual errors, captures the structure and vocabulary of a relevant answer, which produces a better embedding for retrieval than the short query alone.
[Precise Zero-Shot Dense Retrieval without Relevance Labels (HyDE)] — Gao, L., Ma, X., Lin, J. & Callan, J. , 2022public async Task<string> GenerateHypotheticalAnswerAsync(string query)
{
var prompt = $@"
Please write a passage that answers the question: '{query}'.
Do not answer truthfully, just generate a plausible hypothetical response
containing the relevant keywords and structure.";
// Call Ollama or OpenAI
return await _aiService.CompleteAsync(prompt);
}
Multi-Query Expansion
Sometimes a single angle isn’t enough. We can ask the LLM to generate 3 different versions of the query. A related approach, Query2Doc, uses the LLM to generate a pseudo-document that is then appended to the original query for expansion.
[Query2Doc: Query Expansion with Large Language Models] — Wang, L. et al. , 2023LlamaIndex provides a well-documented framework for these query transformation patterns, and we borrowed several ideas from their pipeline design.
[LlamaIndex Query Transformations] — LlamaIndex , 2024- “How do I terminate the lease?”
- “What are the lease cancellation clauses?”
- “Legal requirements for ending a rental agreement.”
We execute all three searches in parallel against Qdrant and fuse the results (Reciprocal Rank Fusion).
Step-back prompting is another powerful technique for complex queries that require reasoning. Instead of searching for the specific question, the LLM first generates a broader, more abstract question and searches for that.
[Take a Step Back: Evoking Reasoning via Abstraction in Large Language Models] — Zheng, H. et al. , 2023Section 2: Semantic Search & Reranking
Once we have our better query (or queries), we hit Qdrant. However, vector distance (Cosine Similarity) is a proxy for relevance, not a guarantee.
The Reranking Step
Vector databases are fast (ANN - Approximate Nearest Neighbor) but can miss nuanced relationships. A Cross-Encoder model is slower but much more accurate because it looks at the query and document together.
We fetch 50 documents from Qdrant and pass them through a reranker.
// Using a local reranker or an API
var rerankRequest = new RerankRequest
{
Query = userQuery,
Documents = candidates.Select(c => c.Content).ToList(),
TopN = 5
};
var topResults = await _reranker.RankAsync(rerankRequest);
Microsoft’s Semantic Kernel provides useful abstractions for building these kinds of memory-augmented pipelines, and its patterns for query routing influenced our orchestration design.
[Microsoft Semantic Kernel - Memory Patterns] — Microsoft , 2024Section 3: Putting It All Together
In our SearchService, we orchestrate these steps.
public async Task<List<DocumentChunk>> SearchAsync(string rawQuery)
{
// 1. Transform
var expandedQuery = await _queryTransformer.ExpandAsync(rawQuery);
// 2. Vector Search (Parallel)
var tasks = expandedQuery.Variations.Select(q => _qdrant.SearchAsync(q));
var results = await Task.WhenAll(tasks);
// 3. Deduplicate & Fuse
var candidates = Deduplicate(results.SelectMany(x => x));
// 4. Rerank
var finalSet = await _reranker.RankAsync(rawQuery, candidates);
return finalSet;
}
To measure whether these query enhancements were actually improving results, we adopted the RAGAs evaluation framework, which provides standardized metrics for retrieval quality including faithfulness, answer relevancy, and context precision.
[RAGAs: Automated Evaluation of Retrieval Augmented Generation] — Es, S., James, J., Espinosa-Anke, L. & Schockaert, S. , 2023Conclusion
By treating the user’s query as a starting point rather than a final command, we significantly improve the relevance of our RAG pipeline. Looking back, the most surprising lesson was how much retrieval quality depended on the query rather than the retrieval system itself. We spent months optimizing vector indexes and embedding models, but the single biggest improvement came from a relatively simple query rewriting step.
The three techniques — HyDE for complex analytical queries, multi-query expansion for exploratory searches, and cross-encoder reranking for precision — each address a different failure mode. The key is not to apply them uniformly, but to classify the query first and then route it through the appropriate enhancement path. This selective approach kept latency manageable while delivering the quality improvement where it was actually needed.
Next Steps
- Explore Hybrid Search: Combining Semantic and Keyword Search to add keyword matching alongside these query enhancements.
- Read about Entity Analysis and Graph Databases for structured knowledge extraction.
- Consider Storage Performance to keep your vector DB fast under production load.