Improving RAG Query Quality and Relevance

Introduction

When I first started analyzing our RAG pipeline’s failure cases, I was focused on the wrong end of the problem. I spent weeks tuning embedding models, adjusting chunk sizes, and experimenting with different vector distance metrics. Then I looked at our query logs and the real bottleneck became painfully obvious: it was the queries themselves. Users typed “show me the tax stuff” and expected the system to understand they meant “display all documents categorized as tax-related with pending review status.” The gap between user intent and query text was enormous. Once I shifted focus to query enhancement — rewriting, expanding, and classifying queries before they ever hit the vector database — retrieval precision jumped almost overnight.

Retrieval-Augmented Generation (RAG) is only as good as the documents it retrieves. If the user asks a vague question, a naive semantic search will return vague results. In a production system like ours, we need to bridge the gap between what the user says and what the vector database (Qdrant) understands.

Why Query Transformation Matters:

Ambiguity Resolution: Users often ask “What about the contract?” without specifying which contract.
Vocabulary Mismatch: Users might say “money” while documents say “fiscal compensation”.
Context Awareness: Queries often depend on previous turns in the conversation.

What We’ll Build

In this guide, we will implement a pipeline that sits before the semantic search. We will:

Implement Query Expansion: Use an LLM to generate synonyms or related questions.
Apply HyDE (Hypothetical Document Embeddings): Generate a fake “ideal” answer to search against.
Add Reranking: Use a cross-encoder to re-score the top K results from Qdrant.

Architecture Overview

We are moving from a direct “User -> Embedding -> Search” flow to a more sophisticated pipeline.

flowchart LR
    User -->|Query| LLM[LLM Rewriter]
    LLM -->|HyDE/Expansion| Embed[Embedding Model]
    Embed -->|Vector Search| Qdrant
    Qdrant -->|Top 50 Candidates| Reranker[Cross-Encoder]
    Reranker -->|Top 5 Contexts| Generator[LLM Answer]

    classDef primary fill:#7c3aed,color:#fff
    classDef secondary fill:#06b6d4,color:#fff
    classDef db fill:#f43f5e,color:#fff
    classDef warning fill:#fbbf24,color:#000

    class LLM,Embed,Reranker,Generator primary
    class Qdrant db
    class User warning

Section 1: Query Expansion and Transformations

The first step is to fix the user’s input. We don’t want to search for the raw query if it’s poor quality.

HyDE (Hypothetical Document Embeddings)

HyDE asks the LLM to hallucinate a plausible answer, then embeds that answer to find real documents that look similar. The core insight from the original paper is that the hypothetical document, even though it may contain factual errors, captures the structure and vocabulary of a relevant answer, which produces a better embedding for retrieval than the short query alone.

[Precise Zero-Shot Dense Retrieval without Relevance Labels (HyDE)] — Gao, L., Ma, X., Lin, J. & Callan, J. , 2022

public async Task<string> GenerateHypotheticalAnswerAsync(string query)
{
    var prompt = $@"
    Please write a passage that answers the question: '{query}'. 
    Do not answer truthfully, just generate a plausible hypothetical response 
    containing the relevant keywords and structure.";
    
    // Call Ollama or OpenAI
    return await _aiService.CompleteAsync(prompt);
}

Multi-Query Expansion

Sometimes a single angle isn’t enough. We can ask the LLM to generate 3 different versions of the query. A related approach, Query2Doc, uses the LLM to generate a pseudo-document that is then appended to the original query for expansion.

[Query2Doc: Query Expansion with Large Language Models] — Wang, L. et al. , 2023

LlamaIndex provides a well-documented framework for these query transformation patterns, and we borrowed several ideas from their pipeline design.

[LlamaIndex Query Transformations] — LlamaIndex , 2024

“How do I terminate the lease?”
“What are the lease cancellation clauses?”
“Legal requirements for ending a rental agreement.”

We execute all three searches in parallel against Qdrant and fuse the results (Reciprocal Rank Fusion).

Step-back prompting is another powerful technique for complex queries that require reasoning. Instead of searching for the specific question, the LLM first generates a broader, more abstract question and searches for that.

[Take a Step Back: Evoking Reasoning via Abstraction in Large Language Models] — Zheng, H. et al. , 2023

Section 2: Semantic Search & Reranking

Once we have our better query (or queries), we hit Qdrant. However, vector distance (Cosine Similarity) is a proxy for relevance, not a guarantee.

The Reranking Step

Vector databases are fast (ANN - Approximate Nearest Neighbor) but can miss nuanced relationships. A Cross-Encoder model is slower but much more accurate because it looks at the query and document together.

We fetch 50 documents from Qdrant and pass them through a reranker.

// Using a local reranker or an API
var rerankRequest = new RerankRequest 
{
    Query = userQuery,
    Documents = candidates.Select(c => c.Content).ToList(),
    TopN = 5
};

var topResults = await _reranker.RankAsync(rerankRequest);

Microsoft’s Semantic Kernel provides useful abstractions for building these kinds of memory-augmented pipelines, and its patterns for query routing influenced our orchestration design.

[Microsoft Semantic Kernel - Memory Patterns] — Microsoft , 2024

Section 3: Putting It All Together

In our SearchService, we orchestrate these steps.

public async Task<List<DocumentChunk>> SearchAsync(string rawQuery)
{
    // 1. Transform
    var expandedQuery = await _queryTransformer.ExpandAsync(rawQuery);
    
    // 2. Vector Search (Parallel)
    var tasks = expandedQuery.Variations.Select(q => _qdrant.SearchAsync(q));
    var results = await Task.WhenAll(tasks);
    
    // 3. Deduplicate & Fuse
    var candidates = Deduplicate(results.SelectMany(x => x));
    
    // 4. Rerank
    var finalSet = await _reranker.RankAsync(rawQuery, candidates);
    
    return finalSet;
}

To measure whether these query enhancements were actually improving results, we adopted the RAGAs evaluation framework, which provides standardized metrics for retrieval quality including faithfulness, answer relevancy, and context precision.

[RAGAs: Automated Evaluation of Retrieval Augmented Generation] — Es, S., James, J., Espinosa-Anke, L. & Schockaert, S. , 2023

Conclusion

By treating the user’s query as a starting point rather than a final command, we significantly improve the relevance of our RAG pipeline. Looking back, the most surprising lesson was how much retrieval quality depended on the query rather than the retrieval system itself. We spent months optimizing vector indexes and embedding models, but the single biggest improvement came from a relatively simple query rewriting step.

The three techniques — HyDE for complex analytical queries, multi-query expansion for exploratory searches, and cross-encoder reranking for precision — each address a different failure mode. The key is not to apply them uniformly, but to classify the query first and then route it through the appropriate enhancement path. This selective approach kept latency manageable while delivering the quality improvement where it was actually needed.

Next Steps

Explore Hybrid Search: Combining Semantic and Keyword Search to add keyword matching alongside these query enhancements.
Read about Entity Analysis and Graph Databases for structured knowledge extraction.
Consider Storage Performance to keep your vector DB fast under production load.

Introduction

What We’ll Build

Architecture Overview

Section 1: Query Expansion and Transformations

HyDE (Hypothetical Document Embeddings)

Multi-Query Expansion

Section 2: Semantic Search & Reranking

The Reranking Step

Section 3: Putting It All Together

Conclusion

Next Steps

Further Reading