Improving RAG Query Quality and Relevance

Introduction

Retrieval-Augmented Generation (RAG) is only as good as the documents it retrieves. If the user asks a vague question, a naive semantic search will return vague results. In a production system like BlueRobin, we need to bridge the gap between what the user says and what the vector database (Qdrant) understands.

Why Query Transformation Matters:

Ambiguity Resolution: Users often ask “What about the contract?” without specifying which contract.
Vocabulary Mismatch: Users might say “money” while documents say “fiscal compensation”.
Context Awareness: Queries often depend on previous turns in the conversation.

What We’ll Build

In this guide, we will implement a pipeline that sits before the semantic search. We will:

Implement Query Expansion: Use an LLM to generate synonyms or related questions.
Apply HyDE (Hypothetical Document Embeddings): Generate a fake “ideal” answer to search against.
Add Reranking: Use a cross-encoder to re-score the top K results from Qdrant.

Architecture Overview

We are moving from a direct “User -> Embedding -> Search” flow to a more sophisticated pipeline.

flowchart LR
    User -->|Query| LLM[LLM Rewriter]
    LLM -->|HyDE/Expansion| Embed[Embedding Model]
    Embed -->|Vector Search| Qdrant
    Qdrant -->|Top 50 Candidates| Reranker[Cross-Encoder]
    Reranker -->|Top 5 Contexts| Generator[LLM Answer]

    classDef primary fill:#7c3aed,color:#fff
    classDef secondary fill:#06b6d4,color:#fff
    classDef db fill:#f43f5e,color:#fff
    classDef warning fill:#fbbf24,color:#000

    class LLM,Embed,Reranker,Generator primary
    class Qdrant db
    class User warning

Section 1: Query Expansion and Transformations

The first step is to fix the user’s input. We don’t want to search for the raw query if it’s poor quality.

HyDE (Hypothetical Document Embeddings)

HyDE asks the LLM to hallucinate a plausible answer, then embeds that answer to find real documents that look similar.

public async Task<string> GenerateHypotheticalAnswerAsync(string query)
{
    var prompt = $@"
    Please write a passage that answers the question: '{query}'. 
    Do not answer truthfully, just generate a plausible hypothetical response 
    containing the relevant keywords and structure.";
    
    // Call Ollama or OpenAI
    return await _aiService.CompleteAsync(prompt);
}

Multi-Query Expansion

Sometimes a single angle isn’t enough. We can ask the LLM to generate 3 different versions of the query.

“How do I terminate the lease?”
“What are the lease cancellation clauses?”
“Legal requirements for ending a rental agreement.”

We execute all three searches in parallel against Qdrant and fuse the results (Reciprocal Rank Fusion).

Section 2: Semantic Search & Reranking

Once we have our better query (or queries), we hit Qdrant. However, vector distance (Cosine Similarity) is a proxy for relevance, not a guarantee.

The Reranking Step

Vector databases are fast (ANN - Approximate Nearest Neighbor) but can miss nuanced relationships. A Cross-Encoder model is slower but much more accurate because it looks at the query and document together.

We fetch 50 documents from Qdrant and pass them through a reranker.

// Using a local reranker or an API
var rerankRequest = new RerankRequest 
{
    Query = userQuery,
    Documents = candidates.Select(c => c.Content).ToList(),
    TopN = 5
};

var topResults = await _reranker.RankAsync(rerankRequest);

Section 3: Putting It All Together

In our SearchService, we orchestrate these steps.

public async Task<List<DocumentChunk>> SearchAsync(string rawQuery)
{
    // 1. Transform
    var expandedQuery = await _queryTransformer.ExpandAsync(rawQuery);
    
    // 2. Vector Search (Parallel)
    var tasks = expandedQuery.Variations.Select(q => _qdrant.SearchAsync(q));
    var results = await Task.WhenAll(tasks);
    
    // 3. Deduplicate & Fuse
    var candidates = Deduplicate(results.SelectMany(x => x));
    
    // 4. Rerank
    var finalSet = await _reranker.RankAsync(rawQuery, candidates);
    
    return finalSet;
}

Conclusion

By treating the user’s query as a starting point rather than a final command, we significantly improve the relevance of our RAG pipeline.

Next Steps:

Explore Entity Extraction for Graph Databases to enhance search further.
Read about Storage Performance to keep your vector DB fast.