Improving RAG Query Quality and Relevance
Techniques for rewriting user queries (HyDE, Expansion) and reranking results to boost retrieval accuracy in Retrieval-Augmented Generation.
Introduction
Retrieval-Augmented Generation (RAG) is only as good as the documents it retrieves. If the user asks a vague question, a naive semantic search will return vague results. In a production system like BlueRobin, we need to bridge the gap between what the user says and what the vector database (Qdrant) understands.
Why Query Transformation Matters:
- Ambiguity Resolution: Users often ask “What about the contract?” without specifying which contract.
- Vocabulary Mismatch: Users might say “money” while documents say “fiscal compensation”.
- Context Awareness: Queries often depend on previous turns in the conversation.
What We’ll Build
In this guide, we will implement a pipeline that sits before the semantic search. We will:
- Implement Query Expansion: Use an LLM to generate synonyms or related questions.
- Apply HyDE (Hypothetical Document Embeddings): Generate a fake “ideal” answer to search against.
- Add Reranking: Use a cross-encoder to re-score the top K results from Qdrant.
Architecture Overview
We are moving from a direct “User -> Embedding -> Search” flow to a more sophisticated pipeline.
flowchart LR
User -->|Query| LLM[LLM Rewriter]
LLM -->|HyDE/Expansion| Embed[Embedding Model]
Embed -->|Vector Search| Qdrant
Qdrant -->|Top 50 Candidates| Reranker[Cross-Encoder]
Reranker -->|Top 5 Contexts| Generator[LLM Answer]
classDef primary fill:#7c3aed,color:#fff
classDef secondary fill:#06b6d4,color:#fff
classDef db fill:#f43f5e,color:#fff
classDef warning fill:#fbbf24,color:#000
class LLM,Embed,Reranker,Generator primary
class Qdrant db
class User warning
Section 1: Query Expansion and Transformations
The first step is to fix the user’s input. We don’t want to search for the raw query if it’s poor quality.
HyDE (Hypothetical Document Embeddings)
HyDE asks the LLM to hallucinate a plausible answer, then embeds that answer to find real documents that look similar.
public async Task<string> GenerateHypotheticalAnswerAsync(string query)
{
var prompt = $@"
Please write a passage that answers the question: '{query}'.
Do not answer truthfully, just generate a plausible hypothetical response
containing the relevant keywords and structure.";
// Call Ollama or OpenAI
return await _aiService.CompleteAsync(prompt);
}
Multi-Query Expansion
Sometimes a single angle isn’t enough. We can ask the LLM to generate 3 different versions of the query.
- “How do I terminate the lease?”
- “What are the lease cancellation clauses?”
- “Legal requirements for ending a rental agreement.”
We execute all three searches in parallel against Qdrant and fuse the results (Reciprocal Rank Fusion).
Section 2: Semantic Search & Reranking
Once we have our better query (or queries), we hit Qdrant. However, vector distance (Cosine Similarity) is a proxy for relevance, not a guarantee.
The Reranking Step
Vector databases are fast (ANN - Approximate Nearest Neighbor) but can miss nuanced relationships. A Cross-Encoder model is slower but much more accurate because it looks at the query and document together.
We fetch 50 documents from Qdrant and pass them through a reranker.
// Using a local reranker or an API
var rerankRequest = new RerankRequest
{
Query = userQuery,
Documents = candidates.Select(c => c.Content).ToList(),
TopN = 5
};
var topResults = await _reranker.RankAsync(rerankRequest);
Section 3: Putting It All Together
In our SearchService, we orchestrate these steps.
public async Task<List<DocumentChunk>> SearchAsync(string rawQuery)
{
// 1. Transform
var expandedQuery = await _queryTransformer.ExpandAsync(rawQuery);
// 2. Vector Search (Parallel)
var tasks = expandedQuery.Variations.Select(q => _qdrant.SearchAsync(q));
var results = await Task.WhenAll(tasks);
// 3. Deduplicate & Fuse
var candidates = Deduplicate(results.SelectMany(x => x));
// 4. Rerank
var finalSet = await _reranker.RankAsync(rawQuery, candidates);
return finalSet;
}
Conclusion
By treating the user’s query as a starting point rather than a final command, we significantly improve the relevance of our RAG pipeline.
Next Steps:
- Explore Entity Extraction for Graph Databases to enhance search further.
- Read about Storage Performance to keep your vector DB fast.