Agentic RRF Ensembling for Production Search
How to combine multiple retrieval strategies with reciprocal rank fusion to produce stable, high-quality contexts for LLM responses.
When I first ran A/B tests on our search pipeline, I expected the improvements from ensembling to be marginal — maybe a few percentage points of NDCG. Instead, the numbers told a different story: a single dense retrieval model scored 0.72 NDCG, while the RRF ensemble scored 0.89. No model retraining. No new embeddings. Just combining existing signals. That 17-point jump was the highest-ROI improvement in the entire RAG pipeline, and it changed how I think about retrieval architecture. Ensembling isn’t an optimization — it’s a fundamental design principle.
Single-model retrieval is brittle. A dense embedding model might miss lexical matches, while BM25 misses semantic similarity. In production RAG systems, combining multiple retrieval signals through reciprocal rank fusion (RRF) produces more stable, higher-quality contexts for LLM generation.
[] — Cormack, Clarke, and BuettcherWhy Ensemble Retrieval?
Each retrieval model has blind spots:
| Model Type | Strength | Weakness |
|---|---|---|
| Dense embeddings | Semantic similarity | Exact keyword matches |
| Sparse/BM25 | Lexical precision | Synonyms, paraphrases |
| Graph traversal | Relational context | Cold-start entities |
| NER-filtered | Domain entity precision | Free-form queries |
By combining their outputs, you get coverage (fewer misses) and stability (less sensitivity to any single model’s quirks). This approach aligns with the broader “mixture of experts” concept where specialized models each contribute their strengths.
[] — Jacobs et al.Reciprocal Rank Fusion
RRF is a rank aggregation algorithm that combines ranked lists without needing score normalization. For each document appearing in any result list:
RRF(d) = sum over all rankers r of: weight_r / (k + rank_r(d))
Where:
- R is the set of rankers (retrieval models)
- weight_r is the weight for ranker r
- k is a constant (typically 60) that dampens the impact of high ranks
- rank_r(d) is the position of document d in ranker r’s list
The elegance of RRF is that it only requires rank positions, not raw scores — making it trivial to combine results from systems with incomparable score distributions.
[] — NISTImplementation
public class RrfEnsembleSearch
{
private const int K = 60;
public List<ScoredDocument> Fuse(
IReadOnlyList<RankedList> rankedLists)
{
var scores = new Dictionary<string, double>();
foreach (var list in rankedLists)
{
for (int rank = 0; rank < list.Documents.Count; rank++)
{
var docId = list.Documents[rank].Id;
scores.TryGetValue(docId, out var current);
scores[docId] = current
+ list.Weight * (1.0 / (K + rank + 1));
}
}
return scores
.OrderByDescending(kv => kv.Value)
.Select(kv => new ScoredDocument(kv.Key, kv.Value))
.ToList();
}
}
public record RankedList(
List<RetrievedDocument> Documents,
double Weight); Building the Retrieval Pipeline
The ensemble search coordinates multiple retrievers in parallel:
public class EnsembleSearchService
{
private readonly IVectorStore _vectorStore;
private readonly IGraphStore _graphStore;
private readonly RrfEnsembleSearch _rrf;
public async Task<List<ScoredDocument>> SearchAsync(
string query, SearchOptions options, CancellationToken ct)
{
// Run retrievers in parallel
var denseTask = _vectorStore.SearchAsync(
query, model: "nomic-embed-text", topK: 20, ct);
var sparseTask = _vectorStore.SearchAsync(
query, model: "bge-m3", topK: 20, ct);
var graphTask = _graphStore.TraverseAsync(
query, maxHops: 2, ct);
await Task.WhenAll(denseTask, sparseTask, graphTask);
// Assign weights per retriever
var rankedLists = new List<RankedList>
{
new(await denseTask, Weight: 1.0),
new(await sparseTask, Weight: 0.8),
new(await graphTask, Weight: 0.6),
};
return _rrf.Fuse(rankedLists)
.Take(options.MaxResults)
.ToList();
}
} Agentic Retrieval
In an agentic setup, the LLM itself decides which retrievers to invoke based on the query type. This is a form of query routing that avoids running unnecessary retrievers.
[] — Li and Croftpublic class AgenticRetriever
{
public async Task<List<ScoredDocument>> RetrieveAsync(
string query, CancellationToken ct)
{
var queryType = await ClassifyQueryAsync(query, ct);
var retrievers = queryType switch
{
QueryType.Factual => new[] { "dense", "sparse" },
QueryType.Relational => new[] { "dense", "graph" },
QueryType.EntityLookup => new[] { "graph", "ner" },
_ => new[] { "dense", "sparse", "graph" },
};
var lists = await RunRetrieversAsync(
retrievers, query, ct);
return _rrf.Fuse(lists);
}
}
This reduces latency — if a graph traversal isn’t needed, don’t run it.
[] — LiuProduction Considerations
- Timeout individual retrievers: If one model is slow, don’t block the entire pipeline. Use
Task.WhenAllwith per-task cancellation. - Deduplication: RRF handles duplicate documents across lists naturally by summing their scores.
- Score normalization: RRF doesn’t need it — that’s the whole point. Avoid mixing raw similarity scores, which aren’t comparable across models.
- Caching: Cache retrieval results for identical queries within a short window (30 seconds).
Conclusion
After months of iterating on this pipeline, the lesson that sticks with me is deceptively simple: combining mediocre signals intelligently beats optimizing any single signal in isolation. The RRF ensemble didn’t require better models or more training data — it just required the discipline to run multiple retrieval strategies and trust the rank fusion to surface the right documents. The agentic routing layer was the second big win, not because it improved quality, but because it cut latency by skipping unnecessary retrievers for well-understood query types.
If you’re building a production RAG system and still relying on a single retrieval model, ensembling is the single highest-ROI change you can make. Start with dense + sparse, add RRF, measure the NDCG improvement, and then decide whether the complexity of graph traversal and agentic routing is worth it for your use case.
Next Steps
- Hybrid Retrieval with Graph Filters: FalkorDB + Qdrant — how graph traversal complements vector search in the ensemble
- GraphRAG Routing with Fallback Strategies — ensuring the agentic pipeline degrades gracefully when components fail