AI/ML Advanced 9 min

Agentic RRF Ensembling for Production Search

How to combine multiple retrieval strategies with reciprocal rank fusion to produce stable, high-quality contexts for LLM responses.

By Victor Robin Updated:

When I first ran A/B tests on our search pipeline, I expected the improvements from ensembling to be marginal — maybe a few percentage points of NDCG. Instead, the numbers told a different story: a single dense retrieval model scored 0.72 NDCG, while the RRF ensemble scored 0.89. No model retraining. No new embeddings. Just combining existing signals. That 17-point jump was the highest-ROI improvement in the entire RAG pipeline, and it changed how I think about retrieval architecture. Ensembling isn’t an optimization — it’s a fundamental design principle.

Single-model retrieval is brittle. A dense embedding model might miss lexical matches, while BM25 misses semantic similarity. In production RAG systems, combining multiple retrieval signals through reciprocal rank fusion (RRF) produces more stable, higher-quality contexts for LLM generation.

[] — Cormack, Clarke, and Buettcher

Why Ensemble Retrieval?

Each retrieval model has blind spots:

Model TypeStrengthWeakness
Dense embeddingsSemantic similarityExact keyword matches
Sparse/BM25Lexical precisionSynonyms, paraphrases
Graph traversalRelational contextCold-start entities
NER-filteredDomain entity precisionFree-form queries

By combining their outputs, you get coverage (fewer misses) and stability (less sensitivity to any single model’s quirks). This approach aligns with the broader “mixture of experts” concept where specialized models each contribute their strengths.

[] — Jacobs et al.

Reciprocal Rank Fusion

RRF is a rank aggregation algorithm that combines ranked lists without needing score normalization. For each document appearing in any result list:

RRF(d) = sum over all rankers r of: weight_r / (k + rank_r(d))

Where:

  • R is the set of rankers (retrieval models)
  • weight_r is the weight for ranker r
  • k is a constant (typically 60) that dampens the impact of high ranks
  • rank_r(d) is the position of document d in ranker r’s list

The elegance of RRF is that it only requires rank positions, not raw scores — making it trivial to combine results from systems with incomparable score distributions.

[] — NIST

Implementation

RrfEnsembleSearch.cs
public class RrfEnsembleSearch
{
    private const int K = 60;

    public List<ScoredDocument> Fuse(
        IReadOnlyList<RankedList> rankedLists)
    {
        var scores = new Dictionary<string, double>();

        foreach (var list in rankedLists)
        {
            for (int rank = 0; rank < list.Documents.Count; rank++)
            {
                var docId = list.Documents[rank].Id;
                scores.TryGetValue(docId, out var current);
                scores[docId] = current
                    + list.Weight * (1.0 / (K + rank + 1));
            }
        }

        return scores
            .OrderByDescending(kv => kv.Value)
            .Select(kv => new ScoredDocument(kv.Key, kv.Value))
            .ToList();
    }
}

public record RankedList(
    List<RetrievedDocument> Documents,
    double Weight);

Building the Retrieval Pipeline

The ensemble search coordinates multiple retrievers in parallel:

EnsembleSearchService.cs
public class EnsembleSearchService
{
    private readonly IVectorStore _vectorStore;
    private readonly IGraphStore _graphStore;
    private readonly RrfEnsembleSearch _rrf;

    public async Task<List<ScoredDocument>> SearchAsync(
        string query, SearchOptions options, CancellationToken ct)
    {
        // Run retrievers in parallel
        var denseTask = _vectorStore.SearchAsync(
            query, model: "nomic-embed-text", topK: 20, ct);
        var sparseTask = _vectorStore.SearchAsync(
            query, model: "bge-m3", topK: 20, ct);
        var graphTask = _graphStore.TraverseAsync(
            query, maxHops: 2, ct);

        await Task.WhenAll(denseTask, sparseTask, graphTask);

        // Assign weights per retriever
        var rankedLists = new List<RankedList>
        {
            new(await denseTask, Weight: 1.0),
            new(await sparseTask, Weight: 0.8),
            new(await graphTask, Weight: 0.6),
        };

        return _rrf.Fuse(rankedLists)
            .Take(options.MaxResults)
            .ToList();
    }
}
[] — Qdrant

Agentic Retrieval

In an agentic setup, the LLM itself decides which retrievers to invoke based on the query type. This is a form of query routing that avoids running unnecessary retrievers.

[] — Li and Croft
public class AgenticRetriever
{
    public async Task<List<ScoredDocument>> RetrieveAsync(
        string query, CancellationToken ct)
    {
        var queryType = await ClassifyQueryAsync(query, ct);

        var retrievers = queryType switch
        {
            QueryType.Factual => new[] { "dense", "sparse" },
            QueryType.Relational => new[] { "dense", "graph" },
            QueryType.EntityLookup => new[] { "graph", "ner" },
            _ => new[] { "dense", "sparse", "graph" },
        };

        var lists = await RunRetrieversAsync(
            retrievers, query, ct);
        return _rrf.Fuse(lists);
    }
}

This reduces latency — if a graph traversal isn’t needed, don’t run it.

[] — Liu

Production Considerations

  • Timeout individual retrievers: If one model is slow, don’t block the entire pipeline. Use Task.WhenAll with per-task cancellation.
  • Deduplication: RRF handles duplicate documents across lists naturally by summing their scores.
  • Score normalization: RRF doesn’t need it — that’s the whole point. Avoid mixing raw similarity scores, which aren’t comparable across models.
  • Caching: Cache retrieval results for identical queries within a short window (30 seconds).

Conclusion

After months of iterating on this pipeline, the lesson that sticks with me is deceptively simple: combining mediocre signals intelligently beats optimizing any single signal in isolation. The RRF ensemble didn’t require better models or more training data — it just required the discipline to run multiple retrieval strategies and trust the rank fusion to surface the right documents. The agentic routing layer was the second big win, not because it improved quality, but because it cut latency by skipping unnecessary retrievers for well-understood query types.

If you’re building a production RAG system and still relying on a single retrieval model, ensembling is the single highest-ROI change you can make. Start with dense + sparse, add RRF, measure the NDCG improvement, and then decide whether the complexity of graph traversal and agentic routing is worth it for your use case.

Next Steps

Further Reading

[] — Cormack, Clarke, and Buettcher [] — Liu [] — Jacobs et al.