🤖 AI/ML Expert ⏱️ 30 min

Building a RAG Pipeline with Ollama and Qdrant

Implement a complete Retrieval-Augmented Generation pipeline that combines semantic search with local LLM inference for intelligent document Q&A.

By Victor Robin

Introduction

Large Language Models are impressive, but they have a critical limitation: they only know what they were trained on. Ask ChatGPT about your company’s vacation policy, and it will confidently hallucinate an answer. RAG (Retrieval-Augmented Generation) solves this by grounding LLM responses in your actual documents.

Why RAG Architecture Matters:

  • Accuracy: Answers are based on your documents, not model hallucinations
  • Currency: No retraining needed—just update your document corpus
  • Attribution: Cite specific sources for every answer
  • Privacy: Run entirely on-premise with Ollama—your data never leaves your infrastructure

In BlueRobin, RAG powers our document Q&A feature. Users upload contracts, policies, and reports, then ask natural language questions like “What’s the termination clause in the Smith contract?” The system retrieves relevant chunks, feeds them to an LLM, and returns an accurate, cited answer.

Architecture Overview

Retrieval-Augmented Generation (RAG) combines the knowledge stored in your documents with the reasoning capabilities of large language models. This guide builds a complete RAG pipeline using Ollama for local inference and Qdrant for vector retrieval.

RAG Architecture

flowchart TB
    subgraph Input["� User Question"]
        Q["What are our GDPR obligations?"]
    end

    subgraph QueryProcessing["1️⃣ Query Processing"]
        QP1[🔄 Query expansion]
        QP2[🧠 Generate query embedding]
        QP3[🏷️ Extract filters]
    end

    subgraph Retrieval["2️⃣ Retrieval"]
        R1[("🔮 Qdrant<br/>Semantic")]
        R2[("🐘 PostgreSQL<br/>Keyword")]
        R3["⚖️ Reranker<br/>Cross-Encoder"]
        R1 & R2 --> R3
    end

    subgraph Context["3️⃣ Context Building"]
        C1[📊 Select top-k chunks]
        C2[🔗 Deduplicate]
        C3[📝 Format with metadata]
        C1 --> C2 --> C3
    end

    subgraph Generation["4️⃣ Generation"]
        G1["🦙 Ollama (llama3:8b)"]
        G2["📋 System: Answer based on documents..."]
        G3["📑 Context: Retrieved chunks"]
    end

    subgraph Response["5️⃣ Response"]
        RS1[💡 Answer with citations]
        RS2[📈 Confidence score]
        RS3[❓ Follow-up suggestions]
    end

    Q --> QP1
    QP1 --> QP2 --> QP3
    QP3 --> R1 & R2
    R3 --> C1
    C3 --> G1
    G1 --> RS1

    classDef primary fill:#7c3aed,color:#fff
    classDef secondary fill:#06b6d4,color:#fff
    classDef db fill:#f43f5e,color:#fff
    classDef warning fill:#fbbf24,color:#000
    class Input warning
    class QueryProcessing,Context,Generation,Response primary
    class Retrieval db

Core Components

RAG Service Interface

// Application/Interfaces/IRagService.cs
public interface IRagService
{
    Task<RagResponse> AskAsync(
        BlueRobinId userId,
        string question,
        RagOptions? options = null,
        CancellationToken ct = default);
    
    IAsyncEnumerable<RagStreamChunk> AskStreamingAsync(
        BlueRobinId userId,
        string question,
        RagOptions? options = null,
        CancellationToken ct = default);
}

public sealed record RagOptions
{
    public int TopK { get; init; } = 5;
    public float MinRelevanceScore { get; init; } = 0.5f;
    public string Model { get; init; } = "llama3:8b";
    public float Temperature { get; init; } = 0.1f;
    public int MaxTokens { get; init; } = 1024;
    public bool IncludeSources { get; init; } = true;
    public IReadOnlyList<string>? FilterTags { get; init; }
    public IReadOnlyList<BlueRobinId>? FilterDocuments { get; init; }
}

public sealed record RagResponse
{
    public required string Answer { get; init; }
    public required IReadOnlyList<SourceReference> Sources { get; init; }
    public required float Confidence { get; init; }
    public required RagMetrics Metrics { get; init; }
    public IReadOnlyList<string>? SuggestedFollowUps { get; init; }
}

public sealed record SourceReference
{
    public required BlueRobinId DocumentId { get; init; }
    public required string DocumentTitle { get; init; }
    public required string Excerpt { get; init; }
    public required int ChunkIndex { get; init; }
    public required float RelevanceScore { get; init; }
}

public sealed record RagMetrics
{
    public required int ChunksRetrieved { get; init; }
    public required int ChunksUsed { get; init; }
    public required long RetrievalTimeMs { get; init; }
    public required long GenerationTimeMs { get; init; }
    public required int InputTokens { get; init; }
    public required int OutputTokens { get; init; }
}

RAG Service Implementation

// Infrastructure/AI/RagService.cs
public sealed class RagService : IRagService
{
    private readonly IHybridSearchService _search;
    private readonly IOllamaClient _ollama;
    private readonly IDocumentRepository _documents;
    private readonly ILogger<RagService> _logger;
    
    private const string SystemPrompt = """
        You are a helpful assistant that answers questions based on the provided documents.
        
        Guidelines:
        - Only use information from the provided context to answer questions
        - If the context doesn't contain relevant information, say so clearly
        - Cite sources using [Source N] notation when referencing specific information
        - Be concise but thorough
        - If you're uncertain, express that uncertainty
        
        Context documents are provided below, each marked with a source number.
        """;
    
    public RagService(
        IHybridSearchService search,
        IOllamaClient ollama,
        IDocumentRepository documents,
        ILogger<RagService> logger)
    {
        _search = search;
        _ollama = ollama;
        _documents = documents;
        _logger = logger;
    }
    
    public async Task<RagResponse> AskAsync(
        BlueRobinId userId,
        string question,
        RagOptions? options = null,
        CancellationToken ct = default)
    {
        options ??= new RagOptions();
        var sw = Stopwatch.StartNew();
        
        // Step 1: Retrieve relevant chunks
        var retrievalStart = sw.ElapsedMilliseconds;
        var searchOptions = new HybridSearchOptions
        {
            Limit = options.TopK * 2, // Retrieve more than needed for reranking
            SemanticThreshold = options.MinRelevanceScore,
            FilterTags = options.FilterTags
        };
        
        var searchResults = await _search.SearchAsync(
            userId, 
            question, 
            searchOptions, 
            ct);
        
        var retrievalTime = sw.ElapsedMilliseconds - retrievalStart;
        
        if (searchResults.Results.Count == 0)
        {
            return CreateNoContextResponse(sw.ElapsedMilliseconds);
        }
        
        // Step 2: Build context from top chunks
        var chunks = await BuildContextAsync(
            searchResults.Results, 
            options.TopK, 
            ct);
        
        var context = FormatContext(chunks);
        
        // Step 3: Generate answer
        var generationStart = sw.ElapsedMilliseconds;
        var prompt = BuildPrompt(context, question);
        
        var generationResult = await _ollama.GenerateAsync(new OllamaRequest
        {
            Model = options.Model,
            System = SystemPrompt,
            Prompt = prompt,
            Options = new OllamaOptions
            {
                Temperature = options.Temperature,
                NumPredict = options.MaxTokens
            }
        }, ct);
        
        var generationTime = sw.ElapsedMilliseconds - generationStart;
        
        // Step 4: Build response with sources
        var sources = chunks.Select((c, i) => new SourceReference
        {
            DocumentId = c.DocumentId,
            DocumentTitle = c.DocumentTitle,
            Excerpt = TruncateExcerpt(c.Content, 200),
            ChunkIndex = c.ChunkIndex,
            RelevanceScore = c.Score
        }).ToList();
        
        var confidence = CalculateConfidence(chunks, generationResult.Response);
        
        sw.Stop();
        
        _logger.LogInformation(
            "RAG query completed in {TotalMs}ms (retrieval: {RetrievalMs}ms, generation: {GenerationMs}ms)",
            sw.ElapsedMilliseconds,
            retrievalTime,
            generationTime);
        
        return new RagResponse
        {
            Answer = generationResult.Response,
            Sources = sources,
            Confidence = confidence,
            Metrics = new RagMetrics
            {
                ChunksRetrieved = searchResults.Results.Count,
                ChunksUsed = chunks.Count,
                RetrievalTimeMs = retrievalTime,
                GenerationTimeMs = generationTime,
                InputTokens = generationResult.PromptEvalCount,
                OutputTokens = generationResult.EvalCount
            },
            SuggestedFollowUps = GenerateFollowUpSuggestions(question, generationResult.Response)
        };
    }
    
    public async IAsyncEnumerable<RagStreamChunk> AskStreamingAsync(
        BlueRobinId userId,
        string question,
        RagOptions? options = null,
        [EnumeratorCancellation] CancellationToken ct = default)
    {
        options ??= new RagOptions();
        
        // Retrieve context (non-streaming)
        var searchResults = await _search.SearchAsync(
            userId, 
            question, 
            new HybridSearchOptions { Limit = options.TopK * 2 }, 
            ct);
        
        var chunks = await BuildContextAsync(searchResults.Results, options.TopK, ct);
        var context = FormatContext(chunks);
        var prompt = BuildPrompt(context, question);
        
        // Yield sources first
        yield return new RagStreamChunk
        {
            Type = RagChunkType.Sources,
            Sources = chunks.Select((c, i) => new SourceReference
            {
                DocumentId = c.DocumentId,
                DocumentTitle = c.DocumentTitle,
                Excerpt = TruncateExcerpt(c.Content, 200),
                ChunkIndex = c.ChunkIndex,
                RelevanceScore = c.Score
            }).ToList()
        };
        
        // Stream the answer
        await foreach (var token in _ollama.GenerateStreamingAsync(new OllamaRequest
        {
            Model = options.Model,
            System = SystemPrompt,
            Prompt = prompt,
            Options = new OllamaOptions
            {
                Temperature = options.Temperature,
                NumPredict = options.MaxTokens
            }
        }, ct))
        {
            yield return new RagStreamChunk
            {
                Type = RagChunkType.Token,
                Token = token
            };
        }
        
        yield return new RagStreamChunk
        {
            Type = RagChunkType.Complete
        };
    }
    
    private async Task<List<RetrievedChunk>> BuildContextAsync(
        IReadOnlyList<HybridSearchResult> results,
        int topK,
        CancellationToken ct)
    {
        var chunks = new List<RetrievedChunk>();
        var seenDocuments = new HashSet<BlueRobinId>();
        
        foreach (var result in results.Take(topK * 2))
        {
            // Get document details if not cached
            var document = result.Document ?? 
                await _documents.GetByIdAsync(result.DocumentId, ct);
            
            if (document is null) continue;
            
            // Add matched chunks
            if (result.MatchedChunks is not null)
            {
                foreach (var chunk in result.MatchedChunks)
                {
                    chunks.Add(new RetrievedChunk
                    {
                        DocumentId = result.DocumentId,
                        DocumentTitle = document.Title.Value,
                        Content = chunk.Content,
                        ChunkIndex = chunk.ChunkIndex,
                        Score = result.Score
                    });
                }
            }
            
            seenDocuments.Add(result.DocumentId);
            
            if (chunks.Count >= topK) break;
        }
        
        // Sort by score and take top-k
        return chunks
            .OrderByDescending(c => c.Score)
            .Take(topK)
            .ToList();
    }
    
    private static string FormatContext(List<RetrievedChunk> chunks)
    {
        var sb = new StringBuilder();
        
        for (int i = 0; i < chunks.Count; i++)
        {
            var chunk = chunks[i];
            sb.AppendLine($"[Source {i + 1}] Document: {chunk.DocumentTitle}");
            sb.AppendLine(chunk.Content);
            sb.AppendLine();
        }
        
        return sb.ToString();
    }
    
    private static string BuildPrompt(string context, string question)
    {
        return $"""
            Based on the following documents, please answer the question.
            
            DOCUMENTS:
            {context}
            
            QUESTION: {question}
            
            ANSWER:
            """;
    }
    
    private static float CalculateConfidence(
        List<RetrievedChunk> chunks,
        string answer)
    {
        if (chunks.Count == 0) return 0;
        
        // Base confidence on retrieval scores
        var avgRetrievalScore = chunks.Average(c => c.Score);
        
        // Penalize very short answers
        var lengthFactor = Math.Min(1.0f, answer.Length / 100f);
        
        // Penalize if answer contains uncertainty phrases
        var uncertaintyPhrases = new[] 
        { 
            "i don't know", 
            "not sure", 
            "cannot determine",
            "no information",
            "unclear"
        };
        
        var uncertaintyPenalty = uncertaintyPhrases
            .Any(p => answer.Contains(p, StringComparison.OrdinalIgnoreCase))
            ? 0.3f
            : 0;
        
        return Math.Max(0, Math.Min(1, avgRetrievalScore * lengthFactor - uncertaintyPenalty));
    }
    
    private static RagResponse CreateNoContextResponse(long elapsedMs)
    {
        return new RagResponse
        {
            Answer = "I couldn't find any relevant documents to answer your question. " +
                     "Please try rephrasing your question or ensure relevant documents are uploaded.",
            Sources = [],
            Confidence = 0,
            Metrics = new RagMetrics
            {
                ChunksRetrieved = 0,
                ChunksUsed = 0,
                RetrievalTimeMs = elapsedMs,
                GenerationTimeMs = 0,
                InputTokens = 0,
                OutputTokens = 0
            }
        };
    }
    
    private static string TruncateExcerpt(string content, int maxLength)
    {
        if (content.Length <= maxLength) return content;
        return content[..(maxLength - 3)] + "...";
    }
    
    private static IReadOnlyList<string>? GenerateFollowUpSuggestions(
        string question,
        string answer)
    {
        // Simple heuristic - in production, use LLM to generate these
        var suggestions = new List<string>();
        
        if (answer.Contains("GDPR", StringComparison.OrdinalIgnoreCase))
        {
            suggestions.Add("What are the penalties for GDPR non-compliance?");
            suggestions.Add("How do we handle data subject access requests?");
        }
        
        return suggestions.Count > 0 ? suggestions : null;
    }
    
    private sealed record RetrievedChunk
    {
        public BlueRobinId DocumentId { get; init; }
        public required string DocumentTitle { get; init; }
        public required string Content { get; init; }
        public required int ChunkIndex { get; init; }
        public required float Score { get; init; }
    }
}

public sealed record RagStreamChunk
{
    public required RagChunkType Type { get; init; }
    public string? Token { get; init; }
    public IReadOnlyList<SourceReference>? Sources { get; init; }
}

public enum RagChunkType
{
    Sources,
    Token,
    Complete
}

Ollama Client

// Infrastructure/AI/OllamaClient.cs
public sealed class OllamaClient : IOllamaClient
{
    private readonly HttpClient _httpClient;
    private readonly ILogger<OllamaClient> _logger;
    
    public OllamaClient(
        HttpClient httpClient,
        ILogger<OllamaClient> logger)
    {
        _httpClient = httpClient;
        _logger = logger;
    }
    
    public async Task<OllamaResponse> GenerateAsync(
        OllamaRequest request,
        CancellationToken ct = default)
    {
        var response = await _httpClient.PostAsJsonAsync(
            "/api/generate",
            new
            {
                model = request.Model,
                system = request.System,
                prompt = request.Prompt,
                stream = false,
                options = request.Options != null ? new
                {
                    temperature = request.Options.Temperature,
                    num_predict = request.Options.NumPredict,
                    top_k = request.Options.TopK,
                    top_p = request.Options.TopP
                } : null
            },
            ct);
        
        response.EnsureSuccessStatusCode();
        
        var result = await response.Content.ReadFromJsonAsync<OllamaApiResponse>(ct);
        
        return new OllamaResponse
        {
            Response = result!.Response,
            PromptEvalCount = result.PromptEvalCount,
            EvalCount = result.EvalCount,
            TotalDuration = result.TotalDuration
        };
    }
    
    public async IAsyncEnumerable<string> GenerateStreamingAsync(
        OllamaRequest request,
        [EnumeratorCancellation] CancellationToken ct = default)
    {
        var httpRequest = new HttpRequestMessage(HttpMethod.Post, "/api/generate")
        {
            Content = JsonContent.Create(new
            {
                model = request.Model,
                system = request.System,
                prompt = request.Prompt,
                stream = true,
                options = request.Options != null ? new
                {
                    temperature = request.Options.Temperature,
                    num_predict = request.Options.NumPredict
                } : null
            })
        };
        
        using var response = await _httpClient.SendAsync(
            httpRequest,
            HttpCompletionOption.ResponseHeadersRead,
            ct);
        
        response.EnsureSuccessStatusCode();
        
        await using var stream = await response.Content.ReadAsStreamAsync(ct);
        using var reader = new StreamReader(stream);
        
        while (!reader.EndOfStream)
        {
            var line = await reader.ReadLineAsync(ct);
            if (string.IsNullOrEmpty(line)) continue;
            
            var chunk = JsonSerializer.Deserialize<OllamaStreamChunk>(line);
            if (chunk?.Response is not null)
            {
                yield return chunk.Response;
            }
            
            if (chunk?.Done == true) break;
        }
    }
    
    private sealed record OllamaApiResponse
    {
        [JsonPropertyName("response")]
        public string Response { get; init; } = string.Empty;
        
        [JsonPropertyName("prompt_eval_count")]
        public int PromptEvalCount { get; init; }
        
        [JsonPropertyName("eval_count")]
        public int EvalCount { get; init; }
        
        [JsonPropertyName("total_duration")]
        public long TotalDuration { get; init; }
    }
    
    private sealed record OllamaStreamChunk
    {
        [JsonPropertyName("response")]
        public string? Response { get; init; }
        
        [JsonPropertyName("done")]
        public bool Done { get; init; }
    }
}

Blazor Chat Component

@* Components/Chat/RagChat.razor *@
@inject IRagService RagService
@implements IAsyncDisposable

<div class="flex flex-col h-full">
    @* Messages *@
    <div class="flex-1 overflow-y-auto p-4 space-y-4" @ref="_messagesContainer">
        @foreach (var message in _messages)
        {
            <ChatMessage Message="message" />
        }
        
        @if (_isLoading)
        {
            <div class="flex items-center gap-2 text-gray-400">
                <div class="animate-pulse">●</div>
                <span>Thinking...</span>
            </div>
        }
    </div>
    
    @* Input *@
    <div class="border-t border-white/10 p-4">
        <form @onsubmit="HandleSubmit" class="flex gap-2">
            <GlassInput
                @bind-Value="_input"
                Placeholder="Ask a question about your documents..."
                Disabled="_isLoading"
                class="flex-1" />
            
            <GlassButton
                Type="submit"
                Disabled="_isLoading || string.IsNullOrWhiteSpace(_input)">
                <svg class="w-5 h-5" fill="none" stroke="currentColor" viewBox="0 0 24 24">
                    <path stroke-linecap="round" stroke-linejoin="round" stroke-width="2" 
                          d="M12 19l9 2-9-18-9 18 9-2zm0 0v-8"/>
                </svg>
            </GlassButton>
        </form>
    </div>
</div>

@code {
    [CascadingParameter] private Task<AuthenticationState>? AuthState { get; set; }
    
    private readonly List<ChatMessageModel> _messages = [];
    private string _input = string.Empty;
    private bool _isLoading;
    private ElementReference _messagesContainer;
    private CancellationTokenSource? _cts;
    
    private async Task HandleSubmit()
    {
        if (string.IsNullOrWhiteSpace(_input) || _isLoading) return;
        
        var question = _input;
        _input = string.Empty;
        _isLoading = true;
        
        // Add user message
        _messages.Add(new ChatMessageModel
        {
            Role = "user",
            Content = question,
            Timestamp = DateTimeOffset.UtcNow
        });
        
        // Add placeholder for assistant
        var assistantMessage = new ChatMessageModel
        {
            Role = "assistant",
            Content = string.Empty,
            Timestamp = DateTimeOffset.UtcNow
        };
        _messages.Add(assistantMessage);
        
        await ScrollToBottom();
        
        try
        {
            var authState = await AuthState!;
            var userId = authState.User.GetBlueRobinId();
            
            _cts = new CancellationTokenSource();
            
            // Stream the response
            await foreach (var chunk in RagService.AskStreamingAsync(
                userId, 
                question, 
                cancellationToken: _cts.Token))
            {
                switch (chunk.Type)
                {
                    case RagChunkType.Sources:
                        assistantMessage.Sources = chunk.Sources;
                        break;
                    
                    case RagChunkType.Token:
                        assistantMessage.Content += chunk.Token;
                        StateHasChanged();
                        break;
                    
                    case RagChunkType.Complete:
                        break;
                }
            }
        }
        catch (OperationCanceledException)
        {
            assistantMessage.Content += " [Cancelled]";
        }
        catch (Exception ex)
        {
            assistantMessage.Content = $"Error: {ex.Message}";
            assistantMessage.IsError = true;
        }
        finally
        {
            _isLoading = false;
            _cts?.Dispose();
            _cts = null;
            await ScrollToBottom();
        }
    }
    
    private async Task ScrollToBottom()
    {
        await Task.Yield();
        // JS interop to scroll would go here
    }
    
    public async ValueTask DisposeAsync()
    {
        _cts?.Cancel();
        _cts?.Dispose();
    }
}

public sealed class ChatMessageModel
{
    public required string Role { get; init; }
    public string Content { get; set; } = string.Empty;
    public required DateTimeOffset Timestamp { get; init; }
    public IReadOnlyList<SourceReference>? Sources { get; set; }
    public bool IsError { get; set; }
}

Summary

A production RAG pipeline requires:

ComponentPurpose
Hybrid RetrievalFind relevant chunks
Context BuildingFormat for LLM
Prompt EngineeringGuide LLM behavior
StreamingBetter UX
Source AttributionTransparency
Confidence ScoringReliability indicator

This foundation can be extended with reranking, query expansion, and conversation memory for even more sophisticated applications.

[Retrieval-Augmented Generation] — Lewis et al.