Building a RAG Pipeline with Ollama and Qdrant

Introduction

When I first watched the full RAG pipeline work end-to-end, I got the kind of satisfaction you normally only feel after shipping a major feature. A colleague typed “What are our obligations under the data processing agreement?” into the chat UI, and within three seconds the system had embedded the query, found the four most relevant chunks from over 500 indexed documents, and the LLM synthesized a coherent two-paragraph answer with inline source citations pointing back to the exact contract clauses. It felt like we had built a private search engine — one that actually understood questions and could compose answers instead of just returning a list of links. That moment justified every late night spent tuning chunk sizes, wrestling with prompt templates, and debugging why the model kept hallucinating contract terms that did not exist.

Large Language Models are impressive, but they have a critical limitation: they only know what they were trained on. Ask ChatGPT about your company’s vacation policy, and it will confidently hallucinate an answer. RAG (Retrieval-Augmented Generation) solves this by grounding LLM responses in your actual documents.

Why RAG Architecture Matters:

Accuracy: Answers are based on your documents, not model hallucinations
Currency: No retraining needed—just update your document corpus
Attribution: Cite specific sources for every answer
Privacy: Run entirely on-premise with Ollama—your data never leaves your infrastructure

[Retrieval-Augmented Generation for Knowledge-Intensive NLP Tasks] — Lewis et al. , 2020-05-22

RAG powers our document Q&A feature. Users upload contracts, policies, and reports, then ask natural language questions like “What’s the termination clause in the Smith contract?” The system retrieves relevant chunks, feeds them to an LLM, and returns an accurate, cited answer.

Architecture Overview

Retrieval-Augmented Generation (RAG) combines the knowledge stored in your documents with the reasoning capabilities of large language models. This guide builds a complete RAG pipeline using Ollama for local inference and Qdrant for vector retrieval.

[Semantic Kernel: RAG Patterns and Best Practices] — Microsoft , 2024-06-20

RAG Architecture

flowchart TB
    subgraph Input["� User Question"]
        Q["What are our GDPR obligations?"]
    end

    subgraph QueryProcessing["1️⃣ Query Processing"]
        QP1[🔄 Query expansion]
        QP2[🧠 Generate query embedding]
        QP3[🏷️ Extract filters]
    end

    subgraph Retrieval["2️⃣ Retrieval"]
        R1[("🔮 Qdrant<br/>Semantic")]
        R2[("🐘 PostgreSQL<br/>Keyword")]
        R3["⚖️ Reranker<br/>Cross-Encoder"]
        R1 & R2 --> R3
    end

    subgraph Context["3️⃣ Context Building"]
        C1[📊 Select top-k chunks]
        C2[🔗 Deduplicate]
        C3[📝 Format with metadata]
        C1 --> C2 --> C3
    end

    subgraph Generation["4️⃣ Generation"]
        G1["🦙 Ollama (llama3:8b)"]
        G2["📋 System: Answer based on documents..."]
        G3["📑 Context: Retrieved chunks"]
    end

    subgraph Response["5️⃣ Response"]
        RS1[💡 Answer with citations]
        RS2[📈 Confidence score]
        RS3[❓ Follow-up suggestions]
    end

    Q --> QP1
    QP1 --> QP2 --> QP3
    QP3 --> R1 & R2
    R3 --> C1
    C3 --> G1
    G1 --> RS1

    classDef primary fill:#7c3aed,color:#fff
    classDef secondary fill:#06b6d4,color:#fff
    classDef db fill:#f43f5e,color:#fff
    classDef warning fill:#fbbf24,color:#000
    class Input warning
    class QueryProcessing,Context,Generation,Response primary
    class Retrieval db

Core Components

RAG Service Interface

// Application/Interfaces/IRagService.cs
public interface IRagService
{
    Task<RagResponse> AskAsync(
        CustomId userId,
        string question,
        RagOptions? options = null,
        CancellationToken ct = default);
    
    IAsyncEnumerable<RagStreamChunk> AskStreamingAsync(
        CustomId userId,
        string question,
        RagOptions? options = null,
        CancellationToken ct = default);
}

public sealed record RagOptions
{
    public int TopK { get; init; } = 5;
    public float MinRelevanceScore { get; init; } = 0.5f;
    public string Model { get; init; } = "llama3:8b";
    public float Temperature { get; init; } = 0.1f;
    public int MaxTokens { get; init; } = 1024;
    public bool IncludeSources { get; init; } = true;
    public IReadOnlyList<string>? FilterTags { get; init; }
    public IReadOnlyList<CustomId>? FilterDocuments { get; init; }
}

public sealed record RagResponse
{
    public required string Answer { get; init; }
    public required IReadOnlyList<SourceReference> Sources { get; init; }
    public required float Confidence { get; init; }
    public required RagMetrics Metrics { get; init; }
    public IReadOnlyList<string>? SuggestedFollowUps { get; init; }
}

public sealed record SourceReference
{
    public required CustomId DocumentId { get; init; }
    public required string DocumentTitle { get; init; }
    public required string Excerpt { get; init; }
    public required int ChunkIndex { get; init; }
    public required float RelevanceScore { get; init; }
}

public sealed record RagMetrics
{
    public required int ChunksRetrieved { get; init; }
    public required int ChunksUsed { get; init; }
    public required long RetrievalTimeMs { get; init; }
    public required long GenerationTimeMs { get; init; }
    public required int InputTokens { get; init; }
    public required int OutputTokens { get; init; }
}

RAG Service Implementation

[Qdrant Search API: Similarity Search with Filtering] — Qdrant , 2024-08-10

// Infrastructure/AI/RagService.cs
public sealed class RagService : IRagService
{
    private readonly IHybridSearchService _search;
    private readonly IOllamaClient _ollama;
    private readonly IDocumentRepository _documents;
    private readonly ILogger<RagService> _logger;
    
    private const string SystemPrompt = """
        You are a helpful assistant that answers questions based on the provided documents.

        Guidelines:
        - Only use information from the provided context to answer questions
        - If the context doesn't contain relevant information, say so clearly
        - Cite sources using [Source N] notation when referencing specific information
        - Be concise but thorough
        - If you're uncertain, express that uncertainty

        Context documents are provided below, each marked with a source number.
        """;

[Prompt Engineering Guide] — OpenAI , 2024-01-10

    public RagService(
        IHybridSearchService search,
        IOllamaClient ollama,
        IDocumentRepository documents,
        ILogger<RagService> logger)
    {
        _search = search;
        _ollama = ollama;
        _documents = documents;
        _logger = logger;
    }
    
    public async Task<RagResponse> AskAsync(
        CustomId userId,
        string question,
        RagOptions? options = null,
        CancellationToken ct = default)
    {
        options ??= new RagOptions();
        var sw = Stopwatch.StartNew();
        
        // Step 1: Retrieve relevant chunks
        var retrievalStart = sw.ElapsedMilliseconds;
        var searchOptions = new HybridSearchOptions
        {
            Limit = options.TopK * 2, // Retrieve more than needed for reranking
            SemanticThreshold = options.MinRelevanceScore,
            FilterTags = options.FilterTags
        };
        
        var searchResults = await _search.SearchAsync(
            userId, 
            question, 
            searchOptions, 
            ct);
        
        var retrievalTime = sw.ElapsedMilliseconds - retrievalStart;
        
        if (searchResults.Results.Count == 0)
        {
            return CreateNoContextResponse(sw.ElapsedMilliseconds);
        }
        
        // Step 2: Build context from top chunks
        var chunks = await BuildContextAsync(
            searchResults.Results,
            options.TopK,
            ct);

        var context = FormatContext(chunks);

        // Step 3: Generate answer
        var generationStart = sw.ElapsedMilliseconds;
        var prompt = BuildPrompt(context, question);

        var generationResult = await _ollama.GenerateAsync(new OllamaRequest
        {
            Model = options.Model,
            System = SystemPrompt,
            Prompt = prompt,
            Options = new OllamaOptions
            {
                Temperature = options.Temperature,
                NumPredict = options.MaxTokens
            }
        }, ct);

        var generationTime = sw.ElapsedMilliseconds - generationStart;

        // Step 4: Build response with sources
        var sources = chunks.Select((c, i) => new SourceReference
        {
            DocumentId = c.DocumentId,
            DocumentTitle = c.DocumentTitle,
            Excerpt = TruncateExcerpt(c.Content, 200),
            ChunkIndex = c.ChunkIndex,
            RelevanceScore = c.Score
        }).ToList();

        var confidence = CalculateConfidence(chunks, generationResult.Response);

        sw.Stop();

        _logger.LogInformation(
            "RAG query completed in {TotalMs}ms (retrieval: {RetrievalMs}ms, generation: {GenerationMs}ms)",
            sw.ElapsedMilliseconds,
            retrievalTime,
            generationTime);

        return new RagResponse
        {
            Answer = generationResult.Response,
            Sources = sources,
            Confidence = confidence,
            Metrics = new RagMetrics
            {
                ChunksRetrieved = searchResults.Results.Count,
                ChunksUsed = chunks.Count,
                RetrievalTimeMs = retrievalTime,
                GenerationTimeMs = generationTime,
                InputTokens = generationResult.PromptEvalCount,
                OutputTokens = generationResult.EvalCount
            },
            SuggestedFollowUps = GenerateFollowUpSuggestions(question, generationResult.Response)
        };
    }

[Lost in the Middle: How Language Models Use Long Contexts] — Liu et al. , 2023-07-06

    public async IAsyncEnumerable<RagStreamChunk> AskStreamingAsync(
        CustomId userId,
        string question,
        RagOptions? options = null,
        [EnumeratorCancellation] CancellationToken ct = default)
    {
        options ??= new RagOptions();
        
        // Retrieve context (non-streaming)
        var searchResults = await _search.SearchAsync(
            userId, 
            question, 
            new HybridSearchOptions { Limit = options.TopK * 2 }, 
            ct);
        
        var chunks = await BuildContextAsync(searchResults.Results, options.TopK, ct);
        var context = FormatContext(chunks);
        var prompt = BuildPrompt(context, question);
        
        // Yield sources first
        yield return new RagStreamChunk
        {
            Type = RagChunkType.Sources,
            Sources = chunks.Select((c, i) => new SourceReference
            {
                DocumentId = c.DocumentId,
                DocumentTitle = c.DocumentTitle,
                Excerpt = TruncateExcerpt(c.Content, 200),
                ChunkIndex = c.ChunkIndex,
                RelevanceScore = c.Score
            }).ToList()
        };
        
        // Stream the answer
        await foreach (var token in _ollama.GenerateStreamingAsync(new OllamaRequest
        {
            Model = options.Model,
            System = SystemPrompt,
            Prompt = prompt,
            Options = new OllamaOptions
            {
                Temperature = options.Temperature,
                NumPredict = options.MaxTokens
            }
        }, ct))
        {
            yield return new RagStreamChunk
            {
                Type = RagChunkType.Token,
                Token = token
            };
        }
        
        yield return new RagStreamChunk
        {
            Type = RagChunkType.Complete
        };
    }
    
    private async Task<List<RetrievedChunk>> BuildContextAsync(
        IReadOnlyList<HybridSearchResult> results,
        int topK,
        CancellationToken ct)
    {
        var chunks = new List<RetrievedChunk>();
        var seenDocuments = new HashSet<CustomId>();
        
        foreach (var result in results.Take(topK * 2))
        {
            // Get document details if not cached
            var document = result.Document ?? 
                await _documents.GetByIdAsync(result.DocumentId, ct);
            
            if (document is null) continue;
            
            // Add matched chunks
            if (result.MatchedChunks is not null)
            {
                foreach (var chunk in result.MatchedChunks)
                {
                    chunks.Add(new RetrievedChunk
                    {
                        DocumentId = result.DocumentId,
                        DocumentTitle = document.Title.Value,
                        Content = chunk.Content,
                        ChunkIndex = chunk.ChunkIndex,
                        Score = result.Score
                    });
                }
            }
            
            seenDocuments.Add(result.DocumentId);
            
            if (chunks.Count >= topK) break;
        }
        
        // Sort by score and take top-k
        return chunks
            .OrderByDescending(c => c.Score)
            .Take(topK)
            .ToList();
    }
    
    private static string FormatContext(List<RetrievedChunk> chunks)
    {
        var sb = new StringBuilder();
        
        for (int i = 0; i < chunks.Count; i++)
        {
            var chunk = chunks[i];
            sb.AppendLine($"[Source {i + 1}] Document: {chunk.DocumentTitle}");
            sb.AppendLine(chunk.Content);
            sb.AppendLine();
        }
        
        return sb.ToString();
    }
    
    private static string BuildPrompt(string context, string question)
    {
        return $"""
            Based on the following documents, please answer the question.
            
            DOCUMENTS:
            {context}
            
            QUESTION: {question}
            
            ANSWER:
            """;
    }
    
    private static float CalculateConfidence(
        List<RetrievedChunk> chunks,
        string answer)
    {
        if (chunks.Count == 0) return 0;
        
        // Base confidence on retrieval scores
        var avgRetrievalScore = chunks.Average(c => c.Score);
        
        // Penalize very short answers
        var lengthFactor = Math.Min(1.0f, answer.Length / 100f);
        
        // Penalize if answer contains uncertainty phrases
        var uncertaintyPhrases = new[] 
        { 
            "i don't know", 
            "not sure", 
            "cannot determine",
            "no information",
            "unclear"
        };
        
        var uncertaintyPenalty = uncertaintyPhrases
            .Any(p => answer.Contains(p, StringComparison.OrdinalIgnoreCase))
            ? 0.3f
            : 0;
        
        return Math.Max(0, Math.Min(1, avgRetrievalScore * lengthFactor - uncertaintyPenalty));
    }
    
    private static RagResponse CreateNoContextResponse(long elapsedMs)
    {
        return new RagResponse
        {
            Answer = "I couldn't find any relevant documents to answer your question. " +
                     "Please try rephrasing your question or ensure relevant documents are uploaded.",
            Sources = [],
            Confidence = 0,
            Metrics = new RagMetrics
            {
                ChunksRetrieved = 0,
                ChunksUsed = 0,
                RetrievalTimeMs = elapsedMs,
                GenerationTimeMs = 0,
                InputTokens = 0,
                OutputTokens = 0
            }
        };
    }
    
    private static string TruncateExcerpt(string content, int maxLength)
    {
        if (content.Length <= maxLength) return content;
        return content[..(maxLength - 3)] + "...";
    }
    
    private static IReadOnlyList<string>? GenerateFollowUpSuggestions(
        string question,
        string answer)
    {
        // Simple heuristic - in production, use LLM to generate these
        var suggestions = new List<string>();
        
        if (answer.Contains("GDPR", StringComparison.OrdinalIgnoreCase))
        {
            suggestions.Add("What are the penalties for GDPR non-compliance?");
            suggestions.Add("How do we handle data subject access requests?");
        }
        
        return suggestions.Count > 0 ? suggestions : null;
    }
    
    private sealed record RetrievedChunk
    {
        public CustomId DocumentId { get; init; }
        public required string DocumentTitle { get; init; }
        public required string Content { get; init; }
        public required int ChunkIndex { get; init; }
        public required float Score { get; init; }
    }
}

public sealed record RagStreamChunk
{
    public required RagChunkType Type { get; init; }
    public string? Token { get; init; }
    public IReadOnlyList<SourceReference>? Sources { get; init; }
}

public enum RagChunkType
{
    Sources,
    Token,
    Complete
}

Ollama Client

[Ollama API Documentation: Generate] — Ollama , 2024-09-15

// Infrastructure/AI/OllamaClient.cs
public sealed class OllamaClient : IOllamaClient
{
    private readonly HttpClient _httpClient;
    private readonly ILogger<OllamaClient> _logger;
    
    public OllamaClient(
        HttpClient httpClient,
        ILogger<OllamaClient> logger)
    {
        _httpClient = httpClient;
        _logger = logger;
    }
    
    public async Task<OllamaResponse> GenerateAsync(
        OllamaRequest request,
        CancellationToken ct = default)
    {
        var response = await _httpClient.PostAsJsonAsync(
            "/api/generate",
            new
            {
                model = request.Model,
                system = request.System,
                prompt = request.Prompt,
                stream = false,
                options = request.Options != null ? new
                {
                    temperature = request.Options.Temperature,
                    num_predict = request.Options.NumPredict,
                    top_k = request.Options.TopK,
                    top_p = request.Options.TopP
                } : null
            },
            ct);
        
        response.EnsureSuccessStatusCode();
        
        var result = await response.Content.ReadFromJsonAsync<OllamaApiResponse>(ct);
        
        return new OllamaResponse
        {
            Response = result!.Response,
            PromptEvalCount = result.PromptEvalCount,
            EvalCount = result.EvalCount,
            TotalDuration = result.TotalDuration
        };
    }
    
    public async IAsyncEnumerable<string> GenerateStreamingAsync(
        OllamaRequest request,
        [EnumeratorCancellation] CancellationToken ct = default)
    {
        var httpRequest = new HttpRequestMessage(HttpMethod.Post, "/api/generate")
        {
            Content = JsonContent.Create(new
            {
                model = request.Model,
                system = request.System,
                prompt = request.Prompt,
                stream = true,
                options = request.Options != null ? new
                {
                    temperature = request.Options.Temperature,
                    num_predict = request.Options.NumPredict
                } : null
            })
        };
        
        using var response = await _httpClient.SendAsync(
            httpRequest,
            HttpCompletionOption.ResponseHeadersRead,
            ct);
        
        response.EnsureSuccessStatusCode();
        
        await using var stream = await response.Content.ReadAsStreamAsync(ct);
        using var reader = new StreamReader(stream);
        
        while (!reader.EndOfStream)
        {
            var line = await reader.ReadLineAsync(ct);
            if (string.IsNullOrEmpty(line)) continue;
            
            var chunk = JsonSerializer.Deserialize<OllamaStreamChunk>(line);
            if (chunk?.Response is not null)
            {
                yield return chunk.Response;
            }
            
            if (chunk?.Done == true) break;
        }
    }
    
    private sealed record OllamaApiResponse
    {
        [JsonPropertyName("response")]
        public string Response { get; init; } = string.Empty;
        
        [JsonPropertyName("prompt_eval_count")]
        public int PromptEvalCount { get; init; }
        
        [JsonPropertyName("eval_count")]
        public int EvalCount { get; init; }
        
        [JsonPropertyName("total_duration")]
        public long TotalDuration { get; init; }
    }
    
    private sealed record OllamaStreamChunk
    {
        [JsonPropertyName("response")]
        public string? Response { get; init; }
        
        [JsonPropertyName("done")]
        public bool Done { get; init; }
    }
}

Blazor Chat Component

@* Components/Chat/RagChat.razor *@
@inject IRagService RagService
@implements IAsyncDisposable

<div class="flex flex-col h-full">
    @* Messages *@
    <div class="flex-1 overflow-y-auto p-4 space-y-4" @ref="_messagesContainer">
        @foreach (var message in _messages)
        {
            <ChatMessage Message="message" />
        }
        
        @if (_isLoading)
        {
            <div class="flex items-center gap-2 text-gray-400">
                <div class="animate-pulse">●</div>
                <span>Thinking...</span>
            </div>
        }
    </div>
    
    @* Input *@
    <div class="border-t border-white/10 p-4">
        <form @onsubmit="HandleSubmit" class="flex gap-2">
            <GlassInput
                @bind-Value="_input"
                Placeholder="Ask a question about your documents..."
                Disabled="_isLoading"
                class="flex-1" />
            
            <GlassButton
                Type="submit"
                Disabled="_isLoading || string.IsNullOrWhiteSpace(_input)">
                <svg class="w-5 h-5" fill="none" stroke="currentColor" viewBox="0 0 24 24">
                    <path stroke-linecap="round" stroke-linejoin="round" stroke-width="2" 
                          d="M12 19l9 2-9-18-9 18 9-2zm0 0v-8"/>
                </svg>
            </GlassButton>
        </form>
    </div>
</div>

@code {
    [CascadingParameter] private Task<AuthenticationState>? AuthState { get; set; }
    
    private readonly List<ChatMessageModel> _messages = [];
    private string _input = string.Empty;
    private bool _isLoading;
    private ElementReference _messagesContainer;
    private CancellationTokenSource? _cts;
    
    private async Task HandleSubmit()
    {
        if (string.IsNullOrWhiteSpace(_input) || _isLoading) return;
        
        var question = _input;
        _input = string.Empty;
        _isLoading = true;
        
        // Add user message
        _messages.Add(new ChatMessageModel
        {
            Role = "user",
            Content = question,
            Timestamp = DateTimeOffset.UtcNow
        });
        
        // Add placeholder for assistant
        var assistantMessage = new ChatMessageModel
        {
            Role = "assistant",
            Content = string.Empty,
            Timestamp = DateTimeOffset.UtcNow
        };
        _messages.Add(assistantMessage);
        
        await ScrollToBottom();
        
        try
        {
            var authState = await AuthState!;
            var userId = authState.User.GetCustomId();
            
            _cts = new CancellationTokenSource();
            
            // Stream the response
            await foreach (var chunk in RagService.AskStreamingAsync(
                userId, 
                question, 
                cancellationToken: _cts.Token))
            {
                switch (chunk.Type)
                {
                    case RagChunkType.Sources:
                        assistantMessage.Sources = chunk.Sources;
                        break;
                    
                    case RagChunkType.Token:
                        assistantMessage.Content += chunk.Token;
                        StateHasChanged();
                        break;
                    
                    case RagChunkType.Complete:
                        break;
                }
            }
        }
        catch (OperationCanceledException)
        {
            assistantMessage.Content += " [Cancelled]";
        }
        catch (Exception ex)
        {
            assistantMessage.Content = $"Error: {ex.Message}";
            assistantMessage.IsError = true;
        }
        finally
        {
            _isLoading = false;
            _cts?.Dispose();
            _cts = null;
            await ScrollToBottom();
        }
    }
    
    private async Task ScrollToBottom()
    {
        await Task.Yield();
        // JS interop to scroll would go here
    }
    
    public async ValueTask DisposeAsync()
    {
        _cts?.Cancel();
        _cts?.Dispose();
    }
}

public sealed class ChatMessageModel
{
    public required string Role { get; init; }
    public string Content { get; set; } = string.Empty;
    public required DateTimeOffset Timestamp { get; init; }
    public IReadOnlyList<SourceReference>? Sources { get; set; }
    public bool IsError { get; set; }
}

Conclusion

A production RAG pipeline requires:

Component	Purpose
Hybrid Retrieval	Find relevant chunks
Context Building	Format for LLM
Prompt Engineering	Guide LLM behavior
Streaming	Better UX
Source Attribution	Transparency
Confidence Scoring	Reliability indicator

This foundation can be extended with reranking, query expansion, and conversation memory for even more sophisticated applications.

Personal Reflection

If I had to distill everything I learned building this pipeline into one sentence, it would be this: the retrieval step matters more than the generation step. I spent weeks trying to improve answer quality by tweaking prompts, switching models, and adjusting temperature — and those things helped, but the biggest quality jumps always came from improving what context the LLM received in the first place. Better chunking, better reranking, and better filtering made a larger difference than any prompt engineering trick. The other lesson that surprised me was how well a local 8B parameter model performed when given good context. Our initial assumption was that we would need a 70B model or a commercial API for production-quality answers, but with well-retrieved, well-ordered chunks and a carefully tuned system prompt, llama3:8b through Ollama produced answers that our internal testers rated as “good” or “excellent” on 87% of test queries. Running locally also meant zero API costs and complete data privacy — a requirement for our enterprise clients.

Next Steps

Building a Semantic Embedding Pipeline with Ollama and Qdrant — The companion article covering the embedding and indexing side of the pipeline in detail.
Conversation Memory — Add multi-turn conversation support so users can ask follow-up questions that reference previous answers without restating full context.
Evaluation Framework — Build an automated evaluation pipeline using labeled question-answer pairs to measure retrieval precision, answer accuracy, and hallucination rates across model and prompt changes.