Building a RAG Pipeline with Ollama and Qdrant
Implement a complete Retrieval-Augmented Generation pipeline that combines semantic search with local LLM inference for intelligent document Q&A.
Introduction
When I first watched the full RAG pipeline work end-to-end, I got the kind of satisfaction you normally only feel after shipping a major feature. A colleague typed “What are our obligations under the data processing agreement?” into the chat UI, and within three seconds the system had embedded the query, found the four most relevant chunks from over 500 indexed documents, and the LLM synthesized a coherent two-paragraph answer with inline source citations pointing back to the exact contract clauses. It felt like we had built a private search engine — one that actually understood questions and could compose answers instead of just returning a list of links. That moment justified every late night spent tuning chunk sizes, wrestling with prompt templates, and debugging why the model kept hallucinating contract terms that did not exist.
Large Language Models are impressive, but they have a critical limitation: they only know what they were trained on. Ask ChatGPT about your company’s vacation policy, and it will confidently hallucinate an answer. RAG (Retrieval-Augmented Generation) solves this by grounding LLM responses in your actual documents.
Why RAG Architecture Matters:
- Accuracy: Answers are based on your documents, not model hallucinations
- Currency: No retraining needed—just update your document corpus
- Attribution: Cite specific sources for every answer
- Privacy: Run entirely on-premise with Ollama—your data never leaves your infrastructure
RAG powers our document Q&A feature. Users upload contracts, policies, and reports, then ask natural language questions like “What’s the termination clause in the Smith contract?” The system retrieves relevant chunks, feeds them to an LLM, and returns an accurate, cited answer.
Architecture Overview
Retrieval-Augmented Generation (RAG) combines the knowledge stored in your documents with the reasoning capabilities of large language models. This guide builds a complete RAG pipeline using Ollama for local inference and Qdrant for vector retrieval.
[Semantic Kernel: RAG Patterns and Best Practices] — Microsoft , 2024-06-20RAG Architecture
flowchart TB
subgraph Input["� User Question"]
Q["What are our GDPR obligations?"]
end
subgraph QueryProcessing["1️⃣ Query Processing"]
QP1[🔄 Query expansion]
QP2[🧠 Generate query embedding]
QP3[🏷️ Extract filters]
end
subgraph Retrieval["2️⃣ Retrieval"]
R1[("🔮 Qdrant<br/>Semantic")]
R2[("🐘 PostgreSQL<br/>Keyword")]
R3["⚖️ Reranker<br/>Cross-Encoder"]
R1 & R2 --> R3
end
subgraph Context["3️⃣ Context Building"]
C1[📊 Select top-k chunks]
C2[🔗 Deduplicate]
C3[📝 Format with metadata]
C1 --> C2 --> C3
end
subgraph Generation["4️⃣ Generation"]
G1["🦙 Ollama (llama3:8b)"]
G2["📋 System: Answer based on documents..."]
G3["📑 Context: Retrieved chunks"]
end
subgraph Response["5️⃣ Response"]
RS1[💡 Answer with citations]
RS2[📈 Confidence score]
RS3[❓ Follow-up suggestions]
end
Q --> QP1
QP1 --> QP2 --> QP3
QP3 --> R1 & R2
R3 --> C1
C3 --> G1
G1 --> RS1
classDef primary fill:#7c3aed,color:#fff
classDef secondary fill:#06b6d4,color:#fff
classDef db fill:#f43f5e,color:#fff
classDef warning fill:#fbbf24,color:#000
class Input warning
class QueryProcessing,Context,Generation,Response primary
class Retrieval db
Core Components
RAG Service Interface
// Application/Interfaces/IRagService.cs
public interface IRagService
{
Task<RagResponse> AskAsync(
CustomId userId,
string question,
RagOptions? options = null,
CancellationToken ct = default);
IAsyncEnumerable<RagStreamChunk> AskStreamingAsync(
CustomId userId,
string question,
RagOptions? options = null,
CancellationToken ct = default);
}
public sealed record RagOptions
{
public int TopK { get; init; } = 5;
public float MinRelevanceScore { get; init; } = 0.5f;
public string Model { get; init; } = "llama3:8b";
public float Temperature { get; init; } = 0.1f;
public int MaxTokens { get; init; } = 1024;
public bool IncludeSources { get; init; } = true;
public IReadOnlyList<string>? FilterTags { get; init; }
public IReadOnlyList<CustomId>? FilterDocuments { get; init; }
}
public sealed record RagResponse
{
public required string Answer { get; init; }
public required IReadOnlyList<SourceReference> Sources { get; init; }
public required float Confidence { get; init; }
public required RagMetrics Metrics { get; init; }
public IReadOnlyList<string>? SuggestedFollowUps { get; init; }
}
public sealed record SourceReference
{
public required CustomId DocumentId { get; init; }
public required string DocumentTitle { get; init; }
public required string Excerpt { get; init; }
public required int ChunkIndex { get; init; }
public required float RelevanceScore { get; init; }
}
public sealed record RagMetrics
{
public required int ChunksRetrieved { get; init; }
public required int ChunksUsed { get; init; }
public required long RetrievalTimeMs { get; init; }
public required long GenerationTimeMs { get; init; }
public required int InputTokens { get; init; }
public required int OutputTokens { get; init; }
}
RAG Service Implementation
[Qdrant Search API: Similarity Search with Filtering] — Qdrant , 2024-08-10// Infrastructure/AI/RagService.cs
public sealed class RagService : IRagService
{
private readonly IHybridSearchService _search;
private readonly IOllamaClient _ollama;
private readonly IDocumentRepository _documents;
private readonly ILogger<RagService> _logger;
private const string SystemPrompt = """
You are a helpful assistant that answers questions based on the provided documents.
Guidelines:
- Only use information from the provided context to answer questions
- If the context doesn't contain relevant information, say so clearly
- Cite sources using [Source N] notation when referencing specific information
- Be concise but thorough
- If you're uncertain, express that uncertainty
Context documents are provided below, each marked with a source number.
""";
[Prompt Engineering Guide]
— OpenAI , 2024-01-10
public RagService(
IHybridSearchService search,
IOllamaClient ollama,
IDocumentRepository documents,
ILogger<RagService> logger)
{
_search = search;
_ollama = ollama;
_documents = documents;
_logger = logger;
}
public async Task<RagResponse> AskAsync(
CustomId userId,
string question,
RagOptions? options = null,
CancellationToken ct = default)
{
options ??= new RagOptions();
var sw = Stopwatch.StartNew();
// Step 1: Retrieve relevant chunks
var retrievalStart = sw.ElapsedMilliseconds;
var searchOptions = new HybridSearchOptions
{
Limit = options.TopK * 2, // Retrieve more than needed for reranking
SemanticThreshold = options.MinRelevanceScore,
FilterTags = options.FilterTags
};
var searchResults = await _search.SearchAsync(
userId,
question,
searchOptions,
ct);
var retrievalTime = sw.ElapsedMilliseconds - retrievalStart;
if (searchResults.Results.Count == 0)
{
return CreateNoContextResponse(sw.ElapsedMilliseconds);
}
// Step 2: Build context from top chunks
var chunks = await BuildContextAsync(
searchResults.Results,
options.TopK,
ct);
var context = FormatContext(chunks);
// Step 3: Generate answer
var generationStart = sw.ElapsedMilliseconds;
var prompt = BuildPrompt(context, question);
var generationResult = await _ollama.GenerateAsync(new OllamaRequest
{
Model = options.Model,
System = SystemPrompt,
Prompt = prompt,
Options = new OllamaOptions
{
Temperature = options.Temperature,
NumPredict = options.MaxTokens
}
}, ct);
var generationTime = sw.ElapsedMilliseconds - generationStart;
// Step 4: Build response with sources
var sources = chunks.Select((c, i) => new SourceReference
{
DocumentId = c.DocumentId,
DocumentTitle = c.DocumentTitle,
Excerpt = TruncateExcerpt(c.Content, 200),
ChunkIndex = c.ChunkIndex,
RelevanceScore = c.Score
}).ToList();
var confidence = CalculateConfidence(chunks, generationResult.Response);
sw.Stop();
_logger.LogInformation(
"RAG query completed in {TotalMs}ms (retrieval: {RetrievalMs}ms, generation: {GenerationMs}ms)",
sw.ElapsedMilliseconds,
retrievalTime,
generationTime);
return new RagResponse
{
Answer = generationResult.Response,
Sources = sources,
Confidence = confidence,
Metrics = new RagMetrics
{
ChunksRetrieved = searchResults.Results.Count,
ChunksUsed = chunks.Count,
RetrievalTimeMs = retrievalTime,
GenerationTimeMs = generationTime,
InputTokens = generationResult.PromptEvalCount,
OutputTokens = generationResult.EvalCount
},
SuggestedFollowUps = GenerateFollowUpSuggestions(question, generationResult.Response)
};
}
[Lost in the Middle: How Language Models Use Long Contexts]
— Liu et al. , 2023-07-06
public async IAsyncEnumerable<RagStreamChunk> AskStreamingAsync(
CustomId userId,
string question,
RagOptions? options = null,
[EnumeratorCancellation] CancellationToken ct = default)
{
options ??= new RagOptions();
// Retrieve context (non-streaming)
var searchResults = await _search.SearchAsync(
userId,
question,
new HybridSearchOptions { Limit = options.TopK * 2 },
ct);
var chunks = await BuildContextAsync(searchResults.Results, options.TopK, ct);
var context = FormatContext(chunks);
var prompt = BuildPrompt(context, question);
// Yield sources first
yield return new RagStreamChunk
{
Type = RagChunkType.Sources,
Sources = chunks.Select((c, i) => new SourceReference
{
DocumentId = c.DocumentId,
DocumentTitle = c.DocumentTitle,
Excerpt = TruncateExcerpt(c.Content, 200),
ChunkIndex = c.ChunkIndex,
RelevanceScore = c.Score
}).ToList()
};
// Stream the answer
await foreach (var token in _ollama.GenerateStreamingAsync(new OllamaRequest
{
Model = options.Model,
System = SystemPrompt,
Prompt = prompt,
Options = new OllamaOptions
{
Temperature = options.Temperature,
NumPredict = options.MaxTokens
}
}, ct))
{
yield return new RagStreamChunk
{
Type = RagChunkType.Token,
Token = token
};
}
yield return new RagStreamChunk
{
Type = RagChunkType.Complete
};
}
private async Task<List<RetrievedChunk>> BuildContextAsync(
IReadOnlyList<HybridSearchResult> results,
int topK,
CancellationToken ct)
{
var chunks = new List<RetrievedChunk>();
var seenDocuments = new HashSet<CustomId>();
foreach (var result in results.Take(topK * 2))
{
// Get document details if not cached
var document = result.Document ??
await _documents.GetByIdAsync(result.DocumentId, ct);
if (document is null) continue;
// Add matched chunks
if (result.MatchedChunks is not null)
{
foreach (var chunk in result.MatchedChunks)
{
chunks.Add(new RetrievedChunk
{
DocumentId = result.DocumentId,
DocumentTitle = document.Title.Value,
Content = chunk.Content,
ChunkIndex = chunk.ChunkIndex,
Score = result.Score
});
}
}
seenDocuments.Add(result.DocumentId);
if (chunks.Count >= topK) break;
}
// Sort by score and take top-k
return chunks
.OrderByDescending(c => c.Score)
.Take(topK)
.ToList();
}
private static string FormatContext(List<RetrievedChunk> chunks)
{
var sb = new StringBuilder();
for (int i = 0; i < chunks.Count; i++)
{
var chunk = chunks[i];
sb.AppendLine($"[Source {i + 1}] Document: {chunk.DocumentTitle}");
sb.AppendLine(chunk.Content);
sb.AppendLine();
}
return sb.ToString();
}
private static string BuildPrompt(string context, string question)
{
return $"""
Based on the following documents, please answer the question.
DOCUMENTS:
{context}
QUESTION: {question}
ANSWER:
""";
}
private static float CalculateConfidence(
List<RetrievedChunk> chunks,
string answer)
{
if (chunks.Count == 0) return 0;
// Base confidence on retrieval scores
var avgRetrievalScore = chunks.Average(c => c.Score);
// Penalize very short answers
var lengthFactor = Math.Min(1.0f, answer.Length / 100f);
// Penalize if answer contains uncertainty phrases
var uncertaintyPhrases = new[]
{
"i don't know",
"not sure",
"cannot determine",
"no information",
"unclear"
};
var uncertaintyPenalty = uncertaintyPhrases
.Any(p => answer.Contains(p, StringComparison.OrdinalIgnoreCase))
? 0.3f
: 0;
return Math.Max(0, Math.Min(1, avgRetrievalScore * lengthFactor - uncertaintyPenalty));
}
private static RagResponse CreateNoContextResponse(long elapsedMs)
{
return new RagResponse
{
Answer = "I couldn't find any relevant documents to answer your question. " +
"Please try rephrasing your question or ensure relevant documents are uploaded.",
Sources = [],
Confidence = 0,
Metrics = new RagMetrics
{
ChunksRetrieved = 0,
ChunksUsed = 0,
RetrievalTimeMs = elapsedMs,
GenerationTimeMs = 0,
InputTokens = 0,
OutputTokens = 0
}
};
}
private static string TruncateExcerpt(string content, int maxLength)
{
if (content.Length <= maxLength) return content;
return content[..(maxLength - 3)] + "...";
}
private static IReadOnlyList<string>? GenerateFollowUpSuggestions(
string question,
string answer)
{
// Simple heuristic - in production, use LLM to generate these
var suggestions = new List<string>();
if (answer.Contains("GDPR", StringComparison.OrdinalIgnoreCase))
{
suggestions.Add("What are the penalties for GDPR non-compliance?");
suggestions.Add("How do we handle data subject access requests?");
}
return suggestions.Count > 0 ? suggestions : null;
}
private sealed record RetrievedChunk
{
public CustomId DocumentId { get; init; }
public required string DocumentTitle { get; init; }
public required string Content { get; init; }
public required int ChunkIndex { get; init; }
public required float Score { get; init; }
}
}
public sealed record RagStreamChunk
{
public required RagChunkType Type { get; init; }
public string? Token { get; init; }
public IReadOnlyList<SourceReference>? Sources { get; init; }
}
public enum RagChunkType
{
Sources,
Token,
Complete
}
Ollama Client
[Ollama API Documentation: Generate] — Ollama , 2024-09-15// Infrastructure/AI/OllamaClient.cs
public sealed class OllamaClient : IOllamaClient
{
private readonly HttpClient _httpClient;
private readonly ILogger<OllamaClient> _logger;
public OllamaClient(
HttpClient httpClient,
ILogger<OllamaClient> logger)
{
_httpClient = httpClient;
_logger = logger;
}
public async Task<OllamaResponse> GenerateAsync(
OllamaRequest request,
CancellationToken ct = default)
{
var response = await _httpClient.PostAsJsonAsync(
"/api/generate",
new
{
model = request.Model,
system = request.System,
prompt = request.Prompt,
stream = false,
options = request.Options != null ? new
{
temperature = request.Options.Temperature,
num_predict = request.Options.NumPredict,
top_k = request.Options.TopK,
top_p = request.Options.TopP
} : null
},
ct);
response.EnsureSuccessStatusCode();
var result = await response.Content.ReadFromJsonAsync<OllamaApiResponse>(ct);
return new OllamaResponse
{
Response = result!.Response,
PromptEvalCount = result.PromptEvalCount,
EvalCount = result.EvalCount,
TotalDuration = result.TotalDuration
};
}
public async IAsyncEnumerable<string> GenerateStreamingAsync(
OllamaRequest request,
[EnumeratorCancellation] CancellationToken ct = default)
{
var httpRequest = new HttpRequestMessage(HttpMethod.Post, "/api/generate")
{
Content = JsonContent.Create(new
{
model = request.Model,
system = request.System,
prompt = request.Prompt,
stream = true,
options = request.Options != null ? new
{
temperature = request.Options.Temperature,
num_predict = request.Options.NumPredict
} : null
})
};
using var response = await _httpClient.SendAsync(
httpRequest,
HttpCompletionOption.ResponseHeadersRead,
ct);
response.EnsureSuccessStatusCode();
await using var stream = await response.Content.ReadAsStreamAsync(ct);
using var reader = new StreamReader(stream);
while (!reader.EndOfStream)
{
var line = await reader.ReadLineAsync(ct);
if (string.IsNullOrEmpty(line)) continue;
var chunk = JsonSerializer.Deserialize<OllamaStreamChunk>(line);
if (chunk?.Response is not null)
{
yield return chunk.Response;
}
if (chunk?.Done == true) break;
}
}
private sealed record OllamaApiResponse
{
[JsonPropertyName("response")]
public string Response { get; init; } = string.Empty;
[JsonPropertyName("prompt_eval_count")]
public int PromptEvalCount { get; init; }
[JsonPropertyName("eval_count")]
public int EvalCount { get; init; }
[JsonPropertyName("total_duration")]
public long TotalDuration { get; init; }
}
private sealed record OllamaStreamChunk
{
[JsonPropertyName("response")]
public string? Response { get; init; }
[JsonPropertyName("done")]
public bool Done { get; init; }
}
}
Blazor Chat Component
@* Components/Chat/RagChat.razor *@
@inject IRagService RagService
@implements IAsyncDisposable
<div class="flex flex-col h-full">
@* Messages *@
<div class="flex-1 overflow-y-auto p-4 space-y-4" @ref="_messagesContainer">
@foreach (var message in _messages)
{
<ChatMessage Message="message" />
}
@if (_isLoading)
{
<div class="flex items-center gap-2 text-gray-400">
<div class="animate-pulse">●</div>
<span>Thinking...</span>
</div>
}
</div>
@* Input *@
<div class="border-t border-white/10 p-4">
<form @onsubmit="HandleSubmit" class="flex gap-2">
<GlassInput
@bind-Value="_input"
Placeholder="Ask a question about your documents..."
Disabled="_isLoading"
class="flex-1" />
<GlassButton
Type="submit"
Disabled="_isLoading || string.IsNullOrWhiteSpace(_input)">
<svg class="w-5 h-5" fill="none" stroke="currentColor" viewBox="0 0 24 24">
<path stroke-linecap="round" stroke-linejoin="round" stroke-width="2"
d="M12 19l9 2-9-18-9 18 9-2zm0 0v-8"/>
</svg>
</GlassButton>
</form>
</div>
</div>
@code {
[CascadingParameter] private Task<AuthenticationState>? AuthState { get; set; }
private readonly List<ChatMessageModel> _messages = [];
private string _input = string.Empty;
private bool _isLoading;
private ElementReference _messagesContainer;
private CancellationTokenSource? _cts;
private async Task HandleSubmit()
{
if (string.IsNullOrWhiteSpace(_input) || _isLoading) return;
var question = _input;
_input = string.Empty;
_isLoading = true;
// Add user message
_messages.Add(new ChatMessageModel
{
Role = "user",
Content = question,
Timestamp = DateTimeOffset.UtcNow
});
// Add placeholder for assistant
var assistantMessage = new ChatMessageModel
{
Role = "assistant",
Content = string.Empty,
Timestamp = DateTimeOffset.UtcNow
};
_messages.Add(assistantMessage);
await ScrollToBottom();
try
{
var authState = await AuthState!;
var userId = authState.User.GetCustomId();
_cts = new CancellationTokenSource();
// Stream the response
await foreach (var chunk in RagService.AskStreamingAsync(
userId,
question,
cancellationToken: _cts.Token))
{
switch (chunk.Type)
{
case RagChunkType.Sources:
assistantMessage.Sources = chunk.Sources;
break;
case RagChunkType.Token:
assistantMessage.Content += chunk.Token;
StateHasChanged();
break;
case RagChunkType.Complete:
break;
}
}
}
catch (OperationCanceledException)
{
assistantMessage.Content += " [Cancelled]";
}
catch (Exception ex)
{
assistantMessage.Content = $"Error: {ex.Message}";
assistantMessage.IsError = true;
}
finally
{
_isLoading = false;
_cts?.Dispose();
_cts = null;
await ScrollToBottom();
}
}
private async Task ScrollToBottom()
{
await Task.Yield();
// JS interop to scroll would go here
}
public async ValueTask DisposeAsync()
{
_cts?.Cancel();
_cts?.Dispose();
}
}
public sealed class ChatMessageModel
{
public required string Role { get; init; }
public string Content { get; set; } = string.Empty;
public required DateTimeOffset Timestamp { get; init; }
public IReadOnlyList<SourceReference>? Sources { get; set; }
public bool IsError { get; set; }
}
Conclusion
A production RAG pipeline requires:
| Component | Purpose |
|---|---|
| Hybrid Retrieval | Find relevant chunks |
| Context Building | Format for LLM |
| Prompt Engineering | Guide LLM behavior |
| Streaming | Better UX |
| Source Attribution | Transparency |
| Confidence Scoring | Reliability indicator |
This foundation can be extended with reranking, query expansion, and conversation memory for even more sophisticated applications.
Personal Reflection
If I had to distill everything I learned building this pipeline into one sentence, it would be this: the retrieval step matters more than the generation step. I spent weeks trying to improve answer quality by tweaking prompts, switching models, and adjusting temperature — and those things helped, but the biggest quality jumps always came from improving what context the LLM received in the first place. Better chunking, better reranking, and better filtering made a larger difference than any prompt engineering trick. The other lesson that surprised me was how well a local 8B parameter model performed when given good context. Our initial assumption was that we would need a 70B model or a commercial API for production-quality answers, but with well-retrieved, well-ordered chunks and a carefully tuned system prompt, llama3:8b through Ollama produced answers that our internal testers rated as “good” or “excellent” on 87% of test queries. Running locally also meant zero API costs and complete data privacy — a requirement for our enterprise clients.
Next Steps
- Building a Semantic Embedding Pipeline with Ollama and Qdrant — The companion article covering the embedding and indexing side of the pipeline in detail.
- Conversation Memory — Add multi-turn conversation support so users can ask follow-up questions that reference previous answers without restating full context.
- Evaluation Framework — Build an automated evaluation pipeline using labeled question-answer pairs to measure retrieval precision, answer accuracy, and hallucination rates across model and prompt changes.