Integrating Docling OCR in a .NET Document Pipeline
Learn how to integrate Docling, an AI-powered document understanding library, into your .NET application for high-quality OCR with layout preservation.
Introduction
Traditional OCR tools extract text but lose document structure. Docling is IBM’s open-source document understanding library that preserves layout, tables, and semantic structure while converting documents to clean Markdown.
In this guide, we’ll integrate Docling as a microservice in a .NET document processing pipeline.
Why Docling?
| Feature | Traditional OCR | Docling |
|---|---|---|
| Text extraction | ✅ | ✅ |
| Layout preservation | ❌ | ✅ |
| Table reconstruction | ❌ | ✅ |
| Semantic sections | ❌ | ✅ |
| Multi-format support | Limited | PDF, DOCX, PPTX, Images, HTML |
| Output format | Plain text | Markdown, JSON, DocTags |
Docling understands document structure—headings, paragraphs, lists, tables—and outputs clean Markdown that’s perfect for RAG pipelines and semantic search.
[Docling - Document Understanding Made Easy] — IBM ResearchArchitecture Overview
We’ll deploy Docling as a REST API service and call it from .NET workers:
flowchart LR
MinIO["🪣 MinIO\n(Storage)"] --> Worker["⚙️ OCR Worker\n(.NET)"]
Worker --> Docling["📄 Docling\n(Python)"]
Worker --> NATS["⚡ NATS\n(Events)"]
classDef primary fill:#7c3aed,color:#fff
classDef secondary fill:#06b6d4,color:#fff
classDef db fill:#f43f5e,color:#fff
classDef warning fill:#fbbf24,color:#000
class Worker primary
class Docling secondary
class MinIO,NATS db
Implementation
Setting Up Docling Service
Dockerfile
Create a containerized Docling API service:
# Dockerfile for Docling OCR Service
FROM python:3.11-slim
WORKDIR /app
# Install system dependencies for PDF processing
RUN apt-get update && apt-get install -y \
libgl1-mesa-glx \
libglib2.0-0 \
poppler-utils \
&& rm -rf /var/lib/apt/lists/*
# Install Python dependencies
COPY requirements.txt .
RUN pip install --no-cache-dir -r requirements.txt
COPY app.py .
EXPOSE 8080
CMD ["uvicorn", "app:app", "--host", "0.0.0.0", "--port", "8080"]
Python API Service
# app.py
from fastapi import FastAPI, HTTPException, UploadFile, File
from fastapi.responses import JSONResponse
from pydantic import BaseModel
from docling.document_converter import DocumentConverter
from docling.datamodel.pipeline_options import PdfPipelineOptions
from docling.datamodel.base_models import InputFormat
import tempfile
import os
from pathlib import Path
app = FastAPI(title="Docling OCR Service")
# Configure converter with optimal settings
pipeline_options = PdfPipelineOptions()
pipeline_options.do_ocr = True
pipeline_options.do_table_structure = True
converter = DocumentConverter(
allowed_formats=[
InputFormat.PDF,
InputFormat.DOCX,
InputFormat.PPTX,
InputFormat.IMAGE,
InputFormat.HTML,
],
pdf_pipeline_options=pipeline_options,
)
class ConvertRequest(BaseModel):
file_url: str
output_format: str = "markdown" # markdown, json, doctags
class ConvertResponse(BaseModel):
content: str
format: str
page_count: int
tables: list[dict] | None = None
@app.post("/convert", response_model=ConvertResponse)
async def convert_document(file: UploadFile = File(...), output_format: str = "markdown"):
"""Convert an uploaded document to the specified format."""
# Validate file type
allowed_extensions = {".pdf", ".docx", ".pptx", ".png", ".jpg", ".jpeg", ".html"}
ext = Path(file.filename).suffix.lower()
if ext not in allowed_extensions:
raise HTTPException(400, f"Unsupported file type: {ext}")
# Save uploaded file temporarily
with tempfile.NamedTemporaryFile(delete=False, suffix=ext) as tmp:
content = await file.read()
tmp.write(content)
tmp_path = tmp.name
try:
# Convert document
result = converter.convert(tmp_path)
doc = result.document
# Generate output in requested format
if output_format == "markdown":
output = doc.export_to_markdown()
elif output_format == "json":
output = doc.export_to_dict()
elif output_format == "doctags":
output = doc.export_to_document_tokens()
else:
raise HTTPException(400, f"Unknown output format: {output_format}")
# Extract table data for separate processing
tables = []
for table in doc.tables:
tables.append({
"id": table.id,
"rows": table.num_rows,
"cols": table.num_cols,
"markdown": table.export_to_markdown(),
})
return ConvertResponse(
content=output if isinstance(output, str) else str(output),
format=output_format,
page_count=doc.num_pages,
tables=tables if tables else None,
)
finally:
# Cleanup temp file
os.unlink(tmp_path)
@app.get("/health")
async def health():
"""Health check endpoint."""
return {"status": "healthy", "service": "docling-ocr"}
Requirements
# requirements.txt
docling>=2.15.0
fastapi>=0.115.0
uvicorn>=0.32.0
python-multipart>=0.0.17
Kubernetes Deployment
Deploy Docling to your K3s cluster:
# infrastructure/ai/docling-ocr/deployment.yaml
apiVersion: apps/v1
kind: Deployment
metadata:
name: docling-ocr
namespace: ai
spec:
replicas: 2
selector:
matchLabels:
app: docling-ocr
template:
metadata:
labels:
app: docling-ocr
spec:
containers:
- name: docling
image: ghcr.io/bluerobin/docling-ocr:latest
ports:
- containerPort: 8080
resources:
requests:
memory: "2Gi"
cpu: "500m"
limits:
memory: "4Gi"
cpu: "2000m"
readinessProbe:
httpGet:
path: /health
port: 8080
initialDelaySeconds: 10
periodSeconds: 5
livenessProbe:
httpGet:
path: /health
port: 8080
initialDelaySeconds: 30
periodSeconds: 10
---
apiVersion: v1
kind: Service
metadata:
name: docling-ocr
namespace: ai
spec:
selector:
app: docling-ocr
ports:
- port: 80
targetPort: 8080
.NET Client Implementation
HTTP Client Service
Create a typed HTTP client to call the Docling API:
// Infrastructure/Ocr/DoclingOcrService.cs
namespace Archives.Infrastructure.Ocr;
public interface IOcrService
{
Task<OcrResult> ProcessDocumentAsync(
Stream documentStream,
string fileName,
CancellationToken ct = default);
}
public sealed record OcrResult(
string Content,
int PageCount,
IReadOnlyList<TableData>? Tables);
public sealed record TableData(
string Id,
int Rows,
int Columns,
string Markdown);
public sealed class DoclingOcrService : IOcrService
{
private readonly HttpClient _httpClient;
private readonly ILogger<DoclingOcrService> _logger;
public DoclingOcrService(
HttpClient httpClient,
ILogger<DoclingOcrService> logger)
{
_httpClient = httpClient;
_logger = logger;
}
public async Task<OcrResult> ProcessDocumentAsync(
Stream documentStream,
string fileName,
CancellationToken ct = default)
{
_logger.LogInformation("Processing document {FileName} with Docling OCR", fileName);
using var content = new MultipartFormDataContent();
using var streamContent = new StreamContent(documentStream);
// Set content type based on file extension
var contentType = GetContentType(fileName);
streamContent.Headers.ContentType = new MediaTypeHeaderValue(contentType);
content.Add(streamContent, "file", fileName);
content.Add(new StringContent("markdown"), "output_format");
var response = await _httpClient.PostAsync("/convert", content, ct);
if (!response.IsSuccessStatusCode)
{
var error = await response.Content.ReadAsStringAsync(ct);
_logger.LogError("Docling OCR failed: {StatusCode} - {Error}",
response.StatusCode, error);
throw new OcrException($"OCR processing failed: {error}");
}
var result = await response.Content.ReadFromJsonAsync<DoclingResponse>(ct)
?? throw new OcrException("Empty response from OCR service");
_logger.LogInformation(
"OCR completed: {PageCount} pages, {TableCount} tables extracted",
result.PageCount,
result.Tables?.Count ?? 0);
return new OcrResult(
Content: result.Content,
PageCount: result.PageCount,
Tables: result.Tables?.Select(t => new TableData(
t.Id, t.Rows, t.Cols, t.Markdown)).ToList());
}
private static string GetContentType(string fileName)
{
var ext = Path.GetExtension(fileName).ToLowerInvariant();
return ext switch
{
".pdf" => "application/pdf",
".docx" => "application/vnd.openxmlformats-officedocument.wordprocessingml.document",
".pptx" => "application/vnd.openxmlformats-officedocument.presentationml.presentation",
".png" => "image/png",
".jpg" or ".jpeg" => "image/jpeg",
".html" => "text/html",
_ => "application/octet-stream"
};
}
private sealed record DoclingResponse(
string Content,
string Format,
int PageCount,
List<DoclingTable>? Tables);
private sealed record DoclingTable(
string Id,
int Rows,
int Cols,
string Markdown);
}
public class OcrException : Exception
{
public OcrException(string message) : base(message) { }
public OcrException(string message, Exception inner) : base(message, inner) { }
}
Service Registration
Configure the HTTP client with resilience policies:
// Program.cs / ServiceCollectionExtensions.cs
services.AddHttpClient<IOcrService, DoclingOcrService>(client =>
{
client.BaseAddress = new Uri(configuration["Ocr:DoclingUrl"]
?? "http://docling-ocr.ai.svc.cluster.local");
client.Timeout = TimeSpan.FromMinutes(5); // Long timeout for large documents
})
.AddStandardResilienceHandler(options =>
{
options.Retry.MaxRetryAttempts = 3;
options.Retry.Delay = TimeSpan.FromSeconds(2);
options.CircuitBreaker.SamplingDuration = TimeSpan.FromSeconds(30);
});
OCR Worker Integration
The OCR worker listens for document upload events and processes them:
// Workers/OcrWorkerService.cs
namespace Archives.Workers;
public sealed class OcrWorkerService : BackgroundService
{
private readonly INatsConnection _nats;
private readonly IOcrService _ocrService;
private readonly IObjectStorage _storage;
private readonly ILogger<OcrWorkerService> _logger;
private readonly string _environment;
public OcrWorkerService(
INatsConnection nats,
IOcrService ocrService,
IObjectStorage storage,
IConfiguration configuration,
ILogger<OcrWorkerService> logger)
{
_nats = nats;
_ocrService = ocrService;
_storage = storage;
_logger = logger;
_environment = configuration["Environment"] ?? "dev";
}
protected override async Task ExecuteAsync(CancellationToken stoppingToken)
{
var subject = $"{_environment}.archives.documents.object.uploaded";
_logger.LogInformation("OCR Worker subscribing to {Subject}", subject);
await foreach (var msg in _nats.SubscribeAsync<DocumentProcessingRequestedEvent>(
subject, cancellationToken: stoppingToken))
{
try
{
await ProcessDocumentAsync(msg.Data!, stoppingToken);
await msg.AckAsync(cancellationToken: stoppingToken);
}
catch (Exception ex)
{
_logger.LogError(ex, "Failed to process document {DocumentId}",
msg.Data?.DocumentId);
await msg.NakAsync(cancellationToken: stoppingToken);
}
}
}
private async Task ProcessDocumentAsync(
DocumentProcessingRequestedEvent evt,
CancellationToken ct)
{
_logger.LogInformation(
"Processing OCR for document {DocumentId} from bucket {Bucket}",
evt.DocumentId, evt.Bucket);
// 1. Download document from MinIO
var objectPath = $"original/{evt.DocumentId}{evt.FileExtension}";
await using var documentStream = await _storage.GetObjectAsync(
evt.Bucket, objectPath, ct);
// 2. Process with Docling OCR
var result = await _ocrService.ProcessDocumentAsync(
documentStream,
evt.FileName,
ct);
// 3. Store extracted content in MinIO
var contentPath = $"processed/{evt.DocumentId}/content.md";
await using var contentStream = new MemoryStream(
Encoding.UTF8.GetBytes(result.Content));
await _storage.PutObjectAsync(
evt.Bucket,
contentPath,
contentStream,
"text/markdown",
ct);
// 4. Store table data if present
if (result.Tables?.Any() == true)
{
var tablesJson = JsonSerializer.Serialize(result.Tables);
var tablesPath = $"processed/{evt.DocumentId}/tables.json";
await using var tablesStream = new MemoryStream(
Encoding.UTF8.GetBytes(tablesJson));
await _storage.PutObjectAsync(
evt.Bucket,
tablesPath,
tablesStream,
"application/json",
ct);
}
// 5. Publish completion event
await _nats.PublishAsync(
$"{_environment}.archives.documents.ocr.completed",
new OcrCompletedEvent
{
DocumentId = evt.DocumentId,
Bucket = evt.Bucket,
UserId = evt.UserId,
PageCount = result.PageCount,
TableCount = result.Tables?.Count ?? 0,
ContentPath = contentPath,
ProcessedAt = DateTimeOffset.UtcNow
},
cancellationToken: ct);
_logger.LogInformation(
"OCR completed for {DocumentId}: {Pages} pages, {Tables} tables",
evt.DocumentId, result.PageCount, result.Tables?.Count ?? 0);
}
}
Event Schemas
Define strongly-typed events for the pipeline:
// Application/Events/DocumentEvents.cs
namespace Archives.Application.Events;
public sealed record DocumentProcessingRequestedEvent
{
public required string DocumentId { get; init; }
public required string Bucket { get; init; }
public required string UserId { get; init; }
public required string FileName { get; init; }
public required string FileExtension { get; init; }
public required long FileSize { get; init; }
public required string ContentType { get; init; }
public required DateTimeOffset UploadedAt { get; init; }
}
public sealed record OcrCompletedEvent
{
public required string DocumentId { get; init; }
public required string Bucket { get; init; }
public required string UserId { get; init; }
public required int PageCount { get; init; }
public required int TableCount { get; init; }
public required string ContentPath { get; init; }
public required DateTimeOffset ProcessedAt { get; init; }
}
Error Handling and Retries
Add robust error handling for OCR failures:
// Infrastructure/Ocr/OcrRetryPolicy.cs
public static class OcrRetryPolicy
{
public static AsyncRetryPolicy<OcrResult> Create(ILogger logger)
{
return Policy<OcrResult>
.Handle<HttpRequestException>()
.Or<TaskCanceledException>()
.Or<OcrException>(ex => !ex.Message.Contains("Unsupported"))
.WaitAndRetryAsync(
retryCount: 3,
sleepDurationProvider: attempt =>
TimeSpan.FromSeconds(Math.Pow(2, attempt)),
onRetry: (outcome, timeSpan, attempt, _) =>
{
logger.LogWarning(
"OCR retry {Attempt} after {Delay}s due to: {Error}",
attempt,
timeSpan.TotalSeconds,
outcome.Exception?.Message);
});
}
}
Health Checks
Add Docling to your health check suite:
// Infrastructure/HealthChecks/DoclingHealthCheck.cs
public class DoclingHealthCheck : IHealthCheck
{
private readonly HttpClient _httpClient;
public DoclingHealthCheck(IHttpClientFactory factory)
{
_httpClient = factory.CreateClient("Docling");
}
public async Task<HealthCheckResult> CheckHealthAsync(
HealthCheckContext context,
CancellationToken ct = default)
{
try
{
var response = await _httpClient.GetAsync("/health", ct);
if (response.IsSuccessStatusCode)
{
return HealthCheckResult.Healthy("Docling OCR is available");
}
return HealthCheckResult.Degraded(
$"Docling returned {response.StatusCode}");
}
catch (Exception ex)
{
return HealthCheckResult.Unhealthy(
"Cannot reach Docling OCR service",
exception: ex);
}
}
}
// Registration
services.AddHealthChecks()
.AddCheck<DoclingHealthCheck>("docling-ocr");
Testing the Integration
Write integration tests to verify OCR processing:
// Tests/Integration/DoclingOcrServiceTests.cs
public class DoclingOcrServiceTests : IClassFixture<DoclingFixture>
{
private readonly IOcrService _ocrService;
public DoclingOcrServiceTests(DoclingFixture fixture)
{
_ocrService = fixture.OcrService;
}
[Fact]
public async Task ProcessDocument_WithValidPdf_ReturnsMarkdown()
{
// Arrange
var pdfPath = Path.Combine("TestFiles", "sample-invoice.pdf");
await using var stream = File.OpenRead(pdfPath);
// Act
var result = await _ocrService.ProcessDocumentAsync(
stream, "sample-invoice.pdf");
// Assert
Assert.NotEmpty(result.Content);
Assert.Contains("Invoice", result.Content);
Assert.True(result.PageCount >= 1);
}
[Fact]
public async Task ProcessDocument_WithTable_ExtractsTables()
{
// Arrange
var pdfPath = Path.Combine("TestFiles", "sample-with-table.pdf");
await using var stream = File.OpenRead(pdfPath);
// Act
var result = await _ocrService.ProcessDocumentAsync(
stream, "sample-with-table.pdf");
// Assert
Assert.NotNull(result.Tables);
Assert.NotEmpty(result.Tables);
Assert.Contains("|", result.Tables[0].Markdown); // Markdown table
}
}
Performance Optimization
Batch Processing
For high-volume scenarios, batch documents:
public async Task ProcessBatchAsync(
IEnumerable<DocumentToProcess> documents,
CancellationToken ct)
{
var semaphore = new SemaphoreSlim(maxConcurrency: 3);
var tasks = documents.Select(async doc =>
{
await semaphore.WaitAsync(ct);
try
{
return await ProcessDocumentAsync(doc, ct);
}
finally
{
semaphore.Release();
}
});
await Task.WhenAll(tasks);
}
Caching Document Fingerprints
Skip re-processing unchanged documents:
public async Task<bool> ShouldProcessAsync(
string bucket,
string documentId,
string fingerprint,
CancellationToken ct)
{
var existingFingerprint = await _storage.GetMetadataAsync(
bucket,
$"processed/{documentId}/fingerprint",
ct);
return existingFingerprint != fingerprint;
}
Summary
We’ve built a complete Docling OCR integration:
- Docling Service — Containerized Python API with FastAPI
- Kubernetes Deployment — Scalable service in the
ainamespace - HTTP Client — Typed client with resilience policies
- OCR Worker — Event-driven document processing
- Health Checks — Monitoring integration
Docling’s layout-aware extraction produces clean Markdown that’s ideal for:
- Semantic search with embeddings
- RAG pipelines with structured chunks
- Document summarization
- Knowledge extraction
Next in the series, we’ll build the embedding pipeline that converts OCR output into vector representations for semantic search.