AI/ML Intermediate 16 min

Integrating Docling OCR in a .NET Document Pipeline

Learn how to integrate Docling, an AI-powered document understanding library, into your .NET application for high-quality OCR with layout preservation.

By Victor Robin Updated:

When I received a batch of 200 scanned receipts from a user and ran them through our document processing pipeline, every single one came back blank. The existing text extraction only worked on digital PDFs — scanned documents, which are essentially images wrapped in a PDF container, produced zero text. That moment of realization, staring at 200 empty extraction results, led me to integrate Docling’s OCR capability and make the platform truly document-format-agnostic.

Introduction

Traditional OCR tools extract text but lose document structure. Docling is IBM’s open-source document understanding library that preserves layout, tables, and semantic structure while converting documents to clean Markdown.

In this guide, we’ll integrate Docling as a microservice in a .NET document processing pipeline.

Why Docling?

FeatureTraditional OCRDocling
Text extraction
Layout preservation
Table reconstruction
Semantic sections
Multi-format supportLimitedPDF, DOCX, PPTX, Images, HTML
Output formatPlain textMarkdown, JSON, DocTags

Docling understands document structure—headings, paragraphs, lists, tables—and outputs clean Markdown that’s perfect for RAG pipelines and semantic search.

[Docling - Document Understanding Made Easy] — IBM Research

Traditional OCR engines like Tesseract are powerful for raw text extraction but require significant preprocessing to handle complex layouts reliably.

[Tesseract OCR Documentation] — Tesseract OCR Project

Architecture Overview

We’ll deploy Docling as a REST API service and call it from .NET workers:

flowchart LR
    MinIO["🪣 MinIO\n(Storage)"] --> Worker["⚙️ OCR Worker\n(.NET)"]
    Worker --> Docling["📄 Docling\n(Python)"]
    Worker --> NATS["⚡ NATS\n(Events)"]

    classDef primary fill:#7c3aed,color:#fff
    classDef secondary fill:#06b6d4,color:#fff
    classDef db fill:#f43f5e,color:#fff
    classDef warning fill:#fbbf24,color:#000

    class Worker primary
    class Docling secondary
    class MinIO,NATS db

Understanding the PDF specification is important context for why OCR is necessary — digital PDFs store text as character codes with positioning data, but scanned PDFs store page images with no extractable text layer.

[PDF Reference, Sixth Edition, Version 1.7] — Adobe Systems

Implementation

Setting Up Docling Service

Dockerfile

Create a containerized Docling API service:

# Dockerfile for Docling OCR Service
FROM python:3.11-slim

WORKDIR /app

# Install system dependencies for PDF processing
RUN apt-get update && apt-get install -y \
    libgl1-mesa-glx \
    libglib2.0-0 \
    poppler-utils \
    && rm -rf /var/lib/apt/lists/*

# Install Python dependencies
COPY requirements.txt .
RUN pip install --no-cache-dir -r requirements.txt

COPY app.py .

EXPOSE 8080

CMD ["uvicorn", "app:app", "--host", "0.0.0.0", "--port", "8080"]

Python API Service

# app.py
from fastapi import FastAPI, HTTPException, UploadFile, File
from fastapi.responses import JSONResponse
from pydantic import BaseModel
from docling.document_converter import DocumentConverter
from docling.datamodel.pipeline_options import PdfPipelineOptions
from docling.datamodel.base_models import InputFormat
import tempfile
import os
from pathlib import Path

app = FastAPI(title="Docling OCR Service")

# Configure converter with optimal settings
pipeline_options = PdfPipelineOptions()
pipeline_options.do_ocr = True
pipeline_options.do_table_structure = True

converter = DocumentConverter(
    allowed_formats=[
        InputFormat.PDF,
        InputFormat.DOCX,
        InputFormat.PPTX,
        InputFormat.IMAGE,
        InputFormat.HTML,
    ],
    pdf_pipeline_options=pipeline_options,
)


class ConvertRequest(BaseModel):
    file_url: str
    output_format: str = "markdown"  # markdown, json, doctags


class ConvertResponse(BaseModel):
    content: str
    format: str
    page_count: int
    tables: list[dict] | None = None


@app.post("/convert", response_model=ConvertResponse)
async def convert_document(file: UploadFile = File(...), output_format: str = "markdown"):
    """Convert an uploaded document to the specified format."""

    # Validate file type
    allowed_extensions = {".pdf", ".docx", ".pptx", ".png", ".jpg", ".jpeg", ".html"}
    ext = Path(file.filename).suffix.lower()
    if ext not in allowed_extensions:
        raise HTTPException(400, f"Unsupported file type: {ext}")

    # Save uploaded file temporarily
    with tempfile.NamedTemporaryFile(delete=False, suffix=ext) as tmp:
        content = await file.read()
        tmp.write(content)
        tmp_path = tmp.name

    try:
        # Convert document
        result = converter.convert(tmp_path)
        doc = result.document

        # Generate output in requested format
        if output_format == "markdown":
            output = doc.export_to_markdown()
        elif output_format == "json":
            output = doc.export_to_dict()
        elif output_format == "doctags":
            output = doc.export_to_document_tokens()
        else:
            raise HTTPException(400, f"Unknown output format: {output_format}")

        # Extract table data for separate processing
        tables = []
        for table in doc.tables:
            tables.append({
                "id": table.id,
                "rows": table.num_rows,
                "cols": table.num_cols,
                "markdown": table.export_to_markdown(),
            })

        return ConvertResponse(
            content=output if isinstance(output, str) else str(output),
            format=output_format,
            page_count=doc.num_pages,
            tables=tables if tables else None,
        )

    finally:
        # Cleanup temp file
        os.unlink(tmp_path)


@app.get("/health")
async def health():
    """Health check endpoint."""
    return {"status": "healthy", "service": "docling-ocr"}

Requirements

# requirements.txt
docling>=2.15.0
fastapi>=0.115.0
uvicorn>=0.32.0
python-multipart>=0.0.17

Kubernetes Deployment

Deploy Docling to your K3s cluster:

# infrastructure/ai/docling-ocr/deployment.yaml
apiVersion: apps/v1
kind: Deployment
metadata:
  name: docling-ocr
  namespace: ai
spec:
  replicas: 2
  selector:
    matchLabels:
      app: docling-ocr
  template:
    metadata:
      labels:
        app: docling-ocr
    spec:
      containers:
        - name: docling
          image: registry.example.com/myapp/docling-ocr:latest
          ports:
            - containerPort: 8080
          resources:
            requests:
              memory: "2Gi"
              cpu: "500m"
            limits:
              memory: "4Gi"
              cpu: "2000m"
          readinessProbe:
            httpGet:
              path: /health
              port: 8080
            initialDelaySeconds: 10
            periodSeconds: 5
          livenessProbe:
            httpGet:
              path: /health
              port: 8080
            initialDelaySeconds: 30
            periodSeconds: 10
---
apiVersion: v1
kind: Service
metadata:
  name: docling-ocr
  namespace: ai
spec:
  selector:
    app: docling-ocr
  ports:
    - port: 80
      targetPort: 8080

.NET Client Implementation

HTTP Client Service

Create a typed HTTP client to call the Docling API:

// Infrastructure/Ocr/DoclingOcrService.cs
namespace MyApp.Infrastructure.Ocr;

public interface IOcrService
{
    Task<OcrResult> ProcessDocumentAsync(
        Stream documentStream,
        string fileName,
        CancellationToken ct = default);
}

public sealed record OcrResult(
    string Content,
    int PageCount,
    IReadOnlyList<TableData>? Tables);

public sealed record TableData(
    string Id,
    int Rows,
    int Columns,
    string Markdown);

public sealed class DoclingOcrService : IOcrService
{
    private readonly HttpClient _httpClient;
    private readonly ILogger<DoclingOcrService> _logger;

    public DoclingOcrService(
        HttpClient httpClient,
        ILogger<DoclingOcrService> logger)
    {
        _httpClient = httpClient;
        _logger = logger;
    }

    public async Task<OcrResult> ProcessDocumentAsync(
        Stream documentStream,
        string fileName,
        CancellationToken ct = default)
    {
        _logger.LogInformation("Processing document {FileName} with Docling OCR", fileName);

        using var content = new MultipartFormDataContent();
        using var streamContent = new StreamContent(documentStream);

        // Set content type based on file extension
        var contentType = GetContentType(fileName);
        streamContent.Headers.ContentType = new MediaTypeHeaderValue(contentType);

        content.Add(streamContent, "file", fileName);
        content.Add(new StringContent("markdown"), "output_format");

        var response = await _httpClient.PostAsync("/convert", content, ct);

        if (!response.IsSuccessStatusCode)
        {
            var error = await response.Content.ReadAsStringAsync(ct);
            _logger.LogError("Docling OCR failed: {StatusCode} - {Error}",
                response.StatusCode, error);
            throw new OcrException($"OCR processing failed: {error}");
        }

        var result = await response.Content.ReadFromJsonAsync<DoclingResponse>(ct)
            ?? throw new OcrException("Empty response from OCR service");

        _logger.LogInformation(
            "OCR completed: {PageCount} pages, {TableCount} tables extracted",
            result.PageCount,
            result.Tables?.Count ?? 0);

        return new OcrResult(
            Content: result.Content,
            PageCount: result.PageCount,
            Tables: result.Tables?.Select(t => new TableData(
                t.Id, t.Rows, t.Cols, t.Markdown)).ToList());
    }

    private static string GetContentType(string fileName)
    {
        var ext = Path.GetExtension(fileName).ToLowerInvariant();
        return ext switch
        {
            ".pdf" => "application/pdf",
            ".docx" => "application/vnd.openxmlformats-officedocument.wordprocessingml.document",
            ".pptx" => "application/vnd.openxmlformats-officedocument.presentationml.presentation",
            ".png" => "image/png",
            ".jpg" or ".jpeg" => "image/jpeg",
            ".html" => "text/html",
            _ => "application/octet-stream"
        };
    }

    private sealed record DoclingResponse(
        string Content,
        string Format,
        int PageCount,
        List<DoclingTable>? Tables);

    private sealed record DoclingTable(
        string Id,
        int Rows,
        int Cols,
        string Markdown);
}

public class OcrException : Exception
{
    public OcrException(string message) : base(message) { }
    public OcrException(string message, Exception inner) : base(message, inner) { }
}

Service Registration

Configure the HTTP client with resilience policies:

// Program.cs / ServiceCollectionExtensions.cs
services.AddHttpClient<IOcrService, DoclingOcrService>(client =>
{
    client.BaseAddress = new Uri(configuration["Ocr:DoclingUrl"]
        ?? "http://docling-ocr.ai.svc.cluster.local");
    client.Timeout = TimeSpan.FromMinutes(5); // Long timeout for large documents
})
.AddStandardResilienceHandler(options =>
{
    options.Retry.MaxRetryAttempts = 3;
    options.Retry.Delay = TimeSpan.FromSeconds(2);
    options.CircuitBreaker.SamplingDuration = TimeSpan.FromSeconds(30);
});

OCR Worker Integration

The OCR worker listens for document upload events and processes them:

// Workers/OcrWorkerService.cs
namespace MyApp.Workers;

public sealed class OcrWorkerService : BackgroundService
{
    private readonly INatsConnection _nats;
    private readonly IOcrService _ocrService;
    private readonly IObjectStorage _storage;
    private readonly ILogger<OcrWorkerService> _logger;
    private readonly string _environment;

    public OcrWorkerService(
        INatsConnection nats,
        IOcrService ocrService,
        IObjectStorage storage,
        IConfiguration configuration,
        ILogger<OcrWorkerService> logger)
    {
        _nats = nats;
        _ocrService = ocrService;
        _storage = storage;
        _logger = logger;
        _environment = configuration["Environment"] ?? "dev";
    }

    protected override async Task ExecuteAsync(CancellationToken stoppingToken)
    {
        var subject = $"{_environment}.archives.documents.object.uploaded";

        _logger.LogInformation("OCR Worker subscribing to {Subject}", subject);

        await foreach (var msg in _nats.SubscribeAsync<DocumentProcessingRequestedEvent>(
            subject, cancellationToken: stoppingToken))
        {
            try
            {
                await ProcessDocumentAsync(msg.Data!, stoppingToken);
                await msg.AckAsync(cancellationToken: stoppingToken);
            }
            catch (Exception ex)
            {
                _logger.LogError(ex, "Failed to process document {DocumentId}",
                    msg.Data?.DocumentId);
                await msg.NakAsync(cancellationToken: stoppingToken);
            }
        }
    }

    private async Task ProcessDocumentAsync(
        DocumentProcessingRequestedEvent evt,
        CancellationToken ct)
    {
        _logger.LogInformation(
            "Processing OCR for document {DocumentId} from bucket {Bucket}",
            evt.DocumentId, evt.Bucket);

        // 1. Download document from MinIO
        var objectPath = $"original/{evt.DocumentId}{evt.FileExtension}";
        await using var documentStream = await _storage.GetObjectAsync(
            evt.Bucket, objectPath, ct);

        // 2. Process with Docling OCR
        var result = await _ocrService.ProcessDocumentAsync(
            documentStream,
            evt.FileName,
            ct);

        // 3. Store extracted content in MinIO
        var contentPath = $"processed/{evt.DocumentId}/content.md";
        await using var contentStream = new MemoryStream(
            Encoding.UTF8.GetBytes(result.Content));

        await _storage.PutObjectAsync(
            evt.Bucket,
            contentPath,
            contentStream,
            "text/markdown",
            ct);

        // 4. Store table data if present
        if (result.Tables?.Any() == true)
        {
            var tablesJson = JsonSerializer.Serialize(result.Tables);
            var tablesPath = $"processed/{evt.DocumentId}/tables.json";
            await using var tablesStream = new MemoryStream(
                Encoding.UTF8.GetBytes(tablesJson));

            await _storage.PutObjectAsync(
                evt.Bucket,
                tablesPath,
                tablesStream,
                "application/json",
                ct);
        }

        // 5. Publish completion event
        await _nats.PublishAsync(
            $"{_environment}.archives.documents.ocr.completed",
            new OcrCompletedEvent
            {
                DocumentId = evt.DocumentId,
                Bucket = evt.Bucket,
                UserId = evt.UserId,
                PageCount = result.PageCount,
                TableCount = result.Tables?.Count ?? 0,
                ContentPath = contentPath,
                ProcessedAt = DateTimeOffset.UtcNow
            },
            cancellationToken: ct);

        _logger.LogInformation(
            "OCR completed for {DocumentId}: {Pages} pages, {Tables} tables",
            evt.DocumentId, result.PageCount, result.Tables?.Count ?? 0);
    }
}

Event Schemas

Define strongly-typed events for the pipeline:

// Application/Events/DocumentEvents.cs
namespace MyApp.Application.Events;

public sealed record DocumentProcessingRequestedEvent
{
    public required string DocumentId { get; init; }
    public required string Bucket { get; init; }
    public required string UserId { get; init; }
    public required string FileName { get; init; }
    public required string FileExtension { get; init; }
    public required long FileSize { get; init; }
    public required string ContentType { get; init; }
    public required DateTimeOffset UploadedAt { get; init; }
}

public sealed record OcrCompletedEvent
{
    public required string DocumentId { get; init; }
    public required string Bucket { get; init; }
    public required string UserId { get; init; }
    public required int PageCount { get; init; }
    public required int TableCount { get; init; }
    public required string ContentPath { get; init; }
    public required DateTimeOffset ProcessedAt { get; init; }
}

Error Handling and Retries

Add robust error handling for OCR failures:

// Infrastructure/Ocr/OcrRetryPolicy.cs
public static class OcrRetryPolicy
{
    public static AsyncRetryPolicy<OcrResult> Create(ILogger logger)
    {
        return Policy<OcrResult>
            .Handle<HttpRequestException>()
            .Or<TaskCanceledException>()
            .Or<OcrException>(ex => !ex.Message.Contains("Unsupported"))
            .WaitAndRetryAsync(
                retryCount: 3,
                sleepDurationProvider: attempt =>
                    TimeSpan.FromSeconds(Math.Pow(2, attempt)),
                onRetry: (outcome, timeSpan, attempt, _) =>
                {
                    logger.LogWarning(
                        "OCR retry {Attempt} after {Delay}s due to: {Error}",
                        attempt,
                        timeSpan.TotalSeconds,
                        outcome.Exception?.Message);
                });
    }
}

Health Checks

Add Docling to your health check suite:

// Infrastructure/HealthChecks/DoclingHealthCheck.cs
public class DoclingHealthCheck : IHealthCheck
{
    private readonly HttpClient _httpClient;

    public DoclingHealthCheck(IHttpClientFactory factory)
    {
        _httpClient = factory.CreateClient("Docling");
    }

    public async Task<HealthCheckResult> CheckHealthAsync(
        HealthCheckContext context,
        CancellationToken ct = default)
    {
        try
        {
            var response = await _httpClient.GetAsync("/health", ct);

            if (response.IsSuccessStatusCode)
            {
                return HealthCheckResult.Healthy("Docling OCR is available");
            }

            return HealthCheckResult.Degraded(
                $"Docling returned {response.StatusCode}");
        }
        catch (Exception ex)
        {
            return HealthCheckResult.Unhealthy(
                "Cannot reach Docling OCR service",
                exception: ex);
        }
    }
}

// Registration
services.AddHealthChecks()
    .AddCheck<DoclingHealthCheck>("docling-ocr");

Testing the Integration

Write integration tests to verify OCR processing:

// Tests/Integration/DoclingOcrServiceTests.cs
public class DoclingOcrServiceTests : IClassFixture<DoclingFixture>
{
    private readonly IOcrService _ocrService;

    public DoclingOcrServiceTests(DoclingFixture fixture)
    {
        _ocrService = fixture.OcrService;
    }

    [Fact]
    public async Task ProcessDocument_WithValidPdf_ReturnsMarkdown()
    {
        // Arrange
        var pdfPath = Path.Combine("TestFiles", "sample-invoice.pdf");
        await using var stream = File.OpenRead(pdfPath);

        // Act
        var result = await _ocrService.ProcessDocumentAsync(
            stream, "sample-invoice.pdf");

        // Assert
        Assert.NotEmpty(result.Content);
        Assert.Contains("Invoice", result.Content);
        Assert.True(result.PageCount >= 1);
    }

    [Fact]
    public async Task ProcessDocument_WithTable_ExtractsTables()
    {
        // Arrange
        var pdfPath = Path.Combine("TestFiles", "sample-with-table.pdf");
        await using var stream = File.OpenRead(pdfPath);

        // Act
        var result = await _ocrService.ProcessDocumentAsync(
            stream, "sample-with-table.pdf");

        // Assert
        Assert.NotNull(result.Tables);
        Assert.NotEmpty(result.Tables);
        Assert.Contains("|", result.Tables[0].Markdown); // Markdown table
    }
}

Performance Optimization

Batch Processing

For high-volume scenarios, batch documents:

public async Task ProcessBatchAsync(
    IEnumerable<DocumentToProcess> documents,
    CancellationToken ct)
{
    var semaphore = new SemaphoreSlim(maxConcurrency: 3);

    var tasks = documents.Select(async doc =>
    {
        await semaphore.WaitAsync(ct);
        try
        {
            return await ProcessDocumentAsync(doc, ct);
        }
        finally
        {
            semaphore.Release();
        }
    });

    await Task.WhenAll(tasks);
}
[PyMuPDF Documentation] — Artifex Software

Caching Document Fingerprints

Skip re-processing unchanged documents:

public async Task<bool> ShouldProcessAsync(
    string bucket,
    string documentId,
    string fingerprint,
    CancellationToken ct)
{
    var existingFingerprint = await _storage.GetMetadataAsync(
        bucket,
        $"processed/{documentId}/fingerprint",
        ct);

    return existingFingerprint != fingerprint;
}

Conclusion

Building this Docling OCR integration taught me that document processing is deceptively complex — the gap between “extract text from a PDF” and “reliably extract structured text from any document format” required solving problems at every layer of the stack, from Python-to-.NET serialization to Kubernetes readiness probes to image preprocessing. The investment in a clean microservice boundary between .NET and Python paid for itself quickly: the OCR service can be scaled, updated, and monitored independently, and the typed HTTP client makes the integration feel native to .NET developers on the team.

Key Takeaways

  1. Docling Service — Containerized Python API with FastAPI
  2. Kubernetes Deployment — Scalable service in the ai namespace with proper readiness probes
  3. HTTP Client — Typed client with resilience policies and multipart uploads
  4. OCR Worker — Event-driven document processing via NATS
  5. Health Checks — Monitoring integration for production reliability

Docling’s layout-aware extraction produces clean Markdown that’s ideal for:

  • Semantic search with embeddings
  • RAG pipelines with structured chunks
  • Document summarization
  • Knowledge extraction

Next Steps

Next in the series, we’ll build the embedding pipeline that converts OCR output into vector representations for semantic search.

Further Reading

[Document AI Overview] — Google Cloud [OCR Benchmark: Evaluating Modern OCR Systems] — OCR-D Project