🤖 AI/ML Intermediate ⏱️ 16 min

NER with spaCy for Document Analysis

Extract named entities from documents using spaCy NLP service for building knowledge graphs and improving search relevance.

By Victor Robin

Introduction

Named Entity Recognition (NER) is a natural language processing technique that identifies and classifies named entities in text—people, organizations, locations, dates, and more. For a document management system, NER transforms unstructured text into structured, queryable data.

Why NER matters for document analysis:

  • Searchability: Find all documents mentioning “Acme Corporation” without relying on exact text matches
  • Knowledge Graphs: Automatically discover relationships between entities across documents
  • Faceted Search: Filter search results by entity type (show only documents mentioning people)
  • Document Clustering: Group related documents by shared entities

What We’ll Build

In this guide, we’ll:

  1. Deploy a spaCy NER service as a containerized microservice
  2. Create a .NET client with retry policies and circuit breakers
  3. Integrate NER into our document processing pipeline via NATS events
  4. Store extracted entities for graph queries and search facets

Architecture Overview

Before diving into code, let’s understand how NER fits into the broader document processing pipeline:

flowchart TB
    subgraph Pipeline["🔬 NER Processing Pipeline"]
        Doc["📄 Document Content\n(Markdown)"] --> Chunk["✂️ Chunking\nSplit into sections"]
        
        subgraph Spacy["🐍 spaCy NER Service"]
            Token["🔤 Tokenizer"] --> Tag["🏷️ Tagger"] --> NER["🎯 NER"]
            Types["Entity Types:\nPERSON, ORG, GPE,\nDATE, MONEY, etc."]
        end
        
        Chunk --> Spacy
        Spacy --> Entities["📋 Extracted Entities\n• PERSON: John Doe\n• ORG: Acme Corp"]
        
        Entities --> Falkor["🔴 FalkorDB\n(Graph)"]
        Entities --> Qdrant["🔮 Qdrant\n(Metadata)"]
        Entities --> PG["🐘 PostgreSQL\n(Facets)"]
    end

    classDef primary fill:#7c3aed,color:#fff
    classDef secondary fill:#06b6d4,color:#fff
    classDef db fill:#f43f5e,color:#fff
    classDef warning fill:#fbbf24,color:#000

    class Pipeline primary
    class Spacy secondary

The flow:

  1. After OCR completes, document content is chunked into manageable sections
  2. Each chunk is sent to the spaCy NER service for entity extraction
  3. Extracted entities are stored in three places:
    • FalkorDB: Graph database for relationship queries (“Who works at Company X?”)
    • Qdrant: Vector metadata for filtered semantic search
    • PostgreSQL: Facet counts for UI filtering

Deploying the spaCy Service

We’ll run spaCy as a dedicated microservice for several reasons: spaCy’s transformer models require significant memory (2-4GB), Python’s GIL limits concurrency, and isolating NER allows independent scaling.

The Dockerfile

# services/spacy-ner/Dockerfile
FROM python:3.11-slim

WORKDIR /app

# Install dependencies
COPY requirements.txt .
RUN pip install --no-cache-dir -r requirements.txt

# Download the transformer-based model (~500MB)
# This provides much better accuracy than the smaller models
RUN python -m spacy download en_core_web_trf

COPY app.py .

EXPOSE 8080

CMD ["uvicorn", "app:app", "--host", "0.0.0.0", "--port", "8080"]

We use en_core_web_trf—the transformer-based model—because accuracy matters more than speed for our use case. A misidentified entity pollutes the knowledge graph; a slower extraction is merely inconvenient.

FastAPI Application

The service exposes a simple REST API for entity extraction:

# services/spacy-ner/app.py
from fastapi import FastAPI, HTTPException
from pydantic import BaseModel
import spacy
from typing import List, Optional
import logging

logging.basicConfig(level=logging.INFO)
logger = logging.getLogger(__name__)

# Load transformer-based model for better accuracy
# This happens once at startup and is reused for all requests
nlp = spacy.load("en_core_web_trf")

app = FastAPI(title="spaCy NER Service")

The model is loaded once at startup—this takes several seconds but ensures subsequent requests are fast.

Request and response models:

class NERRequest(BaseModel):
    text: str
    entity_types: Optional[List[str]] = None  # Filter to specific types
    min_confidence: float = 0.5               # Threshold for inclusion

class Entity(BaseModel):
    text: str           # The entity text ("John Doe")
    label: str          # spaCy label ("PERSON")
    start: int          # Character offset in original text
    end: int            # End character offset
    confidence: float   # Estimated confidence score

class NERResponse(BaseModel):
    entities: List[Entity]
    text_length: int
    processing_time_ms: float

The extraction endpoint:

@app.post("/extract", response_model=NERResponse)
async def extract_entities(request: NERRequest):
    import time
    start_time = time.time()
    
    # Guard against excessive input
    if len(request.text) > 100000:
        raise HTTPException(
            status_code=400, 
            detail="Text too long. Maximum 100,000 characters."
        )
    
    # Run the spaCy pipeline
    doc = nlp(request.text)
    
    entities = []
    for ent in doc.ents:
        # Filter by entity type if specified
        if request.entity_types and ent.label_ not in request.entity_types:
            continue
        
        # spaCy doesn't provide confidence scores directly,
        # so we use a heuristic based on entity characteristics
        confidence = min(0.9, 0.6 + (len(ent.text) / 50))
        
        if confidence >= request.min_confidence:
            entities.append(Entity(
                text=ent.text,
                label=ent.label_,
                start=ent.start_char,
                end=ent.end_char,
                confidence=round(confidence, 2)
            ))
    
    processing_time = (time.time() - start_time) * 1000
    logger.info(f"Extracted {len(entities)} entities in {processing_time:.2f}ms")
    
    return NERResponse(
        entities=entities,
        text_length=len(request.text),
        processing_time_ms=round(processing_time, 2)
    )

Supporting endpoints for health checks and entity type discovery:

@app.get("/health")
async def health():
    return {"status": "healthy", "model": "en_core_web_trf"}

@app.get("/entity-types")
async def entity_types():
    """Return all supported spaCy entity types with descriptions"""
    return {
        "types": [
            {"label": "PERSON", "description": "People, including fictional"},
            {"label": "ORG", "description": "Companies, agencies, institutions"},
            {"label": "GPE", "description": "Countries, cities, states"},
            {"label": "LOC", "description": "Non-GPE locations"},
            {"label": "DATE", "description": "Absolute or relative dates"},
            {"label": "TIME", "description": "Times smaller than a day"},
            {"label": "MONEY", "description": "Monetary values"},
            {"label": "PERCENT", "description": "Percentage values"},
            {"label": "EVENT", "description": "Named events"},
            {"label": "PRODUCT", "description": "Products"},
            {"label": "LAW", "description": "Laws and legal documents"},
            {"label": "WORK_OF_ART", "description": "Titles of creative works"}
        ]
    }

Kubernetes Deployment

The NER service needs more resources than typical web services due to the transformer model:

# apps/spacy-ner/deployment.yaml
apiVersion: apps/v1
kind: Deployment
metadata:
  name: spacy-ner
  namespace: ai
spec:
  replicas: 2  # Multiple replicas for availability
  selector:
    matchLabels:
      app: spacy-ner
  template:
    metadata:
      labels:
        app: spacy-ner
    spec:
      containers:
        - name: spacy-ner
          image: registry.bluerobin.local/spacy-ner:latest
          ports:
            - containerPort: 8080
          resources:
            requests:
              memory: "1Gi"   # Minimum for stable operation
              cpu: "500m"
            limits:
              memory: "4Gi"   # Transformer model needs headroom
              cpu: "2000m"
          readinessProbe:
            httpGet:
              path: /health
              port: 8080
            initialDelaySeconds: 30  # Model loading takes time
            periodSeconds: 10
          livenessProbe:
            httpGet:
              path: /health
              port: 8080
            initialDelaySeconds: 60  # Allow for model warmup
            periodSeconds: 30

.NET Client Integration

Now let’s build the .NET client that calls our spaCy service. We’ll add resilience patterns to handle transient failures gracefully.

The NER Client

// Infrastructure/Ai/SpacyNerClient.cs
public sealed class SpacyNerClient : INerService
{
    private readonly HttpClient _http;
    private readonly ILogger<SpacyNerClient> _logger;
    
    public SpacyNerClient(
        HttpClient http,
        ILogger<SpacyNerClient> logger)
    {
        _http = http;
        _logger = logger;
    }
    
    public async Task<IReadOnlyList<ExtractedEntity>> ExtractAsync(
        string text,
        NerOptions? options = null,
        CancellationToken ct = default)
    {
        options ??= NerOptions.Default;
        
        var request = new NerRequest
        {
            Text = text,
            EntityTypes = options.EntityTypes,
            MinConfidence = options.MinConfidence
        };
        
        var response = await _http.PostAsJsonAsync("/extract", request, ct);
        response.EnsureSuccessStatusCode();
        
        var result = await response.Content
            .ReadFromJsonAsync<NerResponse>(ct);
        
        if (result == null)
        {
            return [];
        }
        
        _logger.LogDebug(
            "Extracted {Count} entities in {Time}ms",
            result.Entities.Count,
            result.ProcessingTimeMs);
        
        // Map spaCy labels to our domain model
        return result.Entities.Select(e => new ExtractedEntity
        {
            Value = e.Text,
            Type = MapEntityType(e.Label),
            Confidence = e.Confidence,
            StartPosition = e.Start,
            EndPosition = e.End
        }).ToList();
    }
    
    /// <summary>
    /// Map spaCy's label strings to our strongly-typed EntityType enum.
    /// This insulates our domain model from spaCy's naming conventions.
    /// </summary>
    private static EntityType MapEntityType(string label) => label switch
    {
        "PERSON" => EntityType.Person,
        "ORG" => EntityType.Organization,
        "GPE" or "LOC" => EntityType.Location,  // Combine geo-political and location
        "DATE" or "TIME" => EntityType.Date,
        "MONEY" or "PERCENT" => EntityType.Numeric,
        "EVENT" => EntityType.Event,
        "PRODUCT" => EntityType.Product,
        "LAW" => EntityType.Legal,
        "WORK_OF_ART" => EntityType.CreativeWork,
        _ => EntityType.Other
    };
}

Supporting types for the client:

public sealed record NerOptions
{
    public IReadOnlyList<string>? EntityTypes { get; init; }
    public float MinConfidence { get; init; } = 0.5f;
    
    public static NerOptions Default { get; } = new();
    
    // Preset for common use case
    public static NerOptions PeopleAndOrgs { get; } = new()
    {
        EntityTypes = ["PERSON", "ORG"]
    };
}

public sealed record ExtractedEntity
{
    public required string Value { get; init; }
    public required EntityType Type { get; init; }
    public required float Confidence { get; init; }
    public int StartPosition { get; init; }
    public int EndPosition { get; init; }
}

public enum EntityType
{
    Person,
    Organization,
    Location,
    Date,
    Numeric,
    Event,
    Product,
    Legal,
    CreativeWork,
    Other
}

Dependency Injection with Resilience

We configure the HttpClient with Polly policies for retry and circuit breaking:

// Infrastructure/DependencyInjection.cs
public static IServiceCollection AddNerServices(
    this IServiceCollection services,
    IConfiguration configuration)
{
    services.AddHttpClient<INerService, SpacyNerClient>(client =>
    {
        client.BaseAddress = new Uri(
            configuration["SpacyNer:Endpoint"] 
            ?? "http://spacy-ner.ai.svc.cluster.local:8080");
        client.Timeout = TimeSpan.FromSeconds(30);
    })
    .AddPolicyHandler(GetRetryPolicy())
    .AddPolicyHandler(GetCircuitBreakerPolicy());
    
    return services;
}

/// <summary>
/// Exponential backoff retry: wait 2s, 4s, 8s between attempts.
/// Handles transient HTTP errors and 5xx responses.
/// </summary>
private static IAsyncPolicy<HttpResponseMessage> GetRetryPolicy()
{
    return HttpPolicyExtensions
        .HandleTransientHttpError()
        .WaitAndRetryAsync(3, retryAttempt => 
            TimeSpan.FromSeconds(Math.Pow(2, retryAttempt)));
}

/// <summary>
/// Circuit breaker: after 5 failures, stop calling for 30 seconds.
/// Prevents cascade failures when the NER service is down.
/// </summary>
private static IAsyncPolicy<HttpResponseMessage> GetCircuitBreakerPolicy()
{
    return HttpPolicyExtensions
        .HandleTransientHttpError()
        .CircuitBreakerAsync(5, TimeSpan.FromSeconds(30));
}

Document Processing Integration

The final piece: a background worker that listens for document analysis events and triggers entity extraction.

// Workers/EntityExtractionWorker.cs
public sealed class EntityExtractionWorker : BackgroundService
{
    private readonly INatsConnection _nats;
    private readonly INerService _ner;
    private readonly IEntityRelationshipService _relationships;
    private readonly IDocumentRepository _documents;
    private readonly IConfiguration _config;
    private readonly ILogger<EntityExtractionWorker> _logger;
    
    protected override async Task ExecuteAsync(CancellationToken ct)
    {
        // Subscribe to the environment-prefixed subject
        var env = _config["Environment"] ?? "dev";
        var subject = $"{env}.archives.documents.analysis.completed";
        
        await foreach (var msg in _nats.SubscribeAsync<AnalysisCompletedEvent>(
            subject, cancellationToken: ct))
        {
            try
            {
                await ProcessAsync(msg.Data!, ct);
                await msg.AckAsync(cancellationToken: ct);
            }
            catch (Exception ex)
            {
                _logger.LogError(ex, 
                    "Failed to extract entities for {DocumentId}",
                    msg.Data?.DocumentId);
                // NAK will trigger redelivery after delay
                await msg.NakAsync(cancellationToken: ct);
            }
        }
    }
    
    private async Task ProcessAsync(
        AnalysisCompletedEvent evt, 
        CancellationToken ct)
    {
        var document = await _documents.GetByIdAsync(
            BlueRobinId.From(evt.DocumentId), ct);
        
        if (document == null) return;
        
        // Extract entities from the full document content
        var entities = await _ner.ExtractAsync(
            evt.Content,
            NerOptions.Default,
            ct);
        
        _logger.LogInformation(
            "Extracted {Count} entities from document {DocumentId}",
            entities.Count,
            evt.DocumentId);
        
        // Store entities in the graph database for relationship queries
        await _relationships.AddDocumentEntitiesAsync(
            document.Id,
            document.OwnerId,
            entities,
            ct);
        
        // Update document with aggregated entity counts for faceted search
        document.UpdateEntityCounts(
            entities.Count(e => e.Type == EntityType.Person),
            entities.Count(e => e.Type == EntityType.Organization),
            entities.Count(e => e.Type == EntityType.Location));
        
        await _documents.UpdateAsync(document, ct);
    }
}

Conclusion

We’ve built a complete NER pipeline that transforms unstructured document text into structured, queryable entity data:

ComponentPurpose
spaCy ServiceTransformer-based NER with 12+ entity types
.NET ClientResilient HTTP client with retry and circuit breaker
WorkerEvent-driven integration via NATS
StorageGraph (relationships), vector (metadata), relational (facets)

Key takeaways:

  • Isolate NLP services: Python NLP models have different resource needs than .NET APIs
  • Use transformer models: The accuracy improvement is worth the memory cost
  • Build resilience: NER failures shouldn’t crash your document pipeline
  • Store strategically: Different storage backends serve different query patterns

With entities extracted and stored in FalkorDB, you can now build powerful features like “find all documents where Person A and Company B are mentioned together” or “show the relationship network for this person across all my documents.”

[spaCy NER Documentation] — Explosion AI