Backend Advanced 12 min

Latency Revolution: Optimizing 60s to 3s

How we slashed system latency by 95% by moving from sequential HTTP calls to parallel NATS requests, implementing Redis caching, and tuning Qdrant vector search.

By Victor Robin Updated:

When I first watched our system take nearly a minute to process a single document upload, I knew something was fundamentally wrong with our architecture. The frustrating part was that each individual service was fast on its own — it was the way we orchestrated them that created a cascading latency nightmare. What followed was a week-long deep dive into profiling, parallelization, and caching that taught me more about distributed systems performance than any textbook could. This article is the story of how we turned a 60-second embarrassment into a 3-second success.

Introduction

In distributed systems, latency isn’t just a number—it’s the user experience. When We started, processing a complex document upload with OCR, embedding generation, and metadata extraction took nearly 60 seconds. This “coffee break” delay was unacceptable for a real-time archive system.

By auditing our architecture and making targeted changes to our communication patterns and data access strategies, we reduced this end-to-end process to under 3 seconds.

Why Optimization Matters:

  • User Trust: Validating a document upload instantly builds confidence in the system.
  • Resource Efficiency: Holding connections open for 60 seconds wastes thread pool resources and memory.
  • Scalability: Sequential processing creates backpressure that chokes the system under load.

What We’ll Build

In this retrospective guide, we will walk through the three key optimizations that revolutionized our performance:

  1. Parallelization: Replacing sequential HTTP orchestration with NATS JetStream fan-out patterns.
  2. Caching: Determining what to cache in Redis to spare the primary database.
  3. Vector Tuning: Optimizing Qdrant HNSW parameters for trade-offs between recall and speed.

Architecture Overview

To achieve sub-3-second latency, we optimized the read path by introducing aggressive caching layers before hitting the primary database.

flowchart LR
    API[API] --> Check{Check Cache}
    Check -->|Hit| Return[Return Logic]
    Check -->|Miss| DB[(Postgres)]
    DB --> Write[Write to Cache]
    Write --> Return

    classDef primary fill:#7c3aed,color:#fff
    classDef secondary fill:#06b6d4,color:#fff
    classDef db fill:#f43f5e,color:#fff
    classDef warning fill:#fbbf24,color:#000

    class API,Check,Return primary
    class Write secondary
    class DB db

Phase 1: The Bottleneck of Sequential HTTP

Our initial MVP used a familiar pattern: a central API controller that orchestrated the entire pipeline. It would upload to MinIO, then call the OCR service, wait for a response, call the Embedding service, wait again, and finally save to Postgres.

This synchronous Http chain was the primary culprit.

[Release It! Design and Deploy Production-Ready Software] — Michael Nygard , 2018
sequenceDiagram
    participant User
    participant API as API (Orchestrator)
    participant OCR
    participant Embed as Embeddings
    participant DB

    User->>API: Upload Document
    activate API
    note right of API: Timer Starts
    API->>OCR: POST /process (20s)
    OCR-->>API: Result
    API->>Embed: POST /vectorize (15s)
    Embed-->>API: Vectors
    API->>DB: INSERT Metadata (50ms)
    DB-->>API: ID
    note right of API: Timer Ends (~35s+)
    API-->>User: 200 OK
    deactivate API

The Fix: NATS JetStream Fan-Out

We moved to an event-driven architecture using NATS JetStream. The API now simply uploads the raw file and publishes a document.uploaded event.

Multiple workers (OCR, Analysis) subscribe to this event and process it in parallel.

[NATS JetStream Documentation] — Synadia , 2024
flowchart LR
    API[API Service]
    NATS((NATS JetStream))
    OCR[OCR Worker]
    Embed[Embedding Worker]
    DB[(PostgreSQL)]

    classDef primary fill:#7c3aed,color:#fff
    classDef secondary fill:#06b6d4,color:#fff
    classDef db fill:#f43f5e,color:#fff
    classDef warning fill:#fbbf24,color:#000

    class API,NATS,OCR,Embed primary
    class DB db

    API -->|Pub: document.uploaded| NATS
    NATS -->|Sub| OCR
    NATS -->|Sub| Embed
    OCR -->|Write| DB
    Embed -->|Write| DB

Phase 2: Caching Expensive Lookups

Profiling revealed that during high-traffic ingestion, we were slamming the database with repeated queries for Tag and Category existence checks. For every page of a document, we were checking if tags existed before inserting.

[Caching Best Practices and Patterns] — Microsoft , 2024

We implemented a Write-Through Caching strategy using Redis.

Implementation

Instead of SELECT id FROM tags WHERE name = @name, we check Redis first.

public async Task<Guid> GetOrCreateTagIdAsync(string tagName)
{
    var cacheKey = $"tag:{tagName.ToLower()}";

    // 1. Fast Path: Redis
    var cachedId = await _cache.GetStringAsync(cacheKey);
    if (cachedId != null) return Guid.Parse(cachedId);

    // 2. Slow Path: Postgres
    var tag = await _dbContext.Tags.FirstOrDefaultAsync(t => t.Name == tagName);
    if (tag == null)
    {
        tag = new Tag(tagName);
        _dbContext.Tags.Add(tag);
        await _dbContext.SaveChangesAsync();
    }

    // 3. Cache for next time (1 hour expiration)
    await _cache.SetStringAsync(cacheKey, tag.Id.ToString(),
        new DistributedCacheEntryOptions { AbsoluteExpirationRelativeToNow = TimeSpan.FromHours(1) });

    return tag.Id;
}

This simple change reduced database CPU usage by 40% during bulk uploads.

Phase 3: Tuning Qdrant & HNSW

The final bottleneck was vector search. As our collection grew to millions of vectors, search latency crept up to 400ms. We are using Qdrant as our vector engine.

Qdrant uses HNSW (Hierarchical Navigable Small World) graphs. The default settings prioritize recall (accuracy) over speed. For a personal archive, we can tolerate a slight drop in accuracy for blazing speed.

[Efficient and Robust Approximate Nearest Neighbor Search Using HNSW Graphs] — Malkov and Yashunin , 2018

Optimizing Index Parameters

We adjusted the m (edges per node) and ef_construct (candidates during index build) parameters in our collection creation:

{
  "vectors": {
    "size": 1536,
    "distance": "Cosine"
  },
  "hnsw_config": {
    "m": 16,            // Reduced from 32 (Less memory, faster search)
    "ef_construct": 100, // Reduced from 200 (Faster indexing)
    "full_scan_threshold": 10000
  }
}
[Qdrant HNSW Index Configuration] — Qdrant , 2024

The Results

Combining these changes transformed the system responsiveness.

MetricBeforeAfterImprovement
End-to-End Latency58s2.8s95%
Database CPU85%15%82%
Throughput12 docs/min250 docs/min20x

Conclusion

Optimization is rarely about finding a single “magic bullet.” It requires a systematic analysis of the pipeline:

  1. Transport Layer: Move from synchronous blocking calls to asynchronous messaging (NATS).
  2. Data Layer: Cache read-heavy, write-rare data close to the application (Redis).
  3. Compute Layer: Tune algorithms to your specific use case (Qdrant HNSW).

Reflecting on this optimization journey, the biggest takeaway for me was that the most impactful improvements came not from clever code but from architectural decisions. Switching from sequential HTTP to event-driven messaging was a one-time effort that delivered permanent, compounding benefits. If I had to do it again, I would start with the messaging layer from day one rather than building a synchronous MVP first. The migration cost was significant, and the technical debt of the synchronous approach nearly derailed us before we caught it.

[Designing Data-Intensive Applications] — Martin Kleppmann , 2017

Next Steps

  • Dive into the Introduction to NATS JetStream for a deeper look at our messaging infrastructure.
  • Explore standard vs keyword search strategies in Qdrant for different query patterns.
  • Implement distributed tracing with OpenTelemetry to get per-service latency breakdowns in production.
  • Add automated performance regression tests to CI/CD to catch latency regressions before deployment.

Further Reading

[NATS by Example: Core Patterns and JetStream] — Example , 2024 [Redis Caching Strategies in .NET] — Microsoft , 2024 [Qdrant Performance Tuning Guide] — Qdrant , 2024 [Designing Data-Intensive Applications by Martin Kleppmann] — Martin Kleppmann , 2024