Latency Revolution: Optimizing 60s to 3s
How we slashed system latency by 95% by moving from sequential HTTP calls to parallel NATS requests, implementing Redis caching, and tuning Qdrant vector search.
Introduction
In distributed systems, latency isn’t just a number—it’s the user experience. When BlueRobin started, processing a complex document upload with OCR, embedding generation, and metadata extraction took nearly 60 seconds. This “coffee break” delay was unacceptable for a real-time archive system.
By auditing our architecture and making targeted changes to our communication patterns and data access strategies, we reduced this end-to-end process to under 3 seconds.
Why Optimization Matters:
- User Trust: Validating a document upload instantly builds confidence in the system.
- Resource Efficiency: Holding connections open for 60 seconds wastes thread pool resources and memory.
- Scalability: Sequential processing creates backpressure that chokes the system under load.
What We’ll Build
In this retrospective guide, we will walk through the three key optimizations that revolutionized our performance:
- Parallelization: Replacing sequential HTTP orchestration with NATS JetStream fan-out patterns.
- Caching: Determining what to cache in Redis to spare the primary database.
- Vector Tuning: Optimizing Qdrant HNSW parameters for trade-offs between recall and speed.
Architecture Overview
To achieve sub-3-second latency, we optimized the read path by introducing aggressive caching layers before hitting the primary database.
flowchart LR
API[API] --> Check{Check Cache}
Check -->|Hit| Return[Return Logic]
Check -->|Miss| DB[(Postgres)]
DB --> Write[Write to Cache]
Write --> Return
classDef primary fill:#7c3aed,color:#fff
classDef secondary fill:#06b6d4,color:#fff
classDef db fill:#f43f5e,color:#fff
classDef warning fill:#fbbf24,color:#000
class API,Check,Return primary
class Write secondary
class DB db
Phase 1: The Bottleneck of Sequential HTTP
Our initial MVP used a familiar pattern: a central API controller that orchestrated the entire pipeline. It would upload to MinIO, then call the OCR service, wait for a response, call the Embedding service, wait again, and finally save to Postgres.
This synchronous Http chain was the primary culprit.
sequenceDiagram
participant User
participant API as API (Orchestrator)
participant OCR
participant Embed as Embeddings
participant DB
User->>API: Upload Document
activate API
note right of API: Timer Starts
API->>OCR: POST /process (20s)
OCR-->>API: Result
API->>Embed: POST /vectorize (15s)
Embed-->>API: Vectors
API->>DB: INSERT Metadata (50ms)
DB-->>API: ID
note right of API: Timer Ends (~35s+)
API-->>User: 200 OK
deactivate API
The Fix: NATS JetStream Fan-Out
We moved to an event-driven architecture using NATS JetStream. The API now simply uploads the raw file and publishes a document.uploaded event.
Multiple workers (OCR, Analysis) subscribe to this event and process it in parallel.
flowchart LR
API[API Service]
NATS((NATS JetStream))
OCR[OCR Worker]
Embed[Embedding Worker]
DB[(PostgreSQL)]
classDef primary fill:#7c3aed,color:#fff
classDef secondary fill:#06b6d4,color:#fff
classDef db fill:#f43f5e,color:#fff
classDef warning fill:#fbbf24,color:#000
class API,NATS,OCR,Embed primary
class DB db
API -->|Pub: document.uploaded| NATS
NATS -->|Sub| OCR
NATS -->|Sub| Embed
OCR -->|Write| DB
Embed -->|Write| DB
Phase 2: Caching Expensive Lookups
Profiling revealed that during high-traffic ingestion, we were slamming the database with repeated queries for Tag and Category existence checks. For every page of a document, we were checking if tags existed before inserting.
We implemented a Write-Through Caching strategy using Redis.
Implementation
Instead of SELECT id FROM tags WHERE name = @name, we check Redis first.
public async Task<Guid> GetOrCreateTagIdAsync(string tagName)
{
var cacheKey = $"tag:{tagName.ToLower()}";
// 1. Fast Path: Redis
var cachedId = await _cache.GetStringAsync(cacheKey);
if (cachedId != null) return Guid.Parse(cachedId);
// 2. Slow Path: Postgres
var tag = await _dbContext.Tags.FirstOrDefaultAsync(t => t.Name == tagName);
if (tag == null)
{
tag = new Tag(tagName);
_dbContext.Tags.Add(tag);
await _dbContext.SaveChangesAsync();
}
// 3. Cache for next time (1 hour expiration)
await _cache.SetStringAsync(cacheKey, tag.Id.ToString(),
new DistributedCacheEntryOptions { AbsoluteExpirationRelativeToNow = TimeSpan.FromHours(1) });
return tag.Id;
}
This simple change reduced database CPU usage by 40% during bulk uploads.
Phase 3: Tuning Qdrant & HNSW
The final bottleneck was vector search. As our collection grew to millions of vectors, search latency crept up to 400ms. We are using Qdrant as our vector engine.
Qdrant uses HNSW (Hierarchical Navigable Small World) graphs. The default settings prioritize recall (accuracy) over speed. For a personal archive, we can tolerate a slight drop in accuracy for blazing speed.
Optimizing Index Parameters
We adjusted the m (edges per node) and ef_construct (candidates during index build) parameters in our collection creation:
{
"vectors": {
"size": 1536,
"distance": "Cosine"
},
"hnsw_config": {
"m": 16, // Reduced from 32 (Less memory, faster search)
"ef_construct": 100, // Reduced from 200 (Faster indexing)
"full_scan_threshold": 10000
}
}
The Results
Combining these changes transformed the system responsiveness.
| Metric | Before | After | Improvement |
|---|---|---|---|
| End-to-End Latency | 58s | 2.8s | 95% |
| Database CPU | 85% | 15% | 82% |
| Throughput | 12 docs/min | 250 docs/min | 20x |
Conclusion
Optimization is rarely about finding a single “magic bullet.” It requires a systematic analysis of the pipeline:
- Transport Layer: Move from synchronous blocking calls to asynchronous messaging (NATS).
- Data Layer: Cache read-heavy, write-rare data close to the application (Redis).
- Compute Layer: Tune algorithms to your specific use case (Qdrant HNSW).
Next Steps:
- Introduction to NATS JetStream
- Exploring standard vs keyword search in Qdrant