Latency Revolution: Optimizing 60s to 3s
How we slashed system latency by 95% by moving from sequential HTTP calls to parallel NATS requests, implementing Redis caching, and tuning Qdrant vector search.
When I first watched our system take nearly a minute to process a single document upload, I knew something was fundamentally wrong with our architecture. The frustrating part was that each individual service was fast on its own — it was the way we orchestrated them that created a cascading latency nightmare. What followed was a week-long deep dive into profiling, parallelization, and caching that taught me more about distributed systems performance than any textbook could. This article is the story of how we turned a 60-second embarrassment into a 3-second success.
Introduction
In distributed systems, latency isn’t just a number—it’s the user experience. When We started, processing a complex document upload with OCR, embedding generation, and metadata extraction took nearly 60 seconds. This “coffee break” delay was unacceptable for a real-time archive system.
By auditing our architecture and making targeted changes to our communication patterns and data access strategies, we reduced this end-to-end process to under 3 seconds.
Why Optimization Matters:
- User Trust: Validating a document upload instantly builds confidence in the system.
- Resource Efficiency: Holding connections open for 60 seconds wastes thread pool resources and memory.
- Scalability: Sequential processing creates backpressure that chokes the system under load.
What We’ll Build
In this retrospective guide, we will walk through the three key optimizations that revolutionized our performance:
- Parallelization: Replacing sequential HTTP orchestration with NATS JetStream fan-out patterns.
- Caching: Determining what to cache in Redis to spare the primary database.
- Vector Tuning: Optimizing Qdrant HNSW parameters for trade-offs between recall and speed.
Architecture Overview
To achieve sub-3-second latency, we optimized the read path by introducing aggressive caching layers before hitting the primary database.
flowchart LR
API[API] --> Check{Check Cache}
Check -->|Hit| Return[Return Logic]
Check -->|Miss| DB[(Postgres)]
DB --> Write[Write to Cache]
Write --> Return
classDef primary fill:#7c3aed,color:#fff
classDef secondary fill:#06b6d4,color:#fff
classDef db fill:#f43f5e,color:#fff
classDef warning fill:#fbbf24,color:#000
class API,Check,Return primary
class Write secondary
class DB db
Phase 1: The Bottleneck of Sequential HTTP
Our initial MVP used a familiar pattern: a central API controller that orchestrated the entire pipeline. It would upload to MinIO, then call the OCR service, wait for a response, call the Embedding service, wait again, and finally save to Postgres.
This synchronous Http chain was the primary culprit.
[Release It! Design and Deploy Production-Ready Software] — Michael Nygard , 2018sequenceDiagram
participant User
participant API as API (Orchestrator)
participant OCR
participant Embed as Embeddings
participant DB
User->>API: Upload Document
activate API
note right of API: Timer Starts
API->>OCR: POST /process (20s)
OCR-->>API: Result
API->>Embed: POST /vectorize (15s)
Embed-->>API: Vectors
API->>DB: INSERT Metadata (50ms)
DB-->>API: ID
note right of API: Timer Ends (~35s+)
API-->>User: 200 OK
deactivate API
The Fix: NATS JetStream Fan-Out
We moved to an event-driven architecture using NATS JetStream. The API now simply uploads the raw file and publishes a document.uploaded event.
Multiple workers (OCR, Analysis) subscribe to this event and process it in parallel.
[NATS JetStream Documentation] — Synadia , 2024flowchart LR
API[API Service]
NATS((NATS JetStream))
OCR[OCR Worker]
Embed[Embedding Worker]
DB[(PostgreSQL)]
classDef primary fill:#7c3aed,color:#fff
classDef secondary fill:#06b6d4,color:#fff
classDef db fill:#f43f5e,color:#fff
classDef warning fill:#fbbf24,color:#000
class API,NATS,OCR,Embed primary
class DB db
API -->|Pub: document.uploaded| NATS
NATS -->|Sub| OCR
NATS -->|Sub| Embed
OCR -->|Write| DB
Embed -->|Write| DB
Phase 2: Caching Expensive Lookups
Profiling revealed that during high-traffic ingestion, we were slamming the database with repeated queries for Tag and Category existence checks. For every page of a document, we were checking if tags existed before inserting.
We implemented a Write-Through Caching strategy using Redis.
Implementation
Instead of SELECT id FROM tags WHERE name = @name, we check Redis first.
public async Task<Guid> GetOrCreateTagIdAsync(string tagName)
{
var cacheKey = $"tag:{tagName.ToLower()}";
// 1. Fast Path: Redis
var cachedId = await _cache.GetStringAsync(cacheKey);
if (cachedId != null) return Guid.Parse(cachedId);
// 2. Slow Path: Postgres
var tag = await _dbContext.Tags.FirstOrDefaultAsync(t => t.Name == tagName);
if (tag == null)
{
tag = new Tag(tagName);
_dbContext.Tags.Add(tag);
await _dbContext.SaveChangesAsync();
}
// 3. Cache for next time (1 hour expiration)
await _cache.SetStringAsync(cacheKey, tag.Id.ToString(),
new DistributedCacheEntryOptions { AbsoluteExpirationRelativeToNow = TimeSpan.FromHours(1) });
return tag.Id;
}
This simple change reduced database CPU usage by 40% during bulk uploads.
Phase 3: Tuning Qdrant & HNSW
The final bottleneck was vector search. As our collection grew to millions of vectors, search latency crept up to 400ms. We are using Qdrant as our vector engine.
Qdrant uses HNSW (Hierarchical Navigable Small World) graphs. The default settings prioritize recall (accuracy) over speed. For a personal archive, we can tolerate a slight drop in accuracy for blazing speed.
[Efficient and Robust Approximate Nearest Neighbor Search Using HNSW Graphs] — Malkov and Yashunin , 2018Optimizing Index Parameters
We adjusted the m (edges per node) and ef_construct (candidates during index build) parameters in our collection creation:
{
"vectors": {
"size": 1536,
"distance": "Cosine"
},
"hnsw_config": {
"m": 16, // Reduced from 32 (Less memory, faster search)
"ef_construct": 100, // Reduced from 200 (Faster indexing)
"full_scan_threshold": 10000
}
}
[Qdrant HNSW Index Configuration]
— Qdrant , 2024
The Results
Combining these changes transformed the system responsiveness.
| Metric | Before | After | Improvement |
|---|---|---|---|
| End-to-End Latency | 58s | 2.8s | 95% |
| Database CPU | 85% | 15% | 82% |
| Throughput | 12 docs/min | 250 docs/min | 20x |
Conclusion
Optimization is rarely about finding a single “magic bullet.” It requires a systematic analysis of the pipeline:
- Transport Layer: Move from synchronous blocking calls to asynchronous messaging (NATS).
- Data Layer: Cache read-heavy, write-rare data close to the application (Redis).
- Compute Layer: Tune algorithms to your specific use case (Qdrant HNSW).
Reflecting on this optimization journey, the biggest takeaway for me was that the most impactful improvements came not from clever code but from architectural decisions. Switching from sequential HTTP to event-driven messaging was a one-time effort that delivered permanent, compounding benefits. If I had to do it again, I would start with the messaging layer from day one rather than building a synchronous MVP first. The migration cost was significant, and the technical debt of the synchronous approach nearly derailed us before we caught it.
[Designing Data-Intensive Applications] — Martin Kleppmann , 2017Next Steps
- Dive into the Introduction to NATS JetStream for a deeper look at our messaging infrastructure.
- Explore standard vs keyword search strategies in Qdrant for different query patterns.
- Implement distributed tracing with OpenTelemetry to get per-service latency breakdowns in production.
- Add automated performance regression tests to CI/CD to catch latency regressions before deployment.