Distributed API Rate Limiting with NATS KV
Moving from local in-memory limits to cluster-aware NATS KV partitions for fair, durable API throttling in distributed deployments.
When I first configured rate limiting across multiple API replicas on our k3s cluster, I assumed a simple in-memory counter per instance would be enough. It was not. During load testing, I watched a single user blow past their quota by a factor of three simply because requests were round-robined across replicas. Each instance maintained its own counter, blissfully unaware of the others. That experience pushed me toward a shared counter backed by NATS KV, where compare-and-swap semantics gave me the atomic guarantees I needed without bolting on Redis or another external dependency.
Rate limiting in a single-instance API is straightforward — an in-memory counter per user, a sliding window, done. But when you scale to multiple replicas behind a load balancer, each instance has its own counter. A user can effectively multiply their quota by the number of replicas. This article shows how to use NATS KV as a shared, cluster-aware rate limit store.
The Multi-Instance Problem
With N API replicas and local rate limiting:
- User quota: 100 requests/minute
- Actual effective quota:
100 * Nrequests/minute - Each replica only sees
1/Nof the user’s traffic
This defeats the purpose of rate limiting entirely. You need a shared counter that all replicas can read and write atomically.
[Rate Limiting Best Practices] — Google Cloud Architecture , 2023-06-15Why NATS KV?
If you already run NATS JetStream for messaging, NATS KV gives you a distributed key-value store with:
- Atomic operations via compare-and-swap (CAS)
- TTL support for automatic window expiry
- No extra infrastructure — it’s built into NATS
- Replication matching your NATS cluster topology
sequenceDiagram
participant C1 as Client A
participant R1 as API Replica 1
participant R2 as API Replica 2
participant KV as NATS KV
participant C2 as Client B
C1->>R1: POST /documents
C2->>R2: POST /documents
R1->>KV: Get(rate:upload:user1)
KV-->>R1: value=4, rev=7
R2->>KV: Get(rate:upload:user1)
KV-->>R2: value=4, rev=7
R1->>KV: Update(key, 5, rev=7)
KV-->>R1: OK, rev=8
R2->>KV: Update(key, 5, rev=7)
KV-->>R2: WrongLastRevision!
R2->>KV: Get(rate:upload:user1)
KV-->>R2: value=5, rev=8
R2->>KV: Update(key, 6, rev=8)
KV-->>R2: OK, rev=9
[NATS Key/Value Store]
— NATS Authors , 2024-03-20
Implementation
Rate Limit Key Strategy
The key encodes the user identity and time window:
rate:{policyName}:{userId}:{windowStart}
For authenticated users, use their user ID. For anonymous requests, hash the IP address to avoid storing raw IPs:
public static class RateLimitKeyGenerator
{
public static string Generate(
HttpContext context, string policyName, TimeSpan window)
{
var userId = context.User?.FindFirst("sub")?.Value;
var windowStart = DateTime.UtcNow.Ticks / window.Ticks;
if (!string.IsNullOrEmpty(userId))
return $"rate:{policyName}:{userId}:{windowStart}";
var ip = context.Connection.RemoteIpAddress;
var hash = ip?.GetHashCode().ToString("X8") ?? "unknown";
return $"rate:{policyName}:anon:{hash}:{windowStart}";
}
} Atomic Increment with CAS
The core operation: read the current count, increment it, and write back atomically. If another replica modified the value between read and write, the CAS fails and we retry:
public class NatsKvRateLimiter
{
private readonly INatsKVStore _store;
private readonly int _maxRetries = 3;
public async Task<RateLimitResult> CheckAsync(
string key, int limit, CancellationToken ct)
{
for (int attempt = 0; attempt < _maxRetries; attempt++)
{
try
{
var entry = await _store.GetEntryAsync<int>(key, ct);
if (entry.Value >= limit)
{
return RateLimitResult.Exceeded(
entry.Value, limit);
}
await _store.UpdateAsync(
key, entry.Value + 1, entry.Revision, ct);
return RateLimitResult.Allowed(
entry.Value + 1, limit);
}
catch (NatsKVKeyNotFoundException)
{
// First request in this window
try
{
await _store.CreateAsync(key, 1, ct);
return RateLimitResult.Allowed(1, limit);
}
catch (NatsKVCreateException)
{
continue; // Another replica created it first
}
}
catch (NatsKVWrongLastRevisionException)
{
continue; // CAS conflict, retry
}
}
// After max retries, allow the request (fail open)
return RateLimitResult.Allowed(0, limit);
}
} ASP.NET Core Middleware Integration
Wire the rate limiter into the ASP.NET Core pipeline. The middleware runs after authentication so the user identity is available:
app.UseAuthentication();
app.UseAuthorization();
app.UseRateLimiter(); // Must be after auth
// Configuration
builder.Services.AddRateLimiter(options =>
{
options.AddPolicy("api", context =>
{
var key = RateLimitKeyGenerator.Generate(
context, "api", TimeSpan.FromMinutes(1));
return RateLimitPartition.GetFixedWindowLimiter(key,
_ => new FixedWindowRateLimiterOptions
{
PermitLimit = 100,
Window = TimeSpan.FromMinutes(1),
});
});
}); Bucket Configuration
var store = await kvContext.CreateStoreAsync(
new NatsKVConfig("rate-limits")
{
MaxAge = TimeSpan.FromMinutes(5), // Auto-expire old windows
Storage = NatsKVStorageType.Memory, // Speed over durability
Replicas = 3,
});
Using Memory storage instead of File trades durability for speed — rate limit counters don’t need to survive full cluster restarts.
Response Headers
Always include rate limit headers so clients can self-throttle:
context.Response.Headers["X-RateLimit-Limit"] = limit.ToString();
context.Response.Headers["X-RateLimit-Remaining"] =
Math.Max(0, limit - count).ToString();
context.Response.Headers["X-RateLimit-Reset"] =
windowEnd.ToUnixTimeSeconds().ToString();
[RateLimit Header Fields for HTTP]
— IETF HTTP API Working Group , 2024-05-01
Key Takeaways
- Local in-memory rate limiting is ineffective with multiple replicas
- NATS KV provides atomic counters with CAS — no extra infrastructure needed
- Use memory storage with short TTL for rate limit buckets
- Place middleware after authentication so user identity is available
- Prefer user ID over IP for rate limit keys when authenticated
- Fail open for most APIs, fail closed for security-critical endpoints
Looking back, the most valuable lesson from implementing distributed rate limiting was that the hardest part is not the algorithm — it is managing contention under real-world traffic patterns. CAS semantics work beautifully in theory, but you need to plan for the thundering herd scenario where many replicas compete for the same key simultaneously. The combination of exponential backoff, memory-backed storage, and a fail-open policy gave me a system that is both correct and performant under load.
[Counting Things: Token Buckets and Sliding Windows] — Cloudflare , 2023-09-12Next Steps
- Implement sliding window rate limiting by maintaining multiple sub-window counters in NATS KV
- Add per-endpoint rate limit policies with different thresholds for read vs write operations
- Build a rate limit dashboard in Grafana to visualize quota consumption per user over time
- Explore token bucket algorithms for smoother traffic shaping on bursty endpoints