Model Policy Governance and Token Usage Tracking
Implementing model policy management and token accounting so LLM features remain cost-aware and governable at platform scale.
When I first integrated LLM capabilities into our platform, I treated token usage the way most developers treat logging — something I would worry about later. That “later” came sooner than expected when a misbehaving summarization loop burned through an entire month’s API budget in a single weekend. The experience was a wake-up call: without governance guardrails, AI features are a financial liability hiding behind a cool demo. This article captures the policy enforcement and token accounting system I built to make sure it never happens again.
As you add more LLM-powered features — summarization, entity extraction, embeddings, RAG queries — token consumption grows unpredictably. Without governance, a single runaway feature can exhaust your model quota or blow your budget. This article shows how to implement model policies and token accounting.
The Problem
Without governance:
- No visibility into which feature consumes the most tokens
- No per-user or per-feature limits
- A misbehaving agent can loop indefinitely, consuming thousands of tokens
- Cost allocation across teams or features is impossible
flowchart TD
REQ["Incoming LLM Request"] --> CHK_EN{"Policy\nEnabled?"}
CHK_EN -->|No| ERR1["FeatureDisabledException"]
CHK_EN -->|Yes| EST["Estimate Input Tokens"]
EST --> CHK_TOK{"Within\nMaxInputTokens?"}
CHK_TOK -->|No| ERR2["TokenLimitExceededException"]
CHK_TOK -->|Yes| CHK_RATE{"Within\nMaxRequestsPerMinute?"}
CHK_RATE -->|No| ERR3["RateLimitExceededException"]
CHK_RATE -->|Yes| CALL["Call Ollama"]
CALL --> REC["Record Output Tokens"]
REC --> OTEL["Emit OpenTelemetry Metrics"]
style REQ fill:#1a2744,stroke:#6366f1,color:#e2e8f0
style CALL fill:#1a2744,stroke:#22c55e,color:#e2e8f0
style ERR1 fill:#3b1a1a,stroke:#ef4444,color:#fca5a5
style ERR2 fill:#3b1a1a,stroke:#ef4444,color:#fca5a5
style ERR3 fill:#3b1a1a,stroke:#ef4444,color:#fca5a5
style OTEL fill:#1a2744,stroke:#f59e0b,color:#e2e8f0
[OpenAI API Rate Limits and Usage Tiers]
— OpenAI , 2024
Model Policy Configuration
Define policies that specify which model each feature uses and its token limits:
[The Options Pattern in ASP.NET Core] — Microsoft , 2024public class ModelPolicyOptions
{
public Dictionary<string, ModelPolicy> Policies { get; set; } = new();
}
public class ModelPolicy
{
public string ModelName { get; set; } = "llama3.2";
public int MaxInputTokens { get; set; } = 4096;
public int MaxOutputTokens { get; set; } = 2048;
public int MaxRequestsPerMinute { get; set; } = 60;
public bool Enabled { get; set; } = true;
} {
"ModelPolicies": {
"Policies": {
"summarization": {
"ModelName": "llama3.2",
"MaxInputTokens": 8192,
"MaxOutputTokens": 1024,
"MaxRequestsPerMinute": 30,
"Enabled": true
},
"entity-extraction": {
"ModelName": "llama3.2",
"MaxInputTokens": 4096,
"MaxOutputTokens": 512,
"MaxRequestsPerMinute": 60,
"Enabled": true
},
"rag-query": {
"ModelName": "llama3.2",
"MaxInputTokens": 16384,
"MaxOutputTokens": 4096,
"MaxRequestsPerMinute": 20,
"Enabled": true
}
}
}
}
Policy Enforcement
A middleware checks the policy before every LLM call:
[Decorator Pattern in C#] — Refactoring Guru , 2024public class PolicyEnforcingLlmService : ILlmService
{
private readonly ILlmService _inner;
private readonly ModelPolicyOptions _policies;
private readonly ITokenCounter _tokenCounter;
public async Task<string> GenerateAsync(
string feature, string prompt, CancellationToken ct)
{
var policy = _policies.Policies.GetValueOrDefault(feature)
?? throw new InvalidOperationException(
$"No model policy defined for feature '{feature}'");
if (!policy.Enabled)
throw new FeatureDisabledException(feature);
// Check input token limit
var inputTokens = EstimateTokenCount(prompt);
if (inputTokens > policy.MaxInputTokens)
throw new TokenLimitExceededException(
feature, inputTokens, policy.MaxInputTokens);
// Check rate limit
if (!await _tokenCounter.TryConsumeAsync(
feature, 1, policy.MaxRequestsPerMinute, ct))
throw new RateLimitExceededException(feature);
var result = await _inner.GenerateAsync(
feature, prompt, ct);
// Record actual usage
var outputTokens = EstimateTokenCount(result);
await _tokenCounter.RecordUsageAsync(
feature, inputTokens, outputTokens, ct);
return result;
}
} Token Accounting
Track cumulative token usage per feature for cost allocation and monitoring:
[Ollama API Documentation] — Ollama , 2024public class TokenUsageRecord
{
public string Feature { get; init; } = "";
public string Model { get; init; } = "";
public int InputTokens { get; init; }
public int OutputTokens { get; init; }
public DateTime Timestamp { get; init; }
public string? UserId { get; init; }
} Dashboard Metrics
Expose token usage as OpenTelemetry metrics:
[OpenTelemetry .NET Metrics] — Microsoft , 2024var meter = new Meter("ai.token-usage");
var inputCounter = meter.CreateCounter<long>("ai.tokens.input");
var outputCounter = meter.CreateCounter<long>("ai.tokens.output");
var requestCounter = meter.CreateCounter<long>("ai.requests");
// After each LLM call
inputCounter.Add(inputTokens,
new("feature", feature), new("model", model));
outputCounter.Add(outputTokens,
new("feature", feature), new("model", model));
requestCounter.Add(1,
new("feature", feature), new("model", model));
Key Takeaways
- Define per-feature model policies — different features need different limits
- Enforce policies before every LLM call with a decorator/middleware
- Track token usage with OpenTelemetry for cost visibility
- Use a kill switch (
Enabled: false) to disable runaway features instantly - Estimate tokens with characters / 4 for local models
Building this governance layer was not glamorous work, but it is the kind of infrastructure that separates a prototype from a production system. Before I had it in place, every new AI feature felt like a gamble. Now, I can onboard new features with confidence because I know exactly how much each one costs, who is using it, and I can shut it down in seconds if something goes wrong. If you are building anything with LLMs beyond a weekend project, invest in governance early — your future self (and your finance team) will thank you.
Next Steps
- Implement per-user token budgets with monthly reset cycles to enable fair usage across tenants in a multi-tenant system.
- Add alerting thresholds to your OpenTelemetry metrics so you get notified before hitting hard limits, not after.
- Explore token-aware request batching to reduce overhead by grouping small requests into a single LLM call.
- Build a cost dashboard that maps token usage to actual dollar amounts using your provider’s pricing tiers.