Model Policy Governance and Token Usage Tracking

When I first integrated LLM capabilities into our platform, I treated token usage the way most developers treat logging — something I would worry about later. That “later” came sooner than expected when a misbehaving summarization loop burned through an entire month’s API budget in a single weekend. The experience was a wake-up call: without governance guardrails, AI features are a financial liability hiding behind a cool demo. This article captures the policy enforcement and token accounting system I built to make sure it never happens again.

As you add more LLM-powered features — summarization, entity extraction, embeddings, RAG queries — token consumption grows unpredictably. Without governance, a single runaway feature can exhaust your model quota or blow your budget. This article shows how to implement model policies and token accounting.

The Problem

Without governance:

No visibility into which feature consumes the most tokens
No per-user or per-feature limits
A misbehaving agent can loop indefinitely, consuming thousands of tokens
Cost allocation across teams or features is impossible

flowchart TD
    REQ["Incoming LLM Request"] --> CHK_EN{"Policy\nEnabled?"}
    CHK_EN -->|No| ERR1["FeatureDisabledException"]
    CHK_EN -->|Yes| EST["Estimate Input Tokens"]
    EST --> CHK_TOK{"Within\nMaxInputTokens?"}
    CHK_TOK -->|No| ERR2["TokenLimitExceededException"]
    CHK_TOK -->|Yes| CHK_RATE{"Within\nMaxRequestsPerMinute?"}
    CHK_RATE -->|No| ERR3["RateLimitExceededException"]
    CHK_RATE -->|Yes| CALL["Call Ollama"]
    CALL --> REC["Record Output Tokens"]
    REC --> OTEL["Emit OpenTelemetry Metrics"]

    style REQ fill:#1a2744,stroke:#6366f1,color:#e2e8f0
    style CALL fill:#1a2744,stroke:#22c55e,color:#e2e8f0
    style ERR1 fill:#3b1a1a,stroke:#ef4444,color:#fca5a5
    style ERR2 fill:#3b1a1a,stroke:#ef4444,color:#fca5a5
    style ERR3 fill:#3b1a1a,stroke:#ef4444,color:#fca5a5
    style OTEL fill:#1a2744,stroke:#f59e0b,color:#e2e8f0

[OpenAI API Rate Limits and Usage Tiers] — OpenAI , 2024

Model Policy Configuration

Define policies that specify which model each feature uses and its token limits:

[The Options Pattern in ASP.NET Core] — Microsoft , 2024

ModelPolicyOptions.cs

public class ModelPolicyOptions
{
    public Dictionary<string, ModelPolicy> Policies { get; set; } = new();
}

public class ModelPolicy
{
    public string ModelName { get; set; } = "llama3.2";
    public int MaxInputTokens { get; set; } = 4096;
    public int MaxOutputTokens { get; set; } = 2048;
    public int MaxRequestsPerMinute { get; set; } = 60;
    public bool Enabled { get; set; } = true;
}

{
  "ModelPolicies": {
    "Policies": {
      "summarization": {
        "ModelName": "llama3.2",
        "MaxInputTokens": 8192,
        "MaxOutputTokens": 1024,
        "MaxRequestsPerMinute": 30,
        "Enabled": true
      },
      "entity-extraction": {
        "ModelName": "llama3.2",
        "MaxInputTokens": 4096,
        "MaxOutputTokens": 512,
        "MaxRequestsPerMinute": 60,
        "Enabled": true
      },
      "rag-query": {
        "ModelName": "llama3.2",
        "MaxInputTokens": 16384,
        "MaxOutputTokens": 4096,
        "MaxRequestsPerMinute": 20,
        "Enabled": true
      }
    }
  }
}

Policy Enforcement

A middleware checks the policy before every LLM call:

[Decorator Pattern in C#] — Refactoring Guru , 2024

PolicyEnforcingLlmService.cs

public class PolicyEnforcingLlmService : ILlmService
{
    private readonly ILlmService _inner;
    private readonly ModelPolicyOptions _policies;
    private readonly ITokenCounter _tokenCounter;

    public async Task<string> GenerateAsync(
        string feature, string prompt, CancellationToken ct)
    {
        var policy = _policies.Policies.GetValueOrDefault(feature)
            ?? throw new InvalidOperationException(
                $"No model policy defined for feature '{feature}'");

        if (!policy.Enabled)
            throw new FeatureDisabledException(feature);

        // Check input token limit
        var inputTokens = EstimateTokenCount(prompt);
        if (inputTokens > policy.MaxInputTokens)
            throw new TokenLimitExceededException(
                feature, inputTokens, policy.MaxInputTokens);

        // Check rate limit
        if (!await _tokenCounter.TryConsumeAsync(
            feature, 1, policy.MaxRequestsPerMinute, ct))
            throw new RateLimitExceededException(feature);

        var result = await _inner.GenerateAsync(
            feature, prompt, ct);

        // Record actual usage
        var outputTokens = EstimateTokenCount(result);
        await _tokenCounter.RecordUsageAsync(
            feature, inputTokens, outputTokens, ct);

        return result;
    }
}

Token Accounting

Track cumulative token usage per feature for cost allocation and monitoring:

[Ollama API Documentation] — Ollama , 2024

TokenUsageRecord.cs

public class TokenUsageRecord
{
    public string Feature { get; init; } = "";
    public string Model { get; init; } = "";
    public int InputTokens { get; init; }
    public int OutputTokens { get; init; }
    public DateTime Timestamp { get; init; }
    public string? UserId { get; init; }
}

Dashboard Metrics

Expose token usage as OpenTelemetry metrics:

[OpenTelemetry .NET Metrics] — Microsoft , 2024

var meter = new Meter("ai.token-usage");
var inputCounter = meter.CreateCounter<long>("ai.tokens.input");
var outputCounter = meter.CreateCounter<long>("ai.tokens.output");
var requestCounter = meter.CreateCounter<long>("ai.requests");

// After each LLM call
inputCounter.Add(inputTokens,
    new("feature", feature), new("model", model));
outputCounter.Add(outputTokens,
    new("feature", feature), new("model", model));
requestCounter.Add(1,
    new("feature", feature), new("model", model));

Key Takeaways

Define per-feature model policies — different features need different limits
Enforce policies before every LLM call with a decorator/middleware
Track token usage with OpenTelemetry for cost visibility
Use a kill switch (Enabled: false) to disable runaway features instantly
Estimate tokens with characters / 4 for local models

Building this governance layer was not glamorous work, but it is the kind of infrastructure that separates a prototype from a production system. Before I had it in place, every new AI feature felt like a gamble. Now, I can onboard new features with confidence because I know exactly how much each one costs, who is using it, and I can shut it down in seconds if something goes wrong. If you are building anything with LLMs beyond a weekend project, invest in governance early — your future self (and your finance team) will thank you.

Next Steps

Implement per-user token budgets with monthly reset cycles to enable fair usage across tenants in a multi-tenant system.
Add alerting thresholds to your OpenTelemetry metrics so you get notified before hitting hard limits, not after.
Explore token-aware request batching to reduce overhead by grouping small requests into a single LLM call.
Build a cost dashboard that maps token usage to actual dollar amounts using your provider’s pricing tiers.

The Problem

Model Policy Configuration

Policy Enforcement

Token Accounting

Dashboard Metrics

Key Takeaways

Next Steps

Further Reading