AI/ML Intermediate 8 min

Model Policy Governance and Token Usage Tracking

Implementing model policy management and token accounting so LLM features remain cost-aware and governable at platform scale.

By Victor Robin Updated:

When I first integrated LLM capabilities into our platform, I treated token usage the way most developers treat logging — something I would worry about later. That “later” came sooner than expected when a misbehaving summarization loop burned through an entire month’s API budget in a single weekend. The experience was a wake-up call: without governance guardrails, AI features are a financial liability hiding behind a cool demo. This article captures the policy enforcement and token accounting system I built to make sure it never happens again.

As you add more LLM-powered features — summarization, entity extraction, embeddings, RAG queries — token consumption grows unpredictably. Without governance, a single runaway feature can exhaust your model quota or blow your budget. This article shows how to implement model policies and token accounting.

The Problem

Without governance:

  • No visibility into which feature consumes the most tokens
  • No per-user or per-feature limits
  • A misbehaving agent can loop indefinitely, consuming thousands of tokens
  • Cost allocation across teams or features is impossible
flowchart TD
    REQ["Incoming LLM Request"] --> CHK_EN{"Policy\nEnabled?"}
    CHK_EN -->|No| ERR1["FeatureDisabledException"]
    CHK_EN -->|Yes| EST["Estimate Input Tokens"]
    EST --> CHK_TOK{"Within\nMaxInputTokens?"}
    CHK_TOK -->|No| ERR2["TokenLimitExceededException"]
    CHK_TOK -->|Yes| CHK_RATE{"Within\nMaxRequestsPerMinute?"}
    CHK_RATE -->|No| ERR3["RateLimitExceededException"]
    CHK_RATE -->|Yes| CALL["Call Ollama"]
    CALL --> REC["Record Output Tokens"]
    REC --> OTEL["Emit OpenTelemetry Metrics"]

    style REQ fill:#1a2744,stroke:#6366f1,color:#e2e8f0
    style CALL fill:#1a2744,stroke:#22c55e,color:#e2e8f0
    style ERR1 fill:#3b1a1a,stroke:#ef4444,color:#fca5a5
    style ERR2 fill:#3b1a1a,stroke:#ef4444,color:#fca5a5
    style ERR3 fill:#3b1a1a,stroke:#ef4444,color:#fca5a5
    style OTEL fill:#1a2744,stroke:#f59e0b,color:#e2e8f0
[OpenAI API Rate Limits and Usage Tiers] — OpenAI , 2024

Model Policy Configuration

Define policies that specify which model each feature uses and its token limits:

[The Options Pattern in ASP.NET Core] — Microsoft , 2024
ModelPolicyOptions.cs
public class ModelPolicyOptions
{
    public Dictionary<string, ModelPolicy> Policies { get; set; } = new();
}

public class ModelPolicy
{
    public string ModelName { get; set; } = "llama3.2";
    public int MaxInputTokens { get; set; } = 4096;
    public int MaxOutputTokens { get; set; } = 2048;
    public int MaxRequestsPerMinute { get; set; } = 60;
    public bool Enabled { get; set; } = true;
}
{
  "ModelPolicies": {
    "Policies": {
      "summarization": {
        "ModelName": "llama3.2",
        "MaxInputTokens": 8192,
        "MaxOutputTokens": 1024,
        "MaxRequestsPerMinute": 30,
        "Enabled": true
      },
      "entity-extraction": {
        "ModelName": "llama3.2",
        "MaxInputTokens": 4096,
        "MaxOutputTokens": 512,
        "MaxRequestsPerMinute": 60,
        "Enabled": true
      },
      "rag-query": {
        "ModelName": "llama3.2",
        "MaxInputTokens": 16384,
        "MaxOutputTokens": 4096,
        "MaxRequestsPerMinute": 20,
        "Enabled": true
      }
    }
  }
}

Policy Enforcement

A middleware checks the policy before every LLM call:

[Decorator Pattern in C#] — Refactoring Guru , 2024
PolicyEnforcingLlmService.cs
public class PolicyEnforcingLlmService : ILlmService
{
    private readonly ILlmService _inner;
    private readonly ModelPolicyOptions _policies;
    private readonly ITokenCounter _tokenCounter;

    public async Task<string> GenerateAsync(
        string feature, string prompt, CancellationToken ct)
    {
        var policy = _policies.Policies.GetValueOrDefault(feature)
            ?? throw new InvalidOperationException(
                $"No model policy defined for feature '{feature}'");

        if (!policy.Enabled)
            throw new FeatureDisabledException(feature);

        // Check input token limit
        var inputTokens = EstimateTokenCount(prompt);
        if (inputTokens > policy.MaxInputTokens)
            throw new TokenLimitExceededException(
                feature, inputTokens, policy.MaxInputTokens);

        // Check rate limit
        if (!await _tokenCounter.TryConsumeAsync(
            feature, 1, policy.MaxRequestsPerMinute, ct))
            throw new RateLimitExceededException(feature);

        var result = await _inner.GenerateAsync(
            feature, prompt, ct);

        // Record actual usage
        var outputTokens = EstimateTokenCount(result);
        await _tokenCounter.RecordUsageAsync(
            feature, inputTokens, outputTokens, ct);

        return result;
    }
}

Token Accounting

Track cumulative token usage per feature for cost allocation and monitoring:

[Ollama API Documentation] — Ollama , 2024
TokenUsageRecord.cs
public class TokenUsageRecord
{
    public string Feature { get; init; } = "";
    public string Model { get; init; } = "";
    public int InputTokens { get; init; }
    public int OutputTokens { get; init; }
    public DateTime Timestamp { get; init; }
    public string? UserId { get; init; }
}

Dashboard Metrics

Expose token usage as OpenTelemetry metrics:

[OpenTelemetry .NET Metrics] — Microsoft , 2024
var meter = new Meter("ai.token-usage");
var inputCounter = meter.CreateCounter<long>("ai.tokens.input");
var outputCounter = meter.CreateCounter<long>("ai.tokens.output");
var requestCounter = meter.CreateCounter<long>("ai.requests");

// After each LLM call
inputCounter.Add(inputTokens,
    new("feature", feature), new("model", model));
outputCounter.Add(outputTokens,
    new("feature", feature), new("model", model));
requestCounter.Add(1,
    new("feature", feature), new("model", model));

Key Takeaways

  • Define per-feature model policies — different features need different limits
  • Enforce policies before every LLM call with a decorator/middleware
  • Track token usage with OpenTelemetry for cost visibility
  • Use a kill switch (Enabled: false) to disable runaway features instantly
  • Estimate tokens with characters / 4 for local models

Building this governance layer was not glamorous work, but it is the kind of infrastructure that separates a prototype from a production system. Before I had it in place, every new AI feature felt like a gamble. Now, I can onboard new features with confidence because I know exactly how much each one costs, who is using it, and I can shut it down in seconds if something goes wrong. If you are building anything with LLMs beyond a weekend project, invest in governance early — your future self (and your finance team) will thank you.

Next Steps

  • Implement per-user token budgets with monthly reset cycles to enable fair usage across tenants in a multi-tenant system.
  • Add alerting thresholds to your OpenTelemetry metrics so you get notified before hitting hard limits, not after.
  • Explore token-aware request batching to reduce overhead by grouping small requests into a single LLM call.
  • Build a cost dashboard that maps token usage to actual dollar amounts using your provider’s pricing tiers.

Further Reading

[OpenTelemetry .NET Metrics] — Microsoft , 2024 [Ollama API Reference] — Ollama , 2024 [OpenAI Tokenizer] — OpenAI , 2024 [Rate Limiting with Redis] — Redis , 2024