Cross-cutting

Token budget

A token budget is the cap an application imposes on tokens consumed per request or per user, for cost control, latency control, and abuse prevention. The budget includes prompt tokens, context tokens, and generation tokens; the application tracks consumption and rejects or downgrades requests that would exceed the cap.

May 23, 2026

Token budgets matter most for high-volume applications where unbounded usage produces unpredictable cost. Common patterns: per-user monthly cap (the budget refills each billing cycle), per-request hard cap (max tokens enforced via the API parameter), per-conversation soft cap (the application summarises and truncates history when the budget is approached). The discipline is similar to memory budgeting in performance work: explicit limits force tradeoffs that diffuse the cost over the code rather than concentrating it in a few unbounded calls. Modern observability tools track per-request token consumption as a first-class metric.