All glossary terms
Cross-cutting

Token budget

A token budget is the cap an application imposes on tokens consumed per request or per user — for cost control, latency control, and abuse prevention. The budget includes prompt tokens, context tokens, and generation tokens; the application tracks consumption and rejects or downgrades requests that would exceed the cap.

Token budgets matter most for high-volume applications where unbounded usage produces unpredictable cost. Common patterns: per-user monthly cap (the budget refills each billing cycle), per-request hard cap (max tokens enforced via the API parameter), per-conversation soft cap (the application summarises and truncates history when the budget is approached). The discipline is similar to memory budgeting in performance work: explicit limits force tradeoffs that diffuse the cost over the code rather than concentrating it in a few unbounded calls. Modern observability tools track per-request token consumption as a first-class metric.

Related terms