Token budget
A token budget is the cap an application imposes on tokens consumed per request or per user — for cost control, latency control, and abuse prevention. The budget includes prompt tokens, context tokens, and generation tokens; the application tracks consumption and rejects or downgrades requests that would exceed the cap.
Token budgets matter most for high-volume applications where unbounded usage produces unpredictable cost. Common patterns: per-user monthly cap (the budget refills each billing cycle), per-request hard cap (max tokens enforced via the API parameter), per-conversation soft cap (the application summarises and truncates history when the budget is approached). The discipline is similar to memory budgeting in performance work: explicit limits force tradeoffs that diffuse the cost over the code rather than concentrating it in a few unbounded calls. Modern observability tools track per-request token consumption as a first-class metric.
Related terms
- Context window
The context window is the maximum number of tokens an LLM can process in a single request — including the prompt, retrieved context, conversation history, and the generated response.
- Inference cost
Inference cost is the per-request economic cost of running an LLM — typically billed per million input tokens and per million output tokens, with output tokens often 3-5x more expensive than input.
- Prompt caching
Prompt caching is the LLM API feature that caches the processing of repeated prompt prefixes so subsequent requests with the same prefix run faster and cost less.