Inference cost
Inference cost is the per-request economic cost of running an LLM — typically billed per million input tokens and per million output tokens, with output tokens often 3-5x more expensive than input. At scale, inference cost is the dominant variable cost of LLM-powered applications and the primary target of optimisation.
The main cost levers: model selection (smaller models for routine work, larger for complex), prompt caching (huge wins on repeated context), output length control (shorter responses cost less), batching (where the API supports it), and judicious retry (don't re-run on every transient error). The architectural lever is the cascading pattern: try a cheap model first, escalate to an expensive model only when the cheap one fails or returns low confidence. For applications handling millions of requests per day, the difference between sloppy and disciplined inference cost management is often the difference between viable and unviable unit economics.
Related terms
- Prompt caching
Prompt caching is the LLM API feature that caches the processing of repeated prompt prefixes so subsequent requests with the same prefix run faster and cost less.
- Token budget
A token budget is the cap an application imposes on tokens consumed per request or per user — for cost control, latency control, and abuse prevention.
- Context window
The context window is the maximum number of tokens an LLM can process in a single request — including the prompt, retrieved context, conversation history, and the generated response.