Cross-cutting

Inference cost

Inference cost is the per-request economic cost of running an LLM, typically billed per million input tokens and per million output tokens, with output tokens often 3-5x more expensive than input. At scale, inference cost is the dominant variable cost of LLM-powered applications and the primary target of optimisation.

May 23, 2026

The main cost levers: model selection (smaller models for routine work, larger for complex), prompt caching (huge wins on repeated context), output length control (shorter responses cost less), batching (where the API supports it), and judicious retry (don't re-run on every transient error). The architectural lever is the cascading pattern: try a cheap model first, escalate to an expensive model only when the cheap one fails or returns low confidence. For applications handling millions of requests per day, the difference between sloppy and disciplined inference cost management is often the difference between viable and unviable unit economics.