Prompt caching
Prompt caching is the LLM API feature that caches the processing of repeated prompt prefixes so subsequent requests with the same prefix run faster and cost less. Useful for long system prompts, large RAG contexts, or conversation histories that don't change between turns. Anthropic, OpenAI, and Google all support some form.
The cost model varies by provider: Anthropic charges a 25% premium on cache writes and 90% discount on cache reads (so the break-even is roughly 2 reads); OpenAI offers automatic caching for prompts > 1024 tokens at 50% discount on the cached portion. The savings can be dramatic for applications with long static prompts — a 50KB system prompt that's repeated across millions of requests is the canonical case. Production patterns: structure prompts with static content first (system prompt, few-shot examples, RAG context) and dynamic content last (user query) so the static prefix caches; monitor cache-hit rates as a leading indicator of cost optimisation.
Related terms
- Context window
The context window is the maximum number of tokens an LLM can process in a single request — including the prompt, retrieved context, conversation history, and the generated response.
- Inference cost
Inference cost is the per-request economic cost of running an LLM — typically billed per million input tokens and per million output tokens, with output tokens often 3-5x more expensive than input.
- System prompt
A system prompt is the initial instruction given to an LLM at the start of a session that sets behaviour, persona, output format, and constraints — distinct from user messages that follow.