All glossary terms
Cross-cutting

Prompt caching

Prompt caching is the LLM API feature that caches the processing of repeated prompt prefixes so subsequent requests with the same prefix run faster and cost less. Useful for long system prompts, large RAG contexts, or conversation histories that don't change between turns. Anthropic, OpenAI, and Google all support some form.

The cost model varies by provider: Anthropic charges a 25% premium on cache writes and 90% discount on cache reads (so the break-even is roughly 2 reads); OpenAI offers automatic caching for prompts > 1024 tokens at 50% discount on the cached portion. The savings can be dramatic for applications with long static prompts — a 50KB system prompt that's repeated across millions of requests is the canonical case. Production patterns: structure prompts with static content first (system prompt, few-shot examples, RAG context) and dynamic content last (user query) so the static prefix caches; monitor cache-hit rates as a leading indicator of cost optimisation.

Related terms