Context window
The context window is the maximum number of tokens an LLM can process in a single request — including the prompt, retrieved context, conversation history, and the generated response. Modern LLMs range from 8K (older models) to 1M+ tokens (Gemini 1.5 Pro, Claude 3.7 Sonnet with 1M, GPT-4.5).
Longer context windows enable richer applications — full document Q&A, long agent traces, multi-file code understanding — but bring three costs: latency (more tokens to process), price (per-token billing scales linearly), and recall degradation (models often attend less effectively to the middle of very long contexts). Production patterns optimise for the smallest context that produces the desired output: retrieval-augmented over loading entire corpora, summarisation of conversation history, and structured prompts that signal which parts of context matter most. Prompt caching (where supported) dramatically reduces the cost of repeated long contexts.
Related terms
- Prompt caching
Prompt caching is the LLM API feature that caches the processing of repeated prompt prefixes so subsequent requests with the same prefix run faster and cost less.
- Token budget
A token budget is the cap an application imposes on tokens consumed per request or per user — for cost control, latency control, and abuse prevention.
- Large language model (LLM)
A large language model is a neural network trained on enormous text corpora to predict the next token given preceding tokens — typically with billions to trillions of parameters.