Cross-cutting

Context window

The context window is the maximum number of tokens an LLM can process in a single request, including the prompt, retrieved context, conversation history, and the generated response. Modern LLMs range from 8K (older models) to 1M+ tokens (Gemini 1.5 Pro, Claude 3.7 Sonnet with 1M, GPT-4.5).

May 23, 2026

Longer context windows enable richer applications, full document Q&A, long agent traces, multi-file code understanding, but bring three costs: latency (more tokens to process), price (per-token billing scales linearly), and recall degradation (models often attend less effectively to the middle of very long contexts). Production patterns optimise for the smallest context that produces the desired output: retrieval-augmented over loading entire corpora, summarisation of conversation history, and structured prompts that signal which parts of context matter most. Prompt caching (where supported) dramatically reduces the cost of repeated long contexts.