Top-p sampling
Top-p (or nucleus) sampling restricts token selection to the smallest set whose cumulative probability exceeds p — typically 0.9 or 0.95. The technique adapts to the model's confidence: when the model is confident, the set is small; when uncertain, the set is large. Top-p often outperforms pure temperature sampling on quality at comparable diversity.
Top-p was introduced by Holtzman et al. (2019) as a counter to two failure modes of pure temperature sampling: at low temperature, the output is too repetitive; at high temperature, low-quality tokens are sometimes sampled. Top-p truncates the long tail of unlikely tokens regardless of temperature, so even high-temperature sampling produces coherent output. Production usage typically combines temperature and top-p (temperature controls diversity, top-p controls tail truncation). Defaults of temperature=0.7, top-p=0.95 are reasonable for most general-purpose generation.
Related terms
- Temperature (sampling)
Temperature is the LLM sampling parameter that controls randomness in token selection — 0 produces deterministic output (always the most-likely token), 1 samples roughly proportional to probability, higher values flatten the distribution and produce more diverse output.
- Large language model (LLM)
A large language model is a neural network trained on enormous text corpora to predict the next token given preceding tokens — typically with billions to trillions of parameters.
- Inference cost
Inference cost is the per-request economic cost of running an LLM — typically billed per million input tokens and per million output tokens, with output tokens often 3-5x more expensive than input.