All glossary terms
Cross-cutting

Reinforcement learning from human feedback (RLHF)

RLHF is the LLM training stage in which the model is refined using human preferences — humans rank model outputs, a reward model is trained from those rankings, and the LLM is optimised against the reward model with reinforcement learning. RLHF is the primary technique that turns a raw language model into a helpful, harmless assistant.

RLHF was popularised by InstructGPT (2022) and made widely visible by ChatGPT. The pipeline: collect demonstrations of desired behaviour, fine-tune on the demonstrations; collect preference comparisons (output A vs output B), train a reward model; use the reward model to score outputs during PPO-based RL fine-tuning. The technique converts vague behavioural goals (be helpful, honest, harmless) into a training signal. Variants emerging through 2023-2026: DPO (Direct Preference Optimisation) skips the explicit reward model; constitutional AI (Anthropic) uses model-generated critiques to scale beyond human-labelled preferences. RLHF is computationally expensive but produces step-changes in model behaviour.

Related terms