Reinforcement learning from human feedback (RLHF)
RLHF is the LLM training stage in which the model is refined using human preferences — humans rank model outputs, a reward model is trained from those rankings, and the LLM is optimised against the reward model with reinforcement learning. RLHF is the primary technique that turns a raw language model into a helpful, harmless assistant.
RLHF was popularised by InstructGPT (2022) and made widely visible by ChatGPT. The pipeline: collect demonstrations of desired behaviour, fine-tune on the demonstrations; collect preference comparisons (output A vs output B), train a reward model; use the reward model to score outputs during PPO-based RL fine-tuning. The technique converts vague behavioural goals (be helpful, honest, harmless) into a training signal. Variants emerging through 2023-2026: DPO (Direct Preference Optimisation) skips the explicit reward model; constitutional AI (Anthropic) uses model-generated critiques to scale beyond human-labelled preferences. RLHF is computationally expensive but produces step-changes in model behaviour.
Related terms
- Fine-tuning
Fine-tuning continues the training of a pre-trained LLM on a custom dataset — typically a few thousand to a few million examples — to adapt its behaviour to a specific domain, task, or output style.
- Large language model (LLM)
A large language model is a neural network trained on enormous text corpora to predict the next token given preceding tokens — typically with billions to trillions of parameters.
- Hallucination
Hallucination is the LLM failure mode in which the model generates content that is plausible-sounding but factually wrong — invented citations, fabricated quotes, non-existent functions, misremembered statistics.