Distillation
Knowledge distillation trains a smaller 'student' model to mimic the outputs of a larger 'teacher' model — producing a model that runs faster and cheaper while retaining most of the teacher's capability on the target tasks. Distillation is widely used to produce production models from frontier-quality teachers.
The mechanics: collect input-output pairs from the teacher model (often including the teacher's logits or chain-of-thought), train the student on those pairs with a loss that matches both the answer and the reasoning. The student can be radically smaller (10-100x) and faster (similar factor) while retaining 80-95% of the teacher's performance on the targeted task distribution. Distillation has produced many of the small but capable models that power production applications — Phi, Mistral 7B variants, distilled Claude/GPT outputs in proprietary fine-tunes. The trade-off: the student is narrow (good on the distilled task distribution, possibly worse off-distribution).
Related terms
- Fine-tuning
Fine-tuning continues the training of a pre-trained LLM on a custom dataset — typically a few thousand to a few million examples — to adapt its behaviour to a specific domain, task, or output style.
- Quantization
Quantization reduces the numerical precision of LLM weights (typically from FP16 to INT8 or INT4) to shrink memory footprint and speed up inference, with modest accuracy loss.
- Inference cost
Inference cost is the per-request economic cost of running an LLM — typically billed per million input tokens and per million output tokens, with output tokens often 3-5x more expensive than input.