Cross-cutting

Distillation

Knowledge distillation trains a smaller 'student' model to mimic the outputs of a larger 'teacher' model, producing a model that runs faster and cheaper while retaining most of the teacher's capability on the target tasks. Distillation is widely used to produce production models from frontier-quality teachers.

May 23, 2026

The mechanics: collect input-output pairs from the teacher model (often including the teacher's logits or chain-of-thought), train the student on those pairs with a loss that matches both the answer and the reasoning. The student can be radically smaller (10-100x) and faster (similar factor) while retaining 80-95% of the teacher's performance on the targeted task distribution. Distillation has produced many of the small but capable models that power production applications, Phi, Mistral 7B variants, distilled Claude/GPT outputs in proprietary fine-tunes. The trade-off: the student is narrow (good on the distilled task distribution, possibly worse off-distribution).