Quantization
Quantization reduces the numerical precision of LLM weights (typically from FP16 to INT8 or INT4) to shrink memory footprint and speed up inference, with modest accuracy loss. Quantization lets large models run on consumer hardware (a 70B-parameter model fits on a single GPU at INT4) and reduces inference cost in production.
Quantization trades precision for speed and memory. INT8 is typically lossless or near-lossless for inference; INT4 produces noticeable but manageable degradation; INT2 and lower are research territory. Quantization-aware training produces better results than post-training quantization but requires the model owner to participate. The technique is one of several model-compression approaches (alongside pruning, distillation, and sparsity); for serving very large models efficiently, modern stacks combine all three. Production examples: GPTQ, AWQ, and BitsAndBytes are the dominant quantization toolkits.
Related terms
- LoRA adapter
LoRA (Low-Rank Adaptation) is a fine-tuning technique that updates only a small number of parameters in low-rank decomposition matrices, leaving the base model frozen.
- Distillation
Knowledge distillation trains a smaller 'student' model to mimic the outputs of a larger 'teacher' model — producing a model that runs faster and cheaper while retaining most of the teacher's capability on the target tasks.
- Inference cost
Inference cost is the per-request economic cost of running an LLM — typically billed per million input tokens and per million output tokens, with output tokens often 3-5x more expensive than input.