All glossary terms
Cross-cutting

Quantization

Quantization reduces the numerical precision of LLM weights (typically from FP16 to INT8 or INT4) to shrink memory footprint and speed up inference, with modest accuracy loss. Quantization lets large models run on consumer hardware (a 70B-parameter model fits on a single GPU at INT4) and reduces inference cost in production.

Quantization trades precision for speed and memory. INT8 is typically lossless or near-lossless for inference; INT4 produces noticeable but manageable degradation; INT2 and lower are research territory. Quantization-aware training produces better results than post-training quantization but requires the model owner to participate. The technique is one of several model-compression approaches (alongside pruning, distillation, and sparsity); for serving very large models efficiently, modern stacks combine all three. Production examples: GPTQ, AWQ, and BitsAndBytes are the dominant quantization toolkits.

Related terms