LoRA adapter
LoRA (Low-Rank Adaptation) is a fine-tuning technique that updates only a small number of parameters in low-rank decomposition matrices, leaving the base model frozen. LoRA dramatically reduces training cost (typically 10-100x cheaper than full fine-tuning) and enables multiple specialised adapters to share one base model.
LoRA was introduced by Hu et al. (2021) and quickly became the dominant parameter-efficient fine-tuning method. The architectural insight: full fine-tuning updates billions of parameters, but most of the useful learning concentrates in low-rank updates. By training only the low-rank adapter (typically <1% of base model size), the technique captures most of the benefit at a fraction of the cost. Adapters can be swapped at inference time (multi-tenancy on a shared base model) or merged into the base for deployment. QLoRA extends LoRA with quantization to further reduce memory requirements.
Related terms
- Fine-tuning
Fine-tuning continues the training of a pre-trained LLM on a custom dataset — typically a few thousand to a few million examples — to adapt its behaviour to a specific domain, task, or output style.
- Quantization
Quantization reduces the numerical precision of LLM weights (typically from FP16 to INT8 or INT4) to shrink memory footprint and speed up inference, with modest accuracy loss.
- Large language model (LLM)
A large language model is a neural network trained on enormous text corpora to predict the next token given preceding tokens — typically with billions to trillions of parameters.