LoRA
Overview
Natural language processing includes large-scale pretraining and adaptation to specific tasks or domains. Full fine-tuning becomes impractical with large models.
Definitions
Fine-Tuning
Refers to slight adjustments to a pre-trained model for specific tasks. Less feasible for larger models due to high cost and parameter count.
Low-Rank Adaptation (LoRA)
LoRA retains pre-trained weights and incorporates trainable rank decomposition matrices in the Transformer architecture, drastically reducing trainable parameters and GPU memory cost.
Trainable Parameters
These are adjustable aspects of the model during training. LoRA reduces these by 10,000 times, enhancing efficiency.
Model Quality
The accuracy or performance of a model. LoRA performs on par or better than traditional fine-tuning, even with fewer parameters.
Rank-Decomposition Matrices
Used in LoRA to reduce complexity without compromising quality or adding additional inference latency.
Inference Latency
Time taken for the model to respond. LoRA does not increase this latency.
Benefits of LoRA
- Significant reduction in trainable parameters and GPU memory requirements.
- Comparable or superior performance to fine-tuning on models like RoBERTa, DeBERTa, GPT-2, and GPT-3.
- Higher training throughput, no additional inference latency.
Empirical Investigation
Study into rank-deficiency in language model adaptation gives insights into LoRA's efficacy.
Availability
Package released for integration with PyTorch, including implementations and checkpoints for RoBERTa, DeBERTa, and GPT-2.