LoRA

From Stable Diffusion Wiki
Jump to navigation Jump to search

Overview

Natural language processing includes large-scale pretraining and adaptation to specific tasks or domains. Full fine-tuning becomes impractical with large models. So Low Rank Adaptation (LoRA) is a solution to this. In the world of natural language processing, many applications depend on using a large, pre-trained language model and then adapting it for different specific tasks. Typically, this is done through fine-tuning, which changes all the parameters of the original model. This method becomes a problem with very large models, like GPT-3, as it keeps the same vast number of parameters, making deployment challenging.

Some have tried to solve this by only changing a few parameters or adding external modules for new tasks. This makes the model more efficient but can slow it down or even reduce its effectiveness. These solutions often don't perform as well as fine-tuning, so there's a compromise between efficiency and quality.

The Low-Rank Adaptation (LoRA) approach proposes a new way to tackle this issue. Inspired by the idea that changes in weights during adaptation have a low "intrinsic rank," LoRA optimizes specific parts of the neural network, keeping the pre-trained weights untouched. This method makes LoRA efficient in both storage and computation, even with extremely large models like GPT-3. It provides a promising balance between maintaining model quality and enhancing operational efficiency.

Definitions

Fine-Tuning

Refers to slight adjustments to a pre-trained model for specific tasks. Less feasible for larger models due to high cost and parameter count.

Low-Rank Adaptation (LoRA)

LoRA retains pre-trained weights and incorporates trainable rank decomposition matrices in the Transformer architecture, drastically reducing trainable parameters and GPU memory cost.

Trainable Parameters

These are adjustable aspects of the model during training. LoRA reduces these by 10,000 times, enhancing efficiency.

Model Quality

The accuracy or performance of a model. LoRA performs on par or better than traditional fine-tuning, even with fewer parameters.

Rank-Decomposition Matrices

Used in LoRA to reduce complexity without compromising quality or adding additional inference latency.

Inference Latency

Time taken for the model to respond. LoRA does not increase this latency.

Benefits of LoRA

  • Significant reduction in trainable parameters and GPU memory requirements.
  • Comparable or superior performance to fine-tuning on models like RoBERTa, DeBERTa, GPT-2, and GPT-3.
  • Higher training throughput, no additional inference latency.

Empirical Investigation

Study into rank-deficiency in language model adaptation gives insights into LoRA's efficacy.

Availability

Package released for integration with PyTorch, including implementations and checkpoints for RoBERTa, DeBERTa, and GPT-2.