LoRA: Low-Rank Adaptation of Large Language Models
Edward J. Hu, Yelong Shen, Phillip Wallis et al.
2022
We propose Low-Rank Adaptation (LoRA), which freezes the pre-trained model weights and injects trainable rank decomposition matrices into each layer of the Transformer architecture. Compared to GPT-3 175B fine-tuned with Adam, LoRA can reduce the number of trainable parameters by 10,000 times and the GPU memory requirement by 3 times.
Reading Level:
LoRA: Low-Rank Adaptation
Core Hypothesis
For pre-trained over-parameterized models, weight updates during fine-tuning have a low intrinsic rank. This motivates learning rank decomposition matrices instead of full weight updates.
Method
For a pre-trained weight matrix $W_0 \in \mathbb{R}^{d \times k}$, constrain the update:
$$W_0 + \Delta W = W_0 + BA$$
Where $B \in \mathbb{R}^{d \times r}$, $A \in \mathbb{R}^{r \times k}$, and $r \ll \min(d, k)$.
The forward pass becomes: $$h = W_0 x + \Delta W x = W_0 x + BAx$$
Initialization
- $A$ initialized with random Gaussian
- $B$ initialized to zero (so $\Delta W = 0$ at start)
- Scaling: $\frac{\alpha}{r}$ applied to $\Delta W$
Parameter Efficiency
For GPT-3 with $r=4$:
- Full fine-tuning: 175B parameters
- LoRA trainable params: ~4.7M (0.003% of original)
- Performance: Matches or exceeds full fine-tuning on GLUE, WikiSQL, SAMSum
Practical Considerations
- Applied to $W_q$ and $W_v$ in attention layers only (empirically sufficient)
- Multiple LoRA adapters can be combined at inference
- No inference latency: $W = W_0 + BA$ merged before deployment
XP Reward
+175
Earned after 10 seconds of reading
Key EquationsClick for Python code
Citation Graph
References (3)
Language Models are Few-Shot Learners (GPT-3)
Brown et al. · 2020Parameter-Efficient Transfer Learning for NLP
Houlsby et al. · 2019QLoRA: Efficient Finetuning of Quantized LLMs
Dettmers et al. · 2023