AIMLSAGA
1,250 XP · Lv.13
AR
Lv.13
50/100 XP · 50 to Lv.14
7 day streak
fine-tuning
parameter-efficient
lora
adaptation
llm

LoRA: Low-Rank Adaptation of Large Language Models

Edward J. Hu, Yelong Shen, Phillip Wallis et al.

2022

We propose Low-Rank Adaptation (LoRA), which freezes the pre-trained model weights and injects trainable rank decomposition matrices into each layer of the Transformer architecture. Compared to GPT-3 175B fine-tuned with Adam, LoRA can reduce the number of trainable parameters by 10,000 times and the GPU memory requirement by 3 times.


Reading Level:

LoRA: Low-Rank Adaptation

Core Hypothesis

For pre-trained over-parameterized models, weight updates during fine-tuning have a low intrinsic rank. This motivates learning rank decomposition matrices instead of full weight updates.

Method

For a pre-trained weight matrix $W_0 \in \mathbb{R}^{d \times k}$, constrain the update:

$$W_0 + \Delta W = W_0 + BA$$

Where $B \in \mathbb{R}^{d \times r}$, $A \in \mathbb{R}^{r \times k}$, and $r \ll \min(d, k)$.

The forward pass becomes: $$h = W_0 x + \Delta W x = W_0 x + BAx$$

Initialization

  • $A$ initialized with random Gaussian
  • $B$ initialized to zero (so $\Delta W = 0$ at start)
  • Scaling: $\frac{\alpha}{r}$ applied to $\Delta W$

Parameter Efficiency

For GPT-3 with $r=4$:

  • Full fine-tuning: 175B parameters
  • LoRA trainable params: ~4.7M (0.003% of original)
  • Performance: Matches or exceeds full fine-tuning on GLUE, WikiSQL, SAMSum

Practical Considerations

  • Applied to $W_q$ and $W_v$ in attention layers only (empirically sufficient)
  • Multiple LoRA adapters can be combined at inference
  • No inference latency: $W = W_0 + BA$ merged before deployment
What did this paper leave unanswered?

Join the gap discussion — identify research opportunities and connect with others building solutions.

XP Reward

+175

Earned after 10 seconds of reading

Key EquationsClick for Python code
LoRA low-rank weight decomposition — the core mathematical insight
W = W_0 + \Delta W = W_0 + BA, \quad B \in \mathbb{R}^{d \times r}, A \in \mathbb{R}^{r \times k}, r \ll \min(d,k)
Citation Graph
References (3)

Language Models are Few-Shot Learners (GPT-3)

Brown et al. · 2020

Parameter-Efficient Transfer Learning for NLP

Houlsby et al. · 2019

QLoRA: Efficient Finetuning of Quantized LLMs

Dettmers et al. · 2023