<a href="https://colab.research.google.com/github/mlabonne/llm-course/blob/main/nanoLoRA.ipynb" target="_parent"><img src="https://colab.research.google.com/assets/colab-badge.svg" alt="Open In Colab"/></a>

# nanoLoRA
> A Minimalistic Implementation of Low-Rank Adaptation

❤️ Created by [@maximelabonne](https://twitter.com/maximelabonne) as part of the 🗣️ [Large Language Model Course](https://github.com/mlabonne/llm-course).

1. **Low-Rank Decomposition**: LoRA represents the updates to the weight matrices with a low-rank decomposition. For a pre-trained weight matrix $W_0$, its update is represented by $W_0 + BA$, where $B$ and $A$ are matrices with lower rank $r$, and $r$ is less than the minimum of the dimensions of $W_0$.

2. **Freezing Pre-trained Weights**: During training, $W_0$ remains constant and does not receive gradient updates. The trainable parameters are contained in matrices $A$ and $B$, thereby constraining the updates to a lower intrinsic rank.

3. **Forward Pass Modification**: The modified forward pass includes both the original weight matrix and the low-rank update, yielding $h=W_0 x+BAx$.

4. **Deployment Efficiency**: When deployed, LoRA can explicitly compute and store $W=W_0 + BA$ and perform inference as usual. This approach introduces no additional inference latency compared to a fully fine-tuned model.


In [1]:
!pip install -q transformers datasets

In [2]:
from datasets import load_dataset
from transformers import DataCollatorForLanguageModeling
from transformers import GPT2Model, GPT2Config, GPT2Tokenizer

def tokenize(element):
    outputs = tokenizer(
        element["text"],
        truncation=True,
        max_length=context_length,
        return_overflowing_tokens=True,
        return_length=True,
    )
    input_batch = []
    for length, input_ids in zip(outputs["length"], outputs["input_ids"]):
        if length == context_length:
            input_batch.append(input_ids)
    return {"input_ids": input_batch}

dataset = load_dataset('wikitext', 'wikitext-2-raw-v1')

tokenizer = GPT2Tokenizer.from_pretrained('gpt2')
tokenizer.pad_token = tokenizer.eos_token

context_length = 128
tokenized_datasets = dataset.map(
    tokenize, batched=True, remove_columns=dataset["train"].column_names
)

data_collator = DataCollatorForLanguageModeling(tokenizer, mlm=False)

Downloading data:   0%|          | 0.00/4.72M [00:00<?, ?B/s]

Generating test split:   0%|          | 0/4358 [00:00<?, ? examples/s]

Generating train split:   0%|          | 0/36718 [00:00<?, ? examples/s]

Generating validation split:   0%|          | 0/3760 [00:00<?, ? examples/s]

Downloading (…)olve/main/vocab.json:   0%|          | 0.00/1.04M [00:00<?, ?B/s]

Downloading (…)olve/main/merges.txt:   0%|          | 0.00/456k [00:00<?, ?B/s]

Downloading (…)lve/main/config.json:   0%|          | 0.00/665 [00:00<?, ?B/s]

Map:   0%|          | 0/4358 [00:00<?, ? examples/s]

Map:   0%|          | 0/36718 [00:00<?, ? examples/s]

Map:   0%|          | 0/3760 [00:00<?, ? examples/s]

The `self.merged` variable is used as a flag to track the state of the model's weights. It serves to indicate whether the weights of the linear layer have been merged with the low-rank approximation weights.

* **During Training**: The LoRA layer maintains two sets of weights - the original weights of the layer (`self.weight`) and the low-rank approximation weights (`self.lora_A` and `self.lora_B`). During training, you typically want to update the low-rank approximation weights and keep the original weights fixed. Therefore, before training, the method ensures that the original weights and the low-rank approximation weights are not merged (`self.merged` is `False`). If they have been merged, it separates them by subtracting the low-rank approximation from the original weights.

* **During Evaluation**: During evaluation (or inference), you want to use the enhanced weights that include the low-rank approximation for better performance. Therefore, before evaluation, the method ensures that the weights are merged (`self.merged` is `True`). If they are not, it merges them by adding the low-rank approximation to the original weights.

This approach allows you to use the low-rank approximation for inference (where it can improve performance) but not during training (where you want to train the low-rank approximation weights based on the fixed original weights).

In [3]:
import torch
import math
from torch import nn
import torch.nn.functional as F

class nanoLoRA(nn.Linear):
    def __init__(
        self,
        in_features: int,
        out_features: int,
        r: int = 8,
        lora_alpha: int = 1
    ):
        super().__init__(in_features, out_features)
        assert r > 0, "r must be > 0"

        self.lora_A = nn.Parameter(self.weight.new_zeros((r, in_features)))
        self.lora_B = nn.Parameter(self.weight.new_zeros((out_features, r)))
        self.scaling = lora_alpha / r
        self.weight.requires_grad = False
        self.merged = False
        self.reset_lora_parameters()

    def reset_lora_parameters(self):
        super().reset_parameters()
        nn.init.kaiming_uniform_(self.lora_A, a=math.sqrt(5))
        nn.init.zeros_(self.lora_B)

    def train(self, mode: bool = True):
        super().train(mode)
        if mode:
            if self.merged:
                # Make sure that the weights are not merged for training
                self.weight.data -= (self.lora_B @ self.lora_A) * self.scaling
                self.merged = False
        else:
            if not self.merged:
                # Merge the weights for inference
                self.weight.data += (self.lora_B @ self.lora_A) * self.scaling
                self.merged = True

    def forward(self, x: torch.Tensor):
        if not self.merged:
            out = F.linear(x, self.weight, bias=self.bias)
            out += (x @ self.lora_A.transpose(0, 1) @ self.lora_B.transpose(0, 1)) * self.scaling
            return out
        else:
            return F.linear(x, self.weight, bias=self.bias)

In [4]:
from transformers import AutoTokenizer, GPT2LMHeadModel, AutoConfig

# Initialize the model
config = AutoConfig.from_pretrained(
    "gpt2",
    vocab_size=len(tokenizer),
    n_ctx=1024,
    bos_token_id=tokenizer.bos_token_id,
    eos_token_id=tokenizer.eos_token_id,
)
model = GPT2LMHeadModel(config)

# Freeze all parameters in the model
for param in model.parameters():
    param.requires_grad = False

# Modify the layers and unfreeze these parameters
for i in range(config.n_layer):
    model.transformer.h[i].mlp.c_fc = nanoLoRA(config.n_embd, config.n_embd, r=8)
    model.transformer.h[i].mlp.c_proj = nanoLoRA(config.n_embd, config.n_embd, r=8)

num_train_params = sum(p.numel() for p in model.parameters() if p.requires_grad)
print(f'The model has {num_train_params:,} trainable parameters.')

The model has 313,344 trainable parameters.


In [5]:
from transformers import Trainer, TrainingArguments

args = TrainingArguments(
    output_dir="results",
    per_device_train_batch_size=32,
    per_device_eval_batch_size=32,
    evaluation_strategy="steps",
    eval_steps=50,
    logging_steps=50,
    gradient_accumulation_steps=1,
    num_train_epochs=1,
    weight_decay=0.1,
    warmup_steps=10,
    lr_scheduler_type="cosine",
    learning_rate=5e-4,
    fp16=True,
    report_to="tensorboard",
)

trainer = Trainer(
    model=model,
    tokenizer=tokenizer,
    args=args,
    data_collator=data_collator,
    train_dataset=tokenized_datasets["train"],
    eval_dataset=tokenized_datasets["validation"],
)

trainer.train()



Step,Training Loss,Validation Loss
50,10.3686,9.695271
100,9.6608,9.591634
150,9.6345,9.585637
200,9.623,9.58336
250,9.6131,9.582972


TrainOutput(global_step=266, training_loss=9.771246601764421, metrics={'train_runtime': 96.2756, 'train_samples_per_second': 88.164, 'train_steps_per_second': 2.763, 'total_flos': 279368589901824.0, 'train_loss': 9.771246601764421, 'epoch': 1.0})

In [6]:
trainer.evaluate()

{'eval_loss': 9.582963943481445,
 'eval_runtime': 3.3787,
 'eval_samples_per_second': 262.824,
 'eval_steps_per_second': 8.287,
 'epoch': 1.0}