Fantastic-tuning A Tiny-Llama Mannequin with Unsloth

February 3, 2024

1

Introduction

After the Llama and Mistral fashions have been launched, the open-source LLMs took the limelight out of OpenAI. Since then, a number of fashions have been launched primarily based on Llama and Mistral structure, acting on par with proprietary fashions like GPT-3.5 Turbo, Claude, Gemini, and so forth. Nevertheless, these fashions are too giant for use in shopper {hardware}.

However currently, there was an emergence of a brand new class of LLMs. These are the LLMs within the sub-7B parameter class. Fewer parameters make them compact sufficient to be run in shopper {hardware} whereas conserving effectivity similar to the 7B fashions. Fashions like Tiny-Llama-1B, Microsoft’s Phi-2, and Alibaba’s Qwen-3b will be nice substitutes for bigger fashions to run regionally or deploy on edge. On the similar time, fine-tuning is essential to deliver the perfect out of any base mannequin for any downstream duties.
Right here, we are going to discover find out how to Fantastic-tune a base Tiny-Llama mannequin on a cleaned Alpaca dataset.

Studying Goals

Perceive fine-tuning and totally different strategies of it.
Find out about instruments and methods for environment friendly fine-tuning.
Find out about WandB for logging coaching logs.
Fantastic-tune Tiny-Llama on the Alpaca dataset in Colab.

This text was revealed as part of the Knowledge Science Blogathon.

What’s LLM Fantastic-Tuning?

Fantastic-tuning is the method of creating a pre-trained mannequin be taught new data. The pre-trained mannequin is a general-purpose mannequin skilled on a considerable amount of information. Nevertheless, usually, they fail to carry out as meant, and fine-tuning is the best strategy to make the mannequin adapt to particular use circumstances. For instance, base LLMs do properly at textual content era on single-turn QA however wrestle with multi-turn conversations like chat fashions.

The bottom fashions should be skilled on transcripts of dialogues to have the ability to maintain multi-turn conversations. Fantastic-tuning is crucial to mould pre-trained fashions into totally different avatars. The standard of Fantastic-tuned fashions is determined by the standard of information and base mannequin capabilities. There are a number of methods to mannequin fine-tuning, like LoRA, QLoRA, and so forth.

Let’s briefly undergo these ideas.

LoRA

LoRA stands for Low-rank Adaptation, a well-liked fine-tuning method through which we choose a number of trainable parameters as an alternative of updating all of the parameters by way of a low-rank approximation of unique weight matrices. The LoRA mannequin will be Fantastic-tuned sooner on much less compute-intensive {hardware}.

QLoRA

QLoRA or Quantized LoRA is a step additional than the LoRA. As an alternative of a full-precision mannequin, it quantizes the mannequin weights to decrease floating level precision earlier than making use of LoRA. Quantization is the method of downcasting increased bit values to decrease values. A 4-bit quantization course of entails quantizing the 16-bit weights to 4-bit float values.

Quantizing the mannequin results in a considerable discount in mannequin dimension with comparable accuracy to the unique mannequin. In QLoRA, we take a quantized mannequin and apply LoRA to it. The fashions will be quantized in a number of methods, comparable to by llama.cpp, AWQ, bitsandbytes, and so forth.

Fantastic-Tuning with Unsloth

Unsloth is an open-source platform for fine-tuning common Massive Language Fashions sooner. It helps common LLMs, together with Llama-2 and Mistral, and their derivatives like Yi, Open-hermes, and so forth. It implements customized triton kernels and a handbook back-prop engine to enhance the pace of the mannequin coaching.

Right here, we are going to use the Unsloth to Fantastic-tune a base 4-bit quantized Tiny-Llama mannequin on the Alpaca dataset. The mannequin is quantized with bits and bytes, and kernels are optimized with OpenAI’s Triton.

Logging with WandB

In Machine studying, it’s essential to log coaching and analysis metrics. This provides us a whole image of the practice run. Weights and Biases (WandB) is an open-source library for visualizing and monitoring machine studying experiments. It has a devoted internet app for visualizing coaching metrics in real-time. It additionally lets us handle manufacturing fashions centrally. We are going to use WandB solely to trace our Tiny-Llama fine-tuning run.

To make use of WandB, join a free account and create an API key.

Now, let’s begin fine-tuning our mannequin.

Find out how to Fantastic-tune Tiny-Llama?

Fantastic-tuning is a compute-heavy activity. It requires a machine with 10-15 GB of VRAM, or you should use Colab’s free Tesla T4 GPU runtime.

Now set up Unsloth and WandB

%%seize
import torch
major_version, minor_version = torch.cuda.get_device_capability()
!pip set up wandb
if major_version >= 8:
    # Use this for brand spanking new GPUs like Ampere, Hopper GPUs (RTX 30xx, RTX 40xx, A100, H100, L40)
    !pip set up "unsloth[colab_ampere] @ git+https://github.com/unslothai/unsloth.git"
else:
    # Use this for older GPUs (V100, Tesla T4, RTX 20xx)
    !pip set up "unsloth[colab] @ git+https://github.com/unslothai/unsloth.git"
go

The following factor is to load the 4-bit quantized pre-trained mannequin with Unsloth.

from unsloth import FastLanguageModel
import torch
max_seq_length = 4096 # Select any! We auto help RoPE Scaling internally!
dtype = None # None for auto detection. Float16 for Tesla T4, V100, Bfloat16 for Ampere+
load_in_4bit = True # Use 4bit quantization to cut back reminiscence utilization. Could be False.

mannequin, tokenizer = FastLanguageModel.from_pretrained(
    model_name = "unsloth/tinyllama-bnb-4bit", # "unsloth/tinyllama" for 16bit loading
    max_seq_length = max_seq_length,
    dtype = dtype,
    load_in_4bit = load_in_4bit,
)

This can set up the mannequin regionally. The 4-bit mannequin dimension shall be round 760 MBs.

Now apply PEFT to the 4-bit Tiny-Llama mannequin.

mannequin = FastLanguageModel.get_peft_model(
    mannequin,
    r = 32, # Select any quantity > 0 ! Steered 8, 16, 32, 64, 128
    target_modules = ["q_proj", "k_proj", "v_proj", "o_proj",
                      "gate_proj", "up_proj", "down_proj",],
    lora_alpha = 32,
    lora_dropout = 0, # Presently solely helps dropout = 0
    bias = "none",    # Presently solely helps bias = "none"
    use_gradient_checkpointing = True, # @@@ IF YOU GET OUT OF MEMORY - set to True @@@
    random_state = 3407,
    use_rslora = False,  # We help rank stabilized LoRA
    loftq_config = None, # And LoftQ
)

Put together Knowledge

The following step is to arrange the dataset for fine-tuning. As I discussed earlier, we are going to use a cleaned Alpaca dataset. This can be a cleaned model of the unique Alpaca dataset. It follows the instruction-input-response format. Right here is an instance of Alpaca information

Now, let’s put together our information.

@title put together information

#alpaca_prompt = """Beneath is an instruction that describes a activity, paired with an enter that
 supplies additional context.
 Write a response that appropriately completes the request.

### Instruction:
{}

### Enter:
{}

### Response:
{}"""

EOS_TOKEN = tokenizer.eos_token
def formatting_prompts_func(examples):
    directions = examples["instruction"]
    inputs       = examples["input"]
    outputs      = examples["output"]
    texts = []
    for instruction, enter, output in zip(directions, inputs, outputs):
        # Should add EOS_TOKEN, in any other case your era will go on ceaselessly!
        textual content = alpaca_prompt.format(instruction, enter, output) + EOS_TOKEN
        texts.append(textual content)
    return { "textual content" : texts, }
go

from datasets import load_dataset
dataset = load_dataset("yahma/alpaca-cleaned", break up = "practice")
dataset = dataset.map(formatting_prompts_func, batched = True,)

Now, break up the info into practice and eval information. I’ve taken small eval information as bigger eval information slows down the coaching.

dataset_dict = dataset.train_test_split(test_size=0.004)

Configure WandB

Now, configure Weights and Biases in your present runtime.

# @title wandb init
import wandb
wandb.login()

Present API key to log in to WandB when prompted.

Arrange atmosphere variables.

%env WANDB_WATCH=all
%env WANDB_SILENT=true

Prepare Mannequin

Thus far, we’ve loaded the 4-bit mannequin, created the LoRA configuration, ready the dataset, and configured WandB. The following step is to coach the mannequin on the info. For that, we have to outline a coach from the Trl library. We are going to use the SFTrainer from Trl. However earlier than that, initialize WandB and outline acceptable coaching arguments.

import os

from trl import SFTTrainer
from transformers import TrainingArguments
from transformers.utils import logging
import wandb

logging.set_verbosity_info()
project_name = "tiny-llama" 
entity = "wandb"
# os.environ["WANDB_LOG_MODEL"] = "checkpoint"

wandb.init(mission=project_name, identify = "tiny-llama-unsloth-sft")

Coaching Arguments

args = TrainingArguments(
        per_device_train_batch_size = 2,
        per_device_eval_batch_size=2,
        gradient_accumulation_steps = 4,
        evaluation_strategy="steps",
        warmup_ratio = 0.1,
        num_train_epochs = 1,
        learning_rate = 2e-5,
        fp16 = not torch.cuda.is_bf16_supported(),
        bf16 = torch.cuda.is_bf16_supported(),
        optim = "adamw_8bit",
        weight_decay = 0.1,
        lr_scheduler_type = "linear",
        seed = 3407,
        output_dir = "outputs",
        report_to="wandb",  # allow logging to W&B
        # run_name="tiny-llama-alpaca-run",  # identify of the W&B run (non-obligatory)
        logging_steps=1,  # how typically to log to W&B
        logging_strategy = 'steps',
        save_total_limit=2,
    )

That is vital for coaching. To maintain GPU utilization low, maintain the practice, eval batch, and gradient accumulating steps low. The logging_steps is the variety of steps earlier than metrics are logged to WandB.

Now, initialize the SFTTrainer.

coach = SFTTrainer(
    mannequin = mannequin,
    tokenizer = tokenizer,
    train_dataset = dataset_dict["train"],
    eval_dataset=dataset_dict["test"],
    dataset_text_field = "textual content",
    max_seq_length = max_seq_length,
    dataset_num_proc = 2,
    packing = True, # Packs brief sequences collectively to avoid wasting time!
    args = args,
)

Now, begin the coaching.

trainer_stats = coach.practice()
wandb.end()

Through the coaching run, WandB will monitor the coaching and eval metrics. You go to the given dashboard hyperlink and see it in real-time.

This can be a screenshot from my run on a Colab pocket book.

The coaching pace will depend upon a number of elements, together with the coaching and eval information sizes, practice and eval batch dimension, and the variety of epochs. If you happen to encounter GPU utilization points, attempt lowering batch and gradient accumulation step sizes. The practice batch dimension = batch_size_per_device * gradient_accumulation_steps. And the variety of optimization steps = complete coaching information/batch dimension. You may play with the parameters and see which works higher.

You may visualize the coaching and analysis lack of your coaching on the WandB dashboard.

Prepare Loss

Eval Loss

Inferencing

It can save you the LoRA adapters regionally or push them to the HuggingFace Repository.

mannequin.save_pretrained("lora_model") # Native saving
# mannequin.push_to_hub("your_name/lora_model", token = "...") # On-line saving

You too can load the saved mannequin from the disk and use it for inferencing.

if False:
    from unsloth import FastLanguageModel
    mannequin, tokenizer = FastLanguageModel.from_pretrained(
        model_name = "lora_model", # YOUR MODEL YOU USED FOR TRAINING
        max_seq_length = max_seq_length,
        dtype = dtype,
        load_in_4bit = load_in_4bit,
    )

inputs = tokenizer(
[
    alpaca_prompt.format(
        "capital of France?", # instruction
        "", # input
        "", # output - leave this blank for a generation!
    )
]*1, return_tensors = "pt").to("cuda")

outputs = mannequin.generate(**inputs, max_new_tokens = 64, use_cache = True)
tokenizer.batch_decode(outputs)

For streaming Mannequin responses.

from transformers import TextStreamer

text_streamer = TextStreamer(tokenizer)
_ = mannequin.generate(**inputs, streamer = text_streamer, max_new_tokens = 64)

So, this was all about fine-tuning a Tiny-Llama mannequin with WandB logging.

Right here is the Colab Pocket book for a similar.

Conclusion

Small LLMs will be helpful for deploying on compute-restricted {hardware}, comparable to private computer systems, cell phones, and different wearables, and so forth. Fantastic-tuning permits these fashions to carry out higher on downstream duties. On this article, we realized find out how to Fantastic-tune a base language mannequin on a dataset.

Key Takeaways

Fantastic-tuning is the method of creating a pre-trained mannequin adapt to a particular new activity.
Tiny-Llama is an LLM with only one.1 billion parameters and is skilled on 3 trillion tokens.
There are alternative ways to Fantastic-tune LLMs, like LoRA and QLoRA.
Unsloth is an open-source platform that gives CUDA-optimized LLMs to hurry up fine-tuning LLMs.
Weights and Biases (WandB) is a instrument for monitoring and storing ML experiments.

Incessantly Requested Questions

Q1. What’s LLM fine-tuning?

A. Fantastic-tuning, within the context of machine studying, particularly deep studying, is a method the place you are taking a pre-trained mannequin and adapt it to a brand new, particular activity.

Q2. Can I Fantastic-tune LLMs without cost?

A. It’s potential to Fantastic-tune smaller LLMs without cost on Colab over the Tesla T4 GPU with QLoRA.

Q3. What are the advantages of Fantastic-tuning LLM?

A. Fantastic-tuning vastly enhances LLM’s functionality to carry out downstream duties, like position play, code era, and so forth.

This fall. What’s Tiny-Llama?

A. Tiny-Llama skilled on 3 trillion tokens is an LLM with 1.1B parameters. The mannequin adopts the unique Llama-2 structure.

Q5. What’s Unsloth used for?

A. Unsloth is an open-source instrument that gives sooner and extra environment friendly LLM fine-tuning by optimizing GPU kernels with Triton.

The media proven on this article is just not owned by Analytics Vidhya and is used on the Writer’s discretion.

Associated

Supply hyperlink