Fine-Tuning Code GenAI model on Google Colab T4 GPU: A Step-by-Step Guide

24 / Sep / 2024 by Tushar Raj Verma 0 comments

Introduction

Fine-tuning large language models for code generation typically requires significant computing power. Many popular models, such as Code LLaMA or CodeT5, demand high-performance GPUs like NVIDIA A100, making them less accessible for most users. However, by leveraging LoRA (Low-Rank Adaptation) and quantization techniques with libraries such as `BitsAndNytes` and `PEFT`, you can fine-tune Starcoder2 on a free Google Colab T4 instance.

This blog explores how you can achieve high-quality code generation results on limited hardware, making it an affordable option for those interested in model training but restricted by resource availability.

Read More: Understanding Generative AI and Predictive Analytics

Prerequisites and Setup

Start by installing the required libraries directly into your Colab environment. The core libraries used in this tutorial are datasets, trl, bitsandbytes, and peft.

!pip install -q datasets trl bitsandbytes peft

Next, log in to Hugging Face to access pre-trained models and datasets.

from huggingface_hub import notebook_login

notebook_login()

Loading the Starcoder2 Model

We will use the 3B variant of Starcoder2. Despite its size, by leveraging bitsandbytes, we can load it in 4-bit precision to save memory and speed up training.

from transformers import AutoModelForCausalLM, BitsAndBytesConfig
from accelerate import PartialState


bnb_config = BitsAndBytesConfig(
    load_in_4bit=True,
    bnb_4bit_quant_type="nf4",
    bnb_4bit_compute_dtype=torch.bfloat16,
)
model = AutoModelForCausalLM.from_pretrained("bigcode/starcoder2-3b",
    quantization_config=bnb_config,
    device_map={"": PartialState().process_index})

Key Consideration

  • Using bnb_4bit_quant_type helps reduce memory consumption by using 4-bit precision for model weights.
  • The PartialState class automatically assigns the model to the appropriate GPU device.

Configuring LoRA (Low-Rank Adaptation)

from peft import LoraConfig, TaskType

lora_config = LoraConfig(
r=8,
target_modules=[
"q_proj",
"o_proj",
"k_proj",
"v_proj",
"gate_proj",
"up_proj",
"down_proj",
],
task_type="CAUSAL_LM",
)

LoRA is essential here because it allows us to fine-tune the model without modifying the core architecture—perfect for limited resource setups like a Colab T4 instance.

Data Loading and Preprocessing

We use the BigCode dataset, which contains code snippets, and focus on the Python subset. You can adjust the dataset as needed.

from datasets import load_dataset
import pandas as pd

data = load_dataset("bigcode/the-stack-smol", data_dir="data/python", split="train")
pd.DataFrame(data['content']).to_csv('python_code_snippet_custom.csv', index=False)

Why Python?

Python remains one of the most popular languages for code generation tasks, and training a model on Python snippets can yield significant improvements in generating accurate and optimized code.

Setting up the Trainer

We use Hugging Face’s SFTTrainer (Supervised Fine-Tuning Trainer) to handle the fine-tuning process.

from trl import SFTTrainer
import transformers

trainer = SFTTrainer(
    model=model,
    train_dataset=data,
    max_seq_length=512,
    args=transformers.TrainingArguments(
        per_device_train_batch_size=1,
        gradient_accumulation_steps=4,
        warmup_steps=100,
        max_steps=100,
        learning_rate=2e-4,
        lr_scheduler_type="cosine",
        weight_decay=0.01,
        bf16=True,
        logging_strategy="steps",
        logging_steps=10,
        output_dir="finetune_starcoder2-3b",
        optim="paged_adamw_8bit",
        seed=0,
    ),
    peft_config=lora_config,
    dataset_text_field="content",

)

Important Configurations

  • Batch Size: Micro-batch size of 1 is used with gradient accumulation to simulate larger batch sizes.
  • Scheduler: Cosine learning rate scheduling ensures smoother convergence.
  • Optimization: paged_adamw_8bit is a memory-efficient optimizer.

Fine-Tuning Process

Once the setup is ready, you can start fine-tuning the model. The process will automatically log results after every 10 steps.

print("Training...")
trainer.train()

print("Saving the last checkpoint of the model")
model.save_pretrained("finetune_starcoder2-3b/final_checkpoint/")
colab_tune

fine tuning steps

Upload the mode to huggingface

if args.push_to_hub:
 trainer.push_to_hub("Upload model")

Testing the Fine-Tuned Model

After fine-tuning, you can load the model to generate Python code snippets based on natural language inputs.

from transformers import AutoModelForCausalLM, AutoTokenizer
from peft import PeftConfig, PeftModel
import torch

config = PeftConfig.from_pretrained("TRV30/finetune_starcoder2-3b")
base_model = "bigcode/starcoder2-3b"
model = AutoModelForCausalLM.from_pretrained(base_model,
    load_in_4bit=True,
    torch_dtype=torch.float16,
    device_map="cuda",
)
model = PeftModel.from_pretrained(model, "hfusername/finetune_starcoder2-3b")
tokenizer = AutoTokenizer.from_pretrained("hfusername/finetune_starcoder2-3b")

You can then input a question, and the model will generate Python code based on the prompt:

def generate_python_code(question):

    eval_prompt = f"""You are a powerful code generator model. Your job is to create a code about a module. You are given a question, convert it into a python code.
    ### Input:
    {question}
    ### Response:
    """
    model_input = tokenizer(eval_prompt, return_tensors="pt").to("cuda")
    model.eval()
    with torch.no_grad():
        output = model.generate(
            **model_input,
            max_length=300,
            eos_token_id=tokenizer.eos_token_id,
        )
        response = tokenizer.decode(output[0], skip_special_tokens=True)
        return response.split("### Response:")[-1].strip()

print(generate_python_code('how to load json'))

Output:

output_finetune

output finetune

Conclusion

The unique advantage of this approach lies in fine-tuning a large code-generation model like Starcoder2 on a free Google Colab T4 instance. Other models, such as Code LLaMA, often require more computational resources, but with LoRA and quantization, you can achieve competitive results without needing access to A100 GPUs or expensive cloud compute instances. By focusing on optimizations like 4-bit quantization and efficient parameter tuning via LoRA, this guide enables you to build high-quality models for real-world coding tasks efficiently and affordably.

Read More: Copilot with Xcode: Use genAI to accelerate your iOS development.

Key Takeaways

  • Google Colab’s free T4 instance is sufficient to fine-tune Starcoder2 using quantization and LoRA.
  • LoRA significantly reduces the need for high-resource machines by adapting specific projections during fine-tuning.
  • By utilizing tools like `bitsandbytes`, fine-tuning models at 4-bit precision drastically reduces memory usage while maintaining performance.

Now, you can easily adapt and use this process to fine-tune models for your code-generation projects! TO THE NEW, a leader in digital technology services empowers businesses across industries to leverage the transformative power of AI and Machine Learning. Our team of 2000+ passionate experts combines the power of Cloud, Data, and AI to design and build innovative digital platforms that unlock new possibilities. Reach out to us for your next project requirements.

FOUND THIS USEFUL? SHARE IT

Leave a Reply

Your email address will not be published. Required fields are marked *