Fine-tuning large language models like Qwen 7B is essential for adapting them to specific tasks. To transform Qwen 7B into a reasoning model, we’ll use a powerful reinforcement learning algorithm called GRPO by DeepSeek, along with Unsloth, a fast and memory-efficient training library.

This guide will demonstrate how to create a fine-tuning job for Qwen 7B using job queues and save the resulting LoRA adapter to Hugging Face. We’ll also deploy a vLLM server for inference tasks. We’ll use one GPU of type L40s for training and inference.

Prerequisites

Before starting, ensure you have configured Tensorfuse on your AWS account. If not, refer to Getting Started guide.

Deploy a Fine-tuning Job Using Tensorfuse

To deploy a job with Tensorfuse, perform the following steps:

  1. Prepare the Dockerfile

  2. Clone the fine-tuning script

  3. Create Tensorfuse secrets

  4. Deploy the job with Tensorfuse

Step 1: Prepare the Dockerfile

Dockerfile
# Use the NVIDIA CUDA base image
FROM nvidia/cuda:12.1.1-runtime-ubuntu22.04

# Set non-interactive mode for apt
ENV DEBIAN_FRONTEND=noninteractive

# Update and install prerequisites, and install Python 3.11 and development packages
RUN apt-get update && \
    apt-get install -y software-properties-common curl && \
    add-apt-repository ppa:deadsnakes/ppa && \
    apt-get update && \
    apt-get install -y python3.11 python3.11-dev python3.11-venv && \
    rm -rf /var/lib/apt/lists/*

# Set Python 3.11 as the default Python version
RUN ln -sf /usr/bin/python3.11 /usr/bin/python

# Install pip for Python 3.11 using the official get-pip.py script
RUN curl -sS https://bootstrap.pypa.io/get-pip.py -o get-pip.py && \
    python3.11 get-pip.py && \
    rm get-pip.py

# Upgrade pip and install required Python packages using Python 3.11
RUN python3.11 -m pip install --no-cache-dir --upgrade pip && \
    python3.11 -m pip install --no-cache-dir transformers torch && python3.11 -m pip install accelerate unsloth vllm pillow diffusers hf_transfer huggingface_hub tensorkube wandb

ENV HF_HUB_ENABLE_HF_TRANSFER 1

# Set working directory
WORKDIR /code

# Copy the code files
COPY train.py /code/train.py
COPY reward_functions.py  /code/reward_functions.py
COPY hugging_face_upload.py /code/hugging_face_upload.py

# Run the application
CMD ["python3.11", "train.py"]

Step 2: Clone the fine-tuning script

The fine-tuning script utilizes unsloth and GRPO to fine-tune Qwen 7B on the openai/gsm8k dataset using reward functions. The script integrates wandb for logging and Hugging Face for uploading the LoRA adapter. It is inspired by this unsloth guide.

The fine tuning script can also be clonned from this git repository. The cloned repository contains two folders. finetuning and inference. The finetuning folder contains the training script and reward functions. The inference folder contains the code to deploy vllm server with tensorfuse.

Clone the script from this Git repository. The repository contains two folders:

  • finetuning: Contains the training script and reward functions.
  • inference: Contains code to deploy the vLLM server with Tensorfuse.
from unsloth import FastLanguageModel, PatchFastRL, is_bfloat16_supported
import torch
from reward_functions import xmlcount_reward_func, soft_format_reward_func, strict_format_reward_func, int_reward_func, correctness_reward_func, dataset
from trl import GRPOConfig, GRPOTrainer
from hugging_face_upload import upload_lora
import os 
import json
import wandb
from tensorkube import get_queued_message

message = json.loads(get_queued_message())

PatchFastRL("GRPO", FastLanguageModel)
max_seq_length = message.get('max_seq_len') or 1024 # Can increase for longer reasoning traces
lora_rank = message.get('lora_rank') or 16 # Larger rank = smarter, but slower

model, tokenizer = FastLanguageModel.from_pretrained(
    model_name = message.get("model_name") or "Qwen/Qwen2.5-7B-Instruct",
    max_seq_length = max_seq_length,
    load_in_4bit = False, # False for LoRA 16bit
    fast_inference = True, # Enable vLLM fast inference
    max_lora_rank = lora_rank,
    gpu_memory_utilization = message.get("gpu_memory_utilization") or 0.6, # Reduce if out of memory
)

model = FastLanguageModel.get_peft_model(
    model,
    r = lora_rank, # lora rank should be in multiples of 2. Suggested 8, 16, 32, 64, 128
    target_modules = [
        "q_proj", "k_proj", "v_proj", "o_proj",
        "gate_proj", "up_proj", "down_proj",
    ], # Remove QKVO if out of memory
    lora_alpha = lora_rank,
    use_gradient_checkpointing = "unsloth", # Enable long context finetuning
    random_state = 3407,
)


# wandb config
project_name = message.get("wandb_project_name") or "unsloth"
wandb.init(project=project_name)

# train config
training_args = GRPOConfig(
    use_vllm = True, # use vLLM for fast inference!
    learning_rate = message.get("learning_rate") or 5e-6,
    adam_beta1 = 0.9,
    adam_beta2 = 0.99,
    weight_decay = 0.1,
    warmup_ratio = 0.1,
    lr_scheduler_type = message.get("lr_scheduler_type") or "cosine",
    optim = "paged_adamw_8bit",
    logging_steps = 1,
    bf16 = is_bfloat16_supported(),
    fp16 = not is_bfloat16_supported(),
    per_device_train_batch_size = 1,
    gradient_accumulation_steps = 1, # Increase to 4 for smoother training
    num_generations = 6, # Decrease if out of memory
    max_prompt_length = message.get("max_prompt_length") or 256,
    max_completion_length = message.get("max_completion_length") or 200,
    num_train_epochs = message.get("num_train_epochs") or 1, # Set to 1 for a full training run
    max_steps = 250,
    save_steps = 250,
    max_grad_norm = 0.1,
    report_to = "wandb", # Can use Weights & Biases
    output_dir = "outputs", # stores the checkpoints in outputs folder
)


trainer = GRPOTrainer(
    model = model,
    processing_class = tokenizer,
    reward_funcs = [
        xmlcount_reward_func,
        soft_format_reward_func,
        strict_format_reward_func,
        int_reward_func,
        correctness_reward_func,
    ],
    args = training_args,
    train_dataset = dataset,
)

trainer.train()
folder_name = "lora_adapter"
model.save_lora(folder_name)
print("model trained and saved")
folder_path = os.getcwd() + "/" + folder_name

upload_lora(folder_path, folder_name)

Step 3: Create Tensorfuse Secrets

The training script requires WandB for logging and Hugging Face for uploading the LoRA adapter. Create secrets for both using the following commands:

tensorkube secret create wandb WANDB_API_KEY=<WANDB_API_KEY> --env keda
tensorkube secret create huggingface HUGGING_FACE_HUB_TOKEN=<HUGGING_FACE_API_KEY> --env keda

Replace placeholders with your actual API keys.

Step 4: Deploy job with tensorfuse

Now that your Dockerfile, fine-tuning scripts, and secrets are ready, deploy the job:

# go to finetuning directory
cd finetuning
tensorkube job deploy --name qwen-7b-unsloth-grpo --gpus 1 --gpu-type l40s --secret huggingface  --secret wandb

This command builds the Docker image, uploads it to the registry, and deploys it on your Tensorfuse cluster. For details on deploying jobs, refer to the job queues documentation.

Running the job

After deploying, you can run the job using:

tensorkube job queue --job-name qwen-7b-unsloth-grpo --job-id test-1 --payload '{
    "lora_rank": 16,
    "max_seq_len": 1024,
    "model_name": "Qwen/Qwen2.5-7B-Instruct",
    "gpu_memory_utilization": 0.6,
    "num_train_epochs": 1,
    "learning_rate": 5e-6,
    "lr_scheduler_type": "cosine",
    "wandb_project_name": "grpo"
}
'

The payload parameters are accessible in the script via get_queued_message() function from tensorkube package.

Checking the job status

You can check the status of the job using the following command:

tensorkube job get --job-name qwen-7b-unsloth-grpo --job-id test-1

The status of the job will be displayed in the output. It should show SUCCESS once the job is completed.

Transform Qwen 7B into Your Own Reasoning Model

Once the job is successfully completed, the LoRA adapter will be available on the Hugging Face model hub. You can use this LoRA adapter for inference tasks involving reasoning. For this guide, we’ll deploy a vLLM server to utilize the LoRA adapter for inference.

The inference folder in the cloned repository contains the code to deploy vllm server with tensorfuse.

Run the following commands to deploy the vLLM server with your LoRA adapter:

# go to inference directory
# if you are finetuning directory, do `cd ..` to go back to the root directory
cd inference
tensorkube secret create huggingface HUGGING_FACE_HUB_TOKEN=<HUGGING_FACE_API_KEY>
tensorkube deploy --config deployment.yaml

After the deployment is ready, load your LoRA adapter into the vLLM server using the following curl command:

  • Replace the <TENSORKUBE_DEPLOYMENT_URL> with the actual deployment url of the vllm server.
  • Replace the <HUGGING_FACE_ORG_NAME> with your actual hugging face org name.
  • The lora_name should be unique for each lora adapter.
  • The lora_path should be the path to the lora adapter in the hugging face model hub.
  • curl -X POST <TENSORKUBE_DEPLOYMENT_URL>/v1/load_lora_adapter \
    -H "Content-Type: application/json"  \
    -d '{
        "lora_name": "unsloth_qwen_7b_adapter",
        "lora_path": "<HUGGING_FACE_ORG_NAME>/unsloth_qwen_7b_adapter"
    }'
    

    Once your LoRA adapter is loaded, you can perform inference using the vLLM server. Use the following curl command to test inference:

  • Replace the <TENSORKUBE_DEPLOYMENT_URL> with the actual deployment url of the vllm server.
  • The model should be the lora adapter name.
  • curl --request POST  \
      --url <TENSORKUBE_DEPLOYMENT_URL>/v1/chat/completions \
      --header 'Content-Type: application/json' \
      --data '{
        "model": "unsloth_qwen_7b_adapter",
        "messages": [
            {
                "role": "system",
                "content": "\nRespond in the following format:\n<reasoning>\n...\n</reasoning>\n<answer>\n...\n</answer>\n"
            },
            {
                "role": "user",
                "content": "Calculate 0/10"
            }
        ]
      }'
    

    That’s it! You have successfully transformed Qwen 7B into your own reasoning model using Tensorfuse and Unsloth. You can now utilize your LoRA adapter for inference on various reasoning tasks.

    The above guide is a high level overview of the steps involved in transforming Qwen 7B into a reasoning model. You can customize the training script and deployment configurations as per your requirements.

    Before moving to production, please follow this guide for a production-ready vLLM server deployment and custom domains for secure endpoints.