Transforming Qwen 7B into Your Own Reasoning Model
Fine-tune Qwen 7B for reasoning tasks on your AWS account using tensorfuse and unsloth with GRPO.
Fine-tuning large language models like Qwen 7B is essential for adapting them to specific tasks. To transform Qwen 7B into a reasoning model, we’ll use a powerful reinforcement learning algorithm called GRPO by DeepSeek, along with Unsloth, a fast and memory-efficient training library.
This guide will demonstrate how to create a fine-tuning job for Qwen 7B using job queues and save the resulting LoRA adapter to Hugging Face. We’ll also deploy a vLLM server for inference tasks. We’ll use one GPU of type L40s for training and inference.
The fine-tuning script utilizes unsloth and GRPO to fine-tune Qwen 7B on the openai/gsm8k dataset using reward functions. The script integrates wandb for logging and Hugging Face for uploading the LoRA adapter. It is inspired by this unsloth guide.
The fine tuning script can also be clonned from this git repository. The cloned repository contains two folders. finetuning and inference. The finetuning folder contains the training script and reward functions. The inference folder contains the code to deploy vllm server with tensorfuse.
Clone the script from this Git repository. The repository contains two folders:
finetuning: Contains the training script and reward functions.
inference: Contains code to deploy the vLLM server with Tensorfuse.
Copy
from unsloth import FastLanguageModel, PatchFastRL, is_bfloat16_supportedimport torchfrom reward_functions import xmlcount_reward_func, soft_format_reward_func, strict_format_reward_func, int_reward_func, correctness_reward_func, datasetfrom trl import GRPOConfig, GRPOTrainerfrom hugging_face_upload import upload_loraimport os import jsonimport wandbfrom tensorkube import get_queued_messagemessage = json.loads(get_queued_message())PatchFastRL("GRPO", FastLanguageModel)max_seq_length = message.get('max_seq_len') or 1024 # Can increase for longer reasoning traceslora_rank = message.get('lora_rank') or 16 # Larger rank = smarter, but slowermodel, tokenizer = FastLanguageModel.from_pretrained( model_name = message.get("model_name") or "Qwen/Qwen2.5-7B-Instruct", max_seq_length = max_seq_length, load_in_4bit = False, # False for LoRA 16bit fast_inference = True, # Enable vLLM fast inference max_lora_rank = lora_rank, gpu_memory_utilization = message.get("gpu_memory_utilization") or 0.6, # Reduce if out of memory)model = FastLanguageModel.get_peft_model( model, r = lora_rank, # lora rank should be in multiples of 2. Suggested 8, 16, 32, 64, 128 target_modules = [ "q_proj", "k_proj", "v_proj", "o_proj", "gate_proj", "up_proj", "down_proj", ], # Remove QKVO if out of memory lora_alpha = lora_rank, use_gradient_checkpointing = "unsloth", # Enable long context finetuning random_state = 3407,)# wandb configproject_name = message.get("wandb_project_name") or "unsloth"wandb.init(project=project_name)# train configtraining_args = GRPOConfig( use_vllm = True, # use vLLM for fast inference! learning_rate = message.get("learning_rate") or 5e-6, adam_beta1 = 0.9, adam_beta2 = 0.99, weight_decay = 0.1, warmup_ratio = 0.1, lr_scheduler_type = message.get("lr_scheduler_type") or "cosine", optim = "paged_adamw_8bit", logging_steps = 1, bf16 = is_bfloat16_supported(), fp16 = not is_bfloat16_supported(), per_device_train_batch_size = 1, gradient_accumulation_steps = 1, # Increase to 4 for smoother training num_generations = 6, # Decrease if out of memory max_prompt_length = message.get("max_prompt_length") or 256, max_completion_length = message.get("max_completion_length") or 200, num_train_epochs = message.get("num_train_epochs") or 1, # Set to 1 for a full training run max_steps = 250, save_steps = 250, max_grad_norm = 0.1, report_to = "wandb", # Can use Weights & Biases output_dir = "outputs", # stores the checkpoints in outputs folder)trainer = GRPOTrainer( model = model, processing_class = tokenizer, reward_funcs = [ xmlcount_reward_func, soft_format_reward_func, strict_format_reward_func, int_reward_func, correctness_reward_func, ], args = training_args, train_dataset = dataset,)trainer.train()folder_name = "lora_adapter"model.save_lora(folder_name)print("model trained and saved")folder_path = os.getcwd() + "/" + folder_nameupload_lora(folder_path, folder_name)
Now that your Dockerfile, fine-tuning scripts, and secrets are ready, deploy the job:
Copy
# go to finetuning directorycd finetuningtensorkube job deploy --name qwen-7b-unsloth-grpo --gpus 1 --gpu-type l40s --secret huggingface --secret wandb
This command builds the Docker image, uploads it to the registry, and deploys it on your Tensorfuse cluster. For details on deploying jobs, refer to the job queues documentation.
Once the job is successfully completed, the LoRA adapter will be available on the Hugging Face model hub. You can use this LoRA adapter for inference tasks involving reasoning. For this guide, we’ll deploy a vLLM server to utilize the LoRA adapter for inference.
The inference folder in the cloned repository contains the code to deploy vllm server with tensorfuse.
Run the following commands to deploy the vLLM server with your LoRA adapter:
Copy
# go to inference directory# if you are finetuning directory, do `cd ..` to go back to the root directorycd inferencetensorkube secret create huggingface HUGGING_FACE_HUB_TOKEN=<HUGGING_FACE_API_KEY>tensorkube deploy --config deployment.yaml
After the deployment is ready, load your LoRA adapter into the vLLM server using the following curl command:
Replace the <TENSORKUBE_DEPLOYMENT_URL> with the actual deployment url of the vllm server.
Replace the <HUGGING_FACE_ORG_NAME> with your actual hugging face org name.
The lora_name should be unique for each lora adapter.
The lora_path should be the path to the lora adapter in the hugging face model hub.
Once your LoRA adapter is loaded, you can perform inference using the vLLM server. Use the following curl command to test inference:
Replace the <TENSORKUBE_DEPLOYMENT_URL> with the actual deployment url of the vllm server.
The model should be the lora adapter name.
Copy
curl --request POST \ --url <TENSORKUBE_DEPLOYMENT_URL>/v1/chat/completions \ --header 'Content-Type: application/json' \ --data '{ "model": "unsloth_qwen_7b_adapter", "messages": [ { "role": "system", "content": "\nRespond in the following format:\n<reasoning>\n...\n</reasoning>\n<answer>\n...\n</answer>\n" }, { "role": "user", "content": "Calculate 0/10" } ] }'
That’s it! You have successfully transformed Qwen 7B into your own reasoning model using Tensorfuse and Unsloth. You can now utilize your LoRA adapter for inference on various reasoning tasks.
The above guide is a high level overview of the steps involved in transforming Qwen 7B into a reasoning model. You can customize the training script and deployment configurations as per your requirements.
Before moving to production, please follow this guide for a production-ready vLLM server deployment and custom domains for secure endpoints.