> ## Documentation Index
> Fetch the complete documentation index at: https://tensorfuse.io/docs/llms.txt
> Use this file to discover all available pages before exploring further.

# Transforming Qwen 7B into Your Own Reasoning Model

> Fine-tune Qwen 7B for reasoning tasks on your AWS account using tensorfuse and unsloth with GRPO.

Fine-tuning large language models like Qwen 7B is essential for adapting them to specific tasks. To transform Qwen 7B into a reasoning model, we'll use a powerful reinforcement learning algorithm called [GRPO](\(https://arxiv.org/pdf/2402.03300\)) by DeepSeek, along with Unsloth, a fast and memory-efficient training library.

This guide will demonstrate how to create a fine-tuning job for Qwen 7B using [job queues](https://tensorfuse.io/docs/concepts/job_queues) and save the resulting LoRA adapter to Hugging Face. We'll also deploy a vLLM server for inference tasks. We'll use one GPU of type L40s for training and inference.

## Prerequisites

Before starting, ensure you have configured Tensorfuse on your AWS account. If not, refer to [Getting Started](/concepts/getting_started_tensorkube) guide.

## Deploy a Fine-tuning Job Using Tensorfuse

To deploy a job with Tensorfuse, perform the following steps:

1. **Prepare the Dockerfile**

2. **Clone the fine-tuning script**

3. **Create Tensorfuse secrets**

4. **Deploy the job with Tensorfuse**

### Step 1: Prepare the Dockerfile

```dockerfile Dockerfile theme={null}
# Use the NVIDIA CUDA base image
FROM nvidia/cuda:12.1.1-runtime-ubuntu22.04

# Set non-interactive mode for apt
ENV DEBIAN_FRONTEND=noninteractive

# Update and install prerequisites, and install Python 3.11 and development packages
RUN apt-get update && \
    apt-get install -y software-properties-common curl && \
    add-apt-repository ppa:deadsnakes/ppa && \
    apt-get update && \
    apt-get install -y python3.11 python3.11-dev python3.11-venv && \
    rm -rf /var/lib/apt/lists/*

# Set Python 3.11 as the default Python version
RUN ln -sf /usr/bin/python3.11 /usr/bin/python

# Install pip for Python 3.11 using the official get-pip.py script
RUN curl -sS https://bootstrap.pypa.io/get-pip.py -o get-pip.py && \
    python3.11 get-pip.py && \
    rm get-pip.py

# Upgrade pip and install required Python packages using Python 3.11
RUN python3.11 -m pip install --no-cache-dir --upgrade pip && \
    python3.11 -m pip install --no-cache-dir transformers torch && python3.11 -m pip install accelerate unsloth vllm pillow diffusers hf_transfer huggingface_hub tensorkube wandb

ENV HF_HUB_ENABLE_HF_TRANSFER 1

# Set working directory
WORKDIR /code

# Copy the code files
COPY train.py /code/train.py
COPY reward_functions.py  /code/reward_functions.py
COPY hugging_face_upload.py /code/hugging_face_upload.py

# Run the application
CMD ["python3.11", "train.py"]
```

### Step 2: Clone the fine-tuning script

The fine-tuning script utilizes unsloth and GRPO to fine-tune Qwen 7B on the [openai/gsm8k](https://huggingface.co/datasets/openai/gsm8k) dataset using reward functions. The script integrates wandb for logging and Hugging Face for uploading the LoRA adapter. It is inspired by this [unsloth guide](https://docs.unsloth.ai/basics/reasoning-grpo-and-rl).

The fine tuning script can also be clonned from this [git repository](https://github.com/tensorfuse/tensorfuse-examples/tree/feat.qwenReasoning/llms/reasoning/qwen7b). The cloned repository contains two folders. finetuning and inference. The finetuning folder contains the training script and reward functions. The inference folder contains the code to deploy vllm server with tensorfuse.

Clone the script from this [Git repository](https://github.com/tensorfuse/tensorfuse-examples/tree/feat.qwenReasoning/llms/reasoning/qwen7b). The repository contains two folders:

* **finetuning:** Contains the training script and reward functions.
* **inference:**  Contains code to deploy the vLLM server with Tensorfuse.

<CodeGroup>
  ```python train.py theme={null}
  from unsloth import FastLanguageModel, PatchFastRL, is_bfloat16_supported
  import torch
  from reward_functions import xmlcount_reward_func, soft_format_reward_func, strict_format_reward_func, int_reward_func, correctness_reward_func, dataset
  from trl import GRPOConfig, GRPOTrainer
  from hugging_face_upload import upload_lora
  import os 
  import json
  import wandb
  from tensorkube import get_queued_message

  message = json.loads(get_queued_message())

  PatchFastRL("GRPO", FastLanguageModel)
  max_seq_length = message.get('max_seq_len') or 1024 # Can increase for longer reasoning traces
  lora_rank = message.get('lora_rank') or 16 # Larger rank = smarter, but slower

  model, tokenizer = FastLanguageModel.from_pretrained(
      model_name = message.get("model_name") or "Qwen/Qwen2.5-7B-Instruct",
      max_seq_length = max_seq_length,
      load_in_4bit = False, # False for LoRA 16bit
      fast_inference = True, # Enable vLLM fast inference
      max_lora_rank = lora_rank,
      gpu_memory_utilization = message.get("gpu_memory_utilization") or 0.6, # Reduce if out of memory
  )

  model = FastLanguageModel.get_peft_model(
      model,
      r = lora_rank, # lora rank should be in multiples of 2. Suggested 8, 16, 32, 64, 128
      target_modules = [
          "q_proj", "k_proj", "v_proj", "o_proj",
          "gate_proj", "up_proj", "down_proj",
      ], # Remove QKVO if out of memory
      lora_alpha = lora_rank,
      use_gradient_checkpointing = "unsloth", # Enable long context finetuning
      random_state = 3407,
  )


  # wandb config
  project_name = message.get("wandb_project_name") or "unsloth"
  wandb.init(project=project_name)

  # train config
  training_args = GRPOConfig(
      use_vllm = True, # use vLLM for fast inference!
      learning_rate = message.get("learning_rate") or 5e-6,
      adam_beta1 = 0.9,
      adam_beta2 = 0.99,
      weight_decay = 0.1,
      warmup_ratio = 0.1,
      lr_scheduler_type = message.get("lr_scheduler_type") or "cosine",
      optim = "paged_adamw_8bit",
      logging_steps = 1,
      bf16 = is_bfloat16_supported(),
      fp16 = not is_bfloat16_supported(),
      per_device_train_batch_size = 1,
      gradient_accumulation_steps = 1, # Increase to 4 for smoother training
      num_generations = 6, # Decrease if out of memory
      max_prompt_length = message.get("max_prompt_length") or 256,
      max_completion_length = message.get("max_completion_length") or 200,
      num_train_epochs = message.get("num_train_epochs") or 1, # Set to 1 for a full training run
      max_steps = 250,
      save_steps = 250,
      max_grad_norm = 0.1,
      report_to = "wandb", # Can use Weights & Biases
      output_dir = "outputs", # stores the checkpoints in outputs folder
  )


  trainer = GRPOTrainer(
      model = model,
      processing_class = tokenizer,
      reward_funcs = [
          xmlcount_reward_func,
          soft_format_reward_func,
          strict_format_reward_func,
          int_reward_func,
          correctness_reward_func,
      ],
      args = training_args,
      train_dataset = dataset,
  )

  trainer.train()
  folder_name = "lora_adapter"
  model.save_lora(folder_name)
  print("model trained and saved")
  folder_path = os.getcwd() + "/" + folder_name

  upload_lora(folder_path, folder_name)
  ```

  ```python reward_functions.py theme={null}

  ## dataset and reward functions

  import re
  from datasets import load_dataset, Dataset

  # Load and prep dataset
  SYSTEM_PROMPT = """
  Respond in the following format:
  <reasoning>
  ...
  </reasoning>
  <answer>
  ...
  </answer>
  """

  XML_COT_FORMAT = """\
  <reasoning>
  {reasoning}
  </reasoning>
  <answer>
  {answer}
  </answer>
  """

  dataset_name = 'openai/gsm8k'
  def extract_xml_answer(text: str) -> str:
      answer = text.split("<answer>")[-1]
      answer = answer.split("</answer>")[0]
      return answer.strip()

  def extract_hash_answer(text: str) -> str | None:
      if "####" not in text:
          return None
      return text.split("####")[1].strip()

  # uncomment middle messages for 1-shot prompting
  def get_gsm8k_questions(split = "train") -> Dataset:
      data = load_dataset(dataset_name, 'main')[split] # type: ignore
      data = data.map(lambda x: { # type: ignore
          'prompt': [
              {'role': 'system', 'content': SYSTEM_PROMPT},
              {'role': 'user', 'content': x['question']}
          ],
          'answer': extract_hash_answer(x['answer'])
      }) # type: ignore
      return data # type: ignore

  dataset = get_gsm8k_questions()

  # Reward functions
  def correctness_reward_func(prompts, completions, answer, **kwargs) -> list[float]:
      responses = [completion[0]['content'] for completion in completions]
      q = prompts[0][-1]['content']
      extracted_responses = [extract_xml_answer(r) for r in responses]
      print('-'*20, f"Question:\n{q}", f"\nAnswer:\n{answer[0]}", f"\nResponse:\n{responses[0]}", f"\nExtracted:\n{extracted_responses[0]}")
      return [2.0 if r == a else 0.0 for r, a in zip(extracted_responses, answer)]

  def int_reward_func(completions, **kwargs) -> list[float]:
      responses = [completion[0]['content'] for completion in completions]
      extracted_responses = [extract_xml_answer(r) for r in responses]
      return [0.5 if r.isdigit() else 0.0 for r in extracted_responses]

  def strict_format_reward_func(completions, **kwargs) -> list[float]:
      """Reward function that checks if the completion has a specific format."""
      pattern = r"^<reasoning>\n.*?\n</reasoning>\n<answer>\n.*?\n</answer>\n$"
      responses = [completion[0]["content"] for completion in completions]
      matches = [re.match(pattern, r) for r in responses]
      return [0.5 if match else 0.0 for match in matches]

  def soft_format_reward_func(completions, **kwargs) -> list[float]:
      """Reward function that checks if the completion has a specific format."""
      pattern = r"<reasoning>.*?</reasoning>\s*<answer>.*?</answer>"
      responses = [completion[0]["content"] for completion in completions]
      matches = [re.match(pattern, r) for r in responses]
      return [0.5 if match else 0.0 for match in matches]

  def count_xml(text) -> float:
      count = 0.0
      if text.count("<reasoning>\n") == 1:
          count += 0.125
      if text.count("\n</reasoning>\n") == 1:
          count += 0.125
      if text.count("\n<answer>\n") == 1:
          count += 0.125
          count -= len(text.split("\n</answer>\n")[-1])*0.001
      if text.count("\n</answer>") == 1:
          count += 0.125
          count -= (len(text.split("\n</answer>")[-1]) - 1)*0.001
      return count

  def xmlcount_reward_func(completions, **kwargs) -> list[float]:
      contents = [completion[0]["content"] for completion in completions]
      return [count_xml(c) for c in contents]


  ```

  ```python hugging_face_upload.py theme={null}
  from huggingface_hub import create_repo, upload_folder
  hugging_face_org_name = "<HUGGING_FACE_ORG_NAME>"
  def upload_lora(folder_path: str, model_name: str):
      # Create a new repository in hugging face
      create_repo(model_name, private=True, exist_ok=True)
      # upload the lora adapter to hugging face
      upload_folder(
          repo_id=f"{hugging_face_org_name}/{model_name}",
          folder_path=folder_path,
          commit_message="unsloth grpo lora"
      )
  ```
</CodeGroup>

### Step 3: Create Tensorfuse Secrets

The training script requires WandB for logging and Hugging Face for uploading the LoRA adapter. Create secrets for both using the following commands:

```sh theme={null}
tensorkube secret create wandb WANDB_API_KEY=<WANDB_API_KEY> --env keda
tensorkube secret create huggingface HUGGING_FACE_HUB_TOKEN=<HUGGING_FACE_API_KEY> --env keda
```

**Replace placeholders with your actual API keys.**

### Step 4: Deploy job with tensorfuse

Now that your Dockerfile, fine-tuning scripts, and secrets are ready, deploy the job:

```sh theme={null}
# go to finetuning directory
cd finetuning
tensorkube job deploy --name qwen-7b-unsloth-grpo --gpus 1 --gpu-type l40s --secret huggingface  --secret wandb
```

This command builds the Docker image, uploads it to the registry, and deploys it on your Tensorfuse cluster. For details on deploying jobs, refer to the [job queues](https://tensorfuse.io/docs/concepts/job_queues) documentation.

## Running the job

After deploying, you can run the job using:

```sh theme={null}
tensorkube job queue --job-name qwen-7b-unsloth-grpo --job-id test-1 --payload '{
    "lora_rank": 16,
    "max_seq_len": 1024,
    "model_name": "Qwen/Qwen2.5-7B-Instruct",
    "gpu_memory_utilization": 0.6,
    "num_train_epochs": 1,
    "learning_rate": 5e-6,
    "lr_scheduler_type": "cosine",
    "wandb_project_name": "grpo"
}
'
```

The payload parameters are accessible in the script via `get_queued_message()` function from tensorkube package.

## Checking the job status

You can check the status of the job using the following command:

```sh theme={null}
tensorkube job get --job-name qwen-7b-unsloth-grpo --job-id test-1
```

The status of the job will be displayed in the output. It should show `SUCCESS` once the job is completed.

## Transform Qwen 7B into Your Own Reasoning Model

Once the job is successfully completed, the LoRA adapter will be available on the Hugging Face model hub. You can use this LoRA adapter for inference tasks involving reasoning. For this guide, we'll deploy a vLLM server to utilize the LoRA adapter for inference.

The `inference` folder in the cloned repository contains the code to deploy vllm server with tensorfuse.

Run the following commands to deploy the vLLM server with your LoRA adapter:

```sh theme={null}
# go to inference directory
# if you are finetuning directory, do `cd ..` to go back to the root directory
cd inference
tensorkube secret create huggingface HUGGING_FACE_HUB_TOKEN=<HUGGING_FACE_API_KEY>
tensorkube deploy --config deployment.yaml
```

After the deployment is ready, load your LoRA adapter into the vLLM server using the following curl command:

<Note>
  <li>Replace the `<TENSORKUBE_DEPLOYMENT_URL>` with the actual deployment url of the vllm server.</li>
  <li>Replace the `<HUGGING_FACE_ORG_NAME>` with your actual hugging face org name.</li>
  <li>The `lora_name` should be unique for each lora adapter.</li>
  <li>The `lora_path` should be the path to the lora adapter in the hugging face model hub.</li>
</Note>

```sh theme={null}
curl -X POST <TENSORKUBE_DEPLOYMENT_URL>/v1/load_lora_adapter \
-H "Content-Type: application/json"  \
-d '{
    "lora_name": "unsloth_qwen_7b_adapter",
    "lora_path": "<HUGGING_FACE_ORG_NAME>/unsloth_qwen_7b_adapter"
}'
```

Once your LoRA adapter is loaded, you can perform inference using the vLLM server. Use the following curl command to test inference:

<Note>
  <li>Replace the `<TENSORKUBE_DEPLOYMENT_URL>` with the actual deployment url of the vllm server.</li>
  <li>The `model` should be the lora adapter name.</li>
</Note>

```sh theme={null}
curl --request POST  \
  --url <TENSORKUBE_DEPLOYMENT_URL>/v1/chat/completions \
  --header 'Content-Type: application/json' \
  --data '{
    "model": "unsloth_qwen_7b_adapter",
    "messages": [
        {
            "role": "system",
            "content": "\nRespond in the following format:\n<reasoning>\n...\n</reasoning>\n<answer>\n...\n</answer>\n"
        },
        {
            "role": "user",
            "content": "Calculate 0/10"
        }
    ]
  }'
```

That's it! You have successfully transformed Qwen 7B into your own reasoning model using Tensorfuse and Unsloth. You can now utilize your LoRA adapter for inference on various reasoning tasks.

<Note>
  The above guide is a high level overview of the steps involved in transforming Qwen 7B into a reasoning model. You can customize the training script and deployment configurations as per your requirements.

  Before moving to production, please follow this [guide](https://tensorfuse.io/docs/guides/llama_guide) for a production-ready vLLM server deployment and [custom domains](https://tensorfuse.io/docs/concepts/custom_domains_with_tls) for secure endpoints.
</Note>
