Fine-tuning Guide for Tensorfuse

This guide explains how to fine-tune Llama models using Tensorfuse’s QLoRA implementation.

Supported Models

Model	GPU Requirements
Llama 3.1 70B	4x L40S (Recommended)
Llama 3.1 8B	1-2x A10G

Dataset Preparation

Tensorfuse accepts datasets in JSONL format, where each line contains a valid JSON object. The following example shows the format for a conversational dataset using the ChatML format:

{
  "messages":
  [
    {
      "role": "system",
      "content": "You are a helpful assistant."
    },
    {
      "role": "user",
      "content": "What is the capital of France?"
    },
    {
      "role": "assistant",
      "content": "The capital of France is Paris."
    }
  ]
}

Dataset Commands

# Create dataset
tensorkube datasets create --dataset-id my_dataset --path data.jsonl

# List datasets
tensorkube datasets list

# Delete dataset
tensorkube datasets delete --dataset-id my_dataset

Once you have created your dataset, you can start fine-tuning your model. But before that, you need to create an authentication token from huggingface.

Authenticating Huggingface and W&B

Create required secrets. Tensorkube uses Kubernetes Event Driven Autoscaling (KEDA) under the hood to scale and schedule training runs. Hence, you need to create your secrets in the keda environment:

Access to Llama 3.1

Llama-3.3 requires a license agreement. Visit the Llama 3.1 huggingface repo to ensure that you have signed the agreement and have access to the model.

Set huggingface token

Get a WRITE token from your huggingface profile and store it as a secret in Tensorfuse using the command below.

tensorkube secret create hugging-face-secret HUGGING_FACE_HUB_TOKEN=hf_EkXXrzzZsuoZubXhDQ --env keda

Ensure that the key for your secret is HUGGING_FACE_HUB_TOKEN as Tensorfuse assumes the same.

If you dont wish to upload your models to your huggingface account, you can use the READ token.

Set your W&B authentication token

Weights and Biases (W&B) is used for logging and monitoring training runs. You need to create a W&B account and get an API key. Store it as a secret in Tensorfuse using the command below.

tensorkube secret create wb-secret WANDB_API_KEY=7xxxxxxxxxxx4 --env keda

Programatic Access

Tensorfuse allows you to interact with the TensorKube cluster using the Python SDK, which provides a straightforward interface for creating fine-tuning jobs.

Authentication

First, you need to create access keys, which are required to authenticate with the TensorKube cluster deployed in your cloud.

You can skip this step if you are running the training runs from local machine as your default user will have sufficient permissions.

Run the following command:

tensorkube train create-user --name <user-name>

This will create a new user and provide you with access keys. Next, export the AWS keys as environment variables where you will be running the Python code:

export AWS_ACCESS_KEY_ID=<AWS_ACCESS_KEY>
export AWS_SECRET_ACCESS_KEY=<AWS_SECRET>

The following code demonstrates how to create a fine-tuning job using the Python SDK. The create_fine_tuning_job function fine-tunes a LLaMA 70B base model using L40S GPUs.

from tensorkube import create_fine_tuning_job

create_fine_tuning_job( # creates a fune tuning job
    job_name="fine-tuning-job", # Job Name. Required
    job_id="unique_id", # Unique Job ID. Required
    gpus=4, # Number of GPUs. Required
    gpu_type="l40s", # GPU Type. Required
    max_scale=1, # Maximum Scale. Required
    base_model='meta-llama/Llama-3.1-70B-Instruct', # Base Model from hugging face. Required
    dataset='dataset-id', # Dataset ID. Required
    epochs=10, # Number of epochs. Required
    secrets=["hugging-face-secret", "wb-secret"], # List of secrets
    micro_batch_size=8, # Micro Batch Size. Optional, default is 8
    lora_r=4, # Lora R. default is 4.  Optional, default is 4
    learning_rate=0.00002, # Learning Rate. Optional, default is 0.00002
    val_set_size=0.1, # Validation Set Size. Optional, default is 0.1
    wandb_entity="ORG_NAME_HERE", # W&B Organisation / account name .Default is None
    hf_org_id="ORG_ID_HERE" # Hugging Face Organisation ID. Default is None
)

The create_fine_tuning_job function also accepts additional keyword arguments (**kwargs) that align with the Axolotl config schema. This means you can pass any Axolotl-supported training parameters directly into the function for fine-tuning customization. For example, if you want to set gradient_accumulation_steps, save_strategy, or lr_scheduler_type, you can include them as additional arguments:

from tensorkube import create_fine_tuning_job
create_fine_tuning_job(
    job_name="fine-tuning-job",
    job_id="unique_id",
    gpus=4,
    gpu_type="l40s",
    max_scale=1,
    base_model='meta-llama/Llama-3.1-70B-Instruct',
    dataset='dataset-id', # Dataset ID. Required
    epochs=10, # Number of epochs. Required
    secrets=["hugging-face-secret", "wb-secret"], # List of secrets
    micro_batch_size=8, # Micro Batch Size. Optional, default is 8
    lora_r=4, # Lora R. default is 4.  Optional, default is 4
    learning_rate=0.00002, # Learning Rate. Optional, default is 0.00002
    val_set_size=0.1, # Validation Set Size. Optional, default is 0.1
    wandb_entity="ORG_NAME_HERE", # W&B Organisation / account name .Default is None
    hf_org_id="ORG_ID_HERE" # Hugging Face Organisation ID. Default is None
    gradient_accumulation_steps=2,        # From Axolotl config
    peft_use_rslora=True,                 # From Axolotl config
    save_strategy="epoch",                # Save model after each epoch
    lr_scheduler_type="linear"  # Learning rate scheduler
)

The LoRA adapter weights stored in the S3 bucket are in float32 format. To store the adapter weights in bfloat16 format instead, set the store_weights_as_bf16 flag to True.

To know the status of the job, you can use the get_job_status function. The function returns the status of the job as QUEUED, PROCESSING, COMPLETED, or FAILED.

from tensorkube import get_job_status
status = get_job_status( # gets the status of the job
  job_name="fine-tuning-job", # Job Name. Required
  job_id="unique_id" # Unique Job ID. Required
)

Once the job is completed, the adapter is uploaded to s3. If you go to your s3 console you can get your adapters as follows

find the s3 bucket with prefix tensorkube-keda-train-bucket. All your training lora adapters will reside here. We construct adapter id from your job-id and the type of gpus used for training so your adapter urls would look like this:- s3://<bucket-name>/lora-adapter/<job_name>/<job_id>

Below is an example of a training adapter url with job_name fine-tuning-job and job-id unique_id, trained on 4 gpu of type l40s

s3://tensorkube-keda-train-bucket-d473253e-d692-4a15/lora-adapter/fine-tuning-job/unique_id

Uploading the Adapter to Huggingface

To automatically upload the adapter to Huggingface, make sure that:

You use a WRITE token as the HUGGING_FACE_HUB_TOKEN secret.
You are passing in the hf_org_id parameter in the create_fine_tuning_job function.

Tensorfuse automatically uploads the adapter to Huggingface once the training is completed in addition to uploading it to S3. Tensorfuse creates an adapter repo that follows the {HF_ORG_IF}/{job_name}_{job_id} format. So for the above example the adapter would get uploaded to {ORG_ID_HERE}/fine-tuning-job_unique_id. The repo would be private by default.

Model Deployment

Clone Lorax repository:

git clone https://github.com/tensorfuse/vllm
cd vllm/llama_70b_lora

Use the following command to deploy

The below deploy command deploys lorax instance in default environment. Make sure you have created the hugging-face-secret in default environment. You can create secret in default environment by adding --env default flag in the secret creation command.

tensorkube deploy --gpus 4 --gpu-type L40S --secret hugging-face-secret

This will deploy the base model with the vLLM server.

Get your deployment url using tensorkube deployment list.

Inference

You can now use the deployment URL to make inference requests. Here is an example using curl. This will query the base model without any adapters.

curl ${ENDPOINT}/generate -X POST \
  -H 'Content-Type: application/json' \
  -d '{
    "inputs": "[INST] Your prompt here [/INST]",
    "parameters": {
      "max_new_tokens": 64
    }
  }'

For using the adapter, you can use the following command:

curl --request POST -v \
  --url <endpoint>/v1/chat/completions \
  --header 'Content-Type: application/json' \
  --data '{
  "model": "<ADAPTER_HF_ID>",
  "messages": [
    {
      "role": "user",
      "content": "hello"
    }
  ]
}'

Get Started

Concepts

Operations

Troubleshooting

Enterprise Setup

Architecture

Finetune Llama 3 70B on your AWS account

Fine-tuning Guide for Tensorfuse

Supported Models

Dataset Preparation

Dataset Commands

Authenticating Huggingface and W&B

Programatic Access

Authentication

Uploading the Adapter to Huggingface

Model Deployment

Inference

Get Started

Concepts

Operations

Troubleshooting

Enterprise Setup

Architecture

​Fine-tuning Guide for Tensorfuse

​Supported Models

​Dataset Preparation

​Dataset Commands

​Authenticating Huggingface and W&B

​Programatic Access

​Authentication

​Uploading the Adapter to Huggingface

​Model Deployment

​Inference

Fine-tuning Guide for Tensorfuse

Supported Models

Dataset Preparation

Dataset Commands

Authenticating Huggingface and W&B

Programatic Access

Authentication

Uploading the Adapter to Huggingface

Model Deployment

Inference