Fine-tuning Guide for Tensorfuse

This guide explains how to fine-tune Llama models using Tensorfuse’s QLoRA implementation.

Supported Models

ModelGPU Requirements
Llama 3.1 70B4x L40S (Recommended)
Llama 3.1 8B1-2x A10G

Dataset Preparation

Tensorfuse accepts datasets in JSONL format, where each line contains a valid JSON object.

The following example shows the format for a conversational dataset using the ChatML format:

{
  "messages": 
  [
    {
      "role": "system",
      "content": "You are a helpful assistant."
    },
    {
      "role": "user",
      "content": "What is the capital of France?"
    },
    {
      "role": "assistant",
      "content": "The capital of France is Paris."
    }
  ]
}

Dataset Commands

# Create dataset
tensorkube datasets create --dataset-id my_dataset --path data.jsonl

# List datasets
tensorkube datasets list

# Delete dataset
tensorkube datasets delete --dataset-id my_dataset

Once you have created your dataset, you can start fine-tuning your model. But before that, you need to create an authentication token from huggingface.

Authentication

Create required secrets. Tensorkube uses Kubernetes Event Driven Autoscaling (KEDA) under the hood to scale and schedule training runs. Hence, you need to create your secrets in the keda environment:

# Create Hugging Face token
tensorkube secret create hugging-face-secret HUGGING_FACE_HUB_TOKEN=your_token --env keda

Programatic Access

Tensorfuse allows you to interact with the TensorKube cluster using the Python SDK, which provides a straightforward interface for creating fine-tuning jobs.

Authentication

First, you need to create access keys, which are required to authenticate with the TensorKube cluster deployed in your cloud.

Run the following command:

tensorkube train create-user --name <user-name>

This will create a new user and provide you with access keys.

Next, export the AWS keys as environment variables where you will be running the Python code:

export AWS_ACCESS_KEY_ID=<AWS_ACCESS_KEY>
export AWS_SECRET_ACCESS_KEY=<AWS_SECRET>

The following code demonstrates how to create a fine-tuning job using the Python SDK. The create_fine_tuning_job function fine-tunes a LLaMA 70B base model using L40S GPUs.

from tensorkube import create_fine_tuning_job

create_fine_tuning_job( # creates a fune tuning job
    job_name="fine-tuning-job", # Job Name. Required 
    job_id="unique_id", # Unique Job ID. Required 
    gpus=4, # Number of GPUs. Required 
    gpu_type="l40s", # GPU Type. Required 
    max_scale=1, # Maximum Scale. Required 
    base_model='meta-llama/Llama-3.1-70B-Instruct', # Base Model from hugging face. Required 
    dataset='dataset-id', # Dataset ID. Required 
    epochs=10, # Number of epochs. Required
    secrets=["hugging-face-secret"], # List of secrets 
    micro_batch_size=16, # Micro Batch Size. Optional, default is 16
    lora_r=8, # Lora R. default is 8.  Optional, default is 8
    learning_rate=0.00002 # Learning Rate. Optional, default is 0.00002
)

To know the status of the job, you can use the get_job_status function. The function returns the status of the job as QUEUED, PROCESSING, COMPLETED, or FAILED.

from tensorkube import get_job_status
status = get_job_status( # gets the status of the job
  job_name="fine-tuning-job", # Job Name. Required
  job_id="unique_id" # Unique Job ID. Required
)

Once the job is completed, the adapter is uploaded to s3. If you go to your s3 console you can get your adapters as follows

  • find the s3 bucket with prefix tensorkube-train-bucket. All your training lora adapters will reside here. We construct adapter id from your job-id and the type of gpus used for training so your adapter urls would look like this:- s3://<bucket-name>/lora-adapter/<job_name>/<job_id>

Below is an example of a training adapter url with job_name fine-tuning-job and job-id unique_id, trained on 4 gpu of type l40s

s3://tensorkube-train-bucket-d473253e-d692-4a15/lora-adapter/fine-tuning-job/unique_id

Model Deployment

  1. Clone Lorax repository:
git clone https://github.com/tensorfuse/lorax
cd lorax/llama-70b
  1. Use the following command to deploy
tensorkube deploy --gpus 4 --gpu-type L40S --secret hugging-face-secret --secret aws-secret

This will deploy the base model with the lorax library.

  1. Get. your deployment url using tensorkube list deployments.

Inference

You can now use the deployment URL to make inference requests. Here is an example using curl. This will query the base model without any adapters.

curl ${ENDPOINT}/generate -X POST \
  -H 'Content-Type: application/json' \
  -d '{
    "inputs": "[INST] Your prompt here [/INST]",
    "parameters": {
      "max_new_tokens": 64
    }
  }'

For using the adapter, you can use the following command:

curl ${ENDPOINT}/generate -X POST \
  -H 'Content-Type: application/json' \
  -d '{
    "inputs": "[INST] Your prompt here [/INST]",
    "parameters": {
      "max_new_tokens": 64,
      "adapter_id": "s3://your-bucket/lora-adapter/your-adapter-path",
      "adapter_source": "s3"
    }
  }'