Fine-tuning Guide for Tensorfuse
This guide explains how to fine-tune Llama models using Tensorfuse’s QLoRA implementation.
Supported Models
Model | GPU Requirements |
---|
Llama 3.1 70B | 4x L40S (Recommended) |
Llama 3.1 8B | 1-2x A10G |
Dataset Preparation
Tensorfuse accepts datasets in JSONL format, where each line contains a valid JSON object.
The following example shows the format for a conversational dataset using the ChatML format:
{
"messages":
[
{
"role": "system",
"content": "You are a helpful assistant."
},
{
"role": "user",
"content": "What is the capital of France?"
},
{
"role": "assistant",
"content": "The capital of France is Paris."
}
]
}
Dataset Commands
# Create dataset
tensorkube datasets create --dataset-id my_dataset --path data.jsonl
# List datasets
tensorkube datasets list
# Delete dataset
tensorkube datasets delete --dataset-id my_dataset
Once you have created your dataset, you can start fine-tuning your model. But before that, you need to create an authentication token from huggingface.
Authenticating Huggingface and W&B
Create required secrets. Tensorkube uses Kubernetes Event Driven Autoscaling (KEDA) under the hood to scale and schedule training runs. Hence, you need to create your
secrets in the keda
environment:
Access to Llama 3.1
Llama-3.3 requires a license agreement. Visit the Llama 3.1 huggingface repo to ensure that
you have signed the agreement and have access to the model. Set huggingface token
Get a WRITE
token from your huggingface profile and store it as a secret in Tensorfuse using the command below.tensorkube secret create hugging-face-secret HUGGING_FACE_HUB_TOKEN=hf_EkXXrzzZsuoZubXhDQ --env keda
Ensure that the key for your secret is HUGGING_FACE_HUB_TOKEN
as Tensorfuse assumes the same.If you dont wish to upload your models to your huggingface account, you can use the READ
token.
Set your W&B authentication token
Weights and Biases (W&B) is used for logging and monitoring training runs. You need to create a W&B account and get an API key. Store it as a secret in Tensorfuse using the command below.tensorkube secret create wb-secret WANDB_API_KEY=7xxxxxxxxxxx4 --env keda
Programatic Access
Tensorfuse allows you to interact with the TensorKube cluster using the Python SDK, which provides a straightforward interface for creating fine-tuning jobs.
Authentication
First, you need to create access keys, which are required to authenticate with the TensorKube cluster deployed in your cloud.
You can skip this step if you are running the training runs from local machine as your default user will have sufficient permissions.
Run the following command:
tensorkube train create-user --name <user-name>
This will create a new user and provide you with access keys.
Next, export the AWS keys as environment variables where you will be running the Python code:
export AWS_ACCESS_KEY_ID=<AWS_ACCESS_KEY>
export AWS_SECRET_ACCESS_KEY=<AWS_SECRET>
The following code demonstrates how to create a fine-tuning job using the Python SDK. The create_fine_tuning_job function fine-tunes a LLaMA 70B base model using L40S GPUs.
from tensorkube import create_fine_tuning_job
create_fine_tuning_job( # creates a fune tuning job
job_name="fine-tuning-job", # Job Name. Required
job_id="unique_id", # Unique Job ID. Required
gpus=4, # Number of GPUs. Required
gpu_type="l40s", # GPU Type. Required
max_scale=1, # Maximum Scale. Required
base_model='meta-llama/Llama-3.1-70B-Instruct', # Base Model from hugging face. Required
dataset='dataset-id', # Dataset ID. Required
epochs=10, # Number of epochs. Required
secrets=["hugging-face-secret", "wb-secret"], # List of secrets
micro_batch_size=8, # Micro Batch Size. Optional, default is 8
lora_r=4, # Lora R. default is 4. Optional, default is 4
learning_rate=0.00002, # Learning Rate. Optional, default is 0.00002
val_set_size=0.1, # Validation Set Size. Optional, default is 0.1
wandb_entity="ORG_NAME_HERE", # W&B Organisation / account name .Default is None
hf_org_id="ORG_ID_HERE" # Hugging Face Organisation ID. Default is None
)
The create_fine_tuning_job function also accepts additional keyword arguments (**kwargs) that align with the Axolotl config schema. This means you can pass any Axolotl-supported training parameters directly into the function for fine-tuning customization. For example, if you want to set gradient_accumulation_steps, save_strategy, or lr_scheduler_type, you can include them as additional arguments:
from tensorkube import create_fine_tuning_job
create_fine_tuning_job(
job_name="fine-tuning-job",
job_id="unique_id",
gpus=4,
gpu_type="l40s",
max_scale=1,
base_model='meta-llama/Llama-3.1-70B-Instruct',
dataset='dataset-id', # Dataset ID. Required
epochs=10, # Number of epochs. Required
secrets=["hugging-face-secret", "wb-secret"], # List of secrets
micro_batch_size=8, # Micro Batch Size. Optional, default is 8
lora_r=4, # Lora R. default is 4. Optional, default is 4
learning_rate=0.00002, # Learning Rate. Optional, default is 0.00002
val_set_size=0.1, # Validation Set Size. Optional, default is 0.1
wandb_entity="ORG_NAME_HERE", # W&B Organisation / account name .Default is None
hf_org_id="ORG_ID_HERE" # Hugging Face Organisation ID. Default is None
gradient_accumulation_steps=2, # From Axolotl config
peft_use_rslora=True, # From Axolotl config
save_strategy="epoch", # Save model after each epoch
lr_scheduler_type="linear" # Learning rate scheduler
)
The LoRA adapter weights stored in the S3 bucket are in float32
format. To store the adapter weights in bfloat16
format instead, set the store_weights_as_bf16
flag to True
.
To know the status of the job, you can use the get_job_status
function. The function returns the status of the job as QUEUED
, PROCESSING
, COMPLETED
, or FAILED
.
from tensorkube import get_job_status
status = get_job_status( # gets the status of the job
job_name="fine-tuning-job", # Job Name. Required
job_id="unique_id" # Unique Job ID. Required
)
Once the job is completed, the adapter is uploaded to s3.
If you go to your s3 console you can get your adapters as follows
- find the s3 bucket with prefix
tensorkube-keda-train-bucket
. All your training lora adapters will reside here. We construct adapter id from your job-id
and the type of gpus used for training so your adapter urls would look like this:-
s3://<bucket-name>/lora-adapter/<job_name>/<job_id>
Below is an example of a training adapter url with job_name fine-tuning-job
and job-id unique_id
, trained on 4
gpu of type l40s
s3://tensorkube-keda-train-bucket-d473253e-d692-4a15/lora-adapter/fine-tuning-job/unique_id
Uploading the Adapter to Huggingface
To automatically upload the adapter to Huggingface, make sure that:
- You use a
WRITE
token as the HUGGING_FACE_HUB_TOKEN
secret.
- You are passing in the
hf_org_id
parameter in the create_fine_tuning_job
function.
Tensorfuse automatically uploads the adapter to Huggingface once the training is completed in addition to uploading it to S3.
Tensorfuse creates an adapter repo that follows the {HF_ORG_IF}/{job_name}_{job_id}
format. So for the above example the adapter
would get uploaded to {ORG_ID_HERE}/fine-tuning-job_unique_id
. The repo would be private
by default.
Model Deployment
- Clone Lorax repository:
git clone https://github.com/tensorfuse/vllm
cd vllm/llama_70b_lora
- Use the following command to deploy
The below deploy command deploys lorax instance in default environment. Make sure you have created the hugging-face-secret in default environment. You can create secret in default environment by adding --env default
flag in the secret creation command.
tensorkube deploy --gpus 4 --gpu-type L40S --secret hugging-face-secret
This will deploy the base model with the vLLM server.
- Get your deployment url using
tensorkube deployment list
.
Inference
You can now use the deployment URL to make inference requests. Here is an example using curl
. This will query the base model without any adapters.
curl ${ENDPOINT}/generate -X POST \
-H 'Content-Type: application/json' \
-d '{
"inputs": "[INST] Your prompt here [/INST]",
"parameters": {
"max_new_tokens": 64
}
}'
For using the adapter, you can use the following command:
curl --request POST -v \
--url <endpoint>/v1/chat/completions \
--header 'Content-Type: application/json' \
--data '{
"model": "<ADAPTER_HF_ID>",
"messages": [
{
"role": "user",
"content": "hello"
}
]
}'