Tensorfuse supports axolotl styled declarative configs for finetuning models on your AWS account. This guide explains how to finetune models. In this example, we will train a fine-tuned LoRA adapter for the LLaMA-3.1-8B model on a SQL dataset.

This guide is intended for users who want to perform one-off training runs and experiment with different hyperparameters. If you are looking to deploy a production-ready model, please refer to the programmatic access guide here.. We only support highly tested configurations for programmatic access. If you would like us to add a new configuration, please reach out to us at [email protected].

Fine-tuning involves three steps:

  1. Dataset Preparation: Prepare a dataset in your S3 bucket. Ensure the dataset is accessible to the IAM user who created the TensorKube stack.
  2. Create a Hugging Face secret: Use your Hugging Face token, ensuring it has access to the model you want to fine-tune.
  3. Prepare a config.yaml file: Define training parameters. Refer to the Axolotl documentation for a list of supported parameters.

Using random Axolotl configurations can lead to OOM errors and other compatibility issues. Use log inspection tools to debug any issues. If you need support, contact us at [email protected]. Well-tested configurations are available in the programmatic access guide.

Dataset Preparation

We will use a SQL dataset for this guide. The dataset follows the ChatML format. Supported dataset formats can be found here. Each dataset should be in JSONL format. Below is an example datapoint:

{
    "messages": [
        {
            "role": "system",
            "content": "You are an SQL expert that helps convert natural language queries into SQL statements."
        },
        {
            "role": "user",
            "content": "Show me all employees from New York. There is an employees table with the following columns: id, name, city."
        },
        {
            "role": "assistant",
            "content": "SELECT * FROM employees WHERE city = 'New York';"
        }
    ]
}

Upload this dataset to your S3 bucket and get the bucket path. In this example it looks like this s3://testing-prod-123456789012/cli_sql_dataset.jsonl.

Create a huggingface secret

Use your Hugging Face token to create a secret. Ensure the token provides access to the model you wish to fine-tune.

tensorkube secret create hugging-face-secret HUGGING_FACE_HUB_TOKEN=<TOKEN_HERE>

Verify that the secret was created successfully by running tensorkube list secrets.

(.venv) ➜  training_cli.py tensorkube list secrets
Secrets
├── Name: amaz-secret
└── Name: hugging-face-secret

Prepare a config.yaml file

Here is an example configuration file for finetuning a lora adapter on llama-3.1-8B model. The following configuration file is tested to work on a single A10G GPU. If you want to run it on a different hardware, please experiment with the micro_batch_size and gradient_accumulation_steps.

config.yaml
base_model: meta-llama/Llama-3.1-8B-Instruct
model_type: LlamaForCausalLM
tokenizer_type: AutoTokenizer
chat_template: chatml
datasets:
  - path: s3://testing-prod-123456789012/cli_sql_dataset.jsonl
    type: chat_template
    field_messages: messages
    message_field_role: role
    message_field_content: content
    field_system: system
    field_human: user
    field_model: assistant

load_in_8bit: true
load_in_4bit: false
strict: false
val_set_size: 0
output_dir: ./outputs/out/lora-llama3-8b

adapter: lora
lora_model_dir:

sequence_len: 4096
sample_packing: false
pad_to_sequence_len: false

lora_r: 8
lora_alpha: 16
lora_dropout: 0.05
lora_target_modules:
  - q_proj
  - v_proj
lora_target_linear: false
lora_fan_in_fan_out:

wandb_project:
wandb_entity:
wandb_watch:
wandb_name:
wandb_log_model:

gradient_accumulation_steps: 4
micro_batch_size: 16
num_epochs: 10
optimizer: adamw_torch
lr_scheduler: cosine
learning_rate: 0.00002

train_on_inputs: false
group_by_length: false
bf16: true
fp16: false
tf32: false

gradient_checkpointing: true
gradient_checkpointing_kwargs:
  use_reentrant: true
early_stopping_patience:
resume_from_checkpoint:
local_rank:
logging_steps: 1
xformers_attention:
flash_attention: true

warmup_steps: 10
evals_per_epoch: 0
eval_table_size:
saves_per_epoch: 1
debug: false
deepspeed:
weight_decay: 0.001
special_tokens:
  pad_token: <|end_of_text|>

Finetuning

Use the tensorkube train create command to initiate the fine-tuning process. The following options are supported:

Usage: tensorkube train create [OPTIONS]

Options:
  --gpus INTEGER                  Number of GPUs needed for the training.
  --gpu-type [V100|A10G|T4|L4|L40S]
                                  Type of GPU.
  --env TEXT                      Environment to deploy the training to.
  --job-id TEXT                   Unique job id for the training job.
  --axolotl                       Run the axolotl training job. Necessary for axolotl training.
  --config-path TEXT              Path to the config.yaml file.
  --secret TEXT                   Secret to use for the deployment.
  --help                          Show this message and exit.

To start fine-tuning, run the following command:

tensorkube train create --gpus 1 --gpu-type A10G --config-path ./config.yaml --job-id llama-3-8b-sql --secret hugging-face-secret --axolotl

This command initiates fine-tuning on a single A10G GPU. Monitor the logs to ensure the training runs successfully.

Checking status

You can run tensorkube train list to check the status of the training job.


 tensorkube train list
                                    Tensorkube Jobs
┏━━━━━━━━━━━━━━━━━━━┳━━━━━━━━━━━┳━━━━━━━━━━━━━━━━━━━━━┳━━━━━━━━━━━━━━━━━━━━━┳━━━━━━━━━┓
┃ Job Id            ┃ Status    ┃ Start Time          ┃ Completion Time     ┃ Env     ┃
┡━━━━━━━━━━━━━━━━━━━╇━━━━━━━━━━━╇━━━━━━━━━━━━━━━━━━━━━╇━━━━━━━━━━━━━━━━━━━━━╇━━━━━━━━━┩
│ llama-3-8b-sql │ Succeeded │ 2024-11-23 06:48:13 │ 2024-11-23 07:03:22 │ default │
└───────────────────┴───────────┴─────────────────────┴─────────────────────┴─────────┘

llama-3-8b-sql-16
├── Namespace: default
├── Status: Succeeded
├── Start Time: 2024-11-23 06:48:13
├── Completion Time: 2024-11-23 07:03:22
└── Conditions
    └── Complete: True

Checking logs

You can also check the logs of the training job by running tensorkube train logs --job-id <JOB_ID> to check the logs of the training process. The following options are supported:

tensorkube train logs --help
Usage: tensorkube train logs [OPTIONS]

Options:
  --job-id TEXT  Unique job id for the training job.  [required]
  --env TEXT     Environment to deploy the service to.
  --help         Show this message and exit.

You can take the following command as an example to check the logs of the training job:

tensorkube train logs --job-id llama-3-8b-sql

[2024-11-23 06:58:31,864] [INFO] [axolotl.train.train:141] [PID:23] [RANK:0] Pre-saving adapter config to ./outputs/out/lora-llama3-8b
[2024-11-23 06:58:32,037] [INFO] [axolotl.train.train:178] [PID:23] [RANK:0] Starting trainer...
  0%|          | 0/30 [00:00<?, ?it/s]You're using a PreTrainedTokenizerFast tokenizer. Please note that with a fast tokenizer, using the `__call__` method is faster than using a method to encode the text followed by a call to the `pad` method to get a padded encoding.
/root/miniconda3/envs/py3.11/lib/python3.11/site-packages/bitsandbytes/autograd/_functions.py:316: UserWarning: MatMul8bitLt: inputs will be cast from torch.bfloat16 to float16 during quantization
  warnings.warn(f"MatMul8bitLt: inputs will be cast from {A.dtype} to float16 during quantization")
{'loss': 0.2861, 'grad_norm': 1.6527355909347534, 'learning_rate': 2.0000000000000003e-06, 'epoch': 0.31}
[2024-11-23 06:58:53,534] [INFO] [axolotl.callbacks.on_step_end:128] [PID:23] [RANK:0] GPU memory usage while training: 8.534GB (+4.498GB cache, +0.825GB misc)
{'loss': 0.2548, 'grad_norm': 1.6560983657836914, 'learning_rate': 4.000000000000001e-06, 'epoch': 0.62}
{'loss': 0.4286, 'grad_norm': 2.8043346405029297, 'learning_rate': 8.000000000000001e-06, 'epoch': 1.23}
{'loss': 0.2554, 'grad_norm': 1.5236539840698242, 'learning_rate': 1e-05, 'epoch': 1.54}
{'loss': 0.2523, 'grad_norm': 1.502539038658142, 'learning_rate': 1.2e-05, 'epoch': 1.85}

Checking the adapter

Look for the tensorkube-train-bucket-<unique_id> bucket in your S3 console. All your training LoRA adapters will reside here. The adapter ID is constructed from your job-id. Your adapter URLs will look like this:

s3://tensorkube-train-bucket-d4232bakhb23e-d692-4a15/lora-adapter/ax-llama-3-8b-sql-16

The last portion of your url is essentially ax-{job-id}. You can download the adapter from the S3 console and use it for inference or use it on lorax.