Axolotl Finetuning with TensorFuse Job Queues

Run powerful and flexible Axolotl finetuning jobs on TensorFuse with support for multiple dataset formats, models, and configurations. Perfect for different customers, use cases, parameter sweeps, and model comparisons.

What You Can Do

  • Multiple Dataset Formats: JSONL, CSV, Parquet, HuggingFace datasets
  • Choose any Base Model: Llama, Qwen, Mistral, CodeLlama, and more
  • Flexible Chat Formats: Instructions, conversations, chat templates
  • Parameter Sweeps: Test different hyperparameters automatically
  • Model Comparisons: Compare different models on the same data
  • Automatic Uploads: Models uploaded to HuggingFace Hub
  • Training Monitoring: Full Weights & Biases integration
  • Queue Management: Run multiple experiments in parallel

Prerequisites

  1. TensorFuse Setup: Ensure your cluster is configured (see Getting Started)
  2. Set up Secrets:
    # HuggingFace token (required - ensure it has WRITE access)
    tensorkube secret create hugging-face-secret \
      HUGGING_FACE_HUB_TOKEN=hf_your_token_here --env keda
    
    # Weights & Biases token (recommended)
    tensorkube secret create wb-secret \
      WANDB_API_KEY=your_wandb_key_here --env keda
    

Quick Start

1. Set Up Your Project

Create a new directory for your finetuning project:
mkdir my-axolotl-finetuning
cd my-axolotl-finetuning

2. Create Base Configuration

Create axolotl-config.yaml:
base_model: Qwen/Qwen3-8B
# Automatically upload checkpoint and final model to HF
hub_model_id: your-username/my-finetuned-model

plugins:
  - axolotl.integrations.cut_cross_entropy.CutCrossEntropyPlugin
strict: false

chat_template: qwen3
datasets:
  - path: NousResearch/hermes-function-calling-v1
    name: func_calling_singleturn
    type: chat_template
    split: train[:80%]
    field_messages: conversations
    message_property_mappings:
      role: from
      content: value
    roles:
      assistant:
        - gpt
        - assistant
      user:
        - human
        - user
      system:
        - system
    roles_to_train: [assistant]
val_set_size: 0.05
output_dir: ./outputs/out
dataset_prepared_path: last_run_prepared

sequence_len: 2048
sample_packing: true
eval_sample_packing: true
pad_to_sequence_len: true

load_in_4bit: true
adapter: qlora
lora_r: 16
lora_alpha: 32
lora_target_modules:
  - q_proj
  - k_proj
  - v_proj
  - o_proj
  - down_proj
  - up_proj
lora_mlp_kernel: true
lora_qkv_kernel: true
lora_o_kernel: true

wandb_project: qwen-finetuning
wandb_entity: your-wandb-username
wandb_watch:
wandb_name: my-experiment
wandb_log_model:

gradient_accumulation_steps: 1
micro_batch_size: 8
num_epochs: 3
optimizer: adamw_torch_4bit
lr_scheduler: cosine
learning_rate: 0.0002

bf16: auto
tf32: true

gradient_checkpointing: offload
gradient_checkpointing_kwargs:
  use_reentrant: false
resume_from_checkpoint:
logging_steps: 1
flash_attention: true

warmup_steps: 10
evals_per_epoch: 4
saves_per_epoch: 1
weight_decay: 0.0
special_tokens:

3. Create Training Script

Create axolotl-train.py:
import os
import sys
import json
import subprocess
import yaml
from tensorkube import get_queued_message

def recursive_update(original, override):
    """Recursively update configuration with new values"""
    for key, value in override.items():
        if (
            key in original
            and isinstance(original[key], dict)
            and isinstance(value, dict)
        ):
            original[key] = recursive_update(original[key], value)
        else:
            original[key] = value
    return original

def run_finetuning(message_json):
    print("Starting Axolotl training job")

    # Load base configuration
    with open("axolotl-config.yaml", "r") as f:
        yaml_config = yaml.safe_load(f)

    # Parse message from job queue
    if isinstance(message_json, str):
        json_message = json.loads(message_json)
    else:
        json_message = message_json

    # Override base config with job-specific settings
    updated_config = recursive_update(yaml_config, json_message)

    # Save updated configuration
    with open("axolotl-config.yaml", "w") as f:
        yaml.safe_dump(updated_config, f, sort_keys=False)

    # Run Axolotl training
    result = subprocess.run(
        ['accelerate', 'launch', '-m', 'axolotl.cli.train', 'axolotl-config.yaml'],
        check=True,
        stdout=sys.stdout,
        stderr=sys.stderr,
        env=os.environ
    )
    print("Training completed successfully!")

if __name__ == "__main__":
    message = get_queued_message()
    run_finetuning(message)

4. Create Dockerfile

Create Dockerfile:
FROM axolotlai/axolotl:0.11.0

RUN pip install tensorkube pyyaml 

COPY axolotl-train.py .

COPY axolotl-config.yaml .

CMD ["python", "axolotl-train.py"]

5. Deploy and Run

# Deploy the job
tensorkube job deploy \
  --name my-axolotl-finetuning \
  --gpus 1 \
  --gpu-type l40s \
  --secret hugging-face-secret \
  --secret wb-secret

# Queue your first training run
tensorkube job queue \
  --job-name my-axolotl-finetuning \
  --job-id experiment-1 \
  --payload '{
    "hub_model_id": "your-username/qwen-function-calling",
    "wandb_name": "qwen-function-calling-v1"
  }'

Dataset Formats

1. Chat/Conversation Format (JSONL)

Perfect for chatbots and conversational AI:
{"messages": [{"role": "system", "content": "You are a helpful assistant."}, {"role": "user", "content": "What is Python?"}, {"role": "assistant", "content": "Python is a programming language."}]}
{"messages": [{"role": "user", "content": "How do I write a loop?"}, {"role": "assistant", "content": "You can use a for loop: for i in range(10):"}]}
Configuration:
datasets:
  - path: chat-data.jsonl
    type: chat_template
    field_messages: messages

2. Instruction Format (CSV/Parquet/JSONL)

Great for instruction-following models: CSV Format (instructions.csv):
instruction,input,output
"Explain what Python is","","Python is a high-level programming language known for its simplicity and readability."
"Write a function to add two numbers","","def add(a, b): return a + b"
Configuration:
datasets:
  - path: instructions.csv
    type: instruction
    field_instruction: instruction
    field_input: input
    field_output: output

3. HuggingFace Datasets

Use any dataset from HuggingFace Hub:
datasets:
  - path: HuggingFaceH4/no_robots
    type: chat_template
    split: train[:1000]
    field_messages: messages

4. Custom Message Mapping

For datasets with different field names:
datasets:
  - path: custom-conversations.jsonl
    type: chat_template
    field_messages: conversations
    message_property_mappings:
      role: from
      content: value
    roles:
      assistant: [gpt, bot, assistant]
      user: [human, user]
      system: [system]

Configurable Parameters

Model Configuration

{
  "base_model": "meta-llama/Llama-3.1-8B-Instruct",
  "hub_model_id": "your-org/custom-model",
  "chat_template": "llama3"
}
Popular Models & Templates:
  • Llama 3.1: meta-llama/Llama-3.1-8B-Instruct + llama3
  • Qwen: Qwen/Qwen2.5-7B-Instruct + qwen2_5
  • Mistral: mistralai/Mistral-7B-Instruct-v0.3 + mistral
  • CodeLlama: codellama/CodeLlama-7b-Instruct-hf + llama3

Training Hyperparameters

{
  "sequence_len": 4096,
  "micro_batch_size": 4,
  "num_epochs": 3,
  "learning_rate": 0.0001,
  "lr_scheduler": "cosine",
  "warmup_steps": 100
}

LoRA Configuration

{
  "adapter": "qlora",
  "lora_r": 32,
  "lora_alpha": 64,
  "lora_dropout": 0.05,
  "load_in_4bit": true
}

Monitoring & Logging

{
  "wandb_project": "my-experiments",
  "wandb_entity": "my-team",
  "wandb_name": "experiment-v1",
  "logging_steps": 10
}

Example Use Cases

Function Calling Assistant

Train models to perform structured function calling for tool usage and API integration:
tensorkube job queue \
  --job-name function-calling-job \
  --job-id hermes-qwen \
  --payload '{
    "base_model": "Qwen/Qwen3-8B",
    "hub_model_id": "your-org/qwen-function-calling",
    "datasets": [
      {
        "path": "NousResearch/hermes-function-calling-v1",
        "name": "func_calling_singleturn",
        "type": "chat_template",
        "split": "train[:80%]",
        "field_messages": "conversations",
        "message_property_mappings": {
          "role": "from",
          "content": "value"
        }
      }
    ],
    "sequence_len": 2048,
    "wandb_name": "hermes-function-calling-v1"
  }'

Code Assistant

tensorkube job queue \
  --job-name my-axolotl-finetuning \
  --job-id code-assistant \
  --payload '{
    "base_model": "codellama/CodeLlama-7b-Instruct-hf",
    "hub_model_id": "customer-a/code-assistant",
    "datasets": [
      {
        "path": "code-instructions.csv",
        "type": "instruction",
        "field_instruction": "problem",
        "field_output": "solution"
      }
    ],
    "sequence_len": 4096,
    "wandb_name": "code-assistant-v1"
  }'
Note: For local datasets, update your Dockerfile to copy the dataset file:
FROM axolotlai/axolotl:0.11.0

RUN pip install tensorkube pyyaml 

COPY axolotl-train.py .
COPY axolotl-config.yaml .
COPY code-instructions.csv .  # Add your dataset file

CMD ["python", "axolotl-train.py"]

Multilingual Chat

tensorkube job queue \
  --job-name my-axolotl-finetuning \
  --job-id multilingual-chat \
  --payload '{
    "base_model": "meta-llama/Llama-3.1-8B-Instruct",
    "hub_model_id": "customer-b/multilingual-chat",
    "datasets": [
      {
        "path": "multilingual-conversations.jsonl",
        "type": "chat_template"
      }
    ],
    "num_epochs": 5,
    "learning_rate": 0.00008,
    "wandb_name": "multilingual-chat-v1"
  }'
Note: Add local dataset to Dockerfile: COPY multilingual-conversations.jsonl .

Configuration Templates

Ready-to-use YAML configurations for common finetuning scenarios. Copy these as starting points and modify for your specific needs.

Template 1: Chat Assistant

Optimized for conversational AI with balanced performance and memory usage:
base_model: meta-llama/Llama-3.1-8B-Instruct
hub_model_id: your-org/chat-assistant
chat_template: llama3
datasets:
  - path: chat-data.jsonl
    type: chat_template
sequence_len: 2048
micro_batch_size: 8
num_epochs: 3
learning_rate: 0.0002
adapter: qlora
lora_r: 16
lora_alpha: 32

Template 2: Code Generator

Configured for code generation tasks with longer context and specialized model:
base_model: codellama/CodeLlama-7b-Instruct-hf
hub_model_id: your-org/code-generator
chat_template: llama3
datasets:
  - path: code-instructions.csv
    type: instruction
    field_instruction: problem
    field_output: solution
sequence_len: 4096
micro_batch_size: 4
num_epochs: 2
learning_rate: 0.0003

Template 3: Instruction Follower

Designed for general instruction-following with efficient LoRA settings:
base_model: Qwen/Qwen2.5-7B-Instruct
hub_model_id: your-org/instruction-follower
chat_template: qwen2_5
datasets:
  - path: instructions.parquet
    type: instruction
sequence_len: 2048
micro_batch_size: 8
num_epochs: 3
learning_rate: 0.0001
lora_r: 32
lora_alpha: 64

Job Management

Efficiently manage your training jobs, monitor progress, and handle multiple experiments running simultaneously.

Monitor Your Jobs

# List all jobs and their current status
tensorkube job list

# View real-time logs from a specific job
tensorkube job logs --job-name my-axolotl-finetuning

Batch Operations

Run multiple experiments programmatically for parameter sweeps and comparisons:
# Queue multiple experiments with different learning rates
experiments=(
  '{"learning_rate": 0.0001, "wandb_name": "exp-1"}'
  '{"learning_rate": 0.0002, "wandb_name": "exp-2"}'
  '{"learning_rate": 0.0005, "wandb_name": "exp-3"}'
)

for i in "${!experiments[@]}"; do
  tensorkube job queue \
    --job-name my-axolotl-finetuning \
    --job-id "batch-exp-$i" \
    --payload "${experiments[$i]}"
done
This approach lets you:
  • Automate experiments: No manual job queuing
  • Compare results: All experiments tracked in W&B
  • Save time: Queue multiple jobs at once

Parameter Sweeps

Learning Rate Sweep

#!/bin/bash
learning_rates=(0.00005 0.0001 0.0002 0.0005)

for lr in "${learning_rates[@]}"; do
  tensorkube job queue \
    --job-name my-axolotl-finetuning \
    --job-id "lr-sweep-${lr}" \
    --payload "{
      \"learning_rate\": $lr,
      \"wandb_name\": \"lr-sweep-${lr}\",
      \"hub_model_id\": \"experiments/lr-${lr}\"
    }"
done

LoRA Rank Comparison

#!/bin/bash
lora_ranks=(8 16 32 64)

for rank in "${lora_ranks[@]}"; do
  alpha=$((rank * 2))  # Common practice: alpha = 2 * rank
  tensorkube job queue \
    --job-name my-axolotl-finetuning \
    --job-id "lora-rank-${rank}" \
    --payload "{
      \"lora_r\": $rank,
      \"lora_alpha\": $alpha,
      \"wandb_name\": \"lora-rank-${rank}\",
      \"hub_model_id\": \"experiments/lora-r${rank}\"
    }"
done

Model Comparison

#!/bin/bash
# Compare different models on the same task
models=("meta-llama/Llama-3.1-8B-Instruct" "Qwen/Qwen2.5-7B-Instruct" "mistralai/Mistral-7B-Instruct-v0.3")
templates=("llama3" "qwen2_5" "mistral")

for i in "${!models[@]}"; do
  model="${models[$i]}"
  template="${templates[$i]}"
  model_name=$(echo "$model" | cut -d'/' -f2 | tr '[:upper:]' '[:lower:]')
  
  tensorkube job queue \
    --job-name my-axolotl-finetuning \
    --job-id "model-comparison-${model_name}" \
    --payload "{
      \"base_model\": \"$model\",
      \"chat_template\": \"$template\",
      \"wandb_name\": \"comparison-${model_name}\",
      \"hub_model_id\": \"experiments/${model_name}-finetuned\"
    }"
done

Monitoring with Weights & Biases

Key Metrics to Watch

  1. Training Loss: Should decrease steadily
  2. Learning Rate: Follow the schedule
  3. GPU Utilization: Should be consistently high
  4. Validation Loss: Check for overfitting

Advanced W&B Configuration

{
  "wandb_project": "qwen-functioncalling",
  "wandb_entity": "your-team",
  "wandb_name": "experiment-name",
  "wandb_watch": "all",
  "wandb_log_model": "checkpoint",
  "wandb_tags": ["qwen", "function-calling", "hermes"]
}

HuggingFace Integration

Automatic Model Upload

Models are automatically uploaded to HuggingFace Hub when training completes:
{
  "hub_model_id": "your-username/model-name",
  "hub_strategy": "end",
  "hub_private_repo": true
}

Upload Configuration Options

{
  "hub_model_id": "organization/model-name",
  "hub_strategy": "checkpoint",
  "hub_private_repo": false,
  "hub_always_push": true,
  "hub_model_revision": "main"
}

Advanced Features

Memory Optimization (for larger models)

{
  "load_in_4bit": true,
  "gradient_checkpointing": "offload",
  "flash_attention": true,
  "sample_packing": true,
  "pad_to_sequence_len": true
}
Memory Usage Breakdown:
  • load_in_4bit: Reduces model weights from 16-bit to 4-bit (e.g., 8GB model → 2GB)
  • gradient_checkpointing: Trades compute for memory (slower but fits larger models)
  • flash_attention: 2-8x faster attention with lower memory footprint
  • sample_packing: Better GPU utilization, especially with variable-length sequences
  • pad_to_sequence_len: Predictable memory usage, prevents OOM errors

Multi-Dataset Training

{
  "datasets": [
    {
      "path": "instructions.jsonl",
      "type": "instruction",
      "weight": 0.7
    },
    {
      "path": "conversations.jsonl",
      "type": "chat_template",
      "weight": 0.3
    }
  ]
}

Evaluation Configuration

{
  "val_set_size": 0.1,
  "evals_per_epoch": 4,
  "eval_sample_packing": false,
  "eval_max_new_tokens": 128
}

Model Evaluation Integration

After training, evaluate your models using the integrated evaluation pipeline:

1. Deploy Inference Server

First, deploy your base model with LoRA support: Create inference/deployment.yaml:
gpus: 1
gpu_type: l40s
secret:
  - hugging-face-secret
  - vllm-api-secret
port: 80
min_scale: 0
max_scale: 1
startup_probe:
  httpGet:
    path: /health
    port: 80
readiness:
  httpGet:
    path: /health
    port: 80
Create inference/Dockerfile:
FROM vllm/vllm-openai:v0.9.2

RUN pip install hf-xet huggingface_hub

ENV HF_XET_HIGH_PERFORMANCE=1
ENV VLLM_ALLOW_RUNTIME_LORA_UPDATING True
ENV VLLM_USE_V1 1

EXPOSE 80

ENTRYPOINT python3 -m vllm.entrypoints.openai.api_server \
           --model Qwen/Qwen3-8B \
           --dtype bfloat16 \
           --max-model-len 32768 \
           --port 80 \
           --enable-lora \
           --api-key $VLLM_API_KEY
Deploy the inference server:
cd inference

# Create API key secret for inference server
tensorkube secret create vllm-api-secret VLLM_API_KEY=your_secure_api_key_here

tensorkube deploy --config-file deployment.yaml

2. Create Evaluation Script

Create evals/evaluation_script.py to benchmark your models:
import openai
from openai import AsyncOpenAI
import json
import asyncio
from datasets import load_dataset
import requests

API_KEY = 'your_secure_api_key_here'  # Same as VLLM_API_KEY
BASE_URL = '<YOUR_INFERENCE_URL>/v1'  # Get from tensorkube deployment

def load_lora_adapter(lora_path):
    """Load your trained LoRA adapter"""
    url = f"{BASE_URL.replace('/v1', '')}/v1/load_lora_adapter"
    headers = {
        "Content-Type": "application/json",
        "Authorization": f"Bearer {API_KEY}"
    }
    data = {
        "lora_name": lora_path,
        "lora_path": lora_path
    }
    
    response = requests.post(url, headers=headers, json=data)
    response.raise_for_status()
    print(f"Successfully loaded LoRA adapter: {lora_path}")

# Load evaluation dataset
dataset = load_dataset("NousResearch/hermes-function-calling-v1", "func_calling_singleturn")

client = AsyncOpenAI(api_key=API_KEY, base_url=BASE_URL)

async def evaluate_model(sample):
    # Extract conversation
    messages = []
    for conv in sample["conversations"]:
        if conv["from"] == "system":
            messages.append({"role": "system", "content": conv["value"]})
        elif conv["from"] == "human":
            messages.append({"role": "user", "content": conv["value"]})
    
    # Get model response
    response = await client.chat.completions.create(
        model="your-model-name",  # Use loaded LoRA name
        messages=messages,
        temperature=0.0,
        max_tokens=2000
    )
    
    return {
        "expected": sample["conversations"][-1]["value"],
        "actual": response.choices[0].message.content
    }

3. Run Evaluation

cd evals
python evaluation_script.py

# Choose option 1: Load LoRA Adapter
# Enter: your-org/your-finetuned-model

# Then run again and choose option 2: Run Evaluations
The evaluation will:
  • Load your trained LoRA adapter
  • Test on function calling tasks
  • Calculate accuracy metrics
  • Compare base vs fine-tuned performance
  • Save results to benchmark_results.json

4. Monitor Results

Check Weights & Biases for:
  • Training curves: Loss progression during finetuning
  • Evaluation metrics: Function calling accuracy
  • Validation performance: Generalization capability
This complete pipeline lets you:
  1. Finetune models with job queues
  2. Deploy inference servers with LoRA support
  3. Load trained adapters dynamically
  4. Evaluate performance on function calling tasks
  5. Compare different model configurations

LoRA Adapter Loading for Inference

Your trained models are automatically uploaded to HuggingFace Hub and can be loaded into running inference servers without restart:

Loading Process

  1. Training completes → Model uploaded to HuggingFace Hub
  2. Inference server running → vLLM with --enable-lora flag
  3. Load adapter → Call /v1/load_lora_adapter endpoint
  4. Ready for inference → Use adapter name in chat completions

API Usage

# Load your trained adapter
curl -X POST "http://your-server/v1/load_lora_adapter" \
  -H "Content-Type: application/json" \
  -H "Authorization: Bearer your_secure_api_key_here" \
  -d '{
    "lora_name": "my-function-calling-model",
    "lora_path": "your-org/my-function-calling-model"
  }'

# Use in chat completions
curl -X POST "http://your-server/v1/chat/completions" \
  -H "Content-Type: application/json" \
  -H "Authorization: Bearer your_secure_api_key_here" \
  -d '{
    "model": "my-function-calling-model",
    "messages": [{"role": "user", "content": "Call the weather API"}]
  }'
This enables:
  • Hot-swapping models without server restart
  • A/B testing different fine-tuned versions
  • Multi-tenant serving with customer-specific models
  • Rapid experimentation with new training runs

Troubleshooting

Out of Memory Issues

If you get OOM errors, try:
{
  "micro_batch_size": 1,
  "gradient_accumulation_steps": 8,
  "gradient_checkpointing": "offload",
  "load_in_4bit": true
}

Slow Training

Speed up training with:
{
  "flash_attention": true,
  "sample_packing": true,
  "tf32": true,
  "bf16": "auto"
}

Upload Failures

Check your HuggingFace token:
# Verify token has write access
tensorkube secret list --env keda

W&B Connection Issues

Verify your W&B setup:
# Check if secret exists
tensorkube secret list --env keda | grep wb-secret

Debug Mode

For troubleshooting, use these Axolotl debugging settings:
{
  "logging_steps": 1,
  "max_steps": 10,
  "save_steps": 5,
  "eval_steps": 5,
  "wandb_name": "debug-run"
}
Additional debugging options:
  • Set TRANSFORMERS_VERBOSITY=debug in environment
  • Use --debug flag with accelerate launch
  • Check logs with: tensorkube job logs --job-name your-job-name
  • For config-only testing: add "wandb_mode": "disabled" to skip W&B entirely