Documentation Index
Fetch the complete documentation index at: https://tensorfuse.io/docs/llms.txt
Use this file to discover all available pages before exploring further.
Axolotl Finetuning with TensorFuse Job Queues
Run powerful and flexible Axolotl finetuning jobs on TensorFuse with support for multiple dataset formats, models, and configurations. Perfect for different customers, use cases, parameter sweeps, and model comparisons.
What You Can Do
- Multiple Dataset Formats: JSONL, CSV, Parquet, HuggingFace datasets
- Choose any Base Model: Llama, Qwen, Mistral, CodeLlama, and more
- Flexible Chat Formats: Instructions, conversations, chat templates
- Parameter Sweeps: Test different hyperparameters automatically
- Model Comparisons: Compare different models on the same data
- Automatic Uploads: Models uploaded to HuggingFace Hub
- Training Monitoring: Full Weights & Biases integration
- Queue Management: Run multiple experiments in parallel
Prerequisites
- TensorFuse Setup: Ensure your cluster is configured (see Getting Started)
- Set up Secrets:
# HuggingFace token (required - ensure it has WRITE access)
tensorkube secret create hugging-face-secret \
HUGGING_FACE_HUB_TOKEN=hf_your_token_here --env keda
# Weights & Biases token (recommended)
tensorkube secret create wb-secret \
WANDB_API_KEY=your_wandb_key_here --env keda
Quick Start
1. Set Up Your Project
Create a new directory for your finetuning project:
mkdir my-axolotl-finetuning
cd my-axolotl-finetuning
2. Create Base Configuration
Create axolotl-config.yaml:
base_model: Qwen/Qwen3-8B
# Automatically upload checkpoint and final model to HF
hub_model_id: your-username/my-finetuned-model
plugins:
- axolotl.integrations.cut_cross_entropy.CutCrossEntropyPlugin
strict: false
chat_template: qwen3
datasets:
- path: NousResearch/hermes-function-calling-v1
name: func_calling_singleturn
type: chat_template
split: train[:80%]
field_messages: conversations
message_property_mappings:
role: from
content: value
roles:
assistant:
- gpt
- assistant
user:
- human
- user
system:
- system
roles_to_train: [assistant]
val_set_size: 0.05
output_dir: ./outputs/out
dataset_prepared_path: last_run_prepared
sequence_len: 2048
sample_packing: true
eval_sample_packing: true
pad_to_sequence_len: true
load_in_4bit: true
adapter: qlora
lora_r: 16
lora_alpha: 32
lora_target_modules:
- q_proj
- k_proj
- v_proj
- o_proj
- down_proj
- up_proj
lora_mlp_kernel: true
lora_qkv_kernel: true
lora_o_kernel: true
wandb_project: qwen-finetuning
wandb_entity: your-wandb-username
wandb_watch:
wandb_name: my-experiment
wandb_log_model:
gradient_accumulation_steps: 1
micro_batch_size: 8
num_epochs: 3
optimizer: adamw_torch_4bit
lr_scheduler: cosine
learning_rate: 0.0002
bf16: auto
tf32: true
gradient_checkpointing: offload
gradient_checkpointing_kwargs:
use_reentrant: false
resume_from_checkpoint:
logging_steps: 1
flash_attention: true
warmup_steps: 10
evals_per_epoch: 4
saves_per_epoch: 1
weight_decay: 0.0
special_tokens:
3. Create Training Script
Create axolotl-train.py:
import os
import sys
import json
import subprocess
import yaml
from tensorkube import get_queued_message
def recursive_update(original, override):
"""Recursively update configuration with new values"""
for key, value in override.items():
if (
key in original
and isinstance(original[key], dict)
and isinstance(value, dict)
):
original[key] = recursive_update(original[key], value)
else:
original[key] = value
return original
def run_finetuning(message_json):
print("Starting Axolotl training job")
# Load base configuration
with open("axolotl-config.yaml", "r") as f:
yaml_config = yaml.safe_load(f)
# Parse message from job queue
if isinstance(message_json, str):
json_message = json.loads(message_json)
else:
json_message = message_json
# Override base config with job-specific settings
updated_config = recursive_update(yaml_config, json_message)
# Save updated configuration
with open("axolotl-config.yaml", "w") as f:
yaml.safe_dump(updated_config, f, sort_keys=False)
# Run Axolotl training
result = subprocess.run(
['accelerate', 'launch', '-m', 'axolotl.cli.train', 'axolotl-config.yaml'],
check=True,
stdout=sys.stdout,
stderr=sys.stderr,
env=os.environ
)
print("Training completed successfully!")
if __name__ == "__main__":
message = get_queued_message()
run_finetuning(message)
4. Create Dockerfile
Create Dockerfile:
FROM axolotlai/axolotl:0.11.0
RUN pip install tensorkube pyyaml
COPY axolotl-train.py .
COPY axolotl-config.yaml .
CMD ["python", "axolotl-train.py"]
5. Deploy and Run
# Deploy the job
tensorkube job deploy \
--name my-axolotl-finetuning \
--gpus 1 \
--gpu-type l40s \
--secret hugging-face-secret \
--secret wb-secret
# Queue your first training run
tensorkube job queue \
--job-name my-axolotl-finetuning \
--job-id experiment-1 \
--payload '{
"hub_model_id": "your-username/qwen-function-calling",
"wandb_name": "qwen-function-calling-v1"
}'
Perfect for chatbots and conversational AI:
{"messages": [{"role": "system", "content": "You are a helpful assistant."}, {"role": "user", "content": "What is Python?"}, {"role": "assistant", "content": "Python is a programming language."}]}
{"messages": [{"role": "user", "content": "How do I write a loop?"}, {"role": "assistant", "content": "You can use a for loop: for i in range(10):"}]}
Configuration:
datasets:
- path: chat-data.jsonl
type: chat_template
field_messages: messages
Great for instruction-following models:
CSV Format (instructions.csv):
instruction,input,output
"Explain what Python is","","Python is a high-level programming language known for its simplicity and readability."
"Write a function to add two numbers","","def add(a, b): return a + b"
Configuration:
datasets:
- path: instructions.csv
type: instruction
field_instruction: instruction
field_input: input
field_output: output
3. HuggingFace Datasets
Use any dataset from HuggingFace Hub:
datasets:
- path: HuggingFaceH4/no_robots
type: chat_template
split: train[:1000]
field_messages: messages
4. Custom Message Mapping
For datasets with different field names:
datasets:
- path: custom-conversations.jsonl
type: chat_template
field_messages: conversations
message_property_mappings:
role: from
content: value
roles:
assistant: [gpt, bot, assistant]
user: [human, user]
system: [system]
Configurable Parameters
Model Configuration
{
"base_model": "meta-llama/Llama-3.1-8B-Instruct",
"hub_model_id": "your-org/custom-model",
"chat_template": "llama3"
}
Popular Models & Templates:
- Llama 3.1:
meta-llama/Llama-3.1-8B-Instruct + llama3
- Qwen:
Qwen/Qwen2.5-7B-Instruct + qwen2_5
- Mistral:
mistralai/Mistral-7B-Instruct-v0.3 + mistral
- CodeLlama:
codellama/CodeLlama-7b-Instruct-hf + llama3
Training Hyperparameters
{
"sequence_len": 4096,
"micro_batch_size": 4,
"num_epochs": 3,
"learning_rate": 0.0001,
"lr_scheduler": "cosine",
"warmup_steps": 100
}
LoRA Configuration
{
"adapter": "qlora",
"lora_r": 32,
"lora_alpha": 64,
"lora_dropout": 0.05,
"load_in_4bit": true
}
Monitoring & Logging
{
"wandb_project": "my-experiments",
"wandb_entity": "my-team",
"wandb_name": "experiment-v1",
"logging_steps": 10
}
Example Use Cases
Function Calling Assistant
Train models to perform structured function calling for tool usage and API integration:
tensorkube job queue \
--job-name function-calling-job \
--job-id hermes-qwen \
--payload '{
"base_model": "Qwen/Qwen3-8B",
"hub_model_id": "your-org/qwen-function-calling",
"datasets": [
{
"path": "NousResearch/hermes-function-calling-v1",
"name": "func_calling_singleturn",
"type": "chat_template",
"split": "train[:80%]",
"field_messages": "conversations",
"message_property_mappings": {
"role": "from",
"content": "value"
}
}
],
"sequence_len": 2048,
"wandb_name": "hermes-function-calling-v1"
}'
Code Assistant
tensorkube job queue \
--job-name my-axolotl-finetuning \
--job-id code-assistant \
--payload '{
"base_model": "codellama/CodeLlama-7b-Instruct-hf",
"hub_model_id": "customer-a/code-assistant",
"datasets": [
{
"path": "code-instructions.csv",
"type": "instruction",
"field_instruction": "problem",
"field_output": "solution"
}
],
"sequence_len": 4096,
"wandb_name": "code-assistant-v1"
}'
Note: For local datasets, update your Dockerfile to copy the dataset file:
FROM axolotlai/axolotl:0.11.0
RUN pip install tensorkube pyyaml
COPY axolotl-train.py .
COPY axolotl-config.yaml .
COPY code-instructions.csv . # Add your dataset file
CMD ["python", "axolotl-train.py"]
Multilingual Chat
tensorkube job queue \
--job-name my-axolotl-finetuning \
--job-id multilingual-chat \
--payload '{
"base_model": "meta-llama/Llama-3.1-8B-Instruct",
"hub_model_id": "customer-b/multilingual-chat",
"datasets": [
{
"path": "multilingual-conversations.jsonl",
"type": "chat_template"
}
],
"num_epochs": 5,
"learning_rate": 0.00008,
"wandb_name": "multilingual-chat-v1"
}'
Note: Add local dataset to Dockerfile: COPY multilingual-conversations.jsonl .
Configuration Templates
Ready-to-use YAML configurations for common finetuning scenarios. Copy these as starting points and modify for your specific needs.
Template 1: Chat Assistant
Optimized for conversational AI with balanced performance and memory usage:
base_model: meta-llama/Llama-3.1-8B-Instruct
hub_model_id: your-org/chat-assistant
chat_template: llama3
datasets:
- path: chat-data.jsonl
type: chat_template
sequence_len: 2048
micro_batch_size: 8
num_epochs: 3
learning_rate: 0.0002
adapter: qlora
lora_r: 16
lora_alpha: 32
Template 2: Code Generator
Configured for code generation tasks with longer context and specialized model:
base_model: codellama/CodeLlama-7b-Instruct-hf
hub_model_id: your-org/code-generator
chat_template: llama3
datasets:
- path: code-instructions.csv
type: instruction
field_instruction: problem
field_output: solution
sequence_len: 4096
micro_batch_size: 4
num_epochs: 2
learning_rate: 0.0003
Template 3: Instruction Follower
Designed for general instruction-following with efficient LoRA settings:
base_model: Qwen/Qwen2.5-7B-Instruct
hub_model_id: your-org/instruction-follower
chat_template: qwen2_5
datasets:
- path: instructions.parquet
type: instruction
sequence_len: 2048
micro_batch_size: 8
num_epochs: 3
learning_rate: 0.0001
lora_r: 32
lora_alpha: 64
Job Management
Efficiently manage your training jobs, monitor progress, and handle multiple experiments running simultaneously.
Monitor Your Jobs
# List all jobs and their current status
tensorkube job list
# View real-time logs from a specific job
tensorkube job logs --job-name my-axolotl-finetuning
Batch Operations
Run multiple experiments programmatically for parameter sweeps and comparisons:
# Queue multiple experiments with different learning rates
experiments=(
'{"learning_rate": 0.0001, "wandb_name": "exp-1"}'
'{"learning_rate": 0.0002, "wandb_name": "exp-2"}'
'{"learning_rate": 0.0005, "wandb_name": "exp-3"}'
)
for i in "${!experiments[@]}"; do
tensorkube job queue \
--job-name my-axolotl-finetuning \
--job-id "batch-exp-$i" \
--payload "${experiments[$i]}"
done
This approach lets you:
- Automate experiments: No manual job queuing
- Compare results: All experiments tracked in W&B
- Save time: Queue multiple jobs at once
Parameter Sweeps
Learning Rate Sweep
#!/bin/bash
learning_rates=(0.00005 0.0001 0.0002 0.0005)
for lr in "${learning_rates[@]}"; do
tensorkube job queue \
--job-name my-axolotl-finetuning \
--job-id "lr-sweep-${lr}" \
--payload "{
\"learning_rate\": $lr,
\"wandb_name\": \"lr-sweep-${lr}\",
\"hub_model_id\": \"experiments/lr-${lr}\"
}"
done
LoRA Rank Comparison
#!/bin/bash
lora_ranks=(8 16 32 64)
for rank in "${lora_ranks[@]}"; do
alpha=$((rank * 2)) # Common practice: alpha = 2 * rank
tensorkube job queue \
--job-name my-axolotl-finetuning \
--job-id "lora-rank-${rank}" \
--payload "{
\"lora_r\": $rank,
\"lora_alpha\": $alpha,
\"wandb_name\": \"lora-rank-${rank}\",
\"hub_model_id\": \"experiments/lora-r${rank}\"
}"
done
Model Comparison
#!/bin/bash
# Compare different models on the same task
models=("meta-llama/Llama-3.1-8B-Instruct" "Qwen/Qwen2.5-7B-Instruct" "mistralai/Mistral-7B-Instruct-v0.3")
templates=("llama3" "qwen2_5" "mistral")
for i in "${!models[@]}"; do
model="${models[$i]}"
template="${templates[$i]}"
model_name=$(echo "$model" | cut -d'/' -f2 | tr '[:upper:]' '[:lower:]')
tensorkube job queue \
--job-name my-axolotl-finetuning \
--job-id "model-comparison-${model_name}" \
--payload "{
\"base_model\": \"$model\",
\"chat_template\": \"$template\",
\"wandb_name\": \"comparison-${model_name}\",
\"hub_model_id\": \"experiments/${model_name}-finetuned\"
}"
done
Monitoring with Weights & Biases
Key Metrics to Watch
- Training Loss: Should decrease steadily
- Learning Rate: Follow the schedule
- GPU Utilization: Should be consistently high
- Validation Loss: Check for overfitting
Advanced W&B Configuration
{
"wandb_project": "qwen-functioncalling",
"wandb_entity": "your-team",
"wandb_name": "experiment-name",
"wandb_watch": "all",
"wandb_log_model": "checkpoint",
"wandb_tags": ["qwen", "function-calling", "hermes"]
}
HuggingFace Integration
Automatic Model Upload
Models are automatically uploaded to HuggingFace Hub when training completes:
{
"hub_model_id": "your-username/model-name",
"hub_strategy": "end",
"hub_private_repo": true
}
Upload Configuration Options
{
"hub_model_id": "organization/model-name",
"hub_strategy": "checkpoint",
"hub_private_repo": false,
"hub_always_push": true,
"hub_model_revision": "main"
}
Advanced Features
Memory Optimization (for larger models)
{
"load_in_4bit": true,
"gradient_checkpointing": "offload",
"flash_attention": true,
"sample_packing": true,
"pad_to_sequence_len": true
}
Memory Usage Breakdown:
load_in_4bit: Reduces model weights from 16-bit to 4-bit (e.g., 8GB model → 2GB)
gradient_checkpointing: Trades compute for memory (slower but fits larger models)
flash_attention: 2-8x faster attention with lower memory footprint
sample_packing: Better GPU utilization, especially with variable-length sequences
pad_to_sequence_len: Predictable memory usage, prevents OOM errors
Multi-Dataset Training
{
"datasets": [
{
"path": "instructions.jsonl",
"type": "instruction",
"weight": 0.7
},
{
"path": "conversations.jsonl",
"type": "chat_template",
"weight": 0.3
}
]
}
Evaluation Configuration
{
"val_set_size": 0.1,
"evals_per_epoch": 4,
"eval_sample_packing": false,
"eval_max_new_tokens": 128
}
Model Evaluation Integration
After training, evaluate your models using the integrated evaluation pipeline:
1. Deploy Inference Server
First, deploy your base model with LoRA support:
Create inference/deployment.yaml:
gpus: 1
gpu_type: l40s
secret:
- hugging-face-secret
- vllm-api-secret
port: 80
min_scale: 0
max_scale: 1
startup_probe:
httpGet:
path: /health
port: 80
readiness:
httpGet:
path: /health
port: 80
Create inference/Dockerfile:
FROM vllm/vllm-openai:v0.9.2
RUN pip install hf-xet huggingface_hub
ENV HF_XET_HIGH_PERFORMANCE=1
ENV VLLM_ALLOW_RUNTIME_LORA_UPDATING True
ENV VLLM_USE_V1 1
EXPOSE 80
ENTRYPOINT python3 -m vllm.entrypoints.openai.api_server \
--model Qwen/Qwen3-8B \
--dtype bfloat16 \
--max-model-len 32768 \
--port 80 \
--enable-lora \
--api-key $VLLM_API_KEY
Deploy the inference server:
cd inference
# Create API key secret for inference server
tensorkube secret create vllm-api-secret VLLM_API_KEY=your_secure_api_key_here
tensorkube deploy --config-file deployment.yaml
2. Create Evaluation Script
Create evals/evaluation_script.py to benchmark your models:
import openai
from openai import AsyncOpenAI
import json
import asyncio
from datasets import load_dataset
import requests
API_KEY = 'your_secure_api_key_here' # Same as VLLM_API_KEY
BASE_URL = '<YOUR_INFERENCE_URL>/v1' # Get from tensorkube deployment
def load_lora_adapter(lora_path):
"""Load your trained LoRA adapter"""
url = f"{BASE_URL.replace('/v1', '')}/v1/load_lora_adapter"
headers = {
"Content-Type": "application/json",
"Authorization": f"Bearer {API_KEY}"
}
data = {
"lora_name": lora_path,
"lora_path": lora_path
}
response = requests.post(url, headers=headers, json=data)
response.raise_for_status()
print(f"Successfully loaded LoRA adapter: {lora_path}")
# Load evaluation dataset
dataset = load_dataset("NousResearch/hermes-function-calling-v1", "func_calling_singleturn")
client = AsyncOpenAI(api_key=API_KEY, base_url=BASE_URL)
async def evaluate_model(sample):
# Extract conversation
messages = []
for conv in sample["conversations"]:
if conv["from"] == "system":
messages.append({"role": "system", "content": conv["value"]})
elif conv["from"] == "human":
messages.append({"role": "user", "content": conv["value"]})
# Get model response
response = await client.chat.completions.create(
model="your-model-name", # Use loaded LoRA name
messages=messages,
temperature=0.0,
max_tokens=2000
)
return {
"expected": sample["conversations"][-1]["value"],
"actual": response.choices[0].message.content
}
3. Run Evaluation
cd evals
python evaluation_script.py
# Choose option 1: Load LoRA Adapter
# Enter: your-org/your-finetuned-model
# Then run again and choose option 2: Run Evaluations
The evaluation will:
- Load your trained LoRA adapter
- Test on function calling tasks
- Calculate accuracy metrics
- Compare base vs fine-tuned performance
- Save results to
benchmark_results.json
4. Monitor Results
Check Weights & Biases for:
- Training curves: Loss progression during finetuning
- Evaluation metrics: Function calling accuracy
- Validation performance: Generalization capability
This complete pipeline lets you:
- Finetune models with job queues
- Deploy inference servers with LoRA support
- Load trained adapters dynamically
- Evaluate performance on function calling tasks
- Compare different model configurations
LoRA Adapter Loading for Inference
Your trained models are automatically uploaded to HuggingFace Hub and can be loaded into running inference servers without restart:
Loading Process
- Training completes → Model uploaded to HuggingFace Hub
- Inference server running → vLLM with
--enable-lora flag
- Load adapter → Call
/v1/load_lora_adapter endpoint
- Ready for inference → Use adapter name in chat completions
API Usage
# Load your trained adapter
curl -X POST "http://your-server/v1/load_lora_adapter" \
-H "Content-Type: application/json" \
-H "Authorization: Bearer your_secure_api_key_here" \
-d '{
"lora_name": "my-function-calling-model",
"lora_path": "your-org/my-function-calling-model"
}'
# Use in chat completions
curl -X POST "http://your-server/v1/chat/completions" \
-H "Content-Type: application/json" \
-H "Authorization: Bearer your_secure_api_key_here" \
-d '{
"model": "my-function-calling-model",
"messages": [{"role": "user", "content": "Call the weather API"}]
}'
This enables:
- Hot-swapping models without server restart
- A/B testing different fine-tuned versions
- Multi-tenant serving with customer-specific models
- Rapid experimentation with new training runs
Troubleshooting
Out of Memory Issues
If you get OOM errors, try:
{
"micro_batch_size": 1,
"gradient_accumulation_steps": 8,
"gradient_checkpointing": "offload",
"load_in_4bit": true
}
Slow Training
Speed up training with:
{
"flash_attention": true,
"sample_packing": true,
"tf32": true,
"bf16": "auto"
}
Upload Failures
Check your HuggingFace token:
# Verify token has write access
tensorkube secret list --env keda
W&B Connection Issues
Verify your W&B setup:
# Check if secret exists
tensorkube secret list --env keda | grep wb-secret
Debug Mode
For troubleshooting, use these Axolotl debugging settings:
{
"logging_steps": 1,
"max_steps": 10,
"save_steps": 5,
"eval_steps": 5,
"wandb_name": "debug-run"
}
Additional debugging options:
- Set
TRANSFORMERS_VERBOSITY=debug in environment
- Use
--debug flag with accelerate launch
- Check logs with:
tensorkube job logs --job-name your-job-name
- For config-only testing: add
"wandb_mode": "disabled" to skip W&B entirely