Axolotl Finetuning with TensorFuse Job Queues
Run powerful and flexible Axolotl finetuning jobs on TensorFuse with support for multiple dataset formats, models, and configurations. Perfect for different customers, use cases, parameter sweeps, and model comparisons.What You Can Do
- Multiple Dataset Formats: JSONL, CSV, Parquet, HuggingFace datasets
- Choose any Base Model: Llama, Qwen, Mistral, CodeLlama, and more
- Flexible Chat Formats: Instructions, conversations, chat templates
- Parameter Sweeps: Test different hyperparameters automatically
- Model Comparisons: Compare different models on the same data
- Automatic Uploads: Models uploaded to HuggingFace Hub
- Training Monitoring: Full Weights & Biases integration
- Queue Management: Run multiple experiments in parallel
Prerequisites
- TensorFuse Setup: Ensure your cluster is configured (see Getting Started)
- Set up Secrets:
Quick Start
1. Set Up Your Project
Create a new directory for your finetuning project:2. Create Base Configuration
Createaxolotl-config.yaml
:
3. Create Training Script
Createaxolotl-train.py
:
4. Create Dockerfile
CreateDockerfile
:
5. Deploy and Run
Dataset Formats
1. Chat/Conversation Format (JSONL)
Perfect for chatbots and conversational AI:2. Instruction Format (CSV/Parquet/JSONL)
Great for instruction-following models: CSV Format (instructions.csv
):
3. HuggingFace Datasets
Use any dataset from HuggingFace Hub:4. Custom Message Mapping
For datasets with different field names:Configurable Parameters
Model Configuration
- Llama 3.1:
meta-llama/Llama-3.1-8B-Instruct
+llama3
- Qwen:
Qwen/Qwen2.5-7B-Instruct
+qwen2_5
- Mistral:
mistralai/Mistral-7B-Instruct-v0.3
+mistral
- CodeLlama:
codellama/CodeLlama-7b-Instruct-hf
+llama3
Training Hyperparameters
LoRA Configuration
Monitoring & Logging
Example Use Cases
Function Calling Assistant
Train models to perform structured function calling for tool usage and API integration:Code Assistant
Multilingual Chat
COPY multilingual-conversations.jsonl .
Configuration Templates
Ready-to-use YAML configurations for common finetuning scenarios. Copy these as starting points and modify for your specific needs.Template 1: Chat Assistant
Optimized for conversational AI with balanced performance and memory usage:Template 2: Code Generator
Configured for code generation tasks with longer context and specialized model:Template 3: Instruction Follower
Designed for general instruction-following with efficient LoRA settings:Job Management
Efficiently manage your training jobs, monitor progress, and handle multiple experiments running simultaneously.Monitor Your Jobs
Batch Operations
Run multiple experiments programmatically for parameter sweeps and comparisons:- Automate experiments: No manual job queuing
- Compare results: All experiments tracked in W&B
- Save time: Queue multiple jobs at once
Parameter Sweeps
Learning Rate Sweep
LoRA Rank Comparison
Model Comparison
Monitoring with Weights & Biases
Key Metrics to Watch
- Training Loss: Should decrease steadily
- Learning Rate: Follow the schedule
- GPU Utilization: Should be consistently high
- Validation Loss: Check for overfitting
Advanced W&B Configuration
HuggingFace Integration
Automatic Model Upload
Models are automatically uploaded to HuggingFace Hub when training completes:Upload Configuration Options
Advanced Features
Memory Optimization (for larger models)
load_in_4bit
: Reduces model weights from 16-bit to 4-bit (e.g., 8GB model → 2GB)gradient_checkpointing
: Trades compute for memory (slower but fits larger models)flash_attention
: 2-8x faster attention with lower memory footprintsample_packing
: Better GPU utilization, especially with variable-length sequencespad_to_sequence_len
: Predictable memory usage, prevents OOM errors
Multi-Dataset Training
Evaluation Configuration
Model Evaluation Integration
After training, evaluate your models using the integrated evaluation pipeline:1. Deploy Inference Server
First, deploy your base model with LoRA support: Createinference/deployment.yaml
:
inference/Dockerfile
:
2. Create Evaluation Script
Createevals/evaluation_script.py
to benchmark your models:
3. Run Evaluation
- Load your trained LoRA adapter
- Test on function calling tasks
- Calculate accuracy metrics
- Compare base vs fine-tuned performance
- Save results to
benchmark_results.json
4. Monitor Results
Check Weights & Biases for:- Training curves: Loss progression during finetuning
- Evaluation metrics: Function calling accuracy
- Validation performance: Generalization capability
- Finetune models with job queues
- Deploy inference servers with LoRA support
- Load trained adapters dynamically
- Evaluate performance on function calling tasks
- Compare different model configurations
LoRA Adapter Loading for Inference
Your trained models are automatically uploaded to HuggingFace Hub and can be loaded into running inference servers without restart:Loading Process
- Training completes → Model uploaded to HuggingFace Hub
- Inference server running → vLLM with
--enable-lora
flag - Load adapter → Call
/v1/load_lora_adapter
endpoint - Ready for inference → Use adapter name in chat completions
API Usage
- Hot-swapping models without server restart
- A/B testing different fine-tuned versions
- Multi-tenant serving with customer-specific models
- Rapid experimentation with new training runs
Troubleshooting
Out of Memory Issues
If you get OOM errors, try:Slow Training
Speed up training with:Upload Failures
Check your HuggingFace token:W&B Connection Issues
Verify your W&B setup:Debug Mode
For troubleshooting, use these Axolotl debugging settings:- Set
TRANSFORMERS_VERBOSITY=debug
in environment - Use
--debug
flag with accelerate launch - Check logs with:
tensorkube job logs --job-name your-job-name
- For config-only testing: add
"wandb_mode": "disabled"
to skip W&B entirely