Installation​
Install TRL and required dependencies:trl: Core training librarypeft: LoRA/QLoRA supportaccelerate: Multi-GPU and distributed training
Supervised Fine-Tuning (SFT)​
The SFTTrainer makes it easy to fine-tune LFM models on instruction-following or conversational datasets. It handles chat templates, packing, and dataset formatting automatically. SFT training requires Instruction datasets.
LoRA Fine-Tuning (Recommended)​
LoRA (Low-Rank Adaptation) is the recommended approach for fine-tuning LFM2 models with TRL. It offers several key advantages:- Memory efficient: Trains only small adapter weights (~1-2% of model size) instead of full model parameters
- Data efficient: Achieves strong task performance improvements with less training data than full fine-tuning
- Fast training: Reduced parameter count enables faster iteration and larger effective batch sizes
- Flexible: Easy to switch between different task adapters without retraining the base model
SFTTrainer:
Full Fine-Tuning
Full Fine-Tuning
Full fine-tuning updates all model parameters. Use this only when you have sufficient GPU memory and need maximum adaptation for your task.
Vision Language Model Fine-Tuning (VLM-SFT)​
The SFTTrainer also supports fine-tuning Vision Language Models like LFM2.5-VL-1.6B on image-text datasets. VLM fine-tuning requires Vision datasets and a few key differences from text-only SFT:
- Uses
AutoModelForImageTextToTextinstead ofAutoModelForCausalLM - Uses
AutoProcessorinstead of just a tokenizer - Requires dataset formatting with image content types
- Needs a custom
collate_fnfor multimodal batching
VLM LoRA Fine-Tuning (Recommended)​
LoRA is recommended for VLM fine-tuning due to the larger model size and multimodal complexity:Full VLM Fine-Tuning
Full VLM Fine-Tuning
Full VLM fine-tuning updates all model parameters. Use this only when you have sufficient GPU memory.
Direct Preference Optimization (DPO)​
The DPOTrainer implements Direct Preference Optimization, a method to align models with human preferences without requiring a separate reward model. DPO training requires Preference datasets with chosen and rejected response pairs.
DPO with LoRA (Recommended)​
LoRA is highly recommended for DPO training, as it significantly reduces memory requirements while maintaining strong alignment performance.Full DPO Training
Full DPO Training
Full DPO training updates all model parameters. Use this only when you have sufficient GPU memory.
Other Training Methods​
TRL also provides additional trainers that work seamlessly with LFM models:- RewardTrainer: Train reward models for RLHF
- PPOTrainer: Proximal Policy Optimization for reinforcement learning from human feedback
- ORPOTrainer: Odds Ratio Preference Optimization, an alternative to DPO
- KTOTrainer: Kahneman-Tversky Optimization for alignment
Tips​
- Learning Rates: SFT typically uses higher learning rates (1e-5 to 5e-5) than DPO (1e-7 to 1e-6)
- Batch Size: DPO requires larger effective batch sizes; increase
gradient_accumulation_stepsif GPU memory is limited - LoRA Ranks: Start with
r=16for experimentation; increase tor=64or higher for better quality - DPO Beta: The
betaparameter controls the deviation from the reference model; typical values range from 0.1 to 0.5
For more end to end examples, visit the Liquid AI Cookbook. Edit this page