Fine-tuning Gemma 2 with PEFT and LoRA
Fine-tuning Gemma 2 with PEFT and LoRA
Full model fine-tuning requires updating billions of parametersâexpensive and time-consuming. PEFT (Parameter-Efficient Fine-Tuning) and LoRA (Low-Rank Adaptation) solve this by updating only a small fraction of parameters while maintaining model quality.
Why PEFT and LoRA?
The Problem: Fine-tuning all Gemma 2's 9B parameters requires:
- 36GB+ VRAM
- Hours of GPU time
- Storage for multiple model versions
The Solution: LoRA adds small trainable matrices alongside frozen weights:
- 99% fewer trainable parameters (from 9B to ~50M)
- 4-8x faster training
- 10x smaller checkpoint files
- Minimal accuracy trade-off
How LoRA Works
Instead of updating weight matrices W directly:
W' = W + ÎW
LoRA decomposes updates into low-rank factors:
W' = W + BA
Where:
- B and A are small matrices (rank 8 or 16)
- Only B and A are trained, not W
- During inference, you can merge B and A into W for zero overhead
PEFT Configuration
The key hyperparameters:
r (rank): How much capacity to add
- r=8: Minimal memory, slightly lower quality
- r=16: Good balance (recommended)
- r=32: Higher quality but more memory
lora_alpha: Scaling factor controlling LoRA contribution
- Typically 16 (matches r=16)
target_modules: Which layers to apply LoRA to
- For Gemma:
q_proj,v_proj(query and value projections) - Can expand to include
k_proj,o_projfor more expressiveness
dropout: LoRA dropout (0.05-0.1) prevents overfitting
Practical Implementation
Memory requirements:
- Full fine-tuning: 36GB+
- LoRA fine-tuning: 10GB
- LoRA + gradient checkpointing: 8GB
- LoRA + bnb quantization (8-bit): 6GB
Training speed:
- Full fine-tuning: 2-4 hours (500 examples)
- LoRA fine-tuning: 15-30 minutes (same data)
Best Practices
- Start with standard hyperparameters: Use established PEFT defaults before experimenting
- Monitor training loss: Should decrease smoothly without oscillation
- Validate on held-out examples: Check quality during training, not just loss
- Save intermediate checkpoints: Can revert if overfitting occurs
- Quantize if needed: 8-bit quantization cuts memory use further with minimal quality loss
Merging for Deployment
After training, you have options:
Option 1: Keep LoRA Separate
- Load base model + LoRA adapter at inference
- Trade: ~50MB extra, slower load time
- Benefit: Easy to swap multiple LoRA adapters
Option 2: Merge LoRA into Base
- Fuse B and A matrices into original W
- Trade: No separate adapter flexibility
- Benefit: Identical inference speed to base model, simpler deployment
Advanced Techniques
Multi-adapter: Load multiple LoRA adapters simultaneously for different specializations
Continuation training: Fine-tune an already-fine-tuned model without catastrophic forgetting
Mixed-precision training: Use float16 to reduce memory while maintaining accuracy
Using Gemma 2 Through Apertis AI
While you can self-host fine-tuned Gemma 2 models, Apertis AI also provides hosted Gemma 2 access through a unified API. This is valuable if you want:
- Simple integration without deployment complexity
- Auto-failover to alternative models
- Built-in rate limiting and monitoring
Reference: DataCamp Tutorial on Fine-tuning Gemma 2