implementationgemma2peftlorapractical-guide

Fine-tuning Gemma 2 with PEFT and LoRA

Apertis Team•November 2, 2024•3 min read

Fine-tuning Gemma 2 with PEFT and LoRA

Full model fine-tuning requires updating billions of parameters—expensive and time-consuming. PEFT (Parameter-Efficient Fine-Tuning) and LoRA (Low-Rank Adaptation) solve this by updating only a small fraction of parameters while maintaining model quality.

Why PEFT and LoRA?

The Problem: Fine-tuning all Gemma 2's 9B parameters requires:

36GB+ VRAM
Hours of GPU time
Storage for multiple model versions

The Solution: LoRA adds small trainable matrices alongside frozen weights:

99% fewer trainable parameters (from 9B to ~50M)
4-8x faster training
10x smaller checkpoint files
Minimal accuracy trade-off

How LoRA Works

Instead of updating weight matrices W directly:

W' = W + ΔW

LoRA decomposes updates into low-rank factors:

W' = W + BA

Where:

B and A are small matrices (rank 8 or 16)
Only B and A are trained, not W
During inference, you can merge B and A into W for zero overhead

PEFT Configuration

The key hyperparameters:

r (rank): How much capacity to add

r=8: Minimal memory, slightly lower quality
r=16: Good balance (recommended)
r=32: Higher quality but more memory

lora_alpha: Scaling factor controlling LoRA contribution

Typically 16 (matches r=16)

target_modules: Which layers to apply LoRA to

For Gemma: q_proj, v_proj (query and value projections)
Can expand to include k_proj, o_proj for more expressiveness

dropout: LoRA dropout (0.05-0.1) prevents overfitting

Practical Implementation

Memory requirements:

Full fine-tuning: 36GB+
LoRA fine-tuning: 10GB
LoRA + gradient checkpointing: 8GB
LoRA + bnb quantization (8-bit): 6GB

Training speed:

Full fine-tuning: 2-4 hours (500 examples)
LoRA fine-tuning: 15-30 minutes (same data)

Best Practices

Start with standard hyperparameters: Use established PEFT defaults before experimenting
Monitor training loss: Should decrease smoothly without oscillation
Validate on held-out examples: Check quality during training, not just loss
Save intermediate checkpoints: Can revert if overfitting occurs
Quantize if needed: 8-bit quantization cuts memory use further with minimal quality loss

Merging for Deployment

After training, you have options:

Option 1: Keep LoRA Separate

Load base model + LoRA adapter at inference
Trade: ~50MB extra, slower load time
Benefit: Easy to swap multiple LoRA adapters

Option 2: Merge LoRA into Base

Fuse B and A matrices into original W
Trade: No separate adapter flexibility
Benefit: Identical inference speed to base model, simpler deployment

Advanced Techniques

Multi-adapter: Load multiple LoRA adapters simultaneously for different specializations

Continuation training: Fine-tune an already-fine-tuned model without catastrophic forgetting

Mixed-precision training: Use float16 to reduce memory while maintaining accuracy

Using Gemma 2 Through Apertis AI

While you can self-host fine-tuned Gemma 2 models, Apertis AI also provides hosted Gemma 2 access through a unified API. This is valuable if you want:

Simple integration without deployment complexity
Auto-failover to alternative models
Built-in rate limiting and monitoring

Reference: DataCamp Tutorial on Fine-tuning Gemma 2

← Back to Research