Fine-tuning Gemma 2 with PEFT and LoRA

Fine-tuning Gemma 2 with PEFT and LoRA

Full model fine-tuning requires updating billions of parameters—expensive and time-consuming. PEFT (Parameter-Efficient Fine-Tuning) and LoRA (Low-Rank Adaptation) solve this by updating only a small fraction of parameters while maintaining model quality.

Why PEFT and LoRA?

The Problem: Fine-tuning all Gemma 2's 9B parameters requires:

  • 36GB+ VRAM
  • Hours of GPU time
  • Storage for multiple model versions

The Solution: LoRA adds small trainable matrices alongside frozen weights:

  • 99% fewer trainable parameters (from 9B to ~50M)
  • 4-8x faster training
  • 10x smaller checkpoint files
  • Minimal accuracy trade-off

How LoRA Works

Instead of updating weight matrices W directly:

W' = W + ΔW

LoRA decomposes updates into low-rank factors:

W' = W + BA

Where:

  • B and A are small matrices (rank 8 or 16)
  • Only B and A are trained, not W
  • During inference, you can merge B and A into W for zero overhead

PEFT Configuration

The key hyperparameters:

r (rank): How much capacity to add

  • r=8: Minimal memory, slightly lower quality
  • r=16: Good balance (recommended)
  • r=32: Higher quality but more memory

lora_alpha: Scaling factor controlling LoRA contribution

  • Typically 16 (matches r=16)

target_modules: Which layers to apply LoRA to

  • For Gemma: q_proj, v_proj (query and value projections)
  • Can expand to include k_proj, o_proj for more expressiveness

dropout: LoRA dropout (0.05-0.1) prevents overfitting

Practical Implementation

Memory requirements:

  • Full fine-tuning: 36GB+
  • LoRA fine-tuning: 10GB
  • LoRA + gradient checkpointing: 8GB
  • LoRA + bnb quantization (8-bit): 6GB

Training speed:

  • Full fine-tuning: 2-4 hours (500 examples)
  • LoRA fine-tuning: 15-30 minutes (same data)

Best Practices

  1. Start with standard hyperparameters: Use established PEFT defaults before experimenting
  2. Monitor training loss: Should decrease smoothly without oscillation
  3. Validate on held-out examples: Check quality during training, not just loss
  4. Save intermediate checkpoints: Can revert if overfitting occurs
  5. Quantize if needed: 8-bit quantization cuts memory use further with minimal quality loss

Merging for Deployment

After training, you have options:

Option 1: Keep LoRA Separate

  • Load base model + LoRA adapter at inference
  • Trade: ~50MB extra, slower load time
  • Benefit: Easy to swap multiple LoRA adapters

Option 2: Merge LoRA into Base

  • Fuse B and A matrices into original W
  • Trade: No separate adapter flexibility
  • Benefit: Identical inference speed to base model, simpler deployment

Advanced Techniques

Multi-adapter: Load multiple LoRA adapters simultaneously for different specializations

Continuation training: Fine-tune an already-fine-tuned model without catastrophic forgetting

Mixed-precision training: Use float16 to reduce memory while maintaining accuracy

Using Gemma 2 Through Apertis AI

While you can self-host fine-tuned Gemma 2 models, Apertis AI also provides hosted Gemma 2 access through a unified API. This is valuable if you want:

  • Simple integration without deployment complexity
  • Auto-failover to alternative models
  • Built-in rate limiting and monitoring

Reference: DataCamp Tutorial on Fine-tuning Gemma 2