RAFT: Retrieval-Augmented Fine-Tuning for Domain-Specific RAG
RAFT: Retrieval-Augmented Fine-Tuning for Domain-Specific RAG
Fine-tuning an LLM on domain data is valuable, but standard approaches miss a critical challenge: LLMs trained on clean data struggle with noisy retrieval results. RAFT solves this by deliberately mixing gold-standard documents with irrelevant ones during training, teaching models to extract signal from noiseβa skill they'll need in production.
The Core Problem
When you build RAG systems, you face two challenges:
Challenge 1: Domain Knowledge Gap Pre-trained LLMs lack specialized knowledge. Fine-tuning helps, but traditional SFT doesn't prepare models for imperfect retrieval.
Challenge 2: Retrieval Imperfection Real retrieval systems are noisy. Your query might match documents with relevant keywords but wrong context. LLMs need to learn:
- How to recognize when retrieved documents don't contain the answer
- When to ignore noisy documents
- How to reason with incomplete information
Standard RAG fine-tuning ignores this problem, assuming all retrieved documents are relevant. RAFT fixes it.
How RAFT Works
Training Data Structure
RAFT uses a mixed training strategy:
P% of samples (typically 80%):
- Contains: Question + Gold Document + Distractors + Answer
- Purpose: Learn domain knowledge AND document relevance
(1-P)% of samples (typically 20%):
- Contains: Question + Only Distractors (no gold document)
- Purpose: Learn to answer without help, preventing over-reliance on retrieval
Example:
Sample 1 (gold + distractors):
Q: "What is Apertis?"
Gold: "Apertis is an AI API gateway offering 500+ models..."
Distractors: [irrelevant docs about other APIs]
A: "Apertis is an AI API gateway..."
Sample 2 (distractors only):
Q: "What is Apertis?"
Distractors: [irrelevant docs, no gold document]
A: "Apertis is an AI API gateway..." (answer from base knowledge)
Training Process
- Extract entities and key terms from questions
- Build retrieval queries using both entity-based and semantic search
- Collect positive (gold) documents that answer the question
- Collect negative (distractor) documents that match keywords but don't answer
- Mix in ratio (80% gold+distractors, 20% distractors-only)
- Fine-tune LLM with chain-of-thought answers
Chain-of-Thought Answers
Rather than direct answers, RAFT uses reasoning chains:
Instead of: A: "American"
Use: A: "Trump was born in New York, which is in the United States,
therefore his nationality is American."
Benefits:
- Teaches explicit reasoning steps
- Helps generator show its work
- Improves generalization to harder questions
Key Research Findings
Finding 1: Optimal Gold Document Ratio
Counter-intuitive result: P=80% outperforms P=100%
When all documents are gold-standard, models become overconfident and brittle. Adding 20% distractors forces the model to be more carefulβit learns to discriminate between relevant and irrelevant information, which transfers to production.
Visual: Accuracy vs Gold Document Percentage
Accuracy
^
| β±β²
| β± β²
| β± β²____
| β±
|_______βββββββββ Gold Document %
0% 80% 100%
Finding 2: Chain-of-Thought Improves Generalization
Models trained with reasoning chains:
- Achieve higher accuracy on complex questions
- Avoid overfitting to training questions
- Better handle variations in wording
Finding 3: Robustness to Distraction
Models trained with distractors are resilient to noisy retrieval. When tested with:
- 0 distractors: Standard performance
- 5-10 distractors: Minimal degradation
- 20+ distractors: Better robustness than baseline
This is critical for production where retrieval quality varies.
RAFT vs. Baselines
Baseline Approaches
LLaMA2-7B + 0-shot
- No fine-tuning, no retrieval
- Fast but limited accuracy
LLaMA2-7B + RAG
- Uses retrieval but no fine-tuning
- Fails on domain-specific questions
Domain-Specific Fine-tuning (DSF)
- Fine-tuned on domain data but no retrieval during training
- Doesn't learn to use retrieved documents effectively
DSF + RAG
- Fine-tuned + uses retrieval
- Better but still vulnerable to noisy retrieval
RAFT
- Fine-tuned with mixed gold/noisy examples
- Uses chain-of-thought
- Most robust to real-world retrieval noise
Performance Pattern
Accuracy with varying numbers of distractors:
RAFT βββββββββ (degrades gracefully)
DSF+RAG βββββββββ (breaks with noise)
DSF βββββββββ (doesn't use retrieval)
Implementation Strategy
Step 1: Data Preparation
For each domain question:
1. Find correct answer document (oracle)
2. Retrieve N other documents (distractors)
3. Create two training samples:
- 80%: Q + oracle + distractors
- 20%: Q + distractors only
Step 2: Generate Chain-of-Thought Answers
Use an existing strong LLM to generate reasoning:
Prompt: "Question: {q}
Documents: {docs}
Please answer step by step, showing your reasoning."
Or annotate manually for highest quality.
Step 3: Fine-tune
Use standard LLM fine-tuning with your prepared data:
- Model: LLaMA2-7B or similar
- Learning rate: 2e-4
- Epochs: 3-5
- Batch size: 16-32
Step 4: Evaluate
Test on held-out questions with varying distractor counts:
Evaluate with:
- 0 distractors (best case)
- 5 distractors (typical)
- 10+ distractors (stress test)
Practical Advantages
Works with any LLM: No special architecture needed
Simple to implement: Standard fine-tuning process, just different data prep
Production-ready: Models trained this way actually work in deployment
Scalable: Works with 7B models; even better with larger models
Use RAFT with Apertis AI
Build RAFT systems using:
- Self-hosted LLM (fine-tune locally or on your infrastructure)
- Apertis AI for retrieval + generation layer: Access multiple LLMs if needed
- Hybrid approach: Use Apertis for fallback generation while your RAFT model handles primary queries
This gives you domain specialization without being locked into single-model deployments.
When to Use RAFT
Perfect for:
- Domain-specific Q&A (legal, medical, financial documents)
- Situations where retrieval is imperfect
- When model accuracy matters more than latency
- Teams with labeled training data
Less ideal for:
- Real-time requirements (fine-tuning takes time)
- Very broad general knowledge (domain advantage is lost)
- Rapidly changing knowledge (frequent retraining needed)
Reference: RAFT Paper on arXiv (2403.10131)