researchragfinetuningdomain-adaptationrobustness

RAFT: Retrieval-Augmented Fine-Tuning for Domain-Specific RAG

Apertis Team•December 18, 2024•5 min read

RAFT: Retrieval-Augmented Fine-Tuning for Domain-Specific RAG

Fine-tuning an LLM on domain data is valuable, but standard approaches miss a critical challenge: LLMs trained on clean data struggle with noisy retrieval results. RAFT solves this by deliberately mixing gold-standard documents with irrelevant ones during training, teaching models to extract signal from noise—a skill they'll need in production.

The Core Problem

When you build RAG systems, you face two challenges:

Challenge 1: Domain Knowledge Gap Pre-trained LLMs lack specialized knowledge. Fine-tuning helps, but traditional SFT doesn't prepare models for imperfect retrieval.

Challenge 2: Retrieval Imperfection Real retrieval systems are noisy. Your query might match documents with relevant keywords but wrong context. LLMs need to learn:

How to recognize when retrieved documents don't contain the answer
When to ignore noisy documents
How to reason with incomplete information

Standard RAG fine-tuning ignores this problem, assuming all retrieved documents are relevant. RAFT fixes it.

How RAFT Works

Training Data Structure

RAFT uses a mixed training strategy:

P% of samples (typically 80%):

Contains: Question + Gold Document + Distractors + Answer
Purpose: Learn domain knowledge AND document relevance

(1-P)% of samples (typically 20%):

Contains: Question + Only Distractors (no gold document)
Purpose: Learn to answer without help, preventing over-reliance on retrieval

Example:

Sample 1 (gold + distractors):
Q: "What is Apertis?"
Gold: "Apertis is an AI API gateway offering 500+ models..."
Distractors: [irrelevant docs about other APIs]
A: "Apertis is an AI API gateway..."

Sample 2 (distractors only):
Q: "What is Apertis?"
Distractors: [irrelevant docs, no gold document]
A: "Apertis is an AI API gateway..." (answer from base knowledge)

Training Process

Extract entities and key terms from questions
Build retrieval queries using both entity-based and semantic search
Collect positive (gold) documents that answer the question
Collect negative (distractor) documents that match keywords but don't answer
Mix in ratio (80% gold+distractors, 20% distractors-only)
Fine-tune LLM with chain-of-thought answers

Chain-of-Thought Answers

Rather than direct answers, RAFT uses reasoning chains:

Instead of: A: "American"

Use: A: "Trump was born in New York, which is in the United States,
       therefore his nationality is American."

Benefits:

Teaches explicit reasoning steps
Helps generator show its work
Improves generalization to harder questions

Key Research Findings

Finding 1: Optimal Gold Document Ratio

Counter-intuitive result: P=80% outperforms P=100%

When all documents are gold-standard, models become overconfident and brittle. Adding 20% distractors forces the model to be more careful—it learns to discriminate between relevant and irrelevant information, which transfers to production.

Visual: Accuracy vs Gold Document Percentage

Accuracy
   ^
   |     ╱╲
   |    ╱  ╲
   |   ╱    ╲____
   |  ╱
   |_______────────→ Gold Document %
   0%    80%   100%

Finding 2: Chain-of-Thought Improves Generalization

Models trained with reasoning chains:

Achieve higher accuracy on complex questions
Avoid overfitting to training questions
Better handle variations in wording

Finding 3: Robustness to Distraction

Models trained with distractors are resilient to noisy retrieval. When tested with:

0 distractors: Standard performance
5-10 distractors: Minimal degradation
20+ distractors: Better robustness than baseline

This is critical for production where retrieval quality varies.

RAFT vs. Baselines

Baseline Approaches

LLaMA2-7B + 0-shot

No fine-tuning, no retrieval
Fast but limited accuracy

LLaMA2-7B + RAG

Uses retrieval but no fine-tuning
Fails on domain-specific questions

Domain-Specific Fine-tuning (DSF)

Fine-tuned on domain data but no retrieval during training
Doesn't learn to use retrieved documents effectively

DSF + RAG

Fine-tuned + uses retrieval
Better but still vulnerable to noisy retrieval

RAFT

Fine-tuned with mixed gold/noisy examples
Uses chain-of-thought
Most robust to real-world retrieval noise

Performance Pattern

Accuracy with varying numbers of distractors:

RAFT    ████████░  (degrades gracefully)
DSF+RAG ████░░░░░  (breaks with noise)
DSF     ██░░░░░░░  (doesn't use retrieval)

Implementation Strategy

Step 1: Data Preparation

For each domain question:
  1. Find correct answer document (oracle)
  2. Retrieve N other documents (distractors)
  3. Create two training samples:
     - 80%: Q + oracle + distractors
     - 20%: Q + distractors only

Step 2: Generate Chain-of-Thought Answers

Use an existing strong LLM to generate reasoning:

Prompt: "Question: {q}
Documents: {docs}
Please answer step by step, showing your reasoning."

Or annotate manually for highest quality.

Step 3: Fine-tune

Use standard LLM fine-tuning with your prepared data:

Model: LLaMA2-7B or similar
Learning rate: 2e-4
Epochs: 3-5
Batch size: 16-32

Step 4: Evaluate

Test on held-out questions with varying distractor counts:

Evaluate with:
  - 0 distractors (best case)
  - 5 distractors (typical)
  - 10+ distractors (stress test)

Practical Advantages

Works with any LLM: No special architecture needed

Simple to implement: Standard fine-tuning process, just different data prep

Production-ready: Models trained this way actually work in deployment

Scalable: Works with 7B models; even better with larger models

Use RAFT with Apertis AI

Build RAFT systems using:

Self-hosted LLM (fine-tune locally or on your infrastructure)
Apertis AI for retrieval + generation layer: Access multiple LLMs if needed
Hybrid approach: Use Apertis for fallback generation while your RAFT model handles primary queries

This gives you domain specialization without being locked into single-model deployments.

When to Use RAFT

Perfect for:

Domain-specific Q&A (legal, medical, financial documents)
Situations where retrieval is imperfect
When model accuracy matters more than latency
Teams with labeled training data

Less ideal for:

Real-time requirements (fine-tuning takes time)
Very broad general knowledge (domain advantage is lost)
Rapidly changing knowledge (frequent retraining needed)

Reference: RAFT Paper on arXiv (2403.10131)

← Back to Research