RAFT: Retrieval-Augmented Fine-Tuning for Domain-Specific RAG

RAFT: Retrieval-Augmented Fine-Tuning for Domain-Specific RAG

Fine-tuning an LLM on domain data is valuable, but standard approaches miss a critical challenge: LLMs trained on clean data struggle with noisy retrieval results. RAFT solves this by deliberately mixing gold-standard documents with irrelevant ones during training, teaching models to extract signal from noiseβ€”a skill they'll need in production.

The Core Problem

When you build RAG systems, you face two challenges:

Challenge 1: Domain Knowledge Gap Pre-trained LLMs lack specialized knowledge. Fine-tuning helps, but traditional SFT doesn't prepare models for imperfect retrieval.

Challenge 2: Retrieval Imperfection Real retrieval systems are noisy. Your query might match documents with relevant keywords but wrong context. LLMs need to learn:

  • How to recognize when retrieved documents don't contain the answer
  • When to ignore noisy documents
  • How to reason with incomplete information

Standard RAG fine-tuning ignores this problem, assuming all retrieved documents are relevant. RAFT fixes it.

How RAFT Works

Training Data Structure

RAFT uses a mixed training strategy:

P% of samples (typically 80%):

  • Contains: Question + Gold Document + Distractors + Answer
  • Purpose: Learn domain knowledge AND document relevance

(1-P)% of samples (typically 20%):

  • Contains: Question + Only Distractors (no gold document)
  • Purpose: Learn to answer without help, preventing over-reliance on retrieval

Example:

Sample 1 (gold + distractors):
Q: "What is Apertis?"
Gold: "Apertis is an AI API gateway offering 500+ models..."
Distractors: [irrelevant docs about other APIs]
A: "Apertis is an AI API gateway..."

Sample 2 (distractors only):
Q: "What is Apertis?"
Distractors: [irrelevant docs, no gold document]
A: "Apertis is an AI API gateway..." (answer from base knowledge)

Training Process

  1. Extract entities and key terms from questions
  2. Build retrieval queries using both entity-based and semantic search
  3. Collect positive (gold) documents that answer the question
  4. Collect negative (distractor) documents that match keywords but don't answer
  5. Mix in ratio (80% gold+distractors, 20% distractors-only)
  6. Fine-tune LLM with chain-of-thought answers

Chain-of-Thought Answers

Rather than direct answers, RAFT uses reasoning chains:

Instead of: A: "American"

Use: A: "Trump was born in New York, which is in the United States,
       therefore his nationality is American."

Benefits:

  • Teaches explicit reasoning steps
  • Helps generator show its work
  • Improves generalization to harder questions

Key Research Findings

Finding 1: Optimal Gold Document Ratio

Counter-intuitive result: P=80% outperforms P=100%

When all documents are gold-standard, models become overconfident and brittle. Adding 20% distractors forces the model to be more carefulβ€”it learns to discriminate between relevant and irrelevant information, which transfers to production.

Visual: Accuracy vs Gold Document Percentage

Accuracy
   ^
   |     β•±β•²
   |    β•±  β•²
   |   β•±    β•²____
   |  β•±
   |_______────────→ Gold Document %
   0%    80%   100%

Finding 2: Chain-of-Thought Improves Generalization

Models trained with reasoning chains:

  • Achieve higher accuracy on complex questions
  • Avoid overfitting to training questions
  • Better handle variations in wording

Finding 3: Robustness to Distraction

Models trained with distractors are resilient to noisy retrieval. When tested with:

  • 0 distractors: Standard performance
  • 5-10 distractors: Minimal degradation
  • 20+ distractors: Better robustness than baseline

This is critical for production where retrieval quality varies.

RAFT vs. Baselines

Baseline Approaches

LLaMA2-7B + 0-shot

  • No fine-tuning, no retrieval
  • Fast but limited accuracy

LLaMA2-7B + RAG

  • Uses retrieval but no fine-tuning
  • Fails on domain-specific questions

Domain-Specific Fine-tuning (DSF)

  • Fine-tuned on domain data but no retrieval during training
  • Doesn't learn to use retrieved documents effectively

DSF + RAG

  • Fine-tuned + uses retrieval
  • Better but still vulnerable to noisy retrieval

RAFT

  • Fine-tuned with mixed gold/noisy examples
  • Uses chain-of-thought
  • Most robust to real-world retrieval noise

Performance Pattern

Accuracy with varying numbers of distractors:

RAFT    β–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–‘  (degrades gracefully)
DSF+RAG β–ˆβ–ˆβ–ˆβ–ˆβ–‘β–‘β–‘β–‘β–‘  (breaks with noise)
DSF     β–ˆβ–ˆβ–‘β–‘β–‘β–‘β–‘β–‘β–‘  (doesn't use retrieval)

Implementation Strategy

Step 1: Data Preparation

For each domain question:
  1. Find correct answer document (oracle)
  2. Retrieve N other documents (distractors)
  3. Create two training samples:
     - 80%: Q + oracle + distractors
     - 20%: Q + distractors only

Step 2: Generate Chain-of-Thought Answers

Use an existing strong LLM to generate reasoning:

Prompt: "Question: {q}
Documents: {docs}
Please answer step by step, showing your reasoning."

Or annotate manually for highest quality.

Step 3: Fine-tune

Use standard LLM fine-tuning with your prepared data:

  • Model: LLaMA2-7B or similar
  • Learning rate: 2e-4
  • Epochs: 3-5
  • Batch size: 16-32

Step 4: Evaluate

Test on held-out questions with varying distractor counts:

Evaluate with:
  - 0 distractors (best case)
  - 5 distractors (typical)
  - 10+ distractors (stress test)

Practical Advantages

Works with any LLM: No special architecture needed

Simple to implement: Standard fine-tuning process, just different data prep

Production-ready: Models trained this way actually work in deployment

Scalable: Works with 7B models; even better with larger models

Use RAFT with Apertis AI

Build RAFT systems using:

  1. Self-hosted LLM (fine-tune locally or on your infrastructure)
  2. Apertis AI for retrieval + generation layer: Access multiple LLMs if needed
  3. Hybrid approach: Use Apertis for fallback generation while your RAFT model handles primary queries

This gives you domain specialization without being locked into single-model deployments.

When to Use RAFT

Perfect for:

  • Domain-specific Q&A (legal, medical, financial documents)
  • Situations where retrieval is imperfect
  • When model accuracy matters more than latency
  • Teams with labeled training data

Less ideal for:

  • Real-time requirements (fine-tuning takes time)
  • Very broad general knowledge (domain advantage is lost)
  • Rapidly changing knowledge (frequent retraining needed)

Reference: RAFT Paper on arXiv (2403.10131)