Multimodal Large Language Models: Architecture, Training & Data Strategies

Multimodal Large Language Models: A Complete Overview

Multimodal Large Language Models (MLLMs) represent a major shift in AI: instead of processing text alone, they understand images, videos, and text simultaneously. This guide covers the architecture, training strategies, and practical implementations behind systems like GPT-4V and Claude's image understanding.

MLLM Architecture: Three Key Components

1. Pre-trained Modality Encoders (The "Eyes")

These are specialized models that compress raw input (images, audio) into compact representations that language models can understand.

Common encoders:

  • CLIP: Aligns images and text through contrastive learning
  • High-resolution vision encoders: CogAgent and similar models for detailed visual understanding

High-resolution approaches:

  • Direct Scaling: Feed higher-resolution images directly (e.g., CogAgent uses 1024×1024 images)
  • Patch Division: Split high-res images into blocks, process each with a standard encoder, then combine features (like Monkey)

The trade-off: higher resolution improves accuracy but increases computational cost.

2. Pre-trained Large Language Models (The "Brain")

The LLM generates responses based on the encoded visual information. Common choices:

  • Flan-T5-XL
  • LLaMA and Vicuna
  • Qwen

These are frozen or lightly fine-tuned; most learning happens through the next component.

3. Modality Interface (The "Translator")

This is where the challenge lies: how do you efficiently teach an LLM to understand visual information without retraining it end-to-end (which is prohibitively expensive)?

Strategy 1: Learnable Connectors Project visual features into the LLM's token space. Two approaches:

  • Token-based fusion (BLIP-2): Convert visual features into tokens and feed alongside text tokens
  • Feature-based fusion (Flamingo): Insert deeper interaction modules for richer multimodal fusion

Strategy 2: Expert Models Use specialized systems to convert images to text descriptions before feeding to the LLM (VideoChat-Text). Trade-off: simpler but potentially loses fine-grained visual details.

Key insight: Token-based fusion is faster but feature-based fusion captures richer interactions. Avoid naively converting images to captions—you lose spatial and temporal relationships.

Training Strategies

Pre-training Phase

Goal: Align different modalities and learn world knowledge

The model learns by predicting image tokens while minimizing cross-entropy loss. Typically:

  • Keep pre-trained modules frozen
  • Train only the learnable interface
  • Data quality matters: noisier short captions benefit from lower resolution; clean long-form data benefits from high resolution

Instruction Fine-tuning

Goal: Teach the model to follow user instructions effectively

Three techniques:

Data Adaptation: Transform existing high-quality datasets into instruction-following format without collecting new data

Self-Instruction Generation: Use an LLM to generate new instruction-response pairs, expanding datasets automatically (e.g., generate 10 instruction variations from one image)

Data Mixing: Combine multimodal data with language-only data during training for better language understanding

Alignment Fine-tuning (RLHF & DPO)

Make the model prefer human-approved responses over others:

RLHF Approach:

  1. Supervised Fine-tuning creates a base policy
  2. Reward modeling learns to score good vs. bad responses based on human preference
  3. Proximal Policy Optimization (PPO) optimizes the policy while staying close to the original

Direct Preference Optimization (DPO): Simpler alternative that learns directly from human preference labels without building an explicit reward model

Advanced MLLM Capabilities

Region-based interaction (Shikra): Users select specific image regions for fine-grained understanding instead of whole-image analysis

Flexible prompting (Ferret): Support for multiple input modalities—points, bounding boxes, freehand sketches—instead of just image + text

Point-based selection (Osprey): Simple click-to-select interface for identifying specific entities

Multilingual transfer (VisCPM): Leverage English training data then transfer vision-language abilities to other languages using translation samples

Why This Matters for Developers

MLLMs enable:

  • Visual question answering (analyze screenshots, diagrams, charts)
  • Document understanding (process forms, invoices, contracts)
  • Accessibility features (describe images for screen readers)
  • Content analysis (moderate or categorize visual content)

You can access state-of-the-art MLLMs through Apertis AI's unified API, which supports both text and image inputs across Claude, GPT-4V, Gemini Vision, and other leading providers—without managing multiple SDKs.


Reference: MLLM Survey on arXiv (2306.13549)