Language-vision models: recent developments

Published December 2, 2024

Table of Contents

In a recent paper titled The Evolution of Multimodal Model Architectures5, the authors present a taxonomy of multimodal vision-language models, categorized by the fusion stage (early or deep) and the fusion methods, including standard cross-attention layers, custom cross-attention layers, specialized tokenizers, or modality-specific encoders. Below, I provide a brief overview of the taxonomy groups, developed by the authors5, and examples of models within each category.

Deep Fusion

Type-A (multimodal inputs are directed to the internal LLM layers using cross-attention)

In this architecture, the multimodal inputs (image/video/audio) are passed through a multimodal encoder, resampled to a fixed length (using a resampler), and then passed to the internal layers of the LLM using cross-attention.

The fusion (via cross-attention) can be done before or after the self-attention layer in the LLM. This results in two possible sub-architectures, one that does the cross-attention fusion before the self-attention layer and one that does it after the self-attention layer.

For one such model, see Dolphins below.

Type-B (multimodal input are directed to the internal LLM layers using custom cross-attention layers)

The authors distinguish models that pass the multimodal input to the internal LLM layers via a custom cross-attention layer in Type-B architectures. They observe that deep fusion typically occurs via add/concatenation operations after the self-attention layers in the LLM.

For one such model, see CogVLM below.

Early Fusion

Type-C (multimodal inputs are optionally embedded or passed as-is to the LLM)
  • Use pre-trained LLM as decoder
    • Input: Encoder output + text
  • Encoder can also be pre-trained
  • Incorporate off-the-shelf LLMs and encoders
  • Training & data:
    • Pre-train + alignment tuning: train projection layers (MLP etc) for vision+text alignment
    • Instruction + alignment tuning: train projection layer + LLM
  • For one such model, see Qwen-VL below.
Type-D (multimodal inputs are tokenized before passed to the LLM)
  • One tokenizer for all modalities
  • Disadvantages:
    • The addition of a new modality requires a re-training of the tokenizer (to learn how to tokenize the new modality).
    • Training was observed to take longer than the other methods because the LLM (or encoder-decoder model) only considers the modality at the input stage (and receives no input guidance at intermediate deeper layers).
  • For one such model, see CM3Leon below.

Qwen-VL (2023)

Paper: https://arxiv.org/pdf/2308.12966

Model architecture
  • Visual Encoder (e.g., a ViT)
  • Position-aware Vision-Language Adapter
    • A cross-attention layer with:
      • Inputs:
        • visual embedding sequence from the Visual Encoder, as keys
        • trainable vector embeddings, as queries
      • Outputs:
        • a compressed fixed-length visual embedding sequence (e.g., 256)
  • Large Language-Model (e.g., Qwen-7B)
    • Inputs:
      • compressed visual embedding sequence (surrounded by two special img tokens to distinguish it from the the text input) + text input sequence
    • Outputs:
      • Predicted next text token
Training

Training is done in three phases:

  1. Pre-training on low-resolution image and text pairs:
    • LLM is frozen. Only adapter and visual encoder are trained to minimize cross-entropy on LLM output text.
  2. Multi-task pre-training on high-res image and text pairs, and interleaved image-text data:
    • LLM, adapter, encoder are all trained.
  3. Fine-tuning on interleaved image-text data:
    • Encoder is frozen, only LLM and adapter are trained.

CM3Leon (2024)

Paper: https://arxiv.org/pdf/2405.09818

CM3Leon (pronounced as “chameleon”) is a multimodal early fusion model pre-trained on a large dataset including pure text, text-image pairs, and interleaved text-image documents. It’s pre-trained in two stages. In the first stage, which accounts for most of the training, the model is trained on ~2.9T text-only tokens, ~1.5T text-image tokens, and ~400B interleaved text-image tokens. The second stage contains a similar amount of data, but it’s much smaller and of higher quality.

Tokenization: At the core of CM3Leon’s architecture is a tokenizer module that can quantize both images and text into discrete tokens before applying the same transformer-based module to those tokens. Images (of size 512x512) are tokenized into a 1024 token representation (from a vocabulary of 8192 tokens). Then the image token representation and the text are tokenized using a BPE tokenizer trained on a vocabulary of ~65k tokens (including the 8192 image tokens).

Architecture: The authors propose new architectural changes to stabilize training. In particular, the normalization strategies in the attention blocks are QK-Norm (along with dropout after the attention and MLP blocks) for the 7B-parameter model, and in the case of the larger 34B-parameter model, the normalization from the Swin-transformer is used in the attention blocks. Also, to stabilize the final softmax over the logits, the authors use the z-loss regularisation softmax (see Sec. 3.1.2 in 2309.14322 v26).

CogVLM (2023)

Paper: https://arxiv.org/pdf/2311.03079

Dolphins (2023)

Paper: https://arxiv.org/pdf/2312.00438

References & Footnotes


  1. Bai, J., Bai, S., Yang, S., Wang, S., Tan, S., Wang, P., … & Zhou, J. (2023). Qwen-vl: A versatile vision-language model for understanding, localization, text reading, and beyond. arXiv preprint arXiv:2308.12966, 1(2), 3. ↩︎

  2. Team, C. Chameleon: Mixed-modal early-fusion foundation models, 2024. URL https://arxiv. org/abs/2405.09818. ↩︎

  3. Wang, W., Lv, Q., Yu, W., Hong, W., Qi, J., Wang, Y., … & Tang, J. (2023). Cogvlm: Visual expert for pretrained language models. arXiv preprint arXiv:2311.03079. ↩︎

  4. Ma, Y., Cao, Y., Sun, J., Pavone, M., & Xiao, C. (2023). Dolphins: Multimodal language model for driving. arXiv preprint arXiv:2312.00438. ↩︎

  5. Wadekar, S. N., Chaurasia, A., Chadha, A., & Culurciello, E. (2024). The Evolution of Multimodal Model Architectures. arXiv preprint arXiv:2405.17927. ↩︎ ↩︎

  6. Wortsman, M., Liu, P. J., Xiao, L., Everett, K., Alemi, A., Adlam, B., … & Kornblith, S. (2023). Small-scale proxies for large-scale transformer training instabilities. arXiv preprint arXiv:2309.14322. ↩︎