CLIP (Contrastive Language-Image Pre-Training)

What

TL;DR: The CLIP (Contrastive Language-Image Pre-Training) algorithm is trained on a large set of image-text pairs and can then be used as a zero-shot (i.e., no fine-tuning) model for various tasks (e.g., image classification, object detection (with prompting) given region proposals, image retrieval).

How

Contrastive Training

CLIP is trained on batches of image-text pairs using a contrastive loss similar to loss in ConVIRT¹ (cf. Eq. 2-4).
Specifically, for $N$ image-text pairs, the corresponding $d$-dimensional image and text embeddings are computed (using standard image and text encoders²), resulting in $N^2$ image-text embedding pairs from which we compute the element-wise cosine similarity. To compute the final loss for a pair, we consider the $i$-th row and $j$-th column in $N^2$ (corresponding to the $i$-th image and $j$-th text in the batch):
- We compute the cross-entropy loss across row $i$ and the cross-entropy loss across column $j$ and then do a linear interpolation between the two loss values.
- The network learns a joint embedding space of similar image-text pairs and disimilar pairs. So, a large batch-size is recommended to capture the disimilarity between the other pairs (that we know are not matching in the batch).

Inference on unseen datasets and tasks (Zero-shot learning³)

Once trained, the CLIP model can be used for new datasets and tasks. For example, it can be used for image captioning, where captions for images are generated by ranking pre-defined captions based on the image-caption similarity.
The standard example is image classification. We have a prompt and a list of categories. We also have images to classify. We compute the cosine similarity between images and categories and retrieve the highest values.

References & Footnotes

Zhang, Y., Jiang, H., Miura, Y., Manning, C. D., & Langlotz, C. P. (2022, December). Contrastive learning of medical visual representations from paired images and text. In Machine Learning for Healthcare Conference (pp. 2-25). PMLR. ↩︎
For image embedding, we can use pretty much any vision model (such as ViT or ResNet-50), and for text embedding we can use a Transformer-based model. ↩︎
Radford, A., Kim, J. W., Hallacy, C., Ramesh, A., Goh, G., Agarwal, S., … & Sutskever, I. (2021, July). Learning transferable visual models from natural language supervision. In International conference on machine learning (pp. 8748-8763). PMLR. ↩︎