Motivation

Computer vision systems are usually trained to predict a fixed set of predetermined object categories, lacking generality and usability since additional labeled data is needed to specify other visual concepts.

CLIP learns SOTA image representation on a 400 million image text pair dataset, then uses NLP to reference learned visual concepts. The goal is to classify images without any explicit labels.

Dataset: Web-scale collections > high-quality crowd-labeled NLP datasets

Background

Zero-shot learning

Understanding Zero-Shot Learning — Making ML More Human _ by Ekin Tiu.pdf

The main idea of zero-shot learning is training the model on unlabelled data to maximize the vector similarity between similar images, to form clusters on a hypersphere (similar to word clouds ig).

Untitled

Example input of CLIP

Text here are a form of supervision, and are not labels. CLIP uses text as supervision to handcraft labels using NLP. (These are three cats sitting… → cats)

Contrastive learning

Summary: Enables zero-shot learning by contrasting embeddings of different samples

Untitled

Understanding Contrastive Learning _by Ekin Tiu.pdf

The diagonal $(I_1 \cdot T_1, ..., I_N \cdot T_N)$ are the positive samples since they match with their text embeddings. Their cosine loss (or essentially their dot product) would be the least

https://arxiv.org/pdf/2005.10242.pdf

Natural language supervision

The main idea is to learn perception from supervision contained in natural language.

Benefits of natural language supervision