Motivation

Computer vision systems are usually trained to predict a fixed set of predetermined object categories, lacking generality and usability since additional labeled data is needed to specify other visual concepts.

CLIP learns SOTA image representation on a 400 million image text pair dataset, then uses NLP to reference learned visual concepts. The goal is to classify images without any explicit labels.

Dataset: Web-scale collections > high-quality crowd-labeled NLP datasets

Background

Zero-shot learning

Understanding Zero-Shot Learning — Making ML More Human _ by Ekin Tiu.pdf

The main idea of zero-shot learning is training the model on unlabelled data to maximize the vector similarity between similar images, to form clusters on a hypersphere (similar to word clouds ig).

Untitled

Example input of CLIP

Text here are a form of supervision, and are not labels. CLIP uses text as supervision to handcraft labels using NLP. (These are three cats sitting… → cats)

Contrastive learning

Summary: Enables zero-shot learning by contrasting embeddings of different samples

Untitled

Understanding Contrastive Learning _by Ekin Tiu.pdf

The diagonal $(I_1 \cdot T_1, ..., I_N \cdot T_N)$ are the positive samples since they match with their text embeddings. Their cosine loss (or essentially their dot product) would be the least

https://arxiv.org/pdf/2005.10242.pdf

Natural language supervision

The main idea is to learn perception from supervision contained in natural language.

Benefits of natural language supervision

Does not require annotations to be in the classic 1-if-N majority vote gold label
- Instead, they learn passively from the vast amount of text on the internet