Computer vision systems are usually trained to predict a fixed set of predetermined object categories, lacking generality and usability since additional labeled data is needed to specify other visual concepts.
CLIP learns SOTA image representation on a 400 million image text pair dataset, then uses NLP to reference learned visual concepts. The goal is to classify images without any explicit labels.
Dataset: Web-scale collections > high-quality crowd-labeled NLP datasets
Understanding Zero-Shot Learning — Making ML More Human _ by Ekin Tiu.pdf
The main idea of zero-shot learning is training the model on unlabelled data to maximize the vector similarity between similar images, to form clusters on a hypersphere (similar to word clouds ig).
Example input of CLIP
Text here are a form of supervision, and are not labels. CLIP uses text as supervision to handcraft labels using NLP. (These are three cats sitting… → cats)
Summary: Enables zero-shot learning by contrasting embeddings of different samples
Understanding Contrastive Learning _by Ekin Tiu.pdf
The diagonal $(I_1 \cdot T_1, ..., I_N \cdot T_N)$ are the positive samples since they match with their text embeddings. Their cosine loss (or essentially their dot product) would be the least
https://arxiv.org/pdf/2005.10242.pdf
The main idea is to learn perception from supervision contained in natural language.
Benefits of natural language supervision