CLIP: Connecting Text and Images

By OpenAI - 2021-01-05

We’re introducing a neural network called CLIP which efficiently learns visual concepts from natural language supervision.

Connecting Text and Images January 5, 2021 15 minute read We’re introducing a neural network called CLIP which efficiently learns visual concepts from natural language supervision.
Most inspirational for CLIP is the work of Ang Li and his co-authors at FAIR who in 2016 demonstrated using natural language supervision to enable zero-shot transfer to several existing computer vision classification datasets, such as the canonical ImageNet dataset.
To validate this we have measured CLIP’s zero-shot performance on over 30 different datasets including tasks such as fine-grained object classification, geo-localization, action recognition in videos, and OCR.
[2] In particular, learning OCR is an example of an exciting behavior that does not occur in standard ImageNet models.
Although it’s noteworthy to achieve these results with task agnostic pre-training, this performance is not competitive when compared to widely available production level models.

Similar Articles