Description
We’re introducing a neural network called CLIP which efficiently learns visual concepts from natural language supervision.
Summary
- Connecting Text and Images January 5, 2021 15 minute read We’re introducing a neural network called CLIP which efficiently learns visual concepts from natural language supervision.
- Most inspirational for CLIP is the work of Ang Li and his co-authors at FAIR who in 2016 demonstrated using natural language supervision to enable zero-shot transfer to several existing computer vision classification datasets, such as the canonical ImageNet dataset.
- To validate this we have measured CLIP’s zero-shot performance on over 30 different datasets including tasks such as fine-grained object classification, geo-localization, action recognition in videos, and OCR.
- [2] In particular, learning OCR is an example of an exciting behavior that does not occur in standard ImageNet models.
- Although it’s noteworthy to achieve these results with task agnostic pre-training, this performance is not competitive when compared to widely available production level models.