CLIP: Connecting Text and Images

By OpenAI - 2021-01-05

Description

We’re introducing a neural network called CLIP which efficiently learns visual concepts from natural language supervision.

Summary

  • Connecting Text and Images January 5, 2021 15 minute read We’re introducing a neural network called CLIP which efficiently learns visual concepts from natural language supervision.
  • Most inspirational for CLIP is the work of Ang Li and his co-authors at FAIR who in 2016 demonstrated using natural language supervision to enable zero-shot transfer to several existing computer vision classification datasets, such as the canonical ImageNet dataset.
  • To validate this we have measured CLIP’s zero-shot performance on over 30 different datasets including tasks such as fine-grained object classification, geo-localization, action recognition in videos, and OCR.
  • [2] In particular, learning OCR is an example of an exciting behavior that does not occur in standard ImageNet models.
  • Although it’s noteworthy to achieve these results with task agnostic pre-training, this performance is not competitive when compared to widely available production level models.

 

Topics

  1. NLP (0.33)
  2. Machine_Learning (0.3)
  3. Backend (0.13)

Similar Articles