CLIP: Connecting text and images

7 min read Original article ↗

Footnotes

  1. 29

    In 2015, a group of researchers from Microsoft first trained a model which achieved a top-5 accuracy on ImageNet that surpassed reported human top-5 accuracy.

  2. B

    While CLIP’s zero-shot OCR performance is mixed, its semantic OCR representation is quite useful. When evaluated on the SST-2 NLP dataset rendered as images, a linear classifer on CLIP’s representation matches a CBoW model with direct access to the text. CLIP is also competitive at detecting hateful memes without needing ground truth text.

  3. 40

    FairFace is a face image dataset designed to balance age, gender, and race, in order to reduce asymmetries common in previous face datasets. It categorizes gender into 2 groups: female and male and race into 7 groups: White, Black, Indian, East Asian, Southeast Asian, Middle Eastern, and Latino. There are inherent problems with race and gender classifications, as e.g. Bowker and Star (2000) and Keyes (2018) have shown. While FairFace’s dataset reduces the proportion of White faces, it still lacks representation of entire large demographic groups, effectively erasing such categories. We use the 2 gender categories and 7 race categories defined in the FairFace dataset in a number of our experiments not in order to reinforce or endorse the use of such reductive categories, but in order to enable us to make comparisons to prior work.

References

  1. 1

    Dodge, S., & Karam, L. (2017, July). “A study and comparison of human and deep learning recognition performance under visual distortions.(opens in a new window)” In ICCCN 2017.

  2. 2

    Geirhos, R., Rubisch, P., Michaelis, C., Bethge, M., Wichmann, F. A., & Brendel, W. (2018). “ImageNet-trained CNNs are biased towards texture; increasing shape bias improves accuracy and robustness.(opens in a new window)” In ICLR 2019.

  3. 3

    Alcorn, M. A., Li, Q., Gong, Z., Wang, C., Mai, L., Ku, W. S., & Nguyen, A. (2019). “Strike (with) a pose: Neural networks are easily fooled by strange poses of familiar objects.(opens in a new window)” In CVPR 2019.

  4. 4

    Barbu, A., Mayo, D., Alverio, J., Luo, W., Wang, C., Gutfreund, D., ... & Katz, B. (2019). “Objectnet: A large-scale bias-controlled dataset for pushing the limits of object recognition models.(opens in a new window)” In NeurIPS 2019.

  5. 5

    Radford, A., Wu, J., Child, R., Luan, D., Amodei, D., & Sutskever, I. (2019). “Language Models are Unsupervised Multitask Learners.(opens in a new window)” Technical Report, OpenAI.

  6. 6

    Brown, T. B., Mann, B., Ryder, N., Subbiah, M., Kaplan, J., Dhariwal, P., ... & Agarwal, S. (2020). “Language Models are Few-Shot Learners.(opens in a new window)” In NeurIPS 2020.

  7. 7

    He, K., Zhang, X., Ren, S., & Sun, J. (2016). “Deep residual learning for image recognition.(opens in a new window)” In CVPR 2016.

  8. 8

    Larochelle, H., Erhan, D., & Bengio, Y. (2008, July). “Zero-data learning of new tasks.(opens in a new window)” In AAAI 2008.

  9. 9

    Lampert, C. H., Nickisch, H., & Harmeling, S. (2009, June). “Learning to detect unseen object classes by between-class attribute transfer.(opens in a new window)” In CVPR 2009.

  10. 10

    Lei Ba, J., Swersky, K., & Fidler, S. (2015). “Predicting deep zero-shot convolutional neural networks using textual descriptions.(opens in a new window)” In ICCV 2015.

  11. 11

    Socher, R., Ganjoo, M., Manning, C. D., & Ng, A. (2013). “Zero-shot learning through cross-modal transfer.(opens in a new window)” In NeurIPS 2013.

  12. 12

    Frome, A., Corrado, G. S., Shlens, J., Bengio, S., Dean, J., Ranzato, M. A., & Mikolov, T. (2013). “Devise: A deep visual-semantic embedding model.(opens in a new window)” In NeurIPS 2013.

  13. 13

    Li, A., Jabri, A., Joulin, A., & van der Maaten, L. (2017). “Learning visual n-grams from web data.(opens in a new window)” In Proceedings of the IEEE International Conference on Computer Vision 2017.

  14. 14

    Doersch, C., Gupta, A., & Efros, A. A. (2015). “Unsupervised visual representation learning by context prediction.(opens in a new window)” In ICCV 2015.

  15. 15

    Zhai, X., Oliver, A., Kolesnikov, A., & Beyer, L. (2019). “S4l: Self-supervised semi-supervised learning.(opens in a new window)” In ICCV 2019.

  16. 16

    Grill, J. B., Strub, F., Altché, F., Tallec, C., Richemond, P. H., Buchatskaya, E., ... & Piot, B. (2020). “Bootstrap your own latent: A new approach to self-supervised learning.(opens in a new window)” In NeurIPS 2020.

  17. 17

    Oord, A. V. D., Li, Y., & Vinyals, O. (2018). “Representation Learning with Contrastive Predictive Coding.(opens in a new window)” arXiv preprint.

  18. 18

    Hjelm, R. D., Fedorov, A., Lavoie-Marchildon, S., Grewal, K., Bachman, P., Trischler, A., & Bengio, Y. (2018). “Learning deep representations by mutual information estimation and maximization.(opens in a new window)” In ICLR 2019.

  19. 19

    Bachman, P., Hjelm, R. D., & Buchwalter, W. (2019). “Learning representations by maximizing mutual information across views.(opens in a new window)” In NeurIPS 2019.

  20. 20

    He, K., Fan, H., Wu, Y., Xie, S., & Girshick, R. (2020). “Momentum contrast for unsupervised visual representation learning.(opens in a new window)” In CVPR 2020.

  21. 21

    Chen, T., Kornblith, S., Norouzi, M., & Hinton, G. (2020). “A simple framework for contrastive learning of visual representations.(opens in a new window)” arXiv preprint.

  22. 22

    Lee, D. H. (2013, June). “Pseudo-label: The simple and efficient semi-supervised learning method for deep neural networks.(opens in a new window)” In Workshop on challenges in representation learning, ICML (2013).

  23. 23

    Xie, Q., Luong, M. T., Hovy, E., & Le, Q. V. (2020). “Self-training with noisy student improves imagenet classification.(opens in a new window)” In CVPR 2020.

  24. 24

    Kingma, D. P., Mohamed, S., Jimenez Rezende, D., & Welling, M. (2014). “Semi-supervised learning with deep generative models.(opens in a new window)” In NeurIPS 2014.

  25. 25

    Salimans, T., Goodfellow, I., Zaremba, W., Cheung, V., Radford, A., & Chen, X. (2016). “Improved techniques for training gans.(opens in a new window)” In NeurIPS 2016.

  26. 26

    Donahue, J., & Simonyan, K. (2019). “Large scale adversarial representation learning.(opens in a new window)” In NeurIPS 2019.

  27. 27

    Chen, M., Radford, A., Child, R., Wu, J., Jun, H., Luan, D., & Sutskever, I. (2020, November). “Generative pretraining from pixels.(opens in a new window)” In ICML 2020.

  28. 28

    He, K., Zhang, X., Ren, S., & Sun, J. (2015). “Delving Deep into Rectifiers: Surpassing Human-Level Performance on ImageNet Classification.(opens in a new window)” In ICCV 2015.

  29. 29

    Russakovsky, O., Deng, J., Su, H., Krause, J., Satheesh, S., Ma, S., ... & Berg, A. C. (2015). “Imagenet large scale visual recognition challenge.(opens in a new window)” In IJCV 2015.

  30. 30

    Taori, R., Dave, A., Shankar, V., Carlini, N., Recht, B., & Schmidt, L. (2020). “Measuring robustness to natural distribution shifts in image classification.(opens in a new window)” In NeurIPS 2020.

  31. 31

    Sohn, K. (2016). “Improved deep metric learning with multi-class n-pair loss objective.(opens in a new window)” In NeurIPS 2016.

  32. 32

    Vaswani, A., Shazeer, N., Parmar, N., Uszkoreit, J., Jones, L., Gomez, A. N., ... & Polosukhin, I. (2017). “Attention is all you need.(opens in a new window)” In NeurIPS 2017.

  33. 33

    Desai, K., & Johnson, J. (2020). “VirTex: Learning Visual Representations from Textual Annotations.(opens in a new window)” arXiv preprint.

  34. 34

    Sariyildiz, M. B., Perez, J., & Larlus, D. (2020). “Learning Visual Representations with Caption Annotations.(opens in a new window)” In ECCV 2020.

  35. 35

    Zhang, Y., Jiang, H., Miura, Y., Manning, C. D., & Langlotz, C. P. (2020). “Contrastive Learning of Medical Visual Representations from Paired Images and Text.(opens in a new window)” arXiv preprint.

  36. 36

    Dosovitskiy, A., Beyer, L., Kolesnikov, A., Weissenborn, D., Zhai, X., Unterthiner, T., ... & Uszkoreit, J. (2020). “An image is worth 16x16 words: Transformers for image recognition at scale.(opens in a new window)” arXiv preprint.

  37. 37

    Mahajan, D., Girshick, R., Ramanathan, V., He, K., Paluri, M., Li, Y., ... & van der Maaten, L. (2018). “Exploring the limits of weakly supervised pretraining.(opens in a new window)” In ECCV 2018.

  38. 38

    Kolesnikov, A., Beyer, L., Zhai, X., Puigcerver, J., Yung, J., Gelly, S., & Houlsby, N. (2019). “Big Transfer (BiT): General Visual Representation Learning.(opens in a new window)” arXiv preprint.

  39. 39

    Kärkkäinen, K., & Joo, J. (2019). “Fairface: Face attribute dataset for balanced race, gender, and age.(opens in a new window)” arXiv preprint.

  40. 40

    Bowker, G., & Star, S. L. (1999). “Sorting things out. Classification and its consequences(opens in a new window)” Book.

  41. 41

    Keyes, O. (2018). “The misgendering machines: Trans/HCI implications of automatic gender recognition.(opens in a new window)” In Proceedings of the ACM on Human-Computer Interaction.

Acknowledgments

We’d like to thank the millions of people involved in creating the data CLIP is trained on. We also are grateful to all our co-authors for their contributions to the project. Finally, we’d like to thank Jeff Clune, Miles Brundage, Ryan Lowe, Jakub Pachocki, and Vedant Misra for feedback on drafts of this blog and Matthew Knight for reviewing the code release.

Design & Cover Artwork

Justin Jay Wang