Abstract:We present VideoCLIP, a contrastive approach to pre-train a unified model for zero-shot video and text understanding, without using any labels on downstream tasks. VideoCLIP trains a transformer for video and text by contrasting temporally overlapping positive video-text pairs with hard negatives from nearest neighbor retrieval. Our experiments on a diverse series of downstream tasks, including sequence-level text-video retrieval, VideoQA, token-level action localization, and action segmentation reveal state-of-the-art performance, surpassing prior work, and in some cases even outperforming supervised approaches. Code is made available at this https URL.
Submission history
From: Hu Xu [view email]
[v1]
Tue, 28 Sep 2021 23:01:51 UTC (1,017 KB)
[v2]
Fri, 1 Oct 2021 15:13:27 UTC (1,017 KB)