VideoCLIP: Contrastive Pre-Training for Zero-Shot Video-Text Understanding
arxiv.orgNot sure if this is the same thing?
Not the same. CLIP is trained with pairs of images and texts, whereas VideoCLIP uses pairs of videos and texts.
Not sure if this is the same thing?
Not the same. CLIP is trained with pairs of images and texts, whereas VideoCLIP uses pairs of videos and texts.