GitHub - AskYoutubeAI/AskVideos-VideoCLIP

Joint Video-Text embeddings for search, classification and more.

AskVideos-VideoCLIP

AskVideos-VideoCLIP is a language-grounded video embedding model.
This model produces a single context-aware embedding for each video clip.
16 frames are sampled from each video clip to generate a video embedding.
The model is trained with contrastive and captioning loss to ground the video embeddings to text.

Pre-trained & Fine-tuned Checkpoints

Checkpoint	Link
AskVideos-VideoCLIP-v0.1	link
AskVideos-VideoCLIP-v0.2	link
AskVideos-VideoCLIP-v0.3	link

The demo is also available to run on colab.

Model	Colab link
AskVideos-VideoCLIP-v0.1	link
AskVideos-VideoCLIP-v0.2	link

Usage

Environment Preparation

First, install ffmpeg.

apt update
apt install ffmpeg

Then, create a conda environment:

conda create -n askvideosclip python=3.9 
conda activate askvideosclip

Then, install the requiremnts:

pip3 install -U pip
pip3 install -r requirements.txt

How to Run Demo Locally

Star History

Term of Use

AskVideos code and models are distributed under the Apache 2.0 license.

Acknowledgement

This model is inspired by the Video-LLaMA Video-Qformer model.

Citation

bibtex
@misc{askvideos2024videoclip,
  title        = {AskVideos-VideoCLIP: Language-grounded video embeddings},
  author       = {AskVideos},
  year         = {2024},
  howpublished = {GitHub},
  url          = {https://github.com/AskYoutubeAI/AskVideos-VideoCLIP}
}