Joint Video-Text embeddings for search, classification and more.
AskVideos-VideoCLIP
- AskVideos-VideoCLIP is a language-grounded video embedding model.
- This model produces a single context-aware embedding for each video clip.
- 16 frames are sampled from each video clip to generate a video embedding.
- The model is trained with contrastive and captioning loss to ground the video embeddings to text.
Pre-trained & Fine-tuned Checkpoints
| Checkpoint | Link |
|---|---|
| AskVideos-VideoCLIP-v0.1 | link |
| AskVideos-VideoCLIP-v0.2 | link |
| AskVideos-VideoCLIP-v0.3 | link |
The demo is also available to run on colab.
| Model | Colab link |
|---|---|
| AskVideos-VideoCLIP-v0.1 | link |
| AskVideos-VideoCLIP-v0.2 | link |
Usage
Environment Preparation
First, install ffmpeg.
apt update
apt install ffmpeg
Then, create a conda environment:
conda create -n askvideosclip python=3.9
conda activate askvideosclip
Then, install the requiremnts:
pip3 install -U pip
pip3 install -r requirements.txt
How to Run Demo Locally
Star History
Term of Use
AskVideos code and models are distributed under the Apache 2.0 license.
Acknowledgement
This model is inspired by the Video-LLaMA Video-Qformer model.
Citation
bibtex
@misc{askvideos2024videoclip,
title = {AskVideos-VideoCLIP: Language-grounded video embeddings},
author = {AskVideos},
year = {2024},
howpublished = {GitHub},
url = {https://github.com/AskYoutubeAI/AskVideos-VideoCLIP}
}
