ClipTagger-12b Playground

1 min read Original article ↗

Upload or paste an image, then annotate using Inference.net

Max 4.5MB

JPEG · PNG · WebP · GIF

Grass × Inference

Read the blog →

ClipTagger-12b is a 12B-parameter vision-language model for scalable video understanding. It outputs schema-consistent JSON per frame and delivers frontier-quality at ~17x lower cost than frontier closed models while matching their accuracy.

Blog postDocs: Video understandingModel card (HF)Serverless API

GitHub

Drop an image here

or

or press ⌘/Ctrl+V to paste

Video (5-frame) Annotator

Drop a video here

or

MP4 · WebM · MOV