Upload or paste an image, then annotate using Inference.net
Max 4.5MB
JPEG · PNG · WebP · GIF

Read the blog →
ClipTagger-12b is a 12B-parameter vision-language model for scalable video understanding. It outputs schema-consistent JSON per frame and delivers frontier-quality at ~17x lower cost than frontier closed models while matching their accuracy.
Drop an image here
or
or press ⌘/Ctrl+V to paste
Video (5-frame) Annotator
Drop a video here
or
MP4 · WebM · MOV