DINOv3

2 min read Original article ↗

INTRODUCING DINOV3

Self-supervised learning for vision at unprecedented scale

DINOv3 scales self-supervised learning (SSL) for images to produce our strongest universal vision backbones, enabling breakthrough performance across diverse domains.

DINOV3 OVERVIEW

Cutting-edge image representations, trained without human supervision

We scaled unsupervised training to 7B-parameter models and 1.7B image datasets, using a fraction of compute compared to weakly-supervised methods. Despite keeping backbones frozen during evaluation, they achieve absolute state-of-the-art performance across diverse domains.

Exceptional performance across visual domains

SSL unlocks domains where annotations are scarce or costly. Backbones enable state-of-the-art results for tasks including object detection in web imagery, but also canopy height mapping in satellite and aerial imagery.

Versatile backbone with powerful dense image features

High-resolution dense features from a single DINOv3 backbone enable leading performance across vision tasks, including object detection, depth estimation, and segmentation, without any finetuning.

Efficient model sizes and architectures

We release a comprehensive model suite addressing a wide range of use cases, including broad coverage of ViT sizes and efficient ConvNeXt models for on-device deployment.

PERFORMANCE

Evaluating DINOv3's Performance

DINOv3 sets a new standard in vision foundation models. For the first time, a model trained with SSL outperforms weakly-supervised models on a broad range of probing tasks, from fine-grained image classification, to semantic segmentation, to object tracking in video.

Chart of DINOv3 performance stats

APPROACH

Self-supervised pre-training unlocks simple task adaptation

Pre-training data is curated from a large unlabeled dataset. During pre-training, the model learns general-purpose visual representations, matching features between different augmented views of the same image. In post-training, the model is distilled into more efficient models.

A pre-trained DINOv3 model can be easily tailored by training a lightweight adapter on a small amount of annotated data.

DINO Evolution

DINOv3 marks a new milestone in self-supervised training at scale. It builds upon the scaling progress of DINOv2, further increasing the model size x6, and training data x12.

Explore additional resources