Active vision has great theoretical potential, but has struggled in practice.
CanViT means to change that, with a novel ViT-based architecture designed from the ground up for active vision at scale, and a distillation paradigm that makes it easy to train in a task- and policy-agnostic manner.
Our work introduces the Active-Vision Foundation Model (AVFM) paradigm, based around the idea of decoupling active-vision pretraining from the final viewing policy, and makes it computationally and empirically viable.
Easy-to-use code and HuggingFace-compatible checkpoints are available.
Extensions to video and embodied settings (robotics, motorized cameras...) are obvious next steps.
We are excited to share this work with the community and to see what people will build upon it!
The paper is also listed on alphaXiv.