Depth Anything: Unleashing the Power of Large-Scale Unlabeled Data
github.comNice paper. I particularly like how they talk through the ideas they tried that didn’t work, and the process they used to land on the final results. A lot of ML papers present the finished result as if it appeared from nowhere without trial and error, perhaps with some ablations in the appendix and I wish more papers followed this one in talking about the dead ends along the way.
Nothing would benefit the scientific enterprise more than explicitly publishing papers about failed experiments.
How much public money, time, and careers have been wasted chasing something that is already known not to work?
Unfortunately, careers don't get advanced that way. There are very backwards incentives.
I’m aware, I left the academic world in no small part that I refused to write papers that weren’t worth reading. A high quality, but short, CV is a career ender these days. I’m happier now though!
Very interesting work! More details here: https://depth-anything.github.io/
It seems better overall and per parameter than current work, with relative and absolute measurement.
Is there any research people are aware of that provides sub-mm level models? For 3D modeling purposes? Or is "classic" photogrammetry still the best option there?
In grad school I was using stereo video cameras to measure fish. I wonder if a model like this could do it accurately from frame grabs from a single feed now. And of course an AI to identify fish, even if was just which sections of video had/did not have fish, not even doing the species level ID, would have saved a ton of time.
We had a whole workshop on various monitoring technologies and the take home from the various video tools is that having highly trained grad students and/or techs watch and analyze the video is extremely slow and expensive.
I haven't worked with video in a while now, but I wonder if any labs are doing more automated identification these days. It feels like the kind of problem that is probably completely solvable if the right tech gets applied.
Definitely not with this model, because it’s impossible to tell based on the distance alone. Is the fish 34cm away and 34cm long or 30cm away and 30cm long? The fish is floating in a transparent medium, so reference points aren’t even useful as calibration.
Are the fish always the same color/is their color distinct from the background.
I work at an industrial plant we have been able to measure a lot of things simply by analyzing the pixels in the video. For example one application we have a camera pointed down at a conveyor belt. The conveyor belt is one color and objects on the belt are a distinct different color.
- we just count how many pixels in a given frame are a specific color/brightness. Then you can easily work out how much of the conveyor belt has material on it in any given frame.
So if you are tying to work out what section of a video has fish in it you could count how many pixels are a different color to the normal background color.
You can definitely train a model to identify fish, to be honest you don't really have to train a whole model, there are tons of models trained on millions of images, you can just extract the embeddings from those models and train a single matrix to project them to the different classes and it will work very well.
Very likely. Tbh, I think there are a lot of domain tasks where if you added a machine learning expert to the team that success and progress would be a lot higher. But to be fair, there are are a lot of people that can do ML but not a lot of people that have a deep understanding. The difference matters for real world tasks when the difference between dataset performance and generalization performance matter. And it's all too common that works that are SOTA are more difficult to generalize, but this is high variance.
Can someone explain the meaning of labelled vs unlabelled in this context? What kind of information would the labels carry?
Did they have depth maps for all 62 million images or not?
They explain in the paper that they used 1.5 million images with known depth maps (labels) to train a teacher model, and then used the teacher model to create pseudolabels (inferred depth maps) for the full dataset. Then they trained a student model to recover those pseudolabels from distorted versions of the original images.
Was that better than running the teacher model on the distorted images directly?
Very cool to see TikTok sharing its research.
any information on the inference speed of this vs midas?
Impressive demo.
Any FSD startup that put their money on LiDAR is even more screwed now.
Disagree there. Humans have massive compute, dual optics, and amazing filters.
Computer vision has 1-2 of those three, and I don't think we are near an AGI for self driving yet. Driving is IMO, an AGI level task.
Does you dataset have a crocodile in it? Does you monocular depth model get fooled by a billboard that's just a photo?
>Does you monocular depth model get fooled by a billboard that's just a photo?
This is actually a pretty clever example, I tried a few billboards on the demo online and, as these models are regressive so they output the mean of the possible outputs, sometimes the model is perplexed and doesn't seem to know if to output something completely flat or that actually has a depth, and by being perplexed it outputs something in between.
AGI is a pretty fuzzy term that will goal post shift like AI has. You can define it that way tautologically but I can easily see a world where we have self driving cars but standalone AI scientists don’t exist. Does that mean we have AGI because we have self driving cars or not because it’s not general in that it can’t also tackle other human endeavors?
That's only a "happy path" attitude.
How well would a moncular path with headlights moving toward it at night operate? How about in rain, snow, or fog?
I'm not saying LiDAR is the only way, but I don't see a reason to use this as a solution.
I'm not saying this isn't valuable. I used to work in 3D/metaverse space, and having depth from a single photo, and being able to recreate a 3D scene from that is very valuable, and is the future.