Robust Conditional 3D Shape Generation from Casual Captures

facebookresearch.github.io

60 points by lastdong 2 months ago · 6 comments

Reader

nico 2 months ago

Does this need depth data capture as well? The “casual captures” makes it seem like it only needs images, but apparently they are using depth data as well

Also, can it run on Apple silicon?

KaiserPro 2 months ago

Nope, only needs depth for ground truth.
its designed to be run on top of a SLAM system that outputs a sparse point cloud.
on page 4 on the top right you can see how the point cloud is used to then feed into the object generator: https://cdn.jsdelivr.net/gh/facebookresearch/ShapeR@main/res...
lastdongOP 2 months ago

I think it does use depth data from parameters in docs: python infer_shape.py --input_pkl <sample.pkl> (possibly achievable using software like MapAnything). I believe CUDA only.
- efskap 2 months ago
  
  Yeah they confirm that at the bottom of the linked page
  > Furthermore, by leveraging tools like MapAnything to generate metric points, ShapeR can even produce metric 3D shapes from monocular images without retraining.

fxtentacle 2 months ago

This turns point clouds into meshes.

That means it doesn’t need depth. Depth is helpful for getting good point locations, but SLAM on multiple frames should also work.

I’m guessing that they are researching this for AR or robot navigation. Otherwise, the focus on accurately dividing the scene into objects wouldn’t make sense for me.

KaiserPro 2 months ago

Its much deeper than that.
Segmentation in 2d is mostly a solved problem (segment anything is pretty fucking great) Segmentation in 3d is also fairly well done. You can use dino V2 to do 3d object detection and segmentation.
The diffcult part _after_ that is interacting with the object. sparse and semi dense point clouds can be generated and refined in real time, but they are point clouds not meshes. this means that interacting with the object accurately is super hard, because its not a simple mesh that can be tested/interacted with. its a bunch of points around the edges.
Where this is useful is it allows you to generate a mostly plausible simple 3d model that can act as a standin for any further interactions. In VR you can use it as a collision object for physics. For robotics you can use it to plan interactions (ie place objects on the table)
Its also a step in the direction of answering "who's" object it is, rather than "what" the object is. Who's water bottle is much much harder to answer with machines (without markers) than "is this a water bottle" or "where is the water bottle in this scene"

Settings

Robust Conditional 3D Shape Generation from Casual Captures

Keyboard Shortcuts