NeRF: Representing scenes as neural radiance fields for view synthesis

237 points by dfield 6 years ago · 42 comments

Reader

uoaei 6 years ago

This is absolutely stunning.

As they say in ML, representation first -- and this is one of the most natural and elegant ways to represent 3D scenes and subjective viewpoints. Great that it works into a rendering environment such that it's E2E differentiable.

This is the first leap toward true high-quality real-time ML-based rendering. I'm blown away.

jayd16 6 years ago

Very cool. Reminds me of when I played with Google's Seurat.

The paper says its 5MB, 12 hours to train the NN and then 30 seconds to render novel views of the scene on an nVidia V100.

Sadly not something you can use in real time but still very cool.

Edit:12 hours and 5MB NN not 5 Minutes

ssivark 6 years ago

Huh, what? It needs almost a million views, and takes 1-2 days to train on a GPU. I’m not sure where the “5 minutes” number comes from.
EDIT: I was referring to the last paragraph of section 5.3 (Implementation details), but maybe I’m misunderstanding how they use rays / sampled coordinates.
Very impressive visual quality. But it seems like they need a LOT of data and computation for each scene. So, its still plausible that intelligently done photogrammetry will beat this approach in efficiency, but a bunch of important details need to be figured out to make that happen.
- jayd16 6 years ago
  
  Excuse me I meant 5MB. It takes 12 hours to train.
  >All compared single scene methods take at least 12 hours to train per scene
  But it seems to only need sparse images.
  >Here, we visualize the set of 100 input views of the synthetic Drums scene randomly captured on a surrounding hemisphere, and we show two novel views rendered from our optimized NeRF representation
- scribu 6 years ago
  
  > It needs almost a million views
  Not sure what you mean by "views". The comparisons in the paper use at most 100 input images per scene.
  - bla3 6 years ago
    
    A pixel is one view for their model if I understand correctly, so one hundred 100x100 images would be a million views.

lifeisstillgood 6 years ago

Well that took some effort just to work out what they actually did. How they actually did it I have no idea. Impressive however - a sort of fill in the blanks for the bits that are missing. If our brains dont do this one would be surprised.

And we are all supposed to become AI developers this decade?!

Come back Visual Basic all is forgiven :-)

raidicy 6 years ago

This blows my mind. This is probably a naive thought; This technique looks like it could be combined with robotics to help it navigate through its environment.

I'd also like to see what it does when you give it multiple views of scenes in a video game. Some from the direct pictures and some from pictures of the monitor.

yarg 6 years ago

They've only showed it working with static content - they'll need to do it with video (multiple synchronised cameras) and in real time for ant robotics application.
- blurbleblurble 6 years ago
  
  It'd be interesting to see what happened if they encoded an additional time parameter on each 'view' (input image pixel). Surely someone is already trying to extend this technique that way.
iandanforth 6 years ago

Currently view coordinates relative to the volume are required so you first have to solve the SLAM problem before you can optimize a network representation of a given volume.
BubRoss 6 years ago

It takes 12 hours on a high end GPU to make one frame.
- teraflop 6 years ago
  
  No, as appendix A of the paper states, each frame takes about 30 seconds to render.
  - BubRoss 6 years ago
    
    No, the high dimensional field takes 12 hours and the time to render the field to an image is not going to matter for robotics where computer vision needs to be done in real time.

teknopurge 6 years ago

This is bad-ass, partly because it's so elegant.

blackhaz 6 years ago

Could someone ELI5, please?

mooneater 6 years ago

If you give it a bunch of photos of a scene from different angles, this machine learning method lets you see angles that did not exist in the original set.
Better results than other methods so far.
- notfed 6 years ago
  
  Fist bump for actually answering as ELI5 (unlike the other responses).
- airstrike 6 years ago
  
  So can we take it to the next level and give it a bunch of ML-generated photos of a scene that doesn't exist (from model B) and let this model A create the 3D view?
  Take it one more step further and make model B create photos from some text description similar to the one described in https://news.ycombinator.com/item?id=22640407 (although that one does 3D designs using voxels)
quadrature 6 years ago

its a very similar concept to photogrammetry which is recovering a 3d representation of an object given pictures taken from different angles.
In this work they take pictures of a scene from different angles and are able to train a neural network to render the scene from new angles that aren't in any source pictures.
The neural network takes in a location (x,y,z), a viewing direction and spits out the RGB of the rendered image if you were to view the scene at that location and angle.
Using this network and traditional rendering techniques they are able to render the whole scene.
- wokwokwok 6 years ago
  
  Significantly, the input is a sparse dataset.
  ie. Few source images vs. traditional photogrammetry.
  ...but basically yes, tldr; photogrammetry using neural networks; this one is better than other recent attempts at the same thing, but takes a really long time (2 days for this vs 10 minutes for a voxel based approach in one of their comparisons).
  Why bother?
  mmm... theres some kind speculation you might be able to represent a photorealistic scene/ 3d object as a neural model instead of voxels or meshes.
  That might be useful for some things. eg. say, a voxel representation of semi transparent fog, or high detail objects like hair are impractically huge, and as a mesh its very difficult to represent.
  - rebuilder 6 years ago
    
    A number of things this seems to do well would be pretty much impossible with standard photogrammetry : trees with leaves, fine details like rigging on a ship, reflective surfaces, even refraction (!)
    Of course the output is a new view, not a shaded mesh, but given it appears to generate depth data, I think you should be able to generate a point cloud and mesh it. Getting the materials from the output light even be possible, I'm not very up to date on the state of material capture nowadays.
  - BubRoss 6 years ago
    
    > Significantly, the input is a sparse dataset. ie. Few source images vs. traditional photogrammetry.
    This uses dozens or hundreds of images, which isn't usually necessary for traditional photogrammetry that maps photos to hard surfaces with textures.
    I think what you noted about volumes is the significant part. Complex objects with fine detail and view dependent reflections are the part that shines here over photogrammetry, but it does take a lot of images. I didn't see anything in the paper that dealt with transparency.
  - visarga 6 years ago
    
    > Why bother?
    There might be 10x speedups to be gained with a tweaked model.
type_enthusiast 6 years ago

They're modeling a scene mathematically as a "radiance field" - a function that takes a view position and direction as inputs and returns the light color that hits that position from the direction it's facing. They use some input images to train a neural network, in order to find an optimal radiance field function which explains the input images. Once they have that function, they can construct images from new angles by evaluating the function over the (position, direction) inputs needed by the pixels in the new image.
ur-whale 6 years ago

>Could someone ELI5, please?
Smart, high-dimensional interpolator.
imposter 6 years ago

Wow great

kuprel 6 years ago

This would be great for instant replays

jayd16 6 years ago

Intel already does this with their "True View" setup. They also had a tech demo CES where they synthesized camera positions for movie sets. https://www.youtube.com/watch?v=9qd276AJg-o

blurbleblurble 6 years ago

The neural networks representing these scenes take up just 5 MB... Less than the input images used to train them. Wow. Mind blowing!

BubRoss 6 years ago

Keep in mind though, that the way the data is represented is a form of lossy compression and the size of the images may not be.

byt143 6 years ago

If you're only looking for one novel view, can it use less views that are close to the novel one?

ssivark 6 years ago

Does anyone know how they do the “virtual object insertion” demonstrated in the paper summary video? Can that be somehow done on the network itself, or is that a diagnostic for scene accuracy by performing SFM on network output?

theresistor 6 years ago

I'm pretty sure they're rendering a depth channel and compositing it in.
- teraflop 6 years ago
  
  You could do that, but I think it's simpler to just introduce additional objects during the raytracing process that generates the images. That would produce accurate results even with semitransparent objects, unlike compositing with an depth buffer.

philip368320 6 years ago

I would like to see “neural enhances” an already rendered 3D scene with the changes which would make it more realistic, given depth map and other information to the neural network

BubRoss 6 years ago

How would it be made more realistic?

tanilama 6 years ago

This is REALLY cool, but kinda makes sense as well. Neural networks are very good at interpolation, given the right prior.

2OEH8eoCRo0 6 years ago

This is the kind of shit I come here for. Awesome post! Thanks for sharing!

anthk 6 years ago

This is like the Blade Runner ingame tool.

Settings

NeRF: Representing scenes as neural radiance fields for view synthesis

Keyboard Shortcuts