4D Gaussian Splatting for Real-Time Dynamic Scene Rendering
guanjunwu.github.ioThis looks great! The main potential use for this must be for VR video with 6 degrees of freedom. What they have now does an incredible job of conveying space, but feels a bit limiting when your view doesn’t translate with you.
This is bad news for me. I am working on a simliar project (gaussian splatting + dynamic scene). Our method is different with the mentioned 4D gaussian splatting, but I am unsure shall I continue or not.
Please continue working on it: being first doesn't imply being the best, all research is iterative.
There's nothing wrong with concurrently developing something similar. I can almost guarantee there will be something different enough about what you've developed to be considered novel (if you care about publication). If you don't care about publication, then definitely keep going! Hell people are still argue about which implementation of complete specifications are the best (for good reason).
Why is that bad news? If you're interested in the outcome - someone has saved you work.
If you're interested in the process - or exploring your specific approach then why stop?
Because of the sunk cost fallacy which is a fallacy only if you ignore the emotions and trauma of having to abandon work you identify with or is responsible for your self esteem.
Don't see how the sunk cost has anything to do with it, OP clearly hoped to be first, as being first would almost guarantee a payoff of some form.
Coming out with a method second is much less likely to be rewarded by the community: not being rewarded for work completed is not sunk cost, its just a straight up loss.
If it’s a different method, it is definitely worth sharing.
The additional interest might actually be helpful.
Does anyone know if the pixel overdraw of the GS scene is consistent from every view angle? I'm asking because I would assume there is inconsistent GS density but the paper doesn't give a range of FPS measurements or 99th percentile or anything like that.
I'm pretty certain it is not - consider surfaces seen in steep angles, vs. ones seen perpendicularly. If we assume no culling or pruning occurs for the splats, steep angles result in way more overdraw.
This doesn't necessarily follow. If the splats are coplanar to the surface they are representing then viewing them at an angle wouldn't change how they overlap. But that said, I assume you're right.
This gives me hope that one day we'll have a holodeck. Holy crap! The applications for this are pretty broad. From safety (scene reconstruction from video sources) to real-estate, to hollywood and video games. I'm just blown away. Will we eventually see 4D GS AR/XR scenes we can walk about? I feel like that would make the perfect VR sherlock holmes game.
Why wouldn't you be able to walk about now? They already have examples with free camera movement. To make it an XR scene, you just need to render two cameras and pipe it into a headset.
One more step towards the next simulation level.
Aren't the scenes static?
Holy heck this is going to fundamentally change media production
After reconstruction, is there any way to scan for a particular condition in the model, and map it onto the 3D structure? For instance, find the broken cookie, or find a surface matching some input image.
I suspect typical point-cloud feature extraction techniques would work. Things like identify planar regions, from that join connecting planar regions into clusters, etc.
The time component is super interesting here though!
Seem fairly tractable to use Segment Anything or a similar method to derive plausible semantic clusters of splats.
Hard to believe the original Gaussian Splatting paper is still less than three months old, given the explosion of new techniques in recent weeks. It's wild to see the state of the art in environment and object capture suddenly advancing this quickly – beyond the obvious applications like real estate, I wonder what else GS will end up transforming.
To be at risk of an "AcKchYuALly". Gaussian Splatting has been around since at least the early 90s. There's even a few old games made with the technique.
The paper I think your referring to made the interesting leap that a 3d radiance field could be rerendered out as a field of Gaussian splats, and that this would probably run faster in modern GPU pipelines for real-time performance. It looks like they also have the nice property of being able to be shifted around in memory quickly hence the animation property seen here.
If you want to be pedantic, the paper made the leap it did because of differentiable rendering which necessarily needs a differentiable representation of primitives - so they use Gaussians. It’s entirely novel and set in a nascent field (neural rendering). Gaussians happen to be further representable as easily rasterized primitives. Though some considerable work was put into making this performant. Everyone who keeps saying this has been around since the 90s is missing the context of the very modern differentiable rendering literature.
The point of Gaussian Splatting for me is that it is a learned representation. It's odd that others view it primarily as a drawing sprites.
I'm curious, would you classify particle effects drawn with quads as 4D gaussian splatting too?
Well, in the old days, you just put the splats in your 3d space, they weren't really sprites (in the strict sense that they didn't use dedicated sprite hardware). They're really interesting thing is that they're being used here to render the learned representation, but there's nothing particularly special about them or new or AI/ML about them.
You could "model" 3d objects with the gaussians by just putting a bunch together. It was a way to produce fast rendering 3d images without using a bunch of polygons. The results back then were...left behind by other techniques.
There's a massive back catalog of computer graphics work on the technique, it's usually just easiest to use the search tools and search back for all dates leading up to say...2021 and you'll find tons of normal old stuff like CS 302 - Computer Graphics courseware slides or whatever on the technique.
https://www.google.com/search?q=gaussian+splat+-site%3Apinte...
Being old and seeing the new generations amazed by the reapplication of what was discovered and used decades ago in a novel way amazes me.
Could you point us to some examples of old games using this technique? Would be awesome to see.
Ecstatica - https://www.youtube.com/watch?v=dnOXk3QJWN8
Is this really splatting gaussians? Or is it rendering ellipsoids?
It was just ellipsoids. I don’t know if any game specifically used Gaussians. But, the idea of splatting points, Gaussians, ellipsoids and a variety of other shapes has been around for at least 20 years.
The novelty of the paper was in using the differentiability of Gaussians to enable fitting splats to incrementally match all of the target photos simultaneously. So, it’s a new way to generate the splats from photos rather than modeling them by hand.
PlayStation Dreams used a very similar technique.
The backgrounds are static and prerendered! There's one sphere .bmp that's scaled and stretched. It comes with a depth offset map that populates a simple z-buffer to prevent overdraw. So rendering each frame becomes just a couple hundred dozen operations!
I remember a 4k demo that used translucent triangles (I think? my brain is showing me circles, so perhaps a fixed set of sizes and fast blit with alpha.) This created moving volumetric light and shadows around some geometric shapes, some pillars I think. Very smeary/ghostly with overdrawn shapes, but the effect was startling given it was on a 486. It didn't render full frames, but moved the model and just kept splatting.
Interesting! Can you please name some of these old games made with Gaussian Splatting? I would be interested to play, to get a sense why polygons won in that round (and likely to lose in this one).
I used additive gaussian fields (restricted by bounding regions) for this back in the late 90's for audio visualizations in a ripper/player called "Siren" (back when we actually thought we could charge money for something like that).
The technique worked well on non-accelerated (CPU only) hardware of the era, with the additive approach saving the pain of needing to keep a z buffer or fragment list.
Gaussian voxel reconstruction is useful in medical and GIS settings, which, if memory serves, is where Kyle Freeman from Novalogic drew on for his work on Comanche. As far as I know, that was the first commercial game with voxel rendering... It's been a bit since I played it, but the swimming jaggies make me think that it was Manhattan distance height map offset by planar traversal (kinda like Doom raycasting) or some similar trick. I don't recall any intersections or overhangs, but, to be fair, I was a middle schooler when Comanche came out.
It also ran fine on my weak sauce PC.
Once acceleration hit, transformation of triangles with fixed-function pipelines took over. The ability to push textured triangles with minimal per-pixel value adjustment took over. Slowly but surely we've swing back to high ALU balance (albeit via massive stream parallelism). We've shifted from heavy list/vertex transformers to giant array multiply/add processors.
It's a pretty great time to be a processing nerd.
From another user:
This one was released today as well. Works out of the box: https://github.com/JonathonLuiten/Dynamic3DGaussians
We used to call this technique "vector balls" on the Amiga, here is one famous example: https://www.youtube.com/watch?v=gjKkUTlhIek . I remember implementing it myself for an unreleased demo.
I realize that a lot has happened since, but this is likely where it all started :)
Does anyone have a video or post that explains the optimization part for the original paper? I understand most of it but that part and can’t seem to wrap my head around it.
Just glossed over the paper but it seems, in principle, simple enough (though rather brilliant IMHO).
Essentially they're doing what you do when you train a neural network, only that instead of adjusting weights connecting "neurons", you adjust the shape and position of gaussians, and the coefficients of spherical harmonics for the colors.
This requires the rendering step to be differentiable, so that you can back-propagate the error between the rendering and the ground-truth image.
The next key step is to every N iterations adjust the number of gaussians. Either fill in details by cloning a gaussian in an area which is undercovered, or split a gaussian in an area which is overcovered.
They use the gradient of the view-space position to determine if more detail is needed, ie those gaussians which the optimizer wants to move significantly over the screen seems to be in a region with not enough detail.
They then use the covariance of the gaussians to determine to split or to clone. Gaussians with large variance get split, the others cloned.
They also remove gaussians which are almost entirely transparent, no point in keeping those around.
That's my understanding at least, after a first time gloss-through.
You:
> Essentially they're doing what you do when you train a neural network, only that instead of adjusting weights connecting "neurons", you adjust the shape and position of gaussians, and the coefficients of spherical harmonics for the colors.
My brain:
> They're providing inverse reactive current to generate unilateral phase detractors, automatically synchronizing cardinal gram meters.
Heh. For those that haven't dabbled much with neural nets, the key aspect here is the backpropagation[1]. If you want to optimize a process, you typically change the parameters (turn a knob or change a number) and see how the output reacts. If it changed too much you reduce the parameter etc. This is a forwards process.
The idea in backpropogation is instead to mathematically relate a change in output to a change in the parameters. You figure out how much you need to change the parameters to change the output a desired amount. Hence the "back" in the name, since you want to control the output, "steering" it in the direction you want, and to do so you go backwards through the process to figure out how much you need to change the parameters.
Instead of "if I turn the knob 15 degrees the temperature goes up 20 degrees", you want "in order to increase the temperature 20 degrees the knob must be turned 15 degrees".
By comparing the output with a reference, you get how much the output needs to change to match the reference, and by using the backpropagation technique you can then relate that to how much you need to change the parameters.
In neural nets the parameters are the so-called weights of the connections between the layers in the model. However the idea is quite general so here they've applied it to optimizing the size, shape, position and color of (gaussian) blobs, which when rendered on top of each other blend to form an image.
Changing a blobs position say might make it better for one pixel but worse for another. So instead of doing a big change in parameters, you do small iterative steps. This is the so-called training phase. Over time the hope is that the output error decreases steadily.
edit: while backpropagation is quite general as such, as I alluded to earlier, it does require that the operation behaves sufficiently nice, so to speak. That's one reason for using gaussians over say spheres. Gaussians have nice smooth properties. Spheres have an edge, the surface, which introduces a sudden change. Backpropagation works best with smooth changes.
Just to add some detail regarding the "blob optimization" phase.
The algorithm that recovers the camera positions from the reference images also gives you a sparse cloud of points (it places the pixels from the image in 3D space). Use that as the center of the initial blobs, and give each blob an initial size. This is almost certainly not enough detail, but a start.
Then you run the "training" for a while, optimizing the position and shape of the blobs. Then you try to optimize the number of blobs. The key aspect here is to determine where more detail is needed.
In order to do so they exploit that they already have derivatives of several properties, including screen position of each blob. If the previous training pass tries to move a given blob a significant distance on the screen, then they take that as a signal that the backpropagation is struggling to cover an area.
They then decide to split the blobs either by duplication or by splitting, depending on if the blob is large or not.
If it's small they assume there's detail it can't fill in, and duplicate the blob and move the new blob slightly in the direction it wanted to move the source blob so they don't overlap exactly.
If the blob is large they assume the detail is too fine and is overcovered by the blob, hence they split it up, calculating the properties of the new blobs so that they best cover the volume the source blob covered.
This process of training followed by blob optimization is repeated until the error is low enough or doesn't change enough, suggesting it converged or a failure to converge respectively.
Thank you. This was much more approachable for someone like myself that has little background (a few undergrad courses) in both machine learning and computer vision concepts.
I was just about to ask why not use a sphere? since it could be thought of as a nn, it will be into NN someday. guess the splitting and merge can be compared with dropout then.
I'm no expert, but my immediate thoughts are that evaluating a gaussian blob is very simple, it's just an exponential of a distance. The edge of a sphere makes it more complicated to compute, hence slower.
For backpropagation, the differentials of a gaussian is smooth while it's not for a sphere, again because of the edge.
Now, if you want to use a sphere you probably will do something like using an opacity falloff similar to ReLU[1], making it transparent at the edge.
This should make smooth enough as such I guess, but I imagine you still have the more complicated rendering. Though I may be mistaken.
[1]: https://en.wikipedia.org/wiki/Rectifier_(neural_networks)
I still continue to read comments like those though - there is a chance I might make sense of a word! But I did find myself laughing as I read the original post thinking about how this sounds like a word salad.
The object that’s being optimized are the parameters of a 3D Gaussian, just imagine a blob changing shape. That’s optimized instead of optimizing a neural network
What parts confuse you? There are a few steps in optimization. There are lots of papers on differentiable rendering, but the pruining of gaussians and the actual treatment of gaussians, I don't think there's a blog post.
Can someone help me understand what this is actually doing?
After the scene is filmed/photographed then one can re-position and re-point a virtual camera and have it correctly render the scene. And do so with higher quality results than photogrammetry and NeRF techniques.
Thanks!
With tech like this I'm starting to wonder if realistic games are going to become normalized and what will happen as a result.
Also has anyone been working on solving the "blurry" look these splats have up close?
But if I'm not mistaken, this technique still requires to get a ton of pictures from many angles ? It's fine for visiting an apartment or watching a cooking video in 3D, but how possibly can you apply this to a videogame that has much more degrees of freedom ? Are you gonna scan an entire city with a drone to create a GTA-like ?
> We introduce the 4D Gaussian Splatting (4D-GS) to achieve real-time dynamic scene rendering while also enjoying high training and storage efficiency.
This seems to be a rendering efficiency innovation, not particular to scanning.
That means it applies to artificially generated environments, whether photo realistic or stylized, and whether based on a real environment or a completely fictional one.
But of course, any photorealistic, extremely faithful to the smallest detail, rendering of a real place is going to involve a lot of scanning. That is true for any kind of rendering.
Each Gaussian "splat" is literally a little blurry blob. The way to make it sharper is to increase the resolution - i.e. increase the number of splats, decrease the size of each one. This increases both training time and render time though.
> realistic games
That said, games don't have to be super realistic to be fun. E.g. I could imagine a game based on GS at "Minecraft resolution".
I'd love to see a machine learning model trained on the resulting data of this. It'd be crazy to see if it can effectively learn and generate realistic looking video as an output.
Can someone explain to me how is it possible using gaussians to have different reflections based on the angle of view like on the demos? I'm finding it hard to grasp.
I believe that is due to the use of Spherical Harmonics.
That seems more complex to store and render than everything else about a gaussian splat, how are these used efficiently?
There is a bit of explanation here https://aras-p.info/blog/2023/09/05/Gaussian-Splatting-is-pr... (found via Google).
Feel like this changes everything, trying it out right now...
Interesting that the original publication this is based on (that won the SIGGRAPH 2023 best paper award) didn't get a lot of attention on HN at the time:
Great video I saw a while ago on this: https://www.youtube.com/watch?v=HVv_IQKlafQ (albeit for 3d, not 4d).
His editing is hilarious too.
I've been slowly building my own rendering and training on a non cuda library (trying with vulkan/spirv) I'm curious how many cameras they used here though.
Reminds me the Deja Vu movie and how they maneuver angles.
Anyone know how well this technique deals with mesh/lattice type structures? For example, fences, ladders, climbing frames, etc.
Gaussian Splatting in general or this specific approach to animation. Can't comment on the latter but fine detail renders very nicely on still scenes.
https://lumalabs.ai/capture/ed9d985b-9cc1-49e0-a39c-88afa203...
https://lumalabs.ai/capture/83e9aae8-7023-448e-83a6-53ccb377...
https://lumalabs.ai/capture/7f8df9c9-c548-4a47-9892-e945637c...
https://lumalabs.ai/capture/076fcfdc-ea80-4fdc-8159-c9fed831...
Wow, that's impressive. Thank you for these.
Combine this with state of the art VR tech (something with good eye tracking and 4k per eye) and we're living in the future.
Wondering when this technique will be used for meal calorie counters
Not seeing how that is related?
The wobbling made me think of photogramettery/estimating volume with a camera paired with some visual model to detect peas or whatever. Without a concrete dimension though eg. lidar not sure how accurate.
That cookie looks delicious
Can someone also explain the implications of this on gaming?
Youtube recommended this video to me, which concisely explains splatting. Information necessary for trade is there but the trade is left to the viewer.
The key drawback that isn't highlighted is that you need a physical space to be a close approximation of what you want to render. So if you want to make a few counter strike maps based off of your workplace (not recommended) then this would be a good technology, but if you want to make an open world on an alien planet you're likely better off with traditional rendering.
This is just incredible technology
i like how galaxies look like ellipsoids if you zoom out
Mind blowing stuff.
Well, Rule 34 is about to happen. And "splatting" is already a decent name...
Ha! Or Rule 34a, "every sufficiently observed phenomenon, has just become somebody's new fetish".
Although actually, and on a slightly more innocent (but just as edgy!) note, the thing that immediately popped into my head upon reading "4D Gaussian Splatting", was the music from the 1992 Future Crew demo Unreal, and the image of it's inter-scene title screens. ["IYKYK", but basically, that famous old PC demo consists of several short sections, each showcasing a particular coding/graphical technique - each section prefaced by a title screen which named the effect being showcased.]
YT of Unreal demo, as citation for this highly-important observation : https://www.youtube.com/watch?v=InrGJ7C9B3s
German demo group Farbrausch pioneered at this.