Bunnymark GL in Jai – 200k sprites at 200fps [video]
youtube.comIt's been awhile since I've done game engine work, but is this impressive? The first thing that comes to mind is they're using instanced rendering. This allows the CPU to only deal with 1 sprite, while telling the GPU to render multiple instances of the sprite, and use a GPU buffer to find each sprite's transformation matrix. All the CPU has to do is update that mmap'ed buffer with new position information (or do something more clever to derive transformations).
Am I missing something that makes the video novel?
It is not impressive. This workload is entirely bandwidth constrained and is more a test that your renderer isn’t doing anything stupid.
you're not missing anything, it's not impressive. i was just checking how fast computers are and sharing the results. my original title was "an optimized 2d game engine can render 200k sprites at 200fps" but the mods changed it to match my youtube title (which made it a lot more popular). and the fact it's written in jai isn't relevant, it's just what i happened to use
> and the fact it's written in jai isn't relevant
I figured it wasn't given that you were showcasing a GL project. But nonethees disappointing as someone curious as to whether or not the language helped in indirect ways with how you structured your project and if you feel you could scale it up to something closer to production ready. That did seem to be the goal of Jai when I last looked into its development some 4 years ago.
jai's irrelevant to the performance here, but it's very relevant to how easy this was to make. i'm not a systems programmer. i've tried writing hardware accelerated things like this in C++ but have failed to get anything to compile for years. the only reason i was able to get this working is because of jai. this is my first time successfully using openGL directly, outside of someone else's game engine
Nope, I don't think so. The point I see in it is regarding some thought I have about that games should run on intel graphic chipsets. Diablo 3 doesn't for example. I wonder if D2 ressurected does...
The author was using this benchmark to compare HTML game frameworks and made this to see how different it is native:
https://old.reddit.com/r/Kha/comments/8hjupc/how_the_heck_is...
> All the CPU has to do is update that mmap'ed buffer with new position information
Doesn’t even have to do that, this is child’s play for a compute shader. The CPU can go take a 16 millisecond nap and let the GPU do all the work.
They should try adding GPU polygon-volume collisions and see how many it can handle.
Nice demo! We need more of this approach.
You really can achieve amazing stuff with just plain e.g. OpenGL optimized for your rendering needs. With todays GPU acceleration capabilities we could have town-building games with huge map resolutions and millions of entities. Instead its mostly only used to make fancy graphics.
Actually I am currently trying to build something like that [1]. A big big world with hundreds of millions of sprites is achievable and runs smoothly, video RAM is the limit. Admittedly it is not optimized to display those hundreds of millions of sprites all at once, maybe just a few millions. Would be a bit too chaotic for a game anyway I guess.
> We need more of this approach.
1000% agree.
I recently took it upon myself to see just how far I can push modern hardware with some very tight constraints. I've been playing around with a 100% custom 3D rasterizer which purely operates on the CPU. For reasonable scenes (<10k triangles) and resolutions (720~1080p), I have been able to push over 30fps with a single thread. On a 5950x, I was able to support over 10 clients simultaneously without any issues. The GPU in my workstation is just moving the final content to the display device via whatever means necessary. The machine generating the frames doesnt even need a graphics device installed at all...
To be clear, this is exceptionally primitive graphics capability, but there are many styles of interactive experience that do not demand 4k textures, global illumination, etc. I am also not fully extracting the capabilities of my CPU. There are many optimizations (e.g. SIMD) that could be applied to get even more uplift.
One fun thing I discovered is just how low latency a pure CPU rasterizer can be compared to a full CPU-GPU pipeline. I have CPU-only user-interactive experiences that can go from input event to final output frame in under 2 milliseconds. I don't think even games like Overwatch can react to user input that quickly.
Just to be clear - you're writing a "software-based" 3D renderer, right? This is the sort of thing I excelled at back in the late 80s, early 90s, before the first 3D accelerators turned up around 1995 I think.
What features does your renderer support in terms of shading and texturing? Are you writing this all in a high-level language, e.g. C, or assembler? If assembler, what CPUs and features are you targeting?
And of course, why?
> you're writing a "software-based" 3D renderer, right?
Yes. This is 100% what you are familiar with.
> What features does your renderer support in terms of shading and texturing?
I have a software-defined pixel shading approach that allows for some degree of flexibility throughout. Each object in the scene currently defines a function that describes how shade its final pixels based on a few parameters.
> Are you writing this all in a high-level language, e.g. C, or assembler?
I am writing this in C#/.NET6. I do have unsafe turned on for pointer access over low-level bitmap operations, but otherwise its all fully-managed runtime.
> And of course, why?
Because I want to see if I can actually build an effective gaming experience without a GPU in 2022. Secondary objective is simply to learn some new stuff that isnt boring banking CRUD apps.
That's awesome. I think the advantage of a software renderer is that you can adapt your inner loops to do things that a GPU can't do. You can create some new form of polygon-fill that isn't supported by Direct3D or OpenGL etc.
Plus, of course it will run on anything.
I hope you'll be willing to open the code at some point...
Unrelated but wrt. modern rendering versus 90s rendering I'd imagine that a lot of the performance shims used in the 90s might not apply because the critical problem is different.
Performance based development these days isn't so much on maximizing usage of the cycles of the machine (I mean, ok fundamentally it's still about that, but-), rather it's about getting the microcode to do the right thing. E.g. LUTs being extremely bad for caching performance. Branch predictions being a much more important predictor of performance than anything else. Huge rams make a lot of old tips around ram size usage invalid. SIMD / vector based operations and threading are a boon but require a very different way of working
Even if your mental model is as simple as "CPU processing + L1 cache is infinitely fast, having to fetch data from anywhere else is dog slow" you'll be able to optimize code pretty well given the characteristics of modern processors.
If modern high performance code relies on making the microcode do "the right thing", and making sure the right data is in cache then why don't CPU manufacturers allow control over such things?
What’s right for today’s CPU is horrible for next year’s in those terms - and also the other way around.
> One fun thing I discovered is just how low latency a pure CPU rasterizer can be compared to a full CPU-GPU pipeline
i'm definitely going to have to test that! always trying to minimize input delay
I think it can reduce input delay enough to change streaming gaming economics, but the current state of cloud economy makes it difficult to scale in practice.
i'm just starting learning directx and noticed it can render a triangle at 12,000 fps! i had no clue this was possible. i don't think there's any room for input delay there, but i'll find out
Did you consider using an existing software rasterizer, like Mesa llvmpipe? Or part of the challenge was writing one yourself (nothing wrong with that)?
The upper rendering limit generally isn't explored deeply by games because as soon as you add simulation behaviors, it imposes new bottlenecks. And the design space of "large scale" is often restricted by what is necessary to implement it; many of Minecraft's bugs, for example, are edge cases of streaming in the world data in chunks.
Thus games that ship to a schedule are hugely incentivized to favor making smaller play spaces with more authored detail, since that controls all the outcomes and reduces the technical dependencies of how scenes are authored.
There is a more philosophical reason to go in that direction too: Simulation building is essentially the art of building Plato's cave, and spending all your time on making the cave very large and the puppets extremely elaborate is a rather dubious idea.
Is this not done because of technical limitations, or is it just not done because a town building game with millions of entities would not be fun/manageable for the player?
Although, there's a few space 4x games that try this "everything is simulated" kind of approach and succeed. Allowing AI control of everything the player doesn't want to manage themselves is one nice way of dealing with it. See: https://store.steampowered.com/app/261470/Distant_Worlds_Uni...
I immediately thought of the bullet physics games like gradius, parodius, raidan, r-type.
What made it of course was the art. An army of digital illustrators working by hand to create bitmaps that pop.
One pseudo 2.5d game I'm playing now is Iridion 2 GBA (2003). You can see the care taken with the art design team, pure lovers of the genre ;)
Very rough guesstimates:
200000 * 200 * 2 = 80M tris/sec
200000 * 200 * 32x32px = 40 gpix/sec (if no occlusion culling)
Neither of those numbers are particularly huge for modern GPUs.
I'd wager that a compute shader + mesh shader based version of this could hit 2M sprites at 200 fps, though at some point we'd have to argue about what counts as "cheating" - if I do a clustered occlusion query that results in my pipeline discarding an invisible batch of 128 sprites, does that still count as "rendering" them?
This demo program is obviously FPS limited in some other way - it's locked at 200fps from the start. The true limits are higher than what is shown.
I've been able to reach 5M particles at 60 fps on a very naive (as a GPU noob) implementation that uses Qt's RHI, which has some unnecessary copying and safeties, with compute + vertex + fragment.
writing 100% of the code on the gpu can render 10,000,000 triangles per frame at 60fps ... even in the web browser! (because there's no javascript running) https://www.youtube.com/watch?v=UNX4PR92BpI
but yes, that's cheating, since it's impractical to work with
Using Goroutines, I also made 10k 2D rabbits wander on a map for 5% of my laptop cpu (they'd sleep a lot admitedly). One goroutine per rabbit, how amazing when you think about it. That's when Go really got me.
edit: oh they do rabbits in the video as well what a bunny coincidence
edit2: the goroutines werent drawcalling btw, they were just moving the rabbits. The drawcalls were still made using a regular for loop, in case you wonder.
This is doable on a single core in JavaScript.
This by the looks of it is in Jonathan Blow’s Jai language.
How are you finding working with it? Have you done a similar thing in C++ to compare the results and the process of writing it?
200k at 200fps on an 8700k with a 1070 seems like a lot of rabbits. Are there similar benchmarks to compare against in other languages?
it's a lot of fun! jai is my intro to systems programming. so i haven't tried this in C++ (actually i have tried a few times over the past few years but never successuflly).
this is just a test of opengl, C++ should be the same exact performance considering my cpu usage is only 7% while gpu usage is 80%. but the process of writing it is infinitely better than C++, since i never got C++ to compile a hardware accelerated bunnymark.
the only bunnymarks i'm aware of are slow https://www.reddit.com/r/Kha/comments/8hjupc/how_the_heck_is...
which is why i wrote this, to see how fast it could go.
I thought Jai wasn't released yet. Are you a beta user or did he release it already?
It isn't released. That said, from people I know, it seems like you can just ask nicely and show some interest and he'll let you try it out.
That only applies if you are a known name (probably being known among his fans works too), or have somebody in his circle vouch for you. Regular people don't get in.
this is untrue. (source: firsthand)
Curious. Did that change more recently? When did you enter?
over a year ago. I explained that I worked on game engines in college and they were terrible and overengineered and wildly inefficient and I wanted to do things better going forward.
the official rendering modules are a bit all over the place atm... did you use Simp, Render, GL, or handle the rendering yourself?
just used raw GL calls from #import "GL". although i did #import "Simp" as well for Simp.Texture and Simp.Shader, which Simplified things quite a bit
Neat. Isn't this akin to 400k triangles on a GPU? So as long as you do instancing it doesn't seem too difficult (performance wise) in itself. Even if there are many sprites, texture mapping should solve for the taking pixels to the screen part.
My guess is that the rendering is not the hardest part, although it's kinda cool.
> Isn't this akin to 400k triangles on a GPU?
Is it faster to render two triangles with slightly less area, or one triangle with slightly more area, to draw the same sprite?
Rendering only one large triangle can be faster than two. First one triangle needs less memory, less vertex processing, etc.
Second, modern GPUs render pixels in groups of 2x2 up to 8x8 "tiles". If only one pixel from this group is part of a triangle, the entire group will be rendered. When two triangles form a quad, the entire area along the diagonal "seam" will be rendered twice. The smaller quads you have, the more overhead.
Also see https://www.saschawillems.de/blog/2016/08/13/vulkan-tutorial...
I disagree, with the exception of the case you link to where half the pixels are outside the viewport or maybe where a sufficient percentage are outside the viewport.
> When two triangles form a quad, the entire area along the diagonal "seam" will be rendered twice
This may be true, but I'm pretty sure that this is more than made up for by the additional pixels in the single triangle circumscribing the quad. In fact, I'm willing to bet that it's a mathematical certainty for any rectangle, although I didn't do enough of the math to prove it.
Instead, I would say that most rendering, especially of hundreds of thousands of 2D shapes, are going to be pixel limited. So trading pixels for vertices is a poor trade.
It depends on the size of the sprites in this case. Small sprites will benefit from being drawn as single triangles.
These "shadow" pixel shader invocations are a very real pain when it comes to rendering highly detailed models. The hardware rasterization pipeline can't cope well with huge amounts of really tiny triangles. That's the reason why UE5 Nanite uses a software GPU rasterizer for the high geometry density sections of a model - it's faster! Large area primitives will be rendered normally AFAIK.
Pretty sure overdraw / fillrate bottlenecks before vertex processing. Also you could draw that quad using strips which would then amount for only one more processed vertex compared to triangle.
Edit: okay surely with modern architecture there is no pixel write because of some early alpha cut but you still have to fetch the texture to make it so texture fetch (memory) will bottleneck first. I guess.
You shouldn't use strips, they're slower than triangle lists on most GPUs.
If by alpha cut you mean "discard", that's going to be much slower than two triangles. Two triangles will have a tiny bit of quad overshading on the seam, compared to a full extra triangle's worth in the alpha cut case.
Yeah discard use to be slow because it flushes pipelines or mess with branching predictions I don't remember which, I just assumed they "fixed" that by now.
No, it's not either of those, it's just launching useless threads, plus all the down-stream effects of launching useless threads, e.g. if you have blending on, that will block the ROP unit which needs to wait for the threads for a given pixel in-order. If you have depth write on, that will move the write to late-Z.
More vertices is not a big problem, doubling your vertex count is not a big deal, since most GPUs process vertices in groups of 32 or more, and whether multiple instances get packed in the same group depends on the GPU vendor.
By this argument you should higher performance from higher-poly models... which clearly isn't the case?
Oh, let me clear that for you. The trick discussed here is that you can draw a sprite (a quad) using one large triangle. The sprite is just inside it but the triangle has quite some "wasted" surface.
Honestly I'm not sure.
I don't think that at 200k or 400k level will matter much. Math is probably easier on humans if you think about the sprites as rectangular (so two triangles), but you could in principle make each sprite a triangle, and texture map in a shader a rectangular area of the triangle.
Bit of a tangent and useless thought experiment, but I think you could render an infinite amount of such bunnies, or as many as you can fit in RAM/simulate. One the CPU, for each frame, iterate over all bunnies. Do your simulation for that bunny and at the pixel corresponding to its position, store its information in a texture at that pixel if it is positioned over the bunny currently stored there (just its logical position, don't put it in all the pixels of its texture!). Then on the GPU have a pixel shader look up (in surrounding pixels) the topmost bunny for the current pixel and draw it (or just draw all the overlaps using the z-buffer). For your source texture, use 0 for no bunny, and other values to indicate the bunny's z-position.
The CPU work would be O(n) and the rendering/GPU work O(m*k), where n is the number of bunnies, m is the display resolution and k is the size of our bunny sprite.
The advantage of this (in real applications utterly useless[1]) method is that CPU work only increases linearly with the number of bunnies, you get to discard bunnies you don't care about really early in the process, and GPU work is constant regardless of how many bunnies you add.
It's conceptually similar to rendering voxels, except you're not tracing rays deep, but instead sweeping wide.
As long as your GPU is fine with sampling that many surrounding pixels, you're exploiting the capabilities of both your CPU and GPU quite well. Also the CPU work can be parallelized: Each thread operates on a subset of the bunnies and on its own texture, and only in the final step the textures are combined into one (which can also be done in parallel!). I wouldn't be surprised if modern CPUs could handle millions of bunnies while modern GPUs would just shrug as long as the sprite is small.
[1] In reality you don't have sprites at constant sizes and also this method can't properly deal with transparency of any kind. The size of your sprites will be directly limited by how many surrounding pixels your shader looks up during rendering, even if you add support for multiple sprites/sprite sizes using other channels on your textures.
i finally got around to writing an opengl "bunnymark" to check how fast computers are.
i got 200k sprites at 200fps on a 1070 (while recording). i'm not sure anyone could survive that many vampires
that many rabbits, it's frightening!
Do you have the code somewhere, I would like to see how it's made?
Does this work with large semi-transparent objects? (My 10-year-old experience with 2D game engines was that 10k objects wasn't really a problem, unless you were trying to make clouds or fog from ~200x100px sized, half-transparent images. Have a 100 of those, and you'd run at 5 FPS.)
Instead of using textures you can get very good performance from shaders.
Example (not mine): https://www.shadertoy.com/view/tlB3zK
I assume each sprite is moved on the CPU and the position data is passed to the GPU for rendering.
Curious how you are passing the data to the GPU - are you having a single dynamic vertex buffer that is uploaded each frame?
Is the vertex data a single position and the GPU is generating the quad from this?
You can do it in SO many ways! You can have one vertex buffer or double buffer it, or you can run the entire simulation on the GPU too. In general, uploading data to the GPU can be the slowest part. OpenGL, and more modern Graphics APIs have evolved in the direction of minimizing the communication between CPU and GPU since it is almost always a big bottle neck. Modern GPUs are designed to manage themselves with work queues, local data and sometimes even local storage to avoid the need to interact with the CPU.
you can write 100% of the code on the gpu. but that's impractical to work with. i did that here to see how fast webgl can go, since javascript is so slow https://www.youtube.com/watch?v=UNX4PR92BpI
for this bunnymark i have 1 VBO containing my 200k bunnies array (just positions). and 1 VBO containing just the 6 verts required to render a quad. turns out the VAO can just read from both of them like that. the processing is all on the CPU and just overwrites the bunnies VBO each frame
How much time is spent in Jai? How much time is spent presenting the graphics? Unfortunately, graphics benchmarks like this are hard because they don't tell us much. You have to profile these two parts separately.
Gotta be honest this is beyond my current comprehension, but seeing the visuals on this while stoned was a trippy pleasure.
Yes although the performance is probably largely due to occlusion? Also the sprites do not collides with their environnement
Is there a way to do it as 1 sprite with 200k SVG filters applied to it at 1fps?
Anecdote: In Unity, using DrawMeshInstancedIndirect, you can get >100k sprites _in motion_ and still maintain >100 FPS.
Using some slight shader/buffer trickery, and depending on what you're trying to do (as is always the case with games & rendering at this scale), you can easily get multiples of that -- and still stay >100FPS.
I agree, more of this approach is great. And I am totally flabbergasted at how abysmally poor the performance is with SpriteRenderer Unity's built-in sprite rendering technique.
That said, it's doable to get relatively high-performance with existing engines -- and the benefits they come with -- even if you can definitely, easily even, do better by "going direct".