Settings

Theme

Revisiting the Business Card Raytracer

fabiensanglard.net

176 points by vanni 6 years ago · 53 comments

Reader

bigcheesegs 6 years ago

> Opening the binary with Binary Ninja revealed that clang had already managed to leverage the SSE registers.

X86-64 uses SSE registers for all floating point operations. I'm not sure that the author realized that they were looking at an -O0 binary. -O0 does not do vectorization (or anything else for that matter).

  • slavik81 6 years ago

    Looking at it on Godbolt, it doesn't really leverage SSE on -O3, either. You can get a reasonable grasp of whether it's using SSE effectively or not just by looking at the instruction names.

    mulss: multiplication of a single single-precision floating point value.

    mulsd: multiplication of a single double-precision floating point value.

    mulps: multiplication of a packed group of single-precision floating point values.

    mulpd: multiplication of a packed group of double-precision floating point values.

    If you're mostly seeing -ps suffixes only on moves and shuffles, you're looking at code that is not being vectorized. (And, actually, if you're seeing a lot of shuffles, that's also a good sign its not well-vectorized.)

    Incidentally, if you're seeing unexpected -sd suffixes, those are often due to unintended conversions between float and double. They can have a noticeable effect on performance, especially if you end up calling the double versions of math functions (as they often use iterative algorithms that need more iterations to achieve double-precision).

    I'm linking GCC output, because it's simpler to follow, but you see more or less the same struggle with Clang.

    https://godbolt.org/z/XtVqsU

    • K0nserv 6 years ago

      This post is really topical for me. I spent hours yesterday trying to write explicit SIMD code[0] for my the vector dot product in my raytracer and all I managed to do was slow the code down about 20-30%.

      The code generated by Rust from the naive solution uses ss instructions mostly whereas my two tries using `mm_dp_ps` and `mm_mul_ps` and `mm_hadd_ps` where both significantly slower even though it results in fewer instructions. I suspect that the issue is that for a single dot product the overhead of loading in and out of mm128 registers is more cost than it's worth.

      Naive Rust version output

          .cfi_startproc
          pushq %rbp
          .cfi_def_cfa_offset 16
          .cfi_offset %rbp, -16
          movq  %rsp, %rbp
          .cfi_def_cfa_register %rbp
          vmovss  (%rdi), %xmm0
          vmulss  (%rsi), %xmm0, %xmm0
          vmovsd  4(%rdi), %xmm1
          vmovsd  4(%rsi), %xmm2
          vmulps  %xmm2, %xmm1, %xmm1
          vaddss  %xmm1, %xmm0, %xmm0
          vmovshdup %xmm1, %xmm1
          vaddss  %xmm1, %xmm0, %xmm0
          popq  %rbp
          retq
      
      My handwritten version with `mm_mul_ps` and `mm_hadd_ps`

          .cfi_startproc
          pushq %rbp
          .cfi_def_cfa_offset 16
          .cfi_offset %rbp, -16
          movq  %rsp, %rbp
          .cfi_def_cfa_register %rbp
          vmovaps (%rdi), %xmm0
          vmulps  (%rsi), %xmm0, %xmm0
          vhaddps %xmm0, %xmm0, %xmm0
          vhaddps %xmm0, %xmm0, %xmm0
          popq  %rbp
          retq
      
      Intuatively it feels like my version should be faster but it isn't. In this code I changed the the struct from 3 f32 components to an array with 4 f32 elements to avoid having to create the array during computation itself, the code also requires specific alignment not to segfault which I guess might also affected performance.

      0: https://github.com/k0nserv/rusttracer/commits/SIMD-mm256-dp-...

      • slavik81 6 years ago

        I'm actually rather bad at this, but my understanding is that the horizontal operations are relatively slow. The easiest way to get throughput out of SIMD is to have a structure representing 4 points, with a vec4 of your X values, a vec4 of your Y values, and a vec4 of your Z values. Then you can do 4 dot products easily and efficiently, using only a handful of vertical packed instructions. (If figuring out how to effectively use a structure like that sounds difficult and annoying, that's because it is.)

        • K0nserv 6 years ago

          Yeah that was my conclusion too. I don't think I have any cases where the need to perform the dot product between multuple paris of vectors arise however, at least not anywhere in the hot loop where it would help. Raytracers tend to use a lot of dot products followed by some checks then another dot product but there's a strictly sequential process in these algorithms.

          • a_e_k 6 years ago

            I suspect that most people attempting to vectorize their ray tracer try the straightforward horizontal approach here and discover that it's not really any faster. I certainly did when I first started in this area.

            The key to the vertical, structure-of-arrays approach is to move to a more data-flow style approach. Instead of generating a single ray at a time from your camera code you can generate them in small batches corresponding to a rectangle on the screen. When I was in grad school we'd call these ray packets. We'd also do things like organize them into little 2x2 pixel "packlets" within the packet to take advantage of SSE or AltiVec. (This tends to work great with coherent pinhole camera rays, and shadow rays towards a common light source, but not so well with path-traced indirect rays which quickly become incoherent.)

            Likewise, don't go all the way down to single primitives in the leaves of your acceleration structure. Instead, try to group them into small blocks of the same primitive type that are clustered together spatially. For example, you might have triangle mesh intersector code that has a BVH specialized for large numbers of pure triangles (generic programming can help here, e.g., BVH<triangle> for meshes and BVH<sphere> for point clouds). Since the leaves may have different numbers of primitives, a typical approach would be to pack all the primitives into a single flat array and then have your BVH leaves just give you the indices of the range within the array. (The nice thing is that this typically also shrinks your acceleration structures since you have fewer nodes due to the leaves not being so fine-grained.)

            If you're curious to see some vectorized ray/triangle intersection code that was fairly state-of-the-art in its day, I have an old paper[0] on the topic. I'd also suggest having a look at Intel's Embree[1], especially the early publications. The library is open source (Apache licensed), well organized, and not too large.

            [0] https://www.cs.utah.edu/~aek/research/triangle.pdf

            [1] https://www.embree.org/

      • MauranKilom 6 years ago

        Without looking deeper into it, I assume that you are falling prey to dependency chains here. All your vectorized code is one big dependency chain (every instruction uses the result of the previous one in xmm0). The scalar code has more instructions but they form multiple, somewhat separate, dependency chains.

        On microcode level, you can generally have multiple of the same kind of instruction running "in parallel" if they are independent.

        For example, look at 256 bit vmulps here: https://software.intel.com/sites/landingpage/IntrinsicsGuide...

        On Ivy Bridge, you can start one vmulps per cycle, but it takes 5 cycles before you get a result. If you do several vmulps (and similar) in one long dependency chain, you will only progress by one instruction every 5 cycles!

        Another point to consider is that multithreading in a hyperthreading environment could change these result: If I'm not mistaken, the two hyperthreads sharing a core compete for the same execution ports but have separate instruction scheduling. What this means is that, in the above scenario, you could theoretically have two hyperthreads each executing one vmulps every five cycles on the same core, so that you actually get double the speed from two threads over one. However, less dependency-laden code (the scalar version?) could fully saturate the floating point related execution ports with just one thread, in which case you might not see any speed benefit at all from a second thread.

        This of course strongly depends on the hardware and how how the code is. I'm also not confident that either of these effects are necessarily at play or the prime influence here. But if you are interested in writing well-performing code on this level, these are topics you should look into!

    • gorgoiler 6 years ago

      Off topic: I teach compilers in high school and godbolt.org looks amazing, thanks for the link!

      • skavi 6 years ago

        Wow, really cool that there are high schools teaching compilers.

        • pjmlp 6 years ago

          Yes, in Portuguese high schools you can do a technical education during the last three years (10 - 12), that gets you ready for the job market.

          I did mine with focus on informatics during in the late 80's/early 90's.

          Brief overview of three years subjects, besides the usual high school stuff.

          Graphics programming, compilers, databases, MS-DOS, UNIX (Xenix back then), Networking (Novell Netware), OS development.

          Languages that we got to use for different kinds of assignments during those three years, GW-Basic, Turbo Basic, Turbo Pascal 5.5, Turbo C 2.0/K&R C, Turbo C++ 1.0, Dbase III+, Clipper Summer '87 and OOP variant Clipper 5.x, 8086 and 68000 Assembly.

          The high school I took it on still offers this, naturally updated to more modern stacks and teaching subjects.

          • slavik81 6 years ago

            For comparison, my Canadian high school offered a "Teach Yourself C++ in 30 Days" book that you could study for up to 10 hours in the optional Computers course. If you chose that module, by the end of the first class, you would be more knowledgeable on the subject than any teacher in the school.

            (It was actually an excellent school; they just did not care about computing. Nevertheless, I'm quite jealous of those kids with such an interesting option available to them.)

            • a-priori 6 years ago

              Also when I was in high school in Canada in about 2001-2003, I ended up taking over the class and teaching C++ because I’d already learned enough of it in my spare time that knew it better than the teacher (he liked Pascal better).

            • pjmlp 6 years ago

              Yes, unfortunately education varies a lot across the globe.

              This is why we have Raspberry PIs now, even though such boards have existed for years.

              It all started as an effort to reboot UK high school computing education that was stuck into teaching Office.

      • rumanator 6 years ago

        Wow that's some next level stuff. How do high schoolers cope with the topic?

        • gorgoiler 6 years ago

          Very well! Its extremely simplified: we introduce the hierarchy of high level Python, low level C, assembler mnemonics and binary machine code and look at examples of each.

          But prior to that I have a few “virtual machines” that we use as compiler targets, where the VM is a robot finger that accepts the left right up down and press key commands, and the compilation step is to convert a string like HELLO into a series of robot commands.

          So no branching. No labels or repeatable units of code. The example gets them warmed up to the idea of converting ideas in high levels to simpler code at lower levels, for simple machines to execute.

          Towards the end we look at (but don’t dive too deep into) real world compilers. What does print(hello 2+3) look like in mach-O 64 assembler? Answer: erm quite a lot of gibberish but the ADDL is visible, and we can change it to SUBL and get “hello -1” to print :)

          Personally, the hardest parts of compiling for me to understand were the steps after lexing. Moving through a grammar to actually do things. Having everything in Python helps this a lot, as you can see how parsing some source code is just a way of triggering other code to execute.

          Apologies for the hand waving. I have a CS degree so I promise it’s not quite as vague as I make it out to be!

  • fabiensanglard 6 years ago

    You are correct. Mārtiņš Možeiko pointed out that I had been too hasty when the article came out (https://twitter.com/mmozeiko/status/1257574246462570497). To conclude SIMD is leveraged when XMM registers are used is wrong. What I should have looked for are packed instructions.

    • mappu 6 years ago

      It would be interesting to see how an ispc version performs, if it is able to extract any more CPU parallelism.

  • moosedev 6 years ago

    Yeah, use of SSE registers does not imply SIMD, since x87 is gone in x86-64, so even scalar FP has to use SSE registers. The asm snippets for v::operator*() in the "Optimization level 1" section use scalar SSE arithmetic only (mulss). (There's some use of movaps to move data around, but it's a stretch to call that SIMD.)

    I think the "leverage" sentence you quoted and the "with SIMD taken care of" one shortly after are maybe a bit misleading, since the asm snippets there don't really demonstrate SIMD.

    • amluto 6 years ago

      > since x87 is gone in x86-64, so even scalar FP has to use SSE registers.

      No, it’s still there. What’s actually going on is that all x86-64 CPUs support SSE2, so there is little reason to use x87 in 64-bit code.

      (You can use it for 80-bit precision. OTOH, for most purposes, 80-bit precision is actively harmful, and x87 is an incredible mess, so almost no one wants it.)

      • ethelward 6 years ago

        > 80-bit precision is actively harmful

        How comes? Unexpected clipping when converting back and forth to 64bits?

        • amluto 6 years ago

          Exactly. The conversations happen at unexpected and unpredictable times depending on when the compiler needs to spill registers, which has surprising effects.

      • moosedev 6 years ago

        You're right - thanks for the correction!

rwmj 6 years ago

Is there a mistake in the original code on the right hand side? I get:

  card.cpp:16:2: error: ‘g’ was not declared in this scope
     16 | <g;p)t=p,n=v(0,0,1),m=1;for(i k=19;k--;)
        |  ^
Edit: Yes there is. The ‘<g;’ seems like it should have been the single character ‘<’, perhaps a corrupted HTML escape.
danielscrubs 6 years ago

Well there goes my weekend. :)

I also tried to optimise the code, and got great speed increases with just constexpr the vector methods and could quickly see that rand was problematic and then Fabien releases this post with nvcc that are another level. Really great blog post!

ectoplasmaboiii 6 years ago

A Ray-Tracer in 7 Lines of K: http://nsl.com/k/ray/ray.k

  • jagged-chisel 6 years ago

    7? Just remove those comments and the last few newlines and it’ll be one line.

    I don’t know K, but it looks like it uses semicolons to end statements. It’s “cheating” on line count to compress statements by just removing \n.

    After all, how many “lines” are in the business card raytracer? 4.

blondin 6 years ago

i love what fabien is doing with his website!

also been experimenting with pure html with an itsy-bitsy amount of css. for months now i wondered how to display code without involving javascript.

that textarea is so perfect! and i bet you when you copy and paste into word or your todo list application they won't even try to be "smart" about knowing what "rich text" is...

that's very cool.

  • GuiA 6 years ago

    i am glad that there is a small village of indomitable developers holding out against tens of megabytes of reactive functional progress modern javascript frameworks.

    thank you

    • wetmore 6 years ago

      There is a place for both.

      • GuiA 6 years ago

        It’s not so much that there’s a place for both rather than both necessarily exist as two points in the same space.

        But the reality is that more websites than not these days will send you many megabytes of JS, mainly for the purpose of tracking you and extracting money/time from you, under the guise of “user experience”.

        So when I see some of those rare people who still actually care about quality, speed, performance, accessibility, etc I make sure to appreciate their work.

  • Kwantuum 6 years ago

    > that textarea is so perfect!

    There are no textarea on that page. The code sections are using a <pre>

  • nucleardog 6 years ago

    Yeah, damn, I thought I'd done well getting my page+css+font down to ~100kB (css/font cached after first load, so ~2kB on subsequent pages) but his site is tiny. Even with his CSS inlined to save a request the entire page is 15kB gzipped.

rrss 6 years ago

This was really fun to read, thanks fsanglard.

> This is correlated with the warning nvcc issued. Because the raytracer uses recursion, it uses a lot of stacks. So much actually that the SM cannot keep more than a few alive.

Stack frame size / "local memory" size doesn't actually directly limit occupancy. There's a list of the limiters here: https://docs.nvidia.com/gameworks/content/developertools/des.... I'm not sure why the achieved occupancy went up after removing the recursion, but I'd guess it was something like the compiler was able to reduce register usage.

fegu 6 years ago

Almost into passable frame rate territory. Next version could be business card VR:)

ntry 6 years ago

101,000ms to 150ms is a phenomenal speedup. Props

  • correct_horse 6 years ago

    To be fair, the thing he started with was totally unreadable and fits on a business card and the thing he ended up with was readable, but didn't fit. It isn't apples-to-apples.

    • sp332 6 years ago

      The original is 1,287 bytes after removing comments and line endings where possible. After just a bit of search-and-replace to shrink variable names, I got the new one down to 2,373 bytes including the same error checking and output at the end. So 1.8 business cards, which is not too bad.

      And no I haven't tried to compile it. https://pastebin.com/LDRd6U4e

      • a1369209993 6 years ago

        Hmm, some further improvements:

          typedef float F;typedef int I;
          // saves 2 lines (but zero bytes)
          
          #define R return
          // OR
          #define O(S,A,R) operator S(A){return R;}
          
          #define E(F){E_((F||cudaPeekAtLastError)(),__FILE__,__LINE__);}
        
        and so on - although you'd probably have to make semantic chages to get onto a single card. (Or use a smaller font, but that's presumably cheating.)
        • sp332 6 years ago

          I feel like the self-timing and error-checking is unnecessary, and you can at least fit it on the front + back of a business card with enough room left for your name and an email address.

mianos 6 years ago

I wonder if the tool-chain would be better under Linux? It is kind if funny the way Windows development has always been a hassle. Mscvars.bat and such has been there for at least 20 years.

tomsmeding 6 years ago

Nice work on the GPU programming, and the multicore before that, but I'm mystified why going from -O0 to -O3 is named an "optimisation". All respect for Fabien, but running code that's supposed to run faster than a snail (and if you're not debugging and require -O0 for reasonable output) implies -O2 or -O3. (In practice, -O3 often doesn't give much performance over -O2, despite increasing compile times.)

The initial time is not 101.8 seconds, it's 11.6 seconds.

  • slavik81 6 years ago

    You could also add -ffast-math, which loosens the rules for floating point optimizations. For example, it would allow the compiler to turn floating point divisions into multiplications by an inverse, and to group operations more efficiently even if doing so would slightly affect rounding. It also rounds denormal numbers down to zero, which can greatly improve performance on a lot of hardware.

    -march=native may also be useful, as it would allow the compiler to use newer CPU instructions, and tune the generated code to your hardware. That would make the program less portable, but it's not like CUDA is portable either.

    My machine matches those numbers surprisingly closely. With -O0 it took 89.6s. With -O3, it took 11.7s. With -Ofast (which combines -O3 and -ffast-math), it took 10.6s. With -Ofast -march=native, it took 8.9s. I would expect those gains to extrapolate to the multi-threaded version, maybe pushing it down to 1 second without any further work. (Note: I'm using GCC on Ubuntu 18.04 with a Haswell i7. Your mileage may vary.)

    • amelius 6 years ago

      Offtopic, but I think that languages should have special float types that trigger the use of fast math. That way, a programmer can better control which parts of a program are done with approximate floating point operations.

      • slavik81 6 years ago

        One of the reasons why I think Zig looks appealing is that you can set the policy for these sorts of things on a per-block basis: https://ziglang.org/documentation/master/#setFloatMode

        • amelius 6 years ago

          Imho that's too fine-grained. Usually you determine which variables don't need strict accuracy rather than which code, so that's better controlled through types. Also, it's easy to limit approximations to code blocks by casting to/from approximate types around a code block so you can still have fine grained control.

  • Natasha35Khan 6 years ago

    Agree with you

lonk 6 years ago

If you increase resolution you can put more code on business card :P

Keyboard Shortcuts

j
Next item
k
Previous item
o / Enter
Open selected item
?
Show this help
Esc
Close modal / clear selection