A little clarification on modern shader compile times – Yosoygames

6 min read Original article ↗

So I saw these tweets earlier today:

Horizon Zero Dawn has been fun.
Tells me I have an old driver. Still runs. Takes 30 minutes to compile shaders. I watch the intro cut scene. I think. Better to update the drivers. I quit. Update drivers. Start the game again.
And now we’re compiling shaders again? What?

Why are companies not compiling shaders as part of their build?
I am so confused are they injecting screen resolution and driver version or what?

I won’t quote every tweet, just mention another one that other (older) games don’t have this problem.

So there are a few issues that need explaining.

Pre DX12/Vulkan world

In D3D11/OpenGL from a 1000km view we can simplify the rendering pipeline to the following (I’m not going to cover every stage):

vertex input -> run vertex shader -> run pixel shader -> output pixel

  1. Vertex input: This is programmable from C++ side. It answers questions such as:
    • Does the vertex contain only position?
    • Does it have normal?
    • Is the variable in FLOAT32 format? Uses 16-bit float? in 8-bit where the range [0;255] is converted to range [0; 1.0]?
  2. Vertex shader, which pretends all the vertex inputs are in 32-bit float
  3. Pixel Shader, which pretends all pixels are 4-channel RGBA in 32-bit floating precision
  4. The output pixel, which can be:
    • In RGBA8_UNORM, RGBA16_UNORM, RGBA16_FLOAT, RG16_UNORM, R8_SNORM, etc. See DXGI_FORMAT for a long list.
    • May use MSAA, may not use MSAA
    • May use alpha blending, it may not

So the problem is that vertex & pixel shaders pretend their inputs and outputs are in 32-bit floats.

How do modern GPUs address this problem in D3D11/GL? By dividing both shaders in 3 parts:

  1. Prefix or Preamble
  2. Body
  3. Suffix or Epilogue

The body is the vertex/pixel shader that gets compiled by fxc (or by the OpenGL driver) and later converted to a GPU-specific ISA (Instruction Set Architecture) i.e. the binary program that the GPU will run.

The prefix and suffix are both patchable regions. What do I mean by patchable? Well… as in binary patching.

The vertex shader ‘pretends’ the input is in 32-bit float. Thus the body got compiled as 32-bit.
But if the input is say 16-bit half floating point with a specific vertex stride (the offset in bytes between each vertex), then the preamble gets patched with a short sequence of instructions which load the 16-bit half floats from the right offsets and converts them to the 32-bit floats the body expects.

The same will happen to the pixel shader, which needs to convert its four 32-bit floats into say RG8_UNORM i.e. discard the blue and alpha channel, convert the red and green from range [0; 1.0] to range [0; 255] and store it to memory.

In order to do that, the driver will patch the epilogue and perform the 32-bit -> 8-bit conversion on the red and green channels.

Depending on the GPU the suffix may contain more operations that have to do with MSAA or even alpha blending (the latter is particularly true in mobile)

D3D11 games ran heavy optimizations only on the body section, mostly done by the fxc compiler, and developers could store them in a single file (a cache) that can be distributed to all machines.

The driver will still need to convert the D3D11 bytecode to a GPU-specific ISA, but it relies on fxc’s heavy optimizations having done the job. Thus conversion from D3D11 to bytecode isn’t free, but it isn’t too costly and can often be hidden by driver-side threading.

Shaders could be paired arbitrarily

One more thing I forgot to mention is that vertex and pixel shaders could be combined arbitrarily at runtime. There are a few rules, called signatures, about having matching layouts otherwise the two shaders can’t be paired together.

But despite those rules, if a vertex shader outputs 16 floats for the pixel shader to use but the pixel shader only uses 4 of them; the vertex shader can’t be optimized for that assumption.

The vertex shader’s body will be optimized as if the pixel shader will consume all 16 floats. At most the suffix will only export only 4 floats to the pixel shader; but there’s still a lot of wasted code in the body that could be removed but won’t be.

Drivers may try to analyze the resulting pair and remove that waste, but they only have limited time to do so (otherwise some games could see permanent heavy stuttering).

Post DX12/Vulkan world

DX12/Vk introduced the concept of Pipeline State Objects aka PSOs which is one huge blob of all data embedded into a single API object.

Because PSOs contain all the data required, there is no longer a need to divide shaders into prefix, body and epilogue.

Drivers know in advance the vertex format, the vertex & pixel shaders that will be paired together, pixel format, MSAA count, whether alpha blending will be used, etc.

We have all the information that is required to produce the optimal shader:

  • Code paths producing unused output will be removed
  • The whole shader’s ‘body’ may prefer to use 16-bit float registers if the vertex format input is in 16-bit (rather than converting 16 -> 32 bits and then operating in 32 bit)
  • Loading & Store instructions may be reordered anywhere to reduce latency (which would typically be forced to the prefix or the suffix)

Therefore most optimizations are delayed until actual PSO creation time. Unlike D3D11’s fxc which took forever to compile, the newer D3D12’s dxc compiler and glslang (if you’re not using Microsoft’s compiler) actually compile very fast.

These shader compilers barely perform optimizations (although SPIRV optimizers still exist, and they may make a difference in mobile).

Unfortunately, caching a PSO is tied to GPU and driver version. Therefore something as trivial as a driver upgrade could invalidate the cache meaning you have to wait again.

As David Clyde said this does actively discourage people from updating though and could become a problem.

There are mitigation strategies being researched, such as:

  • Compiling a slow version of the shader in short time, then recompiling in the background an optimized version
  • Users uploading their caches to a giant shared database classified by GPU device, vendor and driver version; which other users can download (note this may have security concerns, a shader is an executable after all)
  • Faster optimizing compilers

Please note that ported PS4 exclusives such as Horizon Zero Dawn and Detroit Become Human were designed around having the PSO cache distributed with the binary (because there’s only two GPUs to target: PS4 and PS4 Pro) thus they were not designed to recompile twice (slow then fast version). Thus these games spend 20 minutes at the beginning building their PSO cache.

Thus there you have it: that’s the reason modern games are taking so long to build shaders at the beginning, and why it may become more frequent in the future.