Programming a GPU on bare metal

39 min read Original article ↗

I virtually attended Handmade Seattle 2024, where Mason Remaley presented "It's Not About the API" (link eventually), which discussed Vulkan. There was a question about how the buffer data layouts work on the GPU, and the speaker didn't have a clear answer. I don't want to come across as criticizing Mason by saying this; rather, I'm saying this kind of low-level information on GPUs is hard to come by because their internals are rarely revealed or discussed.

I realized I knew at least something about how one GPU does it, and that this information might be interesting. I figured I'd break my long period lack of posts with a long post on how this GPU works from a bare-metal system level. I'll also go over why I was even looking into this, and what the next steps might be.

This is going to be a long and somewhat meandering post, which matches my long and somewhat meandering journey through Pi bare metal hardware-accelerated graphics programming.

(I use GPU here to mean graphics processing unit, but on the Raspberry Pi it's abbreviated "VPU" for what I assume means "video processing unit". I'll mix and match the two, but just know I'm referring to the part of the chip which is designed to convert lists of vertices to rasterized pixels fast.)

The context

In 2023 I had some time on my hands and was feeling willing to take on the Thirty million line problem. I have written before about software complexity and wanted to try making a whole system which had reduced complexity.

There are other people trying to tackle this problem. One such project is Serenum. We should cheer such efforts on, and encourage more.

The Simple Useful System challenge

My goal was to have a complete system targeting a specific piece of hardware in under 100,000 lines of code. I call this the "Simple Useful System challenge".

I have a whole list of "rules" I was trying to follow with this goal, a subset of which are:

  • The lines of code counts everything, not just the final system. The compiler, linker, assembler, etc. (the toolchain) would all count. This automatically eliminates existing kernels, complex languages like C++, the GCC toolchain, or languages built on LLVM, which itself is several million lines. "Code" has a loose interpretation: configuration, markup, JSON, etc. would all count. This isn't intended to be a code golf/size challenge, it is meant to prevent you from pulling in huge libraries to do things which can be done in less code. Pursue true simplicity, not compactness.
  • The system must run on bare metal. No virtual machines. Real hardware.
  • The user can create and run new programs within the system.
  • The system can receive input via a USB keyboard. No UART piping through another terminal, etc.
  • The system displays its output to a screen connected through its standard video ports. Again, UART terminals are not allowed.
  • The system does not depend on a network connection. No cheating by making a thin client.

I have more such rules, and even a bonus points system. I came up with these with the intent to have it be something of a competition, or a rubric for grading systems to challenge and incentivize people to build these systems. If you look at most "hobbyist" operating systems, they do not meet most of these criteria, which was intentional on my part. I want something which could practically work as a useful system.

The hardware

I am passionate about and motivated by running on real hardware, so I wasn't interested in writing a virtual machine with a nice little sandbox. I also figured the problem would be much simpler if I targeted a very specific piece of hardware.

I ended up settling on the Raspberry Pi 4. It's a "single board computer", which means it comes with built-in RAM, CPU, GPU, various I/O ports (USB, HDMI, etc.), and so on. That means there is a fixed set of drivers, which I'll define as code the CPU must run in order to make any non-CPU hardware function. This fixed set is much easier to support than a full e.g. x86/64 PC, which can have a wide variety of hardware attached.

Another advantage of the Pi is its price. Compared to e.g. a Jetson Nano, the Pi is very cheap without being too underpowered. Many programmers I know already own one or several Pis, mostly just collecting dust. This makes the idea of deploying to customers easier, because it might just entail asking them to copy to an SD card, or shipping a preloaded SD card.

Was Pi the best option?

I did look around. There are SBCs based on Rockchip, Pi variants, RISC-V boards, and more. It's worth doing some exploration if you are interested.

I liked that I already had several Pis and knew that many people already had them though. I knew it had a GPU, which was important to me because I want to support 4K resolution graphical user interfaces.

I'm not sure there are better options in terms of documentation available. This is not a compliment on the Pi's documentation, however: it is very sparse. I don't think other systems are doing much better, however. Hardware is pretty secretive, and they mostly assume you'll just use Linux, so there's very little low-level information out there. The Pi has a decent reverse-engineering community thanks to its popularity, so there is some information out there.

One could also consider using an FPGA and writing their own CPU. An inspiring example of this is Project Oberon, written by the incredible Niklaus Wirth. FPGAs are quite expensive and their toolchains tend to be highly complex and closed source, however. I am interested in pursuing FPGA projects in the future.

Raspberry Pi? Who cares. Why not a "real" CPU like Intel i7 or AMD?

Well, those systems don't really have system-on-a-chip boards, which means the driver problem is much more pronounced.

Most everything I do here, including the GPU programming, can be done on desktop PCs too. The big difference is because users can swap their hardware on the motherboard, you no longer have any guarantees about what hardware device drivers you need to write.

You can program your AMD64 processor and even your GPU in bare metal like I do for the Pi, but your code will only work for that exact combination. In the case of the Pi, the two are on the same silicon die and cannot be swapped, so the code will work on any Pi anyone owns.

One, with great effort, can practically program a system targeting a specific SoC (essentially, embedded programming). One cannot do so for desktop hardware unless they severely constrain either the hardware or heavily limit the software.

The Pi is cool because it is a quad-core 2+ Ghz machine which also includes a GPU capable of 4K resolution rendering. It can do some serious computing. This is exciting to target because I know I won't be heavily limited by the hardware.

Getting off the ground

I started with a very nice little tutorial on breaking into bare metal on the Pi 4. Writing your first armstub.S which works is a very rewarding experience, especially when you can get software-rasterized graphics on the screen in only a tiny bit of code on the Pi. I am grateful for such a concise and motivating tutorial. Every platform should have similarly exciting materials to follow.

Aside: Pi boot sequence

The Pi, by the way, is a bit unique in its system architecture: the Pi VPU actually handles initializing the hardware and the Arm CPU. You can compare this to how x86 CPUs are initialized by a BIOS on some other chip on the motherboard. The Pi VPU initialization happens before the "armstub" which starts the CPU ever runs and is proprietary, though people have made open-source firmware for older Pis. This code is NOT compatible with the Pi 4+, which is substantially different.

Reading system documentation

I love learning as much as I can about the platform, so I downloaded all the hardware manuals I could find and committed them to a repository for long-term use. Having them all at hand was very nice when I'm doing hard things like setting up the virtual memory mapping tables or finding the correct memory-mapped device addresses. I particularly liked the "ARM Cortex-A Series Programmer's Guide for ARMv8-A", which provides a great overview of 64-bit ARM processor programming. Whatever the platform, find as much documentation as you can and horde it all for later reference.

Some of it is rather intimidating: the ARM Architecture Reference Manual (the "ARM ARM") is over 12,000 pages! It's really nice to have when trying to interpret a specific instruction's exact behaviors, or what exactly flags mean for e.g. memory mapping tables. If you practice reading these (in the case of a reference manual, not cover-to-cover!), you will level up and feel confident tackling ever more complex systems.

It's important to learn to use first-party documentation like this because the internet is filled with wild speculation on such things that can be known for sure by reading the manuals. Seriously, low level assembly questions on Stack Overflow get absolute garbage answers. You can use the answers to find the general direction, then use the manual to find the actual answer. If the poster mentions the manual being wrong and that there's some documented errata, then it's probably more legit.

Use a hardware debugger!

If you are interested in doing bare-metal yourself, I highly recommend investing in a hardware debugger. I wrote what I believe to be the most comprehensive and up-to-date tutorial on using a hardware debugger on Pi 4 and 5. I was going insane before I had this. Seriously, do not try to debug with just serial/UART/printf. Do not do that. Hardware debuggers are literally only $15-$25 and will save you hours of time.

This applies to all hardware: use the best tools you can for the job. Hardware is often literally in a black box, so you need tools to help you open them up, otherwise you're just wasting time unnecessarily.

USB, oh no

I had the basic boot sequence in as well as some text displaying on the screen, so next was getting keyboard input.

This is when what seemed like a possible project started to take a turn for the worse.

USB. Universal Serial Bus. USB is a great thing for users, but a huge maintenance burden for systems developers. Basically, most devices need custom code running on the CPU in order to function. This is the unsolvable complexity at the core of modern computing. This custom code is often proprietary and is provided for specific operating systems by the hardware designer.

The USB problem (and other hardware drivers) is where I believe the main value and staying power of Linux lies. Through both raw reverse-engineering by hobbyists and company-contributed drivers, Linux's popularity has made a huge amount of Free Software drivers available for a variety of hardware. Love it or hate it, Linux is the only hope we have right now for using a variety of USB devices in a Free Software environment.

Things are not perfect, of course. Many companies provide only binary blobs for their devices. This is better than forcing us to use Windows to use their devices, but only barely so.

There are some good things with USB. If devices conform to e.g. the "human interface device" interfaces, then one driver can support multiple devices providing that interface. There are keyboard, mouse, and sound interfaces, for example. If your keyboard supports the keyboard HID rather than require a custom driver, than the system's USB keyboard HID driver will be compatible with your keyboard.

It's time to pivot

I started reading the USB specification and what it would take to get USB interaction on the Pi 4. These specifications are not simple, though Ben Eater does a heroic job of simplifying it as much as possible. I didn't want to spend the next 3 months implementing multiple layers of protocols (in this case, PCI-E, then XHCI, then USB), so I started looking for some existing implementations which still met my simplicity criteria.

I ended up settling on Circle, a bare-metal environment for applications on various Pi models. Circle is pretty lean all things considered and provides a large amount of device support. It had a major problem for me though: It is written in C++, which means its toolchain is too complex to meet my simplicity goal. It was okay in terms of line count otherwise; by my count it was around 60,000 lines, which would provide USB, keyboard, mouse, sound, networking, and various other things.

I decided to port Circle to C so I could use a simpler toolchain, specifically TCC, to build and eventually bootstrap the system. This took around two months to port. At my fastest clip I could convert about 1,500 source lines a day, amortized and including testing the conversion on hardware. This conversion was painfully repetitive but gave me a great appreciation for the work that Rene Stange and collaborators did.

While I might disagree with their exact approach (heavily object-oriented, including inheritance and polymorphic structures), the stuff within the functions (i.e. the part that matters most, in my view) is hugely valuable. I am very thankful that they didn't rely on C++ templates, which would make the port drastically more difficult.

The code to interact with hardware is filled with trivia and when it's wrong, you get little (if any) indication as to what you did wrong. The device documentation is often sparse, and frequently the only real reference you have is Linux or OpenBSD kernel drivers for the specific device. For a taste, here's the Circle implementation of PCI-E for the Broadcom PCI-E bus on the Pi 4. As you can see, there are a lot of registers and specific values that need to be written to those registers before you can even begin to talk to other devices on the bus.

Repeat this for several protocols several layers deep, and you'll finally be able to speak to a USB device.

It's important to acknowledge that at this level, complexity is forced upon us in software land by the hardware. There is simply no other way to talk to a device than to implement this interface.

If we want any simpler of software, we need simpler hardware speaking simpler protocols. This might shift more complexity onto the hardware, but that might both give the hardware vendors more control and reliability, since they will have less reliance on potentially flawed operating system drivers. I am pushing the limits of my understanding here though–I soldered a mechanical keyboard and wrote custom firmware for it, but that's the extent of my device-side programming experience. If anyone has good reading on how we can simplify both hardware and software, please do share it.

What about the GPU?

After the Circle port, I had USB keyboard input, everything was in C (or Arm assembly), and within my 100,000 line goal.

You've made it through the CPU part. Now it's time to share what I consider the most special part of this project: bare metal hardware accelerated graphics.

From the goals in my notes:

My goal is to have a fully hardware accelerated 2D desktop interface in bare metal on Raspberry Pi 4 and Pi 5. It needs to be completely standalone without dependencies on e.g. precompiling the shaders with Mesa on Linux before being used.

I am NOT exhaustively reverse-engineering or documenting the V3D. I am focused only on a "happy path" or limited subset which achieves my goals.

This all being said, I hope that this effort can be the leanest, closest to bare-metal hardware acceleration library available on contemporary hardware. I want to set a standard of <20k total lines of code, 3-5 standalone C header files, and still deliver a 3D acceleration interface with a programmable pipeline. I believe with such a project we can both understand GPUs better and gain more performance by stripping away all the unnecessary drivers and complexity. The hardware is complex enough as it is; let's try to keep the software simple.

I am NOT interested in implementing an OpenGL layer, or a GLSL compiler, etc. This is supposed to be a simple, minimal library that still practically lets you create new hardware accelerated graphics on bare metal.

My conservative estimate is fewer than 12 people or 12 teams of people in the world have done this level of programming for GPUs. If you do this, you are in a truly exclusive group. Whether this group is worth being in, I don't know, but it is truly rare to do this. I hope reading further will give you an idea why I believe this to be true.

The proprietary nature of GPUs

Most hobby OSes stick to software rasterization because the documentation for GPUs is very proprietary. If you want proof of this, compare ARM's public documentation for the Architecture-8 cpu vs. their public documentation for the ARM Mali GPU (you'll need to search for them; I don't trust their website to have stable links at all). There is virtually nothing low-level said about their GPUs.

I read that GPU vendors consider their GPUs as complete software-hardware systems, and because of that, consider documentation unnecessary because they are providing you with the software to interact with their hardware. This is mostly fine if you are staying strictly in Linux or Windows land, but is a huge problem if you're trying to work on bare metal.

Broadcom actually released documentation for the Pi 3 GPU, which we should celebrate, but they have been extremely tight-lipped about any and all subsequent VideoCore offerings.

A disclaimer

Everything I'm going to link to and present here is to my knowledge information gleaned from public sources. I extensively use and referenced the Mesa project source code, for example.

I did not to my knowledge ever rely on information gleaned from reverse-engineered firmware or private documentation.

I partly put this disclaimer here as a warning that if you are attempting a project like this, that you should avoid any sort of reverse engineering of e.g. bytecode firmware, because it could result in legal trouble due to violation of terms of use. See Coders' Rights Project Reverse Engineering FAQ.

The firmware and hardware design for the VideoCore VI are unknown to me. The only hardware information I have is gleaned from the VideoCore IV document linked previously. The firmware is provided as a binary blob in this repository. I got more intimately familiar with the Mesa v3d implementation, which is as far as I know the only "documentation" for the VideoCore V and up. Needless to say, this isn't an ideal situation. Ideally, I would have a full reference document and a hardware debugger, sort of like I have for the Arm CPU on the Pi. Alas, Broadcom has no interest nor incentive to provide these things.

Getting a break

I was determined regardless to get hardware-accelerated graphics for my system. Luckily, I found a bare metal Pi 4 GPU example (my mirror, with modifications). This was truly a breakthrough project for me; getting a single triangle on the screen with full hardware acceleration made me believe this project was possible.

There was much to be done to make the example work. This example draws a single triangle for a single frame. I needed to get many triangles rendering, then sample textures so that I could render images and text.

How the CPU commands the GPU

I should start at the beginning: how does the CPU tell the GPU what to do?

It's helpful to know how essentially all hardware is controlled by the CPU. For system-on-a-chip boards, one way the CPU controls devices is through predefined mappings in virtual memory. The Pi mapping is documented here. The CPU writes to an agreed-upon RAM address, then the device reads that same address. There are also special registers that when written to, cause the hardware to do things.

One little bare metal pitfall I fell into was that I needed to tell the CPU to flush its caches to RAM so that the hardware device would have the complete picture.

The VPU has a few such special addresses. We construct a command list, which is just tightly packed data conforming to the VPU's expected format, then we tell the VPU the address of our command list. It reads that RAM address via direct memory access and starts executing commands.

I wanted to improve the example's command list interface first, so I wrote my own code generator to generate the command list from the XML definition, which makes packed C structure definitions. I like that the commands were specified in an easily parsed format like this. Arm does a similar thing with their instruction set where every instruction has an entry with its properties and encodings. This is a good way to make your interface programming language-agnostic.

Here's an example command:

<packet code="25" shortname="clear" name="Clear Tile Buffers" cl="R" max_ver="42">
    <field name="Clear Z/Stencil Buffer" size="1" start="1" type="bool"/>
    <field name="Clear all Render Targets" size="1" start="0" type="bool"/>
</packet>

My generator turns it into the following C structure:

#define v3d_OP_CLEAR_TILE_BUFFERS 25
typedef struct PACKED v3d_clear_tile_buffers
{
    v3d_uint operation : 8;
    v3d_bool clear_all_render_targets : 1;
    v3d_bool clear_z_stencil_buffer : 1;
} v3d_clear_tile_buffers;

Every command starts with an operation. The VPU sees operation 25, which it knows corresponds to the Clear Tile Buffers operation. It interprets the data accordingly. The command list is tightly packed, so commands can be different sizes. This is important for performance because some commands require lots of data (like the "Gl shader state record" command, which has 45 fields), and some have no data, like the "Flush" command.

That's all there is to it. The CPU builds a command list in memory, tells the VPU where it is, then the VPU reads the commands and, if it is a valid list, executes them.

A complete example of building a command list and drawing a bunch of triangles is here. It's nearly 2,000 lines, but you can rest assured that most of the command list work can be reused for other rendering work, and is essentially just a lot of configuration.

Parallels to Vulkan/DirectX 12, contrasts to OpenGL

This command list setup is much closer to the Vulkan graphics API. This is no accident; the DX12 and Vulkan APIs more closely resemble how the CPU actually needs to communicate with the GPU. By having the application build the command lists, the application has more control. There is much less "driver" here.

In contrast, the OpenGL interface doesn't resemble a command list. The OpenGL driver (in the case of Linux, the Mesa Broadcom and Gallium v3d drivers) will have to interpret your OpenGL requests and generate a valid command list, which is different for each different kind of GPU.

Note that since OpenGL, DirectX, and Vulkan have existed for a while, GPU vendors specifically design hardware to support these APIs. Before, the API would be designed to support the pre-existing hardware API. As evidence, the v3d.xml has "Gl" in the command names because the GPU command was designed with the specific OpenGL state in mind. If you are a GPU designer and you want to release a new GPU, you'd do well to support the most popular API usage patterns so that things are closer to "just working" on your new hardware. Things never quite go that smoothly, of course. A substantial amount of driver code papers over the various hardware differences, for better and for worse.

Shaders

Okay, so we know we build a big list of commands that tell the GPU what to do. That sounds like fixed-function GPU programming though.

The VideoCore VI is not fixed function. While it is a mobile-tier GPU, it is programmable and can even run general purpose compute shaders.

Shader is just the name for programs which run on the GPU. They are not different from CPU programs: they take some inputs, run machine code instruction operating with registers etc., and provide some outputs. The GPU has specific inputs and outputs designed around 3D rendering, but e.g. in the case of compute shaders, they don't have these predefined inputs/outputs.

An understanding of the GPU programmable pipeline is important at this point. I don't expect you to understand much of what follows without it. Perhaps try following LearnOpenGL and reading some books like Game Engine Architecture by Jason Gregory (sorry I don't have any great recommendations here; I'm less experienced in graphics programming). Rendering is a big topic but it is quite rewarding too.

The VideoCore VI is a tiled GPU, which means it first puts lists of triangles in overlapping tiles on the screen into lists. This means there are actually three shaders:

  • The coordinate shader, which only transforms each vertex into view/screen space for tile binning
  • The vertex shader, which is identical to the coordinate shader but also computes e.g. color output to the fragment shader
  • The fragment shader, which is invoked for every pixel in the triangle which should be rasterized

In Mesa, the coordinate shader is automatically derived from the vertex shader because it is strictly a subset of the vertex shader. They both perform the same operation (transforming vertices), but the coordinate shader only needs to output what the tile binner needs to know to bin the vertex in the appropriate tile/cell.

Nowadays there are many shader languages: GLSL, HLSL, SPIR-V, even graphical node-based ones. At the end of the day though, the driver generates machine code for the specific GPU hardware. Unlike standardized ISAs like Arm Aarch64 or x86/64, GPUs have no standard ISA. This causes the complexity to move to software, much like USB making us have to write all these drivers.

The VideoCore series has its own peculiar ISA for shaders. Each instruction is 64 bits, but you can encode essentially two operations and some load operations in each instruction. There are two processors, an add and a multiply, which is why there are two operations per instruction. From the VideoCore IV documentation:

The QPU contains two independent (and asymmetric) ALU units, an 'add' unit and a 'mul' unit. The 'add' unit performs add-type operations, integer bit manipulation/shift/logical operations and 8-bit vector add/subtract (with saturation). The multiply unit performs integer and floating point multiply, 8-bit vector adds/subs/min/max and multiply (where the 8-bit vector elements are treated as being in the range [0.0, 1.0]).

A basic vertex shader in GLSL looks like this:

#version 300 es
in vec3 position;
in vec4 color;
in vec2 offset;

out vec4 v_color;

void main()
{
    gl_Position = vec4(position.xy + offset.xy, position.z, 1.0);
    v_color=color;
}

Here's the same shader (well, almost the same shader; I have hand-modified it afterwards, and there are additional uniforms I provide that aren't required in the GLSL version but are here) in V3D assembler:

static const char* g_vertex_shader_assembly[] = {
    // xyz
    "ldvpmv_in rf3, 0    ; nop",
    "ldvpmv_in rf4, 1    ; nop",
    "ldvpmv_in rf5, 2   ; nop",
    // color rgba
    "ldvpmv_in rf9, 3    ; nop",
    "ldvpmv_in rf10, 4    ; nop",
    "ldvpmv_in rf11, 5    ; nop",
    "ldvpmv_in rf12, 6    ; nop",
    // offset XY
    "ldvpmv_in rf7, 7    ; nop",
    "ldvpmv_in rf8, 8    ; nop",
    // Load w
    "nop       ; nop ; ldunifrf.rf13",
    // X + offset X ; Load viewport X scale
    "fadd r0, rf3, rf7   ; nop ; ldunif",
    // X * viewport X, load viewport Y
    "nop ; fmul r1, r0, r5 ; ldunif",
    // X to int
    "ftoiz r0, r1 ; nop",
    // Store screen position X
    "stvpmv 0, r0 ; nop",

    // Y + offset Y ;
    "fadd r0, rf4, rf8 ; nop",
    // Y * viewport Y, load viewport Z
    "nop ; fmul r1, r0, r5 ; ldunif",
    /* "nop ; fmul r1, rf4, r5 ; ldunif", */
    // Y to int ;
    "ftoiz r0, r1 ; nop",
    // Store screen position Y
    "stvpmv 1, r0 ; nop",

    // Z * viewport Z, load viewport Z offset
    "nop ; fmul r0, rf5, r5 ; ldunif",
    // Z + viewport Z offset
    "fadd r1, r0, r5 ; nop",
    // Store Z (Zs)
    "stvpmv 2, r1 ; nop",
    // Store 1/Wc
    "stvpmv 3, rf13       ; nop",

    // Store RGBA
    "stvpmv 4, rf9       ; nop",
    "stvpmv 5, rf10       ; nop",
    "stvpmv 6, rf11       ; nop",
    "stvpmv 7, rf12       ; nop",

    // Finished
    "vpmwt -   ; nop",
    "nop       ; nop ; thrsw",
    "nop       ; nop",
    "nop       ; nop",
};

You have officially read GPU shader assembly code. Very few people have done this.

The assembly syntax here is:

add ALU instr. ; mul ALU instr. ; flags

The instruction "nop ; fmul r1, r0, r5 ; ldunif" tells the VPU to:

  • Do nothing on the 'add' ALU
  • Multiply the floating-point values in registers 0 and 5 together and store the result in register 1 (register 5, by the way, is a special register where uniforms are loaded)
  • Load the next uniform (the next 32-bit value in our uniforms buffer)

Many instructions have a no-operation instruction for one of the ALUs, either because e.g. a multiply isn't needed at that time or because the constraints of the processor prohibit two instructions operating at the same time with the given flags (because e.g. the operands cannot be packed with 64 bits in that case, or the hardware would conflict trying to perform those two operations simultaneously).

There is a lot to explain here, but I won't be getting into those details. If you want, you can look at my notes, especially the verbose versions.

Now you see that GPUs aren't so different from CPUs–they aren't magical. There's still just a pile of machine code running, but the GPU has many more assumptions it has made to run faster, such as dispatching in parallel and loading our uniforms in this case.

The V3D shader assembler

I had to write the V3D assembler because none existed; the only public compiler for VideoCore VI was in Mesa Gallium V3D, and it only compiles the Mesa Gallium intermediate representation of GLSL.

I made the assembler by taking the V3D's debug disassembler (which I presume the compiler author used to debug the compiled code) and inverting it. This was what I imagine to be a highly unusual way to write an assembler, but because I had no ISA documentation, was the only way I could think of to do it. Even now I don't know exactly what some instructions do, but I know I can encode any of the instructions that are decodable by the disassembler.

Another interesting wrinkle was the strange semantics of the V3D processor. There are many strange rules surrounding instruction cycle counts. The VideoCore IV manual lists its rules, but the VI potentially has more or fewer constraints. Luckily, the V3D gallium driver implements a validator, which I then ported to the assembler to test these rules. It isn't perfect because the V3D GLSL compiler also encodes rules that a shader author might not be aware exist. This is an unfortunate reality of not having the hardware interface documentation.

Why didn't I support GLSL instead? Well, the GLSL -> Gallium IR was substantially more complicated than the assembler. I didn't feel too bad about having to write assembly because the shaders I intended to write are all very simple–essentially just multiplying vectors and adding pixel colors. The Pi VPU is not incredibly fast, so being somewhat limited by a cumbersome language would keep you from pushing it past its modest limits anyways.

In hindsight, the strange semantics of the code do make a higher-level language appealing, but there is a substantial complexity increase here, and against my 100,000 lines-of-code budget it didn't seem worth it.

The V3D shader simulator

While working on these things I would get the dreaded black screen on the Pi. I have no way of debugging the GPU, so I had to resort to simulation to find as many problems as I could.

I wrote an extremely limited simulator for V3D to sort out incorrect memory accesses or buffer setups. This simulator helped me find bugs in my command lists right away. By interpreting a tiny subset of the V3D machine code, I could even step-debug my shaders!

This is a useful technique to know about when dealing with undebuggable black boxes. You can do your best to try to imagine how it works, and at least make sure your inputs work in that imaginary system. This fixes common trivial mistakes like a missed add, an incorrect buffer pointer, or a wrong data layout offset.

This technique is limited by what's called the "Simulation-to-reality gap" in robotics: at the end of the day, reality is reality, and some amount of your simulation will be wrong or inadequately capture the reality. Eventually we need to run on real hardware, which is the ultimate arbiter of our code's correctness.

This simulator was a huge help though, and helped me feel more confident hand-writing shader assembly.

These tools helped me get to the point where I could render many triangles every frame, like I shared before:

Texture sampling

Now that I had an assembler to practically write simple shaders, I wanted to get hardware-accelerated text rendering in. This required sampling a bitmap of rasterized glyphs (i.e., how literally all operating systems performantly render text).

I wrote a GLSL fragment shader and ran it on Raspberry Pi OS in order to see what V3D instructions were generated by Mesa to sample textures:

in highp vec4 color;
in highp vec2 textureCoordinates;
uniform sampler2D texture;
out highp vec4 FragColor;
void main()
{
    FragColor = texture2D(texture, textureCoordinates) * color;
}

Here's the assembly, though I have hand-modified it from what Mesa generated, and of course the comments are my own:

static const char* g_fragment_shader_assembly[] = {
    // payload_w : rf0
    // payload_w_centroid : rf1
    // payload_z : rf2

    // Load S ; write uniform texture p0
    "nop ; nop ; ldvary.r0; wrtmuc", // (tex[0].p0 | 0x3)",
    // S * W ; load T ; write uniform texture p1
    "nop ; fmul r1, r0, rf0    ; ldvary.r3; wrtmuc",
    // S + r5 (from varying?) ; T * W ; load R
    "fadd r2, r1, r5 ; fmul r4, r3, rf0    ; ldvary.r1",
    // T + r5 (from varying?) ; R * W ; load G
    "fadd r0, r4, r5 ; fmul r3, r1, rf0    ; ldvary.r4",
    // R + r5 (from varying?) ; G * W ; load B
    "fadd rf3, r3, r5 ; fmul r1, r4, rf0    ; ldvary.r3",
    // G + r5 (from varying?) ; B * W ; load A
    "fadd rf4, r1, r5 ; fmul r4, r3, rf0    ; ldvary.r1",
    // B + r5 (from varying?) ; set T
    "fadd rf5, r4, r5 ; mov tmut, r0  ; thrsw",
    // A * W
    "nop ; fmul r3, r1, rf0    ; thrsw",
    // A + r5 (from varying?) ; Set S
    "fadd rf6, r3, r5      ; mov tmus, r2",
    // Load RG
    "nop ; nop  ; ldtmu.r4",
    // R * sample R ; Load BA
    "nop ; fmul rf7, r4.l, rf3 ; ldtmu.r0",
    // G * sample G
    "nop ; fmul rf8, r4.h, rf4",
    // B * sample B
    "nop ; fmul rf9, r0.l, rf5",
    // A * sample A
    "nop ; fmul rf10, r0.h, rf6",
    // RG
    "vfpack tlb, rf7, rf8  ; nop  ; thrsw",
    // BA
    "vfpack tlb, rf9, rf10 ; nop",
    "nop ; nop",

    // out[0] = vary[0] * payload_w + r5
    // out[1] = vary[1] * payload_w + r5
    // out[2] = vary[2] * payload_w + r5
    // out[3] = vary[3] * payload_w + r5
};

Interestingly enough, the actual address the VPU will read from to sample our texture is passed in through a shader uniform, then loaded via the wrtmuc instruction. See my notes if you want a better idea. The "sampler2D" uniforms in GLSL are the rough equivalent; in the VPU's case the shader literally passes the VPU a pointer in RAM to the pixel data to sample from.

This part of my GPU journey was a hard one. I didn't have any existing example to work from besides the generated V3D assembly. I didn't know how exactly to set up the uniforms. Once I did set them up properly (via referencing Mesa code), the texture came out incorrectly:

GPUs often do not store texture pixels in the order we are familiar with–each pixel in a row one by one, each row one after another–so-called "raster" order. Instead, they expect the pixels to be arranged such that nearby pixels are more likely to be nearby in memory. This is clear if you compare the cache locality of accessing an adjacent pixel on the X axis to one on the Y axis: since a raster-order texture would put rows of pixels together, the X axis pixel would be next to the pixel in memory.

In contrast, the Y axis pixel would be an entire row's width away (plus any additional offset/padding for alignment; the total offset from one pixel to the next pixel down is called "stride" or "pitch"). Since textures are often sampled spatially, e.g. adjacent pixels in the output are also adjacent in the input, GPUs can do well to try to preserve some this spatial relationship in memory as well, for cache locality.

We want to take a 2D grid and put it in somewhat linear memory such that the 2D adjacency is maximized despite its 1-dimensional memory storage. You can look up things like Hilbert curves for another way to make adjacent values closer together. In the VPU's case it divides the texture into blocks, then divides those blocks into smaller blocks, and orders those blocks a certain way.

This is when I was very stuck. I didn't understand the format the VPU expected, which is only partially documented. I didn't know how to convert my raster-order textures to this custom format.

I reached out to the Mesa team and they miraculously responded, unjamming me. I want to publicly thank them again for taking time out of their busy days to help someone learn.

Getting texture sampled triangle rendering was a huge milestone for me after months of work.

(This is the XOR texture, by the way. It's a nice little trick to have when you don't want to do the work to load pixels from an image.)

A GPU command list interface, assembler, and simulator in just 3 single-header zero dependency C files. I think this might be the smallest programmable GPU "driver" in existence!

Update: The toolchain now has its own repository, and I've added a graphical inspector and shader step-debugger.

Much of the code comes directly from Mesa V3D project. My unique contributions are writing the simulator and the assembler, as well as stripping all dependencies (there are zero #includes here, except the simulator including the V3D interface header).

Where did I stop? What's next?

I ran out of steam for this project after I got the textured rendering working.

On the Circle to C port ("Rpi System"), I stopped just short of full port. I did not port the network stack to C. I don't think it will be much trouble, I just didn't have the motivation to do it because I wasn't shooting for internet connectivity at the time. I would be very happy to give guidance if someone wants to take up this task, or if someone cheers me on (perhaps financially?) I can possibly push through.

I haven't ported the Raspberry Pi 5 support yet. There are some significant changes, but I don't think it would be nearly as much work as the original Pi 4 to C port. Again, huge compliments to Rene Stange and the Circle team for their amazing continued work.

The GPU is at a very exciting point where it can really do interesting things. I did leave off on a problem where rendering a specific number of triangles would cause a lock-up, which I have no idea how to approach fixing. I may need to work around it.

Update: I fixed the lock-up problem. An incorrect Z-buffer stride probably caused a corrupted control list and broke rendering.

Thanks

There were many sources I relied on or discovered. I read Linux kernel source code, Android kernel source, and various random blog posts like this one.

Here are some other Pi GPU resources which I am grateful for:

Thanks again to the Mesa team for answering my questions on texture format conversion.

Feel free to email me. If you are in the D.C. area, meet me in person at Handmade Cities D.C., where I'm happy to talk more.

If you are aware of a more open and well-documented GPU, please let me know. As far as I know, none exist.

If you like this post and this work, also let me know. I won't write these things out if I think no one reads them. It's a lot of work putting these write-ups together. It can make my day to receive emails offering insights or praise for my often otherwise solitary work.

The question

Now, the question I mentioned at the start, regarding Mason's Vulkan Handmade Seattle 2024, by Taylor:

Q: these "make one buffer and send it to the GPU" approaches imply there is now special logic on the GPU to unpack these buffers. Is that unpacking code hard to optimize? If it's arbitrary shader code how does the GPU do these things quickly when it can't have hardware built specifically for these operations?

My response:

I can only speak for the GPU I've programmed shaders for in assembly on bare metal, the VideoCore VI (Raspberry Pi 4 and 5). My understanding of how it works at the lowest level API/software-wise is that you specify essentially how much memory you're going to read per each shader instance (e.g. per vertex, per pixel, etc.). After that, the machine code just reads the data off the stack in order based on how much you pre-declared. The GPU's memory manager will then try to stay ahead of the shaders by pre-caching the data.

In the Pi's case the GPU has DMA so you don't ever actually "upload" anything to the GPU–you tell it the address of your command buffer and it starts reading it directly. You can see a full example here (username 'human', password 'noai'), where the shader assembly starts the file, then you'll see all the command buffer data I fill out with the various layouts. The layouts specify only how much data you read; the shader assembly determines the interpretation of that data.

My expectation is that more powerful GPUs have virtualized memory to the point where layouts are less important, and essentially just get the starting address for that shader instance for each different buffer. Either way though, these are programmable devices, so how you actually interpret the data is up to the shader code.

The VideoCore VI has an instruction to suspend your shader after you request e.g. a texture sample. While the GPU is looking up that pixel and caching it if necessary, it can run other shaders.

Actually, texture sampling is a good example of where the GPU is built for closer to random access data. Uniforms and buffers are more rigid so the GPU can be smarter about pre-caching things for shader instances.

If you compare the command buffers for compute shaders to a vertex/fragment shader command buffer, you'll see how the data layouts in the vertex/fragment command buffer are kind of in the direction of fixed-function. You're letting the hardware designers make assumptions about your task so they can more intelligently cache data.

Specific implementation of buffers

See VideoCore® IV 3D Architecture Reference Guide, Uniforms p. 22:

The uniforms are accessed by reading from a special register in the A/B regfile space. A uniforms cache and small FIFO in the QPU will keep prefetched uniform values ready for access.

For per-vertex memory:

The vertex pipe memory "VPM" has 16 channels or columns. See the VideoCore IV manual on VPM.

When running a vertex shader, V3D loads multiple several 16-vertex batches. See Mesa's v3d_vs_set_prog_data():

Compute VCM cache size. We set up our program to take up less than half of the VPM, so that any set of bin and render programs won't run out of space. We need space for at least one input segment, and then allocate the rest to output segments (one for the current program, the rest to VCM). The valid range of the VCM cache size field is 1-4 16-vertex batches, but GFXH-1744 limits us to 2-4 batches.

The device info holds the VPM size in bytes, which can then be used to determine the maximum input and output size, as well as the batch count.

In v3d_gl_shader_state_record e.g. vertex_shader_output_vpm_segment_size determines how many segments (sets of 8 rows in the VPM) are used for the vertex shader's output. If your vertex shader takes e.g. position XYZ, color RGBA, and an offset XYZ, you have 3 + 4 + 3 = 10 components, which requires two 8-component segments.

The v3d_gl_shader_state_attribute_record does actually specify the types of values read, so I lied: the GPU is doing special unpacking for us. An example attribute record:

v3d_gl_shader_state_attribute_record* defaultPositionsAttribute =
    V3D_BUFFER_ALLOC_STRUCT(&defaultAttributesState, v3d_gl_shader_state_attribute_record);
defaultPositionsAttribute->address = V3D_ARM_TO_BUS_ADDR(args->defaultVerticesXYZ);
defaultPositionsAttribute->number_of_values_read_by_vertex_shader = 3;
defaultPositionsAttribute->number_of_values_read_by_coordinate_shader = 3;
// NOT instanced, since every instance uses the same positions
defaultPositionsAttribute->instance_divisor = 0;
defaultPositionsAttribute->stride = 3 * sizeof(float);
defaultPositionsAttribute->maximum_index = 0xFFFFFF;
defaultPositionsAttribute->vec_size = v3d_VEC_3;
defaultPositionsAttribute->type = v3d_ATTRIBUTE_FLOAT;
++numAttributes;

This also gives the GPU an opportunity to perform any encoding of the values because we tell it the type of the data provided.

Updated response

So, I wasn't entirely correct in my answer, because I forgot that the GPU does do work for us based on the layout when loading our data into the VPM.

If you put yourself in a GPU hardware designer's shoes, it follows that knowing more about the incoming data will provide more opportunities for the hardware to transform that data. With purely opaque buffers, any type conversion is going to need to happen in the shader. If you write a compute shader to do rendering, it will be slower than an equivalent vertex/fragment shader pair specifically because the latter has more hardware supporting the flow of memory through the shaders.

© 2025 Macoy Madson.