Interlude: Using the Index Registers Effectively on the Z80

Working on these Spectrum projects has meant more practice with Z80 assembly along with everything else, and while I do think I’m basically at parity between Z80 and 6502 now in terms of my skills, the grind of actually doing things with it to get practical experience is still very necessary. Implementing the line-drawing algorithm required me to stretch my skills in new directions and put in some real work on both program design, instruction-level micro-optimizations, and in seeing just how far I could push my preferred assembler.

This was great, as practice: I had to really sit down and grind through what options are all available and when and how I could best use each option. I had to consider multiple ways of phrasing everything. I did in fact come up with some real uses for multiple idioms that were new to me in my own code. I could find that all of this still stayed basically consonant with my big CPU comparison last month.

Image: Menu screen from the game EXAPUNKS (Zachtronics, 2018). — I still have no proper illustration for the act of studying and hand-optimizing assembly code, so once again, enjoy the main menu screen of *EXApunks* (Zachtronics, 2018).

But then, the punchline: in practice, as opposed to as practice, none of those idioms ended up being worthwhile for the actual line-drawing function. After using them to write the function in the first place and then testing them out, I started tuning performance and every single one got removed.

That’s a quirk of this function, though. Wrestling with the alternatives here lays out the space neatly, and even though there’s no final work to show, there’s still going to be some value in showing my work.

Moving Beyond the Z80’s Register Space

Our core problem here is that our computation requires more memory than we can fit into registers, so we need to keep some of our state in memory. The nature of the Z80 instruction set is such that we generally cannot work with data unless it’s in registers, so this is much more a case of stashing values we don’t need just now with the intent of recovering them later. (The alternative may be seen in the 8086 and the 68000, where you can use values from the stack nearly as freely as you can the values within your registers, as sources or destinations.) In my Z80 projects to date, I have generally been able to rely on one of four techniques for managing computations. From least to most complex:

The entire computation fits in the register bank. Any persistent data is stored in global variables accessed via absolute address. My implementation of the 16-bit Xorshift PRNG works like this.
As above, but I want the function to not trash certain registers. This works as above but there is now a function prologue and epilogue that PUSH and POP the registers I wish to preserve. My SG-1000 BIOS library does this a great deal.
As above, but not at function boundaries. We can use PUSH and POP anywhere to make room for temporary values and then restore the values those temporaries displaced. My LZ4 decompressor uses this technique to stash the lengths of the various runs in the compressed data and recover them as needed.
As above, but we have exactly two 16-bit temporary values and we alternate between them before restoring the values we displaced. The Z80’s EX (SP),HL instruction lets us perform this alternation without disrupting any other registers. The LZ4 decompressor uses this to juggle the compressed data pointer and the backreference pointer during backreference processing.

With Bresenham’s algorithm, we finally reach the final stage:

There is more data than can fit into registers and the values persist sufficiently that we cannot treat some of them as “temporaries”. In this case we must set aside memory to hold the computation state and sync registers with memory as necessary. The Z80 instruction set grants us the most freedom of action if this block of memory is a collection of contiguous global variables at a fixed location in RAM.

We have actually already seen an example of “a collection of contiguous global variables at a fixed location in RAM:” the ZX Spectrum’s system variables. It gives itself even more flexibility by assigning IY to a fixed location that allows it to reach all of them within its index range; for our own code, we may also do this with IX if we so desire.

Ways to Access Memory

The Z80 has a dizzying array of ways to move data around and it isn’t always obvious which ones are more performant than others:

An 8-bit register transfer LD r8,r8 is 1 byte and 4 cycles.
An 8-bit immediate load LD r8,n is 2 bytes and 7 cycles.
An 8-bit memory load or store through a 16-bit pointer register LD r8,(HL), LD (HL),r8, LD A,(HL/DE/BC), or LD (HL/DE/BC),A is 1 byte and 7 cycles.
An 8-bit memory load or store through an index register LD r8,(Ir+d) or LD (Ir+d),r8 costs 3 bytes and 19 cycles.
An 8-bit memory load or store through a direct address LD A,(nn) or LD (nn),A costs 3 bytes and 13 cycles.
An 8-bit immediate store through a pointer LD (HL),n is 2 bytes and 10 cycles.
An 8-bit immediate store through an index register LD (Ir+d),n is 4 bytes and 19 cycles.
8-bit arithmetic operations can dereference a pointer register for the second operand. These are generally one byte and 7 cycles for (HL) and 3 bytes and 19 cycles for index registers.
There are no 16-bit register transfers. If you can phrase it as two 8-bit transfers that is 2 bytes and 8 cycles, but if you need to pair a push and pop to do it, that is 2 bytes and 21 cycles.
A 16-bit immediate load LD r16,nn costs 3 bytes and 10 cycles for most registers, but 4 bytes and 14 cycles for index registers.
A 16-bit load or store through a direct address costs a varying amount depending on the register. LD HL,(nn) or LD (nn),HL cost 3 bytes and 16 cycles, while BC, DE, SP, IX, and IY cost 4 bytes and 20 cycles.
Pushing a register costs 1 byte and 11 cycles for most registers but 2 bytes and 15 cycles for index registers.
Popping a register is one cycle faster than pushing: 1 byte and 10 cycles for most registers but 2 bytes and 14 cycles for index registers.
16-bit arithmetic operations are extremely inconsistent. ADD instructions with HL as the target are one byte and 11 cycles. ADC and SBC expand that to 2 bytes and 15 cycles since they’re multibyte, and so is ADD when an index register is the target. Index registers seem to work by standing in for HL so in a wild asymmetry, you cannot actually add HL to an index register—instead, the corresponding point in the instruction space adds the index register to itself.
Individual bits can be test, set, or cleared. For any 8-bit register this costs 2 bytes and 8 cycles. For (HL) a test is 2 bytes and 12 cycles, and for a write of some kind it’s 15 cycles instead. Indexed operations are all 4 bytes, and a test is 20 cycles while updates are 23.

That’s a lot, and taking full advantage of it is tedious. The instruction sequence XOR A; LD (nn),A is shorter and faster than LD (IX+d),0, but make it any value but zero and LD A,3; LD (nn),A is the same length and slower. Even in the zeroing case, though, the faster code overwrites the value in A; can you afford to do that? Will you save enough time over the course of the routine that you don’t end up losing time initializing IX in the first place? Can we remove the need to save and restore a temporary value? A PUSH/POP pair, after all, is 22 cycles long and indexing lets us relieve register pressure.

These are the kinds of questions we are capable of mechanizing these days; a good optimizing compiler will have many techniques it can use to direct register allocation to various ends. Actual compilers for these smaller chips, though, seem to focus more on correctness than speed. Rightly so, of course, but that does mean we bear this burden entirely when writing in assembly for it.

Real Stack Frames and Why Not to Use Them

The other reason to rely on the index variables is if the structure you’re working with isn’t in a fixed space. That removes basically all of the absolute-address options, and the unindexable pointers are disastrous when it comes to random access within a block of data.

But do you know what else isn’t in a fixed place? The stack pointer. We can, with a little effort, turn IX into something very like other systems’ frame pointers, and even preallocate some stack space for initially-uninitialized variables:

        LD IX, -8
        ADD IX, SP
        LD SP, IX

At this point we have considerable freedom to interact with bytes on the stack as long as we’re operating with them alongside the A register. The result isn’t incredibly ergonomic, but it’s at least on par with the 65816.

In my first cut at implementing LINE I tried doing it all stack-local like this, but I found that it was markedly slower than the global variable block; the lack of an indexed 16-bit load ended up being very costly here.

Optimizing Bresenham’s Algorithm For the Z80

Here’s the implementation in C of the Bresenham’s line-drawing algorithm as I normally use it:

(I worked through the exact mechanism by which this algorithm worked in an article two years ago—rereading it today I think it suffers a bit from my bizarre epiphany at the time that this graphical linear interpolation system was strongly analogous to the audio linear interpolation systems I had just finished studying, but it does work through the basics appropriately.)

There are a few elements of this that we can simplify for the Z80 CPU and the Spectrum display in particular:

The screen is 256×192, so the arguments, along with x, y, sx, and sy may all be byte variables instead of full 16-bit ints. We cannot shrink dx or dy because they double the value and can range up to 510, and we cannot shrink d because it interoperates with dx and dy.
It’s cheaper and easier to add two 16-bit values than it is to subtract them. In any given branch, one of dx and dy will only ever be subtracted. At the point where we know that will happen, we can negate it, and then do nothing but additions from there on out.
It’s cheaper and easier to shift 16-bit values left than it is to shift them right. (This is a consequence of the previous point: we can add them to themselves.) We can delay the left-shift on dx and dy until after d is initialized to save ourselves a shift-right operation.

The resulting C code looks like this:

C’s syntax goes out of its way to conceal it, but the translation process also highlighted several other areas where 16-bit math operations were really just sign-extended 8-bit operations, which often meant that we could do the smaller, faster 8-bit math and just sign-extend the result if needed.

More interesting, perhaps, is how the memory allocation worked out. The arguments were all passed in in registers. Most of the variables, including the arguments, were then put in a chunk of global variables, and accessed via absolute-address load-and-store operations. However, the color argument was passed in in the C register, which stayed completely untouched by LINE and which PLOT read but never wrote. Both of them also used B as a scratch register. The PLOT routine takes its X and Y coordinates in L and H respectively, and trashes DE as part of its operation. The HL register was untouched by PLOT but LINE makes very heavy use of it as a sort of 16-bit accumulator, so all four of these variables have to be saved off and restored as needed. Furthermore, “as needed” is inconvenient and doesn’t match the strict time-bounding that would let us just use the stack to stash temporaries; I instead put all four coordinates in adjacent locations in the variable block so that I could restore a pair of them at will with a single LD HL,(nn)-style instruction. Happily, since these values never change over the course of the function, we can restore the registers on the way out simply by re-reading them… and then I figured that it was silly to trash only A and B, and used traditional stack spilling to preserve BC along the way.

Most interesting, though, is that I did not have to allocate space for d, x, or y at all. These three variables were never used simultaneously and also were always updated in place by the operations upon them, and I got to rely on the stack-swapping instruction EX (SP),HL to swap between working with the 16-bit variable d or working with the two 8-bit variables x and y.

Overall, there’s a lot to juggle here, but while I certainly commented the function to within an inch of its life so that it was possible to track the computation and current significance of the registers as they evolved, we really, really want some support from the assembler itself to keep us honest and to help us keep all our symbols straight.

It is not a coincidence that my preferred Z80 assembler, Sjasm, is up to the task; it being up to this task specifically was one of the main reasons I ended up preferring it.

Local Variables and Indices in Sjasm

One of the things I particularly liked about Sjasm was what it called its MAP system. You would set an address for MAP and then when defining a label, instead of giving it code or data to refer to, you could instead specify an amount of space, like so:

        MAP $7800
label # 1
label2 # 2
label3 # 1

These definitions can be anywhere in the program and no matter what else is going on, they reserve space against the current map instead of being defined in terms of what’s going on in the code at that point: in this case, label will be at $7800, label2 will be at $7801, and label3 will be at $7803. Even better, the argument for MAP doesn’t have to be a constant expression like the one I gave above; it just has to be unambiguous at build time. When I write Spectrum programs, the last thing I put in my program is a label with a name like PROG_END and then, at the top of the program, open with MAP PROG_END. This puts my variables right after the end of the program, just like BSS definitions in a more modern system.

This system also interoperates completely transparently with the local label system:

line:
.dx # 2
.dy # 2
.sx # 1
.sy # 1
.x1 # 1
.y1 # 1
.x2 # 1
.y2 # 1

If we want to work with them using IX as a sort of off-stack frame pointer, it turns out the EQU directive for declaring aliases also interoperates smoothly with the local label system. The trick is to create new aliases representing the distance of each variable from the first one defined:

.dxi equ .dx-.dx
.dyi equ .dy-.dx
.sxi equ .sx-.dx
.syi equ .sy-.dx
.x1i equ .x1-.dx
.y1i equ .y1-.dx
.x2i equ .x2-.dx
.y2i equ .y2-.dx

For actual setup, we preserve the registers we’re trashing and initialize IX to point to the first variable in the block, like so:

        push ix
        push bc
        ld ix,.dx
        ld (.x1),hl
        ld (.x2),de

At this point we can use any variable as a direct address the way we do when saving out HL and DE, and we could do things like subtract y1 from y2 with code like:

        ld a,(.y2)
        sub (ix+.y1i)

Cleanup is very similar to setup:

        ld hl,(.x1)
        ld de,(.x2)
        pop bc
        pop ix
        ret

There’s a small subtlety here with the data layout; the PLOT function, and really any code that wants to treat coordinates as something kind of like addresses, really wants the X coordinate in the low byte and the Y coordinate in the high byte. I can, and do, take advantage of this by allocating y, y1, and y2 right after their x equivalents; this lets 16-bit load and store operations pull each coordinate into the proper register right away.

Multi-Function Scratch Space

The code we’ve written above works great, and it even has a canonical representation in C; we aren’t using stack-based variables here, but all the variables are still ultimately interior to the function itself. C represents that by declaring in-function variables static, and while it’s something you only want sometimes there, it does still have its place in the modern world. (That place is more for data that needs to persist from call to call as part of the computation than it is to save on stack operations, admittedly.) If we look at period code written for 8-bit systems in assembly language, though, we’ll see a different pattern that we generally don’t see in C: a bunch of functions will use the same block of memory for their local variables, essentially treating that chunk of globally-allocated RAM as a scratch space. (C can technically do this, too, with global variables that are also unions—I have never seen this done in seriously-intended code but I’m happy to leave making use of it as a challenge to my readers.) Sjasm also handles this case very cleanly.

The way it works is that MAP directives can nest. While it’s common to just have a single directive at the top of the program (indeed, I’m doing that here), we can override the MAP location at will with a new directive, and then go back to where we were with ENDMAP. Unlike a true segment-like system, once a map ends, we lose track of it forever and can’t re-extend it later; however, also unlike a true segment-like system, that makes it a more direct fit for this shared-scratch-space system.

The trick is that these submaps also get to use symbols from elsewhere in the program as their start point.

        MAP bss_start
scratch1 # 16
scratch2 # 16
        ;; other global variables as needed...

        ;; some actual program, and then...

line:
        MAP scratch1
.dx # 2
.dy # 2
;;; ..etc. Don't allocate more than 16 bytes! That's all that scratch1 has!
        MAPEND

The body of the function ends up using the variables just as they do in the code where only LINE gets them— But now, every function that doesn’t call any other function that uses scratch1 can reuse the same memory for its locals with this construct. Similarly, functions that only call functions that use scratch1 for variables can use scratch2 for their own. We can chain this as deep as we like.

The end result is not unlike the register allocation discipline we imposed on ourselves for the TI-99/4A. Each scratch zone is basically a workspace, and within a workspace we could divide up the bytes more carefully to not need to change our context. That’s important when, as on the TI, the workspace is your registers and also basically all of your usable RAM, but it’s less crucial here.

One Other Cool Feature I Didn’t Need Right Here

The Z80 instruction set doesn’t have a handy equivalent to the 6502’s LDA addr,X—adding a small index to a full pointer address as a way to index into an array without necessary iterating directly through it. The full procedure is to put the base address in HL, the offset in DE, then produce the final address to dereference with ADD HL,DE. That’s 7 bytes and 31 cycles. If we know that the whole array is in a single 256-byte-aligned region, though, then we know that the result won’t ever touch the high byte. Now we can write LD HL,base;LD A,index;ADD L;LD L,A for the same effect, saving 1 byte and 6 cycles of computation. One way to do that is to make sure that base is sufficiently aligned that it cannot cross a page boundary, and Sjasm offers the command ## n to pad the map out to n-byte alignment.

Optimizing Away IX

My initial implementation of LINE used IX in five spots. Four of them were places that computed x += sx or y += sy, with code that looked like this:

        ld a,l                          ; x += sx
        add (ix+.sxi)
        ld l,a

This turns out to be strictly inferior to the code that doesn’t touch the index at all:

        ld a,(.sx)
        add l
        ld l,a

This is the same number of bytes of code, and it’s 6 cycles faster. These spots also run whenever the “pen” moves in either dimension, so a diagonal line drawn across the whole screen will save (255+191)*6=2676 cycles: almost a whole millisecond on a single line! The “make your code be 8080-friendly when possible” maxim wins another point.

The last case only runs once, during setup. In it, we have to decide whether we’re iterating along X or Y by deciding which of .dx and .dy is larger.

        ld a,(.dy)
        cp (ix+.dxi)
        jr nc,.y_major

Rewriting this one 8080-style costs us two bytes, two cycles, and burns an extra register:

        ld a,(.dx)
        ld b,a
        ld a,(.dy)
        cp b
        jr nc,.y_major

Happily for us, burning the register is free; the first thing we do whether or not we take the branch is call PLOT, which trashes B without reading it. No additional harm done there. Also, unlike the other four spots, since this only runs once, so the two-cycle cost is only paid once.

It’s also, crucially, the last use of IX in the function. If we make this change, the main body of the function may be two cycles slower, but we can remove the three instructions PUSH IX; LD IX,.dx; POP IX from the prologue and epilogue, deleting 8 bytes and saving 43 cycles. The net savings for bouncing through the B register is 6 bytes of code and 41 cycles of runtime.

Conclusion: Where Does Indexing Pay?

Well, that was a very long trip that took us back to square one. Have we just buried index registers? Is this why the Game Boy’s Z80-variant CPU just didn’t bother with them? … I actually have no idea about that latter, to be honest (though in its place it has actual stack-relative addressing so normal stack frames are much easier to write over there). But I think we’ve still got quite a few places where it makes sense to rely on the index registers:

Bitwise operations. If IX is preconfigured, unless you already have HL pointing at the exact memory location you want, indexed bit operations are probably cheaper than the extra configuration. If you’re doing some kind of weird loop where you’re doing several bit operations per byte in a buffer or something you might have to do some math, but if you’re doing multiple bit operations like that you should also make sure that you don’t really want byte-level logic operations instead!
8-bit Memory-Memory operations. Using IX really did save us both time and space when we needed to compute the low byte of .dy-.dx—it just didn’t happen to save us enough time in this particular function to justify the cost of initializing the register in the first place. With a slightly different algorithm to implement, that decision could have easily gone the other way.
Preventing register spills. If doing a computation in registers would require a PUSH/POP pair, using indexed operations may be cheaper than the costs of the stack operations. If there are a lot of indexed operations the result could go the other way, though, so this is another case where you’ll have to use some craft.
Dynamic variable blocks. This isn’t an efficiency concern exactly, but if you are iterating through an array of structs, index variables are all but tailor-made for this purpose. Just remember that when advancing it that ADD IX,HL isn’t an instruction; use literally any other register pair instead.

We see plenty of the first two cases in the Spectrum’s own system ROM, and the final case showed up for me when processing sprite data on the ColecoVision and MSX1. So the index registers have definitely got their place. It’s just that it’s usually going to be in your back pocket instead of always ready to hand.

Next time: some actual progress towards an animated sprite engine!