BBC BASIC raytracer in 432 characters
mastodon.me.ukIn 398 characters, I can give it error diffusion dithering capabilities, which makes the code smaller and the output quality a bit better:
https://bbcmic.ro/#%7B%22v%22%3A1%2C%22program%22%3A%22MODE1...
It would look even better with bidirectional error diffusion, but that requires reading back memory which I don't know how to do.
Thanks for sharing. I asked ChatGPT to analyze your code, while telling it what it does. How well do you think it did? Did it make any mistakes?
https://chat.openai.com/share/6e754fa1-531d-432e-a767-8c8132...
50% right... 50% trash...
> VDU5: Sets the output to graphics.
Send a command to the visual display unit to turn off the flashing cursor.
> B=0: Initializes variable B to 0. This variable will be used for brightness calculations.
B is used to propagate quantization error from one pixel to the next - specifically, if we wanted one pixel to be orange, but the closest color we could use was red, then B contains "more yellow needed". In the next pixel, this will be added in to give this pixel more chance of being yellow.
> recursion
There is no recursion here. It's an infinite loop that exits when the raytraced ray fails to hit the floor or a sphere.
> The 4M and 4N are for scaling the coordinates to the screen resolution.
No - this is because early computers didn't have enough memory to store the whole screen image. Therefore there were 'device pixels' and 'logical pixels'. In this mode, each row of pixels is output 4 times into the screen, making each pixel into a group of 16 pixels (4x4).
Drawing commands in BBC basic are done in device pixels, but there is no point in figuring out a color for every device pixel if you can only store 1 in 16 of them..
Didn't check the inner loop since I didn't write that code - but I suspect it has similar accuracy issues.
Thanks!
>50% right... 50% trash...
That's not bad at all for ChatGPT. I guess it will be a long time before a computer can really accurately describe what a complicated uncommented algorithm is doing.
Wow - I never thought of using ChatGPT for analysing code. It reads well - but does the author agree ?
I just asked it to reformat the code for better readability, and it guessed it was a ray tracer. I for one welcome our new robot^H^H^H^H^Hparrot overlords.
Also tried having it generate a JavaScript version, but the output doesn't look right: https://chat.openai.com/share/5d76e32b-81a1-4b4a-a808-7682a8... I'd be inclined to suspect the lack of line numbers, as a first guess towards the problem.
Hah! The code doesn't work because chatGPT missed a very obscure trick...
At the start of the code is a comment "REM xxxxxx". It stripped this comment out thinking it was junk.
But actually, the code later uses the "PAGE" command to directly read RAM to get the contents of this comment, which it then uses as a lookup table for dithering.
Actually I did that (note that the REM is missing from my prompt.) It didn't look like a hack for embedding binary data for subsequent PEEK or CALL statements, which was common on the 8-bit Apple I grew up using, so I assumed it was just garbage and left it out.
When I fed it your updated version that didn't include the REM, farther down the page, it still didn't quite work. Although I can tell it's not too badly broken. There's some visual structure in the output, just not the expected image.
I wish there was an option to run at greater than realtime!
It makes the output better in some ways, but since it's just 1 dimensional and there isn't really any noise in the image it does produce a very stripy image on the portion I let it run for.
In the replies underneath, someone ran it on a real Acorn Electron computer (6502 at 1 MHz), where it completed in 8 hours and 40 minutes. Not too bad really.
I'm running this on a BBC micro emulator running at period-accurate speed, and it's taking just under a minute to draw one line. So I think it's going to finish in about 4 hours, which would make sense - approximately twice as fast as the Acorn Electron.
Update: Finished in only 2 hours 30 minutes.
The Electron runs extra slowly in MODE 1 as the video system needs 100% of the RAM bandwidth to form the display. The CPU essentially ends up paused during the visible portion of each scan line, so the effective speed is probably more like 0.6 MHz.
I wonder how much of that it spent just computing those square roots.
Does anybody happen to know what SQR algorithm BBC Basic uses? http://www.bbcbasic.co.uk/bbcwin/manual/bbcwin7.html#printha... does not seem to have the actual algorithm.
Intel's optimization manual has suggestions for fast versions at 15.12.3 (recommended by @James https://stackoverflow.com/a/2637823/10981777) Don't look especially long to implement for 22-bit approximation.
https://software.intel.com/content/www/us/en/develop/downloa...
Also, what's the SQRD? Not finding that referred to anywhere.
SQRD is just SQR D :)
Thanks. So much more of that code makes sense now. IFD, SGNU, FORN, FORM. FORM especially. So many codes now have the word FORM as a special word.
What a nightmare to read though if you don't know what it's supposed to be written like. Is it a: "FORM=" or a "FOR M=" ? Is it: "S GNU:G", "SG NU:G", "SGN U:G"? SQRD totally looks like some special kind of square root implementation.
Yeah, early BASICs tended to let you skip the spaces, and in fact it sometimes did have a measurable effect on parsing/execution speed. And of course on file size too… Additionally, at least C64 Basic had two-character shorthands for all keywords.
that's astounding; i would have guessed months or years
I recreated a mandlebrot zoom I saw in a library book in bbc basic and the last image took two weeks. I had 8 of them saved on a floppy and that would slideshow them for a 2fps "movie"
that is fucking awesome, you are a legend
BBC BASIC is surprisingly efficient for an interpreted language. It kind of had to be, running on processors that slow.
all the basics i used on processors that slow were surprisingly inefficient, even for an interpreted languages
i thought they had to be, running on processors with that little memory. though later on i learned about forth, which is surprisingly efficient for an interpreted language
a more likely explanation is that sophie wilson was just a better hacker than bill gates and paul allen
Yeah, BBC BASIC is really good. Both the language design and the implementation.
I loved it at the time, and the more I think back on it the more impressed I am. Like it had a very decent suite of floating-point routines, which if I remember right were very performant. In a 32KB ROM!
Some of the improvement in BBC BASIC over Microsoft's BASIC is attributable to the larger ROM(the first Altair BASIC was just 4k), but it's also that the BBC Micro's 6502 and DRAM was clocked higher than contemporaries. It's just a faster, more refined 8-bit machine all around.
Basic was actually a 16KB ROM. MOS occupied another 16KB ROM.
Dang, you're right, I was misremembering.
Some good discussion of that on The BBC BASIC wiki entry: https://en.wikipedia.org/wiki/BBC_BASIC (paragraph beginning "Due to a number of optimizations[…]".
Also you could swap out hot spots with assembly trivially as it had an inline assembler. And you had indirect pointers in it!
The OP's code has REM based assembly it in to save space
is that assembly or is that the dithering table
Yes, the REM is the lookup table used in line 70.
Because BBC Basic had a built-in assembler it was pretty uncommon for BBC programs to inline machine code as raw data (unlike some other computers from BITD).
just a table of constants.
Also, don't miss the gallery site! https://www.bbcmicrobot.com
the gallery site is amazing in and of itself. the gallery isn't PNGs or GIFs - it's generated in the browser from the emulator state. if you hover over an item it resumes the emulator.
This one was from 2021.
Do enjoy the insane things ppl managed to cram into tweets to the bbcmicrobot over the years. Anyone who's followed any part of the demoscene especially the classic smaller payload demos.. 4k etc.....this kinda thing is mindblowing and totally in the same spirit.
Ok, this is probably a dumb question: I get that the code traces the rays for this scene, obviously — but where is the scene, then? That is, where is the “data” (as opposed to the algorithm) that says “there’s a sphere at coordinates x,y,z with radius r, and here’s the other one, and here’s the checkerboard plane”?
It’s mostly implicit. See how the spheres are symmetric about the centerpoint? (Well, almost, the camera is at Y = -0.1.) The (X, Y) coordinates of their centers are simply (-1, -1) and (1, 1), given by the `I = SGN U` variable at the end of line 40. The Z is implicitly zero. Their radius is 1. The plane is basically defined by `P = Y - 2` at line 60.
…to elucidate a bit, when tracing the rays on the left side of the image, the sign of u is -1, so we're trying to intersect a sphere at (x-1, y-1, z), and on the right side similarly one at (x+1, y+1, z). This works because none of the left-side rays can even in theory hit the right-side sphere and vice versa. It's a very simple version of space partitioning optimization. And when tracing the reflected rays, we flip the sign of i, because there's no self-reflection – the left-hand sphere can only reflect the right-hand one and vice versa.
these comments were extremely helpful, thank you
No problem, this was a fun puzzle to solve =D
sharlin has explained how the sphere coordinates are computed; the checkerboard plane is implemented by converting downward-pointing vectors (where the y-component v is negative) into upward-pointing vectors toward a point in the sky with the right color
then the (palette index of?) the color of the sky is computed from the square root of the y-coordinate of the direction vector asIF V < 0 then P = (Y + 2)/V : V = -V*((INT(X - U*P) + INT(Z - W*P) AND 1)/2 + .3) + .2
plus a fixed dithering tableGCOL 0, 3 - (48*SQR V) DIV 16The 3D scene is described by formulas instead of data. It's a bit similar to how SDF demos describe a 3D scene by combing shape formulas into a single formula for the whole scene (SDF = signed distance fields).
E.g. see:
while i know you didn't say this was using an sdf, i want to point out that it doesn't seem to be using an sdf, just in case anyone is wondering; it seems to be a conventional whitted-style raytracer that computes sphere–ray intersections in closed form by solving the quadratic
This makes a sphere. It gets multiplied by two coordinates to make spheres at different places, U and V.W=1/SQR(U*U+V*V+1)i think this is incorrect, i think that's computing the z component of the initial direction vector of the ray being traced
if you change the initialization of i from sgn u to 1.2 * sgn u you will get an image with the spheres a little further apart
Yeah, that's something of a red herring. The actual ray-sphere intersection happens on line 50:
// sphere_center = (E, F, Z) E = X - I F = Y - I // solving the ray-sphere quadratic eqn... P = U * E + V * F - W * Z // discriminant D = P * P - E * E - F * F - Z * Z + 1 // if solution (ie. intersection) exists IF D > 0 // intersect_point = ray_orig + T * ray_dir // this is the closer pt, -P + SQR D is the other T = -P - SQR Dyeah, that was what i guessed it was doing, but i couldn't remember or rederive the ray-sphere intersection formula to be sure
a thing that puzzles me is how https://bbcmic.ro/?t=9ctpk is only 4× faster in emulation (about 30 seconds per scan line and so on the order of 2 hours for the whole image)
i'm running this on a ryzen 5 3500u at 2400 megahertz. the acorn electron which supposedly takes 8 hours and 40 minutes is a 1 megahertz 6502 when running from ram, roughly, 262144 instructions per second. at 2 ipc one core of the ryzen should be about 4800 mips. if the emulation slowdown is 10× (10 host instructions to emulate one guest instruction), which is typical for naïve interpretation, it should be about 1000× faster on this laptop as on the original hardware
(possibly the basic is in rom and so it's closer to 524288 instructions per second)
emulation through qemu-like dynamic code generation should cut the 10× slowdown to 3×, so it should be 3000× faster than the original hardware
where did that factor of 1000 go? not the javascript engine, surely?
incidentally there is a rocket-ship button underneath the output screen image labeled 'send to beebjit' which runs the program in about a second
If the emulator is entirely cycle correct for the entire system (including the video system, which is usually the most expensive to emulate, not the CPU), then the speedup on modern computers will be much less than just comparing the clock frequencies.
Consider that each chip of a home computer system (CPU, 1..2 IO/timer chips, audio, video, ...) needs to do 'emulation work' for each 1 or 2 MHz clock cycle which can add up to quite a number of host computer instructions (dozens to hundreds).
If each chip emulator just takes around 10..20 host system clock cycles to emulate one emulator clock cycle, then you are already looking at around 100 host system clock cycles per emulated clock cycle for the entire home computer (in reality it's probably worse).
Such 'vertically sliced' emulation code is also full of conditional branches which put a heavy burden on the host CPU branch predictor.
...and that's how a theoretical 1000x speedup suddenly becomes a 10x speedup, it's not the CPU emulation (this is usually cheap) but the rest of the emulated system which can be expensive.
Different emulators use all sorts of tricks and shortcuts, but usually with tradeoffs like less precise emulation, or less 'compartmentalized' code.
PS: case in point this is just the top-level per-cycle function in my C64 emulator, which in turn calls per-cycle-functions in each of the chip emulators (which may each be just as much code):
https://github.com/floooh/chips/blob/9a7f6d659b5d1bbf72bc8d0...
I'm trying to strike a balance between 'purity' (e.g. an entire emulated system can be 'plugged together' by wiring together chip emulators, just like one would build a real 8-bit computer on a breadboard), while still being fast enough to comfortably run in realtime even in browsers (https://floooh.github.io/tiny8bit/c64.html).
It's definitely possible to implement a faster cycle-correct C64 emulator with a less 'pure' architecture, but it's quite hard to do meaningful optimizations while keeping that same 'virtual breadboard' architecture.
...considering all the code that runs per cycle it's actually amazing how fast modern CPUs are :)
that's an excellent point, i hadn't considered that they might be running this program in basic on a cycle-correct emulator of the bbc micro. thank you very much for explaining
The Acorn Electron is a very different design, and the BBC Micro runs measurably faster (particularly in this case due to the high res graphics). The bbcmic.ro emulation will almost certainly be pretty much exactly the same speed as real BBC Micro.
The Firefox profiler suggests it's spending most of its time doing nothing, waiting for the next Window.requestAnimationFrame.
The rocket ship icon seems to have disappeared. Maybe I used it too much, or maybe the site owner removed it? (was it serverside and overloaded?)
i had that problem when i visited via a saved url in tinyurl.com
https://www.pouet.net/prod.php?which=78045
Incredible raytraced demo in 256 bytes.
https://www.pouet.net/prod.php?which=53871
https://www.youtube.com/watch?v=36BPql6Nl_U
Incredible raytraced demo in 128 bytes. (An underwater journey through a Menger-sponge fractal)
last year i wrote a tetris in arm assembly https://asciinema.org/a/622461 and was pleased to get it down below 1024 bytes
then i looked and rrrola's 4is256 https://www.pouet.net/prod.php?which=29286 is a working tetris in under 256 bytes, and it's better than mine because it has colors and scoring. i could blame a little bit of inflation on arm being 32-bit rather than 16-bit, but not 4×
The upper limit of instruction density on x86 is much higher than any RISC.
Particularly in the case of this x86 ray tracer, because the core of it is written using the stack-based x87 floating point instructions.
in general i find that stack-based instruction sets are more compact than register-based or belt-based instruction sets, despite using almost twice the number of instructions to get anything done, because the individual instructions are smaller. in smalltalk or the jvm or emacs lisp bytecode, most instructions are a single byte, though occasionally you'll have a jump or something that has a following operand byte. (python is much worse at this, because python was never meant to run on 16-bit machines.) on the mup21, each instruction is 5 bits; on the x18, each instruction averages 4½ bits
so, in the space of a single 16-bit thumb or rv32c instruction, you can fit 1.8–2.8 instructions, and that makes a big difference
i compiled some code for rv32ec. here are 8 instructions of it occupying 22 bytes. what would this look like in a stack bytecode?
we can see that in a stack instruction set this would use about 20 bytes, barely less, because the instructions being smaller is offset by their being more numerous. on the other hand, some of the code is occupied with doing things like allocating a stack frame, which usually isn’t necessary on a stack machine38: 00158693 addi a3,a1,1 a1 1 + 3 bytecodes 3c: 43d8 lw a4,4(a5) a5 4 + @ 4 3e: 08e6fa63 bgeu a3,a4,d2 <.L15> < if <.L15> 3 bytecode bytes including jump target 42: 439c lw a5,0(a5) a5 @ 2 44: 058a slli a1,a1,0x2 a1 2 lshift 3 46: 7179 addi sp,sp,-48 literal 48 stackframe 3 48: 97ae add a5,a5,a1 + 1 4a: 0007a303 lw t1,0(a5) @ 1but as far as i can tell on the x87 (which i have never programmed, i'm just going by p. a-1 (176/258) et seq. of http://bitsavers.trailing-edge.com/components/intel/80386/23...) all the instructions are at least two bytes, so i don't see where you get any extra code density
for what it's worth, the subroutine that the above was taken from compiles to 62 instructions and 156 bytes for rv32ec, 61 instructions and 189 bytes for amd64, and 52 instructions and 138 bytes for arm cortex-m4. i'll be compiling it for my own stack-based virtual machine this year but i don't have even a prototype compiler backend yet
There was a Usenix paper examining this.
https://www.usenix.org/legacy/events%2Fvee05%2Ffull_papers/p... [2005]
(Hey, I seem to remember tha an Anton Ertl posts to comp.compilers.)
Spoiler: they claim that with their sophisticated translation from stack to register code, they eliminated 47% of the instructions, and the resulting code is still around 25% larger than the byte code. The size advantage goes to stack-based byte code, but it may not necessarily be as large as you might think.
So more or less in line with your findings or intuition?
hey, thanks! on skimming it looks like the stack-based bytecode they were comparing was jvm bytecode, which is normally one byte per instruction, no?
that hasn't been my experience with instructions i've written myself, but then again, i'm not rrrola
You could try using Thumb instructions, they are 16 bit.
i was, but i wasn't talking about the size of the instruction word, but the size of the registers
Can anybody provide de-obfuscated/minified source so those of us who understand the principle (tracing rays of light around a scene) can understand the maths?
here's my attempt
i'm interested to hear what i got wrong10 REMĀĘĆĞ$Č*Ēĉ!ăě-ĕ'ď rem dithering table in comment above consulted by GCOL line? rem raytracer from https://bbcmic.ro/?t=9ctpk by https://mastodon.me.uk/@coprolite9000 rem deobfuscated and possibly broken by kragen javier sitaker (untested) 20 MODE 1 VDU 5 30 FOR N = 8 TO 247 : rem loop over rows (n) and columns (m) of pixels FOR M = 0 TO 319 40 X = 0 : rem three-dimensional vector Y = -.1 Z = 3 U = (M - 159.5)/160 : rem xform screen coords to direction vector V = (N - 127.5)/160 W = 1/SQR(U*U + V*V + 1) : rem normalize direction vector U = U * W V = V * W I = SGN U : rem derive coordinate of sphere center (i, i, 1)? G = 1 50 E = X - I : rem beginning of do-while loop F = Y - I P = U*E + V*F - W*Z D = P*P - E*E - F*F - Z*Z + 1 IF D <= 0 then goto 60 T = -P - SQR D IF T <= 0 then goto 60 X = X + T*U Y = Y + T*V Z = Z - T*W E = X - I F = Y - I G = Z P = 2*(U*E + V*F - W*G) U = U - P*E V = V - P*F W = W + P*G I = -I : rem if we are reflecting then check other sphere now GOTO 50 : rem end of do-while loop, we have our escape from the scene geometry rem color according to y-component of direction vector to get sky gradient rem if instead y component is negative, make a checkerboard of pieces of sky 60 IF V < 0 then P = (Y + 2)/V : V = -V*((INT(X - U*P) + INT(Z - W*P) AND 1)/2 + .3) + .2 70 GCOL 0, 3 - (48*SQR V + ?(PAGE + 5 + M MOD 4 + N MOD 4*4)/3) DIV 16 80 PLOT 69, 4*M, 4*N NEXT,see also https://news.ycombinator.com/item?id=39026517 and https://news.ycombinator.com/item?id=39026495
> rem if instead y component is negative, make a checkerboard of pieces of sky
I'd maybe phrase this a bit differently – in both cases it uses the v coordinate to calculate the shade but whereas the sky is a simple gradient, the plane has a "ground fog" type effect that can be adjusted with those +.3 and +.2 magic constants.
i was trying to figure out how the ground fog happens, and i still don't
The value of v corresponds to how far away the intersection point is. At the horizon v=0 so the color value is just constant 0.2. The closer to the camera, the more the checker value (0 or 1) contributes. The exact values are probably chosen experimentally, but note the important property that -v*(checker/2+0.3)+0.2 is always between 0 and 1. Note also that 1 means black and 0 white here, as ultimately the value is negated at line 70.
This effort is truly specular!
This is interesting and serves as a potential "hook" for getting people interested in coding. I understand the BASIC appeal!
In something like python, an analogous thing is turtle graphics. Is there something more similar (but still very basic) in python for something like this? A mini graphical language?
Hmm, something like graphics.py maybe? https://mcsp.wartburg.edu/zelle/python/ppics2/code/graphics....
Yeah, something like that. It's hard to be both expressive and useful.
This is a small set of primitives around a tkinter canvas. I suppose one could expose a small set of primitives around a pygame too.
I have to think hard about what exactly I would want.
The dither effect looks really neat on ray trace, then I remembered thats just your average news paper photo. Still would be cool to see it in a game.
Check out the game Return of the Obra Dinn for exactly this.
Just curious how fast could the same thing be if someone who knows the hardware really well, wrote it in assembly.
There’s a bunch of things you can do. Replacing all the square roots with a lookup table might be doable, and you could make the final drawing faster by not using the generic plot routines.
You don’t have a lot of memory to play with though as you only have 32k to start with and the frame buffer will use 20k of that, so I’m not sure how good a look up table you could make for SQR.
In a subthread, I got nerdsniped to figure out how this works, and for fun transcribed it ~faithfully to Rust – this one just outputs ASCII art instead of plotting pixels. You'll want to zoom out your browser as far as you can:
https://www.rustexplorer.com/b#Zm4gbWFpbigpIHsKICAgIC8vIHJvd...
Nice! Is there a commented source code?
this says `goto 50` but there are no line numbers
If you go to the source link - https://bbcmic.ro/?t=9ctpk - then you can see the numbers with the lines. They increment in steps of 10 (which IIRC you could override manually with numbers in-between).
If you're playing with this on a BBC micro emulator, the best way to get this code into the emulator is to download the disk image from the above link. The actual basic program is named "PROGRAM" on the disc. List the code by typing LIST20, (the computer will lock up if you try to list line 10).
If you just want to see the result, type MODE 1 and then press return, then type *LOAD SCREEN and press return again.
thanks! i didn't see that link. but i think the 432 bytes leaves the line numbers out, so it should actually be 448 bytes
I guess that's a philosophical question.
In most home computer BASIC interpreters, the source code (including line numbers) never exists as text in memory or on tape/disk, only as binary tokens. The only place where the original input text exists is a small line buffer which gets translated into binary token form when pressing Enter/Return.
The LIST command translates the token form back into human readable text.
And if you use 'AUTO' you also never type any line numbers... so are they actually part of the source code? Is there even 'source code' when the original source is never stored as text anywhere? Is the 'source code size' then the size of the original text data (that actually never exists a whole), or the size of binary token representation in memory and on tape/disk? :)
sure, you might want to count the tokenized representation instead of the printed representation, so maybe the real number is something smaller than 432, but it contains the line numbers too. you definitely don't want to count the 432 bytes of printed representation and leave out the 16 bytes of line numbers because the program won't run without them
How do you speculate GOTO works if the tokenized program doesn’t have line numbers? Or if I write
Then later I write5 REM MY COOL PROGRAM 10 PRINT “HELLO” 20 GOTO 10
How does it know to replace the right line?10 PRINT “DIE BART DIE”The line number is stored in the program, just not as text. It's preparsed into a (usually 16-bit) integer as part of the 'tokenization' of the input line buffer.
i don't think that flohofwoe was claiming the line numbers weren't stored, just that the textual representation with line numbers maybe isn't the correct measure
The robot automatically adds line numbers in increments of 10s.