I want a good parallel computer

raphlinus.github.io

233 points by raphlinus a month ago


deviantbit - a month ago

"I believe there are two main things holding it back."

He really science’d the heck out of that one. I’m getting tired of seeing opinions dressed up as insight—especially when they’re this detached from how real systems actually work.

I worked on the Cell processor and I can tell you it was a nightmare. It demanded an unrealistic amount of micromanagement and gave developers rope to hang themselves with. There’s a reason it didn’t survive.

What amazes me more is the comment section—full of people waxing nostalgic for architectures they clearly never had to ship stable software on. They forget why we moved on. Modern systems are built with constraints like memory protection, isolation, and stability in mind. You can’t just “flatten address spaces” and ignore the consequences. That’s how you end up with security holes, random crashes, and broken multi-tasking. There's a whole generation of engineers that don't seem to realize why we architected things this way in the first place.

I will take how things are today over how things used to be in a heart beat. I really believe I need to spend 2-weeks requiring students write code on an Amiga, and the programs have to run at the same time. If anyone of them crashes, they all will fail my course. A new found appreciation may flourish.

grg0 - a month ago

The issue is that programming a discrete GPU feels like programming a printer over a COM port, just with higher bandwidths. It's an entirely moronic programming model to be using in 2025.

- You need to compile shader source/bytecode at runtime; you can't just "run" a program.

- On NUMA/discrete, the GPU cannot just manipulate the data structures the CPU already has; gotta copy the whole thing over. And you better design an algorithm that does not require immediate synchronization between the two.

- You need to synchronize data access between CPU-GPU and GPU workloads.

- You need to deal with bad and confusing APIs because there is no standardization of the underlying hardware.

- You need to deal with a combinatorial turd explosion of configurations. HW vendors want to protect their turd, so drivers and specs are behind fairly tight gates. OS vendors also want to protect their turd and refuse even the software API standard altogether. And then the tooling also sucks.

What I would like is a CPU with a highly parallel array of "worker cores" all addressing the same memory and speaking the same goddamn language that the CPU does. But maybe that is an inherently crappy architecture for reasons that are beyond my basic hardware knowledge.

IshKebab - a month ago

Having worked for a company that made a "hundreds of small CPUs on a single chip", I can tell you now that they're all going to fail because the programming model is too weird, and nobody will write software for them.

Whatever comes next will be a GPU with extra capabilities, not a totally new architecture. Probably an nVidia GPU.

armchairhacker - a month ago

> The GPU in your computer is about 10 to 100 times more powerful than the CPU, depending on workload. For real-time graphics rendering and machine learning, you are enjoying that power, and doing those workloads on a CPU is not viable. Why aren’t we exploiting that power for other workloads? What prevents a GPU from being a more general purpose computer?

What other workloads would benefit from a GPU?

Computers are so fast that in practice, many tasks don't need more performance. If a program that runs those tasks is slow, it's because that program's code is particularly bad, and the solution to make the code less bad is simpler than re-writing it for the GPU.

For example, GUIs have been imperceptibly reactive to user input for over 20 years. If an app's GUI feels sluggish, the problem is that the app's actions and rendering aren't on separate coroutines, or the action's coroutine is blocking (maybe it needs to be on a separate thread). But the rendering part of the GUI doesn't need to be on a GPU (any more than it is today, I admit I don't know much about rendering), because responsive GUIs exist today, some even written in scripting languages.

In some cases, parallelizing a task intrinsically makes it slower, because the number of sequential operations required to handle coordination mean there are more forced-sequential operations in total. In other cases, a program spawns 1000+ threads but they only run on 8-16 processors, so the program would be faster if it spawned less threads because it would still use all processors.

I do think GPU programming should be made much simpler, so this work is probably useful, but mainly to ease the implementation of tasks that already use the GPU: real-time graphics and machine learning.

morphle - a month ago

I haven't yet read the full blog post but so far my response is you can have this good parallel computer. See my previous HN comments the past months on building an M4 Mac mini supercomputer.

For example reverse engineering the Apple M3 Ultra GPU and Neural Engine instruction set and IOMMU and pages tables that prevent you from programming all processor cores in the chip (146 cores to over ten thousand depending on how you delineate what a core is) and making your own Abstract Syntax Tree to assembly compiler for these undocumented cores will unleash at least 50 trillion operations per second. I still have to benchmark this chip and make the roofline graphs for the M4 to be sure, it might be more.

https://en.wikipedia.org/wiki/Roofline_model

dekhn - a month ago

There are many intertwined issues here. One of the reasons we can't have a good parallel computer is that you need to get a large number of people to adopt your device for development purposes, and they need to have a large community of people who can run their code. Great projects die all the time because a slightly worse, but more ubiquitous technology prevents flowering of new approaches. There are economies of scale that feed back into ever-improving iterations of existing systems.

Simply porting existing successful codes from CPU to GPU can be a major undertaking and if there aren't any experts who can write something that drive immediate sales, a project can die on the vine.

See for example https://en.wikipedia.org/wiki/Cray_MTA when I was first asked to try this machine, it was pitched as "run a million threads, the system will context switch between threads when they block on memory and run them when the memory is ready". It never really made it on its own as a supercomputer, but lots of the ideas made it to GPUs.

AMD and others have explored the idea of moving the GPU closer to the CPU by placing it directly onto the same memory crossbar. Instead of the GPU connecting to the PCI express controller, it gets dropped into a socket just like a CPU.

I've found the best strategy is to target my development for what the high end consumers are buying in 2 years - this is similar to many games, which launch with terrible performance on the fastest commericially available card, then runs great 2 years later when the next gen of cards arrives ("Can it run crysis?")

Animats - a month ago

Interesting article.

Other than as an exercise, it's not clear why someone would write a massively parallel 2D renderer that needs a GPU. Modern GPUs are overkill for 2D. Now, 3D renderers, we need all the help we can get.

In this context, a "renderer" is something that takes in meshes, textures, materials, transforms, and objects, and generates images. It's not an entire game development engine, such as Unreal, Unity, or Bevy. Those have several more upper levels above the renderer. Game engines know what all the objects are and what they are doing. Renderers don't.

Vulkan, incidentally, is a level below the renderer. Vulkan is a cross-hardware API for asking a GPU to do all the things a GPU can do. WGPU for Rust, incidentally, is an wrapper to extend that concept to cross-platform (Mac, Android, browsers, etc.)

While it seems you can write a general 3D renderer that works in a wide variety of situations, that does not work well in practice. I wish Rust had one. I've tried Rend3 (abandoned), and looked at Renderling (in progress), Orbit (abandoned), and Three.rs (abandoned). They all scale up badly as scene complexity increases.

There's a friction point in design here. The renderer needs more info to work efficiently than it needs to just draw in a dumb way. Modern GPSs are good enough that a dumb renderer works pretty well, until the scene complexity hits some limit. Beyond that point, problems such as lighting requiring O(lights * objects) time start to dominate. The CPU driving the GPU maxes out while the GPU is at maybe 40% utilization. The operations that can easily be parallelized have been. Now it gets hard.

In Rust 3D land, everybody seems to write My First Renderer, hit this wall, and quit.

The big game engines (Unreal, etc.) handle this by using the scene graph info of the game to guide the rendering process. This is visually effective, very complicated, prone to bugs, and takes a huge engine dev team to make work.

Nobody has a good solution to this yet. What does the renderer need to know from its caller? A first step I'm looking at is something where, for each light, the caller provides a lambda which can iterate through the objects in range of the light. That way, the renderer can get some info from the caller's spatial data structures. May or may not be a good idea. Too early to tell.

[1] https://github.com/linebender/vello/

ip26 - a month ago

I believe there are two main things holding it back. One is an impoverished execution model, which makes certain tasks difficult or impossible to do efficiently; GPUs … struggle when the workload is dynamic

This sacrifice is a purposeful cornerstone of what allows GPUs to be so high throughput in the first place.

bee_rider - a month ago

It is odd that he talks about Larabee so much, but doesn’t mention the Xeon Phis. (Or is it Xeons Phi?).

> As a general trend, CPU designs are diverging into those optimizing single-core performance (performance cores) and those optimizing power efficiency (efficiency cores), with cores of both types commonly present on the same chip. As E-cores become more prevalent, algorithms designed to exploit parallelism at scale may start winning, incentivizing provision of even larger numbers of increasingly efficient cores, even if underpowered for single-threaded tasks.

I’ve always been slightly annoyed by the concept of E cores, because they are so close to what I want, but not quite there… I want, like, throughput cores. Let’s take E cores, give them their AVX-512 back, and give them higher throughput memory. Maybe try and pull the Phi trick of less OoO capabilities but more threads per core. Eventually the goal should be to come up with an AVX unit so big it kills iGPUs, haha.

Retr0id - a month ago

Something that frustrates me a little is that my system (apple silicon) has unified memory, which in theory should negate the need to shuffle data between CPU and GPU. But, iiuc, the GPU programming APIs at my disposal all require me to pretend the memory is not unified - which makes sense because they want to be portable across different hardware configurations. But it would make my life a lot easier if I could just target the hardware I have, and ignore compatibility concerns.

svmhdvn - a month ago

I've always admired the work that the team behind https://www.greenarraychips.com/ does, and the GA144 chip seems like a great parallel computing innovation.

api - a month ago

I implemented some evolutionary computation stuff on the Cell BE in college. It was a really interesting machine and could be very fast for its time but it was somewhat painful to program.

The main cores were PPC and the Cell cores were… a weird proprietary architecture. You had to write kernels for them like GPGPU, so in that sense it was similar. You couldn’t use them seamlessly or have mixed work loads easily.

Larrabee and Xeon Phi are closer to what I’d want.

I’ve always wondered about many—many-many-core CPUs too. How many tiny ARM32 cores could you put on a big modern 5nm die? Give each one local RAM and connect them with an on die network fabric. That’d be an interesting machine for certain kinds of work loads. It’d be like a 1990s or 2000s era supercomputer on a chip but with much faster clock, RAM, and network.

scroot - a month ago

When this topic comes up, I always think of uFork [1]. They are even working on an FPGA prototype.

[1] https://ufork.org/

throwawayabcdef - a month ago

The AIE arrays on Versal and Ryzen with XDNA are a big grid of cores (400 in an 8 x 50 array) that you program with streaming work graphs.

https://docs.amd.com/r/en-US/am009-versal-ai-engine/Overview

Each AIE tile can stream 64 Gbps in and out and perform 1024 bit SIMD operations. Each shares memory with its neighbors and the streams can be interconnected in various ways.

andrewstuart - a month ago

AMD Strix Halo APU is a CPU with very powerful integrated GPU.

It’s faster at AI than an Nvidia RTX4090, because 96GB of the 128GB can be allocated to the GPU memory space. This means it’s doesn’t have the same swapping/memory thrashing that a discrete GPU experiences when processing large models.

16 CPU cores and 40 GPU compute units sounds pretty parallel to me.

Doesn’t that fit the bill?

Quis_sum - a month ago

Clearly the author never worked with a CM2 - I did though. The CM2 was more like a co-processor which had to be controlled by a (for that age) rather beefy SUN workstation/server. The program itself ran on the workstation which then sent the data-parallel instructions to the CM2. The CM2 was an extreme form of a MIMD design (that is why it was called data parallel). You worked with a large rectangular array (I cannot recall up to how many dimensions) which had to be a multiple of the physical processors (in your partition). All cells typically performed exactly the same operation. If you wanted to perform an operation on a subset, you had to "mask" the other cells (which were essentially idling during that time).

That is hardly what the author describes.

sitkack - a month ago

This essay needs more work.

Are you arguing for a better software abstraction, a different hardware abstraction or both? Lots of esoteric machines are name dropped, but it isn't clear how that helps your argument.

Why not link to Vello? https://github.com/linebender/vello

I think a stronger essay would at the end give the reader a clear view of what Good means and how to decide if a machine is closer to Good than another machine and why.

SIMD machines can be turned into MIMD machines. Even hardware problems still need a software solution. The hardware is there to offer the right affordances for the kinds of software you want to write.

Lots of words that are in the eye of beholder. We need a checklist or that Good parallel computer won't be built.

SergeAx - a month ago

The thing is that most of our everyday software will not benefit from parallelism. What we really have a use for is concurrency, which is a totally different beast.

Wumpnot - a month ago

I had hoped the GPU API would go away, and the entire thing would become fully programmable, but so far we just keep using these shitty APIs and horrible shader languages.

Personally I would like to use the same language I write the application in to write the rendering code(C++). Preferably with shared memory, not some separate memory system that takes forever to transfer anything too. Somelike along the lines of the new AMD 360 Max chips, but graphics written in explicit C++.

muziq - a month ago

I was always fascinated by the prospects of the 1024-core Epiphany-V from Parallella.. https://parallella.org/2016/10/05/epiphany-v-a-1024-core-64-... But it seems whatever the DARPA connection was has led to it not being for scruffs like me and is likely powering god knows what military systems..

mikewarot - a month ago

Any computing model that tries to parallelize von Neumann machines, that is, has program counters or address space, just isn't going to scale.

nickpsecurity - a month ago

There are designs like Tilera and Phalanx that have tons of cores. Then, NUMA machines used to have 128-256 sockets in one machine with coherent memory. The SGI machines let you program them like it was one machine. Languages like Chapel were designed to make parallel programming easier.

Making more things like that with lowest, possible, unit prices could help a lot.

amelius - a month ago

Isn't the ONNX standard already going into the direction of programming a GPU using a computation graph? Could it be made more general?

0xbadcafebee - a month ago

If we had distributed operating systems and SSI kernels, your computer could use the idle cycles of other computers [that aren't on battery power]. People talk about a grid of solar houses, but we could've had personal/professional grid computing like 15 years ago. Nobody wanted to invest in it, I guess because chips kept getting faster.

nromiun - a month ago

What about unified memory? I know these APUs are slower than traditional GPUs but still it seems like the simpler programming model will be worth it.

The biggest problem is that most APUs don't even support full unified memory (system SVM in OpenCL). From my research only Apple M series, some Qualcomm Adreno and AMD APUs support them.

joshu - a month ago

Huh. The Blelloch mentioned n the thinking machines section taught my parallel algorithms class in 1994 or so.

eternityforest - a month ago

I wonder if CDN server applications could use something like this, if every core had a hardware TCP/TLS stack and there was a built-in IP router to balance the load, or something like that.

casey2 - a month ago

I think Tim was right, it's 2025, Nvidia just released their 50 series, but I don't see any cards, let alone GPUs.

dragontamer - a month ago

There's a lot here that seems to misunderstand GPUs and SIMD.

Note that raytracing is a very dynamic problem, where the GPU isn't sure if a ray hits a geometry or if it misses. When it hits, the ray needs to bounce, possibly multiple times.

Various implementations of raytracing, recursion, dynamic parallelism or whatever. Its all there.

Now the software / compilers aren't ready (outside of specialized situations like Microsofts DirectX Raytracing, which compiles down to a very intriguing threading model). But what was accomplished with DirectX can be done in other situations.

-------

Connection Machine is before my time, but there's no way I'd consider that 80s hardware to be comparable to AVX2 let alone a modern GPU.

Connection Machine was a 1-bit computer for crying out loud, just 4096 of them in parallel.

Xeon Phi (70 core Intel Atoms) is slower and weaker than 192 core Modern EPYC chips.

-------

Today's machines are better. A lot better than the past machines. I cannot believe any serious programmer would complain about the level of parallelism we have today and wax poetic about historic and archaic computers.

pikuseru - a month ago

No mention of the Transputer :(

Ericson2314 - a month ago

Agreed with the premise here

I have never done GPU programming or graphics, but what feels frustating looking from the outside is the designs and constraints seems so arbitrary. They don't feel like they come from actual hardware constraints/problems. It just looks like pure path dependency going all the way back to the fixed-function days, with tons of accidental complexity and and half-finished generalizations ever since.

hackburg - a month ago

[dead]

helf - a month ago

[dead]

chimyy - a month ago

[flagged]

chimyy - a month ago

[flagged]

chimyy - a month ago

[flagged]