M5 Max: Chiplets, Thermals, and Performance per Watt - Creative Strategies

Here’s the conclusion up front: M5 Max is the most impressive piece of silicon I’ve used in a laptop. It’s more efficient and more performant, theoretically cheaper and more sustainable to manufacture with less waste, an AI powerhouse, and still runs in a laptop that lasts all day. I can’t think of an SoC or chip I’d rather have in a laptop. This is fantastic.

Ok, with the conclusion out of the way, let’s talk about what I mean when I say “theoretically cheaper and more sustainable.” At a pure technical and silicon level, I think that is the most interesting part of M5 Max and M5 Pro.

This is the first Apple silicon SoC that uses what I’d call a proper chiplet design. M1/M2/M3 Ultra were two dies bound together with interposer fabric that makes two whole SoCs function one large SoC. M5 Pro/Max are different: they use a CPU-dominant tile and a GPU-dominant tile, combined into a single high-performance package. The CPU is shared between both M5 Pro and M5 Max, while the GPU differs between them with M5 Max being roughly double the size of M5 Pro.

For M5 Pro and M5 Max, the real story here is economics, yield, and waste. On advanced nodes, defect density is everything. Every wafer has some number of random defects across it, and the larger each individual die gets, the greater the odds that one of those defects lands somewhere important. That is why very large monolithic chips get so expensive so quickly: more dies fail, more silicon gets thrown away, and the cost of every good die rises. Smaller dies are simply more efficient to manufacture because more of the wafer survives QA and less total silicon is wasted.

That is where chiplets become really compelling. Instead of printing one massive SoC and hoping it comes out clean, you break the design into smaller functional blocks and bond them together later in packaging. Even if the total silicon area is similar, two smaller dies generally yield better than one giant one because each individual die has a lower probability of containing a fatal defect. You then take the packaging hit later, but on a leading-edge node that trade can still be very favorable because the wafer-level economics improve so much. In simple terms, you spend more putting the chip together, but lose less money throwing broken silicon away.

What seems especially smart here is how Apple appears to have partitioned the design. M5 Pro and M5 Max look to share the same CPU-dominant tile, meaning the same 18-core CPU block, the same neural engine, and the same media blocks. The real differentiation seems to be the GPU-dominant tile, with M5 Max getting the full version and M5 Pro using what appears to be a smaller or effectively cut-down version of the same broader design. That matters because Apple is not building two completely different large chips from scratch. They can reuse the same CPU tile across both SKUs, keep the architecture work more unified, and improve yields by being more flexible with GPU binning. That lowers manufacturing cost, validation cost, and R&D cost all at once.

The trade-off is packaging, because Apple is paying for a much fancier way to assemble all of this. They are using SoIC-MH here, which is a more advanced and more expensive packaging technology than the InFO-style approach used before. Hybrid bonding multiple silicon tiles together, plus keeping unified memory on-package, is not cheap. But if the savings on die cost and yield are large enough, the total chip can still end up cheaper overall. In other words, Apple is spending more on the back end of manufacturing in exchange for saving more on the front end where the real waste happens.

There is also a very real thermal and system-level efficiency advantage to doing it this way. On a traditional monolithic design, the CPU, GPU, cache, and other blocks all share the same physical slab of silicon, so when one area gets hot it naturally bleeds into the others. Push the CPU hard enough and the GPU starts to inherit that thermal load, and vice versa. By separating the CPU and GPU into different dominant tiles, that coupling appears to be reduced quite a bit. In my testing, that seems to be why I can push both the CPU and GPU much harder at the same time than I expected. It is not just about cost savings or yield, it is also about thermal behavior and power delivery. Apple seems to be getting higher peak performance, better sustained behavior, and better efficiency from the same architectural decision, which is exactly why I think this is the most interesting part of M5 Pro and M5 Max.

Before I get into my observations, I want to mention a few tweaks Apple made to the CPU design. Across the M5 family, there are three core classes: super cores, performance cores, and efficiency cores. The regular M5 still uses efficiency cores, while M5 Pro and M5 Max appear to drop them entirely and instead lean on only super cores and performance cores. On M5 Pro and M5 Max, that means 6 super cores and 12 performance cores, with no dedicated efficiency cores at all. That distinction matters because the core strategy is no longer the same across the whole family.

Super cores are designed to maximize single-threaded workloads, getting them done as quickly as possible with maximum performance. Efficiency cores are designed for background work and non-latency-sensitive tasks, where the goal is to run at the lowest power possible while still delivering good enough performance. Performance cores sit in the middle and are really meant for sustained multi-threaded work, the kinds of tasks that use multiple cores in unison to complete a single workload.

Some apps, like Google Chrome for example, may use multiple cores simultaneously, but these are often multiple single-threaded tasks. Something like Adobe Premiere using the CPU, on the other hand, will use multiple cores to process a single workload, which is truly multi-threaded. This is an important distinction when discussing efficiency, since cores and silicon in general operate on efficiency curves. There are points of diminishing returns when you push more power into a core designed for efficiency, but at the lower end, running a super core at lower power can in some cases be just as efficient in bursty workloads as using a dedicated efficiency core. Back in October of last year, I started asking silicon vendors whether we were likely to see more of this design going forward, with big cores running at low power instead of traditional efficiency cores. I think we are starting to see that with M5 Pro and M5 Max.

That architectural shift makes the idle behavior especially interesting: on M5 Max, despite having no efficiency cores, I am seeing sub-2W package power at light desktop idle. This is just my Mac sitting next to me while I write this, with my email, Chrome, and a few apps open but not actively used. That is an impressively low figure. It’s hard to say what this would have been with e-cores, but I’d wager this is more efficient than having e-cores. That’s likely why we see Apple’s rated battery life on the M5 Max MacBook Pro being one hour longer than M4 Max, and their full system power at idle being down 500mw, from 7.6W to 7.1W in their sustainability report.

When you start to push the SoC, for example, you can see brief peak draw hit around 80W on the GPU and around 80W on the CPU, though those are short spikes rather than sustained behavior. What’s most interesting is I’ve seen the M5 Max CPU try to sustain around 75W but as the SoC warms up, the power draw slowly goes down and stabilizes around 50W on the CPU/package with peaks that go a little higher, I noticed this mostly in the Cinebench multi-threaded benchmark. If properly cooled, ie in a room that’s not 74 degrees like mine is right now, you can likely sustain the higher wattage for longer. What I think is interesting is the peak power does seem higher than M4 Max, but the sustained power is lower while still having better performance. **I believe M5 Max has the best Apple silicon performance and performance per watt.**

In terms of that performance, we are looking at roughly a 4300 single-core score and between 28,000-30,000 multi-core in Geekbench 6.6.0. This would hit around 66W peak power draw and, again, is bursty by the nature of Geekbench. I’m sure if I ran this while sitting in a cooler room, this would be even higher.

Below is just a chart of a few of the benchmarks I think are interesting.

Benchmark	M5 Max	M4 Max	M3 Ultra
Geekbench 6.6 Single	4246	3895	3082
Geekbench 6.6 Multi	28728	25984	27157
Cinebench Single	738	676	573
Cinebench Multi	8413	7829	12082
Cinebench GPU	93577	68590	83865

Now, for AI performance. There are two parts to benchmarking AI: prefill and decode. Prefill is the compute-intensive prompt processing stage; simply put, it is the time the GPU spends working before generating your first token. After token generation starts, you get to decode which is limited by memory bandwidth. For M5 Max, the improvements with the neural accelerators in the GPU will improve the prefill performance, while the slightly higher memory bandwidth will be decode.

Below are a few comparisons of M3 Ultra and M5 Max of the same workload across both systems. (M4 Max coming soon, these takes a while to run)

While I would want to compare to DGX Spark, it’s hard to do a clean apples-to-apples comparison today because the inference frameworks are not yet aligned and the relevant low-bit data types are not fully mature on macOS. Right now, that means some of the workloads and precisions that would make the comparison most useful are either missing, in beta, or not implemented in the same way across platforms. Once int4 support is stable, it should become much easier to run more directly comparable tests against DGX Spark fp4-style workloads and get a much clearer read on relative performance. Right now, int4 support for this, which compares well to DGX Spark fp4 support, [is in beta for macOS 26.4](https://developer.apple.com/documentation/metal/mtltensordatatype/int4). I’ll do some more in-depth testing when this releases to stable. My back of napkin math is you should expect roughly equal AI performance between DGX Spark and M5 Max.

There are some changes to the Apple Neural Engine/NPU, but I’m still working through the low-level details. To try to get a better understanding of it, I decided to give GPT-5.4 in Codex access to the machine and just let it play around, test, and come to a conclusion on it. Here’s what it says. I would treat this as directional analysis rather than something I’m fully validating, but this is the model’s view after using the machine:

On the M5 Max, the Apple Neural Engine feels less like a generic “AI accelerator” and more like a very fast, fixed-shape dense compute engine. Once the workload is made to look like large FP16 matrix math, it delivers impressive throughput; when the work gets small, dynamic, or dispatch-heavy, it falls off quickly. In practice, that means the ANE is at its best on big, regular blocks of computation such as prefill-style transformer passes, large linear layers, and other fused tensor workloads, and much less compelling for token-by-token decode or anything else that behaves more like matrix-vector math than matrix-matrix math.

That distinction mattered a lot in testing. The original benchmark setup this project inherited from M4-era work understated the M5 Max because it was using shapes that were simply not ideal for the newer chip. After retuning around larger FP16 matmuls, the measured wall-clock peak rose to about 19.9 TFLOPS, versus the project’s older 15.8 TFLOPS M4 reference point. So the headline here is not just that the M5 Max is faster, but that it seems even more sensitive to being fed the right kind of workload: dense, high-occupancy, matmul-heavy shapes are clearly where the hardware wants to live.

The interesting caveat is that raw peak throughput and real application performance are not the same thing. In end-to-end workloads, actual sustained utilization is much lower because compile time, dispatch overhead, CPU coordination, and data movement all matter. That leaves the M5 Max ANE in an interesting spot: as a block-level accelerator, it is excellent, and meaningfully stronger than the older M4-based baseline suggests; as a low-latency engine for tiny, irregular inference steps, it is much less dominant. The clearest way to think about it is that Apple has a very capable on-chip tensor engine here, but one that rewards careful workload shaping far more than the marketing-level “TOPS” number implies.

So, after all of that, we end up exactly where we started. M5 Max is the most impressive piece of silicon I’ve used in a laptop. It is faster, more efficient, seemingly cheaper to manufacture at the die level, and somehow still lives inside a machine that gets all-day battery life. The CPU is absurd, the GPU is absurd, the AI performance is absurd. This is insanely impressive for a laptop, but that should be no surprise for Apple Silicon at this point.

So yes, the conclusion is still the conclusion. M5 Max is the best laptop SoC on the market. I tried to take the scenic route to get there, but unfortunately for the sake of suspense, the answer at the end is the same as the one at the beginning.