Hi all.
I have one, maybe two, questions for you. This question came out of a webinar series on High Performance Computing (HPC) I took part in (the Italy–Germany HPC webinars organised on the Italian side through CNR). I raised this concern there, and my impression was that the other speakers did not share it to the same degree. The room leaned more optimistic than I am. That is exactly why I want to put it to a wider audience: I may be wrong, and I would like to hear where others land.
The concern is precision. Most scientific HPC needs double precision (FP64). In computational fluid dynamics, which is my field, we resolve physical scales spanning many orders of magnitude, and to do that correctly (with very high-order accuracy methods), we need 64-bit floating point.
AI computing does not need this. Training and inference work well at 8-bit (now even at 4-bit). So, the two workloads require different hardware: AI needs many low-precision cores, while science requires strong FP64 capabilities.
The problem is that the vendors follow the AI market because that is where the money is. Comparing on vector FP64 (peak, dense), the recent trend is to hold it flat or lower it, and spend the transistors on low-precision math instead:
- NVIDIA H100: 34 TFLOP/s vector FP64, or 67 with the FP64 tensor-core path. The newer B200 does about 40 vector FP64. Blackwell dropped the dedicated FP64 tensor-core path that Hopper had, and gained around 20 PFLOP/s of FP4 for AI. The Rubin roadmap reportedly cuts FP64 further.
- AMD MI300X: 81.7 TFLOP/s FP64. The newer MI355X does 78.6, below its own predecessor, with the gains all in FP8/FP4 for AI inference.
- Intel has stepped back from a dedicated HPC GPU. Its current HPC silicon, the Max-series (Ponte Vecchio) in Aurora, has no standalone successor. Intel cancelled Falcon Shores as a product in early 2025 and folded its HPC and AI lines into one chip, Jaguar Shores, due around 2026/2027. Intel describes it as serving both AI and HPC, but says it will compete on total cost of ownership rather than peak FLOPS, and has published no FP64 figure.
- Consumer silicon makes the direction plainest. NVIDIA’s N1X, the new Blackwell laptop chip, publishes only AI-precision figures (NVFP4, around 1000 TOPS) and quotes no FP64 at all. Double precision is simply not a design goal there.
So across all three vendors the direction looks the same. The new chips are built for AI, and double precision gets quietly de-prioritized along the way.
There is one strong counter-current. AMD’s MI430X, coming this year, is a deliberate HPC part. AMD claims more than 200 TFLOP/s of FP64, and independent estimates back out around 211 from the Alice Recoque exascale contract, which would be the highest of any GPU so far, while it still carries FP4/FP8 for AI. It will power Alice Recoque, the next European exascale machine, alongside the US Discovery and Germany’s Herder. So a dedicated FP64 line still exists, for now.
But it is one product line, from one vendor, against a whole market moving the other way. That is what I cannot resolve: whether a first-class FP64 hardware line survives, or shrinks to a small premium niche while everything else is optimized for AI.
Two questions for you:
- Do you share this concern, or do you think I am overstating it?
- If you share it, do you already see a way out?
I would be glad to hear how others in the Fortran and HPC community are thinking about this.
Stefano
mhulsen 2
There is also NextSilicon’s Maverick-2 in Sandia’s supercomputer Spectra.
jorgeg 3
Hi Stefano,
This is in my mind constantly. However, I’ve talked to people that work at either nvidia or AMD and they’ve all assured me that FP64 is not going to be dropped like a hot potato. Cards are separating into two lines, the B300 for example is for AI/inference procedures while the B200 is going to be for science, HPC stuff.
The thing that is true is that we won’t see the crazy increase in FP64 performance, i.e. V100 FP64 was 7.9 TFLOP/s, A100 was 19 (??), H200 was 34 ish and the B200 is as you pointed out less than that creep at 40 TFLOP/s.
The thing here is that most of the workloads out there are not using the peak FLOP rate of these things, not everything is DGEMMs there are a limited amount of applications that can really run these. Most of other apps I feel are limited by memory bandwidth which is important to both AI and HPC workflows. AMD however seems to be championing the FP64 support more than NVIDIA, but GPU support for Fortran is still not the best but it is improving a LOT. I have my do concurrent code which I can translate via a python script to omp target and it works very nicely.
So yeah I am concerned but I don’t see it as a life threatening. The thing is to keep developing things that NEED FP64 to force vendors to support it.
Hope this helps!
hkvzjal 4
Yes, I share it, can’t say if it is overstated or not …
Last year I attended a seminar in Montpellier at CINES and there was a very nice talk about their super computer Adastra https://mumps-solver.org/doc/workshop2025/Hautreux_Slides.pdf (in French), one of the discussions was precisely this point. The seminar happened a few days after the Nvidia GTC in Paris where nvidia mentioned their plans to go more into lower precision, they showed that indeed AMD is the one championing FP64 HPC loads.
PierU 5
IMO there will still be GPGPU dedicated to HPC, with FP64 etc… Just it will become a kind of “niche” hardware, just like the HPC hardware was “niche” 30 years ago and before. We have lived a (relatively short) window where HPC could almost use consumer hardware (i.e. high-end CPU’s and GPU’s, but which were designed for the consumer market, or which were derived from versions for the consumer market), but it’s over. The hardware dedicated to HPC will probably be more expensive from now.
6
Indeed, we are currently in a period where AI is the primary “engine” of GPU innovation, and as a result, hardware is being hyper-specialized for that task. While this is fantastic for AI research, it undeniably leaves the traditional scientific computing community in an awkward position where they must either pay a premium for specialized hardware or deal with the limitations of consumer-grade cards that are increasingly “AI-first.”
On the other hand, however, looking at Nvidia’s statements in the link I attach, the future may be a little less tragic than what is currently in sight. It must be said that in the past, those who work in scientific computing like me have often used resources that weren’t intended for my work: we installed Linux on a PlayStation… I hope that this time too, by mistake, the hardware will evolve towards something that can be useful to us too.
szaghi 7
Hi, thank you very much. I did not know NextSilicon’s work, it is really interesting.
Stefano
szaghi 8
Hi Jorge,
This helps a lot, having first-hand feedback from the vendor is quite encouraging. As for the GPU support for Fortran, this is all another problem that we have to daily struggle to deal with, maybe we can discuss in detail in another topic ![]()
Stefano
jkd2022 9
szaghi 10
Hi,
Thank you, the slides are really interesting. In AMD we have to trust ![]()
Stefano
szaghi 11
Hi PierU,
Indeed, this is my concern, after the high-end CPU-GPU era where HPC shares hardware with other applications (where money is, e.g., gaming, crypto-mining, etc), HPC is coming back into its old corner.
Stefano
szaghi 12
Hi,
Thank you for sharing this link, I did not know it. All of you are more optimistic than me, so I have to change my mind. Anyhow, I still trust more in AMD than in NVIDIA announcement ![]()
Cheers.
Stefano
szaghi 13
HI,
Thank you for sharing these references. I’ll read them soon.
I am already playing with some toy code for exploring low-precision representation and compensative algorithms to recover (at least part) of the accuracy, but I do not yet have a strong, evidence-based opinion. Moreover, I do not know the “emulation mode” claimed by NVIDIA, in particular, whether it ensures precision and preserves performance (both aspects are important for me).
Stefano
rwmsu 14
I share this concern also but there is not anything I can do about it so its not at my list of things to worry about. The optimist in me hopes that one day Nvidia, AMD, and Intel will enable FP64 in their commodity GPUS. The realist in me knows thats probably never going to happen because its hard to make a business case for it. I’m not a gamer and just need a GPU with as much memory as I can afford to do SciVis. Therefore, I’m not going to waste a $1000 on a GPU when I can get something that meets my needs for around $300 to $400. That might change if I could get FP64 support in the $1000 card. Not going to happen though.
As to using FP32 for numerical analysis, I think that Jack Dongarra’s group’s latest Linear Algebra packages (Magma if I remember correctly) uses interative refinement to recover some of the accuracy lost using FP32 for their matrix solvers.
Edit.
Here is a paper from Dongarra’s group on using mixed precision iterative refinement on GPU’s. There are several other papers on the MAGMA web site
Thanks for the references. A famous quote comes to mind here:
If you were plowing a field, which would you rather use? Two strong oxen or 1024 chickens?
Seymour Cray
PierU 16
It’s always possible to simulate high precision computations with reduced precision instructions, it’s “just” one more layer of complexity. But one can imagine that these layers are part of the vendor libraries (for LAPACK, FFTs,…) and transparent to the user.
szaghi 17
Yes, I know it is possible, and talking about NVIDIA, I know it claims to have a sort of this kind of emulation, but until I can try it, I am skeptical (about accuracy and performance). Moreover, the NVIDIA SDK for Fortran developers is not the most bug-free, standard-compliant SDK. I am also skeptical about their vendor libraries.
I know we can recover high precision with low precision representation, but I am afraid of how much effort this requires for small (or solo) Fortran developer teams in scientific fields.
Stefano
This quote is from 1995, when a supercomputer cpu executed at the rate of some 500 MFLOPS. Today’s chicken CPU executes at 30 GFLOPS per thread. History has answered that question.
rwmsu 19
Is there a source for this quote. Given the time frame, I’m guessing it was in response to the Connection Machine. I heard Cray give a keynote speach at a conference once (around 1983 or so) and he said that prior to leaving CDC he was trying to convince management to build a large multi-processor architecture for the CDC 8000 series but it was not thought to be practical at the time.
jorgeg 20
I agree with this but from what I’ve been seeing, if you use any fancy things (that might be supported or not) you start getting performance hits. For example, I had a memory allocation in a polymorphic function:
class(*)
...
!$omp target enter data map(alloc:...)
and when I changed it to type(my_type I got a 10% performance uplift haha
So if you basically write F90 in your more computationally expensive functions you also get speed.