Fortran Outsmarted Our Billion-Dollar AI Chips

We spent millions optimizing CUDA kernels — then a 1970s Fortran loop quietly ran faster.

Press enter or click to view image in full size

I Thought GPUs Were the Future. Then Fortran Happened.

It started as a joke.

We were benchmarking our matrix solver — the one optimized with CUDA, cuBLAS, and enough parallelism to make your GPU whine audibly.
A colleague, half-joking, said:

“You know, my old professor had this Fortran code that did the same thing. Probably slower, but might be fun to compare.”

Fun. Right.

I ran it anyway — mostly to laugh at it.
Except… it wasn’t funny.

The Fortran version ran 1.4× faster than our hand-tuned GPU kernel on an NVIDIA A100.
And honestly? I didn’t see it coming.

The Ancient Loop That Refused to Die

Here’s the kind of code we’re talking about — straight out of a dusty 1978 research archive:

SUBROUTINE MATMUL(A, B, C, N)
      REAL*8 A(N,N), B(N,N), C(N,N)
      DO 20 I = 1, N
          DO 10 J = 1, N
              C(I,J) = 0.0
              DO 10 K = 1, N…