Government-funded academic research on parallel computing, stream processing, real-time shading languages, and programmable graphics processing units (GPUs) directly led to the development of GPU computing. GPUs are used in modern datacenters and have enabled the current revolution in artificial intelligence (AI). Nvidia, which makes GPUs, is now the most valuable company in the world. This transformation of computing and resulting economic value created was enabled by more than 30 years of government-funded research. Government funding not only helped develop many of the key technical innovations; it also enabled the training of large numbers of students who have conveyed this technology to industry.
This article traces the origins of GPU computing. We start by describing the development of the technologies on which GPU computing is built (parallel computing, parallel graphics systems, programmable shaders, and stream processing), and then we detail how this technology was transferred to Nvidia and other companies and came to be applied to modern machine learning.
Enabling Technologies
GPU computing builds on earlier work in parallel computing, parallel graphics systems, and stream processing. These technologies were developed through more than 30 years of government-sponsored academic research.
Parallel computing. When you learn about computing, you learn about a central processing unit (CPU) executing a sequence of instructions one after another. In reality, chips contain billions of transistors switching in parallel and connected by wires. Switches and wires are the fundamental building blocks of physical computers and they operate concurrently. Moreover, transistor switching consumes very little energy, whereas communication along wires consumes much more energy. Communication requires power to send the signal from one point to another; the power increases with distance and is very large if signaling between one chip and another. Although a sequential computer may be easier to think about than a parallel computer, a sequential computer must be implemented with transistors switching concurrently and wires transmitting information simultaneously. Sequential computers use many transistors to compute results in parallel and then carefully assemble these results in a way that is consistent with a sequential execution. The need to create this illusion—that execution is sequential—is inefficient in both power and performance. As the number of available transistors grows, the inefficiency increases. The natural way to build a computer in modern semiconductor technology is to design a parallel computer. GPUs are more efficient than CPUs because they are massively parallel computers.
GPU computing built on earlier work on parallel computing. As with all parallel computers, the parallel tasks or threads running on GPUs must synchronize and communicate with one another. Communication is needed for one thread to use data produced by another thread. Synchronization is needed to signal when data is available to ensure the correct value is consumed. Many of the fundamentals of parallel computing, synchronization, and communication were developed by government-sponsored academic research. The DARPA-funded Cosmic Cube project led by Chuck Seitz at Caltech35 developed much of the fundamentals of parallel computing. The hardware developed on this project was the model on which the Intel iPSC, Delta, and Paragon machines were based, as well as several of the early Department of Energy ASCI machines.6 The Cosmic-C programming language introduced asynchronous message passing and collectives, which later became the norm for programming large parallel machines in the form of message passing interface (MPI).29
The DARPA-funded J-Machine10 and M-Machine21 projects at Massachusetts Institute of Technology (MIT) developed low-overhead mechanisms for communication and synchronization and many key aspects of modern interconnection networks. These mechanisms enabled parallelism to be exploited at a very fine grain, with as little as 10 or 20 instructions being a schedulable unit of work. Many of the features of the J-Machine were directly adopted by the Cray T3D24 and T3E34 computers.
There is a rich history of parallel computing that goes beyond this particular branch of history. We do not have space to give a complete survey. A good review is in Culler et al.9
GPU computing, like all high-performance computing, draws heavily from this legacy. It uses MPI for communication between nodes, interconnection networks to connect these nodes, and many of the communication and synchronization mechanisms developed during the course of this research are used to coordinate the parallel computations.
Parallel graphics systems. Although less well-known than traditional parallel computing and supercomputers, there has been a long history of parallel graphics and imaging computers. Processing and generating images requires enormous amounts of computation. For example, if a one-million-instructions-per-second-computer (1MIPS) applies one arithmetic operation to each pixel of a one megapixel image, the computer takes one second to process one image. Rendering 3D virtual worlds for movies and games takes many orders of magnitude more computation per pixel than image processing. For example, an image generated for a modern movie requires about a billion floating point operations per pixel. As a result, to be useful in practice, graphics and imaging requires high-performance parallel supercomputers. These computers parallelize computations over large collections of data.
One early DARPA-funded research effort was the Geometry Engine, led by Jim Clark at Stanford.7 The Geometry Engine led to the formation of Silicon Graphics, a company that pioneered the development of 3D graphics workstations. The SGI hardware architecture and OpenGL software library defined the modern GPU architecture. Another notable government-funded research effort was the Pixel Planes series of high-performance graphics systems led by Henry Fuchs and his collaborators at the University of North Carolina.22 The Pixel Planes 5 was in fact a quite general single-instruction, multiple-data (SIMD) computer which ran parallel computations on a 128 x 128 image. Other examples of early parallel graphics and image computers were the NASA Massively Parallel Processor (MPP),1 the Ikonas Graphics Systems,15 and the Pixar Image Computer.27
Early GPUs implemented a fixed-function graphics pipeline similar to the early SGI workstations. The term GPU was introduced by Nvidia when it became possible for the entire OpenGL graphics pipeline to be implemented on a single chip. The Nvidia Geforce 256 introduced in 1999, which consisted of 17 million transistors, was the first commercially available GPU.
Earlier, while at Pixar, Hanrahan developed RenderMan, a system to generate photorealistic images. This system revolutionized the movie industry because it enabled the generation of images that could be seamlessly combined with live action captured with cameras. A key component of RenderMan was the shading language, which enabled users to extend the system to model complex materials and lighting.
Although the first GPUs implemented a fixed-function pipeline, they were constructed from programmable components. Unfortunately, these processing elements changed from system to system and from generation to generation. What was needed was a portable programming model. Since the major application for GPUs was computer games, it seemed natural to adapt the RenderMan shading language to GPUs so game developers could create new lighting and shading effects.
At Stanford under a DARPA funded project, a real-time shading language (RTSL) was designed and implemented for the then-current GPUs. Shading language programs are now called shaders. Bill Mark, a post-doctoral scholar, led the design of the Stanford RTSL and later joined Nvidia, and along with another former Stanford graduate student Kurt Akeley enhanced that technology and created the Cg shading language. Cg led to the development of Microsoft’s HLSL and OpenGL’s GLSL.
It was quickly realized that these early shading languages were flexible enough to implement many algorithms in scientific computing. Researchers adopted algorithms such as matrix-matrix multiplication, linear solvers, fluid-flow solvers, and molecular dynamics to run on shaders. This led to the GPGPU (General-Purpose GPU) computing movement.28,36
Stream processing. DARPA- and DOE-funded work at Stanford on the Imagine stream processor and the Merrimac streaming supercomputer developed stream processing, a form of parallel computing that leads to increased arithmetic intensity (the ratio of computation to bandwidth). As mentioned previously, the majority of the power consumed by a processor is in communication. Sending signals between chips is particularly power hungry. Off-chip communication is also much slower than on-chip communication. Stream processing involves two main ideas that reduce the need for memory bandwidth. The first is to exploit producer-consumer locality so that one stage, the producer, forwards its results to the next stage, the consumer, without the need to write and to read from memory. The second major idea was to organize the computation into functions called kernels. Each kernel takes a packet of data, executes a function on that packet, and outputs another packet of data. The number of arithmetic operations in the function is greater than the number of reads and writes to memory. These two techniques significantly decrease the number of memory accesses and improve the efficiency of stream processing architectures.
In a stream processor, a computation was organized into kernels that produced and consumed streams of data. A producing kernel would write its output stream into a stream register file (SRF). The consuming kernel would read its input from the SRF without the data ever needing to be written to or read from memory. With appropriate scheduling to match the batch size of streams to the capacity of the SRF, this organization enabled applications to sustain very high arithmetic intensity (the ratio of arithmetic to memory bandwidth).
A DARPA-funded project to design and build the Imagine Stream Processor was started at MIT in 1997 and migrated to Stanford later that year.12,23,26 Imagine was a graphics- and media-processor intended for signal- and image-processing workloads. It consisted of a number of parallel arithmetic units with local register files, a central stream register file, and a memory system. Kernels read streams from the stream register file passing intermediate results through the local register files and writing output streams back to the stream register file to be read by the next kernel.
The Stream-C programming language was developed to program Imagine. It extended the C programming language with constructs to describe kernels and streams. Numerous graphics,32 signal-processing, and image-processing33 applications were developed to tune and evaluate the architecture. The performance on texture-mapped raster graphics was comparable to a contemporary fixed-function GPU.
At a DARPA principal investigators meeting, the authors of this article realized this technology could be applied to high-performance computing and conceived of the Merrimac project. The computer science (CS) part of the Stanford DOE ASCI Center was redirected to pursue this approach to high-performance computing. The Center’s annual reports offer a great history of the development of stream processing.a
The Merrimac architecture11 was defined to adapt stream processing to scientific applications. The major changes from Imagine were the addition of data types (like FP64) needed for scientific computations, scaling the architecture to multiple nodes connected by an interconnection network to handle problems at scale, and the addition of a number of resilience features17 to support computation at scale with a reasonable failure rate.
The Stream-C programming language evolved into Brook.2 The key idea behind Brook was to merge ideas from stream programming with more traditional data-parallel computing. Kernel functions became the key processing primitive to maintain high arithmetic intensity.
Brook was adapted to target GPUs of the early 2000s.3 These GPUs ran programmable vertex and fragment shaders. The shaders implemented kernels but had a limited number of instructions and a few registers. Common data-parallel programming primitives such as map, reduce/scan, filter, gather, and scatter were implemented by building a virtual data parallel computer on top of the low-level graphics shaders. This abstraction enabled a large number of existing parallel algorithms to be run on GPUs, and gradually the limitations of the early shaders were eliminated.
A good example of an early application of a kernel to perform computations with high arithmetic intensity is dense matrix-matrix multiplication, which underlies modern neural-net algorithms. When performing matrix-matrix multiplication, two nn matrices need to be read and one nn matrix needs to be written. The matrix multiply requires n3 multiples and accumulates. Thus, the arithmetic intensity is of O(n). This fact is well known and leads to efficient methods for blocking matrix multiplies for CPUs with caches. Blocking works well when run on GPUs.19
The numerical scientists in the Stanford ASCI Center ported several scientific codes to Brook to run on a Merrimac simulator. These included computational fluid dynamics,20 magnetohydrodynamics, and n-body simulations.16 The n-body simulation is a good example of an efficient GPU application. The force between a pair of atoms is a given by the gravitational law in astrophysical simulations, but the interactions between non-bounded atoms are approximated by the Lennard-Jones potential (or even more complicated empirical potentials). These functions require many arithmetic operations. For these simulations, neighboring atoms are stored in a “neighbor list.” Molecular dynamics simulations immediately became a major application for GPUs.14
A key feature of GPUs and stream processors is they have multiple forms of hardware parallelism. Each GPU consists of many cores. Each core contains a SIMD processing unit (typically 32-wide). In addition, each core is multithreaded. Recall that GPUs were developed for gaming applications whose performance was dependent on efficiency of applying textures to triangles. Texture mapping involves computing-texture coordinates for each pixel fragment inside a triangle and then fetching from an image using those coordinates. These texture fetches have spatial locality but very little temporal locality. Spatial locality can be handled with small caches, but caches cannot handle temporal locality because of the lack of coherence. Efficient texture mapping required that the GPU hide the latency of these texture fetches. Early GPUs did this by having a fragment request a texture, suspend execution of that fragment, and immediately switch to processing another fragment. This is a simplified version of multi-threading, which means the GPU needs to have many parallel threads running simultaneously. The total number of tasks is the number of cores times the number of SIMD arithmetic units (termed a warp) times the number of threads. A Blackwell B200 GPU has 384 streaming multiprocessors (SMs). Each SM has 64 resident warps of 32 threads each. Thus, there are 786,432 tasks executing simultaneously on this GPU.
Technology transfer. The stream-processing architecture and programming system were transferred from Stanford to Nvidia by moving people. John Nickolls, an architect at Nvidia, heard about stream processing and recruited Bill Dally to consult at Nvidia on the architecture of the NV50 in 2003.31 (The NV50 was launched as G80 in 2006). Many of the features of stream processors were incorporated in the architecture. The “shared-memory” of the NV50 served the function of the SRF in Imagine and Merrimac.
Ian Buck (a graduate student on the Merrimac project and the principal developer of Brook) joined Nvidia in 2004. Ian worked with John Nickolls to evolve Brook into CUDA.30 CUDA incorporated the best features of Brook and Cg (a graphics shading language) as well as incorporating feedback from Brook programmers. The story of how the technology was transferred from Stanford to Nvidia is described in a presentation.4 Mike Houston (another graduate student on the project) joined AMD and used Brook directly as the programming language for their GPUs. The G80 (NV50) and CUDA were launched at Supercomputing in 2006.
When CUDA was launched in 2006 few people understood parallel programming, let alone GPU stream programming. To overcome this workforce deficit, Wen-Mei Hwu and David Kirk evangelized GPU computing by teaching CUDA programming courses for professors.25 The faculty who attended these courses went on to teach thousands of students parallel programming in CUDA. Parallel computing technologies borrowed from the Cosmic Cube, J-Machine, and M-Machine were applied both within a GPU—to coordinate multiple SMs—and across GPUs as multi-node GPU systems were built to tackle large problems.
Enabling AI. Modern machine learning relies on three key ingredients—massive data sets, large models with many layers and weights, and the computing power to optimize the weights. The core algorithms (deep neural networks, convolutional networks, training using back propagation, and stochastic gradient descent) have all been around since the 1980s or earlier. Large, labeled datasets, such as PASCAL18 and Imagenet13 appeared in the early 2000s. Recent advances, such as embedding text into vector spaces, enabled deep learning for natural language. Transformers (attention is all you need)37 replaced hard-to-train recurrent neural networks with hidden states with easy-to-train neural networks with history. GPU computing enabled the training of large networks with massive data sets economically. Once this capability was demonstrated (Alexnet, GPT), the capabilities of AI improved rapidly. The rapid adoption of AI provided even more impetus for improving GPU computing systems.
Machine learning at Nvidia was also enabled by academic-industry synergy. A 2010 breakfast conversation between one of the authors (Dally) and Andrew Ng led to the initiation of a joint project between Nvidia and Stanford to build deep neural networks on GPUs.8 Bryan Catanzaro led the Nvidia portion of the project. The software developed during this project became CuDNN5 which provides a readily available library for deep learning on Nvidia GPUs—democratizing deep learning.
Conclusion
The technologies behind GPU computing, which has enabled modern machine learning, were largely developed thanks to 30 years of government-funded academic research. Research in parallel computing, parallel graphics systems, and stream processing laid the groundwork for GPU computing. Many of the students trained during these research projects took positions in industry transferring the technology and using it to develop innovative products. The transfer from the Stanford Stream Processing project to GPU computing was very direct with the academic Brook language evolving into CUDA and stream processor features being incorporated in the G80 GPU. The efficient, easily programmed, and very-high performance computing platform provided by GPUs with compute shaders enabled the current revolution in machine learning—providing the missing ingredient to complement the algorithms and data that had been available for some time.
Acknowledgments
The late John Nickolls played a critical role in transferring the Stanford stream processing technology to Nvidia. Scott Rixner, Brucek Khailany, Ujval Kapasi, John Owens, Jung Ho Ahn, Peter Mattson, Abishek Das, Jinyung Namkoong, Brian Towles, Andrew Chang, and Ben Mowery contributed to the Imagine project. Several of these students along with Mattan Erez, Ian Buck, Mike Houston, Timothy Knight, Nuwan Jayasena, Francois Labonte, Jayanth Gummaraju, Ben Serebrin, and Binu Mathew contributed to the Merrimac project. Several of these students along with Tim Foley, Daniel Horn, Jeremy Sugarman, and Kayvon Fatahalian contributed to the development of Brook. Kekoa Proudfoot, Bill Mark, Kurt Akeley, Ravi Glanville, Randy Fernando, and Mark Kilgard contributed to the development of RTSL and Cg. David Kirk and Wen-Mei Hwu evangelized CUDA programming. Massimilano Fatica, Vijay Pande, Eric Darve, Juan Alanso, Alan Wray, and Tim Barth developed early scientific Brook applications on Merrimac. Grants from DARPA, DoE, NSF, Intel, Sun, SGI, NVIDIA, and AMD supported this work.