Groq Demonstrates Fast LLMs on 4-Year-Old Silicon

MOUNTAIN VIEW, CALIF. — Groq has repositioned its first-generation AI inference chip as a language processing unit (LPU), and demonstrated Meta’s Llama-2 70-billion–parameter large language model (LLM) running inference at 240 tokens per second per user. Groq CEO Jonathan Ross told EE Times that the company had Llama-2 up and running on the company’s 10-rack (64-chip) cloud-based dev system in “a couple of days.” This system is based on the company’s first gen AI silicon, released four years ago.

“We are focused on the LLM opportunity,” Ross said. “There’s a lot we can do, but this kind of fell in our laps.”

Groq CEO Jonathan Ross and Sally Ward-Foxton — Groq CEO Jonathan Ross shows Sally Ward-Foxton Groq’s PCIe cards at Groq HQ in Mountain View, Calif. (Source: EE Times)

Soaring market opportunities for LLM inference in the wake of ChatGPT’s popularity are encouraging data center AI chip companies to demonstrate their technologies for new LLM workloads. Ross said Groq’s market—data center AI inference—is set to grow much more rapidly than training now that fine-tuning, which reduces the need for training from scratch, is becoming popular for LLMs. Prompt engineering can often be sufficient to even remove the need for fine-tuning, he added.

“People are now doing the fine-tuning on their laptops in a very short period of time,” he said. “So we don’t think that training is a huge market. In fact, I had a very large infrastructure customer tell us that they had expected all of their money would come from training and actually their revenue on training is now going down.”

GigaDevice Introduces GD32E512 and GD32E252 MCUs for Optical Modules

By GigaDevice 06.11.2026

XENSIV™ TMR-based Sensors: Unlocking New Possibilities in Magnetic Sensing

By Marc Biehn, Head of Product Group Industrial Consumer Magnetic Sensing; Sebastian Maerz, Business Developer Magnetic Sensing, Infineon Technologies AG 06.09.2026

Antenna-First Design: The RF Shift IoT Cannot Avoid

By Sifiso Gambahaya, Senior Director of Engineering at Ignion 06.08.2026

Latency issue

“The No. 1 thing we’re hearing from people right now is that on LLMs, latency is the problem,” Ross said.

Training LLMs requires high throughput networking, but for inference, design for batch equals one is critical.

Realized throughput of an 8-way AllReduce for Groq hardware compared to the competition — Realized throughput of an 8-way AllReduce for Groq hardware compared with the competition. (Source: Groq)

At ISCA 2022, Groq’s paper showed its first gen PCIe card was able to transfer small tensors, like the 8-16 kB tensors found in LLM workloads, more efficiently than competing architectures. Being able to distribute the workload across a large number of chips efficiently is key to achieving low latency, according to Ross.

“If you have one Nvidia A100 server versus one of ours, they’re going to win, but if you take the 65-billion–parameter LLM and take 40 of our servers against 40 of theirs, it’s not even close [Groq’s latency is much faster],” Ross said.

Traction

The company has several of its 10-rack, 640-chip systems already stood up with plans to build more. One is currently used for internal development use, and a second is offered in the cloud to Groq’s customers in the financial services industry. Groq also has hardware installed at the Argonne Leadership Computing Facility’s (ALCF) AI Testbed, where it’s used for experimenting with future fusion energy devices.

Groq racks — Groq nodes are already installed at Argonne National Labs. (Source: Groq)

Groq chips are designed, engineered and fabricated in North America—in order to appeal to U.S. government agency customers. Chips are fabbed at GlobalFoundries in Malta, N.Y., and packaged in Canada. The company announced recently that its second-generation chip will be fabbed at a Samsung foundry in Taylor, Texas.

The company is also working on an 8-chip board for its first-gen chips with proprietary interconnect to get around the performance limitations of PCIe and improve compute density.

Secret sauce

The development of Groq’s tensor streaming processor (TSP) hardware architecture started with software. The company first developed a prototype of its machine learning compiler and built the hardware around it. The compiler handles all execution planning, orchestrating data flow and timing, which means hardware design can be simplified, and performance and latency are entirely predictable at compile time.

Reaching production readiness with its compiler meant the number of models Groq could compile for its chip ramped from 60 to 500 in the space of a few weeks, and the company was able to get Llama up and running quickly. Flexibility to run many different types of models quickly is desirable because it offers a level of future-proofing in a market where workloads evolve quickly.

Groq engineers have also been using the company’s GroqView visualization tool to watch the model run on simulated hardware, changing the distribution of the workload across the chip to optimize performance further. The company intends to continue working to optimize performance.

Kernel free

One of the most interesting things about Groq’s TSP architecture is that it’s completely kernel free.

Most hardware architectures require kernels—snippets of low-level code that directly control the hardware. Generally, AI models from a framework like PyTorch are compiled to a set of kernels. With Groq’s architecture, problems are instead broken into a small number of intrinsic functions, around which the chip is designed. Mathematically, Ross said, it can be shown that models can always be reduced to these intrinsics.

“We can also do so in a computationally inexpensive manner; there are no NP-complete problems to compile for our chip,” he said.

While other hardware architectures have to try to solve a 2D bin packing problem (a type of computationally expensive NP complete problem) during compilation, the layout of Groq’s chip is one-dimensional, so compilation is less compute-intensive. This capability wouldn’t be possible to reverse-engineer by writing a new compiler for existing silicon, Ross added.

Groq chip layout — The layout of Groq’s chip. (Source: Groq)

One of the benefits of not using kernels is users don’t have to spend time writing custom kernels for new or proprietary functions. Nor do they have to wait for their hardware provider to write new kernels on their behalf. However, Ross admitted that the technical story has been a tough one to tell customers. Many assume there is some kind of software tool to automatically create kernels, or they simply don’t believe kernel-free operation is possible.

“One of the hardest things was getting to the point where we could prove it would work,” he said. “There are so many reasons to believe you can’t actually build a kernel-less compiler.”

Igor Arsovski, Groq’s head of silicon, told EE Times that the simplicity of Groq’s chip is made possible by extracting dynamic controls for decisions like caching and moving them all into software, leaving the hardware entirely for workload acceleration.

Igor Arsovsky — Igor Arsovski. (Source: Groq)

“By doing this, we can schedule into the hardware exactly where execution will be taking place, down to the nanosecond,” he said. “This is what makes the software easier, allowing the kernel-less approach because everything in the hardware is pre-scheduled. We know what memory is accessed, we know what functional units are being activated. The software knows which functional units are busy during which nanosecond, so when it’s busy you can use another functional unit. You can’t do that in a GPU because you don’t know if your previous execution hit the cache or not, so you have to plan for that by writing kernels.”

Describing Groq’s chip as “an accelerator and a router at the same time,” Arsovski said networking between chips is also scheduled by the compiler, effectively creating one large deterministic multi-chip processor.

Power control

There are also secondary benefits of determinism, Arsovski explained.

“If you know exactly how you’re lighting your chip up, you know exactly where you’re burning power,” he said. “If you’re going to be doing 3D stacking, you need to know where you’re burning power, because you’re generating heat. If you have a non-deterministic chip on top of a non-deterministic chip, you can get superposition of thermal events or hotspots, but you can’t plan for them.”

Groq’s software can predict power peaks based on the workload down to the nanosecond. This allows users to compile for a particular maximum power consumption for the whole chip, or to trade off performance for peak current, or to control thermal issues that impact the hardware’s reliability and lifetime.

Groq’s compiler allows visibility into the power consumed by functional units at nanosecond resolution (Source: Groq)

Power can be limited to a maximum level in software. Groq’s example shows lowering the peak power for use cases with different thermal constraints. (Source: Groq)

The need to keep a safety margin in case of dI/dt events means today’s chips use more power than they need, according to Arsovski.

“Right now everybody’s running 50-80 mV higher than they need to, because of unpredictable events,” he said. Eliminating this safety margin could cut as much as 20% from power consumption.

Managing dI/dt events effectively may also be a way to mitigate silent data corruption—hardware failures that can affect the result of a computation without being detected. This problem is particularly evident in long training runs, Arsovski said, but is becoming apparent for multi-chip inference systems, too.

“It’s getting more and more significant because there’s more and more chips being deployed together—for single chip systems it isn’t such a big deal,” he said. “By managing our current predictably, we can predict and control these events.”

Roadmap

Groq, founded in 2016, is still optimizing software for its first-generation silicon, which was released in 2019. However, the company is considering its options for second-generation silicon. Its compiler-first approach is allowing design space exploration in software with an in-house developed tool.

“This was not intended when we started Groq,” Ross said. “But because we’re kernel free, we decided rather than baking assumptions about the chip into the compiler, we would pass in a config file that would say, here’s how many memories and where they are, and so on. We had been compiling for our v2 before we knew what our v2 was going to be. But someone realized, well, we could just start doing sweeps with this config file and figure out what the ideal chip is.”

All AI accelerator companies face the difficulty of designing silicon, which might typically take 18-24 months, to accelerate AI workloads that evolve extremely rapidly. This is difficult to do without visibility into how the workload will evolve. Reducing the time taken to design accelerator chips may offer an opportunity to stay ahead of the game; this is what Groq is counting on with its AI-assisted design exploration tool. This tool also considers the impact of performance, power and total cost of ownership for various chip and system-level configurations. Without the need to write new kernels, hundreds of common models can be tested in software to see how efficiently they run on potential hardware.

So, will Groq plan a whole family of more specialized chips tuned to different workloads for its second generation?

“Ideally we would find one piece of silicon that works great for everything, because that’s the cheapest way to do it,” Ross said. “Maybe it turns out that two [more specialized] pieces of silicon are the right way to do it, but you don’t want to keep building more and more and getting diminishing returns… When you start specializing too much, the time it takes to get that next chip out [compares unfavorably with] building a general chip with more optimization in it.”

LLMs are taking over many AI use cases, but the workload is far from fixed, Ross said, highlighting that researchers are currently experimenting with different architectures for attention heads, for example. He argued that since LLMs aren’t yet ready to power search, Google and its competitors are likely still working on algorithmic improvements, which means the workload isn’t mature enough to specialize for.

Another evolution relates to a technique called reflection, which is new for LLMs. Today’s models can be asked to produce an answer and then iterate on that answer to make it better, but future models will give themselves time to think—effectively performing multiple inferences to try and come up with the best possible answer on their own. In this way, each “inference” is actually several inferences in a row.

“When people start doing this, the need for inference compute is going to explode,” Ross said. “People don’t have the intuition around why reflection matters. Eventually they will, but people will need a lot of inference compute.”

AI, AI ACCELERATOR, AI AND BIG DATA, AI AND MACHINE LEARNING, AI AND ML, AI CHIP, AI CHIPS, AI SOFTWARE, AI-BASED CHIPS, AI/ML, AI/ML INFERENCE, DATA CENTER, DATA CENTER EQUIPMENT, DATA CENTERS