The need for better frontend benchmarks

March 19, 2026, 10:32pm 1

It seems we are flying blind when it comes to frontend compiler performance regressions for C++ code, specially related to concepts and template heavy code.

Thankfully we have the https://llvm-compile-time-tracker.com, but it has shortcomings for us:

Doesn’t compile anything on -std=c++20 or above. There, we start using concepts, and even if the user code doesn’t use them, the standard library does, and that by itself can cause large regressions. This has the potential to shock us time comes to migrate LLVM to C++20.
None of the test cases are particularly template heavy, at least compared to something like an std::execution implementation. This can make us underestimate performance improvements in this area.

I’d like to ask what can be done to help bridge that gap.

Is the main problem a lack of CI resources to run them?
How can we address this?

Performance monitoring is a lot of continual work: you need to select appropriate benchmarks, build infrastructure, continuously monitor the infrastructure, fix it when it breaks, report when there are regressions, follow up when some fraction of developers inevitably ignore the reports, etc. And hope you manager is willing to give you credit for the time you spend on this.

I’m sure someone would be willing to donate compute resources; the problem is finding someone to do the work.

Endill March 20, 2026, 7:06pm 3

A bit unrelated, but I’ve been wondering what the workload should be, until I realized that building libc++ std module would be a good start.

mizvekov March 20, 2026, 7:48pm 4

I have used GitHub - NVIDIA/stdexec: `std::execution`, the proposed C++ framework for asynchronous and parallel programming. · GitHub as a benchmark for some of my past work. It’s good for the AST heavy stuff, but as far as I remember the concepts overhead doesn’t show much there.

We could optimize CI usage on these tests by running front end only jobs, but that would depend if other people would be interested in codegen too.

+1 having better benchmarks would be nice, but someone needs to do the fixing as well.

I’m somewhat disappointed that LLVM 22 shipped despite 16% compile-time slowdown after clang #161671 · Issue #172266 · llvm/llvm-project · GitHub being marked as a release blocker.

+1 but I think we should also define what we’re trying to benchmark better before trying to find those resources or someone to do the work – are we trying to devise a static test that represents a common set of workloads to catch compile time performance regressions or are we trying to devise a more dynamic test that allows you to catch compile time performance regressions. e.g., “when compiling Clang with this fork of Clang, the bootstrap got 4% slower” or “when compiling <something> with this fork of Clang using <this set of options>, it got 4% slower.”

I think the latter is more useful to solve. e.g., @mizvekov was asking about compiling source with C++20 and above, that’s testing a particular set of options that’s not our usual workload. I think the former is basically covered by llvm-compile-time-tracker unless the concern is with the set of projects picked to track.

I’ve been using compiling SemaExpr.cpp as a local perf test, and I don’t think that’s unreasonable.

I think we could build up a set of “basic” tests from single files - something like a collection of the longest build time files from a bunch of top X projects.

The benchmark versions of the files would be post preprocessing so they aren’t impacted by host libraries, and are tested with the relevant build flags from the original preprocessing (sans library dirs, etc).

In principle this would allow platform agnostic testing of compile time performance that can easily be run locally and have a reasonable degree of confidence in the local before/after performance comparison.

The reason for selecting a set of individual files is to keep the runtime to a reasonable length - ideally you’d want a usable/meaningful benchmark run to be a few minutes at most (a full run being multiple before + after runs).

mikeynap March 26, 2026, 5:46am 8

Same — haven’t updated work to 22 due to the >50% compile time/memory regression on a very template heavy project. I feel if it was quantified and blinking on some benchmark suite it would get more attention.

Edit: just noticed improvements are on the way! Nice!!

shafik April 1, 2026, 7:42pm 9

This has the potential to shock us time comes to migrate LLVM to C++20

I have had folks tell me that getting their groups to move to C++20 is challenging due to increased compile times, mostly but not solely down to how much larger the std header files are.

FWIW, I’ve put a bunch of time into thinking about at least how to build better frontend benchmarks for Carbon, and it seems reasonably applicable to Clang as well. We even have at least one rudimentary benchmark that covers both Clang and Carbon. All of this is open source under the LLVM license, and if there is interest and a good way, happy to contribute it or find ways to share it. =]

Code:

This is a very different benchmarking approach from the others mentioned so far. The goal is not to create a reasonably proxy for what real compile times will be – that is I think well done by the techniques mentioned here. Instead, the goal is to create interesting but extremely consistent measurement of important code paths through the frontend.

Why: I moved away from proxies because my experience is that is very hard to get good measurements from them. Realistic compiles tend to be fairly noisy and to change over time. To get really good data, you want to be able to do many runs and get very accurate timings. However, compiling the same proxy over and over again either requires the proxy to be so large that it takes too long for constant use, or results in misleading timings as significant amounts of work done on the first compilation provide optimizations (from branch prediction to cache populating) to all subsequent compilations. That doesn’t mean we shouldn’t have proxies to use as a baseline periodically, but I don’t think its easy to use these to get coverage or precision for doing targeted improvements.

The technique I came up with is to create a code generator that synthesizes “interesting” patterns of code in a way that is randomly permuted but completely consistent. So the total number of characters in identifiers is always the same, and the histogram of lengths is the same, but the actual identifiers and the order in which ones of a given length are used varies on each generation. Similarly for the number of declarations, the code constructs, etc.

The key is to synthesize code that will do the same amount of work, but in an unpredictable order to accurately measure the cold-execution time of the compiler, as that’s what matters for improving performance in practice. And you have to be really serious about what counts as “work” – one of the fastest parts of Clang is skipping comments, but changing the # of bytes in the input by having comment strings 2x longer has a larger effect on my measured compile times than most other changes. So every byte counts here from my experience. This is compounded by wanting the structure of the code to roughly match what clang-format or a human would produce, but you can’t afford to have a non-deterministic number or position of newlines.

Once you have this synthesis system, you can use more traditional microbenchmark techniques (linked above, built on top of GitHub - google/benchmark: A microbenchmark support library · GitHub ) to actually build reasonably stable measurements. My experience is then that you need to do many runs and aggregate them as there are too many process-specific variations that will drown out any data in noise otherwise. I wrote a custom benchmark runner to do this, and compute good statistical measures: carbon-lang/scripts/bench_runner.py at trunk · carbon-language/carbon-lang · GitHub

The source generation in Carbon is specifically designed to support multi-language synthesis, and I want to expand it to cover more interesting patterns. Right now it only generates one specific pattern that I care a lot about: large sequences of declarations of classes with lots of methods and a decent number of member variables, but no definitions. Basically, “boring” but super large header files. Even this has found lots of nice optimizations.

One thing I’ve been trying to expand it to is benchmarking the complete compiler invocation by running the compiler in a subprocess. For small files, it turns out that the compile time of C++ (and Carbon!) is completely dominated by process startup. This is how I found some of the dynamically initialized string tables in Clang a while back. There is probably more to find here.

Anyways, happy to answer questions here, and would love to have help expanding source generation to cover more patterns. This approach is really powerful, but it is also really difficult to build the source generation. Much harder than the benchmark itself it turns out. But the rewards look promising.

I think if you add what amounts to a flashing red button under every slow PR saying these reduces performance by X amount in Y case(s), it is a lot more likely to be fixed before it is merged.

I think this will be a pretty good idea. libc++ has all the newest features and this can test modules too

mizvekov April 3, 2026, 5:19pm 13

This is to help prevent future regressions, and to help folks working on improvements.

I think we need both. Some of the workloads which could use a lot of improvements are not achievable with the existing set of tests in llvm-compile-time-tracker. That’s the “template-heavy” case I mentioned. Others could be achieved with the existing tests.

Also we are missing a way to adopt performance regression test cases users submit.

For example @hansw2000 provided a nice test case for the concepts regression, but
this will be forgotten once it’s fixed and tested locally.

I have received other such test cases in the past, which are unfortunately forgotten in the mists of time.

That’s a good idea, and what I have done in the past as well. Instead of building the whole of stdexec, we can test only the slowest compiling one.

I presume for some cases we would want to track regressions in preprocessing, so we can’t say we would always preprocess them before inclusion.

I don’t propose we track how much worse building with C++20 is compared to C++17, though I suppose depending on how we go about this, this number could be easy to provide.
But this is more of a help for the libc++'s folks, we can’t do much about that on the clang side.

But we do want to know if we caused a regression in C++20 mode from a clang change, that we can do something about.

Thanks! Yeah I remember you shared this idea on last year’s LLVM dev conf dinner.
I’d be interested in trying this out, to create new test cases to include in future improvements.

But the main problem right now is finding a place to put these tests.

I did something very similar with my compiler that I called code journeys. It’s a combination of tooling written in python to give me static analysis and AI agents to come up with the scenarios and to guide the testing. I have a write up about it and you can see the breakdown here: What is a Code Journey? | OriLang