Choosing our Benchmarking Strategy
We are going to use google_benchmark,
the standard C++ benchmarking library maintained by Google. It’s widely adopted
across the C++ ecosystem, supports fixtures and parameterized benchmarks with
statistical analysis, and works with CMake, Bazel, and other build systems.
Your First Benchmark
Let’s start by creating a benchmark for a recursive Fibonacci function to see how we can measure computational performance.
Project Setup
First, create a basic project structure:
mkdir my_project && cd my_project
mkdir benchmarks
Writing the Benchmark
Create a new file benchmarks/main.cpp:
benchmarks/main.cpp
#include <benchmark/benchmark.h>
// Recursive Fibonacci function to benchmark
static long long fibonacci(int n) {
if (n <= 1)
return n;
return fibonacci(n - 1) + fibonacci(n - 2);
}
// Define the benchmark
static void BM_Fibonacci(benchmark::State &state) {
// Use a volatile variable to prevent compile-time optimization
volatile int n = 30;
// This loop runs multiple times to get accurate measurements
for (auto _ : state) {
// Prevent compiler from optimizing away the computation
auto result = fibonacci(n);
benchmark::DoNotOptimize(result);
}
}
// Register the benchmark, specifying the time unit as milliseconds for better
// readability
BENCHMARK(BM_Fibonacci)->Unit(benchmark::kMillisecond);
// Entrypoint that runs all registered benchmarks
BENCHMARK_MAIN();
A few things to note:
volatile int n = 30prevents the compiler from computing the result at compile timebenchmark::State& stateprovides the benchmark loop that runs your code multiple timesfor (auto _ : state)is where your actual benchmark code goes - this loop is timedbenchmark::DoNotOptimize()prevents the compiler from optimizing away the resultBENCHMARK()registers your function as a benchmark->Unit(benchmark::kMillisecond)displays results in milliseconds for better readability as by default it’s in nanoseconds
BENCHMARK_MAIN()provides the entry point that discovers and runs all benchmarks
Configuration with CMake
Create a CMakeLists.txt file in the benchmarks/ folder:
benchmarks/CMakeLists.txt
cmake_minimum_required(VERSION 3.14)
project(my_benchmarks VERSION 0.1.0 LANGUAGES CXX)
# Use C++17 (or your preferred version)
set(CMAKE_CXX_STANDARD 17)
set(CMAKE_CXX_STANDARD_REQUIRED ON)
# Enable optimizations with debug symbols for profiling
set(CMAKE_BUILD_TYPE RelWithDebInfo)
# Fetch google_benchmark from CodSpeed's repository
include(FetchContent)
FetchContent_Declare(
google_benchmark
GIT_REPOSITORY https://github.com/CodSpeedHQ/codspeed-cpp
SOURCE_SUBDIR google_benchmark
GIT_TAG main
)
set(BENCHMARK_DOWNLOAD_DEPENDENCIES ON)
FetchContent_MakeAvailable(google_benchmark)
# Create the benchmark executable
add_executable(bench main.cpp)
# Link against google_benchmark
target_link_libraries(bench benchmark::benchmark)
Key configuration points:
CMAKE_BUILD_TYPE RelWithDebInfoenables optimizations with debug symbols for accurate profiling- We use CodSpeed’s fork of
google_benchmarkwhich adds performance measurement capabilities and CI integration BENCHMARK_DOWNLOAD_DEPENDENCIES ONallows google_benchmark to download its dependencies
Building and Running the Benchmark
Build your benchmark:
cd benchmarks
mkdir build && cd build
cmake ..
make
You should see output like:
-- The CXX compiler identification is GNU 14.2.1
-- Detecting CXX compiler ABI info
-- Detecting CXX compiler ABI info - done
-- Configuring done (8.6s)
-- Generating done (0.1s)
-- Build files have been written to: /home/user/my_project/benchmarks/build
[ 1%] Building CXX object ...
...
[100%] Built target bench
Now run your benchmark:
You should see output like this:
2025-12-01T17:24:27+01:00
Running ./bench
Run on (8 X 24 MHz CPU s)
CPU Caches:
L1 Data 64 KiB
L1 Instruction 128 KiB
L2 Unified 4096 KiB (x8)
Load Average: 8.47, 7.96, 7.04
-------------------------------------------------------
Benchmark Time CPU Iterations
-------------------------------------------------------
BM_Fibonacci 2.74 ms 2.65 ms 271
Congratulations! You’ve created your first C++ benchmark. The output shows that
computing fibonacci(30) takes about 2.74 milliseconds on average.
Benchmarking with Parameters
So far, we’ve only tested our function with a single input (n=30). But what if
we want to see how performance changes with different input sizes? This is where
DenseRange comes in.
Let’s add a parameterized benchmark to test Fibonacci with various input sizes.
Update your main.cpp to include:
benchmarks/main.cpp
// Define the benchmark with a parameter
static void BM_Fibonacci_DenseRange(benchmark::State &state) {
// Get the input value from the benchmark parameter
volatile int n = state.range(0);
for (auto _ : state) {
auto result = fibonacci(n);
benchmark::DoNotOptimize(result);
}
}
// Test Fibonacci with inputs from 15 to 35 in steps of 5
BENCHMARK(BM_Fibonacci_DenseRange)
->DenseRange(15, 35, 5) // Test inputs 15, 20, 25, 30, 35
->Unit(benchmark::kMillisecond);
Now state.range(0) gives us the input parameter, and DenseRange(15, 35, 5)
tells the benchmark to run with inputs 15, 20, 25, 30, and 35.
Rebuild and run:
make
./bench --benchmark_filter=Fibonacci_DenseRange
You should see output like:
---------------------------------------------------------------------
Benchmark Time CPU Iterations
---------------------------------------------------------------------
BM_Fibonacci_DenseRange/15 0.002 ms 0.002 ms 380948
BM_Fibonacci_DenseRange/20 0.022 ms 0.021 ms 33413
BM_Fibonacci_DenseRange/25 0.276 ms 0.234 ms 3050
BM_Fibonacci_DenseRange/30 2.62 ms 2.59 ms 278
BM_Fibonacci_DenseRange/35 28.1 ms 28.0 ms 25
Notice how the execution time grows exponentially with the input size, clearly demonstrating the O(2^n) complexity of the recursive Fibonacci algorithm. This is the power of parameterized benchmarks – they help you understand how your code scales with different inputs.
Multiple Arguments
What if your function takes multiple parameters? For example, let’s benchmark
the performance of std::string::find() with varying text and pattern sizes.
Let’s add a new benchmark to main.cpp:
benchmarks/main.cpp
// ... (previous code) ...
#include <string>
static void BM_StringFind(benchmark::State& state) {
size_t string_size = state.range(0);
size_t pattern_size = state.range(1);
// Setup
std::string text(string_size, 'a');
std::string pattern(pattern_size, 'b');
// Place pattern near the end for worst-case scenario
text.replace(string_size - pattern_size, pattern_size, pattern);
// Benchmark
for (auto _ : state) {
auto pos = text.find(pattern);
benchmark::DoNotOptimize(pos);
}
}
// Benchmark different combinations of text and pattern sizes using ArgsProduct
BENCHMARK(BM_StringFind)
->ArgsProduct({
{1000, 10000, 100000}, // Text sizes
{50, 500} // Pattern sizes
});
The ArgsProduct() function creates benchmarks for all combinations of the
provided argument lists. In this case, it generates 6 benchmarks (3 text sizes ×
2 pattern sizes), letting you analyze how both parameters affect performance.
Here is the output when you run this benchmark:
./bench --benchmark_filter=StringFind
...
-------------------------------------------------------------------
Benchmark Time CPU Iterations
-------------------------------------------------------------------
BM_StringFind/1000/50 28.7 ns 28.0 ns 25077651
BM_StringFind/10000/50 337 ns 237 ns 3123341
BM_StringFind/100000/50 2157 ns 2066 ns 287731
BM_StringFind/1000/500 30.3 ns 28.6 ns 24820407
BM_StringFind/10000/500 248 ns 243 ns 2987100
BM_StringFind/100000/500 2075 ns 2031 ns 348384
Benchmarking Only What Matters
Sometimes you have expensive setup that shouldn’t be included in your benchmark measurements. For example, loading data from a file or creating large data structures. Google Benchmark provides several ways to handle this.
Fresh Setup per Iteration
Let’s benchmark a sorting algorithm where we need fresh data for each iteration.
We do not want the data generation time to be included in the benchmark. We can
exclude it using PauseTiming() and ResumeTiming():
benchmarks/main.cpp
// ... (previous code) ...
#include <algorithm>
#include <random>
#include <vector>
static void BM_SortVector(benchmark::State &state) {
size_t size = state.range(0);
std::mt19937 gen(42); // Fixed seed for reproducibility
for (auto _ : state) {
// Pause timing during setup
state.PauseTiming();
// Generate random data (NOT measured)
std::vector<int> data(size);
std::uniform_int_distribution<> dis(1, 10000);
for (size_t i = 0; i < size; ++i) {
data[i] = dis(gen);
}
// Resume timing for the actual work
state.ResumeTiming();
// Sort the vector (MEASURED)
std::sort(data.begin(), data.end());
benchmark::DoNotOptimize(data.data());
benchmark::ClobberMemory();
}
}
BENCHMARK(BM_SortVector)->Range(100, 100000)->Unit(benchmark::kMicrosecond);
The setup code (generating random data) runs before each iteration but isn’t
included in the timing. Only the std::sort() call is measured.
Shared Setup for All Iterations
When you can reuse the same data across iterations, fixtures are more efficient. They are a class that defines a setup and teardown process that runs once for all iterations. Both of these methods are not included in the timing. Here is an example where we set up a sorted vector once for all iterations and benchmark binary search on it:
benchmarks/main.cpp
// Define a fixture class that sets up a random vector for searching
class VectorFixture : public benchmark::Fixture {
public:
std::vector<int> data;
// Setup runs once before all iterations
void SetUp(const ::benchmark::State &state) {
size_t size = state.range(0);
std::mt19937 gen(42); // Fixed seed for reproducibility
std::uniform_int_distribution<> dis(1, size);
data.resize(size);
for (size_t i = 0; i < size; ++i) {
data[i] = dis(gen);
}
std::sort(data.begin(), data.end());
}
// TearDown runs once after all iterations
void TearDown(const ::benchmark::State &) { data.clear(); }
};
// Define the BinarySearch benchmark using VectorFixture
BENCHMARK_DEFINE_F(VectorFixture, BinarySearch)(benchmark::State &state) {
int target = data.size() / 2;
for (auto _ : state) {
// Only this is measured
bool found = std::binary_search(data.begin(), data.end(), target);
benchmark::DoNotOptimize(found);
}
}
// Register the fixture benchmark with different vector sizes
BENCHMARK_REGISTER_F(VectorFixture, BinarySearch)->Range(1000, 100000);
In this example, the SetUp() method initializes a sorted vector once before
all iterations, and TearDown() cleans up afterward. The benchmark only
measures the std::binary_search() calls. Fixtures use different macros:
BENCHMARK_DEFINE_F to define and BENCHMARK_REGISTER_F to register with
parameters.
Best Practices
Prevent Compiler Optimizations
The C++ compiler is extremely aggressive with optimizations. Always protect your benchmarks:
// ❌ BAD: Compiler might optimize everything away
static void BM_Bad(benchmark::State& state) {
for (auto _ : state) {
int x = 42;
int y = x * 2; // Compiler knows this is 84 at compile time
}
}
// ✅ GOOD: Use DoNotOptimize for values
static void BM_Good(benchmark::State& state) {
for (auto _ : state) {
int x = 42;
benchmark::DoNotOptimize(x);
int y = x * 2;
benchmark::DoNotOptimize(y);
}
}
// ✅ BETTER: Use DoNotOptimize and ClobberMemory
static void BM_Better(benchmark::State& state) {
for (auto _ : state) {
int x = 42;
benchmark::DoNotOptimize(x);
int y = x * 2;
benchmark::DoNotOptimize(y);
benchmark::ClobberMemory();
}
}
Keep Benchmarks Deterministic
Use fixed seeds for random number generators:
// ❌ BAD: Non-deterministic results
static void BM_NonDeterministic(benchmark::State& state) {
std::random_device rd;
std::mt19937 gen(rd()); // Different every run!
for (auto _ : state) {
// ...
}
}
// ✅ GOOD: Deterministic with fixed seed
static void BM_Deterministic(benchmark::State& state) {
std::mt19937 gen(42); // Fixed seed
for (auto _ : state) {
// ...
}
}
Benchmark Real-World Code
In real projects, you’ll benchmark functions from your library. Here’s a typical structure for a C++ project with benchmarks:
my_project/
├── CMakeLists.txt
├── include/
│ └── mylib/
│ └── algorithms.hpp
├── src/
│ └── algorithms.cpp
└── benchmarks/
└── bench_algorithms.cpp
The header include/mylib/algorithms.hpp defines your library’s API:
include/mylib/algorithms.hpp
#pragma once
#include <vector>
namespace mylib {
std::vector<int> bubble_sort(std::vector<int> arr);
} // namespace mylib
The implementation src/algorithms.cpp contains the actual algorithm:
src/algorithms.cpp
#include "mylib/algorithms.hpp"
namespace mylib {
std::vector<int> bubble_sort(std::vector<int> arr) {
size_t n = arr.size();
for (size_t i = 0; i < n; ++i) {
for (size_t j = 0; j < n - 1 - i; ++j) {
if (arr[j] > arr[j + 1]) {
std::swap(arr[j], arr[j + 1]);
}
}
}
return arr;
}
} // namespace mylib
The benchmark benchmarks/bench_algorithms.cpp tests the bubble sort function:
benchmarks/bench_algorithms.cpp
#include "mylib/algorithms.hpp"
#include <benchmark/benchmark.h>
#include <random>
// Define a fixture class that sets up random data for sorting
class SortFixture : public benchmark::Fixture {
public:
std::vector<int> original_data;
// Setup runs once before all iterations
void SetUp(const ::benchmark::State &state) {
size_t size = state.range(0);
std::mt19937 gen(42); // Fixed seed for reproducibility
std::uniform_int_distribution<> dis(1, size);
original_data.resize(size);
for (size_t i = 0; i < size; ++i) {
original_data[i] = dis(gen);
}
}
// TearDown runs once after all iterations
void TearDown(const ::benchmark::State &) { original_data.clear(); }
};
// Define the BubbleSort benchmark using SortFixture
BENCHMARK_DEFINE_F(SortFixture, BubbleSort)(benchmark::State &state) {
for (auto _ : state) {
// Make a copy of the original data for each iteration
// Only the sorting is measured, not the copy
state.PauseTiming();
std::vector<int> data = original_data;
state.ResumeTiming();
auto sorted = mylib::bubble_sort(data);
benchmark::DoNotOptimize(sorted.data());
benchmark::ClobberMemory();
}
}
// Register the fixture benchmark with different data sizes
BENCHMARK_REGISTER_F(SortFixture, BubbleSort)
->Range(1000, 100000)
->Unit(benchmark::kMillisecond);
BENCHMARK_MAIN();
Update your CMakeLists.txt to build both your library and benchmarks:
cmake_minimum_required(VERSION 3.14)
project(mylib VERSION 0.1.0 LANGUAGES CXX)
set(CMAKE_CXX_STANDARD 17)
set(CMAKE_CXX_STANDARD_REQUIRED ON)
# Enable optimizations with debug symbols for profiling
set(CMAKE_BUILD_TYPE RelWithDebInfo)
# Your library
add_library(mylib src/algorithms.cpp)
target_include_directories(mylib PUBLIC include)
# Fetch google_benchmark
include(FetchContent)
FetchContent_Declare(
google_benchmark
GIT_REPOSITORY https://github.com/CodSpeedHQ/codspeed-cpp
SOURCE_SUBDIR google_benchmark
GIT_TAG main
)
set(BENCHMARK_DOWNLOAD_DEPENDENCIES ON)
FetchContent_MakeAvailable(google_benchmark)
# Benchmark executable
add_executable(bench_algorithms benchmarks/bench_algorithms.cpp)
target_link_libraries(bench_algorithms mylib benchmark::benchmark)
You can now build and run your benchmarks with the following commands:
mkdir build && cd build
cmake ..
make
./bench_algorithms
This will yield an output similar to:
2025-12-02T16:50:44+01:00
Running ./bench_algorithms
Run on (8 X 24 MHz CPU s)
CPU Caches:
L1 Data 64 KiB
L1 Instruction 128 KiB
L2 Unified 4096 KiB (x8)
Load Average: 9.83, 10.83, 8.99
------------------------------------------------------------------------
Benchmark Time CPU Iterations
------------------------------------------------------------------------
SortFixture/BubbleSort/1000 0.381 ms 0.321 ms 2219
SortFixture/BubbleSort/4096 5.80 ms 4.97 ms 136
SortFixture/BubbleSort/32768 732 ms 718 ms 1
SortFixture/BubbleSort/100000 10848 ms 9529 ms 1
Running Benchmarks Continuously with CodSpeed
So far, you’ve been running benchmarks locally. But local benchmarking has limitations:
- Inconsistent hardware: Different developers get different results
- Manual process: Easy to forget to run benchmarks before merging
- No historical tracking: Hard to spot gradual performance degradation
- No PR context: Can’t see performance impact during code review
This is where CodSpeed comes in. It runs your benchmarks automatically in CI and provides:
- Automated performance regression detection in PRs
- Consistent metrics with reliable measurements across all runs
- Historical tracking to see performance over time with detailed charts
- Flamegraph profiles to see exactly what changed in your code’s execution
How to set up CodSpeed with google_benchmark
Here’s how to integrate CodSpeed with your google_benchmark benchmarks using
CMake:
Next Steps
Check out these resources to continue your C++ benchmarking journey: