How do I profile C++ code running on Linux?

Survey of C++ profiling techniques: gprof vs valgrind vs perf vs gperftools

In this answer, I will use several different tools to a analyze a few very simple test programs, in order to concretely compare how those tools work.

Summary of results:

tool	recompile	slowdown (1x == no slowdown, 2x == twice as slow)
gprof	`-pg`	1x (2x-3x reported in real use cases)
perf	no	1x
valgrind	no	15x (80x reported in real use cases)
gperftools	no

The following test program is very simple and does the following:

main calls fast and maybe_slow 3 times, one of the maybe_slow calls being slow

The slow call of maybe_slow is 10x longer, and dominates runtime if we consider calls to the child function common. Ideally, the profiling tool will be able to point us to the specific slow call.
both fast and maybe_slow call common, which accounts for the bulk of the program execution
The program interface is:
```
./main.out [n [seed]]
```
and the program does O(n^2) loops in total. seed is just to get different output without affecting runtime.

main.c

#include <inttypes.h>
#include <stdio.h>
#include <stdlib.h>

uint64_t __attribute__ ((noinline)) common(uint64_t n, uint64_t seed) {
    for (uint64_t i = 0; i < n; ++i) {
        seed = (seed * seed) - (3 * seed) + 1;
    }
    return seed;
}

uint64_t __attribute__ ((noinline)) fast(uint64_t n, uint64_t seed) {
    uint64_t max = (n / 10) + 1;
    for (uint64_t i = 0; i < max; ++i) {
        seed = common(n, (seed * seed) - (3 * seed) + 1);
    }
    return seed;
}

uint64_t __attribute__ ((noinline)) maybe_slow(uint64_t n, uint64_t seed, int is_slow) {
    uint64_t max = n;
    if (is_slow) {
        max *= 10;
    }
    for (uint64_t i = 0; i < max; ++i) {
        seed = common(n, (seed * seed) - (3 * seed) + 1);
    }
    return seed;
}

int main(int argc, char **argv) {
    uint64_t n, seed;
    if (argc > 1) {
        n = strtoll(argv[1], NULL, 0);
    } else {
        n = 1;
    }
    if (argc > 2) {
        seed = strtoll(argv[2], NULL, 0);
    } else {
        seed = 0;
    }
    seed += maybe_slow(n, seed, 0);
    seed += fast(n, seed);
    seed += maybe_slow(n, seed, 1);
    seed += fast(n, seed);
    seed += maybe_slow(n, seed, 0);
    seed += fast(n, seed);
    printf("%" PRIX64 "\n", seed);
    return EXIT_SUCCESS;
}

Default compilation unless otherwise specified:

gcc -ggdb3 -O3 -std=c99 -Wall -Wextra -pedantic -o main.out main.c

gprof

gprof requires recompiling the software with instrumentation, and it also uses a sampling approach together with that instrumentation. It therefore strikes a balance between accuracy (sampling is not always fully accurate and can skip functions) and execution slowdown (instrumentation and sampling are relatively fast techniques that don't slow down execution very much).

gprof is built-into GCC/binutils, so all we have to do is to compile with the -pg option to enable gprof. We then run the program normally with a size CLI parameter that produces a run of reasonable duration of a few seconds (10000):

gcc -pg -ggdb3 -O3 -std=c99 -Wall -Wextra -pedantic -o main.out main.c
time ./main.out 10000

For educational reasons, we will also do a run without optimizations enabled. Note that this is useless in practice, as you normally only care about optimizing the performance of the optimized program:

gcc -pg -ggdb3 -O0 -std=c99 -Wall -Wextra -pedantic -o main.out main.c
./main.out 10000

First, time tells us that the execution time with and without -pg were the same, which is great: no slowdown! I have however seen accounts of 2x - 3x slowdowns on complex software, e.g. as shown in this ticket.

Because we compiled with -pg, running the program produces a file gmon.out file containing the profiling data.

We can observe that file graphically with gprof2dot as asked at: Is it possible to get a graphical representation of gprof results?

sudo apt install graphviz
python3 -m pip install --user gprof2dot
gprof main.out > main.gprof
gprof2dot < main.gprof | dot -Tsvg -o output.svg

Here, the gprof tool reads the gmon.out trace information, and generates a human readable report in main.gprof, which gprof2dot then reads to generate a graph.

The source for gprof2dot is at: https://github.com/jrfonseca/gprof2dot

We observe the following for the -O0 run:

and for the -O3 run:

The -O0 output is pretty much self-explanatory. For example, it shows that the 3 maybe_slow calls and their child calls take up 97.56% of the total runtime, although execution of maybe_slow itself without children accounts for 0.00% of the total execution time, i.e. almost all the time spent in that function was spent on child calls.

TODO: why is main missing from the -O3 output, even though I can see it on a bt in GDB? Missing function from GProf output I think it is because gprof is also sampling based in addition to its compiled instrumentation, and the -O3 main is just too fast and got no samples.

I choose SVG output instead of PNG because the SVG is searchable with Ctrl + F and the file size can be about 10x smaller. Also, the width and height of the generated image can be humoungous with tens of thousands of pixels for complex software, and GNOME eog 3.28.1 bugs out in that case for PNGs, while SVGs get opened by my browser automatically. gimp 2.8 worked well though, see also:

but even then, you will be dragging the image around a lot to find what you want, see e.g. this image from a "real" software example taken from this ticket:

Can you find the most critical call stack easily with all those tiny unsorted spaghetti lines going over one another? There might be better dot options I'm sure, but I don't want to go there now. What we really need is a proper dedicated viewer for it, but I haven't found one yet:

You can however use the color map to mitigate those problems a bit. For example, on the previous huge image, I finally managed to find the critical path on the left when I made the brilliant deduction that green comes after red, followed finally by darker and darker blue.

Alternatively, we can also observe the text output of the gprof built-in binutils tool which we previously saved at:

cat main.gprof

By default, this produces an extremely verbose output that explains what the output data means. Since I can't explain better than that, I'll let you read it yourself.

Once you have understood the data output format, you can reduce verbosity to show just the data without the tutorial with the -b option:

gprof -b main.out

In our example, outputs were for -O0:

Flat profile:

Each sample counts as 0.01 seconds.
  %   cumulative   self              self     total           
 time   seconds   seconds    calls   s/call   s/call  name    
100.35      3.67     3.67   123003     0.00     0.00  common
  0.00      3.67     0.00        3     0.00     0.03  fast
  0.00      3.67     0.00        3     0.00     1.19  maybe_slow

            Call graph


granularity: each sample hit covers 2 byte(s) for 0.27% of 3.67 seconds

index % time    self  children    called     name
                0.09    0.00    3003/123003      fast [4]
                3.58    0.00  120000/123003      maybe_slow [3]
[1]    100.0    3.67    0.00  123003         common [1]
-----------------------------------------------
                                                 <spontaneous>
[2]    100.0    0.00    3.67                 main [2]
                0.00    3.58       3/3           maybe_slow [3]
                0.00    0.09       3/3           fast [4]
-----------------------------------------------
                0.00    3.58       3/3           main [2]
[3]     97.6    0.00    3.58       3         maybe_slow [3]
                3.58    0.00  120000/123003      common [1]
-----------------------------------------------
                0.00    0.09       3/3           main [2]
[4]      2.4    0.00    0.09       3         fast [4]
                0.09    0.00    3003/123003      common [1]
-----------------------------------------------

Index by function name

   [1] common                  [4] fast                    [3] maybe_slow

and for -O3:

Flat profile:

Each sample counts as 0.01 seconds.
  %   cumulative   self              self     total           
 time   seconds   seconds    calls  us/call  us/call  name    
100.52      1.84     1.84   123003    14.96    14.96  common

            Call graph


granularity: each sample hit covers 2 byte(s) for 0.54% of 1.84 seconds

index % time    self  children    called     name
                0.04    0.00    3003/123003      fast [3]
                1.79    0.00  120000/123003      maybe_slow [2]
[1]    100.0    1.84    0.00  123003         common [1]
-----------------------------------------------
                                                 <spontaneous>
[2]     97.6    0.00    1.79                 maybe_slow [2]
                1.79    0.00  120000/123003      common [1]
-----------------------------------------------
                                                 <spontaneous>
[3]      2.4    0.00    0.04                 fast [3]
                0.04    0.00    3003/123003      common [1]
-----------------------------------------------

Index by function name

   [1] common

As a very quick summary for each section e.g.:

                0.00    3.58       3/3           main [2]
[3]     97.6    0.00    3.58       3         maybe_slow [3]
                3.58    0.00  120000/123003      common [1]

centers around the function that is left indented (maybe_slow). [3] is the ID of that function. Above the function, are its callers, and below it the callees.

For -O3, see here like in the graphical output that maybe_slow and fast don't have a known parent, which is what the documentation says that <spontaneous> means.

I'm not sure if there is a nice way to do line-by-line profiling with gprof: `gprof` time spent in particular lines of code

perf from linux-tools

perf works by sampling which part of the programming is running regularly. This makes it very simple to setup without any recompilation. It can even attach to programs that are already running. The downside is that we lose a bit of data compared to gprof instrumented code; notably we don't know how many times each function is called.

This section was tested on Ubuntu 23.10:

sudo apt install linux-tools-common linux-tools-generic

And then you also need:

sudo sysctl kernel.perf_event_paranoid=-1 kernel.kptr_restrict=0

or to persist those options that across reboots:

printf 'kernel.perf_event_paranoid = -1
kernel.kptr_restrict = 0
' | sudo tee -a /etc/sysctl.conf

Collect data:

time perf record --call-graph dwarf ./main.out 10000

This produces a 42 MB file perf.data and adds a fixed 1 s overhead to execution, so it's fine time-wise. Ther --call-graph dwarf option is mentioned at: Call stack in the perf profiler

Next we can inspect the data interactively with:

perf report

Initially this gives us a view:

TODO how to always sort by "Self"? Sorting by self column in perf report It seemed to randomly sort by one column or another from time to time?

This immediately tells us that common is what takes up most of the execution time. If we select the "common" line and hit + (plus sign) or e (expand) and keep expanding recursively we finally see:

which tells us exactly what we want: maybe_slow calls to common are the slow ones.

Another mega-cool thing we can do is to hit a (annotate) on the common line, which gives us this assembly level breakdown of the function

Another interesting visualization is with https://github.com/brendangregg/FlameGraph which is also mentioned at: http://www.brendangregg.com/perf.html#FlameGraphs

git clone https://github.com/brendangregg/FlameGraph
git -C FlameGraph/ checkout cd9ee4c4449775a2f867acf31c84b7fe4b132ad5
perf script | FlameGraph/stackcollapse-perf.pl | FlameGraph/flamegraph.pl > flamegraph.svg

and we get:

On the a more complex example I got from another program it becomes clearer what the graph means:

TODO there are a log of [unknown] functions in that example, why is that?

Another perf GUI interfaces which might be worth it include:

Eclipse Trace Compass plugin: https://www.eclipse.org/tracecompass/

But this has the downside that you have to first convert the data to the Common Trace Format, which can be done with perf data --to-ctf, but it needs to be enabled at build time/have perf new enough, either of which is not the case for the perf in Ubuntu 18.04
https://github.com/KDAB/hotspot

The downside of this is that there seems to be no Ubuntu package, and building it requires Qt 5.10 while Ubuntu 18.04 is at Qt 5.9.

But David Faure mentions in the comments that there is no an AppImage package which might be a convenient way to use it.

valgrind callgrind

valgrind runs the program through the valgrind virtual machine. This makes the profiling very accurate, but it also produces a very large slowdown of the program.

callgrind is the valgrind's tool to profile code and kcachegrind is a KDE program that can visualize cachegrind output. I have also mentioned kcachegrind previously at: Tools to get a pictorial function call graph of code

First we have to remove the -pg flag to go back to normal compilation, otherwise the run actually fails with Profiling timer expired, and yes, this is so common that I did and there was a Stack Overflow question for it.

So we compile and run as:

sudo apt install kcachegrind valgrind
gcc -ggdb3 -O3 -std=c99 -Wall -Wextra -pedantic -o main.out main.c
time valgrind --tool=callgrind valgrind --dump-instr=yes \
  --collect-jumps=yes ./main.out 10000

I enable --dump-instr=yes --collect-jumps=yes because this also dumps information that enables us to view a per assembly line breakdown of performance, at a relatively small added overhead cost.

Off the bat, time tells us that the program took 29.5 seconds to execute, so we had a slowdown of about 15x on this example. Clearly, this slowdown is going to be a serious limitation for larger workloads. On the "real world software example" mentioned here, I observed a slowdown of 80x.

The run generates a profile data file named callgrind.out.<pid> e.g. callgrind.out.8554 in my case. We view that file with:

kcachegrind callgrind.out.8554

which shows a GUI that contains data similar to the textual gprof output:

Also, if we go on the bottom right "Call Graph" tab, we see a call graph which we can export by right clicking it to obtain the following image with unreasonable amounts of white border :-)

I think fast is not showing on that graph because kcachegrind must have simplified the visualization because that call takes up too little time, this will likely be the behavior you want on a real program. The right click menu has some settings to control when to cull such nodes, but I couldn't get it to show such a short call after a quick attempt. If I click on fast on the left window, it does show a call graph with fast, so that stack was actually captured. No one had yet found a way to show the complete graph call graph: Make callgrind show all function calls in the kcachegrind callgraph

TODO on complex C++ software, I see some entries of type <cycle N>, e.g. <cycle 11> where I'd expect function names, what does that mean? I noticed there is a "Cycle Detection" button to toggle that on and off, but what does it mean?

gperftools

Previously called "Google Performance Tools", source: https://github.com/gperftools/gperftools Sample based.

First install gperftools with:

sudo apt install google-perftools

Then, we can enable the gperftools CPU profiler in two ways: at runtime, or at build time.

At runtime, we have to pass set the LD_PRELOAD to point to libprofiler.so, which you can find with locate libprofiler.so, e.g. on my system:

gcc -ggdb3 -O3 -std=c99 -Wall -Wextra -pedantic -o main.out main.c
LD_PRELOAD=/usr/lib/x86_64-linux-gnu/libprofiler.so \
  CPUPROFILE=prof.out ./main.out 10000

Alternatively, we can build the library in at link time, dispensing passing LD_PRELOAD at runtime:

gcc -Wl,--no-as-needed,-lprofiler,--as-needed -ggdb3 -O3 -std=c99 -Wall -Wextra -pedantic -o main.out main.c
CPUPROFILE=prof.out ./main.out 10000

The nicest way to view this data I've found so far is to make pprof output the same format that kcachegrind takes as input (yes, the Valgrind-project-viewer-tool) and use kcachegrind to view that:

google-pprof --callgrind main.out prof.out  > callgrind.out
kcachegrind callgrind.out

After running with either of those methods, we get a prof.out profile data file as output. We can view that file graphically as an SVG with:

google-pprof --web main.out prof.out

which gives as a familiar call graph like other tools, but with the clunky unit of number of samples rather than seconds.

Alternatively, we can also get some textual data with:

google-pprof --text main.out prof.out

which gives:

Using local file main.out.
Using local file prof.out.
Total: 187 samples
     187 100.0% 100.0%      187 100.0% common
       0   0.0% 100.0%      187 100.0% __libc_start_main
       0   0.0% 100.0%      187 100.0% _start
       0   0.0% 100.0%        4   2.1% fast
       0   0.0% 100.0%      187 100.0% main
       0   0.0% 100.0%      183  97.9% maybe_slow