Survey of C++ profiling techniques: gprof vs valgrind vs perf vs gperftools
In this answer, I will use several different tools to a analyze a few very simple test programs, in order to concretely compare how those tools work.
Summary of results:
| tool | recompile | slowdown (1x == no slowdown, 2x == twice as slow) |
|---|---|---|
| gprof | -pg |
1x (2x-3x reported in real use cases) |
| perf | no | 1x |
| valgrind | no | 15x (80x reported in real use cases) |
| gperftools | no |
The following test program is very simple and does the following:
maincallsfastandmaybe_slow3 times, one of themaybe_slowcalls being slowThe slow call of
maybe_slowis 10x longer, and dominates runtime if we consider calls to the child functioncommon. Ideally, the profiling tool will be able to point us to the specific slow call.both
fastandmaybe_slowcallcommon, which accounts for the bulk of the program executionThe program interface is:
./main.out [n [seed]]and the program does
O(n^2)loops in total.seedis just to get different output without affecting runtime.
main.c
#include <inttypes.h>
#include <stdio.h>
#include <stdlib.h>
uint64_t __attribute__ ((noinline)) common(uint64_t n, uint64_t seed) {
for (uint64_t i = 0; i < n; ++i) {
seed = (seed * seed) - (3 * seed) + 1;
}
return seed;
}
uint64_t __attribute__ ((noinline)) fast(uint64_t n, uint64_t seed) {
uint64_t max = (n / 10) + 1;
for (uint64_t i = 0; i < max; ++i) {
seed = common(n, (seed * seed) - (3 * seed) + 1);
}
return seed;
}
uint64_t __attribute__ ((noinline)) maybe_slow(uint64_t n, uint64_t seed, int is_slow) {
uint64_t max = n;
if (is_slow) {
max *= 10;
}
for (uint64_t i = 0; i < max; ++i) {
seed = common(n, (seed * seed) - (3 * seed) + 1);
}
return seed;
}
int main(int argc, char **argv) {
uint64_t n, seed;
if (argc > 1) {
n = strtoll(argv[1], NULL, 0);
} else {
n = 1;
}
if (argc > 2) {
seed = strtoll(argv[2], NULL, 0);
} else {
seed = 0;
}
seed += maybe_slow(n, seed, 0);
seed += fast(n, seed);
seed += maybe_slow(n, seed, 1);
seed += fast(n, seed);
seed += maybe_slow(n, seed, 0);
seed += fast(n, seed);
printf("%" PRIX64 "\n", seed);
return EXIT_SUCCESS;
}
Default compilation unless otherwise specified:
gcc -ggdb3 -O3 -std=c99 -Wall -Wextra -pedantic -o main.out main.c
gprof
gprof requires recompiling the software with instrumentation, and it also uses a sampling approach together with that instrumentation. It therefore strikes a balance between accuracy (sampling is not always fully accurate and can skip functions) and execution slowdown (instrumentation and sampling are relatively fast techniques that don't slow down execution very much).
gprof is built-into GCC/binutils, so all we have to do is to compile with the -pg option to enable gprof. We then run the program normally with a size CLI parameter that produces a run of reasonable duration of a few seconds (10000):
gcc -pg -ggdb3 -O3 -std=c99 -Wall -Wextra -pedantic -o main.out main.c
time ./main.out 10000
For educational reasons, we will also do a run without optimizations enabled. Note that this is useless in practice, as you normally only care about optimizing the performance of the optimized program:
gcc -pg -ggdb3 -O0 -std=c99 -Wall -Wextra -pedantic -o main.out main.c
./main.out 10000
First, time tells us that the execution time with and without -pg were the same, which is great: no slowdown! I have however seen accounts of 2x - 3x slowdowns on complex software, e.g. as shown in this ticket.
Because we compiled with -pg, running the program produces a file gmon.out file containing the profiling data.
We can observe that file graphically with gprof2dot as asked at: Is it possible to get a graphical representation of gprof results?
sudo apt install graphviz
python3 -m pip install --user gprof2dot
gprof main.out > main.gprof
gprof2dot < main.gprof | dot -Tsvg -o output.svg
Here, the gprof tool reads the gmon.out trace information, and generates a human readable report in main.gprof, which gprof2dot then reads to generate a graph.
The source for gprof2dot is at: https://github.com/jrfonseca/gprof2dot
We observe the following for the -O0 run:
and for the -O3 run:
The -O0 output is pretty much self-explanatory. For example, it shows that the 3 maybe_slow calls and their child calls take up 97.56% of the total runtime, although execution of maybe_slow itself without children accounts for 0.00% of the total execution time, i.e. almost all the time spent in that function was spent on child calls.
TODO: why is main missing from the -O3 output, even though I can see it on a bt in GDB? Missing function from GProf output I think it is because gprof is also sampling based in addition to its compiled instrumentation, and the -O3 main is just too fast and got no samples.
I choose SVG output instead of PNG because the SVG is searchable with Ctrl + F and the file size can be about 10x smaller. Also, the width and height of the generated image can be humoungous with tens of thousands of pixels for complex software, and GNOME eog 3.28.1 bugs out in that case for PNGs, while SVGs get opened by my browser automatically. gimp 2.8 worked well though, see also:
- https://askubuntu.com/questions/1112641/how-to-view-extremely-large-images
- https://unix.stackexchange.com/questions/77968/viewing-large-image-on-linux
- https://superuser.com/questions/356038/viewer-for-huge-images-under-linux-100-mp-color-images
but even then, you will be dragging the image around a lot to find what you want, see e.g. this image from a "real" software example taken from this ticket:
Can you find the most critical call stack easily with all those tiny unsorted spaghetti lines going over one another? There might be better dot options I'm sure, but I don't want to go there now. What we really need is a proper dedicated viewer for it, but I haven't found one yet:
You can however use the color map to mitigate those problems a bit. For example, on the previous huge image, I finally managed to find the critical path on the left when I made the brilliant deduction that green comes after red, followed finally by darker and darker blue.
Alternatively, we can also observe the text output of the gprof built-in binutils tool which we previously saved at:
cat main.gprof
By default, this produces an extremely verbose output that explains what the output data means. Since I can't explain better than that, I'll let you read it yourself.
Once you have understood the data output format, you can reduce verbosity to show just the data without the tutorial with the -b option:
gprof -b main.out
In our example, outputs were for -O0:
Flat profile:
Each sample counts as 0.01 seconds.
% cumulative self self total
time seconds seconds calls s/call s/call name
100.35 3.67 3.67 123003 0.00 0.00 common
0.00 3.67 0.00 3 0.00 0.03 fast
0.00 3.67 0.00 3 0.00 1.19 maybe_slow
Call graph
granularity: each sample hit covers 2 byte(s) for 0.27% of 3.67 seconds
index % time self children called name
0.09 0.00 3003/123003 fast [4]
3.58 0.00 120000/123003 maybe_slow [3]
[1] 100.0 3.67 0.00 123003 common [1]
-----------------------------------------------
<spontaneous>
[2] 100.0 0.00 3.67 main [2]
0.00 3.58 3/3 maybe_slow [3]
0.00 0.09 3/3 fast [4]
-----------------------------------------------
0.00 3.58 3/3 main [2]
[3] 97.6 0.00 3.58 3 maybe_slow [3]
3.58 0.00 120000/123003 common [1]
-----------------------------------------------
0.00 0.09 3/3 main [2]
[4] 2.4 0.00 0.09 3 fast [4]
0.09 0.00 3003/123003 common [1]
-----------------------------------------------
Index by function name
[1] common [4] fast [3] maybe_slow
and for -O3:
Flat profile:
Each sample counts as 0.01 seconds.
% cumulative self self total
time seconds seconds calls us/call us/call name
100.52 1.84 1.84 123003 14.96 14.96 common
Call graph
granularity: each sample hit covers 2 byte(s) for 0.54% of 1.84 seconds
index % time self children called name
0.04 0.00 3003/123003 fast [3]
1.79 0.00 120000/123003 maybe_slow [2]
[1] 100.0 1.84 0.00 123003 common [1]
-----------------------------------------------
<spontaneous>
[2] 97.6 0.00 1.79 maybe_slow [2]
1.79 0.00 120000/123003 common [1]
-----------------------------------------------
<spontaneous>
[3] 2.4 0.00 0.04 fast [3]
0.04 0.00 3003/123003 common [1]
-----------------------------------------------
Index by function name
[1] common
As a very quick summary for each section e.g.:
0.00 3.58 3/3 main [2]
[3] 97.6 0.00 3.58 3 maybe_slow [3]
3.58 0.00 120000/123003 common [1]
centers around the function that is left indented (maybe_slow). [3] is the ID of that function. Above the function, are its callers, and below it the callees.
For -O3, see here like in the graphical output that maybe_slow and fast don't have a known parent, which is what the documentation says that <spontaneous> means.
I'm not sure if there is a nice way to do line-by-line profiling with gprof: `gprof` time spent in particular lines of code
perf from linux-tools
perf works by sampling which part of the programming is running regularly. This makes it very simple to setup without any recompilation. It can even attach to programs that are already running. The downside is that we lose a bit of data compared to gprof instrumented code; notably we don't know how many times each function is called.
This section was tested on Ubuntu 23.10:
sudo apt install linux-tools-common linux-tools-generic
And then you also need:
sudo sysctl kernel.perf_event_paranoid=-1 kernel.kptr_restrict=0
or to persist those options that across reboots:
printf 'kernel.perf_event_paranoid = -1
kernel.kptr_restrict = 0
' | sudo tee -a /etc/sysctl.conf
Collect data:
time perf record --call-graph dwarf ./main.out 10000
This produces a 42 MB file perf.data and adds a fixed 1 s overhead to execution, so it's fine time-wise. Ther --call-graph dwarf option is mentioned at: Call stack in the perf profiler
Next we can inspect the data interactively with:
perf report
Initially this gives us a view:
TODO how to always sort by "Self"? Sorting by self column in perf report It seemed to randomly sort by one column or another from time to time?
This immediately tells us that common is what takes up most of the execution time. If we select the "common" line and hit + (plus sign) or e (expand) and keep expanding recursively we finally see:
which tells us exactly what we want: maybe_slow calls to common are the slow ones.
Another mega-cool thing we can do is to hit a (annotate) on the common line, which gives us this assembly level breakdown of the function
Another interesting visualization is with https://github.com/brendangregg/FlameGraph which is also mentioned at: http://www.brendangregg.com/perf.html#FlameGraphs
git clone https://github.com/brendangregg/FlameGraph
git -C FlameGraph/ checkout cd9ee4c4449775a2f867acf31c84b7fe4b132ad5
perf script | FlameGraph/stackcollapse-perf.pl | FlameGraph/flamegraph.pl > flamegraph.svg
and we get:
On the a more complex example I got from another program it becomes clearer what the graph means:
TODO there are a log of [unknown] functions in that example, why is that?
Another perf GUI interfaces which might be worth it include:
Eclipse Trace Compass plugin: https://www.eclipse.org/tracecompass/
But this has the downside that you have to first convert the data to the Common Trace Format, which can be done with
perf data --to-ctf, but it needs to be enabled at build time/haveperfnew enough, either of which is not the case for the perf in Ubuntu 18.04https://github.com/KDAB/hotspot
The downside of this is that there seems to be no Ubuntu package, and building it requires Qt 5.10 while Ubuntu 18.04 is at Qt 5.9.
But David Faure mentions in the comments that there is no an AppImage package which might be a convenient way to use it.
valgrind callgrind
valgrind runs the program through the valgrind virtual machine. This makes the profiling very accurate, but it also produces a very large slowdown of the program.
callgrind is the valgrind's tool to profile code and kcachegrind is a KDE program that can visualize cachegrind output. I have also mentioned kcachegrind previously at: Tools to get a pictorial function call graph of code
First we have to remove the -pg flag to go back to normal compilation, otherwise the run actually fails with Profiling timer expired, and yes, this is so common that I did and there was a Stack Overflow question for it.
So we compile and run as:
sudo apt install kcachegrind valgrind
gcc -ggdb3 -O3 -std=c99 -Wall -Wextra -pedantic -o main.out main.c
time valgrind --tool=callgrind valgrind --dump-instr=yes \
--collect-jumps=yes ./main.out 10000
I enable --dump-instr=yes --collect-jumps=yes because this also dumps information that enables us to view a per assembly line breakdown of performance, at a relatively small added overhead cost.
Off the bat, time tells us that the program took 29.5 seconds to execute, so we had a slowdown of about 15x on this example. Clearly, this slowdown is going to be a serious limitation for larger workloads. On the "real world software example" mentioned here, I observed a slowdown of 80x.
The run generates a profile data file named callgrind.out.<pid> e.g. callgrind.out.8554 in my case. We view that file with:
kcachegrind callgrind.out.8554
which shows a GUI that contains data similar to the textual gprof output:
Also, if we go on the bottom right "Call Graph" tab, we see a call graph which we can export by right clicking it to obtain the following image with unreasonable amounts of white border :-)
I think fast is not showing on that graph because kcachegrind must have simplified the visualization because that call takes up too little time, this will likely be the behavior you want on a real program. The right click menu has some settings to control when to cull such nodes, but I couldn't get it to show such a short call after a quick attempt. If I click on fast on the left window, it does show a call graph with fast, so that stack was actually captured. No one had yet found a way to show the complete graph call graph: Make callgrind show all function calls in the kcachegrind callgraph
TODO on complex C++ software, I see some entries of type <cycle N>, e.g. <cycle 11> where I'd expect function names, what does that mean? I noticed there is a "Cycle Detection" button to toggle that on and off, but what does it mean?
gperftools
Previously called "Google Performance Tools", source: https://github.com/gperftools/gperftools Sample based.
First install gperftools with:
sudo apt install google-perftools
Then, we can enable the gperftools CPU profiler in two ways: at runtime, or at build time.
At runtime, we have to pass set the LD_PRELOAD to point to libprofiler.so, which you can find with locate libprofiler.so, e.g. on my system:
gcc -ggdb3 -O3 -std=c99 -Wall -Wextra -pedantic -o main.out main.c
LD_PRELOAD=/usr/lib/x86_64-linux-gnu/libprofiler.so \
CPUPROFILE=prof.out ./main.out 10000
Alternatively, we can build the library in at link time, dispensing passing LD_PRELOAD at runtime:
gcc -Wl,--no-as-needed,-lprofiler,--as-needed -ggdb3 -O3 -std=c99 -Wall -Wextra -pedantic -o main.out main.c
CPUPROFILE=prof.out ./main.out 10000
See also: gperftools - profile file not dumped
The nicest way to view this data I've found so far is to make pprof output the same format that kcachegrind takes as input (yes, the Valgrind-project-viewer-tool) and use kcachegrind to view that:
google-pprof --callgrind main.out prof.out > callgrind.out
kcachegrind callgrind.out
After running with either of those methods, we get a prof.out profile data file as output. We can view that file graphically as an SVG with:
google-pprof --web main.out prof.out
which gives as a familiar call graph like other tools, but with the clunky unit of number of samples rather than seconds.
Alternatively, we can also get some textual data with:
google-pprof --text main.out prof.out
which gives:
Using local file main.out.
Using local file prof.out.
Total: 187 samples
187 100.0% 100.0% 187 100.0% common
0 0.0% 100.0% 187 100.0% __libc_start_main
0 0.0% 100.0% 187 100.0% _start
0 0.0% 100.0% 4 2.1% fast
0 0.0% 100.0% 187 100.0% main
0 0.0% 100.0% 183 97.9% maybe_slow
See also: How to use google perf tools
Instrument your code with raw perf_event_open syscalls
I think this is the same underlying subsystem that perf uses, but you could of course attain even greater control by explicitly instrumenting your program at compile time with events of interest.
This is likely too hardcore for most people, but it's kind of fun. Minimal runnable example at: Quick way to count number of instructions executed in a C program
Intel VTune
https://en.wikipedia.org/wiki/VTune
This seems to be closed source and x86-only, but it is likely to be amazing from what I've heard. I'm not sure how free it is to use, but it seems to be free to download. TODO evaluate.
Tested in Ubuntu 18.04, gprof2dot 2019.11.30, valgrind 3.13.0, perf 4.15.18, Linux kernel 4.15.0, FLameGraph 1a0dc6985aad06e76857cf2a354bd5ba0c9ce96b, gperftools 2.5-2.










