Agner Fog - Why do you want libc to be 5 times slower than other libraries?

This is the mail archive of the libc-help@sourceware.org mailing list for the glibc project.

I am doing research on optimization of microprocessors and compilers. Some of you probably know my optimization manuals (www.agner.org/optimize/).I have tested many different compilers and compared how well they optimize C++ code. I have been pleased to observe that gcc has been improved a lot in the last couple of years. The gcc compiler itself is now matching the optimizing performance of the Intel compiler and it beats all other compilers I have tested. The many hard-working developers deserve credit for this! Unfortunately, libc turns out to be a weak point in the comparison. The performance of libc on memory and string functions is poor compared to other function libraries because it doesn't use the XMM registers. See my test results below.If somebody would do the job of updating these functions then we would have the wonderful situation where gcc/libc would be the best optimizing solution for all x86 and x86-64 platforms.Test results. Memcpy function on Intel Core 2 processor, core clock cycles per byte of data:

Function library    aligned by 16    unaligned data
---------------------------------------------------
gcc builtin              0.18            1.21
libc 2.7 32 bit          0.18            0.57
libc 2.8 32 bit          0.18            0.58
libc 2.7 64 bit          0.18            0.44
Microsoft                0.12            0.63
CodeGear                 0.18            0.75
Intel                    0.12            0.18
Mac                      0.11            0.11
My own library           0.11            0.12
---------------------------------------------------

As you can see, the speed of memcpy in libc can be improved by a factor 4-5 for unaligned data on a Core 2. The default builtin version is still slower. On an AMD K8 CPU there is less difference between the performance of the different libraries because K8 has only 64-bit internal data paths so it cannot make the full advantage of the 128-bit XMM registers. I expect the AMD K10 to perform similarly to Intel Core 2 because it has 128-bit data paths. However, I haven't had the chance to test this on an AMD K10 yet.There are significant performance differences on the strlen function and other functions as well. You can find my complete test results at http://www.agner.org/optimize/optimizing_cpp.pdf section 2.6.I would recommend that you make CPU-dispatching for the different instruction sets in the most important memory and string instructions and take advantage of the newest instruction sets if available. Of course the old computers without SSE should still be supported, but the 99% users who have SSE2 or later should not be penalized for the sake of compatibility with old CPUs.The work of putting this into libc should not be too big. Open source optimized code is available in Mac/Xnu, in OpenSolaris, and in my own function library "asmlib" at www.agner.org/optimize/asmlib.zip All these have open source licenses, although with various differences. I don't know if these differences in license conditions cause legal problems that cannot be solved through negotiation. At least I am willing to grant the necessary licenses to the Gnu/libc project if you want to use my code.I am not going to join the libc development team because I have lots of other work to do, I am just offering my advice.