This is the mail archive of the libc-help@sourceware.org mailing list for the glibc project.
I am doing research on optimization of microprocessors and compilers.
Some of you probably know my optimization manuals
(www.agner.org/optimize/).I have tested many different compilers and compared how well they
optimize C++ code. I have been pleased to observe that gcc has been
improved a lot in the last couple of years. The gcc compiler itself is
now matching the optimizing performance of the Intel compiler and it
beats all other compilers I have tested. The many hard-working
developers deserve credit for this! Unfortunately, libc turns out to
be a weak point in the comparison. The performance of libc on memory
and string functions is poor compared to other function libraries
because it doesn't use the XMM registers. See my test results below.If somebody would do the job of updating these functions then we would
have the wonderful situation where gcc/libc would be the best
optimizing solution for all x86 and x86-64 platforms.Test results. Memcpy function on Intel Core 2 processor, core clock
cycles per byte of data:
As you can see, the speed of memcpy in libc can be improved by a factor 4-5 for unaligned data on a Core 2. The default builtin version is still slower. On an AMD K8 CPU there is less difference between the performance of the different libraries because K8 has only 64-bit internal data paths so it cannot make the full advantage of the 128-bit XMM registers. I expect the AMD K10 to perform similarly to Intel Core 2 because it has 128-bit data paths. However, I haven't had the chance to test this on an AMD K10 yet.There are significant performance differences on the strlen function and other functions as well. You can find my complete test results at http://www.agner.org/optimize/optimizing_cpp.pdf section 2.6.I would recommend that you make CPU-dispatching for the different instruction sets in the most important memory and string instructions and take advantage of the newest instruction sets if available. Of course the old computers without SSE should still be supported, but the 99% users who have SSE2 or later should not be penalized for the sake of compatibility with old CPUs.The work of putting this into libc should not be too big. Open source optimized code is available in Mac/Xnu, in OpenSolaris, and in my own function library "asmlib" at www.agner.org/optimize/asmlib.zip
All these have open source licenses, although with various differences. I don't know if these differences in license conditions cause legal problems that cannot be solved through negotiation. At least I am willing to grant the necessary licenses to the Gnu/libc project if you want to use my code.I am not going to join the libc development team because I have lots of other work to do, I am just offering my advice.
Function library aligned by 16 unaligned data --------------------------------------------------- gcc builtin 0.18 1.21 libc 2.7 32 bit 0.18 0.57 libc 2.8 32 bit 0.18 0.58 libc 2.7 64 bit 0.18 0.44 Microsoft 0.12 0.63 CodeGear 0.18 0.75 Intel 0.12 0.18 Mac 0.11 0.11 My own library 0.11 0.12 ---------------------------------------------------
As you can see, the speed of memcpy in libc can be improved by a factor 4-5 for unaligned data on a Core 2. The default builtin version is still slower. On an AMD K8 CPU there is less difference between the performance of the different libraries because K8 has only 64-bit internal data paths so it cannot make the full advantage of the 128-bit XMM registers. I expect the AMD K10 to perform similarly to Intel Core 2 because it has 128-bit data paths. However, I haven't had the chance to test this on an AMD K10 yet.There are significant performance differences on the strlen function and other functions as well. You can find my complete test results at http://www.agner.org/optimize/optimizing_cpp.pdf section 2.6.I would recommend that you make CPU-dispatching for the different instruction sets in the most important memory and string instructions and take advantage of the newest instruction sets if available. Of course the old computers without SSE should still be supported, but the 99% users who have SSE2 or later should not be penalized for the sake of compatibility with old CPUs.The work of putting this into libc should not be too big. Open source optimized code is available in Mac/Xnu, in OpenSolaris, and in my own function library "asmlib" at www.agner.org/optimize/asmlib.zip
All these have open source licenses, although with various differences. I don't know if these differences in license conditions cause legal problems that cannot be solved through negotiation. At least I am willing to grant the necessary licenses to the Gnu/libc project if you want to use my code.I am not going to join the libc development team because I have lots of other work to do, I am just offering my advice.