This file contains hidden or bidirectional Unicode text that may be interpreted or compiled differently than what appears below. To review, open the file in an editor that reveals hidden Unicode characters.
Learn more about bidirectional Unicode characters
| Sort scalability on modern cpus | |
| We don't need core counts of thirty | |
| If we win by playing dirty! | |
| Lets test linux sort on Coffee Lake generation processor : | |
| System : Ubuntu 16, 6 cores i5-8500, 16GB of DDR4, Nvidia GTX 1080 | |
| Test file : 760 MB text file of tab separated numeric and text fields from TPC-H benchmark | |
| The sort program is run repeatedly, so the source file is read from cache, the results are written | |
| to /dev/null so we are comparing the cpu sort performance and not the disk performance. | |
| Sorting on 16th field of variable length text : | |
| Cores used : 6 5 4 3 2 1 | |
| Running time(seconds) : 9.1 8.9 9.1 13.4 13.8 23.6 | |
| Sorting on 3rd field (numeric) | |
| Cores used : 6 5 4 3 2 1 | |
| Running time(seconds) : 3.3 3.4 3.4 4.8 4.9 7.9 | |
| As we see, the sort doesn't scale very well, the speed-up resulting from additional cores is limited. | |
| Let's see the numbers when using a gpu. Here is a simple program that reads the file into gpu memory, | |
| sorts on the specified fields, copies the file back into the main memory and writes the file to disk. | |
| For sorting the program uses partitioning that exactly load-balances work over each gpu thread. | |
| The results are not written to a file to match the cpu tests. | |
| Running time(seconds) for text field sort : 1.5 ( the actual copying the data and sort takes 0.79 seconds, | |
| the rest is gpu initialization time). | |
| Running time(seconds) for numeric field sort : 0.8 ( 0.4 seconds takes the data copying and sorting). | |
| my gpu sort program link : https://github.com/antonmks/nvSort | |
| nvSort program is not ready for production, it was written just for the purpose of this benchmark. | |
| Not tested on any other files. | |