Sort scalability on modern cpus

	Sort scalability on modern cpus

	We don't need core counts of thirty
	If we win by playing dirty!

	Lets test linux sort on Coffee Lake generation processor :
	System : Ubuntu 16, 6 cores i5-8500, 16GB of DDR4, Nvidia GTX 1080
	Test file : 760 MB text file of tab separated numeric and text fields from TPC-H benchmark
	The sort program is run repeatedly, so the source file is read from cache, the results are written
	to /dev/null so we are comparing the cpu sort performance and not the disk performance.

	Sorting on 16th field of variable length text :

	Cores used : 6 5 4 3 2 1
	Running time(seconds) : 9.1 8.9 9.1 13.4 13.8 23.6

	Sorting on 3rd field (numeric)

	Cores used : 6 5 4 3 2 1
	Running time(seconds) : 3.3 3.4 3.4 4.8 4.9 7.9

	As we see, the sort doesn't scale very well, the speed-up resulting from additional cores is limited.

	Let's see the numbers when using a gpu. Here is a simple program that reads the file into gpu memory,
	sorts on the specified fields, copies the file back into the main memory and writes the file to disk.
	For sorting the program uses partitioning that exactly load-balances work over each gpu thread.
	The results are not written to a file to match the cpu tests.

	Running time(seconds) for text field sort : 1.5 ( the actual copying the data and sort takes 0.79 seconds,
	the rest is gpu initialization time).
	Running time(seconds) for numeric field sort : 0.8 ( 0.4 seconds takes the data copying and sorting).

	my gpu sort program link : https://github.com/antonmks/nvSort
	nvSort program is not ready for production, it was written just for the purpose of this benchmark.
	Not tested on any other files.