Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Ping pong merge | multi-threaded merge sorts #16

Open
jeffareid opened this issue Jan 18, 2023 · 12 comments
Open

Ping pong merge | multi-threaded merge sorts #16

jeffareid opened this issue Jan 18, 2023 · 12 comments

Comments

@jeffareid
Copy link

From the readme: "Traditionally merge sorts would merge two blocks to swap memory, then copy them back to main memory."

I'm not aware of any library implementation of merge sort that does a copy back. Most implementations of stable sort are some hybrid of insertion sort | bottom up merge sort that change the direction of merge with each sort pass.

Merge sorts have been using ping pong like merging since the days of tape sorts (1960's) . For 3 tape drives, polyphase merge sort can be used, and it's generally quicker than standard merge sort up to 7 tape drives:

Polyphase merge sort

Multi-threading - GNU Sort for Unix defaults to using 16 threads during the internal sort phase which sorts an array of pointers to records, then writes the records according to the sorted array to create sorted runs on hard drives. It then defaults to a 16 way merge to merge the sorted runs into a sorted file. For drives with very fast transfer rates, like NVMe SSD with 6+ GB/s transfer rates, multi-threading with 2 or 4 way merging (using if|else not mini-heap) would be faster.

Sort - Unix

Sort source code

@scandum
Copy link
Owner

scandum commented Jan 20, 2023

I'm quite sure Timsort, at least originally, performs a copy back. The performance loss is minimal. I've updated the readme to "Most textbook mergesort examples merge two blocks to swap memory, then copy them back to main memory."

Quadsort doesn't use multi-threading, and it's not something I'm interested in as it'd likely be disruptive to processes running on the other cores in a real-world setting.

@jeffareid
Copy link
Author

jeffareid commented Jan 21, 2023

Looking at tape based merge sorts, they date back to the late 1950's (IBM 729 tape drives), before early 1960's. As posted above, tape base sorts don't use copy back.

I haven't worked with Timsort.

Regular merge sorts with copy back are mostly classroom exercises, so while class room textbooks might have copy back. but algorithm oriented textbooks probably don't. I don't have any copies of Knuth or similar textbooks so I can't be sure. Top down merge sorts are also mostly educational, while most libraries use some hybrid of insertion sort | bottom up merge sort, such as C++ std:stable_sort. One exception is Visual Studio 2015 std::list::sort, which switched to using iterators, but also switched to a less efficient top down merge sort. I falsely assumed that Microsoft ran into some issue with an iterator based bottom up merge sort for a linked list, and only later out of curiosity did I determine it wasn't an issue.

https://stackoverflow.com/questions/40622430/40629882#40629882

I converted your quadsort code to compile with Visual Studio (it doesn't allow return void, so only a few lines were changed). I tested with 2^24 (16 million) 64 bit integers versus my hybrid insertion sort | 4-way merge sort (using nested if|else instead of min-heap). 4-way sort time 1.3 seconds, quadsort 0.8 seconds. I did the same test, but masked the values with 0xff, so a huge number of duplicates, 4-way merge sort 0.38 seconds, quadsort 0.62 seconds. Based on your description, branch prediction is probably an issue, as there are 3 compares for every element moved during a 4-way merge. I haven't looked at all the optimizations in quadsort to get an idea of why it's so much faster for normal data.

As for multi-threading, it's something that someone could always add to an existing merge sort. Unix sort and mainframe sorts use multi-threading, but libraries like C++ std::stable_sort do not.

@jeffareid
Copy link
Author

jeffareid commented Jan 21, 2023

In the case of 16 million pseudo-random 64 bit integers masked with 0xff: lots of duplicates, for a standard merge sort, this is very similar to already sorted data (all left and right runs moved with with not much more than n/2 compares and virtually no branch mis-predictions). However quadsort is extremely fast at handling already sorted data, so my guess is some minor tweak could be made to handle a large number of duplicates (when most of left run <= right run due to duplicates).

Out of all the optimizations in quadsort, is there one key optimization,, or is the increase in performance due to the combined effect of all of the optimizations?

@scandum
Copy link
Owner

scandum commented Jan 21, 2023

I don't think modulating by 256 is enough for quadsort to detect it as non-random data, or otherwise benefit. Modulating by 16 should trigger galloping mode and some skips. I haven't experimented much with brute-forcing branched comparisons on mostly sorted data, it's definitely worth looking into.

Fluxsort will do a lot better on that data set though, and I don't think quadsort could ever beat it in that department, so I'd suggest giving fluxsort a spin. Fluxsort should sort that data set in about 0.31 seconds.

As for key optimizations, the quad_swap routine sorts fully sorted data in O(n) without ruining the performance on random data, like Timsort does.

The parity and galloping merge, which merge from both ends of the arrays, are instrumental to fast branchless merging. As far as I know, it's not feasible to do branchless merging without writing to two separate memory regions, and I'm quite sure I was the first one to discover it. My hint was that branchless quicksort was 2x faster, and quicksort naturally writes to two separate memory regions instead of one.

The rest of the performance is primarily the cumulative sum of a bunch of micro optimizations. Since I compile with gcc some of it might be compiler specific.

@jeffareid
Copy link
Author

jeffareid commented Jan 22, 2023

Thanks for explaining your work. My main specialty is Reed Solomon code and some generic finite field stuff. I know basic sort algorithms but haven't spent much time on specialized optimizations.

As for the modulo 256 issue, with a generic merge sort, when merging a pair of runs, since the compare is left element <= right element for which one to merge, duplicates result in all of the left run being merged with only a small fraction of the right run included in the merge, resulting in most of the right run just being copied in a tight loop with no compares. Not only does this reduce the number of compares, almost all of the compares for <= will be true, which means branch prediction will be correct almost all of the time. You mention runs are scanned forwards and backwards and I assume this is the reason quadsort doesn't get the same benefit from duplicates.

As a comparison to a similar issue with quicksort, duplicates cause Lomuto partition scheme to degrade, while they cause Hoare partition scheme to improve. All duplicates are a worst case scenario for Lomuto, and a best case for Hoare.

@scandum
Copy link
Owner

scandum commented Jan 23, 2023

I'm aware, the performance you are reporting for mod 256 is quite unusual for a merge. Gains from simple branched merges are pretty limited generally. Either you're doing something quite unique, or there might be a problem with the algorithm.

I wrote a 4-way merge a while back, but it didn't exhibit any unusual behavior.

@jeffareid
Copy link
Author

jeffareid commented Jan 23, 2023

For a 4 way merge (3 compares per move) of {run0 run1 run2 run3} modulo 256, most of run0 will be <= other runs, aiding branch prediction, and the end of run0 is reached early, dropping down to 3 way merge (2 compares per move), then 2 way merge (1 compare per move), then a copy of the remainder of run3. This results in an average of about 1.5 compares per move, with 1/2 of the number of merge passes of a 2 way merge.

I uploaded the source code. The timing stuff with QP set to 1 is Windows specific and needs to be changed (note I since changed it to 0, but you already got the prior version). This is a generic hybrid insertion sort | 4 way merge sort, with no special optimizations. Note that the second array (b[]) is a one time allocation that is passed to the mergesort function, and the allocation time is not included in the timing. This 4 way merge is intended to be run on a CPU with 16 registers, such as an X86 in 64 bit mode.

https://github.com/jeffareid/misc/blob/master/msbu4i.c

@scandum
Copy link
Owner

scandum commented Jan 23, 2023

I managed to reach an average of 1 compare per move with my own 4-way merge, but it's pretty bulky.

I'll see if I can plug your sort into my own benchmark.

@scandum
Copy link
Owner

scandum commented Jan 23, 2023

Here are the results when I run it through my benchmark:

image

I aliased your sort as fourway. My best guess is that you forgot to define the #define cmp(a,b) (*(a) > *(b)) macro for quadsort. Since your sort uses inlined primitive comparisons it makes a significant difference. Compiling with optimizations enabled is also a must. Memory allocation is included in the timing.

The data table:

Name Items Type Best Average Compares Samples Distribution
quadsort 10000000 32 0.410247 0.418234 1 5 random order
fluxsort 10000000 32 0.259402 0.262681 1 5 random order
fourway 10000000 32 0.850645 0.852009 1 5 random order
quadsort 10000000 32 0.273567 0.274634 1 5 random % 100
fluxsort 10000000 32 0.099596 0.100465 1 5 random % 100
fourway 10000000 32 0.421388 0.421923 1 5 random % 100
quadsort 10000000 32 0.012305 0.012775 1 5 ascending order
fluxsort 10000000 32 0.006983 0.007099 1 5 ascending order
fourway 10000000 32 0.123592 0.124429 1 5 ascending order
quadsort 10000000 32 0.097924 0.098559 1 5 ascending saw
fluxsort 10000000 32 0.097320 0.097645 1 5 ascending saw
fourway 10000000 32 0.184623 0.185662 1 5 ascending saw
quadsort 10000000 32 0.071398 0.071744 1 5 pipe organ
fluxsort 10000000 32 0.053403 0.054020 1 5 pipe organ
fourway 10000000 32 0.153717 0.160141 1 5 pipe organ
quadsort 10000000 32 0.010339 0.010578 1 5 descending order
fluxsort 10000000 32 0.010325 0.010474 1 5 descending order
fourway 10000000 32 0.183821 0.194731 1 5 descending order
quadsort 10000000 32 0.097083 0.097512 1 5 descending saw
fluxsort 10000000 32 0.096807 0.097685 1 5 descending saw
fourway 10000000 32 0.185179 0.185889 1 5 descending saw
quadsort 10000000 32 0.029866 0.030105 1 5 random tail
fluxsort 10000000 32 0.043229 0.043561 1 5 random tail
fourway 10000000 32 0.129266 0.130999 1 5 random tail
quadsort 10000000 32 0.248677 0.248990 1 5 random half
fluxsort 10000000 32 0.165950 0.166270 1 5 random half
fourway 10000000 32 0.495616 0.496706 1 5 random half
quadsort 10000000 32 0.157256 0.158560 1 5 ascending tiles
fluxsort 10000000 32 0.074750 0.088254 1 5 ascending tiles
fourway 10000000 32 0.172149 0.172785 1 5 ascending tiles
quadsort 10000000 32 0.384517 0.385016 1 5 bit reversal
fluxsort 10000000 32 0.240946 0.250197 1 5 bit reversal
fourway 10000000 32 0.241819 0.244781 1 5 bit reversal

@jeffareid
Copy link
Author

jeffareid commented Jan 23, 2023

fourway is intended for X86|64 bit build, as it needs 16 registers, most of them used up by the pointers. It would also work on a 32 bit processor that has 16 registers. I uploaded 2 more files to my github.

Changes;

MergeSort now uses a pointer to compare function, but Visual Studio defaults to inlining the compare function so it makes no difference. The compare macro also doesn't make any difference, but I left it enabled.
MergeSort no longer takes a pointer to the second array b[], and instead does a one time allocation and freeing of that array; any difference was smaller than the difference between runs, so not measurable.
#define QP 0 so that test program uses the standard time functions.

I assumed (incorrectly) 10000000 and 100 were hex numbers, 0x1000000 (16777216 instead of 10000000) and 0x100. Results (note these are 64 bit builds). There is little or no difference between 64 bit and 32 bit random, but a significant difference for random%0x100, due to near streaming moves.

Intel 3770K | Windows 7 Pro 64 bit | VS2015 | 64 bit build
64 bit data random = 1.290 | random%0x100 = 0.357
32 bit data random = 1.290 | random%0x100 = 0.242

Intel 10510U | Windows 10 Pro 64 bit | VS2019 | 64 bit build
64 bit data random = 1.178 | random%0x100 = 0.235
32 bit data random = 1.145 | random%0x100 = 0.181

Link to 64 bit build, 64 bit data:
https://github.com/jeffareid/misc/blob/master/msbu4ic.c

Link to 64 bit build, 32 bit data:
https://github.com/jeffareid/misc/blob/master/msbu4c32.c

@scandum
Copy link
Owner

scandum commented Jan 24, 2023

I am benching with an Intel Core i3-8100, and the numbers are in decimal.

We're also using a different compiler, and I'm running everything within WSL.

On *nix systems and WSL, setting the cmp macro makes a big difference.

Without compiler optimizations (-O3 in gcc) I can reproduce the situation you are reporting, where your sort is 10% faster than quadsort on random % 100. With -O3 enabled quadsort becomes the fastest however.

@jeffareid
Copy link
Author

jeffareid commented Jan 24, 2023

The cmp macro didn't make any difference with Visual Studio. Visual Studio inlines the compare function with or without the macro. For VS C++ standard template libraries, such as std::sort, std::stable_sort, std::list::sort, they optionally take a pointer to compare function parameter, but for the versions without that compare parameter, they just use a generic compare function and call the version with the pointer to compare function, relying on VS to inline it.

Another issue was the switch from unsigned integers to signed integers, where modulo returns the sign of the dividend. I switched back to unsigned integers. For %256, &0xff could be used instead to avoid the signed integer issue. With the switch to unsigned integers, there isn't much difference between 64 bit or 32 bit data. %256 is faster than %100, and I'm not sure why. For quadsort, %256 is about 9% faster. For fourway, %100 is about 2 times as fast, which is expected: an average of 1.5 compares per move versus 3 compares per move, but I don't know why %256 is over 4 times as fast.

Intel 10510U | Windows 10 Pro 64 bit | VS2019 | 64 bit build | 64 bit data
quadsort random = 0.332 | %100 = 0.257 | %256 = 0.235
fourway random = 0.675 | %100 = 0.325 | %256 = 0.157

Intel 10510U | Windows 10 Pro 64 bit | VS2019 | 64 bit build | 32 bit data
quadsort random = 0.302 %100 = 0.257 %256 = 0.235
fourway random = 0.670 %100 = 0.320 %256 = 0.145

I added compare counts for fourway 64 bit. For n = 10000000, insertion sort is creating sorted runs of 16 before starting merge sort passes. I don't know why %256 is so much faster than %100.

Intel 3770K | Windows 7 Pro 64 bit | VS2015 | 64 bit build | 64 bit data
random 0.749 # insert compares: 45400272 | merge compares: 283661102 | total compares: 329061374
---%100 0.394 # insert compares: 45361374 | merge compares: 282621068 | total compares: 327982442
---%256 0.224 # insert compares: 45585938 | merge compares: 283598899 | total compares: 329184837

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

No branches or pull requests

2 participants