perf: improve parallelism of SortBotMerge#831
Conversation
541e928 to
3e45fac
Compare
|
I have removed the second commit, which get rid of "block 0": there is another circumstance in which it is required, separate from copying a partial term from the last block, such that it is contiguous with the tail in block 1, which is when adding terms with polyratfun. When terms merge, if the result is larger, it is copied into the space before "term1" in the merge: this requires block 0 if term1 is the first term of block 1. I added commentary to the code to highlight this. |
|
The following table shows benchmark results for
DetailsSpeedup of B over A (mean) = (mean time of A) / (mean time of B) A: B: Paired runs with n = 30 per benchmark with Environment:
|
|
By the way, if you and Ana would like, you could try using co-authored commits. |
Currently, the sortbot stage of tform sorting does not achieve good parallelism. Primarily, this is because the "sortblock" size has grown with default SmallSize and LargeSize adjustments, such that in many cases the sortbot levels do not run in parallel at all, because each thread's output fits in a single block. This commit adjusts the logic for filling and unlocking the blocks such that sortbot threads can start work as soon as possible: - Only put complete terms into the blocks, no splitting over the blocks. - Track the number of terms in each block, and use this when reading the data to determine when a block is complete, rather than waiting for a term to overlap the "stop" pointer. - When filling blocks (in PutToMaster), if a certain (small) minimum number of terms has been written, probe if a reading thread is waiting on the current block by attempting to lock the previous block. If a thread is waiting (so the lock fails), unlock the current block immediately and write the term into the next. Also reduce the MINIMUMNUMBEROFTERMS parameter from 10 to 1, so that the small+large buffer does not scale (so much) with large MaxTermSize and many threads, and similarly reduce NUMBEROFBLOCKSINSORT to its minimal value of 8. Co-authored-by: AnaIPereira <AnaIPereira@users.noreply.github.com>
|
I also benchmarked this PR against the current master on my MacBook Pro (M5 Pro-chip, 15-core CPU, 48GB memory), using
with |
|
I also ran on a local Nikhef machine for
Machine: AMD EPYC 7702P 64-Core Processor, x86_64 and |
This change comes from discussion and benchmarking efforts in collaboration with @AnaIPereira. See the commit message for a full description.
On my 12-core 7900X, the results are as follows. @AnaIPereira is finishing up some testing on higher core-count sytems, to make sure we have a good value for
MINWRITENUMBEROFTERMSand that the adjusted, lower value ofMINIMUMNUMBEROFTERMSdoes not affect performance.Speedup w.r.t. master:
chromaticcolorfmftforcer-expforcerhyperformmass-factmbox1lminceexmincermzv-dmsort-disksort-largesort-smalltrace