perf: improve parallelism of SortBotMerge by jodavies · Pull Request #831 · form-dev/form

jodavies · 2026-05-18T18:23:16Z

This change comes from discussion and benchmarking efforts in collaboration with @AnaIPereira. See the commit message for a full description.

On my 12-core 7900X, the results are as follows. @AnaIPereira is finishing up some testing on higher core-count sytems, to make sure we have a good value for MINWRITENUMBEROFTERMS and that the adjusted, lower value of MINIMUMNUMBEROFTERMS does not affect performance.

Speedup w.r.t. master:

Benchmark	form	-w2	-w4	-w6	-w12	-w24 (HT)
`chromatic`	0.99 ± 0.01	1.03 ± 0.02	1.08 ± 0.02	1.14 ± 0.01	1.12 ± 0.01	1.08 ± 0.00
`color`	0.99 ± 0.01	1.01 ± 0.01	1.02 ± 0.01	1.03 ± 0.02	1.03 ± 0.02	1.03 ± 0.01
`fmft`	1.01 ± 0.01	1.02 ± 0.00	1.08 ± 0.01	1.13 ± 0.01	1.14 ± 0.01	1.09 ± 0.01
`forcer-exp`	1.00 ± 0.00	1.00 ± 0.00	1.04 ± 0.01	1.09 ± 0.03	1.13 ± 0.00	1.14 ± 0.00
`forcer`	1.00 ± 0.01	1.01 ± 0.00	1.07 ± 0.01	1.14 ± 0.00	1.20 ± 0.01	1.21 ± 0.00
`hyperform`	1.00 ± 0.01	1.03 ± 0.01	1.06 ± 0.01	1.09 ± 0.01	1.08 ± 0.01	1.05 ± 0.01
`mass-fact`	1.00 ± 0.01	1.02 ± 0.02	1.01 ± 0.03	1.02 ± 0.02	1.06 ± 0.07	1.03 ± 0.05
`mbox1l`	1.00 ± 0.00	1.01 ± 0.00	1.07 ± 0.01	1.15 ± 0.02	1.30 ± 0.03	1.38 ± 0.04
`minceex`	1.00 ± 0.00	1.04 ± 0.00	1.06 ± 0.01	1.08 ± 0.01	1.08 ± 0.01	1.06 ± 0.01
`mincer`	1.01 ± 0.02	1.01 ± 0.01	1.03 ± 0.01	1.03 ± 0.01	1.02 ± 0.01	1.02 ± 0.01
`mzv-dm`	1.00 ± 0.01	0.99 ± 0.02	1.00 ± 0.01	1.00 ± 0.02	1.01 ± 0.04	1.00 ± 0.02
`sort-disk`	1.02 ± 0.01	1.01 ± 0.02	1.01 ± 0.01	1.01 ± 0.01	1.02 ± 0.02	1.00 ± 0.01
`sort-large`	1.02 ± 0.01	0.98 ± 0.01	0.99 ± 0.01	1.00 ± 0.02	1.01 ± 0.01	1.01 ± 0.01
`sort-small`	1.02 ± 0.02	0.98 ± 0.02	1.00 ± 0.02	1.02 ± 0.03	1.07 ± 0.04	1.07 ± 0.08
`trace`	1.00 ± 0.02	0.99 ± 0.01	0.99 ± 0.01	1.00 ± 0.01	1.00 ± 0.01	0.99 ± 0.03

coveralls · 2026-05-18T21:34:38Z

coverage: 61.507% (+0.07%) from 61.438% — jodavies:sortbot-blocks into form-dev:master

jodavies · 2026-05-21T08:13:09Z

I have removed the second commit, which get rid of "block 0": there is another circumstance in which it is required, separate from copying a partial term from the last block, such that it is contiguous with the tail in block 1, which is when adding terms with polyratfun. When terms merge, if the result is larger, it is copied into the space before "term1" in the merge: this requires block 0 if term1 is the first term of block 1. I added commentary to the code to highlight this.

tueda · 2026-05-23T03:19:48Z

The following table shows benchmark results for tform -w8 on my machine (Intel Core i9-12900, Ubuntu 20.04, x86_64), using master as the baseline, though the baseline includes other merged PRs that are not included in the PR branch. The results show improvements with no regressions.

Benchmark	Speedup	95% bootstrap CI
chromatic	1.10	[1.09, 1.10]
color	1.04	[1.03, 1.04]
fmft	1.12	[1.12, 1.13]
forcer	1.25	[1.24, 1.25]
forcer-exp	1.19	[1.19, 1.19]
hyperform	1.13	[1.13, 1.14]
mbox1l	1.16	[1.16, 1.17]
minceex	1.04	[1.04, 1.05]
mincer	1.02	[1.01, 1.02]
mzv-dm	1.02	[1.01, 1.03]
sort-disk	1.02	[1.01, 1.02]
sort-large	1.03	[1.02, 1.04]
sort-small	1.05	[1.04, 1.07]
trace	1.03	[1.03, 1.04]

Details

Speedup of B over A (mean) = (mean time of A) / (mean time of B)

A:

TFORM 5.0.0 (May 19 2026, v5.0.0-25-g1f70ea1)
-backtrace  +flint=3.4.0  +gmp=6.3.0   -mpi    +pthreads  +zlib=1.2.11
-debugging  +float        +mpfr=4.2.2  +posix  -windows   +zstd=1.4.4
Compiler: GCC 11.5.0
Architecture: x86_64

B:

TFORM 5.0.0 (May 21 2026, v5.0.0-20-g3e45fac)
-backtrace  +flint=3.4.0  +gmp=6.3.0   -mpi    +pthreads  +zlib=1.2.11
-debugging  +float        +mpfr=4.2.2  +posix  -windows   +zstd=1.4.4
Compiler: GCC 11.5.0
Architecture: x86_64

Paired runs with n = 30 per benchmark with /tmp instead of /dev/shm. Used the scripts from this snapshot.

Environment:


OS	Ubuntu 20.04.6 LTS
Kernel	Linux 5.15.0-84-generic
Architecture	x86_64
CPU	Intel Core i9-12900
CPU configuration	16 cores / 24 threads (8 P-cores + 8 E-cores)
Memory	62.6 GiB
Storage	WD_BLACK SN770 1TB NVMe SSD

tueda · 2026-06-02T09:23:59Z

By the way, if you and Ana would like, you could try using co-authored commits. Co-authored-by: NAME <EMAIL> lines at the end of the commit message should be recognised by GitHub.

Currently, the sortbot stage of tform sorting does not achieve good parallelism. Primarily, this is because the "sortblock" size has grown with default SmallSize and LargeSize adjustments, such that in many cases the sortbot levels do not run in parallel at all, because each thread's output fits in a single block. This commit adjusts the logic for filling and unlocking the blocks such that sortbot threads can start work as soon as possible: - Only put complete terms into the blocks, no splitting over the blocks. - Track the number of terms in each block, and use this when reading the data to determine when a block is complete, rather than waiting for a term to overlap the "stop" pointer. - When filling blocks (in PutToMaster), if a certain (small) minimum number of terms has been written, probe if a reading thread is waiting on the current block by attempting to lock the previous block. If a thread is waiting (so the lock fails), unlock the current block immediately and write the term into the next. Also reduce the MINIMUMNUMBEROFTERMS parameter from 10 to 1, so that the small+large buffer does not scale (so much) with large MaxTermSize and many threads, and similarly reduce NUMBEROFBLOCKSINSORT to its minimal value of 8. Co-authored-by: AnaIPereira <AnaIPereira@users.noreply.github.com>

cbmarini · 2026-06-02T09:50:27Z

I also benchmarked this PR against the current master on my MacBook Pro (M5 Pro-chip, 15-core CPU, 48GB memory), using tform -w8.

Test	Speedup
chromatic	1.06 ± 0.0
color	1.01 ± 0.02
fmft	1.08 ± 0.01
forcer-exp	1.04 ± 0.01
forcer	1.08 ± 0.01
hyperform	1.04 ± 0.01
mbox1l	1.16 ± 0.02
minceex	1.08 ± 0.02
mincer	1.10 ± 0.0
mzv-dm	1.00 ± 0.01
sort-disk	0.97 ± 0.04
sort-large	1.06 ± 0.04
sort-small	1.01 ± 0.02
trace	1.00 ± 0.03

with

tform-master -vv
TFORM 5.0.0-beta.1 (May 19 2026, v5.0.0-beta.1-359-g1f70ea1)
-backtrace  +flint=3.5.0  +gmp=6.3.0   -mpi    +pthreads  +zlib=1.2.12
-debugging  +float        +mpfr=4.2.2  +posix  -windows   +zstd=1.5.7
Compiler: Apple Clang 21.0.0 (build 21000101)
Architecture: arm64

tform-josh -vv
TFORM 4.1 (May 21 2026, v4.1-20131025-925-g3e45fac)
-backtrace  +flint=3.5.0  +gmp=6.3.0   -mpi    +pthreads  +zlib=1.2.12
-debugging  +float        +mpfr=4.2.2  +posix  -windows   +zstd=1.5.7
Compiler: Apple Clang 21.0.0 (build 21000101)
Architecture: arm64

cbmarini · 2026-06-02T11:40:34Z

I also ran on a local Nikhef machine for -w32:

Benchmark	Speedup
chromatic	1.12 ± 0.01
color	1.02 ± 0.02
fmft	1.14 ± 0.01
forcer	1.38 ± 0.02
forcer-exp	1.18 ± 0.01
hyperform	1.09 ± 0.04
mbox1l	1.43 ± 0.04
minceex	1.11 ± 0.02
mincer	1.02 ± 0.02
mzv-dm	0.99 ± 0.04
sort-disk	1.04 ± 0.02
sort-large	0.99 ± 0.02
sort-small	1.19 ± 0.08
trace	1.10 ± 0.12

Machine: AMD EPYC 7702P 64-Core Processor, x86_64 and

tform-master -vv
TFORM 5.0.0 (May 19 2026, v5.0.0-25-g1f70ea1)
-backtrace  +flint=3.4.0  +gmp=6.3.0   -mpi    +pthreads  +zlib=1.2.11
-debugging  +float        +mpfr=4.2.2  +posix  -windows   +zstd=1.4.4
Compiler: GCC 8.5.0
Architecture: x86_64

tform-josh -vv
TFORM 4.1 (May 21 2026, v4.1-20131025-925-g3e45fac)
-backtrace  +flint=3.4.0  +gmp=6.3.0   -mpi    +pthreads  +zlib=1.2.11
-debugging  +float        +mpfr=4.2.2  +posix  -windows   +zstd=1.4.4
Compiler: GCC 8.5.0
Architecture: x86_64

jodavies force-pushed the sortbot-blocks branch from 6ef8662 to 5c7c957 Compare May 18, 2026 21:16

jodavies mentioned this pull request May 19, 2026

wip: reduce size of thread buckets at larger maxtermsize #829

Draft

jodavies force-pushed the sortbot-blocks branch 2 times, most recently from 541e928 to 3e45fac Compare May 21, 2026 05:33

jodavies force-pushed the sortbot-blocks branch from 3e45fac to aeb849a Compare June 2, 2026 09:44

jodavies merged commit c26c056 into form-dev:master Jun 2, 2026
84 checks passed

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

perf: improve parallelism of SortBotMerge#831

perf: improve parallelism of SortBotMerge#831
jodavies merged 1 commit into
form-dev:masterfrom
jodavies:sortbot-blocks

jodavies commented May 18, 2026 •

edited

Loading

Uh oh!

coveralls commented May 18, 2026 •

edited

Loading

Uh oh!

jodavies commented May 21, 2026

Uh oh!

tueda commented May 23, 2026

Uh oh!

tueda commented Jun 2, 2026

Uh oh!

cbmarini commented Jun 2, 2026

Uh oh!

cbmarini commented Jun 2, 2026

Uh oh!

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

4 participants

Conversation

jodavies commented May 18, 2026 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Uh oh!

coveralls commented May 18, 2026 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Uh oh!

jodavies commented May 21, 2026

Uh oh!

tueda commented May 23, 2026

Uh oh!

tueda commented Jun 2, 2026

Uh oh!

cbmarini commented Jun 2, 2026

Uh oh!

cbmarini commented Jun 2, 2026

Uh oh!

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

4 participants

jodavies commented May 18, 2026 •

edited

Loading

coveralls commented May 18, 2026 •

edited

Loading