Speed up scalar Sobol with closed-form evaluation by wantonsushi · Pull Request #97 · AcademySoftwareFoundation/openqmc

wantonsushi · 2026-06-22T02:36:59Z

Hello OpenQMC maintainers!

This PR implements the Ahmed 2024 paper mentioned in issue #67 for the scalar path of sobolReversedIndex. I used the existing benchmark tool to check timings and generate to confirm the output is unchanged.

Changes

include/oqmc/owen.h: the scalar path uses the closed form.
src/tools/cli/matrices.cpp: derives the steps from the generator matrices and prints them, the same way it already prints directions[].
src/tests/owen.cpp: OwenTest.SobolReversedIndex checks the result against the direction matrices for every 16-bit index and dimension.

I also tried to implement a SIMD version (sobol_simd_experiment.cpp). It's not in this PR since it doesn't beat the existing SIMD, but I'm sharing it so you can reproduce the SIMD timings below.

Timings

Using benchmark sobol samples, median of 9 runs, in microseconds:

target	old	new	speedup
CPU scalar	60034	56482	1.06x
GPU (RTX 4070)	128830	124873	1.03x
CPU SSE	47545	62700	0.76x
CPU AVX	47627	62856	0.76x

The speedup is small because generation is only part of a draw, along with scrambling and state hashing. The SIMD implementation fits one dimension per lane (four lanes). The existing SIMD path is faster because it packs all sixteen matrix columns into a wider register, so I left the SIMD paths alone.

Verification

generate sobol output is byte-identical to before on scalar, SSE and AVX
Tests pass (168/168), clang-format and clang-tidy clean

joshbainbridge · 2026-06-24T23:26:15Z

Hi @wantonsushi. I've just given the PR a scan and it is looking good. This is great work, thank you.

Very interesting to see the comparison. I'll follow up with a closer read of the code details sometime tomorrow.

joshbainbridge · 2026-06-28T17:26:32Z

These changes are stellar, thank you. I would be happy to add this into the project. I've added some minor comments to the PR.

There are also a few additional of files we should update:

Adding a change log entry to CHANGELOG.md file.
Referencing the technique and the author in both include/oqmc/sobol.h:103 and README.md:979. Just a short sentence would be totally fine.

Once we are happy with the PR, next steps will be rebase onto main, and squash the commits down to a single commit. You will need to digitally sign the commit. Info on how to digitally sign, and commit message format can be found here:

Digital signing: https://docs.github.com/en/authentication/managing-commit-signature-verification/about-commit-signature-verification
Commit message format: https://github.com/AcademySoftwareFoundation/openqmc/blob/main/CONTRIBUTING.md#developer-workflow

joshbainbridge · 2026-07-01T20:08:18Z

Everything is looking good to me. Thank you for adding those additional changes. I think we are now ready for next steps:

Rebasing onto main
Squashing commits
- Authoring Git message (brief, why, what)
- Make sure to sign the commit with -S

Once that is done, the pipeline should allow us to merge the PR 🚀

The scalar Sobol path evaluates each dimension with a per-bit matrix-vector product over 16 columns. Ahmed (EGSR 2024) computes the same result more efficiently using a sequence of shift-mask-xor steps, avoiding the loop over the columns. Replace the scalar sobolReversedIndex loop with closed-form steps for dimensions 0 to 3, steps emitted by a new matrices CLI tool, and cite the technique in the comments and docs. Leave the SIMD paths unchanged, as experiments showed Ahmed to be slower since the SIMD path also avoids the loop by packing the 16 columns into a wide register. Add test to validate the new path against old classic implementation. Signed-off-by: wantonsushi <[email protected]>

wantonsushi · 2026-07-02T15:47:10Z

Pushed. Should be good to go now, let me know if anything else is required. Thanks for the opportunity to contribute! Might do more in the future :)

fpsunflower · 2026-07-02T16:56:32Z

This is cool! I hadn't figured out how to extend Ahmed's construction to arbitrary matrices as you've done here.

If I am reading the code right though, aside from the Pascal matrix (dimension 1) which has a compact O(log(n)) evaluation, we should expect to get roughly O(n) steps for an arbitrary matrix? So if we wanted to extend the code to 32-bit indices we would expect roughly twice as many steps?

Ahmed's follow up paper on SZ sequences has the nice property that higher order terms also have a factored representation so they are all O(log(n)) to evaluate, which allows extending even to 64-bit sequences (potentially useful if you are sharing a single sequence across pixels). Also since all pairs of dimensions are themselves (0,2) sequences, you can benefit from the super fast pascal matrix evaluation for every other term.

wantonsushi · 2026-07-02T19:37:19Z

Hi Chris. Exciting to see you comment here. Is it weird to say I'm a big fan of your work at SPI?

Yes, you're right: dim 1 is O(log n), whereas dims 2/3 are O(n). I verified this, for dim 2:

num bits	num steps
16	12
32	22
64	44

so roughly doubling like you said.

The SZ construction seems to make sense: every dim is a Pascal matrix P(a), and the S-P-Z decomposition evaluates it as the shared Pascal plus precision-independent corrections, so all dims stay O(log n) even at 64-bit (with every other dim reducing to plain Pascal)? I guess the catch is it's a different sequence past dim 1, so for OpenQMC, this isn't a simple drop-in replacement. @joshbainbridge what do you think?

I think I'll try making an SZ prototype for OpenQMC and benchmark it against the current path when I get a chance, I will follow up. Thanks for the comment.

fpsunflower · 2026-07-02T21:58:05Z

I played a bit with the code and got the same results as you for the Kuo Sobol matrices which OpenQMC is using. The SZ matrices have a slightly less random structure, so more stuff cancels out and you get only 12 steps required for 32-bit input for dimensions 2,3 (which is fairly close to what you need with the factored implementation and is easier to reason about). I'll do some tests to see if it makes any big difference in practice.

One small optimization, if instead of:

  index ^= (index & mask) << shift;

You do this:

   index ^= (index & reversebits(mask)) >> shift;

you can skip the final call to reversebits16() which should be a bit faster (of course the reversing of the mask is precomputed, and therefore free).

wantonsushi · 2026-07-03T01:30:46Z

@fpsunflower Thanks for the suggestion. I applied it, but ran into an issue. The conjugated form measured 0.83x on GPU for me. To my understanding, right shifts contend with the logic pipe, while the original left shifts lower to multiplies on the multiply pipe. I tried right-shift-as-multiply-high, but __umulhi expands on Ada and I measured 0.59x. So I kept the old method for the GPU path. Maybe there is something I overlooked.

For CPU though, works great. With benchmark sobol samples, median of 9 runs: 1.14x speedup on CPU scalar, with byte-identical output.

I've left this as an independent commit so you and @joshbainbridge can easily verify the changes. I'll squash it to one commit if we're happy with it.

fpsunflower · 2026-07-03T02:43:19Z

Interesting. That difference may depend on the GPU architecture in practice, I hadn't considered the fact that left-shifts are potentially faster.

Still, its better if the reversebits() happens on the input since its then shared across all 4 dimensions. The code probably should be restructured a bit to make this more obvious, again probably better left for another PR.

I'll also point out that in OpenQMC, reverseAndShuffle() calls reversebits() on the same index each time. This can be pulled out of drawSample(). In the end, you should only have 1 + num_dim calls to reversebits() per drawSample call (one for the index, and one per output dimension).

Also, while the 16-bit implementation is sufficient for a randomized seed (all the index bits are used for the sample index), when using the Z-sampler construction, part of the index gets used for the pixel coordinates and you need a fully 32-bit evaluation. With the optimized method used here, I don't think it should be that much slower than the 16-bit version and would scale to higher sample counts.

wantonsushi requested a review from joshbainbridge as a code owner June 22, 2026 02:37

joshbainbridge reviewed Jun 28, 2026

View reviewed changes

Comment thread src/tools/cli/matrices.cpp

Comment thread src/tools/cli/matrices.cpp Outdated

Comment thread src/tests/owen.cpp

joshbainbridge linked an issue Jun 29, 2026 that may be closed by this pull request

Improve performance with new Sobol construction method #67

Open

wantonsushi requested a review from joshbainbridge June 30, 2026 02:15

wantonsushi force-pushed the sobol-fast-2d-ahmed branch from c676881 to 467eaf9 Compare July 2, 2026 15:41

Move CPU sobol bit reversal onto the input

fbf7ec2

wantonsushi force-pushed the sobol-fast-2d-ahmed branch from fde20ff to fbf7ec2 Compare July 3, 2026 01:34

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Uh oh!

Speed up scalar Sobol with closed-form evaluation#97

Speed up scalar Sobol with closed-form evaluation#97
wantonsushi wants to merge 2 commits into
AcademySoftwareFoundation:mainfrom
wantonsushi:sobol-fast-2d-ahmed

wantonsushi commented Jun 22, 2026

Uh oh!

joshbainbridge commented Jun 24, 2026

Uh oh!

joshbainbridge commented Jun 28, 2026 •

edited

Loading

Uh oh!

Uh oh!

Uh oh!

Uh oh!

joshbainbridge commented Jul 1, 2026

Uh oh!

wantonsushi commented Jul 2, 2026

Uh oh!

fpsunflower commented Jul 2, 2026

Uh oh!

wantonsushi commented Jul 2, 2026

Uh oh!

fpsunflower commented Jul 2, 2026

Uh oh!

wantonsushi commented Jul 3, 2026

Uh oh!

fpsunflower commented Jul 3, 2026

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

3 participants

Uh oh!

Conversation

wantonsushi commented Jun 22, 2026

Changes

Timings

Verification

Uh oh!

joshbainbridge commented Jun 24, 2026

Uh oh!

joshbainbridge commented Jun 28, 2026 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Uh oh!

Uh oh!

Uh oh!

Uh oh!

joshbainbridge commented Jul 1, 2026

Uh oh!

wantonsushi commented Jul 2, 2026

Uh oh!

fpsunflower commented Jul 2, 2026

Uh oh!

wantonsushi commented Jul 2, 2026

Uh oh!

fpsunflower commented Jul 2, 2026

Uh oh!

wantonsushi commented Jul 3, 2026

Uh oh!

fpsunflower commented Jul 3, 2026

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

3 participants

joshbainbridge commented Jun 28, 2026 •

edited

Loading