Skip to content

Speed up scalar Sobol with closed-form evaluation#97

Open
wantonsushi wants to merge 2 commits into
AcademySoftwareFoundation:mainfrom
wantonsushi:sobol-fast-2d-ahmed
Open

Speed up scalar Sobol with closed-form evaluation#97
wantonsushi wants to merge 2 commits into
AcademySoftwareFoundation:mainfrom
wantonsushi:sobol-fast-2d-ahmed

Conversation

@wantonsushi

Copy link
Copy Markdown

Hello OpenQMC maintainers!

This PR implements the Ahmed 2024 paper mentioned in issue #67 for the scalar path of sobolReversedIndex. I used the existing benchmark tool to check timings and generate to confirm the output is unchanged.

Changes

  • include/oqmc/owen.h: the scalar path uses the closed form.
  • src/tools/cli/matrices.cpp: derives the steps from the generator matrices and prints them, the same way it already prints directions[].
  • src/tests/owen.cpp: OwenTest.SobolReversedIndex checks the result against the direction matrices for every 16-bit index and dimension.

I also tried to implement a SIMD version (sobol_simd_experiment.cpp). It's not in this PR since it doesn't beat the existing SIMD, but I'm sharing it so you can reproduce the SIMD timings below.

Timings

Using benchmark sobol samples, median of 9 runs, in microseconds:

target old new speedup
CPU scalar 60034 56482 1.06x
GPU (RTX 4070) 128830 124873 1.03x
CPU SSE 47545 62700 0.76x
CPU AVX 47627 62856 0.76x

The speedup is small because generation is only part of a draw, along with scrambling and state hashing. The SIMD implementation fits one dimension per lane (four lanes). The existing SIMD path is faster because it packs all sixteen matrix columns into a wider register, so I left the SIMD paths alone.

Verification

  • generate sobol output is byte-identical to before on scalar, SSE and AVX
  • Tests pass (168/168), clang-format and clang-tidy clean

@joshbainbridge

Copy link
Copy Markdown
Collaborator

Hi @wantonsushi. I've just given the PR a scan and it is looking good. This is great work, thank you.

Very interesting to see the comparison. I'll follow up with a closer read of the code details sometime tomorrow.

@joshbainbridge

joshbainbridge commented Jun 28, 2026

Copy link
Copy Markdown
Collaborator

These changes are stellar, thank you. I would be happy to add this into the project. I've added some minor comments to the PR.

There are also a few additional of files we should update:

  • Adding a change log entry to CHANGELOG.md file.
  • Referencing the technique and the author in both include/oqmc/sobol.h:103 and README.md:979. Just a short sentence would be totally fine.

Once we are happy with the PR, next steps will be rebase onto main, and squash the commits down to a single commit. You will need to digitally sign the commit. Info on how to digitally sign, and commit message format can be found here:

Comment thread src/tools/cli/matrices.cpp
Comment thread src/tools/cli/matrices.cpp Outdated
Comment thread src/tests/owen.cpp
@joshbainbridge joshbainbridge linked an issue Jun 29, 2026 that may be closed by this pull request
@joshbainbridge

Copy link
Copy Markdown
Collaborator

Everything is looking good to me. Thank you for adding those additional changes. I think we are now ready for next steps:

  • Rebasing onto main
  • Squashing commits
    • Authoring Git message (brief, why, what)
    • Make sure to sign the commit with -S

Once that is done, the pipeline should allow us to merge the PR 🚀

The scalar Sobol path evaluates each dimension with a per-bit
matrix-vector product over 16 columns. Ahmed (EGSR 2024) computes the
same result more efficiently using a sequence of shift-mask-xor steps,
avoiding the loop over the columns.

Replace the scalar sobolReversedIndex loop with closed-form steps for
dimensions 0 to 3, steps emitted by a new matrices CLI tool, and cite
the technique in the comments and docs. Leave the SIMD paths unchanged,
as experiments showed Ahmed to be slower since the SIMD path also
avoids the loop by packing the 16 columns into a wide register. Add
test to validate the new path against old classic implementation.

Signed-off-by: wantonsushi <[email protected]>
@wantonsushi wantonsushi force-pushed the sobol-fast-2d-ahmed branch from c676881 to 467eaf9 Compare July 2, 2026 15:41
@wantonsushi

Copy link
Copy Markdown
Author

Pushed. Should be good to go now, let me know if anything else is required. Thanks for the opportunity to contribute! Might do more in the future :)

@fpsunflower

Copy link
Copy Markdown

This is cool! I hadn't figured out how to extend Ahmed's construction to arbitrary matrices as you've done here.

If I am reading the code right though, aside from the Pascal matrix (dimension 1) which has a compact O(log(n)) evaluation, we should expect to get roughly O(n) steps for an arbitrary matrix? So if we wanted to extend the code to 32-bit indices we would expect roughly twice as many steps?

Ahmed's follow up paper on SZ sequences has the nice property that higher order terms also have a factored representation so they are all O(log(n)) to evaluate, which allows extending even to 64-bit sequences (potentially useful if you are sharing a single sequence across pixels). Also since all pairs of dimensions are themselves (0,2) sequences, you can benefit from the super fast pascal matrix evaluation for every other term.

@wantonsushi

Copy link
Copy Markdown
Author

Hi Chris. Exciting to see you comment here. Is it weird to say I'm a big fan of your work at SPI?

Yes, you're right: dim 1 is O(log n), whereas dims 2/3 are O(n). I verified this, for dim 2:

num bits num steps
16 12
32 22
64 44

so roughly doubling like you said.

The SZ construction seems to make sense: every dim is a Pascal matrix P(a), and the S-P-Z decomposition evaluates it as the shared Pascal plus precision-independent corrections, so all dims stay O(log n) even at 64-bit (with every other dim reducing to plain Pascal)? I guess the catch is it's a different sequence past dim 1, so for OpenQMC, this isn't a simple drop-in replacement. @joshbainbridge what do you think?

I think I'll try making an SZ prototype for OpenQMC and benchmark it against the current path when I get a chance, I will follow up. Thanks for the comment.

@fpsunflower

Copy link
Copy Markdown

I played a bit with the code and got the same results as you for the Kuo Sobol matrices which OpenQMC is using. The SZ matrices have a slightly less random structure, so more stuff cancels out and you get only 12 steps required for 32-bit input for dimensions 2,3 (which is fairly close to what you need with the factored implementation and is easier to reason about). I'll do some tests to see if it makes any big difference in practice.

One small optimization, if instead of:

  index ^= (index & mask) << shift;

You do this:

   index ^= (index & reversebits(mask)) >> shift;

you can skip the final call to reversebits16() which should be a bit faster (of course the reversing of the mask is precomputed, and therefore free).

@wantonsushi

Copy link
Copy Markdown
Author

@fpsunflower Thanks for the suggestion. I applied it, but ran into an issue. The conjugated form measured 0.83x on GPU for me. To my understanding, right shifts contend with the logic pipe, while the original left shifts lower to multiplies on the multiply pipe. I tried right-shift-as-multiply-high, but __umulhi expands on Ada and I measured 0.59x. So I kept the old method for the GPU path. Maybe there is something I overlooked.

For CPU though, works great. With benchmark sobol samples, median of 9 runs: 1.14x speedup on CPU scalar, with byte-identical output.

I've left this as an independent commit so you and @joshbainbridge can easily verify the changes. I'll squash it to one commit if we're happy with it.

@wantonsushi wantonsushi force-pushed the sobol-fast-2d-ahmed branch from fde20ff to fbf7ec2 Compare July 3, 2026 01:34
@fpsunflower

Copy link
Copy Markdown

Interesting. That difference may depend on the GPU architecture in practice, I hadn't considered the fact that left-shifts are potentially faster.

Still, its better if the reversebits() happens on the input since its then shared across all 4 dimensions. The code probably should be restructured a bit to make this more obvious, again probably better left for another PR.

I'll also point out that in OpenQMC, reverseAndShuffle() calls reversebits() on the same index each time. This can be pulled out of drawSample(). In the end, you should only have 1 + num_dim calls to reversebits() per drawSample call (one for the index, and one per output dimension).

Also, while the 16-bit implementation is sufficient for a randomized seed (all the index bits are used for the sample index), when using the Z-sampler construction, part of the index gets used for the pixel coordinates and you need a fully 32-bit evaluation. With the optimized method used here, I don't think it should be that much slower than the 16-bit version and would scale to higher sample counts.

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

None yet

Projects

None yet

Development

Successfully merging this pull request may close these issues.

Improve performance with new Sobol construction method

3 participants