Skip to content

feat(cuda): GPU batch inverse#658

Open
ColoCarletti wants to merge 79 commits into
mainfrom
feat/cuda-pr5-batch-invert
Open

feat(cuda): GPU batch inverse#658
ColoCarletti wants to merge 79 commits into
mainfrom
feat/cuda-pr5-batch-invert

Conversation

@ColoCarletti

Copy link
Copy Markdown
Collaborator

Summary

Replaces the two remaining CPU inplace_batch_inverse sites in the prover with a fused on-device compute+invert pipeline. R3 OOD's per-eval-point barycentric_inv_denoms and R4 DEEP's (x_i − z_k) denominators now both flow through a single multi-block Hillis-Steele scan kernel and return device-resident CudaSlice<u64> handles that downstream dispatchers (try_barycentric_*_on_handle, try_deep_composition_gpu) read directly.

Wall-clock parity on fib_iterative_4M on a 46-core host (savings overlap with existing GPU work). The win is in PCIe traffic (~6 GB of redundant H2D removed per prove), peak host RSS at R4 (~288 MB removed), and the architectural primitives (Arc<CudaStream> threading, DenomSign enum, public batch_inverse_ext3_dev API) that future device-resident extensions can reuse. Pays off proportionally on smaller hosts, larger traces, more eval points, and more tables.

Changes

  • crypto/math-cuda/kernels/inverse.cu — 6 kernels: sign-flagged compute_denoms_ext3 (R3 z − x vs R4 x − z), forward and reverse multi-block scans,
    apply-offsets passes, batch_inverse_combine_ext3.
  • crypto/math-cuda/src/inverse.rs — host driver: batch_inverse_ext3 (parity-test path), batch_inverse_ext3_dev (device→device), fused
    compute_and_invert_denoms_ext3_dev with the DenomSign enum, recursive scan driver, one ext3 Fermat inverse on host per batch.
  • crypto/math-cuda/src/{barycentric,deep}.rs_with_dev_inv_denoms variants that take a buffer + offset + caller's stream and slice internally (no
    per-call H2D, no cross-stream race).
  • crypto/stark/src/gpu_lde.rstry_compute_and_invert_inv_denoms_dev + try_inv_denoms_dev_with_stream (acquires backend + stream). Threads
    Option<(&CudaSlice<u64>, usize, &Arc<CudaStream>)> through the barycentric and DEEP dispatchers. New gpu_batch_invert_calls counter.
  • crypto/stark/src/{trace,prover}.rs — R3 OOD and R4 DEEP fast paths. CPU barycentric_inv_denoms is now lazy on the all-GPU happy path; the CPU denoms
    Vec in compute_deep_composition_poly_evaluations is only built when the dev-inv-denoms path returns None.
  • Testsbatch_inverse.rs (parity, n in {1, 2..256, 257..1024, 4096..2^16, 2^18, 2^20, 2^22}); compute_and_invert_denoms.rs (parity parametrised over both
    signs, R3 and R4 shapes); invert_ext3_host parity against Degree3GoldilocksExtensionField::inv; cuda_path_integration asserts the new counter fires;
    cuda_fallback_tests::gpu_batch_invert_fault_falls_back_to_cpu validates the CPU fallback under injected cudarc errors.

Fallback

Every dispatch returns None on TypeId mismatch (non-Goldilocks / non-ext3), below threshold, or any cudarc error. The caller falls through to the existing CPU inplace_batch_inverse path with no state change. Runtime-validated by the new fault-injection test.

ColoCarletti and others added 30 commits May 6, 2026 15:12
ColoCarletti and others added 20 commits June 3, 2026 15:43
# Conflicts:
#	crypto/math-cuda/build.rs
#	crypto/math-cuda/src/device.rs
#	crypto/math-cuda/src/lib.rs
#	crypto/stark/src/gpu_lde.rs
#	crypto/stark/src/prover.rs
#	prover/tests/cuda_path_integration.rs
…5-batch-invert

# Conflicts:
#	prover/tests/cuda_path_integration.rs
…nvert

# Conflicts:
#	crypto/math-cuda/build.rs
#	crypto/math-cuda/src/deep.rs
#	crypto/math-cuda/src/device.rs
#	crypto/math-cuda/src/lib.rs
#	crypto/stark/src/gpu_lde.rs
#	crypto/stark/src/prover.rs
#	prover/tests/cuda_fallback_tests.rs
#	prover/tests/cuda_path_integration.rs
@ColoCarletti

Copy link
Copy Markdown
Collaborator Author

/codex

@ColoCarletti

Copy link
Copy Markdown
Collaborator Author

/claude

@github-actions

Copy link
Copy Markdown

Codex Code Review

Found 2 issues in the PR diff.

  1. High - R4 GPU fallback can panic when device inv-denoms exist but device parts do not
    crypto/stark/src/gpu_lde.rs:1199 skips validating inv_denoms_host whenever inv_denoms_dev is Some, but the (None, _) match arm later still slices inv_denoms_host[0..lde_size] at crypto/stark/src/gpu_lde.rs:1294. The caller in crypto/stark/src/prover.rs:1369 can pass Some(inv_dev) while round_2_result.gpu_composition_parts is None, so this turns a GPU fast-path miss into an out-of-bounds panic instead of falling back to CPU.
    Fix by adding an explicit (None, Some(_)) path that returns None or by supporting host parts + device inv-denoms; alternatively only attempt the device-inv-denoms DEEP path when gpu_composition_parts.is_some().

  2. Medium - Size truncation can under-launch CUDA kernels and leave uninitialized output
    The new inverse path casts usize sizes to u32 before computing grid sizes, e.g. crypto/math-cuda/src/inverse.rs:182, crypto/math-cuda/src/inverse.rs:218, and crypto/math-cuda/src/inverse.rs:323. For inputs above u32::MAX, this wraps/truncates, while allocations like crypto/math-cuda/src/inverse.rs:173 are sized from the full usize. Because these buffers are allocated uninitialized under “kernel writes every slot” assumptions, an under-launch can return/read unwritten device memory.
    Add explicit checked bounds before launch, e.g. reject n > u32::MAX / total > u32::MAX, and use checked_mul(3) for allocation lengths.

No Critical security issues found in the reviewed diff.

Comment thread crypto/stark/src/gpu_lde.rs
Comment thread crypto/math-cuda/kernels/inverse.cu Outdated
@claude

claude Bot commented Jun 12, 2026

Copy link
Copy Markdown

Review: feat(cuda): GPU batch inverse

Overall: Well-structured PR. The multi-block Hillis-Steele scan, recursive block-total pass, single host Fermat inversion per batch, DenomSign abstraction, Arc<CudaStream> threading, and fallback design are all sound. Test coverage is thorough. Two issues found.


Bug (Medium) — try_deep_composition_gpu can panic on R2-fail + R4-success combination

See inline on gpu_lde.rs:1308. When the caller passes inv_denoms_host = &[] with inv_denoms_dev = Some(...) (the prover.rs fast path does exactly this), but parts_dev = None (R2 GPU composition failed), execution falls into the (None, _) match arm and &inv_denoms_host[0..lde_size] panics on the empty slice. The length check at line ~1199 is conditional on inv_denoms_dev.is_none(), so it is bypassed for this combination. The fix is an early-return guard on the (None, _) arm when inv_denoms_host is unpopulated.

Low — Benign CUDA data race in block-total write (inverse.cu:130)

See inline on inverse.cu:130. When n % BLOCK_SIZE != 0, two threads write to the same block_totals slot. Both values are equal (identity padding leaves the running product unchanged) so correctness is never affected, but adding gid < n && to the condition makes the single-writer intent explicit.

@ColoCarletti ColoCarletti added the gpu Related to GPU/CUDA development label Jun 12, 2026
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

gpu Related to GPU/CUDA development

Projects

None yet

Development

Successfully merging this pull request may close these issues.

3 participants