feat(cuda): GPU batch inverse#658
Conversation
Co-authored-by: Gabriel Bosio <[email protected]>
Co-authored-by: Gabriel Bosio <[email protected]>
Co-authored-by: Gabriel Bosio <[email protected]>
Co-authored-by: Gabriel Bosio <[email protected]>
Co-authored-by: Gabriel Bosio <[email protected]>
Co-authored-by: Gabriel Bosio <[email protected]>
Co-authored-by: Gabriel Bosio <[email protected]>
Co-authored-by: Gabriel Bosio <[email protected]>
Co-authored-by: Gabriel Bosio <[email protected]>
Co-authored-by: Gabriel Bosio <[email protected]>
Co-authored-by: Gabriel Bosio <[email protected]>
Co-authored-by: Gabriel Bosio <[email protected]>
# Conflicts: # crypto/math-cuda/build.rs # crypto/math-cuda/src/device.rs # crypto/math-cuda/src/lib.rs # crypto/stark/src/gpu_lde.rs # crypto/stark/src/prover.rs # prover/tests/cuda_path_integration.rs
…5-batch-invert # Conflicts: # prover/tests/cuda_path_integration.rs
…nvert # Conflicts: # crypto/math-cuda/build.rs # crypto/math-cuda/src/deep.rs # crypto/math-cuda/src/device.rs # crypto/math-cuda/src/lib.rs # crypto/stark/src/gpu_lde.rs # crypto/stark/src/prover.rs # prover/tests/cuda_fallback_tests.rs # prover/tests/cuda_path_integration.rs
|
/codex |
|
/claude |
Codex Code ReviewFound 2 issues in the PR diff.
No Critical security issues found in the reviewed diff. |
Review: feat(cuda): GPU batch inverseOverall: Well-structured PR. The multi-block Hillis-Steele scan, recursive block-total pass, single host Fermat inversion per batch, Bug (Medium) —
|
Summary
Replaces the two remaining CPU
inplace_batch_inversesites in the prover with a fused on-device compute+invert pipeline. R3 OOD's per-eval-pointbarycentric_inv_denomsand R4 DEEP's(x_i − z_k)denominators now both flow through a single multi-block Hillis-Steele scan kernel and return device-residentCudaSlice<u64>handles that downstream dispatchers (try_barycentric_*_on_handle,try_deep_composition_gpu) read directly.Wall-clock parity on
fib_iterative_4Mon a 46-core host (savings overlap with existing GPU work). The win is in PCIe traffic (~6 GB of redundant H2D removed per prove), peak host RSS at R4 (~288 MB removed), and the architectural primitives (Arc<CudaStream>threading,DenomSignenum, publicbatch_inverse_ext3_devAPI) that future device-resident extensions can reuse. Pays off proportionally on smaller hosts, larger traces, more eval points, and more tables.Changes
crypto/math-cuda/kernels/inverse.cu— 6 kernels: sign-flaggedcompute_denoms_ext3(R3z − xvs R4x − z), forward and reverse multi-block scans,apply-offsets passes,
batch_inverse_combine_ext3.crypto/math-cuda/src/inverse.rs— host driver:batch_inverse_ext3(parity-test path),batch_inverse_ext3_dev(device→device), fusedcompute_and_invert_denoms_ext3_devwith theDenomSignenum, recursive scan driver, one ext3 Fermat inverse on host per batch.crypto/math-cuda/src/{barycentric,deep}.rs—_with_dev_inv_denomsvariants that take a buffer + offset + caller's stream and slice internally (noper-call H2D, no cross-stream race).
crypto/stark/src/gpu_lde.rs—try_compute_and_invert_inv_denoms_dev+try_inv_denoms_dev_with_stream(acquires backend + stream). ThreadsOption<(&CudaSlice<u64>, usize, &Arc<CudaStream>)>through the barycentric and DEEP dispatchers. Newgpu_batch_invert_callscounter.crypto/stark/src/{trace,prover}.rs— R3 OOD and R4 DEEP fast paths. CPUbarycentric_inv_denomsis now lazy on the all-GPU happy path; the CPUdenomsVec in
compute_deep_composition_poly_evaluationsis only built when the dev-inv-denoms path returnsNone.batch_inverse.rs(parity, n in {1, 2..256, 257..1024, 4096..2^16, 2^18, 2^20, 2^22});compute_and_invert_denoms.rs(parity parametrised over bothsigns, R3 and R4 shapes);
invert_ext3_hostparity againstDegree3GoldilocksExtensionField::inv;cuda_path_integrationasserts the new counter fires;cuda_fallback_tests::gpu_batch_invert_fault_falls_back_to_cpuvalidates the CPU fallback under injected cudarc errors.Fallback
Every dispatch returns
Noneon TypeId mismatch (non-Goldilocks / non-ext3), below threshold, or any cudarc error. The caller falls through to the existing CPUinplace_batch_inversepath with no state change. Runtime-validated by the new fault-injection test.