Fix memory issues in CUDA EXX screening by vmitq · Pull Request #182 · wavefunction91/GauXC

vmitq · 2026-03-13T12:45:29Z

This PR addresses memory-related issues in exact exchange calculation on CUDA GPU devices.

Issue Bug in CUDA exx ek screening #154
Integer overflow observed sometimes for very large molecules. Change few 32-bit counters to 64-bit in bfn stats.

Copilot

Pull request overview

This PR fixes correctness issues in CUDA exact-exchange (EXX) EK screening for large workloads by adjusting how task patches are generated/processed on device, and by widening several CUDA screening counters to 64-bit to avoid overflow.

Changes:

Reworks the CUDA exx_ek_screening task loop to process each generate_buffers() patch independently (avoiding buffer overwrite across inner-loop iterations).
Promotes CUDA EXX screening collision counts / position lists from 32-bit to 64-bit.
Adds a new device-memory requirement flag for reserving EXX collision scratch space, and increases Scheme1 static padding.

Reviewed changes

Copilot reviewed 5 out of 5 changed files in this pull request and generated 4 comments.

Show a summary per file

File	Description
`src/xc_integrator/xc_data/device/xc_device_data.hpp`	Adds `task_exx_collision` and a sizing helper to reserve scratch space for EXX collision work.
`src/xc_integrator/xc_data/device/xc_device_aos_data.cxx`	Accounts for the new EXX collision scratch reservation in AoS device memory requirements.
`src/xc_integrator/local_work_driver/device/scheme1_data_base.cxx`	Increases static allocation padding used for alignment.
`src/xc_integrator/local_work_driver/device/cuda/kernels/exx_ek_screening_bfn_stats.cu`	Widens collision counts/position lists to 64-bit and updates associated kernel signatures and allocations.
`src/xc_integrator/integrator_util/exx_screening.cxx`	Removes the problematic double-loop batching and calls collision per generated task patch.

Comments suppressed due to low confidence (1)

src/xc_integrator/local_work_driver/device/cuda/kernels/exx_ek_screening_bfn_stats.cu:299

print_counts now takes uint64_t* counts, but the printf format string still uses %d for counts[i_task] (and i_task is also not an int). This is undefined behavior and can corrupt output or crash when debugging. Use the correct 64-bit/size_t format specifiers (or cast to an explicitly formatted type).

__global__ void print_counts(size_t ntasks, uint64_t* counts) {


  for(auto i_task = 0 ; i_task < ntasks; ++i_task) {

    printf("[GPU] ITASK %d: %d\n", i_task,counts[i_task]);

  }

💡 Add Copilot custom instructions for smarter, more guided reviews. Learn how to get started.

wavefunction91 · 2026-04-06T21:25:29Z

Thanks for the patch @vmitq!

You seem to have disabled the batching in the EXX screening. I'm not opposed to that offhand, but can you quantify the performance penalty (if any) on systems for which the memory problem wasn't a problem?

vmitq · 2026-04-07T19:44:09Z

Thanks for the patch @vmitq!

You seem to have disabled the batching in the EXX screening. I'm not opposed to that offhand, but can you quantify the performance penalty (if any) on systems for which the memory problem wasn't a problem?

It doesn’t remove batching completely; it just removes the second level of batching, which seems redundant here. I haven’t observed a notable performance difference, but if you’d like concrete numbers, I can run a few benchmarks.

Compute collision scratch requirements as max(mem_coll + mem_bfn_stats, mem_collion) per task. Replace task_exx_collision_size with task_exx_coll_bitvec_size, task_exx_coll_fmax_size, and task_exx_coll_position_size helpers.

Only use uint64_t for prefix sum counters

vmitq · 2026-04-24T11:12:55Z

investigated the performance and found it was about ~10% slower than the master. The main reason was excessive use of uint64_t for the positions list, which increased memory traffic. This has been fixed in a recent commit. Previously, I had only benchmarked with large batch sizes (for performance, see table), where this issue was less noticeable.

The scheduling changes do not have a measurable impact on performance.

The memory size for the screening step is now computed more accurately. This allows for larger batch sizes, which leads to a slight overall performance improvement compared to master.

Below are the final benchmark results for several molecules with the cc-pVDZ basis set and different screening tolerances:

Molecule	Atoms	NBF	Batch	Tol	Master (s)	current (s)	Diff
chloroquine	48	464	1024	1e-10	1.892 ± 0.012	1.871 ± 0.014	-1.1%
chloroquine	48	464	2048	1e-10	1.039 ± 0.001	1.027 ± 0.009	-1.1%
chloroquine	48	464	10240	1e-10	0.315 ± 0.001	0.311 ± 0.001	-1.0%
chloroquine	48	464	1024	1e-50	2.807 ± 0.006	2.776 ± 0.013	-1.1%
chloroquine	48	464	2048	1e-50	1.753 ± 0.004	1.725 ± 0.007	-1.6%
chloroquine	48	464	10240	1e-50	0.850 ± 0.002	0.837 ± 0.001	-1.6%
c60	60	900	1024	1e-10	5.753 ± 0.021	5.703 ± 0.034	-0.9%
c60	60	900	2048	1e-10	3.320 ± 0.015	3.302 ± 0.016	-0.5%
c60	60	900	10240	1e-10	1.140 ± 0.003	1.138 ± 0.002	-0.1%
c60	60	900	1024	1e-50	9.395 ± 0.051	9.333 ± 0.036	-0.7%
c60	60	900	2048	1e-50	5.642 ± 0.030	5.567 ± 0.013	-1.3%
c60	60	900	10240	1e-50	2.930 ± 0.009	2.944 ± 0.009	+0.5%
valinomycin	168	1620	1024	1e-10	22.449 ± 0.272	22.657 ± 0.086	+0.9%
valinomycin	168	1620	2048	1e-10	12.270 ± 0.038	12.173 ± 0.031	-0.8%
valinomycin	168	1620	10240	1e-10	3.172 ± 0.006	3.171 ± 0.014	-0.0%
valinomycin	168	1620	1024	1e-50	48.831 ± 0.040	47.718 ± 0.133	-2.3%
valinomycin	168	1620	2048	1e-50	32.718 ± 0.073	33.217 ± 0.113	+1.5%
valinomycin	168	1620	10240	1e-50	17.337 ± 0.026	17.135 ± 0.077	-1.2%
crambin	642	6504	1024	1e-10	311.094 ± 3.624	259.484 ± 0.488	-16.6%
crambin	642	6504	2048	1e-10	146.633 ± 0.197	150.920 ± 0.164	+2.9%
crambin	642	6504	10240	1e-10	47.117 ± 1.645	45.911 ± 0.063	-2.6%

Fix memory issues in CUDA EXX screening

eea68c6

wavefunction91 requested a review from Copilot March 31, 2026 20:12

Copilot started reviewing on behalf of wavefunction91 March 31, 2026 20:13 View session

Copilot AI reviewed Mar 31, 2026

View reviewed changes

awvwgk linked an issue Apr 1, 2026 that may be closed by this pull request

Bug in CUDA exx ek screening #154

Open

vmitq closed this Apr 7, 2026

vmitq reopened this Apr 7, 2026

vmitq added 2 commits April 24, 2026 10:50

Fix collision memory estimates

68579bf

Compute collision scratch requirements as max(mem_coll + mem_bfn_stats, mem_collion) per task. Replace task_exx_collision_size with task_exx_coll_bitvec_size, task_exx_coll_fmax_size, and task_exx_coll_position_size helpers.

Undo unneeded uint32_t -> uint64_t promotions

a022025

Only use uint64_t for prefix sum counters

awvwgk added bug Something isn't working cuda CUDA related Issue labels May 4, 2026

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Fix memory issues in CUDA EXX screening#182

Fix memory issues in CUDA EXX screening#182
vmitq wants to merge 3 commits intowavefunction91:masterfrom
vmitq:fixup/cuda_exx_memory_issues

vmitq commented Mar 13, 2026

Uh oh!

Copilot AI left a comment

Uh oh!

Uh oh!

Uh oh!

Uh oh!

Uh oh!

wavefunction91 commented Apr 6, 2026

Uh oh!

vmitq commented Apr 7, 2026

Uh oh!

vmitq commented Apr 24, 2026

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

4 participants

Conversation

vmitq commented Mar 13, 2026

Uh oh!

Copilot AI left a comment

Choose a reason for hiding this comment

Pull request overview

Reviewed changes

Uh oh!

Uh oh!

Uh oh!

Uh oh!

Uh oh!

wavefunction91 commented Apr 6, 2026

Uh oh!

vmitq commented Apr 7, 2026

Uh oh!

vmitq commented Apr 24, 2026

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

4 participants