Fix memory issues in CUDA EXX screening#182
Fix memory issues in CUDA EXX screening#182vmitq wants to merge 3 commits intowavefunction91:masterfrom
Conversation
There was a problem hiding this comment.
Pull request overview
This PR fixes correctness issues in CUDA exact-exchange (EXX) EK screening for large workloads by adjusting how task patches are generated/processed on device, and by widening several CUDA screening counters to 64-bit to avoid overflow.
Changes:
- Reworks the CUDA
exx_ek_screeningtask loop to process eachgenerate_buffers()patch independently (avoiding buffer overwrite across inner-loop iterations). - Promotes CUDA EXX screening collision counts / position lists from 32-bit to 64-bit.
- Adds a new device-memory requirement flag for reserving EXX collision scratch space, and increases Scheme1 static padding.
Reviewed changes
Copilot reviewed 5 out of 5 changed files in this pull request and generated 4 comments.
Show a summary per file
| File | Description |
|---|---|
src/xc_integrator/xc_data/device/xc_device_data.hpp |
Adds task_exx_collision and a sizing helper to reserve scratch space for EXX collision work. |
src/xc_integrator/xc_data/device/xc_device_aos_data.cxx |
Accounts for the new EXX collision scratch reservation in AoS device memory requirements. |
src/xc_integrator/local_work_driver/device/scheme1_data_base.cxx |
Increases static allocation padding used for alignment. |
src/xc_integrator/local_work_driver/device/cuda/kernels/exx_ek_screening_bfn_stats.cu |
Widens collision counts/position lists to 64-bit and updates associated kernel signatures and allocations. |
src/xc_integrator/integrator_util/exx_screening.cxx |
Removes the problematic double-loop batching and calls collision per generated task patch. |
Comments suppressed due to low confidence (1)
src/xc_integrator/local_work_driver/device/cuda/kernels/exx_ek_screening_bfn_stats.cu:299
print_countsnow takesuint64_t* counts, but theprintfformat string still uses%dforcounts[i_task](andi_taskis also not anint). This is undefined behavior and can corrupt output or crash when debugging. Use the correct 64-bit/size_t format specifiers (or cast to an explicitly formatted type).
__global__ void print_counts(size_t ntasks, uint64_t* counts) {
for(auto i_task = 0 ; i_task < ntasks; ++i_task) {
printf("[GPU] ITASK %d: %d\n", i_task,counts[i_task]);
}
💡 Add Copilot custom instructions for smarter, more guided reviews. Learn how to get started.
|
Thanks for the patch @vmitq! You seem to have disabled the batching in the EXX screening. I'm not opposed to that offhand, but can you quantify the performance penalty (if any) on systems for which the memory problem wasn't a problem? |
It doesn’t remove batching completely; it just removes the second level of batching, which seems redundant here. I haven’t observed a notable performance difference, but if you’d like concrete numbers, I can run a few benchmarks. |
Compute collision scratch requirements as max(mem_coll + mem_bfn_stats, mem_collion) per task. Replace task_exx_collision_size with task_exx_coll_bitvec_size, task_exx_coll_fmax_size, and task_exx_coll_position_size helpers.
Only use uint64_t for prefix sum counters
|
investigated the performance and found it was about ~10% slower than the The scheduling changes do not have a measurable impact on performance. The memory size for the screening step is now computed more accurately. This allows for larger batch sizes, which leads to a slight overall performance improvement compared to master. Below are the final benchmark results for several molecules with the cc-pVDZ basis set and different screening tolerances:
|
This PR addresses memory-related issues in exact exchange calculation on CUDA GPU devices.