Skip to content

Optimize compute_zdiff_gradp#1272

Draft
msimberg wants to merge 12 commits into
C2SM:mainfrom
msimberg:zdiff-vertoffset-gradp
Draft

Optimize compute_zdiff_gradp#1272
msimberg wants to merge 12 commits into
C2SM:mainfrom
msimberg:zdiff-vertoffset-gradp

Conversation

@msimberg

Copy link
Copy Markdown
Contributor

This is a POC optimized mostly by an LLM, not meant for merging as-is.

compute_zdiff_gradp is very slow on GPUs at least in the distributed metrics tests. On CPU the ZDIFF_GRADP/VERTOFFSET_GRADP test runs in a few seconds, on GPU ~300 seconds. These two fields take by far the longest to on the distributed tests (not counting standalone driver).

msimberg added 9 commits May 19, 2026 09:48
…mpute_zdiff_gradp

Replaces O(nedges * nlev^2) nested jk/jk1 search loops with
O(nedges * nlev * log(nlev)) np.searchsorted on reversed z_ifc
slices. Eliminates per-iteration boolean array allocation and
array_ns.where calls.

Phase 1: batch searchsorted for all z_me[jk] per edge per side.
Phase 2: single searchsorted for z_aux2, applied via boolean mask.

Largest grid (exclaim_ch_r04b09_dsl): 2.95s -> 0.93s (3.2x)
…zdiff_gradp

Removes the (nedges, 2, nlev) vertidx_gradp allocation and the
final vertoffset_gradp subtraction step. Computes vertoffset_gradp
offsets directly in the per-edge loop. Initializes vertoffset_gradp
as zeros instead of broadcasting jk_field.
- Pre-compute z_ifc_asc = z_ifc[:, ::-1] to avoid per-edge reversal
- Extract e2c_0, e2c_1 column arrays to avoid per-edge indexing
- Unroll for side in range(2) into explicit side 0 and side 1
- Hoist jk_field_slice and z_me_slice out of side processing
The zdiff_gradp computation is now fast enough (0.92s vs 2.95s
baseline on largest grid) that the factory test no longer needs
the cpu_only guard. The vwind_impl_wgt speed concern is also
resolved.
Replaces 62K per-edge searchsorted calls with 4 batched 2D calls
using the row-offset trick. Detects CuPy vs NumPy at runtime.
All data stays on-device (no per-iteration D2H transfers).

Largest grid (exclaim_ch_r04b09_dsl): 2.95s -> 0.74s (4.0x)
Replaces _get_xp / _cp / _np with the array_ns returned by
data_alloc.array_namespace(), which already provides backend-
agnostic searchsorted, clip, where, take_along_axis, etc.
- Remove unnecessary z_me_m masking (invalid jk values are
  masked out by valid_jk after the search)
- Remove .copy() from z_ifc_asc (fancy indexing creates new arrays)
- Compute fill_high from z_ifc directly (fewer elements than z_ifc_e0)
- Remove zdiff_gradp exchange calls from function, delegate to
  factory via do_exchange=True
The z_me halo exchange inside compute_zdiff_gradp was redundant:
z_mc already has valid halo values from the factory dependency
chain (Z_MC provider exchanges its halo). Since z_me is computed
locally from z_mc[e2c], it inherits correct halo values without
an additional exchange.

- Remove exchange parameter from compute_zdiff_gradp signature
- Remove functools.partial wrapping in factory registration
- Remove dims and decomposition imports (no longer needed)
- Update unit test call site
@msimberg

Copy link
Copy Markdown
Contributor Author

cscs-ci run distributed

@github-actions

Copy link
Copy Markdown

Mandatory Tests

Please make sure you run these tests via comment before you merge!

  • cscs-ci run default
  • cscs-ci run distributed

Optional Tests

To run benchmarks you can use:

  • cscs-ci run benchmark-bencher

To run tests and benchmarks with the DaCe backend you can use:

  • cscs-ci run dace

To run test levels ignored by the default test suite (mostly simple datatest for static fields computations) you can use:

  • cscs-ci run extra

For more detailed information please look at CI in the EXCLAIM universe.

@msimberg

Copy link
Copy Markdown
Contributor Author

cscs-ci run distributed

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

None yet

Projects

None yet

Development

Successfully merging this pull request may close these issues.

1 participant