Optimize `compute_zdiff_gradp` by msimberg · Pull Request #1272 · C2SM/icon4py

msimberg · 2026-05-19T07:52:15Z

This is a POC optimized mostly by an LLM, not meant for merging as-is.

compute_zdiff_gradp is very slow on GPUs at least in the distributed metrics tests. On CPU the ZDIFF_GRADP/VERTOFFSET_GRADP test runs in a few seconds, on GPU ~300 seconds. These two fields take by far the longest to on the distributed tests (not counting standalone driver).

…mpute_zdiff_gradp Replaces O(nedges * nlev^2) nested jk/jk1 search loops with O(nedges * nlev * log(nlev)) np.searchsorted on reversed z_ifc slices. Eliminates per-iteration boolean array allocation and array_ns.where calls. Phase 1: batch searchsorted for all z_me[jk] per edge per side. Phase 2: single searchsorted for z_aux2, applied via boolean mask. Largest grid (exclaim_ch_r04b09_dsl): 2.95s -> 0.93s (3.2x)

…zdiff_gradp Removes the (nedges, 2, nlev) vertidx_gradp allocation and the final vertoffset_gradp subtraction step. Computes vertoffset_gradp offsets directly in the per-edge loop. Initializes vertoffset_gradp as zeros instead of broadcasting jk_field.

- Pre-compute z_ifc_asc = z_ifc[:, ::-1] to avoid per-edge reversal - Extract e2c_0, e2c_1 column arrays to avoid per-edge indexing - Unroll for side in range(2) into explicit side 0 and side 1 - Hoist jk_field_slice and z_me_slice out of side processing

The zdiff_gradp computation is now fast enough (0.92s vs 2.95s baseline on largest grid) that the factory test no longer needs the cpu_only guard. The vwind_impl_wgt speed concern is also resolved.

Replaces 62K per-edge searchsorted calls with 4 batched 2D calls using the row-offset trick. Detects CuPy vs NumPy at runtime. All data stays on-device (no per-iteration D2H transfers). Largest grid (exclaim_ch_r04b09_dsl): 2.95s -> 0.74s (4.0x)

Replaces _get_xp / _cp / _np with the array_ns returned by data_alloc.array_namespace(), which already provides backend- agnostic searchsorted, clip, where, take_along_axis, etc.

- Remove unnecessary z_me_m masking (invalid jk values are masked out by valid_jk after the search) - Remove .copy() from z_ifc_asc (fancy indexing creates new arrays) - Compute fill_high from z_ifc directly (fewer elements than z_ifc_e0) - Remove zdiff_gradp exchange calls from function, delegate to factory via do_exchange=True

The z_me halo exchange inside compute_zdiff_gradp was redundant: z_mc already has valid halo values from the factory dependency chain (Z_MC provider exchanges its halo). Since z_me is computed locally from z_mc[e2c], it inherits correct halo values without an additional exchange. - Remove exchange parameter from compute_zdiff_gradp signature - Remove functools.partial wrapping in factory registration - Remove dims and decomposition imports (no longer needed) - Update unit test call site

msimberg · 2026-05-19T07:52:28Z

cscs-ci run distributed

github-actions · 2026-05-20T09:43:54Z

Mandatory Tests

Please make sure you run these tests via comment before you merge!

cscs-ci run default
cscs-ci run distributed

Optional Tests

To run benchmarks you can use:

cscs-ci run benchmark-bencher

To run tests and benchmarks with the DaCe backend you can use:

cscs-ci run dace

To run test levels ignored by the default test suite (mostly simple datatest for static fields computations) you can use:

cscs-ci run extra

For more detailed information please look at CI in the EXCLAIM universe.

msimberg · 2026-05-20T09:43:57Z

cscs-ci run distributed

msimberg added 9 commits May 19, 2026 09:48

test(metrics): remove cpu_only mark from test_factory_zdiff_gradp

a496ec1

The zdiff_gradp computation is now fast enough (0.92s vs 2.95s baseline on largest grid) that the factory test no longer needs the cpu_only guard. The vwind_impl_wgt speed concern is also resolved.

refactor(metrics): use array_ns instead of manual backend detection

86e95db

Replaces _get_xp / _cp / _np with the array_ns returned by data_alloc.array_namespace(), which already provides backend- agnostic searchsorted, clip, where, take_along_axis, etc.

Only run common mpi tests

7afb739

msimberg added 3 commits May 20, 2026 09:17

Refactor compute_zdiff_gradp to only do do_exchange exchange

84bb917

Some more refactoring

ffa6397

Merge remote-tracking branch 'origin/main' into zdiff-vertoffset-gradp

1d48eeb

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Uh oh!

Optimize `compute_zdiff_gradp`#1272

Optimize `compute_zdiff_gradp`#1272
msimberg wants to merge 12 commits into
C2SM:mainfrom
msimberg:zdiff-vertoffset-gradp

msimberg commented May 19, 2026

Uh oh!

msimberg commented May 19, 2026

Uh oh!

github-actions Bot commented May 20, 2026

Uh oh!

msimberg commented May 20, 2026

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

1 participant

Uh oh!

Conversation

msimberg commented May 19, 2026

Uh oh!

msimberg commented May 19, 2026

Uh oh!

github-actions Bot commented May 20, 2026

Uh oh!

msimberg commented May 20, 2026

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

1 participant