Vectorize angular (ATS) reductions to fix XLA compile blowup by jpbrodrick89 · Pull Request #102 · ergodicio/tsadar

jpbrodrick89 · 2026-06-01T14:58:04Z

The angular forward model wrapped two functions whose Python list comprehensions were fully unrolled into the jit-traced graph, once per output bin, and then re-traced through value_and_grad:

thomson_diagnostic.reduce_ATS_to_resunit: ~1024 + 1024 slice+mean+stack ops to bin the wavelength and angular axes.
irf.add_ATS_IRF: ~2048 + 1024 jnp.convolve ops for the separable 2D instrument-response smoothing.

At angular shapes (npts=2048, 1024 angles) this is ~5000 unrolled iterations baked into a single graph. On CPU the angular value_and_grad failed to finish compiling in 15+ minutes.

Replace both with vectorized equivalents:

_bin_average: reshape (+ NaN-pad for a ragged final window) and mean, a single op; bit-for-bit equal to the loop including the ragged case.
add_ATS_IRF: vmap each 1D convolution over the batch axis.

After the change the same angular value_and_grad compiles in ~4.6s on CPU. Outputs are unchanged (1D fits recover identical parameters).

This is done with vmap not scan so both CPU and GPU should benefit equally.

Add tests/test_forward/test_ats_vectorization.py pinning both vectorized forms to the original loop semantics — these run on CPU, unlike the angular integration tests which skip without a GPU.

The angular forward model wrapped two functions whose Python list comprehensions were fully unrolled into the jit-traced graph, once per output bin, and then re-traced through value_and_grad: - thomson_diagnostic.reduce_ATS_to_resunit: ~1024 + 1024 slice+mean+stack ops to bin the wavelength and angular axes. - irf.add_ATS_IRF: ~2048 + 1024 jnp.convolve ops for the separable 2D instrument-response smoothing. At angular shapes (npts=2048, 1024 angles) this is ~5000 unrolled iterations baked into a single graph. On CPU the angular value_and_grad failed to finish compiling in 15+ minutes. Replace both with vectorized equivalents: - _bin_average: reshape (+ NaN-pad for a ragged final window) and mean, a single op; bit-for-bit equal to the loop including the ragged case. - add_ATS_IRF: vmap each 1D convolution over the batch axis. After the change the same angular value_and_grad compiles in ~4.6s on CPU. Outputs are unchanged (1D fits recover identical parameters). Add tests/test_forward/test_ats_vectorization.py pinning both vectorized forms to the original loop semantics — these run on CPU, unlike the angular integration tests which skip without a GPU. Co-Authored-By: Claude Opus 4.8 (1M context) <[email protected]>

almilder

Looks like a great fix, thanks for adding this!

almilder approved these changes Jun 4, 2026

View reviewed changes

almilder merged commit 6c509b2 into ergodicio:main Jun 4, 2026
2 checks passed

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Vectorize angular (ATS) reductions to fix XLA compile blowup#102

Vectorize angular (ATS) reductions to fix XLA compile blowup#102
almilder merged 1 commit into
ergodicio:mainfrom
jpbrodrick89:fix/vectorize-angular-compile-loops

jpbrodrick89 commented Jun 1, 2026

Uh oh!

almilder left a comment

Uh oh!

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

2 participants

Conversation

jpbrodrick89 commented Jun 1, 2026

Uh oh!

almilder left a comment

Choose a reason for hiding this comment

Uh oh!

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

2 participants