Skip to content

Nabla Path Tracer runtime compare 2026-03-28

Latest

Choose a tag to compare

@AnastaZIuk AnastaZIuk released this 28 Mar 12:53
· 5271 commits to main since this release
c74eed1

Nabla Path Tracer runtime compare from Nsight Graphics

This directory contains one paired Nsight Graphics GPU Trace probe for Nabla Path Tracer.

Protocol:

  • 1 run per variant
  • same capture point: frame 1000
  • same effective render path:
    • geometry: sphere
    • effective method: solid angle
  • runtime numbers below come directly from Nsight Graphics exports:
    • FRAME.xls
    • GPUTRACE_FRAME.xls
  • measurement machine:

Variant matrix

Case Checkout source Nabla DXC SPIRV-Headers SPIRV-Tools Mode
master_source_off master_runcheck local worktree e11b118dd2e80393b5b7eb309c6abb25f51a818c d76c7890b19ce0b344ee0ce116dbc1c92220ccea 057230db28c7f7d1d571c9e61732da44815f2891 91ac969ed599bfd0697a5b88cfae550318a04392 local Release, SOURCE, runtime builtins OFF
devshfixes_upstream unroll_dxc_df_upstream_check local worktree c13c33662c3733b54d9014988a5ac602ab0c3245 74d6fbbad7388813c65ae269b20f15b4e971df9c 10b37414a3c9269b9bd8861cc759bd7fdf09760d 2c75d08e3b31a673726ce6be80ab528250247064 local Release, SOURCE, runtime builtins OFF
unroll_artifact CI install artifact from run 23599197849 262a8b72f295ec95d3cf83170f1768a43972c9ab 07f06e9d48807ef8e7cabc41ae6acdeb26c68c09 c141151dd53cbd5b1ced0665ad95ae3e91e8f916 2a730e127a32ac8b0713f5e1490d7b9be9d1cc9a CI Release install artifact
unroll_v2 unroll_o1_local local worktree after the latest -O1experimental changes 6ee8dbc04df55db97c9440d078eef160522a6af1 891d1d7bd6fb20757a3af07f5a7a33ef59f7c15e c141151dd53cbd5b1ced0665ad95ae3e91e8f916 0ecbcc95a108f1a3313ea184260b10d21e158a47 local RelWithDebInfo, SOURCE, runtime -O1experimental

Legend

flowchart LR
  A["master_source_off\nNabla e11b118d\n2026-03-26"] --> B["devshfixes_upstream\nNabla c13c3366\n2026-03-28"] --> C["unroll_artifact\nNabla 262a8b72\n2026-03-26"] --> D["unroll_v2\nlocal O1experimental refresh\n2026-03-28"]

This report compares four checkpoints:

  • master_source_off: current master-side baseline
  • devshfixes_upstream: the same line after refreshing devshFixes with newer DXC upstream state
  • unroll_artifact: that refreshed line plus the unroll PR work packaged in the published CI artifact
  • unroll_v2: an up-to-date local measurement with the new -O1experimental flag after the latest unroll-line changes

Runtime Probe

Variant GPU frame ms Dispatch count Compute active SM throughput PCIe write GB/s
master_source_off 21.4304 2 83.2501% 35.5388% 2.62710
devshfixes_upstream 19.6157 2 82.9923% 38.2916% 2.64694
unroll_artifact 21.5935 2 83.9945% 34.3346% 2.62212
unroll_v2 19.2360 2 86.9712% 38.7311% 2.64514

Runtime deltas

Comparison Delta ms Delta %
devshfixes_upstream vs master_source_off -1.8147 -8.47%
unroll_artifact vs master_source_off +0.1631 +0.76%
unroll_artifact vs devshfixes_upstream +1.9778 +10.08%
unroll_v2 vs master_source_off -2.1944 -10.24%
unroll_v2 vs devshfixes_upstream -0.3797 -1.94%
unroll_v2 vs unroll_artifact -2.3575 -10.92%

Cold startup Vulkan API probe

Cold startup vkCreateComputePipelines was measured on the same published runnable bundles with cleared pipeline/shader cache and Vulkan API tracing enabled for the process.

Variant vkCreateComputePipelines calls vkCreateComputePipelines start->next sum ms
master_source_off 13 3737.11
devshfixes_upstream 21 3332.55
unroll_artifact 21 1418.86
unroll_v2 2 354.707

TODO: need to recheck vkCreateComputePipelines, those are wrong metrics

Main conclusion

The measured latest upstream refresh baseline is faster than master_source_off in this probe. At the same time unroll_artifact is effectively at parity with master_source_off here at only +0.76%, while the remaining gap appears only against devshfixes_upstream.

The up-to-date unroll_v2 follow-up, measured with the new -O1experimental flag, goes further: in this probe it is now faster than master_source_off by 10.24% on steady-state GPU frame time (19.2360 ms vs 21.4304 ms).

Taken together, the measured runtime cost points at the unroll side of the experiment, not at the generic DXC/SPIRV-Tools upstream refresh. That tradeoff is also aligned with the intent of the experiment: reduce shader build time aggressively while accepting a small runtime cost.

In practice this is also a strong argument for the new explicit -O1experimental path. For the Nabla Path Tracer builds behind this comparison the shader-build wall time is about 10x worse without -O1experimental, while the newest unroll_v2 follow-up is already faster than the current master baseline on this measured path. On this workload -O1experimental delivers the intended development tradeoff directly: a major build-time win together with favorable measured runtime.

unroll_v2 is the current local follow-up checkpoint after those latest changes. It keeps the same high-level workload shape (dispatch_count = 2) and shows where the updated -O1experimental line lands relative to the published unroll_artifact and the current master baseline.

Deeper Nsight signals from the same exports

Frame-level exports also show:

  • dispatch_count = 2 and gr__ctas_launched_queue_sync.sum = 14401 in all three variants
  • unroll_artifact has lower SM throughput than devshfixes_upstream
  • unroll_artifact also shows higher total executed instructions and much higher L1/LSU/shared pressure than devshfixes_upstream
  • unroll_v2 raises SM throughput back to 38.7311% while keeping dispatch_count = 2

This points at a compute-side codegen / execution-mix difference with higher L1/LSU/shared pressure on the unroll side.

Directory map

Runtime stats

Machine spec

Executable locations

Capture files

Raw Nsight exports

Cold startup Vulkan API trace

Startup logs

Additional unroll_v2 diagnostics