Nabla Path Tracer runtime compare from Nsight Graphics

This directory contains one paired Nsight Graphics GPU Trace probe for Nabla Path Tracer.

Protocol:

1 run per variant
same capture point: frame 1000
same effective render path:
- geometry: sphere
- effective method: solid angle
runtime numbers below come directly from Nsight Graphics exports:
- FRAME.xls
- GPUTRACE_FRAME.xls
measurement machine:
- measurement_machine.md

Variant matrix

Case	Checkout source	Nabla	DXC	SPIRV-Headers	SPIRV-Tools	Mode
`master_source_off`	`master_runcheck` local worktree	`e11b118dd2e80393b5b7eb309c6abb25f51a818c`	`d76c7890b19ce0b344ee0ce116dbc1c92220ccea`	`057230db28c7f7d1d571c9e61732da44815f2891`	`91ac969ed599bfd0697a5b88cfae550318a04392`	local `Release`, `SOURCE`, runtime `builtins OFF`
`devshfixes_upstream`	`unroll_dxc_df_upstream_check` local worktree	`c13c33662c3733b54d9014988a5ac602ab0c3245`	`74d6fbbad7388813c65ae269b20f15b4e971df9c`	`10b37414a3c9269b9bd8861cc759bd7fdf09760d`	`2c75d08e3b31a673726ce6be80ab528250247064`	local `Release`, `SOURCE`, runtime `builtins OFF`
`unroll_artifact`	CI install artifact from `run 23599197849`	`262a8b72f295ec95d3cf83170f1768a43972c9ab`	`07f06e9d48807ef8e7cabc41ae6acdeb26c68c09`	`c141151dd53cbd5b1ced0665ad95ae3e91e8f916`	`2a730e127a32ac8b0713f5e1490d7b9be9d1cc9a`	CI `Release install` artifact
`unroll_v2`	`unroll_o1_local` local worktree after the latest `-O1experimental` changes	`6ee8dbc04df55db97c9440d078eef160522a6af1`	`891d1d7bd6fb20757a3af07f5a7a33ef59f7c15e`	`c141151dd53cbd5b1ced0665ad95ae3e91e8f916`	`0ecbcc95a108f1a3313ea184260b10d21e158a47`	local `RelWithDebInfo`, `SOURCE`, runtime `-O1experimental`

Legend

flowchart LR
  A["master_source_off\nNabla e11b118d\n2026-03-26"] --> B["devshfixes_upstream\nNabla c13c3366\n2026-03-28"] --> C["unroll_artifact\nNabla 262a8b72\n2026-03-26"] --> D["unroll_v2\nlocal O1experimental refresh\n2026-03-28"]

This report compares four checkpoints:

master_source_off: current master-side baseline
devshfixes_upstream: the same line after refreshing devshFixes with newer DXC upstream state
unroll_artifact: that refreshed line plus the unroll PR work packaged in the published CI artifact
unroll_v2: an up-to-date local measurement with the new -O1experimental flag after the latest unroll-line changes

Runtime Probe

Variant	GPU frame ms	Dispatch count	Compute active	SM throughput	PCIe write GB/s
`master_source_off`	`21.4304`	`2`	`83.2501%`	`35.5388%`	`2.62710`
`devshfixes_upstream`	`19.6157`	`2`	`82.9923%`	`38.2916%`	`2.64694`
`unroll_artifact`	`21.5935`	`2`	`83.9945%`	`34.3346%`	`2.62212`
`unroll_v2`	`19.2360`	`2`	`86.9712%`	`38.7311%`	`2.64514`

Runtime deltas

Comparison	Delta ms	Delta %
`devshfixes_upstream` vs `master_source_off`	`-1.8147`	`-8.47%`
`unroll_artifact` vs `master_source_off`	`+0.1631`	`+0.76%`
`unroll_artifact` vs `devshfixes_upstream`	`+1.9778`	`+10.08%`
`unroll_v2` vs `master_source_off`	`-2.1944`	`-10.24%`
`unroll_v2` vs `devshfixes_upstream`	`-0.3797`	`-1.94%`
`unroll_v2` vs `unroll_artifact`	`-2.3575`	`-10.92%`

Cold startup Vulkan API probe

Cold startup vkCreateComputePipelines was measured on the same published runnable bundles with cleared pipeline/shader cache and Vulkan API tracing enabled for the process.

Variant	`vkCreateComputePipelines` calls	`vkCreateComputePipelines` start->next sum ms
`master_source_off`	`13`	`3737.11`
`devshfixes_upstream`	`21`	`3332.55`
`unroll_artifact`	`21`	`1418.86`
`unroll_v2`	`2`	`354.707`

TODO: need to recheck vkCreateComputePipelines, those are wrong metrics

Main conclusion

The measured latest upstream refresh baseline is faster than master_source_off in this probe. At the same time unroll_artifact is effectively at parity with master_source_off here at only +0.76%, while the remaining gap appears only against devshfixes_upstream.

The up-to-date unroll_v2 follow-up, measured with the new -O1experimental flag, goes further: in this probe it is now faster than master_source_off by 10.24% on steady-state GPU frame time (19.2360 ms vs 21.4304 ms).

Taken together, the measured runtime cost points at the unroll side of the experiment, not at the generic DXC/SPIRV-Tools upstream refresh. That tradeoff is also aligned with the intent of the experiment: reduce shader build time aggressively while accepting a small runtime cost.

In practice this is also a strong argument for the new explicit -O1experimental path. For the Nabla Path Tracer builds behind this comparison the shader-build wall time is about 10x worse without -O1experimental, while the newest unroll_v2 follow-up is already faster than the current master baseline on this measured path. On this workload -O1experimental delivers the intended development tradeoff directly: a major build-time win together with favorable measured runtime.

unroll_v2 is the current local follow-up checkpoint after those latest changes. It keeps the same high-level workload shape (dispatch_count = 2) and shows where the updated -O1experimental line lands relative to the published unroll_artifact and the current master baseline.

Deeper Nsight signals from the same exports

Frame-level exports also show:

dispatch_count = 2 and gr__ctas_launched_queue_sync.sum = 14401 in all three variants
unroll_artifact has lower SM throughput than devshfixes_upstream
unroll_artifact also shows higher total executed instructions and much higher L1/LSU/shared pressure than devshfixes_upstream
unroll_v2 raises SM throughput back to 38.7311% while keeping dispatch_count = 2

This points at a compute-side codegen / execution-mix difference with higher L1/LSU/shared pressure on the unroll side.

Directory map

Runtime stats

Machine spec

measurement_machine.md

Executable locations

master_source_off: runnable/master_source_off_minimal/31_hlslpathtracer.exe
devshfixes_upstream: runnable/devshfixes_upstream_minimal/31_hlslpathtracer.exe
unroll_artifact: runnable/unroll_artifact_minimal/31_hlslpathtracer.exe
unroll_v2: runnable/unroll_v2_minimal/31_hlslpathtracer_rwdi.exe

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Nabla Path Tracer runtime compare 2026-03-28

Choose a tag to compare

Sorry, something went wrong.

Sorry, something went wrong.

Uh oh!

No results found

Nabla Path Tracer runtime compare from Nsight Graphics

Variant matrix

Legend

Runtime Probe

Runtime deltas

Cold startup Vulkan API probe

Main conclusion

Deeper Nsight signals from the same exports

Directory map

Runtime stats

Machine spec

Executable locations

Capture files

Raw Nsight exports

Cold startup Vulkan API trace

Startup logs

Additional `unroll_v2` diagnostics

Uh oh!

Nabla Path Tracer runtime compare 2026-03-28

Nabla Path Tracer runtime compare from Nsight Graphics

Variant matrix

Legend

Runtime Probe

Runtime deltas

Cold startup Vulkan API probe

Main conclusion

Deeper Nsight signals from the same exports

Directory map

Runtime stats

Machine spec

Executable locations

Capture files

Raw Nsight exports

Cold startup Vulkan API trace

Startup logs

Additional unroll_v2 diagnostics

Uh oh!

Additional `unroll_v2` diagnostics