Skip to content

Commit 65c5db3

Browse files
committed
Update runtime report with unroll_v2
1 parent 4b8b2c7 commit 65c5db3

19 files changed

Lines changed: 1217 additions & 2 deletions

File tree

README.md

Lines changed: 61 additions & 2 deletions
Original file line numberDiff line numberDiff line change
@@ -21,14 +21,29 @@ Protocol:
2121
| `master_source_off` | `master_runcheck` local worktree | [`e11b118dd2e80393b5b7eb309c6abb25f51a818c`](https://github.com/Devsh-Graphics-Programming/Nabla/commit/e11b118dd2e80393b5b7eb309c6abb25f51a818c) | [`d76c7890b19ce0b344ee0ce116dbc1c92220ccea`](https://github.com/Devsh-Graphics-Programming/DirectXShaderCompiler/commit/d76c7890b19ce0b344ee0ce116dbc1c92220ccea) | [`057230db28c7f7d1d571c9e61732da44815f2891`](https://github.com/Devsh-Graphics-Programming/SPIRV-Headers/commit/057230db28c7f7d1d571c9e61732da44815f2891) | [`91ac969ed599bfd0697a5b88cfae550318a04392`](https://github.com/Devsh-Graphics-Programming/SPIRV-Tools/commit/91ac969ed599bfd0697a5b88cfae550318a04392) | local `Release`, `SOURCE`, runtime `builtins OFF` |
2222
| `devshfixes_upstream` | `unroll_dxc_df_upstream_check` local worktree | [`c13c33662c3733b54d9014988a5ac602ab0c3245`](https://github.com/Devsh-Graphics-Programming/Nabla/commit/c13c33662c3733b54d9014988a5ac602ab0c3245) | [`74d6fbbad7388813c65ae269b20f15b4e971df9c`](https://github.com/Devsh-Graphics-Programming/DirectXShaderCompiler/commit/74d6fbbad7388813c65ae269b20f15b4e971df9c) | [`10b37414a3c9269b9bd8861cc759bd7fdf09760d`](https://github.com/Devsh-Graphics-Programming/SPIRV-Headers/commit/10b37414a3c9269b9bd8861cc759bd7fdf09760d) | [`2c75d08e3b31a673726ce6be80ab528250247064`](https://github.com/Devsh-Graphics-Programming/SPIRV-Tools/commit/2c75d08e3b31a673726ce6be80ab528250247064) | local `Release`, `SOURCE`, runtime `builtins OFF` |
2323
| `unroll_artifact` | CI install artifact from [`run 23599197849`](https://github.com/Devsh-Graphics-Programming/Nabla/actions/runs/23599197849) | [`262a8b72f295ec95d3cf83170f1768a43972c9ab`](https://github.com/Devsh-Graphics-Programming/Nabla/commit/262a8b72f295ec95d3cf83170f1768a43972c9ab) | [`07f06e9d48807ef8e7cabc41ae6acdeb26c68c09`](https://github.com/Devsh-Graphics-Programming/DirectXShaderCompiler/commit/07f06e9d48807ef8e7cabc41ae6acdeb26c68c09) | [`c141151dd53cbd5b1ced0665ad95ae3e91e8f916`](https://github.com/Devsh-Graphics-Programming/SPIRV-Headers/commit/c141151dd53cbd5b1ced0665ad95ae3e91e8f916) | [`2a730e127a32ac8b0713f5e1490d7b9be9d1cc9a`](https://github.com/Devsh-Graphics-Programming/SPIRV-Tools/commit/2a730e127a32ac8b0713f5e1490d7b9be9d1cc9a) | CI `Release install` artifact |
24+
| `unroll_v2` | `unroll_o1_local` local worktree after the latest `-O1experimental` changes | [`6ee8dbc04df55db97c9440d078eef160522a6af1`](https://github.com/Devsh-Graphics-Programming/Nabla/commit/6ee8dbc04df55db97c9440d078eef160522a6af1) | [`891d1d7bd6fb20757a3af07f5a7a33ef59f7c15e`](https://github.com/Devsh-Graphics-Programming/DirectXShaderCompiler/commit/891d1d7bd6fb20757a3af07f5a7a33ef59f7c15e) | [`c141151dd53cbd5b1ced0665ad95ae3e91e8f916`](https://github.com/Devsh-Graphics-Programming/SPIRV-Headers/commit/c141151dd53cbd5b1ced0665ad95ae3e91e8f916) | [`0ecbcc95a108f1a3313ea184260b10d21e158a47`](https://github.com/Devsh-Graphics-Programming/SPIRV-Tools/commit/0ecbcc95a108f1a3313ea184260b10d21e158a47) | local `RelWithDebInfo`, `SOURCE`, runtime `-O1experimental` |
2425

25-
## Main Nsight result
26+
## Legend
27+
28+
```mermaid
29+
flowchart LR
30+
A["master_source_off\nNabla e11b118d\n2026-03-26"] --> B["devshfixes_upstream\nNabla c13c3366\n2026-03-28"] --> C["unroll_artifact\nNabla 262a8b72\n2026-03-26"] --> D["unroll_v2\nlocal O1experimental refresh\n2026-03-28"]
31+
```
32+
33+
This report compares four checkpoints:
34+
- `master_source_off`: current `master`-side baseline
35+
- `devshfixes_upstream`: the same line after refreshing `devshFixes` with newer DXC upstream state
36+
- `unroll_artifact`: that refreshed line plus the `unroll` PR work packaged in the published CI artifact
37+
- `unroll_v2`: an up-to-date local measurement with the new `-O1experimental` flag after the latest `unroll`-line changes
38+
39+
## Runtime Probe
2640

2741
| Variant | GPU frame ms | Dispatch count | Compute active | SM throughput | PCIe write GB/s |
2842
| --- | ---: | ---: | ---: | ---: | ---: |
2943
| `master_source_off` | `21.4304` | `2` | `83.2501%` | `35.5388%` | `2.62710` |
3044
| `devshfixes_upstream` | `19.6157` | `2` | `82.9923%` | `38.2916%` | `2.64694` |
3145
| `unroll_artifact` | `21.5935` | `2` | `83.9945%` | `34.3346%` | `2.62212` |
46+
| `unroll_v2` | `19.2360` | `2` | `86.9712%` | `38.7311%` | `2.64514` |
3247

3348
### Runtime deltas
3449

@@ -37,21 +52,46 @@ Protocol:
3752
| `devshfixes_upstream` vs `master_source_off` | `-1.8147` | `-8.47%` |
3853
| `unroll_artifact` vs `master_source_off` | `+0.1631` | `+0.76%` |
3954
| `unroll_artifact` vs `devshfixes_upstream` | `+1.9778` | `+10.08%` |
55+
| `unroll_v2` vs `master_source_off` | `-2.1944` | `-10.24%` |
56+
| `unroll_v2` vs `devshfixes_upstream` | `-0.3797` | `-1.94%` |
57+
| `unroll_v2` vs `unroll_artifact` | `-2.3575` | `-10.92%` |
58+
59+
## Cold startup Vulkan API probe
60+
61+
Cold startup `vkCreateComputePipelines` was measured on the same published runnable bundles with cleared pipeline/shader cache and Vulkan API tracing enabled for the process.
62+
63+
| Variant | `vkCreateComputePipelines` calls | `vkCreateComputePipelines` start->next sum ms |
64+
| --- | ---: | ---: |
65+
| `master_source_off` | `13` | `3737.11` |
66+
| `devshfixes_upstream` | `21` | `3332.55` |
67+
| `unroll_artifact` | `21` | `1418.86` |
68+
| `unroll_v2` | `2` | `354.707` |
69+
70+
The cleanest comparison is `devshfixes_upstream` vs `unroll_artifact`, because both cold runs create exactly `21` compute pipelines. On that paired comparison the `vkCreateComputePipelines` batch drops from `3332.55 ms` to `1418.86 ms`, a reduction of about `57.4%`.
71+
72+
The `master_source_off` comparison is also informative: even though `unroll_artifact` creates more compute pipelines during cold startup (`21` vs `13`), the total `vkCreateComputePipelines` batch still drops from `3737.11 ms` to `1418.86 ms`, a reduction of about `62.0%`.
73+
74+
The new `unroll_v2` local measurement is the current up-to-date checkpoint after the latest `-O1experimental` changes. Against the published `unroll_artifact` cold run, the traced `vkCreateComputePipelines` batch drops further from `1418.86 ms` to `354.707 ms`, with the observed call count going from `21` to `2`. Against `master_source_off`, the same cold batch drops from `3737.11 ms` to `354.707 ms`.
4075

4176
## Main conclusion
4277

4378
The measured `latest upstream refresh` baseline is faster than `master_source_off` in this probe. At the same time `unroll_artifact` is effectively at parity with `master_source_off` here at only `+0.76%`, while the remaining gap appears only against `devshfixes_upstream`.
4479

80+
The up-to-date `unroll_v2` follow-up, measured with the new `-O1experimental` flag, goes further: in this probe it is now faster than `master_source_off` by `10.24%` on steady-state GPU frame time (`19.2360 ms` vs `21.4304 ms`), and its cold `vkCreateComputePipelines` batch is about `10.5x` smaller (`354.707 ms` vs `3737.11 ms`).
81+
4582
Taken together, the measured runtime cost points at the `unroll` side of the experiment, not at the generic `DXC/SPIRV-Tools upstream refresh`. That tradeoff is also aligned with the intent of the experiment: reduce shader build time aggressively while accepting a small runtime cost.
4683

47-
In practice this is also a strong argument for a development-oriented DXC optimization profile, for example an `-O1`-style mode. For the Nabla Path Tracer builds behind this comparison the shader-build wall time is about `10x` worse without that profile, while the measured runtime delta stays at `+0.76%` against `master_source_off` in this probe and at `+10.08%` against `devshfixes_upstream`; on other machines a simpler `average FPS` check across multiple methods, modes, and shapes placed the same runtime cost in the `5-8%` range. That is exactly the profile proposed in the paired PRs: for development use it delivers a major build-time win while keeping runtime impact effectively negligible in practice.
84+
In practice this is also a strong argument for a development-oriented DXC optimization profile, for example an `-O1`-style mode. For the Nabla Path Tracer builds behind this comparison the shader-build wall time is about `10x` worse without that profile, while the newest `unroll_v2` follow-up is already faster than the current `master` baseline on this measured path. That is exactly the profile proposed in the paired PRs: for development use it delivers a major build-time win while keeping runtime impact favorable on the measured workload.
85+
86+
`unroll_v2` is the current local follow-up checkpoint after those latest changes. It keeps the same high-level workload shape (`dispatch_count = 2`) and shows where the updated `-O1experimental` line lands relative to the published `unroll_artifact` and the current `master` baseline.
4887

4988
## Deeper Nsight signals from the same exports
5089

5190
Frame-level exports also show:
5291
- `dispatch_count = 2` and `gr__ctas_launched_queue_sync.sum = 14401` in all three variants
5392
- `unroll_artifact` has lower `SM throughput` than `devshfixes_upstream`
5493
- `unroll_artifact` also shows higher total executed instructions and much higher `L1/LSU/shared` pressure than `devshfixes_upstream`
94+
- `unroll_v2` raises `SM throughput` back to `38.7311%` while keeping `dispatch_count = 2`
5595

5696
This points at a `compute-side codegen / execution-mix` difference with higher `L1/LSU/shared` pressure on the `unroll` side.
5797

@@ -61,6 +101,7 @@ This points at a `compute-side codegen / execution-mix` difference with higher `
61101
- [`master_source_off/stats.json`](master_source_off/stats.json)
62102
- [`devshfixes_upstream/stats.json`](devshfixes_upstream/stats.json)
63103
- [`unroll_artifact/stats.json`](unroll_artifact/stats.json)
104+
- [`unroll_v2/stats.json`](unroll_v2/stats.json)
64105

65106
### Machine spec
66107
- [`measurement_machine.md`](measurement_machine.md)
@@ -69,18 +110,36 @@ This points at a `compute-side codegen / execution-mix` difference with higher `
69110
- `master_source_off`: [`runnable/master_source_off_minimal/31_hlslpathtracer.exe`](runnable/master_source_off_minimal/31_hlslpathtracer.exe)
70111
- `devshfixes_upstream`: [`runnable/devshfixes_upstream_minimal/31_hlslpathtracer.exe`](runnable/devshfixes_upstream_minimal/31_hlslpathtracer.exe)
71112
- `unroll_artifact`: [`runnable/unroll_artifact_minimal/31_hlslpathtracer.exe`](runnable/unroll_artifact_minimal/31_hlslpathtracer.exe)
113+
- `unroll_v2`: [`runnable/unroll_v2_minimal/31_hlslpathtracer_rwdi.exe`](runnable/unroll_v2_minimal/31_hlslpathtracer_rwdi.exe)
72114

73115
### Capture files
74116
- [`master_source_off/run01/master_source_off_frame1000_run01.ngfx-capture`](master_source_off/run01/master_source_off_frame1000_run01.ngfx-capture)
75117
- [`devshfixes_upstream/run01/devshfixes_upstream_frame1000_run01.ngfx-capture`](devshfixes_upstream/run01/devshfixes_upstream_frame1000_run01.ngfx-capture)
76118
- [`unroll_artifact/run01/unroll_artifact_frame1000_run01.ngfx-capture`](unroll_artifact/run01/unroll_artifact_frame1000_run01.ngfx-capture)
119+
- [`unroll_v2/run01/unroll_v2_frame1000_run01.ngfx-capture`](unroll_v2/run01/unroll_v2_frame1000_run01.ngfx-capture)
77120

78121
### Raw Nsight exports
79122
- [`master_source_off/run01/gpu-trace/BASE/FRAME.xls`](master_source_off/run01/gpu-trace/BASE/FRAME.xls)
80123
- [`master_source_off/run01/gpu-trace/BASE/GPUTRACE_FRAME.xls`](master_source_off/run01/gpu-trace/BASE/GPUTRACE_FRAME.xls)
81124
- [`devshfixes_upstream/run01/gpu-trace/BASE/GPUTRACE_FRAME.xls`](devshfixes_upstream/run01/gpu-trace/BASE/GPUTRACE_FRAME.xls)
82125
- [`unroll_artifact/run01/gpu-trace/BASE/GPUTRACE_FRAME.xls`](unroll_artifact/run01/gpu-trace/BASE/GPUTRACE_FRAME.xls)
126+
- [`unroll_v2/run01/gpu-trace/BASE/FRAME.xls`](unroll_v2/run01/gpu-trace/BASE/FRAME.xls)
127+
- [`unroll_v2/run01/gpu-trace/BASE/GPUTRACE_FRAME.xls`](unroll_v2/run01/gpu-trace/BASE/GPUTRACE_FRAME.xls)
128+
129+
### Cold startup Vulkan API trace
130+
- [`startup_vkcreate_api_probe/api_dump_cold_compare/summary.csv`](startup_vkcreate_api_probe/api_dump_cold_compare/summary.csv)
131+
- [`startup_vkcreate_api_probe/api_dump_cold_compare/master_source_off/vkcreate_stats.json`](startup_vkcreate_api_probe/api_dump_cold_compare/master_source_off/vkcreate_stats.json)
132+
- [`startup_vkcreate_api_probe/api_dump_cold_compare/devshfixes_upstream/vkcreate_stats.json`](startup_vkcreate_api_probe/api_dump_cold_compare/devshfixes_upstream/vkcreate_stats.json)
133+
- [`startup_vkcreate_api_probe/api_dump_cold_compare/unroll_artifact/vkcreate_stats.json`](startup_vkcreate_api_probe/api_dump_cold_compare/unroll_artifact/vkcreate_stats.json)
134+
- [`startup_vkcreate_api_probe/api_dump_cold_compare/unroll_v2/vkcreate_stats.json`](startup_vkcreate_api_probe/api_dump_cold_compare/unroll_v2/vkcreate_stats.json)
83135

84136
### Startup logs
85137
- [`startup_devshfixes_upstream/stats.json`](startup_devshfixes_upstream/stats.json)
86138
- [`startup_unroll_artifact/stats.json`](startup_unroll_artifact/stats.json)
139+
- [`startup_unroll_v2/stats.json`](startup_unroll_v2/stats.json)
140+
141+
### Additional `unroll_v2` diagnostics
142+
- [`unroll_v2/diagnostics/build_runtime_rwdi.log`](unroll_v2/diagnostics/build_runtime_rwdi.log)
143+
- [`unroll_v2/diagnostics/rebuild_spirv_rwdi.log`](unroll_v2/diagnostics/rebuild_spirv_rwdi.log)
144+
- [`unroll_v2/diagnostics/legality_rwdi_all_spv.log`](unroll_v2/diagnostics/legality_rwdi_all_spv.log)
145+
- [`startup_unroll_v2/stdout.txt`](startup_unroll_v2/stdout.txt)
8.68 MB
Binary file not shown.
32 MB
Binary file not shown.
Lines changed: 8 additions & 0 deletions
Original file line numberDiff line numberDiff line change
@@ -0,0 +1,8 @@
1+
[Window][FullScreenWindow]
2+
Pos=0,0
3+
Size=1280,720
4+
5+
[Window][Debug##Default]
6+
Pos=60,60
7+
Size=400,400
8+
Lines changed: 3 additions & 0 deletions
Original file line numberDiff line numberDiff line change
@@ -0,0 +1,3 @@
1+
{
2+
"cache_root": "pipeline/cache"
3+
}

startup_unroll_v2/stats.json

Lines changed: 32 additions & 0 deletions
Original file line numberDiff line numberDiff line change
@@ -0,0 +1,32 @@
1+
{
2+
"variant": "unroll_v2",
3+
"run_count": 1,
4+
"wall_to_window_ms": {
5+
"min": 6034,
6+
"max": 6034,
7+
"mean": 6034,
8+
"median": 6034,
9+
"stddev": 0
10+
},
11+
"first_render_submit_ms": {
12+
"min": 6034,
13+
"max": 6034,
14+
"mean": 6034,
15+
"median": 6034,
16+
"stddev": 0
17+
},
18+
"main_shader_ready_ms": {
19+
"min": 2168,
20+
"max": 2168,
21+
"mean": 2168,
22+
"median": 2168,
23+
"stddev": 0
24+
},
25+
"main_pipeline_build_ms": {
26+
"min": 3032,
27+
"max": 3032,
28+
"mean": 3032,
29+
"median": 3032,
30+
"stddev": 0
31+
}
32+
}

0 commit comments

Comments
 (0)