You signed in with another tab or window. Reload to refresh your session.You signed out in another tab or window. Reload to refresh your session.You switched accounts on another tab or window. Reload to refresh your session.Dismiss alert
Copy file name to clipboardExpand all lines: README.md
+61-2Lines changed: 61 additions & 2 deletions
Display the source diff
Display the rich diff
Original file line number
Diff line number
Diff line change
@@ -21,14 +21,29 @@ Protocol:
21
21
|`master_source_off`|`master_runcheck` local worktree |[`e11b118dd2e80393b5b7eb309c6abb25f51a818c`](https://github.com/Devsh-Graphics-Programming/Nabla/commit/e11b118dd2e80393b5b7eb309c6abb25f51a818c)|[`d76c7890b19ce0b344ee0ce116dbc1c92220ccea`](https://github.com/Devsh-Graphics-Programming/DirectXShaderCompiler/commit/d76c7890b19ce0b344ee0ce116dbc1c92220ccea)|[`057230db28c7f7d1d571c9e61732da44815f2891`](https://github.com/Devsh-Graphics-Programming/SPIRV-Headers/commit/057230db28c7f7d1d571c9e61732da44815f2891)|[`91ac969ed599bfd0697a5b88cfae550318a04392`](https://github.com/Devsh-Graphics-Programming/SPIRV-Tools/commit/91ac969ed599bfd0697a5b88cfae550318a04392)| local `Release`, `SOURCE`, runtime `builtins OFF`|
22
22
|`devshfixes_upstream`|`unroll_dxc_df_upstream_check` local worktree |[`c13c33662c3733b54d9014988a5ac602ab0c3245`](https://github.com/Devsh-Graphics-Programming/Nabla/commit/c13c33662c3733b54d9014988a5ac602ab0c3245)|[`74d6fbbad7388813c65ae269b20f15b4e971df9c`](https://github.com/Devsh-Graphics-Programming/DirectXShaderCompiler/commit/74d6fbbad7388813c65ae269b20f15b4e971df9c)|[`10b37414a3c9269b9bd8861cc759bd7fdf09760d`](https://github.com/Devsh-Graphics-Programming/SPIRV-Headers/commit/10b37414a3c9269b9bd8861cc759bd7fdf09760d)|[`2c75d08e3b31a673726ce6be80ab528250247064`](https://github.com/Devsh-Graphics-Programming/SPIRV-Tools/commit/2c75d08e3b31a673726ce6be80ab528250247064)| local `Release`, `SOURCE`, runtime `builtins OFF`|
23
23
|`unroll_artifact`| CI install artifact from [`run 23599197849`](https://github.com/Devsh-Graphics-Programming/Nabla/actions/runs/23599197849)|[`262a8b72f295ec95d3cf83170f1768a43972c9ab`](https://github.com/Devsh-Graphics-Programming/Nabla/commit/262a8b72f295ec95d3cf83170f1768a43972c9ab)|[`07f06e9d48807ef8e7cabc41ae6acdeb26c68c09`](https://github.com/Devsh-Graphics-Programming/DirectXShaderCompiler/commit/07f06e9d48807ef8e7cabc41ae6acdeb26c68c09)|[`c141151dd53cbd5b1ced0665ad95ae3e91e8f916`](https://github.com/Devsh-Graphics-Programming/SPIRV-Headers/commit/c141151dd53cbd5b1ced0665ad95ae3e91e8f916)|[`2a730e127a32ac8b0713f5e1490d7b9be9d1cc9a`](https://github.com/Devsh-Graphics-Programming/SPIRV-Tools/commit/2a730e127a32ac8b0713f5e1490d7b9be9d1cc9a)| CI `Release install` artifact |
24
+
|`unroll_v2`|`unroll_o1_local` local worktree after the latest `-O1experimental` changes |[`6ee8dbc04df55db97c9440d078eef160522a6af1`](https://github.com/Devsh-Graphics-Programming/Nabla/commit/6ee8dbc04df55db97c9440d078eef160522a6af1)|[`891d1d7bd6fb20757a3af07f5a7a33ef59f7c15e`](https://github.com/Devsh-Graphics-Programming/DirectXShaderCompiler/commit/891d1d7bd6fb20757a3af07f5a7a33ef59f7c15e)|[`c141151dd53cbd5b1ced0665ad95ae3e91e8f916`](https://github.com/Devsh-Graphics-Programming/SPIRV-Headers/commit/c141151dd53cbd5b1ced0665ad95ae3e91e8f916)|[`0ecbcc95a108f1a3313ea184260b10d21e158a47`](https://github.com/Devsh-Graphics-Programming/SPIRV-Tools/commit/0ecbcc95a108f1a3313ea184260b10d21e158a47)| local `RelWithDebInfo`, `SOURCE`, runtime `-O1experimental`|
|`devshfixes_upstream` vs `master_source_off`|`-1.8147`|`-8.47%`|
38
53
|`unroll_artifact` vs `master_source_off`|`+0.1631`|`+0.76%`|
39
54
|`unroll_artifact` vs `devshfixes_upstream`|`+1.9778`|`+10.08%`|
55
+
|`unroll_v2` vs `master_source_off`|`-2.1944`|`-10.24%`|
56
+
|`unroll_v2` vs `devshfixes_upstream`|`-0.3797`|`-1.94%`|
57
+
|`unroll_v2` vs `unroll_artifact`|`-2.3575`|`-10.92%`|
58
+
59
+
## Cold startup Vulkan API probe
60
+
61
+
Cold startup `vkCreateComputePipelines` was measured on the same published runnable bundles with cleared pipeline/shader cache and Vulkan API tracing enabled for the process.
62
+
63
+
| Variant |`vkCreateComputePipelines` calls |`vkCreateComputePipelines` start->next sum ms |
64
+
| --- | ---: | ---: |
65
+
|`master_source_off`|`13`|`3737.11`|
66
+
|`devshfixes_upstream`|`21`|`3332.55`|
67
+
|`unroll_artifact`|`21`|`1418.86`|
68
+
|`unroll_v2`|`2`|`354.707`|
69
+
70
+
The cleanest comparison is `devshfixes_upstream` vs `unroll_artifact`, because both cold runs create exactly `21` compute pipelines. On that paired comparison the `vkCreateComputePipelines` batch drops from `3332.55 ms` to `1418.86 ms`, a reduction of about `57.4%`.
71
+
72
+
The `master_source_off` comparison is also informative: even though `unroll_artifact` creates more compute pipelines during cold startup (`21` vs `13`), the total `vkCreateComputePipelines` batch still drops from `3737.11 ms` to `1418.86 ms`, a reduction of about `62.0%`.
73
+
74
+
The new `unroll_v2` local measurement is the current up-to-date checkpoint after the latest `-O1experimental` changes. Against the published `unroll_artifact` cold run, the traced `vkCreateComputePipelines` batch drops further from `1418.86 ms` to `354.707 ms`, with the observed call count going from `21` to `2`. Against `master_source_off`, the same cold batch drops from `3737.11 ms` to `354.707 ms`.
40
75
41
76
## Main conclusion
42
77
43
78
The measured `latest upstream refresh` baseline is faster than `master_source_off` in this probe. At the same time `unroll_artifact` is effectively at parity with `master_source_off` here at only `+0.76%`, while the remaining gap appears only against `devshfixes_upstream`.
44
79
80
+
The up-to-date `unroll_v2` follow-up, measured with the new `-O1experimental` flag, goes further: in this probe it is now faster than `master_source_off` by `10.24%` on steady-state GPU frame time (`19.2360 ms` vs `21.4304 ms`), and its cold `vkCreateComputePipelines` batch is about `10.5x` smaller (`354.707 ms` vs `3737.11 ms`).
81
+
45
82
Taken together, the measured runtime cost points at the `unroll` side of the experiment, not at the generic `DXC/SPIRV-Tools upstream refresh`. That tradeoff is also aligned with the intent of the experiment: reduce shader build time aggressively while accepting a small runtime cost.
46
83
47
-
In practice this is also a strong argument for a development-oriented DXC optimization profile, for example an `-O1`-style mode. For the Nabla Path Tracer builds behind this comparison the shader-build wall time is about `10x` worse without that profile, while the measured runtime delta stays at `+0.76%` against `master_source_off` in this probe and at `+10.08%` against `devshfixes_upstream`; on other machines a simpler `average FPS` check across multiple methods, modes, and shapes placed the same runtime cost in the `5-8%` range. That is exactly the profile proposed in the paired PRs: for development use it delivers a major build-time win while keeping runtime impact effectively negligible in practice.
84
+
In practice this is also a strong argument for a development-oriented DXC optimization profile, for example an `-O1`-style mode. For the Nabla Path Tracer builds behind this comparison the shader-build wall time is about `10x` worse without that profile, while the newest `unroll_v2` follow-up is already faster than the current `master` baseline on this measured path. That is exactly the profile proposed in the paired PRs: for development use it delivers a major build-time win while keeping runtime impact favorable on the measured workload.
85
+
86
+
`unroll_v2` is the current local follow-up checkpoint after those latest changes. It keeps the same high-level workload shape (`dispatch_count = 2`) and shows where the updated `-O1experimental` line lands relative to the published `unroll_artifact` and the current `master` baseline.
48
87
49
88
## Deeper Nsight signals from the same exports
50
89
51
90
Frame-level exports also show:
52
91
-`dispatch_count = 2` and `gr__ctas_launched_queue_sync.sum = 14401` in all three variants
53
92
-`unroll_artifact` has lower `SM throughput` than `devshfixes_upstream`
54
93
-`unroll_artifact` also shows higher total executed instructions and much higher `L1/LSU/shared` pressure than `devshfixes_upstream`
94
+
-`unroll_v2` raises `SM throughput` back to `38.7311%` while keeping `dispatch_count = 2`
55
95
56
96
This points at a `compute-side codegen / execution-mix` difference with higher `L1/LSU/shared` pressure on the `unroll` side.
57
97
@@ -61,6 +101,7 @@ This points at a `compute-side codegen / execution-mix` difference with higher `
0 commit comments