Skip to content

Fix rawBufferVectorLoad/Store to widen min precision types to 32-bit#8274

Merged
alsepkow merged 14 commits intomicrosoft:mainfrom
alsepkow:user/alsepkow/fix-min-precision-vector-load
Mar 31, 2026
Merged

Fix rawBufferVectorLoad/Store to widen min precision types to 32-bit#8274
alsepkow merged 14 commits intomicrosoft:mainfrom
alsepkow:user/alsepkow/fix-min-precision-vector-load

Conversation

@alsepkow
Copy link
Copy Markdown
Contributor

@alsepkow alsepkow commented Mar 18, 2026

Summary

Fixes RawBufferVectorLoad/Store to use 32-bit element types (i32/f32) for min precision types (min16int, min16uint, min16float) instead of 16-bit (i16/f16). This matches how pre-SM6.9 RawBufferLoad handles min precision.

Resolves #8273

Root Cause

TranslateBufLoad in HLOperationLower.cpp creates the vector type directly from the min precision element type (i16/f16) without widening to i32/f32. This causes WARP (and potentially other drivers) to load/store 2 bytes per element instead of 4, mismatching the buffer layout.

Fix

Apply the same widening pattern used for bool types:

  • Load: Load as v_i32/v_f32, then trunc/fptrunc back to i16/half
  • Store: sext/fpext to i32/f32, then store as v_i32/v_f32

Testing

Added FileCheck test verifying all 3 min precision types produce i32/f32 vector load/store ops.

Co-authored-by: Copilot [email protected]

RawBufferVectorLoad/Store for min precision types (min16int, min16uint,
min16float) was emitting i16/f16 vector operations (e.g., v3i16) which
causes WARP and potentially other drivers to load/store 2 bytes per
element instead of 4. This mismatches the buffer layout when the CPU
writes 32-bit values.

Pre-SM6.9 RawBufferLoad correctly handles this by loading as i32/f32
and truncating. Apply the same pattern for SM6.9 vector variants:
- RawBufferVectorLoad: load as v_i32/v_f32, then trunc to i16/half
- RawBufferVectorStore: sext/fpext to i32/f32, then store as v_i32/v_f32

This matches the existing bool widening pattern already in TranslateBufLoad.

Co-authored-by: Copilot <[email protected]>
alsepkow added a commit to alsepkow/DirectXShaderCompiler that referenced this pull request Mar 24, 2026
The DXC compiler fix in PR microsoft#8274 handles min precision vector
load/store correctly at the DXIL level, so the test-side IO type
indirection (MIN_PRECISION, IO_TYPE, IO_OUT_TYPE, BUFFER_TYPE,
dispatchMinPrecisionTest, HLK_MIN_PRECISION_TEST) is no longer needed.
Min precision tests now use the same HLK_TEST/HLK_WAVEOP_TEST macros
and dispatch paths as all other types.

Co-authored-by: Copilot <[email protected]>
@alsepkow alsepkow marked this pull request as ready for review March 25, 2026 21:45
@V-FEXrt
Copy link
Copy Markdown
Collaborator

V-FEXrt commented Mar 25, 2026

Fixes RawBufferVectorLoad/Store to use 32-bit element types (i32/f32) for min precision types (min16int, min16uint, min16float) instead of 16-bit (i16/f16). This matches how pre-SM6.9 RawBufferLoad handles min precision.

Are we confident this is the right change? Doesn't SM6.9 make 16bit types manditory and IIRC soon we are going to make them always enabled.

@alsepkow
Copy link
Copy Markdown
Contributor Author

Fixes RawBufferVectorLoad/Store to use 32-bit element types (i32/f32) for min precision types (min16int, min16uint, min16float) instead of 16-bit (i16/f16). This matches how pre-SM6.9 RawBufferLoad handles min precision.

Are we confident this is the right change? Doesn't SM6.9 make 16bit types manditory and IIRC soon we are going to make them always enabled.

SM6.9 makes 16-bit type support mandatory. But a shader would still require the flag.

If we make them always enabled would that require a new SM?

alsepkow and others added 2 commits March 25, 2026 17:02
The min precision widening changes introduce trunc instructions that
cause SSA values to get names like %.i08 instead of numeric %6. Widen
the FileCheck regex from %{{[0-9]+}} to %{{[^ ,]+}} to accept both.

Co-authored-by: Copilot <[email protected]>
Same fix as min16uint_vec — widen regex from %{{[0-9]+}} to %{{[^ ,]+}}
to accept named SSA values introduced by min precision widening.

Co-authored-by: Copilot <[email protected]>
@alsepkow
Copy link
Copy Markdown
Contributor Author

alsepkow commented Mar 26, 2026

Added by Copilot

The min precision widening fix introduces trunc/fptrunc instructions after the widened load, which causes LLVM to assign named SSA values (e.g., %.i08) instead of the numeric ones (e.g., %6) that the original code produced. This broke three existing debug info FileCheck tests (min16uint_vec.hlsl, min16int_vec.hlsl, min16float_vec.hlsl) whose CHECK patterns used %{{[0-9]+}} which only matches numeric names. Updated the regex to %{{[^ ,]+}} to accept both numeric and named SSA values. The scalar variants (min16int.hlsl, min16uint.hlsl, min16float.hlsl) were unaffected.

Copy link
Copy Markdown
Collaborator

@V-FEXrt V-FEXrt left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Few more comments this time, though I again would like a review from a min precision person because while it looks like the PR does what it says it does I'm not actually able to say that we want to do what the PR does (specifically forcing 16->32 given that 16 is a mandatory thing moving forward)

Comment thread lib/HLSL/HLOperationLower.cpp Outdated
Comment thread lib/HLSL/HLOperationLower.cpp Outdated
Comment thread lib/HLSL/HLOperationLower.cpp Outdated
Comment thread lib/HLSL/HLOperationLower.cpp Outdated
Comment thread lib/HLSL/HLOperationLower.cpp Outdated
Comment thread lib/HLSL/HLOperationLower.cpp
Copy link
Copy Markdown
Collaborator

@V-FEXrt V-FEXrt left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Few more comments this time, though I again would like a review from a min precision person because while it looks like the PR does what it says it does I'm not actually able to say that we want to do what the PR does (specifically forcing 16->32 given that 16 is a mandatory thing moving forward)

@V-FEXrt
Copy link
Copy Markdown
Collaborator

V-FEXrt commented Mar 26, 2026

If we make them always enabled would that require a new SM?

not sure, but even if it did it would easily go into SM6.10. @daymanp can comment but the idea was to default the flag to true instead of false

- Extract isMinPrecisionType() and widenMinPrecisionType() helpers
- Remove unused TargetTy variable
- Shorten verbose comments
- Simplify store widening using helper function

Co-authored-by: Copilot <[email protected]>
Comment thread lib/HLSL/HLOperationLower.cpp Outdated
alsepkow and others added 2 commits March 25, 2026 17:33
Comment thread lib/HLSL/HLOperationLower.cpp
Comment thread lib/HLSL/HLOperationLower.cpp Outdated
Comment thread lib/HLSL/HLOperationLower.cpp Outdated
alsepkow and others added 2 commits March 30, 2026 16:22
- widenMinPrecisionType now takes a single Type* (vector or scalar) and
  LLVMContext instead of separate EltTy/VecOrScalarTy and IRBuilder.
- Load path uses widenMinPrecisionType upfront, eliminating the
  else-if-isMinPrec branch per tex3d's suggestion.
- Store widening now covers both RawBufferStore and RawBufferVectorStore.
  Without this, scalar min-precision stores crash with a cast type mismatch.
- Added scalar RawBufferLoad/Store test cases for all 3 min-precision types.
- Added TODO(microsoft#8314) for SExt/ZExt signedness issue.

Co-authored-by: Copilot <[email protected]>
@alsepkow
Copy link
Copy Markdown
Contributor Author

If we make them always enabled would that require a new SM?

not sure, but even if it did it would easily go into SM6.10. @daymanp can comment but the idea was to default the flag to true instead of false

I don't think that defaulting that flag to causes any apparent problems specific to this change. @tex3d to confirm and keep me honest though!

Copy link
Copy Markdown
Contributor

@tex3d tex3d left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

LGTM.

@alsepkow
Copy link
Copy Markdown
Contributor Author

alsepkow commented Mar 31, 2026 via email

alsepkow and others added 3 commits March 30, 2026 17:48
Shorten 3 comments in HLOperationLower.cpp that re-described code
behavior. Each now explains the underlying reason in 1 line.
The TODO(microsoft#8314) comment was already concise and left unchanged.

Co-authored-by: Copilot <[email protected]>
The test covers both vector and scalar paths, so drop 'vector' from
the file name to accurately reflect scope.

Co-authored-by: Copilot <[email protected]>
Restores SPIRV-Headers and SPIRV-Tools submodule pointers to their
previous state and removes setup_agility_sdk.py which was staged
unintentionally.

Co-authored-by: Copilot <[email protected]>
@alsepkow alsepkow enabled auto-merge (squash) March 31, 2026 01:30
alsepkow and others added 2 commits March 31, 2026 10:25
The store widening for min precision types was applying to both
RawBufferStore (scalar) and RawBufferVectorStore (vector). This
broke StructuredBuffer stores (struct_buf3.hlsl) because it replaced
the correct sext/zext from TranslateMinPrecisionRawBuffer with a
blanket sext, losing signedness info for min16uint.

Scope the RawBufferStore widening to RawBuffer (ByteAddressBuffer)
only. StructuredBuffer scalar stores are correctly handled by the
later TranslateMinPrecisionRawBuffer pass in DxilGenerationPass,
which has signedness info from struct type annotations.

ByteAddressBuffer scalar stores still need widening here because
the later pass crashes on non-struct resource types (cast<StructType>
on ByteAddressBuffer's i32 inner element).

Co-authored-by: Copilot <[email protected]>
Move scalar RawBufferStore widening back to TranslateMinPrecisionRawBuffer
in DxilGenerationPass where it belongs, keeping HLOperationLower scoped to
RawBufferVectorStore only.

Fix crash in ReplaceMinPrecisionRawBufferStoreByType for ByteAddressBuffer:
the existing code did cast<StructType> on the inner resource element, which
fails for ByteAddressBuffer (inner element is i32, not a struct). Use
dyn_cast instead and fall back to sext when no struct annotation is
available. Preserve undef args to avoid store mask validation errors.

Co-authored-by: Copilot <[email protected]>
Copy link
Copy Markdown
Member

@damyanp damyanp left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

LGTM

@alsepkow alsepkow merged commit dc4354b into microsoft:main Mar 31, 2026
12 checks passed
@github-project-automation github-project-automation Bot moved this from New to Done in HLSL Roadmap Mar 31, 2026
alsepkow added a commit to alsepkow/DirectXShaderCompiler that referenced this pull request Apr 1, 2026
Reverts the DxilGenerationPass ByteAddressBuffer scalar store fix from
PR microsoft#8274 — it was out of scope and incomplete (sext fallback is wrong
for min16uint). Scalar ByteAddressBuffer template store widening should
be handled during CodeGen where signedness info is still available.

Removes scalar load/store tests from min_precision_raw_load_store.hlsl
since scalar ByteAddressBuffer::Store<min16int>() hits a pre-existing
crash in TranslateMinPrecisionRawBuffer (cast<StructType> on non-struct
resource type). Test now covers vector loads/stores only, which is the
scope of the original fix.

Co-authored-by: Copilot <[email protected]>
alsepkow added a commit that referenced this pull request Apr 1, 2026
## Summary

Reverts the `DxilGenerationPass` ByteAddressBuffer scalar store changes
and removes scalar store tests that were included in PR #8274. These
changes were out of scope for the vector load/store fix and, on further
discussion, were concluded to be incomplete.

Follow-up issue for the proper fix: #8322

## What changed

- **DxilGenerationPass.cpp**: Reverted the ByteAddressBuffer fallback
path added in `ReplaceMinPrecisionRawBufferStoreByType`. The `sext`
fallback was wrong for `min16uint` (loses signedness), and the fix
belongs in CodeGen where signedness info is still available.
- **min_precision_raw_load_store.hlsl**: Removed scalar load/store
tests. Scalar `ByteAddressBuffer::Store<min16int>()` hits a pre-existing
crash in `TranslateMinPrecisionRawBuffer` (`cast<StructType>` on
ByteAddressBuffer's `i32` inner element). Test now covers vector
loads/stores only, which is the scope of the original fix.

## Context

The original PR #8274 correctly fixed `RawBufferVectorLoad/Store` to
widen min precision types to 32-bit. However, it also added a
ByteAddressBuffer scalar store fix in `DxilGenerationPass` that:
1. Crept outside the scope of the vector load/store fix
2. Was incomplete — the `sext` fallback is wrong for unsigned types
(`min16uint`)
3. Should instead be handled during Clang CodeGen, where signedness
information is available

Scalar ByteAddressBuffer template store widening for min precision types
is a separate pre-existing issue that needs a proper fix in CodeGen.

Co-authored-by: Copilot <[email protected]>
alsepkow added a commit that referenced this pull request Apr 1, 2026
…#8260)

This PR extends the SM 6.9 long vector execution tests to cover HLSL min
precision types (min16float, min16int, min16uint). These types are
always available — `D3D12_SHADER_MIN_PRECISION_SUPPORT` only reports
whether hardware actually uses reduced precision, not whether the types
compile — so no device capability check is needed and the tests live in
the existing `DxilConf_SM69_Vectorized_Core` class alongside other
types.

Note: I wasn't able to find any existing min precision HLK tests.
Unclear if we have coverage.

## Key design decisions

**Full-precision buffer I/O:** Min precision types have
implementation-defined buffer storage width, so we use full-precision
types (`float`/`int`/`uint`) for all Load/Store operations via the
`IO_TYPE`/`IO_OUT_TYPE` shader defines, with explicit casts to/from the
min precision compute type. This ensures deterministic data layout
regardless of the device implementation.

**Half-precision tolerances:** Validation compares results in fp16 space
using HLSLHalf_t ULP tolerances. Since min precision guarantees at least
16-bit, fp16 tolerances are a correct upper bound — devices computing at
higher precision will produce more accurate results, not less.

**Test coverage mirrors existing patterns:**
- min16float mirrors HLSLHalf_t
(float/trig/math/comparison/dot/cast/derivative/wave/quad/load-store)
- min16int mirrors int16_t
(arithmetic/bitwise/comparison/reduction/cast/wave/quad/load-store)
- min16uint mirrors uint16_t
(arithmetic/bitwise/comparison/cast/wave/quad/load-store)

**Wave and quad op support:** Wave ops
(WaveActiveSum/Min/Max/Product/AllEqual, WaveReadLaneAt/First,
WavePrefix*, WaveMultiPrefix*, WaveMatch) and quad ops (QuadReadLaneAt,
QuadReadAcrossX/Y/Diagonal) are tested for all three min precision
types, mirroring the ops supported by their 16-bit equivalents. The wave
op shader helpers use `#ifdef MIN_PRECISION` guards to store results via
`IO_OUT_TYPE` for deterministic buffer layout without changing DXIL for
existing non-min-precision tests.

**Excluded operations:**
- Signed div/mod on min16int: HLSL does not support signed integer
division on min precision types
- Bit shifting on min16int/min16uint: Not supported for min precision
types
- FP specials (INF/NaN/denorm): min precision types do not support them

Resolves #7780

All tests require the rawBufferVectorLoad/Store fix from :
#8274
The array accessor and wave/quad op tests for min precision require the
optimizer fix from:
#8269


Co-authored-by: Copilot <[email protected]>

---------

Co-authored-by: Copilot <[email protected]>
Co-authored-by: github-actions[bot] <github-actions[bot]@users.noreply.github.com>
alsepkow added a commit to alsepkow/DirectXShaderCompiler that referenced this pull request Apr 14, 2026
…icrosoft#8274)

## Summary

Fixes `RawBufferVectorLoad`/`Store` to use 32-bit element types
(`i32`/`f32`) for min precision types (`min16int`, `min16uint`,
`min16float`) instead of 16-bit (`i16`/`f16`). This matches how
pre-SM6.9 `RawBufferLoad` handles min precision.

Resolves microsoft#8273

## Root Cause

`TranslateBufLoad` in `HLOperationLower.cpp` creates the vector type
directly from the min precision element type (`i16`/`f16`) without
widening to `i32`/`f32`. This causes WARP (and potentially other
drivers) to load/store 2 bytes per element instead of 4, mismatching the
buffer layout.

## Fix

Apply the same widening pattern used for bool types:
- **Load**: Load as `v_i32`/`v_f32`, then trunc/fptrunc back to
`i16`/`half`
- **Store**: `sext`/`fpext` to `i32`/`f32`, then store as
`v_i32`/`v_f32`

## Testing

Added FileCheck test verifying all 3 min precision types produce
`i32`/`f32` vector load/store ops.

Co-authored-by: Copilot <[email protected]>

---------

Co-authored-by: Copilot <[email protected]>
Co-authored-by: Tex Riddell <[email protected]>
alsepkow added a commit to alsepkow/DirectXShaderCompiler that referenced this pull request Apr 14, 2026
…icrosoft#8321)

## Summary

Reverts the `DxilGenerationPass` ByteAddressBuffer scalar store changes
and removes scalar store tests that were included in PR microsoft#8274. These
changes were out of scope for the vector load/store fix and, on further
discussion, were concluded to be incomplete.

Follow-up issue for the proper fix: microsoft#8322

## What changed

- **DxilGenerationPass.cpp**: Reverted the ByteAddressBuffer fallback
path added in `ReplaceMinPrecisionRawBufferStoreByType`. The `sext`
fallback was wrong for `min16uint` (loses signedness), and the fix
belongs in CodeGen where signedness info is still available.
- **min_precision_raw_load_store.hlsl**: Removed scalar load/store
tests. Scalar `ByteAddressBuffer::Store<min16int>()` hits a pre-existing
crash in `TranslateMinPrecisionRawBuffer` (`cast<StructType>` on
ByteAddressBuffer's `i32` inner element). Test now covers vector
loads/stores only, which is the scope of the original fix.

## Context

The original PR microsoft#8274 correctly fixed `RawBufferVectorLoad/Store` to
widen min precision types to 32-bit. However, it also added a
ByteAddressBuffer scalar store fix in `DxilGenerationPass` that:
1. Crept outside the scope of the vector load/store fix
2. Was incomplete — the `sext` fallback is wrong for unsigned types
(`min16uint`)
3. Should instead be handled during Clang CodeGen, where signedness
information is available

Scalar ByteAddressBuffer template store widening for min precision types
is a separate pre-existing issue that needs a proper fix in CodeGen.

Co-authored-by: Copilot <[email protected]>
alsepkow added a commit to alsepkow/DirectXShaderCompiler that referenced this pull request Apr 14, 2026
…microsoft#8260)

This PR extends the SM 6.9 long vector execution tests to cover HLSL min
precision types (min16float, min16int, min16uint). These types are
always available — `D3D12_SHADER_MIN_PRECISION_SUPPORT` only reports
whether hardware actually uses reduced precision, not whether the types
compile — so no device capability check is needed and the tests live in
the existing `DxilConf_SM69_Vectorized_Core` class alongside other
types.

Note: I wasn't able to find any existing min precision HLK tests.
Unclear if we have coverage.

## Key design decisions

**Full-precision buffer I/O:** Min precision types have
implementation-defined buffer storage width, so we use full-precision
types (`float`/`int`/`uint`) for all Load/Store operations via the
`IO_TYPE`/`IO_OUT_TYPE` shader defines, with explicit casts to/from the
min precision compute type. This ensures deterministic data layout
regardless of the device implementation.

**Half-precision tolerances:** Validation compares results in fp16 space
using HLSLHalf_t ULP tolerances. Since min precision guarantees at least
16-bit, fp16 tolerances are a correct upper bound — devices computing at
higher precision will produce more accurate results, not less.

**Test coverage mirrors existing patterns:**
- min16float mirrors HLSLHalf_t
(float/trig/math/comparison/dot/cast/derivative/wave/quad/load-store)
- min16int mirrors int16_t
(arithmetic/bitwise/comparison/reduction/cast/wave/quad/load-store)
- min16uint mirrors uint16_t
(arithmetic/bitwise/comparison/cast/wave/quad/load-store)

**Wave and quad op support:** Wave ops
(WaveActiveSum/Min/Max/Product/AllEqual, WaveReadLaneAt/First,
WavePrefix*, WaveMultiPrefix*, WaveMatch) and quad ops (QuadReadLaneAt,
QuadReadAcrossX/Y/Diagonal) are tested for all three min precision
types, mirroring the ops supported by their 16-bit equivalents. The wave
op shader helpers use `#ifdef MIN_PRECISION` guards to store results via
`IO_OUT_TYPE` for deterministic buffer layout without changing DXIL for
existing non-min-precision tests.

**Excluded operations:**
- Signed div/mod on min16int: HLSL does not support signed integer
division on min precision types
- Bit shifting on min16int/min16uint: Not supported for min precision
types
- FP specials (INF/NaN/denorm): min precision types do not support them

Resolves microsoft#7780

All tests require the rawBufferVectorLoad/Store fix from :
microsoft#8274
The array accessor and wave/quad op tests for min precision require the
optimizer fix from:
microsoft#8269


Co-authored-by: Copilot <[email protected]>

---------

Co-authored-by: Copilot <[email protected]>
Co-authored-by: github-actions[bot] <github-actions[bot]@users.noreply.github.com>
alsepkow added a commit that referenced this pull request Apr 15, 2026
…in precision types to 32-bit (#8369)

Cherry-pick PR (#8274) and revert of out-of-scope changes PR (#8321)

Assisted by gh copilot.

SHA
[dc4354b](dc4354b)
SHA
[71aa195](71aa195)

---------

Co-authored-by: Copilot <[email protected]>
Co-authored-by: Tex Riddell <[email protected]>
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

None yet

Projects

Status: Done

Development

Successfully merging this pull request may close these issues.

rawBufferVectorLoad/Store emits i16/f16 for min precision types instead of i32/f32

4 participants