[Common] Use specialized unfused MXFP8 cast kernels by default#2958
[Common] Use specialized unfused MXFP8 cast kernels by default#2958Oleg-Goncharov wants to merge 5 commits intoNVIDIA:mainfrom
Conversation
Signed-off-by: Oleg Goncharov <[email protected]>
Greptile SummaryThis PR promotes the unfused MXFP8 cast-only kernels from opt-in (via
Confidence Score: 5/5The change is safe to merge. The correctness risk from enabling the fast path by default is tightly bounded by the new Both changed files are narrow and well-reasoned. The env-var removal is a clean mechanical change. The rowwise alignment guard ( No files require special attention. Important Files Changed
Flowchart%%{init: {'theme': 'neutral'}}%%
flowchart TD
A["quantize() called"] --> B{"hasSpec<IS_DBIAS,IS_DACT,IS_ACT,IType,OType>()"}
B -- "false (unsupported combo)" --> G["Generic kernel"]
B -- "true (fp16/bf16→fp8, cast-only)" --> C{"WITH_GEMM_SWIZZLED_SCALES?"}
C -- "yes" --> G
C -- "no" --> D{"scaling_type_has_specialized_support?"}
D -- "no (COLWISE, or ROWWISE with cols%32≠0)" --> G
D -- "yes" --> E{"scaling_type"}
E -- "ROWWISE\n(cols%32==0 guaranteed)" --> F1["specialized rowwise kernel\n(vectorized 32-elem chunks)"]
E -- "BIDIMENSIONAL\n(TMA handles non-aligned)" --> F2["specialized bidirectional kernel\n(TMA loads, internal OOB guards)"]
E -- "other" --> ERR["NVTE_ERROR: Invalid scaling type"]
F1 --> RET["return"]
F2 --> RET
G --> GEN["Generic MXFP8 kernel (TMA-based)"]
Reviews (3): Last reviewed commit: "[pre-commit.ci] auto fixes from pre-comm..." | Re-trigger Greptile |
Signed-off-by: Oleg Goncharov <[email protected]>
|
/te-ci |
Signed-off-by: Oleg Goncharov <[email protected]>
|
/te-ci |
Description
This PR enables the fast unfused MXFP8 cast kernels by default.
Previously, these kernels were gated behind an environment variable and therefore were not used unless explicitly enabled. This change makes the specialized cast-only path the default behavior.
Type of change
Changes
Checklist: