Skip to content

[All] Remove legacy max512 backend#2949

Open
cyanguwa wants to merge 7 commits intoNVIDIA:mainfrom
cyanguwa:remove_max512_subbackend
Open

[All] Remove legacy max512 backend#2949
cyanguwa wants to merge 7 commits intoNVIDIA:mainfrom
cyanguwa:remove_max512_subbackend

Conversation

@cyanguwa
Copy link
Copy Markdown
Collaborator

@cyanguwa cyanguwa commented Apr 30, 2026

Description

Fused attention (cuDNN attention) currently has 3 sub-backends:

  • f16_max512: FP16/BF16, max_seq_len <= 512, head_dim = 64, MHA only
  • f16_arbitrary: FP16/BF16, max_seq_len = any, head_dim <= 256, MHA/MQA/GQA/MLA, and
  • fp8: FP8 delayed scaling, FP8 current scaling, MXFP8.

f16_max512 was implemented using a much older cuDNN interface, which will be removed in the next cudnn-frontend release. This PR removes the f16_max512 sub-backend, and routes all BF16/FP16 attention calculations to f16_arbitrary sub-backend, which covers all max512 features.

Type of change

  • Documentation change (change only to the documentation, either a fix or a new content)
  • Bug fix (non-breaking change which fixes an issue)
  • New feature (non-breaking change which adds functionality)
  • Breaking change (fix or feature that would cause existing functionality to not work as expected)
  • Infra/Build change
  • Code refactoring

Changes

  • Deleted the CUDA kernel (fused_attn_f16_max512_seqlen.cu) and its header
  • Removed NVTE_F16_max512_seqlen from the NVTE_Fused_Attn_Backend enum (existing values NVTE_F16_arbitrary_seqlen = 1 and NVTE_FP8 = 2 are unchanged)
  • Removed flag_m512 computation, backend selection logic, and fwd/bwd dispatch for max512 in fused_attn.cpp
  • Removed max512 from pybind definitions (PyTorch + JAX), Python FusedAttnBackend dict, RNG workspace sizing, and docstrings
  • Removed max512-specific workarounds in DotProductAttention backend selection (env var override for post_scale_bias, sliding window filter, bias shape filter)
  • Updated tests and docs; note that an outdated support matrix from the FusedAttention class docstring is removed and to be replaced with a more complete and accurate version in a follow-up

Checklist:

  • I have read and followed the contributing guidelines
  • The functionality is complete
  • I have commented my code, particularly in hard-to-understand areas
  • I have made corresponding changes to the documentation
  • My changes generate no new warnings
  • I have added tests that prove my fix is effective or that my feature works
  • New and existing unit tests pass locally with my changes

Signed-off-by: Charlene Yang <[email protected]>
@cyanguwa
Copy link
Copy Markdown
Collaborator Author

/te-ci

@greptile-apps
Copy link
Copy Markdown
Contributor

greptile-apps Bot commented Apr 30, 2026

Greptile Summary

This PR removes the legacy F16_max512_seqlen cuDNN sub-backend (backend enum value 0), which used an older cuDNN interface scheduled for removal. All FP16/BF16 attention — including workloads previously handled by max512 (seq_len ≤ 512, head_dim = 64, MHA) — is now routed exclusively to the F16_arbitrary_seqlen backend, which is a strict superset of max512's feature coverage.

  • The CUDA kernel file (fused_attn_f16_max512_seqlen.cu), its header, the NVTE_F16_max512_seqlen = 0 enum value, all dispatch logic in fused_attn.cpp, and all Python/JAX/PyTorch bindings referencing it are cleanly removed.
  • Backend-selection workarounds in utils.py (bias-shape filter, sliding-window filter, env-var force) that existed to compensate for max512 limitations are also removed.
  • Test and documentation updates are included; a follow-up is noted to replace the removed FusedAttention support-matrix docstring with a more complete version.

Confidence Score: 5/5

The removal is well-scoped: the arbitrary-seqlen backend is a confirmed superset of max512's capabilities, so no previously-supported workload should silently fall back to NVTE_No_Backend.

The core deletion — kernel, header, dispatch, and bindings — is complete and internally consistent across C++, PyTorch, and JAX. The remaining findings are documentation-only: the FusedAttention docstring misstates the FP8 backend's sequence-length constraint, and the NVTE_FUSED_ATTN_BACKEND env-var docs still list value 1 as a meaningful F16 override when the code no longer reads it. Neither affects runtime correctness.

The FusedAttention class docstring in backends.py and the NVTE_FUSED_ATTN_BACKEND entry in docs/envvars.rst both carry inaccuracies introduced by this PR that are worth a second look before merging.

Important Files Changed

Filename Overview
transformer_engine/common/fused_attn/fused_attn.cpp Removes max512 backend selection logic, flag_m512 computation, and fwd/bwd dispatch; routes all F16/BF16 to flag_arb/NVTE_F16_arbitrary_seqlen. The NVTE_FUSED_ATTN_BACKEND env-var override is now completely gone from the F16 path.
transformer_engine/common/include/transformer_engine/fused_attn.h Removes NVTE_F16_max512_seqlen = 0 enum value and updates the support-matrix verbatim table in both nvte_fused_attn_fwd and nvte_fused_attn_bwd docstrings to drop backend-0 row and widen backend-1 seqlen constraint to 'any'.
transformer_engine/pytorch/attention/dot_product_attention/backends.py Removes max512 docstring table; replacement docstring incorrectly describes FP8 as supporting 'any sequence length' when the backend is still constrained to ≤512.
transformer_engine/pytorch/attention/dot_product_attention/utils.py Removes three max512-specific workarounds: env-var force to backend 1 for non-grad bias, sliding-window filter, and bias-shape filter; all clean deletions with no residual references.
transformer_engine/pytorch/cpp_extensions/fused_attn.py Removes F16_max512 from FusedAttnBackend dict, renames BACKEND_F16m512_FP8_THREADS_PER_CTA to BACKEND_FP8_THREADS_PER_CTA, drops the max512 RNG workspace sizing branch, and simplifies the aux_ctx_tensors guard; all changes are consistent.
tests/pytorch/attention/test_attention.py Removes dual-backend forced comparison (backend 0 vs 1) and simplifies to a single FusedAttention run; matches the new single-F16-backend reality.
tests/pytorch/utils.py Updates backends dict to {1: 'F16_arbitrary_seqlen', 2: 'FP8'} and loops over its keys instead of range(3), correctly excluding the removed backend-0.
transformer_engine/jax/csrc/extensions/attention.cpp Removes the post-hoc softmax shape correction for max512 in PrepareFusedAttnBackwardAuxTensors; the remaining code correctly initialises tensors for the arbitrary-seqlen and FP8 backends only.
docs/envvars.rst Removes value 0 from NVTE_FUSED_ATTN_BACKEND docs; values 1 and 2 remain documented as valid overrides, but the env-var is now completely unused in fused_attn.cpp for the F16 path.

Flowchart

%%{init: {'theme': 'neutral'}}%%
flowchart TD
    A[nvte_get_fused_attn_backend] --> B{dtype?}
    B -->|FP8| C[Evaluate FP8 conditions]
    B -->|FP16 / BF16| D[Evaluate flag_arb conditions]
    C --> E{FP8 conditions met?}
    E -->|Yes| F[NVTE_FP8]
    E -->|No| G[NVTE_No_Backend]
    D --> H{flag_arb?}
    H -->|Yes| I[NVTE_F16_arbitrary_seqlen]
    H -->|No| G

    style F fill:#4CAF50,color:#fff
    style I fill:#2196F3,color:#fff
    style G fill:#f44336,color:#fff

    subgraph REMOVED["Removed (this PR)"]
        R1["flag_m512 evaluation - seqlen <= 512, head_dim = 64, MHA only"]
        R2["NVTE_F16_max512_seqlen backend"]
        R3["NVTE_FUSED_ATTN_BACKEND env-var override for F16 path"]
    end
Loading

Reviews (5): Last reviewed commit: "remove sub-backend 0 from header docstri..." | Re-trigger Greptile

Comment thread transformer_engine/pytorch/attention/dot_product_attention/context_parallel.py Outdated
Comment thread transformer_engine/pytorch/attention/dot_product_attention/context_parallel.py Outdated
@cyanguwa
Copy link
Copy Markdown
Collaborator Author

cyanguwa commented May 5, 2026

/te-ci L0

@cyanguwa cyanguwa requested a review from sudhakarsingh27 May 5, 2026 19:23
@cyanguwa cyanguwa added the 2.16.0 label May 5, 2026
@cyanguwa cyanguwa changed the title [All] Remove max512 backend [All] Remove legacy max512 backend May 5, 2026
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

Projects

None yet

Development

Successfully merging this pull request may close these issues.

1 participant