Skip to content

[AMD] improve dsr1 fp4 disagg perf on mi355x - rapid follow up PR incoming to quant correction on fp8 combine #1236

Merged
Oseltamivir merged 75 commits intomainfrom
amd/mi355x-dsfp4-april14
May 7, 2026
Merged

[AMD] improve dsr1 fp4 disagg perf on mi355x - rapid follow up PR incoming to quant correction on fp8 combine #1236
Oseltamivir merged 75 commits intomainfrom
amd/mi355x-dsfp4-april14

Conversation

@billishyahao
Copy link
Copy Markdown
Collaborator

@billishyahao billishyahao commented Apr 30, 2026

inital merge - - rapid follow up PR incoming to quant correction on fp8 combine

# NOTE: that currently with fp8_combine set, the evals do not pass on InferenceX eval harness
# or on SGLang native harness for high concurrency 4k and gets no where near the golden score of
# 0.95 on even basic GSM8k grade school math as confirmed by @billishyahao from AMD
# and as confirmed by @Oseltamivir. This was initally merged with @billishyahao promising 
# that an fast follow PR to fix the evals via having quant correction in the fp8 combine

replacement of #983

The new patch is adding the following optimization:

- "Bump SGL mori image to lmsysorg/sglang-rocm"
- "Add more high tput / low latency sweep configs"
- "Enable v2 mxfp4 DSR1 0528 model"
- "Enable fp4 disp / fp8 combine feature on mori"
- "Enable Mori SDMA + two batch overlapping feature"

billishyahao and others added 30 commits March 16, 2026 08:36
…transformers v5

Transformers v5 incorrectly rebuilds pre_tokenizer/decoder components for
models like DeepSeek-R1 that use LlamaTokenizerFast with a non-Llama
tokenizer architecture. The sglang server fixes this at startup, but the
benchmark client loads the tokenizer without these fixes, causing a ~5x
token count inflation (e.g. 7000 tokens -> 35000 tokens) and false
performance regressions in TTFT and throughput benchmarks.

Apply the same tokenizer fixes (pre_tokenizer/decoder restoration and
add_bos_token recovery) that sglang server applies, so client and server
tokenize identically. No-op on transformers v4.

Made-with: Cursor
@github-actions
Copy link
Copy Markdown
Contributor

github-actions Bot commented May 3, 2026

@github-actions
Copy link
Copy Markdown
Contributor

github-actions Bot commented May 3, 2026

@github-actions
Copy link
Copy Markdown
Contributor

github-actions Bot commented May 3, 2026

@github-actions
Copy link
Copy Markdown
Contributor

github-actions Bot commented May 3, 2026

@github-actions
Copy link
Copy Markdown
Contributor

github-actions Bot commented May 3, 2026

DECODE_SERVER_CONFIG=$(echo "$DECODE_SERVER_CONFIG" | sed 's/--ep-dispatch-algorithm fake//g')
unset MORI_MOE_MAX_INPUT_TOKENS_PREFILL
unset MORI_MOE_MAX_INPUT_TOKENS_DECODE
unset SGLANG_MORI_FP8_COMB
Copy link
Copy Markdown
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

@billishyahao I don't understand why we are unsetting fp8 combine for evals only but using can we not performance benchmark.

It seems like the only thing we should change for evals specific is context len to fit the shots and not setting fp8 combine.

can you work with @Oseltamivir to figure it out? happy to dedicate time on our end to work with you on it

Copy link
Copy Markdown
Collaborator Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

FP8 combine looks fine for
python benchmark/gsm8k/bench_sglang.py --num-questions 1300 --port 30000

may need more debugging from @Oseltamivir

BTW since this pr is based on March old PR + switching upstream sglang PR. Eval is a new feature needs more time to address. Can we merge this first and then use follow-up PR for addressing eval issue?

Copy link
Copy Markdown
Collaborator Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

image

Copy link
Copy Markdown
Contributor

@functionstackx functionstackx May 5, 2026

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

FP8 combine looks fine for
python benchmark/gsm8k/bench_sglang.py --num-questions 1300 --port 30000

@billishyahao even in your local bench 91.8% on GSM8k is quite low and does not look fine for deepseekv3 R1, we are seeing 95-96% for deepseekv3 R1 on grade school math

Copy link
Copy Markdown
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

we can potentially merge this and have fixing it be an follow up PR but would like to do a couple days of work between you @billishyahao & @Oseltamivir before we merge this.

Your local sglang bench (not using inferencex harness) is quite low at 91%

Copy link
Copy Markdown
Collaborator Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

The current accuracy drop with fp8 combine is expected as we have not introduced quant factor to retain the preicision. But huge drop from 0.915 to 0.485 is yet another issue from harness.

Copy link
Copy Markdown
Contributor

@functionstackx functionstackx May 5, 2026

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

@billishyahao dropping from 0.95 to 0.915 on gsm8k is already an huge issue. as your aware, dropping 3 points on grade school math 8k has huge effects on downstream evals like swe-bench or has huge downstream evals when the harness template prompt changes slightly.

as suggested above, i think it would be good for @Oseltamivir and yall to debug for the next few days,
#1 first debug the 0.95 to 0.915 drop on sgl harness template prompt
#2 then next debug the 0.915 to 0.485 drop on the secondary prompt template and whether or not the bug is from the secondary prompt template or an downstream bug in amd's code where the root cause is there is not an quant factor to retain the precision with fp8 combine

Copy link
Copy Markdown
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

@billishyahao can u take a look at this? @Oseltamivir is currently unable to reproduce the you described and got 90+ without unsetting fp8 combine

https://github.com/SemiAnalysisAI/InferenceX/actions/runs/25395388503

Copy link
Copy Markdown
Collaborator Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Hi @functionstackx @Oseltamivir this eplb only takes effect when EP is enabled. Job 25395388503 is only evaluating TP only case.

@github-actions
Copy link
Copy Markdown
Contributor

github-actions Bot commented May 5, 2026

@github-actions
Copy link
Copy Markdown
Contributor

github-actions Bot commented May 5, 2026

3 similar comments
@github-actions
Copy link
Copy Markdown
Contributor

github-actions Bot commented May 5, 2026

@github-actions
Copy link
Copy Markdown
Contributor

github-actions Bot commented May 5, 2026

@github-actions
Copy link
Copy Markdown
Contributor

github-actions Bot commented May 5, 2026

@billishyahao
Copy link
Copy Markdown
Collaborator Author

@claude claude Bot mentioned this pull request May 5, 2026
Copy link
Copy Markdown
Collaborator

@Oseltamivir Oseltamivir left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

lgtm

@github-actions
Copy link
Copy Markdown
Contributor

github-actions Bot commented May 7, 2026

@Oseltamivir Oseltamivir merged commit ec02dba into main May 7, 2026
11 of 35 checks passed
@Oseltamivir Oseltamivir deleted the amd/mi355x-dsfp4-april14 branch May 7, 2026 02:03
@github-actions
Copy link
Copy Markdown
Contributor

github-actions Bot commented May 7, 2026

@functionstackx functionstackx changed the title [AMD] improve dsr1 fp4 disagg perf on mi355x [AMD] improve dsr1 fp4 disagg perf on mi355x - rapid follow up PR incoming to quant correction on fp8 combine May 7, 2026
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Projects

Development

Successfully merging this pull request may close these issues.

4 participants