[AMD] improve dsr1 fp4 disagg perf on mi355x - rapid follow up PR incoming to quant correction on fp8 combine by billishyahao · Pull Request #1236 · SemiAnalysisAI/InferenceX

billishyahao · 2026-04-30T10:46:49Z

inital merge - - rapid follow up PR incoming to quant correction on fp8 combine

# NOTE: that currently with fp8_combine set, the evals do not pass on InferenceX eval harness
# or on SGLang native harness for high concurrency 4k and gets no where near the golden score of
# 0.95 on even basic GSM8k grade school math as confirmed by @billishyahao from AMD
# and as confirmed by @Oseltamivir. This was initally merged with @billishyahao promising 
# that an fast follow PR to fix the evals via having quant correction in the fp8 combine

replacement of #983

The new patch is adding the following optimization:

- "Bump SGL mori image to lmsysorg/sglang-rocm"
- "Add more high tput / low latency sweep configs"
- "Enable v2 mxfp4 DSR1 0528 model"
- "Enable fp4 disp / fp8 combine feature on mori"
- "Enable Mori SDMA + two batch overlapping feature"

…transformers v5 Transformers v5 incorrectly rebuilds pre_tokenizer/decoder components for models like DeepSeek-R1 that use LlamaTokenizerFast with a non-Llama tokenizer architecture. The sglang server fixes this at startup, but the benchmark client loads the tokenizer without these fixes, causing a ~5x token count inflation (e.g. 7000 tokens -> 35000 tokens) and false performance regressions in TTFT and throughput benchmarks. Apply the same tokenizer fixes (pre_tokenizer/decoder restoration and add_bos_token recovery) that sglang server applies, so client and server tokenize identically. No-op on transformers v4. Made-with: Cursor

github-actions · 2026-05-03T07:33:54Z

see unofficial run visualizer at https://inferencex.semianalysis.com/inference?unofficialRun=25269775978
see unofficial run visualizer at https://inferencex.semianalysis.com/evaluation?unofficialRun=25269775978

github-actions · 2026-05-03T08:44:08Z

see unofficial run visualizer at https://inferencex.semianalysis.com/inference?unofficialRun=25273191587
see unofficial run visualizer at https://inferencex.semianalysis.com/evaluation?unofficialRun=25273191587

github-actions · 2026-05-03T15:08:23Z

see unofficial run visualizer at https://inferencex.semianalysis.com/inference?unofficialRun=25282687262
see unofficial run visualizer at https://inferencex.semianalysis.com/evaluation?unofficialRun=25282687262

github-actions · 2026-05-03T16:12:02Z

see unofficial run visualizer at https://inferencex.semianalysis.com/inference?unofficialRun=25284166545
see unofficial run visualizer at https://inferencex.semianalysis.com/evaluation?unofficialRun=25284166545

github-actions · 2026-05-03T17:05:09Z

see unofficial run visualizer at https://inferencex.semianalysis.com/inference?unofficialRun=25284187965
see unofficial run visualizer at https://inferencex.semianalysis.com/evaluation?unofficialRun=25284187965

functionstackx · 2026-05-04T19:27:07Z

+    DECODE_SERVER_CONFIG=$(echo "$DECODE_SERVER_CONFIG" | sed 's/--ep-dispatch-algorithm fake//g')
+    unset MORI_MOE_MAX_INPUT_TOKENS_PREFILL
+    unset MORI_MOE_MAX_INPUT_TOKENS_DECODE
+    unset SGLANG_MORI_FP8_COMB


@billishyahao I don't understand why we are unsetting fp8 combine for evals only but using can we not performance benchmark.

It seems like the only thing we should change for evals specific is context len to fit the shots and not setting fp8 combine.

can you work with @Oseltamivir to figure it out? happy to dedicate time on our end to work with you on it

FP8 combine looks fine for
python benchmark/gsm8k/bench_sglang.py --num-questions 1300 --port 30000

may need more debugging from @Oseltamivir

BTW since this pr is based on March old PR + switching upstream sglang PR. Eval is a new feature needs more time to address. Can we merge this first and then use follow-up PR for addressing eval issue?

FP8 combine looks fine for
python benchmark/gsm8k/bench_sglang.py --num-questions 1300 --port 30000

@billishyahao even in your local bench 91.8% on GSM8k is quite low and does not look fine for deepseekv3 R1, we are seeing 95-96% for deepseekv3 R1 on grade school math

we can potentially merge this and have fixing it be an follow up PR but would like to do a couple days of work between you @billishyahao & @Oseltamivir before we merge this.

Your local sglang bench (not using inferencex harness) is quite low at 91%

The current accuracy drop with fp8 combine is expected as we have not introduced quant factor to retain the preicision. But huge drop from 0.915 to 0.485 is yet another issue from harness.

@billishyahao dropping from 0.95 to 0.915 on gsm8k is already an huge issue. as your aware, dropping 3 points on grade school math 8k has huge effects on downstream evals like swe-bench or has huge downstream evals when the harness template prompt changes slightly.

as suggested above, i think it would be good for @Oseltamivir and yall to debug for the next few days,
#1 first debug the 0.95 to 0.915 drop on sgl harness template prompt
#2 then next debug the 0.915 to 0.485 drop on the secondary prompt template and whether or not the bug is from the secondary prompt template or an downstream bug in amd's code where the root cause is there is not an quant factor to retain the precision with fp8 combine

@billishyahao can u take a look at this? @Oseltamivir is currently unable to reproduce the you described and got 90+ without unsetting fp8 combine

https://github.com/SemiAnalysisAI/InferenceX/actions/runs/25395388503

Hi @functionstackx @Oseltamivir this eplb only takes effect when EP is enabled. Job 25395388503 is only evaluating TP only case.

github-actions · 2026-05-05T02:49:49Z

see unofficial run visualizer at https://inferencex.semianalysis.com/inference?unofficialRun=25355064300
see unofficial run visualizer at https://inferencex.semianalysis.com/evaluation?unofficialRun=25355064300

github-actions · 2026-05-05T06:38:34Z

see unofficial run visualizer at https://inferencex.semianalysis.com/inference?unofficialRun=25355166460
see unofficial run visualizer at https://inferencex.semianalysis.com/evaluation?unofficialRun=25355166460

github-actions · 2026-05-05T10:04:38Z

see unofficial run visualizer at https://inferencex.semianalysis.com/inference?unofficialRun=25355166460
see unofficial run visualizer at https://inferencex.semianalysis.com/evaluation?unofficialRun=25355166460

github-actions · 2026-05-05T12:18:59Z

see unofficial run visualizer at https://inferencex.semianalysis.com/inference?unofficialRun=25355166460
see unofficial run visualizer at https://inferencex.semianalysis.com/evaluation?unofficialRun=25355166460

github-actions · 2026-05-05T13:42:54Z

see unofficial run visualizer at https://inferencex.semianalysis.com/inference?unofficialRun=25355166460
see unofficial run visualizer at https://inferencex.semianalysis.com/evaluation?unofficialRun=25355166460

billishyahao · 2026-05-05T14:53:29Z

Latest full sweep result：
https://github.com/SemiAnalysisAI/InferenceX/actions/runs/25355166460

Oseltamivir

lgtm

github-actions · 2026-05-07T02:03:06Z

see unofficial run visualizer at https://inferencex.semianalysis.com/inference?unofficialRun=25471826142
see unofficial run visualizer at https://inferencex.semianalysis.com/evaluation?unofficialRun=25471826142

github-actions · 2026-05-07T02:04:41Z

see unofficial run visualizer at https://inferencex.semianalysis.com/inference?unofficialRun=25471841042
see unofficial run visualizer at https://inferencex.semianalysis.com/evaluation?unofficialRun=25471841042

billishyahao and others added 30 commits March 16, 2026 08:36

[AMD] add dsr1 mxfp4 v2 sweep points

0383696

fix

18e05b1

change mtp model to fp8

0bd347f

change fp8 image

754e53c

bump image to 0327

f29f2d0

remove specv2

a44c7eb

consolidate dsr1 fp4 configs

2514136

Merge remote-tracking branch 'inf/main' into amd/mi355x-dsfp4-march15

3b4d4ab

bump fp8 image to 0327

682a4ab

fix crash

64bf100

fix env

c44e175

cleanup

0a41f89

add perf change log

7282748

add deprecate comments

e6d4b32

add spec v2 env

b7dd65f

bump the docker image

12a4ba0

add stream control to eliminate cpu overhead

597a458

tune the config

f715e47

bump image

2ea82d5

tune config

16384e7

add new exp config

4d733e7

enable log level info

83af743

fix mori env

0c3083e

bump image

1c61622

fix log

e2d2ac9

bump the image

d2a7988

fix

b09ae6c

fix

2c3ee04

fix

69102f7

Merge remote-tracking branch 'inf/main' into amd/mi355x-dsfp4-april14

219cf7a

fix

a73f622

Merge remote-tracking branch 'inf/main' into amd/mi355x-dsfp4-april14

8039b5f

Merge branch 'main' into amd/mi355x-dsfp4-april14

25fb9d1

functionstackx reviewed May 4, 2026

View reviewed changes

Merge remote-tracking branch 'inf/main' into amd/mi355x-dsfp4-april14

2c48183

billishyahao added full-sweep-enabled and removed sweep-enabled labels May 5, 2026

claude Bot mentioned this pull request May 5, 2026

DNM - dsr1 Mi355x test #1283

Closed

Update server.sh

9be76be

Oseltamivir approved these changes May 7, 2026

View reviewed changes

Merge branch 'main' into amd/mi355x-dsfp4-april14

b8ed4a3

Oseltamivir merged commit ec02dba into main May 7, 2026
11 of 35 checks passed

Oseltamivir deleted the amd/mi355x-dsfp4-april14 branch May 7, 2026 02:03

github-project-automation Bot moved this to Done in InferenceMAX Board May 7, 2026

functionstackx changed the title ~~[AMD] improve dsr1 fp4 disagg perf on mi355x~~ [AMD] improve dsr1 fp4 disagg perf on mi355x - rapid follow up PR incoming to quant correction on fp8 combine May 7, 2026

functionstackx mentioned this pull request May 7, 2026

[blocker, AMD said they would switch to upstream after end of march, this PR is still not upstream images] [AMD] improve dsr1 fp4 disagg perf on mi355x #983

Closed

Conversation

billishyahao commented Apr 30, 2026 • edited by functionstackx Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

inital merge - - rapid follow up PR incoming to quant correction on fp8 combine

Uh oh!

github-actions Bot commented May 3, 2026

Uh oh!

github-actions Bot commented May 3, 2026

Uh oh!

github-actions Bot commented May 3, 2026

Uh oh!

github-actions Bot commented May 3, 2026

Uh oh!

github-actions Bot commented May 3, 2026

Uh oh!

functionstackx May 4, 2026

Choose a reason for hiding this comment

Uh oh!

billishyahao May 5, 2026

Choose a reason for hiding this comment

Uh oh!

billishyahao May 5, 2026

Choose a reason for hiding this comment

Uh oh!

functionstackx May 5, 2026 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Choose a reason for hiding this comment

Uh oh!

functionstackx May 5, 2026

Choose a reason for hiding this comment

Uh oh!

billishyahao May 5, 2026

Choose a reason for hiding this comment

Uh oh!

functionstackx May 5, 2026 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Choose a reason for hiding this comment

Uh oh!

functionstackx May 5, 2026

Choose a reason for hiding this comment

Uh oh!

billishyahao May 6, 2026

Choose a reason for hiding this comment

Uh oh!

github-actions Bot commented May 5, 2026

Uh oh!

github-actions Bot commented May 5, 2026

Uh oh!

github-actions Bot commented May 5, 2026

Uh oh!

github-actions Bot commented May 5, 2026

Uh oh!

github-actions Bot commented May 5, 2026

Uh oh!

billishyahao commented May 5, 2026

Uh oh!

Oseltamivir left a comment

Choose a reason for hiding this comment

Uh oh!

github-actions Bot commented May 7, 2026

Uh oh!

Uh oh!

github-actions Bot commented May 7, 2026

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

4 participants

billishyahao commented Apr 30, 2026 •

edited by functionstackx

Loading

functionstackx May 5, 2026 •

edited

Loading

functionstackx May 5, 2026 •

edited

Loading