[AMD] improve dsr1 fp4 disagg perf on mi355x - rapid follow up PR incoming to quant correction on fp8 combine #1236
Conversation
…transformers v5 Transformers v5 incorrectly rebuilds pre_tokenizer/decoder components for models like DeepSeek-R1 that use LlamaTokenizerFast with a non-Llama tokenizer architecture. The sglang server fixes this at startup, but the benchmark client loads the tokenizer without these fixes, causing a ~5x token count inflation (e.g. 7000 tokens -> 35000 tokens) and false performance regressions in TTFT and throughput benchmarks. Apply the same tokenizer fixes (pre_tokenizer/decoder restoration and add_bos_token recovery) that sglang server applies, so client and server tokenize identically. No-op on transformers v4. Made-with: Cursor
|
see unofficial run visualizer at https://inferencex.semianalysis.com/inference?unofficialRun=25269775978 |
|
see unofficial run visualizer at https://inferencex.semianalysis.com/inference?unofficialRun=25273191587 |
|
see unofficial run visualizer at https://inferencex.semianalysis.com/inference?unofficialRun=25282687262 |
|
see unofficial run visualizer at https://inferencex.semianalysis.com/inference?unofficialRun=25284166545 |
|
see unofficial run visualizer at https://inferencex.semianalysis.com/inference?unofficialRun=25284187965 |
| DECODE_SERVER_CONFIG=$(echo "$DECODE_SERVER_CONFIG" | sed 's/--ep-dispatch-algorithm fake//g') | ||
| unset MORI_MOE_MAX_INPUT_TOKENS_PREFILL | ||
| unset MORI_MOE_MAX_INPUT_TOKENS_DECODE | ||
| unset SGLANG_MORI_FP8_COMB |
There was a problem hiding this comment.
@billishyahao I don't understand why we are unsetting fp8 combine for evals only but using can we not performance benchmark.
It seems like the only thing we should change for evals specific is context len to fit the shots and not setting fp8 combine.
can you work with @Oseltamivir to figure it out? happy to dedicate time on our end to work with you on it
There was a problem hiding this comment.
FP8 combine looks fine for
python benchmark/gsm8k/bench_sglang.py --num-questions 1300 --port 30000
may need more debugging from @Oseltamivir
BTW since this pr is based on March old PR + switching upstream sglang PR. Eval is a new feature needs more time to address. Can we merge this first and then use follow-up PR for addressing eval issue?
There was a problem hiding this comment.
FP8 combine looks fine for
python benchmark/gsm8k/bench_sglang.py --num-questions 1300 --port 30000
@billishyahao even in your local bench 91.8% on GSM8k is quite low and does not look fine for deepseekv3 R1, we are seeing 95-96% for deepseekv3 R1 on grade school math
There was a problem hiding this comment.
we can potentially merge this and have fixing it be an follow up PR but would like to do a couple days of work between you @billishyahao & @Oseltamivir before we merge this.
Your local sglang bench (not using inferencex harness) is quite low at 91%
There was a problem hiding this comment.
The current accuracy drop with fp8 combine is expected as we have not introduced quant factor to retain the preicision. But huge drop from 0.915 to 0.485 is yet another issue from harness.
There was a problem hiding this comment.
@billishyahao dropping from 0.95 to 0.915 on gsm8k is already an huge issue. as your aware, dropping 3 points on grade school math 8k has huge effects on downstream evals like swe-bench or has huge downstream evals when the harness template prompt changes slightly.
as suggested above, i think it would be good for @Oseltamivir and yall to debug for the next few days,
#1 first debug the 0.95 to 0.915 drop on sgl harness template prompt
#2 then next debug the 0.915 to 0.485 drop on the secondary prompt template and whether or not the bug is from the secondary prompt template or an downstream bug in amd's code where the root cause is there is not an quant factor to retain the precision with fp8 combine
There was a problem hiding this comment.
@billishyahao can u take a look at this? @Oseltamivir is currently unable to reproduce the you described and got 90+ without unsetting fp8 combine
https://github.com/SemiAnalysisAI/InferenceX/actions/runs/25395388503
There was a problem hiding this comment.
Hi @functionstackx @Oseltamivir this eplb only takes effect when EP is enabled. Job 25395388503 is only evaluating TP only case.
|
see unofficial run visualizer at https://inferencex.semianalysis.com/inference?unofficialRun=25355064300 |
|
see unofficial run visualizer at https://inferencex.semianalysis.com/inference?unofficialRun=25355166460 |
3 similar comments
|
see unofficial run visualizer at https://inferencex.semianalysis.com/inference?unofficialRun=25355166460 |
|
see unofficial run visualizer at https://inferencex.semianalysis.com/inference?unofficialRun=25355166460 |
|
see unofficial run visualizer at https://inferencex.semianalysis.com/inference?unofficialRun=25355166460 |
|
Latest full sweep result: |
|
see unofficial run visualizer at https://inferencex.semianalysis.com/inference?unofficialRun=25471826142 |
|
see unofficial run visualizer at https://inferencex.semianalysis.com/inference?unofficialRun=25471841042 |

inital merge - - rapid follow up PR incoming to quant correction on fp8 combine
replacement of #983
The new patch is adding the following optimization: