Skip to content

Preflight Qwen3.6 deploy failures#550

Open
wang-tong0 wants to merge 1 commit into
mainfrom
fix-deploy-failure-log-classification
Open

Preflight Qwen3.6 deploy failures#550
wang-tong0 wants to merge 1 commit into
mainfrom
fix-deploy-failure-log-classification

Conversation

@wang-tong0

@wang-tong0 wang-tong0 commented Jun 19, 2026

Copy link
Copy Markdown
Collaborator

Summary

  • Add a scheduler preflight check for Qwen3.6 (qwen3_5_moe) deployments.
  • Verify preprocessor_config.json exists in the HF repo before starting inference.
  • Terminate candidates only when HF clearly reports the required file is missing.

Why

Qwen3.6 models fail deterministically in SGLang when the repo does not include the processor metadata required by Transformers. Checking this before deployment avoids wasting GPU time and prevents repeatedly retrying a model that cannot start.

HF/network probe errors remain retryable: inconclusive checks return no classification and the scheduler continues through the normal deploy/retry path.

Validation

  • pytest tests/test_scheduler_deploy_failures.py tests/test_scheduler_flow.py tests/test_scheduler_ssh.py -q

@wang-tong0 wang-tong0 marked this pull request as ready for review June 19, 2026 05:14
@wang-tong0 wang-tong0 force-pushed the fix-deploy-failure-log-classification branch from 760996b to b0852cb Compare June 19, 2026 06:08
@wang-tong0 wang-tong0 changed the title Classify deterministic deploy failures from logs Preflight Qwen3.6 deploy failures Jun 19, 2026
@wang-tong0 wang-tong0 force-pushed the fix-deploy-failure-log-classification branch from b0852cb to d9e6a72 Compare June 19, 2026 06:37
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

None yet

Projects

None yet

Development

Successfully merging this pull request may close these issues.

1 participant