ci: disable fail-fast in the test matrix#83
Conversation
A flaky native SIGILL during import (mismatched runner CPU vs. a prebuilt jaxlib/scipy/onnxruntime wheel) intermittently crashes one Python-version job. With fail-fast: true that single crash cancels the other in-progress matrix jobs, hiding whether they passed and forcing a full-matrix re-run. Setting fail-fast: false lets each version finish independently, so a flake fails only its own job and can be re-run on its own (`gh run rerun --failed`). Co-Authored-By: Claude Opus 4.8 (1M context) <[email protected]>
|
No actionable comments were generated in the recent review. 🎉 ℹ️ Recent review info⚙️ Run configurationConfiguration used: defaults Review profile: CHILL Plan: Pro Plus Run ID: 📒 Files selected for processing (1)
📝 WalkthroughWalkthroughThe GitHub Actions test workflow matrix strategy has ChangesCI Matrix Strategy Update
Estimated code review effort🎯 1 (Trivial) | ⏱️ ~2 minutes Poem
🚥 Pre-merge checks | ✅ 5✅ Passed checks (5 passed)
✏️ Tip: You can configure your own custom pre-merge checks in the settings. ✨ Finishing Touches🧪 Generate unit tests (beta)
Comment |
There was a problem hiding this comment.
Pull request overview
This PR adjusts the GitHub Actions test matrix behavior to avoid losing signal from other Python versions when one matrix entry fails due to an intermittent native SIGILL crash on some runners.
Changes:
- Set the
run_testsjob matrixfail-fasttofalseso all Python versions complete independently. - Added in-file documentation explaining the rationale (flaky SIGILL causing cancellations) and the intended recovery workflow (
gh run rerun --failed).
💡 Add Copilot custom instructions for smarter, more guided reviews. Learn how to get started.
Codecov Report✅ All modified and coverable lines are covered by tests. 🚀 New features to boost your workflow:
|
Problem
run_tests.ymlruns a 3-job matrix (Python 3.11/3.12/3.13) withfail-fast: true. The test job intermittently hits a flaky native SIGILL during library import — a prebuiltjaxlib/scipy/onnxruntimewheel using a CPU instruction unsupported on some GitHub runner CPUs. It strikes a fraction of runs depending on which physical runner the job lands on, and crashes before pytest prints anything (exit code 132, zero output).With
fail-fast: true, that single flaky crash on one Python version cancels the other two in-progress jobs, so:This was reproduced as a control experiment: re-running a known-green commit with no code change produced the same
(3.13)=failure (3.11)=cancelled (3.12)=cancelledpattern.Fix
Set
fail-fast: false. Each Python version runs to completion independently, so a flaky SIGILL fails only its own job — 3.11/3.12 still report their true status, and you can re-run just the failed job (gh run rerun --failed, which lands on a fresh runner) instead of the whole matrix.This isolates the flake and makes recovery cheap; it does not eliminate the SIGILL itself (the root cure would be pinning/constraining the offending native wheel — a separate, larger change that touches the lockfile).
🤖 Generated with Claude Code
Summary by CodeRabbit