Skip to content

ci: disable fail-fast in the test matrix#83

Merged
AlexanderFengler merged 1 commit into
mainfrom
fix-ci-fail-fast
Jun 20, 2026
Merged

ci: disable fail-fast in the test matrix#83
AlexanderFengler merged 1 commit into
mainfrom
fix-ci-fail-fast

Conversation

@AlexanderFengler

@AlexanderFengler AlexanderFengler commented Jun 20, 2026

Copy link
Copy Markdown
Member

Problem

run_tests.yml runs a 3-job matrix (Python 3.11/3.12/3.13) with fail-fast: true. The test job intermittently hits a flaky native SIGILL during library import — a prebuilt jaxlib/scipy/onnxruntime wheel using a CPU instruction unsupported on some GitHub runner CPUs. It strikes a fraction of runs depending on which physical runner the job lands on, and crashes before pytest prints anything (exit code 132, zero output).

With fail-fast: true, that single flaky crash on one Python version cancels the other two in-progress jobs, so:

  • the whole run goes red even though 3.11/3.12 were green,
  • you lose the real pass/fail status of the other versions,
  • the only recourse is a full-matrix re-run (3× cost).

This was reproduced as a control experiment: re-running a known-green commit with no code change produced the same (3.13)=failure (3.11)=cancelled (3.12)=cancelled pattern.

Fix

Set fail-fast: false. Each Python version runs to completion independently, so a flaky SIGILL fails only its own job — 3.11/3.12 still report their true status, and you can re-run just the failed job (gh run rerun --failed, which lands on a fresh runner) instead of the whole matrix.

This isolates the flake and makes recovery cheap; it does not eliminate the SIGILL itself (the root cure would be pinning/constraining the offending native wheel — a separate, larger change that touches the lockfile).

🤖 Generated with Claude Code

Summary by CodeRabbit

  • Chores
    • Enhanced test suite resilience by allowing Python version tests to run independently, preventing cascading failures when individual test runs encounter intermittent issues.

A flaky native SIGILL during import (mismatched runner CPU vs. a prebuilt
jaxlib/scipy/onnxruntime wheel) intermittently crashes one Python-version job.
With fail-fast: true that single crash cancels the other in-progress matrix
jobs, hiding whether they passed and forcing a full-matrix re-run. Setting
fail-fast: false lets each version finish independently, so a flake fails only
its own job and can be re-run on its own (`gh run rerun --failed`).

Co-Authored-By: Claude Opus 4.8 (1M context) <[email protected]>
Copilot AI review requested due to automatic review settings June 20, 2026 15:48
@coderabbitai

coderabbitai Bot commented Jun 20, 2026

Copy link
Copy Markdown

Review Change Stack

No actionable comments were generated in the recent review. 🎉

ℹ️ Recent review info
⚙️ Run configuration

Configuration used: defaults

Review profile: CHILL

Plan: Pro Plus

Run ID: 99889071-cbae-4fd4-b7fc-79b36d7e9989

📥 Commits

Reviewing files that changed from the base of the PR and between f59f0a8 and c99be5c.

📒 Files selected for processing (1)
  • .github/workflows/run_tests.yml

📝 Walkthrough

Walkthrough

The GitHub Actions test workflow matrix strategy has fail-fast changed from true to false, with explanatory comments added. This allows test runs across all Python versions to continue independently even when one version encounters a failure.

Changes

CI Matrix Strategy Update

Layer / File(s) Summary
Disable fail-fast in test matrix
.github/workflows/run_tests.yml
strategy.fail-fast is set to false and comments are added to explain that Python version runs should proceed independently to handle occasional flaky native SIGILL failures during import.

Estimated code review effort

🎯 1 (Trivial) | ⏱️ ~2 minutes

Poem

🐇 When one test stumbles and falls with a thud,
The others keep hopping right out of the mud.
No fast-fail to stop them, they run on their own,
Each Python version reaps what it has sown.
A flag flipped to false — small change, big cheer! 🎉

🚥 Pre-merge checks | ✅ 5
✅ Passed checks (5 passed)
Check name Status Explanation
Description Check ✅ Passed Check skipped - CodeRabbit’s high-level summary is enabled.
Title check ✅ Passed The title 'ci: disable fail-fast in the test matrix' directly and concisely describes the main change: disabling the fail-fast setting in the GitHub Actions test matrix.
Docstring Coverage ✅ Passed No functions found in the changed files to evaluate docstring coverage. Skipping docstring coverage check.
Linked Issues check ✅ Passed Check skipped because no linked issues were found for this pull request.
Out of Scope Changes check ✅ Passed Check skipped because no linked issues were found for this pull request.

✏️ Tip: You can configure your own custom pre-merge checks in the settings.

✨ Finishing Touches
🧪 Generate unit tests (beta)
  • Create PR with unit tests
  • Commit unit tests in branch fix-ci-fail-fast

Comment @coderabbitai help to get the list of available commands and usage tips.

Copilot AI left a comment

Copy link
Copy Markdown
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Pull request overview

This PR adjusts the GitHub Actions test matrix behavior to avoid losing signal from other Python versions when one matrix entry fails due to an intermittent native SIGILL crash on some runners.

Changes:

  • Set the run_tests job matrix fail-fast to false so all Python versions complete independently.
  • Added in-file documentation explaining the rationale (flaky SIGILL causing cancellations) and the intended recovery workflow (gh run rerun --failed).

💡 Add Copilot custom instructions for smarter, more guided reviews. Learn how to get started.

@codecov

codecov Bot commented Jun 20, 2026

Copy link
Copy Markdown

Codecov Report

✅ All modified and coverable lines are covered by tests.

🚀 New features to boost your workflow:
  • ❄️ Test Analytics: Detect flaky tests, report on failures, and find test suite problems.

@AlexanderFengler AlexanderFengler merged commit ab24ae6 into main Jun 20, 2026
9 checks passed
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

None yet

Projects

None yet

Development

Successfully merging this pull request may close these issues.

2 participants