Skip to content

[SYSTEMDS-2651] Poll for async compression in federated component tests#2472

Merged
Baunsgaard merged 1 commit into
apache:mainfrom
Baunsgaard:fix-fed-component-flaky
May 26, 2026
Merged

[SYSTEMDS-2651] Poll for async compression in federated component tests#2472
Baunsgaard merged 1 commit into
apache:mainfrom
Baunsgaard:fix-fed-component-flaky

Conversation

@Baunsgaard
Copy link
Copy Markdown
Contributor

FedWorkerReadMatrixCompress.verifyRead failed roughly once per ten component-test CI runs because it called FederatedTestUtils.wait(1000) to give the worker time to finish its async compression (kicked off by CompressedMatrixBlockFactory.compressAsync), then asserted that the returned block was a CompressedMatrixBlock. On a contended runner the 1 s sleep was not enough, the subsequent read returned the still- uncompressed block, and the assertion failed. Surefire's rerunFailingTestsCount=2 hid this as a "Flake" rather than a job failure.

Add FedWorkerBase.awaitCompressed(long id), which polls getMatrixBlock at 25 ms intervals for up to COMPRESS_TIMEOUT_MS (10 s) and returns as soon as the worker reports the compressed form, or returns the last- observed block on timeout so the caller's assertion still produces a meaningful failure.

FedWorkerReadMatrixCompress.verifyRead failed roughly once per ten
component-test CI runs because it called FederatedTestUtils.wait(1000)
to give the worker time to finish its async compression (kicked off by
CompressedMatrixBlockFactory.compressAsync), then asserted that the
returned block was a CompressedMatrixBlock. On a contended runner the
1 s sleep was not enough, the subsequent read returned the still-
uncompressed block, and the assertion failed. Surefire's
rerunFailingTestsCount=2 hid this as a "Flake" rather than a job
failure.

Add FedWorkerBase.awaitCompressed(long id), which polls getMatrixBlock
at 25 ms intervals for up to COMPRESS_TIMEOUT_MS (10 s) and returns as
soon as the worker reports the compressed form, or returns the last-
observed block on timeout so the caller's assertion still produces a
meaningful failure.

Convert the three call sites that used the fixed-sleep anti-pattern:
- FedWorkerReadMatrixCompress.verifyRead (the actual CI flake)
- FedWorkerMatrixCompress.verifySameOrAlsoCompressedAsLocalCompress
  (polls only when local compresses, so the "do not compress"
  parametrization stays fast)
- FedWorkerMatrixMultiplyWorkload.verifySameOrAlsoCompressedAsLocalCompress

Remove the now-unused FederatedTestUtils.wait helper so the
anti-pattern is harder to reintroduce.
@codecov
Copy link
Copy Markdown

codecov Bot commented May 26, 2026

Codecov Report

✅ All modified and coverable lines are covered by tests.
✅ Project coverage is 71.37%. Comparing base (34a19f0) to head (ca4e71f).

Additional details and impacted files
@@             Coverage Diff              @@
##               main    #2472      +/-   ##
============================================
- Coverage     71.38%   71.37%   -0.01%     
  Complexity    48756    48756              
============================================
  Files          1571     1571              
  Lines        188912   188912              
  Branches      37067    37067              
============================================
- Hits         134858   134841      -17     
- Misses        43603    43612       +9     
- Partials      10451    10459       +8     

☔ View full report in Codecov by Sentry.
📢 Have feedback on the report? Share it here.

🚀 New features to boost your workflow:
  • ❄️ Test Analytics: Detect flaky tests, report on failures, and find test suite problems.
  • 📦 JS Bundle Analysis: Save yourself from yourself by tracking and limiting bundle sizes in JS merges.

@Baunsgaard Baunsgaard merged commit 88c26e2 into apache:main May 26, 2026
46 checks passed
@github-project-automation github-project-automation Bot moved this from In Progress to Done in SystemDS PR Queue May 26, 2026
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

None yet

Projects

Status: Done

Development

Successfully merging this pull request may close these issues.

1 participant