[SYSTEMDS-2651] Poll for async compression in federated component tests#2472
Merged
Merged
Conversation
FedWorkerReadMatrixCompress.verifyRead failed roughly once per ten component-test CI runs because it called FederatedTestUtils.wait(1000) to give the worker time to finish its async compression (kicked off by CompressedMatrixBlockFactory.compressAsync), then asserted that the returned block was a CompressedMatrixBlock. On a contended runner the 1 s sleep was not enough, the subsequent read returned the still- uncompressed block, and the assertion failed. Surefire's rerunFailingTestsCount=2 hid this as a "Flake" rather than a job failure. Add FedWorkerBase.awaitCompressed(long id), which polls getMatrixBlock at 25 ms intervals for up to COMPRESS_TIMEOUT_MS (10 s) and returns as soon as the worker reports the compressed form, or returns the last- observed block on timeout so the caller's assertion still produces a meaningful failure. Convert the three call sites that used the fixed-sleep anti-pattern: - FedWorkerReadMatrixCompress.verifyRead (the actual CI flake) - FedWorkerMatrixCompress.verifySameOrAlsoCompressedAsLocalCompress (polls only when local compresses, so the "do not compress" parametrization stays fast) - FedWorkerMatrixMultiplyWorkload.verifySameOrAlsoCompressedAsLocalCompress Remove the now-unused FederatedTestUtils.wait helper so the anti-pattern is harder to reintroduce.
Codecov Report✅ All modified and coverable lines are covered by tests. Additional details and impacted files@@ Coverage Diff @@
## main #2472 +/- ##
============================================
- Coverage 71.38% 71.37% -0.01%
Complexity 48756 48756
============================================
Files 1571 1571
Lines 188912 188912
Branches 37067 37067
============================================
- Hits 134858 134841 -17
- Misses 43603 43612 +9
- Partials 10451 10459 +8 ☔ View full report in Codecov by Sentry. 🚀 New features to boost your workflow:
|
This file contains hidden or bidirectional Unicode text that may be interpreted or compiled differently than what appears below. To review, open the file in an editor that reveals hidden Unicode characters.
Learn more about bidirectional Unicode characters
Sign up for free
to join this conversation on GitHub.
Already have an account?
Sign in to comment
Add this suggestion to a batch that can be applied as a single commit.This suggestion is invalid because no changes were made to the code.Suggestions cannot be applied while the pull request is closed.Suggestions cannot be applied while viewing a subset of changes.Only one suggestion per line can be applied in a batch.Add this suggestion to a batch that can be applied as a single commit.Applying suggestions on deleted lines is not supported.You must change the existing code in this line in order to create a valid suggestion.Outdated suggestions cannot be applied.This suggestion has been applied or marked resolved.Suggestions cannot be applied from pending reviews.Suggestions cannot be applied on multi-line comments.Suggestions cannot be applied while the pull request is queued to merge.Suggestion cannot be applied right now. Please check back later.
FedWorkerReadMatrixCompress.verifyRead failed roughly once per ten component-test CI runs because it called FederatedTestUtils.wait(1000) to give the worker time to finish its async compression (kicked off by CompressedMatrixBlockFactory.compressAsync), then asserted that the returned block was a CompressedMatrixBlock. On a contended runner the 1 s sleep was not enough, the subsequent read returned the still- uncompressed block, and the assertion failed. Surefire's rerunFailingTestsCount=2 hid this as a "Flake" rather than a job failure.
Add FedWorkerBase.awaitCompressed(long id), which polls getMatrixBlock at 25 ms intervals for up to COMPRESS_TIMEOUT_MS (10 s) and returns as soon as the worker reports the compressed form, or returns the last- observed block on timeout so the caller's assertion still produces a meaningful failure.