Batch stat calls in find -exec by replacing \; with +#340
Open
jeanschmidt wants to merge 2 commits intoactions:mainfrom
Open
Batch stat calls in find -exec by replacing \; with +#340jeanschmidt wants to merge 2 commits intoactions:mainfrom
stat calls in find -exec by replacing \; with +#340jeanschmidt wants to merge 2 commits intoactions:mainfrom
Conversation
- Replace `\;` with `+` in find -exec for stat
Using `{} +` batches filenames into fewer stat
invocations instead of spawning one process per
file, reducing fork/exec overhead in large dirs.
Signed-off-by: Jean Schmidt <[email protected]>
Contributor
There was a problem hiding this comment.
Pull request overview
This PR improves performance of workspace copy verification by batching stat invocations produced by listDirAllCommand() (used when hashing directory contents on both runner and pod sides).
Changes:
- Replace
find -exec ... {} \;withfind -exec ... {} +to batch multiple files perstatprocess.
💡 Add Copilot custom instructions for smarter, more guided reviews. Learn how to get started.
- Add `--` before `{}` in stat command to prevent filenames starting
with `-` from being interpreted as options
- Add tests for listDirAllCommand covering batched exec, end-of-options
marker, directory quoting, path exclusion, and file type filtering
Notes:
Without the `--` end-of-options marker, files whose names begin with a
dash (e.g. `-rf`) could be misinterpreted as flags by `stat`, causing
silent failures or incorrect output during workspace file listing.
Signed-off-by: Jean Schmidt <[email protected]>
This file contains hidden or bidirectional Unicode text that may be interpreted or compiled differently than what appears below. To review, open the file in an editor that reveals hidden Unicode characters.
Learn more about bidirectional Unicode characters
Sign up for free
to join this conversation on GitHub.
Already have an account?
Sign in to comment
Add this suggestion to a batch that can be applied as a single commit.This suggestion is invalid because no changes were made to the code.Suggestions cannot be applied while the pull request is closed.Suggestions cannot be applied while viewing a subset of changes.Only one suggestion per line can be applied in a batch.Add this suggestion to a batch that can be applied as a single commit.Applying suggestions on deleted lines is not supported.You must change the existing code in this line in order to create a valid suggestion.Outdated suggestions cannot be applied.This suggestion has been applied or marked resolved.Suggestions cannot be applied from pending reviews.Suggestions cannot be applied on multi-line comments.Suggestions cannot be applied while the pull request is queued to merge.Suggestion cannot be applied right now. Please check back later.
Summary
Batch
statcalls infind -execby replacing\;with+inlistDirAllCommand(), reducing process spawns from one-per-file to one-per-batch.Problem
The
listDirAllCommand()function inpackages/k8s/src/k8s/utils.tsgenerates a shell command used to list every file (with its size) under a directory. It is invoked during workspace copy verification in bothexecCpToPodandexecCpFromPod— each of which runs the command up to 15 times in a retry loop on both the runner side (localspawn) and the job pod side (K8sexec). That's potentially 60 invocations of this command per job.The original command:
The
\;terminator tellsfindto spawn a separatestatprocess for every single file it discovers. For a workspace with 10,000 files, that means 10,000fork+execcycles — each creating a new process, loading thestatbinary, running it on one file, and exiting. The overhead is almost entirely in process creation, not in the actualstatsyscall.This is especially painful inside Kubernetes job pods, where:
Writablestream buffer (execCalculateOutputHashSorted), and the drip-feed of one-line-at-a-time from individualstatcalls increases GC pressure compared to receiving larger chunks.Solution
The
+terminator tellsfindto batch as many filenames as possible into eachstatinvocation, up to the OS argument-length limit (ARG_MAX, typically 2MB on Linux). For a 10,000-file workspace, this typically results in 1–3statprocesses instead of 10,000.This is a POSIX-standard feature (
find -exec {} +has been in POSIX since 2004 / IEEE Std 1003.1-2004) and is supported by everyfindimplementation used in GitHub Actions runner images (GNU findutils, BusyBox find, macOS find).Behavioral equivalence
The output is identical: one
%s %n(size + filename) line per file. The only difference is how many filenames are passed perstatinvocation. Since the downstream consumer (execCalculateOutputHashSorted/localCalculateOutputHashSorted) splits on newlines, sorts, and hashes — the batching is invisible to the hash comparison logic.What this does NOT change
stat -c '%s %n'format string)findfilter (unchanged — same-type f -not -pathexclusion)Performance impact
\;(before)+(after)The savings multiply across the retry loop (up to 15 iterations × 2 sides × 2 copy directions = 60 invocations per job in the worst case).
Files changed
packages/k8s/src/k8s/utils.ts— one-character change inlistDirAllCommand():\\;→+Test plan
npm run buildsucceeds (tsc + ncc)npx jest -- utils)