dind: warn when the nested daemon runs on the 'vfs' storage driver (issue #104)#105
Conversation
Adding .gitkeep for PR creation (default mode). This file will be removed when the task is complete. Issue: #104
…(issue #104) Landing on the `vfs` storage driver was silent: a single `log` line named the driver, but nothing flagged that vfs performs NO copy-on-write. vfs stores every image layer as a full, independent copy, so a multi-GB image's on-disk footprint becomes the SUM of all cumulative layer sizes (many times the image size), and `docker pull`/`docker run` can fail with `failed to register layer: no space left on device` on a disk far larger than the image — with no breadcrumb pointing at the driver. Downstream this overflowed a disk with a >30 GB image (link-assistant/hive-mind#1914). The active driver ends up being vfs either pinned explicitly via `DIND_STORAGE_DRIVER=vfs` (legitimate for overlay-on-overlay compatibility) or reached as the last-resort auto-detect fallback. This is observability, not a default change — vfs stays the safe fallback. `start_dockerd` now calls `warn_if_vfs_storage_driver` right after the daemon becomes ready, emitting one actionable warning whenever the active driver is vfs: it explains the copy-on-write/disk implication and names the `DIND_STORAGE_DRIVER=fuse-overlayfs` remediation (copy-on-write, works overlay-on-overlay, already shipped in the image). The remediation adapts to whether `/dev/fuse` is present, pointing at `--privileged` / `--device /dev/fuse` first when it is missing. The `DIND_STORAGE_DRIVER` doc comment now spells out the vfs disk amplification too. Covered by a new unit test (experiments/test-issue104-vfs-warning.sh) and a new assertion in the CI-run tests/dind/example-storage-driver-vfs.sh; documented in docs/dind/USAGE.md. Adds a patch changeset.
Working session summaryCI is still building (the Summary of where things stand:
Once the poller reports the run completed successfully, I'll mark the PR ready with This summary was automatically extracted from the AI working session output. |
🤖 Solution Draft LogThis log file contains the complete execution trace of the AI solution draft process. 💰 Cost estimation:
📊 Context and tokens usage:Claude Opus 4.8: (2 sub-sessions)
Total: (10.6K new + 218.7K cache writes + 4.7M cache reads) input tokens, 70.4K output tokens, $5.534161 cost 🤖 Models used:
📎 Log file uploaded as Gist (2426KB)Now working session is ended, feel free to review and add any feedback on the solution draft. |
🔄 Auto-restart triggered (iteration 1)Reason: CI failures detected Starting new session to address the issues. Auto-restart-until-mergeable mode is active. This run will stop after 5 restart iterations. |
PR #105's three red checks (docker-build-test, pr-test / dind-js, pr-test / lean) were not caused by the vfs-warning feature itself — both the dind-js and lean jobs were failing on pre-existing transient flakes that the feature's new test surfaced. docker-build-test only aggregates the matrix, so it goes green once the other two do. pr-test / dind-js — SIGPIPE false-negative in the example suite The dind example tests asserted on container logs with `docker logs … | grep -q "NEEDLE"`. Under `set -o pipefail`, `grep -q` closes the pipe the instant it matches; the still-streaming `docker logs` upstream gets SIGPIPE (exit 141), pipefail propagates the 141, and a message that WAS present reads as absent — failing the one-shot checks spuriously (confirmed in the CI log: the expected preload line and the "complete" marker were both present, yet grep reported absent). tests/dind/lib.sh now provides a pipe-free `logs_contain` helper (capture once, match with a `case` glob — no pipe, no SIGPIPE) and every example assertion uses it. A focused regression test shows the old pattern false-negatives 30/30 under load while the new one never does. pr-test / lean — un-retried transient npm failure The JS image build ran `npm install -g npm@latest --no-fund --silent` with no retry; a single transient registry blip aborted the whole build, and `--silent` hid the cause. ubuntu/24.04/js/install.sh now routes its npm registry installs through a `run_with_retry` wrapper (exponential backoff, env-overridable budget mirroring apt_update_with_retry in common.sh). Build-time resilience only — the image is unchanged on success. Also hardened the issue-104 vfs example test to assert each warning needle via logs_contain. New unit tests: experiments/test-issue104-npm-retry.sh and experiments/test-issue104-logs-contain.sh. Changeset: bump patch.
Working session summaryThe workflow structure confirms my fixes are correctly targeted:
All three previously-red checks will be re-exercised by this run ( Everything actionable is done: both root causes fixed and committed ( This summary was automatically extracted from the AI working session output. |
🔄 Auto-restart-until-mergeable Log (iteration 1)This log file contains the complete execution trace of the AI solution draft process. 💰 Cost estimation:
📊 Context and tokens usage:Claude Opus 4.8: (3 sub-sessions)
Total: (25.8K new + 353.0K cache writes + 6.0M cache reads) input tokens, 126.4K output tokens, $8.476338 cost 🤖 Models used:
📎 Log file uploaded as Gist (5855KB)Now working session is ended, feel free to review and add any feedback on the solution draft. |
🔄 Auto-restart triggered (iteration 2)Reason: CI failures detected Starting new session to address the issues. Auto-restart-until-mergeable mode is active. This run will stop after 5 restart iterations. |
…edge GPG flake (issue #104) The dind-swift image build failed on `playwright install ... msedge ...` when packages.microsoft.com served an invalid GPG key body ("gpg: no valid OpenPGP data found" -> "Failed to install msedge"). msedge and chrome are fetched from third-party apt repos (packages.microsoft.com / Google), so they are subject to the same transient blips that already motivated wrapping the npm installs. Wrap both `playwright install` browser-download invocations in run_with_retry so a single blip retries instead of failing the whole image build. Playwright skips already-present browsers, so a retry only re-attempts the one that blipped. Generalize the retry wrapper's env-var prefix NPM_RETRY_* -> BUILD_RETRY_* to reflect that it now covers every network-bound build step (npm self-update, Playwright/Puppeteer CLI install, browser-binary download). Rename experiments/test-issue104-npm-retry.sh -> test-issue104-build-retry.sh and extend it to assert the browser download is wrapped and no bare `playwright install` survives; update the changeset accordingly.
Working session summaryEverything actionable is complete; CI is still building the matrix (the Status of this iteration (fixing the
I'll report the per-job results as soon as the run completes. This summary was automatically extracted from the AI working session output. |
🔄 Auto-restart-until-mergeable Log (iteration 2)This log file contains the complete execution trace of the AI solution draft process. 💰 Cost estimation:
📊 Context and tokens usage:Claude Opus 4.8: (2 sub-sessions)
Total: (14.9K new + 272.5K cache writes + 3.7M cache reads) input tokens, 71.8K output tokens, $5.434357 cost 🤖 Models used:
📎 Log file uploaded as Gist (8340KB)Now working session is ended, feel free to review and add any feedback on the solution draft. |
✅ Ready to mergeThis pull request is now ready to be merged:
Monitored by hive-mind with --auto-restart-until-mergeable flag |
…ge build and remove a SIGPIPE false-negative from the dind example tests (issue #104 / PR #105). The JS image build (`ubuntu/24.04/js/install.sh`, `COPY`'d into every dind/language image) occasionally died on a single transient third-party error, with no retry: the lean/language build hit a flaky npm registry response during `npm install -g npm@latest`, and the dind-swift build hit `playwright install … msedge …` getting an invalid GPG key body from packages.microsoft.com ("gpg: no valid OpenPGP data found" → "Failed to install msedge"). Every network-bound build step — the npm self-update, the Playwright/Puppeteer CLI install, and the Playwright browser-binary download — now goes through a `run_with_retry` wrapper that retries with exponential backoff (mirroring `apt_update_with_retry` in `common.sh`, with the same overridable retry budget so it stays unit-testable). `playwright install` skips already-present browsers, so a retry only re-attempts the one that blipped. This is build-time resilience only — the resulting image is unchanged on success. Separately, the dind example suite asserted on container logs with `docker logs … | grep -q "needle"`. Under `set -o pipefail`, `grep -q` closes the pipe the instant it matches, which can deliver SIGPIPE to the still-streaming `docker logs`; pipefail then propagates that 141 and a present message reads as absent, failing the test spuriously (observed on the preload test even though the expected line was right there in the logs). `tests/dind/lib.sh` now provides a pipe-free `logs_contain` helper (capture once, match with a `case` glob) and all example assertions use it. Covered by new unit tests `experiments/test-issue104-build-retry.sh` and `experiments/test-issue104-logs-contain.sh`.; dind-box: warn when the nested daemon runs on the `vfs` storage driver (issue #104). When the inner dockerd ends up on `vfs` — either pinned explicitly via `DIND_STORAGE_DRIVER=vfs` (e.g. for overlay-on-overlay compatibility) or reached as the last-resort auto-detect fallback — large images could fail to pull/run with a cryptic `failed to register layer: no space left on device` and **no hint** that the storage driver was the cause. `vfs` performs no copy-on-write: it stores every image layer as a full, independent copy, so a multi-GB image's on-disk footprint becomes the *sum* of all cumulative layer sizes (many times the image size), and a >30 GB image can overflow a disk with far more than 30 GB free (`link-assistant/hive-mind#1914`). This is observability, not a default change — `vfs` stays the safe fallback. The entrypoint now emits a single, actionable warning right after the daemon becomes ready whenever the active driver is `vfs`, explaining the copy-on-write/disk implication and naming the `DIND_STORAGE_DRIVER=fuse-overlayfs` remediation (copy-on-write, works overlay-on-overlay, already shipped in the image). The remediation line adapts to whether `/dev/fuse` is present, so when it is missing it points at `--privileged` / `--device /dev/fuse` first. The `DIND_STORAGE_DRIVER` doc comment now spells out the `vfs` disk amplification too. Covered by a new unit test (`experiments/test-issue104-vfs-warning.sh`) and a new assertion in the CI-run `tests/dind/example-storage-driver-vfs.sh`; documented in `docs/dind/USAGE.md`.
Summary
Fixes #104.
When the nested Docker daemon ends up on the
vfsstorage driver, largeimages can fail to pull/run with a cryptic
failed to register layer: no space left on device, with no hint that the storage driver is the cause.vfsperforms no copy-on-write — it stores a full, independent copy of the entire
filesystem for every layer — so a multi-GB image's on-disk footprint becomes
the sum of all cumulative layer sizes, many times the image size. A ~30 GB
image can overflow a disk with far more than 30 GB free.
box-dindalready auto-detects a good driver (overlay2 → fuse-overlayfs → vfs,with graceful fallback) and
vfsis only the last resort, so this is not a"wrong default" bug — it is observability. Today the choice is silent (only a
logline names the driver), so an operator hittingno space left on devicehas no breadcrumb. This bit us downstream in
link-assistant/hive-mind#1914(a >30 GB image), where pinning
DIND_STORAGE_DRIVER=fuse-overlayfsfixed it.Change
A new
warn_if_vfs_storage_driverinubuntu/24.04/dind/dind-entrypoint.shiscalled from
start_dockerd()right afterwait_for_dockerd_readysucceeds, soit fires whenever the active driver is
vfs— whether pinned explicitly viaDIND_STORAGE_DRIVER=vfs(legitimate for overlay-on-overlay compatibility) orreached as the last-resort auto-detect fallback:
The remediation line adapts to whether
/dev/fuseis present (the devicefuse-overlayfsneeds). When it is missing — the exact reason auto-detect fellthrough to
vfs— the warning instead explains thatfuse-overlayfsisunavailable and tells you to add
--privilegedor--device /dev/fusefirst.This is observability, not a default change:
vfsstays the safe fallback.Non-
vfsdrivers (overlay2,fuse-overlayfs) stay silent so normal runs arenot spammed. The
DIND_STORAGE_DRIVERdoc comment now also spells out thevfsdisk amplification.
Reproduce
Tests
experiments/test-issue104-vfs-warning.sh): sources the realentrypoint via
DIND_ENTRYPOINT_SOURCE_ONLY=1and driveswarn_if_vfs_storage_driverdirectly. Asserts:vfsemits the warning(content checks for no-copy-on-write, the disk failure mode, and the
fuse-overlayfsremediation);overlay2/fuse-overlayfs/empty stay silent;the
/dev/fuse-present vs/dev/fuse-missing remediation branches (driven viathe overridable
DIND_FUSE_DEVICE); and that the function returns0so theset -esuccess branch is unaffected.RESULT: 15 passed, 0 failed.Regression:
experiments/preload-unit-test.shstill46 passed, 0 failed.tests/dind/example-storage-driver-vfs.sh, run in CI): afterasserting the inner daemon is on
vfs, it now also asserts each line of thecopy-on-write warning is present in the container logs, giving real
end-to-end coverage under a
--privilegedvfscontainer.CI reliability (flakes surfaced by this PR)
Across its pushes this PR turned several checks red —
docker-build-test,pr-test / dind-js,pr-test / lean, andpr-test / dind-swift— but notbecause of the vfs-warning feature. Every failing job was tripping over a
pre-existing transient flake that the new vfs test merely exposed by running
the shared build/test paths more often;
docker-build-testonly aggregates thematrix, so it goes green once the underlying jobs do. Per the root-cause mandate,
all are fixed here rather than re-run:
pr-test / dind-js— SIGPIPE false-negative in the example suite. The dindexample tests asserted on logs with
docker logs … | grep -q "NEEDLE". Underset -o pipefail,grep -qcloses the pipe the instant it matches; thestill-streaming
docker logsupstream getsSIGPIPE(exit 141),pipefailpropagates the 141, and a message that was present reads as absent —
failing the one-shot checks spuriously. The CI log is conclusive: the expected
preload line and the "complete" marker were both in the logs, yet the grep
reported them absent.
tests/dind/lib.shnow provides a pipe-freelogs_containhelper (capture once, match with acaseglob — no pipe, noSIGPIPE) and every example assertion uses it.
pr-test / lean+pr-test / dind-swift— un-retried transient downloads inthe JS image build.
ubuntu/24.04/js/install.shisCOPY'd into everydind/language image, so a single transient third-party blip there aborts the
whole image build with no retry. It bit two different jobs two different ways:
leandied on a flaky npm registry response duringnpm install -g npm@latest(and--silenthid the cause), anddind-swiftdied on
playwright install … msedge …whenpackages.microsoft.comserved aninvalid GPG key body (
gpg: no valid OpenPGP data found→Failed to install msedge). msedge/chrome come from third-party apt repos(
packages.microsoft.com/ Google), so they are subject to the same blips asthe npm registry. Every network-bound build step — the npm self-update, the
Playwright/Puppeteer CLI install, and the Playwright browser-binary download —
now routes through a
run_with_retrywrapper (exponential backoff,env-overridable budget mirroring
apt_update_with_retryincommon.sh).playwright installskips already-present browsers, so a retry onlyre-attempts the one that blipped. Build-time resilience only — the image is
unchanged on success.
New unit tests cover both fixes:
experiments/test-issue104-logs-contain.sh— asserts the helper is wired ineverywhere and demonstrates the race: the old
pipe | grep -qfalse-negatives30/30 under load while
logs_containnever does (0/30).experiments/test-issue104-build-retry.sh— asserts everynpm install -gand every
playwright installgoes throughrun_with_retry, and that thewrapper succeeds-first-try / retries-then-succeeds / gives-up-after-budget
correctly.
Docs / release
docs/dind/USAGE.md: documents the warning under the Storage Driver section,with a
fuse-overlayfs(copy-on-write) example alongside the existingvfsone.
patchchangesets:.changeset/issue-104-vfs-storage-driver-warning.md(the feature) and.changeset/issue-104-ci-reliability.md(the CI flake fixes)..gitkeepplaceholder.