dind: warn when the nested daemon runs on the 'vfs' storage driver (issue #104) by konard · Pull Request #105 · link-foundation/box

konard · 2026-06-14T21:59:39Z

Summary

Fixes #104.

When the nested Docker daemon ends up on the vfs storage driver, large
images can fail to pull/run with a cryptic failed to register layer: no space left on device, with no hint that the storage driver is the cause. vfs
performs no copy-on-write — it stores a full, independent copy of the entire
filesystem for every layer — so a multi-GB image's on-disk footprint becomes
the sum of all cumulative layer sizes, many times the image size. A ~30 GB
image can overflow a disk with far more than 30 GB free.

box-dind already auto-detects a good driver (overlay2 → fuse-overlayfs → vfs,
with graceful fallback) and vfs is only the last resort, so this is not a
"wrong default" bug — it is observability. Today the choice is silent (only a
log line names the driver), so an operator hitting no space left on device
has no breadcrumb. This bit us downstream in link-assistant/hive-mind#1914
(a >30 GB image), where pinning DIND_STORAGE_DRIVER=fuse-overlayfs fixed it.

Change

A new warn_if_vfs_storage_driver in ubuntu/24.04/dind/dind-entrypoint.sh is
called from start_dockerd() right after wait_for_dockerd_ready succeeds, so
it fires whenever the active driver is vfs — whether pinned explicitly via
DIND_STORAGE_DRIVER=vfs (legitimate for overlay-on-overlay compatibility) or
reached as the last-resort auto-detect fallback:

[dind-entrypoint] WARN: dockerd is using the 'vfs' storage driver, which has NO copy-on-write:
[dind-entrypoint] WARN: every image layer is stored as a full copy, so a multi-GB image's on-disk
[dind-entrypoint] WARN: footprint becomes the SUM of all cumulative layer sizes (many times the
[dind-entrypoint] WARN: image size). 'docker pull'/'docker run' can then fail with 'failed to
[dind-entrypoint] WARN: register layer: no space left on device' on a disk far larger than the image.
[dind-entrypoint] WARN: For copy-on-write here, set DIND_STORAGE_DRIVER=fuse-overlayfs (works
[dind-entrypoint] WARN: overlay-on-overlay; /dev/fuse is present).

The remediation line adapts to whether /dev/fuse is present (the device
fuse-overlayfs needs). When it is missing — the exact reason auto-detect fell
through to vfs — the warning instead explains that fuse-overlayfs is
unavailable and tells you to add --privileged or --device /dev/fuse first.

This is observability, not a default change: vfs stays the safe fallback.
Non-vfs drivers (overlay2, fuse-overlayfs) stay silent so normal runs are
not spammed. The DIND_STORAGE_DRIVER doc comment now also spells out the vfs
disk amplification.

Reproduce

docker run --rm -it --privileged \
  -e DIND_STORAGE_DRIVER=vfs \
  konard/box-dind:latest \
  bash -lc 'docker info --format "driver={{.Driver}}"'
# Before: driver=vfs, no warning — a later large pull dies with
#         "failed to register layer: no space left on device" and no hint.
# After:  startup logs the vfs copy-on-write warning + the fuse-overlayfs remediation.

Tests

Unit (experiments/test-issue104-vfs-warning.sh): sources the real
entrypoint via DIND_ENTRYPOINT_SOURCE_ONLY=1 and drives
warn_if_vfs_storage_driver directly. Asserts: vfs emits the warning
(content checks for no-copy-on-write, the disk failure mode, and the
fuse-overlayfs remediation); overlay2/fuse-overlayfs/empty stay silent;
the /dev/fuse-present vs /dev/fuse-missing remediation branches (driven via
the overridable DIND_FUSE_DEVICE); and that the function returns 0 so the
set -e success branch is unaffected. RESULT: 15 passed, 0 failed.
Regression: experiments/preload-unit-test.sh still 46 passed, 0 failed.
Integration (tests/dind/example-storage-driver-vfs.sh, run in CI): after
asserting the inner daemon is on vfs, it now also asserts each line of the
copy-on-write warning is present in the container logs, giving real
end-to-end coverage under a --privileged vfs container.

CI reliability (flakes surfaced by this PR)

Across its pushes this PR turned several checks red — docker-build-test,
pr-test / dind-js, pr-test / lean, and pr-test / dind-swift — but not
because of the vfs-warning feature. Every failing job was tripping over a
pre-existing transient flake that the new vfs test merely exposed by running
the shared build/test paths more often; docker-build-test only aggregates the
matrix, so it goes green once the underlying jobs do. Per the root-cause mandate,
all are fixed here rather than re-run:

pr-test / dind-js — SIGPIPE false-negative in the example suite. The dind
example tests asserted on logs with docker logs … | grep -q "NEEDLE". Under
set -o pipefail, grep -q closes the pipe the instant it matches; the
still-streaming docker logs upstream gets SIGPIPE (exit 141), pipefail
propagates the 141, and a message that was present reads as absent —
failing the one-shot checks spuriously. The CI log is conclusive: the expected
preload line and the "complete" marker were both in the logs, yet the grep
reported them absent. tests/dind/lib.sh now provides a pipe-free
logs_contain helper (capture once, match with a case glob — no pipe, no
SIGPIPE) and every example assertion uses it.
pr-test / lean + pr-test / dind-swift — un-retried transient downloads in
the JS image build. ubuntu/24.04/js/install.sh is COPY'd into every
dind/language image, so a single transient third-party blip there aborts the
whole image build with no retry. It bit two different jobs two different ways:
lean died on a flaky npm registry response during
npm install -g npm@latest (and --silent hid the cause), and dind-swift
died on playwright install … msedge … when packages.microsoft.com served an
invalid GPG key body (gpg: no valid OpenPGP data found → Failed to install msedge). msedge/chrome come from third-party apt repos
(packages.microsoft.com / Google), so they are subject to the same blips as
the npm registry. Every network-bound build step — the npm self-update, the
Playwright/Puppeteer CLI install, and the Playwright browser-binary download —
now routes through a run_with_retry wrapper (exponential backoff,
env-overridable budget mirroring apt_update_with_retry in common.sh).
playwright install skips already-present browsers, so a retry only
re-attempts the one that blipped. Build-time resilience only — the image is
unchanged on success.

New unit tests cover both fixes:

experiments/test-issue104-logs-contain.sh — asserts the helper is wired in
everywhere and demonstrates the race: the old pipe | grep -q false-negatives
30/30 under load while logs_contain never does (0/30).
experiments/test-issue104-build-retry.sh — asserts every npm install -g
and every playwright install goes through run_with_retry, and that the
wrapper succeeds-first-try / retries-then-succeeds / gives-up-after-budget
correctly.

Docs / release

docs/dind/USAGE.md: documents the warning under the Storage Driver section,
with a fuse-overlayfs (copy-on-write) example alongside the existing vfs
one.
Adds two patch changesets:
.changeset/issue-104-vfs-storage-driver-warning.md (the feature) and
.changeset/issue-104-ci-reliability.md (the CI flake fixes).
Removed the auto-generated .gitkeep placeholder.

Adding .gitkeep for PR creation (default mode). This file will be removed when the task is complete. Issue: #104

…(issue #104) Landing on the `vfs` storage driver was silent: a single `log` line named the driver, but nothing flagged that vfs performs NO copy-on-write. vfs stores every image layer as a full, independent copy, so a multi-GB image's on-disk footprint becomes the SUM of all cumulative layer sizes (many times the image size), and `docker pull`/`docker run` can fail with `failed to register layer: no space left on device` on a disk far larger than the image — with no breadcrumb pointing at the driver. Downstream this overflowed a disk with a >30 GB image (link-assistant/hive-mind#1914). The active driver ends up being vfs either pinned explicitly via `DIND_STORAGE_DRIVER=vfs` (legitimate for overlay-on-overlay compatibility) or reached as the last-resort auto-detect fallback. This is observability, not a default change — vfs stays the safe fallback. `start_dockerd` now calls `warn_if_vfs_storage_driver` right after the daemon becomes ready, emitting one actionable warning whenever the active driver is vfs: it explains the copy-on-write/disk implication and names the `DIND_STORAGE_DRIVER=fuse-overlayfs` remediation (copy-on-write, works overlay-on-overlay, already shipped in the image). The remediation adapts to whether `/dev/fuse` is present, pointing at `--privileged` / `--device /dev/fuse` first when it is missing. The `DIND_STORAGE_DRIVER` doc comment now spells out the vfs disk amplification too. Covered by a new unit test (experiments/test-issue104-vfs-warning.sh) and a new assertion in the CI-run tests/dind/example-storage-driver-vfs.sh; documented in docs/dind/USAGE.md. Adds a patch changeset.

konard · 2026-06-14T22:16:46Z

Working session summary

CI is still building (the pr-test / js job builds the box-dind image and runs the full dind integration suite, which takes time). I've started a reliable background poller (b4bc7camg) that will notify me the moment the run completes with the per-job results.

Summary of where things stand:

Committed & pushed to issue-104-8849a1f71a19 as 5b9944e — the vfs warning (warn_if_vfs_storage_driver), the /dev/fuse-adaptive remediation, the updated DIND_STORAGE_DRIVER doc comment, the unit test (15/15 pass locally), the CI integration assertion, the USAGE.md docs, the patch changeset, and the .gitkeep removal.
PR dind: warn when the nested daemon runs on the 'vfs' storage driver (issue #104) #105 updated — title de-[WIP]'d and a full description (Summary / Change / Reproduce / Tests / Docs-release) written.
CI gates already green: Check for Changesets ✓ and Check for Manual Version Changes ✓. Only pr-test / js remains in progress.
Main is already an ancestor of the branch — no merge needed.

Once the poller reports the run completed successfully, I'll mark the PR ready with gh pr ready 105. If anything fails, I'll download the logs and investigate before finalizing.

This summary was automatically extracted from the AI working session output.

konard · 2026-06-14T22:16:56Z

🤖 Solution Draft Log

This log file contains the complete execution trace of the AI solution draft process.

💰 Cost estimation:

Public pricing estimate: $5.534161
Calculated by Anthropic: $6.343478
Difference: $0.809318 (+14.62%)

📊 Context and tokens usage:

Claude Opus 4.8: (2 sub-sessions)

116.7K / 1M (12%) input tokens, 40.4K / 128K (32%) output tokens
73.3K / 1M (7%) input tokens, 23.2K / 128K (18%) output tokens

Total: (10.6K new + 218.7K cache writes + 4.7M cache reads) input tokens, 70.4K output tokens, $5.534161 cost

🤖 Models used:

Tool: Anthropic Claude Code
Requested: opus
Model: Claude Opus 4.8 (claude-opus-4-8)

📎 Log file uploaded as Gist (2426KB)

View complete solution draft log

Now working session is ended, feel free to review and add any feedback on the solution draft.

konard · 2026-06-14T23:00:42Z

🔄 Auto-restart triggered (iteration 1)

Reason: CI failures detected

Starting new session to address the issues.

Auto-restart-until-mergeable mode is active. This run will stop after 5 restart iterations.

PR #105's three red checks (docker-build-test, pr-test / dind-js, pr-test / lean) were not caused by the vfs-warning feature itself — both the dind-js and lean jobs were failing on pre-existing transient flakes that the feature's new test surfaced. docker-build-test only aggregates the matrix, so it goes green once the other two do. pr-test / dind-js — SIGPIPE false-negative in the example suite The dind example tests asserted on container logs with `docker logs … | grep -q "NEEDLE"`. Under `set -o pipefail`, `grep -q` closes the pipe the instant it matches; the still-streaming `docker logs` upstream gets SIGPIPE (exit 141), pipefail propagates the 141, and a message that WAS present reads as absent — failing the one-shot checks spuriously (confirmed in the CI log: the expected preload line and the "complete" marker were both present, yet grep reported absent). tests/dind/lib.sh now provides a pipe-free `logs_contain` helper (capture once, match with a `case` glob — no pipe, no SIGPIPE) and every example assertion uses it. A focused regression test shows the old pattern false-negatives 30/30 under load while the new one never does. pr-test / lean — un-retried transient npm failure The JS image build ran `npm install -g npm@latest --no-fund --silent` with no retry; a single transient registry blip aborted the whole build, and `--silent` hid the cause. ubuntu/24.04/js/install.sh now routes its npm registry installs through a `run_with_retry` wrapper (exponential backoff, env-overridable budget mirroring apt_update_with_retry in common.sh). Build-time resilience only — the image is unchanged on success. Also hardened the issue-104 vfs example test to assert each warning needle via logs_contain. New unit tests: experiments/test-issue104-npm-retry.sh and experiments/test-issue104-logs-contain.sh. Changeset: bump patch.

konard · 2026-06-14T23:28:52Z

Working session summary

The workflow structure confirms my fixes are correctly targeted:

pr-test / lean (cell of pr-test-language, line 538) builds the JS + essentials + lean chain — which runs js/install.sh at build time. That's exactly where the npm flake aborted the build; my run_with_retry wrapper fixes it.
pr-test / dind-js (cell of pr-test-dind, line 774) runs the dind example tests — where the SIGPIPE false-negative lived; my logs_contain helper fixes it.
docker-build-test (line 417) is the branch-protection aggregator — it goes green once the above pass.

All three previously-red checks will be re-exercised by this run (should-build is true, since pr-test / js is scheduled rather than skipped). The tiers run sequentially (js → essentials → {lean, dind-js}), so the build takes a while. The background monitor is polling and will notify me on completion.

Everything actionable is done: both root causes fixed and committed (3958614), pushed to issue-104-8849a1f71a19, PR #105 description updated, compliance gates (version-check, changeset-check) already green, and all local tests passing. I'll report the final check results once the run completes.

This summary was automatically extracted from the AI working session output.

konard · 2026-06-14T23:29:00Z

🔄 Auto-restart-until-mergeable Log (iteration 1)

This log file contains the complete execution trace of the AI solution draft process.

💰 Cost estimation:

Public pricing estimate: $8.476338
Calculated by Anthropic: $16.102669
Difference: $7.626331 (+89.97%)

📊 Context and tokens usage:

Claude Opus 4.8: (3 sub-sessions)

114.1K / 1M (11%) input tokens, 40.0K / 128K (31%) output tokens
116.7K / 1M (12%) input tokens, 44.9K / 128K (35%) output tokens
73.2K / 1M (7%) input tokens, 23.8K / 128K (19%) output tokens

Total: (25.8K new + 353.0K cache writes + 6.0M cache reads) input tokens, 126.4K output tokens, $8.476338 cost

🤖 Models used:

Tool: Anthropic Claude Code
Requested: opus
Model: Claude Opus 4.8 (claude-opus-4-8)

📎 Log file uploaded as Gist (5855KB)

View complete solution draft log

Now working session is ended, feel free to review and add any feedback on the solution draft.

konard · 2026-06-15T00:19:00Z

🔄 Auto-restart triggered (iteration 2)

Reason: CI failures detected

Starting new session to address the issues.

Auto-restart-until-mergeable mode is active. This run will stop after 5 restart iterations.

…edge GPG flake (issue #104) The dind-swift image build failed on `playwright install ... msedge ...` when packages.microsoft.com served an invalid GPG key body ("gpg: no valid OpenPGP data found" -> "Failed to install msedge"). msedge and chrome are fetched from third-party apt repos (packages.microsoft.com / Google), so they are subject to the same transient blips that already motivated wrapping the npm installs. Wrap both `playwright install` browser-download invocations in run_with_retry so a single blip retries instead of failing the whole image build. Playwright skips already-present browsers, so a retry only re-attempts the one that blipped. Generalize the retry wrapper's env-var prefix NPM_RETRY_* -> BUILD_RETRY_* to reflect that it now covers every network-bound build step (npm self-update, Playwright/Puppeteer CLI install, browser-binary download). Rename experiments/test-issue104-npm-retry.sh -> test-issue104-build-retry.sh and extend it to assert the browser download is wrapped and no bare `playwright install` survives; update the changeset accordingly.

konard · 2026-06-15T00:36:22Z

Working session summary

Everything actionable is complete; CI is still building the matrix (the pr-test tiers run sequentially and each builds Docker images, so it takes a while). The background poller b9o35ftln will notify me the moment the run finishes.

Status of this iteration (fixing the pr-test / dind-swift flake):

Root cause verified against the actual CI log: the playwright install … msedge … chrome step pulled chromium/firefox/webkit/headless-shell from Playwright's CDN, then died on msedge with gpg: no valid OpenPGP data found → Error: Failed to install msedge — a transient packages.microsoft.com blip (other dind jobs passed the same step in the same run).
Fix committed & pushed (55e2efa): both playwright install browser-download lines now go through run_with_retry; the env-var prefix was generalized NPM_RETRY_* → BUILD_RETRY_* since the wrapper now covers every network-bound build step. Playwright skips already-present browsers, so a retry only re-attempts the one that blipped.
Tests: renamed test-issue104-npm-retry.sh → test-issue104-build-retry.sh, extended with browser-install assertions — 11/11 pass. Full local suite green (logs-contain, vfs-warning 15/15, preload 46/46). All bash -n clean.
Changeset & PR description updated to document the msedge/chrome retry and the renamed test; no stray NPM_RETRY or old-filename references remain.
Branch confirmed up to date with origin/main (no merge needed); no human review feedback outstanding.

I'll report the per-job results as soon as the run completes.

This summary was automatically extracted from the AI working session output.

konard · 2026-06-15T00:36:30Z

🔄 Auto-restart-until-mergeable Log (iteration 2)

This log file contains the complete execution trace of the AI solution draft process.

💰 Cost estimation:

Public pricing estimate: $5.434357
Calculated by Anthropic: $22.550392
Difference: $17.116035 (+314.96%)

📊 Context and tokens usage:

Claude Opus 4.8: (2 sub-sessions)

115.3K / 1M (12%) input tokens, 44.1K / 128K (34%) output tokens
83.2K / 1M (8%) input tokens, 20.6K / 128K (16%) output tokens

Total: (14.9K new + 272.5K cache writes + 3.7M cache reads) input tokens, 71.8K output tokens, $5.434357 cost

🤖 Models used:

Tool: Anthropic Claude Code
Requested: opus
Model: Claude Opus 4.8 (claude-opus-4-8)

📎 Log file uploaded as Gist (8340KB)

View complete solution draft log

Now working session is ended, feel free to review and add any feedback on the solution draft.

konard · 2026-06-15T01:26:41Z

✅ Ready to merge

This pull request is now ready to be merged:

All CI checks have passed
No merge conflicts
No pending changes

Monitored by hive-mind with --auto-restart-until-mergeable flag

…ge build and remove a SIGPIPE false-negative from the dind example tests (issue #104 / PR #105). The JS image build (`ubuntu/24.04/js/install.sh`, `COPY`'d into every dind/language image) occasionally died on a single transient third-party error, with no retry: the lean/language build hit a flaky npm registry response during `npm install -g npm@latest`, and the dind-swift build hit `playwright install … msedge …` getting an invalid GPG key body from packages.microsoft.com ("gpg: no valid OpenPGP data found" → "Failed to install msedge"). Every network-bound build step — the npm self-update, the Playwright/Puppeteer CLI install, and the Playwright browser-binary download — now goes through a `run_with_retry` wrapper that retries with exponential backoff (mirroring `apt_update_with_retry` in `common.sh`, with the same overridable retry budget so it stays unit-testable). `playwright install` skips already-present browsers, so a retry only re-attempts the one that blipped. This is build-time resilience only — the resulting image is unchanged on success. Separately, the dind example suite asserted on container logs with `docker logs … | grep -q "needle"`. Under `set -o pipefail`, `grep -q` closes the pipe the instant it matches, which can deliver SIGPIPE to the still-streaming `docker logs`; pipefail then propagates that 141 and a present message reads as absent, failing the test spuriously (observed on the preload test even though the expected line was right there in the logs). `tests/dind/lib.sh` now provides a pipe-free `logs_contain` helper (capture once, match with a `case` glob) and all example assertions use it. Covered by new unit tests `experiments/test-issue104-build-retry.sh` and `experiments/test-issue104-logs-contain.sh`.; dind-box: warn when the nested daemon runs on the `vfs` storage driver (issue #104). When the inner dockerd ends up on `vfs` — either pinned explicitly via `DIND_STORAGE_DRIVER=vfs` (e.g. for overlay-on-overlay compatibility) or reached as the last-resort auto-detect fallback — large images could fail to pull/run with a cryptic `failed to register layer: no space left on device` and **no hint** that the storage driver was the cause. `vfs` performs no copy-on-write: it stores every image layer as a full, independent copy, so a multi-GB image's on-disk footprint becomes the *sum* of all cumulative layer sizes (many times the image size), and a >30 GB image can overflow a disk with far more than 30 GB free (`link-assistant/hive-mind#1914`). This is observability, not a default change — `vfs` stays the safe fallback. The entrypoint now emits a single, actionable warning right after the daemon becomes ready whenever the active driver is `vfs`, explaining the copy-on-write/disk implication and naming the `DIND_STORAGE_DRIVER=fuse-overlayfs` remediation (copy-on-write, works overlay-on-overlay, already shipped in the image). The remediation line adapts to whether `/dev/fuse` is present, so when it is missing it points at `--privileged` / `--device /dev/fuse` first. The `DIND_STORAGE_DRIVER` doc comment now spells out the `vfs` disk amplification too. Covered by a new unit test (`experiments/test-issue104-vfs-warning.sh`) and a new assertion in the CI-run `tests/dind/example-storage-driver-vfs.sh`; documented in `docs/dind/USAGE.md`.

Initial commit with task details

13d4c05

Adding .gitkeep for PR creation (default mode). This file will be removed when the task is complete. Issue: #104

konard self-assigned this Jun 14, 2026

konard changed the title ~~[WIP] dind: warn when the nested daemon runs on the 'vfs' storage driver (no copy-on-write → large images overflow disk)~~ dind: warn when the nested daemon runs on the 'vfs' storage driver (issue #104) Jun 14, 2026

konard marked this pull request as ready for review June 14, 2026 22:16

konard merged commit 4f67188 into main Jun 15, 2026
50 checks passed

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

dind: warn when the nested daemon runs on the 'vfs' storage driver (issue #104)#105

dind: warn when the nested daemon runs on the 'vfs' storage driver (issue #104)#105
konard merged 4 commits into
mainfrom
issue-104-8849a1f71a19

konard commented Jun 14, 2026 •

edited

Loading

Uh oh!

konard commented Jun 14, 2026

Uh oh!

konard commented Jun 14, 2026

Uh oh!

konard commented Jun 14, 2026

Uh oh!

konard commented Jun 14, 2026

Uh oh!

konard commented Jun 14, 2026

Uh oh!

konard commented Jun 15, 2026

Uh oh!

konard commented Jun 15, 2026

Uh oh!

konard commented Jun 15, 2026

Uh oh!

konard commented Jun 15, 2026

Uh oh!

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

1 participant

Conversation

konard commented Jun 14, 2026 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Summary

Change

Reproduce

Tests

CI reliability (flakes surfaced by this PR)

Docs / release

Uh oh!

konard commented Jun 14, 2026

Working session summary

Uh oh!

konard commented Jun 14, 2026

🤖 Solution Draft Log

💰 Cost estimation:

📊 Context and tokens usage:

🤖 Models used:

📎 Log file uploaded as Gist (2426KB)

Uh oh!

konard commented Jun 14, 2026

🔄 Auto-restart triggered (iteration 1)

Uh oh!

konard commented Jun 14, 2026

Working session summary

Uh oh!

konard commented Jun 14, 2026

🔄 Auto-restart-until-mergeable Log (iteration 1)

💰 Cost estimation:

📊 Context and tokens usage:

🤖 Models used:

📎 Log file uploaded as Gist (5855KB)

Uh oh!

konard commented Jun 15, 2026

🔄 Auto-restart triggered (iteration 2)

Uh oh!

konard commented Jun 15, 2026

Working session summary

Uh oh!

konard commented Jun 15, 2026

🔄 Auto-restart-until-mergeable Log (iteration 2)

💰 Cost estimation:

📊 Context and tokens usage:

🤖 Models used:

📎 Log file uploaded as Gist (8340KB)

Uh oh!

konard commented Jun 15, 2026

✅ Ready to merge

Uh oh!

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

1 participant

konard commented Jun 14, 2026 •

edited

Loading