Skip to content

dind: warn when the nested daemon runs on the 'vfs' storage driver (issue #104)#105

Merged
konard merged 4 commits into
mainfrom
issue-104-8849a1f71a19
Jun 15, 2026
Merged

dind: warn when the nested daemon runs on the 'vfs' storage driver (issue #104)#105
konard merged 4 commits into
mainfrom
issue-104-8849a1f71a19

Conversation

@konard

@konard konard commented Jun 14, 2026

Copy link
Copy Markdown
Member

Summary

Fixes #104.

When the nested Docker daemon ends up on the vfs storage driver, large
images can fail to pull/run with a cryptic failed to register layer: no space left on device, with no hint that the storage driver is the cause. vfs
performs no copy-on-write — it stores a full, independent copy of the entire
filesystem for every layer — so a multi-GB image's on-disk footprint becomes
the sum of all cumulative layer sizes, many times the image size. A ~30 GB
image can overflow a disk with far more than 30 GB free.

box-dind already auto-detects a good driver (overlay2 → fuse-overlayfs → vfs,
with graceful fallback) and vfs is only the last resort, so this is not a
"wrong default" bug — it is observability. Today the choice is silent (only a
log line names the driver), so an operator hitting no space left on device
has no breadcrumb. This bit us downstream in link-assistant/hive-mind#1914
(a >30 GB image), where pinning DIND_STORAGE_DRIVER=fuse-overlayfs fixed it.

Change

A new warn_if_vfs_storage_driver in ubuntu/24.04/dind/dind-entrypoint.sh is
called from start_dockerd() right after wait_for_dockerd_ready succeeds, so
it fires whenever the active driver is vfs — whether pinned explicitly via
DIND_STORAGE_DRIVER=vfs (legitimate for overlay-on-overlay compatibility) or
reached as the last-resort auto-detect fallback:

[dind-entrypoint] WARN: dockerd is using the 'vfs' storage driver, which has NO copy-on-write:
[dind-entrypoint] WARN: every image layer is stored as a full copy, so a multi-GB image's on-disk
[dind-entrypoint] WARN: footprint becomes the SUM of all cumulative layer sizes (many times the
[dind-entrypoint] WARN: image size). 'docker pull'/'docker run' can then fail with 'failed to
[dind-entrypoint] WARN: register layer: no space left on device' on a disk far larger than the image.
[dind-entrypoint] WARN: For copy-on-write here, set DIND_STORAGE_DRIVER=fuse-overlayfs (works
[dind-entrypoint] WARN: overlay-on-overlay; /dev/fuse is present).

The remediation line adapts to whether /dev/fuse is present (the device
fuse-overlayfs needs). When it is missing — the exact reason auto-detect fell
through to vfs — the warning instead explains that fuse-overlayfs is
unavailable and tells you to add --privileged or --device /dev/fuse first.

This is observability, not a default change: vfs stays the safe fallback.
Non-vfs drivers (overlay2, fuse-overlayfs) stay silent so normal runs are
not spammed. The DIND_STORAGE_DRIVER doc comment now also spells out the vfs
disk amplification.

Reproduce

docker run --rm -it --privileged \
  -e DIND_STORAGE_DRIVER=vfs \
  konard/box-dind:latest \
  bash -lc 'docker info --format "driver={{.Driver}}"'
# Before: driver=vfs, no warning — a later large pull dies with
#         "failed to register layer: no space left on device" and no hint.
# After:  startup logs the vfs copy-on-write warning + the fuse-overlayfs remediation.

Tests

  • Unit (experiments/test-issue104-vfs-warning.sh): sources the real
    entrypoint via DIND_ENTRYPOINT_SOURCE_ONLY=1 and drives
    warn_if_vfs_storage_driver directly. Asserts: vfs emits the warning
    (content checks for no-copy-on-write, the disk failure mode, and the
    fuse-overlayfs remediation); overlay2/fuse-overlayfs/empty stay silent;
    the /dev/fuse-present vs /dev/fuse-missing remediation branches (driven via
    the overridable DIND_FUSE_DEVICE); and that the function returns 0 so the
    set -e success branch is unaffected. RESULT: 15 passed, 0 failed.
    Regression: experiments/preload-unit-test.sh still 46 passed, 0 failed.
  • Integration (tests/dind/example-storage-driver-vfs.sh, run in CI): after
    asserting the inner daemon is on vfs, it now also asserts each line of the
    copy-on-write warning is present in the container logs, giving real
    end-to-end coverage under a --privileged vfs container.

CI reliability (flakes surfaced by this PR)

Across its pushes this PR turned several checks red — docker-build-test,
pr-test / dind-js, pr-test / lean, and pr-test / dind-swift — but not
because of the vfs-warning feature. Every failing job was tripping over a
pre-existing transient flake that the new vfs test merely exposed by running
the shared build/test paths more often; docker-build-test only aggregates the
matrix, so it goes green once the underlying jobs do. Per the root-cause mandate,
all are fixed here rather than re-run:

  • pr-test / dind-js — SIGPIPE false-negative in the example suite. The dind
    example tests asserted on logs with docker logs … | grep -q "NEEDLE". Under
    set -o pipefail, grep -q closes the pipe the instant it matches; the
    still-streaming docker logs upstream gets SIGPIPE (exit 141), pipefail
    propagates the 141, and a message that was present reads as absent —
    failing the one-shot checks spuriously. The CI log is conclusive: the expected
    preload line and the "complete" marker were both in the logs, yet the grep
    reported them absent. tests/dind/lib.sh now provides a pipe-free
    logs_contain helper (capture once, match with a case glob — no pipe, no
    SIGPIPE) and every example assertion uses it.
  • pr-test / lean + pr-test / dind-swift — un-retried transient downloads in
    the JS image build.
    ubuntu/24.04/js/install.sh is COPY'd into every
    dind/language image, so a single transient third-party blip there aborts the
    whole image build with no retry. It bit two different jobs two different ways:
    lean died on a flaky npm registry response during
    npm install -g npm@latest (and --silent hid the cause), and dind-swift
    died on playwright install … msedge … when packages.microsoft.com served an
    invalid GPG key body (gpg: no valid OpenPGP data foundFailed to install msedge). msedge/chrome come from third-party apt repos
    (packages.microsoft.com / Google), so they are subject to the same blips as
    the npm registry. Every network-bound build step — the npm self-update, the
    Playwright/Puppeteer CLI install, and the Playwright browser-binary download —
    now routes through a run_with_retry wrapper (exponential backoff,
    env-overridable budget mirroring apt_update_with_retry in common.sh).
    playwright install skips already-present browsers, so a retry only
    re-attempts the one that blipped. Build-time resilience only — the image is
    unchanged on success.

New unit tests cover both fixes:

  • experiments/test-issue104-logs-contain.sh — asserts the helper is wired in
    everywhere and demonstrates the race: the old pipe | grep -q false-negatives
    30/30 under load while logs_contain never does (0/30).
  • experiments/test-issue104-build-retry.sh — asserts every npm install -g
    and every playwright install goes through run_with_retry, and that the
    wrapper succeeds-first-try / retries-then-succeeds / gives-up-after-budget
    correctly.

Docs / release

  • docs/dind/USAGE.md: documents the warning under the Storage Driver section,
    with a fuse-overlayfs (copy-on-write) example alongside the existing vfs
    one.
  • Adds two patch changesets:
    .changeset/issue-104-vfs-storage-driver-warning.md (the feature) and
    .changeset/issue-104-ci-reliability.md (the CI flake fixes).
  • Removed the auto-generated .gitkeep placeholder.

Adding .gitkeep for PR creation (default mode).
This file will be removed when the task is complete.

Issue: #104
@konard konard self-assigned this Jun 14, 2026
…(issue #104)

Landing on the `vfs` storage driver was silent: a single `log` line named the
driver, but nothing flagged that vfs performs NO copy-on-write. vfs stores every
image layer as a full, independent copy, so a multi-GB image's on-disk footprint
becomes the SUM of all cumulative layer sizes (many times the image size), and
`docker pull`/`docker run` can fail with `failed to register layer: no space
left on device` on a disk far larger than the image — with no breadcrumb
pointing at the driver. Downstream this overflowed a disk with a >30 GB image
(link-assistant/hive-mind#1914).

The active driver ends up being vfs either pinned explicitly via
`DIND_STORAGE_DRIVER=vfs` (legitimate for overlay-on-overlay compatibility) or
reached as the last-resort auto-detect fallback. This is observability, not a
default change — vfs stays the safe fallback. `start_dockerd` now calls
`warn_if_vfs_storage_driver` right after the daemon becomes ready, emitting one
actionable warning whenever the active driver is vfs: it explains the
copy-on-write/disk implication and names the `DIND_STORAGE_DRIVER=fuse-overlayfs`
remediation (copy-on-write, works overlay-on-overlay, already shipped in the
image). The remediation adapts to whether `/dev/fuse` is present, pointing at
`--privileged` / `--device /dev/fuse` first when it is missing. The
`DIND_STORAGE_DRIVER` doc comment now spells out the vfs disk amplification too.

Covered by a new unit test (experiments/test-issue104-vfs-warning.sh) and a new
assertion in the CI-run tests/dind/example-storage-driver-vfs.sh; documented in
docs/dind/USAGE.md. Adds a patch changeset.
@konard konard changed the title [WIP] dind: warn when the nested daemon runs on the 'vfs' storage driver (no copy-on-write → large images overflow disk) dind: warn when the nested daemon runs on the 'vfs' storage driver (issue #104) Jun 14, 2026
@konard

konard commented Jun 14, 2026

Copy link
Copy Markdown
Member Author

Working session summary

CI is still building (the pr-test / js job builds the box-dind image and runs the full dind integration suite, which takes time). I've started a reliable background poller (b4bc7camg) that will notify me the moment the run completes with the per-job results.

Summary of where things stand:

  • Committed & pushed to issue-104-8849a1f71a19 as 5b9944e — the vfs warning (warn_if_vfs_storage_driver), the /dev/fuse-adaptive remediation, the updated DIND_STORAGE_DRIVER doc comment, the unit test (15/15 pass locally), the CI integration assertion, the USAGE.md docs, the patch changeset, and the .gitkeep removal.
  • PR dind: warn when the nested daemon runs on the 'vfs' storage driver (issue #104) #105 updated — title de-[WIP]'d and a full description (Summary / Change / Reproduce / Tests / Docs-release) written.
  • CI gates already green: Check for Changesets ✓ and Check for Manual Version Changes ✓. Only pr-test / js remains in progress.
  • Main is already an ancestor of the branch — no merge needed.

Once the poller reports the run completed successfully, I'll mark the PR ready with gh pr ready 105. If anything fails, I'll download the logs and investigate before finalizing.


This summary was automatically extracted from the AI working session output.

@konard konard marked this pull request as ready for review June 14, 2026 22:16
@konard

konard commented Jun 14, 2026

Copy link
Copy Markdown
Member Author

🤖 Solution Draft Log

This log file contains the complete execution trace of the AI solution draft process.

💰 Cost estimation:

  • Public pricing estimate: $5.534161
  • Calculated by Anthropic: $6.343478
  • Difference: $0.809318 (+14.62%)

📊 Context and tokens usage:

Claude Opus 4.8: (2 sub-sessions)

  1. 116.7K / 1M (12%) input tokens, 40.4K / 128K (32%) output tokens
  2. 73.3K / 1M (7%) input tokens, 23.2K / 128K (18%) output tokens

Total: (10.6K new + 218.7K cache writes + 4.7M cache reads) input tokens, 70.4K output tokens, $5.534161 cost

🤖 Models used:

  • Tool: Anthropic Claude Code
  • Requested: opus
  • Model: Claude Opus 4.8 (claude-opus-4-8)

📎 Log file uploaded as Gist (2426KB)


Now working session is ended, feel free to review and add any feedback on the solution draft.

@konard

konard commented Jun 14, 2026

Copy link
Copy Markdown
Member Author

🔄 Auto-restart triggered (iteration 1)

Reason: CI failures detected

Starting new session to address the issues.


Auto-restart-until-mergeable mode is active. This run will stop after 5 restart iterations.

PR #105's three red checks (docker-build-test, pr-test / dind-js,
pr-test / lean) were not caused by the vfs-warning feature itself — both
the dind-js and lean jobs were failing on pre-existing transient flakes
that the feature's new test surfaced. docker-build-test only aggregates
the matrix, so it goes green once the other two do.

pr-test / dind-js — SIGPIPE false-negative in the example suite
  The dind example tests asserted on container logs with
  `docker logs … | grep -q "NEEDLE"`. Under `set -o pipefail`, `grep -q`
  closes the pipe the instant it matches; the still-streaming `docker
  logs` upstream gets SIGPIPE (exit 141), pipefail propagates the 141,
  and a message that WAS present reads as absent — failing the one-shot
  checks spuriously (confirmed in the CI log: the expected preload line
  and the "complete" marker were both present, yet grep reported absent).
  tests/dind/lib.sh now provides a pipe-free `logs_contain` helper
  (capture once, match with a `case` glob — no pipe, no SIGPIPE) and
  every example assertion uses it. A focused regression test shows the
  old pattern false-negatives 30/30 under load while the new one never
  does.

pr-test / lean — un-retried transient npm failure
  The JS image build ran `npm install -g npm@latest --no-fund --silent`
  with no retry; a single transient registry blip aborted the whole
  build, and `--silent` hid the cause. ubuntu/24.04/js/install.sh now
  routes its npm registry installs through a `run_with_retry` wrapper
  (exponential backoff, env-overridable budget mirroring
  apt_update_with_retry in common.sh). Build-time resilience only — the
  image is unchanged on success.

Also hardened the issue-104 vfs example test to assert each warning
needle via logs_contain. New unit tests:
experiments/test-issue104-npm-retry.sh and
experiments/test-issue104-logs-contain.sh. Changeset: bump patch.
@konard

konard commented Jun 14, 2026

Copy link
Copy Markdown
Member Author

Working session summary

The workflow structure confirms my fixes are correctly targeted:

  • pr-test / lean (cell of pr-test-language, line 538) builds the JS + essentials + lean chain — which runs js/install.sh at build time. That's exactly where the npm flake aborted the build; my run_with_retry wrapper fixes it.
  • pr-test / dind-js (cell of pr-test-dind, line 774) runs the dind example tests — where the SIGPIPE false-negative lived; my logs_contain helper fixes it.
  • docker-build-test (line 417) is the branch-protection aggregator — it goes green once the above pass.

All three previously-red checks will be re-exercised by this run (should-build is true, since pr-test / js is scheduled rather than skipped). The tiers run sequentially (js → essentials → {lean, dind-js}), so the build takes a while. The background monitor is polling and will notify me on completion.

Everything actionable is done: both root causes fixed and committed (3958614), pushed to issue-104-8849a1f71a19, PR #105 description updated, compliance gates (version-check, changeset-check) already green, and all local tests passing. I'll report the final check results once the run completes.


This summary was automatically extracted from the AI working session output.

@konard

konard commented Jun 14, 2026

Copy link
Copy Markdown
Member Author

🔄 Auto-restart-until-mergeable Log (iteration 1)

This log file contains the complete execution trace of the AI solution draft process.

💰 Cost estimation:

  • Public pricing estimate: $8.476338
  • Calculated by Anthropic: $16.102669
  • Difference: $7.626331 (+89.97%)

📊 Context and tokens usage:

Claude Opus 4.8: (3 sub-sessions)

  1. 114.1K / 1M (11%) input tokens, 40.0K / 128K (31%) output tokens
  2. 116.7K / 1M (12%) input tokens, 44.9K / 128K (35%) output tokens
  3. 73.2K / 1M (7%) input tokens, 23.8K / 128K (19%) output tokens

Total: (25.8K new + 353.0K cache writes + 6.0M cache reads) input tokens, 126.4K output tokens, $8.476338 cost

🤖 Models used:

  • Tool: Anthropic Claude Code
  • Requested: opus
  • Model: Claude Opus 4.8 (claude-opus-4-8)

📎 Log file uploaded as Gist (5855KB)


Now working session is ended, feel free to review and add any feedback on the solution draft.

@konard

konard commented Jun 15, 2026

Copy link
Copy Markdown
Member Author

🔄 Auto-restart triggered (iteration 2)

Reason: CI failures detected

Starting new session to address the issues.


Auto-restart-until-mergeable mode is active. This run will stop after 5 restart iterations.

…edge GPG flake (issue #104)

The dind-swift image build failed on `playwright install ... msedge ...` when
packages.microsoft.com served an invalid GPG key body ("gpg: no valid OpenPGP
data found" -> "Failed to install msedge"). msedge and chrome are fetched from
third-party apt repos (packages.microsoft.com / Google), so they are subject to
the same transient blips that already motivated wrapping the npm installs.

Wrap both `playwright install` browser-download invocations in run_with_retry so
a single blip retries instead of failing the whole image build. Playwright skips
already-present browsers, so a retry only re-attempts the one that blipped.
Generalize the retry wrapper's env-var prefix NPM_RETRY_* -> BUILD_RETRY_* to
reflect that it now covers every network-bound build step (npm self-update,
Playwright/Puppeteer CLI install, browser-binary download).

Rename experiments/test-issue104-npm-retry.sh -> test-issue104-build-retry.sh
and extend it to assert the browser download is wrapped and no bare
`playwright install` survives; update the changeset accordingly.
@konard

konard commented Jun 15, 2026

Copy link
Copy Markdown
Member Author

Working session summary

Everything actionable is complete; CI is still building the matrix (the pr-test tiers run sequentially and each builds Docker images, so it takes a while). The background poller b9o35ftln will notify me the moment the run finishes.

Status of this iteration (fixing the pr-test / dind-swift flake):

  • Root cause verified against the actual CI log: the playwright install … msedge … chrome step pulled chromium/firefox/webkit/headless-shell from Playwright's CDN, then died on msedge with gpg: no valid OpenPGP data foundError: Failed to install msedge — a transient packages.microsoft.com blip (other dind jobs passed the same step in the same run).
  • Fix committed & pushed (55e2efa): both playwright install browser-download lines now go through run_with_retry; the env-var prefix was generalized NPM_RETRY_*BUILD_RETRY_* since the wrapper now covers every network-bound build step. Playwright skips already-present browsers, so a retry only re-attempts the one that blipped.
  • Tests: renamed test-issue104-npm-retry.shtest-issue104-build-retry.sh, extended with browser-install assertions — 11/11 pass. Full local suite green (logs-contain, vfs-warning 15/15, preload 46/46). All bash -n clean.
  • Changeset & PR description updated to document the msedge/chrome retry and the renamed test; no stray NPM_RETRY or old-filename references remain.
  • Branch confirmed up to date with origin/main (no merge needed); no human review feedback outstanding.

I'll report the per-job results as soon as the run completes.


This summary was automatically extracted from the AI working session output.

@konard

konard commented Jun 15, 2026

Copy link
Copy Markdown
Member Author

🔄 Auto-restart-until-mergeable Log (iteration 2)

This log file contains the complete execution trace of the AI solution draft process.

💰 Cost estimation:

  • Public pricing estimate: $5.434357
  • Calculated by Anthropic: $22.550392
  • Difference: $17.116035 (+314.96%)

📊 Context and tokens usage:

Claude Opus 4.8: (2 sub-sessions)

  1. 115.3K / 1M (12%) input tokens, 44.1K / 128K (34%) output tokens
  2. 83.2K / 1M (8%) input tokens, 20.6K / 128K (16%) output tokens

Total: (14.9K new + 272.5K cache writes + 3.7M cache reads) input tokens, 71.8K output tokens, $5.434357 cost

🤖 Models used:

  • Tool: Anthropic Claude Code
  • Requested: opus
  • Model: Claude Opus 4.8 (claude-opus-4-8)

📎 Log file uploaded as Gist (8340KB)


Now working session is ended, feel free to review and add any feedback on the solution draft.

@konard

konard commented Jun 15, 2026

Copy link
Copy Markdown
Member Author

✅ Ready to merge

This pull request is now ready to be merged:

  • All CI checks have passed
  • No merge conflicts
  • No pending changes

Monitored by hive-mind with --auto-restart-until-mergeable flag

@konard konard merged commit 4f67188 into main Jun 15, 2026
50 checks passed
github-actions Bot added a commit that referenced this pull request Jun 15, 2026
…ge build and remove a SIGPIPE false-negative from the dind example tests (issue #104 / PR #105). The JS image build (`ubuntu/24.04/js/install.sh`, `COPY`'d into every dind/language image) occasionally died on a single transient third-party error, with no retry: the lean/language build hit a flaky npm registry response during `npm install -g npm@latest`, and the dind-swift build hit `playwright install … msedge …` getting an invalid GPG key body from packages.microsoft.com ("gpg: no valid OpenPGP data found" → "Failed to install msedge"). Every network-bound build step — the npm self-update, the Playwright/Puppeteer CLI install, and the Playwright browser-binary download — now goes through a `run_with_retry` wrapper that retries with exponential backoff (mirroring `apt_update_with_retry` in `common.sh`, with the same overridable retry budget so it stays unit-testable). `playwright install` skips already-present browsers, so a retry only re-attempts the one that blipped. This is build-time resilience only — the resulting image is unchanged on success. Separately, the dind example suite asserted on container logs with `docker logs … | grep -q "needle"`. Under `set -o pipefail`, `grep -q` closes the pipe the instant it matches, which can deliver SIGPIPE to the still-streaming `docker logs`; pipefail then propagates that 141 and a present message reads as absent, failing the test spuriously (observed on the preload test even though the expected line was right there in the logs). `tests/dind/lib.sh` now provides a pipe-free `logs_contain` helper (capture once, match with a `case` glob) and all example assertions use it. Covered by new unit tests `experiments/test-issue104-build-retry.sh` and `experiments/test-issue104-logs-contain.sh`.; dind-box: warn when the nested daemon runs on the `vfs` storage driver (issue #104). When the inner dockerd ends up on `vfs` — either pinned explicitly via `DIND_STORAGE_DRIVER=vfs` (e.g. for overlay-on-overlay compatibility) or reached as the last-resort auto-detect fallback — large images could fail to pull/run with a cryptic `failed to register layer: no space left on device` and **no hint** that the storage driver was the cause. `vfs` performs no copy-on-write: it stores every image layer as a full, independent copy, so a multi-GB image's on-disk footprint becomes the *sum* of all cumulative layer sizes (many times the image size), and a >30 GB image can overflow a disk with far more than 30 GB free (`link-assistant/hive-mind#1914`). This is observability, not a default change — `vfs` stays the safe fallback. The entrypoint now emits a single, actionable warning right after the daemon becomes ready whenever the active driver is `vfs`, explaining the copy-on-write/disk implication and naming the `DIND_STORAGE_DRIVER=fuse-overlayfs` remediation (copy-on-write, works overlay-on-overlay, already shipped in the image). The remediation line adapts to whether `/dev/fuse` is present, so when it is missing it points at `--privileged` / `--device /dev/fuse` first. The `DIND_STORAGE_DRIVER` doc comment now spells out the `vfs` disk amplification too. Covered by a new unit test (`experiments/test-issue104-vfs-warning.sh`) and a new assertion in the CI-run `tests/dind/example-storage-driver-vfs.sh`; documented in `docs/dind/USAGE.md`.
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

None yet

Projects

None yet

Development

Successfully merging this pull request may close these issues.

dind: warn when the nested daemon runs on the 'vfs' storage driver (no copy-on-write → large images overflow disk)

1 participant