Skip to content
Merged
Show file tree
Hide file tree
Changes from all commits
Commits
File filter

Filter by extension

Filter by extension

Conversations
Failed to load comments.
Loading
Jump to
Jump to file
Failed to load files.
Loading
Diff view
Diff view
32 changes: 32 additions & 0 deletions .changeset/issue-104-ci-reliability.md
Original file line number Diff line number Diff line change
@@ -0,0 +1,32 @@
---
bump: patch
---

CI reliability: retry transient network failures in the JS image build and remove
a SIGPIPE false-negative from the dind example tests (issue #104 / PR #105).

The JS image build (`ubuntu/24.04/js/install.sh`, `COPY`'d into every dind/language
image) occasionally died on a single transient third-party error, with no retry:
the lean/language build hit a flaky npm registry response during
`npm install -g npm@latest`, and the dind-swift build hit
`playwright install … msedge …` getting an invalid GPG key body from
packages.microsoft.com ("gpg: no valid OpenPGP data found" → "Failed to install
msedge"). Every network-bound build step — the npm self-update, the
Playwright/Puppeteer CLI install, and the Playwright browser-binary download — now
goes through a `run_with_retry` wrapper that retries with exponential backoff
(mirroring `apt_update_with_retry` in `common.sh`, with the same overridable retry
budget so it stays unit-testable). `playwright install` skips already-present
browsers, so a retry only re-attempts the one that blipped. This is build-time
resilience only — the resulting image is unchanged on success.

Separately, the dind example suite asserted on container logs with
`docker logs … | grep -q "needle"`. Under `set -o pipefail`, `grep -q` closes the
pipe the instant it matches, which can deliver SIGPIPE to the still-streaming
`docker logs`; pipefail then propagates that 141 and a present message reads as
absent, failing the test spuriously (observed on the preload test even though the
expected line was right there in the logs). `tests/dind/lib.sh` now provides a
pipe-free `logs_contain` helper (capture once, match with a `case` glob) and all
example assertions use it.

Covered by new unit tests `experiments/test-issue104-build-retry.sh` and
`experiments/test-issue104-logs-contain.sh`.
29 changes: 29 additions & 0 deletions .changeset/issue-104-vfs-storage-driver-warning.md
Original file line number Diff line number Diff line change
@@ -0,0 +1,29 @@
---
bump: patch
---

dind-box: warn when the nested daemon runs on the `vfs` storage driver (issue
#104).

When the inner dockerd ends up on `vfs` — either pinned explicitly via
`DIND_STORAGE_DRIVER=vfs` (e.g. for overlay-on-overlay compatibility) or reached
as the last-resort auto-detect fallback — large images could fail to pull/run
with a cryptic `failed to register layer: no space left on device` and **no
hint** that the storage driver was the cause. `vfs` performs no copy-on-write: it
stores every image layer as a full, independent copy, so a multi-GB image's
on-disk footprint becomes the *sum* of all cumulative layer sizes (many times the
image size), and a >30 GB image can overflow a disk with far more than 30 GB free
(`link-assistant/hive-mind#1914`).

This is observability, not a default change — `vfs` stays the safe fallback. The
entrypoint now emits a single, actionable warning right after the daemon becomes
ready whenever the active driver is `vfs`, explaining the copy-on-write/disk
implication and naming the `DIND_STORAGE_DRIVER=fuse-overlayfs` remediation
(copy-on-write, works overlay-on-overlay, already shipped in the image). The
remediation line adapts to whether `/dev/fuse` is present, so when it is missing
it points at `--privileged` / `--device /dev/fuse` first. The
`DIND_STORAGE_DRIVER` doc comment now spells out the `vfs` disk amplification too.

Covered by a new unit test (`experiments/test-issue104-vfs-warning.sh`) and a new
assertion in the CI-run `tests/dind/example-storage-driver-vfs.sh`; documented in
`docs/dind/USAGE.md`.
23 changes: 23 additions & 0 deletions docs/dind/USAGE.md
Original file line number Diff line number Diff line change
Expand Up @@ -352,6 +352,29 @@ docker run -d --privileged \
konard/box-dind sleep infinity
```

Because that trade-off is easy to hit by accident, the entrypoint emits a
one-time warning whenever the **active** driver ends up being `vfs` — whether
pinned explicitly or reached as the last-resort fallback (issue #104). `vfs`
stores every image layer as a full, independent copy, so a multi-GB image's
on-disk footprint becomes the *sum* of all cumulative layer sizes — many times
the image size — and `docker pull`/`docker run` can fail with `failed to register
layer: no space left on device` on a disk far larger than the image. The warning
makes that failure traceable instead of looking like a generic "out of disk".

If your host supports it, prefer `DIND_STORAGE_DRIVER=fuse-overlayfs`: it is
copy-on-write **and** works overlay-on-overlay (the compatibility reason `vfs` is
sometimes chosen), is already shipped in the image, and needs `/dev/fuse`
(provided by `--privileged`). The warning's remediation line adapts to whether
`/dev/fuse` is present, so when it is missing it tells you to add `--privileged`
or `--device /dev/fuse` before switching.

```bash
docker run -d --privileged \
-e DIND_STORAGE_DRIVER=fuse-overlayfs \
--name box-dind-cow \
konard/box-dind sleep infinity
```

CI verifies the forced `vfs` path:

```bash
Expand Down
100 changes: 100 additions & 0 deletions experiments/test-issue104-build-retry.sh
Original file line number Diff line number Diff line change
@@ -0,0 +1,100 @@
#!/usr/bin/env bash
# Unit test for the JS image's network-bound build-step retry wrapper
# (issue #104 / PR #105).
#
# Two transient, third-party failures used to abort the whole JS image build with
# no retry:
# * the lean (pr-test / language) build died on a transient npm registry error
# during `npm install -g npm@latest`; and
# * the dind-swift build died on `playwright install ... msedge ...` when
# packages.microsoft.com served an invalid GPG key body
# ("gpg: no valid OpenPGP data found" -> "Failed to install msedge").
# js/install.sh now routes every such network-bound step through run_with_retry(),
# which retries with exponential backoff. This test asserts both the POLICY (the
# wrapper exists and wraps every npm install AND the playwright browser download)
# and the BEHAVIOR (it retries, then succeeds or gives up).
set -euo pipefail

cd "$(dirname "$0")/.."

install="ubuntu/24.04/js/install.sh"
fail=0

assert_grep() { # extended-regex, description
if grep -qE "$1" "$install"; then
echo " PASS: $2"
else
echo " FAIL: $2"
fail=1
fi
}

assert_not_grep() { # extended-regex, description
if grep -qE "$1" "$install"; then
echo " FAIL: $2"
fail=1
else
echo " PASS: $2"
fi
}

echo "== Case 1: js/install.sh defines and uses run_with_retry =="
assert_grep '^run_with_retry\(\) \{' "defines run_with_retry()"
assert_grep 'BUILD_RETRY_MAX_RETRIES' "retry budget is overridable (BUILD_RETRY_MAX_RETRIES)"
assert_grep 'BUILD_RETRY_INITIAL_DELAY' "initial delay is overridable (BUILD_RETRY_INITIAL_DELAY)"
assert_grep 'run_with_retry npm install -g npm@latest' "wraps the npm@latest self-update"
assert_grep 'run_with_retry npm install -g playwright' "wraps the playwright/puppeteer CLI install"
assert_grep 'run_with_retry playwright install ' "wraps the playwright browser binary download (msedge/chrome flake)"

echo "== Case 2: no un-retried network install remains =="
# A bare line that *starts* with `npm install -g` or `playwright install` would be
# un-retried; the wrapped calls start with `run_with_retry`, so they must not match.
assert_not_grep '^[[:space:]]*npm install -g' "every 'npm install -g' goes through run_with_retry"
assert_not_grep '^[[:space:]]*playwright install ' "every 'playwright install' goes through run_with_retry"

echo "== Case 3: run_with_retry retry semantics =="
# install.sh as a whole performs real installs, so load just the function body.
fn="$(awk '/^run_with_retry\(\) \{/{p=1} p{print} p&&/^\}/{exit}' "$install")"
eval "$fn"
# Stub the helpers the wrapper leans on so the test stays fast and silent.
log_warning() { :; }
sleep() { :; }

attempts=0
succeed_now() { attempts=$((attempts + 1)); return 0; }
if BUILD_RETRY_INITIAL_DELAY=0 run_with_retry succeed_now && [ "$attempts" -eq 1 ]; then
echo " PASS: succeeds on the first attempt without retrying"
else
echo " FAIL: expected exactly one attempt, got ${attempts}"
fail=1
fi

attempts=0
fail_then_succeed() { attempts=$((attempts + 1)); [ "$attempts" -ge 3 ]; }
if BUILD_RETRY_INITIAL_DELAY=0 BUILD_RETRY_MAX_RETRIES=5 run_with_retry fail_then_succeed \
&& [ "$attempts" -eq 3 ]; then
echo " PASS: retries transient failures and ultimately succeeds (3 attempts)"
else
echo " FAIL: expected success on the 3rd attempt, got ${attempts}"
fail=1
fi

attempts=0
always_fail() { attempts=$((attempts + 1)); return 1; }
if BUILD_RETRY_INITIAL_DELAY=0 BUILD_RETRY_MAX_RETRIES=3 run_with_retry always_fail; then
echo " FAIL: should have given up after exhausting retries"
fail=1
elif [ "$attempts" -eq 3 ]; then
echo " PASS: gives up after BUILD_RETRY_MAX_RETRIES attempts and returns non-zero"
else
echo " FAIL: expected exactly 3 attempts before giving up, got ${attempts}"
fail=1
fi

echo ""
if [ "$fail" -eq 0 ]; then
echo "RESULT: PASS - build-step retry wrapper is present and behaves correctly"
else
echo "RESULT: FAIL"
exit 1
fi
89 changes: 89 additions & 0 deletions experiments/test-issue104-logs-contain.sh
Original file line number Diff line number Diff line change
@@ -0,0 +1,89 @@
#!/usr/bin/env bash
# Regression test for the dind example-suite log assertions (issue #104 / PR #105).
#
# The pr-test / dind-js job intermittently failed on assertions shaped like
# if ! docker logs "$c" 2>&1 | grep -q "NEEDLE"; then fail ...; fi
# Under `set -o pipefail`, `grep -q` closes the pipe the instant it matches, which
# delivers SIGPIPE (exit 141) to the still-streaming `docker logs`. pipefail then
# propagates that 141, so a needle that WAS present reads as absent and the test
# fails spuriously. tests/dind/lib.sh now provides logs_contain(), which captures
# the logs first and matches with a `case` glob -- no pipe, no SIGPIPE.
#
# This test asserts the POLICY (the helper exists, every example uses it, and no
# raw `docker logs | grep` survives) and the BEHAVIOR (capture+case is correct and
# immune to the pipefail false-negative the old pattern suffered).
set -euo pipefail

cd "$(dirname "$0")/.."

fail=0
pass() { echo " PASS: $1"; }
miss() { echo " FAIL: $1"; fail=1; }

lib="tests/dind/lib.sh"

echo "== Case 1: lib.sh defines the pipe-free logs_contain helper =="
if grep -qE '^logs_contain\(\) \{' "$lib"; then pass "defines logs_contain()"; else miss "defines logs_contain()"; fi
if grep -q 'docker logs' "$lib"; then pass "logs_contain captures docker logs"; else miss "logs_contain captures docker logs"; fi

echo "== Case 2: no raw 'docker logs | grep' assertion survives in tests/dind =="
# A pipe straight from docker logs into grep is the vulnerable shape we removed.
if grep -rnE 'docker logs [^|]*\| *grep' tests/dind/ >/dev/null 2>&1; then
grep -rnE 'docker logs [^|]*\| *grep' tests/dind/ >&2
miss "no docker-logs|grep pipelines remain"
else
pass "no docker-logs|grep pipelines remain"
fi

echo "== Case 3: example scripts assert via logs_contain =="
for f in tests/dind/example-preload-images.sh tests/dind/example-storage-driver-vfs.sh; do
if grep -q 'logs_contain' "$f"; then pass "$(basename "$f") uses logs_contain"; else miss "$(basename "$f") uses logs_contain"; fi
done

echo "== Case 4: capture+case is correct and SIGPIPE-immune =="
NEEDLE="image preload/passthrough complete"
# Producer prints the needle EARLY then streams a large tail, so a matcher that
# short-circuits is reliably killed by SIGPIPE while the producer is still writing
# -- the exact shape of `docker logs` on a busy dind container.
producer() {
printf '%s\n' "starting dockerd"
printf '%s\n' "$NEEDLE"
for n in $(seq 1 5000); do
printf 'trailing log line %s aaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaa\n' "$n"
done
}
old_match() { producer | grep -q "$NEEDLE"; } # vulnerable
new_match() { # logs_contain's core
local logs
logs="$(producer 2>&1 || true)"
case "$logs" in *"$NEEDLE"*) return 0 ;; *) return 1 ;; esac
}

iterations=30
old_fn=0
new_fn=0
for _ in $(seq 1 "$iterations"); do
if ! old_match; then old_fn=$((old_fn + 1)); fi
if ! new_match; then new_fn=$((new_fn + 1)); fi
done
echo " (old pipe|grep -q false negatives: ${old_fn}/${iterations}; new capture+case: ${new_fn}/${iterations})"
if [ "$new_fn" -eq 0 ]; then pass "capture+case never false-negatives under pipefail"; else miss "capture+case false-negatived ${new_fn}/${iterations}"; fi

# Correctness: reject an absent needle; match needles containing glob/regex
# metacharacters as literals (the vfs warning needles do).
absent="no marker here"
case "$absent" in *"$NEEDLE"*) miss "matched an absent needle" ;; *) pass "rejects an absent needle" ;; esac
meta="warning: 'vfs' storage driver [no copy-on-write] DIND_STORAGE_DRIVER=fuse-overlayfs"
meta_ok=1
for n in "'vfs' storage driver" "[no copy-on-write]" "DIND_STORAGE_DRIVER=fuse-overlayfs"; do
case "$meta" in *"$n"*) ;; *) meta_ok=0 ;; esac
done
if [ "$meta_ok" -eq 1 ]; then pass "matches glob/regex metacharacters as literals"; else miss "failed to match a literal metacharacter needle"; fi

echo ""
if [ "$fail" -eq 0 ]; then
echo "RESULT: PASS - logs_contain is wired in and SIGPIPE-immune"
else
echo "RESULT: FAIL"
exit 1
fi
89 changes: 89 additions & 0 deletions experiments/test-issue104-vfs-warning.sh
Original file line number Diff line number Diff line change
@@ -0,0 +1,89 @@
#!/usr/bin/env bash
# Isolated unit test for the issue #104 vfs storage-driver warning in
# dind-entrypoint.sh.
#
# Landing on the `vfs` storage driver used to be silent (only a `log` line named
# the driver), so an operator hitting `failed to register layer: no space left on
# device` had no breadcrumb pointing at the copy-on-write-less driver. The
# entrypoint now emits a one-time `warn` whenever the *active* driver is `vfs`,
# with a remediation hint whose wording depends on whether the fuse-overlayfs
# device node is available.
#
# Building the full box-dind image requires overlay-backed nested Docker, which
# this sandbox cannot provide, so — exactly like preload-unit-test.sh — we source
# the real entrypoint (DIND_ENTRYPOINT_SOURCE_ONLY=1 returns before the
# startup/handoff flow) to get `warn_if_vfs_storage_driver` verbatim and drive it
# directly. The /dev/fuse probe is pointed at a temp path via DIND_FUSE_DEVICE so
# both remediation branches are exercised deterministically without a real device
# node.
set -euo pipefail

SCRIPT_DIR="$(cd "$(dirname "${BASH_SOURCE[0]}")" && pwd)"
ENTRYPOINT="$SCRIPT_DIR/../ubuntu/24.04/dind/dind-entrypoint.sh"

WORK="$(mktemp -d)"
trap 'rm -rf "$WORK"' EXIT

# Source the real entrypoint for its functions only.
# shellcheck disable=SC1090
DIND_ENTRYPOINT_SOURCE_ONLY=1 . "$ENTRYPOINT"

pass=0; fail=0
check() { # check <description> <condition-cmd...>
desc="$1"; shift
if "$@"; then echo " PASS: $desc"; pass=$((pass+1)); else echo " FAIL: $desc"; fail=$((fail+1)); fi
}

ERR="$WORK/err.log"
PRESENT_FUSE="$WORK/fuse-present" # an existing path: stands in for /dev/fuse
MISSING_FUSE="$WORK/fuse-missing" # a path that does not exist
: > "$PRESENT_FUSE"
rm -f "$MISSING_FUSE"

echo "== Case 1: vfs driver emits the copy-on-write warning =="
: > "$ERR"
DIND_FUSE_DEVICE="$PRESENT_FUSE" warn_if_vfs_storage_driver vfs 2>"$ERR"
check "warns the driver is vfs" grep -q "'vfs' storage driver" "$ERR"
check "calls out NO copy-on-write" grep -q "NO copy-on-write" "$ERR"
check "names the disk failure mode" grep -q "no space left on device" "$ERR"
check "names the fuse-overlayfs remediation" grep -q "DIND_STORAGE_DRIVER=fuse-overlayfs" "$ERR"

echo "== Case 2: overlay2 driver stays silent =="
: > "$ERR"
DIND_FUSE_DEVICE="$PRESENT_FUSE" warn_if_vfs_storage_driver overlay2 2>"$ERR"
check "no warning for overlay2" bash -c '! test -s "$1"' _ "$ERR"

echo "== Case 3: fuse-overlayfs driver stays silent =="
: > "$ERR"
DIND_FUSE_DEVICE="$PRESENT_FUSE" warn_if_vfs_storage_driver fuse-overlayfs 2>"$ERR"
check "no warning for fuse-overlayfs" bash -c '! test -s "$1"' _ "$ERR"

echo "== Case 4: empty/unknown driver stays silent =="
: > "$ERR"
DIND_FUSE_DEVICE="$PRESENT_FUSE" warn_if_vfs_storage_driver "" 2>"$ERR"
check "no warning for empty driver" bash -c '! test -s "$1"' _ "$ERR"

echo "== Case 5: /dev/fuse present -> 'set fuse-overlayfs' remediation =="
: > "$ERR"
DIND_FUSE_DEVICE="$PRESENT_FUSE" warn_if_vfs_storage_driver vfs 2>"$ERR"
check "remediation says the device is present" grep -q "is present" "$ERR"
check "remediation does NOT claim it is missing" bash -c '! grep -q "is missing" "$1"' _ "$ERR"

echo "== Case 6: /dev/fuse missing -> explains why fuse-overlayfs is unavailable =="
: > "$ERR"
DIND_FUSE_DEVICE="$MISSING_FUSE" warn_if_vfs_storage_driver vfs 2>"$ERR"
check "remediation explains the device is missing" grep -q "is missing" "$ERR"
check "remediation suggests --device /dev/fuse" grep -q -- "--device /dev/fuse" "$ERR"
check "remediation suggests --privileged" grep -q -- "--privileged" "$ERR"
check "still names the fuse-overlayfs driver" grep -q "DIND_STORAGE_DRIVER=fuse-overlayfs" "$ERR"

echo "== Case 7: function returns success so the start_dockerd success branch is unaffected =="
# warn_if_vfs_storage_driver runs immediately before `return 0`; under `set -e`
# a non-zero return would abort startup. Assert exit status 0 for both vfs and
# non-vfs drivers.
check "returns 0 for vfs" bash -c 'DIND_ENTRYPOINT_SOURCE_ONLY=1 . "$1"; DIND_FUSE_DEVICE="$2" warn_if_vfs_storage_driver vfs >/dev/null 2>&1' _ "$ENTRYPOINT" "$PRESENT_FUSE"
check "returns 0 for overlay2" bash -c 'DIND_ENTRYPOINT_SOURCE_ONLY=1 . "$1"; warn_if_vfs_storage_driver overlay2 >/dev/null 2>&1' _ "$ENTRYPOINT"

echo
echo "RESULT: $pass passed, $fail failed"
[ "$fail" -eq 0 ]
Loading