Snapshot shm via APFS fclonefileat for safe CoW#60
Conversation
|
|
||
| fail_snapshot: | ||
| free(regions_snapshot); | ||
| if (snapshot_shm_fd >= 0) |
There was a problem hiding this comment.
The child elfuse process is already spawned by posix_spawn at src/runtime/forkipc.c:1324, well before any of the snapshot/IPC steps run. Every goto fail_snapshot after that point therefore unwinds with a live child, but the fail_snapshot: block (src/runtime/forkipc.c:1637) only does close(ipc_sock) (:1655) and returns -LINUX_ENOMEM (:1660). It never reaps the child, and nothing else does either — so the child becomes a zombie that lingers for the parent elfuse's lifetime.
Why it becomes a zombie
close(ipc_sock) makes the child exit (it reads EOF on the control socket and bails), but exiting is exactly what creates the zombie: on macOS a child that has exited is kept as a <defunct> entry until its parent waitpids it. elfuse installs no host SIGCHLD handler and sets neither SIG_IGN nor SA_NOCLDWAIT (the only host-side touch is pthread_sigmask(SIG_BLOCK, …) in src/core/rosetta.c:1094, which merely defers delivery), so the kernel will not auto-discard the status. Something must call waitpid — and on this path nothing does:
- Guest-level reaper (
sys_wait4→ hostwait4): the path returns-LINUX_ENOMEM, i.e. the guest'sfork()reports failure, so the guest believes no child exists and never waits for it. - Host-level reaper (
proc_reap_finished()sweepsproc_tablewithwaitpid(WNOHANG)): entries land inproc_tableonly viaproc_register_child, which runs only on the success path atsrc/runtime/forkipc.c:1603. A child unwound throughfail_snapshotis never registered, so the sweep never sees it.
Both reapers key off "child was registered / guest knows about it." On the failure path neither holds, so no waitpid for child_host_pid is ever issued and the zombie persists. For a long-running, fork-heavy guest that repeatedly hits this path, zombies accumulate and consume host PIDs.
Scope note — which new fail points actually reach this path
This is pre-existing debt in fail_snapshot, but this PR changes the odds of reaching it:
mkdtemp/fclonefileat/openinsidefork_snapshot_shm_via_clonefile()return-1and the caller falls back (rosetta →use_shm = falseat:1455; native → liveshm_fd). They do notgoto fail_snapshot.- Only the overlay-sync
pwritefailure doesgoto fail_snapshot, and that block is pre-existing (the PR only moved it ahead of the header). - The net new exposure is Rosetta-specific: pre-PR Rosetta ran with
use_shm = false, so the overlay-syncpwriteblock was skipped entirely and could not reachfail_snapshot; post-PR Rosetta runsuse_shm = true(:1448) and now executes thatpwrite, adding one newfail_snapshottrigger for Rosetta guests.
There was a problem hiding this comment.
Confirmed and fixed.
Primary fix (this thread): every goto fail_snapshot lands after posix_spawn, so child_host_pid is always a live process there. After closing ipc_sock (which makes the child read EOF on fork_ipc_read_all and return nonzero from fork_child_main), reap it explicitly:
pid_t reaped;
do {
reaped = waitpid(child_host_pid, NULL, 0);
} while (reaped < 0 && errno == EINTR);
if (reaped < 0)
log_warn("clone: failed to reap fork-child pid=%d: %s",
(int) child_host_pid, strerror(errno));A blocking waitpid is safe because fork_ipc_write_all only returns -1 when the write didn't deliver, so the child always sees EOF on a parent-side IPC failure (no pathological "child finished restore while parent thinks send failed" case to defend against).
Bonus fix found while reviewing this patch: vfork_notify_fds[1] is closed at the post-spawn cleanup (line 1342) but never reset to -1, so the guarded close(vfork_notify_fds[1]) in fail_snapshot double-closes. In a multithreaded guest another vCPU can open a new fd between the two closes and the second close would steal it. Pre-existing, but adjacent to the lines this PR touches, so folded in: vfork_notify_fds[1] = -1; after the first close.
Validation: temporarily injected goto fail_snapshot immediately after posix_spawn, ran a fork-then-sleep guest, and ps -axo pid,ppid,stat under the elfuse parent showed:
- with fix: 0 zombies, fork() returns ENOMEM, no hang
- without fix (control): 1
<defunct>child (STAT=ZN)
Fork previously sent the parent's live shm_fd via SCM_RIGHTS and the child mapped it MAP_PRIVATE. The parent stayed on MAP_SHARED, so any page the child had not yet COW'd reflected the parent's current bytes. That was benign for typical aarch64 workloads but corrupted x86_64-via-Rosetta guests: translator-internal structures (TLS slabs, code caches, indirect-call tables, block lists) cross page boundaries and observe parent-side mid-update reads. Issue #45 tracked the resulting fall back to a per-region byte copy through the IPC socket -- 10-16x slower per fork than the CoW path. sys_clone now lifts the !g->is_rosetta gate and always asks fork_snapshot_shm_via_clonefile() for an APFS fclonefileat snapshot of the shm file. The clone shares blocks with the parent until either side writes, so the parent's later writes never reach the child's backing, and the existing guest_init_from_shm MAP_PRIVATE flow on the child consumes the snapshot unchanged. The snapshot helper uses a mode 0700 mkdtemp directory (clone inside, then unlink + rmdir) rather than an earlier mkstemp + unlink + fclonefileat sequence whose freed /tmp basename gave a local-user TOCTOU window that could DoS the fast path via EEXIST. Fallback differs by guest. Rosetta drops use_shm on clonefile failure and falls through to the legacy region-copy path; sending the live fd would re-introduce the issue #45 corruption. Native guests keep use_shm and send g->shm_fd directly, preserving the original CoW behavior so a non-APFS /tmp does not silently slow forks down to per-region copy cost. Overlay sync (pwrite of file-backed MAP_SHARED overlay bytes into shm_fd) moves before the IPC header so the cloned file picks up overlay-backed bytes and the header has_shm field reflects the post-clonefile outcome. guest_init_from_shm now closes shm_fd on the compute_infra_layout failure path so the take-ownership contract holds on every error, not just the post-mmap ones. tests/bench-fork-cost.sh is added as the regression baseline. Per-fork wall-clock means from three back-to-back runs on M1 (subshell fork, no exec, per-fork numbers exclude startup via a 0-iter baseline subtraction): rss aarch64 CoW Rosetta (before / after) ratio 0 MiB ~113 ms/fork ~1058-1196 / ~113 ms 10x -> ~1x 1 MiB ~113 ms/fork ~1090-1250 / ~117 ms 10x -> ~1x 16 MiB ~114 ms/fork ~1125-1230 / ~120 ms 10x -> ~1x 64 MiB ~114 ms/fork ~1400-1840 / ~220 ms 12-16x -> ~2x The 64 MiB Rosetta residual is APFS clone metadata plus child-side MAP_PRIVATE materialization, not byte-copy bandwidth. test-cow-fork (5/5), make check, and the 71-test make test-rosetta-all suite stay green. Close #45
Fork previously sent the parent's live shm_fd via SCM_RIGHTS and the child mapped it MAP_PRIVATE. The parent stayed on MAP_SHARED, so any page the child had not yet COW'd reflected the parent's current bytes. That was benign for typical aarch64 workloads but corrupted x86_64-via-Rosetta guests: translator-internal structures (TLS slabs, code caches, indirect-call tables, block lists) cross page boundaries and observe parent-side mid-update reads. Issue #45 tracked the resulting fall back to a per-region byte copy through the IPC socket -- 10-16x slower per fork than the CoW path.
sys_clone now lifts the !g->is_rosetta gate and always asks fork_snapshot_shm_via_clonefile() for an APFS fclonefileat snapshot of the shm file. The clone shares blocks with the parent until either side writes, so the parent's later writes never reach the child's backing, and the existing guest_init_from_shm MAP_PRIVATE flow on the child consumes the snapshot unchanged. The snapshot helper uses a mode 0700 mkdtemp directory (clone inside, then unlink + rmdir) rather than an earlier mkstemp + unlink + fclonefileat sequence whose freed /tmp basename gave a local-user TOCTOU window that could DoS the fast path via EEXIST.
Fallback differs by guest. Rosetta drops use_shm on clonefile failure and falls through to the legacy region-copy path; sending the live fd would re-introduce the issue #45 corruption. Native guests keep use_shm and send g->shm_fd directly, preserving the original CoW behavior so a non-APFS /tmp does not silently slow forks down to per-region copy cost.
Overlay sync (pwrite of file-backed MAP_SHARED overlay bytes into shm_fd) moves before the IPC header so the cloned file picks up overlay-backed bytes and the header has_shm field reflects the post-clonefile outcome.
guest_init_from_shm now closes shm_fd on the compute_infra_layout failure path so the take-ownership contract holds on every error, not just the post-mmap ones.
tests/bench-fork-cost.sh is added as the regression baseline. Per-fork wall-clock means from three back-to-back runs on M1 (subshell fork, no exec, per-fork numbers exclude startup via a 0-iter baseline subtraction):
The 64 MiB Rosetta residual is APFS clone metadata plus child-side MAP_PRIVATE materialization, not byte-copy bandwidth. test-cow-fork (5/5), make check, and the 71-test make test-rosetta-all suite stay green.
Close #45
Summary by cubic
Snapshot guest shared memory at fork with APFS
fclonefileatto give the child a safe CoW snapshot, fixing Rosetta torn-read corruption and restoring fast‑path performance (addresses #45). Adds per-guest fallbacks, syncs overlay data before the snapshot, and includes a fork-cost benchmark.New Features
g->shm_fdvia APFSfclonefileat; child maps the cloneMAP_PRIVATE.shm_fd.shm_fdbefore snapshot; IPC headerhas_shmreflects the final path.tests/bench-fork-cost.shto track per‑fork cost.Bug Fixes
mkdtempdir, then unlinking.guest_init_from_shmclosesshm_fdifcompute_infra_layoutfails.vfork_notify_fds[1], close snapshot fd on all exits) and reap failed children to prevent zombies.Written for commit b724c9d. Summary will update on new commits.