Snapshot shm via APFS fclonefileat for safe CoW by jserv · Pull Request #60 · sysprog21/elfuse

jserv · 2026-05-30T09:56:41Z

Fork previously sent the parent's live shm_fd via SCM_RIGHTS and the child mapped it MAP_PRIVATE. The parent stayed on MAP_SHARED, so any page the child had not yet COW'd reflected the parent's current bytes. That was benign for typical aarch64 workloads but corrupted x86_64-via-Rosetta guests: translator-internal structures (TLS slabs, code caches, indirect-call tables, block lists) cross page boundaries and observe parent-side mid-update reads. Issue #45 tracked the resulting fall back to a per-region byte copy through the IPC socket -- 10-16x slower per fork than the CoW path.

sys_clone now lifts the !g->is_rosetta gate and always asks fork_snapshot_shm_via_clonefile() for an APFS fclonefileat snapshot of the shm file. The clone shares blocks with the parent until either side writes, so the parent's later writes never reach the child's backing, and the existing guest_init_from_shm MAP_PRIVATE flow on the child consumes the snapshot unchanged. The snapshot helper uses a mode 0700 mkdtemp directory (clone inside, then unlink + rmdir) rather than an earlier mkstemp + unlink + fclonefileat sequence whose freed /tmp basename gave a local-user TOCTOU window that could DoS the fast path via EEXIST.

Fallback differs by guest. Rosetta drops use_shm on clonefile failure and falls through to the legacy region-copy path; sending the live fd would re-introduce the issue #45 corruption. Native guests keep use_shm and send g->shm_fd directly, preserving the original CoW behavior so a non-APFS /tmp does not silently slow forks down to per-region copy cost.

Overlay sync (pwrite of file-backed MAP_SHARED overlay bytes into shm_fd) moves before the IPC header so the cloned file picks up overlay-backed bytes and the header has_shm field reflects the post-clonefile outcome.

guest_init_from_shm now closes shm_fd on the compute_infra_layout failure path so the take-ownership contract holds on every error, not just the post-mmap ones.

tests/bench-fork-cost.sh is added as the regression baseline. Per-fork wall-clock means from three back-to-back runs on M1 (subshell fork, no exec, per-fork numbers exclude startup via a 0-iter baseline subtraction):

  rss      aarch64 CoW    Rosetta (before / after)   ratio
  0 MiB    ~113 ms/fork   ~1058-1196 / ~113 ms       10x -> ~1x
  1 MiB    ~113 ms/fork   ~1090-1250 / ~117 ms       10x -> ~1x
  16 MiB   ~114 ms/fork   ~1125-1230 / ~120 ms       10x -> ~1x
  64 MiB   ~114 ms/fork   ~1400-1840 / ~220 ms       12-16x -> ~2x

The 64 MiB Rosetta residual is APFS clone metadata plus child-side MAP_PRIVATE materialization, not byte-copy bandwidth. test-cow-fork (5/5), make check, and the 71-test make test-rosetta-all suite stay green.

Close #45

Summary by cubic

Snapshot guest shared memory at fork with APFS fclonefileat to give the child a safe CoW snapshot, fixing Rosetta torn-read corruption and restoring fast‑path performance (addresses #45). Adds per-guest fallbacks, syncs overlay data before the snapshot, and includes a fork-cost benchmark.

New Features
- Snapshot g->shm_fd via APFS fclonefileat; child maps the clone MAP_PRIVATE.
- Attempt clone for all guests; on failure, Rosetta falls back to region-copy, native guests send the live shm_fd.
- Sync overlay bytes into shm_fd before snapshot; IPC header has_shm reflects the final path.
- Add tests/bench-fork-cost.sh to track per‑fork cost.
Bug Fixes
- Remove TOCTOU by cloning inside a mode 0700 mkdtemp dir, then unlinking.
- Ensure guest_init_from_shm closes shm_fd if compute_infra_layout fails.
- Avoid fd double‑close/leaks (reset vfork_notify_fds[1], close snapshot fd on all exits) and reap failed children to prevent zombies.

^{Written for commit b724c9d. Summary will update on new commits.}

Max042004 · 2026-05-31T16:27:21Z


 fail_snapshot:
    free(regions_snapshot);
+    if (snapshot_shm_fd >= 0)


The child elfuse process is already spawned by posix_spawn at src/runtime/forkipc.c:1324, well before any of the snapshot/IPC steps run. Every goto fail_snapshot after that point therefore unwinds with a live child, but the fail_snapshot: block (src/runtime/forkipc.c:1637) only does close(ipc_sock) (:1655) and returns -LINUX_ENOMEM (:1660). It never reaps the child, and nothing else does either — so the child becomes a zombie that lingers for the parent elfuse's lifetime.

Why it becomes a zombie

close(ipc_sock) makes the child exit (it reads EOF on the control socket and bails), but exiting is exactly what creates the zombie: on macOS a child that has exited is kept as a <defunct> entry until its parent waitpids it. elfuse installs no host SIGCHLD handler and sets neither SIG_IGN nor SA_NOCLDWAIT (the only host-side touch is pthread_sigmask(SIG_BLOCK, …) in src/core/rosetta.c:1094, which merely defers delivery), so the kernel will not auto-discard the status. Something must call waitpid — and on this path nothing does:

Guest-level reaper (sys_wait4 → host wait4): the path returns -LINUX_ENOMEM, i.e. the guest's fork() reports failure, so the guest believes no child exists and never waits for it.

Host-level reaper (proc_reap_finished() sweeps proc_table with waitpid(WNOHANG)): entries land in proc_table only via proc_register_child, which runs only on the success path at src/runtime/forkipc.c:1603. A child unwound through fail_snapshot is never registered, so the sweep never sees it.

Both reapers key off "child was registered / guest knows about it." On the failure path neither holds, so no waitpid for child_host_pid is ever issued and the zombie persists. For a long-running, fork-heavy guest that repeatedly hits this path, zombies accumulate and consume host PIDs.

Scope note — which new fail points actually reach this path

This is pre-existing debt in fail_snapshot, but this PR changes the odds of reaching it:

mkdtemp / fclonefileat / open inside fork_snapshot_shm_via_clonefile() return -1 and the caller falls back (rosetta → use_shm = false at :1455; native → live shm_fd). They do not goto fail_snapshot.

Only the overlay-sync pwrite failure does goto fail_snapshot, and that block is pre-existing (the PR only moved it ahead of the header).

The net new exposure is Rosetta-specific: pre-PR Rosetta ran with use_shm = false, so the overlay-sync pwrite block was skipped entirely and could not reach fail_snapshot; post-PR Rosetta runs use_shm = true (:1448) and now executes that pwrite, adding one new fail_snapshot trigger for Rosetta guests.

Confirmed and fixed.

Primary fix (this thread): every goto fail_snapshot lands after posix_spawn, so child_host_pid is always a live process there. After closing ipc_sock (which makes the child read EOF on fork_ipc_read_all and return nonzero from fork_child_main), reap it explicitly:

pid_t reaped; do { reaped = waitpid(child_host_pid, NULL, 0); } while (reaped < 0 && errno == EINTR); if (reaped < 0) log_warn("clone: failed to reap fork-child pid=%d: %s", (int) child_host_pid, strerror(errno));

A blocking waitpid is safe because fork_ipc_write_all only returns -1 when the write didn't deliver, so the child always sees EOF on a parent-side IPC failure (no pathological "child finished restore while parent thinks send failed" case to defend against).

Bonus fix found while reviewing this patch: vfork_notify_fds[1] is closed at the post-spawn cleanup (line 1342) but never reset to -1, so the guarded close(vfork_notify_fds[1]) in fail_snapshot double-closes. In a multithreaded guest another vCPU can open a new fd between the two closes and the second close would steal it. Pre-existing, but adjacent to the lines this PR touches, so folded in: vfork_notify_fds[1] = -1; after the first close.

Validation: temporarily injected goto fail_snapshot immediately after posix_spawn, ran a fork-then-sleep guest, and ps -axo pid,ppid,stat under the elfuse parent showed:

with fix: 0 zombies, fork() returns ENOMEM, no hang

without fix (control): 1 <defunct> child (STAT=ZN)

Fork previously sent the parent's live shm_fd via SCM_RIGHTS and the child mapped it MAP_PRIVATE. The parent stayed on MAP_SHARED, so any page the child had not yet COW'd reflected the parent's current bytes. That was benign for typical aarch64 workloads but corrupted x86_64-via-Rosetta guests: translator-internal structures (TLS slabs, code caches, indirect-call tables, block lists) cross page boundaries and observe parent-side mid-update reads. Issue #45 tracked the resulting fall back to a per-region byte copy through the IPC socket -- 10-16x slower per fork than the CoW path. sys_clone now lifts the !g->is_rosetta gate and always asks fork_snapshot_shm_via_clonefile() for an APFS fclonefileat snapshot of the shm file. The clone shares blocks with the parent until either side writes, so the parent's later writes never reach the child's backing, and the existing guest_init_from_shm MAP_PRIVATE flow on the child consumes the snapshot unchanged. The snapshot helper uses a mode 0700 mkdtemp directory (clone inside, then unlink + rmdir) rather than an earlier mkstemp + unlink + fclonefileat sequence whose freed /tmp basename gave a local-user TOCTOU window that could DoS the fast path via EEXIST. Fallback differs by guest. Rosetta drops use_shm on clonefile failure and falls through to the legacy region-copy path; sending the live fd would re-introduce the issue #45 corruption. Native guests keep use_shm and send g->shm_fd directly, preserving the original CoW behavior so a non-APFS /tmp does not silently slow forks down to per-region copy cost. Overlay sync (pwrite of file-backed MAP_SHARED overlay bytes into shm_fd) moves before the IPC header so the cloned file picks up overlay-backed bytes and the header has_shm field reflects the post-clonefile outcome. guest_init_from_shm now closes shm_fd on the compute_infra_layout failure path so the take-ownership contract holds on every error, not just the post-mmap ones. tests/bench-fork-cost.sh is added as the regression baseline. Per-fork wall-clock means from three back-to-back runs on M1 (subshell fork, no exec, per-fork numbers exclude startup via a 0-iter baseline subtraction): rss aarch64 CoW Rosetta (before / after) ratio 0 MiB ~113 ms/fork ~1058-1196 / ~113 ms 10x -> ~1x 1 MiB ~113 ms/fork ~1090-1250 / ~117 ms 10x -> ~1x 16 MiB ~114 ms/fork ~1125-1230 / ~120 ms 10x -> ~1x 64 MiB ~114 ms/fork ~1400-1840 / ~220 ms 12-16x -> ~2x The 64 MiB Rosetta residual is APFS clone metadata plus child-side MAP_PRIVATE materialization, not byte-copy bandwidth. test-cow-fork (5/5), make check, and the 71-test make test-rosetta-all suite stay green. Close #45

jserv requested a review from Max042004 May 30, 2026 09:56

This comment was marked as resolved.

Sign in to view

Max042004 reviewed May 31, 2026

View reviewed changes

jserv force-pushed the rosetta-fork branch from 8a76acd to b724c9d Compare June 1, 2026 03:35

jserv requested a review from Max042004 June 1, 2026 03:40

Max042004 approved these changes Jun 1, 2026

View reviewed changes

jserv merged commit b1ce739 into main Jun 1, 2026
4 checks passed

jserv deleted the rosetta-fork branch June 1, 2026 13:29

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Snapshot shm via APFS fclonefileat for safe CoW#60

Snapshot shm via APFS fclonefileat for safe CoW#60
jserv merged 1 commit into
mainfrom
rosetta-fork

jserv commented May 30, 2026 •

edited by cubic-dev-ai Bot

Loading

Uh oh!

This comment was marked as resolved.

Uh oh!

Max042004 May 31, 2026

Uh oh!

jserv Jun 1, 2026 •

edited

Loading

Uh oh!

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

2 participants

Conversation

jserv commented May 30, 2026 • edited by cubic-dev-ai Bot Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Summary by cubic

Uh oh!

This comment was marked as resolved.

Uh oh!

Max042004 May 31, 2026

Choose a reason for hiding this comment

Why it becomes a zombie

Scope note — which new fail points actually reach this path

Uh oh!

jserv Jun 1, 2026 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

2 participants

jserv commented May 30, 2026 •

edited by cubic-dev-ai Bot

Loading

jserv Jun 1, 2026 •

edited

Loading