Skip to content

Snapshot shm via APFS fclonefileat for safe CoW#60

Merged
jserv merged 1 commit into
mainfrom
rosetta-fork
Jun 1, 2026
Merged

Snapshot shm via APFS fclonefileat for safe CoW#60
jserv merged 1 commit into
mainfrom
rosetta-fork

Conversation

@jserv
Copy link
Copy Markdown
Contributor

@jserv jserv commented May 30, 2026

Fork previously sent the parent's live shm_fd via SCM_RIGHTS and the child mapped it MAP_PRIVATE. The parent stayed on MAP_SHARED, so any page the child had not yet COW'd reflected the parent's current bytes. That was benign for typical aarch64 workloads but corrupted x86_64-via-Rosetta guests: translator-internal structures (TLS slabs, code caches, indirect-call tables, block lists) cross page boundaries and observe parent-side mid-update reads. Issue #45 tracked the resulting fall back to a per-region byte copy through the IPC socket -- 10-16x slower per fork than the CoW path.

sys_clone now lifts the !g->is_rosetta gate and always asks fork_snapshot_shm_via_clonefile() for an APFS fclonefileat snapshot of the shm file. The clone shares blocks with the parent until either side writes, so the parent's later writes never reach the child's backing, and the existing guest_init_from_shm MAP_PRIVATE flow on the child consumes the snapshot unchanged. The snapshot helper uses a mode 0700 mkdtemp directory (clone inside, then unlink + rmdir) rather than an earlier mkstemp + unlink + fclonefileat sequence whose freed /tmp basename gave a local-user TOCTOU window that could DoS the fast path via EEXIST.

Fallback differs by guest. Rosetta drops use_shm on clonefile failure and falls through to the legacy region-copy path; sending the live fd would re-introduce the issue #45 corruption. Native guests keep use_shm and send g->shm_fd directly, preserving the original CoW behavior so a non-APFS /tmp does not silently slow forks down to per-region copy cost.

Overlay sync (pwrite of file-backed MAP_SHARED overlay bytes into shm_fd) moves before the IPC header so the cloned file picks up overlay-backed bytes and the header has_shm field reflects the post-clonefile outcome.

guest_init_from_shm now closes shm_fd on the compute_infra_layout failure path so the take-ownership contract holds on every error, not just the post-mmap ones.

tests/bench-fork-cost.sh is added as the regression baseline. Per-fork wall-clock means from three back-to-back runs on M1 (subshell fork, no exec, per-fork numbers exclude startup via a 0-iter baseline subtraction):

  rss      aarch64 CoW    Rosetta (before / after)   ratio
  0 MiB    ~113 ms/fork   ~1058-1196 / ~113 ms       10x -> ~1x
  1 MiB    ~113 ms/fork   ~1090-1250 / ~117 ms       10x -> ~1x
  16 MiB   ~114 ms/fork   ~1125-1230 / ~120 ms       10x -> ~1x
  64 MiB   ~114 ms/fork   ~1400-1840 / ~220 ms       12-16x -> ~2x

The 64 MiB Rosetta residual is APFS clone metadata plus child-side MAP_PRIVATE materialization, not byte-copy bandwidth. test-cow-fork (5/5), make check, and the 71-test make test-rosetta-all suite stay green.

Close #45


Summary by cubic

Snapshot guest shared memory at fork with APFS fclonefileat to give the child a safe CoW snapshot, fixing Rosetta torn-read corruption and restoring fast‑path performance (addresses #45). Adds per-guest fallbacks, syncs overlay data before the snapshot, and includes a fork-cost benchmark.

  • New Features

    • Snapshot g->shm_fd via APFS fclonefileat; child maps the clone MAP_PRIVATE.
    • Attempt clone for all guests; on failure, Rosetta falls back to region-copy, native guests send the live shm_fd.
    • Sync overlay bytes into shm_fd before snapshot; IPC header has_shm reflects the final path.
    • Add tests/bench-fork-cost.sh to track per‑fork cost.
  • Bug Fixes

    • Remove TOCTOU by cloning inside a mode 0700 mkdtemp dir, then unlinking.
    • Ensure guest_init_from_shm closes shm_fd if compute_infra_layout fails.
    • Avoid fd double‑close/leaks (reset vfork_notify_fds[1], close snapshot fd on all exits) and reap failed children to prevent zombies.

Written for commit b724c9d. Summary will update on new commits.

Review in cubic

@jserv jserv requested a review from Max042004 May 30, 2026 09:56
cubic-dev-ai[bot]

This comment was marked as resolved.

Comment thread src/runtime/forkipc.c

fail_snapshot:
free(regions_snapshot);
if (snapshot_shm_fd >= 0)
Copy link
Copy Markdown
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

The child elfuse process is already spawned by posix_spawn at src/runtime/forkipc.c:1324, well before any of the snapshot/IPC steps run. Every goto fail_snapshot after that point therefore unwinds with a live child, but the fail_snapshot: block (src/runtime/forkipc.c:1637) only does close(ipc_sock) (:1655) and returns -LINUX_ENOMEM (:1660). It never reaps the child, and nothing else does either — so the child becomes a zombie that lingers for the parent elfuse's lifetime.

Why it becomes a zombie

close(ipc_sock) makes the child exit (it reads EOF on the control socket and bails), but exiting is exactly what creates the zombie: on macOS a child that has exited is kept as a <defunct> entry until its parent waitpids it. elfuse installs no host SIGCHLD handler and sets neither SIG_IGN nor SA_NOCLDWAIT (the only host-side touch is pthread_sigmask(SIG_BLOCK, …) in src/core/rosetta.c:1094, which merely defers delivery), so the kernel will not auto-discard the status. Something must call waitpid — and on this path nothing does:

  • Guest-level reaper (sys_wait4 → host wait4): the path returns -LINUX_ENOMEM, i.e. the guest's fork() reports failure, so the guest believes no child exists and never waits for it.
  • Host-level reaper (proc_reap_finished() sweeps proc_table with waitpid(WNOHANG)): entries land in proc_table only via proc_register_child, which runs only on the success path at src/runtime/forkipc.c:1603. A child unwound through fail_snapshot is never registered, so the sweep never sees it.

Both reapers key off "child was registered / guest knows about it." On the failure path neither holds, so no waitpid for child_host_pid is ever issued and the zombie persists. For a long-running, fork-heavy guest that repeatedly hits this path, zombies accumulate and consume host PIDs.

Scope note — which new fail points actually reach this path

This is pre-existing debt in fail_snapshot, but this PR changes the odds of reaching it:

  • mkdtemp / fclonefileat / open inside fork_snapshot_shm_via_clonefile() return -1 and the caller falls back (rosetta → use_shm = false at :1455; native → live shm_fd). They do not goto fail_snapshot.
  • Only the overlay-sync pwrite failure does goto fail_snapshot, and that block is pre-existing (the PR only moved it ahead of the header).
  • The net new exposure is Rosetta-specific: pre-PR Rosetta ran with use_shm = false, so the overlay-sync pwrite block was skipped entirely and could not reach fail_snapshot; post-PR Rosetta runs use_shm = true (:1448) and now executes that pwrite, adding one new fail_snapshot trigger for Rosetta guests.

Copy link
Copy Markdown
Contributor Author

@jserv jserv Jun 1, 2026

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Confirmed and fixed.

Primary fix (this thread): every goto fail_snapshot lands after posix_spawn, so child_host_pid is always a live process there. After closing ipc_sock (which makes the child read EOF on fork_ipc_read_all and return nonzero from fork_child_main), reap it explicitly:

pid_t reaped;
do {
    reaped = waitpid(child_host_pid, NULL, 0);
} while (reaped < 0 && errno == EINTR);
if (reaped < 0)
    log_warn("clone: failed to reap fork-child pid=%d: %s",
             (int) child_host_pid, strerror(errno));

A blocking waitpid is safe because fork_ipc_write_all only returns -1 when the write didn't deliver, so the child always sees EOF on a parent-side IPC failure (no pathological "child finished restore while parent thinks send failed" case to defend against).

Bonus fix found while reviewing this patch: vfork_notify_fds[1] is closed at the post-spawn cleanup (line 1342) but never reset to -1, so the guarded close(vfork_notify_fds[1]) in fail_snapshot double-closes. In a multithreaded guest another vCPU can open a new fd between the two closes and the second close would steal it. Pre-existing, but adjacent to the lines this PR touches, so folded in: vfork_notify_fds[1] = -1; after the first close.

Validation: temporarily injected goto fail_snapshot immediately after posix_spawn, ran a fork-then-sleep guest, and ps -axo pid,ppid,stat under the elfuse parent showed:

  • with fix: 0 zombies, fork() returns ENOMEM, no hang
  • without fix (control): 1 <defunct> child (STAT=ZN)

Fork previously sent the parent's live shm_fd via SCM_RIGHTS and the
child mapped it MAP_PRIVATE. The parent stayed on MAP_SHARED, so any
page the child had not yet COW'd reflected the parent's current bytes.
That was benign for typical aarch64 workloads but corrupted
x86_64-via-Rosetta guests: translator-internal structures (TLS slabs,
code caches, indirect-call tables, block lists) cross page boundaries
and observe parent-side mid-update reads. Issue #45 tracked the
resulting fall back to a per-region byte copy through the IPC socket --
10-16x slower per fork than the CoW path.

sys_clone now lifts the !g->is_rosetta gate and always asks
fork_snapshot_shm_via_clonefile() for an APFS fclonefileat snapshot of
the shm file. The clone shares blocks with the parent until either
side writes, so the parent's later writes never reach the child's
backing, and the existing guest_init_from_shm MAP_PRIVATE flow on the
child consumes the snapshot unchanged. The snapshot helper uses a mode
0700 mkdtemp directory (clone inside, then unlink + rmdir) rather than
an earlier mkstemp + unlink + fclonefileat sequence whose freed /tmp
basename gave a local-user TOCTOU window that could DoS the fast path
via EEXIST.

Fallback differs by guest. Rosetta drops use_shm on clonefile failure
and falls through to the legacy region-copy path; sending the live fd
would re-introduce the issue #45 corruption. Native guests keep
use_shm and send g->shm_fd directly, preserving the original CoW
behavior so a non-APFS /tmp does not silently slow forks down to
per-region copy cost.

Overlay sync (pwrite of file-backed MAP_SHARED overlay bytes into
shm_fd) moves before the IPC header so the cloned file picks up
overlay-backed bytes and the header has_shm field reflects the
post-clonefile outcome.

guest_init_from_shm now closes shm_fd on the compute_infra_layout
failure path so the take-ownership contract holds on every error,
not just the post-mmap ones.

tests/bench-fork-cost.sh is added as the regression baseline.
Per-fork wall-clock means from three back-to-back runs on M1
(subshell fork, no exec, per-fork numbers exclude startup via a
0-iter baseline subtraction):

  rss      aarch64 CoW    Rosetta (before / after)   ratio
  0 MiB    ~113 ms/fork   ~1058-1196 / ~113 ms       10x -> ~1x
  1 MiB    ~113 ms/fork   ~1090-1250 / ~117 ms       10x -> ~1x
  16 MiB   ~114 ms/fork   ~1125-1230 / ~120 ms       10x -> ~1x
  64 MiB   ~114 ms/fork   ~1400-1840 / ~220 ms       12-16x -> ~2x

The 64 MiB Rosetta residual is APFS clone metadata plus child-side
MAP_PRIVATE materialization, not byte-copy bandwidth. test-cow-fork
(5/5), make check, and the 71-test make test-rosetta-all suite stay
green.

Close #45
@jserv jserv requested a review from Max042004 June 1, 2026 03:40
@jserv jserv merged commit b1ce739 into main Jun 1, 2026
4 checks passed
@jserv jserv deleted the rosetta-fork branch June 1, 2026 13:29
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

None yet

Projects

None yet

Development

Successfully merging this pull request may close these issues.

fork/clone falls back to full guest-memory copy for x86 (Rosetta) guests, losing CoW

2 participants