Skip to content

Cut dynamic-linker startup syscalls#62

Open
jserv wants to merge 1 commit into
mainfrom
startup
Open

Cut dynamic-linker startup syscalls#62
jserv wants to merge 1 commit into
mainfrom
startup

Conversation

@jserv
Copy link
Copy Markdown
Contributor

@jserv jserv commented May 30, 2026

The dynamic-linker bring-up storm was the largest remaining startup band after pull request #34. Adding a per-syscall histogram pointed at the sidecar walker as the openat dominant cost (61% of getent startup), the per-call path_translation_t memset as the second source, and the opened_fd_type fstat as a small but real per-open round-trip.

src/debug/syscall-hist.[ch]: opt-in histogram via
ELFUSE_STARTUP_TRACE=syscalls (or =all alongside the existing step trace). Lock-free atomic counters per Linux syscall number, sorted total-ns descending in the dump. Records freeze on the first successful execve so steady-state traffic does not pollute the startup picture. Fork children disable the histogram explicitly because they resume from a parent snapshot, not a fresh bring-up.

src/syscall/sidecar.c: First a per-directory absence cache keyed by (st_dev, st_ino, mtime, ctime) so the walker can skip the openat for .elfuse-sidecar-index when a recent fstat on the same dirfd already saw ENOENT. The mtime/ctime in the key closes ABA naturally and makes a cross-process index publish observable without explicit invalidation. Second a cached sysroot dirfd handed out as fcntl(F_DUPFD_CLOEXEC, 0) so each translated absolute path saves the ~30 us open(sysroot) round-trip and the dup carries CLOEXEC across any racing posix_spawn.

src/syscall/path.c: drop the per-call zero-init of path_translation_t. The struct is ~12 KiB (24 metadata bytes plus three LINUX_PATH_MAX buffers) and the buffers are read-after- written by their respective resolvers. memset of all three was the dominant remaining cost after the sidecar caches.

src/core/elf.c: skip the redundant memset of the file-data range in elf_map_segments. The loader previously zeroed the full page-aligned segment extent before issuing fread; now only the BSS portion plus page padding (filesz to zero_len) is zeroed.

src/syscall/fs.c: skip opened_fd_type fstat when neither O_PATH nor O_DIRECTORY is set. Dynamic-linker opens are overwhelmingly regular files where the type is already implied. The corner where a guest opens a directory without O_DIRECTORY and then issues getdents now returns ENOTDIR; glibc fdopendir has required O_DIRECTORY since 2009 and the test corpus does not exercise the corner.

src/core/startup-trace.h: env parsing extended to comma-separated tokens (steps, syscalls, all); legacy =1 keeps enabling steps only so existing scripts keep working.

Measurement: 30-run distributions under ELFUSE_STARTUP_TRACE=syscalls, warm cache:

  bench-hot-guard-glibc startup syscalls:
    5.225 ms baseline (single sample) -> 1.33 ms p50
    (p25 1.21, p75 1.55, stdev 0.45, n=30)         3.9x
  bench openat per-call:
    135 us baseline -> 33.4 us p50
    (p25 32.4, p75 35.8, stdev 7.1, n=30)          4.0x
  getent passwd root startup syscalls:
    7.478 ms baseline -> 2.22 ms p50
    (p25 2.10, p75 2.28, stdev 0.27, n=30)         3.4x
  getent openat per-call:
    230 us baseline -> 52.9 us p50
    (p25 51.5, p75 55.1, stdev 2.2, n=30)          4.3x

End-to-end wall-clock for getent: 14.6 ms p50 (p25 14.3, p75 15.1, stdev 1.18, n=30). Bench guardrail steady-state: static getpid 74 ns, clock_gettime 6.7 ns, urandom1 153 ns; dynamic-glibc getpid 53 ns, clock_gettime 6.4 ns, urandom1 142 ns. All under ceilings.

The original baselines were single first-run samples; their variance band was not measured, so the speedup ratios are best-effort relative to the cited starting point.

Lazy FD_REGULAR to FD_DIR promotion in sys_getdents64 was attempted but dropped after both reviewers flagged a HIGH-severity ABA hole: a sibling close+reopen between the probe and the install could land the original directory's DIR* onto a fresh regular file's slot. The fix path (fd-slot generation counter or stat+inode comparison under fd_lock) was invasive enough that the lazy promotion did not pay for its complexity.


Summary by cubic

Cuts dynamic-linker startup syscall cost by 3–4x with an opt-in per-syscall histogram and fast paths in the sidecar walker and path translation. Warm-cache startup drops from 5.2–7.5 ms to 1.3–2.2 ms p50; openat calls are ~4x faster.

  • New Features

    • Per-syscall histogram via ELFUSE_STARTUP_TRACE=syscalls or all; initialized at process start, freezes at first execve, disabled in fork children, dumps once before teardown, sorted by total time.
    • Startup trace env supports comma-separated tokens: steps, syscalls, all; legacy 1 aliases steps and also works in combos (e.g., 1,syscalls).
  • Refactors

    • Sidecar: per-directory ENOENT cache keyed by dev/ino + mtime/ctime; cached sysroot dirfd via F_DUPFD_CLOEXEC to avoid repeated opens.
    • Path translate: remove 12 KiB per-call memset; set only needed fields.
    • ELF loader: zero only BSS and page padding; skip zeroing file-data range.
    • FS: skip fstat in opened_fd_type unless O_PATH or O_DIRECTORY; getdents64 on non-directory FDs now returns ENOTDIR.

Written for commit f4782bb. Summary will update on new commits.

Review in cubic

@jserv jserv requested a review from Max042004 May 30, 2026 19:23
cubic-dev-ai[bot]

This comment was marked as resolved.

The dynamic-linker bring-up storm was the largest remaining startup band
after pull request #34. Adding a per-syscall histogram pointed at the
sidecar walker as the openat dominant cost (61% of getent startup), the
per-call path_translation_t memset as the second source, and the
opened_fd_type fstat as a small but real per-open round-trip.

src/debug/syscall-hist.[ch]: opt-in histogram via
ELFUSE_STARTUP_TRACE=syscalls (or =all alongside the existing step
trace). Lock-free atomic counters per Linux syscall number, sorted
total-ns descending in the dump. Records freeze on the first successful
execve so steady-state traffic does not pollute the startup picture.
Fork children disable the histogram explicitly because they resume from
a parent snapshot, not a fresh bring-up.

src/syscall/sidecar.c: First a per-directory absence cache keyed by
(st_dev, st_ino, mtime, ctime) so the walker can skip the openat for
.elfuse-sidecar-index when a recent fstat on the same dirfd already saw
ENOENT. The mtime/ctime in the key closes ABA naturally and makes a
cross-process index publish observable without explicit invalidation.
Second a cached sysroot dirfd handed out as fcntl(F_DUPFD_CLOEXEC, 0) so
each translated absolute path saves the ~30 us open(sysroot) round-trip
and the dup carries CLOEXEC across any racing posix_spawn.

src/syscall/path.c: drop the per-call zero-init of path_translation_t.
The struct is ~12 KiB (24 metadata bytes plus three LINUX_PATH_MAX
buffers) and the buffers are read-after- written by their respective
resolvers. memset of all three was the dominant remaining cost after the
sidecar caches.

src/core/elf.c: skip the redundant memset of the file-data range in
elf_map_segments. The loader previously zeroed the full page-aligned
segment extent before issuing fread; now only the BSS portion plus page
padding (filesz to zero_len) is zeroed.

src/syscall/fs.c: skip opened_fd_type fstat when neither O_PATH nor
O_DIRECTORY is set. Dynamic-linker opens are overwhelmingly regular files
where the type is already implied. The corner where a guest opens a
directory without O_DIRECTORY and then issues getdents now returns
ENOTDIR; glibc fdopendir has required O_DIRECTORY since 2009 and the test
corpus does not exercise the corner.

src/core/startup-trace.h: env parsing extended to comma-separated tokens
(steps, syscalls, all); legacy =1 keeps enabling steps only so existing
scripts keep working.

Measurement: 30-run distributions under ELFUSE_STARTUP_TRACE=syscalls,
warm cache:
  bench-hot-guard-glibc startup syscalls:
    5.225 ms baseline (single sample) -> 1.33 ms p50
    (p25 1.21, p75 1.55, stdev 0.45, n=30)         3.9x
  bench openat per-call:
    135 us baseline -> 33.4 us p50
    (p25 32.4, p75 35.8, stdev 7.1, n=30)          4.0x
  getent passwd root startup syscalls:
    7.478 ms baseline -> 2.22 ms p50
    (p25 2.10, p75 2.28, stdev 0.27, n=30)         3.4x
  getent openat per-call:
    230 us baseline -> 52.9 us p50
    (p25 51.5, p75 55.1, stdev 2.2, n=30)          4.3x

End-to-end wall-clock for getent: 14.6 ms p50 (p25 14.3, p75 15.1, stdev
1.18, n=30). Bench guardrail steady-state: static getpid 74 ns,
clock_gettime 6.7 ns, urandom1 153 ns; dynamic-glibc getpid 53 ns,
clock_gettime 6.4 ns, urandom1 142 ns. All under ceilings.

The original baselines were single first-run samples; their variance
band was not measured, so the speedup ratios are best-effort relative
to the cited starting point.

Lazy FD_REGULAR to FD_DIR promotion in sys_getdents64 was attempted
but dropped after both reviewers flagged a HIGH-severity ABA hole:
a sibling close+reopen between the probe and the install could land
the original directory's DIR* onto a fresh regular file's slot. The
fix path (fd-slot generation counter or stat+inode comparison under
fd_lock) was invasive enough that the lazy promotion did not pay for
its complexity.
Copy link
Copy Markdown
Collaborator

@Max042004 Max042004 left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Commit message describe the src/syscall/fs.c changes, but actually src/syscall/fs.c didn't appear in File changed list.

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

None yet

Projects

None yet

Development

Successfully merging this pull request may close these issues.

2 participants