Conversation
This extends the synthetic vDSO at AT_SYSINFO_EHDR with four new fast
paths and rebuilds the anchor publication protocol so the anchor can be
refreshed safely while concurrent guest readers are interpolating.
New trampolines, each ending in its own SVC fallback so the dynamic
linker sees a complete __kernel_* symbol:
__kernel_clock_getres 76 B / 19 instructions. Returns {0, 1}
inline for REALTIME, MONOTONIC, MONOTONIC_RAW,
REALTIME_COARSE, MONOTONIC_COARSE, BOOTTIME.
CPU and dynamic per-pid clockids SVC out.
__kernel_gettimeofday 160 B / 40 instructions. Mirrors clock_gettime
using the REALTIME anchor and divides by 1000
for tv_usec. tz, if non-NULL, gets one
str xzr to clear the obsolete struct timezone.
__kernel_getcpu 52 B / 13 instructions. Stores zero to both
out-pointers (elfuse models one CPU / one
node) and returns 0.
__kernel_clock_gettime grows to 168 B / 42 instructions to hold the
anchor-age cap and the seqlock recheck added
to the existing CNTVCT trampoline.
VDSO_NUM_SYMS goes from 4 to 5; dynstr_data widens to 119 bytes; all
post-text section offsets shift but the page still ends inside the
4 KiB at 0x6A0.
The vvar's first uint32 is now a Linux-style seqlock counter:
0 unseeded, no anchor data
odd N >= 1 writer has reserved generation (N+1)/2
even N >= 2 stable generation N/2, anchor fields readable
vdso_seed_anchor publishes through one CAS-then-release-store sequence
that handles both initial seeding and refresh:
load(seq, ACQUIRE), bail if odd
CAS(seq, cur, cur+1, ACQUIRE, RELAXED), bail on contention
thread_fence(RELEASE) // CAS odd-publish before field stores
store(field_i, RELAXED) * 5
store(seq, cur+2, RELEASE) // publish next even
The thread_fence(RELEASE) lowers to DMB ISH on AArch64 and closes the
window where another CPU could observe the relaxed field stores before
the odd-publish, since ARMv8 is not multi-copy atomic for unsynchronized
stores to different locations. Without it a reader whose snapshot LDAR
still saw the old even seq could read fields from the new generation
and recheck the same old even seq, accepting torn data.
Trampoline readers snapshot the seq with LDAR, fall back on 0 or odd,
plain-load anchor fields, DMB ISHLD, LDAR the seq again, and SVC on
mismatch. The DMB ISHLD is load-bearing: LDAR provides forward acquire
only, so without it the recheck load can be observed before the plain
field LDR/LDPs complete and the race goes undetected. The host helpers
(vdso_anchor_age_exceeded, vdso_realtime_drift_exceeded) read through a
new vvar_snapshot_anchor() that mirrors this protocol in C: relaxed
atomic field loads with __atomic_thread_fence(__ATOMIC_ACQUIRE) before
the recheck.
Two staleness gates drive the refresh:
- The trampolines LSR + CBNZ the CNTVCT delta against a 2**31-cycle
cap (~89 s at 24 MHz). A stale anchor falls back to SVC so the
host can publish a fresh one.
- sys_clock_gettime and sys_gettimeofday sample both host clocks
back-to-back on the vDSO SVC fallback and call vdso_seed_anchor
whenever the anchor is unseeded, has aged out, or has drifted past
VDSO_ANCHOR_MAX_DRIFT_NS (100 ms) relative to a fresh REALTIME.
This catches macOS NTP wall-clock steps without a host timer
thread. The drift detector short-circuits on age, guards
anchor_sec + delta_sec with __builtin_add_overflow, and saturates
the cross-second diff before the * 1e9 multiply so adversarial
vvar values cannot trip signed overflow.
The host gates the HVF reads (ELR_EL1, X9) on clockid 0 or 1 every time,
not just before the anchor is published. The previous short-circuit on
vdso_anchor_is_seeded left stale anchors stranded.
sys_gettimeofday now writes 8 bytes of zero to tz_gva when non-null so
SVC and fast-path callers see the same tz semantics; previously the SVC
path ignored tz while the fast path zeroed it, so the first unseeded
fallback with tz != NULL silently diverged.
Measured under tests/bench-vdso.c at 200000 iterations, with the
seqlock and DMB ISHLD overhead included:
clock_gettime(MONOTONIC) : 4.3 ns/op (SVC: 1324 ns; 311x)
clock_gettime(REALTIME) : 4.1 ns/op (SVC: 1315 ns; 321x)
clock_getres(MONOTONIC) : 2.3 ns/op (SVC: 1349 ns; 591x)
gettimeofday : 4.9 ns/op (SVC: 1331 ns; 271x)
getcpu : 2.0 ns/op (SVC: 1519 ns; 755x)
The ~1.5 ns increase on clock_gettime over the prior one-shot anchor
(~2.5 ns baseline) is the cost of the seqlock recheck plus DMB ISHLD
plus the age-cap LSR+CBNZ. clock_getres and getcpu have no anchor
dependence so they stay at ~2 ns.
The synthetic vDSO at AT_SYSINFO_EHDR already carries DT_HASH, LINUX_2.6.39 symbol versioning, and five __kernel_* trampolines, but glibc 2.41's dynamic-linker vDSO probe rejected the page for lack of an NT_GNU_ABI_TAG note: every dynamically-linked guest fell back to SVC for clock_gettime, gettimeofday, and clock_getres. PR #34 measured 1006 ns/op against an 18 ns/op OrbStack reference, a 56x gap the TODO Tier D P1 entry tracked as the highest-leverage single fix. This adds the note. To avoid moving VVAR (0x0B0), TEXT_OFF_SIGRET (0x0E0, exported in vdso.h for signal.c), or any trampoline / section offset, the program-header table relocates from 0x040 to 0x6B0 (after the section-header area). The reclaimed 0x040 window now holds the 32-byte NT_GNU_ABI_TAG: namesz : 4 ("GNU\0") descsz : 16 type : NT_GNU_ABI_TAG (1) name : "GNU\0" desc : { ELF_NOTE_OS_LINUX (0), 2, 6, 39 } The descriptor's minimum kernel ABI (2.6.39) matches the LINUX_2.6.39 symbol version already exposed through DT_VERDEF, so a glibc that honors the version also honors the note. PT_LOAD continues to cover the whole page so the relocated PHDR table and the note both stay mapped at runtime. Validation, dynamically-linked glibc 2.41 binary built from the cross-toolchain sysroot at /opt/toolchain/aarch64-linux-gnu (same toolchain PR #34 used for the baseline): libc clock_gettime : 6.97 ns/op (was 1006 ns/op pre-fix) direct vDSO call : 6.24 ns/op (dlsym function-pointer) raw SVC syscall : 2047.01 ns/op libc/vDSO ratio = 1.12x -- libc IS using the vDSO The 0.7 ns libc-vs-direct gap is glibc's dl_sysinfo_dso dispatch, not an SVC fallback. libc clock_gettime now beats the OrbStack reference (18 ns/op) by ~2.6x. gettimeofday and clock_getres land on the trampolines through the same probe path: libc gettimeofday : 7.5 ns/op (vDSO REALTIME anchor reuse) libc clock_getres : 4.9 ns/op (constant-resolution path) readelf parses the page cleanly: e_phnum=3, e_phoff=0x6B0, three PHDRs (PT_LOAD covering the whole page, PT_DYNAMIC at 0x420 size 0x90, PT_NOTE at 0x40 size 0x20), and `readelf -n` decodes the note as "GNU NT_GNU_ABI_TAG OS: Linux, ABI: 2.6.39". No region overlaps; total page usage 0x758 / 0x1000. Static vDSO bench unchanged at 6 ns/op for the time fast paths; the PHDR relocation only shifts where the dynamic linker looks for the table and does not touch any code the trampolines execute. test-signal explicit run passes, confirming the unchanged TEXT_OFF_SIGRET=0xE0 trampoline still drives the libc __restore_rt path.
Three hot paths the PR #34 OrbStack baseline tracked -- getpid (~47 ns), clock_gettime through the vDSO (~2.5 ns), and 1-byte /dev/urandom read (~134 ns) -- had no automated regression check. A silent slip-back to the SVC fallback turned each into a ~1-2 us trap without anything in CI to notice. This adds an explicit guardrail. tests/bench-hot-guard.c resolves __kernel_clock_gettime via AT_SYSINFO_EHDR + PT_DYNAMIC + DT_HASH (SysV ELF hash walk) and measures three labels in fixed-width "%-20s %10.1f ns/op last=%ld" output: getpid (raw SVC), clock_gettime (vDSO trampoline), and read-urandom1 (raw 1-byte read of /dev/urandom). The same source builds two binaries via a compile-time switch: build/bench-hot-guard Static glibc. Built without the macro. clock_gettime invokes the trampoline directly through the resolved function pointer. Static glibc never initializes dl_sysinfo_dso, so its libc wrapper falls back to raw SVC for reasons unrelated to the vDSO; measuring the wrapper would fail the 50 ns ceiling for the wrong reason. Direct call isolates the trampoline. build/bench-hot-guard-glibc Dynamic glibc. Built with -DGUARD_USE_LIBC_CG=1. clock_gettime invokes glibc's clock_gettime() wrapper -- which on glibc 2.41 + a correctly-stamped vDSO (NT_GNU_ABI_TAG PT_NOTE, LINUX_2.6.39 versioning) routes through the trampoline. A regression in the note or versioning would push this measurement from ~7 ns to SVC range and trip the ceiling. Built only when the cross-toolchain sysroot at $(LINUX_TOOLCHAIN)/aarch64-unknown-linux-gnu/sysroot exists; run with elfuse --sysroot at that path. Disassembly verifies the split: the dynamic binary lowers bench_clock_gettime to "bl <clock_gettime@plt>" while the static binary lowers it to "ldr x2, [x1], #8" + indirect dispatch. Validation: static getpid 50.4 ns, clock_gettime 6.7 ns, urandom 141.9 ns dyn-glibc getpid 71.9 ns, clock_gettime 17.8 ns, urandom 147.9 ns
Collaborator
|
The same Startup time:
|
| samples (ms) | min | median | |
|---|---|---|---|
| elfuse (PR #52) | 48 41 35 34 34 34 34 34 34 34 | 34 | 34 |
| orbstack | 55 35 35 35 35 34 35 35 35 36 | 34 | 35 |
vDSO-heavy Python loops (N=500k, 5 rounds, median ns/op)
| Operation | elfuse (PR #52) | orbstack | Gap |
|---|---|---|---|
time.time() (glibc → __kernel_gettimeofday) |
36.2 | 37.0 | elfuse 2% faster |
time.monotonic_ns() (→ clock_gettime(MONO)) |
35.2 | 30.1 | orbstack 17% faster |
clock_gettime(CLOCK_REALTIME) |
49.3 | 40.2 | orbstack 23% faster |
clock_gettime(CLOCK_MONOTONIC) |
49.8 | 40.4 | orbstack 23% faster |
clock_getres(CLOCK_MONOTONIC) |
49.3 | 38.9 | orbstack 27% faster |
CPU-bound control: fibonacci(50000) (10 runs)
| samples (ms) | min | median | |
|---|---|---|---|
| elfuse (PR #52) | 51 52 51 51 53 52 51 51 51 51 | 51 | 51 |
| orbstack | 52 52 52 51 52 51 53 51 52 52 | 51 | 52 |
Max042004
approved these changes
May 29, 2026
The trampoline's last divide -- UDIV by 1e9 to split anchor_nsec + delta_ns into sec_carry and nsec_out -- runs ~10-22 cycles on Apple Silicon and is not pipelined. Tightening VDSO_ANCHOR_AGE_SHIFT from 31 to 22 caps delta_ns at ~175e6 ns and bounds the sum below 2e9, so the quotient is always 0 or 1. That collapses the carry to a single SUBS + CSEL + CINC (~3 cycles, fully pipelined), eliminating the only remaining hot-path divide in __kernel_clock_gettime. The shift change costs one extra SVC re-seed per ~0.175 s of idle, which is negligible compared to the per-call gain. Measured on M1: dyn-glibc clock_gettime drops from ~6.5 ns/op baseline to ~4.0 ns/op (~38% faster, guardrail static path 3.4 ns), closing most of the remaining gap to the PR #52 OrbStack baseline. The host-side vdso_realtime_drift_exceeded was also updated to the matching (delta * 699050666) >> 24 mult+shift so the drift detector cannot mis-classify the trampoline's own rounding. x9 stays live across the new math and reaches the SVC fallback intact, preserving the trustworthy CNTVCT contract with sys_clock_gettime. The overflow invariant is documented on vdso_seed_anchor in vdso.h.
Collaborator
|
The same statically-linked 5 runs each, ns/op:
|
This file contains hidden or bidirectional Unicode text that may be interpreted or compiled differently than what appears below. To review, open the file in an editor that reveals hidden Unicode characters.
Learn more about bidirectional Unicode characters
Sign up for free
to join this conversation on GitHub.
Already have an account?
Sign in to comment
Add this suggestion to a batch that can be applied as a single commit.This suggestion is invalid because no changes were made to the code.Suggestions cannot be applied while the pull request is closed.Suggestions cannot be applied while viewing a subset of changes.Only one suggestion per line can be applied in a batch.Add this suggestion to a batch that can be applied as a single commit.Applying suggestions on deleted lines is not supported.You must change the existing code in this line in order to create a valid suggestion.Outdated suggestions cannot be applied.This suggestion has been applied or marked resolved.Suggestions cannot be applied from pending reviews.Suggestions cannot be applied on multi-line comments.Suggestions cannot be applied while the pull request is queued to merge.Suggestion cannot be applied right now. Please check back later.
This extends the synthetic vDSO at AT_SYSINFO_EHDR with four new fast paths and rebuilds the anchor publication protocol so the anchor can be refreshed safely while concurrent guest readers are interpolating.
New trampolines, each ending in its own SVC fallback so the dynamic linker sees a complete _kernel* symbol:
VDSO_NUM_SYMS goes from 4 to 5; dynstr_data widens to 119 bytes; all post-text section offsets shift but the page still ends inside the 4 KiB at 0x6A0.
The vvar's first uint32 is now a Linux-style seqlock counter:
vdso_seed_anchor publishes through one CAS-then-release-store sequence that handles both initial seeding and refresh:
The thread_fence(RELEASE) lowers to DMB ISH on AArch64 and closes the window where another CPU could observe the relaxed field stores before the odd-publish, since ARMv8 is not multi-copy atomic for unsynchronized stores to different locations. Without it a reader whose snapshot LDAR still saw the old even seq could read fields from the new generation and recheck the same old even seq, accepting torn data.
Trampoline readers snapshot the seq with LDAR, fall back on 0 or odd, plain-load anchor fields, DMB ISHLD, LDAR the seq again, and SVC on mismatch. The DMB ISHLD is load-bearing: LDAR provides forward acquire only, so without it the recheck load can be observed before the plain field LDR/LDPs complete and the race goes undetected. The host helpers (vdso_anchor_age_exceeded, vdso_realtime_drift_exceeded) read through a new vvar_snapshot_anchor() that mirrors this protocol in C: relaxed atomic field loads with __atomic_thread_fence(__ATOMIC_ACQUIRE) before the recheck.
Two staleness gates drive the refresh:
The host gates the HVF reads (ELR_EL1, X9) on clockid 0 or 1 every time, not just before the anchor is published. The previous short-circuit on vdso_anchor_is_seeded left stale anchors stranded.
sys_gettimeofday now writes 8 bytes of zero to tz_gva when non-null so SVC and fast-path callers see the same tz semantics; previously the SVC path ignored tz while the fast path zeroed it, so the first unseeded fallback with tz != NULL silently diverged.
Measured under tests/bench-vdso.c at 200000 iterations, with the seqlock and DMB ISHLD overhead included:
The ~1.5 ns increase on clock_gettime over the prior one-shot anchor (~2.5 ns baseline) is the cost of the seqlock recheck plus DMB ISHLD plus the age-cap LSR+CBNZ. clock_getres and getcpu have no anchor dependence so they stay at ~2 ns.
Summary by cubic
Adds a seqlock-protected vDSO time anchor with safe refresh and fast paths for
__kernel_clock_getres,__kernel_clock_gettime,__kernel_gettimeofday, and__kernel_getcpu, plus anNT_GNU_ABI_TAGnote so dynamic glibc binds to the vDSO. Also collapses theclock_gettimecarry to remove the final divide, cutting dyn‑glibc down to ~4 ns.New Features
__kernel_*symbols. Host refresh on SVC (seed/refresh on unseeded, aged, or >100 ms REALTIME drift);sys_clock_getresreturns fixed resolutions for common clockids;sys_gettimeofdayzerostzto match fast‑path behavior.PT_NOTEwithNT_GNU_ABI_TAG(Linux 2.6.39); PHDR table relocated to keep vvar/text offsets stable. New tests and benches, plus a hot‑syscall guardrail inmake checkenforcing ceilings: getpid ≤ 200 ns, libcclock_gettime≤ 50 ns, 1‑byte/dev/urandomread ≤ 200 ns. Optional dynamic‑glibc bench verifies glibc routes through the vDSO.Refactors
clock_gettimewith SUBS+CSEL+CINC by tightening the anchor age shift from 31 to 22; keeps X9 live for the SVC fallback. Dyn‑glibcclock_gettime~4.0 ns (static ~3.4 ns); costs one extra re‑seed per ~0.175 s idle. Host drift detector updated to the same mult+shift so rounding matches the trampoline.Written for commit f5b3e21. Summary will update on new commits.