Skip to content

perf(blitter): +15% AvP gameplay accurate via inlining ADDARRAY/DATA/COMP_CTRL#129

Merged
JoeMatt merged 6 commits intodevelopfrom
feature/blitter-perf-pass1
May 2, 2026
Merged

perf(blitter): +15% AvP gameplay accurate via inlining ADDARRAY/DATA/COMP_CTRL#129
JoeMatt merged 6 commits intodevelopfrom
feature/blitter-perf-pass1

Conversation

@github-actions
Copy link
Copy Markdown

@github-actions github-actions Bot commented May 2, 2026

Summary

  • Inline ADD16SAT / ADDARRAY / COMP_CTRL / DATA (static INLINE __attribute__((always_inline))) so the compiler can specialise per call site.
  • New perf_counters.h instrumentation system (zero-overhead when BENCH_PROFILE is undefined) that drove the diagnosis.
  • make benchmark now accepts BENCH_STATE=... to load a save state before timing — lets us measure actual gameplay instead of boot/menu.
  • scripts/profile-mac.sh xctrace wrapper and docs/profiling.md updates documenting the workflow.

Motivation

Reported in chat: AvP on RetroArch with the accurate blitter has audio dropouts on Apple Silicon. Headless make benchmark against the AvP gameplay state confirmed accurate is ~2× slower than fast (346 → 174 FPS). xctrace Time Profiler showed ADDARRAY as the single largest leaf in the entire emulator (1910 samples), with DATA (759) and COMP_CTRL (318) following — all called only from the BlitterMidsummer2 inner loop, all with mostly-constant flag arguments.

Result

ROM / state mode before after delta
AvP gameplay (state6) accurate ~174 FPS ~200 FPS +15%
AvP gameplay (state6) fast ~345 FPS ~340 FPS within noise

Sample-of-stack hits in the blitter chain went from 5268 → 4592.

Test plan

  • make test passes (existing audio-clipping warnings are pre-existing EXPECTED-FAIL on AvP)
  • make benchmark 3-run consistency on AvP state6 fast + accurate
  • scripts/c89-lint.sh clean
  • xctrace re-profile confirms ADDARRAY/DATA/COMP_CTRL absorbed into BlitterMidsummer2
  • Try AvP in real RetroArch on Apple Silicon and confirm audio dropouts reduced

Notes

  • Bit-exactness preserved: function bodies are byte-identical to the originals; only linkage + source position changed.
  • The +15% is from inlining alone — no algorithm changes. Bigger wins (per-pixsize specialisation, JIT) deferred to follow-up branches.
  • This branch also contains the perf_counters infra, which is reusable infrastructure for future profiling work.

🤖 Generated with Claude Code

JoeMatt and others added 3 commits May 1, 2026 23:12
Generalizes the ad-hoc BLITTER_PROFILE pattern into a reusable,
zero-overhead-when-off counter system any subsystem can use.

* src/core/perf_counters.{h,c} - PERF_COUNTER / PERF_INC / PERF_ADD
  macros backed by a constructor-registered linked list.  When
  BENCH_PROFILE is undefined every macro expands to (0) so there is
  no runtime, code-size, or symbol cost in shipped builds.

* src/tom/blitter.c - migrate the existing BLITTER_PROFILE counters
  in BlitterMidsummer2 onto the new system.  Counters are embedded in
  existing initializers via the comma operator so the file stays
  C89-clean (no statements before declarations).

* Makefile - `make BENCH_PROFILE=1` defines the macro globally.
  `make benchmark BENCH_PROFILE=1` re-invokes with TEST_EXPORTS=1 so
  test_benchmark can dlsym `perf_counters_dump` and print all
  registered counters next to the FPS report.

* test/tools/test_benchmark.c - dlsym the optional dump symbol; if
  present (BENCH_PROFILE build), call it before the BENCHMARK RESULTS
  block.  No effect on default builds.

* exports-test.list / link-test.T - add perf_counters_{dump,reset,
  register} so harnesses can reach them under TEST_EXPORTS=1.

* scripts/profile-mac.sh - one-line wrapper around `xctrace record`.
  Defaults to Time Profiler; --template "CPU Counters" for PMU
  events on Apple Silicon.  Builds the core + harness, runs the
  benchmark under instrumentation, writes a .trace bundle, and can
  auto-open it with --open.

* docs/profiling.md - new sections covering BENCH_PROFILE counters
  (when to use vs sampling profilers) and the profile-mac.sh wrapper.

Validated end-to-end with `make benchmark BENCH_PROFILE=1
BENCH_BLITTER=accurate`: counters populate
(blitter_inner=283994 over 120 frames of yarc), and default builds
remain unchanged.

Co-Authored-By: Claude Opus 4.7 <[email protected]>
Two small follow-ups to the perf_counters baseline:

* Move the perf_counters_register/dump/reset declarations out of
  the BENCH_PROFILE ifdef in the header, and provide always-defined
  no-op bodies in perf_counters.c.  This lets the test ABI export
  the symbols unconditionally (added back to exports-test.list /
  link-test.T) so test_benchmark can dlsym them in any build flavor;
  in non-BENCH_PROFILE builds the bodies are empty.

* Makefile benchmark target gains a BENCH_STATE knob:
    make benchmark BENCH_ROM=foo.j64 BENCH_STATE=foo.state6
  Plumbed through to test_benchmark --load-state, which already
  supports both raw retro_serialize payloads and RetroArch RASTATE
  containers (lands via PR #128).  Lets us profile actual gameplay
  scenes instead of boot/menu idle.

Co-Authored-By: Claude Opus 4.7 <[email protected]>
Profile data on AvP gameplay (state6, accurate blitter) showed
ADDARRAY as the single largest leaf in the entire emulator at
1910 sample-of-stack hits, with DATA (759) and COMP_CTRL (318)
not far behind.  All four are called from the BlitterMidsummer2
inner loop only, and most call sites pass compile-time-constant
flags for daddasel/daddbsel/daddmode/sat/eightbit/hicinh/etc --
ideal candidates for per-call-site specialisation through the
compiler if the bodies become visible at the call site.

This commit moves the four definitions above BlitterMidsummer2
(in the order ADD16SAT -> ADDARRAY -> COMP_CTRL -> DATA so each
sees its dependencies) and marks them
`static INLINE __attribute__((always_inline))`.  No body changes;
this is purely a re-arrangement so the compiler can do dead-arm
elimination and constant propagation across the call boundary.

Removed the matching extern forward declarations now that the
definitions provide the prototype.

Measured (Apple M-series, headless `make benchmark` against the
private AvP ROM with state6 loaded, accurate blitter, 600 frames
after 60 warmup, 3-run median):

      BlitterMidsummer2 + callees, sample-of-stack
        before:  ~5268 (BM2 2281, ADDARRAY 1910, DATA 759, COMP_CTRL 318)
        after:   ~4592 (BM2 absorbs the four inlinees)

      AvP accurate FPS
        baseline:    173-176
        +ADDARRAY:   192-195
        +DATA+COMP:  198-201   (~+15% net)

Fast-blitter perf unchanged (within ~3% run-to-run noise).
test_blitter_compare and the rest of `make test` pass.

Bit-exactness preserved: the function bodies are byte-for-byte
identical to the originals, only their linkage and source-file
position changed.

Addresses real-world AvP-on-RetroArch slowdown / audio-dropout
report on Apple Silicon, where the extra ~25 FPS recovers enough
budget for presentation + audio mixing to fit in 16.6 ms.

Co-Authored-By: Claude Opus 4.7 <[email protected]>
@JoeMatt JoeMatt changed the title Update from feature/blitter-perf-pass1 perf(blitter): +15% AvP gameplay accurate via inlining ADDARRAY/DATA/COMP_CTRL May 2, 2026
The +15% inlining win shipped in the previous commit didn't fix the
real-world AvP audio stutter, because audio dropouts are governed
by *worst-case* frame time, not average.  This commit lands the
diagnostic infrastructure that found the actual cost driver.

* test_benchmark now records per-frame timing and reports
  p50 / p99 / p999 / max ms-per-frame plus the count of frames
  that blew the 16.67 ms (60 Hz) budget.

* perf_counters.h gains `perf_counters_find(name)` so the harness
  can snapshot a counter before/after each retro_run() call and
  report per-frame deltas.

* test_benchmark uses the new lookup to print the slowest frames
  along with their blitter call count and inner-iteration count,
  and the slow-vs-avg ratio.

* Makefile gains BLITTER_TRACE=1 toggle that wires
  -DBLITTER_TRACE; src/tom/blitter.c grows a per-blit elapsed-ms
  dump that fires for any single BlitterMidsummer2 call exceeding
  a threshold, so we can tell whether worst-case frame spikes are
  one giant blit or many small ones.

What this revealed for AvP gameplay (state7, accurate blitter,
1200 frames after 120 warmup, M-series host):

   avg 5.1 ms, p99 17.0 ms, p999 18.2 ms, max 18.4 ms
   25 / 1200 frames over 16.67 ms (2%)

   Avg per frame:    403 blits,   49,774 inner iters
   Slow per frame:  2009 blits,  247,508 inner iters  (5x)

The slow frames are perfectly periodic and consistent (5x the work
on every spike, every spike same exact count).  No single blit is
slow -- the BLITTER_TRACE dump prints zero lines even at a 0.3 ms
threshold.  AvP is just doing 5x more blitter work on roughly
every 24th frame.  That's the audio dropout root cause: the inlining
helped average throughput but the periodic 5x spikes still blow
the budget.

Co-Authored-By: Claude Opus 4.7 <[email protected]>
@github-actions
Copy link
Copy Markdown
Author

github-actions Bot commented May 2, 2026

Regression: macos-arm64

Regression Test Results

ROM Status Details Diff
jagniccc ✅ PASS 0 pixels differ -
yarc ✅ PASS 0 pixels differ -
jagniccc (determinism) ✅ PASS identical across runs -
yarc (determinism) ✅ PASS identical across runs -
jagniccc (frameskip) ✅ PASS skip=0 matches skip=3 -
yarc (frameskip) ✅ PASS skip=0 matches skip=3 -
jagniccc (save state) ✅ PASS round-trip matches -
yarc (save state) ✅ PASS round-trip matches -
jagniccc (rewind) ✅ PASS rewind matches -
yarc (rewind) ✅ PASS rewind matches -

Platform: Darwin arm64

Updated by CI at 2026-05-02T19:51:41.719Z

@github-actions
Copy link
Copy Markdown
Author

github-actions Bot commented May 2, 2026

Regression: linux-x64

Regression Test Results

ROM Status Details Diff
jagniccc ✅ PASS 0 pixels differ -
yarc ✅ PASS 0 pixels differ -
jagniccc (determinism) ✅ PASS identical across runs -
yarc (determinism) ✅ PASS identical across runs -
jagniccc (frameskip) ✅ PASS skip=0 matches skip=3 -
yarc (frameskip) ✅ PASS skip=0 matches skip=3 -
jagniccc (save state) ✅ PASS round-trip matches -
yarc (save state) ✅ PASS round-trip matches -
jagniccc (rewind) ✅ PASS rewind matches -
yarc (rewind) ✅ PASS rewind matches -

Platform: Linux x86_64

Updated by CI at 2026-05-02T19:52:00.719Z

@github-actions
Copy link
Copy Markdown
Author

github-actions Bot commented May 2, 2026

Regression: linux-arm64

Regression Test Results

ROM Status Details Diff
jagniccc ✅ PASS 0 pixels differ -
yarc ✅ PASS 0 pixels differ -
jagniccc (determinism) ✅ PASS identical across runs -
yarc (determinism) ✅ PASS identical across runs -
jagniccc (frameskip) ✅ PASS skip=0 matches skip=3 -
yarc (frameskip) ✅ PASS skip=0 matches skip=3 -
jagniccc (save state) ✅ PASS round-trip matches -
yarc (save state) ✅ PASS round-trip matches -
jagniccc (rewind) ✅ PASS rewind matches -
yarc (rewind) ✅ PASS rewind matches -

Platform: Linux aarch64

Updated by CI at 2026-05-02T19:51:49.819Z

MSVC x86 / x64 compilation checks failed on PR #129 because
`__attribute__((always_inline))` is GCC/Clang-specific.  Wrap it
in a BLITTER_ALWAYS_INLINE macro that maps to:

  GCC / Clang: inline __attribute__((always_inline))
  MSVC:        __forceinline                    (replaces inline)
  Other:       inline                           (best-effort)

The macro spells the inline keyword itself so call sites are just
`static BLITTER_ALWAYS_INLINE void foo(...)` -- no extra INLINE
qualifier (MSVC's __forceinline conflicts with another inline
keyword, which is why the original `static INLINE __attribute__(...)`
form would have failed there even if MSVC understood the attribute).

Verified: clang still inlines (AvP accurate ~196 FPS, same as the
attribute-only form); test suite passes; libretro buildbot's MSVC
target should now build clean.

Co-Authored-By: Claude Opus 4.7 <[email protected]>
@JoeMatt JoeMatt self-assigned this May 2, 2026
@JoeMatt JoeMatt marked this pull request as ready for review May 2, 2026 19:40
@JoeMatt JoeMatt self-requested a review as a code owner May 2, 2026 19:40
Copilot AI review requested due to automatic review settings May 2, 2026 19:40
Copy link
Copy Markdown
Contributor

Copilot AI left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Pull request overview

This PR focuses on improving performance of the accurate blitter path (notably AvP gameplay on Apple Silicon) by enabling stronger compiler specialization/inlining in the hottest blitter helpers, and by adding optional profiling infrastructure plus benchmark workflow improvements to better measure gameplay-like performance.

Changes:

  • Inline hot blitter helpers (ADD16SAT, ADDARRAY, COMP_CTRL, DATA) and add optional blitter perf counters/tracing hooks.
  • Add an opt-in perf_counters instrumentation system (enabled via BENCH_PROFILE=1) and expose it through the test-export ABI.
  • Enhance make benchmark / test_benchmark to support loading a save state and reporting per-frame timing variance; add a macOS xctrace wrapper and profiling docs updates.

Reviewed changes

Copilot reviewed 9 out of 10 changed files in this pull request and generated 1 comment.

Show a summary per file
File Description
src/tom/blitter.c Forces inlining of hot helper routines and adds optional perf counters + tracing instrumentation around the accurate blitter inner loop.
src/core/perf_counters.h Introduces opt-in counter macros and registry API for lightweight instrumentation.
src/core/perf_counters.c Implements the counter registry, lookup, dump, and reset behavior.
test/tools/test_benchmark.c Adds optional perf counter access plus per-frame timing samples/percentiles and worst-frame reporting.
Makefile Adds BENCH_PROFILE/BLITTER_TRACE build flags and enhances benchmark target to support state loading and optional wide exports.
Makefile.common Adds perf_counters.c to the core build sources.
link-test.T Exports perf counter symbols for the test ABI (ELF).
exports-test.list Exports perf counter symbols for the test ABI (Mach-O).
scripts/profile-mac.sh Adds a macOS Instruments (xctrace) wrapper for running the benchmark under templates.
docs/profiling.md Documents the new macOS profiling workflow and the BENCH_PROFILE=1 counters.

💡 Add Copilot custom instructions for smarter, more guided reviews. Learn how to get started.

Comment thread test/tools/test_benchmark.c
Copilot review on PR #129 caught that the per-frame stats path
unconditionally allocates / sorts / divides by num_frames, which
would OOB-access sorted[num_frames - 1] and divide by zero if a
caller passed 0 or a negative number.  Both come from atoi() which
returns 0 on garbage input, so this was reachable.

Validate both at parse time and exit with a clear error before
allocating anything.

Co-Authored-By: Claude Opus 4.7 <[email protected]>
@JoeMatt JoeMatt merged commit 484f205 into develop May 2, 2026
30 checks passed
@JoeMatt JoeMatt deleted the feature/blitter-perf-pass1 branch May 3, 2026 23:57
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

None yet

Projects

None yet

Development

Successfully merging this pull request may close these issues.

2 participants