|
| 1 | +# Performance baseline + profile: OP vs blitter vs GPU/DSP |
| 2 | + |
| 3 | +Captured 2026-05-01 on Apple Silicon (M-series Mac), arm64 build, default flags + `RELEASE_DEBUG_INFO=1`. See `make benchmark` and `docs/profiling.md` for the methodology. |
| 4 | + |
| 5 | +## TL;DR |
| 6 | + |
| 7 | +**OP is not the bottleneck. Blitter is rarely the bottleneck either. GPU and DSP RISC interpretation dominate frame time across the board.** |
| 8 | + |
| 9 | +This redirects the next-up perf work from issue #123 (OP) to a different target — see "Recommendation" at the bottom. |
| 10 | + |
| 11 | +## Baseline FPS |
| 12 | + |
| 13 | +`make benchmark` 600 frames after 60 warmup, headless, no video presentation, no audio output. Same Mac, same thermal state. |
| 14 | + |
| 15 | +| ROM | Fast blitter | Accurate blitter | Slowdown | |
| 16 | +|---|---:|---:|---:| |
| 17 | +| `yarc.j64` (raycasting demo) | 280 FPS | 229 FPS | 1.22× | |
| 18 | +| `jagniccc.j64` (NICCC compo demo) | 355 FPS | 258 FPS | 1.37× | |
| 19 | +| `Iron Soldier (1994).jag` | 312 FPS | 258 FPS | 1.21× | |
| 20 | +| `Iron Soldier 2 (World).j64` | 313 FPS | 266 FPS | 1.18× | |
| 21 | +| `Doom - Evil Unleashed (1994).jag` | 306 FPS | 261 FPS | 1.17× | |
| 22 | +| `Skyhammer_(1999).jag` | 339 FPS | 215 FPS | **1.58×** | |
| 23 | + |
| 24 | +All ROMs run well above 60 FPS on M-series. The "where does it hurt" question is meaningful for slower hosts (Pi, mobile) — same ratios, lower absolute numbers. |
| 25 | + |
| 26 | +## Profile breakdown (`/usr/bin/sample`, 8s @ 6000-frame run) |
| 27 | + |
| 28 | +Captured to `/tmp/op-baseline/sample-*-*.txt`. Aggregated top-of-stack by symbol: |
| 29 | + |
| 30 | +### `yarc.j64` (demo, fast blitter) |
| 31 | + |
| 32 | +| Function | Samples | % of frame | |
| 33 | +|---|---:|---:| |
| 34 | +| `GPUExec` (per-instruction dispatch in `src/tom/gpu.c`) | ~5018 | **~74%** | |
| 35 | +| `blitter_generic` | ~649 | 10% | |
| 36 | +| `HalflineCallback` | 79 | 1.2% | |
| 37 | +| (`OPProcessFixedBitmap` etc. didn't make top-15) | – | <1% | |
| 38 | + |
| 39 | +**Demo content with heavy GPU programs. This pattern would generalize to anything that uses the GPU's RISC for software rendering.** |
| 40 | + |
| 41 | +### `Iron Soldier (1994).jag` (commercial, fast blitter) |
| 42 | + |
| 43 | +| Function | Samples | % of frame | |
| 44 | +|---|---:|---:| |
| 45 | +| `DSPExec` (per-instruction dispatch in `src/jerry/dsp.c`) | 3456 | **~51%** | |
| 46 | +| `dsp_opcode_jr` | 818 | 12% | |
| 47 | +| `dsp_opcode_*` (load_r15_indexed, jump, load) | ~250 | 4% | |
| 48 | +| **DSP TOTAL** | **~4524** | **~67%** | |
| 49 | +| `HalflineCallback` / `JaguarReadLong` / `JaguarExecuteNew` | ~250 | 4% | |
| 50 | +| `OPProcessFixedBitmap` | 68 | **1%** | |
| 51 | +| `m68k_execute` | 49 | 0.7% | |
| 52 | +| `blitter_generic` | 48 | 0.7% | |
| 53 | + |
| 54 | +**OP at 1%. Blitter at 0.7%. 68K at 0.7%. Two-thirds of frame time in DSP interpretation.** |
| 55 | + |
| 56 | +### `Skyhammer_(1999).jag` (commercial, fast blitter) |
| 57 | + |
| 58 | +| Function | Samples | % of frame | |
| 59 | +|---|---:|---:| |
| 60 | +| `DSPExec` | 2260 | **~33%** | |
| 61 | +| `GPUExec` | ~1600 | **~24%** | |
| 62 | +| `m68k_execute` | 177 | 2.6% | |
| 63 | +| `blitter_generic` | 112 | 1.6% | |
| 64 | +| `HalflineCallback` | 106 | 1.6% | |
| 65 | + |
| 66 | +**GPU + DSP = 57% of frame time. Blitter is irrelevant here on the fast path.** |
| 67 | + |
| 68 | +### `Skyhammer_(1999).jag` (commercial, accurate blitter) |
| 69 | + |
| 70 | +| Function | Samples | % of frame | |
| 71 | +|---|---:|---:| |
| 72 | +| `DSPExec` | 1525 | **~22%** | |
| 73 | +| `BlitterMidsummer2` + `DATA` + `ADDARRAY` | ~1430 | **~21%** | |
| 74 | +| `GPUExec` | ~720 | **~11%** | |
| 75 | +| `m68k_execute` | 97 | 1.4% | |
| 76 | + |
| 77 | +**The one case where the blitter genuinely matters — and only because the user opted into accurate mode. Even here, DSP is comparable.** |
| 78 | + |
| 79 | +## What this changes |
| 80 | + |
| 81 | +The original `[spike] OP performance audit` (#123) and `[spike] Blitter performance audit` (#124) both assumed those subsystems dominate. They don't. |
| 82 | + |
| 83 | +| Target | Hypothesis going in | Reality | ROI | |
| 84 | +|---|---|---|---| |
| 85 | +| **OP** (#123) | "dominates on heavy-OP scenes (Wolf3D, T2K, Iron Soldier)" | 1% on Iron Soldier; doesn't make top-15 elsewhere | Low — even a 10× OP speedup buys ~1% of frame time | |
| 86 | +| **Blitter** (#124) | "fast vs accurate matters for some games" | Fast is <2% everywhere except where accurate is opted-in (then ~21% on Skyhammer) | Medium — only for users running accurate blitter on specific titles | |
| 87 | +| **GPU/DSP RISC** (#122 sub-component) | "JIT might help on mobile/Pi" | **24-74% of frame time, every ROM** | **High** — single dynarec helps both because they share an ISA | |
| 88 | +| **68K** (#122 sub-component) | "wrap UAE JIT or Cyclone68k" | 0.7-2.6% — barely visible in any profile | Low | |
| 89 | + |
| 90 | +## Recommendation |
| 91 | + |
| 92 | +**Redirect to #122 (JIT / dynarec / cached IR), specifically the GPU/DSP RISC half.** Both Tom GPU and Jerry DSP share the same ISA (~64 opcodes, fixed 16-bit encoding, no MMU). A single basic-block dynarec or cached-IR dispatcher covers both, and the profile data says it would attack the actual hot path on every game tested: |
| 93 | + |
| 94 | +- Demos (yarc/jagniccc) → mostly GPU. |
| 95 | +- Commercial games (Iron Soldier, Doom, Skyhammer) → mostly DSP, sometimes both. |
| 96 | + |
| 97 | +Closing #123 (OP) and #124 (blitter) as **wontfix-for-now** based on profile data is the honest call. Cheap wins documented in those spikes (e.g., the OP `O(N²)` discovery bug, fast-blitter SIMD widening) can still land opportunistically — they're correct fixes, just won't move the headline number. |
| 98 | + |
| 99 | +## Why we DIDN'T touch OP/blitter as planned |
| 100 | + |
| 101 | +- The OP spike (#123) called out an `O(N²)` linear scan in `OPObjectExists`. That's still a real bug worth fixing for code-quality reasons. But it costs ~1% of frame time, not the ~30% the spike speculated. Not a perf win. |
| 102 | +- The blitter accurate-mode wins (#124) are real for users who deliberately enable that mode on a specific title. But the default user experience runs the fast blitter, which is single-digit% of frame time. |
| 103 | + |
| 104 | +## What 68K JIT would buy us |
| 105 | + |
| 106 | +Per profile: 0.7-2.6% of frame time across all ROMs. Even a 10× speedup of the 68K interpreter → at most 2-3% wall-clock improvement. **Not worth the GPLv2 / GPLv3 license dance with UAE-JIT or the ARM-only constraint of Cyclone68k.** |
| 107 | + |
| 108 | +The profile result also explains why the standalone Virtual Jaguar interpreter has been "fast enough" for years on desktop — modern CPUs eat the 68K interpreter for breakfast. The GPU/DSP RISC don't eat as well because of the call-overhead-per-instruction pattern. |
| 109 | + |
| 110 | +## Next steps |
| 111 | + |
| 112 | +1. Update issues #122, #123, #124 with this profile data. Keep #122 open and re-scope to GPU/DSP-only. Close #123 and #124 with the linked profile evidence. |
| 113 | +2. New spike: feasibility of a Tom RISC cached-IR / threaded-code dispatcher — the lowest-cost approach, works on JIT-restricted platforms (iOS, Switch). Block JIT comes later if the cached IR proves the model. |
| 114 | +3. Profile data + this doc lives at `docs/op-perf-profile.md` (this file). Re-run periodically as a same-host commit-to-commit delta. |
| 115 | + |
| 116 | +## Files / commands |
| 117 | + |
| 118 | +- Baseline benchmarks: `make benchmark BENCH_ROM=<rom> BENCH_BLITTER={fast,accurate}` — 600 frames + 60 warmup default |
| 119 | +- Profile capture: `./test/tools/test_benchmark <core> <rom> 6000 --warmup 60 --blitter <mode> &` then `sample $! 8 -file /tmp/sample.txt` |
| 120 | +- Test ROMs in tree: `test/roms/yarc.j64`, `test/roms/jagniccc.j64` |
| 121 | +- Commercial ROMs (private): `test/roms/private/Iron Soldier (1994).jag` and friends |
0 commit comments