|
| 1 | +# Performance baseline + profile: OP vs blitter vs GPU/DSP |
| 2 | + |
| 3 | +Captured 2026-05-01 on Apple Silicon (M-series Mac), arm64 build, default flags + `RELEASE_DEBUG_INFO=1`. See `make benchmark` and `docs/profiling.md` for the methodology. |
| 4 | + |
| 5 | +## TL;DR |
| 6 | + |
| 7 | +**OP is not the bottleneck. GPU/DSP RISC interpretation is the biggest single target across the board. The accurate blitter is genuinely hot during gameplay (not just at boot) — comparable in cost to DSP for users who enable that mode.** |
| 8 | + |
| 9 | +> **Methodology fix mid-investigation:** the first round of this doc profiled the headless boot/menu state and concluded blitter was always small. That was misleading — boot screens don't draw much. Re-profiled with a save state loaded on top of the live ROM (Alien vs Predator at an in-game state6 save) and the picture changes substantially for the accurate-blitter case. Both data sets are below. |
| 10 | +
|
| 11 | +## Baseline FPS |
| 12 | + |
| 13 | +`make benchmark` 600 frames after 60 warmup, headless, no video presentation, no audio output. Same Mac, same thermal state. |
| 14 | + |
| 15 | +| ROM | Fast blitter | Accurate blitter | Slowdown | |
| 16 | +|---|---:|---:|---:| |
| 17 | +| `yarc.j64` (raycasting demo) | 280 FPS | 229 FPS | 1.22× | |
| 18 | +| `jagniccc.j64` (NICCC compo demo) | 355 FPS | 258 FPS | 1.37× | |
| 19 | +| `Iron Soldier (1994).jag` | 312 FPS | 258 FPS | 1.21× | |
| 20 | +| `Iron Soldier 2 (World).j64` | 313 FPS | 266 FPS | 1.18× | |
| 21 | +| `Doom - Evil Unleashed (1994).jag` | 306 FPS | 261 FPS | 1.17× | |
| 22 | +| `Skyhammer_(1999).jag` | 339 FPS | 215 FPS | **1.58×** | |
| 23 | + |
| 24 | +All ROMs run well above 60 FPS on M-series. The "where does it hurt" question is meaningful for slower hosts (Pi, mobile) — same ratios, lower absolute numbers. |
| 25 | + |
| 26 | +## Profile breakdown (`/usr/bin/sample`, 8s @ 6000-frame run) |
| 27 | + |
| 28 | +Captured to `/tmp/op-baseline/sample-*-*.txt`. Aggregated top-of-stack by symbol: |
| 29 | + |
| 30 | +### `yarc.j64` (demo, fast blitter) |
| 31 | + |
| 32 | +| Function | Samples | % of frame | |
| 33 | +|---|---:|---:| |
| 34 | +| `GPUExec` (per-instruction dispatch in `src/tom/gpu.c`) | ~5018 | **~74%** | |
| 35 | +| `blitter_generic` | ~649 | 10% | |
| 36 | +| `HalflineCallback` | 79 | 1.2% | |
| 37 | +| (`OPProcessFixedBitmap` etc. didn't make top-15) | – | <1% | |
| 38 | + |
| 39 | +**Demo content with heavy GPU programs. This pattern would generalize to anything that uses the GPU's RISC for software rendering.** |
| 40 | + |
| 41 | +### `Iron Soldier (1994).jag` (commercial, fast blitter) |
| 42 | + |
| 43 | +| Function | Samples | % of frame | |
| 44 | +|---|---:|---:| |
| 45 | +| `DSPExec` (per-instruction dispatch in `src/jerry/dsp.c`) | 3456 | **~51%** | |
| 46 | +| `dsp_opcode_jr` | 818 | 12% | |
| 47 | +| `dsp_opcode_*` (load_r15_indexed, jump, load) | ~250 | 4% | |
| 48 | +| **DSP TOTAL** | **~4524** | **~67%** | |
| 49 | +| `HalflineCallback` / `JaguarReadLong` / `JaguarExecuteNew` | ~250 | 4% | |
| 50 | +| `OPProcessFixedBitmap` | 68 | **1%** | |
| 51 | +| `m68k_execute` | 49 | 0.7% | |
| 52 | +| `blitter_generic` | 48 | 0.7% | |
| 53 | + |
| 54 | +**OP at 1%. Blitter at 0.7%. 68K at 0.7%. Two-thirds of frame time in DSP interpretation.** |
| 55 | + |
| 56 | +### `Skyhammer_(1999).jag` (commercial, fast blitter) |
| 57 | + |
| 58 | +| Function | Samples | % of frame | |
| 59 | +|---|---:|---:| |
| 60 | +| `DSPExec` | 2260 | **~33%** | |
| 61 | +| `GPUExec` | ~1600 | **~24%** | |
| 62 | +| `m68k_execute` | 177 | 2.6% | |
| 63 | +| `blitter_generic` | 112 | 1.6% | |
| 64 | +| `HalflineCallback` | 106 | 1.6% | |
| 65 | + |
| 66 | +**GPU + DSP = 57% of frame time. Blitter is irrelevant here on the fast path.** |
| 67 | + |
| 68 | +### `Skyhammer_(1999).jag` (commercial, accurate blitter) |
| 69 | + |
| 70 | +| Function | Samples | % of frame | |
| 71 | +|---|---:|---:| |
| 72 | +| `DSPExec` | 1525 | **~22%** | |
| 73 | +| `BlitterMidsummer2` + `DATA` + `ADDARRAY` | ~1430 | **~21%** | |
| 74 | +| `GPUExec` | ~720 | **~11%** | |
| 75 | +| `m68k_execute` | 97 | 1.4% | |
| 76 | + |
| 77 | +**The one case where the blitter genuinely matters — and only because the user opted into accurate mode. Even here, DSP is comparable.** |
| 78 | + |
| 79 | +### `Alien vs Predator (1994)` at gameplay state6 (fast blitter) |
| 80 | + |
| 81 | +| Function | Samples | % of frame | |
| 82 | +|---|---:|---:| |
| 83 | +| `DSPExec` | 3303 | **~49%** | |
| 84 | +| `dsp_opcode_jr` | 977 | 14% | |
| 85 | +| **DSP TOTAL** | **~4280** | **~63%** | |
| 86 | +| `blitter_generic` | ~340 | 5% | |
| 87 | +| `GPUExec` | ~254 | 4% | |
| 88 | +| `HalflineCallback` / `JaguarExecuteNew` / `JaguarReadLong` | ~310 | 5% | |
| 89 | +| `OPProcessFixedBitmap` | 72 | 1% | |
| 90 | + |
| 91 | +**Same DSP-dominated picture as the boot profiles.** Blitter is meaningfully present (5%) since the scene is actively drawing, but still single-digit. |
| 92 | + |
| 93 | +### `Alien vs Predator (1994)` at gameplay state6 (accurate blitter) |
| 94 | + |
| 95 | +| Function | Samples | % of frame | |
| 96 | +|---|---:|---:| |
| 97 | +| **`BlitterMidsummer2` family** (`+ ADDARRAY 2090-2094, DATA, BlitterMidsummer2 itself`) | ~2330 | **~34%** | |
| 98 | +| `DSPExec` | 1652 | **~24%** | |
| 99 | +| `dsp_opcode_jr` | 443 | 6.5% | |
| 100 | +| **DSP TOTAL** | **~2095** | **~31%** | |
| 101 | + |
| 102 | +**Accurate blitter is co-equal with DSP during gameplay.** The hot inner code is `ADDARRAY` (lines 2090-2094 in `src/tom/blitter.c`) — the per-cycle FDSYNC cascade the spike report (#124) flagged — plus `DATA()` (line 2514). Confirms the spike's hypothesis was right; the boot-only profile just didn't expose it. |
| 103 | + |
| 104 | +This matches the user-reported behavior: AvP slows down noticeably when the character moves, and the slowdown is worse with the accurate blitter selected. |
| 105 | + |
| 106 | +## What this changes |
| 107 | + |
| 108 | +The original `[spike] OP performance audit` (#123) and `[spike] Blitter performance audit` (#124) both assumed those subsystems dominate. They don't. |
| 109 | + |
| 110 | +| Target | Hypothesis going in | Reality | ROI | |
| 111 | +|---|---|---|---| |
| 112 | +| **OP** (#123) | "dominates on heavy-OP scenes (Wolf3D, T2K, Iron Soldier)" | 1% on Iron Soldier; doesn't make top-15 elsewhere | Low — even a 10× OP speedup buys ~1% of frame time | |
| 113 | +| **Blitter — fast** (#124) | "default path matters" | <2% on boot, 5% on AvP gameplay — small everywhere | Low | |
| 114 | +| **Blitter — accurate** (#124) | "matters for some games" | **~34% on AvP gameplay**, 21% on Skyhammer (boot); ADDARRAY cycle cascade dominates | **High for accurate-blitter users** | |
| 115 | +| **GPU/DSP RISC** (#122 sub-component) | "JIT might help on mobile/Pi" | **24-74% on fast blitter, ~31% even with accurate** | **High** — single dynarec covers both (shared ISA) | |
| 116 | +| **68K** (#122 sub-component) | "wrap UAE JIT or Cyclone68k" | 0.7-2.6% — barely visible in any profile | Low | |
| 117 | + |
| 118 | +## Recommendation |
| 119 | + |
| 120 | +Two genuine wins, depending on which user we optimize for: |
| 121 | + |
| 122 | +### Path A — RISC dynarec / cached IR (#122, RISC half only) |
| 123 | + |
| 124 | +Helps **everyone** (every ROM tested, every blitter mode): |
| 125 | +- Fast-blitter users: attacks the dominant 50-75% slice |
| 126 | +- Accurate-blitter users: attacks the ~31% DSP slice (the other ~34% is blitter) |
| 127 | +- Cross-platform reach: cached IR / threaded code works on JIT-restricted hosts (iOS without entitlement, Switch) |
| 128 | +- Both Tom GPU and Jerry DSP share the same ISA → single dispatcher covers both |
| 129 | + |
| 130 | +### Path B — accurate-blitter SIMD widening (#124, narrow scope) |
| 131 | + |
| 132 | +Helps **users who enable accurate blitter** specifically: |
| 133 | +- ADDARRAY (`src/tom/blitter.c:2090-2094`, the per-cycle FDSYNC cascade) is the biggest single function — ~15% of frame time on AvP gameplay |
| 134 | +- `BlitterMidsummer2` itself + `DATA()` add another ~19% |
| 135 | +- Existing `test_blitter_compare` infrastructure already gates bit-exactness regressions |
| 136 | +- Lower risk per change than dynarec — can be done in small, mergeable batches |
| 137 | + |
| 138 | +### Suggested order |
| 139 | + |
| 140 | +1. **Start with Path B (accurate-blitter perf)** — smaller surface, lower risk, immediate visible win for users hitting the AvP-style slowdown. ADDARRAY is the obvious first target. |
| 141 | +2. **Then Path A (RISC dynarec)** — bigger lift, bigger payoff, helps everyone including accurate-blitter users still bottlenecked on DSP after step 1. |
| 142 | + |
| 143 | +Closing #123 (OP) as **wontfix-for-now** is still the honest call — even on gameplay it's <2%. The cheap wins documented in #123 (OP `O(N²)` discovery scan, hoist transparency check) remain valid as code-quality fixes but won't move the headline number. |
| 144 | + |
| 145 | +## Why we DIDN'T touch OP/blitter as planned |
| 146 | + |
| 147 | +- The OP spike (#123) called out an `O(N²)` linear scan in `OPObjectExists`. That's still a real bug worth fixing for code-quality reasons. But it costs ~1% of frame time, not the ~30% the spike speculated. Not a perf win. |
| 148 | +- The blitter accurate-mode wins (#124) are real for users who deliberately enable that mode on a specific title. But the default user experience runs the fast blitter, which is single-digit% of frame time. |
| 149 | + |
| 150 | +## What 68K JIT would buy us |
| 151 | + |
| 152 | +Per profile: 0.7-2.6% of frame time across all ROMs. Even a 10× speedup of the 68K interpreter → at most 2-3% wall-clock improvement. **Not worth the GPLv2 / GPLv3 license dance with UAE-JIT or the ARM-only constraint of Cyclone68k.** |
| 153 | + |
| 154 | +The profile result also explains why the standalone Virtual Jaguar interpreter has been "fast enough" for years on desktop — modern CPUs eat the 68K interpreter for breakfast. The GPU/DSP RISC don't eat as well because of the call-overhead-per-instruction pattern. |
| 155 | + |
| 156 | +## Next steps |
| 157 | + |
| 158 | +1. Update issues #122, #123, #124 with this profile data. Keep #122 open and re-scope to GPU/DSP-only. Close #123 and #124 with the linked profile evidence. |
| 159 | +2. New spike: feasibility of a Tom RISC cached-IR / threaded-code dispatcher — the lowest-cost approach, works on JIT-restricted platforms (iOS, Switch). Block JIT comes later if the cached IR proves the model. |
| 160 | +3. Profile data + this doc lives at `docs/op-perf-profile.md` (this file). Re-run periodically as a same-host commit-to-commit delta. |
| 161 | + |
| 162 | +## Files / commands |
| 163 | + |
| 164 | +- Baseline benchmarks: `make benchmark BENCH_ROM=<rom> BENCH_BLITTER={fast,accurate}` — 600 frames + 60 warmup default |
| 165 | +- Profile capture: `./test/tools/test_benchmark <core> <rom> 6000 --warmup 60 --blitter <mode> &` then `sample $! 8 -file /tmp/sample.txt` |
| 166 | +- Test ROMs in tree: `test/roms/yarc.j64`, `test/roms/jagniccc.j64` |
| 167 | +- Commercial ROMs (private): `test/roms/private/Iron Soldier (1994).jag` and friends |
0 commit comments