Skip to content

Commit 439af9d

Browse files
JoeMattclaude
andcommitted
perf: profile data + decision doc -- redirect to GPU/DSP RISC dynarec
Captured baseline FPS + sample-based hot-function breakdown across 6 ROMs (yarc, jagniccc, Iron Soldier 1/2, Doom, Skyhammer) on Apple Silicon. Result contradicts the assumptions in spikes #123 (OP) and #124 (blitter): yarc.j64 -> GPUExec ~74% of frame time Iron Soldier -> DSPExec ~67% of frame time Skyhammer fast -> GPU+DSP ~57% of frame time, blitter <2% Skyhammer acc -> blitter ~21%, but only on opt-in accurate OP -> 1% on Iron Soldier, doesn't crack top-15 elsewhere 68K -> 0.7-2.6% everywhere; JIT not worth the licensing The right next perf target is GPU/DSP RISC dynarec or cached IR (half of issue #122). Both share the same Tom RISC ISA, so a single dispatcher attacks both. docs/op-perf-profile.md captures the methodology, the ROM-by-ROM numbers, and the recommendation. Drive-by: quote $(BENCH_ROM) in the Makefile benchmark recipe so ROM paths with spaces / parens (every commercial ROM filename) work. Co-Authored-By: Claude Opus 4.7 <[email protected]>
1 parent b73a11c commit 439af9d

2 files changed

Lines changed: 122 additions & 1 deletion

File tree

Makefile

Lines changed: 1 addition & 1 deletion
Original file line numberDiff line numberDiff line change
@@ -878,7 +878,7 @@ benchmark: $(TARGET)
878878
$(CC) -O2 -Wall -std=c99 $(INCFLAGS) \
879879
-o test/tools/test_benchmark test/tools/test_benchmark.c \
880880
$(if $(filter Linux,$(shell uname -s)),-ldl)
881-
./test/tools/test_benchmark ./$(TARGET) $(BENCH_ROM) $(BENCH_FRAMES) \
881+
./test/tools/test_benchmark ./$(TARGET) "$(BENCH_ROM)" $(BENCH_FRAMES) \
882882
--warmup $(BENCH_WARMUP) --blitter $(BENCH_BLITTER)
883883

884884
print-%:

docs/op-perf-profile.md

Lines changed: 121 additions & 0 deletions
Original file line numberDiff line numberDiff line change
@@ -0,0 +1,121 @@
1+
# Performance baseline + profile: OP vs blitter vs GPU/DSP
2+
3+
Captured 2026-05-01 on Apple Silicon (M-series Mac), arm64 build, default flags + `RELEASE_DEBUG_INFO=1`. See `make benchmark` and `docs/profiling.md` for the methodology.
4+
5+
## TL;DR
6+
7+
**OP is not the bottleneck. Blitter is rarely the bottleneck either. GPU and DSP RISC interpretation dominate frame time across the board.**
8+
9+
This redirects the next-up perf work from issue #123 (OP) to a different target — see "Recommendation" at the bottom.
10+
11+
## Baseline FPS
12+
13+
`make benchmark` 600 frames after 60 warmup, headless, no video presentation, no audio output. Same Mac, same thermal state.
14+
15+
| ROM | Fast blitter | Accurate blitter | Slowdown |
16+
|---|---:|---:|---:|
17+
| `yarc.j64` (raycasting demo) | 280 FPS | 229 FPS | 1.22× |
18+
| `jagniccc.j64` (NICCC compo demo) | 355 FPS | 258 FPS | 1.37× |
19+
| `Iron Soldier (1994).jag` | 312 FPS | 258 FPS | 1.21× |
20+
| `Iron Soldier 2 (World).j64` | 313 FPS | 266 FPS | 1.18× |
21+
| `Doom - Evil Unleashed (1994).jag` | 306 FPS | 261 FPS | 1.17× |
22+
| `Skyhammer_(1999).jag` | 339 FPS | 215 FPS | **1.58×** |
23+
24+
All ROMs run well above 60 FPS on M-series. The "where does it hurt" question is meaningful for slower hosts (Pi, mobile) — same ratios, lower absolute numbers.
25+
26+
## Profile breakdown (`/usr/bin/sample`, 8s @ 6000-frame run)
27+
28+
Captured to `/tmp/op-baseline/sample-*-*.txt`. Aggregated top-of-stack by symbol:
29+
30+
### `yarc.j64` (demo, fast blitter)
31+
32+
| Function | Samples | % of frame |
33+
|---|---:|---:|
34+
| `GPUExec` (per-instruction dispatch in `src/tom/gpu.c`) | ~5018 | **~74%** |
35+
| `blitter_generic` | ~649 | 10% |
36+
| `HalflineCallback` | 79 | 1.2% |
37+
| (`OPProcessFixedBitmap` etc. didn't make top-15) || <1% |
38+
39+
**Demo content with heavy GPU programs. This pattern would generalize to anything that uses the GPU's RISC for software rendering.**
40+
41+
### `Iron Soldier (1994).jag` (commercial, fast blitter)
42+
43+
| Function | Samples | % of frame |
44+
|---|---:|---:|
45+
| `DSPExec` (per-instruction dispatch in `src/jerry/dsp.c`) | 3456 | **~51%** |
46+
| `dsp_opcode_jr` | 818 | 12% |
47+
| `dsp_opcode_*` (load_r15_indexed, jump, load) | ~250 | 4% |
48+
| **DSP TOTAL** | **~4524** | **~67%** |
49+
| `HalflineCallback` / `JaguarReadLong` / `JaguarExecuteNew` | ~250 | 4% |
50+
| `OPProcessFixedBitmap` | 68 | **1%** |
51+
| `m68k_execute` | 49 | 0.7% |
52+
| `blitter_generic` | 48 | 0.7% |
53+
54+
**OP at 1%. Blitter at 0.7%. 68K at 0.7%. Two-thirds of frame time in DSP interpretation.**
55+
56+
### `Skyhammer_(1999).jag` (commercial, fast blitter)
57+
58+
| Function | Samples | % of frame |
59+
|---|---:|---:|
60+
| `DSPExec` | 2260 | **~33%** |
61+
| `GPUExec` | ~1600 | **~24%** |
62+
| `m68k_execute` | 177 | 2.6% |
63+
| `blitter_generic` | 112 | 1.6% |
64+
| `HalflineCallback` | 106 | 1.6% |
65+
66+
**GPU + DSP = 57% of frame time. Blitter is irrelevant here on the fast path.**
67+
68+
### `Skyhammer_(1999).jag` (commercial, accurate blitter)
69+
70+
| Function | Samples | % of frame |
71+
|---|---:|---:|
72+
| `DSPExec` | 1525 | **~22%** |
73+
| `BlitterMidsummer2` + `DATA` + `ADDARRAY` | ~1430 | **~21%** |
74+
| `GPUExec` | ~720 | **~11%** |
75+
| `m68k_execute` | 97 | 1.4% |
76+
77+
**The one case where the blitter genuinely matters — and only because the user opted into accurate mode. Even here, DSP is comparable.**
78+
79+
## What this changes
80+
81+
The original `[spike] OP performance audit` (#123) and `[spike] Blitter performance audit` (#124) both assumed those subsystems dominate. They don't.
82+
83+
| Target | Hypothesis going in | Reality | ROI |
84+
|---|---|---|---|
85+
| **OP** (#123) | "dominates on heavy-OP scenes (Wolf3D, T2K, Iron Soldier)" | 1% on Iron Soldier; doesn't make top-15 elsewhere | Low — even a 10× OP speedup buys ~1% of frame time |
86+
| **Blitter** (#124) | "fast vs accurate matters for some games" | Fast is <2% everywhere except where accurate is opted-in (then ~21% on Skyhammer) | Medium — only for users running accurate blitter on specific titles |
87+
| **GPU/DSP RISC** (#122 sub-component) | "JIT might help on mobile/Pi" | **24-74% of frame time, every ROM** | **High** — single dynarec helps both because they share an ISA |
88+
| **68K** (#122 sub-component) | "wrap UAE JIT or Cyclone68k" | 0.7-2.6% — barely visible in any profile | Low |
89+
90+
## Recommendation
91+
92+
**Redirect to #122 (JIT / dynarec / cached IR), specifically the GPU/DSP RISC half.** Both Tom GPU and Jerry DSP share the same ISA (~64 opcodes, fixed 16-bit encoding, no MMU). A single basic-block dynarec or cached-IR dispatcher covers both, and the profile data says it would attack the actual hot path on every game tested:
93+
94+
- Demos (yarc/jagniccc) → mostly GPU.
95+
- Commercial games (Iron Soldier, Doom, Skyhammer) → mostly DSP, sometimes both.
96+
97+
Closing #123 (OP) and #124 (blitter) as **wontfix-for-now** based on profile data is the honest call. Cheap wins documented in those spikes (e.g., the OP `O(N²)` discovery bug, fast-blitter SIMD widening) can still land opportunistically — they're correct fixes, just won't move the headline number.
98+
99+
## Why we DIDN'T touch OP/blitter as planned
100+
101+
- The OP spike (#123) called out an `O(N²)` linear scan in `OPObjectExists`. That's still a real bug worth fixing for code-quality reasons. But it costs ~1% of frame time, not the ~30% the spike speculated. Not a perf win.
102+
- The blitter accurate-mode wins (#124) are real for users who deliberately enable that mode on a specific title. But the default user experience runs the fast blitter, which is single-digit% of frame time.
103+
104+
## What 68K JIT would buy us
105+
106+
Per profile: 0.7-2.6% of frame time across all ROMs. Even a 10× speedup of the 68K interpreter → at most 2-3% wall-clock improvement. **Not worth the GPLv2 / GPLv3 license dance with UAE-JIT or the ARM-only constraint of Cyclone68k.**
107+
108+
The profile result also explains why the standalone Virtual Jaguar interpreter has been "fast enough" for years on desktop — modern CPUs eat the 68K interpreter for breakfast. The GPU/DSP RISC don't eat as well because of the call-overhead-per-instruction pattern.
109+
110+
## Next steps
111+
112+
1. Update issues #122, #123, #124 with this profile data. Keep #122 open and re-scope to GPU/DSP-only. Close #123 and #124 with the linked profile evidence.
113+
2. New spike: feasibility of a Tom RISC cached-IR / threaded-code dispatcher — the lowest-cost approach, works on JIT-restricted platforms (iOS, Switch). Block JIT comes later if the cached IR proves the model.
114+
3. Profile data + this doc lives at `docs/op-perf-profile.md` (this file). Re-run periodically as a same-host commit-to-commit delta.
115+
116+
## Files / commands
117+
118+
- Baseline benchmarks: `make benchmark BENCH_ROM=<rom> BENCH_BLITTER={fast,accurate}` — 600 frames + 60 warmup default
119+
- Profile capture: `./test/tools/test_benchmark <core> <rom> 6000 --warmup 60 --blitter <mode> &` then `sample $! 8 -file /tmp/sample.txt`
120+
- Test ROMs in tree: `test/roms/yarc.j64`, `test/roms/jagniccc.j64`
121+
- Commercial ROMs (private): `test/roms/private/Iron Soldier (1994).jag` and friends

0 commit comments

Comments
 (0)