Skip to content

Commit 662b13f

Browse files
authored
Merge pull request #128 from libretro/feature/op-perf-baseline
perf: profile baseline -- redirect from OP/blitter to GPU/DSP RISC
2 parents b73a11c + 2be4278 commit 662b13f

3 files changed

Lines changed: 292 additions & 2 deletions

File tree

Makefile

Lines changed: 1 addition & 1 deletion
Original file line numberDiff line numberDiff line change
@@ -878,7 +878,7 @@ benchmark: $(TARGET)
878878
$(CC) -O2 -Wall -std=c99 $(INCFLAGS) \
879879
-o test/tools/test_benchmark test/tools/test_benchmark.c \
880880
$(if $(filter Linux,$(shell uname -s)),-ldl)
881-
./test/tools/test_benchmark ./$(TARGET) $(BENCH_ROM) $(BENCH_FRAMES) \
881+
./test/tools/test_benchmark ./$(TARGET) "$(BENCH_ROM)" $(BENCH_FRAMES) \
882882
--warmup $(BENCH_WARMUP) --blitter $(BENCH_BLITTER)
883883

884884
print-%:

docs/op-perf-profile.md

Lines changed: 167 additions & 0 deletions
Original file line numberDiff line numberDiff line change
@@ -0,0 +1,167 @@
1+
# Performance baseline + profile: OP vs blitter vs GPU/DSP
2+
3+
Captured 2026-05-01 on Apple Silicon (M-series Mac), arm64 build, default flags + `RELEASE_DEBUG_INFO=1`. See `make benchmark` and `docs/profiling.md` for the methodology.
4+
5+
## TL;DR
6+
7+
**OP is not the bottleneck. GPU/DSP RISC interpretation is the biggest single target across the board. The accurate blitter is genuinely hot during gameplay (not just at boot) — comparable in cost to DSP for users who enable that mode.**
8+
9+
> **Methodology fix mid-investigation:** the first round of this doc profiled the headless boot/menu state and concluded blitter was always small. That was misleading — boot screens don't draw much. Re-profiled with a save state loaded on top of the live ROM (Alien vs Predator at an in-game state6 save) and the picture changes substantially for the accurate-blitter case. Both data sets are below.
10+
11+
## Baseline FPS
12+
13+
`make benchmark` 600 frames after 60 warmup, headless, no video presentation, no audio output. Same Mac, same thermal state.
14+
15+
| ROM | Fast blitter | Accurate blitter | Slowdown |
16+
|---|---:|---:|---:|
17+
| `yarc.j64` (raycasting demo) | 280 FPS | 229 FPS | 1.22× |
18+
| `jagniccc.j64` (NICCC compo demo) | 355 FPS | 258 FPS | 1.37× |
19+
| `Iron Soldier (1994).jag` | 312 FPS | 258 FPS | 1.21× |
20+
| `Iron Soldier 2 (World).j64` | 313 FPS | 266 FPS | 1.18× |
21+
| `Doom - Evil Unleashed (1994).jag` | 306 FPS | 261 FPS | 1.17× |
22+
| `Skyhammer_(1999).jag` | 339 FPS | 215 FPS | **1.58×** |
23+
24+
All ROMs run well above 60 FPS on M-series. The "where does it hurt" question is meaningful for slower hosts (Pi, mobile) — same ratios, lower absolute numbers.
25+
26+
## Profile breakdown (`/usr/bin/sample`, 8s @ 6000-frame run)
27+
28+
Captured to `/tmp/op-baseline/sample-*-*.txt`. Aggregated top-of-stack by symbol:
29+
30+
### `yarc.j64` (demo, fast blitter)
31+
32+
| Function | Samples | % of frame |
33+
|---|---:|---:|
34+
| `GPUExec` (per-instruction dispatch in `src/tom/gpu.c`) | ~5018 | **~74%** |
35+
| `blitter_generic` | ~649 | 10% |
36+
| `HalflineCallback` | 79 | 1.2% |
37+
| (`OPProcessFixedBitmap` etc. didn't make top-15) || <1% |
38+
39+
**Demo content with heavy GPU programs. This pattern would generalize to anything that uses the GPU's RISC for software rendering.**
40+
41+
### `Iron Soldier (1994).jag` (commercial, fast blitter)
42+
43+
| Function | Samples | % of frame |
44+
|---|---:|---:|
45+
| `DSPExec` (per-instruction dispatch in `src/jerry/dsp.c`) | 3456 | **~51%** |
46+
| `dsp_opcode_jr` | 818 | 12% |
47+
| `dsp_opcode_*` (load_r15_indexed, jump, load) | ~250 | 4% |
48+
| **DSP TOTAL** | **~4524** | **~67%** |
49+
| `HalflineCallback` / `JaguarReadLong` / `JaguarExecuteNew` | ~250 | 4% |
50+
| `OPProcessFixedBitmap` | 68 | **1%** |
51+
| `m68k_execute` | 49 | 0.7% |
52+
| `blitter_generic` | 48 | 0.7% |
53+
54+
**OP at 1%. Blitter at 0.7%. 68K at 0.7%. Two-thirds of frame time in DSP interpretation.**
55+
56+
### `Skyhammer_(1999).jag` (commercial, fast blitter)
57+
58+
| Function | Samples | % of frame |
59+
|---|---:|---:|
60+
| `DSPExec` | 2260 | **~33%** |
61+
| `GPUExec` | ~1600 | **~24%** |
62+
| `m68k_execute` | 177 | 2.6% |
63+
| `blitter_generic` | 112 | 1.6% |
64+
| `HalflineCallback` | 106 | 1.6% |
65+
66+
**GPU + DSP = 57% of frame time. Blitter is irrelevant here on the fast path.**
67+
68+
### `Skyhammer_(1999).jag` (commercial, accurate blitter)
69+
70+
| Function | Samples | % of frame |
71+
|---|---:|---:|
72+
| `DSPExec` | 1525 | **~22%** |
73+
| `BlitterMidsummer2` + `DATA` + `ADDARRAY` | ~1430 | **~21%** |
74+
| `GPUExec` | ~720 | **~11%** |
75+
| `m68k_execute` | 97 | 1.4% |
76+
77+
**The one case where the blitter genuinely matters — and only because the user opted into accurate mode. Even here, DSP is comparable.**
78+
79+
### `Alien vs Predator (1994)` at gameplay state6 (fast blitter)
80+
81+
| Function | Samples | % of frame |
82+
|---|---:|---:|
83+
| `DSPExec` | 3303 | **~49%** |
84+
| `dsp_opcode_jr` | 977 | 14% |
85+
| **DSP TOTAL** | **~4280** | **~63%** |
86+
| `blitter_generic` | ~340 | 5% |
87+
| `GPUExec` | ~254 | 4% |
88+
| `HalflineCallback` / `JaguarExecuteNew` / `JaguarReadLong` | ~310 | 5% |
89+
| `OPProcessFixedBitmap` | 72 | 1% |
90+
91+
**Same DSP-dominated picture as the boot profiles.** Blitter is meaningfully present (5%) since the scene is actively drawing, but still single-digit.
92+
93+
### `Alien vs Predator (1994)` at gameplay state6 (accurate blitter)
94+
95+
| Function | Samples | % of frame |
96+
|---|---:|---:|
97+
| **`BlitterMidsummer2` family** (`+ ADDARRAY 2090-2094, DATA, BlitterMidsummer2 itself`) | ~2330 | **~34%** |
98+
| `DSPExec` | 1652 | **~24%** |
99+
| `dsp_opcode_jr` | 443 | 6.5% |
100+
| **DSP TOTAL** | **~2095** | **~31%** |
101+
102+
**Accurate blitter is co-equal with DSP during gameplay.** The hot inner code is `ADDARRAY` (lines 2090-2094 in `src/tom/blitter.c`) — the per-cycle FDSYNC cascade the spike report (#124) flagged — plus `DATA()` (line 2514). Confirms the spike's hypothesis was right; the boot-only profile just didn't expose it.
103+
104+
This matches the user-reported behavior: AvP slows down noticeably when the character moves, and the slowdown is worse with the accurate blitter selected.
105+
106+
## What this changes
107+
108+
The original `[spike] OP performance audit` (#123) and `[spike] Blitter performance audit` (#124) both assumed those subsystems dominate. They don't.
109+
110+
| Target | Hypothesis going in | Reality | ROI |
111+
|---|---|---|---|
112+
| **OP** (#123) | "dominates on heavy-OP scenes (Wolf3D, T2K, Iron Soldier)" | 1% on Iron Soldier; doesn't make top-15 elsewhere | Low — even a 10× OP speedup buys ~1% of frame time |
113+
| **Blitter — fast** (#124) | "default path matters" | <2% on boot, 5% on AvP gameplay — small everywhere | Low |
114+
| **Blitter — accurate** (#124) | "matters for some games" | **~34% on AvP gameplay**, 21% on Skyhammer (boot); ADDARRAY cycle cascade dominates | **High for accurate-blitter users** |
115+
| **GPU/DSP RISC** (#122 sub-component) | "JIT might help on mobile/Pi" | **24-74% on fast blitter, ~31% even with accurate** | **High** — single dynarec covers both (shared ISA) |
116+
| **68K** (#122 sub-component) | "wrap UAE JIT or Cyclone68k" | 0.7-2.6% — barely visible in any profile | Low |
117+
118+
## Recommendation
119+
120+
Two genuine wins, depending on which user we optimize for:
121+
122+
### Path A — RISC dynarec / cached IR (#122, RISC half only)
123+
124+
Helps **everyone** (every ROM tested, every blitter mode):
125+
- Fast-blitter users: attacks the dominant 50-75% slice
126+
- Accurate-blitter users: attacks the ~31% DSP slice (the other ~34% is blitter)
127+
- Cross-platform reach: cached IR / threaded code works on JIT-restricted hosts (iOS without entitlement, Switch)
128+
- Both Tom GPU and Jerry DSP share the same ISA → single dispatcher covers both
129+
130+
### Path B — accurate-blitter SIMD widening (#124, narrow scope)
131+
132+
Helps **users who enable accurate blitter** specifically:
133+
- ADDARRAY (`src/tom/blitter.c:2090-2094`, the per-cycle FDSYNC cascade) is the biggest single function — ~15% of frame time on AvP gameplay
134+
- `BlitterMidsummer2` itself + `DATA()` add another ~19%
135+
- Existing `test_blitter_compare` infrastructure already gates bit-exactness regressions
136+
- Lower risk per change than dynarec — can be done in small, mergeable batches
137+
138+
### Suggested order
139+
140+
1. **Start with Path B (accurate-blitter perf)** — smaller surface, lower risk, immediate visible win for users hitting the AvP-style slowdown. ADDARRAY is the obvious first target.
141+
2. **Then Path A (RISC dynarec)** — bigger lift, bigger payoff, helps everyone including accurate-blitter users still bottlenecked on DSP after step 1.
142+
143+
Closing #123 (OP) as **wontfix-for-now** is still the honest call — even on gameplay it's <2%. The cheap wins documented in #123 (OP `O(N²)` discovery scan, hoist transparency check) remain valid as code-quality fixes but won't move the headline number.
144+
145+
## Why we DIDN'T touch OP/blitter as planned
146+
147+
- The OP spike (#123) called out an `O(N²)` linear scan in `OPObjectExists`. That's still a real bug worth fixing for code-quality reasons. But it costs ~1% of frame time, not the ~30% the spike speculated. Not a perf win.
148+
- The blitter accurate-mode wins (#124) are real for users who deliberately enable that mode on a specific title. But the default user experience runs the fast blitter, which is single-digit% of frame time.
149+
150+
## What 68K JIT would buy us
151+
152+
Per profile: 0.7-2.6% of frame time across all ROMs. Even a 10× speedup of the 68K interpreter → at most 2-3% wall-clock improvement. **Not worth the GPLv2 / GPLv3 license dance with UAE-JIT or the ARM-only constraint of Cyclone68k.**
153+
154+
The profile result also explains why the standalone Virtual Jaguar interpreter has been "fast enough" for years on desktop — modern CPUs eat the 68K interpreter for breakfast. The GPU/DSP RISC don't eat as well because of the call-overhead-per-instruction pattern.
155+
156+
## Next steps
157+
158+
1. Update issues #122, #123, #124 with this profile data. Keep #122 open and re-scope to GPU/DSP-only. Close #123 and #124 with the linked profile evidence.
159+
2. New spike: feasibility of a Tom RISC cached-IR / threaded-code dispatcher — the lowest-cost approach, works on JIT-restricted platforms (iOS, Switch). Block JIT comes later if the cached IR proves the model.
160+
3. Profile data + this doc lives at `docs/op-perf-profile.md` (this file). Re-run periodically as a same-host commit-to-commit delta.
161+
162+
## Files / commands
163+
164+
- Baseline benchmarks: `make benchmark BENCH_ROM=<rom> BENCH_BLITTER={fast,accurate}` — 600 frames + 60 warmup default
165+
- Profile capture: `./test/tools/test_benchmark <core> <rom> 6000 --warmup 60 --blitter <mode> &` then `sample $! 8 -file /tmp/sample.txt`
166+
- Test ROMs in tree: `test/roms/yarc.j64`, `test/roms/jagniccc.j64`
167+
- Commercial ROMs (private): `test/roms/private/Iron Soldier (1994).jag` and friends

test/tools/test_benchmark.c

Lines changed: 124 additions & 1 deletion
Original file line numberDiff line numberDiff line change
@@ -34,6 +34,8 @@ static void (*pretro_run)(void);
3434
static void (*pretro_unload_game)(void);
3535
static void *(*pretro_get_memory_data)(unsigned);
3636
static size_t (*pretro_get_memory_size)(unsigned);
37+
static size_t (*pretro_serialize_size)(void);
38+
static bool (*pretro_unserialize)(const void *, size_t);
3739

3840
/* Options state */
3941
static int bios_option_set = 0;
@@ -181,13 +183,17 @@ static void print_usage(const char *progname)
181183
fprintf(stderr,
182184
"Usage: %s <core.dylib> <rom_file> [num_frames]\n"
183185
" [--blitter fast|accurate] [--warmup N] [--load-srm file]\n"
186+
" [--load-state file]\n"
184187
"\n"
185188
"Options:\n"
186189
" num_frames Number of frames to benchmark (default: 300)\n"
187190
" --blitter fast Use fast blitter (default)\n"
188191
" --blitter accurate Use accurate (Midsummer2) blitter\n"
189192
" --warmup N Run N warmup frames before timing\n"
190-
" --load-srm file Load EEPROM save data from file\n",
193+
" --load-srm file Load EEPROM save data from file\n"
194+
" --load-state file Load a save state into the core after retro_load_game.\n"
195+
" Accepts raw retro_serialize() payloads or RetroArch\n"
196+
" RASTATE container files (the MEM chunk is extracted).\n",
191197
progname);
192198
}
193199

@@ -197,6 +203,7 @@ int main(int argc, char **argv)
197203
const char *core_path;
198204
const char *rom_path;
199205
const char *srm_load_path = NULL;
206+
const char *state_load_path = NULL;
200207
struct retro_game_info info;
201208
FILE *f;
202209
long fsize;
@@ -235,6 +242,8 @@ int main(int argc, char **argv)
235242
warmup_frames = atoi(argv[++i]);
236243
else if (strcmp(argv[i], "--load-srm") == 0 && i + 1 < argc)
237244
srm_load_path = argv[++i];
245+
else if (strcmp(argv[i], "--load-state") == 0 && i + 1 < argc)
246+
state_load_path = argv[++i];
238247
else if (strcmp(argv[i], "--help") == 0 || strcmp(argv[i], "-h") == 0)
239248
{
240249
print_usage(argv[0]);
@@ -298,6 +307,8 @@ int main(int argc, char **argv)
298307
LOAD_SYM(retro_unload_game);
299308
LOAD_SYM(retro_get_memory_data);
300309
LOAD_SYM(retro_get_memory_size);
310+
LOAD_SYM(retro_serialize_size);
311+
LOAD_SYM(retro_unserialize);
301312

302313
pretro_set_environment(environment_cb);
303314
pretro_set_video_refresh(video_refresh);
@@ -355,6 +366,118 @@ int main(int argc, char **argv)
355366
fprintf(stderr, "WARNING: Core reports no SAVE_RAM area\n");
356367
}
357368

369+
/* Load save state if provided. Accepts both raw retro_serialize()
370+
* payloads and RetroArch RASTATE container files (extracts the
371+
* MEM chunk). See https://github.com/libretro/RetroArch on the
372+
* RASTATE format. */
373+
if (state_load_path)
374+
{
375+
FILE *stf = NULL;
376+
long st_total = 0;
377+
uint8_t *st_buf = NULL;
378+
const uint8_t *payload = NULL;
379+
size_t payload_size = 0;
380+
size_t expected;
381+
const char *state_err = NULL;
382+
383+
stf = fopen(state_load_path, "rb");
384+
if (!stf)
385+
{
386+
state_err = "cannot open state file";
387+
goto state_fail;
388+
}
389+
if (fseek(stf, 0, SEEK_END) != 0)
390+
{
391+
state_err = "fseek to end failed";
392+
goto state_fail;
393+
}
394+
st_total = ftell(stf);
395+
if (st_total <= 0)
396+
{
397+
state_err = "ftell failed or empty state file";
398+
goto state_fail;
399+
}
400+
if (fseek(stf, 0, SEEK_SET) != 0)
401+
{
402+
state_err = "fseek to start failed";
403+
goto state_fail;
404+
}
405+
st_buf = (uint8_t *)malloc((size_t)st_total);
406+
if (!st_buf)
407+
{
408+
state_err = "malloc failed for state buffer";
409+
goto state_fail;
410+
}
411+
if (fread(st_buf, 1, (size_t)st_total, stf) != (size_t)st_total)
412+
{
413+
state_err = "short read on state file";
414+
goto state_fail;
415+
}
416+
fclose(stf);
417+
stf = NULL;
418+
payload = st_buf;
419+
payload_size = (size_t)st_total;
420+
/* RASTATE container? "RASTATE" + 1 version byte, then chunks. */
421+
if (st_total >= 16 && memcmp(st_buf, "RASTATE", 7) == 0)
422+
{
423+
const uint8_t *p = st_buf + 8; /* past magic+version */
424+
const uint8_t *end = st_buf + st_total;
425+
int found = 0;
426+
/* Each chunk: 4-byte type + 4-byte LE size + payload. */
427+
while (p + 8 <= end)
428+
{
429+
uint32_t chunk_size = (uint32_t)p[4] | ((uint32_t)p[5] << 8)
430+
| ((uint32_t)p[6] << 16) | ((uint32_t)p[7] << 24);
431+
/* Bounds-check the declared chunk size against the buffer. */
432+
if (chunk_size > (uint32_t)(end - (p + 8)))
433+
{
434+
state_err = "RASTATE chunk size overruns buffer";
435+
goto state_fail;
436+
}
437+
if (memcmp(p, "MEM ", 4) == 0)
438+
{
439+
payload = p + 8;
440+
payload_size = chunk_size;
441+
found = 1;
442+
break;
443+
}
444+
p += 8 + chunk_size;
445+
}
446+
if (!found)
447+
{
448+
state_err = "no MEM chunk in RASTATE file";
449+
goto state_fail;
450+
}
451+
fprintf(stderr, "--- RASTATE: extracted MEM chunk (%zu bytes) ---\n", payload_size);
452+
}
453+
expected = pretro_serialize_size();
454+
fprintf(stderr, "--- State payload: %zu bytes (core expects %zu) ---\n",
455+
payload_size, expected);
456+
if (!pretro_unserialize(payload, payload_size))
457+
{
458+
state_err = "retro_unserialize failed";
459+
goto state_fail;
460+
}
461+
fprintf(stderr, "--- State loaded from %s ---\n", state_load_path);
462+
free(st_buf);
463+
st_buf = NULL;
464+
465+
if (0)
466+
{
467+
state_fail:
468+
fprintf(stderr, "ERROR: %s: %s\n",
469+
state_err ? state_err : "state load error",
470+
state_load_path);
471+
if (stf) fclose(stf);
472+
if (st_buf) free(st_buf);
473+
pretro_unload_game();
474+
pretro_deinit();
475+
free((void *)info.data);
476+
dlclose(handle);
477+
return 1;
478+
}
479+
}
480+
358481
/* Run warmup frames (not timed) */
359482
if (warmup_frames > 0)
360483
{

0 commit comments

Comments
 (0)