Skip to content

Commit 79a65d0

Browse files
JoeMattclaude
andcommitted
perf: profile gameplay-state too (AvP), correct the blitter conclusion
Original profile only sampled headless boot/menu state, which doesn't exercise the blitter much. Added --load-state to test_benchmark (handles RetroArch RASTATE container by extracting the MEM chunk), loaded the user's AvP state6 save (active gameplay), and re-profiled. Result: the spike reports' hypothesis about the blitter was right -- it just didn't show up in boot profiles. AvP gameplay, fast blitter: DSP ~63% (same as before), blitter 5%, GPU 4%, OP 1% AvP gameplay, accurate blitter: BlitterMidsummer2+ADDARRAY+DATA ~34% (matches docs/spike #124), DSP ~31%, GPU small OP stays at ~1% even on gameplay -- closing #123 as wontfix-for-now is still the honest call. Updated recommendation: two genuine targets, suggested order: 1. Accurate-blitter SIMD widening (#124) -- smaller surface, lower risk, immediate visible win for AvP-style accurate-blitter slowdown. ADDARRAY (src/tom/blitter.c:2090-2094) is the obvious first target. 2. RISC dynarec / cached IR (#122 RISC half) -- bigger lift, helps every game on every blitter mode. Co-Authored-By: Claude Opus 4.7 <[email protected]>
1 parent 439af9d commit 79a65d0

2 files changed

Lines changed: 135 additions & 9 deletions

File tree

docs/op-perf-profile.md

Lines changed: 54 additions & 8 deletions
Original file line numberDiff line numberDiff line change
@@ -4,9 +4,9 @@ Captured 2026-05-01 on Apple Silicon (M-series Mac), arm64 build, default flags
44

55
## TL;DR
66

7-
**OP is not the bottleneck. Blitter is rarely the bottleneck either. GPU and DSP RISC interpretation dominate frame time across the board.**
7+
**OP is not the bottleneck. GPU/DSP RISC interpretation is the biggest single target across the board. The accurate blitter is genuinely hot during gameplay (not just at boot) — comparable in cost to DSP for users who enable that mode.**
88

9-
This redirects the next-up perf work from issue #123 (OP) to a different target — see "Recommendation" at the bottom.
9+
> **Methodology fix mid-investigation:** the first round of this doc profiled the headless boot/menu state and concluded blitter was always small. That was misleading — boot screens don't draw much. Re-profiled with a save state loaded on top of the live ROM (Alien vs Predator at an in-game state6 save) and the picture changes substantially for the accurate-blitter case. Both data sets are below.
1010
1111
## Baseline FPS
1212

@@ -76,25 +76,71 @@ Captured to `/tmp/op-baseline/sample-*-*.txt`. Aggregated top-of-stack by symbo
7676

7777
**The one case where the blitter genuinely matters — and only because the user opted into accurate mode. Even here, DSP is comparable.**
7878

79+
### `Alien vs Predator (1994)` at gameplay state6 (fast blitter)
80+
81+
| Function | Samples | % of frame |
82+
|---|---:|---:|
83+
| `DSPExec` | 3303 | **~49%** |
84+
| `dsp_opcode_jr` | 977 | 14% |
85+
| **DSP TOTAL** | **~4280** | **~63%** |
86+
| `blitter_generic` | ~340 | 5% |
87+
| `GPUExec` | ~254 | 4% |
88+
| `HalflineCallback` / `JaguarExecuteNew` / `JaguarReadLong` | ~310 | 5% |
89+
| `OPProcessFixedBitmap` | 72 | 1% |
90+
91+
**Same DSP-dominated picture as the boot profiles.** Blitter is meaningfully present (5%) since the scene is actively drawing, but still single-digit.
92+
93+
### `Alien vs Predator (1994)` at gameplay state6 (accurate blitter)
94+
95+
| Function | Samples | % of frame |
96+
|---|---:|---:|
97+
| **`BlitterMidsummer2` family** (`+ ADDARRAY 2090-2094, DATA, BlitterMidsummer2 itself`) | ~2330 | **~34%** |
98+
| `DSPExec` | 1652 | **~24%** |
99+
| `dsp_opcode_jr` | 443 | 6.5% |
100+
| **DSP TOTAL** | **~2095** | **~31%** |
101+
102+
**Accurate blitter is co-equal with DSP during gameplay.** The hot inner code is `ADDARRAY` (lines 2090-2094 in `src/tom/blitter.c`) — the per-cycle FDSYNC cascade the spike report (#124) flagged — plus `DATA()` (line 2514). Confirms the spike's hypothesis was right; the boot-only profile just didn't expose it.
103+
104+
This matches the user-reported behavior: AvP slows down noticeably when the character moves, and the slowdown is worse with the accurate blitter selected.
105+
79106
## What this changes
80107

81108
The original `[spike] OP performance audit` (#123) and `[spike] Blitter performance audit` (#124) both assumed those subsystems dominate. They don't.
82109

83110
| Target | Hypothesis going in | Reality | ROI |
84111
|---|---|---|---|
85112
| **OP** (#123) | "dominates on heavy-OP scenes (Wolf3D, T2K, Iron Soldier)" | 1% on Iron Soldier; doesn't make top-15 elsewhere | Low — even a 10× OP speedup buys ~1% of frame time |
86-
| **Blitter** (#124) | "fast vs accurate matters for some games" | Fast is <2% everywhere except where accurate is opted-in (then ~21% on Skyhammer) | Medium — only for users running accurate blitter on specific titles |
87-
| **GPU/DSP RISC** (#122 sub-component) | "JIT might help on mobile/Pi" | **24-74% of frame time, every ROM** | **High** — single dynarec helps both because they share an ISA |
113+
| **Blitter — fast** (#124) | "default path matters" | <2% on boot, 5% on AvP gameplay — small everywhere | Low |
114+
| **Blitter — accurate** (#124) | "matters for some games" | **~34% on AvP gameplay**, 21% on Skyhammer (boot); ADDARRAY cycle cascade dominates | **High for accurate-blitter users** |
115+
| **GPU/DSP RISC** (#122 sub-component) | "JIT might help on mobile/Pi" | **24-74% on fast blitter, ~31% even with accurate** | **High** — single dynarec covers both (shared ISA) |
88116
| **68K** (#122 sub-component) | "wrap UAE JIT or Cyclone68k" | 0.7-2.6% — barely visible in any profile | Low |
89117

90118
## Recommendation
91119

92-
**Redirect to #122 (JIT / dynarec / cached IR), specifically the GPU/DSP RISC half.** Both Tom GPU and Jerry DSP share the same ISA (~64 opcodes, fixed 16-bit encoding, no MMU). A single basic-block dynarec or cached-IR dispatcher covers both, and the profile data says it would attack the actual hot path on every game tested:
120+
Two genuine wins, depending on which user we optimize for:
121+
122+
### Path A — RISC dynarec / cached IR (#122, RISC half only)
123+
124+
Helps **everyone** (every ROM tested, every blitter mode):
125+
- Fast-blitter users: attacks the dominant 50-75% slice
126+
- Accurate-blitter users: attacks the ~31% DSP slice (the other ~34% is blitter)
127+
- Cross-platform reach: cached IR / threaded code works on JIT-restricted hosts (iOS without entitlement, Switch)
128+
- Both Tom GPU and Jerry DSP share the same ISA → single dispatcher covers both
129+
130+
### Path B — accurate-blitter SIMD widening (#124, narrow scope)
131+
132+
Helps **users who enable accurate blitter** specifically:
133+
- ADDARRAY (`src/tom/blitter.c:2090-2094`, the per-cycle FDSYNC cascade) is the biggest single function — ~15% of frame time on AvP gameplay
134+
- `BlitterMidsummer2` itself + `DATA()` add another ~19%
135+
- Existing `test_blitter_compare` infrastructure already gates bit-exactness regressions
136+
- Lower risk per change than dynarec — can be done in small, mergeable batches
137+
138+
### Suggested order
93139

94-
- Demos (yarc/jagniccc) → mostly GPU.
95-
- Commercial games (Iron Soldier, Doom, Skyhammer) → mostly DSP, sometimes both.
140+
1. **Start with Path B (accurate-blitter perf)** — smaller surface, lower risk, immediate visible win for users hitting the AvP-style slowdown. ADDARRAY is the obvious first target.
141+
2. **Then Path A (RISC dynarec)** — bigger lift, bigger payoff, helps everyone including accurate-blitter users still bottlenecked on DSP after step 1.
96142

97-
Closing #123 (OP) and #124 (blitter) as **wontfix-for-now** based on profile data is the honest call. Cheap wins documented in those spikes (e.g., the OP `O(N²)` discovery bug, fast-blitter SIMD widening) can still land opportunistically — they're correct fixes, just won't move the headline number.
143+
Closing #123 (OP) as **wontfix-for-now** is still the honest call — even on gameplay it's <2%. The cheap wins documented in #123 (OP `O(N²)` discovery scan, hoist transparency check) remain valid as code-quality fixes but won't move the headline number.
98144

99145
## Why we DIDN'T touch OP/blitter as planned
100146

test/tools/test_benchmark.c

Lines changed: 81 additions & 1 deletion
Original file line numberDiff line numberDiff line change
@@ -34,6 +34,8 @@ static void (*pretro_run)(void);
3434
static void (*pretro_unload_game)(void);
3535
static void *(*pretro_get_memory_data)(unsigned);
3636
static size_t (*pretro_get_memory_size)(unsigned);
37+
static size_t (*pretro_serialize_size)(void);
38+
static bool (*pretro_unserialize)(const void *, size_t);
3739

3840
/* Options state */
3941
static int bios_option_set = 0;
@@ -181,13 +183,17 @@ static void print_usage(const char *progname)
181183
fprintf(stderr,
182184
"Usage: %s <core.dylib> <rom_file> [num_frames]\n"
183185
" [--blitter fast|accurate] [--warmup N] [--load-srm file]\n"
186+
" [--load-state file]\n"
184187
"\n"
185188
"Options:\n"
186189
" num_frames Number of frames to benchmark (default: 300)\n"
187190
" --blitter fast Use fast blitter (default)\n"
188191
" --blitter accurate Use accurate (Midsummer2) blitter\n"
189192
" --warmup N Run N warmup frames before timing\n"
190-
" --load-srm file Load EEPROM save data from file\n",
193+
" --load-srm file Load EEPROM save data from file\n"
194+
" --load-state file Load a save state into the core after retro_load_game.\n"
195+
" Accepts raw retro_serialize() payloads or RetroArch\n"
196+
" RASTATE container files (the MEM chunk is extracted).\n",
191197
progname);
192198
}
193199

@@ -197,6 +203,7 @@ int main(int argc, char **argv)
197203
const char *core_path;
198204
const char *rom_path;
199205
const char *srm_load_path = NULL;
206+
const char *state_load_path = NULL;
200207
struct retro_game_info info;
201208
FILE *f;
202209
long fsize;
@@ -235,6 +242,8 @@ int main(int argc, char **argv)
235242
warmup_frames = atoi(argv[++i]);
236243
else if (strcmp(argv[i], "--load-srm") == 0 && i + 1 < argc)
237244
srm_load_path = argv[++i];
245+
else if (strcmp(argv[i], "--load-state") == 0 && i + 1 < argc)
246+
state_load_path = argv[++i];
238247
else if (strcmp(argv[i], "--help") == 0 || strcmp(argv[i], "-h") == 0)
239248
{
240249
print_usage(argv[0]);
@@ -298,6 +307,8 @@ int main(int argc, char **argv)
298307
LOAD_SYM(retro_unload_game);
299308
LOAD_SYM(retro_get_memory_data);
300309
LOAD_SYM(retro_get_memory_size);
310+
LOAD_SYM(retro_serialize_size);
311+
LOAD_SYM(retro_unserialize);
301312

302313
pretro_set_environment(environment_cb);
303314
pretro_set_video_refresh(video_refresh);
@@ -355,6 +366,75 @@ int main(int argc, char **argv)
355366
fprintf(stderr, "WARNING: Core reports no SAVE_RAM area\n");
356367
}
357368

369+
/* Load save state if provided. Accepts both raw retro_serialize()
370+
* payloads and RetroArch RASTATE container files (extracts the
371+
* MEM chunk). See https://github.com/libretro/RetroArch on the
372+
* RASTATE format. */
373+
if (state_load_path)
374+
{
375+
FILE *stf = fopen(state_load_path, "rb");
376+
if (!stf)
377+
{
378+
fprintf(stderr, "ERROR: cannot open state file: %s\n", state_load_path);
379+
return 1;
380+
}
381+
{
382+
long st_total;
383+
uint8_t *st_buf;
384+
const uint8_t *payload;
385+
size_t payload_size;
386+
size_t expected;
387+
fseek(stf, 0, SEEK_END);
388+
st_total = ftell(stf);
389+
fseek(stf, 0, SEEK_SET);
390+
st_buf = (uint8_t *)malloc(st_total);
391+
if (fread(st_buf, 1, st_total, stf) != (size_t)st_total)
392+
{
393+
fprintf(stderr, "ERROR: short read on state file\n");
394+
free(st_buf); fclose(stf); return 1;
395+
}
396+
fclose(stf);
397+
payload = st_buf;
398+
payload_size = (size_t)st_total;
399+
/* RASTATE container? "RASTATE" + 1 version byte, then chunks. */
400+
if (st_total >= 16 && memcmp(st_buf, "RASTATE", 7) == 0)
401+
{
402+
const uint8_t *p = st_buf + 8; /* past magic+version */
403+
const uint8_t *end = st_buf + st_total;
404+
int found = 0;
405+
while (p + 8 <= end)
406+
{
407+
uint32_t chunk_size = (uint32_t)p[4] | ((uint32_t)p[5] << 8)
408+
| ((uint32_t)p[6] << 16) | ((uint32_t)p[7] << 24);
409+
if (memcmp(p, "MEM ", 4) == 0)
410+
{
411+
payload = p + 8;
412+
payload_size = chunk_size;
413+
found = 1;
414+
break;
415+
}
416+
p += 8 + chunk_size;
417+
}
418+
if (!found)
419+
{
420+
fprintf(stderr, "ERROR: no MEM chunk in RASTATE file\n");
421+
free(st_buf); return 1;
422+
}
423+
fprintf(stderr, "--- RASTATE: extracted MEM chunk (%zu bytes) ---\n", payload_size);
424+
}
425+
expected = pretro_serialize_size();
426+
fprintf(stderr, "--- State payload: %zu bytes (core expects %zu) ---\n",
427+
payload_size, expected);
428+
if (!pretro_unserialize(payload, payload_size))
429+
{
430+
fprintf(stderr, "ERROR: retro_unserialize failed\n");
431+
free(st_buf); return 1;
432+
}
433+
fprintf(stderr, "--- State loaded from %s ---\n", state_load_path);
434+
free(st_buf);
435+
}
436+
}
437+
358438
/* Run warmup frames (not timed) */
359439
if (warmup_frames > 0)
360440
{

0 commit comments

Comments
 (0)