[spike] JIT / dynarec / cached IR for Jaguar CPUs (68000, GPU, DSP)

## Summary

Investigate whether a JIT, dynamic recompiler, or cached-IR / threaded-code backend is worth building for one or more of the Jaguar's CPUs (68000, Tom GPU, Jerry DSP). This is a **research spike** — the deliverable is a profile + recommendation document, not code in `master`. The goal is to give a future contributor (human or LLM) enough context to pick this up cold.

---

## Motivation

The Virtual Jaguar core is currently fully interpreted on all three programmable processors. That is fine on a desktop x86_64 box but leaves measurable wins on the table for:

- **Battery life on mobile.** The Provenance iOS frontend ships this core; cycles spent in a 1.8 MB switch-statement interpreter (`src/m68000/cpuemu.c`) translate directly to mAh.
- **Speed on low-power ARM hardware.** Raspberry Pi 3/4, retro handhelds (RG35XX, Anbernic class), and similar SBCs struggle on the heavier Jaguar titles.
- **Headroom for future accuracy work.** Cycle accounting, bus contention, and a more accurate blitter all cost CPU; freeing budget on the interpreter side buys room to spend it on accuracy.

What we explicitly **do not** expect to help: modern desktop x86_64 / Apple Silicon, where the interpreter already runs full speed.

---

## Background — what the core actually executes

Three programmable processors share a unified big-endian memory map (`src/core/vjag_memory.h`):

| Processor | Clock | Source | Notes |
|---|---|---|---|
| Motorola 68000 | 13.3 MHz | `src/m68000/` | UAE-derived; `cpuemu.c` is **machine-generated, ~60k LOC / 1.8 MB**. Treat as opaque. |
| Tom GPU | 26.6 MHz | `src/tom/gpu.c` | Custom 32-bit RISC, 4 KB local SRAM. |
| Jerry DSP | 26.6 MHz | `src/jerry/dsp.c` | **Same ISA as the GPU**, 8 KB local SRAM. Audio. |

Frame execution is **event-driven, not cycle-accurate** — see `JaguarExecuteNew()` in `src/core/jaguar.c` (line ~934). The loop runs the 68K to the next event boundary (half-line render, timer, IRQ), then steps the GPU, then fires callbacks. This matters for dynarec design: a classic block-JIT happily executes thousands of guest instructions before yielding, but here we may need to bail out at much finer granularity.

**Key assumption to validate before any of this:** Object Processor (`src/tom/op.c`) and blitter (`src/tom/blitter.c`) dominate the frame, not the CPUs. If that is true, a CPU-side dynarec is the wrong place to spend a month. The new `make benchmark` target (PR #121, `test/tools/test_benchmark.c`) plus `perf` / Instruments should answer this in an afternoon. Recommended ROMs: `test/roms/yarc.j64` (ships in-tree), `jagniccc.j64`, plus the maintainer's private ROM set under `test/roms/private/`.

---

## Prior art to evaluate

### 68000

- **UAE JIT** — x86 / x86_64, GPLv2, lives in upstream UAE / WinUAE. License is the wrinkle: this codebase is GPLv3, and GPLv2-only code cannot be linked in. UAE's JIT is also "GPLv2 or later" in places — needs file-by-file audit.
- **Cyclone 68000** — ARM-only assembly recompiler used by PicoDrive. https://github.com/notaz/cyclone68000 — MAME-license-ish, hand-written ARM. Would only help on ARM targets but those *are* the targets that need help.
- **FAME / C68K** — used in MAME and various ports; pure-C threaded interpreter, often a 2-3x speedup over generic switch interpreters with no JIT at all.
- **Musashi** — cleaner interpreter than UAE's but still an interpreter; mainly relevant if we end up rewriting the 68K core for maintainability and want a better baseline before JITing.

### GPU / DSP

No off-the-shelf option found. The Tom/Jerry RISC ISA is Jaguar-specific. Survey work needed:

- Has **BigPEmu** (the closed-source reference Jaguar emulator) published anything about its approach?
- Phoenix, Project Tempest, jagulator — any of them ship a recompiler?
- Worst case this is greenfield. The ISA is small (~60 opcodes, fixed 16-bit encoding) which is *good* for a from-scratch dynarec.

---

## Three implementation paths

### (a) Wrap an existing 68K JIT

Smallest scope. Pick Cyclone (ARM) or UAE-JIT (x86_64, license permitting) and wire it behind the `m68kinterface.c` boundary. Biggest perceived speedup *if* the 68K turns out to be hot.

- **Effort:** 1-3 weeks including memory-map glue, IRQ handoff, and event-yield handling.
- **Risk:** licensing (UAE), platform coverage (Cyclone is ARM-only).

### (b) Custom GPU/DSP RISC dynarec

Greenfield basic-block JIT in C, targeting x86_64 + arm64. Same backend serves both processors since they share an ISA.

- **Effort:** ~2-4 weeks for a first pass (decoder + register allocator + two backends + memory-op fallbacks).
- **Risk:** if GPU/DSP turn out *not* to be hot, this is wasted work. Profile first.

### (c) Cached IR / threaded code

No native code generation — decode each guest instruction once, cache as a function pointer or compact IR, dispatch via threaded interpreter. Ships everywhere.

- **Effort:** ~1-2 weeks per processor.
- **Speedup ceiling:** lower than a real JIT (call it 2-4x over `cpuemu.c`-style switch dispatch) but works on **JIT-hostile platforms** — non-jailbroken iOS, Switch, locked-down handhelds.

---

## Platform deployment matrix

| Target | (a) 68K JIT | (b) GPU/DSP dynarec | (c) Cached IR |
|---|---|---|---|
| Desktop x86_64 (Linux/Win/macOS) | yes (UAE) | yes | yes (low value) |
| Desktop arm64 (Apple Silicon) | yes (Cyclone? UAE port?) | yes | yes (low value) |
| iOS *with* JIT entitlement | yes | yes | yes |
| iOS *without* entitlement (App Store, Provenance default) | **no** | **no** | **yes** |
| Android arm64 | yes (Cyclone) | yes | yes |
| Raspberry Pi (arm64) | yes (Cyclone) | yes | yes |
| Nintendo Switch (homebrew) | partial (W^X) | partial (W^X) | yes |
| PS Vita | partial (W^X) | partial (W^X) | yes |

W^X platforms can usually be made to work but require dedicated RWX page allocators and add porting friction.

---

## Open questions / pre-work

- [ ] **Profile breakdown by subsystem.** Use `make benchmark` (PR #121) plus `perf record` on Linux or Instruments / `sample` on macOS. Required before committing to any path. Break down 68K vs GPU vs DSP vs OP vs blitter vs audio mix.
- [ ] **Event-driven yield model.** Can a block JIT cleanly hand control back on a half-line / IRQ / timer boundary, or do we need to chunk blocks more aggressively (e.g., cap block length, insert yield checks)? Read `JaguarExecuteNew()` carefully.
- [ ] **Memory budget for cached IR.** Worst-case code reach for a typical commercial ROM — does a cached-IR table fit comfortably on a 256 MB handheld?
- [ ] **License audit of UAE JIT.** GPLv2 vs GPLv3 compatibility on a per-file basis. If incompatible, drop path (a) entirely or restrict it to a separate non-distributed build.
- [ ] **Survey other Jaguar emulators** (BigPEmu, Phoenix, jagulator) for any published dynarec work.
- [ ] **Decide on backend strategy** if going custom: hand-written assembly per arch, LLVM ORC, lightweight IR (sljit, MIR, GNU Lightning), or DynASM.

---

## Acceptance criteria for closing this spike

- [ ] Profile data captured for at least 3 ROMs (yarc, jagniccc, plus one demanding commercial title) on at least 2 hosts (one desktop, one ARM SBC if available).
- [ ] Recommendation document committed to `docs/` (e.g. `docs/jit-spike.md`) covering: hotspot breakdown, which path (a/b/c/none) to pursue, target platforms, rough effort estimate, license posture.
- [ ] Decision recorded — either a follow-up implementation issue is opened, or this is closed as "not worth it" with the data to back that up.

---

## Pointers for whoever picks this up

- Build: `make -j$(getconf _NPROCESSORS_ONLN)`
- Benchmark: `make benchmark BENCH_ROM=test/roms/yarc.j64` (see `Makefile` line ~860 and `test/tools/test_benchmark.c`)
- 68K interpreter entry point: `src/m68000/m68kinterface.c`
- GPU: `src/tom/gpu.c`
- DSP: `src/jerry/dsp.c`
- Frame loop: `src/core/jaguar.c::JaguarExecuteNew` (line ~934)
- Memory map: `src/core/vjag_memory.h` (RAM 0x000000 / 2 MB, cart 0x800000, TOM regs 0xF00000, JERRY regs 0xF10000)
- C89 / GNU89 strict — see `CLAUDE.md` and `scripts/c89-lint.sh`


Provide feedback

Saved searches

Use saved searches to filter your results more quickly

[spike] JIT / dynarec / cached IR for Jaguar CPUs (68000, GPU, DSP) #122

Summary

Motivation

Background — what the core actually executes

Prior art to evaluate

68000

GPU / DSP

Three implementation paths

(a) Wrap an existing 68K JIT

(b) Custom GPU/DSP RISC dynarec

(c) Cached IR / threaded code

Platform deployment matrix

Open questions / pre-work

Acceptance criteria for closing this spike

Pointers for whoever picks this up

Metadata

Assignees

Labels

Type

Projects

Milestone

Relationships

Development

Processor	Clock	Source	Notes
Motorola 68000	13.3 MHz	`src/m68000/`	UAE-derived; `cpuemu.c` is machine-generated, ~60k LOC / 1.8 MB. Treat as opaque.
Tom GPU	26.6 MHz	`src/tom/gpu.c`	Custom 32-bit RISC, 4 KB local SRAM.
Jerry DSP	26.6 MHz	`src/jerry/dsp.c`	Same ISA as the GPU, 8 KB local SRAM. Audio.

Target	(a) 68K JIT	(b) GPU/DSP dynarec	(c) Cached IR
Desktop x86_64 (Linux/Win/macOS)	yes (UAE)	yes	yes (low value)
Desktop arm64 (Apple Silicon)	yes (Cyclone? UAE port?)	yes	yes (low value)
iOS with JIT entitlement	yes	yes	yes
iOS without entitlement (App Store, Provenance default)	no	no	yes
Android arm64	yes (Cyclone)	yes	yes
Raspberry Pi (arm64)	yes (Cyclone)	yes	yes
Nintendo Switch (homebrew)	partial (W^X)	partial (W^X)	yes
PS Vita	partial (W^X)	partial (W^X)	yes

[spike] JIT / dynarec / cached IR for Jaguar CPUs (68000, GPU, DSP) #122

Description

Summary

Motivation

Background — what the core actually executes

Prior art to evaluate

68000

GPU / DSP

Three implementation paths

(a) Wrap an existing 68K JIT

(b) Custom GPU/DSP RISC dynarec

(c) Cached IR / threaded code

Platform deployment matrix

Open questions / pre-work

Acceptance criteria for closing this spike

Pointers for whoever picks this up

Metadata

Metadata

Assignees

Labels

Type

Projects

Milestone

Relationships

Development

Issue actions