Skip to content

[spike] JIT / dynarec / cached IR for Jaguar CPUs (68000, GPU, DSP) #122

@JoeMatt

Description

@JoeMatt

Summary

Investigate whether a JIT, dynamic recompiler, or cached-IR / threaded-code backend is worth building for one or more of the Jaguar's CPUs (68000, Tom GPU, Jerry DSP). This is a research spike — the deliverable is a profile + recommendation document, not code in master. The goal is to give a future contributor (human or LLM) enough context to pick this up cold.


Motivation

The Virtual Jaguar core is currently fully interpreted on all three programmable processors. That is fine on a desktop x86_64 box but leaves measurable wins on the table for:

  • Battery life on mobile. The Provenance iOS frontend ships this core; cycles spent in a 1.8 MB switch-statement interpreter (src/m68000/cpuemu.c) translate directly to mAh.
  • Speed on low-power ARM hardware. Raspberry Pi 3/4, retro handhelds (RG35XX, Anbernic class), and similar SBCs struggle on the heavier Jaguar titles.
  • Headroom for future accuracy work. Cycle accounting, bus contention, and a more accurate blitter all cost CPU; freeing budget on the interpreter side buys room to spend it on accuracy.

What we explicitly do not expect to help: modern desktop x86_64 / Apple Silicon, where the interpreter already runs full speed.


Background — what the core actually executes

Three programmable processors share a unified big-endian memory map (src/core/vjag_memory.h):

Processor Clock Source Notes
Motorola 68000 13.3 MHz src/m68000/ UAE-derived; cpuemu.c is machine-generated, ~60k LOC / 1.8 MB. Treat as opaque.
Tom GPU 26.6 MHz src/tom/gpu.c Custom 32-bit RISC, 4 KB local SRAM.
Jerry DSP 26.6 MHz src/jerry/dsp.c Same ISA as the GPU, 8 KB local SRAM. Audio.

Frame execution is event-driven, not cycle-accurate — see JaguarExecuteNew() in src/core/jaguar.c (line ~934). The loop runs the 68K to the next event boundary (half-line render, timer, IRQ), then steps the GPU, then fires callbacks. This matters for dynarec design: a classic block-JIT happily executes thousands of guest instructions before yielding, but here we may need to bail out at much finer granularity.

Key assumption to validate before any of this: Object Processor (src/tom/op.c) and blitter (src/tom/blitter.c) dominate the frame, not the CPUs. If that is true, a CPU-side dynarec is the wrong place to spend a month. The new make benchmark target (PR #121, test/tools/test_benchmark.c) plus perf / Instruments should answer this in an afternoon. Recommended ROMs: test/roms/yarc.j64 (ships in-tree), jagniccc.j64, plus the maintainer's private ROM set under test/roms/private/.


Prior art to evaluate

68000

  • UAE JIT — x86 / x86_64, GPLv2, lives in upstream UAE / WinUAE. License is the wrinkle: this codebase is GPLv3, and GPLv2-only code cannot be linked in. UAE's JIT is also "GPLv2 or later" in places — needs file-by-file audit.
  • Cyclone 68000 — ARM-only assembly recompiler used by PicoDrive. https://github.com/notaz/cyclone68000 — MAME-license-ish, hand-written ARM. Would only help on ARM targets but those are the targets that need help.
  • FAME / C68K — used in MAME and various ports; pure-C threaded interpreter, often a 2-3x speedup over generic switch interpreters with no JIT at all.
  • Musashi — cleaner interpreter than UAE's but still an interpreter; mainly relevant if we end up rewriting the 68K core for maintainability and want a better baseline before JITing.

GPU / DSP

No off-the-shelf option found. The Tom/Jerry RISC ISA is Jaguar-specific. Survey work needed:

  • Has BigPEmu (the closed-source reference Jaguar emulator) published anything about its approach?
  • Phoenix, Project Tempest, jagulator — any of them ship a recompiler?
  • Worst case this is greenfield. The ISA is small (~60 opcodes, fixed 16-bit encoding) which is good for a from-scratch dynarec.

Three implementation paths

(a) Wrap an existing 68K JIT

Smallest scope. Pick Cyclone (ARM) or UAE-JIT (x86_64, license permitting) and wire it behind the m68kinterface.c boundary. Biggest perceived speedup if the 68K turns out to be hot.

  • Effort: 1-3 weeks including memory-map glue, IRQ handoff, and event-yield handling.
  • Risk: licensing (UAE), platform coverage (Cyclone is ARM-only).

(b) Custom GPU/DSP RISC dynarec

Greenfield basic-block JIT in C, targeting x86_64 + arm64. Same backend serves both processors since they share an ISA.

  • Effort: ~2-4 weeks for a first pass (decoder + register allocator + two backends + memory-op fallbacks).
  • Risk: if GPU/DSP turn out not to be hot, this is wasted work. Profile first.

(c) Cached IR / threaded code

No native code generation — decode each guest instruction once, cache as a function pointer or compact IR, dispatch via threaded interpreter. Ships everywhere.

  • Effort: ~1-2 weeks per processor.
  • Speedup ceiling: lower than a real JIT (call it 2-4x over cpuemu.c-style switch dispatch) but works on JIT-hostile platforms — non-jailbroken iOS, Switch, locked-down handhelds.

Platform deployment matrix

Target (a) 68K JIT (b) GPU/DSP dynarec (c) Cached IR
Desktop x86_64 (Linux/Win/macOS) yes (UAE) yes yes (low value)
Desktop arm64 (Apple Silicon) yes (Cyclone? UAE port?) yes yes (low value)
iOS with JIT entitlement yes yes yes
iOS without entitlement (App Store, Provenance default) no no yes
Android arm64 yes (Cyclone) yes yes
Raspberry Pi (arm64) yes (Cyclone) yes yes
Nintendo Switch (homebrew) partial (W^X) partial (W^X) yes
PS Vita partial (W^X) partial (W^X) yes

W^X platforms can usually be made to work but require dedicated RWX page allocators and add porting friction.


Open questions / pre-work

  • Profile breakdown by subsystem. Use make benchmark (PR chore: CI and code-quality improvements #121) plus perf record on Linux or Instruments / sample on macOS. Required before committing to any path. Break down 68K vs GPU vs DSP vs OP vs blitter vs audio mix.
  • Event-driven yield model. Can a block JIT cleanly hand control back on a half-line / IRQ / timer boundary, or do we need to chunk blocks more aggressively (e.g., cap block length, insert yield checks)? Read JaguarExecuteNew() carefully.
  • Memory budget for cached IR. Worst-case code reach for a typical commercial ROM — does a cached-IR table fit comfortably on a 256 MB handheld?
  • License audit of UAE JIT. GPLv2 vs GPLv3 compatibility on a per-file basis. If incompatible, drop path (a) entirely or restrict it to a separate non-distributed build.
  • Survey other Jaguar emulators (BigPEmu, Phoenix, jagulator) for any published dynarec work.
  • Decide on backend strategy if going custom: hand-written assembly per arch, LLVM ORC, lightweight IR (sljit, MIR, GNU Lightning), or DynASM.

Acceptance criteria for closing this spike

  • Profile data captured for at least 3 ROMs (yarc, jagniccc, plus one demanding commercial title) on at least 2 hosts (one desktop, one ARM SBC if available).
  • Recommendation document committed to docs/ (e.g. docs/jit-spike.md) covering: hotspot breakdown, which path (a/b/c/none) to pursue, target platforms, rough effort estimate, license posture.
  • Decision recorded — either a follow-up implementation issue is opened, or this is closed as "not worth it" with the data to back that up.

Pointers for whoever picks this up

  • Build: make -j$(getconf _NPROCESSORS_ONLN)
  • Benchmark: make benchmark BENCH_ROM=test/roms/yarc.j64 (see Makefile line ~860 and test/tools/test_benchmark.c)
  • 68K interpreter entry point: src/m68000/m68kinterface.c
  • GPU: src/tom/gpu.c
  • DSP: src/jerry/dsp.c
  • Frame loop: src/core/jaguar.c::JaguarExecuteNew (line ~934)
  • Memory map: src/core/vjag_memory.h (RAM 0x000000 / 2 MB, cart 0x800000, TOM regs 0xF00000, JERRY regs 0xF10000)
  • C89 / GNU89 strict — see CLAUDE.md and scripts/c89-lint.sh

Metadata

Metadata

Assignees

Type

No type

Projects

No projects

Milestone

No milestone

Relationships

None yet

Development

No branches or pull requests

Issue actions