You signed in with another tab or window. Reload to refresh your session.You signed out in another tab or window. Reload to refresh your session.You switched accounts on another tab or window. Reload to refresh your session.Dismiss alert
Investigate whether a JIT, dynamic recompiler, or cached-IR / threaded-code backend is worth building for one or more of the Jaguar's CPUs (68000, Tom GPU, Jerry DSP). This is a research spike — the deliverable is a profile + recommendation document, not code in master. The goal is to give a future contributor (human or LLM) enough context to pick this up cold.
Motivation
The Virtual Jaguar core is currently fully interpreted on all three programmable processors. That is fine on a desktop x86_64 box but leaves measurable wins on the table for:
Battery life on mobile. The Provenance iOS frontend ships this core; cycles spent in a 1.8 MB switch-statement interpreter (src/m68000/cpuemu.c) translate directly to mAh.
Speed on low-power ARM hardware. Raspberry Pi 3/4, retro handhelds (RG35XX, Anbernic class), and similar SBCs struggle on the heavier Jaguar titles.
Headroom for future accuracy work. Cycle accounting, bus contention, and a more accurate blitter all cost CPU; freeing budget on the interpreter side buys room to spend it on accuracy.
What we explicitly do not expect to help: modern desktop x86_64 / Apple Silicon, where the interpreter already runs full speed.
Background — what the core actually executes
Three programmable processors share a unified big-endian memory map (src/core/vjag_memory.h):
Processor
Clock
Source
Notes
Motorola 68000
13.3 MHz
src/m68000/
UAE-derived; cpuemu.c is machine-generated, ~60k LOC / 1.8 MB. Treat as opaque.
Tom GPU
26.6 MHz
src/tom/gpu.c
Custom 32-bit RISC, 4 KB local SRAM.
Jerry DSP
26.6 MHz
src/jerry/dsp.c
Same ISA as the GPU, 8 KB local SRAM. Audio.
Frame execution is event-driven, not cycle-accurate — see JaguarExecuteNew() in src/core/jaguar.c (line ~934). The loop runs the 68K to the next event boundary (half-line render, timer, IRQ), then steps the GPU, then fires callbacks. This matters for dynarec design: a classic block-JIT happily executes thousands of guest instructions before yielding, but here we may need to bail out at much finer granularity.
Key assumption to validate before any of this: Object Processor (src/tom/op.c) and blitter (src/tom/blitter.c) dominate the frame, not the CPUs. If that is true, a CPU-side dynarec is the wrong place to spend a month. The new make benchmark target (PR #121, test/tools/test_benchmark.c) plus perf / Instruments should answer this in an afternoon. Recommended ROMs: test/roms/yarc.j64 (ships in-tree), jagniccc.j64, plus the maintainer's private ROM set under test/roms/private/.
Prior art to evaluate
68000
UAE JIT — x86 / x86_64, GPLv2, lives in upstream UAE / WinUAE. License is the wrinkle: this codebase is GPLv3, and GPLv2-only code cannot be linked in. UAE's JIT is also "GPLv2 or later" in places — needs file-by-file audit.
Cyclone 68000 — ARM-only assembly recompiler used by PicoDrive. https://github.com/notaz/cyclone68000 — MAME-license-ish, hand-written ARM. Would only help on ARM targets but those are the targets that need help.
FAME / C68K — used in MAME and various ports; pure-C threaded interpreter, often a 2-3x speedup over generic switch interpreters with no JIT at all.
Musashi — cleaner interpreter than UAE's but still an interpreter; mainly relevant if we end up rewriting the 68K core for maintainability and want a better baseline before JITing.
GPU / DSP
No off-the-shelf option found. The Tom/Jerry RISC ISA is Jaguar-specific. Survey work needed:
Has BigPEmu (the closed-source reference Jaguar emulator) published anything about its approach?
Phoenix, Project Tempest, jagulator — any of them ship a recompiler?
Worst case this is greenfield. The ISA is small (~60 opcodes, fixed 16-bit encoding) which is good for a from-scratch dynarec.
Three implementation paths
(a) Wrap an existing 68K JIT
Smallest scope. Pick Cyclone (ARM) or UAE-JIT (x86_64, license permitting) and wire it behind the m68kinterface.c boundary. Biggest perceived speedup if the 68K turns out to be hot.
Effort: 1-3 weeks including memory-map glue, IRQ handoff, and event-yield handling.
Risk: licensing (UAE), platform coverage (Cyclone is ARM-only).
(b) Custom GPU/DSP RISC dynarec
Greenfield basic-block JIT in C, targeting x86_64 + arm64. Same backend serves both processors since they share an ISA.
Effort: ~2-4 weeks for a first pass (decoder + register allocator + two backends + memory-op fallbacks).
Risk: if GPU/DSP turn out not to be hot, this is wasted work. Profile first.
(c) Cached IR / threaded code
No native code generation — decode each guest instruction once, cache as a function pointer or compact IR, dispatch via threaded interpreter. Ships everywhere.
Effort: ~1-2 weeks per processor.
Speedup ceiling: lower than a real JIT (call it 2-4x over cpuemu.c-style switch dispatch) but works on JIT-hostile platforms — non-jailbroken iOS, Switch, locked-down handhelds.
Platform deployment matrix
Target
(a) 68K JIT
(b) GPU/DSP dynarec
(c) Cached IR
Desktop x86_64 (Linux/Win/macOS)
yes (UAE)
yes
yes (low value)
Desktop arm64 (Apple Silicon)
yes (Cyclone? UAE port?)
yes
yes (low value)
iOS with JIT entitlement
yes
yes
yes
iOS without entitlement (App Store, Provenance default)
no
no
yes
Android arm64
yes (Cyclone)
yes
yes
Raspberry Pi (arm64)
yes (Cyclone)
yes
yes
Nintendo Switch (homebrew)
partial (W^X)
partial (W^X)
yes
PS Vita
partial (W^X)
partial (W^X)
yes
W^X platforms can usually be made to work but require dedicated RWX page allocators and add porting friction.
Open questions / pre-work
Profile breakdown by subsystem. Use make benchmark (PR chore: CI and code-quality improvements #121) plus perf record on Linux or Instruments / sample on macOS. Required before committing to any path. Break down 68K vs GPU vs DSP vs OP vs blitter vs audio mix.
Event-driven yield model. Can a block JIT cleanly hand control back on a half-line / IRQ / timer boundary, or do we need to chunk blocks more aggressively (e.g., cap block length, insert yield checks)? Read JaguarExecuteNew() carefully.
Memory budget for cached IR. Worst-case code reach for a typical commercial ROM — does a cached-IR table fit comfortably on a 256 MB handheld?
License audit of UAE JIT. GPLv2 vs GPLv3 compatibility on a per-file basis. If incompatible, drop path (a) entirely or restrict it to a separate non-distributed build.
Survey other Jaguar emulators (BigPEmu, Phoenix, jagulator) for any published dynarec work.
Decide on backend strategy if going custom: hand-written assembly per arch, LLVM ORC, lightweight IR (sljit, MIR, GNU Lightning), or DynASM.
Acceptance criteria for closing this spike
Profile data captured for at least 3 ROMs (yarc, jagniccc, plus one demanding commercial title) on at least 2 hosts (one desktop, one ARM SBC if available).
Recommendation document committed to docs/ (e.g. docs/jit-spike.md) covering: hotspot breakdown, which path (a/b/c/none) to pursue, target platforms, rough effort estimate, license posture.
Decision recorded — either a follow-up implementation issue is opened, or this is closed as "not worth it" with the data to back that up.
Pointers for whoever picks this up
Build: make -j$(getconf _NPROCESSORS_ONLN)
Benchmark: make benchmark BENCH_ROM=test/roms/yarc.j64 (see Makefile line ~860 and test/tools/test_benchmark.c)
Summary
Investigate whether a JIT, dynamic recompiler, or cached-IR / threaded-code backend is worth building for one or more of the Jaguar's CPUs (68000, Tom GPU, Jerry DSP). This is a research spike — the deliverable is a profile + recommendation document, not code in
master. The goal is to give a future contributor (human or LLM) enough context to pick this up cold.Motivation
The Virtual Jaguar core is currently fully interpreted on all three programmable processors. That is fine on a desktop x86_64 box but leaves measurable wins on the table for:
src/m68000/cpuemu.c) translate directly to mAh.What we explicitly do not expect to help: modern desktop x86_64 / Apple Silicon, where the interpreter already runs full speed.
Background — what the core actually executes
Three programmable processors share a unified big-endian memory map (
src/core/vjag_memory.h):src/m68000/cpuemu.cis machine-generated, ~60k LOC / 1.8 MB. Treat as opaque.src/tom/gpu.csrc/jerry/dsp.cFrame execution is event-driven, not cycle-accurate — see
JaguarExecuteNew()insrc/core/jaguar.c(line ~934). The loop runs the 68K to the next event boundary (half-line render, timer, IRQ), then steps the GPU, then fires callbacks. This matters for dynarec design: a classic block-JIT happily executes thousands of guest instructions before yielding, but here we may need to bail out at much finer granularity.Key assumption to validate before any of this: Object Processor (
src/tom/op.c) and blitter (src/tom/blitter.c) dominate the frame, not the CPUs. If that is true, a CPU-side dynarec is the wrong place to spend a month. The newmake benchmarktarget (PR #121,test/tools/test_benchmark.c) plusperf/ Instruments should answer this in an afternoon. Recommended ROMs:test/roms/yarc.j64(ships in-tree),jagniccc.j64, plus the maintainer's private ROM set undertest/roms/private/.Prior art to evaluate
68000
GPU / DSP
No off-the-shelf option found. The Tom/Jerry RISC ISA is Jaguar-specific. Survey work needed:
Three implementation paths
(a) Wrap an existing 68K JIT
Smallest scope. Pick Cyclone (ARM) or UAE-JIT (x86_64, license permitting) and wire it behind the
m68kinterface.cboundary. Biggest perceived speedup if the 68K turns out to be hot.(b) Custom GPU/DSP RISC dynarec
Greenfield basic-block JIT in C, targeting x86_64 + arm64. Same backend serves both processors since they share an ISA.
(c) Cached IR / threaded code
No native code generation — decode each guest instruction once, cache as a function pointer or compact IR, dispatch via threaded interpreter. Ships everywhere.
cpuemu.c-style switch dispatch) but works on JIT-hostile platforms — non-jailbroken iOS, Switch, locked-down handhelds.Platform deployment matrix
W^X platforms can usually be made to work but require dedicated RWX page allocators and add porting friction.
Open questions / pre-work
make benchmark(PR chore: CI and code-quality improvements #121) plusperf recordon Linux or Instruments /sampleon macOS. Required before committing to any path. Break down 68K vs GPU vs DSP vs OP vs blitter vs audio mix.JaguarExecuteNew()carefully.Acceptance criteria for closing this spike
docs/(e.g.docs/jit-spike.md) covering: hotspot breakdown, which path (a/b/c/none) to pursue, target platforms, rough effort estimate, license posture.Pointers for whoever picks this up
make -j$(getconf _NPROCESSORS_ONLN)make benchmark BENCH_ROM=test/roms/yarc.j64(seeMakefileline ~860 andtest/tools/test_benchmark.c)src/m68000/m68kinterface.csrc/tom/gpu.csrc/jerry/dsp.csrc/core/jaguar.c::JaguarExecuteNew(line ~934)src/core/vjag_memory.h(RAM 0x000000 / 2 MB, cart 0x800000, TOM regs 0xF00000, JERRY regs 0xF10000)CLAUDE.mdandscripts/c89-lint.sh