Profiling guide

How to measure where the Virtual Jaguar libretro core actually spends time.

TL;DR — `make benchmark`

Wall-clock baseline you can run on every commit:

make benchmark                              # default: yarc.j64, 600 frames, fast blitter
make benchmark BENCH_FRAMES=3000            # longer run (smoother numbers)
make benchmark BENCH_BLITTER=accurate       # A/B against the slow path
make benchmark BENCH_ROM=test/roms/private/Atari\ Karts.jag

Reports Frames/sec, Time/frame, total wall time. Boots the core via dlopen, runs N frames headless (no video presentation, no audio output), so you measure pure emulator work.

Use it as a delta: capture baseline before your change, run again after. Don't trust absolute numbers across hosts (CPU, thermals, scheduler, big.LITTLE pinning). Do trust same-host commit-to-commit deltas.

The harness lives at test/tools/test_benchmark.c. Read it if you want to measure something specific (per-subsystem timing, only-DSP, etc.) — it's <400 lines.

macOS — Instruments / `sample`

Instruments (Time Profiler) is the easiest way to get a flame graph on macOS.

The wrapper at scripts/profile-mac.sh builds the core, runs the benchmark under xctrace, and writes a .trace bundle you can open in Instruments:

scripts/profile-mac.sh                                    # default: Time Profiler, accurate blitter
scripts/profile-mac.sh --template "CPU Counters"          # PMU: cycles, instructions, branch misses
scripts/profile-mac.sh --rom test/roms/yarc.j64 --open    # auto-open the trace

Manual invocation if you'd rather attach to a running process:

make benchmark BENCH_FRAMES=6000 BENCH_WARMUP=120 &
BENCH_PID=$!

xcrun xctrace record --template "Time Profiler" --attach $BENCH_PID --output bench.trace --time-limit 30s
open bench.trace

The default symbolication is good — you'll see OPProcessFixedBitmap, BlitterMidsummer2, DSPExec etc. as top hot frames if they're slow.

For a quick text dump without the GUI:

sample $BENCH_PID 5 -file /tmp/sample.txt
# 5-second sample.  Read /tmp/sample.txt for collapsed call stacks.

Bespoke counters — `BENCH_PROFILE=1`

Sampling profilers tell you where time goes; counters tell you how often something happens. When you want exact iteration counts (e.g., "did my fast-path actually skip the inner loop?"), use the perf_counters system in src/core/perf_counters.h.

make benchmark BENCH_PROFILE=1 BENCH_BLITTER=accurate BENCH_FRAMES=300
# ...
# [perf] counter dump:
# [perf]   blitter_phrase_writes                    3034993
# [perf]   blitter_phrase_reads                     931821
# [perf]   blitter_inner_io                         3966814
# [perf]   blitter_inner                            4131151
# [perf]   blitter_outer                            337722
# [perf]   blitter_calls                            131628

The macros are zero-overhead when BENCH_PROFILE is undefined (default build) — every PERF_INC becomes ((void)0), every PERF_COUNTER becomes a typedef. Use them freely in hot paths to instrument hypotheses.

Adding a counter:

#include "perf_counters.h"

PERF_COUNTER(my_event);             /* file scope */

void hot(void) {
    PERF_INC(my_event);             /* in-loop */
    PERF_ADD(my_event, n);          /* batch */
}

The harness (test/tools/test_benchmark.c) calls perf_counters_dump(stderr) at exit; counter values appear right before the BENCHMARK RESULTS block.

When to reach for this vs. Time Profiler:

Question	Tool
"Where are we spending cycles?"	`xctrace` Time Profiler
"How many times does the inner loop run per frame?"	`BENCH_PROFILE=1`
"What fraction of inner iterations are no-ops?"	`BENCH_PROFILE=1`
"Are we hitting L1 / branch-mispredicting?"	`xctrace` CPU Counters
"Did this optimization change behavior, not just timing?"	`BENCH_PROFILE=1` (deltas in counts)

Linux — `perf` + flamegraph

sudo apt install -y linux-tools-common linux-tools-generic
git clone https://github.com/brendangregg/FlameGraph /tmp/flamegraph

make benchmark BENCH_FRAMES=6000 BENCH_WARMUP=120 &
BENCH_PID=$!

perf record -F 999 -g -p $BENCH_PID -- sleep 30
perf script | /tmp/flamegraph/stackcollapse-perf.pl | /tmp/flamegraph/flamegraph.pl > flame.svg
open flame.svg

-F 999 = 999 Hz sample rate (avoid 1000 Hz lockstep aliasing with display refresh). -g = capture call graphs.

Hot paths to know

Suspicious-by-default places when something gets slow:

Subsystem	File	Notes
Object Processor (sprites, bitmaps)	`src/tom/op.c`	`OPProcessFixedBitmap`, `OPProcessScaledBitmap`, `OPDiscoverObjects`. Dominant on heavy-OP scenes (Wolf3D, Tempest 2000).
Blitter	`src/tom/blitter.c`, `blitter_mmio.c`, `blitter_simd_*.c`	Two paths: fast (`blitter_generic`, the upstream-derived path) and accurate (`BlitterMidsummer2`). SIMD (`blitter_simd_{sse2,neon,scalar}.c`) is currently wired only into the accurate blitter's pixel kernel — see issue #124 for the plan to widen SIMD coverage.
68K	`src/m68000/cpuemu.c`	Machine-generated UAE. ~1.8 MB of source. If this is hot, there's not much to do beyond JIT (out of scope).
GPU (RISC, 26.6 MHz)	`src/tom/gpu.c`	`GPUExec` per-instruction. Hot when game uses GPU heavily (most do).
DSP (RISC, audio)	`src/jerry/dsp.c`	`DSPExec` per-instruction. See `src/jerry/dsp_acc40.h` for the 40-bit MAC.
Frame loop	`src/core/jaguar.c`	`JaguarExecuteNew` is the event-driven driver. If the event queue is hot, look at `src/core/event.c`.

SIMD A/B testing

The blitter has three implementations, selected at build time via BLITTER_SIMD:

make BLITTER_SIMD=neon -j4 && make benchmark   # ARM
make BLITTER_SIMD=sse2 -j4 && make benchmark   # x86_64
make BLITTER_SIMD=scalar -j4 && make benchmark # portable fallback

Auto-detection picks NEON on aarch64 / SSE2 on x86_64 / scalar elsewhere (see Makefile.common). Force the scalar build to verify SIMD is actually winning — when the gap closes, your bottleneck moved elsewhere.

Build flavors

Goal	Flags
Production perf	`make` (default; `-O2 -DNDEBUG -ffast-math -fomit-frame-pointer`)
Profiling (good symbols, near-prod perf)	`make RELEASE_DEBUG_INFO=1` (`-O2 -g`). Strips later if shipping.
Sanitizers	`make CC="clang -fsanitize=address,undefined -O1 -g"` (catches bugs, halves perf)
Coverage	`make COVERAGE=1` (`-O0 -g --coverage`). Don't profile this — coverage instrumentation overhead is ~3× and not representative.
Debug stepping	`make DEBUG=1` (`-O0 -g`)

Cycle / instruction counts

The core is event-driven, not cycle-accurate (docs/source-layout.md covers the rationale). JaguarExecuteNew runs the 68K to the next event, then GPU, then fires callbacks. Don't expect cycle-counter results to match real hardware — measure wall-clock instead.

If you do need cycle-level inspection:

# Linux: cycles + instructions per loop iteration
perf stat -e cycles,instructions,branches,branch-misses ./test/tools/test_benchmark ./virtualjaguar_libretro.so test/roms/yarc.j64 600

# macOS: Instruments has a "CPU Counters" template
xcrun xctrace record --template "CPU Counters" --launch -- ./test/tools/test_benchmark ...

Regression triage

If make benchmark shows a regression after your change:

Run twice — check for noise. Same-host runs typically vary <2%. >5% delta is real signal.
Bisect by commit: git bisect start HEAD <known-good-commit>, mark good/bad with make benchmark results until you isolate the offender.
Check both blitters (BENCH_BLITTER=fast vs accurate) — sometimes a regression only shows on one path.
Profile, don't guess — the bottleneck is rarely where you'd expect on this codebase.

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Profiling guide

TL;DR — `make benchmark`

macOS — Instruments / `sample`

Bespoke counters — `BENCH_PROFILE=1`

Linux — `perf` + flamegraph

Hot paths to know

SIMD A/B testing

Build flavors

Cycle / instruction counts

Regression triage

See also

FilesExpand file tree

profiling.md

Latest commit

History

profiling.md

File metadata and controls

Profiling guide

TL;DR — make benchmark

macOS — Instruments / sample

Bespoke counters — BENCH_PROFILE=1

Linux — perf + flamegraph

Hot paths to know

SIMD A/B testing

Build flavors

Cycle / instruction counts

Regression triage

See also

TL;DR — `make benchmark`

macOS — Instruments / `sample`

Bespoke counters — `BENCH_PROFILE=1`

Linux — `perf` + flamegraph