Skip to content

perf(vm): superinstruction fusion — bytecode interpreter now faster than CPython#3627

Merged
hellovai merged 24 commits into
hellovai/trim-eventsfrom
hellovai/vm-superinstructions
Jun 2, 2026
Merged

perf(vm): superinstruction fusion — bytecode interpreter now faster than CPython#3627
hellovai merged 24 commits into
hellovai/trim-eventsfrom
hellovai/vm-superinstructions

Conversation

@hellovai

@hellovai hellovai commented Jun 1, 2026

Copy link
Copy Markdown
Contributor

Stacked on #3616 (base = hellovai/trim-events). Makes the bex_vm bytecode interpreter faster than CPython 3.9 across most workloads, where it started ~2.2× slower.

Results (Apple M2 Max, scripts/speedtest run)

  • fib (1M × fib(50)): total instructions retired −60%, CPU cycles −51% vs the pre-fusion baseline — below the cycle count of an optimal naive hand-written interpreter for the same loop.
  • All ~38 speedtest workloads: BAML now beats Python on ~22 of them (several by 2–4×: parallel-sum 0.23×, spawn-fan-out 0.41×, call-chain 0.45×, class-instances 0.50×, method-call 0.56×). Not overfit to fib — the wins span compute, classes, concurrency, dispatch, and most string ops.

What changed (each commit validated + clippy-clean)

All fusion is an emit-time peephole that rewrites instructions in place, confined to the current basic block (mirrors the existing StoreVarLoadVar fusion), so jump targets and block addresses are never affected:

  • Superinstruction fusion — fold operand loads (and the store, and the loop condition's branch) into single ops:
    • AddIntVar/Const, SubIntVar/Const, CmpIntLtVar/Const (fold right operand)
    • AddIntVarVar/VarConst, SubIntVarVar/VarConst, CmpIntLtVarVar/VarConst (fold both operands — no operand stack pushes)
    • AddIntVarVarStore/VarConstStore (compute + store directly), MoveLocal (fused x = y)
    • CmpIntLt*BrFalse/BrTrue (fuse the loop condition with its branch; branch inversion drops the per-iteration jump-to-body)
  • Unchecked local-slot fast paths for LoadVar/StoreVar, and a lazy faulting_pc (track cur_pc once per dispatch instead of writing the frame every op).
  • Call path: consolidated redundant per-call get_object heap derefs with a plain-Function fast path.
  • kperf: an in-process Apple-Silicon PMC probe (cycles + instructions retired, env-gated BAML_KPERF=1, behind an off-by-default kperf cargo feature) for deterministic, frequency-independent measurement.
  • Toolchain: pinned the nightly toolchain and dropped the MSRV job.

Key insight

Cycle cost is dominated by branch-mispredictions on the giant opcode dispatch match (an indirect jump). So the real lever is reducing the number of dispatched ops — that's why fusion keeps paying off (branch inversion: −15% cycles for −9% instructions), while removing the inline per-op counter barely moved cycles. Measure total instructions/cycles, not instr/op (fusion shrinks the op-count denominator).

Validation

fib/boundary-loops/if-else correct; bex_vm, interfaces (337), exceptions, cancellation, host_value_callable, errors, floats, gc, dispatch, optimization all green.

Known follow-ups (diagnosed, not in this PR)

  • baml pack's embedded runtime stub needs rebuilding for the new opcodes (baml run is unaffected).
  • Remaining Python losses: strings (split ~2.2× — alloc-bound), deep recursion (~130 cyc/call, frame setup), collatz (%///*/!=/== still unfused).

🤖 Generated with Claude Code

@vercel

vercel Bot commented Jun 1, 2026

Copy link
Copy Markdown

The latest updates on your projects. Learn more about Vercel for GitHub.

Project Deployment Actions Updated (UTC)
beps Ready Ready Preview, Comment Jun 2, 2026 10:11pm
promptfiddle Ready Ready Preview, Comment Jun 2, 2026 10:11pm
promptfiddle2 Ready Ready Preview, Comment Jun 2, 2026 10:11pm

Request Review

@coderabbitai

coderabbitai Bot commented Jun 1, 2026

Copy link
Copy Markdown
Contributor

Important

Review skipped

Auto reviews are disabled on base/target branches other than the default branch.

Please check the settings in the CodeRabbit UI or the .coderabbit.yaml file in this repository. To trigger a single review, invoke the @coderabbitai review command.

⚙️ Run configuration

Configuration used: Repository UI

Review profile: CHILL

Plan: Pro

Run ID: 6c3d4093-e62a-4db7-adf6-aadb88d93e6e

You can disable this status message by setting the reviews.review_status to false in the CodeRabbit configuration file.

Use the checkbox below for a quick retry:

  • 🔍 Trigger review
✨ Finishing Touches
🧪 Generate unit tests (beta)
  • Create PR with unit tests
  • Commit unit tests in branch hellovai/vm-superinstructions

Thanks for using CodeRabbit! It's free for OSS, and your support helps us grow. If you like it, consider giving us a shout-out.

❤️ Share

Comment @coderabbitai help to get the list of available commands and usage tips.

@github-actions

github-actions Bot commented Jun 1, 2026

Copy link
Copy Markdown

Binary size checks passed

7 passed

Artifact Platform File Gzip Gated on Baseline Delta Status
baml-cli Linux 🔒 17.1 MB 7.3 MB file 20.0 MB -2.9 MB (-14.7%) OK
packed-program Linux 🔒 12.6 MB 5.3 MB file 15.1 MB -2.5 MB (-16.5%) OK
baml-cli macOS 🔒 13.0 MB 6.3 MB file 15.1 MB -2.1 MB (-14.0%) OK
packed-program macOS 🔒 9.6 MB 4.6 MB file 11.5 MB -1.9 MB (-16.3%) OK
baml-cli Windows 🔒 14.0 MB 6.4 MB file 16.3 MB -2.3 MB (-14.2%) OK
packed-program Windows 🔒 10.1 MB 4.7 MB file 12.2 MB -2.1 MB (-17.2%) OK
bridge_wasm WASM 11.5 MB 🔒 3.3 MB gzip 3.9 MB -609.4 KB (-15.7%) OK

🔒 = the size this artifact is GATED on (ceiling + delta). Binaries gate on file size (installed binary); WASM gates on gzip (download size). The other size is shown for information only.


Generated by cargo size-gate · workflow run

@codspeed-hq

codspeed-hq Bot commented Jun 1, 2026

Copy link
Copy Markdown

Merging this PR will improve performance by 60.45%

⚡ 10 improved benchmarks
✅ 3 untouched benchmarks
⏩ 7 skipped benchmarks1

Performance Changes

Mode Benchmark BASE HEAD Efficiency
WallTime vm_loop_500k 95 ms 27.4 ms ×3.5
WallTime vm_nested_loop 9.7 ms 4 ms ×2.4
WallTime vm_field_access_50k 16.9 ms 8.4 ms ×2
WallTime vm_closure_call_50k 21.7 ms 13.1 ms +65.6%
WallTime vm_class_create_50k 40.1 ms 28.3 ms +41.95%
WallTime vm_array_iter_10k 7 ms 5.4 ms +29.22%
WallTime vm_wide_nested_class_create_50k 296.4 ms 230.2 ms +28.73%
WallTime vm_array_push_50k 21 ms 16.4 ms +27.63%
WallTime vm_fib_20 7.2 ms 6.1 ms +17.43%
WallTime vm_mixed_ops 11.1 ms 9.6 ms +15.81%

Tip

Curious why this is faster? Comment @codspeedbot explain why this is faster on this PR, or directly use the CodSpeed MCP with your agent.


Comparing hellovai/vm-superinstructions (143bd33) with hellovai/trim-events (855fa73)

Open in CodSpeed

Footnotes

  1. 7 benchmarks were skipped, so the baseline results were used instead. If they were deleted from the codebase, click here and archive them to remove them from the performance reports.

hellovai and others added 9 commits June 1, 2026 14:54
…d split

Prep for the threaded-dispatch work:

- Add crates/baml_tests/tests/dispatch.rs — inline-snapshot of how direct
  method calls and interface (polymorphic) dispatch lower to bytecode.
  Locks in that a direct `v.norm2()` is a plain static `call` (no
  make_bound_method / no per-call allocation) and that interface dispatch
  is an `is_type` chain + static call.
- Split store_local_value into an `#[inline(always)]` fast path (direct
  stack write) and a `#[cold] #[inline(never)]` watch handler, so the hot
  store path stays inline and the rarely-active watch bookkeeping is out of
  line. (Measured ~0% on its own; it's structural prep for `become`.)

Co-Authored-By: Claude Opus 4.8 (1M context) <[email protected]>
Cut interpreter instruction count on the hot arithmetic/loop path:

- Emit-time binop peephole folds operand loads into fused superinstructions
  (mirrors the existing StoreVarLoadVar in-place rewrite; confined to the
  current basic block so jump targets/block addresses are never fused across):
  * single-fold (right operand): AddIntVar/AddIntConst, CmpIntLtVar/CmpIntLtConst
  * double-fold (both operands, no operand stack pushes): AddIntVarVar/
    AddIntVarConst, CmpIntLtVarVar/CmpIntLtVarConst
- Unchecked local-slot indexing (get_at/set_at) on LoadVar/StoreVar fast paths.
- Lazy faulting_pc: track cur_pc once per dispatch instead of writing the frame
  every op; reconstruct topmost/innermost bytecode PC only when unwinding.
- kperf: in-process Apple-Silicon PMC probe (cycles + instructions retired),
  env-gated by BAML_KPERF=1, for deterministic instruction-count comparisons.

fib (1M x fib50): total instructions retired -25.4%, cycles -22.8%, VM ops -33%
vs the pre-fusion baseline. Validated: result correct; bex_vm, exceptions,
cancellation, errors, floats, dispatch, optimization all green.

Co-Authored-By: Claude Opus 4.8 <[email protected]>
Extend the emit-time peephole with two more fused-op families, same
in-place / current-block-confined rewrite as before:

- MoveLocal(dst, src): fuses `LoadVar src; StoreVar dst` (every `x = y`).
  Stores via store_local_value, preserving watch semantics.
- AddIntVarVarStore / AddIntVarConstStore: a double-folded add whose result
  is stored straight into a (non-captured) local — `local[dst] = local[a] + ..`
  — never touching the eval stack.

fib (1M x fib50) cumulative vs the pre-fusion baseline: total instructions
retired -47%, cycles -35%, VM ops -55%, reaching the cycle count of an optimal
hand-written bytecode interpreter for the same workload. Validated: result
correct; bex_vm, exceptions, cancellation, errors, floats, dispatch,
optimization all green.

Co-Authored-By: Claude Opus 4.8 <[email protected]>
Add CmpIntLtVarVarBrFalse / CmpIntLtVarConstBrFalse: when a fused integer
`<` comparison is immediately consumed by a `PopJumpIfFalse` (the canonical
loop/if condition), the emit peephole collapses both into one op that
evaluates the comparison straight from its operands and branches without
materializing a bool on the stack. The fused op participates in jump
resolution like the plain jumps (instruction-relative offset patched to a
byte delta) and preserves the early-yield/cancellation check.

fib (1M x fib50) cumulative vs the pre-fusion baseline: total instructions
retired -52%, cycles -41%, VM ops -60% — now below the cycle count of an
optimal naive hand-written bytecode interpreter for the same workload.
Validated: fib correct; boundary loops (sum 0..10 = 45, count = 5) correct;
bex_vm, cancellation, exceptions, errors, floats, dispatch, optimization green.

Co-Authored-By: Claude Opus 4.8 <[email protected]>
The VM-op counter (`op_count`) is pure measurement scaffolding but ran on the
hottest path in every build — a store plus a memory dependency per dispatched
op. Gate the increment behind a new off-by-default `kperf` cargo feature so
normal/release builds don't pay for it; kperf reads cycles and instructions
retired straight from the hardware counters, so the op count is only needed
for the optional per-op breakdown (built with `--features bex_vm/kperf`).

Also make the `cur_pc` fault-PC write unconditional (it is needed for correct
exception line numbers, not measurement) and drop the now-unused
`dbg_skip_*` / `BAML_NO_*` bisection env gates.

fib (1M x fib50), op_count removed: -6.7% instructions retired (20.7e9), and
BAML's interpreter is now ~10% faster than CPython 3.9 on the same workload.

Co-Authored-By: Claude Opus 4.8 <[email protected]>
When a conditional's else-successor is the fall-through block (and the then
block is not), emit an inverted compare-and-branch — CmpIntLtVarVarBrTrue /
CmpIntLtVarConstBrTrue (branch to `then` when the comparison is true) — and
let the jump to the else block fall through. This eliminates the
unconditional jump-to-body that otherwise runs every loop iteration.

Removing a dispatched op matters disproportionately: each dispatch is an
indirect jump through the giant opcode match that the branch predictor
routinely misses, so this is ~15% fewer cycles for ~9% fewer instructions
(IPC 5.0 -> 5.4). fib (1M x fib50) cumulative vs the pre-fusion baseline:
instructions retired -60%, cycles -51% (halved). Validated: fib correct;
if/else and boundary loops correct; bex_vm, interfaces (337), exceptions,
cancellation, errors, floats, gc, env, io, dispatch, optimization all green.

Co-Authored-By: Claude Opus 4.8 <[email protected]>
execute_call_from_locals_offset dereferenced the callee HeapPtr three separate
times to (a) check for a HostClosure, (b) extract Closure captured type args,
and (c) extract BoundMethod class type args. Fold these into a single match
with a plain-Function fast path (the common case, including all recursion),
which extracts nothing. The later callee-resolution match still validates
non-callable objects, so error behaviour is unchanged.

Measured call overhead is ~130 cyc / ~740 instructions per call (5M-call bench
minus an equivalent inline loop) — dominated by frame setup/teardown spread
across the call+return machinery, so this is a small (~1%) but free reduction.
Validated: fib32 correct; bex_vm, interfaces (337), dispatch, spawn,
cancellation, host_value_callable all green (closure/bound-method/host-closure
dispatch preserved).

Co-Authored-By: Claude Opus 4.8 <[email protected]>
Mirror the AddInt fusion for SubInt: SubIntVar / SubIntConst (fold the right
operand load) and SubIntVarVar / SubIntVarConst (fold both). Subtraction is not
commutative, so operand order is preserved (left - right) and no const-on-left
commute is applied. Same in-place, current-block-confined emit peephole.

Helps any code using subtraction, which the earlier add/compare fusion missed.
fib32-recursive (call-bound) still gains -9% cycles from fusing its n-1/n-2 +
result-add body; the hot fib loop is unchanged (no I-cache regression from the
larger dispatch match). Validated: fib32 = 2178309, subtraction loop correct;
bex_vm, interfaces (337), optimization, dispatch, errors, exceptions, floats
all green.

Co-Authored-By: Claude Opus 4.8 <[email protected]>
…system fixture

The superinstruction fusion changes emitted bytecode (e.g. `load_var a;
load_var b; add_int` → `add_int_var_var`, `load_var x; store_var y` →
`move_local`, `load_const 1; add_int` → `add_int_const`), so every codegen /
bytecode snapshot is regenerated to match. Behaviour is unchanged — these are
the same programs, fewer dispatched ops.

Also delete the `event_system` test fixture: it exercised `baml.events.send`,
which the tracing trim removed, so it no longer compiles ("unresolved name:
send"). The feature is gone, so the fixture is removed rather than rewritten;
its generated test disappears when build.rs regenerates from projects/.

A few snapshots also drop `events.send` from builtin listings (baml_cli
package listing, __baml_std__, package_items) for the same removal.

Verified: baml_tests, baml_cli, baml_compiler2_emit, bex_vm, bex_vm_types all
green with no INSTA_UPDATE.

Co-Authored-By: Claude Opus 4.8 <[email protected]>
hellovai and others added 4 commits June 1, 2026 23:17
## What

Adds **53 in-BAML test blocks** to `ns_floats/floats.baml` covering the
`==` / `!=` operators on floats. Regenerates the `floats` bytecode
snapshot.

## Why

Float equality **already works** end-to-end:
- **Type-checker** (`infer_binary_op`): permissive — any two operands →
`bool`; `int == float` widens int to f64.
- **Constant folder** (`try_fold_binary`): literal `float == float` is
folded at compile time — a *separate* path from runtime.
- **VM** (`exec_cmpop` + dedicated `CmpFloatEq` opcode): IEEE 754
semantics.

…but there was **no test coverage**: `floats.baml` deliberately used
epsilon checks and `operators.baml` only tested `int == int`. This locks
in the behavior.

## Coverage

Cases are grouped by code path (compile-fold vs runtime, forced via
`float.parse(...)`/calls) so a divergence between the folder and the VM
would be caught. They follow JS/TS (IEEE 754) conventions:

- **NaN**: `NaN != NaN` true, `NaN == NaN` false, NaN vs
number/inf/null, NaN propagation through arithmetic
- **Infinity**: `+inf == +inf`, `+inf != -inf`, overflow → inf,
max-finite ≠ inf, `inf - inf` → NaN
- **Signed zero**: `0.0 == -0.0`, `-1.0/inf` → `-0.0`
- **int/float mixing**: `2 == 2.0`, `3.0 == 3` (both paths)
- **Precision**: `0.1 + 0.2 != 0.3`, `== 0.30000000000000004`,
`1.0/3.0`, sqrt identities, subnormals (`5e-324`), large-magnitude loss
- **null**: `float == null` → false (allowed, not an error)

The one rejected pairing — `float == bigint` (compile error E0004,
bigint past 2⁵³ can't round-trip f64) — is unchanged and not expressible
as a passing test.

## Testing

```
cargo run -p baml_cli -- test --from crates/baml_tests/baml_src -i "::float_eq_*"
# 53 passed, 0 failed
```

🤖 Generated with [Claude Code](https://claude.com/claude-code)

<!-- This is an auto-generated comment: release notes by coderabbit.ai
-->

## Summary by CodeRabbit

* **Tests**
* Added comprehensive test suite for float comparison operations,
covering IEEE-754 semantics and edge cases: NaN behavior and
self-inequality, positive/negative infinity handling, signed-zero
behavior, overflow-to-infinity conversions, precision and rounding
imprecision scenarios, and parsing-based test cases including scientific
notation edge ranges.

<!-- end of auto-generated comment: release notes by coderabbit.ai -->

Co-authored-by: Claude Opus 4.8 (1M context) <[email protected]>
…pus (#3631)

## What

The `baml_tests` CodSpeed runtime suite was a set of hand-written
`#[divan::bench]` functions with BAML source inlined as string literals
— a parallel copy of workloads that already exist in the
`tools/speedtest` corpus. This replaces them with **one `vm_speedtest_*`
bench auto-generated per workload** under
`tools/speedtest/workloads/*.md`, so the speedtest harness and CodSpeed
share a single source of truth.

## How

- **`build.rs` (`generate_speedtest_benches`)** shells out once to a new
`tools/speedtest/export_baml.py`, which reuses `speedtest.loader`
(including `## eval-setup` + `$$` templating) to emit each workload's
*expanded* BAML as JSON. build.rs then generates one divan bench per
workload, named `vm_speedtest_<slug>`, each calling the existing
`bench_vm_main` helper (compile + tokio runtime built **once, outside**
the measured region → only `main()` is timed).
- **Graceful degradation:** if `python3` or the corpus is unavailable at
build time, it emits a `cargo:warning` and no benches rather than
breaking the crate build.
- **Sleep exclusion:** workloads that call the blocking `baml.sys.sleep`
are dropped at build time (matched by FQN) — as walltime benches their
sample time is dominated by sleeping, not VM work. A build warning names
what was skipped. Currently excludes `concurrency::parallel sleep
3x200ms`.
- **All hand-written benches removed** (`vm_*`, `e2e_*`, `startup_*`,
`compile_to_engine`, `engine_init_cost`) per design discussion. The 2
with no workload equivalent became new workloads, with BAML/Python/TS
output cross-verified against `baml-cli`:
  - `compute/wide-nested-class-create-50k.md` (= `8754025000`)
  - `compute/mixed-ops-5k.md` (= `62499999`)
- **CI** run filter updated `vm_|engine_init` → `vm_speedtest` (the old
alternatives no longer exist); build step unchanged.

## Result

**36 generated benches** (37 workloads − 1 sleep). All compile and
execute cleanly (`divan --test`, exit 0). No CI build wiring change
needed — `cargo codspeed build --bench runtime_benchmark` already covers
them.

To add or change a runtime bench going forward, edit a workload `.md` —
no Rust changes required.

🤖 Generated with [Claude Code](https://claude.com/claude-code)

<!-- This is an auto-generated comment: release notes by coderabbit.ai
-->
## Summary by CodeRabbit

* **Tests**
* Added new speedtest workloads (mixed arithmetic and wide nested object
creation) to expand performance coverage
* Introduced generated VM-focused benchmark cases to measure pure VM
execution timing

* **Chores**
* Updated CI benchmark configuration to run VM speedtests for more
representative timing
* Added build-time benchmark generation and a CLI export tool to produce
workload test data automatically
<!-- end of auto-generated comment: release notes by coderabbit.ai -->

---------

Co-authored-by: Claude Opus 4.8 (1M context) <[email protected]>
…tions

Drop the 20 emit-time fused opcodes added earlier (AddIntVar/Const,
SubInt*, CmpIntLt* folds, AddInt*Store, CmpIntLt*Br{False,True}, MoveLocal)
in favour of CPython's minimal, operand-movement-only superinstruction set:

- LoadVar2  (= LOAD_FAST_LOAD_FAST):  push two locals in one dispatch
- StoreVar2 (= STORE_FAST_STORE_FAST): store two locals in one dispatch

(StoreVarLoadVar already covered STORE_FAST_LOAD_FAST.) Type specialization
stays where it belongs — the pre-existing AddInt/SubInt/MulInt/CmpIntOp ops
are BAML's static-typing equivalent of CPython's BINARY_OP_*_INT /
COMPARE_OP_INT, emitted directly with no inline caches or deopt.

Rationale: the dedicated fused ops were a combinatorial set (operation ×
operand-kind × fold-depth × branch-polarity) that overfit the fib loop and
would explode in opcode count as more operators/types were covered. CPython
deliberately keeps fusion to a tiny movement-only set and leans on
specialization (which we get for free from static types) — and, for the real
"way faster" win, a copy-and-patch JIT, which subsumes interpreter fusion for
hot code. This keeps the dispatch table small (better I-cache) and the design
principled.

Bytecode snapshots regenerated accordingly (load_var2 + plain add_int/
cmp_int_op/store_var). Validated: fib correct; baml_tests, baml_cli green;
clippy clean on stable 1.93.0.

Co-Authored-By: Claude Opus 4.8 <[email protected]>
…structions

# Conflicts:
#	baml_language/crates/baml_tests/benches/runtime_benchmark.rs
sxlijin and others added 3 commits June 2, 2026 04:06
Use sccache (R2-backed) for Rust **compilation** artifacts in the cargo
CI jobs, configured entirely from `.envrc` so CI matches local shells.

- `tools_sccache` crate / `tools/baml-sccache` wrapper: a
`RUSTC_WRAPPER` that maps `BAML_SCCACHE_R2_*` → `AWS_*` and execs
sccache (native crate on Windows, shell script on POSIX).
- mise installs sccache + direnv; each cargo job loads `.envrc` via
`direnv export gha` (the single source of truth for the sccache/R2
config).
- **Swatinem/rust-cache still caches the cargo registry/git download
state**, with `cache-targets: false` so sccache owns `target/` and the
two caches don't compete. Fork PRs without R2 secrets fall back to the
runner-local cache.

Follow-up #3624 replaces Swatinem for the download caches with a
granular, Cargo.lock-driven R2 action (`cache-cargo-home`); this PR is
the sccache base it stacks on.

🤖 Generated with [Claude Code](https://claude.com/claude-code)

Co-authored-by: Claude Opus 4.8 (1M context) <[email protected]>
…3633)

## Issue Reference
  N/A (net-new, self-contained addition under `tools/baml-bench/`)

  ## Changes
This PR adds **baml-bench**, an event-driven pipeline that benchmarks
how well a coding
agent (Claude Code) uses BAML, surfaces real language/skill issues from
those runs, and
dispatches fixes. The entire diff is confined to `tools/baml-bench/` (88
files) and
  touches nothing else in the monorepo.

**The pipeline:** an inbound event (Slack mention, cron job, or bug
report) creates a
task; a worker runs a Claude Code agent against the task and records a
"trophy"
(transcript, metrics, findings); findings are classified and
deduplicated into issues;
approved issues are synced to Notion and dispatched to a Cursor cloud
agent that opens a
  fix PR. A read-only dashboard shows the whole thing live.

It is built as 8 Python services + a self-hosted Convex data layer + a
Next.js
  dashboard:

- **`bench_core`** (shared library): pydantic schemas, jsonl/prices
utilities, the
service/proxy/slack/notion/cursor clients, and the `Processor`
claim-loop base (SSE
  wakeups, heartbeat, lease) that every worker builds on.
- **Convex data layer**: the schema, a generic claimable-queue lib,
per-table
  query/mutation modules, and a reaper for stale claims.
- **`api`**: the sole Convex gateway, plus a blob store for
transcripts/binaries and
  generic table + baml-builds routers.
- **`claude-proxy`**: runs real Claude Code sessions and parses them
into transcript +
  metrics.
- **`baml-worker`**: task to trophy (agent run, trophy parse, repro
verification).
  - **`baml-dedup`**: trophy to issue (classify + dedup).
  - **`baml-builder`**: tracks baml release binaries in a registry.
  - **`ingress`**: public webhook gateway (slack/notion/bug, ack-first).
- **`notion-fixer`**: Notion board sync + Cursor cloud-agent fix
dispatch.
  - **`cron`**: daily build-refresh + task enqueuer.

Also included: Python packaging, the base Docker image + per-service
Dockerfiles, a
`docker-compose` local stack with `.env.example`, the unit + E2E test
suites,
Google-style docstrings on every function/method/class, a README, and a
generated
`docs/reference.md` indexing every symbol across `bench_core`,
`services`, `convex`, and
  `ui`. Anthropic auth is API-key only.

  ## Testing
  Please describe how you tested these changes

  - [x] Unit tests added/updated
  - [x] Manual testing performed
  - [ ] Tested in [environment]

- Fast suite (no Docker): `cd tools/baml-bench && pytest -m "not
integration"` -> 12
  passed (app/health wiring, proxy session parsing, ingress routing).
- E2E suite (`@pytest.mark.integration`, self-skips without Docker):
`pytest -m
integration` boots a Convex backend container plus `api`/`ingress`/a
stub proxy on
ephemeral host ports and drives the full pipeline (task -> worker ->
trophy -> dedup ->
  issue -> notion sync) and the ingress + fix-dispatch path end to end.
- The pipeline has been running in production (standalone on Fly), so
the migrated code
  is exercised; this PR is the monorepo packaging of it.

  ## Screenshots
  If applicable, add screenshots to help explain your changes

N/A (the UI is a read-only dashboard; no user-facing change to existing
BAML surfaces).

  ## PR Checklist
  Please ensure you've completed these items

  - [x] I have read and followed the contributing guidelines
  - [x] My code follows the style guidelines of this project
  - [x] I have performed a self-review of my own code
- [x] I have commented my code, particularly in hard-to-understand areas
  - [x] I have made corresponding changes to the documentation
  - [x] My changes generate no new warnings

  ## Additional Notes
  Add any other context about the PR here

- **Isolation:** the entire diff is confined to `tools/baml-bench/`;
nothing outside
  that path changes, so it has no effect on the rest of the monorepo.
- **CI is intentionally not in this PR.** A path-scoped Blacksmith
workflow + pre-commit
hooks are ready but touch shared `.github/` config outside
`tools/baml-bench/`, so they
  will follow in a separate PR to keep this one isolated.
- **Some docs follow later.** The README and the generated API reference
are included.
The longer guides (architecture, data-model, configuration, local-setup,
deployment, ci)
  are staged and will land in a follow-up once reviewed.

<!-- This is an auto-generated comment: release notes by coderabbit.ai
-->
## Summary by CodeRabbit

* **New Features**
* Full local benchmark stack: live dashboard (graph, tables, run/task
pages), API + blob storage, claimable-queue workers, build manager,
agent proxy, Slack/Notion ingress, cron-driven task enqueueing, and
end-to-end agent/verification flows.

* **Documentation**
* Complete architecture, data model, configuration guides and generated
API reference.

* **Tests**
* New unit and end-to-end integration suites with drivers and service
stubs exercising ingress, proxy, and the full pipeline.

* **Chores**
* Local dev tooling: docker-compose, env example, gitignore,
Dockerfiles, and package manifests.
<!-- end of auto-generated comment: release notes by coderabbit.ai -->

---------

Co-authored-by: Claude Opus 4.8 <[email protected]>
)

## What

Adds **interface-field destructuring** in `match` patterns. An interface
head binds the interface's declared fields across every implementor:

```baml
function describe(a: Animal) -> string {
  match (a) {
    Animal { name } => "animal: " + name   // binds `name` for any implementor
  }
}
```

`name` resolves through each implementor's field view, so it works
whether the field was auto-linked (`Dog { name }`) or `as`-aliased (`Cat
{ name as nickname }`). Because every implementor necessarily provides
the interface's declared fields, the pattern matches them all — so
`Animal { name }` is exhaustive on its own (no `_` needed).

Previously `Animal { name }` was mis-lowered as a construction
expression (`unresolved name: Animal`); only concrete-class destructure
(`Dog { name }`) worked.

## How

- **TIR** (`baml_compiler2_tir/src/builder.rs`):
`resolve_class_pattern_type` accepts interface heads; `lower_class_pat`
has an interface branch that binds each field's type via
`resolve_interface_member` and produces a wildcard-cover `DPat`.
- **MIR** (`baml_compiler2_mir/src/lower.rs`):
`project_class_pattern_field` routes interface heads to a new
`project_interface_pattern_field`, reusing the existing interface
field-view dispatch (`try_lower_interface_field_access`). The MIR `Ty`
has no interface variant, so the route keys off the raw
`Tir2Ty::Interface`.

## Tests

- New: `match_destructures_interface_fields_directly` (interface head,
auto-linked + aliased implementors) and
`match_destructures_concrete_implementor_fields`.
- Refreshed the BEP-044 regression-suite comments — all pass now. The
two interface-method-as-value cases (`fuzz_bug01/02`) remain
`#[ignore]`d (genuinely unimplemented).
- Interfaces suite: 339 passed / 2 ignored; full `baml_tests` (30
binaries) and `cargo check --workspace` clean.

The matching BEP-044 spec update (match syntax, this feature, and other
implementation-vs-draft corrections) was pushed separately to
beps.boundaryml.com.

## Also included

In-flight **Python SDK / bridge / codegen** fixes that were already
present on the working branch (`bridge_cffi`, `bridge_python`,
`codegen_python`, `harness_setup`, `baml_cli/generate.rs`, `.pyi`
stubs). Not authored as part of the interface work; bundled per request.

🤖 Generated with [Claude Code](https://claude.com/claude-code)

<!-- CURSOR_SUMMARY -->
---

> [!NOTE]
> **High Risk**
> Changes span TIR/MIR pattern lowering and exhaustiveness (easy to get
match soundness wrong) plus the Python runtime initialization path that
all generated SDKs use after `baml generate`.
> 
> **Overview**
> Implements **BEP-044 interface destructuring** in `match` (`Animal {
name } => …`): TIR resolves interface pattern heads and lowers
`DPat::interface` with field-view types; exhaustiveness gains
`Ctor::Interface` and matrix specialization that maps interface field
slots onto implementing class fields; MIR projects bound fields via
`project_interface_pattern_field` / existing interface field dispatch.
> 
> **Python codegen/runtime** now embeds **borsh-serialized bytecode**
instead of inlined `.baml` source: `baml generate` compiles and calls
`to_source_code_with_bytecode`; `bex_project::new_from_bytecode`,
CFFI/Python `initialize_runtime_from_bytecode`, and generated
`_inlinedbaml.BYTECODE` wiring. CI **size-gate** baselines for
`baml-cli` are bumped slightly.
> 
> Large **interface test suite** additions (compile + VM) for
destructure exhaustiveness, mixed concrete/interface arms, generics, and
updated regression comments (most fuzz/wf3 cases now pass;
method-as-value tests still ignored).
> 
> <sup>Reviewed by [Cursor Bugbot](https://cursor.com/bugbot) for commit
3c25e0f. Bugbot is set up for automated
code reviews on this repo. Configure
[here](https://www.cursor.com/dashboard/bugbot).</sup>
<!-- /CURSOR_SUMMARY -->

<!-- This is an auto-generated comment: release notes by coderabbit.ai
-->
## Summary by CodeRabbit

* **New Features**
* Match expressions now support destructuring of interface-typed values.
* SDKs can initialize the runtime from precompiled BAML bytecode (new
runtime entrypoint and corresponding Python initializer).

* **Tests**
* Added end-to-end tests covering interface-field destructuring,
exhaustive matching, and related runtime behaviors.

* **Chores**
* Test harness and workspace updated for bytecode support (borsh); CI
size-gate baselines updated.
<!-- end of auto-generated comment: release notes by coderabbit.ai -->

---------

Co-authored-by: Claude Opus 4.8 (1M context) <[email protected]>
hellovai and others added 7 commits June 2, 2026 09:10
…lock (#3635)

## TL;DR

`cargo test -p baml_tests` (and `baml-cli test --from baml_src`)
**hung**. It turned out to be **two unrelated bugs stacked on top of
each other**, and the first one masked the second:

1. **Compile-time `O(files²)` blow-up** — the deterministic hang.
Compiling the 55-file `baml_src` corpus as one project ran two
whole-project passes once *per file* / *per function*. Fixed with salsa
memoization. Corpus went from *never finishing* → ~30s, **byte-identical
bytecode**.
2. **A runtime GC/permit deadlock in `spawn`** — once compilation was
fast enough to actually reach execution, the test run started hanging
~50% of the time in the engine. Root cause was a **nested heap-permit
acquire that deadlocks against stop-the-world GC**. Fixed by allocating
the spawned future under the parent's existing permit.

Both were diagnosed by measurement (CPU sampling, scaling curves,
forced-GC stress repro), not guesswork. Details + code references below
so this is auditable.

**Commits**
1. `perf(compiler2): …` — memoize the two whole-project compile passes
(Problem 1).
2. `fix(engine): …` — allocate the spawned future under the parent's
permit (Problem 2).
3. `perf(mir): …` — follow-up: borrow `resolved_aliases` instead of
cloning it per context (see *Follow-up* section), addressing review
feedback.

---

## Problem 1 — `O(files²)` compilation (commit `perf(compiler2): …`)

### Symptom

`baml-cli test --from baml_src` pinned one core at ~95% CPU and never
finished on the full corpus. Not a cargo-lock deadlock, not network —
pure CPU-bound compilation. Scaling was clearly super-linear:

| files | time (before) |
|------:|--------------:|
| 1  | 4s |
| 15 | 13s |
| 27 | 31s |
| 35 | 44s |
| 55 | **never finished** (~100s+ extrapolated) |

`sample`-ing the process showed all time under `collect_diagnostics` →
type inference → `ppir_expansion_items` → `collect_alias_bodies` →
`lower_file`, and later under `generate_project_bytecode` →
`lower_function` → `LoweringContext::new` → `populate_from_package`.

### Two quadratics

**(a) `baml_compiler2_ppir`** — `ppir_expansion_items` is a **per-file**
`#[salsa::tracked]` query (`lib.rs:205`), but each invocation called
`collect_block_attrs` / `collect_alias_bodies` — plain functions that
iterate **every** file in the project and call `ast::lower_file` on
each. So N files × re-lowering N files = **O(N²)** lowerings.

**(b) `baml_compiler2_mir`** — `LoweringContext::new` / `new_for_let`
run **per function**, and each rebuilt `populate_from_package` (which
lowers every class field type across all packages, `lower.rs:1190`) plus
`ResolvedAliases::for_package` (which re-runs `find_recursive_aliases`
over the whole project). M functions × N classes = **O(N²)** again.

### Fix

Memoize the whole-project work behind package/project-keyed salsa
queries, reusing the manual `unsafe impl salsa::Update` (via
`PartialEq`) pattern already established for `PackageItems`:

- `project_expansion_maps(db, project)` — `ppir/lib.rs:165`
- `package_lowering_data(db, pkg_id)` — `mir/lower.rs:957`.
`LoweringContext` now **borrows** the schema maps and `resolved_aliases`
(`&'db`) instead of rebuilding/cloning them per function. (The
`resolved_aliases` borrow was completed in commit 3 — see *Follow-up*.)

### Result

| files | before | after |
|------:|-------:|------:|
| 35 | 44s | 20s |
| 55 | never finished | ~30s |

Scaling is now ~linear. **The bytecode snapshot test passes unchanged**
— output is byte-for-byte identical, so this is a pure performance
change. 1576 `baml_tests` lib tests pass.

---

## Problem 2 — `spawn` deadlocks against GC (commit `fix(engine): …`)

This is the subtle one. Once compilation was fast, the full 1614-test
run started **hanging ~50% of the time**, always with the *same* shape:
the tokio runtime driver parked in `block_on` and **every worker thread
idle/parked** — i.e. a lost-wakeup, not a CPU spin.

### The BAML test that triggered it

`crates/baml_tests/baml_src/ns_cancel_cascade/cancel_cascade.baml`:

```baml
function cancelled_child_future_state_is_cancelled() -> baml.future.FutureState {
    let slow   = spawn { baml.sys.sleep(60000); 42 };  // task S: sleeps 60s
    let waiter = spawn { await slow };                  // task W: awaits S
    let _      = waiter.cancel();                        // cancel W (not S)
    waiter.state()
}
```

It passed **5/5 in isolation** but hung ~50% in the full run — the
classic fingerprint of a *concurrency* bug that needs accumulated load,
not a logic bug in the test. Two facts narrowed it down:

- All threads parked ⇒ a lost wakeup in async machinery, not a
synchronous lock.
- It correlated with **garbage collection**: forcing GC to run on
*every* allocation (temporarily setting the Gen0 threshold `10_000 → 1`
in `bex_heap`) turned the flake into a **100% reproducible** hang on a
tiny repro. That was the key to pinning it.

### Background: the heap-permit model

The engine coordinates GC with a `HeapPermitManager` backed by a
**single tokio `Semaphore`** (`bex_heap/src/heap_guard.rs`):

- Each running VM mutator holds **one** `ActiveHeapPermit` (one
semaphore permit).
- Stop-the-world GC parks everything by draining the **entire**
semaphore at once:

```rust
// bex_heap/src/heap_guard.rs:227
pub async fn request_park(&self) -> HeapGuard<'_> {
    let permits = self.active
        .acquire_many(MAX_PERMITS)   // <-- wants ALL permits; completes only when
        .await                        //     every ActiveHeapPermit has been released
        ...
}
```

The crucial property: **tokio's `Semaphore` is fair (FIFO)**. Once
`acquire_many(MAX_PERMITS)` is queued, any later `acquire()` (even for 1
permit) queues **behind** it and cannot be granted until the big request
is satisfied and released.

### The bug

`spawn` allocated the child's heap `Future` by taking a **second,
fresh** permit *while the parent task that issued the `spawn` still held
its own permit*. The parent awaits `spawn_thread` inline, so both
permits live on the same logical flow:

```rust
// OLD — bex_engine/src/lib.rs, spawn_thread_setup (deleted in this PR)
let permit = self.heap_permit_manager.new_permit(()).await;
let permit = permit.acquire().await;          // <-- 2nd permit, while parent still holds its 1st
let (future_id, future_ptr) = {
    let mut guard = self.futures.acquire(permit.proof()).await;
    guard.new_future(child_cancel.clone())     // allocate the child Future
};
drop(permit);
```

Now interleave a GC park (which, under real workloads, fires whenever
heap pressure crosses the threshold — hence the flakiness, and 100%
under forced GC):

```mermaid
sequenceDiagram
    participant P as Parent task<br/>(holds permit P_main)
    participant G as GC (request_park)
    participant S as Semaphore (fair FIFO)

    P->>P: executing spawn { ... }
    Note over G,S: heap pressure → GC starts
    G->>S: acquire_many(MAX_PERMITS)
    S-->>G: queued — waits for ALL permits<br/>(P_main still held by Parent)
    P->>S: acquire() for the child-future permit
    S-->>P: queued BEHIND GC (fair) — blocked
    Note over P,G: 🔒 deadlock cycle
    Note right of P: Parent won't release P_main<br/>until spawn returns
    Note right of P: spawn can't return<br/>until it gets the 2nd permit
    Note right of G: GC can't grant the 2nd permit<br/>until it finishes, which needs P_main
```

So: **GC waits for the parent's permit → the parent waits (fairly,
behind GC) for a second permit → the parent won't release the first
until it gets the second.** Cycle. All tasks suspend; every worker
parks. The 60s `sleep` is a red herring — the hang is immediate (it
reproduced with `PASS=0`).

### The fix

There is no reason to take a *new* permit to allocate the child future:
the parent is **already** holding an active permit at the `spawn` site.
Allocate the future there, under the parent's permit, and hand the
`future_id` to `spawn_thread`:

```rust
// NEW — bex_engine/src/lib.rs, VmExecState::Spawn dispatch (~lib.rs:2462)
let child_cancel = cancel.child_token();
let future_ptr = {
    let mut guard = self.futures.acquire(thread.proof()).await;   // parent's permit
    let (future_id, future_ptr) = guard.new_future(child_cancel.clone());
    drop(guard);                                                  // dropped before the await below
    Arc::clone(self)
        .spawn_thread(child_cancel, parent_errors_arc, closure, spawn_name, call_id, future_id)
        .await?;
    future_ptr
};
thread.vm.stack.push(Value::object(future_ptr));
```

`spawn_thread` now only builds the child VM and **registers** the
child's permit via `new_permit` (which takes the holders mutex but does
**not** acquire a semaphore permit), then fires the task. The child's
permit is acquired later, on the spawned task — never nested under the
parent. **One permit per task on the spawn path**, so the deadlock cycle
cannot form. `spawn_thread_setup` is deleted.

Why this is safe to audit:
- `new_future` is synchronous and only needs a `PermitProof` to prove GC
isn't running — the parent's `thread.proof()` satisfies that just as
well as a fresh permit did.
- The `FutureManagerGuard` is dropped **before** the
`spawn_thread().await`, so no non-`Send` guard crosses a yield point
(same pattern already used elsewhere in `run_thread_event_loop`).
- `new_permit` only contends on the holders mutex, and `request_park` is
explicitly ordered (semaphore first, then holders mutex) to not deadlock
against it.
- The parent holds its permit across the whole dispatch, so GC cannot
move `future_ptr` during the await.

### Verification

The bug is flaky, so I verified with the forced-GC stress harness (turns
the race into a deterministic signal) and then with the real threshold:

| scenario | before fix | after fix |
|---|---|---|
| minimal `spawn/cancel` repro, **GC forced every alloc** | 6/6 hang |
**0/10 hang** |
| full 1614-test corpus, real GC threshold | ~50% hang | **0/8 hang**
(1614 passed, 0 failed each) |

(The temporary `gc.rs` threshold change was only a debugging aid and is
**not** part of this PR.)

---

## Follow-up — borrow `resolved_aliases`, and what's *not* worth
optimizing (commit `perf(mir): …`)

Review feedback flagged two remaining per-`LoweringContext` (i.e.
per-function) costs. I profiled a full `--list` compile of the corpus
before touching either, and the result decided each:

> The whole MIR lowering path (`package_lowering_data` /
`LoweringContext::new` / `lower_function`) is **~11 of ~3600 samples**.
Compile time is dominated by TIR `infer_scope_types`
(`render_scope_diagnostics → infer_scope_types`). So neither of these is
a measurable cost on the current corpus.

**Fixed: `resolved_aliases` cloned per context → borrowed.** The other
five package-invariant schema maps were already borrowed (`&'db`) from
`package_lowering_data`; `resolved_aliases` (a `HashMap` + `HashSet`)
was the one I'd left as a per-context clone for expedience. Now it's
borrowed too. Whole-struct passes (`&self.resolved_aliases`) became
`self.resolved_aliases` (the field is already a reference); `.aliases`
sub-field accesses and `.convert(...)` calls are unchanged (they
auto-deref through the borrow). Not a measurable speedup here, but it's
a strict reduction in per-context allocation, completes the
borrow-not-clone design, and is asymptotically `O(contexts × aliases)`
for alias-heavy projects.

**Skipped (measured non-issue): making `build_class_type_tags` a tracked
query.** It showed up as **0 samples**. Unlike `populate_from_package`,
it does no type lowering — just cached `file_item_tree` reads and
`TypeName→i64` inserts. Memoizing it would add a wrapper + `Update` impl
+ a project-keyed query, and it's **bytecode-affecting** (it assigns the
global type-tag numbering that must match the emitter), so it's risk for
no measurable benefit. Worth revisiting only if a profile on a
larger/real project shows it mattering.

---

## Test status

- `bex_engine`, `bex_vm` unit tests — pass
- `cancel_cascade`, `spawn_array_race`, `spawn_parallel`,
`spawn_semantics`, `spawn_specialization` integration tests — pass
- `baml_tests --lib` (1576 tests) — pass
- bytecode snapshot (`baml_tests --test baml_src`) — pass, **unchanged**
- `cargo fmt` + `cargo clippy -D warnings` — clean (enforced by
pre-commit)

## Reviewer notes / blast radius

The engine change is on the **hot `spawn` path** and touches core
GC/permit concurrency, so it deserves a careful read despite the small
diff. The repro is flaky by nature; I'd suggest a CI loop running the
corpus a handful of times to build confidence. The compiler change is
performance-only and guarded by the byte-identical bytecode snapshot.

🤖 Generated with [Claude Code](https://claude.com/claude-code)


<!-- This is an auto-generated comment: release notes by coderabbit.ai
-->
## Summary by CodeRabbit

* **Refactor**
* Memoized package-level schema and type-alias data to eliminate
redundant per-function recomputation and speed up compilation.
* Centralized project-wide expansion maps for consistent, more efficient
expansion across files.
* Adjusted spawn/allocation flow so child futures are allocated at spawn
sites, improving heap-permit handling and runtime stability.

* **Chores**
  * Updated size-gate thresholds and CI artifact size metadata.
<!-- end of auto-generated comment: release notes by coderabbit.ai -->

---------

Co-authored-by: Claude Opus 4.8 (1M context) <[email protected]>
…anch`, non-blocking gate (#3637)

## Why

The size-gate baseline was a stale, hand-recorded local snapshot that
only moved when someone remembered to re-run it. Two consequences the
team has hit repeatedly:

1. **Drift** — real CI sizes had crept to **+2.9%** under the 3% gate
(linux `baml-cli` was ~26 KB from tripping), so even near-no-op PRs trip
it. Several ceilings sat ~0% above baseline instead of the intended ~3%.
2. **Broken fix instructions** — the hint pointed at `size-gate record`,
which rebuilds **locally** (sizes differ from CI) and only covers the
host platform, so following it verbatim can't green CI.

## What changed

**New `cargo-size-gate bake` — adopt CI-measured sizes, no rebuild**
- `bake --branch canary` (new `fetch.rs`) shells to `gh`, finds the
newest *completed* CI run on the branch with all four `size-gate-*`
reports (ignoring run conclusion and size violations — canary CI is
usually red for unrelated reasons), downloads them, writes
`.ci/size-gate/<platform>.toml`, and re-pegs the `max_*_bytes` ceilings
in `.cargo/size-gate.toml` (comments preserved via `toml_edit`).
Idempotent: no size change → no write → no PR.
- Also `bake <files...>` for explicit reports, and
`--repo/--download-dir/--summary-out`.
- Exposed as `mise run size-gate-update`.

**New workflow `size-gate-baseline-refresh.yml` — daily refresh**
- Daily (+ manual dispatch). Calls the **same** `bake --branch canary`
(no duplicated run-selection), opens PR
`chore/size-gate-baseline-refresh` via PAT so its own CI runs, and
enables squash **auto-merge** so it self-merges once required checks
pass. Reuses the artifacts canary CI already produced — **no rebuilds**
(avoids re-paying the mac/windows release builds that are CI's long
pole).

**Fixed the fix-hint** to point at `bake --branch canary` instead of the
broken `record` flow.

**Made size-gate non-blocking** — removed from `ci-failure-alert.needs`.
A size bump no longer blocks the merge queue. It's a signal (PR comment
+ daily refresh), not a gate.

**Re-pegged baselines + ceilings to current canary (`ef7c326`)** so the
gate is accurate on merge.

## Prerequisites for the daily job
- Repo setting **"Allow auto-merge"** enabled.
- `secrets.SAM_GITHUB_BOUNDARYML_READWRITE` (the PAT `oncall.yml` uses)
available to the workflow.
- Scheduled workflows fire from the **default branch**'s copy of the
file — merge there for the cron to run (the job checks out canary
explicitly regardless).

## Deferred / follow-ups (not in this PR)
- Move size-gate to **post-merge-only** (run on canary push, not
PRs/merge-queue) so it stops running on PRs entirely while still feeding
the nightly. Kept local for now.
- **Trend graph** (the git history of `.ci/size-gate/*.toml` is a
ready-made byte-exact time series). CodSpeed has no custom-metric
ingest, so this won't ride that rail cleanly.

🤖 Generated with [Claude Code](https://claude.com/claude-code)

<!-- This is an auto-generated comment: release notes by coderabbit.ai
-->

## Summary by CodeRabbit

* **Bug Fixes**
* Made size-gate checks informational in CI; they no longer block
merges.

* **New Features**
* Added automated daily baseline refresh workflow for size-gate metrics.
* Added new `bake` command to update baseline values from CI
measurements.

* **Chores**
* Updated size-gate baselines for multiple platforms with new
measurements.

<!-- end of auto-generated comment: release notes by coderabbit.ai -->

---------

Co-authored-by: Claude Opus 4.8 (1M context) <[email protected]>
Adds salsa caching to three functions, resulting in a ~60% speedup on
the `baml_tests/baml_cli` compilation time.

<!-- This is an auto-generated comment: release notes by coderabbit.ai
-->
## Summary by CodeRabbit

* **New Features**
* Added compiler benchmarking suite to measure BAML compilation
performance.
* Added flamegraph profiling tool for analyzing compiler performance
bottlenecks.

* **Chores**
  * Optimized internal compilation caching for improved performance.
  * Updated build documentation.
<!-- end of auto-generated comment: release notes by coderabbit.ai -->
## Summary

Fixes 17 bugs in BEP-044 interface **unions**, found by a union-focused
fuzz of the language. Each was pinned as a failing `union_fuzz_fNN_*`
test in `crates/baml_tests/tests/interfaces.rs` and is now green. A
class-only union method-dispatch case is also pinned as a regression
guard.

Two root causes dominated:
- **`unknown | unknown is not a function`** — a method present on every
member of a union containing an interface was rejected because the
union-member probe resolved interface members to the `Ty::Unknown`
sentinel.
- **VM crashes (`expected map, got instance`, `tagged_int_add`)** —
reading a field / narrowing a generic-interface match arm on a union
dispatched incorrectly at runtime.

## Fixes

| ID | Severity | Fix |
|----|----------|-----|
| F7/F12 | spurious-error / diagnostic | TIR resolves interface union
members through the real interface machinery (side effects suppressed);
MIR dispatches the call on the runtime class across all members'
implementors. No more `unknown \| unknown`. |
| F1/F3/F11 | crash / soundness | MIR dispatches a union field read on
the runtime class. Conflicting field types across union interfaces read
soundly as `T \| U` (misuse → E0001); a genuinely ambiguous field view →
E0131. |
| F2/F5 | crash / wrong-result | A generic-interface match arm
(`Slot<int>`) respects its type argument at runtime instead of matching
every implementor of the bare interface. |
| F4 | crash | `string + <non-object primitive>`
(int/float/bigint/bool/null) is a type error, not an inferred `string`
that aborts the VM. `string + uint8array` stays valid. |
| F6 | wrong-result | Reflection compares generic union args as
unordered sets (`Box<int\|string>` == `Box<string\|int>`). |
| F8 | spurious-error | A bounded type variable is a subtype of itself
(`<T extends I>(a: T) -> T { return a }` compiles). |
| F9 | spurious-error | A match arm overlapping any member of an
optional/union scrutinee is accepted (`let a: Animal` over `(Dog \|
Cat)?`). |
| F10/F15 | spurious-error / diagnostic | Out-of-body `implements I for
<primitive>` resolves for every primitive (not just `int`); union
method-not-found blames only the genuinely-lacking member. |
| F13/F14/F16/F17 | diagnostic | No `user.` package-prefix leak in match
witnesses; uncovered interface members named instead of `_`; ambiguous
union field → E0131; projection errors keep generic args (`Cargo<int>`).
|

Snapshot + LSP expectation updates reflect the corrected diagnostics (no
`user.` leak; `string + int` now E0004).

## Test plan

- `cargo test -p baml_tests` — **2058 passed, 0 failed**
- `cargo test -p baml_lsp2_actions_tests` — **369 passed, 0 failed**
- `cargo clippy --all-targets -- -D warnings` — clean
- `cargo fmt` — clean

Merged latest `canary` (salsa-memoization hang fix); the one conflict in
`lower.rs` was resolved by keeping the generic-interface pattern routing
on top of canary's borrowed-`resolved_aliases` API.

🤖 Generated with [Claude Code](https://claude.com/claude-code)

<!-- CURSOR_SUMMARY -->
---

> [!NOTE]
> **High Risk**
> Touches MIR lowering, TIR inference, VM reflection, and pattern/match
codegen for interfaces and unions—areas that previously caused VM
crashes and silent wrong dispatch; regressions would affect runtime
behavior and type soundness.
> 
> **Overview**
> Fixes **BEP-044 interface unions** end-to-end: TIR no longer collapses
interface arms to `unknown | unknown` for callable unions; MIR adds
runtime class-tag dispatch for methods and fields when the receiver is a
union (including optional-wrapped unions and interface members),
including inherited defaults and interface field views. **Generic
interface** `is`/match patterns and fast type-tag switches now respect
type arguments so `Slot<int>` cannot capture `Slot<string>` values.
> 
> TIR also gains union-member resolution with diagnostic rollback,
E0121/E0131 for ambiguous shared implementors, out-of-body primitive
members, bounded-generic `T <: T`, match-arm overlap over optional
unions, and stricter `string +` rules for non-object primitives. VM
reflection compares generic args with order-insensitive union
equivalence. Diagnostics and exhaustiveness witnesses drop `user.`
prefixes and show user-facing type names. Large `union_fuzz_fNN_*`
regression tests plus snapshot/LSP updates; `.gitignore` adds
`workflow_scratch_files/`.
> 
> <sup>Reviewed by [Cursor Bugbot](https://cursor.com/bugbot) for commit
b535f94. Bugbot is set up for automated
code reviews on this repo. Configure
[here](https://www.cursor.com/dashboard/bugbot).</sup>
<!-- /CURSOR_SUMMARY -->

<!-- This is an auto-generated comment: release notes by coderabbit.ai
-->
## Summary by CodeRabbit

* **Bug Fixes**
* Fixed runtime dispatch and field reads for unions that contain
interface members; tightened generic-interface runtime matching and
narrowed type-match behavior.
* Reduced spurious diagnostics during union-call inference; improved
subtype/pattern-overlap checking and rejected invalid string+int
concatenation.
* Made missing-case diagnostics and witness rendering use user-facing
type names and avoid leaking internal prefixes.

* **Tests**
* Added a large regression suite covering union/interface dispatch,
generic-interface scenarios, diagnostics, and runtime behaviors.

* **Chores**
  * Ignored local workflow scratch files.
<!-- end of auto-generated comment: release notes by coderabbit.ai -->

---------

Co-authored-by: Claude Opus 4.8 (1M context) <[email protected]>
<!-- This is an auto-generated comment: release notes by coderabbit.ai
-->
## Summary by CodeRabbit

* **New Features**
* Added a lightweight Bedrock request serializer and media support for
images, video, documents, and audio.
* Introduced local credential/token resolution modules for AWS and
Google Cloud with pluggable IO adapters.

* **Improvements**
* Streamlined Bedrock and Vertex AI authentication flows and request
signing for more reliable credential discovery.
* Simplified credential-provider precedence and expanded test coverage
for credential resolution.
<!-- end of auto-generated comment: release notes by coderabbit.ai -->
## What & why

`baml describe <Class>` and LSP hover treated a class's methods as
incidental text living inside the raw class body. Under a line budget
those methods got chopped out and replaced with `[... skipped N lines
...]`, and LSP hover rendered a class with methods as a bare `class
String {}` — methods were effectively **invisible for discovery**.

This PR makes methods first-class, always-visible output — the same
treatment fields already get.

## End-user impact

- **Methods are always discoverable.** Every method of a class now
appears in `baml describe` with its first-line docstring, full
`function` signature, and definition line range, in dedicated `methods:`
/ `static_methods:` sections. These are **never truncated** — `baml
describe User --budget 5` shows the same methods as the full output.
Method *bodies* only appear when you drill into a specific method.
- **Class body shows fields only.** The body block is canonical BAML
(`name: type,`) containing just the fields; methods render below. A
fields-only body fits any reasonable budget, so it stops triggering
truncation.
- **`baml describe string` works.** Lowercase primitive/keyword aliases
— `string`, `int`, `bigint`, `float`, `bool`, `null`, `uint8array`,
`image`, `audio`, `video`, `pdf`, `json` — resolve to their builtin
classes, alongside the existing canonical (`root.ns.Foo`) and package
(`baml.json.json`) forms.
- **Consistent, canonical type printing** across describe + hover +
signatures: builtins collapse to their alias (`baml.String` → `string`,
`baml.json.json` → `json`), user types read `root.ns.Foo`, lists/maps as
`T[]` / `map<K, V>`. Headers show the canonical FQN in parens when it
differs from the bare name (e.g. `class String (string)`, `class Config
(root.llm.Config)`).
- **Minimal, useful hover.** Hover shows the class docstring + field
shape, plus a one-line `Run \`baml describe <FQN>\`` hint **only when
the class has methods**.

### Before / after (`baml describe string`)

Before: a multi-hundred-line raw class body, truncated mid-way with
`[... skipped 448 lines ...]` — methods unusable.

After:
```
class String  (string)  <builtin>/baml/string.baml:5-475

/// A UTF-8 encoded string.
/// ...
class String {}

methods:
  /// Serializes this string to a JSON value.
  function to_json(self) -> json  <builtin>/baml/string.baml:8-10
  /// Returns the length of the string in UTF-8 bytes.
  function length(self) -> int  <builtin>/baml/string.baml:25-27
  ...

static_methods:
  function from_code_points(unicode: int[]) -> string throws root.errors.InvalidArgument  <builtin>/baml/string.baml:472-474

references (0):
```

## Implementation

- `MethodRef` (describe) and `MethodSig` (type_info) carry signature +
first-line docstring + full range. Signatures resolve via the package
interface — auto-derived methods are skipped, `self` is shown bare, and
`throws never` is omitted.
- One canonical printer: `QualifiedTypeName::builtin_alias` +
`display_ty_canonical_for_file`. It's opted into **only** by the
describe/hover/signature paths; diagnostics, completions, and inlay
hints keep their existing spelling (so other call sites can adopt it
mechanically later).
- `references` now exclude a symbol's own definition span, so a class is
no longer listed as a reference of itself via its own method bodies.
- Codegen (`cg::Class`) and the `truncate_body()` algorithm are
untouched, per the design.

## Testing

783 tests green (TIR 136, LSP 114, CLI 164, integration 369). New
coverage: TIR alias round-trip (incl. the `json` special case); CLI
fixtures for instance-only, mixed instance+static, and generic classes,
alias resolution, and a tight-budget never-truncate case; LSP hover
tests for the describe hint. Existing describe/hover snapshots updated
for the new format.

## Known follow-up

Types referenced **only** in a method signature (e.g. `WrapperMarker` in
`-> T | WrapperMarker`) are not yet surfaced under `dependencies:`. The
class-dependency path matches the canonical `pkg.Name` string against
short outline names — a pre-existing gap with no current test coverage.
Fixing it (and adding method-signature deps on top) was deferred to
avoid regressing builtin output; method *listing* itself is unaffected.

🤖 Generated with [Claude Code](https://claude.com/claude-code)

<!-- This is an auto-generated comment: release notes by coderabbit.ai
-->
## Summary by CodeRabbit

* **New Features**
* Describe output now shows class methods (instance and static) with
full one-line signatures, per-method docstrings, and definition line
ranges; canonical fully-qualified names appended when different.

* **Improvements**
* Lowercase primitives (e.g., string, int, json, image) resolve as
aliases in describe/dispatch and canonical type rendering.
  * CLI JSON includes richer per-method metadata.
* Hovers include class docstrings and a "baml describe" hint when
methods exist; method sections are never truncated.
  * Non-doc single-line comments are stripped from rendered bodies.

* **Tests**
* Expanded fixtures/snapshots covering methods, alias dispatch, hover
output, comment handling, and truncation guarantees.
<!-- end of auto-generated comment: release notes by coderabbit.ai -->

---------

Co-authored-by: Claude Opus 4.8 (1M context) <[email protected]>
…structions

# Conflicts:
#	baml_language/Cargo.lock
#	baml_language/crates/bridge_cffi/src/lib.rs
…ac targets)

The non-macos/aarch64 kperf shim had `#[inline(always)]` on its no-op
enabled/exec_start/exec_end fns, tripping clippy::inline_always under
`-D warnings` when compiled for other targets (e.g. CI's cross-target check).
Host clippy never compiles the shim, so it slipped through. `#[inline]` is
plenty for trivial no-ops. Verified clean via clippy --target
x86_64-unknown-linux-musl.

Co-Authored-By: Claude Opus 4.8 <[email protected]>
@hellovai hellovai merged commit 99f624d into hellovai/trim-events Jun 2, 2026
31 of 34 checks passed
@hellovai hellovai deleted the hellovai/vm-superinstructions branch June 2, 2026 21:54
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

None yet

Projects

None yet

Development

Successfully merging this pull request may close these issues.

5 participants