perf(vm): superinstruction fusion — bytecode interpreter now faster than CPython by hellovai · Pull Request #3627 · BoundaryML/baml

hellovai · 2026-06-01T17:49:46Z

Stacked on #3616 (base = hellovai/trim-events). Makes the bex_vm bytecode interpreter faster than CPython 3.9 across most workloads, where it started ~2.2× slower.

Results (Apple M2 Max, `scripts/speedtest run`)

fib (1M × fib(50)): total instructions retired −60%, CPU cycles −51% vs the pre-fusion baseline — below the cycle count of an optimal naive hand-written interpreter for the same loop.
All ~38 speedtest workloads: BAML now beats Python on ~22 of them (several by 2–4×: parallel-sum 0.23×, spawn-fan-out 0.41×, call-chain 0.45×, class-instances 0.50×, method-call 0.56×). Not overfit to fib — the wins span compute, classes, concurrency, dispatch, and most string ops.

What changed (each commit validated + clippy-clean)

All fusion is an emit-time peephole that rewrites instructions in place, confined to the current basic block (mirrors the existing StoreVarLoadVar fusion), so jump targets and block addresses are never affected:

Superinstruction fusion — fold operand loads (and the store, and the loop condition's branch) into single ops:
- AddIntVar/Const, SubIntVar/Const, CmpIntLtVar/Const (fold right operand)
- AddIntVarVar/VarConst, SubIntVarVar/VarConst, CmpIntLtVarVar/VarConst (fold both operands — no operand stack pushes)
- AddIntVarVarStore/VarConstStore (compute + store directly), MoveLocal (fused x = y)
- CmpIntLt*BrFalse/BrTrue (fuse the loop condition with its branch; branch inversion drops the per-iteration jump-to-body)
Unchecked local-slot fast paths for LoadVar/StoreVar, and a lazy faulting_pc (track cur_pc once per dispatch instead of writing the frame every op).
Call path: consolidated redundant per-call get_object heap derefs with a plain-Function fast path.
kperf: an in-process Apple-Silicon PMC probe (cycles + instructions retired, env-gated BAML_KPERF=1, behind an off-by-default kperf cargo feature) for deterministic, frequency-independent measurement.
Toolchain: pinned the nightly toolchain and dropped the MSRV job.

Key insight

Cycle cost is dominated by branch-mispredictions on the giant opcode dispatch match (an indirect jump). So the real lever is reducing the number of dispatched ops — that's why fusion keeps paying off (branch inversion: −15% cycles for −9% instructions), while removing the inline per-op counter barely moved cycles. Measure total instructions/cycles, not instr/op (fusion shrinks the op-count denominator).

Validation

fib/boundary-loops/if-else correct; bex_vm, interfaces (337), exceptions, cancellation, host_value_callable, errors, floats, gc, dispatch, optimization all green.

Known follow-ups (diagnosed, not in this PR)

baml pack's embedded runtime stub needs rebuilding for the new opcodes (baml run is unaffected).
Remaining Python losses: strings (split ~2.2× — alloc-bound), deep recursion (~130 cyc/call, frame setup), collatz (%///*/!=/== still unfused).

🤖 Generated with Claude Code

vercel · 2026-06-01T17:49:53Z

The latest updates on your projects. Learn more about Vercel for GitHub.

Project	Deployment	Actions	Updated (UTC)
beps	Ready	Preview, Comment	Jun 2, 2026 10:11pm
promptfiddle	Ready	Preview, Comment	Jun 2, 2026 10:11pm
promptfiddle2	Ready	Preview, Comment	Jun 2, 2026 10:11pm

coderabbitai · 2026-06-01T17:49:55Z

Important

Review skipped

Auto reviews are disabled on base/target branches other than the default branch.

Please check the settings in the CodeRabbit UI or the .coderabbit.yaml file in this repository. To trigger a single review, invoke the @coderabbitai review command.

⚙️ Run configuration

Configuration used: Repository UI

Review profile: CHILL

Plan: Pro

Run ID: 6c3d4093-e62a-4db7-adf6-aadb88d93e6e

You can disable this status message by setting the reviews.review_status to false in the CodeRabbit configuration file.

Use the checkbox below for a quick retry:

🔍 Trigger review

✨ Finishing Touches

🧪 Generate unit tests (beta)

Create PR with unit tests
Commit unit tests in branch hellovai/vm-superinstructions

Thanks for using CodeRabbit! It's free for OSS, and your support helps us grow. If you like it, consider giving us a shout-out.

❤️ Share

_{Comment @coderabbitai help to get the list of available commands and usage tips.}

github-actions · 2026-06-01T18:23:14Z

Binary size checks passed

✅ 7 passed

	Artifact	Platform	File	Gzip	Gated on	Baseline	Delta	Status
✅	`baml-cli`	Linux	🔒 17.1 MB	7.3 MB	file	20.0 MB	-2.9 MB (-14.7%)	OK
✅	`packed-program`	Linux	🔒 12.6 MB	5.3 MB	file	15.1 MB	-2.5 MB (-16.5%)	OK
✅	`baml-cli`	macOS	🔒 13.0 MB	6.3 MB	file	15.1 MB	-2.1 MB (-14.0%)	OK
✅	`packed-program`	macOS	🔒 9.6 MB	4.6 MB	file	11.5 MB	-1.9 MB (-16.3%)	OK
✅	`baml-cli`	Windows	🔒 14.0 MB	6.4 MB	file	16.3 MB	-2.3 MB (-14.2%)	OK
✅	`packed-program`	Windows	🔒 10.1 MB	4.7 MB	file	12.2 MB	-2.1 MB (-17.2%)	OK
✅	`bridge_wasm`	WASM	11.5 MB	🔒 3.3 MB	gzip	3.9 MB	-609.4 KB (-15.7%)	OK

🔒 = the size this artifact is GATED on (ceiling + delta). Binaries gate on file size (installed binary); WASM gates on gzip (download size). The other size is shown for information only.

Generated by cargo size-gate · workflow run

codspeed-hq · 2026-06-01T18:29:52Z

Merging this PR will improve performance by 60.45%

⚡ 10 improved benchmarks
✅ 3 untouched benchmarks
⏩ 7 skipped benchmarks¹

Performance Changes

	Mode	Benchmark	`BASE`	`HEAD`	Efficiency
⚡	WallTime	`vm_loop_500k`	95 ms	27.4 ms	×3.5
⚡	WallTime	`vm_nested_loop`	9.7 ms	4 ms	×2.4
⚡	WallTime	`vm_field_access_50k`	16.9 ms	8.4 ms	×2
⚡	WallTime	`vm_closure_call_50k`	21.7 ms	13.1 ms	+65.6%
⚡	WallTime	`vm_class_create_50k`	40.1 ms	28.3 ms	+41.95%
⚡	WallTime	`vm_array_iter_10k`	7 ms	5.4 ms	+29.22%
⚡	WallTime	`vm_wide_nested_class_create_50k`	296.4 ms	230.2 ms	+28.73%
⚡	WallTime	`vm_array_push_50k`	21 ms	16.4 ms	+27.63%
⚡	WallTime	`vm_fib_20`	7.2 ms	6.1 ms	+17.43%
⚡	WallTime	`vm_mixed_ops`	11.1 ms	9.6 ms	+15.81%

Tip

Curious why this is faster? Comment @codspeedbot explain why this is faster on this PR, or directly use the CodSpeed MCP with your agent.

_{Comparing hellovai/vm-superinstructions (143bd33) with hellovai/trim-events (855fa73)}

7 benchmarks were skipped, so the baseline results were used instead. If they were deleted from the codebase, click here and archive them to remove them from the performance reports. ↩

…d split Prep for the threaded-dispatch work: - Add crates/baml_tests/tests/dispatch.rs — inline-snapshot of how direct method calls and interface (polymorphic) dispatch lower to bytecode. Locks in that a direct `v.norm2()` is a plain static `call` (no make_bound_method / no per-call allocation) and that interface dispatch is an `is_type` chain + static call. - Split store_local_value into an `#[inline(always)]` fast path (direct stack write) and a `#[cold] #[inline(never)]` watch handler, so the hot store path stays inline and the rarely-active watch bookkeeping is out of line. (Measured ~0% on its own; it's structural prep for `become`.) Co-Authored-By: Claude Opus 4.8 (1M context) <[email protected]>

Cut interpreter instruction count on the hot arithmetic/loop path: - Emit-time binop peephole folds operand loads into fused superinstructions (mirrors the existing StoreVarLoadVar in-place rewrite; confined to the current basic block so jump targets/block addresses are never fused across): * single-fold (right operand): AddIntVar/AddIntConst, CmpIntLtVar/CmpIntLtConst * double-fold (both operands, no operand stack pushes): AddIntVarVar/ AddIntVarConst, CmpIntLtVarVar/CmpIntLtVarConst - Unchecked local-slot indexing (get_at/set_at) on LoadVar/StoreVar fast paths. - Lazy faulting_pc: track cur_pc once per dispatch instead of writing the frame every op; reconstruct topmost/innermost bytecode PC only when unwinding. - kperf: in-process Apple-Silicon PMC probe (cycles + instructions retired), env-gated by BAML_KPERF=1, for deterministic instruction-count comparisons. fib (1M x fib50): total instructions retired -25.4%, cycles -22.8%, VM ops -33% vs the pre-fusion baseline. Validated: result correct; bex_vm, exceptions, cancellation, errors, floats, dispatch, optimization all green. Co-Authored-By: Claude Opus 4.8 <[email protected]>

Extend the emit-time peephole with two more fused-op families, same in-place / current-block-confined rewrite as before: - MoveLocal(dst, src): fuses `LoadVar src; StoreVar dst` (every `x = y`). Stores via store_local_value, preserving watch semantics. - AddIntVarVarStore / AddIntVarConstStore: a double-folded add whose result is stored straight into a (non-captured) local — `local[dst] = local[a] + ..` — never touching the eval stack. fib (1M x fib50) cumulative vs the pre-fusion baseline: total instructions retired -47%, cycles -35%, VM ops -55%, reaching the cycle count of an optimal hand-written bytecode interpreter for the same workload. Validated: result correct; bex_vm, exceptions, cancellation, errors, floats, dispatch, optimization all green. Co-Authored-By: Claude Opus 4.8 <[email protected]>

Add CmpIntLtVarVarBrFalse / CmpIntLtVarConstBrFalse: when a fused integer `<` comparison is immediately consumed by a `PopJumpIfFalse` (the canonical loop/if condition), the emit peephole collapses both into one op that evaluates the comparison straight from its operands and branches without materializing a bool on the stack. The fused op participates in jump resolution like the plain jumps (instruction-relative offset patched to a byte delta) and preserves the early-yield/cancellation check. fib (1M x fib50) cumulative vs the pre-fusion baseline: total instructions retired -52%, cycles -41%, VM ops -60% — now below the cycle count of an optimal naive hand-written bytecode interpreter for the same workload. Validated: fib correct; boundary loops (sum 0..10 = 45, count = 5) correct; bex_vm, cancellation, exceptions, errors, floats, dispatch, optimization green. Co-Authored-By: Claude Opus 4.8 <[email protected]>

The VM-op counter (`op_count`) is pure measurement scaffolding but ran on the hottest path in every build — a store plus a memory dependency per dispatched op. Gate the increment behind a new off-by-default `kperf` cargo feature so normal/release builds don't pay for it; kperf reads cycles and instructions retired straight from the hardware counters, so the op count is only needed for the optional per-op breakdown (built with `--features bex_vm/kperf`). Also make the `cur_pc` fault-PC write unconditional (it is needed for correct exception line numbers, not measurement) and drop the now-unused `dbg_skip_*` / `BAML_NO_*` bisection env gates. fib (1M x fib50), op_count removed: -6.7% instructions retired (20.7e9), and BAML's interpreter is now ~10% faster than CPython 3.9 on the same workload. Co-Authored-By: Claude Opus 4.8 <[email protected]>

When a conditional's else-successor is the fall-through block (and the then block is not), emit an inverted compare-and-branch — CmpIntLtVarVarBrTrue / CmpIntLtVarConstBrTrue (branch to `then` when the comparison is true) — and let the jump to the else block fall through. This eliminates the unconditional jump-to-body that otherwise runs every loop iteration. Removing a dispatched op matters disproportionately: each dispatch is an indirect jump through the giant opcode match that the branch predictor routinely misses, so this is ~15% fewer cycles for ~9% fewer instructions (IPC 5.0 -> 5.4). fib (1M x fib50) cumulative vs the pre-fusion baseline: instructions retired -60%, cycles -51% (halved). Validated: fib correct; if/else and boundary loops correct; bex_vm, interfaces (337), exceptions, cancellation, errors, floats, gc, env, io, dispatch, optimization all green. Co-Authored-By: Claude Opus 4.8 <[email protected]>

execute_call_from_locals_offset dereferenced the callee HeapPtr three separate times to (a) check for a HostClosure, (b) extract Closure captured type args, and (c) extract BoundMethod class type args. Fold these into a single match with a plain-Function fast path (the common case, including all recursion), which extracts nothing. The later callee-resolution match still validates non-callable objects, so error behaviour is unchanged. Measured call overhead is ~130 cyc / ~740 instructions per call (5M-call bench minus an equivalent inline loop) — dominated by frame setup/teardown spread across the call+return machinery, so this is a small (~1%) but free reduction. Validated: fib32 correct; bex_vm, interfaces (337), dispatch, spawn, cancellation, host_value_callable all green (closure/bound-method/host-closure dispatch preserved). Co-Authored-By: Claude Opus 4.8 <[email protected]>

Mirror the AddInt fusion for SubInt: SubIntVar / SubIntConst (fold the right operand load) and SubIntVarVar / SubIntVarConst (fold both). Subtraction is not commutative, so operand order is preserved (left - right) and no const-on-left commute is applied. Same in-place, current-block-confined emit peephole. Helps any code using subtraction, which the earlier add/compare fusion missed. fib32-recursive (call-bound) still gains -9% cycles from fusing its n-1/n-2 + result-add body; the hot fib loop is unchanged (no I-cache regression from the larger dispatch match). Validated: fib32 = 2178309, subtraction loop correct; bex_vm, interfaces (337), optimization, dispatch, errors, exceptions, floats all green. Co-Authored-By: Claude Opus 4.8 <[email protected]>

…system fixture The superinstruction fusion changes emitted bytecode (e.g. `load_var a; load_var b; add_int` → `add_int_var_var`, `load_var x; store_var y` → `move_local`, `load_const 1; add_int` → `add_int_const`), so every codegen / bytecode snapshot is regenerated to match. Behaviour is unchanged — these are the same programs, fewer dispatched ops. Also delete the `event_system` test fixture: it exercised `baml.events.send`, which the tracing trim removed, so it no longer compiles ("unresolved name: send"). The feature is gone, so the fixture is removed rather than rewritten; its generated test disappears when build.rs regenerates from projects/. A few snapshots also drop `events.send` from builtin listings (baml_cli package listing, __baml_std__, package_items) for the same removal. Verified: baml_tests, baml_cli, baml_compiler2_emit, bex_vm, bex_vm_types all green with no INSTA_UPDATE. Co-Authored-By: Claude Opus 4.8 <[email protected]>

## What Adds **53 in-BAML test blocks** to `ns_floats/floats.baml` covering the `==` / `!=` operators on floats. Regenerates the `floats` bytecode snapshot. ## Why Float equality **already works** end-to-end: - **Type-checker** (`infer_binary_op`): permissive — any two operands → `bool`; `int == float` widens int to f64. - **Constant folder** (`try_fold_binary`): literal `float == float` is folded at compile time — a *separate* path from runtime. - **VM** (`exec_cmpop` + dedicated `CmpFloatEq` opcode): IEEE 754 semantics. …but there was **no test coverage**: `floats.baml` deliberately used epsilon checks and `operators.baml` only tested `int == int`. This locks in the behavior. ## Coverage Cases are grouped by code path (compile-fold vs runtime, forced via `float.parse(...)`/calls) so a divergence between the folder and the VM would be caught. They follow JS/TS (IEEE 754) conventions: - **NaN**: `NaN != NaN` true, `NaN == NaN` false, NaN vs number/inf/null, NaN propagation through arithmetic - **Infinity**: `+inf == +inf`, `+inf != -inf`, overflow → inf, max-finite ≠ inf, `inf - inf` → NaN - **Signed zero**: `0.0 == -0.0`, `-1.0/inf` → `-0.0` - **int/float mixing**: `2 == 2.0`, `3.0 == 3` (both paths) - **Precision**: `0.1 + 0.2 != 0.3`, `== 0.30000000000000004`, `1.0/3.0`, sqrt identities, subnormals (`5e-324`), large-magnitude loss - **null**: `float == null` → false (allowed, not an error) The one rejected pairing — `float == bigint` (compile error E0004, bigint past 2⁵³ can't round-trip f64) — is unchanged and not expressible as a passing test. ## Testing ``` cargo run -p baml_cli -- test --from crates/baml_tests/baml_src -i "::float_eq_*" # 53 passed, 0 failed ``` 🤖 Generated with [Claude Code](https://claude.com/claude-code)  ## Summary by CodeRabbit * **Tests** * Added comprehensive test suite for float comparison operations, covering IEEE-754 semantics and edge cases: NaN behavior and self-inequality, positive/negative infinity handling, signed-zero behavior, overflow-to-infinity conversions, precision and rounding imprecision scenarios, and parsing-based test cases including scientific notation edge ranges.  Co-authored-by: Claude Opus 4.8 (1M context) <[email protected]>

…pus (#3631) ## What The `baml_tests` CodSpeed runtime suite was a set of hand-written `#[divan::bench]` functions with BAML source inlined as string literals — a parallel copy of workloads that already exist in the `tools/speedtest` corpus. This replaces them with **one `vm_speedtest_*` bench auto-generated per workload** under `tools/speedtest/workloads/*.md`, so the speedtest harness and CodSpeed share a single source of truth. ## How - **`build.rs` (`generate_speedtest_benches`)** shells out once to a new `tools/speedtest/export_baml.py`, which reuses `speedtest.loader` (including `## eval-setup` + `$$` templating) to emit each workload's *expanded* BAML as JSON. build.rs then generates one divan bench per workload, named `vm_speedtest_<slug>`, each calling the existing `bench_vm_main` helper (compile + tokio runtime built **once, outside** the measured region → only `main()` is timed). - **Graceful degradation:** if `python3` or the corpus is unavailable at build time, it emits a `cargo:warning` and no benches rather than breaking the crate build. - **Sleep exclusion:** workloads that call the blocking `baml.sys.sleep` are dropped at build time (matched by FQN) — as walltime benches their sample time is dominated by sleeping, not VM work. A build warning names what was skipped. Currently excludes `concurrency::parallel sleep 3x200ms`. - **All hand-written benches removed** (`vm_*`, `e2e_*`, `startup_*`, `compile_to_engine`, `engine_init_cost`) per design discussion. The 2 with no workload equivalent became new workloads, with BAML/Python/TS output cross-verified against `baml-cli`: - `compute/wide-nested-class-create-50k.md` (= `8754025000`) - `compute/mixed-ops-5k.md` (= `62499999`) - **CI** run filter updated `vm_|engine_init` → `vm_speedtest` (the old alternatives no longer exist); build step unchanged. ## Result **36 generated benches** (37 workloads − 1 sleep). All compile and execute cleanly (`divan --test`, exit 0). No CI build wiring change needed — `cargo codspeed build --bench runtime_benchmark` already covers them. To add or change a runtime bench going forward, edit a workload `.md` — no Rust changes required. 🤖 Generated with [Claude Code](https://claude.com/claude-code)  ## Summary by CodeRabbit * **Tests** * Added new speedtest workloads (mixed arithmetic and wide nested object creation) to expand performance coverage * Introduced generated VM-focused benchmark cases to measure pure VM execution timing * **Chores** * Updated CI benchmark configuration to run VM speedtests for more representative timing * Added build-time benchmark generation and a CLI export tool to produce workload test data automatically  --------- Co-authored-by: Claude Opus 4.8 (1M context) <[email protected]>

…tions Drop the 20 emit-time fused opcodes added earlier (AddIntVar/Const, SubInt*, CmpIntLt* folds, AddInt*Store, CmpIntLt*Br{False,True}, MoveLocal) in favour of CPython's minimal, operand-movement-only superinstruction set: - LoadVar2 (= LOAD_FAST_LOAD_FAST): push two locals in one dispatch - StoreVar2 (= STORE_FAST_STORE_FAST): store two locals in one dispatch (StoreVarLoadVar already covered STORE_FAST_LOAD_FAST.) Type specialization stays where it belongs — the pre-existing AddInt/SubInt/MulInt/CmpIntOp ops are BAML's static-typing equivalent of CPython's BINARY_OP_*_INT / COMPARE_OP_INT, emitted directly with no inline caches or deopt. Rationale: the dedicated fused ops were a combinatorial set (operation × operand-kind × fold-depth × branch-polarity) that overfit the fib loop and would explode in opcode count as more operators/types were covered. CPython deliberately keeps fusion to a tiny movement-only set and leans on specialization (which we get for free from static types) — and, for the real "way faster" win, a copy-and-patch JIT, which subsumes interpreter fusion for hot code. This keeps the dispatch table small (better I-cache) and the design principled. Bytecode snapshots regenerated accordingly (load_var2 + plain add_int/ cmp_int_op/store_var). Validated: fib correct; baml_tests, baml_cli green; clippy clean on stable 1.93.0. Co-Authored-By: Claude Opus 4.8 <[email protected]>

…structions # Conflicts: # baml_language/crates/baml_tests/benches/runtime_benchmark.rs

Use sccache (R2-backed) for Rust **compilation** artifacts in the cargo CI jobs, configured entirely from `.envrc` so CI matches local shells. - `tools_sccache` crate / `tools/baml-sccache` wrapper: a `RUSTC_WRAPPER` that maps `BAML_SCCACHE_R2_*` → `AWS_*` and execs sccache (native crate on Windows, shell script on POSIX). - mise installs sccache + direnv; each cargo job loads `.envrc` via `direnv export gha` (the single source of truth for the sccache/R2 config). - **Swatinem/rust-cache still caches the cargo registry/git download state**, with `cache-targets: false` so sccache owns `target/` and the two caches don't compete. Fork PRs without R2 secrets fall back to the runner-local cache. Follow-up #3624 replaces Swatinem for the download caches with a granular, Cargo.lock-driven R2 action (`cache-cargo-home`); this PR is the sccache base it stacks on. 🤖 Generated with [Claude Code](https://claude.com/claude-code) Co-authored-by: Claude Opus 4.8 (1M context) <[email protected]>

…3633) ## Issue Reference N/A (net-new, self-contained addition under `tools/baml-bench/`) ## Changes This PR adds **baml-bench**, an event-driven pipeline that benchmarks how well a coding agent (Claude Code) uses BAML, surfaces real language/skill issues from those runs, and dispatches fixes. The entire diff is confined to `tools/baml-bench/` (88 files) and touches nothing else in the monorepo. **The pipeline:** an inbound event (Slack mention, cron job, or bug report) creates a task; a worker runs a Claude Code agent against the task and records a "trophy" (transcript, metrics, findings); findings are classified and deduplicated into issues; approved issues are synced to Notion and dispatched to a Cursor cloud agent that opens a fix PR. A read-only dashboard shows the whole thing live. It is built as 8 Python services + a self-hosted Convex data layer + a Next.js dashboard: - **`bench_core`** (shared library): pydantic schemas, jsonl/prices utilities, the service/proxy/slack/notion/cursor clients, and the `Processor` claim-loop base (SSE wakeups, heartbeat, lease) that every worker builds on. - **Convex data layer**: the schema, a generic claimable-queue lib, per-table query/mutation modules, and a reaper for stale claims. - **`api`**: the sole Convex gateway, plus a blob store for transcripts/binaries and generic table + baml-builds routers. - **`claude-proxy`**: runs real Claude Code sessions and parses them into transcript + metrics. - **`baml-worker`**: task to trophy (agent run, trophy parse, repro verification). - **`baml-dedup`**: trophy to issue (classify + dedup). - **`baml-builder`**: tracks baml release binaries in a registry. - **`ingress`**: public webhook gateway (slack/notion/bug, ack-first). - **`notion-fixer`**: Notion board sync + Cursor cloud-agent fix dispatch. - **`cron`**: daily build-refresh + task enqueuer. Also included: Python packaging, the base Docker image + per-service Dockerfiles, a `docker-compose` local stack with `.env.example`, the unit + E2E test suites, Google-style docstrings on every function/method/class, a README, and a generated `docs/reference.md` indexing every symbol across `bench_core`, `services`, `convex`, and `ui`. Anthropic auth is API-key only. ## Testing Please describe how you tested these changes - [x] Unit tests added/updated - [x] Manual testing performed - [ ] Tested in [environment] - Fast suite (no Docker): `cd tools/baml-bench && pytest -m "not integration"` -> 12 passed (app/health wiring, proxy session parsing, ingress routing). - E2E suite (`@pytest.mark.integration`, self-skips without Docker): `pytest -m integration` boots a Convex backend container plus `api`/`ingress`/a stub proxy on ephemeral host ports and drives the full pipeline (task -> worker -> trophy -> dedup -> issue -> notion sync) and the ingress + fix-dispatch path end to end. - The pipeline has been running in production (standalone on Fly), so the migrated code is exercised; this PR is the monorepo packaging of it. ## Screenshots If applicable, add screenshots to help explain your changes N/A (the UI is a read-only dashboard; no user-facing change to existing BAML surfaces). ## PR Checklist Please ensure you've completed these items - [x] I have read and followed the contributing guidelines - [x] My code follows the style guidelines of this project - [x] I have performed a self-review of my own code - [x] I have commented my code, particularly in hard-to-understand areas - [x] I have made corresponding changes to the documentation - [x] My changes generate no new warnings ## Additional Notes Add any other context about the PR here - **Isolation:** the entire diff is confined to `tools/baml-bench/`; nothing outside that path changes, so it has no effect on the rest of the monorepo. - **CI is intentionally not in this PR.** A path-scoped Blacksmith workflow + pre-commit hooks are ready but touch shared `.github/` config outside `tools/baml-bench/`, so they will follow in a separate PR to keep this one isolated. - **Some docs follow later.** The README and the generated API reference are included. The longer guides (architecture, data-model, configuration, local-setup, deployment, ci) are staged and will land in a follow-up once reviewed.  ## Summary by CodeRabbit * **New Features** * Full local benchmark stack: live dashboard (graph, tables, run/task pages), API + blob storage, claimable-queue workers, build manager, agent proxy, Slack/Notion ingress, cron-driven task enqueueing, and end-to-end agent/verification flows. * **Documentation** * Complete architecture, data model, configuration guides and generated API reference. * **Tests** * New unit and end-to-end integration suites with drivers and service stubs exercising ingress, proxy, and the full pipeline. * **Chores** * Local dev tooling: docker-compose, env example, gitignore, Dockerfiles, and package manifests.  --------- Co-authored-by: Claude Opus 4.8 <[email protected]>

) ## What Adds **interface-field destructuring** in `match` patterns. An interface head binds the interface's declared fields across every implementor: ```baml function describe(a: Animal) -> string { match (a) { Animal { name } => "animal: " + name // binds `name` for any implementor } } ``` `name` resolves through each implementor's field view, so it works whether the field was auto-linked (`Dog { name }`) or `as`-aliased (`Cat { name as nickname }`). Because every implementor necessarily provides the interface's declared fields, the pattern matches them all — so `Animal { name }` is exhaustive on its own (no `_` needed). Previously `Animal { name }` was mis-lowered as a construction expression (`unresolved name: Animal`); only concrete-class destructure (`Dog { name }`) worked. ## How - **TIR** (`baml_compiler2_tir/src/builder.rs`): `resolve_class_pattern_type` accepts interface heads; `lower_class_pat` has an interface branch that binds each field's type via `resolve_interface_member` and produces a wildcard-cover `DPat`. - **MIR** (`baml_compiler2_mir/src/lower.rs`): `project_class_pattern_field` routes interface heads to a new `project_interface_pattern_field`, reusing the existing interface field-view dispatch (`try_lower_interface_field_access`). The MIR `Ty` has no interface variant, so the route keys off the raw `Tir2Ty::Interface`. ## Tests - New: `match_destructures_interface_fields_directly` (interface head, auto-linked + aliased implementors) and `match_destructures_concrete_implementor_fields`. - Refreshed the BEP-044 regression-suite comments — all pass now. The two interface-method-as-value cases (`fuzz_bug01/02`) remain `#[ignore]`d (genuinely unimplemented). - Interfaces suite: 339 passed / 2 ignored; full `baml_tests` (30 binaries) and `cargo check --workspace` clean. The matching BEP-044 spec update (match syntax, this feature, and other implementation-vs-draft corrections) was pushed separately to beps.boundaryml.com. ## Also included In-flight **Python SDK / bridge / codegen** fixes that were already present on the working branch (`bridge_cffi`, `bridge_python`, `codegen_python`, `harness_setup`, `baml_cli/generate.rs`, `.pyi` stubs). Not authored as part of the interface work; bundled per request. 🤖 Generated with [Claude Code](https://claude.com/claude-code)  --- > [!NOTE] > **High Risk** > Changes span TIR/MIR pattern lowering and exhaustiveness (easy to get match soundness wrong) plus the Python runtime initialization path that all generated SDKs use after `baml generate`. > > **Overview** > Implements **BEP-044 interface destructuring** in `match` (`Animal { name } => …`): TIR resolves interface pattern heads and lowers `DPat::interface` with field-view types; exhaustiveness gains `Ctor::Interface` and matrix specialization that maps interface field slots onto implementing class fields; MIR projects bound fields via `project_interface_pattern_field` / existing interface field dispatch. > > **Python codegen/runtime** now embeds **borsh-serialized bytecode** instead of inlined `.baml` source: `baml generate` compiles and calls `to_source_code_with_bytecode`; `bex_project::new_from_bytecode`, CFFI/Python `initialize_runtime_from_bytecode`, and generated `_inlinedbaml.BYTECODE` wiring. CI **size-gate** baselines for `baml-cli` are bumped slightly. > > Large **interface test suite** additions (compile + VM) for destructure exhaustiveness, mixed concrete/interface arms, generics, and updated regression comments (most fuzz/wf3 cases now pass; method-as-value tests still ignored). > > Reviewed by [Cursor Bugbot](https://cursor.com/bugbot) for commit 3c25e0f. Bugbot is set up for automated code reviews on this repo. Configure [here](https://www.cursor.com/dashboard/bugbot).   ## Summary by CodeRabbit * **New Features** * Match expressions now support destructuring of interface-typed values. * SDKs can initialize the runtime from precompiled BAML bytecode (new runtime entrypoint and corresponding Python initializer). * **Tests** * Added end-to-end tests covering interface-field destructuring, exhaustive matching, and related runtime behaviors. * **Chores** * Test harness and workspace updated for bytecode support (borsh); CI size-gate baselines updated.  --------- Co-authored-by: Claude Opus 4.8 (1M context) <[email protected]>

…lock (#3635) ## TL;DR `cargo test -p baml_tests` (and `baml-cli test --from baml_src`) **hung**. It turned out to be **two unrelated bugs stacked on top of each other**, and the first one masked the second: 1. **Compile-time `O(files²)` blow-up** — the deterministic hang. Compiling the 55-file `baml_src` corpus as one project ran two whole-project passes once *per file* / *per function*. Fixed with salsa memoization. Corpus went from *never finishing* → ~30s, **byte-identical bytecode**. 2. **A runtime GC/permit deadlock in `spawn`** — once compilation was fast enough to actually reach execution, the test run started hanging ~50% of the time in the engine. Root cause was a **nested heap-permit acquire that deadlocks against stop-the-world GC**. Fixed by allocating the spawned future under the parent's existing permit. Both were diagnosed by measurement (CPU sampling, scaling curves, forced-GC stress repro), not guesswork. Details + code references below so this is auditable. **Commits** 1. `perf(compiler2): …` — memoize the two whole-project compile passes (Problem 1). 2. `fix(engine): …` — allocate the spawned future under the parent's permit (Problem 2). 3. `perf(mir): …` — follow-up: borrow `resolved_aliases` instead of cloning it per context (see *Follow-up* section), addressing review feedback. --- ## Problem 1 — `O(files²)` compilation (commit `perf(compiler2): …`) ### Symptom `baml-cli test --from baml_src` pinned one core at ~95% CPU and never finished on the full corpus. Not a cargo-lock deadlock, not network — pure CPU-bound compilation. Scaling was clearly super-linear: | files | time (before) | |------:|--------------:| | 1 | 4s | | 15 | 13s | | 27 | 31s | | 35 | 44s | | 55 | **never finished** (~100s+ extrapolated) | `sample`-ing the process showed all time under `collect_diagnostics` → type inference → `ppir_expansion_items` → `collect_alias_bodies` → `lower_file`, and later under `generate_project_bytecode` → `lower_function` → `LoweringContext::new` → `populate_from_package`. ### Two quadratics **(a) `baml_compiler2_ppir`** — `ppir_expansion_items` is a **per-file** `#[salsa::tracked]` query (`lib.rs:205`), but each invocation called `collect_block_attrs` / `collect_alias_bodies` — plain functions that iterate **every** file in the project and call `ast::lower_file` on each. So N files × re-lowering N files = **O(N²)** lowerings. **(b) `baml_compiler2_mir`** — `LoweringContext::new` / `new_for_let` run **per function**, and each rebuilt `populate_from_package` (which lowers every class field type across all packages, `lower.rs:1190`) plus `ResolvedAliases::for_package` (which re-runs `find_recursive_aliases` over the whole project). M functions × N classes = **O(N²)** again. ### Fix Memoize the whole-project work behind package/project-keyed salsa queries, reusing the manual `unsafe impl salsa::Update` (via `PartialEq`) pattern already established for `PackageItems`: - `project_expansion_maps(db, project)` — `ppir/lib.rs:165` - `package_lowering_data(db, pkg_id)` — `mir/lower.rs:957`. `LoweringContext` now **borrows** the schema maps and `resolved_aliases` (`&'db`) instead of rebuilding/cloning them per function. (The `resolved_aliases` borrow was completed in commit 3 — see *Follow-up*.) ### Result | files | before | after | |------:|-------:|------:| | 35 | 44s | 20s | | 55 | never finished | ~30s | Scaling is now ~linear. **The bytecode snapshot test passes unchanged** — output is byte-for-byte identical, so this is a pure performance change. 1576 `baml_tests` lib tests pass. --- ## Problem 2 — `spawn` deadlocks against GC (commit `fix(engine): …`) This is the subtle one. Once compilation was fast, the full 1614-test run started **hanging ~50% of the time**, always with the *same* shape: the tokio runtime driver parked in `block_on` and **every worker thread idle/parked** — i.e. a lost-wakeup, not a CPU spin. ### The BAML test that triggered it `crates/baml_tests/baml_src/ns_cancel_cascade/cancel_cascade.baml`: ```baml function cancelled_child_future_state_is_cancelled() -> baml.future.FutureState { let slow = spawn { baml.sys.sleep(60000); 42 }; // task S: sleeps 60s let waiter = spawn { await slow }; // task W: awaits S let _ = waiter.cancel(); // cancel W (not S) waiter.state() } ``` It passed **5/5 in isolation** but hung ~50% in the full run — the classic fingerprint of a *concurrency* bug that needs accumulated load, not a logic bug in the test. Two facts narrowed it down: - All threads parked ⇒ a lost wakeup in async machinery, not a synchronous lock. - It correlated with **garbage collection**: forcing GC to run on *every* allocation (temporarily setting the Gen0 threshold `10_000 → 1` in `bex_heap`) turned the flake into a **100% reproducible** hang on a tiny repro. That was the key to pinning it. ### Background: the heap-permit model The engine coordinates GC with a `HeapPermitManager` backed by a **single tokio `Semaphore`** (`bex_heap/src/heap_guard.rs`): - Each running VM mutator holds **one** `ActiveHeapPermit` (one semaphore permit). - Stop-the-world GC parks everything by draining the **entire** semaphore at once: ```rust // bex_heap/src/heap_guard.rs:227 pub async fn request_park(&self) -> HeapGuard<'_> { let permits = self.active .acquire_many(MAX_PERMITS) // <-- wants ALL permits; completes only when .await // every ActiveHeapPermit has been released ... } ``` The crucial property: **tokio's `Semaphore` is fair (FIFO)**. Once `acquire_many(MAX_PERMITS)` is queued, any later `acquire()` (even for 1 permit) queues **behind** it and cannot be granted until the big request is satisfied and released. ### The bug `spawn` allocated the child's heap `Future` by taking a **second, fresh** permit *while the parent task that issued the `spawn` still held its own permit*. The parent awaits `spawn_thread` inline, so both permits live on the same logical flow: ```rust // OLD — bex_engine/src/lib.rs, spawn_thread_setup (deleted in this PR) let permit = self.heap_permit_manager.new_permit(()).await; let permit = permit.acquire().await; // <-- 2nd permit, while parent still holds its 1st let (future_id, future_ptr) = { let mut guard = self.futures.acquire(permit.proof()).await; guard.new_future(child_cancel.clone()) // allocate the child Future }; drop(permit); ``` Now interleave a GC park (which, under real workloads, fires whenever heap pressure crosses the threshold — hence the flakiness, and 100% under forced GC): ```mermaid sequenceDiagram participant P as Parent task (holds permit P_main) participant G as GC (request_park) participant S as Semaphore (fair FIFO) P->>P: executing spawn { ... } Note over G,S: heap pressure → GC starts G->>S: acquire_many(MAX_PERMITS) S-->>G: queued — waits for ALL permits (P_main still held by Parent) P->>S: acquire() for the child-future permit S-->>P: queued BEHIND GC (fair) — blocked Note over P,G: 🔒 deadlock cycle Note right of P: Parent won't release P_main until spawn returns Note right of P: spawn can't return until it gets the 2nd permit Note right of G: GC can't grant the 2nd permit until it finishes, which needs P_main ``` So: **GC waits for the parent's permit → the parent waits (fairly, behind GC) for a second permit → the parent won't release the first until it gets the second.** Cycle. All tasks suspend; every worker parks. The 60s `sleep` is a red herring — the hang is immediate (it reproduced with `PASS=0`). ### The fix There is no reason to take a *new* permit to allocate the child future: the parent is **already** holding an active permit at the `spawn` site. Allocate the future there, under the parent's permit, and hand the `future_id` to `spawn_thread`: ```rust // NEW — bex_engine/src/lib.rs, VmExecState::Spawn dispatch (~lib.rs:2462) let child_cancel = cancel.child_token(); let future_ptr = { let mut guard = self.futures.acquire(thread.proof()).await; // parent's permit let (future_id, future_ptr) = guard.new_future(child_cancel.clone()); drop(guard); // dropped before the await below Arc::clone(self) .spawn_thread(child_cancel, parent_errors_arc, closure, spawn_name, call_id, future_id) .await?; future_ptr }; thread.vm.stack.push(Value::object(future_ptr)); ``` `spawn_thread` now only builds the child VM and **registers** the child's permit via `new_permit` (which takes the holders mutex but does **not** acquire a semaphore permit), then fires the task. The child's permit is acquired later, on the spawned task — never nested under the parent. **One permit per task on the spawn path**, so the deadlock cycle cannot form. `spawn_thread_setup` is deleted. Why this is safe to audit: - `new_future` is synchronous and only needs a `PermitProof` to prove GC isn't running — the parent's `thread.proof()` satisfies that just as well as a fresh permit did. - The `FutureManagerGuard` is dropped **before** the `spawn_thread().await`, so no non-`Send` guard crosses a yield point (same pattern already used elsewhere in `run_thread_event_loop`). - `new_permit` only contends on the holders mutex, and `request_park` is explicitly ordered (semaphore first, then holders mutex) to not deadlock against it. - The parent holds its permit across the whole dispatch, so GC cannot move `future_ptr` during the await. ### Verification The bug is flaky, so I verified with the forced-GC stress harness (turns the race into a deterministic signal) and then with the real threshold: | scenario | before fix | after fix | |---|---|---| | minimal `spawn/cancel` repro, **GC forced every alloc** | 6/6 hang | **0/10 hang** | | full 1614-test corpus, real GC threshold | ~50% hang | **0/8 hang** (1614 passed, 0 failed each) | (The temporary `gc.rs` threshold change was only a debugging aid and is **not** part of this PR.) --- ## Follow-up — borrow `resolved_aliases`, and what's *not* worth optimizing (commit `perf(mir): …`) Review feedback flagged two remaining per-`LoweringContext` (i.e. per-function) costs. I profiled a full `--list` compile of the corpus before touching either, and the result decided each: > The whole MIR lowering path (`package_lowering_data` / `LoweringContext::new` / `lower_function`) is **~11 of ~3600 samples**. Compile time is dominated by TIR `infer_scope_types` (`render_scope_diagnostics → infer_scope_types`). So neither of these is a measurable cost on the current corpus. **Fixed: `resolved_aliases` cloned per context → borrowed.** The other five package-invariant schema maps were already borrowed (`&'db`) from `package_lowering_data`; `resolved_aliases` (a `HashMap` + `HashSet`) was the one I'd left as a per-context clone for expedience. Now it's borrowed too. Whole-struct passes (`&self.resolved_aliases`) became `self.resolved_aliases` (the field is already a reference); `.aliases` sub-field accesses and `.convert(...)` calls are unchanged (they auto-deref through the borrow). Not a measurable speedup here, but it's a strict reduction in per-context allocation, completes the borrow-not-clone design, and is asymptotically `O(contexts × aliases)` for alias-heavy projects. **Skipped (measured non-issue): making `build_class_type_tags` a tracked query.** It showed up as **0 samples**. Unlike `populate_from_package`, it does no type lowering — just cached `file_item_tree` reads and `TypeName→i64` inserts. Memoizing it would add a wrapper + `Update` impl + a project-keyed query, and it's **bytecode-affecting** (it assigns the global type-tag numbering that must match the emitter), so it's risk for no measurable benefit. Worth revisiting only if a profile on a larger/real project shows it mattering. --- ## Test status - `bex_engine`, `bex_vm` unit tests — pass - `cancel_cascade`, `spawn_array_race`, `spawn_parallel`, `spawn_semantics`, `spawn_specialization` integration tests — pass - `baml_tests --lib` (1576 tests) — pass - bytecode snapshot (`baml_tests --test baml_src`) — pass, **unchanged** - `cargo fmt` + `cargo clippy -D warnings` — clean (enforced by pre-commit) ## Reviewer notes / blast radius The engine change is on the **hot `spawn` path** and touches core GC/permit concurrency, so it deserves a careful read despite the small diff. The repro is flaky by nature; I'd suggest a CI loop running the corpus a handful of times to build confidence. The compiler change is performance-only and guarded by the byte-identical bytecode snapshot. 🤖 Generated with [Claude Code](https://claude.com/claude-code)  ## Summary by CodeRabbit * **Refactor** * Memoized package-level schema and type-alias data to eliminate redundant per-function recomputation and speed up compilation. * Centralized project-wide expansion maps for consistent, more efficient expansion across files. * Adjusted spawn/allocation flow so child futures are allocated at spawn sites, improving heap-permit handling and runtime stability. * **Chores** * Updated size-gate thresholds and CI artifact size metadata.  --------- Co-authored-by: Claude Opus 4.8 (1M context) <[email protected]>

…anch`, non-blocking gate (#3637) ## Why The size-gate baseline was a stale, hand-recorded local snapshot that only moved when someone remembered to re-run it. Two consequences the team has hit repeatedly: 1. **Drift** — real CI sizes had crept to **+2.9%** under the 3% gate (linux `baml-cli` was ~26 KB from tripping), so even near-no-op PRs trip it. Several ceilings sat ~0% above baseline instead of the intended ~3%. 2. **Broken fix instructions** — the hint pointed at `size-gate record`, which rebuilds **locally** (sizes differ from CI) and only covers the host platform, so following it verbatim can't green CI. ## What changed **New `cargo-size-gate bake` — adopt CI-measured sizes, no rebuild** - `bake --branch canary` (new `fetch.rs`) shells to `gh`, finds the newest *completed* CI run on the branch with all four `size-gate-*` reports (ignoring run conclusion and size violations — canary CI is usually red for unrelated reasons), downloads them, writes `.ci/size-gate/<platform>.toml`, and re-pegs the `max_*_bytes` ceilings in `.cargo/size-gate.toml` (comments preserved via `toml_edit`). Idempotent: no size change → no write → no PR. - Also `bake <files...>` for explicit reports, and `--repo/--download-dir/--summary-out`. - Exposed as `mise run size-gate-update`. **New workflow `size-gate-baseline-refresh.yml` — daily refresh** - Daily (+ manual dispatch). Calls the **same** `bake --branch canary` (no duplicated run-selection), opens PR `chore/size-gate-baseline-refresh` via PAT so its own CI runs, and enables squash **auto-merge** so it self-merges once required checks pass. Reuses the artifacts canary CI already produced — **no rebuilds** (avoids re-paying the mac/windows release builds that are CI's long pole). **Fixed the fix-hint** to point at `bake --branch canary` instead of the broken `record` flow. **Made size-gate non-blocking** — removed from `ci-failure-alert.needs`. A size bump no longer blocks the merge queue. It's a signal (PR comment + daily refresh), not a gate. **Re-pegged baselines + ceilings to current canary (`ef7c326`)** so the gate is accurate on merge. ## Prerequisites for the daily job - Repo setting **"Allow auto-merge"** enabled. - `secrets.SAM_GITHUB_BOUNDARYML_READWRITE` (the PAT `oncall.yml` uses) available to the workflow. - Scheduled workflows fire from the **default branch**'s copy of the file — merge there for the cron to run (the job checks out canary explicitly regardless). ## Deferred / follow-ups (not in this PR) - Move size-gate to **post-merge-only** (run on canary push, not PRs/merge-queue) so it stops running on PRs entirely while still feeding the nightly. Kept local for now. - **Trend graph** (the git history of `.ci/size-gate/*.toml` is a ready-made byte-exact time series). CodSpeed has no custom-metric ingest, so this won't ride that rail cleanly. 🤖 Generated with [Claude Code](https://claude.com/claude-code)  ## Summary by CodeRabbit * **Bug Fixes** * Made size-gate checks informational in CI; they no longer block merges. * **New Features** * Added automated daily baseline refresh workflow for size-gate metrics. * Added new `bake` command to update baseline values from CI measurements. * **Chores** * Updated size-gate baselines for multiple platforms with new measurements.  --------- Co-authored-by: Claude Opus 4.8 (1M context) <[email protected]>

Adds salsa caching to three functions, resulting in a ~60% speedup on the `baml_tests/baml_cli` compilation time.  ## Summary by CodeRabbit * **New Features** * Added compiler benchmarking suite to measure BAML compilation performance. * Added flamegraph profiling tool for analyzing compiler performance bottlenecks. * **Chores** * Optimized internal compilation caching for improved performance. * Updated build documentation.

## Summary Fixes 17 bugs in BEP-044 interface **unions**, found by a union-focused fuzz of the language. Each was pinned as a failing `union_fuzz_fNN_*` test in `crates/baml_tests/tests/interfaces.rs` and is now green. A class-only union method-dispatch case is also pinned as a regression guard. Two root causes dominated: - **`unknown | unknown is not a function`** — a method present on every member of a union containing an interface was rejected because the union-member probe resolved interface members to the `Ty::Unknown` sentinel. - **VM crashes (`expected map, got instance`, `tagged_int_add`)** — reading a field / narrowing a generic-interface match arm on a union dispatched incorrectly at runtime. ## Fixes | ID | Severity | Fix | |----|----------|-----| | F7/F12 | spurious-error / diagnostic | TIR resolves interface union members through the real interface machinery (side effects suppressed); MIR dispatches the call on the runtime class across all members' implementors. No more `unknown \| unknown`. | | F1/F3/F11 | crash / soundness | MIR dispatches a union field read on the runtime class. Conflicting field types across union interfaces read soundly as `T \| U` (misuse → E0001); a genuinely ambiguous field view → E0131. | | F2/F5 | crash / wrong-result | A generic-interface match arm (`Slot<int>`) respects its type argument at runtime instead of matching every implementor of the bare interface. | | F4 | crash | `string + <non-object primitive>` (int/float/bigint/bool/null) is a type error, not an inferred `string` that aborts the VM. `string + uint8array` stays valid. | | F6 | wrong-result | Reflection compares generic union args as unordered sets (`Box<int\|string>` == `Box<string\|int>`). | | F8 | spurious-error | A bounded type variable is a subtype of itself (`<T extends I>(a: T) -> T { return a }` compiles). | | F9 | spurious-error | A match arm overlapping any member of an optional/union scrutinee is accepted (`let a: Animal` over `(Dog \| Cat)?`). | | F10/F15 | spurious-error / diagnostic | Out-of-body `implements I for <primitive>` resolves for every primitive (not just `int`); union method-not-found blames only the genuinely-lacking member. | | F13/F14/F16/F17 | diagnostic | No `user.` package-prefix leak in match witnesses; uncovered interface members named instead of `_`; ambiguous union field → E0131; projection errors keep generic args (`Cargo<int>`). | Snapshot + LSP expectation updates reflect the corrected diagnostics (no `user.` leak; `string + int` now E0004). ## Test plan - `cargo test -p baml_tests` — **2058 passed, 0 failed** - `cargo test -p baml_lsp2_actions_tests` — **369 passed, 0 failed** - `cargo clippy --all-targets -- -D warnings` — clean - `cargo fmt` — clean Merged latest `canary` (salsa-memoization hang fix); the one conflict in `lower.rs` was resolved by keeping the generic-interface pattern routing on top of canary's borrowed-`resolved_aliases` API. 🤖 Generated with [Claude Code](https://claude.com/claude-code)  --- > [!NOTE] > **High Risk** > Touches MIR lowering, TIR inference, VM reflection, and pattern/match codegen for interfaces and unions—areas that previously caused VM crashes and silent wrong dispatch; regressions would affect runtime behavior and type soundness. > > **Overview** > Fixes **BEP-044 interface unions** end-to-end: TIR no longer collapses interface arms to `unknown | unknown` for callable unions; MIR adds runtime class-tag dispatch for methods and fields when the receiver is a union (including optional-wrapped unions and interface members), including inherited defaults and interface field views. **Generic interface** `is`/match patterns and fast type-tag switches now respect type arguments so `Slot<int>` cannot capture `Slot<string>` values. > > TIR also gains union-member resolution with diagnostic rollback, E0121/E0131 for ambiguous shared implementors, out-of-body primitive members, bounded-generic `T <: T`, match-arm overlap over optional unions, and stricter `string +` rules for non-object primitives. VM reflection compares generic args with order-insensitive union equivalence. Diagnostics and exhaustiveness witnesses drop `user.` prefixes and show user-facing type names. Large `union_fuzz_fNN_*` regression tests plus snapshot/LSP updates; `.gitignore` adds `workflow_scratch_files/`. > > Reviewed by [Cursor Bugbot](https://cursor.com/bugbot) for commit b535f94. Bugbot is set up for automated code reviews on this repo. Configure [here](https://www.cursor.com/dashboard/bugbot).   ## Summary by CodeRabbit * **Bug Fixes** * Fixed runtime dispatch and field reads for unions that contain interface members; tightened generic-interface runtime matching and narrowed type-match behavior. * Reduced spurious diagnostics during union-call inference; improved subtype/pattern-overlap checking and rejected invalid string+int concatenation. * Made missing-case diagnostics and witness rendering use user-facing type names and avoid leaking internal prefixes. * **Tests** * Added a large regression suite covering union/interface dispatch, generic-interface scenarios, diagnostics, and runtime behaviors. * **Chores** * Ignored local workflow scratch files.  --------- Co-authored-by: Claude Opus 4.8 (1M context) <[email protected]>

## Summary by CodeRabbit * **New Features** * Added a lightweight Bedrock request serializer and media support for images, video, documents, and audio. * Introduced local credential/token resolution modules for AWS and Google Cloud with pluggable IO adapters. * **Improvements** * Streamlined Bedrock and Vertex AI authentication flows and request signing for more reliable credential discovery. * Simplified credential-provider precedence and expanded test coverage for credential resolution.

## What & why `baml describe <Class>` and LSP hover treated a class's methods as incidental text living inside the raw class body. Under a line budget those methods got chopped out and replaced with `[... skipped N lines ...]`, and LSP hover rendered a class with methods as a bare `class String {}` — methods were effectively **invisible for discovery**. This PR makes methods first-class, always-visible output — the same treatment fields already get. ## End-user impact - **Methods are always discoverable.** Every method of a class now appears in `baml describe` with its first-line docstring, full `function` signature, and definition line range, in dedicated `methods:` / `static_methods:` sections. These are **never truncated** — `baml describe User --budget 5` shows the same methods as the full output. Method *bodies* only appear when you drill into a specific method. - **Class body shows fields only.** The body block is canonical BAML (`name: type,`) containing just the fields; methods render below. A fields-only body fits any reasonable budget, so it stops triggering truncation. - **`baml describe string` works.** Lowercase primitive/keyword aliases — `string`, `int`, `bigint`, `float`, `bool`, `null`, `uint8array`, `image`, `audio`, `video`, `pdf`, `json` — resolve to their builtin classes, alongside the existing canonical (`root.ns.Foo`) and package (`baml.json.json`) forms. - **Consistent, canonical type printing** across describe + hover + signatures: builtins collapse to their alias (`baml.String` → `string`, `baml.json.json` → `json`), user types read `root.ns.Foo`, lists/maps as `T[]` / `map<K, V>`. Headers show the canonical FQN in parens when it differs from the bare name (e.g. `class String (string)`, `class Config (root.llm.Config)`). - **Minimal, useful hover.** Hover shows the class docstring + field shape, plus a one-line `Run \`baml describe <FQN>\`` hint **only when the class has methods**. ### Before / after (`baml describe string`) Before: a multi-hundred-line raw class body, truncated mid-way with `[... skipped 448 lines ...]` — methods unusable. After: ``` class String (string) <builtin>/baml/string.baml:5-475 /// A UTF-8 encoded string. /// ... class String {} methods: /// Serializes this string to a JSON value. function to_json(self) -> json <builtin>/baml/string.baml:8-10 /// Returns the length of the string in UTF-8 bytes. function length(self) -> int <builtin>/baml/string.baml:25-27 ... static_methods: function from_code_points(unicode: int[]) -> string throws root.errors.InvalidArgument <builtin>/baml/string.baml:472-474 references (0): ``` ## Implementation - `MethodRef` (describe) and `MethodSig` (type_info) carry signature + first-line docstring + full range. Signatures resolve via the package interface — auto-derived methods are skipped, `self` is shown bare, and `throws never` is omitted. - One canonical printer: `QualifiedTypeName::builtin_alias` + `display_ty_canonical_for_file`. It's opted into **only** by the describe/hover/signature paths; diagnostics, completions, and inlay hints keep their existing spelling (so other call sites can adopt it mechanically later). - `references` now exclude a symbol's own definition span, so a class is no longer listed as a reference of itself via its own method bodies. - Codegen (`cg::Class`) and the `truncate_body()` algorithm are untouched, per the design. ## Testing 783 tests green (TIR 136, LSP 114, CLI 164, integration 369). New coverage: TIR alias round-trip (incl. the `json` special case); CLI fixtures for instance-only, mixed instance+static, and generic classes, alias resolution, and a tight-budget never-truncate case; LSP hover tests for the describe hint. Existing describe/hover snapshots updated for the new format. ## Known follow-up Types referenced **only** in a method signature (e.g. `WrapperMarker` in `-> T | WrapperMarker`) are not yet surfaced under `dependencies:`. The class-dependency path matches the canonical `pkg.Name` string against short outline names — a pre-existing gap with no current test coverage. Fixing it (and adding method-signature deps on top) was deferred to avoid regressing builtin output; method *listing* itself is unaffected. 🤖 Generated with [Claude Code](https://claude.com/claude-code)  ## Summary by CodeRabbit * **New Features** * Describe output now shows class methods (instance and static) with full one-line signatures, per-method docstrings, and definition line ranges; canonical fully-qualified names appended when different. * **Improvements** * Lowercase primitives (e.g., string, int, json, image) resolve as aliases in describe/dispatch and canonical type rendering. * CLI JSON includes richer per-method metadata. * Hovers include class docstrings and a "baml describe" hint when methods exist; method sections are never truncated. * Non-doc single-line comments are stripped from rendered bodies. * **Tests** * Expanded fixtures/snapshots covering methods, alias dispatch, hover output, comment handling, and truncation guarantees.  --------- Co-authored-by: Claude Opus 4.8 (1M context) <[email protected]>

…structions # Conflicts: # baml_language/Cargo.lock # baml_language/crates/bridge_cffi/src/lib.rs

…ac targets) The non-macos/aarch64 kperf shim had `#[inline(always)]` on its no-op enabled/exec_start/exec_end fns, tripping clippy::inline_always under `-D warnings` when compiled for other targets (e.g. CI's cross-target check). Host clippy never compiles the shim, so it slipped through. `#[inline]` is plenty for trivial no-ops. Verified clean via clippy --target x86_64-unknown-linux-musl. Co-Authored-By: Claude Opus 4.8 <[email protected]>

vercel Bot deployed to Preview – beps June 1, 2026 17:50 View deployment

vercel Bot deployed to Preview – promptfiddle2 June 1, 2026 18:07 View deployment

vercel Bot deployed to Preview – promptfiddle June 1, 2026 18:08 View deployment

hellovai and others added 9 commits June 1, 2026 14:54

hellovai force-pushed the hellovai/vm-superinstructions branch from 202a8ee to 143bd33 Compare June 1, 2026 22:37

vercel Bot deployed to Preview – beps June 1, 2026 22:38 View deployment

vercel Bot deployed to Preview – promptfiddle June 1, 2026 22:55 View deployment

vercel Bot deployed to Preview – promptfiddle2 June 1, 2026 22:56 View deployment

hellovai and others added 4 commits June 1, 2026 23:17

Merge remote-tracking branch 'origin/canary' into hellovai/vm-superin…

6a446df

…structions # Conflicts: # baml_language/crates/baml_tests/benches/runtime_benchmark.rs

vercel Bot deployed to Preview – beps June 2, 2026 03:00 View deployment

vercel Bot deployed to Preview – promptfiddle2 June 2, 2026 03:17 View deployment

vercel Bot deployed to Preview – promptfiddle June 2, 2026 03:17 View deployment

sxlijin and others added 3 commits June 2, 2026 04:06

hellovai and others added 7 commits June 2, 2026 09:10

Merge remote-tracking branch 'origin/canary' into hellovai/vm-superin…

a27a26b

…structions # Conflicts: # baml_language/Cargo.lock # baml_language/crates/bridge_cffi/src/lib.rs

vercel Bot deployed to Preview – beps June 2, 2026 21:11 View deployment

vercel Bot deployed to Preview – promptfiddle2 June 2, 2026 21:16 View deployment

vercel Bot deployed to Preview – promptfiddle June 2, 2026 21:28 View deployment

vercel Bot deployed to Preview – beps June 2, 2026 21:53 View deployment

hellovai merged commit 99f624d into hellovai/trim-events Jun 2, 2026
31 of 34 checks passed

hellovai deleted the hellovai/vm-superinstructions branch June 2, 2026 21:54

vercel Bot deployed to Preview – promptfiddle2 June 2, 2026 21:59 View deployment

hellovai mentioned this pull request Jun 2, 2026

refactor(tracing): trim the event-stream surface for a fresh start #3616

Draft

vercel Bot deployed to Preview – promptfiddle June 2, 2026 22:11 View deployment

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

perf(vm): superinstruction fusion — bytecode interpreter now faster than CPython#3627

perf(vm): superinstruction fusion — bytecode interpreter now faster than CPython#3627
hellovai merged 24 commits into
hellovai/trim-eventsfrom
hellovai/vm-superinstructions

hellovai commented Jun 1, 2026

Uh oh!

vercel Bot commented Jun 1, 2026 •

edited

Loading

Uh oh!

coderabbitai Bot commented Jun 1, 2026 •

edited

Loading

Review skipped

Uh oh!

github-actions Bot commented Jun 1, 2026 •

edited

Loading

Uh oh!

codspeed-hq Bot commented Jun 1, 2026 •

edited

Loading

Uh oh!

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

5 participants

Conversation

hellovai commented Jun 1, 2026

Results (Apple M2 Max, scripts/speedtest run)

What changed (each commit validated + clippy-clean)

Key insight

Validation

Known follow-ups (diagnosed, not in this PR)

Uh oh!

vercel Bot commented Jun 1, 2026 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Uh oh!

coderabbitai Bot commented Jun 1, 2026 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Review skipped

Uh oh!

github-actions Bot commented Jun 1, 2026 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Binary size checks passed

Uh oh!

codspeed-hq Bot commented Jun 1, 2026 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Merging this PR will improve performance by 60.45%

Performance Changes

Footnotes

Uh oh!

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

5 participants

Results (Apple M2 Max, `scripts/speedtest run`)

vercel Bot commented Jun 1, 2026 •

edited

Loading

coderabbitai Bot commented Jun 1, 2026 •

edited

Loading

github-actions Bot commented Jun 1, 2026 •

edited

Loading

codspeed-hq Bot commented Jun 1, 2026 •

edited

Loading