perf(vm): superinstruction fusion — bytecode interpreter now faster than CPython#3627
Conversation
|
The latest updates on your projects. Learn more about Vercel for GitHub.
|
|
Important Review skippedAuto reviews are disabled on base/target branches other than the default branch. Please check the settings in the CodeRabbit UI or the ⚙️ Run configurationConfiguration used: Repository UI Review profile: CHILL Plan: Pro Run ID: You can disable this status message by setting the Use the checkbox below for a quick retry:
✨ Finishing Touches🧪 Generate unit tests (beta)
Thanks for using CodeRabbit! It's free for OSS, and your support helps us grow. If you like it, consider giving us a shout-out. Comment |
Binary size checks passed✅ 7 passed
Generated by |
Merging this PR will improve performance by 60.45%
Performance Changes
Tip Curious why this is faster? Comment Comparing Footnotes
|
…d split Prep for the threaded-dispatch work: - Add crates/baml_tests/tests/dispatch.rs — inline-snapshot of how direct method calls and interface (polymorphic) dispatch lower to bytecode. Locks in that a direct `v.norm2()` is a plain static `call` (no make_bound_method / no per-call allocation) and that interface dispatch is an `is_type` chain + static call. - Split store_local_value into an `#[inline(always)]` fast path (direct stack write) and a `#[cold] #[inline(never)]` watch handler, so the hot store path stays inline and the rarely-active watch bookkeeping is out of line. (Measured ~0% on its own; it's structural prep for `become`.) Co-Authored-By: Claude Opus 4.8 (1M context) <[email protected]>
Cut interpreter instruction count on the hot arithmetic/loop path:
- Emit-time binop peephole folds operand loads into fused superinstructions
(mirrors the existing StoreVarLoadVar in-place rewrite; confined to the
current basic block so jump targets/block addresses are never fused across):
* single-fold (right operand): AddIntVar/AddIntConst, CmpIntLtVar/CmpIntLtConst
* double-fold (both operands, no operand stack pushes): AddIntVarVar/
AddIntVarConst, CmpIntLtVarVar/CmpIntLtVarConst
- Unchecked local-slot indexing (get_at/set_at) on LoadVar/StoreVar fast paths.
- Lazy faulting_pc: track cur_pc once per dispatch instead of writing the frame
every op; reconstruct topmost/innermost bytecode PC only when unwinding.
- kperf: in-process Apple-Silicon PMC probe (cycles + instructions retired),
env-gated by BAML_KPERF=1, for deterministic instruction-count comparisons.
fib (1M x fib50): total instructions retired -25.4%, cycles -22.8%, VM ops -33%
vs the pre-fusion baseline. Validated: result correct; bex_vm, exceptions,
cancellation, errors, floats, dispatch, optimization all green.
Co-Authored-By: Claude Opus 4.8 <[email protected]>
Extend the emit-time peephole with two more fused-op families, same in-place / current-block-confined rewrite as before: - MoveLocal(dst, src): fuses `LoadVar src; StoreVar dst` (every `x = y`). Stores via store_local_value, preserving watch semantics. - AddIntVarVarStore / AddIntVarConstStore: a double-folded add whose result is stored straight into a (non-captured) local — `local[dst] = local[a] + ..` — never touching the eval stack. fib (1M x fib50) cumulative vs the pre-fusion baseline: total instructions retired -47%, cycles -35%, VM ops -55%, reaching the cycle count of an optimal hand-written bytecode interpreter for the same workload. Validated: result correct; bex_vm, exceptions, cancellation, errors, floats, dispatch, optimization all green. Co-Authored-By: Claude Opus 4.8 <[email protected]>
Add CmpIntLtVarVarBrFalse / CmpIntLtVarConstBrFalse: when a fused integer `<` comparison is immediately consumed by a `PopJumpIfFalse` (the canonical loop/if condition), the emit peephole collapses both into one op that evaluates the comparison straight from its operands and branches without materializing a bool on the stack. The fused op participates in jump resolution like the plain jumps (instruction-relative offset patched to a byte delta) and preserves the early-yield/cancellation check. fib (1M x fib50) cumulative vs the pre-fusion baseline: total instructions retired -52%, cycles -41%, VM ops -60% — now below the cycle count of an optimal naive hand-written bytecode interpreter for the same workload. Validated: fib correct; boundary loops (sum 0..10 = 45, count = 5) correct; bex_vm, cancellation, exceptions, errors, floats, dispatch, optimization green. Co-Authored-By: Claude Opus 4.8 <[email protected]>
The VM-op counter (`op_count`) is pure measurement scaffolding but ran on the hottest path in every build — a store plus a memory dependency per dispatched op. Gate the increment behind a new off-by-default `kperf` cargo feature so normal/release builds don't pay for it; kperf reads cycles and instructions retired straight from the hardware counters, so the op count is only needed for the optional per-op breakdown (built with `--features bex_vm/kperf`). Also make the `cur_pc` fault-PC write unconditional (it is needed for correct exception line numbers, not measurement) and drop the now-unused `dbg_skip_*` / `BAML_NO_*` bisection env gates. fib (1M x fib50), op_count removed: -6.7% instructions retired (20.7e9), and BAML's interpreter is now ~10% faster than CPython 3.9 on the same workload. Co-Authored-By: Claude Opus 4.8 <[email protected]>
When a conditional's else-successor is the fall-through block (and the then block is not), emit an inverted compare-and-branch — CmpIntLtVarVarBrTrue / CmpIntLtVarConstBrTrue (branch to `then` when the comparison is true) — and let the jump to the else block fall through. This eliminates the unconditional jump-to-body that otherwise runs every loop iteration. Removing a dispatched op matters disproportionately: each dispatch is an indirect jump through the giant opcode match that the branch predictor routinely misses, so this is ~15% fewer cycles for ~9% fewer instructions (IPC 5.0 -> 5.4). fib (1M x fib50) cumulative vs the pre-fusion baseline: instructions retired -60%, cycles -51% (halved). Validated: fib correct; if/else and boundary loops correct; bex_vm, interfaces (337), exceptions, cancellation, errors, floats, gc, env, io, dispatch, optimization all green. Co-Authored-By: Claude Opus 4.8 <[email protected]>
execute_call_from_locals_offset dereferenced the callee HeapPtr three separate times to (a) check for a HostClosure, (b) extract Closure captured type args, and (c) extract BoundMethod class type args. Fold these into a single match with a plain-Function fast path (the common case, including all recursion), which extracts nothing. The later callee-resolution match still validates non-callable objects, so error behaviour is unchanged. Measured call overhead is ~130 cyc / ~740 instructions per call (5M-call bench minus an equivalent inline loop) — dominated by frame setup/teardown spread across the call+return machinery, so this is a small (~1%) but free reduction. Validated: fib32 correct; bex_vm, interfaces (337), dispatch, spawn, cancellation, host_value_callable all green (closure/bound-method/host-closure dispatch preserved). Co-Authored-By: Claude Opus 4.8 <[email protected]>
Mirror the AddInt fusion for SubInt: SubIntVar / SubIntConst (fold the right operand load) and SubIntVarVar / SubIntVarConst (fold both). Subtraction is not commutative, so operand order is preserved (left - right) and no const-on-left commute is applied. Same in-place, current-block-confined emit peephole. Helps any code using subtraction, which the earlier add/compare fusion missed. fib32-recursive (call-bound) still gains -9% cycles from fusing its n-1/n-2 + result-add body; the hot fib loop is unchanged (no I-cache regression from the larger dispatch match). Validated: fib32 = 2178309, subtraction loop correct; bex_vm, interfaces (337), optimization, dispatch, errors, exceptions, floats all green. Co-Authored-By: Claude Opus 4.8 <[email protected]>
…system fixture
The superinstruction fusion changes emitted bytecode (e.g. `load_var a;
load_var b; add_int` → `add_int_var_var`, `load_var x; store_var y` →
`move_local`, `load_const 1; add_int` → `add_int_const`), so every codegen /
bytecode snapshot is regenerated to match. Behaviour is unchanged — these are
the same programs, fewer dispatched ops.
Also delete the `event_system` test fixture: it exercised `baml.events.send`,
which the tracing trim removed, so it no longer compiles ("unresolved name:
send"). The feature is gone, so the fixture is removed rather than rewritten;
its generated test disappears when build.rs regenerates from projects/.
A few snapshots also drop `events.send` from builtin listings (baml_cli
package listing, __baml_std__, package_items) for the same removal.
Verified: baml_tests, baml_cli, baml_compiler2_emit, bex_vm, bex_vm_types all
green with no INSTA_UPDATE.
Co-Authored-By: Claude Opus 4.8 <[email protected]>
202a8ee to
143bd33
Compare
## What Adds **53 in-BAML test blocks** to `ns_floats/floats.baml` covering the `==` / `!=` operators on floats. Regenerates the `floats` bytecode snapshot. ## Why Float equality **already works** end-to-end: - **Type-checker** (`infer_binary_op`): permissive — any two operands → `bool`; `int == float` widens int to f64. - **Constant folder** (`try_fold_binary`): literal `float == float` is folded at compile time — a *separate* path from runtime. - **VM** (`exec_cmpop` + dedicated `CmpFloatEq` opcode): IEEE 754 semantics. …but there was **no test coverage**: `floats.baml` deliberately used epsilon checks and `operators.baml` only tested `int == int`. This locks in the behavior. ## Coverage Cases are grouped by code path (compile-fold vs runtime, forced via `float.parse(...)`/calls) so a divergence between the folder and the VM would be caught. They follow JS/TS (IEEE 754) conventions: - **NaN**: `NaN != NaN` true, `NaN == NaN` false, NaN vs number/inf/null, NaN propagation through arithmetic - **Infinity**: `+inf == +inf`, `+inf != -inf`, overflow → inf, max-finite ≠ inf, `inf - inf` → NaN - **Signed zero**: `0.0 == -0.0`, `-1.0/inf` → `-0.0` - **int/float mixing**: `2 == 2.0`, `3.0 == 3` (both paths) - **Precision**: `0.1 + 0.2 != 0.3`, `== 0.30000000000000004`, `1.0/3.0`, sqrt identities, subnormals (`5e-324`), large-magnitude loss - **null**: `float == null` → false (allowed, not an error) The one rejected pairing — `float == bigint` (compile error E0004, bigint past 2⁵³ can't round-trip f64) — is unchanged and not expressible as a passing test. ## Testing ``` cargo run -p baml_cli -- test --from crates/baml_tests/baml_src -i "::float_eq_*" # 53 passed, 0 failed ``` 🤖 Generated with [Claude Code](https://claude.com/claude-code) <!-- This is an auto-generated comment: release notes by coderabbit.ai --> ## Summary by CodeRabbit * **Tests** * Added comprehensive test suite for float comparison operations, covering IEEE-754 semantics and edge cases: NaN behavior and self-inequality, positive/negative infinity handling, signed-zero behavior, overflow-to-infinity conversions, precision and rounding imprecision scenarios, and parsing-based test cases including scientific notation edge ranges. <!-- end of auto-generated comment: release notes by coderabbit.ai --> Co-authored-by: Claude Opus 4.8 (1M context) <[email protected]>
…pus (#3631) ## What The `baml_tests` CodSpeed runtime suite was a set of hand-written `#[divan::bench]` functions with BAML source inlined as string literals — a parallel copy of workloads that already exist in the `tools/speedtest` corpus. This replaces them with **one `vm_speedtest_*` bench auto-generated per workload** under `tools/speedtest/workloads/*.md`, so the speedtest harness and CodSpeed share a single source of truth. ## How - **`build.rs` (`generate_speedtest_benches`)** shells out once to a new `tools/speedtest/export_baml.py`, which reuses `speedtest.loader` (including `## eval-setup` + `$$` templating) to emit each workload's *expanded* BAML as JSON. build.rs then generates one divan bench per workload, named `vm_speedtest_<slug>`, each calling the existing `bench_vm_main` helper (compile + tokio runtime built **once, outside** the measured region → only `main()` is timed). - **Graceful degradation:** if `python3` or the corpus is unavailable at build time, it emits a `cargo:warning` and no benches rather than breaking the crate build. - **Sleep exclusion:** workloads that call the blocking `baml.sys.sleep` are dropped at build time (matched by FQN) — as walltime benches their sample time is dominated by sleeping, not VM work. A build warning names what was skipped. Currently excludes `concurrency::parallel sleep 3x200ms`. - **All hand-written benches removed** (`vm_*`, `e2e_*`, `startup_*`, `compile_to_engine`, `engine_init_cost`) per design discussion. The 2 with no workload equivalent became new workloads, with BAML/Python/TS output cross-verified against `baml-cli`: - `compute/wide-nested-class-create-50k.md` (= `8754025000`) - `compute/mixed-ops-5k.md` (= `62499999`) - **CI** run filter updated `vm_|engine_init` → `vm_speedtest` (the old alternatives no longer exist); build step unchanged. ## Result **36 generated benches** (37 workloads − 1 sleep). All compile and execute cleanly (`divan --test`, exit 0). No CI build wiring change needed — `cargo codspeed build --bench runtime_benchmark` already covers them. To add or change a runtime bench going forward, edit a workload `.md` — no Rust changes required. 🤖 Generated with [Claude Code](https://claude.com/claude-code) <!-- This is an auto-generated comment: release notes by coderabbit.ai --> ## Summary by CodeRabbit * **Tests** * Added new speedtest workloads (mixed arithmetic and wide nested object creation) to expand performance coverage * Introduced generated VM-focused benchmark cases to measure pure VM execution timing * **Chores** * Updated CI benchmark configuration to run VM speedtests for more representative timing * Added build-time benchmark generation and a CLI export tool to produce workload test data automatically <!-- end of auto-generated comment: release notes by coderabbit.ai --> --------- Co-authored-by: Claude Opus 4.8 (1M context) <[email protected]>
…tions
Drop the 20 emit-time fused opcodes added earlier (AddIntVar/Const,
SubInt*, CmpIntLt* folds, AddInt*Store, CmpIntLt*Br{False,True}, MoveLocal)
in favour of CPython's minimal, operand-movement-only superinstruction set:
- LoadVar2 (= LOAD_FAST_LOAD_FAST): push two locals in one dispatch
- StoreVar2 (= STORE_FAST_STORE_FAST): store two locals in one dispatch
(StoreVarLoadVar already covered STORE_FAST_LOAD_FAST.) Type specialization
stays where it belongs — the pre-existing AddInt/SubInt/MulInt/CmpIntOp ops
are BAML's static-typing equivalent of CPython's BINARY_OP_*_INT /
COMPARE_OP_INT, emitted directly with no inline caches or deopt.
Rationale: the dedicated fused ops were a combinatorial set (operation ×
operand-kind × fold-depth × branch-polarity) that overfit the fib loop and
would explode in opcode count as more operators/types were covered. CPython
deliberately keeps fusion to a tiny movement-only set and leans on
specialization (which we get for free from static types) — and, for the real
"way faster" win, a copy-and-patch JIT, which subsumes interpreter fusion for
hot code. This keeps the dispatch table small (better I-cache) and the design
principled.
Bytecode snapshots regenerated accordingly (load_var2 + plain add_int/
cmp_int_op/store_var). Validated: fib correct; baml_tests, baml_cli green;
clippy clean on stable 1.93.0.
Co-Authored-By: Claude Opus 4.8 <[email protected]>
…structions # Conflicts: # baml_language/crates/baml_tests/benches/runtime_benchmark.rs
Use sccache (R2-backed) for Rust **compilation** artifacts in the cargo CI jobs, configured entirely from `.envrc` so CI matches local shells. - `tools_sccache` crate / `tools/baml-sccache` wrapper: a `RUSTC_WRAPPER` that maps `BAML_SCCACHE_R2_*` → `AWS_*` and execs sccache (native crate on Windows, shell script on POSIX). - mise installs sccache + direnv; each cargo job loads `.envrc` via `direnv export gha` (the single source of truth for the sccache/R2 config). - **Swatinem/rust-cache still caches the cargo registry/git download state**, with `cache-targets: false` so sccache owns `target/` and the two caches don't compete. Fork PRs without R2 secrets fall back to the runner-local cache. Follow-up #3624 replaces Swatinem for the download caches with a granular, Cargo.lock-driven R2 action (`cache-cargo-home`); this PR is the sccache base it stacks on. 🤖 Generated with [Claude Code](https://claude.com/claude-code) Co-authored-by: Claude Opus 4.8 (1M context) <[email protected]>
…3633) ## Issue Reference N/A (net-new, self-contained addition under `tools/baml-bench/`) ## Changes This PR adds **baml-bench**, an event-driven pipeline that benchmarks how well a coding agent (Claude Code) uses BAML, surfaces real language/skill issues from those runs, and dispatches fixes. The entire diff is confined to `tools/baml-bench/` (88 files) and touches nothing else in the monorepo. **The pipeline:** an inbound event (Slack mention, cron job, or bug report) creates a task; a worker runs a Claude Code agent against the task and records a "trophy" (transcript, metrics, findings); findings are classified and deduplicated into issues; approved issues are synced to Notion and dispatched to a Cursor cloud agent that opens a fix PR. A read-only dashboard shows the whole thing live. It is built as 8 Python services + a self-hosted Convex data layer + a Next.js dashboard: - **`bench_core`** (shared library): pydantic schemas, jsonl/prices utilities, the service/proxy/slack/notion/cursor clients, and the `Processor` claim-loop base (SSE wakeups, heartbeat, lease) that every worker builds on. - **Convex data layer**: the schema, a generic claimable-queue lib, per-table query/mutation modules, and a reaper for stale claims. - **`api`**: the sole Convex gateway, plus a blob store for transcripts/binaries and generic table + baml-builds routers. - **`claude-proxy`**: runs real Claude Code sessions and parses them into transcript + metrics. - **`baml-worker`**: task to trophy (agent run, trophy parse, repro verification). - **`baml-dedup`**: trophy to issue (classify + dedup). - **`baml-builder`**: tracks baml release binaries in a registry. - **`ingress`**: public webhook gateway (slack/notion/bug, ack-first). - **`notion-fixer`**: Notion board sync + Cursor cloud-agent fix dispatch. - **`cron`**: daily build-refresh + task enqueuer. Also included: Python packaging, the base Docker image + per-service Dockerfiles, a `docker-compose` local stack with `.env.example`, the unit + E2E test suites, Google-style docstrings on every function/method/class, a README, and a generated `docs/reference.md` indexing every symbol across `bench_core`, `services`, `convex`, and `ui`. Anthropic auth is API-key only. ## Testing Please describe how you tested these changes - [x] Unit tests added/updated - [x] Manual testing performed - [ ] Tested in [environment] - Fast suite (no Docker): `cd tools/baml-bench && pytest -m "not integration"` -> 12 passed (app/health wiring, proxy session parsing, ingress routing). - E2E suite (`@pytest.mark.integration`, self-skips without Docker): `pytest -m integration` boots a Convex backend container plus `api`/`ingress`/a stub proxy on ephemeral host ports and drives the full pipeline (task -> worker -> trophy -> dedup -> issue -> notion sync) and the ingress + fix-dispatch path end to end. - The pipeline has been running in production (standalone on Fly), so the migrated code is exercised; this PR is the monorepo packaging of it. ## Screenshots If applicable, add screenshots to help explain your changes N/A (the UI is a read-only dashboard; no user-facing change to existing BAML surfaces). ## PR Checklist Please ensure you've completed these items - [x] I have read and followed the contributing guidelines - [x] My code follows the style guidelines of this project - [x] I have performed a self-review of my own code - [x] I have commented my code, particularly in hard-to-understand areas - [x] I have made corresponding changes to the documentation - [x] My changes generate no new warnings ## Additional Notes Add any other context about the PR here - **Isolation:** the entire diff is confined to `tools/baml-bench/`; nothing outside that path changes, so it has no effect on the rest of the monorepo. - **CI is intentionally not in this PR.** A path-scoped Blacksmith workflow + pre-commit hooks are ready but touch shared `.github/` config outside `tools/baml-bench/`, so they will follow in a separate PR to keep this one isolated. - **Some docs follow later.** The README and the generated API reference are included. The longer guides (architecture, data-model, configuration, local-setup, deployment, ci) are staged and will land in a follow-up once reviewed. <!-- This is an auto-generated comment: release notes by coderabbit.ai --> ## Summary by CodeRabbit * **New Features** * Full local benchmark stack: live dashboard (graph, tables, run/task pages), API + blob storage, claimable-queue workers, build manager, agent proxy, Slack/Notion ingress, cron-driven task enqueueing, and end-to-end agent/verification flows. * **Documentation** * Complete architecture, data model, configuration guides and generated API reference. * **Tests** * New unit and end-to-end integration suites with drivers and service stubs exercising ingress, proxy, and the full pipeline. * **Chores** * Local dev tooling: docker-compose, env example, gitignore, Dockerfiles, and package manifests. <!-- end of auto-generated comment: release notes by coderabbit.ai --> --------- Co-authored-by: Claude Opus 4.8 <[email protected]>
) ## What Adds **interface-field destructuring** in `match` patterns. An interface head binds the interface's declared fields across every implementor: ```baml function describe(a: Animal) -> string { match (a) { Animal { name } => "animal: " + name // binds `name` for any implementor } } ``` `name` resolves through each implementor's field view, so it works whether the field was auto-linked (`Dog { name }`) or `as`-aliased (`Cat { name as nickname }`). Because every implementor necessarily provides the interface's declared fields, the pattern matches them all — so `Animal { name }` is exhaustive on its own (no `_` needed). Previously `Animal { name }` was mis-lowered as a construction expression (`unresolved name: Animal`); only concrete-class destructure (`Dog { name }`) worked. ## How - **TIR** (`baml_compiler2_tir/src/builder.rs`): `resolve_class_pattern_type` accepts interface heads; `lower_class_pat` has an interface branch that binds each field's type via `resolve_interface_member` and produces a wildcard-cover `DPat`. - **MIR** (`baml_compiler2_mir/src/lower.rs`): `project_class_pattern_field` routes interface heads to a new `project_interface_pattern_field`, reusing the existing interface field-view dispatch (`try_lower_interface_field_access`). The MIR `Ty` has no interface variant, so the route keys off the raw `Tir2Ty::Interface`. ## Tests - New: `match_destructures_interface_fields_directly` (interface head, auto-linked + aliased implementors) and `match_destructures_concrete_implementor_fields`. - Refreshed the BEP-044 regression-suite comments — all pass now. The two interface-method-as-value cases (`fuzz_bug01/02`) remain `#[ignore]`d (genuinely unimplemented). - Interfaces suite: 339 passed / 2 ignored; full `baml_tests` (30 binaries) and `cargo check --workspace` clean. The matching BEP-044 spec update (match syntax, this feature, and other implementation-vs-draft corrections) was pushed separately to beps.boundaryml.com. ## Also included In-flight **Python SDK / bridge / codegen** fixes that were already present on the working branch (`bridge_cffi`, `bridge_python`, `codegen_python`, `harness_setup`, `baml_cli/generate.rs`, `.pyi` stubs). Not authored as part of the interface work; bundled per request. 🤖 Generated with [Claude Code](https://claude.com/claude-code) <!-- CURSOR_SUMMARY --> --- > [!NOTE] > **High Risk** > Changes span TIR/MIR pattern lowering and exhaustiveness (easy to get match soundness wrong) plus the Python runtime initialization path that all generated SDKs use after `baml generate`. > > **Overview** > Implements **BEP-044 interface destructuring** in `match` (`Animal { name } => …`): TIR resolves interface pattern heads and lowers `DPat::interface` with field-view types; exhaustiveness gains `Ctor::Interface` and matrix specialization that maps interface field slots onto implementing class fields; MIR projects bound fields via `project_interface_pattern_field` / existing interface field dispatch. > > **Python codegen/runtime** now embeds **borsh-serialized bytecode** instead of inlined `.baml` source: `baml generate` compiles and calls `to_source_code_with_bytecode`; `bex_project::new_from_bytecode`, CFFI/Python `initialize_runtime_from_bytecode`, and generated `_inlinedbaml.BYTECODE` wiring. CI **size-gate** baselines for `baml-cli` are bumped slightly. > > Large **interface test suite** additions (compile + VM) for destructure exhaustiveness, mixed concrete/interface arms, generics, and updated regression comments (most fuzz/wf3 cases now pass; method-as-value tests still ignored). > > <sup>Reviewed by [Cursor Bugbot](https://cursor.com/bugbot) for commit 3c25e0f. Bugbot is set up for automated code reviews on this repo. Configure [here](https://www.cursor.com/dashboard/bugbot).</sup> <!-- /CURSOR_SUMMARY --> <!-- This is an auto-generated comment: release notes by coderabbit.ai --> ## Summary by CodeRabbit * **New Features** * Match expressions now support destructuring of interface-typed values. * SDKs can initialize the runtime from precompiled BAML bytecode (new runtime entrypoint and corresponding Python initializer). * **Tests** * Added end-to-end tests covering interface-field destructuring, exhaustive matching, and related runtime behaviors. * **Chores** * Test harness and workspace updated for bytecode support (borsh); CI size-gate baselines updated. <!-- end of auto-generated comment: release notes by coderabbit.ai --> --------- Co-authored-by: Claude Opus 4.8 (1M context) <[email protected]>
…lock (#3635) ## TL;DR `cargo test -p baml_tests` (and `baml-cli test --from baml_src`) **hung**. It turned out to be **two unrelated bugs stacked on top of each other**, and the first one masked the second: 1. **Compile-time `O(files²)` blow-up** — the deterministic hang. Compiling the 55-file `baml_src` corpus as one project ran two whole-project passes once *per file* / *per function*. Fixed with salsa memoization. Corpus went from *never finishing* → ~30s, **byte-identical bytecode**. 2. **A runtime GC/permit deadlock in `spawn`** — once compilation was fast enough to actually reach execution, the test run started hanging ~50% of the time in the engine. Root cause was a **nested heap-permit acquire that deadlocks against stop-the-world GC**. Fixed by allocating the spawned future under the parent's existing permit. Both were diagnosed by measurement (CPU sampling, scaling curves, forced-GC stress repro), not guesswork. Details + code references below so this is auditable. **Commits** 1. `perf(compiler2): …` — memoize the two whole-project compile passes (Problem 1). 2. `fix(engine): …` — allocate the spawned future under the parent's permit (Problem 2). 3. `perf(mir): …` — follow-up: borrow `resolved_aliases` instead of cloning it per context (see *Follow-up* section), addressing review feedback. --- ## Problem 1 — `O(files²)` compilation (commit `perf(compiler2): …`) ### Symptom `baml-cli test --from baml_src` pinned one core at ~95% CPU and never finished on the full corpus. Not a cargo-lock deadlock, not network — pure CPU-bound compilation. Scaling was clearly super-linear: | files | time (before) | |------:|--------------:| | 1 | 4s | | 15 | 13s | | 27 | 31s | | 35 | 44s | | 55 | **never finished** (~100s+ extrapolated) | `sample`-ing the process showed all time under `collect_diagnostics` → type inference → `ppir_expansion_items` → `collect_alias_bodies` → `lower_file`, and later under `generate_project_bytecode` → `lower_function` → `LoweringContext::new` → `populate_from_package`. ### Two quadratics **(a) `baml_compiler2_ppir`** — `ppir_expansion_items` is a **per-file** `#[salsa::tracked]` query (`lib.rs:205`), but each invocation called `collect_block_attrs` / `collect_alias_bodies` — plain functions that iterate **every** file in the project and call `ast::lower_file` on each. So N files × re-lowering N files = **O(N²)** lowerings. **(b) `baml_compiler2_mir`** — `LoweringContext::new` / `new_for_let` run **per function**, and each rebuilt `populate_from_package` (which lowers every class field type across all packages, `lower.rs:1190`) plus `ResolvedAliases::for_package` (which re-runs `find_recursive_aliases` over the whole project). M functions × N classes = **O(N²)** again. ### Fix Memoize the whole-project work behind package/project-keyed salsa queries, reusing the manual `unsafe impl salsa::Update` (via `PartialEq`) pattern already established for `PackageItems`: - `project_expansion_maps(db, project)` — `ppir/lib.rs:165` - `package_lowering_data(db, pkg_id)` — `mir/lower.rs:957`. `LoweringContext` now **borrows** the schema maps and `resolved_aliases` (`&'db`) instead of rebuilding/cloning them per function. (The `resolved_aliases` borrow was completed in commit 3 — see *Follow-up*.) ### Result | files | before | after | |------:|-------:|------:| | 35 | 44s | 20s | | 55 | never finished | ~30s | Scaling is now ~linear. **The bytecode snapshot test passes unchanged** — output is byte-for-byte identical, so this is a pure performance change. 1576 `baml_tests` lib tests pass. --- ## Problem 2 — `spawn` deadlocks against GC (commit `fix(engine): …`) This is the subtle one. Once compilation was fast, the full 1614-test run started **hanging ~50% of the time**, always with the *same* shape: the tokio runtime driver parked in `block_on` and **every worker thread idle/parked** — i.e. a lost-wakeup, not a CPU spin. ### The BAML test that triggered it `crates/baml_tests/baml_src/ns_cancel_cascade/cancel_cascade.baml`: ```baml function cancelled_child_future_state_is_cancelled() -> baml.future.FutureState { let slow = spawn { baml.sys.sleep(60000); 42 }; // task S: sleeps 60s let waiter = spawn { await slow }; // task W: awaits S let _ = waiter.cancel(); // cancel W (not S) waiter.state() } ``` It passed **5/5 in isolation** but hung ~50% in the full run — the classic fingerprint of a *concurrency* bug that needs accumulated load, not a logic bug in the test. Two facts narrowed it down: - All threads parked ⇒ a lost wakeup in async machinery, not a synchronous lock. - It correlated with **garbage collection**: forcing GC to run on *every* allocation (temporarily setting the Gen0 threshold `10_000 → 1` in `bex_heap`) turned the flake into a **100% reproducible** hang on a tiny repro. That was the key to pinning it. ### Background: the heap-permit model The engine coordinates GC with a `HeapPermitManager` backed by a **single tokio `Semaphore`** (`bex_heap/src/heap_guard.rs`): - Each running VM mutator holds **one** `ActiveHeapPermit` (one semaphore permit). - Stop-the-world GC parks everything by draining the **entire** semaphore at once: ```rust // bex_heap/src/heap_guard.rs:227 pub async fn request_park(&self) -> HeapGuard<'_> { let permits = self.active .acquire_many(MAX_PERMITS) // <-- wants ALL permits; completes only when .await // every ActiveHeapPermit has been released ... } ``` The crucial property: **tokio's `Semaphore` is fair (FIFO)**. Once `acquire_many(MAX_PERMITS)` is queued, any later `acquire()` (even for 1 permit) queues **behind** it and cannot be granted until the big request is satisfied and released. ### The bug `spawn` allocated the child's heap `Future` by taking a **second, fresh** permit *while the parent task that issued the `spawn` still held its own permit*. The parent awaits `spawn_thread` inline, so both permits live on the same logical flow: ```rust // OLD — bex_engine/src/lib.rs, spawn_thread_setup (deleted in this PR) let permit = self.heap_permit_manager.new_permit(()).await; let permit = permit.acquire().await; // <-- 2nd permit, while parent still holds its 1st let (future_id, future_ptr) = { let mut guard = self.futures.acquire(permit.proof()).await; guard.new_future(child_cancel.clone()) // allocate the child Future }; drop(permit); ``` Now interleave a GC park (which, under real workloads, fires whenever heap pressure crosses the threshold — hence the flakiness, and 100% under forced GC): ```mermaid sequenceDiagram participant P as Parent task<br/>(holds permit P_main) participant G as GC (request_park) participant S as Semaphore (fair FIFO) P->>P: executing spawn { ... } Note over G,S: heap pressure → GC starts G->>S: acquire_many(MAX_PERMITS) S-->>G: queued — waits for ALL permits<br/>(P_main still held by Parent) P->>S: acquire() for the child-future permit S-->>P: queued BEHIND GC (fair) — blocked Note over P,G: 🔒 deadlock cycle Note right of P: Parent won't release P_main<br/>until spawn returns Note right of P: spawn can't return<br/>until it gets the 2nd permit Note right of G: GC can't grant the 2nd permit<br/>until it finishes, which needs P_main ``` So: **GC waits for the parent's permit → the parent waits (fairly, behind GC) for a second permit → the parent won't release the first until it gets the second.** Cycle. All tasks suspend; every worker parks. The 60s `sleep` is a red herring — the hang is immediate (it reproduced with `PASS=0`). ### The fix There is no reason to take a *new* permit to allocate the child future: the parent is **already** holding an active permit at the `spawn` site. Allocate the future there, under the parent's permit, and hand the `future_id` to `spawn_thread`: ```rust // NEW — bex_engine/src/lib.rs, VmExecState::Spawn dispatch (~lib.rs:2462) let child_cancel = cancel.child_token(); let future_ptr = { let mut guard = self.futures.acquire(thread.proof()).await; // parent's permit let (future_id, future_ptr) = guard.new_future(child_cancel.clone()); drop(guard); // dropped before the await below Arc::clone(self) .spawn_thread(child_cancel, parent_errors_arc, closure, spawn_name, call_id, future_id) .await?; future_ptr }; thread.vm.stack.push(Value::object(future_ptr)); ``` `spawn_thread` now only builds the child VM and **registers** the child's permit via `new_permit` (which takes the holders mutex but does **not** acquire a semaphore permit), then fires the task. The child's permit is acquired later, on the spawned task — never nested under the parent. **One permit per task on the spawn path**, so the deadlock cycle cannot form. `spawn_thread_setup` is deleted. Why this is safe to audit: - `new_future` is synchronous and only needs a `PermitProof` to prove GC isn't running — the parent's `thread.proof()` satisfies that just as well as a fresh permit did. - The `FutureManagerGuard` is dropped **before** the `spawn_thread().await`, so no non-`Send` guard crosses a yield point (same pattern already used elsewhere in `run_thread_event_loop`). - `new_permit` only contends on the holders mutex, and `request_park` is explicitly ordered (semaphore first, then holders mutex) to not deadlock against it. - The parent holds its permit across the whole dispatch, so GC cannot move `future_ptr` during the await. ### Verification The bug is flaky, so I verified with the forced-GC stress harness (turns the race into a deterministic signal) and then with the real threshold: | scenario | before fix | after fix | |---|---|---| | minimal `spawn/cancel` repro, **GC forced every alloc** | 6/6 hang | **0/10 hang** | | full 1614-test corpus, real GC threshold | ~50% hang | **0/8 hang** (1614 passed, 0 failed each) | (The temporary `gc.rs` threshold change was only a debugging aid and is **not** part of this PR.) --- ## Follow-up — borrow `resolved_aliases`, and what's *not* worth optimizing (commit `perf(mir): …`) Review feedback flagged two remaining per-`LoweringContext` (i.e. per-function) costs. I profiled a full `--list` compile of the corpus before touching either, and the result decided each: > The whole MIR lowering path (`package_lowering_data` / `LoweringContext::new` / `lower_function`) is **~11 of ~3600 samples**. Compile time is dominated by TIR `infer_scope_types` (`render_scope_diagnostics → infer_scope_types`). So neither of these is a measurable cost on the current corpus. **Fixed: `resolved_aliases` cloned per context → borrowed.** The other five package-invariant schema maps were already borrowed (`&'db`) from `package_lowering_data`; `resolved_aliases` (a `HashMap` + `HashSet`) was the one I'd left as a per-context clone for expedience. Now it's borrowed too. Whole-struct passes (`&self.resolved_aliases`) became `self.resolved_aliases` (the field is already a reference); `.aliases` sub-field accesses and `.convert(...)` calls are unchanged (they auto-deref through the borrow). Not a measurable speedup here, but it's a strict reduction in per-context allocation, completes the borrow-not-clone design, and is asymptotically `O(contexts × aliases)` for alias-heavy projects. **Skipped (measured non-issue): making `build_class_type_tags` a tracked query.** It showed up as **0 samples**. Unlike `populate_from_package`, it does no type lowering — just cached `file_item_tree` reads and `TypeName→i64` inserts. Memoizing it would add a wrapper + `Update` impl + a project-keyed query, and it's **bytecode-affecting** (it assigns the global type-tag numbering that must match the emitter), so it's risk for no measurable benefit. Worth revisiting only if a profile on a larger/real project shows it mattering. --- ## Test status - `bex_engine`, `bex_vm` unit tests — pass - `cancel_cascade`, `spawn_array_race`, `spawn_parallel`, `spawn_semantics`, `spawn_specialization` integration tests — pass - `baml_tests --lib` (1576 tests) — pass - bytecode snapshot (`baml_tests --test baml_src`) — pass, **unchanged** - `cargo fmt` + `cargo clippy -D warnings` — clean (enforced by pre-commit) ## Reviewer notes / blast radius The engine change is on the **hot `spawn` path** and touches core GC/permit concurrency, so it deserves a careful read despite the small diff. The repro is flaky by nature; I'd suggest a CI loop running the corpus a handful of times to build confidence. The compiler change is performance-only and guarded by the byte-identical bytecode snapshot. 🤖 Generated with [Claude Code](https://claude.com/claude-code) <!-- This is an auto-generated comment: release notes by coderabbit.ai --> ## Summary by CodeRabbit * **Refactor** * Memoized package-level schema and type-alias data to eliminate redundant per-function recomputation and speed up compilation. * Centralized project-wide expansion maps for consistent, more efficient expansion across files. * Adjusted spawn/allocation flow so child futures are allocated at spawn sites, improving heap-permit handling and runtime stability. * **Chores** * Updated size-gate thresholds and CI artifact size metadata. <!-- end of auto-generated comment: release notes by coderabbit.ai --> --------- Co-authored-by: Claude Opus 4.8 (1M context) <[email protected]>
…anch`, non-blocking gate (#3637) ## Why The size-gate baseline was a stale, hand-recorded local snapshot that only moved when someone remembered to re-run it. Two consequences the team has hit repeatedly: 1. **Drift** — real CI sizes had crept to **+2.9%** under the 3% gate (linux `baml-cli` was ~26 KB from tripping), so even near-no-op PRs trip it. Several ceilings sat ~0% above baseline instead of the intended ~3%. 2. **Broken fix instructions** — the hint pointed at `size-gate record`, which rebuilds **locally** (sizes differ from CI) and only covers the host platform, so following it verbatim can't green CI. ## What changed **New `cargo-size-gate bake` — adopt CI-measured sizes, no rebuild** - `bake --branch canary` (new `fetch.rs`) shells to `gh`, finds the newest *completed* CI run on the branch with all four `size-gate-*` reports (ignoring run conclusion and size violations — canary CI is usually red for unrelated reasons), downloads them, writes `.ci/size-gate/<platform>.toml`, and re-pegs the `max_*_bytes` ceilings in `.cargo/size-gate.toml` (comments preserved via `toml_edit`). Idempotent: no size change → no write → no PR. - Also `bake <files...>` for explicit reports, and `--repo/--download-dir/--summary-out`. - Exposed as `mise run size-gate-update`. **New workflow `size-gate-baseline-refresh.yml` — daily refresh** - Daily (+ manual dispatch). Calls the **same** `bake --branch canary` (no duplicated run-selection), opens PR `chore/size-gate-baseline-refresh` via PAT so its own CI runs, and enables squash **auto-merge** so it self-merges once required checks pass. Reuses the artifacts canary CI already produced — **no rebuilds** (avoids re-paying the mac/windows release builds that are CI's long pole). **Fixed the fix-hint** to point at `bake --branch canary` instead of the broken `record` flow. **Made size-gate non-blocking** — removed from `ci-failure-alert.needs`. A size bump no longer blocks the merge queue. It's a signal (PR comment + daily refresh), not a gate. **Re-pegged baselines + ceilings to current canary (`ef7c326`)** so the gate is accurate on merge. ## Prerequisites for the daily job - Repo setting **"Allow auto-merge"** enabled. - `secrets.SAM_GITHUB_BOUNDARYML_READWRITE` (the PAT `oncall.yml` uses) available to the workflow. - Scheduled workflows fire from the **default branch**'s copy of the file — merge there for the cron to run (the job checks out canary explicitly regardless). ## Deferred / follow-ups (not in this PR) - Move size-gate to **post-merge-only** (run on canary push, not PRs/merge-queue) so it stops running on PRs entirely while still feeding the nightly. Kept local for now. - **Trend graph** (the git history of `.ci/size-gate/*.toml` is a ready-made byte-exact time series). CodSpeed has no custom-metric ingest, so this won't ride that rail cleanly. 🤖 Generated with [Claude Code](https://claude.com/claude-code) <!-- This is an auto-generated comment: release notes by coderabbit.ai --> ## Summary by CodeRabbit * **Bug Fixes** * Made size-gate checks informational in CI; they no longer block merges. * **New Features** * Added automated daily baseline refresh workflow for size-gate metrics. * Added new `bake` command to update baseline values from CI measurements. * **Chores** * Updated size-gate baselines for multiple platforms with new measurements. <!-- end of auto-generated comment: release notes by coderabbit.ai --> --------- Co-authored-by: Claude Opus 4.8 (1M context) <[email protected]>
Adds salsa caching to three functions, resulting in a ~60% speedup on the `baml_tests/baml_cli` compilation time. <!-- This is an auto-generated comment: release notes by coderabbit.ai --> ## Summary by CodeRabbit * **New Features** * Added compiler benchmarking suite to measure BAML compilation performance. * Added flamegraph profiling tool for analyzing compiler performance bottlenecks. * **Chores** * Optimized internal compilation caching for improved performance. * Updated build documentation. <!-- end of auto-generated comment: release notes by coderabbit.ai -->
## Summary
Fixes 17 bugs in BEP-044 interface **unions**, found by a union-focused
fuzz of the language. Each was pinned as a failing `union_fuzz_fNN_*`
test in `crates/baml_tests/tests/interfaces.rs` and is now green. A
class-only union method-dispatch case is also pinned as a regression
guard.
Two root causes dominated:
- **`unknown | unknown is not a function`** — a method present on every
member of a union containing an interface was rejected because the
union-member probe resolved interface members to the `Ty::Unknown`
sentinel.
- **VM crashes (`expected map, got instance`, `tagged_int_add`)** —
reading a field / narrowing a generic-interface match arm on a union
dispatched incorrectly at runtime.
## Fixes
| ID | Severity | Fix |
|----|----------|-----|
| F7/F12 | spurious-error / diagnostic | TIR resolves interface union
members through the real interface machinery (side effects suppressed);
MIR dispatches the call on the runtime class across all members'
implementors. No more `unknown \| unknown`. |
| F1/F3/F11 | crash / soundness | MIR dispatches a union field read on
the runtime class. Conflicting field types across union interfaces read
soundly as `T \| U` (misuse → E0001); a genuinely ambiguous field view →
E0131. |
| F2/F5 | crash / wrong-result | A generic-interface match arm
(`Slot<int>`) respects its type argument at runtime instead of matching
every implementor of the bare interface. |
| F4 | crash | `string + <non-object primitive>`
(int/float/bigint/bool/null) is a type error, not an inferred `string`
that aborts the VM. `string + uint8array` stays valid. |
| F6 | wrong-result | Reflection compares generic union args as
unordered sets (`Box<int\|string>` == `Box<string\|int>`). |
| F8 | spurious-error | A bounded type variable is a subtype of itself
(`<T extends I>(a: T) -> T { return a }` compiles). |
| F9 | spurious-error | A match arm overlapping any member of an
optional/union scrutinee is accepted (`let a: Animal` over `(Dog \|
Cat)?`). |
| F10/F15 | spurious-error / diagnostic | Out-of-body `implements I for
<primitive>` resolves for every primitive (not just `int`); union
method-not-found blames only the genuinely-lacking member. |
| F13/F14/F16/F17 | diagnostic | No `user.` package-prefix leak in match
witnesses; uncovered interface members named instead of `_`; ambiguous
union field → E0131; projection errors keep generic args (`Cargo<int>`).
|
Snapshot + LSP expectation updates reflect the corrected diagnostics (no
`user.` leak; `string + int` now E0004).
## Test plan
- `cargo test -p baml_tests` — **2058 passed, 0 failed**
- `cargo test -p baml_lsp2_actions_tests` — **369 passed, 0 failed**
- `cargo clippy --all-targets -- -D warnings` — clean
- `cargo fmt` — clean
Merged latest `canary` (salsa-memoization hang fix); the one conflict in
`lower.rs` was resolved by keeping the generic-interface pattern routing
on top of canary's borrowed-`resolved_aliases` API.
🤖 Generated with [Claude Code](https://claude.com/claude-code)
<!-- CURSOR_SUMMARY -->
---
> [!NOTE]
> **High Risk**
> Touches MIR lowering, TIR inference, VM reflection, and pattern/match
codegen for interfaces and unions—areas that previously caused VM
crashes and silent wrong dispatch; regressions would affect runtime
behavior and type soundness.
>
> **Overview**
> Fixes **BEP-044 interface unions** end-to-end: TIR no longer collapses
interface arms to `unknown | unknown` for callable unions; MIR adds
runtime class-tag dispatch for methods and fields when the receiver is a
union (including optional-wrapped unions and interface members),
including inherited defaults and interface field views. **Generic
interface** `is`/match patterns and fast type-tag switches now respect
type arguments so `Slot<int>` cannot capture `Slot<string>` values.
>
> TIR also gains union-member resolution with diagnostic rollback,
E0121/E0131 for ambiguous shared implementors, out-of-body primitive
members, bounded-generic `T <: T`, match-arm overlap over optional
unions, and stricter `string +` rules for non-object primitives. VM
reflection compares generic args with order-insensitive union
equivalence. Diagnostics and exhaustiveness witnesses drop `user.`
prefixes and show user-facing type names. Large `union_fuzz_fNN_*`
regression tests plus snapshot/LSP updates; `.gitignore` adds
`workflow_scratch_files/`.
>
> <sup>Reviewed by [Cursor Bugbot](https://cursor.com/bugbot) for commit
b535f94. Bugbot is set up for automated
code reviews on this repo. Configure
[here](https://www.cursor.com/dashboard/bugbot).</sup>
<!-- /CURSOR_SUMMARY -->
<!-- This is an auto-generated comment: release notes by coderabbit.ai
-->
## Summary by CodeRabbit
* **Bug Fixes**
* Fixed runtime dispatch and field reads for unions that contain
interface members; tightened generic-interface runtime matching and
narrowed type-match behavior.
* Reduced spurious diagnostics during union-call inference; improved
subtype/pattern-overlap checking and rejected invalid string+int
concatenation.
* Made missing-case diagnostics and witness rendering use user-facing
type names and avoid leaking internal prefixes.
* **Tests**
* Added a large regression suite covering union/interface dispatch,
generic-interface scenarios, diagnostics, and runtime behaviors.
* **Chores**
* Ignored local workflow scratch files.
<!-- end of auto-generated comment: release notes by coderabbit.ai -->
---------
Co-authored-by: Claude Opus 4.8 (1M context) <[email protected]>
<!-- This is an auto-generated comment: release notes by coderabbit.ai --> ## Summary by CodeRabbit * **New Features** * Added a lightweight Bedrock request serializer and media support for images, video, documents, and audio. * Introduced local credential/token resolution modules for AWS and Google Cloud with pluggable IO adapters. * **Improvements** * Streamlined Bedrock and Vertex AI authentication flows and request signing for more reliable credential discovery. * Simplified credential-provider precedence and expanded test coverage for credential resolution. <!-- end of auto-generated comment: release notes by coderabbit.ai -->
## What & why
`baml describe <Class>` and LSP hover treated a class's methods as
incidental text living inside the raw class body. Under a line budget
those methods got chopped out and replaced with `[... skipped N lines
...]`, and LSP hover rendered a class with methods as a bare `class
String {}` — methods were effectively **invisible for discovery**.
This PR makes methods first-class, always-visible output — the same
treatment fields already get.
## End-user impact
- **Methods are always discoverable.** Every method of a class now
appears in `baml describe` with its first-line docstring, full
`function` signature, and definition line range, in dedicated `methods:`
/ `static_methods:` sections. These are **never truncated** — `baml
describe User --budget 5` shows the same methods as the full output.
Method *bodies* only appear when you drill into a specific method.
- **Class body shows fields only.** The body block is canonical BAML
(`name: type,`) containing just the fields; methods render below. A
fields-only body fits any reasonable budget, so it stops triggering
truncation.
- **`baml describe string` works.** Lowercase primitive/keyword aliases
— `string`, `int`, `bigint`, `float`, `bool`, `null`, `uint8array`,
`image`, `audio`, `video`, `pdf`, `json` — resolve to their builtin
classes, alongside the existing canonical (`root.ns.Foo`) and package
(`baml.json.json`) forms.
- **Consistent, canonical type printing** across describe + hover +
signatures: builtins collapse to their alias (`baml.String` → `string`,
`baml.json.json` → `json`), user types read `root.ns.Foo`, lists/maps as
`T[]` / `map<K, V>`. Headers show the canonical FQN in parens when it
differs from the bare name (e.g. `class String (string)`, `class Config
(root.llm.Config)`).
- **Minimal, useful hover.** Hover shows the class docstring + field
shape, plus a one-line `Run \`baml describe <FQN>\`` hint **only when
the class has methods**.
### Before / after (`baml describe string`)
Before: a multi-hundred-line raw class body, truncated mid-way with
`[... skipped 448 lines ...]` — methods unusable.
After:
```
class String (string) <builtin>/baml/string.baml:5-475
/// A UTF-8 encoded string.
/// ...
class String {}
methods:
/// Serializes this string to a JSON value.
function to_json(self) -> json <builtin>/baml/string.baml:8-10
/// Returns the length of the string in UTF-8 bytes.
function length(self) -> int <builtin>/baml/string.baml:25-27
...
static_methods:
function from_code_points(unicode: int[]) -> string throws root.errors.InvalidArgument <builtin>/baml/string.baml:472-474
references (0):
```
## Implementation
- `MethodRef` (describe) and `MethodSig` (type_info) carry signature +
first-line docstring + full range. Signatures resolve via the package
interface — auto-derived methods are skipped, `self` is shown bare, and
`throws never` is omitted.
- One canonical printer: `QualifiedTypeName::builtin_alias` +
`display_ty_canonical_for_file`. It's opted into **only** by the
describe/hover/signature paths; diagnostics, completions, and inlay
hints keep their existing spelling (so other call sites can adopt it
mechanically later).
- `references` now exclude a symbol's own definition span, so a class is
no longer listed as a reference of itself via its own method bodies.
- Codegen (`cg::Class`) and the `truncate_body()` algorithm are
untouched, per the design.
## Testing
783 tests green (TIR 136, LSP 114, CLI 164, integration 369). New
coverage: TIR alias round-trip (incl. the `json` special case); CLI
fixtures for instance-only, mixed instance+static, and generic classes,
alias resolution, and a tight-budget never-truncate case; LSP hover
tests for the describe hint. Existing describe/hover snapshots updated
for the new format.
## Known follow-up
Types referenced **only** in a method signature (e.g. `WrapperMarker` in
`-> T | WrapperMarker`) are not yet surfaced under `dependencies:`. The
class-dependency path matches the canonical `pkg.Name` string against
short outline names — a pre-existing gap with no current test coverage.
Fixing it (and adding method-signature deps on top) was deferred to
avoid regressing builtin output; method *listing* itself is unaffected.
🤖 Generated with [Claude Code](https://claude.com/claude-code)
<!-- This is an auto-generated comment: release notes by coderabbit.ai
-->
## Summary by CodeRabbit
* **New Features**
* Describe output now shows class methods (instance and static) with
full one-line signatures, per-method docstrings, and definition line
ranges; canonical fully-qualified names appended when different.
* **Improvements**
* Lowercase primitives (e.g., string, int, json, image) resolve as
aliases in describe/dispatch and canonical type rendering.
* CLI JSON includes richer per-method metadata.
* Hovers include class docstrings and a "baml describe" hint when
methods exist; method sections are never truncated.
* Non-doc single-line comments are stripped from rendered bodies.
* **Tests**
* Expanded fixtures/snapshots covering methods, alias dispatch, hover
output, comment handling, and truncation guarantees.
<!-- end of auto-generated comment: release notes by coderabbit.ai -->
---------
Co-authored-by: Claude Opus 4.8 (1M context) <[email protected]>
…structions # Conflicts: # baml_language/Cargo.lock # baml_language/crates/bridge_cffi/src/lib.rs
…ac targets) The non-macos/aarch64 kperf shim had `#[inline(always)]` on its no-op enabled/exec_start/exec_end fns, tripping clippy::inline_always under `-D warnings` when compiled for other targets (e.g. CI's cross-target check). Host clippy never compiles the shim, so it slipped through. `#[inline]` is plenty for trivial no-ops. Verified clean via clippy --target x86_64-unknown-linux-musl. Co-Authored-By: Claude Opus 4.8 <[email protected]>
Stacked on #3616 (base =
hellovai/trim-events). Makes thebex_vmbytecode interpreter faster than CPython 3.9 across most workloads, where it started ~2.2× slower.Results (Apple M2 Max,
scripts/speedtest run)fib(1M × fib(50)): total instructions retired −60%, CPU cycles −51% vs the pre-fusion baseline — below the cycle count of an optimal naive hand-written interpreter for the same loop.What changed (each commit validated + clippy-clean)
All fusion is an emit-time peephole that rewrites instructions in place, confined to the current basic block (mirrors the existing
StoreVarLoadVarfusion), so jump targets and block addresses are never affected:AddIntVar/Const,SubIntVar/Const,CmpIntLtVar/Const(fold right operand)AddIntVarVar/VarConst,SubIntVarVar/VarConst,CmpIntLtVarVar/VarConst(fold both operands — no operand stack pushes)AddIntVarVarStore/VarConstStore(compute + store directly),MoveLocal(fusedx = y)CmpIntLt*BrFalse/BrTrue(fuse the loop condition with its branch; branch inversion drops the per-iteration jump-to-body)LoadVar/StoreVar, and a lazyfaulting_pc(trackcur_pconce per dispatch instead of writing the frame every op).get_objectheap derefs with a plain-Functionfast path.BAML_KPERF=1, behind an off-by-defaultkperfcargo feature) for deterministic, frequency-independent measurement.Key insight
Cycle cost is dominated by branch-mispredictions on the giant opcode dispatch
match(an indirect jump). So the real lever is reducing the number of dispatched ops — that's why fusion keeps paying off (branch inversion: −15% cycles for −9% instructions), while removing the inline per-op counter barely moved cycles. Measure total instructions/cycles, notinstr/op(fusion shrinks the op-count denominator).Validation
fib/boundary-loops/if-else correct;
bex_vm,interfaces(337),exceptions,cancellation,host_value_callable,errors,floats,gc,dispatch,optimizationall green.Known follow-ups (diagnosed, not in this PR)
baml pack's embedded runtime stub needs rebuilding for the new opcodes (baml runis unaffected).split~2.2× — alloc-bound), deep recursion (~130 cyc/call, frame setup), collatz (%///*/!=/==still unfused).🤖 Generated with Claude Code