Skip to content

Add no-argument String.split whitespace tokenization#3677

Draft
ATX24 wants to merge 2 commits into
canaryfrom
cursor/no-arg-string-split-8a66
Draft

Add no-argument String.split whitespace tokenization#3677
ATX24 wants to merge 2 commits into
canaryfrom
cursor/no-arg-string-split-8a66

Conversation

@ATX24

@ATX24 ATX24 commented Jun 4, 2026

Copy link
Copy Markdown
Contributor

The error

Calling String.split() without a separator currently fails arity validation instead of tokenizing whitespace:

function ReverseWords(s: string) -> string {
  let tokens = s.split()
  let reversed = tokens | reverse()
  reversed | join(" ")
}

Current diagnostic from engine/baml-compiler/src/thir/typecheck.rs:

Function baml.String.split expects 2 arguments, got 0

The explicit-separator workaround also produces incorrect tokens for mixed whitespace:

function Tokens() -> string[] {
  let s = " \thello  \nworld\r\nBAML "
  s.split(" ")
}

It only splits literal spaces, leaving tab/newline/carriage-return whitespace attached and preserving empty fields from repeated or leading/trailing spaces.

Root cause

  • engine/baml-compiler/src/thir/typecheck.rs registered baml.String.split as a two-parameter native signature: receiver plus explicit separator, and method-call arity validation rejected receiver-only calls.
  • engine/baml-vm/src/vm.rs enforced each native function's fixed arity before dispatch, so a receiver-only bytecode call to baml.String.split could not reach the native implementation.
  • engine/baml-vm/src/native.rs::string_split always read args[1] as the delimiter and used Rust str::split.
  • engine/baml-compiler/src/thir/interpret.rs::evaluate_method_call also required exactly one delimiter argument.

The fix

  • Allow s.split() through THIR typechecking while preserving s.split(separator) typechecking.
  • Allow the VM to dispatch baml.String.split with one VM argument only for the receiver-only no-arg form.
  • Update string_split to use Rust split_whitespace() when called with only the receiver, which collapses contiguous whitespace and omits empty tokens; explicit separators continue to use str::split.
  • Update the direct THIR interpreter path to support both zero and one split argument.
  • Add Rust tests for no-argument whitespace splitting and typechecking.

Verification

Passing commands:

$ mise exec -- cargo test -p baml-vm string_split --test strings -- --nocapture && mise exec -- cargo test -p baml-compiler typecheck_string_split_without_separator --lib -- --nocapture
...
test result: ok
$ mise exec -- cargo test --lib
...
test result: ok
$ mise exec -- cargo test --features skip-integ-tests
...
test result: ok

The same reproduction now succeeds via the VM test added in engine/baml-vm/tests/strings.rs:

function main() -> string[] {
  let s = " \thello  \nworld\r\nBAML "
  s.split()
}

Expected and now-passing output:

["hello", "world", "BAML"]

Full language integration runner status:

$ mise exec -- ./run-tests.sh

This reached the TypeScript integration phase and then blocked on an interactive Infisical login prompt:

No valid login session found, triggering login flow
? Select your hosting option:

Running the TypeScript integration tests directly without Infisical also failed due missing/invalid provider credentials, not this code change:

$ pnpm test -- --silent false --testTimeout 60000
...
LLM client 'GPT35' requires environment variable 'OPENAI_API_KEY' to be set but it is not
LLM client 'Sonnet' requires environment variable 'ANTHROPIC_API_KEY' to be set but it is not
LLM client 'Gemini' requires environment variable 'GOOGLE_API_KEY' to be set but it is not
Request failed with status code: 401 Unauthorized ... invalid_api_key
Test Suites: 34 failed, 8 passed, 42 total
Tests: 155 failed, 81 passed, 236 total

Issue Reference

  • This PR fixes/closes #[issue number]

Changes

Implemented no-argument String.split() for whitespace tokenization while preserving explicit separator behavior.

Testing

  • Unit tests added/updated
  • Manual testing performed through focused VM/typechecker tests
  • Tested in Cursor Cloud Linux environment

Screenshots

Not applicable.

PR Checklist

  • I have read and followed the contributing guidelines
  • My code follows the style guidelines of this project
  • I have performed a self-review of my own code
  • I have commented my code, particularly in hard-to-understand areas
  • I have made corresponding changes to the documentation
  • My changes generate no new warnings

Additional Notes

The full integration suite requires authenticated provider credentials or a valid Infisical session in this environment.

Open in Web Open in Cursor 

@vercel

vercel Bot commented Jun 4, 2026

Copy link
Copy Markdown

The latest updates on your projects. Learn more about Vercel for GitHub.

Project Deployment Actions Updated (UTC)
beps Ready Ready Preview, Comment Jun 4, 2026 1:59am
promptfiddle2 Ready Ready Preview, Comment Jun 4, 2026 1:59am
1 Skipped Deployment
Project Deployment Actions Updated (UTC)
promptfiddle Skipped Skipped Jun 4, 2026 1:59am

Request Review

@coderabbitai

coderabbitai Bot commented Jun 4, 2026

Copy link
Copy Markdown
Contributor

Important

Review skipped

Draft detected.

Please check the settings in the CodeRabbit UI or the .coderabbit.yaml file in this repository. To trigger a single review, invoke the @coderabbitai review command.

⚙️ Run configuration

Configuration used: Repository UI

Review profile: CHILL

Plan: Pro

Run ID: 7993b926-e6d4-4b5e-9b5c-970aabd7dbc4

You can disable this status message by setting the reviews.review_status to false in the CodeRabbit configuration file.

Use the checkbox below for a quick retry:

  • 🔍 Trigger review
✨ Finishing Touches
🧪 Generate unit tests (beta)
  • Create PR with unit tests
  • Commit unit tests in branch cursor/no-arg-string-split-8a66

Thanks for using CodeRabbit! It's free for OSS, and your support helps us grow. If you like it, consider giving us a shout-out.

❤️ Share

Comment @coderabbitai help to get the list of available commands and usage tips.

@github-actions

github-actions Bot commented Jun 4, 2026

Copy link
Copy Markdown

⏭️ Performance benchmarks were skipped

Perf benchmarks (CodSpeed) are opt-in on pull requests — they no longer run on every push. They always run automatically after merge to canary/main.

To run them on this PR, do any of the following, then push a commit (or re-run CI):

  • Add RUN_CODSPEED=1 to the PR description, or
  • Include run-perf or /perf in the PR title or any commit message.

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

None yet

Projects

None yet

Development

Successfully merging this pull request may close these issues.

2 participants