fix(parser): close CR-012 benchmark-driven parser/tokenizer gaps#19
Merged
Conversation
… UPDATE SET (CR-012) Surfaced by the sql-ast-benchmark report against v0.9.37: - Gap 1: tokenizer now accepts any Unicode letter as identifier-start and any Unicode alphanumeric as identifier-continue, matching SQL:2003 §5.2 (unblocks identifiers like 'regionalliga_süd', 'area_in_1000km²', 'København' that PG/MySQL/SQLite/Oracle/ClickHouse all accept today). - Gap 3: '$' is now allowed mid-identifier; '$N' parameter form still works because identifier-start excludes '$'. - Gap 2 (partial): consume the optional SQL-standard 'SELECT ALL' quantifier instead of trying to parse 'ALL' as a column. - Gap 9: 'UPDATE … SET' now accepts a dotted LHS like 'alias.col' (Oracle/T-SQL idiom; 318 of the Oracle benchmark failures). Adds tests/test_benchmark_regressions.rs pinning each gap so future parser/tokenizer changes do not silently regress acceptance on the real-world benchmark corpora.
… GROUP, PG `@>` / `<@`, CTAS VALUES, ALTER fallback)
Completes coverage of the remaining sql-ast-benchmark gaps that were
deferred from the initial CR-012 pass.
AST additions (all serde-additive, no breaking changes):
* Statement::Command(CommandStatement { kind, body }) — raw-tail
statement that preserves verbs we don't model in detail. Covers
Gap 4 (SET, SHOW, DESCRIBE, ANALYZE standalone, COMMENT ON, GRANT,
REVOKE, GO, DECLARE, LOAD, REM/REMARK, RESET, PRAGMA, VACUUM,
REINDEX, CALL, LOCK/UNLOCK, CLUSTER, REFRESH, CHECKPOINT,
LISTEN/NOTIFY, PREPARE/EXECUTE/DEALLOCATE, DISCARD, COPY,
ATTACH/DETACH) plus the vendor-specific ALTER (Gap 5) and CREATE
OPERATOR / AGGREGATE / SEQUENCE / FUNCTION / TEXT SEARCH … (Gap 4)
fallbacks, plus CREATE TABLE … AS VALUES (Gap 7).
* Expr::Function gains `order_by: Vec<OrderByItem>` and
`within_group: bool` so aggregates round-trip:
array_agg(x ORDER BY y DESC)
percentile_cont(0.5) WITHIN GROUP (ORDER BY salary)
string_agg(x, ',') WITHIN GROUP (ORDER BY id)
Both fields default to empty / false (Gap 6).
* BinaryOperator gains AtArrow (`@>`) and ArrowAt (`<@`) for PG
array / jsonb / range containment (Gap 8).
Tokenizer:
* `@>` and `<@` tokenize as TokenType::AtArrow / ArrowAt.
* is_identifier_start / is_identifier_continue split out earlier
(covered by Gap 1) is reused here.
Parser dispatcher:
* Each command verb is recognized at statement boundary and the
remainder up to `;`/EOF is captured verbatim (paren-depth aware,
string-literal aware).
* ALTER and CREATE wrap their respective specialized parsers in a
rewind-on-error guard that falls back to Statement::Command.
* Aggregate function call parser accepts an in-arglist ORDER BY and
a trailing WITHIN GROUP (ORDER BY …) clause.
* is_name_token extends to RANGE, CONFLICT, UNNEST, TEXT, SHOW,
DESCRIBE, ANALYZE, INDEX so they parse as column identifiers
(Gap 2 full).
* attach_comments_to_statement handles the new Command arm.
Generator:
* gen_command writes `<kind> <body>` verbatim.
* gen_expr for Expr::Function emits inline ORDER BY before `)` when
`within_group == false`, or appends `) WITHIN GROUP (ORDER BY …)`
when set. New helper gen_order_by_items_inline shared with the
existing gen_order_by.
* binary_op_str handles AtArrow / ArrowAt.
Tests (tests/test_benchmark_regressions.rs):
* +20 regression tests pinning every gap from this commit.
* Full suite remains green (≈ 1086 tests pass).
Refs CR-012.
This file contains hidden or bidirectional Unicode text that may be interpreted or compiled differently than what appears below. To review, open the file in an editor that reveals hidden Unicode characters.
Learn more about bidirectional Unicode characters
Sign up for free
to join this conversation on GitHub.
Already have an account?
Sign in to comment
Add this suggestion to a batch that can be applied as a single commit.This suggestion is invalid because no changes were made to the code.Suggestions cannot be applied while the pull request is closed.Suggestions cannot be applied while viewing a subset of changes.Only one suggestion per line can be applied in a batch.Add this suggestion to a batch that can be applied as a single commit.Applying suggestions on deleted lines is not supported.You must change the existing code in this line in order to create a valid suggestion.Outdated suggestions cannot be applied.This suggestion has been applied or marked resolved.Suggestions cannot be applied from pending reviews.Suggestions cannot be applied on multi-line comments.Suggestions cannot be applied while the pull request is queued to merge.Suggestion cannot be applied right now. Please check back later.
CR-012 — Benchmark-driven parser/tokenizer gap closure
Closes every gap surfaced by the
sql-ast-benchmarkcorpus run that wasin scope for CR-012 (see tmp/CR-012-sqlglot-rust-benchmark-parser-gaps.md).
Gap 10 (SQL*Plus continuation lines) is explicitly out of scope per the CR.
Commits
b71bb4b— Gaps 1, 2 (partial), 3, 9.36a02c1— Gaps 2 (full), 4, 5, 6, 7, 8.What's covered
is_identifier_start/is_identifier_continueusechar::is_alphabetic/is_alphanumeric, accepting Latin-1 letters, superscripts, etc.SELECT ALLquantifier + non-reserved keywords as identifiersparse_select_bodyconsumes optionalALL;is_name_tokenextended withRANGE,CONFLICT,UNNEST,TEXT,SHOW,DESCRIBE,ANALYZE,INDEX.$mid-identifier$continues an identifier when it's not the leading char (preserves$1parameter semantics).Statement::Command { kind, body }raw-tail capture forSET/SHOW/DESCRIBE/ANALYZE(standalone)/COMMENT ON/GRANT/REVOKE/GO/DECLARE/LOAD/REM/REMARK/RESET/PRAGMA/VACUUM/REINDEX/CALL/LOCK/UNLOCK/CLUSTER/REFRESH/CHECKPOINT/LISTEN/NOTIFY/PREPARE/EXECUTE/DEALLOCATE/DISCARD/COPY/ATTACH/DETACHplus fallback forCREATE OPERATOR / AGGREGATE / SEQUENCE / FUNCTION / TEXT SEARCH ….ALTER TABLEtailsparse_alter_or_commandrewinds on unknown actions (MySQLCONVERT TO CHARACTER SET … COLLATE …, HivePARTITION … COMPACT …, T-SQLWITH (…) CHECK CONSTRAINT …) into aStatement::Command.ORDER BY/WITHIN GROUPExpr::Functiongainsorder_by: Vec<OrderByItem>+within_group: bool. Parser accepts inlineORDER BYin arg lists and trailingWITHIN GROUP (ORDER BY …). Generator round-trips both forms.CREATE TABLE … AS VALUES …parse_create_or_commandrewinds theCREATEto aStatement::Commandwhen the AS-payload isn't a SELECT.@>/<@containmentTokenType::AtArrow/ArrowAt; newBinaryOperator::AtArrow/ArrowAt; tokenizer, parser, and generator wired end-to-end.UPDATE … SET alias.col = …AST surface changes
All additions are serde-additive — existing payloads deserialize unchanged.
Tests
tests/test_benchmark_regressions.rs— 29 regression tests, one per gapexample from the benchmark report.
Full suite: 1086 tests pass, 0 failures, 0 ignored.
cargo clippy --lib --testsproduces no new warnings.Out of scope
-continuation lines) — needs a preprocessor pass,tracked separately.
~regex and?jsonb-has-key — conflict with existingBitwiseNotandParametertoken roles; deferred until we restructurethose token classes.
Risk
Low. All changes are additive at the AST level. Behavior changes are
gated behind new tokens / new statement verbs that previously errored,
so prior-passing inputs still parse identically.