Skip to content

fix(parser): close CR-012 benchmark-driven parser/tokenizer gaps#19

Merged
yigalrozenberg merged 2 commits into
masterfrom
fix/cr-012-benchmark-parser-gaps
Jun 3, 2026
Merged

fix(parser): close CR-012 benchmark-driven parser/tokenizer gaps#19
yigalrozenberg merged 2 commits into
masterfrom
fix/cr-012-benchmark-parser-gaps

Conversation

@yigalrozenberg

Copy link
Copy Markdown
Collaborator

CR-012 — Benchmark-driven parser/tokenizer gap closure

Closes every gap surfaced by the sql-ast-benchmark corpus run that was
in scope for CR-012 (see tmp/CR-012-sqlglot-rust-benchmark-parser-gaps.md).
Gap 10 (SQL*Plus continuation lines) is explicitly out of scope per the CR.

Commits

  1. b71bb4b — Gaps 1, 2 (partial), 3, 9.
  2. 36a02c1 — Gaps 2 (full), 4, 5, 6, 7, 8.

What's covered

# Gap Resolution
1 Unicode identifier characters is_identifier_start / is_identifier_continue use char::is_alphabetic / is_alphanumeric, accepting Latin-1 letters, superscripts, etc.
2 SELECT ALL quantifier + non-reserved keywords as identifiers parse_select_body consumes optional ALL; is_name_token extended with RANGE, CONFLICT, UNNEST, TEXT, SHOW, DESCRIBE, ANALYZE, INDEX.
3 $ mid-identifier $ continues an identifier when it's not the leading char (preserves $1 parameter semantics).
4 Verbs we don't model in detail New Statement::Command { kind, body } raw-tail capture for SET/SHOW/DESCRIBE/ANALYZE (standalone)/COMMENT ON/GRANT/REVOKE/GO/DECLARE/LOAD/REM/REMARK/RESET/PRAGMA/VACUUM/REINDEX/CALL/LOCK/UNLOCK/CLUSTER/REFRESH/CHECKPOINT/LISTEN/NOTIFY/PREPARE/EXECUTE/DEALLOCATE/DISCARD/COPY/ATTACH/DETACH plus fallback for CREATE OPERATOR / AGGREGATE / SEQUENCE / FUNCTION / TEXT SEARCH ….
5 Vendor-specific ALTER TABLE tails parse_alter_or_command rewinds on unknown actions (MySQL CONVERT TO CHARACTER SET … COLLATE …, Hive PARTITION … COMPACT …, T-SQL WITH (…) CHECK CONSTRAINT …) into a Statement::Command.
6 Aggregate ORDER BY / WITHIN GROUP Expr::Function gains order_by: Vec<OrderByItem> + within_group: bool. Parser accepts inline ORDER BY in arg lists and trailing WITHIN GROUP (ORDER BY …). Generator round-trips both forms.
7 CREATE TABLE … AS VALUES … parse_create_or_command rewinds the CREATE to a Statement::Command when the AS-payload isn't a SELECT.
8 Postgres @> / <@ containment New TokenType::AtArrow / ArrowAt; new BinaryOperator::AtArrow / ArrowAt; tokenizer, parser, and generator wired end-to-end.
9 Qualified UPDATE … SET alias.col = … SET loop accepts dotted LHS.

AST surface changes

All additions are serde-additive — existing payloads deserialize unchanged.

// New Statement variant
Statement::Command(CommandStatement {
    comments: Vec<String>,
    kind: String,   // "SET", "SHOW", "ALTER", ...
    body: String,   // verbatim tail up to `;` / EOF
})

// New Expr::Function fields (default empty / false)
Expr::Function {
    // ...existing fields...
    order_by: Vec<OrderByItem>,
    within_group: bool,
}

// New operators
BinaryOperator::AtArrow   // @>
BinaryOperator::ArrowAt   // <@

// New tokens
TokenType::AtArrow
TokenType::ArrowAt

Tests

tests/test_benchmark_regressions.rs — 29 regression tests, one per gap
example from the benchmark report.

Full suite: 1086 tests pass, 0 failures, 0 ignored.
cargo clippy --lib --tests produces no new warnings.

Out of scope

  • Gap 10 (SQL*Plus - continuation lines) — needs a preprocessor pass,
    tracked separately.
  • PG ~ regex and ? jsonb-has-key — conflict with existing
    BitwiseNot and Parameter token roles; deferred until we restructure
    those token classes.

Risk

Low. All changes are additive at the AST level. Behavior changes are
gated behind new tokens / new statement verbs that previously errored,
so prior-passing inputs still parse identically.

… UPDATE SET (CR-012)

Surfaced by the sql-ast-benchmark report against v0.9.37:

- Gap 1: tokenizer now accepts any Unicode letter as identifier-start and
  any Unicode alphanumeric as identifier-continue, matching SQL:2003 §5.2
  (unblocks identifiers like 'regionalliga_süd', 'area_in_1000km²',
  'København' that PG/MySQL/SQLite/Oracle/ClickHouse all accept today).
- Gap 3: '$' is now allowed mid-identifier; '$N' parameter form still
  works because identifier-start excludes '$'.
- Gap 2 (partial): consume the optional SQL-standard 'SELECT ALL'
  quantifier instead of trying to parse 'ALL' as a column.
- Gap 9: 'UPDATE … SET' now accepts a dotted LHS like 'alias.col'
  (Oracle/T-SQL idiom; 318 of the Oracle benchmark failures).

Adds tests/test_benchmark_regressions.rs pinning each gap so future
parser/tokenizer changes do not silently regress acceptance on the
real-world benchmark corpora.
… GROUP, PG `@>` / `<@`, CTAS VALUES, ALTER fallback)

Completes coverage of the remaining sql-ast-benchmark gaps that were
deferred from the initial CR-012 pass.

AST additions (all serde-additive, no breaking changes):
  * Statement::Command(CommandStatement { kind, body }) — raw-tail
    statement that preserves verbs we don't model in detail. Covers
    Gap 4 (SET, SHOW, DESCRIBE, ANALYZE standalone, COMMENT ON, GRANT,
    REVOKE, GO, DECLARE, LOAD, REM/REMARK, RESET, PRAGMA, VACUUM,
    REINDEX, CALL, LOCK/UNLOCK, CLUSTER, REFRESH, CHECKPOINT,
    LISTEN/NOTIFY, PREPARE/EXECUTE/DEALLOCATE, DISCARD, COPY,
    ATTACH/DETACH) plus the vendor-specific ALTER (Gap 5) and CREATE
    OPERATOR / AGGREGATE / SEQUENCE / FUNCTION / TEXT SEARCH … (Gap 4)
    fallbacks, plus CREATE TABLE … AS VALUES (Gap 7).
  * Expr::Function gains `order_by: Vec<OrderByItem>` and
    `within_group: bool` so aggregates round-trip:
      array_agg(x ORDER BY y DESC)
      percentile_cont(0.5) WITHIN GROUP (ORDER BY salary)
      string_agg(x, ',') WITHIN GROUP (ORDER BY id)
    Both fields default to empty / false (Gap 6).
  * BinaryOperator gains AtArrow (`@>`) and ArrowAt (`<@`) for PG
    array / jsonb / range containment (Gap 8).

Tokenizer:
  * `@>` and `<@` tokenize as TokenType::AtArrow / ArrowAt.
  * is_identifier_start / is_identifier_continue split out earlier
    (covered by Gap 1) is reused here.

Parser dispatcher:
  * Each command verb is recognized at statement boundary and the
    remainder up to `;`/EOF is captured verbatim (paren-depth aware,
    string-literal aware).
  * ALTER and CREATE wrap their respective specialized parsers in a
    rewind-on-error guard that falls back to Statement::Command.
  * Aggregate function call parser accepts an in-arglist ORDER BY and
    a trailing WITHIN GROUP (ORDER BY …) clause.
  * is_name_token extends to RANGE, CONFLICT, UNNEST, TEXT, SHOW,
    DESCRIBE, ANALYZE, INDEX so they parse as column identifiers
    (Gap 2 full).
  * attach_comments_to_statement handles the new Command arm.

Generator:
  * gen_command writes `<kind> <body>` verbatim.
  * gen_expr for Expr::Function emits inline ORDER BY before `)` when
    `within_group == false`, or appends `) WITHIN GROUP (ORDER BY …)`
    when set. New helper gen_order_by_items_inline shared with the
    existing gen_order_by.
  * binary_op_str handles AtArrow / ArrowAt.

Tests (tests/test_benchmark_regressions.rs):
  * +20 regression tests pinning every gap from this commit.
  * Full suite remains green (≈ 1086 tests pass).

Refs CR-012.
@yigalrozenberg yigalrozenberg merged commit ef62e99 into master Jun 3, 2026
5 checks passed
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

None yet

Projects

None yet

Development

Successfully merging this pull request may close these issues.

1 participant