Skip to content

fix(schema): quote column names atomically + feat(clickhouse): bulk-insert builder#13

Open
lohanidamodar wants to merge 5 commits into
mainfrom
feat/clickhouse-nested-column-quoting
Open

fix(schema): quote column names atomically + feat(clickhouse): bulk-insert builder#13
lohanidamodar wants to merge 5 commits into
mainfrom
feat/clickhouse-nested-column-quoting

Conversation

@lohanidamodar
Copy link
Copy Markdown
Contributor

@lohanidamodar lohanidamodar commented May 21, 2026

Summary

Column names with a literal . (e.g. meta.key) are currently mis-quoted by QuotesIdentifiers::quote(), which unconditionally splits on . and treats each segment as a qualifier. The most visible victim is ClickHouse's standard nested-array convention — meta.key Array(String) is a top-level column whose name literally contains a dot — but the bug is dialect-general: any column whose name contains a . is currently emitted as if it were a qualified reference.

Before:

-- $schema->table('events')->array('meta.key', ColumnType::String)->create()
CREATE TABLE `events` (`meta`.`key` Array(String)) ENGINE = ...
-- ClickHouse rejects this — `meta`.`key` is read as a subobject access,
-- not a column reference, and is invalid DDL in this position.

After:

CREATE TABLE `events` (`meta.key` Array(String)) ENGINE = ...

What's changing

  • Adds QuotesIdentifiers::quoteLiteral() for atomic identifiers — no qualifier splitting, just wrap + escape (matching quote()'s * and control-character handling).
  • Switches column-emission paths in the schema compilers (Schema, ClickHouse, PostgreSQL, SQLite, MySQL, and the shared ForeignKeys trait) to quoteLiteral() — CREATE TABLE column lists, ALTER TABLE ADD/DROP/RENAME COLUMN, PRIMARY KEY / UNIQUE / FOREIGN KEY column lists, index column lists, ClickHouse ORDER BY columns, ClickHouse engine column-arg positions (ReplacingMergeTree's version column, SummingMergeTree's value columns, CollapsingMergeTree's sign column), and COMMENT ON COLUMN.
  • quote() itself is unchanged. Qualifier-friendly call sites — table names, FROM clauses, FK refTable, view / procedure / trigger / sequence / type / schema / database / extension names, builder SELECT / alias / CTE references — all keep splitting on . exactly as before.
  • MongoDB::quote() is a no-op identity; MongoDB::quoteLiteral() matches.

Why this matters

Compatibility with ClickHouse's nested-array convention (meta.key, meta.value as sibling top-level columns) is the obvious motivator. But the fix is correctness-preserving for any dialect that quotes identifiers and lets users include . in a column name — quoted identifiers in MySQL, PostgreSQL, and SQLite all permit dots inside the quoted form, and the current builder mis-quotes any such column today.

What's new (also in this PR)

A second feature lands on the same branch: a typed ClickHouse bulk-insert envelope so callers stop hand-assembling the INSERT INTO <table> FORMAT <name> query string and the format-specific body payload separately.

use Utopia\Query\Builder\ClickHouse as Builder;
use Utopia\Query\Builder\ClickHouse\Format;

$statement = (new Builder())
    ->into('events')
    ->bulkInsert(Format::JSONEachRow, [
        ['id' => 1, 'event' => 'login',  'time' => '2024-01-01 00:00:00'],
        ['id' => 2, 'event' => 'logout', 'time' => '2024-01-01 00:00:05'],
    ]);

// $statement->query  -> INSERT INTO `events` (`id`, `event`, `time`) FORMAT JSONEachRow
// $statement->body   -> {"id":1,...}\n{"id":2,...}
  • Builder\ClickHouse::bulkInsert(Format $format, iterable $rows, array $columns = []): FormattedInsertStatement — the recommended bulk-ingest entry point. Emits the envelope and the serialized body in one call. Columns are derived from the first row's keys; pass $columns to pin order or fill missing keys with null.
  • Builder\ClickHouse\Format — a backed enum. Currently supports JSONEachRow (JSON_THROW_ON_ERROR | JSON_UNESCAPED_SLASHES | JSON_UNESCAPED_UNICODE, one row per line, no trailing newline) and TabSeparated (escapes \\, \t, \n, \r; emits \N for null; booleans as 0/1). Additional formats can be added as cases.
  • FormattedInsertStatement gains an optional ?string $body property (default null). The existing insertFormat() + insert() envelope-only path stays available as a lower-level setter (e.g. when streaming the payload from elsewhere) and keeps emitting body = null — fully back-compat with 0.3.2 callers.
  • Eager serialization: the row iterable is materialized into the body string before the statement is returned, keeping Statement a plain readonly value object. Generators are accepted at the API surface for ergonomics; they are consumed in full.

bulkInsert() / insertFormat() boundary

The two methods are intentionally kept separate after this PR but share their envelope emitter:

  • bulkInsert() is the recommended entry point — typed enum, returns query + body together. Use this unless you have a specific reason not to.
  • insertFormat() (released in 0.3.2) is retained as a lower-level setter for callers that produce the body payload elsewhere and want only the envelope (body = null).
  • Both paths now route through a single private compileFormatInsertEnvelope() so they cannot diverge on table quoting, column resolution, or column-name validation. insertFormat()+insert() is also aligned with bulkInsert() to accept the no-columns form (INSERT INTO t FORMAT name) — ClickHouse treats the column list as optional and the two paths now match.

Motivation: routing bulk ingest through the typed Builder API removes a class of hand-assembled HTTP envelope code that previously bypassed the builder entirely — quoting, column-list construction, and body formatting are all now the builder's job.

Tests

  • QuotesIdentifiersTest covers quoteLiteral() for literal dots, plain identifiers, doubled wrap chars, bare *, name.* treated as a literal (no qualifier split), and control-character rejection.
  • ClickHouseTest::testCreateTableArrayWithDottedColumnName asserts that array('meta.key', ColumnType::String) and array('meta.value', ColumnType::String) emit as `meta.key` Array(String), `meta.value` Array(String) — single backtick-wrapped tokens, dot preserved.
  • New BulkInsertTest covers single/multi-row JSONEachRow, no trailing newline, empty iterable (empty body, optional column list), explicit column ordering with missing keys, JSON escaping of quotes/backslashes/tabs/newlines, slash + unicode preservation, null serialization, TabSeparated escaping and \N/0/1 mapping, generator input, table names with literal dots, and the body = null back-compat invariant for the envelope-only insertFormat() path. New parity tests assert bulkInsert() and insertFormat()+insert() emit identical envelopes for equivalent inputs (including tables with literal dots and empty-column-name rejection).
  • Full test suite + lint + PHPStan-max all green.

QuotesIdentifiers::quote() splits identifiers on '.' to support qualified
references like 'schema.table.column'. That assumption breaks for column
names that legitimately contain a literal dot — the canonical example is
ClickHouse's nested-array convention where 'meta.key Array(String)' is a
top-level column whose name happens to include a dot.

quoteLiteral() wraps + escapes a single identifier without splitting, so
callers that know the identifier is atomic (column-emission contexts) can
preserve the dot.
…ALTER

Column names appearing in CREATE TABLE column lists, ALTER TABLE
ADD/DROP/RENAME COLUMN, PRIMARY KEY/UNIQUE/FOREIGN KEY column lists,
index column lists, and the ClickHouse ORDER BY / engine column-arg
positions are atomic by definition — no schema/table qualifier is
allowed in those positions. Use quoteLiteral() there so identifiers
that contain a literal '.' (e.g. ClickHouse 'meta.key Array(String)')
emit as a single backtick-wrapped token instead of being split into
two segments.

Qualifier-friendly call sites (table names, FROM clauses, FK refTable,
view/procedure/trigger/sequence names, etc.) keep using quote() so
'schema.table' style references continue to work.

Also covers the ClickHouse engine column-arg positions
(ReplacingMergeTree, SummingMergeTree, CollapsingMergeTree) and
COMMENT ON COLUMN, which all take atomic column names.
@github-actions
Copy link
Copy Markdown

github-actions Bot commented May 21, 2026

📊 Coverage

Metric PR Baseline Δ
Lines 91.82% (7472/8138) 91.80% +0.02%
Methods 84.29% (1105/1311) 84.42% -0.13%
Classes 65.53% (135/206) 65.85% -0.32%

Full per-file breakdown in the job summary.

@greptile-apps
Copy link
Copy Markdown
Contributor

greptile-apps Bot commented May 21, 2026

Greptile Summary

This PR fixes a correctness bug in all SQL schema compilers where column names containing a literal . (e.g. ClickHouse nested-array columns like meta.key) were mis-split into qualified references instead of being quoted atomically. It also ships a new typed bulkInsert() API for the ClickHouse query builder.

  • Introduces QuotesIdentifiers::quoteLiteral() and migrates every column-name emission site in all schema compilers to use it, while keeping table/schema identifiers on the dot-splitting quote() path.
  • Adds Builder\\ClickHouse::bulkInsert(Format, iterable, array) returning a FormattedInsertStatement with both ->query and ->body populated; Format::JSONEachRow and Format::TabSeparated handle serialization.
  • The insertFormat()+insert() path no longer throws when no columns are declared — it now emits a bare envelope, aligning with ClickHouse's optional column-list behavior.

Confidence Score: 4/5

The schema-compiler fix is correct and well-covered, but the new bulkInsert() feature has an open column-quoting gap in its INSERT envelope that a previous reviewer flagged and that remains unaddressed.

The quoteLiteral() migration across all schema compilers is thorough and the dotted-column DDL fix is correct end-to-end. However, compileFormatInsertEnvelope() wraps each column via resolveAndWrap() which calls quote() and splits on dots — a column named meta.key in bulkInsert() would still be emitted as mis-quoted backtick-meta backtick dot backtick-key backtick in the INSERT envelope, causing ClickHouse to reject the statement.

src/Query/Builder/ClickHouse.php — compileFormatInsertEnvelope() where column names are wrapped via resolveAndWrap() instead of quoteLiteral().

Important Files Changed

Filename Overview
src/Query/Builder/ClickHouse.php Adds bulkInsert() and refactors insert() to share compileFormatInsertEnvelope(). Column quoting uses resolveAndWrap() which calls quote() and splits on dots, leaving dotted column names mis-quoted in INSERT envelopes.
src/Query/QuotesIdentifiers.php Adds quoteLiteral() — a dot-preserving quote path that skips qualifier splitting. Clean implementation mirroring quote()'s * and control-character handling.
src/Query/Schema.php All column-emission sites switched to quoteLiteral(). Changes are consistent and correct.
src/Query/Schema/ClickHouse.php Column-emission paths migrated to quoteLiteral() including ORDER BY, engine args, SKIP INDEX, RENAME/DROP COLUMN, and COMMENT COLUMN. All correct.
src/Query/Builder/ClickHouse/Format.php New backed enum with correct JSONEachRow and TabSeparated serialization. JSON_THROW_ON_ERROR guards failures; null/bool/escape handling matches ClickHouse spec.
src/Query/Builder/ClickHouse/FormattedInsertStatement.php Adds optional ?string $body = null property; withExecutor() correctly threads it through the clone. Back-compat for callers using the envelope-only path.

Reviews (4): Last reviewed commit: "refactor(clickhouse): drop builder state..." | Re-trigger Greptile

Comment thread tests/Query/Schema/ClickHouseTest.php
Routes ClickHouse's canonical `INSERT INTO <table> FORMAT <name>` bulk-
ingest path through the typed builder. `Builder\ClickHouse::bulkInsert()`
takes a `Format` enum and an iterable of associative rows and returns a
`FormattedInsertStatement` whose `->query` is the envelope and whose new
`->body` field is the serialized payload — callers ship both to the
ClickHouse HTTP interface without hand-assembling either side.

The `Format` enum supports `JSONEachRow` (encoded with
JSON_THROW_ON_ERROR | JSON_UNESCAPED_SLASHES | JSON_UNESCAPED_UNICODE,
one row per line, no trailing newline) and `TabSeparated` (escapes
`\\`, `\t`, `\n`, `\r`; emits `\N` for null; booleans as 0/1). Empty
iterables yield an empty body, which ClickHouse accepts as a zero-row
ingest. Rows are materialized eagerly so the statement remains a plain
readonly value object.

`FormattedInsertStatement` gains an optional `?string $body` property
(default null) that preserves back-compat for the existing
`insertFormat()` + `insert()` envelope-only path. Callers who only want
the envelope (e.g. when streaming the payload from elsewhere) keep
using that path; callers who want a single typed call switch to
`bulkInsert()`.
@lohanidamodar lohanidamodar changed the title fix(schema): quote column names atomically — preserve literal dots like ClickHouse's meta.key fix(schema): quote column names atomically + feat(clickhouse): bulk-insert builder May 21, 2026
Comment thread src/Query/Builder/ClickHouse.php
bulkInsert() is the recommended bulk-ingest entry point; insertFormat()
stays as a lower-level setter for callers that stream the body
separately. Both paths now share a single private envelope emitter
(compileFormatInsertEnvelope) so they cannot diverge on table quoting,
column resolution, or column-name validation.

Also aligns insertFormat()+insert() with bulkInsert() by accepting the
no-columns envelope form (INSERT INTO t FORMAT name) — ClickHouse
treats the column list as optional, and the two paths now match.
lohanidamodar added a commit to utopia-php/usage that referenced this pull request May 21, 2026
Picks up utopia-php/query#13 (feat/clickhouse-nested-column-quoting),
which lands the typed bulkInsert() entry point and the QuotesIdentifiers
quoteLiteral() fix for atomic identifiers. Bumps minimum-stability to dev
+ prefer-stable true so the dev-branch resolves alongside the rest of
the stable graph.

TODO: flip minimum-stability back to "stable" and pin to "^0.3.3" once
PR #13 is tagged.
lohanidamodar added a commit to utopia-php/usage that referenced this pull request May 21, 2026
…l JSONEachRow body assembly

addBatch() now hands its built rows directly to the typed
bulkInsert(Format::JSONEachRow, ...) entry point on the ClickHouse builder
and ships the eager $statement->body to the HTTP layer, replacing the
hand-rolled array_map(json_encode, ...) + implode("\n", ...) assembly.
The runtime instanceof guard on FormattedInsertStatement is gone too -
bulkInsert() returns the typed statement by signature.

insert() now takes the serialized body string instead of an array of
pre-encoded rows; the only caller is addBatch().

createDailyMaterializedView() picks up the createMaterializedView()
argument-order change in the new query branch (name, body, targetTable,
ifNotExists). The snapshot test for the MV path is updated to match.

New snapshots:
  - testAddBatchEmitsBulkInsertQueryAndBody asserts the envelope query
    and the serialized JSONEachRow body for a two-row fixture.
  - testNestedColumnDotQuoting validates that columns containing a dot
    (ClickHouse nested-array convention) remain single-backtick-wrapped
    atomic identifiers, exercising the QuotesIdentifiers::quoteLiteral()
    fix shipped in utopia-php/query#13.
…cument Format::serialize() column contract

- bulkInsert() no longer assigns to the fluent insertFormat / insertFormatColumns
  fields after the envelope is compiled. The compileFormatInsertEnvelope() helper
  is already parameterised, so those assignments were stale state — reusing a
  builder instance for a subsequent regular insert() previously inherited the
  residual format envelope.
- Format::serialize() PHPDoc now spells out the $columns === null contract:
  ordering is derived from the keys of the first row, with no cross-row
  consistency check. Positional formats (TabSeparated) corrupt silently on
  inconsistent row shapes; named formats (JSONEachRow) tolerate reordering but
  the explicit $columns argument acts as a projection filter.
- New tests guard both invariants: builder-reuse after bulkInsert() and the
  explicit-columns ordering contract on Format::serialize() across rows whose
  keys vary.
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

None yet

Projects

None yet

Development

Successfully merging this pull request may close these issues.

1 participant