fix(schema): quote column names atomically + feat(clickhouse): bulk-insert builder#13
fix(schema): quote column names atomically + feat(clickhouse): bulk-insert builder#13lohanidamodar wants to merge 5 commits into
Conversation
QuotesIdentifiers::quote() splits identifiers on '.' to support qualified references like 'schema.table.column'. That assumption breaks for column names that legitimately contain a literal dot — the canonical example is ClickHouse's nested-array convention where 'meta.key Array(String)' is a top-level column whose name happens to include a dot. quoteLiteral() wraps + escapes a single identifier without splitting, so callers that know the identifier is atomic (column-emission contexts) can preserve the dot.
…ALTER Column names appearing in CREATE TABLE column lists, ALTER TABLE ADD/DROP/RENAME COLUMN, PRIMARY KEY/UNIQUE/FOREIGN KEY column lists, index column lists, and the ClickHouse ORDER BY / engine column-arg positions are atomic by definition — no schema/table qualifier is allowed in those positions. Use quoteLiteral() there so identifiers that contain a literal '.' (e.g. ClickHouse 'meta.key Array(String)') emit as a single backtick-wrapped token instead of being split into two segments. Qualifier-friendly call sites (table names, FROM clauses, FK refTable, view/procedure/trigger/sequence names, etc.) keep using quote() so 'schema.table' style references continue to work. Also covers the ClickHouse engine column-arg positions (ReplacingMergeTree, SummingMergeTree, CollapsingMergeTree) and COMMENT ON COLUMN, which all take atomic column names.
📊 Coverage
Full per-file breakdown in the job summary. |
Greptile SummaryThis PR fixes a correctness bug in all SQL schema compilers where column names containing a literal
Confidence Score: 4/5The schema-compiler fix is correct and well-covered, but the new bulkInsert() feature has an open column-quoting gap in its INSERT envelope that a previous reviewer flagged and that remains unaddressed. The quoteLiteral() migration across all schema compilers is thorough and the dotted-column DDL fix is correct end-to-end. However, compileFormatInsertEnvelope() wraps each column via resolveAndWrap() which calls quote() and splits on dots — a column named meta.key in bulkInsert() would still be emitted as mis-quoted backtick-meta backtick dot backtick-key backtick in the INSERT envelope, causing ClickHouse to reject the statement. src/Query/Builder/ClickHouse.php — compileFormatInsertEnvelope() where column names are wrapped via resolveAndWrap() instead of quoteLiteral(). Important Files Changed
Reviews (4): Last reviewed commit: "refactor(clickhouse): drop builder state..." | Re-trigger Greptile |
Routes ClickHouse's canonical `INSERT INTO <table> FORMAT <name>` bulk- ingest path through the typed builder. `Builder\ClickHouse::bulkInsert()` takes a `Format` enum and an iterable of associative rows and returns a `FormattedInsertStatement` whose `->query` is the envelope and whose new `->body` field is the serialized payload — callers ship both to the ClickHouse HTTP interface without hand-assembling either side. The `Format` enum supports `JSONEachRow` (encoded with JSON_THROW_ON_ERROR | JSON_UNESCAPED_SLASHES | JSON_UNESCAPED_UNICODE, one row per line, no trailing newline) and `TabSeparated` (escapes `\\`, `\t`, `\n`, `\r`; emits `\N` for null; booleans as 0/1). Empty iterables yield an empty body, which ClickHouse accepts as a zero-row ingest. Rows are materialized eagerly so the statement remains a plain readonly value object. `FormattedInsertStatement` gains an optional `?string $body` property (default null) that preserves back-compat for the existing `insertFormat()` + `insert()` envelope-only path. Callers who only want the envelope (e.g. when streaming the payload from elsewhere) keep using that path; callers who want a single typed call switch to `bulkInsert()`.
meta.keybulkInsert() is the recommended bulk-ingest entry point; insertFormat() stays as a lower-level setter for callers that stream the body separately. Both paths now share a single private envelope emitter (compileFormatInsertEnvelope) so they cannot diverge on table quoting, column resolution, or column-name validation. Also aligns insertFormat()+insert() with bulkInsert() by accepting the no-columns envelope form (INSERT INTO t FORMAT name) — ClickHouse treats the column list as optional, and the two paths now match.
Picks up utopia-php/query#13 (feat/clickhouse-nested-column-quoting), which lands the typed bulkInsert() entry point and the QuotesIdentifiers quoteLiteral() fix for atomic identifiers. Bumps minimum-stability to dev + prefer-stable true so the dev-branch resolves alongside the rest of the stable graph. TODO: flip minimum-stability back to "stable" and pin to "^0.3.3" once PR #13 is tagged.
…l JSONEachRow body assembly
addBatch() now hands its built rows directly to the typed
bulkInsert(Format::JSONEachRow, ...) entry point on the ClickHouse builder
and ships the eager $statement->body to the HTTP layer, replacing the
hand-rolled array_map(json_encode, ...) + implode("\n", ...) assembly.
The runtime instanceof guard on FormattedInsertStatement is gone too -
bulkInsert() returns the typed statement by signature.
insert() now takes the serialized body string instead of an array of
pre-encoded rows; the only caller is addBatch().
createDailyMaterializedView() picks up the createMaterializedView()
argument-order change in the new query branch (name, body, targetTable,
ifNotExists). The snapshot test for the MV path is updated to match.
New snapshots:
- testAddBatchEmitsBulkInsertQueryAndBody asserts the envelope query
and the serialized JSONEachRow body for a two-row fixture.
- testNestedColumnDotQuoting validates that columns containing a dot
(ClickHouse nested-array convention) remain single-backtick-wrapped
atomic identifiers, exercising the QuotesIdentifiers::quoteLiteral()
fix shipped in utopia-php/query#13.
…cument Format::serialize() column contract - bulkInsert() no longer assigns to the fluent insertFormat / insertFormatColumns fields after the envelope is compiled. The compileFormatInsertEnvelope() helper is already parameterised, so those assignments were stale state — reusing a builder instance for a subsequent regular insert() previously inherited the residual format envelope. - Format::serialize() PHPDoc now spells out the $columns === null contract: ordering is derived from the keys of the first row, with no cross-row consistency check. Positional formats (TabSeparated) corrupt silently on inconsistent row shapes; named formats (JSONEachRow) tolerate reordering but the explicit $columns argument acts as a projection filter. - New tests guard both invariants: builder-reuse after bulkInsert() and the explicit-columns ordering contract on Format::serialize() across rows whose keys vary.
Summary
Column names with a literal
.(e.g.meta.key) are currently mis-quoted byQuotesIdentifiers::quote(), which unconditionally splits on.and treats each segment as a qualifier. The most visible victim is ClickHouse's standard nested-array convention —meta.key Array(String)is a top-level column whose name literally contains a dot — but the bug is dialect-general: any column whose name contains a.is currently emitted as if it were a qualified reference.Before:
After:
What's changing
QuotesIdentifiers::quoteLiteral()for atomic identifiers — no qualifier splitting, just wrap + escape (matchingquote()'s*and control-character handling).Schema,ClickHouse,PostgreSQL,SQLite,MySQL, and the sharedForeignKeystrait) toquoteLiteral()— CREATE TABLE column lists, ALTER TABLE ADD/DROP/RENAME COLUMN, PRIMARY KEY / UNIQUE / FOREIGN KEY column lists, index column lists, ClickHouseORDER BYcolumns, ClickHouse engine column-arg positions (ReplacingMergeTree's version column, SummingMergeTree's value columns, CollapsingMergeTree's sign column), andCOMMENT ON COLUMN.quote()itself is unchanged. Qualifier-friendly call sites — table names, FROM clauses, FKrefTable, view / procedure / trigger / sequence / type / schema / database / extension names, builderSELECT/ alias / CTE references — all keep splitting on.exactly as before.MongoDB::quote()is a no-op identity;MongoDB::quoteLiteral()matches.Why this matters
Compatibility with ClickHouse's nested-array convention (
meta.key,meta.valueas sibling top-level columns) is the obvious motivator. But the fix is correctness-preserving for any dialect that quotes identifiers and lets users include.in a column name — quoted identifiers in MySQL, PostgreSQL, and SQLite all permit dots inside the quoted form, and the current builder mis-quotes any such column today.What's new (also in this PR)
A second feature lands on the same branch: a typed ClickHouse bulk-insert envelope so callers stop hand-assembling the
INSERT INTO <table> FORMAT <name>query string and the format-specific body payload separately.Builder\ClickHouse::bulkInsert(Format $format, iterable $rows, array $columns = []): FormattedInsertStatement— the recommended bulk-ingest entry point. Emits the envelope and the serialized body in one call. Columns are derived from the first row's keys; pass$columnsto pin order or fill missing keys withnull.Builder\ClickHouse\Format— a backed enum. Currently supportsJSONEachRow(JSON_THROW_ON_ERROR | JSON_UNESCAPED_SLASHES | JSON_UNESCAPED_UNICODE, one row per line, no trailing newline) andTabSeparated(escapes\\,\t,\n,\r; emits\Nfor null; booleans as 0/1). Additional formats can be added as cases.FormattedInsertStatementgains an optional?string $bodyproperty (defaultnull). The existinginsertFormat()+insert()envelope-only path stays available as a lower-level setter (e.g. when streaming the payload from elsewhere) and keeps emittingbody = null— fully back-compat with 0.3.2 callers.Statementa plain readonly value object. Generators are accepted at the API surface for ergonomics; they are consumed in full.bulkInsert()/insertFormat()boundaryThe two methods are intentionally kept separate after this PR but share their envelope emitter:
bulkInsert()is the recommended entry point — typed enum, returnsquery+bodytogether. Use this unless you have a specific reason not to.insertFormat()(released in 0.3.2) is retained as a lower-level setter for callers that produce the body payload elsewhere and want only the envelope (body = null).compileFormatInsertEnvelope()so they cannot diverge on table quoting, column resolution, or column-name validation.insertFormat()+insert()is also aligned withbulkInsert()to accept the no-columns form (INSERT INTO t FORMAT name) — ClickHouse treats the column list as optional and the two paths now match.Motivation: routing bulk ingest through the typed
BuilderAPI removes a class of hand-assembled HTTP envelope code that previously bypassed the builder entirely — quoting, column-list construction, and body formatting are all now the builder's job.Tests
QuotesIdentifiersTestcoversquoteLiteral()for literal dots, plain identifiers, doubled wrap chars, bare*,name.*treated as a literal (no qualifier split), and control-character rejection.ClickHouseTest::testCreateTableArrayWithDottedColumnNameasserts thatarray('meta.key', ColumnType::String)andarray('meta.value', ColumnType::String)emit as`meta.key` Array(String),`meta.value` Array(String)— single backtick-wrapped tokens, dot preserved.BulkInsertTestcovers single/multi-row JSONEachRow, no trailing newline, empty iterable (empty body, optional column list), explicit column ordering with missing keys, JSON escaping of quotes/backslashes/tabs/newlines, slash + unicode preservation,nullserialization, TabSeparated escaping and\N/0/1 mapping, generator input, table names with literal dots, and thebody = nullback-compat invariant for the envelope-onlyinsertFormat()path. New parity tests assertbulkInsert()andinsertFormat()+insert()emit identical envelopes for equivalent inputs (including tables with literal dots and empty-column-name rejection).