Skip to content
Open
Show file tree
Hide file tree
Changes from all commits
Commits
File filter

Filter by extension

Filter by extension

Conversations
Failed to load comments.
Loading
Jump to
Jump to file
Failed to load files.
Loading
Diff view
Diff view
41 changes: 41 additions & 0 deletions README.md
Original file line number Diff line number Diff line change
Expand Up @@ -1520,6 +1520,47 @@ $result->namedBindings;

Unregistered columns fall through to value-based inference: `int → Int64`, `float → Float64`, `bool → UInt8`, `null → Nullable(String)`, `DateTimeInterface → DateTime64(3)`, everything else → `String`. Register types via `withParamType($column, $type)` or `withParamTypes($map)` whenever the inference rule doesn't match the column's ClickHouse declaration. The positional `$bindings` array is still exposed on the resulting `Statement` for callers that prefer it.

**Bulk insert** — emit the canonical `INSERT INTO <table> FORMAT <name>` envelope together with the serialized row payload in a single typed call. The returned `FormattedInsertStatement` exposes `->query` (the envelope) and `->body` (the format-specific payload) so the caller can ship both to ClickHouse's HTTP interface without hand-assembling either side:

```php
use Utopia\Query\Builder\ClickHouse as Builder;
use Utopia\Query\Builder\ClickHouse\Format;

$statement = (new Builder())
->into('events')
->bulkInsert(Format::JSONEachRow, [
['id' => 1, 'event' => 'login', 'time' => '2024-01-01 00:00:00'],
['id' => 2, 'event' => 'logout', 'time' => '2024-01-01 00:00:05'],
]);

// $statement->query
// INSERT INTO `events` (`id`, `event`, `time`) FORMAT JSONEachRow
//
// $statement->body
// {"id":1,"event":"login","time":"2024-01-01 00:00:00"}
// {"id":2,"event":"logout","time":"2024-01-01 00:00:05"}
```

Ship the result over the HTTP interface by passing `$statement->query` as the `?query=` parameter and `$statement->body` as the POST body. Columns are derived from the first row's keys; pass an explicit third argument to pin the order or fill missing keys with `null`:

```php
$statement = (new Builder())
->into('events')
->bulkInsert(Format::JSONEachRow, $rows, ['id', 'event', 'time']);
```

The `Format` enum currently supports `Format::JSONEachRow` and `Format::TabSeparated`. JSONEachRow rows are encoded with `JSON_THROW_ON_ERROR | JSON_UNESCAPED_SLASHES | JSON_UNESCAPED_UNICODE` (slashes and non-ASCII are preserved verbatim); TabSeparated escapes `\\`, `\t`, `\n`, `\r` and emits `\N` for `null`. An empty row iterable produces an empty body, which ClickHouse accepts as a zero-row ingest. The iterable is consumed eagerly — pass a generator if you want to defer row construction, but the serialized body is materialized in full before the statement is returned.

`bulkInsert()` is the recommended entry point for FORMAT-based ingest — it covers the full envelope + body contract in one typed call. The lower-level `insertFormat()` setter pairs with `insert()` for the envelope-only path (returns `body = null`) and is retained for callers that stream the payload separately. Both paths share the same envelope emitter, so the resulting `query` is identical for equivalent inputs:

```php
$statement = (new Builder())
->into('events')
->insertFormat('JSONEachRow', ['id', 'event', 'time'])
->insert();
// $statement->body is null; assemble the payload separately.
```

**UPDATE** — compiles to `ALTER TABLE ... UPDATE` with mandatory WHERE:

```php
Expand Down
110 changes: 85 additions & 25 deletions src/Query/Builder/ClickHouse.php
Original file line number Diff line number Diff line change
Expand Up @@ -3,6 +3,7 @@
namespace Utopia\Query\Builder;

use Utopia\Query\Builder as BaseBuilder;
use Utopia\Query\Builder\ClickHouse\Format;
use Utopia\Query\Builder\ClickHouse\FormattedInsertStatement;
use Utopia\Query\Builder\Feature\BitwiseAggregates;
use Utopia\Query\Builder\Feature\ClickHouse\ApproximateAggregates;
Expand Down Expand Up @@ -140,11 +141,58 @@ public function hint(string $hint): static
}

/**
* Declare a ClickHouse FORMAT pragma for the next INSERT.
* Recommended bulk-ingest entry point. Emits the canonical `INSERT INTO
* <table> [(<cols>)] FORMAT <name>` envelope alongside the serialized row
* payload in a single typed call. The returned `FormattedInsertStatement`
* exposes `->query` (envelope, no bindings) and `->body` (format-specific
* payload) so the caller can ship both to ClickHouse's HTTP interface
* without hand-assembling either side.
*
* The target table must be set via `into()` first. Columns are derived
* from the keys of the first row when `$columns` is omitted; pass
* `$columns` explicitly to pin the order when row shapes vary or when
* an empty iterable is passed. An empty iterable produces an empty
* body — ClickHouse accepts this as a zero-row ingest.
*
* @param iterable<array<string, mixed>> $rows
* @param list<string> $columns Optional explicit column ordering.
*/
public function bulkInsert(Format $format, iterable $rows, array $columns = []): FormattedInsertStatement
{
$materialized = [];
foreach ($rows as $row) {
/** @phpstan-ignore function.alreadyNarrowedType */
if (!\is_array($row)) {
throw new ValidationException('bulkInsert() rows must be associative arrays.');
}
$materialized[] = $row;
}

if (empty($columns) && !empty($materialized)) {
$columns = \array_keys($materialized[0]);
}

$sql = $this->compileFormatInsertEnvelope($format->value, $columns);

$body = $format->serialize($materialized, empty($columns) ? null : $columns);

return new FormattedInsertStatement(
$sql,
[],
$columns,
$format->value,
$body,
executor: $this->executor,
);
}

/**
* Lower-level setter for the FORMAT envelope. Use `bulkInsert()` for the
* typed entry point; this method is retained for callers that need to
* stream the body payload separately (e.g. piping a pre-serialized stream
* straight into the HTTP request) — the subsequent `insert()` call emits
* the envelope only, with `body = null`.
*
* When a format is set, `insert()` emits
* `INSERT INTO \`t\` (\`col1\`, \`col2\`) FORMAT <name>` with no VALUES.
* The row payload must be streamed into the HTTP body by the caller.
* Column names are derived from the most recent `set()` call (values are
* ignored). Pass `$columns` to declare them explicitly when no `set()`
* call has been made.
Expand All @@ -163,6 +211,38 @@ public function insertFormat(string $format, array $columns = []): static
return $this;
}

/**
* Build the shared `INSERT INTO <table> [(<cols>)] FORMAT <name>`
* envelope. Validates the table, validates column names, quotes the
* table identifier, and wraps each column via `resolveAndWrap()`.
* Resets bindings so callers don't accumulate stale values from prior
* builder operations.
*
* @param list<string> $columns
*/
private function compileFormatInsertEnvelope(string $format, array $columns): string
{
$this->bindings = [];
$this->validateTable();

foreach ($columns as $col) {
if ($col === '') {
throw new ValidationException('Column names for FORMAT INSERT must be non-empty strings.');
}
}

$wrappedColumns = empty($columns)
? ''
: ' (' . \implode(', ', \array_map(
fn (string $col): string => $this->resolveAndWrap($col),
$columns
)) . ')';

return 'INSERT INTO ' . $this->quote($this->table)
. $wrappedColumns
. ' FORMAT ' . $format;
}

/**
* @param array<string, string> $settings
*/
Expand Down Expand Up @@ -618,31 +698,11 @@ public function insert(): Statement
return $this->applyNamedTypedBindings(parent::insert());
}

$this->bindings = [];
$this->validateTable();

$columns = !empty($this->insertFormatColumns)
? $this->insertFormatColumns
: (!empty($this->rows) ? \array_keys($this->rows[0]) : []);

if (empty($columns)) {
throw new ValidationException('No columns specified for FORMAT INSERT. Pass columns to insertFormat() or call set() before insert().');
}

foreach ($columns as $col) {
if ($col === '') {
throw new ValidationException('Column names for FORMAT INSERT must be non-empty strings.');
}
}

$wrappedColumns = \array_map(
fn (string $col): string => $this->resolveAndWrap($col),
$columns
);

$sql = 'INSERT INTO ' . $this->quote($this->table)
. ' (' . \implode(', ', $wrappedColumns) . ')'
. ' FORMAT ' . $format;
$sql = $this->compileFormatInsertEnvelope($format, $columns);

return new FormattedInsertStatement(
$sql,
Expand Down
132 changes: 132 additions & 0 deletions src/Query/Builder/ClickHouse/Format.php
Original file line number Diff line number Diff line change
@@ -0,0 +1,132 @@
<?php

namespace Utopia\Query\Builder\ClickHouse;

use Utopia\Query\Exception\ValidationException;

/**
* ClickHouse bulk-ingest format identifiers.
*
* The values map 1:1 to the names ClickHouse accepts after the `FORMAT`
* keyword in an `INSERT INTO <table> FORMAT <name>` envelope. Each case
* knows how to serialize a row iterable into the request body that
* ClickHouse expects for that format.
*/
enum Format: string
{
case JSONEachRow = 'JSONEachRow';
case TabSeparated = 'TabSeparated';

/**
* Serialize an iterable of associative rows into the body payload for
* this format. An empty iterable yields an empty string — ClickHouse
* accepts an empty body as a zero-row insert.
*
* When `$columns` is null the column ordering is derived from the keys
* of the first row encountered. Subsequent rows are serialized against
* whatever shape they themselves carry — there is no cross-row
* consistency check. The implications differ per format:
*
* - For positional formats (e.g. {@see Format::TabSeparated}) values
* are emitted in row-key order. If later rows reorder their keys the
* columns silently misalign with the envelope's column list. Pass
* `$columns` explicitly whenever row shapes are not guaranteed
* identical, or whenever the format is positional.
* - For named formats (e.g. {@see Format::JSONEachRow}) key ordering
* does not affect correctness because each value is paired with its
* key in the wire format. `$columns` still acts as a projection
* filter: rows missing a listed column receive `null`, and row keys
* outside the list are dropped.
*
* @param iterable<array<string, mixed>> $rows
* @param list<string>|null $columns Optional explicit column ordering. When null, derived from the keys of the first row.
*/
public function serialize(iterable $rows, ?array $columns = null): string
{
return match ($this) {
self::JSONEachRow => $this->serializeJsonEachRow($rows, $columns),
self::TabSeparated => $this->serializeTabSeparated($rows, $columns),
};
}

/**
* @param iterable<array<string, mixed>> $rows
* @param list<string>|null $columns
*/
private function serializeJsonEachRow(iterable $rows, ?array $columns): string
{
$lines = [];
foreach ($rows as $row) {
if ($columns !== null) {
$ordered = [];
foreach ($columns as $col) {
$ordered[$col] = $row[$col] ?? null;
}
$row = $ordered;
}

$lines[] = \json_encode(
(object) $row,
JSON_THROW_ON_ERROR | JSON_UNESCAPED_SLASHES | JSON_UNESCAPED_UNICODE,
);
}

return \implode("\n", $lines);
}

/**
* @param iterable<array<string, mixed>> $rows
* @param list<string>|null $columns
*/
private function serializeTabSeparated(iterable $rows, ?array $columns): string
{
$lines = [];
foreach ($rows as $row) {
$values = [];

if ($columns === null) {
foreach ($row as $value) {
$values[] = $this->escapeTabSeparatedValue($value);
}
} else {
foreach ($columns as $col) {
$values[] = $this->escapeTabSeparatedValue($row[$col] ?? null);
}
}

$lines[] = \implode("\t", $values);
}

return \implode("\n", $lines);
}

private function escapeTabSeparatedValue(mixed $value): string
{
if ($value === null) {
return '\\N';
}

if (\is_bool($value)) {
return $value ? '1' : '0';
}

if (\is_int($value) || \is_float($value)) {
return (string) $value;
}

if (! \is_string($value)) {
if (\is_object($value) && \method_exists($value, '__toString')) {
$value = (string) $value;
} else {
throw new ValidationException('TabSeparated values must be scalar, null, or stringable. Received: ' . \get_debug_type($value));
}
}

return \strtr($value, [
'\\' => '\\\\',
"\t" => '\\t',
"\n" => '\\n',
"\r" => '\\r',
]);
}
}
3 changes: 3 additions & 0 deletions src/Query/Builder/ClickHouse/FormattedInsertStatement.php
Original file line number Diff line number Diff line change
Expand Up @@ -12,6 +12,7 @@
* @param list<mixed> $bindings
* @param list<string> $columns
* @param string $format
* @param ?string $body Serialized payload to ship as the HTTP request body alongside `$query`. Null when only the envelope query was produced (the caller assembles the body separately).
* @param bool $readOnly
* @param (Closure(Statement): (array<mixed>|int))|null $executor
*/
Expand All @@ -20,6 +21,7 @@ public function __construct(
array $bindings,
public array $columns,
public string $format,
public ?string $body = null,
bool $readOnly = false,
?Closure $executor = null,
) {
Expand All @@ -34,6 +36,7 @@ public function withExecutor(Closure $executor): self
$this->bindings,
$this->columns,
$this->format,
$this->body,
$this->readOnly,
$executor,
);
Expand Down
24 changes: 24 additions & 0 deletions src/Query/QuotesIdentifiers.php
Original file line number Diff line number Diff line change
Expand Up @@ -41,4 +41,28 @@ protected function quote(string $identifier): string

return \implode('.', $wrapped);
}

/**
* Quote a single identifier without treating dots as qualifier separators.
*
* Use when the identifier is known to be atomic — e.g. a column name in a
* CREATE TABLE definition where the dot is a literal part of the name
* rather than a `schema.table.column` separator. The canonical case is
* ClickHouse's nested-array convention (`meta.key Array(String)`) where
* `meta.key` is a single top-level column whose name contains a dot.
*/
protected function quoteLiteral(string $identifier): string
{
if ($identifier === '*') {
return '*';
}

if (\preg_match('/[\x00-\x1f\x7f]/', $identifier) === 1) {
throw new ValidationException('Identifier contains control character');
}

return $this->wrapChar
. \str_replace($this->wrapChar, $this->wrapChar . $this->wrapChar, $identifier)
. $this->wrapChar;
}
}
Loading
Loading