Skip to content

feat(writer): broaden zone-map MIN/MAX to extension, Utf8, and dict columns#130

Merged
dfa1 merged 3 commits into
mainfrom
feat/extension-zonemap-minmax
Jun 21, 2026
Merged

feat(writer): broaden zone-map MIN/MAX to extension, Utf8, and dict columns#130
dfa1 merged 3 commits into
mainfrom
feat/extension-zonemap-minmax

Conversation

@dfa1

@dfa1 dfa1 commented Jun 21, 2026

Copy link
Copy Markdown
Owner

Closes the zone-map MIN/MAX gap for non-primitive columns. Ground-truthed against Rust's default_zoned_aggregate_fns: zone-map stats are chosen from the logical column dtype, encoding-independent — so extension, Utf8, and dict columns all carry MIN/MAX/NULL_COUNT, not NULL_COUNT alone.

Changes (3 commits)

  1. Extension columnsflushZoneMaps unwraps Extension to its storage primitive (ExtEncoding already propagates the storage min/max), via zoneStatPTypezoneMinMaxDtype.
  2. Utf8 columnsvortex.varbin already records full string min/max scalars; emit them (plus the always-false _is_truncated flags) per zone, matching ZonedStatsSchema (min/max dtype == column dtype).
  3. Dict columns — compute per-chunk min/max on each chunk's logical values at dict-build time (reusing PrimitiveEncodingEncoder.minMaxStats / VarBinEncodingEncoder.minMaxStats, now exposed so dict and flat paths compute identically). Unified flat + dict emission through one emitZoneMap helper.

Coverage

Column type MIN/MAX
Primitive ✅ (was)
Extension (over primitive) ✅ new
Utf8 ✅ new
Dict (primitive + utf8) ✅ new
Binary ❌ deferred — varbin records string scalars, not bytes

Full ./mvnw verify green incl. integration/Rust interop + inspector decode + javadoc. New WriterZoneMapTest cases cover extension, Utf8, and primitive/utf8 dict per-zone min/max.

🤖 Generated with Claude Code

dfa1 and others added 3 commits June 21, 2026 17:37
…itive

flushZoneMaps gated min/max on `DType.Primitive`, so extension columns
(e.g. vortex.timestamp over I64) emitted NULL_COUNT-only zone-maps even
though ExtEncoding already propagates the storage array's min/max scalars
to each chunk. Generalise the gate via zoneStatPType, which unwraps an
Extension to its storage primitive; the per-zone stat column is stored as
that primitive, matching Rust (stats computed on the storage array).

Co-Authored-By: Claude Opus 4.8 <[email protected]>
Generalise flushZoneMaps min/max beyond primitives: vortex.varbin already
records full string min/max scalars per chunk, so Utf8 columns now emit
MAX/MIN (plus the always-false _is_truncated flags) in the per-zone stats
table, matching ZonedStatsSchema (min/max dtype == column dtype) and Rust.

zoneMinMaxDtype resolves the stored min/max dtype (primitive / extension
storage / Utf8); zoneStatValues dispatches to the primitive or string
stat-column builder. Binary is excluded — varbin records its bounds as
string scalars, not bytes, so Binary min/max stays NULL_COUNT-only.

Co-Authored-By: Claude Opus 4.8 <[email protected]>
Rust computes zone-map stats on the logical column dtype, independent of
the dict encoding, so dict columns carry MIN/MAX/NULL_COUNT like any
other. The Java dict path emitted NULL_COUNT only — a parity gap.

Compute per-chunk min/max on each chunk's logical values at dict-build
time (reusing PrimitiveEncodingEncoder.minMaxStats / VarBinEncodingEncoder
.minMaxStats, now exposed so the dict and flat paths stay identical) and
carry them on DictColRef. Unify zone-map emission: both the flat and dict
loops feed per-zone min/max scalar bytes + null counts through one
emitZoneMap helper (replacing the dict-only NULL_COUNT writer), and the
stat-column builders now take the scalar bytes directly.

Co-Authored-By: Claude Opus 4.8 <[email protected]>
@dfa1 dfa1 merged commit e51da93 into main Jun 21, 2026
6 checks passed
@dfa1 dfa1 deleted the feat/extension-zonemap-minmax branch June 21, 2026 18:49
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

None yet

Projects

None yet

Development

Successfully merging this pull request may close these issues.

1 participant