Skip to content

[GH-2939] Box2D spatial join: ST_BoxIntersects / ST_BoxContains#2953

Draft
jiayuasu wants to merge 2 commits into
apache:masterfrom
jiayuasu:feature/box2d-spatial-join
Draft

[GH-2939] Box2D spatial join: ST_BoxIntersects / ST_BoxContains#2953
jiayuasu wants to merge 2 commits into
apache:masterfrom
jiayuasu:feature/box2d-spatial-join

Conversation

@jiayuasu
Copy link
Copy Markdown
Member

Did you read the Contributor Guide?

Is this PR related to a ticket?

What changes were proposed in this PR?

Teaches Sedona's spatial join planner about the Box2D predicates from #2926. After this PR, both broadcast index joins and distributed range joins handle:

  • ST_BoxIntersects(box_a, box_b)
  • ST_BoxContains(box_a, box_b) (both argument orders)

…using the same machinery (partitioner, R-tree, refine evaluator) that already powers ST_Intersects / ST_Contains joins. No new physical operator, no new index implementation, no new partitioning code.

How it works

At every executor boundary where a shape column is materialised — TraitJoinQueryBase.toSpatialRDD, TraitJoinQueryBase.toExpandedEnvelopeRDD, and BroadcastIndexJoinExec.createStreamShapes — dispatch on the shape expression's dataType. If it is Box2DUDT, read the four doubles out of the serialized InternalRow and materialise the implied closed rectangular Polygon via Constructors.polygonFromEnvelope. Geometry columns continue through GeometrySerializer.deserialize as before.

The materialised Polygon flows through SpatialRDD<T extends Geometry>, the spatial partitioner's sample step, IndexBuilder's R-tree, and SpatialPredicateEvaluators unchanged. JTS already short-circuits axis-aligned rectangle predicates via RectangleIntersects / RectangleContains (gated on Polygon.isRectangle()), which polygonFromEnvelope produces exactly. The refine step therefore pays only a four-double envelope comparison per pair — the savings the user expects from "we know the data is a box".

Predicate-kind mapping

Source predicate SpatialPredicate
ST_BoxIntersects(a, b) INTERSECTS
ST_BoxContains(a, b) COVERS

ST_BoxContains deliberately maps to COVERS, not CONTAINS. PostGIS-style closed-interval containment counts edge-touching pairs as contained; JTS Geometry.contains excludes shared-edge pairs (strict interior), whereas Geometry.covers accepts them — which is what we want.

What's left untouched

  • The Box2D class hierarchy (still a value class).
  • Storage / on-wire layout (still a struct of four doubles via Box2DUDT).
  • SpatialRDD, the partitioners, the R-tree, the refine evaluator.
  • The raster join path and all Geography join paths.

Scope notes

ST_DWithin (distance join) only has (Geometry, Geometry, distance) and (Geography, Geography, distance) overloads today, so Box2D × Box2D distance joins are out of scope for this PR. Adding a (Box2D, Box2D, distance) overload is a follow-up.

How was this patch tested?

New Box2DJoinSuite under spark/common/src/test/scala/org/apache/sedona/sql/:

  • ST_BoxIntersects broadcast index join produces the expected 4 pairs from a 3×3 fixture, with BroadcastIndexJoinExec in the plan.
  • ST_BoxIntersects(R, L) produces the same 4 pairs (argument order symmetric).
  • ST_BoxContains broadcast index join produces the expected 2 closed-interval containment pairs.
  • Edge-touching containment (closed-interval) is counted — locks in the COVERS-not-CONTAINS mapping.
  • Non-broadcast range join produces the same 4 pairs with RangeJoinExec in the plan.
  • Result is equivalent (same ordered row pairs) to ST_Intersects on the same data materialised as polygons via ST_GeomFromBox2D.

Run locally against Spark 3.5 / Scala 2.12. Regression runs: BroadcastIndexJoinSuite 65/65, SpatialJoinSuite 160/160, Box2DUDTSuite 5/5, Box2DCastResolutionRuleSuite 3/3, GeoParquetSpatialFilterPushDownSuite 25/25 — all still pass.

Did this PR include necessary documentation updates?

  • No, this PR does not affect any public SQL API documentation surface in isolation. The new spatial-join behavior for ST_BoxIntersects / ST_BoxContains is covered by the consolidated Phase 1+2+3 Box2D docs update.

Wire `ST_BoxIntersects(box_a, box_b)` and `ST_BoxContains(box_a, box_b)`
through the existing spatial join planner (broadcast index, range, and
distance-capable join executors). No new partitioner, no new index, no
new refine path.

Approach: at the executor boundary, dispatch on the shape column's
`dataType`. If it's `Box2DUDT`, read the four doubles out of the
serialized InternalRow and materialise the implied rectangular Polygon
via `Constructors.polygonFromEnvelope`. The materialised Polygon flows
through `SpatialRDD<T extends Geometry>`, the partitioner sample step,
the R-tree `IndexBuilder`, and the refine evaluator with no API change.
JTS already short-circuits axis-aligned rectangle-rectangle predicates
through `RectangleIntersects` / `RectangleContains` (gated on
`Polygon.isRectangle()`), so the refine step pays only the four-double
envelope comparison naturally.

Changes:

- `TraitJoinQueryBase.shapeToGeometry`: centralised dispatch used by
  `toSpatialRDD` and `toExpandedEnvelopeRDD`. Box2DUDT → Polygon;
  GeometryUDT → existing `GeometrySerializer.deserialize`.
- `BroadcastIndexJoinExec.createStreamShapes`: same dispatch at the
  stream-side eval sites for both the distance and non-distance paths
  (raster path unchanged).
- `JoinQueryDetector`: recognise `ST_BoxIntersects` (→ INTERSECTS) and
  `ST_BoxContains` (→ COVERS). `COVERS` is intentional — ST_BoxContains
  is closed-interval, matching JTS `covers`; JTS `contains` would
  reject edge-touching pairs.
- `OptimizableJoinCondition.isOptimizablePredicate`: accept both
  predicates so the condition is forwarded to the detector.
- Storage and the `Box2D` class hierarchy are untouched.

Tests: new `Box2DJoinSuite` covering broadcast index join, range join,
the symmetric form of `ST_BoxIntersects`, the COVERS semantics for
`ST_BoxContains` (incl. edge-touching closed-interval case), and an
equivalence check that the Box2D path yields the same row pairs as
`ST_Intersects` over the boxes materialised as polygons.

Closes apache#2939.
@jiayuasu jiayuasu marked this pull request as draft May 15, 2026 17:29
@jiayuasu jiayuasu requested a review from Copilot May 15, 2026 17:30
Copy link
Copy Markdown
Contributor

Copilot AI left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Pull request overview

Extends Sedona Spark SQL’s spatial-join planning/execution pipeline to recognize Box2D predicates (ST_BoxIntersects, ST_BoxContains) as join conditions and route them through existing range joins and broadcast index joins (partitioner + R-tree + refine evaluator), rather than falling back to non-optimized join strategies.

Changes:

  • Adds ST_BoxIntersects / ST_BoxContains to join-condition detection and optimization eligibility.
  • Introduces Box2D→JTS-geometry materialization (as rectangles) so existing join machinery can operate on Box2D keys.
  • Adds a dedicated Scala test suite covering broadcast and non-broadcast joins for Box2D predicates.

Reviewed changes

Copilot reviewed 5 out of 5 changed files in this pull request and generated 3 comments.

Show a summary per file
File Description
spark/common/src/test/scala/org/apache/sedona/sql/Box2DJoinSuite.scala New join-planner/executor coverage for Box2D predicates across broadcast index and range join plans.
spark/common/src/main/scala/org/apache/spark/sql/sedona_sql/strategy/join/TraitJoinQueryBase.scala Centralizes shape materialization and adds Box2D handling for join inputs.
spark/common/src/main/scala/org/apache/spark/sql/sedona_sql/strategy/join/OptimizableJoinCondition.scala Marks Box2D predicates as optimizable join predicates.
spark/common/src/main/scala/org/apache/spark/sql/sedona_sql/strategy/join/JoinQueryDetector.scala Detects Box2D predicates as spatial join conditions and maps them to SpatialPredicate.
spark/common/src/main/scala/org/apache/spark/sql/sedona_sql/strategy/join/BroadcastIndexJoinExec.scala Uses the centralized shape materialization logic for stream-side shape extraction (non-raster paths).
Comments suppressed due to low confidence (1)

spark/common/src/main/scala/org/apache/spark/sql/sedona_sql/strategy/join/TraitJoinQueryBase.scala:131

  • toExpandedEnvelopeRDD now materializes shapes via shapeToGeometry, which may return null for NULL inputs; passing a null shape into geometryToExpandedEnvelope will throw. The previous implementation deserialized with GeometrySerializer, which handles null bytes. Add null handling here (skip rows or substitute an empty geometry) to avoid crashing distance joins on NULL shapes.
    spatialRdd.setRawSpatialRDD(
      rdd
        .map { x =>
          val shape = TraitJoinQueryBase.shapeToGeometry(shapeExpression, x)
          val distance = boundRadius.eval(x).asInstanceOf[Double]
          val expandedEnvelope =
            JoinedGeometry.geometryToExpandedEnvelope(shape, distance, isGeography)
          expandedEnvelope.setUserData(x.copy)

💡 Add Copilot custom instructions for smarter, more guided reviews. Learn how to get started.

Comment on lines 52 to 56
rdd
.map { x =>
val shape =
GeometrySerializer.deserialize(shapeExpression.eval(x).asInstanceOf[Array[Byte]])
val shape = TraitJoinQueryBase.shapeToGeometry(shapeExpression, x)
shape.setUserData(x.copy)
shape
Copy link
Copy Markdown
Member Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Fixed in 9144fb8 — the helper now keeps returning null on null inputs, and toSpatialRDD/toExpandedEnvelopeRDD wrap it in a shapeToGeometryOrEmpty that substitutes an empty GeometryCollection. This matches GeometrySerializer.deserialize(null)'s legacy behaviour. The KnnJoinSuite null-geom regression is gone in local runs.

Comment on lines +203 to +208
Constructors
.polygonFromEnvelope(
box.getDouble(0),
box.getDouble(1),
box.getDouble(2),
box.getDouble(3))
Copy link
Copy Markdown
Member Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Fixed in 9144fb8 — the Box2D branch now validates xmin <= xmax and ymin <= ymax and throws the same IllegalArgumentException raised by Predicates.boxIntersects/boxContains. Kept polygonFromEnvelope as the materialiser; JTS handles degenerate boxes (zero-area polygon) correctly in both index and refine paths, and switching to geomFromBox2D would introduce shape polymorphism (Point/Line/Polygon) at the join boundary for what's nominally a single rectangle type.

Comment on lines +54 to +60
describe("Box2D spatial join") {

it("ST_BoxIntersects: broadcast index join produces correct pairs") {
val df = leftBoxes
.alias("L")
.join(broadcast(rightBoxes.alias("R")), expr("ST_BoxIntersects(L.box, R.box)"))
val plan = df.queryExecution.sparkPlan
Copy link
Copy Markdown
Member Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Added in 9144fb8: 'Null Box2D rows are safe and produce no matches' (covers both broadcast and range paths) and 'Inverted Box2D bounds in a join throw IllegalArgumentException' (verifies the exception type and message).

- Restore null-shape fallback. `shapeToGeometry` returns null on null
  inputs; `toSpatialRDD` / `toExpandedEnvelopeRDD` substitute an empty
  GeometryCollection via the new `shapeToGeometryOrEmpty` wrapper,
  matching the pre-existing `GeometrySerializer.deserialize(null)`
  behaviour. Fixes the `KnnJoinSuite` null-geom NPE regression.
- Reject inverted Box2D bounds (`xmin > xmax` / `ymin > ymax`) with an
  `IllegalArgumentException` whose message matches
  `Predicates.boxIntersects` / `boxContains`. Inverted bounds have no
  defined planar meaning and would silently mis-prune the R-tree.
- Add two edge-case tests to `Box2DJoinSuite`: null Box2D rows survive
  the broadcast and range paths without crashing and without
  contributing matches; inverted Box2D bounds surface the same
  `IllegalArgumentException` as the scalar predicate.

Verified locally against Spark 3.5 / Scala 2.12: Box2DJoinSuite 8/8;
regression run across KnnJoinSuite, BroadcastIndexJoinSuite,
SpatialJoinSuite, Box2DJoinSuite — 262/262.
Copy link
Copy Markdown
Contributor

Copilot AI left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Copilot encountered an error and was unable to review this pull request. You can try again by re-requesting a review.

Copy link
Copy Markdown
Contributor

Copilot AI left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Pull request overview

Copilot reviewed 5 out of 5 changed files in this pull request and generated 1 comment.

Comment on lines +141 to +145
val ex = intercept[org.apache.spark.SparkException] {
invertedLeft
.alias("L")
.join(broadcast(rightBoxes.alias("R")), expr("ST_BoxIntersects(L.box, R.box)"))
.collect()
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

None yet

Projects

None yet

Development

Successfully merging this pull request may close these issues.

Spatial join planner: recognize ST_BoxIntersects / ST_BoxContains as join predicates

2 participants