[SPARK-56395][CONNECT][PYTHON] Add NEAREST BY DataFrame API by dilipbiswal · Pull Request #55682 · apache/spark

dilipbiswal · 2026-05-05T07:24:38Z

What changes were proposed in this pull request?

Builds on the catalyst-side merged in SPARK-56395 (link). Adds the DataFrame nearestByJoin method in Scala / Java / PySpark and wires up Spark Connect:

Why are the changes needed

API completeness. The prior PR exposed NEAREST BY only via SQL; this PR brings the same capability to DataFrame / PySpark / Spark Connect.

Does this PR introduce any user-facing change?

// Scala

  users.nearestByJoin(
    products,
    -abs(users("score") - products("pscore")),
    numResults = 1,                                                                                                                                                       
    mode = "exact",
    direction = "similarity",                                                                                                                                             
    joinType = "leftouter")

// PySpark

  users.nearestByJoin(                                                                                                                                                    
      products,                                             
      -sf.abs(users.score - products.pscore),
      1,
      "exact",                                                                                                                                                            
      "similarity",
      joinType="leftouter",                                                                                                                                               
  ).select("user_id", "product").show()

How was this patch tested?

DataFrameNearestByJoinSuite,RewriteNearestByJoinSuite, python doctests

Was this patch authored or co-authored using generative AI tooling?

Generated-by: Claude Code (Opus 4.7), human-reviewed and tested

zhidongqu-db · 2026-05-05T18:14:43Z

            session=self._session,
        )

+    def nearestByJoin(


we need Spark connect tests for nearestByJoin - see lateralJoin tests in DataFrameSubquerySuite and PlanGenerationTestSuite

zhidongqu-db · 2026-05-05T18:17:46Z

 }
+
+private[sql] object Dataset {
+  // Acceptance lists for `nearestByJoin`. Must stay aligned with `NearestByJoinType` /


how do we keep these in sync? is there a good way to share the same validation list? like move the lists to sql/api so that both sql/connect and sql/catalyst could reuse it.

zhidongqu-db · 2026-05-05T18:22:42Z

+        ...     [("A", 11.0), ("B", 22.0), ("C", 5.0)], ["product", "pscore"])
+        >>> users.nearestByJoin(
+        ...     products, -sf.abs(users.score - products.pscore), 1, "exact", "similarity"
+        ... ).select("user_id", "product").orderBy("user_id").show()


the doctest will only cover the hapy path with default inner join - can we add more tests similar to

spark/python/pyspark/sql/tests/test_subquery.py

Line 679 in 3a6bc83

def test_lateral_join_with_single_column_select(self):

zhidongqu-db · 2026-05-05T18:24:06Z

+  // cannot import.
+  private val MaxNumResults: Int = 100000
+  private val SupportedJoinTypeDisplay = "'INNER', 'LEFT OUTER'"
+  private val SupportedJoinTypes = Set("inner", "leftouter", "left", "left_outer")


why do we need both leftouter and left_outer?

zhidongqu-db · 2026-05-05T18:24:46Z

+  private val SupportedModes = Seq("approx", "exact")
+  private val SupportedDirections = Seq("distance", "similarity")
+
+  private[connect] def validateNearestByJoinArgs(


look like we are missing similar validations in python client?

zhidongqu-db

I would closely examine implementation and existing tests for lateral join and try to mirror that here

Builds on SPARK-56395 (catalyst-side, prior PR). Adds the DataFrame `nearestByJoin` method in Scala / Java / PySpark, the corresponding Spark Connect proto and server/client wiring, and the end-to-end DataFrame test suite.

…on tests, doc alignment

zhidongqu-db reviewed May 5, 2026

View reviewed changes

HyukjinKwon changed the title ~~[SPARK-56395][DataFrame][CONNECT][PYTHON] Add NEAREST BY DataFrame API~~ [SPARK-56395][CONNECT][PYTHON] Add NEAREST BY DataFrame API May 5, 2026

dilipbiswal added 3 commits May 6, 2026 11:46

fmt change

198dd75

Address review comments: shared validation lists, Connect tests, Pyth…

eb05ccf

…on tests, doc alignment

dilipbiswal force-pushed the SPARK-56395-DF-CONNECT2 branch from 9576f79 to eb05ccf Compare May 6, 2026 18:52

dilipbiswal added 2 commits May 6, 2026 13:55

lint error

42bb0f8

code review

17e65ad

dilipbiswal force-pushed the SPARK-56395-DF-CONNECT2 branch from 275687d to 17e65ad Compare May 6, 2026 22:23

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

[SPARK-56395][CONNECT][PYTHON] Add NEAREST BY DataFrame API#55682

[SPARK-56395][CONNECT][PYTHON] Add NEAREST BY DataFrame API#55682
dilipbiswal wants to merge 5 commits intoapache:masterfrom
dilipbiswal:SPARK-56395-DF-CONNECT2

dilipbiswal commented May 5, 2026

Uh oh!

zhidongqu-db May 5, 2026

Uh oh!

zhidongqu-db May 5, 2026

Uh oh!

zhidongqu-db May 5, 2026

Uh oh!

zhidongqu-db May 5, 2026

Uh oh!

zhidongqu-db May 5, 2026

Uh oh!

zhidongqu-db left a comment

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

2 participants

Conversation

dilipbiswal commented May 5, 2026

What changes were proposed in this pull request?

Why are the changes needed

Does this PR introduce any user-facing change?

How was this patch tested?

Was this patch authored or co-authored using generative AI tooling?

Uh oh!

zhidongqu-db May 5, 2026

Choose a reason for hiding this comment

Uh oh!

zhidongqu-db May 5, 2026

Choose a reason for hiding this comment

Uh oh!

zhidongqu-db May 5, 2026

Choose a reason for hiding this comment

Uh oh!

zhidongqu-db May 5, 2026

Choose a reason for hiding this comment

Uh oh!

zhidongqu-db May 5, 2026

Choose a reason for hiding this comment

Uh oh!

zhidongqu-db left a comment

Choose a reason for hiding this comment

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

2 participants