Skip to content

[SPARK-56395][CONNECT][PYTHON] Add NEAREST BY DataFrame API#55682

Open
dilipbiswal wants to merge 5 commits intoapache:masterfrom
dilipbiswal:SPARK-56395-DF-CONNECT2
Open

[SPARK-56395][CONNECT][PYTHON] Add NEAREST BY DataFrame API#55682
dilipbiswal wants to merge 5 commits intoapache:masterfrom
dilipbiswal:SPARK-56395-DF-CONNECT2

Conversation

@dilipbiswal
Copy link
Copy Markdown
Contributor

What changes were proposed in this pull request?

Builds on the catalyst-side merged in SPARK-56395 (link). Adds the DataFrame nearestByJoin method in Scala / Java / PySpark and wires up Spark Connect:

Why are the changes needed

API completeness. The prior PR exposed NEAREST BY only via SQL; this PR brings the same capability to DataFrame / PySpark / Spark Connect.

Does this PR introduce any user-facing change?

// Scala

  users.nearestByJoin(
    products,
    -abs(users("score") - products("pscore")),
    numResults = 1,                                                                                                                                                       
    mode = "exact",
    direction = "similarity",                                                                                                                                             
    joinType = "leftouter")   

// PySpark

  users.nearestByJoin(                                                                                                                                                    
      products,                                             
      -sf.abs(users.score - products.pscore),
      1,
      "exact",                                                                                                                                                            
      "similarity",
      joinType="leftouter",                                                                                                                                               
  ).select("user_id", "product").show() 

How was this patch tested?

DataFrameNearestByJoinSuite,RewriteNearestByJoinSuite, python doctests

Was this patch authored or co-authored using generative AI tooling?

Generated-by: Claude Code (Opus 4.7), human-reviewed and tested

session=self._session,
)

def nearestByJoin(
Copy link
Copy Markdown
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

we need Spark connect tests for nearestByJoin - see lateralJoin tests in DataFrameSubquerySuite and PlanGenerationTestSuite

}

private[sql] object Dataset {
// Acceptance lists for `nearestByJoin`. Must stay aligned with `NearestByJoinType` /
Copy link
Copy Markdown
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

how do we keep these in sync? is there a good way to share the same validation list? like move the lists to sql/api so that both sql/connect and sql/catalyst could reuse it.

... [("A", 11.0), ("B", 22.0), ("C", 5.0)], ["product", "pscore"])
>>> users.nearestByJoin(
... products, -sf.abs(users.score - products.pscore), 1, "exact", "similarity"
... ).select("user_id", "product").orderBy("user_id").show()
Copy link
Copy Markdown
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

the doctest will only cover the hapy path with default inner join - can we add more tests similar to

def test_lateral_join_with_single_column_select(self):

// cannot import.
private val MaxNumResults: Int = 100000
private val SupportedJoinTypeDisplay = "'INNER', 'LEFT OUTER'"
private val SupportedJoinTypes = Set("inner", "leftouter", "left", "left_outer")
Copy link
Copy Markdown
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

why do we need both leftouter and left_outer?

private val SupportedModes = Seq("approx", "exact")
private val SupportedDirections = Seq("distance", "similarity")

private[connect] def validateNearestByJoinArgs(
Copy link
Copy Markdown
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

look like we are missing similar validations in python client?

Copy link
Copy Markdown
Contributor

@zhidongqu-db zhidongqu-db left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I would closely examine implementation and existing tests for lateral join and try to mirror that here

@HyukjinKwon HyukjinKwon changed the title [SPARK-56395][DataFrame][CONNECT][PYTHON] Add NEAREST BY DataFrame API [SPARK-56395][CONNECT][PYTHON] Add NEAREST BY DataFrame API May 5, 2026
Builds on SPARK-56395 (catalyst-side, prior PR). Adds the DataFrame
`nearestByJoin` method in Scala / Java / PySpark, the corresponding
Spark Connect proto and server/client wiring, and the end-to-end
DataFrame test suite.
@dilipbiswal dilipbiswal force-pushed the SPARK-56395-DF-CONNECT2 branch from 9576f79 to eb05ccf Compare May 6, 2026 18:52
@dilipbiswal dilipbiswal force-pushed the SPARK-56395-DF-CONNECT2 branch from 275687d to 17e65ad Compare May 6, 2026 22:23
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

None yet

Projects

None yet

Development

Successfully merging this pull request may close these issues.

2 participants