[SPARK-56395][CONNECT][PYTHON] Add NEAREST BY DataFrame API#55682
[SPARK-56395][CONNECT][PYTHON] Add NEAREST BY DataFrame API#55682dilipbiswal wants to merge 5 commits intoapache:masterfrom
Conversation
| session=self._session, | ||
| ) | ||
|
|
||
| def nearestByJoin( |
There was a problem hiding this comment.
we need Spark connect tests for nearestByJoin - see lateralJoin tests in DataFrameSubquerySuite and PlanGenerationTestSuite
| } | ||
|
|
||
| private[sql] object Dataset { | ||
| // Acceptance lists for `nearestByJoin`. Must stay aligned with `NearestByJoinType` / |
There was a problem hiding this comment.
how do we keep these in sync? is there a good way to share the same validation list? like move the lists to sql/api so that both sql/connect and sql/catalyst could reuse it.
| ... [("A", 11.0), ("B", 22.0), ("C", 5.0)], ["product", "pscore"]) | ||
| >>> users.nearestByJoin( | ||
| ... products, -sf.abs(users.score - products.pscore), 1, "exact", "similarity" | ||
| ... ).select("user_id", "product").orderBy("user_id").show() |
There was a problem hiding this comment.
the doctest will only cover the hapy path with default inner join - can we add more tests similar to
| // cannot import. | ||
| private val MaxNumResults: Int = 100000 | ||
| private val SupportedJoinTypeDisplay = "'INNER', 'LEFT OUTER'" | ||
| private val SupportedJoinTypes = Set("inner", "leftouter", "left", "left_outer") |
There was a problem hiding this comment.
why do we need both leftouter and left_outer?
| private val SupportedModes = Seq("approx", "exact") | ||
| private val SupportedDirections = Seq("distance", "similarity") | ||
|
|
||
| private[connect] def validateNearestByJoinArgs( |
There was a problem hiding this comment.
look like we are missing similar validations in python client?
zhidongqu-db
left a comment
There was a problem hiding this comment.
I would closely examine implementation and existing tests for lateral join and try to mirror that here
Builds on SPARK-56395 (catalyst-side, prior PR). Adds the DataFrame `nearestByJoin` method in Scala / Java / PySpark, the corresponding Spark Connect proto and server/client wiring, and the end-to-end DataFrame test suite.
…on tests, doc alignment
9576f79 to
eb05ccf
Compare
275687d to
17e65ad
Compare
What changes were proposed in this pull request?
Builds on the catalyst-side merged in SPARK-56395 (link). Adds the DataFrame
nearestByJoinmethod in Scala / Java / PySpark and wires up Spark Connect:Why are the changes needed
API completeness. The prior PR exposed
NEAREST BYonly via SQL; this PR brings the same capability to DataFrame / PySpark / Spark Connect.Does this PR introduce any user-facing change?
// Scala
// PySpark
How was this patch tested?
DataFrameNearestByJoinSuite,RewriteNearestByJoinSuite, python doctests
Was this patch authored or co-authored using generative AI tooling?
Generated-by: Claude Code (Opus 4.7), human-reviewed and tested