You signed in with another tab or window. Reload to refresh your session.You signed out in another tab or window. Reload to refresh your session.You switched accounts on another tab or window. Reload to refresh your session.Dismiss alert
feat(vector/hnsw): add per‑query ef and distance_threshold to similar_to, fix early termination (#9514)
Hugely appreciative of the Dgraph team’s work. Native vector search
integrated directly into a graph database is kind of a no brainer today.
Deployed Dgraph (both vanilla and customised) in systems with 1M+
vectors guiding deep traversal queries across 10M+ nodes -- tight
coupling of vector search with graph traversal at massive scale gets us
closer to something that could represent the fuzzy nuances of everything
in an enterprise. Certainly not the biggest deployment your team will
have seen, but this PR fixes an under‑recall edge case in HNSW and
introduces opt‑in, per‑query controls that let users dial recall vs
latency safely and predictably. I’ve had this running in production for
a while and thought it worth proposing to main.
- Summary
- Fix incorrect early termination in the HNSW bottom layer that could
stop before collecting k neighbours.
- Extend similar_to with optional per‑query `ef` and
`distance_threshold` (string or JSON‑like fourth argument).
- Backwards compatible: default 3‑arg behaviour of similar_to is
unchanged.
- Motivation
- In narrow probes, the bottom‑layer search could exit at a local
minimum before collecting k, hurting recall.
- No per‑query `ef` meant recall vs latency trade‑offs required global
tuning or inflating k (and downstream work).
- This PR corrects the termination logic and adds opt‑in knobs so users
can increase exploration only when needed.
- Changes (key files)
- `tok/hnsw/persistent_hnsw.go`: fix early termination, add
`SearchWithOptions`/`SearchWithUidAndOptions`, apply `ef` override at
upper layers and `max(k, ef)` at bottom layer, apply
`distance_threshold` in the metric domain (Euclidean squared internally,
cosine as 1 − sim).
- `tok/index/index.go`: add `VectorIndexOptions` and
`OptionalSearchOptions` (non‑breaking).
- `worker/task.go`: parse optional fourth argument to `similar_to`
(`ef`, `distance_threshold`), thread options, route to optional methods
when provided, guard zero/negative k.
- `tok/index/search_path.go`: add `SearchPathResult` helper.
- Tests: `tok/hnsw/ef_recall_test.go` adds
- `TestHNSWSearchEfOverrideImprovesRecall`
- `TestHNSWDistanceThreshold_Euclidean`
- `TestHNSWDistanceThreshold_Cosine`
- `CHANGELOG.md`: Unreleased entry for HNSW fix and per‑query options.
- Backwards compatibility
- No default behaviour changes. The three‑argument `similar_to(attr, k,
vector_or_uid)` is unchanged.
- `ef` and `distance_threshold` are optional, unsupported metrics safely
ignore the threshold.
- Performance
- No overhead without options.
- With `ef`, bottom‑layer candidate size becomes `max(k, ef)` (as in
HNSW), cost scales accordingly.
- Threshold filtering is a cheap pass over candidates, squaring
Euclidean thresholds avoids extra roots.
- Rationale and alignment
- Matches HNSW semantics: `ef_search` controls exploration/recall, `k`
controls output size.
- Aligns with
[Typesense](https://typesense.org/docs/29.0/api/vector-search.html#vector-search-parameters)’s
per‑query `ef` and `distance_threshold` semantics for familiarity.
Checklist
- [x] Code compiles correctly and linting passes locally
- [x] For all code changes, an entry added to the `CHANGELOG.md`
describing this PR
- [x] Tests added for new functionality / regression tests for the bug
fix
- [ ] For public APIs/new features, docs PR will be prepared and linked
here after initial review
0 commit comments