Skip to content

Commit 2c84a01

Browse files
authored
feat(vector/hnsw): add per‑query ef and distance_threshold to similar_to, fix early termination (#9514)
Hugely appreciative of the Dgraph team’s work. Native vector search integrated directly into a graph database is kind of a no brainer today. Deployed Dgraph (both vanilla and customised) in systems with 1M+ vectors guiding deep traversal queries across 10M+ nodes -- tight coupling of vector search with graph traversal at massive scale gets us closer to something that could represent the fuzzy nuances of everything in an enterprise. Certainly not the biggest deployment your team will have seen, but this PR fixes an under‑recall edge case in HNSW and introduces opt‑in, per‑query controls that let users dial recall vs latency safely and predictably. I’ve had this running in production for a while and thought it worth proposing to main. - Summary - Fix incorrect early termination in the HNSW bottom layer that could stop before collecting k neighbours. - Extend similar_to with optional per‑query `ef` and `distance_threshold` (string or JSON‑like fourth argument). - Backwards compatible: default 3‑arg behaviour of similar_to is unchanged. - Motivation - In narrow probes, the bottom‑layer search could exit at a local minimum before collecting k, hurting recall. - No per‑query `ef` meant recall vs latency trade‑offs required global tuning or inflating k (and downstream work). - This PR corrects the termination logic and adds opt‑in knobs so users can increase exploration only when needed. - Changes (key files) - `tok/hnsw/persistent_hnsw.go`: fix early termination, add `SearchWithOptions`/`SearchWithUidAndOptions`, apply `ef` override at upper layers and `max(k, ef)` at bottom layer, apply `distance_threshold` in the metric domain (Euclidean squared internally, cosine as 1 − sim). - `tok/index/index.go`: add `VectorIndexOptions` and `OptionalSearchOptions` (non‑breaking). - `worker/task.go`: parse optional fourth argument to `similar_to` (`ef`, `distance_threshold`), thread options, route to optional methods when provided, guard zero/negative k. - `tok/index/search_path.go`: add `SearchPathResult` helper. - Tests: `tok/hnsw/ef_recall_test.go` adds - `TestHNSWSearchEfOverrideImprovesRecall` - `TestHNSWDistanceThreshold_Euclidean` - `TestHNSWDistanceThreshold_Cosine` - `CHANGELOG.md`: Unreleased entry for HNSW fix and per‑query options. - Backwards compatibility - No default behaviour changes. The three‑argument `similar_to(attr, k, vector_or_uid)` is unchanged. - `ef` and `distance_threshold` are optional, unsupported metrics safely ignore the threshold. - Performance - No overhead without options. - With `ef`, bottom‑layer candidate size becomes `max(k, ef)` (as in HNSW), cost scales accordingly. - Threshold filtering is a cheap pass over candidates, squaring Euclidean thresholds avoids extra roots. - Rationale and alignment - Matches HNSW semantics: `ef_search` controls exploration/recall, `k` controls output size. - Aligns with [Typesense](https://typesense.org/docs/29.0/api/vector-search.html#vector-search-parameters)’s per‑query `ef` and `distance_threshold` semantics for familiarity. Checklist - [x] Code compiles correctly and linting passes locally - [x] For all code changes, an entry added to the `CHANGELOG.md` describing this PR - [x] Tests added for new functionality / regression tests for the bug fix - [ ] For public APIs/new features, docs PR will be prepared and linked here after initial review
1 parent a4e752f commit 2c84a01

11 files changed

Lines changed: 946 additions & 36 deletions

File tree

CHANGELOG.md

Lines changed: 5 additions & 0 deletions
Original file line numberDiff line numberDiff line change
@@ -88,6 +88,11 @@ as a guide.
8888

8989
- **Query**
9090
- fix(query): return full float value in query results (#9492)
91+
- **Vector**
92+
- fix(vector/hnsw): correct early termination in bottom-layer search to ensure at least k
93+
candidates are considered before breaking
94+
- feat(vector/hnsw): add optional per-query controls to similar_to via named parameters: `ef`
95+
(search breadth override) and `distance_threshold` (metric-domain cutoff); defaults unchanged
9196

9297
- **Changed**
9398

dql/parser.go

Lines changed: 92 additions & 1 deletion
Original file line numberDiff line numberDiff line change
@@ -1745,6 +1745,10 @@ L:
17451745

17461746
name := collectName(it, item.Val)
17471747
function.Name = strings.ToLower(name)
1748+
var similarToOptSeen map[string]struct{}
1749+
if function.Name == similarToFn {
1750+
similarToOptSeen = make(map[string]struct{})
1751+
}
17481752
if _, ok := tryParseItemType(it, itemLeftRound); !ok {
17491753
return nil, it.Errorf("Expected ( after func name [%s]", function.Name)
17501754
}
@@ -1874,7 +1878,10 @@ L:
18741878
case IsInequalityFn(function.Name):
18751879
err = parseFuncArgs(it, function)
18761880

1877-
case function.Name == "uid_in" || function.Name == "similar_to":
1881+
case function.Name == "uid_in":
1882+
err = parseFuncArgs(it, function)
1883+
1884+
case function.Name == "similar_to":
18781885
err = parseFuncArgs(it, function)
18791886

18801887
default:
@@ -1892,7 +1899,87 @@ L:
18921899
}
18931900
expectArg = false
18941901
continue
1902+
case itemLeftCurl:
1903+
return nil, itemInFunc.Errorf("Unrecognized character inside a func: U+007B '{'")
1904+
case itemRightCurl:
1905+
// Right curly braces are never valid in function arguments outside of
1906+
// the (unsupported) object literal syntax. Always error on stray '}'.
1907+
return nil, itemInFunc.Errorf("Unrecognized character inside a func: U+007D '}'")
18951908
default:
1909+
// similar_to supports named optional parameters after the 3rd positional argument:
1910+
// similar_to(pred, k, vec, ef: 64, distance_threshold: 0.5)
1911+
//
1912+
// Internally we represent each option as two args appended after k and vec:
1913+
// ["ef", "64", "distance_threshold", "0.5", ...]
1914+
if itemInFunc.Typ == itemName && function.Name == similarToFn &&
1915+
function.Attr != "" && len(function.Args) >= 2 {
1916+
next, ok := it.PeekOne()
1917+
if ok && next.Typ == itemColon {
1918+
key := strings.ToLower(collectName(it, itemInFunc.Val))
1919+
switch key {
1920+
case "ef", "distance_threshold":
1921+
default:
1922+
return nil, itemInFunc.Errorf("Unknown option %q in similar_to", key)
1923+
}
1924+
if _, exists := similarToOptSeen[key]; exists {
1925+
return nil, itemInFunc.Errorf("Duplicate key %q in similar_to options", key)
1926+
}
1927+
similarToOptSeen[key] = struct{}{}
1928+
1929+
if ok := trySkipItemTyp(it, itemColon); !ok {
1930+
return nil, it.Errorf("Expected colon(:) after %s", key)
1931+
}
1932+
if !it.Next() {
1933+
return nil, it.Errorf("Expected value for %s", key)
1934+
}
1935+
valItem := it.Item()
1936+
switch valItem.Typ {
1937+
case itemDollar:
1938+
varName, err := parseVarName(it)
1939+
if err != nil {
1940+
return nil, err
1941+
}
1942+
function.Args = append(function.Args, Arg{Value: key})
1943+
function.Args = append(function.Args, Arg{Value: varName, IsDQLVar: true})
1944+
case itemMathOp:
1945+
// Allow signed numeric literals, e.g. distance_threshold: -0.5
1946+
prefix := valItem.Val
1947+
if !it.Next() {
1948+
return nil, it.Errorf("Expected value after %s for %s", prefix, key)
1949+
}
1950+
valItem = it.Item()
1951+
if valItem.Typ != itemName {
1952+
return nil, valItem.Errorf("Expected value for %s", key)
1953+
}
1954+
v := collectName(it, valItem.Val)
1955+
v = strings.Trim(v, " \t")
1956+
uq, err := unquoteIfQuoted(v)
1957+
if err != nil {
1958+
return nil, err
1959+
}
1960+
function.Args = append(function.Args, Arg{Value: key})
1961+
function.Args = append(function.Args, Arg{Value: prefix + uq})
1962+
default:
1963+
if valItem.Typ != itemName {
1964+
return nil, valItem.Errorf("Expected value for %s", key)
1965+
}
1966+
v := collectName(it, valItem.Val)
1967+
v = strings.Trim(v, " \t")
1968+
uq, err := unquoteIfQuoted(v)
1969+
if err != nil {
1970+
return nil, err
1971+
}
1972+
function.Args = append(function.Args, Arg{Value: key})
1973+
function.Args = append(function.Args, Arg{Value: uq})
1974+
}
1975+
1976+
expectArg = false
1977+
continue
1978+
}
1979+
1980+
// Disallow extra positional args after (k, vec). Options must be named.
1981+
return nil, itemInFunc.Errorf("Expected named parameter in similar_to options (e.g. ef: 64)")
1982+
}
18961983
if itemInFunc.Typ != itemName {
18971984
return nil, itemInFunc.Errorf("Expected arg after func [%s], but got item %v",
18981985
function.Name, itemInFunc)
@@ -2408,6 +2495,10 @@ loop:
24082495
// The parentheses are balanced out. Let's break.
24092496
break loop
24102497
}
2498+
case item.Typ == itemLeftCurl:
2499+
return nil, item.Errorf("Unrecognized character inside a func: U+007B '{'")
2500+
case item.Typ == itemRightCurl:
2501+
return nil, item.Errorf("Unrecognized character inside a func: U+007D '}'")
24112502
default:
24122503
return nil, item.Errorf("Unexpected item while parsing @filter: %v", item)
24132504
}

dql/parser_test.go

Lines changed: 127 additions & 6 deletions
Original file line numberDiff line numberDiff line change
@@ -2518,6 +2518,12 @@ func TestParseFilter_brac(t *testing.T) {
25182518
}
25192519

25202520
// Test if unbalanced brac will lead to errors.
2521+
// Note: This query has two errors: missing ')' after '()' AND a stray '{'.
2522+
// After changes to support similar_to's JSON args the lexer now emits brace tokens
2523+
// instead of erroring immediately. This causes the query to fail on the structural
2524+
// error (unclosed brackets) rather than the character-specific error. This is an
2525+
// acceptable trade-off because queries with multiple syntax errors may report a different
2526+
// (but equally fatal) error first.
25212527
func TestParseFilter_unbalancedbrac(t *testing.T) {
25222528
query := `
25232529
query {
@@ -2532,8 +2538,119 @@ func TestParseFilter_unbalancedbrac(t *testing.T) {
25322538
`
25332539
_, err := Parse(Request{Str: query})
25342540
require.Error(t, err)
2535-
require.Contains(t, err.Error(),
2536-
"Unrecognized character inside a func: U+007B '{'")
2541+
require.Contains(t, err.Error(), "Unclosed Brackets")
2542+
}
2543+
2544+
func TestParseSimilarToNamedParams(t *testing.T) {
2545+
query := `{
2546+
q(func: similar_to(voptions, 4, "[0,0]", distance_threshold: 1.5, ef: 12)) {
2547+
uid
2548+
}
2549+
}`
2550+
res, err := Parse(Request{Str: query})
2551+
require.NoError(t, err)
2552+
require.Len(t, res.Query, 1)
2553+
require.NotNil(t, res.Query[0])
2554+
require.NotNil(t, res.Query[0].Func)
2555+
require.Equal(t, "similar_to", res.Query[0].Func.Name)
2556+
require.Equal(t, "voptions", res.Query[0].Func.Attr)
2557+
require.Equal(t, "4", res.Query[0].Func.Args[0].Value)
2558+
require.Equal(t, "[0,0]", res.Query[0].Func.Args[1].Value)
2559+
2560+
// Options are appended as (key, value) pairs after k and vec.
2561+
require.Len(t, res.Query[0].Func.Args, 6)
2562+
require.Equal(t, "distance_threshold", res.Query[0].Func.Args[2].Value)
2563+
require.Equal(t, "1.5", res.Query[0].Func.Args[3].Value)
2564+
require.Equal(t, "ef", res.Query[0].Func.Args[4].Value)
2565+
require.Equal(t, "12", res.Query[0].Func.Args[5].Value)
2566+
}
2567+
2568+
func TestParseSimilarToThreeArgs(t *testing.T) {
2569+
// Test three-arg form (no options)
2570+
query := `{
2571+
q(func: similar_to(voptions, 4, "[0,0]")) {
2572+
uid
2573+
}
2574+
}`
2575+
res, err := Parse(Request{Str: query})
2576+
require.NoError(t, err)
2577+
require.Equal(t, "similar_to", res.Query[0].Func.Name)
2578+
require.Len(t, res.Query[0].Func.Args, 2)
2579+
}
2580+
2581+
func TestParseSimilarToRejectsObjectLiteralSyntax(t *testing.T) {
2582+
query := `{
2583+
q(func: similar_to(voptions, 4, "[0,0]", {ef: 12})) {
2584+
uid
2585+
}
2586+
}`
2587+
_, err := Parse(Request{Str: query})
2588+
require.Error(t, err)
2589+
require.Contains(t, err.Error(), "Unrecognized character inside a func: U+007B '{'")
2590+
}
2591+
2592+
func TestParseSimilarToWithQueryVariable(t *testing.T) {
2593+
query := `query test($eff: int) {
2594+
q(func: similar_to(voptions, 4, "[0,0]", ef: $eff)) {
2595+
uid
2596+
}
2597+
}`
2598+
res, err := Parse(Request{
2599+
Str: query,
2600+
Variables: map[string]string{"$eff": "64"},
2601+
})
2602+
require.NoError(t, err)
2603+
require.Equal(t, "similar_to", res.Query[0].Func.Name)
2604+
require.Len(t, res.Query[0].Func.Args, 4)
2605+
require.Equal(t, "ef", res.Query[0].Func.Args[2].Value)
2606+
require.Equal(t, "64", res.Query[0].Func.Args[3].Value)
2607+
}
2608+
2609+
func TestParseSimilarToRejectsLegacyStringOptionsSyntax(t *testing.T) {
2610+
query := `{
2611+
q(func: similar_to(voptions, 4, "[0,0]", "ef=64,distance_threshold=0.45")) {
2612+
uid
2613+
}
2614+
}`
2615+
_, err := Parse(Request{Str: query})
2616+
require.Error(t, err)
2617+
require.Contains(t, err.Error(), "Expected named parameter in similar_to options")
2618+
}
2619+
2620+
func TestParseSimilarToUnknownOption(t *testing.T) {
2621+
query := `{
2622+
q(func: similar_to(voptions, 4, "[0,0]", foo: 5)) {
2623+
uid
2624+
}
2625+
}`
2626+
_, err := Parse(Request{Str: query})
2627+
require.Error(t, err)
2628+
require.Contains(t, err.Error(), "Unknown option")
2629+
require.Contains(t, err.Error(), "foo")
2630+
}
2631+
2632+
func TestParseSimilarToDuplicateOption(t *testing.T) {
2633+
query := `{
2634+
q(func: similar_to(voptions, 4, "[0,0]", ef: 10, ef: 20)) {
2635+
uid
2636+
}
2637+
}`
2638+
_, err := Parse(Request{Str: query})
2639+
require.Error(t, err)
2640+
require.Contains(t, err.Error(), "Duplicate key")
2641+
require.Contains(t, err.Error(), "ef")
2642+
}
2643+
2644+
func TestParseNonSimilarToWithBrace(t *testing.T) {
2645+
// Braces in non-similar_to functions should be rejected
2646+
query := `{
2647+
q(func: eq(name, {value: "test"})) {
2648+
uid
2649+
}
2650+
}`
2651+
_, err := Parse(Request{Str: query})
2652+
require.Error(t, err)
2653+
require.Contains(t, err.Error(), "Unrecognized character inside a func: U+007B '{'")
25372654
}
25382655

25392656
func TestParseFilter_Geo1(t *testing.T) {
@@ -2768,6 +2885,10 @@ func TestParseCountAsFunc(t *testing.T) {
27682885

27692886
}
27702887

2888+
// Note: This query has two errors: missing ')' after 'friends' AND a stray '}'.
2889+
// After changes to support similar_to's JSON args the lexer emits brace tokens instead
2890+
// of erroring immediately -- causing this to fail on unclosed brackets rather than the
2891+
// specific character error. See TestParseFilter_unbalancedbrac for full explanation.
27712892
func TestParseCountError1(t *testing.T) {
27722893
query := `{
27732894
me(func: uid(1)) {
@@ -2779,10 +2900,11 @@ func TestParseCountError1(t *testing.T) {
27792900
`
27802901
_, err := Parse(Request{Str: query})
27812902
require.Error(t, err)
2782-
require.Contains(t, err.Error(),
2783-
"Unrecognized character inside a func: U+007D '}'")
2903+
require.Contains(t, err.Error(), "Unclosed Brackets")
27842904
}
27852905

2906+
// Note: Similar to TestParseCountError1, this has missing ')' and stray '}',
2907+
// now reports structural error instead of character-specific error.
27862908
func TestParseCountError2(t *testing.T) {
27872909
query := `{
27882910
me(func: uid(1)) {
@@ -2794,8 +2916,7 @@ func TestParseCountError2(t *testing.T) {
27942916
`
27952917
_, err := Parse(Request{Str: query})
27962918
require.Error(t, err)
2797-
require.Contains(t, err.Error(),
2798-
"Unrecognized character inside a func: U+007D '}'")
2919+
require.Contains(t, err.Error(), "Unclosed Brackets")
27992920
}
28002921

28012922
func TestParseCheckPwd(t *testing.T) {

dql/state.go

Lines changed: 12 additions & 0 deletions
Original file line numberDiff line numberDiff line change
@@ -306,6 +306,18 @@ func lexFuncOrArg(l *lex.Lexer) lex.StateFn {
306306
l.Emit(itemLeftSquare)
307307
case r == rightSquare:
308308
l.Emit(itemRightSquare)
309+
case r == leftCurl:
310+
empty = false
311+
l.Emit(itemLeftCurl)
312+
// Design decision: Emit brace tokens without affecting ArgDepth tracking.
313+
// The parser validates whether braces are legal in context.
314+
// Trade-off: Queries with multiple syntax errors (e.g., missing ')' AND stray '}')
315+
// will report structural errors (Unclosed Brackets) rather than character-specific
316+
// errors. This is acceptable as the query is still rejected with a clear error.
317+
case r == rightCurl:
318+
l.Emit(itemRightCurl)
319+
// Don't decrement ArgDepth for braces; let parser validate context.
320+
// See leftCurl case above for full rationale.
309321
case r == '#':
310322
return lexComment
311323
case r == '.':

query/vector/vector_test.go

Lines changed: 68 additions & 0 deletions
Original file line numberDiff line numberDiff line change
@@ -417,6 +417,74 @@ func TestVectorIndexRebuildWhenChange(t *testing.T) {
417417
require.Greater(t, dur, time.Second*4)
418418
}
419419

420+
func TestSimilarToOptionsIntegration(t *testing.T) {
421+
const pred = "voptions"
422+
dropPredicate(pred)
423+
t.Cleanup(func() { dropPredicate(pred) })
424+
425+
setSchema(fmt.Sprintf(vectorSchemaWithIndex, pred, "4", "euclidean"))
426+
427+
rdf := `<0x1> <voptions> "[0,0]" .
428+
<0x2> <voptions> "[1,0]" .
429+
<0x3> <voptions> "[2,0]" .
430+
<0x4> <voptions> "[5,0]" .`
431+
require.NoError(t, addTriplesToCluster(rdf))
432+
433+
t.Run("ef_override_named_param", func(t *testing.T) {
434+
query := `{
435+
results(func: similar_to(voptions, 3, "[0,0]", ef: 2)) {
436+
uid
437+
}
438+
}`
439+
resp := processQueryNoErr(t, query)
440+
441+
var result struct {
442+
Data struct {
443+
Results []struct {
444+
UID string `json:"uid"`
445+
} `json:"results"`
446+
} `json:"data"`
447+
}
448+
require.NoError(t, json.Unmarshal([]byte(resp), &result))
449+
require.Len(t, result.Data.Results, 3)
450+
451+
expected := map[string]struct{}{"0x1": {}, "0x2": {}, "0x3": {}}
452+
for _, r := range result.Data.Results {
453+
_, ok := expected[r.UID]
454+
require.Truef(t, ok, "unexpected uid %s", r.UID)
455+
delete(expected, r.UID)
456+
}
457+
require.Empty(t, expected)
458+
})
459+
460+
t.Run("distance_threshold_named_param", func(t *testing.T) {
461+
query := `{
462+
results(func: similar_to(voptions, 4, "[0,0]", distance_threshold: 1.5)) {
463+
uid
464+
}
465+
}`
466+
resp := processQueryNoErr(t, query)
467+
468+
var result struct {
469+
Data struct {
470+
Results []struct {
471+
UID string `json:"uid"`
472+
} `json:"results"`
473+
} `json:"data"`
474+
}
475+
require.NoError(t, json.Unmarshal([]byte(resp), &result))
476+
require.Len(t, result.Data.Results, 2)
477+
478+
expected := map[string]struct{}{"0x1": {}, "0x2": {}}
479+
for _, r := range result.Data.Results {
480+
_, ok := expected[r.UID]
481+
require.Truef(t, ok, "unexpected uid %s", r.UID)
482+
delete(expected, r.UID)
483+
}
484+
require.Empty(t, expected)
485+
})
486+
}
487+
420488
func TestVectorInQueryArgument(t *testing.T) {
421489
dropPredicate("vtest")
422490
setSchema(fmt.Sprintf(vectorSchemaWithIndex, "vtest", "4", "euclidean"))

0 commit comments

Comments
 (0)