fix: stop lucene_sanitize escaping individual uppercase letters (breaks BM25)#1569
fix: stop lucene_sanitize escaping individual uppercase letters (breaks BM25)#1569eldar702 wants to merge 1 commit into
Conversation
|
All contributors have signed the CLA ✍️ ✅ |
|
I have read the CLA Document and I hereby sign the CLA behalf on myself, e-mail: [email protected] |
|
I have read the CLA Document and I hereby sign the CLA behalf on myself, e-mail: [email protected] |
|
I have read the CLA Document and I hereby sign the CLA |
…ep#1302) lucene_sanitize built its escape map with single-letter entries for O/R/N/T/A/D and applied it via str.translate, which is per-character. As a result every uppercase O/R/N/T/A/D in a query was backslash-escaped (e.g. "NORD stream" -> "\N\O\R\D stream"), corrupting BM25 fulltext matching for any query containing those common letters. Drop the six single-letter map entries and instead escape only the Lucene boolean operators AND / OR / NOT, as whole words, using a stdlib re.sub. All other special-character escaping is unchanged. AI-assisted contribution.
c92346f to
328333a
Compare
|
Rebased onto current |
|
FYI — filed #1582 about the |
Summary
lucene_sanitize()escaped individual uppercase letters O/R/N/T/A/D as a crude approximation of escaping the Lucene boolean operators AND/OR/NOT. Becausestr.maketrans/str.translateoperate per-character, every uppercase O, R, N, T, A, or D in a query was backslash-escaped — corrupting most real-world queries and breaking BM25 fulltext matching.On current
main:"NORD stream"→"\N\O\R\D stream""EBITDA forecast"→"EBI\T\D\A forecast"Root cause
graphiti_core/helpers.py::lucene_sanitizebuildsescape_map = str.maketrans({... 'O': r'\O', 'R': r'\R', 'N': r'\N', 'T': r'\T', 'A': r'\A', 'D': r'\D'}). Per-character translation escapes those letters everywhere, not just inside the operators AND/OR/NOT.Fix
escape_map.re.sub(r'\b(AND|OR|NOT)\b', r'\\\1', sanitized)(reis stdlib — no new dependency).graphiti_core/helpers.py+7/-6.Behavior after the fix:
"NORD stream"→"NORD stream"(unchanged)"cats AND dogs"→"cats \AND dogs"(operator still escaped, as a whole word)"ANDES mountains"→"ANDES mountains"(operator substring inside a word is not escaped)Testing
test_lucene_sanitize, whose expected value encoded the old bug (a leading\on "This" from theTrule).test_lucene_sanitize_preserves_uppercase_wordscovering plain uppercase tokens, whole-word operator escaping, and embedded-operator non-escaping.uv run pytest tests/helpers_test.py::test_lucene_sanitize_preserves_uppercase_wordsand the existingtest_lucene_sanitizeboth pass.Closes #1302
Type of change: Bug fix.
Drafted with AI assistance (Claude Opus) and verified against current
main. The Zep CLA will be signed by me (the human contributor) when the CLA-assistant bot prompts.