Skip to content

fix: stop lucene_sanitize escaping individual uppercase letters (breaks BM25)#1569

Open
eldar702 wants to merge 1 commit into
getzep:mainfrom
eldar702:fix/1302-lucene-sanitize-uppercase
Open

fix: stop lucene_sanitize escaping individual uppercase letters (breaks BM25)#1569
eldar702 wants to merge 1 commit into
getzep:mainfrom
eldar702:fix/1302-lucene-sanitize-uppercase

Conversation

@eldar702

@eldar702 eldar702 commented Jun 9, 2026

Copy link
Copy Markdown

Summary

lucene_sanitize() escaped individual uppercase letters O/R/N/T/A/D as a crude approximation of escaping the Lucene boolean operators AND/OR/NOT. Because str.maketrans / str.translate operate per-character, every uppercase O, R, N, T, A, or D in a query was backslash-escaped — corrupting most real-world queries and breaking BM25 fulltext matching.

On current main:

  • "NORD stream""\N\O\R\D stream"
  • "EBITDA forecast""EBI\T\D\A forecast"

Root cause

graphiti_core/helpers.py::lucene_sanitize builds escape_map = str.maketrans({... 'O': r'\O', 'R': r'\R', 'N': r'\N', 'T': r'\T', 'A': r'\A', 'D': r'\D'}). Per-character translation escapes those letters everywhere, not just inside the operators AND/OR/NOT.

Fix

  • Remove the six single-letter entries from escape_map.
  • Escape the boolean operators as whole words instead: re.sub(r'\b(AND|OR|NOT)\b', r'\\\1', sanitized) (re is stdlib — no new dependency).
  • graphiti_core/helpers.py +7/-6.

Behavior after the fix:

  • "NORD stream""NORD stream" (unchanged)
  • "cats AND dogs""cats \AND dogs" (operator still escaped, as a whole word)
  • "ANDES mountains""ANDES mountains" (operator substring inside a word is not escaped)

Testing

  • Updated the pre-existing test_lucene_sanitize, whose expected value encoded the old bug (a leading \ on "This" from the T rule).
  • Added test_lucene_sanitize_preserves_uppercase_words covering plain uppercase tokens, whole-word operator escaping, and embedded-operator non-escaping.
  • Service-free (no DB/LLM): uv run pytest tests/helpers_test.py::test_lucene_sanitize_preserves_uppercase_words and the existing test_lucene_sanitize both pass.

Closes #1302

Type of change: Bug fix.


Drafted with AI assistance (Claude Opus) and verified against current main. The Zep CLA will be signed by me (the human contributor) when the CLA-assistant bot prompts.

@zep-cla-assistant

zep-cla-assistant Bot commented Jun 9, 2026

Copy link
Copy Markdown
Contributor

All contributors have signed the CLA ✍️ ✅
Posted by the CLA Assistant Lite bot.

@eldar702

eldar702 commented Jun 9, 2026

Copy link
Copy Markdown
Author

I have read the CLA Document and I hereby sign the CLA behalf on myself, e-mail: [email protected]

@eldar702 eldar702 marked this pull request as ready for review June 10, 2026 06:55
@eldar702

Copy link
Copy Markdown
Author

I have read the CLA Document and I hereby sign the CLA behalf on myself, e-mail: [email protected]

@eldar702

Copy link
Copy Markdown
Author

I have read the CLA Document and I hereby sign the CLA

zep-cla-assistant Bot added a commit that referenced this pull request Jun 10, 2026
…ep#1302)

lucene_sanitize built its escape map with single-letter entries for
O/R/N/T/A/D and applied it via str.translate, which is per-character.
As a result every uppercase O/R/N/T/A/D in a query was backslash-escaped
(e.g. "NORD stream" -> "\N\O\R\D stream"), corrupting BM25 fulltext
matching for any query containing those common letters.

Drop the six single-letter map entries and instead escape only the
Lucene boolean operators AND / OR / NOT, as whole words, using a stdlib
re.sub. All other special-character escaping is unchanged.

AI-assisted contribution.
@eldar702 eldar702 force-pushed the fix/1302-lucene-sanitize-uppercase branch from c92346f to 328333a Compare June 15, 2026 13:29
@eldar702

Copy link
Copy Markdown
Author

Rebased onto current main (was 4 commits behind). The CLAAssistant check was showing FAILURE on the old SHA even though the CLA bot itself confirmed "All contributors have signed" in a comment on June 9. Pushing to the new SHA should re-trigger the check and clear that false alarm. All other checks (ruff, check-fork) were green on the old SHA — expecting the same here. Happy to address any review feedback!

@eldar702

Copy link
Copy Markdown
Author

FYI — filed #1582 about the triage workflow FAILURE visible on this PR. It's a repo-side infra bug in the anthropics/claude-code-action runner (directory mismatch + Node.js 20 deprecation), not related to this PR's code. All other checks (CLA, ruff, check-fork) are green.

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

None yet

Projects

None yet

Development

Successfully merging this pull request may close these issues.

[BUG] lucene_sanitize() escapes individual uppercase letters, breaking BM25 for most queries

1 participant