Skip to content

fix: validate embeddings for NaN/Inf values with automatic retry#1511

Open
BaiYouQing wants to merge 1 commit into
getzep:mainfrom
BaiYouQing:fix/nan-embedding-validation
Open

fix: validate embeddings for NaN/Inf values with automatic retry#1511
BaiYouQing wants to merge 1 commit into
getzep:mainfrom
BaiYouQing:fix/nan-embedding-validation

Conversation

@BaiYouQing

Copy link
Copy Markdown

Description

Embedding providers can occasionally return corrupted responses containing NaN or Inf values in the embedding vector. When this happens, cosine similarity comparisons (used for entity deduplication) silently produce wrong results — cosine_similarity(NaN, any_vector) never exceeds 0.6, causing the dedup step to miss existing entities and create duplicate nodes.

This is related to Issue #1505.

Changes

  1. Added _validate_embedding() method that checks for NaN/Inf using numpy
  2. On detection, logs a warning and raises ValueError to trigger a single retry
  3. Both create() and create_batch() use this validation with automatic retry
  4. Added logging and numpy imports

Testing

  • Tested with a local FalkorDB instance and DeepSeek/Tencent embedding APIs
  • NaN detection correctly catches corrupt embeddings and retries
  • After retry, the second call typically returns valid embeddings

Embedding providers can occasionally return corrupted responses
containing NaN or Inf values. When this happens, cosine similarity
comparisons silently produce wrong results - cosine_similarity(nan, vec)
never exceeds 0.6, causing the dedup step to miss existing entities
and create duplicates.

Changes:
- Add _validate_embedding() that checks for NaN/Inf using numpy
- Both create() and create_batch() validate with automatic single retry
- Add numpy as a dependency
@Adelagric

Copy link
Copy Markdown

Strong direction. Two extensions worth keeping in mind once this lands:

  1. Dimension validation. embedder/openai.py:60,66 silently truncates with [: self.config.embedding_dim] — if the provider returns fewer dims than expected, the resulting vector is shorter than EMBEDDING_DIM, mixing inconsistent-length vectors inside the same index. Worth a length check alongside the isfinite guard.

  2. L2 normalization invariant. helpers.py:116-119 applies normalize_l2 inline in only two places (bulk edge dedup, MMR), but vectors persisted to the graph backend are the raw provider output. With FalkorDB's non-normalized cosine semantics (graph_queries.py:158), this drifts the meaning of the 0.6 dedup threshold.

Both could fit as separate sub-issues if the maintainers prefer to keep this PR focused on NaN/Inf. Out-of-process alternative for callers who want the full set of invariants enforced today: https://github.com/Adelagric/vector-router (Apache 2.0). Not a competitor to this PR — they compose.

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

None yet

Projects

None yet

Development

Successfully merging this pull request may close these issues.

2 participants