fix: validate embeddings for NaN/Inf values with automatic retry#1511
fix: validate embeddings for NaN/Inf values with automatic retry#1511BaiYouQing wants to merge 1 commit into
Conversation
Embedding providers can occasionally return corrupted responses containing NaN or Inf values. When this happens, cosine similarity comparisons silently produce wrong results - cosine_similarity(nan, vec) never exceeds 0.6, causing the dedup step to miss existing entities and create duplicates. Changes: - Add _validate_embedding() that checks for NaN/Inf using numpy - Both create() and create_batch() validate with automatic single retry - Add numpy as a dependency
|
Strong direction. Two extensions worth keeping in mind once this lands:
Both could fit as separate sub-issues if the maintainers prefer to keep this PR focused on NaN/Inf. Out-of-process alternative for callers who want the full set of invariants enforced today: https://github.com/Adelagric/vector-router (Apache 2.0). Not a competitor to this PR — they compose. |
Description
Embedding providers can occasionally return corrupted responses containing NaN or Inf values in the embedding vector. When this happens, cosine similarity comparisons (used for entity deduplication) silently produce wrong results —
cosine_similarity(NaN, any_vector)never exceeds 0.6, causing the dedup step to miss existing entities and create duplicate nodes.This is related to Issue #1505.
Changes
_validate_embedding()method that checks for NaN/Inf using numpycreate()andcreate_batch()use this validation with automatic retryloggingandnumpyimportsTesting