✨ feat(identity): same-name disambiguation & user-self identity resolution#75
Conversation
There was a problem hiding this comment.
Code Review
This pull request implements same-name identity disambiguation and self-identity resolution to prevent the knowledge graph from misattributing claims to same-named nodes. It introduces a dedicated user-self-identity module to manage the user's Person node, seeds only distinguishing multi-token aliases, wires transcript speakers as claim subjects, and updates the identity resolver to split instead of guessing on ambiguous ties. Feedback on the changes highlights an optimization opportunity in ensureUserSelfIdentity to avoid redundant database updates to nodeMetadata on every transcript ingestion by checking if the label has actually changed first.
Important
The consumer version of Gemini Code Assist on GitHub is being sunset. Starting June 18, 2026, new organization installations will be blocked, and all code review activity will officially cease on July 17, 2026.
For more details on the timeline and next steps, please review the Help Documentation.
What & why
The knowledge graph attributed the account owner's own first-person statements (from an ingested WhatsApp transcript) to a different same-named person's node — e.g. the user's "I…" routed to a contact also named "Marcel". This is general same-name disambiguation with the account owner as a first-class case — deliberately not a blunt "assume the user when unsure" heuristic (which over-merges).
Four components, in leverage order:
user-self-identity.tsnames the self node with the user's most-specific (multi-token) alias and seeds only multi-token aliases into the alias table — bare first names never enter it, so a same-named contact can never merge into the user (or vice versa).ensureUserSelfIdentityis idempotent + advisory-lock-guarded; called fromsetUserSelfAliases(config) andingestTranscript(effective aliases — covers WhatsApp, where stored aliases are empty). Plus a backfill route to repair existing self nodes.extract-graph.tsregisters resolved speaker nodes intoidMapand instructs the model to use a speaker's nodeId as the subject of first-person claims. Load-bearing fix: claim insertion skips subjects not inidMap.resolveIdentityresolves only on a unique canonical/alias match; >1 match recordsambiguous, logsidentity.ambiguous_skip, and falls through to split. No self-prior anywhere.Result: the bug dies twice over — subject-wiring routes the user's utterance to the self node, and even a bare name reaching the resolver splits rather than guessing.
How to test
Key tests:
identity-resolution.test.ts(ambiguous tie → no resolve +identity.ambiguous_skip;Marcel Samyn→ self, bareMarcel→ the contact),ingest-transcript.test.ts(user-self first-person claim attaches to the self node, not a same-named participant).Operational note
Existing user-self nodes predate the new contract. Run a one-time backfill per user to give them a distinguishing label + multi-token aliases (and strip any previously-seeded bare alias):
Passing an empty/absent effective alias list clears the self node's alias rows (clean-slate) — pass a non-empty list to avoid that.
Checklist
pnpm run build:check(tsc --noEmit + structured-output schema check)pnpm run lintpnpm run format:5431(note: PR CI runs lint/format/build only — vitest is local)🤖 Generated with Claude Code