fix(attribution): normalize repo identity at every write boundary#10
Merged
Conversation
Same git repo could be persisted under several strings — a remote URL (SSH/HTTPS, ±.git), a bare owner/repo slug, or a directory basename — and since GitHub slugs are case-insensitive but case-preserving, Acme/Repo and acme/repo fragmented into separate analytics buckets. This is the correct-at-write half of the attribution work; the cleanup --remap-repo backfill stays for historical rows. Adds normalizeRepo() in core (canonical lowercase owner/name, idempotent) and applies it at ALL repo write boundaries: - cli/git.ts getCurrentRepo() (covers wrap + hooks) - cli run start --repo and job submit --repo (operator-typed values) - web SDK runs route (caller-supplied environment.repo), which also now rejects a non-empty repo that normalizes to "" (e.g. a host-only URL) rather than persisting an empty bucket. Read-side symmetry: run list / job list --repo filters normalize their input too, so a mixed-case filter still matches normalized rows. Tests: normalizeRepo unit suite (URL/SSH/slug/basename/idempotency/dedup), getCurrentRepo lowercasing, CLI run-start/job-submit write-path normalization, and SDK inbound normalize + empty-repo rejection. Full suite 1241 pass, lint clean, build green. Co-Authored-By: Claude Opus 4.8 (1M context) <[email protected]>
The wrap command's auto-detected repo already normalizes via getCurrentRepo, but the explicit `wrap --repo <value>` override bypassed it and persisted the raw string — the same write-boundary gap fixed for run start / job submit. Wrap the whole `opts.repo ?? getCurrentRepo()` expression in normalizeRepo (idempotent, so the auto-detected path is unaffected). Co-Authored-By: Claude Opus 4.8 (1M context) <[email protected]>
This file contains hidden or bidirectional Unicode text that may be interpreted or compiled differently than what appears below. To review, open the file in an editor that reveals hidden Unicode characters.
Learn more about bidirectional Unicode characters
Sign up for free
to join this conversation on GitHub.
Already have an account?
Sign in to comment
Add this suggestion to a batch that can be applied as a single commit.This suggestion is invalid because no changes were made to the code.Suggestions cannot be applied while the pull request is closed.Suggestions cannot be applied while viewing a subset of changes.Only one suggestion per line can be applied in a batch.Add this suggestion to a batch that can be applied as a single commit.Applying suggestions on deleted lines is not supported.You must change the existing code in this line in order to create a valid suggestion.Outdated suggestions cannot be applied.This suggestion has been applied or marked resolved.Suggestions cannot be applied from pending reviews.Suggestions cannot be applied on multi-line comments.Suggestions cannot be applied while the pull request is queued to merge.Suggestion cannot be applied right now. Please check back later.
What
First of the two attribution root-cause fixes — moves repo bucketing from cleanup-after to correct-at-write.
A single git repo could be persisted under several strings: a remote URL (SSH/HTTPS, ±
.git), a bareowner/reposlug, or a directory basename. Because GitHub slugs are case-insensitive but case-preserving,Acme/Repoandacme/repofragmented into separate analytics buckets. Thecleanup --remap-repobackfill mopped this up after the fact; this fixes it at the source.Change
New
normalizeRepo()in core → canonical lowercaseowner/name(or a lowercased single basename), strictly idempotent. Applied at every repo write boundary:getCurrentRepo()(coverswrap+ hooks)cli/git.tsrun start --repo/job submit --repo(operator-typed)cli/commands/run.ts,job.tsenvironment.repoweb/.../sdk/runs/route.tsThe SDK route now also rejects a non-empty repo that normalizes to
""(e.g. a host-only URL) instead of persisting an empty bucket. Read-side symmetry:run list/job list --repofilters normalize their input too, so a mixed-case filter still matches.The
--remap-repobackfill is kept — still the only fix for historical rows.Decision
Canonical form = lowercase
owner/name(per owner's call). Tradeoff: repo names render lowercase. Matches the existing backfill's de-facto target.Known limitation
normalizeRepocan't merge a bare basename (agentops, from a repo with no git remote) intoiaj6/agentops— there's no owner to recover. Repos with a remote (the trial) getowner/nameand dedupe fully.Tests
normalizeRepounit suite (URL/SSH/slug/basename/idempotency/dedup/empty),getCurrentRepolowercasing, CLIrun start/job submitwrite-path normalization, SDK inbound normalize + empty-repo rejection. 1241 pass, lint clean, build green.Review
Adversarial 2-lens review (completeness/consumer-safety + correctness) — caught two real gaps now fixed: the
run start/job submitwrite paths were initially missed, and confirmed no consumer depends on the raw case-preserving value (--remap-reposeeds raw history;pr/linkderive repo fromgh, notenvironment.repo; analytics only group by the stored string).Part 1 of 2 (attribution root causes). PR A (NULL user_id) follows.
🤖 Generated with Claude Code