feat: import pipeline, search benchmark, asymmetric query prefixes#4
Merged
Conversation
…efixes - scripts/import_markdown.py: generic, config-driven bulk importer — every record goes through capture_memory (chained, audited, injection-screened); section-level chunking, path-based type classification, content-hash dedupe (idempotent re-runs), markdown report. No personal paths baked in. - embedding: embed_query() with per-model retrieval prefixes (arctic-embed 'query: ', nomic 'search_query: ') — asymmetric models need query-side conditioning; semantic search now uses it. - scripts/bench_search.py: p50/p95 latency for semantic+fts and verify_chain runtime at store scale. Validated against a real 2,328-memory corpus (staging): chain valid at full count; semantic p50 161ms end-to-end (≈15ms of that is HNSW — query embedding dominates); verify_chain <100ms; retrieval quality sanity-checked. Field finding: injection heuristics flag ~7% of security-engineering prose — documented for the judicial-review queue. Co-Authored-By: Claude Fable 5 <[email protected]>
CI's live-Postgres concurrency test caught it: now() is transaction-START time, and chain writes serialize on the advisory lock AFTER their transactions begin — concurrent writers tie on created_at, so both the find-latest query and the verify walk saw an order that wasn't the chain. - alembic 005: chain_seq (sequence default, claimed inside the locked insert => strictly increasing in true chain order), unique, backfilled for pre-005 rows in (created_at, id) order (sequential data, safe) - prev-hash lookup + verify walk now order by chain_seq - integration stub: embed_query = embed (symmetric) so exact-match distance assertions survive the asymmetric prefix change 131 unit + 5 live integration green; 005 applied to prod+staging; staging chain re-verified valid at 2,328 records post-migration. Co-Authored-By: Claude Fable 5 <[email protected]>
This file contains hidden or bidirectional Unicode text that may be interpreted or compiled differently than what appears below. To review, open the file in an editor that reveals hidden Unicode characters.
Learn more about bidirectional Unicode characters
Sign up for free
to join this conversation on GitHub.
Already have an account?
Sign in to comment
Add this suggestion to a batch that can be applied as a single commit.This suggestion is invalid because no changes were made to the code.Suggestions cannot be applied while the pull request is closed.Suggestions cannot be applied while viewing a subset of changes.Only one suggestion per line can be applied in a batch.Add this suggestion to a batch that can be applied as a single commit.Applying suggestions on deleted lines is not supported.You must change the existing code in this line in order to create a valid suggestion.Outdated suggestions cannot be applied.This suggestion has been applied or marked resolved.Suggestions cannot be applied from pending reviews.Suggestions cannot be applied on multi-line comments.Suggestions cannot be applied while the pull request is queued to merge.Suggestion cannot be applied right now. Please check back later.
Overnight build, validated against a real 2,328-memory corpus in a staging database.
scripts/import_markdown.py— generic bulk importer; every record passes throughcapture_memory(an importer bypassing the scoped write path would be a second, unguarded door). Section chunking, path→type classification, hash dedupe, idempotent re-runs, report output.embed_query()with per-model prefixes (arcticquery:); semantic search uses it.scripts/bench_search.py— p50/p95 + verify_chain at scale.Real-corpus results (2,328 memories): chain valid at full count · semantic p50 161ms end-to-end (HNSW itself ≈15ms — query embedding dominates; keep-alive/warm-model is the lever) · fts p50 14.5ms · verify_chain <100ms · top-1 retrieval quality confirmed on live queries.
Field finding for the judicial queue: injection regexes flag ~7% of security-engineering prose (notes that describe attacks trip the patterns). Documented in the import report; refinement belongs to the warn-daemon work, not this PR.
🤖 Generated with Claude Code