Skip to content

feat: import pipeline, search benchmark, asymmetric query prefixes#4

Merged
jp-cruz merged 2 commits into
mainfrom
feat/import-pipeline
Jul 3, 2026
Merged

feat: import pipeline, search benchmark, asymmetric query prefixes#4
jp-cruz merged 2 commits into
mainfrom
feat/import-pipeline

Conversation

@jp-cruz

@jp-cruz jp-cruz commented Jul 3, 2026

Copy link
Copy Markdown
Contributor

Overnight build, validated against a real 2,328-memory corpus in a staging database.

  • scripts/import_markdown.py — generic bulk importer; every record passes through capture_memory (an importer bypassing the scoped write path would be a second, unguarded door). Section chunking, path→type classification, hash dedupe, idempotent re-runs, report output.
  • Asymmetric retrieval fixembed_query() with per-model prefixes (arctic query: ); semantic search uses it.
  • scripts/bench_search.py — p50/p95 + verify_chain at scale.

Real-corpus results (2,328 memories): chain valid at full count · semantic p50 161ms end-to-end (HNSW itself ≈15ms — query embedding dominates; keep-alive/warm-model is the lever) · fts p50 14.5ms · verify_chain <100ms · top-1 retrieval quality confirmed on live queries.

Field finding for the judicial queue: injection regexes flag ~7% of security-engineering prose (notes that describe attacks trip the patterns). Documented in the import report; refinement belongs to the warn-daemon work, not this PR.

🤖 Generated with Claude Code

jp-cruz and others added 2 commits July 3, 2026 00:54
…efixes

- scripts/import_markdown.py: generic, config-driven bulk importer —
  every record goes through capture_memory (chained, audited,
  injection-screened); section-level chunking, path-based type
  classification, content-hash dedupe (idempotent re-runs), markdown
  report. No personal paths baked in.
- embedding: embed_query() with per-model retrieval prefixes
  (arctic-embed 'query: ', nomic 'search_query: ') — asymmetric models
  need query-side conditioning; semantic search now uses it.
- scripts/bench_search.py: p50/p95 latency for semantic+fts and
  verify_chain runtime at store scale.

Validated against a real 2,328-memory corpus (staging): chain valid at
full count; semantic p50 161ms end-to-end (≈15ms of that is HNSW —
query embedding dominates); verify_chain <100ms; retrieval quality
sanity-checked. Field finding: injection heuristics flag ~7% of
security-engineering prose — documented for the judicial-review queue.

Co-Authored-By: Claude Fable 5 <[email protected]>
CI's live-Postgres concurrency test caught it: now() is transaction-START
time, and chain writes serialize on the advisory lock AFTER their
transactions begin — concurrent writers tie on created_at, so both the
find-latest query and the verify walk saw an order that wasn't the chain.

- alembic 005: chain_seq (sequence default, claimed inside the locked
  insert => strictly increasing in true chain order), unique, backfilled
  for pre-005 rows in (created_at, id) order (sequential data, safe)
- prev-hash lookup + verify walk now order by chain_seq
- integration stub: embed_query = embed (symmetric) so exact-match
  distance assertions survive the asymmetric prefix change

131 unit + 5 live integration green; 005 applied to prod+staging;
staging chain re-verified valid at 2,328 records post-migration.

Co-Authored-By: Claude Fable 5 <[email protected]>
@jp-cruz jp-cruz merged commit 3daec02 into main Jul 3, 2026
11 checks passed
@jp-cruz jp-cruz deleted the feat/import-pipeline branch July 3, 2026 05:58
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

None yet

Projects

None yet

Development

Successfully merging this pull request may close these issues.

1 participant